LINGUISTIC MEASURES OF SECOND LANGUAGE SPEECH: MOVING FROM MONOLOGIC TO INTERACTIVE SPEECH By Dustin Joseph Crowther A DISSERTATION Submitted to Michigan State University in partial fulfilment of the requirements for the degree of Second Language Studies Doctor of Philosophy 2018 ABSTRACT LINGUISTIC MEASURES OF SECOND LANGUAGE SPEECH: MOVING FROM MONOLOGIC TO INTERACTIVE SPEECH By Dustin Joseph Crowther Second language (L2) scholars generally agree that pronunciation development should prioritize attaining understandable over nativelike speech (e.g., Derwing & Munro, 2015; Jenkins, 2000; Levis, 2005). What specific linguistic measures of speech enable listener understanding is less clear. While monologic - based research indicates a combined effect of segmental and suprasegmental measures (word stress, intonation, rhythm), interactive - based research has emphasized only a segmental focus. The current study takes a first step in addressing this divide by appl ying a monologic methodology to interactive speech. Twenty intensive English program students (levels 3/4 of a 4 - level program) completed one interactive and th ree monologic (Picture, Experiential, & Academic ) tasks. Using 60 - second (interactive) or 30 - sec ond (monologic) excerpts, 36 native Listeners rated each S peaker on 9 - point scales per task for accentedness (i.e., nativelikeness) and comprehensibility (i.e., ease of understanding) . I acoustically coded all utterances for a series of phonological and fluency measures (derived from Isaacs & Trofimovich, 2012). In addition, each S peaker received a task rating for the Experiential , Academic , and Interactive task s to see if perceived ac centedness and/or comprehensibility predicted actual task performance. Cons istent with previous findings, L isteners perceived comprehensibility more positively than accentedness (e.g., Trofimovich & Isaacs, 2012). In terms of task, Interactive speech patt erned most similarly to Experiential speech, especially for comprehensibility. Speakers were easier to understand on these two tasks than they were for Picture or Academic . Across tasks, L y measures (Articulation Rate, Mean Length of Run), but not phonological. The more complex, linguistically constrained tasks, Picture and Academic , demonstrated stronger associations with these fluency measures than did Experiential and Interactive, a like ly effect of the increased cognitive demands placed on Speakers in regards to their lexical retrieval and syntactic al encoding processes (Segalowitz, 2010). the Experiential and Academic tasks, but not for the Interactive task. For both Experiential and Academic , it appears that a higher perceived comprehensibility rating aligns with higher overall task score (and for Experiential , scores in both the Pronunciation and Fluency categories). For Interactive speech, it is likely that task performance draws more upon measures of interactive competence (e.g., turn - taking, topic initiation, discourse extension; May, 2011) than it does perceived comprehensibility. I concl ude my study by discussing what insight the above findings can provide in regards to how L2 speech is perceived. This insight includes the potential effect of speaker, listener, and task variables, along with how the measurement of specific linguistic meas ures is operationalized. In addition, I discuss the potential pedagogical and assessment implications of perceived comprehensibility being associat ed with task performance. After addressing the limitation s of my study, I provide suggestions for future research to extend my findings. Copyright by DUSTIN JOSEPH CROWTHER 2018 v For mom, dad, and Tara, who have supported and encouraged me throughout all steps of my life journey thus far. vi ACKNOWLEDGMENTS My dissertation is the product of support from throughout my studies at Michigan State University. I begin by thanking everyone in the Second Language Studies program, including faculty , students , and alumni . Your support has gone beyond what can be expres sed here alone. Specifically, I would like to acknowledge the support my dissertation chair, Dr. Debra Hardison, has provided me throughout my four years of study. From time as an RA in my first rt has been a great benefit to my professional development. In addition, my dissertation committee members, Drs. Susan Gass, Peter De Costa, and Paula Winke, have provided guidance well beyond that necessary to complete my degree, and for that I am forever grateful. I have been a long - time believer in the support that we as people provide each other . Thank you to my cohort: Jessica Fox, Susie Kim, Jie Liu, Jeff Maloney, Zack Miller, Magda Tigchelaar, and Irina Zaykovskaya. We have travelled these last four years together, and without your support along the way I would not be where I am today. At different stages of my dissertation , many people have offered their time to help me along my way . Though not an exclusive list, I would like to acknowledge the help of Caitlin Cornell, Amanda Haag, Lizz Huntley, Dan Isbell, Minhye Kim, Jongbong, Lee, Jungmin Lim, and Susie Kim. I would be remiss not to acknowledge the administrative support I have received throughout this process, and thus conclude by acknowledging MS and Department of Linguistics and Germanic, Slavic, Asian and African Languages. vii TABLE OF CONTENTS LIST OF TABLES xi LIST OF FIGURES xiii KEY TO ABBREVIATIONS xiv INTRODUCTION 1 CHAPTER 1: LITERATURE REVIEW 4 Global Perception of Second Language Speech 4 From Monologues to Interaction 8 Interaction Hypothesis 10 Linguistic sources of communicative breakdowns 11 Lingua Franca Core 12 Methodological concerns 13 From Listener Perception to Task Performance 14 Accentedness and comprehensibility in assessment rubrics 15 A listener versus rater dichotomy 16 Linguistic correlates of rubric rating 17 Individual listener/rater effects 18 The interactive rubric 20 The Current Study 21 CHAPTER 2: METHODOLOGY 26 Participants 26 Speakers 26 Assessors 27 Listeners 27 Raters 28 Materials 29 Monologic tasks 29 Picture 29 Experiential 31 Academic 31 Interactive task 32 Task comparisons 33 Pronunciation survey 34 Background questionnaire 35 Procedure Speech Elicitation 35 Monologic session 35 Interactive session 36 Procedure Speech Rating 36 viii Stimuli preparation 36 Monologic 36 Interactive 37 Speech rating 37 Monologic 37 Interactive 39 Task scoring 40 Monologic 41 Interactive 42 Data Analysis 43 Linguistic coding 44 Reliability 45 Monologic speech rating 45 Interactive speech rating 46 Linguistic coding 46 Task scoring 47 Analyse s parameters 47 CHAPTER 3: RESULTS 49 Wave 1: Monologic and Interactive Speech Performance 49 Descriptive comparisons 49 Prompt effect 51 Tests of parametric assumptions 52 Nonparametric analysis 56 Accentedness & comprehensibility strength of association 56 Accentedness & comprehensibility group differences 57 Accentedness & comprehensibility within task comparisons 57 Accentedness between task comparisons 58 Comprehensibility between task comparisons 60 Spearman correlations 60 Accentedness 61 Comprehensibility 62 Cluster analysis 64 Wave 2: Participant Patterns 69 Group responses 70 Japanese responses 70 Chinese responses 71 Wave 3: Task Performance 71 Experiential 72 Linguistic associations 76 Academic 77 Linguistic associations 79 Interactive 79 Linguistic association s 81 ix CHAPTER 4: DISCUSSION 83 Summary of Research Questions and Findings 83 Task effect 83 Pronunciation awareness 84 Task performance 84 Listener Perception Across Monologic and Interactive Tasks 86 Exploring task differences 86 Interactive alignment 90 Task complexity 90 Variation in linguistic associations 91 Proficiency consideration 93 Listener consideration 94 Li mitations of the current analyse s 95 Monologic bias 95 Interactive task complexity 95 Interlocutor variables 96 eness of L2 Pronunciation Measu r e s 98 Accentedness and Comprehensibility Effects on Task Rating 99 A lignment of linguistic association s 101 Limitations of task analyses 101 Causes for Concern: 11 Linguistic Measures of Speech 102 CHAPTER 5: CONCLUSION 104 Implications 104 Pedagogical 104 Assessment 106 Directions for Future Research 107 Interlocutor perception 107 Task assessment 108 Linguistic coding 108 Concluding T houghts 109 APPENDICES 111 APPENDIX A Picture Narrative (Derwing et al., 2009) 112 APPENDIX B Experiential Task 113 APPENDIX C Academic Task (Educational Testing Service, 2012) 114 APPENDIX D Interactive Prompts 117 APPENDIX E Pronunciation Questionnaire 119 APPENDIX F Questionnaire A Speaker Background 122 APPENDIX G Questionnaire B Listener Background 124 APPENDIX H Questionnaire C Rater Background 126 APPENDIX I - Perception of Rating Categories 129 APPENDIX J Paired Assessment Rating Rubric (Reproduced as presented in Ockey, 2011) 130 APPENDIX K Targeted 11 Linguistic Measures of L2 Speech 132 x REFERENCES 134 xi LIST OF TABLES Table 1 Biographical data for Speakers 30 Table 2 Biographical data for each Rater 30 Table 3 Task complexity across three monologic tasks (as reported in Crowther et al., 2017) 33 Table 4 List of 11 phonological and fluency measures (drawn from Isaacs & Trofimovich, 2012) 45 Table 5 Intraclass correlation coefficients for accentedness and comprehensibility 46 Table 6 Intraclass correlation coefficients for 11 linguistic measures of spee ch 48 Table 7 Speaker performance on monologic + interactive tasks 50 ) test results for accentedness and comprehensibility across 4 tasks 57 Table 9 Mann - Whitney U test results for group differences in accentedness and comprehensibility across 4 tasks 57 Table 10 Results of Wilcoxon signed - ranks tests between accentedness and comprehensibility across 4 tasks 58 Table 11 Mean (SD) performance on monologic + interactive tasks for Friedman test (N = 17) 59 Table 12 Results of Wilcoxon signed - ranks tests comparing accentedness ratings across 4 tasks 59 Table 13 Results of Wilcoxon signed - ranks tests comparing comprehensibility ratings across 4 tasks 61 Table 14 ) coefficients between accentedness and 9 linguistic measures of speech 62 Table 15 Summary of Spearman correlations with accentedness per task type 63 Table 16 ) coefficients between comprehensibility and 9 linguistic measures of speech 64 xii Table 17 Summary of Spearman correlations with comprehensibility per task type 65 Table 18 Descriptive results for task of 3 - cluster solution (mean [SD]) 67 Table 19 One - way ANOVAs between High cluster and combined Middle/Low clusters 68 Table 20 One - way ANOVAs between Middle and Low clusters 69 Table 21 L1 breakdown of 3 - cluster HCA solution 69 Table 22 Group pronunciation survey results (N = 29; Mean [SD]) 70 Table 23 Pronunciation survey results Japanese (N = 15; Mean [SD]) 71 Table 24 Pronunciation survey results Chinese (N = 14; Mean [SD]) 72 Table 25 Experiential hierarchical regression results for Overall score 73 Table 26 Experiential hierarchical regression results for Pronunciation score 75 Table 27 Experiential hierarchical regression results for Fluency score 75 Table 28 ) coefficients between Experiential Overall, Pronunciation, and Fluency scores and 9 linguistic measures of speech (N = 20) 77 Table 29 Crosstabulation of comprehensibility ratings with Academic band scores 78 Table 30 ) coefficients between Academic Band score and 9 linguistic measures of speech (N = 27) 79 Table 31 ) coefficients between accentedness, comprehensibi lity, Interactive Overall, Pronunciation & Fluency scores (N = 20) 80 Table 32 Interactive hierarchical regression results (N = 20) 80 Table 33 ) coefficients between Interactive Overall, Pronunciation, Fluency scores and 9 linguistic measures of speech (N = 20) 82 xiii LIST OF FIGURES Figure 1. Continuum of linguistic constraint across 4 speaking tasks 34 Figure 2. Qualtrics interface for monologic task rating 39 Figure 3. Qualtrics interface for interactive task rating 40 Figure 4. Comparison of accentedness and comprehensibility ratings within tasks 50 Figure 5. Comparison of accentedness and comprehensibility ratings across 4 tasks 51 Figure 6. Histogram and boxplot depicting distribution of accentedness and comprehensibility ratings for Picture task 54 Figure 7. Histogram and boxplot depicting distribution of accentedness and comprehensibility ratings for Experiential task 54 Figure 8. Histogram and boxplot depicting distribution of accentedness and comprehensibility ratings for Academic task 55 Figure 9. Histogram and boxplot depicting distribution of accentedness and comprehensibility ratings for Interactive task 55 Figure 10. Dendrogram of hierarchical cluster analysis 66 Figure 11. Scree plot of hierarchical cluster analysis 67 Figure 12. P - P and Residual - scatter plots for Experiential hierarchical linear regression 74 Figure 13. P - P and Residual - scatter plots for Experiential Pronunciation and Fluency hierarchical linear regressions 76 Figure 14. P - P and Residual - scatter plots for Interactive hierarchical linear regression 81 xiv KEY TO ABBREVIATIONS ANOVA Analysis of Variance EIL English as an International Language ELF English as a Lingua Franca HCA Hierarchical Cluster Analysis ICC Intraclass Correlation Coefficients IELTS International English Language Testing System IEP Intensive English Program IRIS A Digital Repository of Instruments and Materials for Research into Second Languages L1 First Language L2 Second Language LFC Lingua Franca Core LRE Language Related Episode NNS Nonnative Speaker NS Native Speaker OPI Oral Proficiency Interview SLA Second Language Acquisition TESOL Teaching English to Speakers of Other Languages TOEFL Test of English as a Foreign Language TOEFL iBT Test of English as a Foreign Language (internet - based test) TOEIC Test of English for International Communication 1 INTRODUCTION A range of ideological perspectives have addressed second language (L2) pronunciation development , including Second Language Acquisition (SLA; Celce - Murcia, Brinton, & Goodwin, 2010 ), English as a lingua franca (ELF; Walker, 2010), and English as an international language (EIL; Low, 2015). A relatively consistent argument across these ideological views is that pronunciation instruction should adhere to what Levis (2005) referred to a s the Intelligibility Principle interlocutor(s) . This is an extension beyond a longstanding focus solely on accent reduction (i.e., the Nativeness Principle ). From an SLA perspecti ve, this argument stems primarily from the fact that accented L2 speech is often unavoidable, even for L2 learners who begin at an early age (e.g., Abrahamsson & Hyltenstam, 2009; Flege, Munro, & MacKay, 1995; MacKay, Flege, & Imai, 2006; Major, 2001 ; Moye r, 2013 ). EIL and ELF scholars, who advocate on behalf of the ~75% of English users globally who are nonnative speakers (Crystal, 2008), prioritize a focus on achieving and maintaining mutual intelligibility over the attainment of any specific native - Engli sh norm (Seidlhofer, 2011). This is primarily due to the wide variety of L2 accents likely to be encountered during multilingual contact (Matsuda, 2017). While pronunciation instruction has been shown to be effective both at the phonemic (Lee, Jang, & Plon sky, 2015) and global (Derwing, Munro, & Wiebe, 1998 ; Saito & Saito, 2017 ) levels, the linguistic targets of such intervention - based instruction have varied, encompassing both segmental and suprasegmental measures of speech ( Lee et al., 2015; Saito, 2012). To help complicate matters, respondents for surveys of actual classroom practice have indicated not only sporadic and unbalanced pronunciation instruction, but a preference for a segmental emphasis (e.g., Breitkreutz, Derwing, & Rossiter, 2001; Foote, Der wing, & Holtby, 2012; Hardison, 2014 ). 2 In line with a greater emphasis on intelligibility (i.e., understandable speech) over nativelikeness (i.e., accent - free speech) among speech production scholars (e.g., Derwing & Munro, 2015; Levis, 2005), I (along with my co llaborators ) have proposed that listener perception of L2 comprehensibility (i.e., ease of understanding) is associated with a wider range of linguistic dimension s (e.g., phonological, fluency, lexical, grammatical ) than that of perceived accentedness (primarily phonological ) , with our findings based on both linguistic (coding of individual speech measures) and subjective (listener ratings of individual speech measures) assessments ( e.g., Crowther, Trofim ovich, Isaacs, & Saito, 2015a; Crowther, Trofimovich, Saito, & Isaacs, 2015b; Trofimovich & Isaacs , 2012 ). However, a primary focus on monologic tasks (e.g., picture narrative) has limited our ability to make any claims on L2 interactive performance. Consi dering that L2 usage in spontaneous communication often serves as an overarching goal of L2 acquisition (SLA; Loewen, 2015), and that it is within interaction that much SLA is theorized to occur (Gass & Mackey, 2015 ; Long, 1996 ), it is necessary to investigate whether the linguistic dimensions identified to promote understandable speech during monologic tasks are the same as those necessary for such performance during interaction. Understanding within interactive speech is prima rily considered through either researcher analysis of language - related episodes (LREs) and communicative breakdowns (e.g., Jenkins, (e.g., Gurzynski - Weiss & Baralt , 2014; Kennedy, Guénette, Murphy, & Allard, 2015 ). Such analyses have firstly emphasized lexical and grammatical issues over phonological, which supports our monologic argument that attaining mutual intelligibility requires more than just phonological acc uracy. However, in instances where researchers/interlocutors have indicated phonological sources of communicative breakdowns, the emphasis has strongly been placed on 3 segmental rather than suprasegmental issues (e.g., Jenkins, 2000; Kennedy et al., 2015; Loewen & Isbell, 2017) . This is in contrast to our monologic findings, which have argued for at least an equal, if not greater, role for suprasegmental measures in producing understandable speech . This difference in phonological emphasis between monologic and interactive understanding serves as the starting point for my dissertation. A limitation of existing research on linguistic dimensions associated with understandable speech is a minimal focus on actual task performance. While we have drawn upon various tasks or how understandable speakers are (usually through Likert scale ratings of ~30 - second utterances), and not how effectively speak ers have completed the actual task. As some tasks used to elicit speech feature readily available rubrics for task performance, such as those inspired by the International English Language Testing System (IELTS) or Test of English as a Foreign Language (TOEFL) , this seems a serious limitation. Recent years have seen a greater emphasis on the relationship between L2 pronunciation and assessment (see Isaacs & Trofimovich, 2017 , and Kang & Ginther, 2018 , for two recent edited volumes), yet it is not clear h ow the pedagogically - orientated literature on L2 accentedness and comprehensibility that I draw upon (e.g., Derwing & Munro, 201 5) may inform L2 assessment. Specifically, it is necessary to consider how phonology - based pedagogical targets drawn from the fo rmer actually inform the latt er. As standards - based assessments (as contentious a topic as it is) play a significant role in L2 teaching and learning (e.g., Fulcher & Owen, 2016 ; Ginther & Elder, 2014 ), it would be a disservice to promote pedagogical targets that may promote understandable speech but do not necessarily lead to higher assessment performance . 4 CHAPTER 1: LITERATURE REVIEW Throughout this chapter, I review relevant literature on the lin guistic measures associated with listener perception of how accented and understandable second language (L2) speakers are, across both monologic and interactive speech. Within this review, I identify how methodological differences in the scholarship related to each speaking task type may explain diverging results. I then highlight potential divides between this more pedagogically - orientated body o f L2 pronunciation research and how L2 speaking is viewed and addressed from an L2 assessment perspective. Finally, I present the six research questions that guide my dissertation. G lobal P erception of Second L anguage S peech Theoretically, L2 pronunciation has received relatively minor attention when it comes to models of L2 development and assessment (Galaczi, Post, Li, Barker, & Schmidt, 2017). While numerous variables attributed to L2 development in syntactic, lexical, and pragmatic dimensio ns (e.g., age, motivation, aptitude) have also been linked to pronunciation, rarely do empirical studies relate directly to theories of SLA (e.g., VanPatten & Williams, 2015). Instead, a primary focus of discussion has been on whether nativeness or intelli gibility should be the primary target of L2 pronunciation acquisition (e.g., Derwing & Munro, 2015; Levis, 2005) , and, recently, the linguistic dimensions (e.g., phonology, fluency, lexicon, grammar, discourse ) associated with each (e.g., Isaacs & Trofimov ich, 2012). Much research in this vein focuses on three key constructs , defined best in Derwing and Munro (2015): Accentedness member of the target speech community . Comprehensibility utterance to be . 5 Intelligibility . A fourth construct , fluency , has also received a significant amount of attentio n, t hough much varia tion exists in how it has been defined (Chambers, 1997 ; Segalowitz, 2016). For example, Lennon (1990) make s Munro (2015) refer to the ease of flow of L2 speech, typically in reference to the presen ce/absence of pauses and other dysfluency markers. Segalowitz (2010) emphasizes the underlying processes involved with L2 fluency attainment, specifically addressing the link between cognitive (retrieval) and utterance (temporal) fluency. Importantly, Derw ing and Munro (2015) have argued that these cons t ructs , more specifically when comparing accentedness to either comprehensibility or intelligibility, are overlapping, yet partially independent. This is evidenced by the fact that L2 speakers can be perceive d as both highly comprehensible/intelligible while still possessing a heavy accent (though a heavy accent is almost always present for speakers deemed to have low comprehensibility/intelligibility). While research within this stream has targeted primarily L2 English speech (e.g., Derwing & Munro, 2013; Isaacs & Trofimovich, 201 2; Kang, 2010 ), recent years have seen an increased focus on additional languages, including Dutch (Caspers, 2010), French (Bergeron & Trofimovich, 2017), Japanese (Sai to & Akiyama , 2017), Korean (Isbell, Park , & Lee , in press ), and Spanish (Nagle, 2018 ). The partial independence between global listener perception of accentedness and comprehensibility/intelligibility proposed by Derwing and Munro (2015) has been echoed in regards to the these same constructs (e.g., Crowther et al., 2015a; Crowther et al., 2015b; Isaacs & Trofimovich, 6 2012). Prior to Isaacs and Trofimovich (2012) and Tro fimovich and Isaacs (2012), linguistic measures found to influence the perception of L2 speech had primarily been considered independent of each other, despite the range of measures that had been identified. For perceived accentedness, these associations i nclude d segmental accuracy (Derwing, et al., 1998), pausing and articulation rate (Trofimovich & Baker, 2006), and various suprasegmental measures such as pitch range, stress, and pause length (Kang, 2010). Associates of understanding (encompassing both comprehensibility and intelligibility) include d word stress (Field, 2005) and speech rate (Munro & Derwing, 2001), as well as pitch range a nd pause or syllable length (Kang, Rubin, & - phonological measures, such as grammatical accuracy, appear to play an important role in speech perception as well (Fayer & Krasi nsky, 1987; Varonis & Gass, 1982) . For example, Varonis and Gass (1982) asked native speakers (NSs) of English to rate the accent and comprehensibility of L2 speakers reading a pair of sentences, one of which was grammatical, one of which was not. Interest ingly, they found that grammatical accuracy did indeed impact perceived accentedness, but only when the speaker was not seen as being at either end of the accentedness spectrum (i.e., highly accented vs. not - accented at all). Speakers were also perceived a s being more comprehensible when reading grammatical, as opposed to non - grammatical sentences. Drawing upon this knowledge, Isaacs and Trofimovich (2012) and Trofimovich and Isaacs (2012) employed correlational and regression analyses to measure the relat ive weight of strength of 19 linguistic measures on listener perception of L2 accentedness and comprehensibility. Participants were first language (L1) French/L2 English speakers completing a picture narrative, rated by 60 NSs of English using a pair of 9 - point Likert scales. Results indicated that perceived accentedness was linked primarily to phonological measures, while 7 comprehensibility was associated with a wider range of considerations, now including fluency, grammatical, and lexical dimensions . A par tial replication indicated similar results when the same speech data were rated by nonnative speakers (NNS) of English (L1 French, L1 Chinese; Crowther, Trofimovich & Isaacs, 2016). Further evidence for this distinction comes from Crowther et al. (2015b), where NSs rated the L2 English speech of learners from three distinct backgrounds (Chinese, Hindi, Farsi), as well as Crowther et al. (2015a), in which performance was compared across two tasks (IELTS - and TOEFL - inspired ). Furthermore, recent evidence with in this research agenda has strengthened the importance of lexicon in perceived comprehensibility (Saito, Webb, Trofimovich, & Isaacs, 2016). The message extending from this line of research, at least to this point, is that while perceived L2 accentedness is primarily associated with phonological measures , the rating of L2 comprehensibility requires listeners to draw on a much wider range of linguistic dimensions to attain understanding. In terms of phonological co nsiderations, listener perception of compre hensibility has g enerally shown greater balance between segmental accuracy and suprasegmental measures (word stress, intonation, and rhythm) than for accentedness. Measures of fluency (articulation rate, mean length of run, pause accuracy) have generally b een associated with both. An important highlighted in Saito, Trofimovich, and Isaacs (2016). Considering the picture descriptions of three pr oficiency levels of L1 Japanese/ L2 English speakers, the authors argued that for comprehensibility, fluency measures with varied prosody were most relevant for beginner - to - intermediate speakers. At the advanced level, segmental accuracy with good prosody was of greater impor tance. For accentedness, segmental and prosodic measures were important across levels, though grammaticality became of greater relevance at the advanced level. However, it 8 should be noted that Saito et al. operationalized proficiency through assigned comprehensibility ratings rather than any standardized measures of proficiency. This leaves an open question as to how reliable such findings may be in regards to such an association. One important caveat to consider is an emphasis on comprehensibility (ease/difficulty of listener understanding), rather than intelligibility (accuracy of understanding). Though more closely related than either is with accentedness, comprehensibility and i ntelligibility are not 100% correlated (Derwing & Munro, 2015). A scholarly emphasis on comprehensibility has been due to the primary usage of scalar measures (e.g., 7 - or 9 - point Likert scales), which are more closely aligned with how is o ften operationalized in high stakes assessments such as IELTS and TOEFL (Harding, 201 7; Isaacs & Trofimovich, 2012). I place pronunciation above in quotations, as the constructs of accentedness and comprehensibility are quite often conflated within assessm ent rubrics (Harding, 2018; Isaacs, Trofimovich, Yu, & M uñoz Chereau, 2015) . To maintain alignment with both the work of Isaacs and Trofimovich, as well as my own, I here maintain an emphasis on comprehensibility. 1 From Monologues to Interaction The linguistic - based partial independence between accentedness and comprehensibility described above is primarily derived from monologic speech, with a heavy emphasis on the usage of a picture narrative. Though still in need of further investigation, there is evidence that as task complexity increases and L2 speakers are forced to draw upon a wider range of their linguistic resources (Robinson, 2011; Skehan, 2009), the distinction between linguistic correlates of perceived accentedness and comprehensibility begins to diminish . Crowther , Trofimovich, Saito, and Isaacs ( 2017) compared the linguistic correlates of accentedness and 1 However, as part of data collection procedures, I included a measure of intelligibility (via orthographic transcription) to be considered in future analyses. 9 comprehensibility across three tasks (Picture, IELTS - & TOEFL - inspired ) . Ten English NSs rated the accentedness performance on ten linguistic measures (spanning dimensions of phonology, fluency, lexicogrammar, discourse). 2 2005 ) as well a TOEFL - inspired task (integrated speaking) as being more complex than the other two. Interestingly, while accentedness associated only with phonolog ical and fluency measures for the Picture and IELTS - inspired task, for the TOEFL - inspired task perceptions now also associated with grammatical and lexical measures. These associations aligned accentedness more closely with comprehensibility. Though more r esearch is need ed to explore the potential overlap between the two constructs in regard s to task complexity , additional evidence exists for L2 French speech as well (Bergeron & Trofimovich, 2017). If increasing task complexity across monologic tasks influences the perception of L2 acc entedness and comprehensibility , it would seemingly follow that manipulating additional variables would further impact listener perception. Specifically, making a task + interactive opens up a number of additional variables to increase complexity. ( 2005 ) Cognition Hypothesis, the presence of an interlocutor could potential ly impact task complexity in a number of ways. T category which features participation variables such as +/ - open solution, +/ - convergent solution, +/ - few participants, and +/ - negotiation not needed and participant variables such as +/ - same proficiency, +/ - same gender, +/ - shared content knowledge, and +/ - shared cultural knowledge. Just as monologic speaking tasks can differ in complexity, so too can interactive 2 This study drew upon the same dataset used in Crowther et al., 2015a, 2015b. 10 speaking tasks . Yet, it is not known how the change from - interactive to + interactive may impact eption of linguistic dimensions of speech . Aside from extending what has become a relatively influential stream of L2 pronunciation research, a focus on interaction may also help address the gap between existing L2 pronunciation research and theoretical vi ews on SLA (Galaczi et al., 2017). Specifically, any consideration of the role of L2 accentedness and co mprehensibility in interactive effectiveness would be remiss if it were to not consider the important role interaction plays in L2 development (Ga ss & Mackey, 2015 ). Interaction H ypothesis . As put forth in the Interaction Hypothesis (e.g., Long, 1996), language learning occurs during communicative breakdowns in conversation involving L2 speakers. Within these breakdowns, in an effort to repair commun ication, speakers, whether native or non - native, will incorporate discourse moves such as clarification requests or comprehension and confirmation checks. These discourse moves comprise negotiation for meaning, which facilitates L2 development by drawing l measures of speech led to the communicative breakdown in question (see Gass & Mackey, 2015, and Mackey & Goo, 2007, for more in - depth breakdown s ). Though discussing monologic tasks, Crowther et al. (2015a, 2015b) made a loose connection to the Interaction Hypothesis. As some linguistic dimensions are more likely to lead to communicative breakdowns than others (Mackey, Gass, & McDonough, 2000), we argued that identifying the linguistic measures of L2 speech related to perceived comprehensibility more so than accentedness would provide L2 learners with explicit knowledge to help them notice and repair their nontarget production during communicative breakdowns. Pedagogically, identifying phonological difficulties that need to be 11 syntactical measures that hinder communication. The issue, of course, is in the actual identification of these measures. Linguistic sources of communicative breakdowns. Identifying the sources of communicative breakdown s in interaction has often been conducted through the analysis of language - related episodes (LREs), in which interlocutors discuss a linguistic item, either due to the breakdown itself or a desire/need for linguistic accuracy (Swain & Lapkin, 1998). However, pronunciation has rarely been the focus of such research (Loewen & Isbell, 2017). One recent examp le, however, comes from Kennedy et al. (2015), who consi dered the sources of communicative breakdowns between intermediate and advanced L2 French speakers in Québec, Canada. Using video recordings of each interaction (i.e., stimulated recall), Kennedy et al. asked interlocutors to comment on potential and actua l comprehension problems. They found that 18% of reported comprehension issues were related to pronunciation, primarily due to segmental accuracy. Few links were made to suprasegmental measures of speech, and, interestingly, no effect of proficiency was fo und. These findings align with that of Loewen and Isbell (2017), who employed a more analytical approach, coding LREs for the linguistic measure of interest (minus the interlocutor input collected in Kennedy et al., 2015). Despite the different methodologi cal approach, the results were strikingly similar, with 16% of LREs related to pronunciation (and 90% of these focused on segmental concerns). Though the emphasis on pronunciation - related LREs ranges from between 1% and 40%, (Bowles, Toth, & Adams, 2014; B ueno - Alastuey, 2013; Gurzynski - Weiss & Baralt, 2014), that greater than half of reported issues are attributed to non - phonology - based dimensions (e.g., lexicon, syntax) provides support to previous findings that listener understanding is as much reliant (i n these examples more so) on grammatical and lexical considerations as it is phonological (e.g., Crowther et al., 2015a, 2015b; Isaacs & 12 Trofimovich, 2012). In terms of pronunciation, that segmental features appear to be the primary source of communicative breakdown aligns strongly with ev idence from an ELF perspective. Lingua Franca Core. To this date, one of the most well - cited analyse s of potential pronunciation targets based on interactive data comes from Jenkins (2000, 2002). Jenkins proposed the Lingua Franca Core ( LFC ) , a series of pedagogical targets designed to allow for mutual intelligibility between users of English from different linguistic and cultural backgrounds (with the caveat that despite the now common inclusion of NSs of English [Jen kins, 2014], ELF does not adhere to any specific native - English norm [Jenkins, 2006; Seidlhofer, 2011]). Through discussion with the interlocutors) Jenkins identifie d pronunciation - based measures that hindered mutual intelligibility. Within the LFC, Jenkins advocates for core and non - core elements, which Nativeness (non - core) Principles. A corpus - b ased approach, the LFC has been criticized due to its limited interlocutors and speech (Munro & Derwing, 2006 ; Sewell, 2017 ), and has been the source of much debate between ELF and non - ELF scholars (see Dziubalska - examp le ). Of potentially greater interest, however, is that the LFC emphasizes segmental accuracy (Park & Wee, 2015), in the process relegating suprasegmental measures (e.g., word stress, pitch range, and rhythm) to non - core status. While aligning with Kennedy et al. (2015) and Loewen and Isbell (2017) in regards to interactive tasks, this is in contrast to findings linking these suprasegmental features to comprehensibility on monologic tasks (Crowther et al., 2015a, 2015b; Dauer, 2005; Field, 2005; Isaacs & Tro fimovich, 2012). It should be noted , though, that more cont emporary ELF - inspired research has argued for a greater role of intonation in interactive meaning making (Pickering, 2009; Pickering & Litzenberg, 2011). 13 Methodological concern s. Clearly, methodological differences exist between monologic and interactive speech in terms of which phonologic measures hinder interlocutor understanding . M onologic - based data emphasizes across an entire utterance , wher eas interactive - data relies on retrospective analysis and outsider reflective observations of specific moments in time (i.e., communicative breakdowns) . Similarly, the linguistic source(s) of these breakdowns are determined either through a fully - developed coding scheme that encompasses an entire sample (monologic) or the identification of the source of difficulty in specific episodes (interactive). Within episodes of communicative breakdown, interactive analysis often emphasizes the percep tions of participants themselves, who, in turn, have placed an emphasis on segmental issues. Generally, segmental accuracy receives far greate r emphasis in the L2 classroom (e.g., Foote et al. , 2012 ; Foote, Trofimovich, Collins, & Urzúa, 2013; Hardison, 2014 ) which may bias stimulated recall. The existence of such a bias may help to explain the divergence between interactive and monologic findings. Similarly, monologi c tasks target comprehensibility (i.e., perceived ease/difficulty of understanding). This may indicate the effort needed to understand an utterance, yet, does not inform us whether actual loss of meaning occurred. For interactive speech, the emphasis is on actual episodes of communicative breakdown, or moments where it is clear meaning was lost. As such, message was understood) may be a more relevant construct for interactive tasks (Loewen & Isbe ll, 2017). 3 However, a primary focus on individual moments does not a) take into account forms less prevalent during target interactions ( see Sewell, 2017) , and b) allow for an 3 As discussed earlier, it is important to remember that comprehensibility and intelligibility are not perfectly aligned. 14 understanding of the internal processes learners engage in to understand speech throughout an interactive encounter (Oppenheimer, 2008) ; more specifically, how much effort they require to comprehend their speaking partner. This too may help to explain differences in phonological measures relevant to interactive versus monologic speech. One way in which to begin to address these methodological gaps would be to apply a monologic approach to interactive tasks. As previously noted, comprehensibility, rather than intelligibility, was chosen for monologic studies as it was more aligne d with the measure of understanding featured in high stakes assessment (Harding, 2017; Isaacs & Trofimovich, 2012). As high stakes assessments that include paired - tasks (e.g., Cambridge First Certificate of English) rely upon similar rubric - based measures (i.e., rater - based perceptions), gathering listener - based perceptions would be a logical first step of bridging the gap in methodology. From Listener Perception to Task Performance An additional concern prevalent in accentedness /comprehensibility - orientated research is that to be the information most re o lower accentedness/ comprehensibility ratings imp all score on a monologic task? d oes a greater number of LREs lead to a lower score in an interactive task? a r e lower task scores related to specific linguistic measures ?). If comprehensibility is chosen based on its alignment with high st s of understanding (Harding, 2017; Isaacs & Trofimovich, 2012), then it seems reasonable that we should see some level of alignment between this global measure and task performance. However, if there is no alignment, then it may be that the linguistic measures identified from a listener - based perspective may be less relevant than those from a rater - based perspective. 15 Accentedness and comprehensibility in assessment rubrics. As has already been stated, accentedness and comprehensibi lity are partially overlapping constructs (Derwing & Munro, 2015). In fact, when correlations between the two have been reported, their strength of association can be quite high (for example, Crowther et al., 2016, found r > .90 for three different listening groups, and Crowther et al., 2017, found r = .74 - .80 across three tasks). 4 W hile comprehensibility is generally rated higher than accentedness, an increase in one will still lead to an increase in the other . However, str ength of correlation becomes a concern statistically as it may increase concerns of collinearity (Field, 2009), and ultimately, as the association between two constructs increases, there is concern on whether the two constructs are actually distinct or are simply a different measure of the same underlying skill ( Warner, 2008 ), in this case L2 speaking . This may help to explain why, from an assessment perspective, accentedness and comprehensibility are often conflated into a single scale ( e.g., Harding, 2018; Isaacs et al. , 2015 ). For example, Ockey and French (2016) describe accent within their assessment framework as the local variety, and how much this difference is perceived to impact comprehension of listeners who are familiar with the local variety. Therefore, the strength of an accent indicates the degree to which it is judged to be different than the local variety, and how it is perceived to impact the comprehension of users of the local variety , emphasis added ) . In another example , Isaacs et al. (2015) describe how De accentedness , intelligibility, and comprehensibility have been conflated in the revised - IELTS pronunciation scale . For 4 Also see Munro & Derwing (1995), who included speaker - specific correlations and found that strength of association between accentedness and comprehensibility ranged from .41 to .82. 16 understand throughout; L1 greater precision in which construct is being addressed across Band scores, drawing upon the results of a mixed - methods analysis of eight IELTS examiners use of the Pronunciation scale. Their results indicated inconsistent classification of examinees into Bands 5 - 8, with Bands 5 and shows all the positive features of [b and below] and some, but not all, of the positive features of [band different measures when conducting their ratings. Interestingly, despite variability between whi ch measures examiners attended to, Isaacs et al. still found medium strength correlation s between comprehensibility and IELTS Speaking ( r = .51) and Pronunciation ( r = .48) scores. This tendency to conflate accentedness and comprehensibility into a single scale could be seen as somewhat prevalent in L2 pronunciation assessment (see also Harding, 2017, 2018, for recent overview s ). Conflation of constructs is concerning, as the goals of Derwing line of research (including my own) is pedagogically orientated, which should ideally serve L2 speakers well on such standardized tests as IELTS. However, if measures relevant to scale - based perception differ from those for rubric - based rating , there is clearly a potentially dangerous gap between L2 pronunciation instruction and assessment. A listener versus rater dichotomy . Before considering linguistic correlates of the rating scales used in tests such as IELTS, there is an important distinc tion to consider, describe d in depth by Yan and Ginther (2018). The evaluation of L2 speech production is usually conducted using two groups: listeners and/or raters. While listener groups can include nearly anyone, rater groups consist of those who have received formal training in how to rate speaking performance on a language proficiency test (e.g., IELTS, TOEFL). Primarily interested in impressionistic 17 judgments, research that employs listeners has emphasized global rat ings of accentedness, comprehensibility, and intelligibility. From a listener perspective, measures of speech perception are often conducted through the use of Likert scale rating (accentedness, comprehensibility) or orthographic transcription (intelligibi lity; Derwing & Munro, 2015). In comparison, raters - based studies (e.g., Davis, 2009; Lazarton & Davis, 2008, Ockey, 2009) align with assessment goals of selection and placement (Yan & Ginther, 2018), and frequently employ some form of assessment rubric (e .g., IELTS, TOEFL) to collect perception measures. Clearly, the body of research I have highlighted to this point impressionistic perceptions, as opposed to those of trained raters. While impressionistic perceptions of global measures are likely to be considered during rater scoring of L2 speech (Yan & Ginther, 2018), that listener - based studies have placed limited emphasis on overall task performance leaves the relative weight of impressionistic judgments on rater scoring unkn own. Linguistic correlates of rubric rating. Beyond the work of Isaacs et al. ( 2015 ) described above, little work has addressed the specific linguistic measures that r aters attend to when providing task scores, whether for overall task performance or pronunciation - specific categories. Rather, r esearch on monologic assessment has considered variables such as accent familiarity (e.g., Winke & Gass, 2013; Winke, Gass, & My ford, 2013), and paired assessment on discourse - and individual - based properties, such as interactive patterning (Galaczi, 2008), interlocutor personality (Ockey, 2009), and interactional competence (May, 2011). While recent studies have proposed rubrics g Trofimovich, 2012; Is aacs, Trofimovich, & Foote, 2017 ), these rubrics draw upon listener - based rating data and focus group discussions with English for Academic Purposes professionals. 5 5 Both Isaacs & Trofimovich, 2012, and Isaacs et al., 2018, targeted L2 English for their scale. 18 They are not derived from any specific standardized ass essment (e.g., IELTS, TOEFL) . While Isaacs et al. (2017 ) propose their rubric as a tool for pre - and in - sessional university students, they still present it primarily as a tool to support continued oral language development. Individual listener/rater effects . L2 pronunciation scholars, whether listener - or rater - orientated, h ave placed much attention on how individual characteristics may impact speech perception/rating , specifically in regards to linguis tic training (e.g., Saito, Trofimovich, & Isaacs, 2017), familiarity with accented speech (e.g., Gass & Varonis, 198 4; Winke & Gass, 2013; Winke et al. , 2013), and the L1 background (native versus nonnative) of the listeners themselves (e.g., Bent & Bradlo w, 2003; Harding, 2012; Major, Fitzmaurice, Bunta, & Balasubramanian, 2002). Yan and Ginther (2018) distinguis h ed listeners from rater s based on formal training in a - based research, though it manifests in a different way. Wh Trained vs. Naï - based research considers how lingui stic training may impact speech perception, though such research has been inconclusive ( e.g., Bongaerts, van Summeren, Planke n, & Schils, 1997; Calloway, 1980; Saito et al., 2017 ; Thompson, 1991). Saito et al. (2017), working with the same data utilized in Isaacs and Trofimovich (2012) and Trofimovich and Isaacs (2012), found that a group of raters with lingui stic and pedagogical experience provided more lenient ratings of accentedness and comprehensibility than their inexperienced peers, and were more cons istent in evaluating complex linguistic measures (word stress, intonation, rhythm). This consistency led to Crowther et al. (2015a, 2015b, 2017) employing MA - level applied linguistics students with both L2 instructional and learning experience . They asked each listener to rate not only accentedness and comprehensibility, but also 10 linguistic 19 measures of speech (spanning phonology, fluency, lexicogrammar, and discourse dimensions ) . However, whether the perception of th ese experienced listeners aligned with those with less linguistic training was not addressed . The effect of accent familiarity (whether L1 specific or nonnative in general) on task scoring has also been a popular empirical topic. Gass and Varonis (1984) found that for 142 English NS listener s, familiarity with a) nonnative speech in general, b) a specific nonnative accent, and c) a specific nonnative speaker facilitated their ability to draw accurate meaning from L2 utterances. Connecting back to the identification of linguistic associates, t population (undergraduates) employed in Isaacs and Trofimovich (2012) and Trofimovich and Isaacs (2012) was located in the bilingual city of Montréal, Canada. So, while listeners may not have had formal linguistic training, the y surely h ad significant exposure to both L1 - French accented English and L1 - English accented French. As with Crowther et al. (2015a, 2015b, 2017) above, how this accent familiarity may have impacted results is not clear. The role of accent familiarity has been furth er investigated in relation to a potential bias effect in L2 speech assessmen t (Winke & Gass, 2013; Winke et al. , 2013), where such familiarity/bias has been identified as a potential source of compromise in test reliability between raters. An accent fami liarity advantage may not be limited to the NSs employed in the above studies, as evidence exists that NNS listeners/raters may have an easier time understanding same - , 2003; Harding, 2012; Major et al., 2002). However, contradictory findings (e.g., Crowther et al., 2016; Munro, Derwing, & Morton, 2006) have led to a belief that this advantage is only present for some NNS listeners/raters and only some of the time (Majo r et al., 2002; Munro et al., 2006), and that it may depend on variables such as L2 proficiency, context, and learner background 20 characteristics (Hayes - Harb, Smith, Bent, & Bradlow, 2008; Smith & Hayes - Harb, 2011; Xie & Fowler, 2013). The interactive rubric . While oral proficiency interviews (OPIs) involve some degree of interactivity, they are still primarily learner - centered, as demonstrated by the existence of both person - and computer - moderated OPIs (e.g., Thompson, Cox, & Knapp, 2016). While inter active variety can indeed exist within such testing (e.g., Plough & Bogart, 2008), the role that the learner takes differs greatly than if they were to interact with a fellow learner. Learner - learner interaction elicits different discourse than when contro lled by an examiner (e.g., Johnson & Tyler, 1998; Kormos, 1999). In fact, learners appear to perform better when engaging with a fellow learner than with a tester (Brooks, 2009). Paired assessment appears to receive far less focus in standardized testing, although the Cambridge First Certificate of English includes a 14 - minute interactive component (http://www.cambridgeenglish.org/exams - and - tests/first/exam - format/; see Galaczi, 2008, for an in depth look into this assessment). N umerous variables have been considered in relation to paired assessment, all of which may impact the type and amount of language produced. These include pair/group dynamics (e.g., Galaczi, 2008; Storch, 2002), interlocutor proficiency (e.g., Csepes, 2009; Davies, 2009; Lazarton & Dav is, 2009), interactional competence (May, 2011; Young, 2011), CAF measures (Sato, 2014), planning t ime (Niita & Nakatsuhara, 2014) and linguacultural differences (Scollon, Scollon, & Jones, 2012). However, the majority of rubric - based paired/group oral ass essment scholarship has come from Ockey and colleagues (Ockey, 2009, 2011; Ockey, Koyama, Setoguchi, & Sun, 2015). In these studies, Ockey et al. have utilized a group - assessment rubric measuring fluency, grammar, vocabulary, pronunciation, and communicati ve strategies across five bands. While this stream of research has worked exclusively with university - level Japanese students, it has still proven 21 informative, specifically highlighting a high association between group oral performance and TOEFL iBT speaki ng scores ( r = .76; Ockey et al. , 2015). Of concern, and in line with Ockey and French (201 4 ) , is that the pronunciation category has conflated accentedness and comprehensi bility into a single con struct. For example, for Band 3 the following descriptor is Pronunciation is good but has still not mastered the sound system of English; accent This once again makes it difficult to determine what specific dimensions raters are address ing, and if they are consistent in these dimensions. The Current Study My dissertation serves as a follow - up to my previous work (Crowther et al., 2015a, 2015b, 2016 , 2017 ), which has aimed to identify pedagogical pronunciation targets that would prioritize L2 learners ability to produce understandable speech (Derwing & Munro, 2015; Jenkins, 2000; Levis, 2005). However, the stream of research I have subscribed to, which focuses on the constructs of a ccentedness and comprehensibility , has prioritiz ed monologic performance. As previously discussed, this line of research has produced findings that do not seem to align with those that have been found for interactive tasks. Specifically, and with a focus on listener understanding, monologic tasks place a greater emphasis on the production of suprasegmental measures, such as word stress, intonation, and rhythm (e.g., Crowther et al., 2015a, 2015b; Isaacs & Trofimovich, 2012). This is in contrast with interactive tasks, where listener understanding appears to be tied most significantly to segmental accuracy (e.g., Jenkins, 2000; Kennedy et al., 2015; Loewen & Isbell, 2017). This difference across task type may be related to two key methodologic differences. First, while monologic tasks emphasize comprehensi bility (i.e., ease of understanding), intelligibility (i.e., accuracy of understanding an intended message) appears to be the primary focus of interactive analyses (Loewen & Isbell, 22 2017). Second, identifying linguistic correlates of comprehensibility on m onologic tasks has been conducted through the coding of specific linguistic measures across longer utterance s . In interactive tasks, researcher observation and interlocutor reflection of specific moments (i.e., communicative breakdowns) are used to identif y sources of mis - or non - understanding. To help bridge this gap, the current study applies monologic methodology to interactive speech . Twenty intensive English program (IEP) students completed one in teractive and three monologic (P icture, Experiential, Academic ) tasks. In the interactive task, speakers discussed an opinion - orientated topic wi th a fellow participant. Speakers participating came from one of two IEP levels, and represented two L1s, Japanese and Chinese, which allowed for some control over p otential interlocutor effects. 6 Using 6 0 - second ( interactive) or 3 0 - second ( monologic ) excerpts, NS listeners rated each speaker (on Likert scales) per task for accentedness and comprehensibility. I acoustically coded all utterances f or a series of phonolo gical and fluency measures (derived from Isaacs & Trofimovich, 2012). From this methodology, I address the following research question s : 1. Does l of task (monologic vs. interactive)? 2. Do the linguistic measures of L2 speech that influence l accentedness and comprehensibility differ as a function of task (monologic vs. interactive)? 3. Do l L2 accentedness and comprehensibi lity follow any patterns across task (monologic and interactive)? 6 For which participants were drawn. 23 The next re search question address es one potential reason L2 learners may not reference suprasegmental measures during LRE - and stimulated recall - based analyses (if, of course, these measure s are indeed tied to understanding). Teacher respondents to surveys on pronunciation instruction have indicated a segmental bias in the classroom ( Breitkreutz et al., 2001; Foote et al., 2011, 2013 ; Hardison, 2014 suprasegmental measures during LREs and stimulated recall. To gain a better understanding of - point Likert scale questionnai re targeting familiarity with, previous instruction on, self - awareness of, and perceived importance of five phonological measures (consonants, vowels, word stress, intonation, rhythm). Drawing upon questionnaire results, I consider the following research q uestion: 4. What awareness of phonological measures of L2 speech do learners possess? Finally, a s previously discussed, speech production has been measured from the perspective of both listener and rater (Yan & Ginther, 20 18). While research questions 1 - 3 pri oritize l isteners , it remains to be seen how much impact such global ratings have on overall task performance, the primary target of r aters . While it can be argued that accentedness and comprehensibility are all relevant to overall performance (Yan & Ginther, 2018), the relative weight of their impact is unknown. For this reason, three tasks ( Experiential , Academic , Interaction) were assessed using task - specific rubric s . This allowed for the inclusion of research question s 5 - 6 : 5. Does l istener percept ion of L2 accentedness and comprehensibility predict overall task performance? 6. Do linguistic measures associated with l comprehensibility al ign with those associated with r inter active tasks? 24 My dissertation has theoretical , pedagogical , and assessment implications. Theoretically, L2 pronunciation has received relatively minor attention when it comes to models of L2 development and assessment (Galaczi et al., 2017). However, if we consider that L2 learning communicative breakdowns (i.e., the Interaction Approach; Long, 1996), of particular concern is whether the suprasegmental measures iden tified in monologic speech to create underlying listener difficulty are not perceivable during interaction. As interactive findings using LREs and stimulated recall indicate minimal interactive attention to suprasegmentals, it may be that such linguistic m easures are in need of greater pedagogical focus. That segmental elements tend to receive greater classroom attention would support this argument (e.g., Breitkreutz et al., 2001; Foote et al., 2012, 2013 ; Hardison, 2014 ). Considering that explicit pronunciation instruction has been shown to be effective (Lee et al. , 2015; Saito, 2012), addressing suprasegmental measures pedagogically would ideally minimize pronunciation as a concern during communicative breakdowns, further grammatical targets. This proposal, though, is based on the hypothesis that the importance of suprasegmentals in producing understandable speech found during monologic performance is rele van t to interactive performance . In terms of assessment, accentedness and comprehensibility have often been conf ounde d into a single scale (Harding , 2018 ; Isaacs & Trofimovich, 2012 ). While the generally high correlation between the two ( e.g., Crowther et al., 2016, 2017 ) would serve as one justification for this, that comprehensibility is also usually found to significantly differ from accentedness ( Derwing & Munro, 2015 ) indicates potential concerns with this approach. No study has yet to consider whether 25 conceptualized following Derwing and Munro (2015), inform task rating. If these constructs do indeed exert influence over task rating, then the pedagogical targets drawn from such r esearch would serve to benefit both L2 pronunciation development and L2 assessment preparation. If not, then the targets identified in my previous studies (Crowther et al., 2015a, 2015b, 2016 , 2017 ) may be limited in regards to their generalizable relevanc e. 26 CHAPTER 2: METHODOLOGY Participants Participants consisted of Speakers and Assessors recruited from the student body of an English - medium Midwest American university. The latter served as either Listeners or Raters , following Yan and Ginther (2018). Speakers. I recruited 29 nonnative - English speakers (NNSs) from university - run intensive English program (IEP) courses. Students enrolled in IEP courses choose to pursue f ull - time English language study, with a school - designed placement test placing each into one of fi ve proficiency levels (090 - 094) . The 29 S peakers ( M age = 21.41 [SD = 4.87]; Female = 13, Male = 16) represented two L1 groups: Japanese (N = 15, female = 8, male = 7) and Chinese (N = 14, female = 5, male = 9). 7 Speakers began learning English on average at age 10.00 (SD = 3.08 ) and had studied for 11.46 years (SD = 4.50). Five S peakers reported prior study abroad experience in the US (1 - 3 years, all du ring high school). All but two S peakers reported E nglish as t heir L2 (1 Japanese S peak er reported Chinese, 1 Chinese S peaker reported Japanese), and nine reported an L3 (Spanish = 3, Japanese = 2, Korean = 2, French = 2), . An important difference is that while the Chinese S pea kers enrolled in IEP courses with the goal of pursuing undergraduate study at the university, only one Japanese S peaker indicated a similar goal . The remaining 14 Japanese S peakers were participants on a semester - length study abroad, either company - (N = 5 ) or university - sponsored (N = 9). Table 1 provides biographical data, including standardized and self - assessed proficiency measures. 7 Initial data collect ion included one L1 Vietnamese Speaker and one L1 Spanish Speaker. As all other S peakers were either Ch inese or Japanese, I removed these two S peakers from analyses. 27 Initial data collection occurred during summer 2017 (N = 6). I recruited S peakers from the two highest levels of IEP , 093 (N = 4) and 094 (N = 2), via class visits. Speakers received US $20 as compensation, plus 60 minutes of English tutoring provided by me . The second round of data collection took place in fall 2017 (N = 23), with Speakers again recruited via class visits to IEP 093 (N = 1) and 094 (N = 22) . Speakers received either US $20 plus 30 - minutes of tutoring (N = 14) or 120 - minutes of tutoring, but with no monetary compensation (N = 9) . Assessors. Speakers performance was scored by either Listeners or Raters . Listeners. Thirty - six native - English speaking undergraduate students ( M age = 20.61, SD = 1.20, Range = 18 - 25; Female = 35, Male = 1) 8 assigned speech scores for the L2 utterances of the 29 S peakers described above. I recruited L isteners minor courses (Second & Foreign Language Learnin g, Pedagogical Grammar) , offered by the Department of Linguistics and Languages . Liste ners were primarily education majors (N = 31), though a dditional majors included Spanish (N = 2), French (N = 1), Chinese (N = 1), and Linguistics (N = 1). Twenty - six reported pursuing the TESOL minor, with six also pursuing a language minor (Spanish = 5, Chinese = 1). Listeners had completed one of two Intro to Linguistics courses at the university but reported no additional theoretical linguistic courses. Those completing a Spanish major/minor had additionally taken several linguistic courses specific to their degree. Only three indicated prior l anguage teaching experience , all with learners under 10 years of age (Spanish for 6 months in the US, English for one month in Japan, English for one month in Kazakhstan). 8 The gender breakdown presented reflects that of the students enrolled in the TESOL minor program. 28 Twenty - One Listeners reported knowledge of an L2 (Spanish = 14, French = 2, Chinese = 2, Hindi = 1, Serbian = 1, American Sign Language = 1), and two of a third+ language (French & Spanish, Japanese). 9 Three had spent a short time abroad as part of their undergraduate studies (one semester in Ecuador [N = 2], one year in Spain). Listeners rated their exposure to specific accented - L2 English speech on 9 - point Likert scale s (1 = No previous exposure, 9 = Extensive previous exposure). Comparing the two pri mary L1s of the foci S peakers, L isteners reported greater familiarity with Chinese - ( M = 4.06, SD = 2.40) than with Japanese - ( M = 2.89, SD = 2.00) accented speech . A Wilcoxon signed - rank s test indicated t his difference to be significant ( Z = - 3.26, p = .001) , with a strong effect size ( r = .61) . Speech rating occurred in late fall 2017. I recruited Listeners through class visits to TESOL minor courses with the permission of class instructors. As one course occurred online, the instructor forwarded a recruitment e - mail to enrolled students. All L isteners received class credit as assigned by their instructors. Due to the limited number of international students enrolled in the target TESOL minor courses (only four L1 Chinese students completed the procedure), no NNS L istener data are included. Though potentially interesting, scholars comparing NS and NNS listene accentedness, comprehensibility) have indicated minimal difference s in perception between groups (Crowther et al., 2016; Derwing & Munro, 2013; MacKay et al., 2006). Raters. Two NSs and two NNSs of English ( M age = 27.25, SD = 2.50; Female = 2, Male = 2) scored the task performance of the 29 S peakers. Raters were graduate students in second/foreign language teaching programs, and indicated relatively high level s of L2 familiarity 9 Languages listed include only those in which participants rated their profic iency on a 9 - point Likert sca le as 2+. The scale end points were 1 = Near beginner and 9 = near nativelike . 29 both in gener al ( 8 [SD = .141, Range = 6 8 ]), as well as for Chinese - (7.50 [SD = 1.91 , Range = 5 7 ]) and Japanese - (6.00 [SD = 2.94 , Range = 3 9 ]) accented English speech, all measured on a 9 - point Likert scale (1 = No previous exposure, 9 = Extensiv e previous exposure). All four R aters indicated previous language instructional experience, teaching English (N = 3), Arabic (N = 1), Chinese (N = 1), German (N = 1), and Latin (N = 1). On average, R aters had taught for 3.15 years (SD = 2.63), teaching a range of age groups (1.5 - 23) in both second ( N = 2) and foreign ( N = 4) language contexts. I recruited Raters from a graduate level L2 assessment course, and each received class credit as compensation . Further biographical data are provided in Table 2. Materials Monologic tasks. The three monologic tasks were the same as those used in Crowther et al. (2015a, 2015b , 2017 ). They consisted of a picture narrative (hereafter referred to as Picture ), an IELTS - inspired long turn task (hereafter Experiential ), and a TOEFL iBT - inspired integrative task (hereafter Academic ). Having been used in previous research, the three tasks were established speech elicitation tools , and allowed for comparison across studies. It is important to note that while IELTS and TOEFL iBT inspire d the Experiential and Academic tasks respectively, the same stringent procedures utilized in high stakes assessment were not present during data collection. For this reason, I have chosen to use more descriptive labels throughout. Picture. The Picture ta sk was the same used in much previous speech production research (e.g., Derwing, Munro, Thomson, & Rossiter, 2009; Isaacs & Trofimovich, 2012), and is available through the IRIS Database (Marsden, Mackey, & Plonsky, 2016) under Derwing et al. (2009). The eight - framed colored picture narrative depicts a story of two strangers who bump into each other on a busy street corner, and in the process accidentally exchange their identical suitcases. Upon returning home and opening their suitcase , they realize their mistake. Following 30 Table 1 Biographical data for S peakers. N Age Age of Onset (SD) Years of Study (SD) Proficiency (SD) Self - Rated Proficiency (SD) (1 = low ability, 9 = high ability) Mean (SD) Median Range TOEFL (N = 18) TOEIC (N = 12) Speaking Listening Reading Writing 29 21.41 (4.87) 20 18 - 36 10.00 (3.08) 11.46 (4.50) 71.44 (5.37) 678.75 (90.73) 4.72 (1.49) 5.27 (1.38) 5.66 (1.22) 5.23 (1.23) Notes. N = Sample Size; SD = standard deviation. Table 2 Biographical data for each R ater. Rater # Age L1 L2 L2 Proficiency Current Degree (Field) Teaching Experience Accent Familiarity Chinese Japanese 1 27 German English TOEFL (101) MA (German Studies) 0.5 years (Latin) 0.5 years (German) 7 4 2 28 Arabic English TOEFL (102) MA (TESOL) 4 years (Arabic) .75 years (English) 5 3 3 30 English Chinese OPIc (Advance Mid) MA (TESOL) 5.5 years (Chinese) 0.5 years (English) 9 8 4 24 English Japanese OPIc (Interme d i ate High) MA (TESOL) 10 weeks (English) 9 9 Notes. 1 accent familiarity rated on a 9 - point scale (1 = Not familiar at all, 9 = Very familiar). 31 standard procedures (e.g., Derwing et al., 2009; Isaacs & Trofimovich, 2012), I provided Speakers one minute to preview the eight pictures before they provided their response. The picture narrative can be found in Appendix A. Experiential . The Experiential task drew upon t wo publicly available IELTS prompts which required Speakers to discuss a prior experience in their life . The first version asked participants to describe a party that they enjoyed (International English Language Testing System, 2009), the second to describe a restaurant that they enjoyed going to (International English Language Testing System, 2011 ). E ach S peaker received a card with their written prompt, along with several suggestions of discussion points. They had up to 1 minute to prepare their response (notes were allowed) before they spoke for between 1 2 minutes. Acting as the moderator, I followed up each response with one or two questions (e.g., Have you been to any other similar parties? for the prompt about describing a party). Appendix B provides the full prompts for both versions. Academic . The Academic task made use of two TOEFL prompts , publicly available through sample test materials (Educational Testing Service, 2012) , and targeted skills deemed necessary for successful academic study . For each prompt, S peakers had 45 - 50 seconds to read a short passage, before listening to an audio reco rding on a related topic. Upon comp letion of the audio recording, S peakers responded to a question related to the content of the two sources of input. They had 30 seconds to prepare a response (note s were allowed) and then spoke for one minute, drawing on examples from both the reading and audio when formulating their response. An audio - recorded examiner moderated the task, presented via a PowerPoint presentation. 10 The topic of Version A was social interaction (104 - word text, 95 - second audio) and the topic of 10 The same sample test materials provided the sound file for the audio - recorded examiner, which I embedded within the PowerPoint. 32 Version B was cognitive dissonance (88 words, 80 seconds). W ritten and audio text are available in Appendix C. Interactive task. T he work of Ockey and colleagues (2009, 2011, 2015) motivated the Interacti ve task materials . In these studies, the authors assessed L1 Japanese university - level, English as a foreign language learners through a group oral discussion. Topics across the studies varied but were generally open in regards to how learners might respon d. As all my participants - themed topics. Pilot interactions with IEP students in spring 2017 indicated that my initial academic - specific prompts were not appropriate across cultural backgrounds. For example, the pilot res that were not afford ed such choices. As such, I selected more extra - curricular - based topics for use. The f inal discussion prompts requested S peakers to agree or disagree with one of three statements: it is important to a) attend many activities when st udying abroad , b) make international friends when studying abroad , and c) travel to many places when studying abroad . Prompts were counterbalanced across dyads. A series of prompt questions accompanied e ach statement (e.g., a] have you attended any new activities while at the school? b] do you want to make international friends while here at the school? c] have you visited anywhere while at the school?). Directions informed Speakers to first express their opinion to their partner and then determine if their opinion differed from their partner s (and persuade their partner of their opinion if a difference existed). Speakers had 2 minutes to prepare for the interaction but were not allowed to take notes. The full interactive prompts are available in Appendix D. 33 Task comparisons . Crowther et al. (2017) differentiated the three monologic tasks using tion Hypothesis. Their categorization is reproduced in Table 3 . The authors deemed Academic (referred to as TOEFL) as more complex than Picture and Experiential (referred to as IELTS) due to the greater reasoning demands of the task. P articipant ratings indicated a similar perception, with Academic seen as more complex than both Picture ( p = .009) and Experiential ( p = .005). Beyond differentiated between Picture and Experiential based on the greater linguistic constraints placed on participants when presenting a picture narrative. Participants are constrained by the lexical items required to complete the narra tive, whereas in the Experiential task they are free to draw upon their entire linguistic repertoire. Table 3 Task complexity across three monologic tasks (as reported in Crowther et al., 2017) . Picture Experiential Academic Few elements + + Spatial reasoning + + Here/now + Casual reasoning + Intentional reasoning + Perspective taking + Notes. Complexity categories drawn from Robinson, 2005. The I nteractive task is potentially more complex due to the presence of an interlocutor (+ Interactive) . However, as applied in the current study, the overall complexity of the interactive 2005 ) Cognition Hypothesis, the task is + open solution, + convergent solution, and + few participants. The complexity would come from one - way flow, few contributions needed, and negotiation not needed. Ultimately, how complex 34 the task was depended on whether Speakers aligned or differed in their response and how often they felt compelled to contribute. As this was not an assessment context (as it was in the Ockey studies), there were no consequences if a Speaker chose not to fully engage. It should be noted same proficiency, +/ same gender) were not considered in the current study, due primarily to the relative homogeneity of the Speakers recruited, and their subs equent dyads (reported later). In terms of linguistic resources, as prompts were opinion - based, Speakers had the freedom to draw upon linguistic resources they felt would best support their intended message. This is similar to the linguistic freedom availa ble in the Experiential task. The primary difference between the two is that, ideally, Speakers would take into consideration the contribu tions of their partner. Figure 1 presents a continuum of linguistic constraint across the four tasks. Figure 1 . Cont inuum of linguistic constraint across 4 speaking tasks. Pronunciation survey. Speakers completed a pronunciation survey which targeted six key phonological/fluency measures of L2 speech. Based on the rating categories of Crowt her et al. (2015a, 2015b), the measures included Segments (divided into Consonants and Vowels), Word Stress, Intonation, Rhythm, and Speech Rate. Each measure received a bri ef written explanation, before S peakers used a series of 5 - point Likert scales to r ate their familiarity with each, the amount of instruction they had received, their self - awareness of the measure when speaking, and their perceived importance of the measure for intelligible speech. F ive NNSs piloted the survey , and an experienced TESOL p ractitioner provided additional feedback . Based 35 on the feedback provided, I clarified each of the written explanations . Appendix E provides the complete survey, including end - point descriptors for the 5 - point scales. Background questionnaire. Speakers filled - in Q uestionnaire A ( Appendix F ) , which targeted biographical information, language learning history, study abroad history, and university study plan. Listeners and Raters completed Q uestionnaires B and C (Appendices G and H ) respectively, both of which requested biographical information, education history, language learning and teaching history, and accent familiarity. Procedure Speech Elicitation S peakers committed a total of 60 minutes to data collection: 30 minutes for m onologic task completion and 30 minutes for interactive task completion. Data collection occurred in 90 - minute blocks involving two S peakers each. Speaker 1 would arrive first and complete the 30 - minute monologic session. As this session finished, Speaker 2 would arrive and both would engage in the 30 - minute interactive session. Upon completion, Speaker 1 would leave and Speaker 2 would complete their own 30 - minute monologic session. This approach also enabled a counterbalancing of monologic and interactive task pe rformance. Speaker 1 read and signed a consent form before beginning their monologic session, Speaker 2 be fore their interactive session. Monologic session. The S peaker and I completed Questionnaire A together, allowing me to ask follow - up and clarification questions when needed. Speakers then completed the three monologic tasks, with me serving as their moderator to provide instruction and clarification when needed. The order of tasks was counterbalanced (e.g., Picture Exp eriential Academic ; Academic Picture Experiential , etc.) to control for any potential task ordering effect. A Sony 36 ICD - PX333 digital voice recorder recorded all S peaker output . Upon completion, the S peaker and I completed the pronunciation survey, with cla rification provided when requested/necessary. Interactive session. The two S peakers met in a large room with two chairs positioned 2 - 3 feet apart. After introductions , I explained that they would each receive the same interactive prompt, which would serve as the basis for an 8 - to 10 - minute audio and vis ual recorded interaction. Each S peaker then took two minutes to read through the prompt and prepare a response. Before beginning, I addressed any clarification questions, and switched on the audio (Sony ICD - PX333 digital recorder) and video (S ony HDR - CX580 camera) recorders. Speakers then interacted. The completion of the interaction occurred either organically as the two S peakers appeared to have nothing left to discuss or deliberately by me after 8 - 10 minu tes, at an appropriate place in the interaction (e.g., completed thought, short pause). Speakers then completed the post - interaction questionnaire. Interaction length ranged from 4:02 - 8:31 ( M = 6:36, SD = 1.31). Procedure Speech Rating Stimuli preparation. I prepared e ach monologic and interactive speech sample for speech rating. Monologic. In line with Crowther et al. (2015a, 2015b), I normalized speech samples for peak amplitude and edited each down to the initial 30 - seconds of speech produced, removing all initial disfluencies (e.g., uh, um) and false starts. This length falls in line with previous speech production research using 20 - to 60 - second recordings to elicit listener judgments of L2 sp eech (Derwing, Munro, & Thomson, 2008), while also being long enough to allow for reliable judgments (Munro, Derwing, & Burgess, 2010). In addition to these 30 - second excerpts, I identified two approximately 10 - second excerpts, with logical beginning and e nd points, per 37 monologic task. Intended as m easure s of intelligibility (i.e., accuracy of understanding) in a future study, I will not discuss these utterances in the following analyses. I refer to them here only to make clear the procedure Listeners went through , described below . In summary, f or each monologic task, each S peaker provided one 30 - second and two approximately 10 - second excerpts. Interactive. I reviewed e ach interactive speech sample to identify a 60 - second excerpt that prominently featured both S peakers. Across interactive samples, speaking time ranged from 20.52s 38.4s (Mean = 29.60, SD = 5.03, Median = 31.30). Though not b alanced, no excerpt involved a S peaker speaking for less than 37% of the time. This prov ided L isteners with at least 20 seconds of speech, allowing for reliable judgements (Derwing et al., 2008; Munro et al., 2010). Again, I normalized each sample for peak amplitude and removed all initial disfluencies. Speech rating. Listeners completed speech r ating through two 60 - minute online questionnaires using Qualtrics (www.qualtrics.com). In Questionnaire 1, Listeners used 9 - point Likert scales to rate each monologic speech utterance for accentedness and comprehensibility. In Questionnaire 2, Listeners or thographically transcribed a subset of the 10 - second utterances as a measure of intelligibility and rated each interactive sample for accentedness and comprehensibility. Though the majority of global speech rating studies ten d to collect speech ratings on site (e.g., Crowther et al., 2015a, 2015b; Munro & Derwing, 1999), collecting ratings through online measures have still provided high inter - rater reliability (Crowther et al., 2016). Three NSs drawn from the same target popul ation piloted monologic and interactive rating procedures for timing, clarity of direction, and appropriateness. Monologic. Listeners rated each 30 - second recording for accentedness and comprehensibility using 9 - point Likert scales. Despite some debate on appropriateness (Isaacs & 38 Thomson, 2013; Isbell, 2018; Munro, 2018; Southwood & Flege, 1999), the use of 9 - point scales aligned with much previous research in the area (e.g., Derwing et al., 2015; Isaacs & Trofimovich, 2012). F o llowing Derwing and Munro (2 015), I informed Listeners to treat a ccentedness as the degree of difference between the S the target variety (1 = heavily accented, 9 = not accented at all ), and comprehensibility as how much effort they required to understand the utterance (1 = hard to understand , 9 = easy to understand ). Listeners heard each utterance once. They could not advance to the next item until they had both heard the entire 30 - second recording and provided a rating for both acc entedness and comprehensibility (though these ratings could be provided at any time during the recording). (2016) found minimal differences between rating both co nstructs at once or individually. 11 Figure 2 presents t he Qualtrics interface. Listeners received online instruction before beginning speech rating. This included a written explanation of each of the targeted constructs and three practice ratings (using p ilot recordings). For accentedness and comprehensibility, the Qualtrics interface informed Listeners that each recording would end after 30 seconds, potentially cutting a Speaker off mid - sentence and that this should not be considered in their rating. Foll owing Crowther et al. (2015a, 2015b), Listeners rated the three tasks in counterbalanced blocks (e.g., Picture Experiential Academic ; Experiential Academic Picture, etc.), with recordings within each block randomized into one of six possible orders. 11 included a third construct, fluency. She did find that L1 German listeners rating L2 German speech indicate greater fluency when rating all three constructs at once than when rating them individually ( p = .018). There was also a trend indicating that L1 E nglish - L2 German listeners rated L2 German speech as slightly more comprehensible when rating all three constructs together ( p = .054). 39 Figure 2. Qualtrics interface for monologic task rating. Interactive. Listeners rated both S peakers within the 60 - second int eractive samples simultaneously, using t he same 9 - point scales for accentedness and comprehensibility . The Qualtrics i nterface informed Listeners that the first voice they heard would be considered Speaker A and the second voice Speaker B . As shown in Figure 3, the 9 - point scales designated for Speaker A and Speaker B were cl early labelled. In the case of S peaker confusion, the i nterface provided Listeners the option to indicate that they were unable to clearly differentiate between the two S peakers. After completing their accentedness and comprehensibil ity ratings for both S peakers, Listeners moved to the next sample. Listeners c ompleted Interactive speech rating in one of seven randomized blocks. As with the monologic tasks, Listeners received online instruction prior to rating, including three practice ratings. At the end of the two rating sessions, Listeners self - rated their understanding of (1 = I did not understand at all , 9 = I understand this concept well ) and comfort with (1 = very difficult , 9 = very easy and comfortable ) rating both accentedness and compr ehensibility (Appendix I). 40 Listeners indicated greater understanding (Mean Acc = 7.58 [SD = 1.71], Mean Comp = 7.81 [SD = 1.60]) than they did comfort (Mean Acc = 6.19 [SD = 1.78], Mean Comp = 6.89 [SD = 1.67]). Figure 3 . Qualtrics interface for interactive task rating. Task scoring. The four Raters assessed S peakers on their Experiential , Academic, and I nteractive performance. Rating occurred using the audio recordings of monologic performance and video/audio recordings of interact ive performance. Two Raters rated Experiential while the other two rated Academic . The four Raters worked in their monologic pairs to each rate half the interactive task samples. E ach pair worked collaboratively to reach consensus on a task score for their assigned Speakers . For training on monologic rating, each pair met with me for a 45 - minute session. Prior to training, each R ater took a week to familiariz e themselves with their rubric . The training 41 session consisted of three stages. In Stage 1 , each pair expressed and discussed questions and concerns regarding their assigned rubric. In Stage 2, they worked collaboratively to rate two speech samples determined by me to represent a high performing and low performing S pe aker. The final stage had e ach R ater work individually to rate 2 - 3 additional samples, before comparing ratings with their partner. Raters then discussed and resolved any discrepancies on their own . Interactive task training followed the same procedure , except that all four Raters w ere present . Upon completing training, R aters took a month to collaborative ly assign a score per S peaker per task. Monologic. Experiential and Academic task scoring used predesigned rubrics. The Experiential task used the publically available IELTS speaking rubric (International English Language Testing System, 2016). This 10 - ban d rubric (0 - 9 ) featured four categories (fluency and coherence, lexical resource, grammatical ran ge and accuracy, pronunciation) . Raters determined an overall score by tallyi ng band scores across categories and dividing by four (the number of categories). Scores of .25 or .75 were rounded up to .5 and .00 respectively (e.g., 4.25 - > 5.5, 4.75 - > 5.0). For example, Speaker #3 received scores of 5 (fluency and coherence), 6 (lex ical resource), 6 (grammatical range and accuracy), and 6 (pronunciation). Dividing their summed score ( 23) by number of categories (4), their overall Experiential score was 5.75 , which was rounded up to 6 . Although this procedure emulated that utilized fo r official IELTS rating (https://www.ielts.org/en - us/ielts - for - organisations/ielts - scoring - in - detail), it did not feature the same level of training rigor. The Academic task employed the publically available TOEFL integrated speaking rubric (Educational T esting Service, 2014). This 5 - band rubric (0 - 5) featured four categories (general description, delivery, language use, topic development ) . Raters provided a single band score 42 representative across categories . As with the Experiential task, scoring followed general TOEFL procedures (https://www.ets.org/toefl/ibt/scores/understand/), though Raters did not possess the same rigor of training. For both rubrics, descriptors accompani ed each band for each category. Un like Experiential and Academic , no rating rubric exists for this specific Picture task. A search of picture narrative tasks returned limited assessment options (e.g., Sato, 2014, which targeted only fluency). Using pilot data, and prior to primary data col lection, I considered several solutions drawing from a range of speech rubrics, but , in consultation with four graduate students enrolled in a language assessment course (the same population intended to utilize the rubric) , I determined that any rubric wou ld require a change in task procedure (specifical ly, more in - depth instructions). To ensure comparability across studies, I chose not to collect task assessment for the Picture task , leaving only the Experiential and Academic tasks for this section of the analysis. Interactive. The four Raters utilized a rubric previously established in Ockey (2009, 2011), and subsequently used in Leaper and Riazi (2014). Used with permission from Dr. Gary Ockey, the rubric originated as a me asure of oral group performance at Kanda University of International Studies (Japan) . The rubric (available in Appendix J , as presented in Ockey, 2011 [in Language Learning]), consists of five categories (pronunciation, fluency, grammar, vocabulary, commun icative skills/strategies) scored along five 0 - 4 bands (which allow ed for half points to be assign ed) accompanied by descriptors. As with Experiential , I chose to assign Speakers an averaged overall score. For Speaker #3, who received scores of 2.5 (pronunciation), 2.5 (fluency), 2.5 (grammar), 2.5 (vocabulary), and 3 (communicative skills/strategies), their overall score was 2.6 (summed score [13] divided by number of categories [5]). As the original design featured several descriptors directly relevant to the target Japanese population of the initial studies , I rewrote them - like phonology 43 - at katakana - like - Though Ockey (2009, 2011) had raters assess interactio ns live, this study followed May (2011) by providing video recordings of each interaction. Data Analysis Not all participants provided a speech utterance for all tasks. For the Picture task, a techn ical issue led to one Japanese S the Academic task, two Japanese S peakers were unable to produce enough language to fill a 30 - second sample. No issues arose for the Experiential task. This left 28 Picture utterances, 29 E xperiential utterances, and 27 Academic utterances for analyse s. For the Interactive task, a s might be expected, several scheduled Speakers did not attend their session, leaving multiple Speakers without a speaking partner. While it was possible to resched ule by pairing some of these Speakers together to allow for complete sessions, five Speakers were left without speaking partners (all Chine se). In addition, two Japanese S peakers interacted with speaking partners not from the two L1s of focus (see Footnote #1), and I thus removed them from inte ractive analyses. In total, 22 S peakers (Japanese = 1 2, Chinese = 10) completed the I nteractive task, comprising five Japanese - Japanese, four Chinese - Chinese, and two Japanese - Chinese dyads. Before coding and analyzin g , I transcribed entire task utterances (monologic and interactive). A second transcriber then verified my transcriptions. I then edited down each utterance to the 30 - and 60 - second excerpts used for speech rating. 44 Linguistic coding. I coded each speech sample for a series of phonological and fluency measures following guidelines established in Isaacs and Trofimovich (2012) and Trofimovich measures aligns s trongly with the linguistic coding of similar measures by trained coders (Saito et al., 2017), the current study adopted the latter approach for two reasons. First, a subjective approach (as used in Crowther et al., 2015a, 2015b) required a significant lev el of commitment on the part of listeners (four 2 - hour sessions). With minimal difference between subjective ratings and linguistic coding, I deemed the latter process to be more time efficient. The second reason to pursue the linguistic coding approach was that it provided a more minute understanding of what and how speech is perceived. A limitation of the subjective approach is that not only was it time intensive, but the complexity of the lingu istic measures made it difficult to identify the more minute components of L2 speech. For example, Saito et al. (2017 ) reduced the 19 measures employed in Isaacs and Trofimovich (2012) to 10, with several measures of fluency (articulation rate, mean length of run, number of filled and unfilled pauses) category was devised consisting of both segmental errors and syllable structure errors (additional reduction occurred for lexical, grammatical, and discourse measures as well). To allow for as much in - depth linguistic analysis as possible, the current study employ ed 11 phonological 12 and fluency measures previously identified in Isaacs and Trofimovich (2012), following the sa me 12 I removed one measure of phonology (Pitch Range) as it was not possible to calculate (using Praat ) for interactive speech due to overlap of voices. In Isaacs & Trofimovich (2012) and Trofimovich & Isaacs (2012), this measure provided minimal association with both accentedness and comprehensibility ( r < .10). 45 coding guidelines . 13 Tabl e 4 presents the 11 measures, while Appendix K provides the coding guidelines. For readability, I have revised the name of several measures, as provided in Table 4 . Table 4 List of 11 phonological and fluency measures (drawn from Isaacs & Trofimovich, 2012). Original Name Revised Name Segmental Error % Segmental Accuracy Syllable structure errors % Syllable Structure Accuracy Word stress errors % Word Stress Accuracy Rhythm (vowel reduction ratio) % Rhythm Pitch contour (intonation error rate) % Intonation Filled Pauses Filled Pauses Unfilled pauses Unfilled Pauses Pause errors % Pause Appropriateness Repetitions/self - corrections % Repetitions/Self Corrections Articulation rate Articulation R ate Mean length of run Mean Length of R un Reliability . Reliability measures we re calculated for monologic and interactive speech rating, l inguistic coding , and task scoring . Monologic speech rating . Following common conventions in speech production research (Munro & Derwing, 2015), I determined Listener reliability by calculating intraclass correlation 13 Isaacs & Trofimovich (2012) and Trofimovi ch & Isaacs (2012) included grammatical, lexical, and discourse measures. As my interest lies in pronunciation training, I here emphasize only the phonological and fluency measures. 46 coefficients (ICCs) for accentedness and comprehensibility per Speaker per monologic task. As reported in Table 5 reliability was within acceptable levels (> .80; Larson - Hall, 2009). As such, I subsequently calculated a mean score th at averaged speech ratings across Listeners for each Speaker per task. These mean scores serve as the accentedness and comprehensibility data for statistical analyses. T able 5 Intraclass correlation coefficients for accentedness and comprehensibility . Pic ture Experiential Academic Interaction Accentedness .879 .888 .919 .948 1 Comprehensibility .932 .907 .966 .956 1 Note. 1 = only the 22 Speakers who completed the interactive task were included. Interactive speech rating. Interactive speech rating followed the same procedure as above, though only including the 22 Speakers who completed the interactive session of the study. As shown in Table 5 , ICCs were within acceptable levels (> .80). However, an additional consideration in cluded the confidence level of L isteners d ifferentiating between the two S peakers in each interaction. Percent of confidence ranged from 59% to 100%, though only one dyad was below 70%. Considerin g the high reliability between L isteners for accentedness and comprehensibility, I removed only this dyad (mixed L1 Japanese - Chinese) from further analyses, leav ing 10 dyads (and 20 S peakers). Combined with the lost monologic samples, only 17 Speakers completed all four tasks. Linguistic coding. As in Isaacs and Trofimovich (2012), three trained, secondary coders recoded the speech of 12 Speakers (41%) who completed both monologic and interactive tasks. Secondary coders includ ed an undergraduate - level TESOL - minor ( S egmental Accuracy, Word 47 S tress Accuracy, Articula tion R ate ) and two PhD - level applied linguistic (1: Syllable S tructure Accuracy, R hythm, Intonation; 2: Filled & U n filled Pauses, P ause Appropriateness, Repetitions/Self - Corrections, Mean Length of R un) students . Table 6 reports ICCs for all categories except Syllable Structure Accuracy. Discussed in more detail at the conclusion of Chapter 4, in short, I removed this category from analysis due to coding concerns raised during discussion with my secondary coder. A second measure, Rhythm , was also remove d due to low ICC (.137). This measure involved coding for how accurately Speakers reduced vowel sounds in unstressed syllables and function words, a highly subjective judgment . Despite training, review, and discussion with my secondary coder, we were unable to develop reliability in our coding, with no discernable pattern of differences in perception. Though disappointing considering the high association with both accentedness ( r = .74) and comprehensibility ( r = .76) i n Trofimovich and Isaacs (2012), pursuing Rhythm within the current analysis would not provide much insight given the low reliability. The remaining categories had ICCs which ranged between . 528 and . 999 . While not ideal for certain categories ( Intonation, Pause Appropriateness, Repetitions/Self - Corrections), these measures are also highly subjective, and potentially problematic for coding (as discusse d at the conclusion of Chapter 4 ). Cons idering the high agreement on the majority of measures, my initial coding was utilized for analyses, though all interpretations of linguistic associations are presented cautiously, given the low ICCs for several variables. Task scoring . As R aters collaboratively assigned a score per task utterance , there was no need to calculate a measure of reliability. Analyses parameters . For all analyses provided, alpha is initially set at .05. For linguistic measures, values have been coded so that all positive correlations equate to an increase 48 in performanc e, with the exception of Filled Pauses, Unfilled Pauses, and Repetitions/Self - Corrections. Effect sizes follow the guidelines put forth by Plonsky and Oswald (2014) 14 for SLA research, and sample sizes per analysis have been made explicitly clear. Table 6 Intraclass correlation coefficients for 11 linguistic measures of speech . Overall Segmental Accuracy .823 Syllable Structure Accuracy N/A Word Stress Accuracy .806 Rhythm .137 Intonation .655 Filled Pauses .938 Unfilled Pauses .860 Pause Appropriateness .528 Repetitions/Self Corrections .610 Articulation Rate .999 Mean Length of Run .822 14 Plonsky & Oswald (2014) proposed the following guidelines for effect siz es in SLA : weak ( r > .25, d > .40), medium ( r > .40, d > .70), and strong ( r > .60, d > 1.00). Due to the use of nonparametric analytic tools, only r is utilized in the below analyses. 49 CHAPTER 3: RESULTS Below I report my results in three waves. In Wave 1, I discuss research questions 1 - 3 , which target L isteners accentedness and comprehensibility across monologic a nd interactive speech. Next, in Wave 2, I consider research question 4, which knowledge, training, and awareness of phonological measures of L2 speech . Finally, in Wave 3 , I address the relationship between L accentedness and comprehensibility , and strength of association with task performance on Experiential , Academic , and Interactive tasks. Wave 1: Monologic and Interactive Speech Performance Wave 1 of analyse s focused on L accentedness and comprehensibility across monologic and interactive tasks. I included only 20 of 29 Speakers in the current analyse s. I removed five S peakers who did not complete both the monologic and interactive tasks, two who did not interact with a Chinese or Japanese speaking partner, and final ly one dyad (two S peakers) who m L isteners indicated a limited ability to differentiate between (61% indica ted a lack of confidence). In summary, this wave of analysis draws upon 10 dyads comprise d of 11 Japanese an d 9 Chinese S peakers (9 same - L1, 1 mixed - L1). Descriptive comparisons. In Table 7 , I report mean scores , standard deviations, and 95% Confidence Intervals for accentedness and comprehensibility across task type . For all tasks, Listeners rated S peakers as being easier to understand ( comprehensibility ) than they were nativelike ( accentedness ). Listeners found Experiential speech the easiest to understand, and Academic speech the most difficult , while Interactive speech was most nativelike in terms of accentedness , with Academic the least. Figure 4 and Figure 5 present bar graphs depicting accentedness and comprehensibility c omparisons both within and between tasks. 50 Table 7 Speaker performance on monologic + interactive t asks . Mean SD 95% Confidence Intervals Picture (N = 19) accentedness 3.66 0.57 3.39 3.94 comprehensibility 4.64 0.88 4.22 5.06 Experiential (N = 20) accentedness 3.83 0.56 3.60 4.15 comprehensibility 5.23 0.73 4.97 5.70 Academic (N = 18) accentedness 3.64 0.75 3.27 4.01 comprehensibility 4.62 1.26 3.99 5.25 Interactive (N = 20) accentedness 3.84 0.74 3.54 4.28 comprehensibility 5.02 0.92 4.63 5.56 Figure 4 . Comparison of accentedness and c omprehensibility ratings within tasks. 0 1 2 3 4 5 6 Picture Experiential Academic Interactive Accentedness Comprehensibility 51 Figure 5 . Comparison of accentedness and c omprehensibility ratings across 4 tasks. Before investigating whether perceptions of accentedness and comprehensibility across tasks differed significantly, I checked for the existence of a prompt effect within the Experiential, Academic, and Interactive tasks, followed by a review of common assumptions for statistical analyses. Prompt effect. For Experiential, half the participants completed each prompt (N = 10 each). Considering the small sample size for each prompt, I conducted a nonparametric Mann - Whitney U test to determine if a prompt effect existed. 15 While there was no prompt effect for accen tedness (U = 32.500, Z = - 1.324, exact p = .190), there was for comprehensibility (U = 16.500, Z = - 2.536, exact p = .009). Listeners perceived Speakers as being easier to understand when responding to the Party prompt (M = 5.69, SD = .65) than the Restaur ant prompt (M = 4.87, SD = .57). An effect size of r = .56 indicated a medium strength effect. I calculated an 15 - significant ( p > .05) values for accentedness and comprehensibility ratings across tasks, indicating no concerns with homogeneity of variance. 0 1 2 3 4 5 6 Accentedness Comprehensibility Picture Experiential Academic Interactive 52 effect size for this difference using the equation r = , where Z represents the z - score returned by the Mann - Whitney U test and N equals the total number of observations. I ran the same analyse s for the two Academic prompts (Social Interaction, Cognitive Dissonance; N = 9 each), and found no prompt effect for either accentedness (U = 39.50, Z = - 0.88, exact p = .931) or comprehensibility (U =33.50, Z = - 0.62, exact p = .546). For the I nteractive task, which featured three prompts (Activities [N = 4] , International Friends [N = 10] , Travel [N = 6] ), I conducted a Kruskal - Wallis test, which revealed no prompt effect for accentedness ( 2 = 2.247, p = .325) or comprehensibility ( 2 = 5.512, p = .064). As the p - value for comprehensibility approa ched significance, I carried out three separate Mann Whitney U tests to confirm there was no prompt effect , with a manually Bonferroni - adjusted alpha of .017 ( = .05/3) . Listeners did not perceive comprehensibility differently between Activities and International Friends prompts (U = 9.000, Z = - 1.556, exact p = .142) , Activities and Travel prompts (U = 10.000, Z = - 0.432, exact p = .762), or International Friends and Travel prompts (U = 10.000, Z = - 2.171, exact p = .031). To summarize, I found a prompt effect only for the Experiential tasks . As the Experiential prompts were drawn directly from official IELTS materials ( International English Language Testing System , 2009, 2011), that they elicit different listener perception of comprehensibility is concerning. I will revisit this concern in Wave #3 of analyses. Tests of parametric assumptions. Recognizing the importance of data exploration prior to conducting parametric analyse s (Field, 2009), I first explored the assumption of normal distribution for accentedness and comprehensibility per task. 16 For Picture (N = 19), both the 16 As the majority of comparisons to be conducted involve comparing speech ratings of the same Speakers (e.g., paired - samples t - tests), I assumed homogeneity of variance (Field, 2009; Larson - Hall, 2010). 53 Kolmogorov - Smirnov test of normality ( p = .200 for both accentedness and comprehensibility ) and skewness ( accentedness = 0.68; COM = 0.88 ) and kurtosis ( accentedness = - 0.29 ; comprehensibility = - 0.30) ratios 17 indicate d normal distributions (Field, 2009). Figures 6 provides both histograms and boxplots similarly indicating normal distribution. For Experiential (N = 20), though visual inspection of accentedness (Figure 7) indicates potential concerns for distribution, both the Kolmogorov - Smirnov test of normality ( p = .200 for both accentedness and comprehensibility ) and skewness ( accentedness = 1.03; comprehensibility = - 0.63) and kurtosis ( accentedness = - 0.42; comprehensibility = - 0.56) ratios indicate normal distributions. As with Experiential , while visual inspection of the Academic (N = 18) accentedness histogram (Figure 8) indicates potential concern with distribution, the Kolmogorov - Smirnov test of normali ty ( p = .200 for both accentedness and comprehensibility ) and skewness ( accentedness = 0.95; comprehensibility = - 0.10) and kurtosis ( accentedness = - 0.78; comprehensibility = - 0.79) ratios do not. Unlike Picture, Experiential, and Academic, Interactive (N = 20) did not demonstrate normal distribution. The Kolmogorov - Smirnov test of normality was significant for both accentedness ( p = .024) and comprehensibility ( p = .014). While there were no issues based on the kurtosis ratio (accentedness = 1.38; comprehensibility = 0.70), skewness ratios (accentedness = 2.53; comprehensibility = 2.41) were both above the threshold of 1.96 (Field, 2009). A visual inspection (Figure 9) indicates that for both accentedness and comprehensibility, Listeners te nded to assign lower ratings to Speakers on the 9 - point scale. As revealed in the boxplots, 17 Skewness and kurtosis ratios are z - scores calculated by dividing skewness and kurtosis values (minus the mean of the distribution [0]) by their standard error. Values below - 1.96 and above 1.96 a re considered indicators of non - normal distribution (Field, 2009). 54 Figure 6 . Histogram and b oxplot dep icting distribution of a ccent e dness and c omprehensibility ratings for Picture task. Figure 7 . Histogram and box plot depicting distribution of accentedness and c omprehensibility ratings for Experiential task. 55 Figure 8 . Histogram and b ox plot depicting distribution of a ccent e dness and c omprehensibility ratings for Academic task. Figure 9 . Histogram and b ox plot depicting distribution of a ccent e dness and c omprehensibility ratings for Interactive task. 56 Speakers #3 and #36 (both Japanese) and #9 and #13 (both Chinese) were positive outliers (though only #3 for accentedness). The lack of normal distribution found within the Interactive task presents issues when it comes to statistical analyses. One solution would be to remove the four outliers. However, this would reduce the already limited sample size of the study (down to 16, and this is without removing their interactive partners). In addition, outlier removal should be based on the assumption that the observation is not of the population of interest (Field, 2009). However, as all Speakers were drawn from the same IEP environ ment, this is clearly not the case. While another option would be to attempt to transform the data (Field, 2009 ), considering the already limited sample size , I chose to instead draw upon nonparametr ic approaches to data analyse s. Nonparametric analysis. Nonparametric tests can be utilized when data, such as those presented above, do not adhere to the assumption of normal distribution, and instead follow a rank - order system (Field, 2009; Larson - Hall, 2010). In the cu rrent analyse ran k test s (in place of Pearson correlations), Mann - Whitney U tests (in place of independent - samples t - tests), Friedman tests (as opposed to one - way repeated measur es ANOVAs), and Wilcoxon signed - rank s tests (serving as a substitute for post - hoc paired - sample s t - tests). Accentedness & comprehensibility strength of association. In the first analyses, I trength of association between L perception of accentedness and comprehensibility a cross tasks. As shown in Table 8 , the correlation between the two global measures is quite strong (> .60 ). This indicates that as accentedness ratings increase (i.e., Listeners deem S peakers to be more nativelike), so do comprehensibility ratings (i.e., S pea kers are easier to understand). 57 Table 8 ) test results for accentedness and comprehensibility across 4 tasks. Picture (N = 19) Experiential (N = 20) Academic (N = 18) Interactive (N = 20) .756 .789 .899 .665 Accentedness & comprehensibility group differences. To check for group differences between L istener perc eption of Chinese and Japanese S accentedness and comprehensibility , I conducted a series of Mann - Whitney U tests across tasks. 18 I included a manually Bonferroni - adjusted alp ha of .006 (.05/8). I found no significant differences between gr oup ratings, as shown in Table 9 . As such, I conducted subsequent analyses on the entire sample. Table 9 Mann - Whitney U test results for group differences in accentedness and comprehensibility across 4 tasks. Picture Experiential Academic Interaction ACC COM ACC COM ACC COM ACC COM U 38.00 29.00 33.50 28.00 36.00 34.00 28.50 35.00 Z - 0.77 - 1.47 - 1.12 - 1.55 - 0.93 - 1.08 - 1.51 - 1.00 Exact p - value .473 .157 .270 .135 .384 .305 .135 .343 Accentedness & comprehensibility within task comparison s . As present ed earlier in Table 7 , L isteners rated S peakers as being more comprehensibility than they were nativelike in 18 - significant ( p > .05) values for accentedness and comprehensibility ratings across tasks, indicating no concerns with homogeneity of variance. 58 their speech. I conducted a series of Wilcoxon signed - ranks tests to determine whether this difference was significant, with a manually Bonferro ni - adjusted alpha value of .013 ( = .05/4). For all four tasks, the difference was found to be significant ( p ( r > .80). Table 10 reports full details of this analysis. Table 10 Re sults of Wilcoxon signed - ranks tests between accentedness and comprehensibility across 4 tasks. Picture Experiential Academic Interaction N 19 20 18 20 Mean Difference 0.98 1.40 0.98 1.18 z - score - 3.70 - 3.9 3 - 3.46 - 3.92 p - value < .001 < .001 .001 < .001 r 1 .85 .88 .82 .88 Notes. N = sample size; 1 = effect size r calculated using r = . Accentedness between task comparison s . I conducted a Friedman test to determine if important to note that the Friedman test considered only the 17 Speakers who completed all four tasks (Table 11 r eports task means and standard deviations for these 17 Speakers). The test indicated that a significant difference between tasks existed, 2 = 9.055, p = .029. To determine the source(s) of this difference, I performed six post - hoc Wilcoxon signed - ranks te sts, with a manually Bonferonni - adjusted alpha value of .008 ( = .05/6). Table 12 reports the full post - hoc results. 59 Ta ble 11 Mean (SD) p erforma nce on monologic + interactive t asks for Friedman test (N = 17) . Mean SD 95% Confidence Intervals Picture accentedness 3.67 0.58 3.38 3.97 comprehensibility 4.70 0.91 4.23 5.16 Experiential accentedness 3.83 0.53 3.55 4.10 comprehensibility 5.32 0.76 4.93 5.71 Academic accentedness 3.57 0.70 3.21 3.93 comprehensibility 4.53 1.24 3.89 5.16 Interactive accentedness 3.83 0.68 3.48 4.19 comprehensibility 5.00 0.86 4.55 5.44 Tabl e 12 Results of W ilcoxon signed - ranks tests comparing accentedness ratings across 4 tasks. Mean Difference z - score p - value r 1 Picture - Experiential 0.16 - 0.82 .410 .20 Picture - Academic 0.10 - 0.29 .776 .07 Picture - Interactive 0.16 - 0.63 .528 .15 Experiential - Academic 0.26 - 1.83 .067 .44 Experiential - Interactive 0.00 - 0.08 .940 .02 Academic - Interactive 0.26 - 2.39 .017 .58 Notes. 1 = effect size r calculated using r = . 60 While no significant differences were found, comparisons between Academic and Experiential ( p = .067) and Academic and Interactive ( p = .017) could be argued to be approaching significance. An investigation into the effect of these two differences revealed a medium strength effect ( Academic - Experiential = .44; Academic - Interactive = .58), where Listeners rated S peakers as being more nat ivelike on both Experiential and Interactive than they were on Academic . Comprehensibility between task comparison s . I conducted the same analyses described above for comprehensibility between tasks. Again, the Friedman test indicated that a significant difference existed , 2 = 8.432, p = .038 . Post - hoc Wilcoxon signed - ranks tests, again with a corrected alpha of .008, indicated no significant differences, although comparisons between Experiential and Picture ( p = .009), Experiential and Academic ( p = .020), and Interactive and Academic ( p = .026) were all approaching significance. For the Picture - Experiential comparison, the strength of this effect was strong ( r = .63), as L isteners rated the Experiential speech as easier to understand, while for the Experiential - Academic ( r = 57) and Interactive - Academic ( r = .54) comparisons this eff ect was medium. In both cases, L isteners rated the Academic speech as more difficult to understand. In addition, a consideration of the Picture - Interactive ( r = .38) and Experiential - Interactive ( r = .35) comparisons reveal a weaker, but present, effect, with Interactive speech perceived as easier to understand than Picture, but more difficult than Experiential . Table 13 reports full results of the six post - hoc tests. Spearman correlations. In the final analysis, I calculated Spe arman correlations between the nine accentedness and comprehensibility . 61 Table 13 R esults of Wilcoxon signed - ranks tests comparing comprehensibility ratings across 4 tasks. Mean Difference z - score p - value r 1 Picture - Experiential 0.62 - 2.60 .009 . 63 Picture - Academic 0.17 - 0.36 .722 .0 9 Picture - Interactive 0.30 - 1.57 .117 . 38 Experiential - Academic 0.79 - 2.33 .020 . 57 Experiential - Interactive 0.32 - 1.46 .145 . 35 Academic - Interactive 0.47 - 2.22 .026 . 54 Notes. 1 = effect size r calculated using r = . Accentedness. ) coefficients for accentedness . For clarity of reading, Table 15 summarizes the associations based on strength. For each task, Listener perception of accentedness revealed different patterns of associations . For Picture, Pause Appropriateness has the strongest association. For Experienti al , Segmental Accuracy has the strongest association. Measures of fluency (Articulation Rate, Mean Length of Run) had the strongest influence on L Academic speech, along with Intonation. Finally, Interactive speech was similar to Academic in the associations with Articulation Rate (the only association across tasks > .60). In addition, each task indicates a series of weaker associations with various measures, with Academic being the most diverse. Only two tasks (Picture, Experienti al ) reveal an association with Segmental Accuracy. 62 Table 14 ) coefficients between accentedness and 9 linguistic measures of speech. Picture (N =19) Experiential (N = 20) Academic (N = 18) Interactive (N = 20) Segmental Accuracy .36 .42 .14 .16 Word Stress Accuracy - .19 - .03 - .06 - .01 Intonation .24 - .04 .41 .05 Filled Pauses - .24 - .09 .01 - .05 Unfilled Pauses .05 - .01 - .14 - .05 Pause Appropriateness .45 .21 .36 - .01 Repetitions/Self Corrections - .18 - .27 - .36 - .04 Articulation Rate .25 - .04 .48 .60 Mean Length of Run .21 .01 .40 .30 Notes. > .60 = Strong, > . 40 = Medium, > .25 = Weak. Comprehensibility. Spearman correlation coefficients are presented in Table 16, and a summary of association strength is presented in Table 17. Unlike accentedness where the four tasks tended to demonstrate different patterns, there appears to be more alignment for comprehe nsibility . All four tasks show associations with two measures of fluency (Articulation Rate, Mean Length of Run). However, these associations are stronger for Picture and Academic than they are for Experiential and Interactive (with Interactive sitting in the middle). In addition, the three monologic tasks feature associations with Pause Appropriateness (though of different strength), while both Experiential and Academic indicate a medium strength association with Repetitions/Self - Corrections. While all fou r tasks demonstrate some associations with phonological measures, these are all of weaker strength, and only Experiential has an association with Segmental Accuracy. 63 Table 15 Summary of Spearman correlations with accentedness per task type. Weak ( r > .25) Medium ( r > .40) Strong ( r > .60) Picture Segmental Accuracy , Articulation Rate Pause Appropriateness Experiential Repetitions/Self - Corrections % Segmental Accuracy Academic Pause Appropriateness, Repetitions/Self - Corrections Intonation, Articulation Rate, Mean Length of Run Interactive Mean Length of Run Articulation Rate 64 Tabl e 16 ) coefficients between comprehensibility and 9 linguistic measures of speech. Picture (N =19) Experiential (N = 20) Academic (N = 18) Interactive (N = 20) Segmental Accuracy .12 .29 .20 .10 Word Stress Accuracy - .25 - .08 - .06 - .24 Intonation .33 .07 .38 - .01 Filled Pauses .10 - .22 - .14 - .17 Unfilled Pauses - .40 - .01 - .13 .06 Pause Appropriateness .48 .35 .62 - .03 Repetitions/Self Corrections - .07 - .43 - .42 .03 Articulation Rate .67 .32 .68 .48 Mean Length of Run .62 .30 .60 .46 Notes. > .60 = Strong, >. 40 = Medium, > .25 = Weak. Cluster analysis. accentedness and comprehensibility across tasks, I conducted a hierarchical cluster analysis (HCA). Cluster analysis is a statistical technique that allows for the classification of cases into a number of groups, or clusters. Those within a group are similar in regards to target characteristics but are unlike those in the other observed groups (Everitt, 1980; King, 2015). Through an objective mathematical function, cluster analysis mi nimizes variance within groups while maximizing variance between (King, 2015). HCA, one specific technique of cluster analysis, begins with each case as an individual cluster before combining cases into larger and larger clusters based on distance coeffici ents (Staples & Biber, 2014). Researchers then make use of several sources to determine the optimal number of clusters, including dendrogram and distance coefficient inspection. As such, it must be noted that HCA involves a level of researcher subjectivity . 65 Table 17 Summary of Spearman correlations with comprehensibility per task type. Weak ( r > .25) Medium ( r > .40) Strong ( r > .60) Picture Word Stress Accuracy, Intonation Unfilled Pauses, Pause Appropriateness Articulation Rate, Mean Length of Run Experiential Segmental Accuracy, Pause Appropriateness, Articulation Rate, Mean Length of Run Repetitions/Self - Corrections Academic Intonation Repetitions/Self - Corrections Pause Appropriateness, Articulation Rate, Mean Length of Run Interactive Articulation Rate, Mean Length of Run 66 In t he current analysis, 17 S peakers who completed all four tasks served as clustered variables. Their accentedness and comprehensibility ratings across the four tasks served as cluster distance. 19 After inspecting both the HCA dendrogram (Figure 10) and scree plot (Figure 11), 20 I decided on a 3 - cluster solution. 21 I report the descriptive inf ormation for each cluster in Table 18. Figure 10. Dendrogram of hierarchical cluster analysis. 19 Loewen, forthcoming), and should be paired with squared Euclidean distance (Staples & Biber, 2014). 20 A scree plot (or approximation of a scree plot) can be created through using an agglomeration schedule. See Staples & Biber (2014) for more details. 21 Whe n visually inspecting the dendrogram, both a potential 2 - and 3 - cluster solution existed. Further cons ideration of the scree plot indicated that the differ ences in the coefficients begin to flatten out after the third cluster (Staples & Biber, 2014). For this reason, I decided on a 3 - cluster solution. 67 Figure 11 . Scree plot of hierarchical cluster analysis. Table 18 Descriptive results for task of 3 - cluster solution (mean [SD]). Picture Experiential Academic Interactive # N ACC COM ACC COM ACC COM ACC COM 1 High 4 4.40 (0.43) 5.93 (0.45) 4.36 (0.55) 5.98 (0.61) 4.58 (0.39) 6.13 (0.49) 4.84 (0.64) 6.13 (1.11) 2 Middle 7 3.55 (0.24) 4.53 (0.57) 3.77 (0.47) 5.27 (0.72) 3.58 (0.20) 4.76 (0.44) 3.66 (0.24) 4.85 (0.27) 3 Low 6 3.34 (0.57) 4.07 (0.63) 3.54 (0.37) 4.94 (0.69) 2.88 (0.21) 3.19 (0.50) 3.36 (0.23) 4.42 (0.32) Notes. ACC = accentedness, COM = comprehensibility 0 5 10 15 20 25 30 35 40 45 50 1 2 3 4 5 Number of Clusters Distance Between fusion coefficients 68 As a clear continuum exists across tasks for both accentedness and comprehensibility , I have labelled the three clusters as High , Middle , and Low , with Speakers in the High cluster the most nativelike and easiest to understand across tasks and those in the Low cluster the least nativelike and most difficult to understand. Following Staples and Biber (2014), I ran a series of one - way ANOVAs to determine if differences in mean scores between groups was significant. As seen earlier in the dendrogram, both the Middle and Low clusters branched from the same origin and could be argued to be a single clust er. I thus began by conducting one - way ANOVA s between the High cluster and the combined Middle and Low clusters (I used a manually - adjusted Bonferroni correction of .006 [.05/8]). For all tasks except Experiential , the High cluster was significantly more nativelike and easier to understand ( p pr ovides the full results of this first round of one - way ANOVAs. I then ran the same analysis between the Middle and Low clusters ( = .006). Middle only significantly differed from Low on the Academic task ( p < .001), in which Middle was more nativelike and easier to unde rstand in their speech. Table 20 provides the full results of the se one - way ANOVAs. Table 19 One - way ANOVAs between High cluster and combined Middle/Low clusters. Picture Experiential Academic Interactive ACC COM ACC COM ACC COM ACC COM F 7.93 13.31 4.07 2.80 52.20 47.81 21.66 10.91 p .005 .001 .040 .095 < .001 < .001 < .001 .001 Notes. ACC = accentedness, COM = comprehensibility. 69 Table 20 One - way ANOVAs between Middle and Low clusters. Picture Experiential Academic Interactive ACC COM ACC COM ACC COM ACC COM F 0.73 1.93 0.94 0.74 37.70 36.39 5.26 7.16 p .410 .193 .353 .407 < .001 < .001 .043 .022 Notes. ACC = accentedness, COM = comprehensibility. Recognizing the differences between clusters in task performance, I next investigated the L1 d ynamic of each cluster (Table 21 ). While the High and Middle clusters had members from both L1 backgrounds, a ll Low cluster members were Japanese. It is important to note that TOEFL ( from which the Academic task was derived) serves as a key proficiency examina tion for non - English speaking international students to gain admission into an English - medium university. As all Chinese S peakers intended to pursue undergraduate study at the university , they were likely more practiced at the task than their Japanese peers, who were exchange students enrolled in IEP . Table 21 L1 breakdown of 3 - cluster HCA solution. Cluster # Japanese Chinese 1 1 3 2 2 5 3 6 0 Wave 2: Participant Patterns the length of the Likert scale used (1 - 5), my analysis focuses on a descriptive comparison of Japanese (N = 15) and Chinese (N = 14) Additionally, although the questionnaire included a category for Speech Rate, I have chosen to remove this variable and 70 focus solely on the four phonological - based measures. I make this distinction in line with the linguistic measure coding utilized in Isaacs and Trofimovich (2012), where fluency measures were not only treated separate from phonological but were more s pecific in their classification . Group responses . As provided in Table 22 , Speakers responses favored Word Stress, which scored the highest for all four survey categories (familiarity, instruction, awareness, segmental categories, Speakers indicated greater familiarity, instruct ion, awareness, and importance for Vowel over Consonant production. Considering the two different L1 backgrounds of the Speakers, I next considered if differences existed between Japanese (Table 23) and Chinese (Table 24 ) Speakers . Table 22 G roup pronunciation survey results ( N = 29; Mean [SD]). Familiarity Instruction Awareness Importance Consonants 3.02 (1.07 ) 2.90 (0.98 ) 2.74 (1.26 ) 4.14 (0.92 ) Vowels 3.31 (1.22 ) 3.05 (1.07 ) 3.16 (1.11 ) 4.38 (0.90 ) Word Stress 3.53 (0.93 ) 3.38 (1.09 ) 3.68 (0.97 ) 4.48 (0.82 ) Intonation 3.29 (0.95 ) 3.07 (1.03 ) 3.17 (0.85 ) 4.28 (0.53 ) Rhythm 2.69 (0.97 ) 2.28 (0.88 ) 2.40 (0.86 ) 3.55 (0.78 ) Japanese responses . In line with the overall group ratings, Japanese Speakers assi gned the highest rating to Word Stress across categories, and the lowest ratings to Rhythm. Unlike the overall group ratings, however, Japanese Speakers indicated greater familiarity with Consonants 71 than Vowels, though the fact they also provided higher scores for Vowel instruction, awa reness, and importance leads me to interpret this finding with caution. Table 23 Pronunciation survey r esults Japanese (N = 1 5; M ean [SD]). Familiarity Instruction Awareness Importance Consonants 3.07 (1.1 0) 2.60 (0.91 ) 3.07 (1.22 ) 4.53 (0.64 ) Vowels 2.93 (1.22 ) 2.67 (1.0 5) 3.27 (1.03 ) 4.60 (0.51) Word Stress 3. 67 (1.05 ) 3.47 (1.30 ) 4.07 (0.83) 4.67 (0.62 ) Intonation 3.20 (0.94 ) 2.80 (1.21 ) 3.27 (0.96 ) 4.40 (0.51) Rhythm 2.47 (0.83 ) 2.00 (0.76 ) 2.33 (0.72 ) 3.67 (0.72 ) Chinese responses . Chinese Speakers indicated a slightly different pattern than did the Japanese. While Word Stress again scored highest for awareness and importance, Vowel production scored highest for familiarity and instruction. Interestingly, while Chinese Speakers mai ntained the pattern of assigning the lowest scores for familiarity, instruction, and importance to Rhythm, they indicated the least amount of awareness of Consonant production. Wave 3: Task Performance The final set of analyses considered the potential r accentedness and comprehensibility Experiential , Academic, and Interactive tasks, I planned to run regression analyses with Task Scores as the outcome variables and accentedness and comprehensibility ratings as predictor variables. For all analyses, I included the entire sample that completed each task ( Experiential = 29, Academic = 27, Interactive = 20). 72 Table 24 Pronunciation survey r esults Chinese (N = 14; M ean [ SD]). Familiarity Instruction Awareness Importance Consonants 2.96 (1.08 ) 3.21 (0.97 ) 2.39 (1.24 ) 3.71 (0.99 ) Vowels 3.71 (1.12 ) 3.46 (0.97 ) 3.04 (1.22 ) 4.14 (1.17 ) Word Stress 3.39 (0.7 9) 3.29 (0.85 ) 3.29 (0.97 ) 4.29 ( 0.97 ) Intonation 3.39 (0.98 ) 3.36 (0.74) 3.07 (0.73 ) 4.14 (0.53 ) Rhythm 2.93 (1.07 ) 2.57 (0.94 ) 2.46 (1.01 ) 3.43 (0.85 ) Experiential . I began by considering the strength of association between accentedness, comprehensibility, and Overall, Pronunciation, and Fluency scores. There was a medium association between accentedness and Overall score ( = .44) and a strong association between accentedness and Pronunciation score ( = .62) , though no association with Fluency ( = .06) score . For comprehensibility , the associations with both Overall ( = .62) and Pronunciation ( = .66) were strong , and with Fluency weak ( = .29) . I next ran a series of hierarchical linear regressions. The Experiential O verall scores, the second with their specific Pronunciation score , and the last with their Fluency score . In these regression s , I treated Experiential task score as a continuous variable, as it was calculated by first summing a categories , then dividing this score by the number of categories . Before beginning, I investigated the potential prompt effect identified earlier. As discussed, Speakers were rated as easier to understand on the Party prompt than they were on the Restaurant prompt. To see whether a similar issue existed f or task score, I ran a linear regression with Overall s core as the outcome variable and Prompt as the predictor variable (Reference = Restaurant). The 73 prompt difference only predicted 2% of variance in Overall s core (R 2 = .023) , and thus prompt was not con sidered further in this analysis . I next ran a hierarchical linear regression with Overall s core as the outcome variable and accentedness and comprehensibility ratings as predictor variables. Although accentedness and comprehensibility correlated highly ( = . 79 ) this was still below the .80 threshold put forth by Field (2009) for multicollinearity. In line with comprehensibility was placed into t he model second, to investigate whether comprehensibility explained any variance beyond that explained by accentedness . Table 25 provides the results of the regressi on. A ccentedness and comprehensibility accounted for 33 % of total variance in Overall s core, with comprehensibility contributing an additional 13 % beyond that of accentedness ( p = .023 ). Table 25 Experiential hierarchical regression results for Overall score . B (SE) Adj. R 2 R 2 Change p Model 1 Constant 2.33 (1.00) accentedness 0.76 (0.27) .48 .20 .20 .008 Model 2 Constant 1.32 (1.01) accentedness 0.07 (0.37) .04 comprehensibility 0.69 (0.28) .58 .33 .13 .023 Following Field (2009) and Larson - Hall (2010), I completed my analysis by reviewing the additional assumptions of linear regression analyses. Positively, I found no concerns with multicollinearity (VIF = 2.36, Tolerance = .42), assumption of independent er rors (Durbin - Watson = 1.68) or outliers (residual statistics between - 74 Mahalanobis distance < 11). However, an analysis of the P - P and Residual plots indicate slight concerns for the distribution of residuals and homo geneity of variances respectively. In Figure 12 , it is clear that there is slight deviation (curvature) from what would be a normal distri bution of residuals in the P - P plot and a slight restriction of data points towards the lower middle and left side of the Residual - scatter plot. Neither deviation would appear to be severe, though I consider my interpretations with slight caution. Figure 12 . P - P and r esidual - scatter plot s for Experiential hierarchical linear regression. To investigate Experiential performance more closely, I ran a second regression with accentedness and comprehensibility as predictor variables. This analysis should be viewed as exploratory as scores within a category are arguably categorical , as placement is based on meeting the requirements of the descriptors provided. However, due to the limited sample size, conducting a multinomial regression was not possible (see Academic results below for greater detail). The linear regression revealed that accentedness and comprehensibility accounted for 45 % of variance in Pronunciation score, wi th comprehensibility providing an addit ional 7 % beyond that of accentedness . I next ran the same analysis for Fluency scores and found that only 75 comprehensibility was a significant predictor of variance (12%) . The full regression res ults are presented in Table 26 (Pronunciation) and Table 27 (Fluency) . Table 26 Experiential hierarchical regression results for Pronunciation score. B (SE) Adj. R 2 R 2 Change p Model 1 Constant - 0.43 (1.32) accentedness 1.48 (0.35) .63 .38 .38 < .001 Model 2 Constant - 1.64 (1.36) accentedness 0.66 (0.50) .28 comprehensibility 0.83 (0.38) .47 .45 .07 .039 Table 27 Experiential hierarchical regression results for Fluency score. B (SE) Adj. R 2 R 2 Change p Model 1 Constant 3.70 (1.54 ) accentedness 0.37 (0.41 ) .17 - .01 - .01 .367 Model 2 Constant 2.26 (1.58 ) accentedness - 0.61 (0.58 ) - .29 comprehensibility 0.98 (0.44 ) .61 .12 .13 .035 As before, I checked for potential violations of assumptions, with minimal concern identified (see Figure 13 for P - P and Residual - scatter plots). 76 Figure 13 . P - P and r esidual - scatter plots for Experiential Pronunciation and Fluency hierarchical linear regression s . Linguistic a ssociations . I finally checked the strength of association between the nine coded linguistic measures and Experiential O verall , P ronunciation , and Fluency scores. Table 28 presents results. The pattern s for Overall and Pronunciation were similar, with Segmental Accuracy being the strongest association for both, with a weaker association with Pause Appropriateness . The only difference w as in the weak association for O verall with Articulation Rate. As might be expected for Fluency, there was a strong association with Articulation Rate, Mean Length of Run, and Unfilled Pauses, an d weaker associations with Repetitions/Self - Corrections . 77 Table 28 ) coefficients between Experiential Overall , P ronunciation , and Fluency scores and 9 linguistic measures of speech (N = 20). Overall Pronunciation Fluency Segmental Accuracy .49 .52 .20 Word Stress Accuracy - .08 - .06 .03 Intonation - .00 .04 - .03 Filled Pauses .17 .16 .12 Unfilled Pauses - .19 .01 - .47 Pause Appropriateness .26 .27 .23 Repetitions/Self Corrections - .21 - .04 - .35 Articulation R ate .36 .03 .55 Mean Length of R un .11 - .14 .49 Notes. > .60 = Strong, > .40 = Medium, > .25 = Weak. Academic . Unlike Experiential , I treated Academic task scores as categorical, as Raters assigne d each S peaker one out of five possible scores (0 - 4 ). I intended to ru n a multinomial logistic regression with Task score as the outcome variable (Reference = 1) 22 and accentedness and comprehensibility as continuous predictor variables . However, t he correlation between accentedness and comprehensibility was strong ( = .920), indicating multicollinearity, so it was not possible to enter both as predictor variables. Principle, I chose comprehensibility as a predictor variable to investigate whether this measure of understanding woul d predict task performance. More concerning than the high correlation between accentedness and comprehensibility was that my limited sample size did not allow me 22 78 to enter a full range of predictors per Band level. As shown in Table 29 , 23 not all comprehensibility values appeared per Band level. While descriptively this may be quite informative, for multinomial logistic regression, values of 0 within cells is problematic, often leading to high standard errors (Field, 2009). This was indeed the case, as the standard error of B (odds) for the intercept in my model was quite high (> 5 ) . 24 Table 29 Crosstabulation of comprehensibility ratings with Academic band scores . Mean comprehensibility 1 2 3 4 2 .00 - 2.99 2 0 0 0 3 .00 - 3.99 3 1 4 0 4 .00 - 4.99 1 2 2 0 5 .00 - 5.99 0 0 3 4 6 .00 - 6.99 0 1 2 1 7 .00 - 7.99 0 0 0 1 Descriptively, the crosstabulation chart offers some interesting observations. No Speakers with a mean comprehensibility score < 5.00 received placement in the highest Academic band. While 92% of Speakers with a mean comprehensibility score > 5.00 placed within the highest two Academic bands, 60% of S peakers with a mean comprehensibility score < 5.00 placed in the lowest two bands. Similarly, comprehensibility was strongly corre lated with Academic Band 23 For ease of reading, I have presented mean comprehensibility scores as being within a range (e.g., 3.00 - 3.99). Visually, this helps to reduce the number of cells to be considered, while still maintaining the concern of empty cells. 24 For interest, the model I ran indicated comprehensibility to be a significant predictor of Academic band placement, 2 (1,3) = 17.52, p = .001, though only for placement into Band 4 ( B = 3.67, SE = 1.37, Exp( B ) = 3 9.21, 95% CIs = 2.68, 572.71, p = .007). 79 placement ( = .62). Though only exploratory , this does indeed indicate that to at least some extent, comprehensibility may predict Academic performance. Linguistic association s. As with Experiential above, I checked the strength of association between Academic score and the nine linguistic measures, reported in Table 30 . While Segmental Accuracy was the only (weak) association with a phonological measure, Academic scores were strongly associated with fluency measures, specifically Articulation Rate and Mean Length of Run. Table 30 ) coefficients between Academic Band score and 9 linguistic measures of speech (N = 27). Academic Segmental Accuracy .34 Word Stress Accuracy - .09 Intonation - .03 Filled Pauses - .24 Unfilled Pauses - .34 Pause Appropriateness .55 Repetitions/Self Corrections - .38 Articulation Rate .81 Mean Length of Run .75 Notes. > .60 = Strong, > .40 = Medium, > .25 = Weak. Interactive . I first considered the strength of association between both accentedness and comprehensibility with Interactive Overall, P ronunciation , and Fluency scores. No significant assoc iations were found (see Table 31 ). This was echoed in the hierarchical linear regression for Interactive Overall score. As with Experiential , I treated Overall scores as a continuous variable 80 (summed score divided by total number of categories ). Accentedness correlated with comprehensibi lity at a high but acceptable level ( r = .67) and was again entered into the m odel first. As shown in Table 32 , the hierarchical linear regression indicated that accentedness and comprehensibility combined explained a minimal 5% of variance in Interactive task scores ( p = .892), with higher comprehensibility even appearing to have a negative impact on Overall score. Similar to Experiential, my inspection of assumptions indicated limited concerns, though the P - P plot features slight deviation from linearity and the residual plot clearly shows a restriction of data points to the lower right side (Figure 14). Considering the minimal association strengths, I did not pursue a regression with either Pronunciation or Fluency scores. Table 31 ) coef ficients between accentedness , comprehensibility , Interactive Overall , Pronunciation & Fluency scores (N = 20) . Accentedness Comprehensibility Interactive Score Pronunciation Score Fluency Score Accentedness - .665 < .01 - .08 - .03 Comprehensibility - - - .11 - .06 - .03 Table 32 Interactive hierarchical regression results (N = 20 ) . B (SE) Adj. R 2 R 2 Change p Model 1 Constant 3.35 (0.50) accentedness - 0.14 (0.13) - .25 .01 .01 .292 Model 2 Constant 3.37 (0.55) accentedness - 0.10 (0.30) - .18 comprehensibility - 0.33 (0.24) - .07 - .05 - .06 .892 81 Figure 14 . P - P and r esidual - scatter plot s for Interactive hierarchical linear regression. Linguistic association s . different and somewhat unexpected patterns. For both Overall and Pronunciation, it appears that the presence of Filled Pauses (e.g., er, um, uh) led to lower scores, whereas more Unfilled Pau ses led to higher scores. For Overall, there was also a weak association with Articulation Rate, though the negative association indicates that less production equaled a higher task score. This may be expectation that both Speakers would equally contribute, so those who were more dominant may have scored lower. For Pronunciation, one weak phonological association included Intonation, though this association is not at all intuitive (less accurate Intonation led to greater Pronunciation scor es). For Fluency, Filled Pauses had the strongest association (less Filled Pauses equated to hi gher Fluency score), and a weaker association existed with Segmental Accuracy. That Segmental Accuracy was associated with Fluency and not Pronunciation is an un expected finding. 82 Table 33 ) coeffi cients between Interactive Overall, Pronunciation, F luency scores and 9 linguistic measures of speech (N =20). Overall Pronunciation Fluency Segmental Accuracy .07 - .15 .32 Word Stress Accuracy - .02 .15 - .01 Intonation - .18 - .31 - .14 Filled Pauses - .27 - .31 - .46 Unfilled Pauses .27 .49 .18 Pause Appropriateness - .17 .04 .06 Repetitions/Self Corrections .08 .16 - .10 Articulation Rate - .36 - .24 - .08 Mean Length of Run - .22 - .20 .05 Notes. > .60 = Strong, > .40 = Medium, > .25 = Weak. 83 CHAPTER 4: DISCUSSION In this section I discuss my findings in relation to the six research questions posed at the outset of my dissertation. I first provide a summary of the results for each question. Summary of Research Questions and Findings Task e ffect . The first wave of analysis addressed three research questions: 1. Does l istener perception of L2 accentedness and comprehensibility differ as a function of task (monologic vs. interactive)? Though minimal, task type did indeed elicit different Listener perception of accentedness and they were nativelike in their speech, with a strong effect ( r > .80) across tasks. Between tasks, Interactive speech appeared to differ only from TOEFL speech, both in terms of perceived accentedness and comprehensibility. It should be noted, though, that this difference only approached significance (albeit with a medium s trength effect, r > .54). 2. Do the linguistic measures of L2 speech that in fluence l istener perception of L2 accentedness and comprehensibility differ as a function of task (monologic vs. interactive)? There was indeed a difference between the linguistic measures that influenced Listener perception across task s . For accentedness, the Interactive task was the only task to feature a strong association (Articulation Rate, r = .60), and aligned more closely with Academic than either Picture or Experiential . Fo r comprehensibility, all tasks demonstrated associations with fluency measures ( Articulation Rate , Mean Length of Run ) , though the strength of association was strongest for Picture and Academic , with Interactive speech falling in between Experiential 84 and P icture/ Academic . In general , Interactive speech demonstrated the weakest associations with phonological measures across accentedness and comprehensibility. 3. Do l accentedness and comprehensibility follow any p atterns across task (monologic and interactive)? accentedness and comprehensibility across tasks placed Speakers into three groups. Group 1, High, was more nativelike and easier to understand than those in Groups 2 and 3 on all tasks except Experiential . Group 2, Middle, was more nativelike and easier to understand than those in Group 3, Low, but only on the Academic task. Considering the L1 Japanese make - up of Group #3, this difference is potentially an effect o f task fa miliarity, as Chinese S peakers are likely more practiced than Japanese when it comes to the Academic task (which was based on the TOEFL iBT integrated speaking task) . Pronunciation a wareness . The second wave of analyses considered research question #4. 4. What awareness of phonological measures of L2 speech do learners possess? As a group, Speakers indicated greater familiarity, instruction, awareness, and importance for Word Stress. For all four categories, Rhythm received the lowest score. In regards to S egments, Vowels scored higher across the four categories than did Consonants. However, whereas the Japanese Speakers aligned closely with the overall group perception, the Chinese did not. Instead, the Chinese Speakers indicated greater familiarity and ins truction with Vowel production (though Word Stress was still rated highest for awareness and importance). Task p erformance . The final research questions focused on the relationship between performance. 85 5. Does l performance? Perceived accentedness and comprehensibility accounted for 33% of variance in Experiential O verall score, 45% in Experiential Pronunciation score, and 12% in Experiential Fluency score. In all cases, comprehensibility provided variance beyond that of accentedness (13% for Overall, 7% for Pronunciation, 13% for Fluency). While no regression analys is for Academic was possible, there was a strong correlation between Academic Band score and comprehensibility ( = .62), and descriptive analysis indicated that a comprehensibility r ating > 5 was likely to land a S peaker in Bands 3 or 4 90% of the time , while Speakers with scores < 5 were placed in Bands 1 or 2 60% of the time. Finally, neither accentedness nor comprehensibility appeared to predict performance on the Interactive task. 6. Do lingu istic measures associated with l centedness and comprehensibility al ign with those associated with r interactive tasks? For Experiential , both Listeners and Raters attended to similar measures when assigning comprehensibility and task scores. Though the strength of associations differed, common measur es included Segmental Accuracy, Pause Appropriateness, Repetitions/Self - Corrections, Articulation Rate, and Mean Length of Run. The only common measure between accentedness and Overall score was Segmental Acc uracy. For Academic , while Listeners attended to suprasegmental measures and Raters to segmental , both appeared to emphasize the range of fluency measures (Pause Appropriateness, Repetitions/Self - Corrections, Articulation Rate, Mean Length of Run). This wa s similar when comparing both accentedness and comprehensibility 86 associations to those of task score. For Interactive speech, there appeared to be little overlap between which measures Listeners and Raters attended to. Listener Perception Across Monologic and Interactive Tasks Exploring task differences. As stated at the outset, one primary goal of this dissertation was to investigate whether L isteners differed in their perception of L2 accentedness and comprehensibility when rating interactive versus Intelligibility Principle, my emphasis is on comprehensibility, or the perceived ease or difficulty of understanding (Derwing & Munro, 2015) . My previous research had indicated that increased monologic task comp lexity led to listeners attending to different linguistic measures of L2 oral production (Crowther et al., 2015a, 2017). In addition, Experiential speech, as elicited through the IELTS long turn task, was deemed to be easier to understand than was speech elicited through a Picture Narrative or TOEFL - inspired (i.e., Academic) integrated speaking task (Crowther et al., 2017). The current findings echo this listener ease to some extent , as again Experiential speech was easier to understand for L isteners than that of Picture ( r = .63) and Experiential ( r = 57). T his interpretation , though, is made based on a medium strength effect size rather than p - values (Plon s k y, 2015 b ) . Extending beyond the monologic nature of the above studies, my dissertation asked L isteners to rate two S peakers simultaneously as they engaged in an interactive task. Listener perception of comprehensibility for the Interactive task fell in between the easiest to understand Experiential ( r = .35) and the more difficult to understand Picture ( r = .38) and Academic ( r = 54) tasks, though this effect was strongest in the Interactive - Academic comparison. Though still producing slightly more difficult to understand speech, Interactive - elicited s peech appears to align more closely with Experiential - elicited speech than 87 with Academic - elicited speech. Picture - elicited speech, which could be seen as the least , falls more towards Academic than either Experiential or Inte ractive. In regards to comprehensibility, the four tasks fell along the same linguistic - constraint continuum present ed in Chapter 2 ( Figure 1). The easiest to understand task, Experiential , is the least constraining, while the most difficult to understand task, Academic , is the most constraining. Interactive and Picture similarly take their expected place on the continuum (Interactive closer to Experiential , Picture closer to Academic ). It would appear that as greater linguistic constraint s were placed on S peakers the less comprehensible their utterances were perceived to be. Though I will discuss specific linguistic measures in more detail below, I would l ike to note that for all tasks two measures of fluency, Articulation Rate and Mean Length of Run, associated with L istener perception of comprehensibility. T he strength of these associations was quite strong ( r > .60) for the two more constraining tasks, P icture and Academic . Segalowitz (2010, 2016) describes three ways in which fluen cy may be perceived: Utterance : the use of measurable temporal features to characterize the fluidity of observable speech. Cognitive : the cognitive processes responsible for performing a speech act. Perceived One w ay to interpret the differences in strength of association between fluency measures and perceived comprehensibility across tasks may be to consider cognitive an d utterance fluency in more detail. Segalowitz (2016) lists a number of cognitive processes that help to define cognitive focusing demands inherent in utteran ce construction, operations in working memory, among 82 ). The greater li nguistic constraints placed on S peakers in the Picture and TOEFL 88 tasks, in which they are required to utilize specific stimuli to formulate a response, likely creates a gre ater cognitive load through more complex retrieval processes (see Hilton, 2008, for a discussion on the link between lexical knowledge and spoken L2 fluency). This greater retrieval burden may have negatively impact ed utterance fluency (i.e., poorer pe rformance on temporal measures), and in turn created greater difficulty for L isteners in understanding. This echoes the views of NNS listeners in Crowther et al. (2017), who primarily referenced fluency measures when perf orming comprehensibility ratings during a think - aloud protocol (unfortunately, no such protocol was utilized with NSs) . As linguistic constraints were lessened in the Interac tive and Experiential tasks, allowing S peakers to rely on their full range of lexical and syntactic knowledge, it is likely th at there was less of a cognitive burden , which le d to more balance across Speakers in utterance fluency, and subsequently weaker associations between temporal measures and L bility. However, as no specific measures of cognitive fluency were taken, further research is needed before any concrete claims can be made (see Kahng, 2014, for one example of how such measures may be taken). W hile my emphasis here is on comprehensibility, it should be noted that L istener perception of accentedness indicated a similar pattern, with Academic the most accented followed by Picture. Howeve r, the two least linguistically constrained tasks, Experienti al and Interactive, indicated minimal difference in how accented they were perceived to be ( r = .02). Both Experiential ( r = .44) and Interactive ( r = .58) also indicated medium strength differences when compared to Academic , though differences with Pictur e were minimal ( r < .20 ). That Picture is more aligned with Experiential and Interactive speech for accentedness, rather than Academic as it was for comprehensibility, may be indicative of the type of speech being elicited. Of the four tasks, Academic is likely the most academically orientated. In the Academic task, 89 S peakers were required to draw upon multiple language skills (listening, reading, speaking) to formulate a response, as the TOEFL - inspired task was designed to mimic the demands of English - m edium academic study ( Educational Testing Service, 2017 ). While no measures of vocabulary usage were taken , Saito et al. (2016) previously highlighted the importance of lexical considerations in listener perception of comprehensibility. Similarly, Crowther et al. (2017) provided initial evidence t hat as task complexity increases , we begin to see an effect of lexical and grammatical measures on perceived accentedness. As the Picture, Experiential , and Interactive tasks required more casual language (share an experience or opinion , tell a story) than that necessary to complete the Academic task , S peakers may have been perceived to be m ore nativelike in their speech. A final note of interest, drawing from the HCA, is the relationship between task complexity and Speaker perce ived accentedness and comprehensibility across tasks. For Picture, Academic , and Interactive, there were significant differences between members in the High cluster and those in the Middle and Low ( p < .005). However, no suc h difference existed for Experiential . From a pronunciation instruction perspective, this indicates that more cognitively complex tasks may better serve L2 learners, as nativelike and comprehensible speech on such tasks seems to ensure similar production o n less complex tasks (Crowther et al., 2017). Another important consideration is task familiarity. For those not in the High cluster, the only difference came down to speech performance on the Academic task. Those in the Low cluster were all Japanese excha nge students who reported a TOEIC rather than a TOEFL score (in contrast to their Chinese peers). As the TOEFL - inspired Academic task was not only the most complicated but likely also the most unfamiliar, Low cluster Speakers may have struggled with the ta sk, which in turn elicited lower Listener perceptions of accentedness and comprehensibility. 90 Interactive alignment . One concern in th e above analysis is that while L isteners provided relatively normal distributions for Picture, Experiential , and Academic ratings, they did not do so for Interactive ratings. Rather , they tended to positively skew ratings, grouping Speakers between 3 - 4 for accentedness and 4 - 5 for comprehensibility. Interestingly, thi s may be less a comment on the L isteners, but more so on th e interactive processes of the S peakers. In responding to a psycholinguistic focus on individual acts of production or comprehension, Garrod and Pickering (2009) highlight how over the course of an interaction, interlocutors demonstrated interactive ali gnment at both linguistic and non - linguistic levels, often through emulation. Drawing from Fowler, Brown, Sabadin i, and Weihing (2003), Garrod and Pickering describe how phonological and acoustic alignment can be quite rapid. Thus, even within a 60 - second excerpt, it may be that interlocutors aligned their speech, which led to a mo re constrained distribution of L istener ratings. The positive skew is likely the result of potentially high performing S peakers who made alignment difficult. In fact, three of the four outliers identified for comprehensibility on the Interactive task were also placed in the High cluster during the HCA analysis, while their partners were placed in the Middle cluster. 25 I will temper this interpretation , as it is not clear whether 60 sec onds is indeed enough time for interlocutors to align in their linguistic output, although it may serve as a starting point for further investigation. Task complexity. I highlighted earlier how my interactive task was potentially variable in regards to degree of complexity. How Speakers engaged with the task and prompts likely contributed to how complex the task would be. Due to the homogenous population drawn from (IEP 093 & 094, L1 Japanese & L 1 Chinese), with dyads that were primarily shared - L1, 25 The fourth outlier was not included in the HCA, due to a recording issue with her Picture Narrative, and thus not having data across the four task types. However, her speaking partner was placed in the Low cluster. 91 participation measures such as one - way flow, few contributions needed, and negotiation not needed were present, that ther e was such strong alignment in L istener perception of Experiential and Interactive speech may indicate that the Interactive task itself was not that complex. To support this view, in their post - interaction questionnaires, many Speakers in dicated that what made the interactive task easy was the topic itself. 26 In essence, the Interactive task as designed did not require extensive (ca us al or intentional) reasoning or perspective taking, and required more opinion sharing than negotiating. With out considering different types of interactive tasks, such as picture difference or consensus tasks ( e.g., Loewen & Isbell, 2017), it is not possible to make any overa rching declarations on whether l isteners perceive monologic speech differently than inter active speech. It may be that while the Interactive task employed here aligned well with the Experiential task, a more complex, academic - based interactive task may align more closely with the Academic task used . Variation in linguistic associations. As already highlighted, for comprehensibility all four tasks shared an association with fluency measures Articulation Rate and Mea n Length of Run. This association was stronges t for Picture and Academic , which may be a reflection of the greater linguistic con straints of these tasks, which in turn influence d Speakers cognitive and utterance fluency (Segalowitz, 2010 , 2016). One difference between the three monologic tasks an d the I nteractive task was a monologic association with Pause Appropriateness, not present for Interactive speech. This might be due to the turn - taking nature of interaction . Pausing can be indicative of several fluency processes, including breakdown, repair, and retrieval speed (Skehan, 2009) . While pausing is likely to be more detrimental to listener perception when 26 Further analyses of the post - interaction questionnaires are not included, as Speakers responses were general brief and perfunctory. 92 occurring mid - clause rather than end - clause (Davies, 2003), misplaced pauses are also likely to be more salient in monologic speech. Within interaction, runs may be shorter and interlocutors may interject in moments where a pause might have occurred (Michel, Kuiken, & Vedder, 2007), and thus inappropriate pauses are less frequent . A pause may also indicate a change in turns, and thus listeners may be less able to attribute a given pause to either interlocutor . From a coding perspective, the inability to assign a between - turn pause to a specific S peaker led to only within - turn pauses being coded . This may help to explain this monologic versus interactive difference. The final consideration, and directly linked to the primary goal of this study comes down to the phonological associations of monologic versus interactive spee ch in regards to comprehensibility; s pecifically, the relevance of suprasegmental s to understandi ng within an interactive encounter . However, the results of the current study are inconclusive. Based on previous findings it was expected that there would be at minimum an association between Segmental A ccuracy and comprehensi bility across tasks. Instead, Segmental A ccuracy weakly Experiential comprehensibility ( r = .29). While several suprasegmental measures were associated across the four tasks, these associations were weak ( r < .40), and at times unexpected . For example, lower Word Stress Accuracy appears to elicit higher comprehensibility judgments for the Picture task . While this would seem counterintuitive , a pedagogical emphasis on English stress timing is not unanimously promoted (Low, 2015). While SLA - o rientated researchers advocate the need for accurate word stress and rhythm to produce understandable speech (e.g., Benrabah, 1997; Saito & Saito, 2016; Trofimovich & Isaacs, 2012 ), many ELF - orientated researchers do not (e.g., Deterding, 2010 ; Jenkins, 20 00 ). Considering the relatively weak association found here, and only for the Picture 93 Narrative, I am hesitant to comment in depth on this debate. Returning to the general lack of associations with segmental and suprasegmental measures, I propose two poten tial explanations. Proficiency consideration . Speakers were IEP students, a designation that entails they are not yet proficient enough in their English ability to pursue full - time undergr aduate study. This differs greatly from the communities that made up the sa mples i n Isaacs and Trofimovich (2012) and Crowther et al. (2015a, 2015b, 2017), both of which advocated for a combined emphasis on segmental/suprasegmental measures. In Isaacs and Trofimovich, speakers represented a combined range of proficiencies from beginner to advance d . In Crowther et al., speakers were either undergraduate or graduate students. The findings of Saito et al. (2016) may help to explain the differences in my data compared to those of these similar studies. Saito et al. proposed that optimal rate of speech and adequate and varied prosody were relevant to beginner - intermediate level learners, whereas segmental accuracy and good prosody became relevant only at the advanced stage. Despite the authors determining proficiency based on comprehensibility scores rather than established measure s of L2 proficiency , a comparison can still be drawn . I f we consider the range of comprehensibility ratings in the current study (mean scores across tasks = 4.62 5.23), this falls within the range of beginner - to - intermediate profiles of Saito et al . ( means scores = ~4 - 6 ) . 27 Additional evidence from Derwing, Rossiter, Munr o, and Thomson (2004), who worked with low - proficiency L1 Mandarin/L2 English learners, also indicated a strong relationship between fluency and comprehensibility. My S peakers (IEP) were likely less proficient than those of Crowther et al. (undergraduate/graduate), 27 Saito et al. (2016) employed end points opposite to those utilized in the current study. The estimate provided is an approximate conversion. Following Saito et al., beginner - to - intermediate mean scores ranged from 6.03 - 4.06 . 94 which may explain differences in perception between the two sets of listeners , specifically the attention to fluency over phonology in my study . Listener cons ideration. Greater attention to fluency measures may also be a result of the Listeners employed. Listeners formed a relatively homogenous group, with an age range of 18 - 25. All were born, raised, and educated in the American Midwest, and indicated minimal exposure to non - native - English speech. This limited familiarity with L2 speech differs greatly from listeners utilized in previous studies by Isaacs and Trofimovich (2012) and Crowther et al. (2015a, 2015b, 2017). In Isaacs and Trofimovich, listeners were also undergraduate students, but lived in the French/English bilingual city of Montréal, Canada. Thus, even without any formal linguistics training, they would have still been exposed to non - native speech on a daily basis (both English - accented French and French - accented English). It should also be not ed that the target of rating was French - accented English as well. For Crowther et al., listeners were MA students in an applied linguistics program, and were experienced L2 English instructors, also living in Montréal, Quebec. Evidence from both SLA ( Gass & Varonis , 1982) and L2 assessment (Winke & Gass, 2013; Winke, Gass, & Myford, 2013 ) scholarship has indicated that familiarity with non - native speech can inform/bias listener perception, often in a positive d irection (Saito et al., 2017). I t may be that such an effect is present in my data. With limited exposure to non - native - English speech, Listeners possibly lacked the skills necessary to accommodate their receptive ability to unfamiliar patterns of fluency ( Gallois, Ogay, & Giles, 2005 ) . This may have usurped a focus on more phonological considerations . It is possible that with increased familiarity, a pattern of associations more aligned with those in the studies highlighted above may emerge. Derwing and Munro (2014) provide suggestions for NS listener training that would 95 aid their comprehension of L2 speech, including accent perception training, background linguistic information of particular L1s, and communication stra tegies. Limitations of the curr ent analyses . The above discussion has already highlighted the potential effect of speaker and listener variables on how monologic speech is perceived compared to interactive. The impact of such variables is clearly i n need of further investigation . I here highlight three additional limitations to the current study in regards to how I treated interactive speech. Monologic bias. As stated at the outset, the current study used what had been a monologic - orientated methodology to analyze interactive speech. I did this with the knowledge that I would only gain knowledge from how an outsider (Listener) perceived an interaction, and limited input from those actually involved (Speakers). While it could be argued that this outsider perspective is similar to interact ive studies that base their analyses on discourse analysis (e.g., Jenkins, 2000) or LREs (Loewen & Isbell, 2017), it still loses the insight that approaches such as stimulated recall may provide (e.g., Kennedy et al., 2015). Clearly, the data I present are only one half of a complicated story , and further research is needed that looks to more closely bridge the monologic/interactive methodological divide. Interactive task complexity. I approached this project with the mindset that interaction was the next s tep in terms of task complexity, moving beyond the monologic tasks employed in my previous work (Crowther et al., 2015a, 2015b, 2017). However, when considering the strong alignment in Listener perception of the Experiential and Interactive task s ut iliz ed in the current study, it may be better to view Interactive speech as existing on its own continuum of complexity, which may or may not align with that for monologic speech . Rather than seeing it as the next step in task complexity, it may be that monologi c and interactive speech both exist along 96 their own continuum, with the potential that more complex interactive tasks (e.g., picture difference, consensus tasks) may lead to different listener perceptions , such as appears to be the case for monologic tasks . Interlocutor variables. T he dyads formed in this study were relatively homogeneous, which allowed me to control for L2 proficiency and linguacultural differences. However, in our globalized world , contact with a range of nonnative speaking partners is likely a daily occurrence ( Appiah, 2006; MacKenzie, 2011). Thus , it becomes necessary to consider how different paired/group dynamics may change the architecture of an interaction, and how this may subsequently impact perception, both globally (accentednes s, comprehensibility) and linguistically (phonology, fluency). Such considerations moving forward would consider how paired/group dynamics are formed, and the role of proficiency and culture in this formation. Storch (2002 ; SLA perspective ) and Galaczi (20 08; assessment perspective) have both discussed how different group dynamics subsequently impact how interlocutors engage in the co - construction of meaning. Storch describes four distinct dynamics that may form within an interactive event: collaborative, e xpert/novice, domi nant/dominant, dominant/passive (Galaczi refers to them as collaborative, parallel, asymmetric, and blended , respectively) . The amount of language produced by each interlocutor in each pairing varies, with collaborative and expert/novice being the most conducive to language development. proficiency. How L2 learners position themselves, or are positioned, within an interaction based on their L2 proficiency has been shown to play an important role in how their interactive ability manifests, with much of this evidence based on performance when paired with same - and different - proficiency interlocutors. Lazarton and Davis (2008), using collaborative decision 97 tasks, proficiency speakers tended to work in greater collaboration than when one spe aker was of a higher proficiency than the other. In the latter pairing, whether intentional or not, the higher proficiency speaker often reinforced a less proficient identity in their speaking partner, subsequently impacting performance and assessment. A s imilar effect was demonstrated in Yule and Macdonald (1990), where in an information exchange map task, more proficient learners performing in the role of sender tended to limit the contribution of their less proficient partner. Yet, similar tasks using su ch mismatched pairings have also been shown to lead to greater production (in terms of word count) from lower proficiency learners (Da vis, 2009; Long & Porter, 1985) . While the ef fects of asymmetric pairings have been raised as a concern for construct validity and fairness within L2 assessment (e.g., May, 2009), placing lower proficiency proficiency (e.g., Csepes, 2009 ; Nakatsuhara, 2004; Norton, 2005). A second key consideration , often overlooked in lieu of more linguistic considerations (Scollon et al., 2012), is the role of cultural differences in mis - and non - understanding during interaction between speakers of dif ferent linguistic and cultural backgrounds. It has been argued call for a greater focus on intercultural awareness as a pedagogical target (Baker, 2015; Byram, 1997; Kumaravadivelu, 2008), a focus that allows learners to recognize how a dialogue between language and culture impacts the interactions they engage in (Leung, 2005). Though a goal of convivial relations across interactions is ideal (Crowther & De Costa , 2017 ), it is far from practical, considering the numerous power differentials that may exist (Norton, 2013). Primarily 98 related to the possession of material and symbolic resources, those with greater resources may wield greater power within a given inter action, allowing them to shape how an interaction is constructed. My motivation for the L2 pronunciation survey was to determine whether Speakers possessed the metalinguistic awareness to make reference to s uprasegmental measures such as words stress, intonation , and rhythm. I hypothesized that a lack of metalinguistic knowledge may help explain the segmental emphasis during LREs and stimulated recall (e.g., Kennedy et al., 2015; Loewen & Isbell, 2017). Interestingly, Speakers actually indicated greater familiarity with and awareness of word stress rather than segmental production . In addition, intonation was rated almost identically to vowel production, which in turn was rated higher ac ross categories than consonant production. Such a focus on word stress and intonation works against my hypothesis that a lack of metalinguistic awareness led to a greater segmental focus during LREs and stimulated recall , which may, in turn, provide s suppo rt that it is segmental errors that are of greatest importance in attaining mutual intelligibility in interactive contexts (e.g., Jenkins, 2000). The emphasis on word stress and intonation instruction is also in contrast to what seems to be characteristic of classroom pronunciation practices, where teachers tend to emphasis a segmental focus. One consideration is that previous surveys of classroom pronunciation practices (e.g., Breitkreutz et al., 2001; Foote et al., 2012, 2013) have been conducted in seco nd language contexts. As the vast majority of Speakers (N = 24) indicated no study abroad prior to arrival, their primary English instruction was in a foreign language context. Teacher cognit ion studies in foreign language contexts are less common (Baker & Murphy, 2011), with much research drawn from EIL/ELF scholars (e.g., Jenkins, 2000; Sifakis & Sougari, 2005). Unfortunately, such 99 NS versus NNS models of English and is less focused on teac and how this manifests in the classroom . Why my Japanese and Chinese S peakers indicated greater familiarity and instruction in word stress and intonation remains an o pen question, in need of fu rther investigation (e.g., speaker and teacher interviews, classroom observation). Accentedness and Comprehensibility Effects on Task Rating SLA pronunciation scholars have advocated for a pedagogical emphasis on attaining understandable before nativelike speech (e.g., Derwing & Munro, 2015; Levis, 2005). However, in standardized assessment, the constructs of comprehensibility and accentedness have often been conflated in pronunciation assessment scales (e.g., Harding, 2018; Isaacs et al., 2015) , high stakes assessment. The current study attempt ed to bridge th is gap. For Experiential , Listener perception of accentedness and comprehensibility was a significant predictor of not only Pronunciation (Adj. R 2 = .45) and Fluency scores (Adj. R 2 = .12), but also Overall score (Adj. R 2 = .33). In addition, comprehensibility accounted for significantly more variance in Overall ( p = .023), Pronunciation ( p = .039), and Fluency ( p = .035) scores than did accentedness. For all three, increased comprehensibility predicts higher task performance. Potentially conce rning is the limited amount of variance accounted for in both Pronunciation and Fluency scores. As Overall score takes into consideration not just Pronunciation and Fluency, but also Lexical Resource and Grammatical Range and Accuracy, it is not surprising that comprehensibility explained only 33% of variance in Overall score. However, it is not clear what accounts for the additional 55% of variance in Pronunciation and 100 88% of variance in Fluency scores. As I utilized the publically available IELTS speaking rubric, o ne consideration may be the untrai ned (officially) nature of the R aters (see below), however, Isaacs et al. (2015) indicated that even trained IELTS raters did not always align in what they attend to when assigning learners to a Pronunciation band score. Another consideration may be, as referenced in Chapter 3, that a linear regression was not the appropriate method of analysis, as Band score may be more representative of a categorical variable than an interval one. To investigate this potentia l source of concern , however, a larger sample size would be needed. Similar to Experiential , there appears to be evidence that listener perception of comprehensibility may relate to Academic speaking score. This is supported by a strong correlation betwee n the two ( = .62), and a descriptive consideration in which those w ith comprehensibility scores > 5 were placed into Bands 3 or 4 92% of the time. Similarly, those with scores < 5 placed in Bands 1 or 2 60% of the time. It appears that a comprehensibility rating equal to 5 (the mid - point of the comprehensibility scale) may serve as a cut point for assi gnment into the two high er or two low er Academic speaking bands (no Speaker was assig ned a 0 in this study). The small sample (N = 18) and lack of inferential analysis indicates more investigation is needed before any concrete conclusions can be made. Unl ike Experiential and Academic , there was no association found between accentedness/comprehensibility and Interactive Overall, Pronunciation, and Fluency scores. Simply put, it is likely that when scoring Interactive performance, Raters were more attentive to measures of interactive competence (e.g., turn - taking, topic initiation, discourse extension; May, 2011) than they were the actual speech produced. Other considerations may be that Interactive scoring included a video of the interactive event , which would also allow Raters to attend to physical considerations, such as body language (Ducasse & Brown, 2009 ) when assigning task 101 scores. In summary, whereas perceived comprehensibility has potential to serve as a predictor of monologic task performance, it appears limited as a predictor for interactive performance. A lignment of linguistic association s. As overall comprehensibility aligned with task score for both Experiential and Academic , so did the linguistic measures that Listeners and Raters attend ed to. Similarly, I found no pattern between which linguistic measures influenced Listeners compared to Raters when rating Interactive speech. Whereas the less complex Experiential task featured associations with both phonological (Segmental Accuracy ) and fluency (Pause Appropriateness, Articulation Rate) measures, the more complex Academic task emphasized associations with fluency measures (Pause Appropriateness, Repetitions/Self - Corrections, Articulation Rate, Mean Length of Run). Interestingly, the phonological associations for Listeners on Academic were not found for Raters. This may indicate that for Raters, who are focused more on general proficiency than speech perception ( Yan & Ginther, 2018 ), issues in fluency are more salient than phonologica l concerns. As discussed previously, the Experiential task was less complex than the Academic task, which may have enabled Speakers to produce more fluent speech. Subsequently, this may have enabled Raters to place a greater emphasis on Segmental Accuracy . Limitations of the task analyses. I make the above interpretations with caution for two important reasons. First , despite using publically available IELTS and TOEFL rubrics, the Raters employed did not receive official IELTS or TOEFL training, and thus are not representative of how official raters may have assessed Speaker performance on either the Experiential or Academic task s . Second, in official assessment contexts, such as IELTS and TOEFL , an entire speaking battery is employed to holistically asses speaking ability . Here, I utilized only a pair of tasks, each derived from a different standardized assessment (IELTS or TOEFL). 102 As such, the current stud y should be seen as exploratory . The findings above would indicate that perceived comprehensibility may indeed predict task performance, which provides support that more controlled research comparing listener versus rater perception is necessary An additional con cern, prevalent in much SLA research ( Plonsky, 2013 ) , is the limited sample of the study. Sample size clearly impacted the statistical approaches used throughout the study. In addition, from an assessment perspective, I would ideally like to generalize across a m uch wider - range of L2 learners . That my Speake rs represented only two L1s, Japanese and Chinese, and were of a limited proficiency range, IEP 093 and 094, clearly limits my ability to generalize beyond this population. Causes for Concern: 11 Linguistic Measures of Speech For this study, I drew upon 11 phonological and fluency measures used previously in Isaacs and Trofimovich (2012). This was done to allow for comparability. While these authors referenced the surprising nature of having gained ICCs > .90 despite the subjective nature of many of the cat egories (e.g., Segmental Accuracy, Word Stress Acurracy, Rhythm), application in my study was less concise, and potentially problematic. For example, Word Stress Accuracy received an ICC of .80, which would fall within the acceptable guidel ines put forth by Larson - Hall (2010). Yet, when reviewing coding with my secondary coder it became clear that where we both may have identified the same number of errors within an utterance, the specific errors identified did not always align. We achieved agreement on total number of errors, but not on actual errors. Similar concerns exist for Segmental Accurac y, Syllable Structure Accuracy, Rhythm, and Pause Appropriateness, all of which are highly subjective. For example, Syllable Structure Accuracy was defined as any additional or deleted sound (Isaacs & Trofimovich, 2012, 103 the following sentence: a man with a a green suitcase and a woman uh with a green suitcase too pronounced. This was a point of disagreement between my secondary coder and me , as s he indicated that such deletion might be seen as characteristic of native - English speech ( see Celce - Murcia et al., 2010 , for further discussion on this topic ). To account for such deletion instances, a choice was made to consider only errors that altered the syllable count of a word (e.g., - - u - T his not only significantly lowered the number of syllable structure errors from my initial coding , but also created a category no longer directly comparable with previous studies. As such, I subsequently removed Syllable Structure Accuracy from analyses. Rhythm was also removed, but due to an extremely low level of agreement between coders (ICC = .137). These instances are , of course, concerning, as they raise questions on how well the phonological and fluency coding employed actually reflects how Listeners perceived the L2 speech. For the current study , I have maintained coding, aside from the Syllable Structure Accuracy and Rhythm , to align with Isaacs and Trofimovich (2 012). However, moving forward, conservative estimate of reliability ( Plonsky & Derrick, 2016 ) and considers agreement on each individually coded item. Such an alysis would allow for a better understanding of coder perception, specifically any patterns of disagreement that may exist. 104 CHAPTER 5: CONCLUSION In this final chapter, I reflect upon the potential theoretical, pedagogical, and assessment implications of the above findings, before providing suggestions for future research. Implications In Chapter 2 I proposed potential theoretical, pedagogical, and assessment implications, which I revisit here. Pedagogical . At the outset I proposed a link between comprehensibility and the Interaction Approach (Long, 1996). Specifically, I proposed that since L2 learners attend primarily to lexical and grammatical measures during communicative breakdowns, and to a lesser exten t segmental issues, then a pedagogical focus on suprasegmental measures in the classroom would provide the necessary attention to such measures that previous research has called for (e.g., Derwing & Munro, 2015; Isaacs & Trofimovich, 2012) . This proposal a ssumed that the reason Speakers did not reference such measures during LREs and stimulated recall was due to a lack of metalinguistic aware ness. However, my results indicate a greater emphasis on l or suprasegmental production. Such an emphasis for beginner - to - intermediate L2 speakers is not unfounded ( Derwing et al., 2004; Saito et al., 2016), but pedagogically troublesome (Thomson, 2018) . While it may be possible to raise L2 learners awareness of appropriate pause placement and how to effectively use filled and unfilled pauses , fluency concerns related to lexical, syntactic, and semantic retrieval (Segalowitz, 2010, 2016) are not as easily targeted. Lee et al. (2014), through a meta - analysis of 86 pronunciation instruction studies, indicated stronger effects of explicit pronunciation instruction for beginning and advanced proficiency students than for intermediate students. This may indicate that once a minimum pronunciation threshold is achieved , 105 placed on developing lexical and grammatical knowledge, which in turn wou fluency. The findings of Nagle (2018) may provide initial support for this hypothesis. Focused on L2 Spanish, Nagle measured the growth of perceived accentedness and comprehensibility over a yearlong period. Set in a communicative - based university - level classroom in the US, instructors indicated limited attention to pronunciation during their lessons. Despite no explicit focus on pronunciation, learners still demonstrated general improvement in how co mprehensible they were perceived to be. The Interaction Approach theorizes that L2 development occurs through exposure to input, production of output, and engagement in negotiation of meaning (Gass & Mackey, 2015). As learners appear to attend primarily to lexical and grammatical features during communicative breakdowns (e.g., Kennedy et al., 2015; Loewen & Isbell, 2017), it is possible that the increases in comprehensibility observed by Nagle were a direct reflection of increases in lexical and grammatical awareness/ knowledge, which in turn would benefit learners retrieval processes. Once higher fluency is achieved, targeted pronunciation instruction may again be necessary (e.g., promoting segmental accuracy; Saito et al., 2016) . However, further research across proficiency levels is needed before making any concrete conclusions on this claim . Extending the previous discussion on a communicative - based classroom, the question would be the type of tasks necessary to further develop the linguistic measures associated with producing understandable speech. Crowther et al. (2017) proposed that the use of complex tasks would enable learners to practice a wider range of linguistic dimensions (phonology, fluency, lexicon, grammar). The results of my HCA would indicate the same. For the least complex and linguistically constrained task (Experiential) no di fferences existed between Speakers for perceived accentedness and comprehensibility. However, as task complexity and linguistic 106 constraints increased, the High cluster began to outperform Mid and Low (in fact, High performed significantly better on Interac tive, Picture, & Academic). In addition, the Mid and Low clusters differed only on the most complex, most constrained task (Academic). These findings would appear to indicate that for perceived accentedness and comprehensibility, Speakers who performed wel l on the more complex, more constrained tasks also performed well on the less complex, less constrained tasks. Essentially, task complexity served to differentiate between Speakers in this study. Pedagogically, as proposed in Crowther et al. (2017), and fo llowing on the findings of Nagle (2018), a communicative - based classroom that emphasizes more complex, more linguistically constrained tasks may serve to benefit L2 learners in regards to increasing their ability to produce understandable L2 speech. Clearl y more investigation is needed to gauge the potential of such a pedagogical approach. Assessment. scholars in regards to a holistic pronunciation target (e.g., Derwing & Munro, 20 15; Isaacs & Trofimovich, 2012). An emphasis in the L2 classroom should be placed on the linguistic measures relevant to producing understandable rather than nativelike speech. From an assessment perspective, that comprehensibility is often conflated with accentedness (Harding, 2017; Isaacs et al., 2015) in rubric descriptors is concerning, as this may problematize what linguistic measures receive focus in the L2 classroom. Even though L2 speakers can be highly comprehensible even while possessing a heavy a ccent (Derwing & Munro, 2015), would this heavy accent still negatively impact their speaking score? The exploratory findings of the current study may indicate this is not the case, as for both the IELTS - and TOEFL - inspired tasks (i.e., Experiential and Academic, respectively) , Listener perception of comprehensibility seems to If so, then the pedagogical emphasis 107 on the Intelligibility Principle is well founded , in regards to both L2 pronunciati on development and speaking assessment. One limitation of the above proposal is that neither perceived accentedness nor comprehensibility predicted performance on the Interactive task. This raises questions on the ecological validity of comprehensibility a s a predictor of interactive success. Much monologic - based research that has measured perceived comprehensibility has done so primarily through audio - recorded utterances, with no visual representation of the speaker (though see Rubin, 1992, and Kang & Rubi n, 2009, for exceptions). Such an approach ignores the multimodal nature of communication, where listeners do not simply rely on linguistic cues, but also visual when determining meaning (Jewitt, 2014). In interaction, such visual cues are frequently available. Nonverbal communication may include gesture, posture, facial expressions, and eye behavior ( Hardison, in press; Knapp & Hall, 1992). While the effects of nonverbal cues have been investigated in respect to L2 listening comprehension (e .g., Sueyoshi & Hardison, 2005; Suvorov, 2011, 2015; Wagner, 2007, 2008), the importance of such cues has not been considered in respect to listener evaluation of L2 monologic speech. I approached a potential monologic - interactive divide by applying a mono logic methodology on interactive speech. However, without considering the availability of visual cues, it may be that this monologic methodology in itself is limited. Directions for Future Research As several potential avenues for future research have been referenced previously , I here highlight three that I feel are necessary to continue the line of inquiry presented . Interlocutor perception . The manipulation of several variables from the current st udy would be of interest, including speaker proficiency, l istener familiarity , and interactive task 108 complexity. As t he potential impact of such variables was discussed in Chapter 4, I here stress the need to emphasize the perception of the interlocutor and compare whether their within - task perception aligns with that of the outside listener. In the current study I employed a methodology characteristic of monologic speech research. The next step would be comparing how these outside - derived, perception - based judgments align with those that may be expres sed through the stimulated recall ( Gass & Mackey, 2017) of actual participants. Of particular interest would be how these within - task perceptions may change based on various interlocutors, and whether these changes are accounted for in the outside listener of speech. Task assessment . For IELTS and TOEFL, there is initial evidence that listener may help predict task performanc e. As this evidence draws upon R aters without formal IELTS or TOEFL training, however, it can only be viewed as exploratory. In addition, only a single IELTS - and TOEFL - inspired task were considered, as opposed to the entire battery of tasks utilized to assess speaking. The next logical step t hen, beyond increasing sample size, would be to compare listener perception of L2 comprehensibility to actual IELTS and TOEFL ratings across a range of speaking tasks. As referenced previously, an association between listener perception and rater scoring w ould add credence to a pedagogical focus on understandable speech (Derwing & Munro, 2015; Levis, 2005). Linguistic coding. led to a greater emphasis on the methodological rigor employed in conducting empirical research (e.g., Norris & Ortega, 2000; Plon s k y & Gass, 2011). While much of this emphasis has been placed on the proper application of statistical procedures (e.g., Cunning, 2012; Plonsky, 2015a; 109 Plonsky & Gonulal, 2015; Winke , 2014), clearly such methodological review must also encapsulate the initial coding scheme (see Plonsky, Marsden, Crowther, Gass, & Spinner, forthcoming, for an example focused on SLA judgment task design and usage). Linguistic coding, whether targeting p honological, fluency, grammatical, lexical, or discourse domains, has varied greatly across studies. For example, whereas Isaacs and Trofimovich (2012) featured four measures dedicated to grammar and lexicon measures, Saito et al. (2016) utilized six measu res for only lexicon. Comparing Kang (2010) to Kahng (2014) highlights the different ways in which L2 fluency can and has been measured. Segmental production has been measured both perceptually (e.g., Isaacs & Trofimovich, 2012) and acoustically (e.g., Sol on, Long, & Gurzynski - Weiss, 2017 ). Pronunciation scholars have made claims that specific linguistic measures are relevant to the production of understandable L2 s peech; however, without a uniform approach to linguistic coding, making comparisons across st udies is not possible. While it would be unreasonable to expect all scholars to subscribe to the same coding procedures, it does seem necessary to at least make note of what procedures have been used, their strengths and weaknesses, and the reasons as to w hy researchers have utilized them. A methodological review of linguistic coding within L2 pronunciation research seems well overdue. Concluding Thoughts While there is clearly a need to pursue many of the themes of my dissertation further, as highlighted above, there is also insight that can be drawn. The findings of my dissertation continue to support a pedagogical emphasis on intelligibl e (i.e., understa ndable) before nativelike speech (Derwing & Munro, 2015; Levis, 2005), and provide evidence that this emphasis is relevant to both L2 communicative and assessment contexts. While greater clarity in regards to which linguistic measures enable an L2 speaker to produce understandable speech is needed, it is 110 clear that at both a monologic and interactive level, listener understanding is reliant on more than simply a segmental versus suprasegmental debate. Along with the linguistic measures of interest, it is ne cessary that we consider the proficiency of speakers, the familiarity of the listeners, and the complexity of the tasks employed. 111 APPENDICES 112 APPENDIX A Picture Narrative (Derwing et al., 2009) 113 APPENDIX B Experiential Task Version A (IELTS, 2009) Describe a party that you enjoyed. You should say: whose party it was and what it was celebrating where the party was held and who went to it what people did during the party and explain what you enjoyed about this party You will have to talk about the topic for 1 to 2 minutes. You will have 1 minute to think about what you are going to say. You can make some notes to help you if you wish. Version B (IELTS, 2011) Describe a restaurant that you enjoyed going to. You should say: where the restaurant was why you chose this restaurant what type of food you ate in this restaurant and explain why you enjoyed eating in this restaurant. You will have to talk about the topic for 1 to 2 minutes. You will have 1 minute to think ab out what you are going to say. You can make some notes to help you if you wish. 114 APPENDIX C Academic Task (Educational Testing Service, 2012) Version A Social Interaction Reading Text People account for their own behavior differently from how they account for the behavior of others. When observing the behavior of others, we tend to attribute their actions to their character or their personality rather than to external factors. In contrast, we tend to explain our own behavior in terms of situational factors beyond our own control rather than attributing it to our own character. One explanation for this difference is that people are aware of the situational forces affecting them but not of situational forces affecting other people. Thus, when evaluatin g Listening Text he store and I was getting in line to buy something. But just before I was actually in line, some guy comes out of I assumed he was just a selfish, inconsiderat e person when, in fact, I had no idea why he cut in Ok. So a few day s after that, I was at the store again. Only this time I was in a real hurry I was late for an important meeting and I was frustrated that everything was taking so long. And ving so 115 slowly. But then I saw a slightly shorter line! But some woman with a lot of stuff to buy was think of myself as a bad or rude person for doing thi s. I had an important meeting to get to I was in a hurry, so, you know, I had done nothing wrong. Version B Cognitive Dissonance Reading Text Individuals sometimes experience a contradiction between their actions and their beliefs between what they are doing and what they believe they should be doing. These contradictions can cause a kind of mental discomfort know as cognitive dissonance . People experiencing cognitive dissonance often do not want to change the way they are acting, so they resolve the co ntradictory situation in another way: they change their interpretation of the situation in a way that minimizes the contradiction between what they are doing and what they believe they should be doing. Listening Text This is a true story from my own life. In my first year in high school, I was addicted to video I was failing chemistry; that was my hardest class. So this was a conflict for me because I wanted a good job when I grew up, and I bel ieved I knew only class I was doing really badly in was chemistry. In the others I was, I was okay. So I asked 116 to know chemistry. In other words, I changed my understanding of what it meant to do well in school. I reinterpreted my situation: I used to think that doing well in school meant doing well in all my classes, but now I decided that succeeding in school me ant only doing well in the classes that related directly to my future career. I eliminated the conflict, at least in my mind. 117 APPENDIX D Interactive Prompts Prompt #1 Agree or disagree with the following statement: It is important to attend many activities when studying abroad . Have you attended any new activities while at MSU (sports game, club activity, etc.)? Why or why not? What type of activities? Did you enjoy them? What are positive reasons to attend a new ac tivity? What are negative reasons to attend a new activity? What type of activity do you like best? A sports activity? An academic activity? Why? different from yours? Try to convince your partner your opinion is best. Prompt #2 Agree or disagree with the following statement: It is important to make international friends when studying abroad . Do you want to make international friends while here at MSU? Why or why not? What are positive reasons to make international friends? What are negative reasons to make international friends? Have you made any international friends while here? How did you meet them? What do you do with your international fr iends? What events do you attend? different from yours? Try to convince your partner your opinion is best. 118 Prompt #3 Agree or disagree with the following statement: It is important to travel to many places when studying abroad . Have you visited anywhere while at MSU? Why or why not? What places have you visited? Why did you choose these places? What was your impression of the places you visited? What are positive reasons to travel to many different places? What are negative reasons to travel to many different places? different from yours? Try to convin ce your partner your opinion is best. 119 APPENDIX E Pronunciation Questionnaire Consonants Individual sounds that are not vowels. For examples, /b/, /d/, /g/, & /s/. How familiar are you with English consonants? 1 2 3 4 5 1 = Not familiar at all 5 = Very familiar How much instruction have you received on how to produce English consonants? 1 2 3 4 5 1 = No instruction at all 5 = A lot of instruction When speaking English, how aware are you of how you are producing consonants? 1 2 3 4 5 1 = Not aware at all 5 = Very aware How important are consonants in producing English speech that is understandable? 1 2 3 4 5 1 = Not important at all 5 = Very important Vowels Individual sounds that are not consonants. For examples, /a/, /e/, /i /, /o/, & /u/. How familiar are you with English vowels? 1 2 3 4 5 1 = Not familiar at all 5 = Very familiar How much instruction have you received on how to produce English vowels? 1 2 3 4 5 1 = No instruction at all 5 = A lot of instruction When speaking English, how aware are you of how you are producing vowel sounds? 1 2 3 4 5 1 = Not aware at all 5 = Very aware How important are vowels in producing English speech that is understandable? 1 2 3 4 5 1 = Not important at all 5 = Very important 120 Word Stress How familiar are you with English word stress? 1 2 3 4 5 1 = Not familiar at all 5 = Very familiar How much instruction have you received on English word stress? 1 2 3 4 5 1 = No instruction at all 5 = A lot of instruction When speaking English, how aware are you of where you place stress in words? 1 2 3 4 5 1 = Not aware at all 5 = Very aware How important is word stress in producing English speech that is understandable? 1 2 3 4 5 1 = Not important at all 5 = Very important Intonation The melody of English speech, or how pitch of the voice goes up and down when speaking. For example, in a yes/no question, English pitch goes up at the end of the question. How familiar are you with English intonation? 1 2 3 4 5 1 = Not familiar at all 5 = Very familiar How much instruction have you received on English intonation? 1 2 3 4 5 1 = No instruction at all 5 = A lot of instruction When speaking English, how aware are you of your intonation usage? 1 2 3 4 5 1 = Not aware at all 5 = Very aware How important is intonation in producing English speech that is understandable? 1 2 3 4 5 1 = Not important at all 5 = Very important 121 Rhythm The regular beat (like in music) created by stressed elements across a sentence. These stressed words, and receive extra emphasis. How familiar are you with English speech rhythm? 1 2 3 4 5 1 = Not familiar at all 5 = Very familiar How much instruction have you received on English speech rhythm? 1 2 3 4 5 1 = No instruction at all 5 = A lot of instruction When speaking English, how aware are you of your speech rhythm? 1 2 3 4 5 1 = Not aware at all 5 = Very aware How important is speech rhythm in producing English speech that is understandable? 1 2 3 4 5 1 = Not important at all 5 = Very important Speech Rate How slowly or quickly a person speaks English. How aware are you of the speech rate of an English speaker you are listening to? 1 2 3 4 5 1 = Not aware at all 5 = Very aware Have you ever received instruction on English speech rate? 1 2 3 4 5 1 = No instruction at all 5 = A lot of instruction When speaking English, how aware are you of your speech rate? 1 2 3 4 5 1 = Not aware at all 5 = Very aware How important is speech rate in producing English speech that is understandable? 1 2 3 4 5 1 = Not important at all 5 = Very important 122 APPENDIX F Questionnaire A Speaker Background Name: Age: Hometown: (City, Province/State, Country) 1) What is your native language? What was Did you speak any other languages at home? 2) What do you consider your second language? Do you speak any other languages? 3) Places You Have Lived (not visited) Location (City, Province, Country) Reason Length MSU Study Experience 4) What is your intended/acquired major? 5) What type of degree is this (ex., BA, BSc, MA, PhD)? 6) How many years have you/did you studied? 7 ) Do you have a scholarship/fellowship? Yes No If yes, could you please give a description? Language Use and Background 8 ) What age did you begin to learn English? 9 ) Have you ever studied English abroad? Yes No Where (how long)? 123 10 ) Using the below scale, please rate your ability to speak, listen, read, and write English. ( 1 = Low Ability, 9 = High Ability) Speaking 1 2 3 4 5 6 7 8 9 Listening 1 2 3 4 5 6 7 8 9 Reading 1 2 3 4 5 6 7 8 9 Writing 1 2 3 4 5 6 7 8 9 11 ) If you remember, what was your last score on a test like IELTS, TOEFL, or TOEIC? Which test? Score? 12 ) If you remember, what was your score on the MSUELT? 13 ) Here in Michigan, approximately what percent of the time do you speak English (as opposed to other languages) in your daily life? 0% 10 20 30 40 50 60 70 80 90 100% 14 ) Here in Michiga n, approximately what percent of the time do you listen to the English language media (as opposed to the media in other languages)? 0% 10 20 30 40 50 60 70 80 90 100% 15 ) Of the time that you spend speaking English in Michigan, approximately what percent of the time do you interact with native English speakers (as opposed to non - native speakers)? 0% 10 20 30 40 50 60 70 80 90 100% 16 ) Of the time that you spend speaking English in Michigan, approximately what percent of the time do you interact with nonnative English speakers (as opposed to native speakers)? 0% 10 20 30 40 50 60 70 80 90 100% 17 ) Please list which types of accented - English (native and nonnative) you are most familiar with. 124 Appendix G Questionnaire B Listener Background Pre - Listening 1) Please type your name. 2) Please type your e - mail address. 3) What [TESOL minor] courses are you enrolled in? 4) How old are you (in years)? 5) In what country did you study for: a) Elementary school? b) Junior high school? c) Senior high school? d) Undergraduate studies? 6) What is your current degree? a) Undergraduate b) Graduate (MA) c) Graduate (PhD) d) Other a. If other, please describe. 7) What is your first language? If you grew up bilingual, please list both languages. 8) What is your second language? Write none if you do not speak a second language. 9) Please rate your proficiency in your L2. (1 = Near beginner, 9 = Near nativelike) 10) Please list all other languages you speak. Please rate your proficiency on a 9 - point scale (1 = Near begin ner, 9 = Near nativelike). 11) Do you have previous experience teaching a second language? Yes or No a) If yes, which language(s) did you teach? b) How long did you teach for (in months/years)? Please respond for each language listed above. c) How old were your learne rs? Please respond for each language listed above. d) In what country did you teach? Please respond for each language listed above. 12) How familiar are you with the following English accents? (1 = Not familiar at all, 9 = Very familiar) a) American b) Arabic c) Australia n d) British e) Chinese f) French g) Hindi h) Japanese i) Korean j) Spanish k) Vietnamese 125 Post - Listening 1) Have you ever studied the Chinese language? If you are a native speaker of Chinese , a) If yes, please self - rate your proficiency below (1 = low ability, 9 = high ability). a. Speaking b. Listening c. Reading d. Writing b) If yes, in 2 - 3 sentences, please describe your Chinese learning experience (e.g., years of study, class type, location, etc.). 2) Have you ever studied the Japanese language? If you are a native speaker of Jap anese , a) If yes, please self - rate your proficiency below (1 = low ability, 9 = high ability). e. Speaking f. Listening g. Reading h. Writing b) If yes, in 2 - 3 sentences, please describe your Japanese learning experience (e.g., years of study, class type , location, etc.). 3) What is your major? 4) If you have a minor, please list it here? 5) Please list any linguistic courses you have taken. Please include a descriptive name, such as syntax, morphology, phonology, etc. 126 Appendix H Questionnaire C Rater Background Name: Age: Hometown: (City, Province/State, Country) 1) What is your native language? Did you speak any other lang uages at home? 2) What do you consider your second language? Do you speak any other languages? 3) Places You Have Lived (not visited) Location (City, Province, Country) Reason Length MSU Study Experience 4) What is your current degree and focus of study? o What year of study are you in? 5) Do you have a scholarship/fellowship? Yes No o If yes, could you please give a description? 6) What is your highest degree earned and focus o f study? o Where did you earn this degree? 7) Please list any other degrees earned, the focus of the degree, and location of degree. Language Use and Background 8) What age did you begin to learn your second language (as described above)? 9) Have you ever studied this language abroad? Yes No Where (how long)? 127 10) Using the below scale, please rate your ability to speak, listen, read, and write in your L2. ( 1 = Low Ability, 9 = High Ability) Speaking 1 2 3 4 5 6 7 8 9 Listening 1 2 3 4 5 6 7 8 9 Reading 1 2 3 4 5 6 7 8 9 Writing 1 2 3 4 5 6 7 8 9 11) If you remember, what was your last score on a language proficiency test (such as ACTFL)? Which test? Score? 12) Here in Michigan, approximately what percent of the time do you speak this language (as opposed to other languages) in your daily l ife? 0% 10 20 30 40 50 60 70 80 90 100% 13) Here in Michigan, approximately what percent of the time do you listen to media in this language (as opposed to the media in other languages)? 0% 10 20 30 40 50 60 70 80 90 100% 14) Of the tim e that you spend speaking this language in Michigan, approximately what percent of the time do you interact with native speakers (as opposed to non - native speakers)? 0% 10 20 30 40 50 60 70 80 90 100% 15) Of the time that you spend speaking this language in Michigan, approximately what percent of the time do you interact with nonnative speakers (as opposed to native speakers)? 0% 10 20 30 40 50 60 70 80 90 100% Language Teaching Background 16) Do you have previous experience teaching a second language? Yes or No l) If yes, which language(s) did you teach? m) How long did you teach for (in months/years)? Please respond for each language listed above. n) How old were your learners? Please respond for each language listed above. o) In what country did you teach? Please respond for each language listed above. 128 Nonnative - English Speech Familiarity 17) How famili ar are you with the following English accents? (1 = Not familiar at all, 9 = Very familiar) p) American q) Arabic r) Australian s) British t) Chinese u) French v) Hindi w) Japanese x) Korean y) Spanish z) Vietnamese 18) How familiar are you with accented - English in general? (1 = Not familiar at all, 9 = Very familiar) 129 APPENDIX I - Perception of Rating Categories How well did you understand this rating category? 1 = I did not understand this concept well 5 = Neutral 9 = I understood this concept well 1 2 3 4 5 6 7 8 9 Accentedness Comprehensibility Intelligibility How comfortable did you feel rating this category? 1 = Very difficult 5 = Neutral 9 = Very easy and comfortable 1 2 3 4 5 6 7 8 9 Accentedness Comprehensibility Intelligibility 130 APPENDIX J Paired Assessment Rating Rubric (Reproduced as presented in Ockey, 2011) Pronunciation Think about: a) pronunciation, b) intonation, c) word blending Fluency Think about: a) automatization, b) fillers, c) speaking speech Grammar Think about: a) use of morphology, b) complexity of syntax (embedded clauses, parallel structures, connectors) Vocabulary Think ab out: a) range of vocabulary Communicative skills/strategies Think about: a) interaction, b) confidence, c) conversational awareness 4 Speaks with excellent pronunciation and intonation; has practically mastered the sound system of English Excellent fluenc y; uses fillers effectively; shows ability to speak quickly in short bursts Uses both simple and complex grammar effectively; may make occasional errors but they are only in late - acquired grammar Shows evidence of a wide range of vocabulary knowledge Confident and natural; asks others to expand on views; shows how own and related; interacts smoothly 3.5 3.0 Pronunciation is good but has still not mastered the sound system of English; accent does not interfere with comprehension; can blend words May use some fillers; rarely gropes for words but speech may still not be quick Shows ability to use some complex grammar; may make errors but they are only in late - acquired grammar Shows some evidence of some advanced vocabulary Generally con fident; responds appropriately to shows ability to negotiate meaning quickly and relatively naturally 2.5 2.0 May not have mastered some difficult sounds of English, but would Speech is hesitant; some groping for words and unfilled spaces are Relies mostly on simple (but appropriate) grammar; has enough Generally has enough lexis for expressing some Responds to others without long pauses to maintain 131 be mostly understandable to a naïve NS; makes some attempts to blend words present but generally communication completely morphosyntax to express meaning complex grammar is attempted but may be inaccurate opinions but does not demonstrate any particular knowledge of vocabulary interaction; shows agreement or disagreement with oth 1.5 1.0 Somewhat non - nativelike pronunciation; does not blend words together; they are pronounced in isolation Slow, strained speech; constant groping for words and long unnatural pauses; communication with a NS would be difficult have enough grammar to express an opinion clearly; makes frequent errors; no attempt at complex grammar Lexis not adequate for task; cannot express opinion properly with limited words used Does not initiate interaction; produces monologue only; shows some turn - not relate ideas in explanation; too nervous to interact effectively 0.5 Very heavy accent; uses non - nativelike phonology and rhythm; words are not blended together Fragments of speech that are so halting tha t conversation is not really possible; NS would think person had virtually no English Does not use any discernible grammatical mor phology Shows knowledge of only the simplest words and phrases taught in early language learning contexts Shows no awareness o f other speakers; may speak, but not in a conversation - like way 1 32 A PPENDIX K Targeted 11 Linguistic Measures of L2 Speech All measures are drawn fro m Isaacs and Trofimovich (2012) but have been relabeled to allow for increased readability. Phonology. A total of six categories at segmental and suprasegmental levels were used to (1) Segmental Accuracy : The total number of segmental (vowel, consonant) substitutions divided by the total number of segments articulated (e.g., substituting /i / for / I . (2) Syllable S tructure Accuracy : The total number of vowel and consonant epenthesis (insertion) and elision (del etion) errors over the total number of syllables articulated. REMOVED FROM ANALYSIS (3) Word S tress Accuracy : The total number of instances where primary stress was misplaced or missing over the total number of polysyllabic words produced. (e.g., (4) Rhythm : A measure of English stress - timing, the number of correctly reduced syllables in both polysyllabic words and function words divided by the total number of obligatory vowel reduction contexts. REMOVED FROM ANALYSIS (5) Intonation : The number of correct pitch patterns at the end of phrases over the total number of instances where pitch patterns were expected 133 is one inappropri ate rise after corner and one appropriate fall after other). Fluency. Six categories designed to describe dysfluencies in L2 speech were used to (6 ) Filled Pauses : Total number of non - lexical pauses (i.e., uh, um) longer than 400 milliseconds (7 ) Unfilled P auses : Total number of unfilled pauses longer than 400 milliseconds (8 ) Pause Appropriateness : A measurement of the relationship between fluency and sentence structure, the number of inappropriately filled and unfille d pauses divided by the ( Unfilled P ause (9 ) Repetitions/Self - C orrections : The sum of all immediately repeated and self - corrected words over the (unfilled pause) big big [REPETITION] building on the (filled pause) uh (10 ) Articulation R ate : Excluding dysfluencies (e.g. , filled pauses, false starts), the total number of syllables produced divided by the total duration of the speech sample in seconds. (11 ) Mean Length of R un : The mean number of syllables produced between two adjacent filled or unfilled pauses greater tha n 400 milliseconds. 134 REFERENCES 135 REFERENCES Abrahamsson, N., & Hyltenstam, K. (2009). Age of onset and nativelikeness in a second language: Listener perception versus linguistic scrutiny. Language Learning, 59 (2) , 249 - 306. Appiah, K. A. (2006). Cosmopolitanism: Ethics in a world of strangers (issues of our time) . New York: W.W. Norton & Company, Inc. Baker, A., & Murphy, J. (2011). Knowledge base of pronunciation teaching: Staking out the territory. TESL Canada Journal, 28 ( 2), 29. Baker, W. (2015). Culture and identity through English as a lingua franca . Berlin, Germany: De Gruyter Mouton. Benrabah, M. (1997). Word - stress a source of unintelligibility in English. International Review of Applied Linguistics in Language Teaching, 35 (3), 157 - 166. Bent, T., & Bradlow, A. (2003). The interlanguage speech intelligibility benefit. The Journal of the Acoustical Society of America, 114 (3) , 1600 - 1610. Bergeron, A., & Trofimovich, P. (2017). Linguistic dimensions of accentedness and comprehensibility: Exploring task and listener effects in second language French. Foreign Language Annals, 50 (3), 547 - 566. Bongaerts, T., Van Summeren, C., Planken, B., & Schils, E. (1997). Age and ultimate attainment in the pronunciation of a foreig n language. Studies in Second Language A cquisition, 19 (4), 447 - 465. Bowles, M. A., Toth, P. D., & Adams, R. J. (2014). A comparison of L2 - L2 and L2 - heritage learner interactions in Spanish language classrooms. Modern Language Journal, 92 (2) , 497 - 517. Breitkreutz, J., Derwing, T. M., & Rossiter, M. J. (2001). Pronunciation teaching practices in Canada. TESL Canada Journal, 19 , 51 - 61. Brooks, L. (2009). Interacting in pairs in a test of oral proficiency: Co - constructing a better performance. Language Te sting, 26 (3), 341 - 366. Bueno - Alastuey, M. C. (2013). Interactional feedback in synchronous voice - based computer mediated communication: Effect of dyad. System, 41 (3) , 543 - 559. Byram, M. (1997). Teaching and assessing intercultural communicative competence . Clevedon, UK: Multilingual Matters. 136 Byrnes, H. (2013). Notes from the editor. The Modern Language Journal, 97 (4) , 825 - 827. Calloway, D. R. (1980). Accent and the evaluation of ESL oral proficiency. In J. W. Oller Jr. & K. Perkins (Eds.), Research in language testing (pp. 102 - 115). Newbury House. Caspers, J. (2010). The influence of erroneous stress position and segmental errors on intelligibility, comprehensibility and foreign accent in Dutch as a second language. Linguistics in the Netherlands, 27 , 17 - 29. Celce - Murcia, M., Brinton, D. M., & Goodwin, J. M. (2010). Teaching pronunciation: A course book and reference guide (2nd ed.). Cambridge: Cambridge University Press. Chambers, F. (1997). What do we mean by fluency? System, 25 (4) , 535 - 544. Crowt her, D., & De Costa, P. I. (2017 ). Developing mutual intelligibility and conviviality in the 21st century classroom: Insights from English as a lingua franca and intercultural communication. TESOL Quarterly, 51 (2), 450 - 460 . Crowther, D., K im, S., Lee, J., Lim, J., & Loewen, S. (forthcoming). Methodological synthesis of cluster analysis in second language acquisition research. Crowther, D., Trofimovich, P., & Isaacs, T. (2016). Linguistic dimensions of second language accent and comprehensi Journal of Second Language Pronunciation, 2 (2) , 160 - 182. Crowther, D., Trofimovich, P., Isaacs, T., & Saito, K. (2015a). Does a speaking task affect second language comprehensibility? The Modern Language Journal, 99 , 80 - 95. Crowther, D., Trofimovich, P., Isaacs, T., & Saito, K. (2017). Linguistic dimensions of L2 accentedness and comprehensibility vary across speaking tasks. Studies in Second Language Acquisition . Published online 22 August 2017. Crowther, D., T rofimovich, P., Saito, K., & Isaacs, T. (2015b). Second language comprehensibility revisited: Investing the effects of learner background. TESOL Quarterly, 49 (4) , 814 - 837. Crystal, D. (2008). Two thousand million? English Today, 24 , 3 - 6. Csepes, I. (2009 ). Measuring oral proficiency through paired - task performance . Frankfurt, Germany: Peter Lang. Cunnings, I. (2012). An overview of mixed - effects statistical models for second language researchers. Second Language Research, 28 (3), 369 - 382. 137 Dauer , R. M. (2005). The lingua franca core: A new model for pronunciation instruction? TESOL Quarterly, 39 (3) , 543 - 550. Davies, A. (2003). The native speaker: Myth and reality . Clevedon, UK: Multilingual Matters. Davis, L. (2009). The influence of interlocutor proficiency in a paired oral assessment. Language Testing, 26 (3) , 367 - 396. Derwing, T. M., & Munro, M. J. (2013). The development of L2 oral language skills in two L1 groups: A 7 year study. Language Learning , 63 (2) , 163 - 185. Derwing, T. M., & Munro, M. J. (2014). Training native speakers to listen to L2 speech. In J. M. Levis & A. Moyer (Eds.), Social dynamics in second language accent (pp. 219 - 236). Boston, MA: Walter de Gruyter Inc. Derwing , T. M., & Munro, M. J. (2015). Pronunciation fundamentals: Evidence - based perspectives for L2 teaching and research . Philadelphia, PA: John Benjamins. fluency an d comprehensibility development. Applied Linguistics, 29 (3) , 359 - 380. Derwing, T. M., Munro, M. J., Thomson, R. I., & Rossiter, M. (2009). The relationship between L1 fluency and L2 fluency development. Studies in Second Language Acquisition, 31 (4) , 553 - 5 57. Derwing, T. M., Munro, M. J., & Wiebe, G. (1998). Evidence in favor of a broad framework for pronunciation instruction. Language Learning, 48 (3) , 393 - 410. Derwing, T. M., Rossiter, M. J., Munro, M. J., & Thomson, R. I. (2004). Second language fluency judgments on different tasks. Language Learning, 54 (4), 655 - 679. Deterding, D. (2010). Norms for pronunciation in Southeast Asia. World Englishes, 29 (3), 364 - 377. interaction. Language Testing , 26 (3), 423 - 443. Dziubalska - English pronunciation models: A changing scene . New York, USA: Peter Lang. Educational Testing Service. (2012). The official guide to the TOEFL test (4 th ed.). New York: McGraw Hill. Educational Testing Service. (2017). The official guide to the TOEFL test ( 5 th ed.). New York: McGraw Hil l . 138 Educational Testing Service. (2014). TOEFL iBT speaking section scoring guide . Retrieved online from https://www.ets.org/s/toefl/pdf/toefl_speaking_rubrics.pdf. Everitt, B. S. (1980). Cluster analysis (2nd ed.). New York: Halsted Press. Fayer, J. M., & Krasinski, E. (1987). Native and nonnative judgments of intelligibility and i rritation. Language Learning, 37 (3), 313 - 326. Field, A. (2009). Discovering statistics using SPSS (3 rd Ed.). London: SAGE Publishing. Field, J. (2005). Intelligibility and the listener: The role of lexical stress. TESOL Quarterly, 39 (3) , 399 - 423. Flege , J. E., Munro, M. J., & MacKay, I. R. A. (1995). Effects of age of second - language learning on the production of English consonants. Speech Communication, 16 , 1 - 26. Foote, J. A., Holtby, A. K., & Derwing, T. M. (2011). Survey of teaching pronunciation in adult ESL programs in Canada, 2010. TESL Canada Journal, 29 , 1 - 22. Foote, J. A., Trofimovich, P., Collins, L., & Urzúa, F. S. (2016). Pronunciation teaching practices in communicative second language classes. The Language Learning Journal, 44 (2), 181 - 196 . Fowler, C. A., Brown, J., Sabadini, L., & Weihing, J. (2003). Rapid access to speech gestures in perception: Evidence from choice and simple response time tasks. Journal of Memory and Language, 49 (3) , 296 314. Fulcher, G., & Owens, N. (2016). Dealing w ith the demands of language testing and assessment. In G. Hall (Ed.), The Routledge Handbook of English Language Teaching (pp. 109 - 120). New York: Routledge. Galaczi, E. D. (2008). Peer peer interaction in a speaking test: The case of the First Certificat e in English examination. Language Assessment Quarterly, 5 (2) , 89 - 119. Galaczi, E., Post, B., Li, A., Barker, F., & Schmidt, E. (2017). Assessing second language pronunciation: Distinguishing features of rhythm in learner speech at different proficiency l evels. In T. Isaacs & P. Trofimovich (Eds.), Second language pronunciation assessment : Interdisciplinary perspectives (pp. 157 - 182). Bristol, UK: Multilingual Matters. Gallois, C., Ogay , T., & Giles, H. (2005). Communication accommodation theory: A look back and a look ahead. In W. B. Gudykunst (Ed.), Theorizing about intercultural communication (pp. 121 - 148). Thousand Oaks, CA: SAGE Publications. Galloway, N., & Rose, H. (2015). Intr oducing global Englishes . New York: Routledge. 139 Garrod, S., & Pickering, M. J. (2009). Joint action, interactive alignment, and dialog. Topics in Cognitive Science, 1 (2) , 292 - 304. Gass , S., & Mackey, A. (2015). Input, interaction, and output in second language acquisition. In B. VanPatten & J. Williams (Eds.), Theories in second language acquisition: An introduction (pp. 180 - 206). New York: Routledge. Gass, S. M., & Mackey, A. (2017). Stimulated recall methodology in second language research (2nd Ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Gass, S., & Varonis, E. M. (1984). The effect of familiarity on the comprehensibility of nonnative speech. Language Learning, 34 , 65 - 89. Ginther , A., & Elder, C. (2014). A comparative investigation into understandings and uses of the TOEFL iBT® test, the International English Language Testing Service (Academic) test, and the Pearson Test of English for graduate admissions in the United States and Australia: A case study of two university contexts. ETS Research Report Series, 2014 (2), 1 - 39. Gurzynski - Weiss, L., & Baralt, M. (2014). Exploring learner perception and use of task - based interactional feedback in FTF and CMC modes. Studies in Second Lang uage Acquisition, 36 , 1 - 37. Harding, L. (2012). Accent, listening assessment and the potential for a shared - L1 advantage: A DIF perspective. Language Testing, 29 (2) , 163 - 180. ew. In T. Isaacs & P. Trofimovich (Eds.), Second language pronunciation assessment (pp. 12 - 34). Bristol, UK: Multilingual Matters. Harding, L. (2018). Validity in pronunciation assessment. In O. Kang & A. Ginther (Eds.), Assessment in second language pron unciation (pp. 30 - 48). New York: Routledge. Hardison, D. (2014). Phonological literacy in L2 learning and teaching. In J. M. Levis & A. Moyer (Eds.), Social dynamics in second language accent (pp. 195 - 218). Boston, MA: Walter de Gruyter Inc. Hardison, D. M. (in press). Visualizing the acoustic and gestural beats of emphasis in multimodal discourse: Theoretical and pedagogical implications. Journal of Second Language Pronunciation . Hayes - Harb, R., Smith, B. L., Bent, T., & Bradlow , A. R. (2008). The interlanguage speech intelligibility benefit for native speakers of Mandarin: Production and perception of English word - final voicing contrasts. Journal of Phonetics, 36 (4) , 664 - 679. 140 Hilton, H. (2008). The link between vocabulary knowl edge and spoken L2 fluency. The Language Learning Journal, 36 (2), 153 - 166. International English Language Testing System. (2009). Cambridge IELTS 7: Examination papers from University of Cambridge ESOL Examinations: English for speakers of other languages . Cambridge: Cambridge University Press. International English Language Testing System. (2011). Cambridge IELTS 8: Examination papers from University of Cambridge ESOL Examinations: English for speakers of other languages. Cambridge: Cambridge University Press. International English Language Testing System. (2016). Speaking assessment criteria . Retrieved from https://www.ielts.org/~/media/pdfs/speaking - band - descriptors.ashx . Isaacs, T., & Thomson, R. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10 (2) , 135 - 159. Issacs, T., & Trofimovich, P. (2012). Deconstructing comprehens ibility: Identifying the Studies in Second Language Acquisition, 34 (3) , 475 - 505. Issacs, T., & Trofimovich, P. (Eds). (2017). Second language pronunciation assessment. Bristol, UK: Multilin gual Matters. Isaacs, T., Trofi m ovich, P., & Foote, J. A. (2017). Developing a user - oriented second language comprehensibility scale for English - medium universities. Language Testing. Published online 6 May 2017. Isaacs, T., Trofimovich, P., Yu, G., & Mu aspects of speech that most efficiently discriminate between upper levels of the revised IELTS Pronunciation scale. IELTS Research Reports Series, 4 , 1 - 48. Isbell, D. (2018). Assessing pronunciation for re search purposes with listener - based numerical scales. In O. Kang & A. Ginther (Eds.), Assessment of second language pronunciation (pp. 89 - 112). New York: Routledge. Isbell, D., Park, O. S., & Lee, K. (in press). Learning Korean pronunciation: Effects of i nstruction, proficiency, and L1. Journal of Second Language Pronunciation. Jenkins, J. (2000). The phonology of English as an international language . Oxford, UK: Oxford University Press. Jenkins, J. (2002). A sociolinguistically based, empirically researched pronunciation syllabus for English as an international language. Applied Linguistics, 23 , 83 - 103. 141 Jenkins, J. (2006). Current perspectives on teaching world Englishes and English as a lingua franca. TESOL Quarterly, 40 , 157 - 181. Jenkins, J. (2014). English as a lingua franca in the international university . New York: Routledge. Jewitt, C. (2014). An introduction to multimodality. In C. Jewitt (Ed.), The Routledge Handbook of Multimodal Analysis (2nd Ed.) (pp. 15 - 30). New York, NY: Routledge. Johnson, M., & Tyler, A. (1999). Re - analyzing the OPI: How much does it look like natural conversation? In R. Young & A. W. He (Eds.), Talking and testing: Discourse approaches to the assessment of oral pro ficiency (pp. 27 - 52). Philade l phia, PA: John Benjamins North America. Kahng, J. (2014). Exploring utterance and cognitive fluency of L1 and L2 English speakers: Temporal measures and stimulated recall. Language Learning, 64 (4), 809 - 854. Kang, O. (2010). Relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness. System, 38 (2) , 301 - 315. Kang, O., & Ginther, A. (Eds.) (2018). Assessment in second language pronunciation . New York: Routledge. Kang, O., & Rubin, D. L. (2009). Reverse linguistic stereotyping: Measuring the effect of listener expectations on speech evaluation. Journal of Language and Social Psychology, 28 (4) , 441 - 456. Kang, O., Rubin, D. L. , & Pickering, L. (2010). Supra segmental measures of accentedne ss and judgments of language learner proficiency in oral English. Modern Language Journal, 94 (4) , 554 - 566. Kennedy, S., Guénette, D., Murphy, J., & Allard, S. (2015). Le rôle de la prononciation dans l'intercompréhension entre locuteurs de français lingua franca [The role of pronunciation in comprehension between speakers of French as a lingua franca]. Canadian Modern Language Review, 71 , 1 - 25. King, R. S. (2015). Cluster analysis and data mining: An introduction . Boston, MA: Mercury Learning and Information. Knapp, M. L., & Hall, J. A. (1992). Nonverbal communication in human interaction . New York, NY: Holt Rinehart and Winston, Inc. Kormos , J. (1999). Simulating conversations in oral - proficiency assessment: a conversation analysis of role plays and non - scripted interviews in language exams. Language Testing, 16 (2), 163 - 188. 142 Kumaravadivelu, B. (2008). Cultural globalization and language edu cation . New Haven, CT: Yale University Press. Larson - Hall, J. (2010). A guide to doing statistics in second language research using SPSS . New York, NY: Routledge. Lazarton, A., & Davis, L. (2008). A microanalytic perspective on discourse, proficiency, an d identity in paired oral assessment. Language Assessment Quarterly, 5 (4) , 313 - 335. Leaper, D. A., & Riazi, M. (2014). The influence of prompts on group oral tests. Language Testing, 31 (2), 177 - 204. Lee, J., Jang, J., & Plonsky, L. (2015). The effectiven ess of second language pronunciation instruction: A meta - analysis. Applied Linguistics, 36 (3) , 345 - 366. Lennon, P. (1990). Investigating fluency in EFL: A quantitative approach. Language learning, 40 (3), 387 - 417. Leung, C. (2005). Convivial communication: Reconceptualizing communicative competence. International Journal of Applied Linguistics, 15 (2) , 119 - 144. Levis, J. M. (2005). Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly, 39 (3) , 369 - 377. Loewen, S. (2015). Introduction to instructed second language acquisition . New York: Routledge. Loewen, S., & Isbell, D. (2017). Pronunciation in face - to - face and audio - only synchronous computer - mediated learner interactions. Studies in S econd Language Acquisition, 39 (2), 225 - 256. Long, M. H. (1996). The role of the linguistic environment in second language acquisition. In W. C. Ritchie & T. K. Bhatia (Eds.), Handbook of language acquisition: Second language acquisition (pp. 413 - 468). New York: Academic Press. Long, M. H., & Porter, P. (1985). Group work, interlanguage talk, and second language acquisition. TESOL Quarterly, 19 (2) , 207 - 228. Low, E. L. (2015). Pronunciation for English as an international language: From research to practice . New York: Routledge. feedback? Studies in Second Language Acquisition, 22 (4) , 471 - 497. 143 Mackey, A. & Goo, J. (2007). Interaction research in SLA: A meta - analysis and research synthesis. In A. Mackey (Ed.), Conversational interaction in second language acquisition: A collection of empirical studies (pp. 407 - 452). Oxford: Oxford University Press. MacKay, I. R. A., Flege, J. E., & Imai, S. (2006). Eva luating the effects of chronological age and sentence duration on degree of perceived foreign accent. Applied Psycholinguistics, 27 (2) , 157 183. MacKenzie, I. (2011). Intercultural negotiations . New York: Routledge. Major, R. C. (2001). Foreign accent: T he ontogeny and phylogeny of second language phonology . Mahwah, NJ: Lawrence Erlbaum. Major, R. C., Fitzmaurice, S. F., Bunta, F., Balasubramanian, C. (2001). The effects of nonnative accents on listening comprehension: Implications for ESL assessment. TESOL Quarterly, 36 (2) , 173 - 190. Marsden, E., Mackey A., & Plonsky, L. (2016). The IRIS Repository: Advancing research practice and methodology. In A. Mackey & E. Marsden (Eds.), Advancing methodology and practice: The IRIS Repository of Instruments for R esearch into Second Languages (pp. 1 - 21). New York: Routledge. Matsuda, A. (Ed.). (2017). Preparing teachers to teach English as an international language . Bristol, UK: Multilingual Matters. May, L. (2009). Co - constructed interaction in a paired speaking Language Testing, 26 (3) , 397 - 421. May, L. (2011). Interactional competence in a paired speaking test: Features salient to raters. Language Assessment Quarterly, 8 (2) , 127 - 145. Michel, M. C., Kuiken, F., & Vedder , I. (2007). The influence of complexity in monologic versus dialogic tasks in Dutch L2. International Review of Applied Linguistics in Language Teaching , 45 (3), 241 - 259. Moyer, A. (2013). Foreign accent: The phenomenon of non - native speech. Cambridge: Ca mbridge University Press. Munro, M. J. (2018, forthcoming). Dimensions of pronunciation. In O. Kang, R. I. Thomson, & J. Murphy (Eds.), The Routledge handbook of contemporary English pronunciation . London: Routledge. Munro, M. J., & Derwing, T. M. (1995) . Foreign accent, intelligibility, and comprehensibility in the speech of second language learners. Language Learning, 45 , 73 - 97. 144 Munro, M. J., & Derwing, T. M. (1999). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 49 (s1) , 285 - 310. Munro, M. J., & Derwing, T. M. (2001). Modeling perceptions of the accentedness and comprehensibility of L2 speech: The role of speaking rate. Studies in Second Language Acquisition, 23 (4) , 451 - 468. Munro, M. J., & Derwing, T. M. (2006). The functional load principle in ESL pronunciation instruction: An exploratory study. System, 34 (4) , 520 - 531. Munro, M. J., Derwing, T. M., & Burgess, C. S. (2010). Detection o f nonnative speaker status from content - masked speech. Speech Communication, 52 (5 - 7) , 626 - 637. Munro, M. J., Derwing, T. M., & Morton, S. L. (2006). The mutual intelligibility of L2 speech. Studies in Second Language Acquisition, 28 , 111 - 131. Nagle, C. ( 2018). Motivation, comprehensibility, and accentedness in L2 Spanish: Investigating motivation as a time - varying predictor of pronunciation development. Modern Language Journal, 102 , 199 - 217. Nakatsuhara, F. (2006). The impact of proficiency - level on conv ersational styles in paired speaking tests. Cambridge ESOL Research Notes, 25 , 15 - 20. Nitta, R., & Nakatsuhara, F. (2014). A multifaceted approach to investigating pre - task planning effects on paired oral test performance. Language Testing, 31 (2) , 147 - 175. Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta analysis. Language learning, 50 (3), 417 - 528. Norton, B. (2013). Identity and language learning (2nd e d.). Toronto, Canada: Multilingual Matters. Norton, J. (2005). The paired format in the Cambridge Speaking Tests. ELT Journal, 59 (4) , 287 297. native and nonnative German speech. La nguage Learning, 64 (4), 715 - 748. Studies in Second Language Acquisition, 38 (3) , 587 - 605. ral discussion test scores. Language Testing, 26 (2) , 161 - 186. 145 Ockey, G. (2011). Self - consciousness and assertiveness as explanatory variables in of L2 oral ability: A latent variable approach. Language Learning, 61 (3) , 968 - 989. Ockey , G. J., & French, R. (2014). From one to multiple accents on a test of L2 listening comprehension. Applied Linguistics, 37 (5), 693 - 715. Ockey, G. J., Koyama, D., Seto guchi, E., & Sun, A. (2015) . The extent to which TOEFL iBT speaking scores are associate d with performance on oral language tasks and oral ability components for Japanese university students. Language Testing, 32 , 39 - 62. Oppenheimer, D. M. (2008). The secret life of fluency. Trends in Cognitive Science, 12 (6) , 237 - 241. Park, J. S - Y. & Wee, L. (2015). English as a lingua franca: Lessons for language and mobility. In C. Stroud & M. Prinsloo (Eds.), Language, literacy and diversity: Moving words (pp. 55 - 71). New York: Routledge. Pickering, L. (2009). Intonation as a pragma tic resource in ELF interaction. Intercultural Pragmatics, 6 (2) , 235 255. Pickering, L., & Litzenberg, J. (2011). Intonation as a pragmatic resource in ELF interaction, revisited. In A. Archibald, A. Cogo, & J. Jenkins (Eds.), Latest trends in ELF researc h (pp. 77 92). Newcastle, UK: Cambridge Scholars Publishing. Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting practices in quantitative L2 research. Studies in Second Language Acquisition, 35 (4), 655 - 687. Plonsk y, L. (Ed.). (2015a). Advancing quantitative methods in second language research . New York: Routledge. Plonsky, L. (2015 b ). Statistical power, p - to - (Ed.), Advancing quantitative methods in second language research (pp. 23 - 45). New York: Routledge. Plonsky, L., & Derrick, D. (2016). A meta analysis of reliability coefficients in second language research. The Modern Language Journal, 100 (2), 538 - 553. Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes: The case of interactio n research. Language Learning, 61 (2), 325 - 366. Plonsky, L., & Gonulal, T. (2015). Methodological synthesis in quantitative L2 research: A review of reviews and a case study of exploratory factor analysis. Language Learning, 65 (s1) , 9 - 36. 146 Plonsky, L., Mar sden, E., Crowther, D., Gass, S. M., & Spinner, P. (forthcoming). A methodological synthesis of judgment tasks in second language research. Language Learning , 64 (4), 878 - 912. Plough, I. C., & Bogart, P. S. (2008). Perceptions of examiner behavior modulate power relations in oral performance testing. Language Assessment Quarterly, 5 (3), 195 - 217. Robinson, P. (2005). Cognitive complexity and task sequencing: Studies in a componential framework for second language task design. International Review of Applied Linguistics in Language Teaching, 43 , 1 - 32. Robinson, P. (2011). Task - based language learni ng: A review of issues. Language Learning, 61 , 1 36. Rubin, D. L. (1992). Nonlanguage factors affecting native English - speaking teaching assistants. Research in Higher Education, 33 (4) , 511 - 531. Saito, K. (2012). Effects of instruction on L2 pronunciation development: A synthesis of 15 quasi - experimental intervention studies. TESOL Quarterly, 46 (4) , 842 - 854. Saito, K., & Akiyama, Y. (2017). Linguistic correlates of comprehensibility in second language Japanese speech. Jo urnal of Second Language Pronunciation, 3 (2), 199 - 217. Saito, Y., & Saito, K. (2017). Differential effects of instruction on the development of second language comprehensibility, word stress, rhythm, and intonation: The case of inexperienced Japanese EFL learners. Language Teaching Research, 21 (5), 589 - 608. Saito, K., Trofimovich, P., & Isaacs, T. (2016). Second language speech production: Investigating linguistic correlates of comprehensibility and accentedness for learners at different ability levels. Applied Psycholinguistics 37 (2), 217 - 240 . Saito, K., Trofimovich, P., & Isaacs, T. (2017). Using listener judgments to investigate linguistic influences on L2 comprehensibility and accentedness: A validation and generalization study. Applied Linguistics, 38 (4) , 439 - 462. Saito, K., Webb, S., Trofimovich, P., & Isaacs, T. (2016). Lexical profiles of comprehensible second language speech: The role of appropriateness, fluency, variation, sophistication, abstractness, and sense relations. Studies in Second Lan guage Acquisition, 38 (4) , 677 - 701. Sato, M. (2014). Exploring the construct of interactional oral fluency: Second language acquisition and language testing approaches. System, 45 , 79 - 91. 147 Scollon, R., Scollon, S. W., & Jones, R. H. (2012). Intercultural c ommunication: A discourse approach (3rd ed.). Oxford: Blackwell. Segalowitz, N. (2010). Cognitive bases of second language fluency. New York: Routledge. Segalowitz, N. (2016). Second language fluency and its underlying cognitive and social determinants. International Review of Applied Linguistics in Language Teaching, 54 (2), 79 - 95. Seidlhofer, B. (2011). Understanding English as a lingua franca . Oxford: Oxford University Press. Sewell, A. (2017). Functional load revisit intelligibility studies. Journal of Second Language Pronunciation, 3 , 57 - 79. Sifakis, N.C., & Sougari, A. - M. (2005). Pronunciation issues and EIL pedagogy in the periphery: A survey of Greek state school TESOL Quarterly, 39 (3), 467 - 488. Skehan, P. (2009). Modelling second language performance: Integrating complexity, accuracy, fluency and lexis. Applied Linguistics, 30 (4) , 510 32. Smith, B. L., & Hayes - Harb, R. (2011). Individual diffe rences in the perception of final consonant voicing among native and non - native speakers of English. Journal of Phonetics, 39 , 115 - 120. Solon, M., Long, A. Y., & Gurzynski - Weiss, L. (2017). Task complexity, language - related episodes, and production of L2 Spanish vowels. Studies in Second Language Acquisition, 39 (2), 347 - 380. Southwood, M. H., & Flege, J. E. (1999). Scaling foreign accent: Direct magnitude estimation versus interval scaling. Linguistics & Phonetics, 13 (5) , 335 - 349. Staples, S., & Biber, D. (2015). Cluster anal y sis. In L. Plonsky (Ed.), Advancing quantitative methods in second language research (pp. 243 - 274). New York: Routledge. Storch, N. (2002). Pattern s of interaction in ESL pair work. Language Learning, 52 , 119 58. Sueyoshi, A., & Hardison, D. (2005). The role of gestures and facial cues in second language listening comprehension. Language Learning, 55 (4) , 661 - 669. Suvorov, R. (2011). The effects on context visuals on L2 listening comprehension . Cambridge ESOL: Research Notes, 45, 2 - 7. 148 Suvorov, R. (2015). The use of eye tracking in research on video - based second language (L2) listening assessment: A comparison of context videos and content videos. La nguage Testing, 32 (4), 463 - 483. Swain, M., & Lapkin, S. (1998). Interaction and second language learning: Two adolescent French immersion students working together. The Modern Language Journal, 82 (3) , 320 - 337. Thompson, G. L., Cox, T. L., & Knapp, N. (2016). Comparing the OPI and the OPIc: The effect of test method on oral proficiency scores and student preference. Foreign Language Annals, 49 , 75 - 92. Thompson, I. (1991). Foreign accents revisited: The English p ronunciation of Russian immigrants. Language Learning, 41 (2), 177 - 204. Trofimovich, P., & Baker, W. (2006). Learning second language suprasegmentals: Effects of L2 experience on prosody and fluency characteristics of L2 speech. Studies in Second Languag e Acquisition, 28 , 1 - 30. Trofimovich, P., & Isaacs, T. (2012). Disentangling accent from comprehensibility. Bilingualism: Language and Cognition , 15 (4) , 905 - 916. VanPatten, B., & Williams, J. (2015). Theories in second language acquisition: An introduct ion . New York: Routledge. Varonis, E. M., & Gass, S. M. (1982). The comprehensibility of nonnative speech. Studies in Second Language Acquisition, 4 (2) , 114 - 136. Wagner, E. (2007). Are they watching? Test - taker viewing behavior during an L2 video listeni ng test. Language Learning and Technology, 11 , 67 86. Wagner, E. (2008). Video listening tests: What are they measuring? Language Assessment Quarterly, 5 (3) , 218 - 243. Walker, R. (2010). Teaching the pronunciation of English as a lingua franca . Oxford: Ox ford University Press. Warner, R. M. (2008). Applied statistics: From bivariate through multivariate techniques . London: SAGE Publications. Winke, P. (2014). Testing hypotheses about language learning using structural equation modeling. Annual Review of Applied Linguistics, 34 , 102 - 122. Winke, P., & Gass, S. (2013). The influence of L2 experience and accent familiarity on oral proficiency rating: A qualitative investigation. TESOL Quarterly, 47 (4) , 762 - 789. 149 Winke, P., Gass, S., & Myford, C. (2013). Rate rating oral performance. Language Testing, 30 (2) , 231 - 252. contributions of F0 and duration. Speech Communi cation, 55 (3) , 486 - 507. Xie, X., & Fowler, C. A. (2013). Listening with a foreign - accent: The interlanguage speech intelligibility benefit in Mandarin speakers of English. Journal of Phonetics, 41 (5) , 369 - 378. Yan, X., & Ginther, A. (2018). Listeners and raters: Similarities and differences in evaluation of accented speech. In O. Kang & A. Ginther (Eds.), Assessment in second language pronunciation (pp. 67 - 88). New York: Routledge. Young, R. F. (2011). Interactional competence in language learning, teach ing, and testing. In E. Hinkle (Ed.), Handbook of research in second language learning (Vol. 2, pp. 426 - 443). New York, NY: Routledge. Yule, G., & Macdonald, D. (1990). Resolving referential conflicts in L2 interaction: The effect of proficiency and inter active role. Language Learning, 40 (4) , 539 - 556.