DIAGNOSING SECOND LANGUAGE PRONUNCIATION By Daniel Richard Isbell A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Second Language Studies – Doctor of Philosophy 2019 ABSTRACT DIAGNOSING SECOND LANGUAGE PRONUNCIATION By Daniel Richard Isbell Pronunciation presents a significant, persistent challenge to second language (L2) learners (Derwing & Munro, 2015; Piske, McKay, & Flege, 2001). Fortunately, pronunciation instruction works (Lee, Jang, & Plonsky, 2015). However, pronunciation receives relatively little attention in language classrooms. Further complicating the matter are classrooms with learners from diverse linguistic backgrounds and/or differing levels of pronunciation ability, which may impact the effectiveness of one-size-fits-all whole-class pronunciation instruction (e.g., Isbell, Park, & Lee, 2019). Diagnostic language assessment (Alderson, 2005; Alderson, Brunfaut, & Harding, 2014), which prioritizes identifying the specific strengths and weaknesses of learners, is a potentially useful approach to addressing individuals’ pronunciation needs. Addressing these issues, I developed a new diagnostic tool for segmental L2 Korean pronunciation called the Korean Pronunciation Diagnostic (KPD). The KPD consists of two sections, production and perception, each with two tasks that tap into phonological knowledge and abilities. KPD feedback includes a list of a learner’s most-difficult phonemes to prioritize in instruction and accuracy scores for production and perception of all phonemes. To evaluate the quality and usefulness of the test, I constructed a validity argument (Kane, 2013) for the interpretation and use of KPD scores, which included inferences on the operationalization of relevant theory, evaluation of observations, generalization of scores, explanation of scores with respect to underlying theory, extrapolation of scores to general language use, utilization of feedback by stakeholders, and the usefulness and impact of applying scores. I sought support for these inferences from two main sources: field testing with 198 L2 Korean learners and interviews with 21 learners and one Korean language teacher. Field testing participants completed a background questionnaire, pronunciation self-assessment, independent speaking task, the KPD, and a standardized measure of oral proficiency. Interview participants completed an initial semi-structured interview where they received their KPD score report; 14 learners completed a follow-up interview approximately 3 months later where they discussed recent pronunciation learning activity and took the KPD again. I used several quantitative techniques, including measurement analyses (classical test theory and Rasch), correlations, cluster analysis, to analyze the field testing data. I analyzed interview data qualitatively. Support for the operationalization, generalization, and explanation inferences was strong, supporting the interpretation of KPD scores as strengths and weaknesses in the production and perception of Korean phonemes. Support for the extrapolation inference was positive but limited. Correlations between KPD scores and learner self-assessments were positive but not large, as learners had limited awareness of their fine-grained pronunciation abilities. Similarly, the KPD’s discrete and delimited measurements of phoneme accuracy had limited overlap with pronunciation in spontaneous, meaning-focused speech. The utilization inference was well- supported, though improvements to the KPD score report could further enhance stakeholder interpretation of results. Finally, positive but limited evidence for the usefulness and impact of the KPD was found: Findings suggest that learner application of KPD results has the potential to support pronunciation development, but this is conditional on learner effort. I determined that more evidence is needed to sufficiently support this inference. Overall, the interpretation and use of KPD scores is supported, but future development and research efforts should focus on the effective application of the KPD’s diagnostic feedback. Copyright by DANIEL RICHARD ISBELL 2019 For Kyujin. v ACKNOWLEDGEMENTS This dissertation is the capstone of one incredible year. In this past year, I temporarily relocated to South Korea, became a father, started data collection, navigated the academic job market, finished data collection, completed analyses, finished writing this, and returned to the United States (in roughly that order). I wouldn’t have been able to do this without lots of help. First, this dissertation was made possible thanks to the financial support of the Fulbright U.S. Student program, an Educational Testing Service TOEFL Doctoral Dissertation Research Support Grant, and several sources at MSU: a Research Enhancement Award from the Graduate College, a Dissertation Completion Fellowship from the College of Arts and Letters and the Graduate College, funds from the College of Arts and Letters and the Second Language Studies program. Additionally, the Asian Studies Center at MSU and the U.S. Department of Education provided me with a Foreign Language and Area Studies Summer Fellowship to study Korean. I have benefited immensely from Dr. Paula Winke’s guidance throughout my doctoral studies. Paula was not just an extremely knowledgeable expert in the field of language testing perfectly suited to chair this dissertation. She was also the most positive, supportive advisor a graduate student could have. I learned so much about language assessment, the language testing industry, doing research, writing about research, and applying for funding from Paula, but my greatest takeaway from her mentorship is how to be a good mentor. Through her example, I learned about connecting people with opportunities, promoting the work and talents of others, and supporting someone both as a scholar and as a well-rounded human being. My go-to heuristic when mentoring students in the future will be the question “What would Paula do?” vi I wish to express my gratitude to the other members of my committee. Complementing Paula’s assessment knowledge, I was able to assemble a dream team of topical expertise perfectly aligned to this dissertation: Dr. Susan Gass (SLA, and so much more), Dr. Debra Hardison (L2 pronunciation and speech perception), Dr. Junkyu Lee (L2 pronunciation, language learning and research environment in Korea), and Dr. Shawn Loewen (instructed SLA, research methodology). Advice I received from my committee was indispensable. I am especially appreciative of Debra’s perspective and feedback in the early stages of developing my ideas and Junkyu’s support and guidance during my time in Korea. I am grateful to Sue and Shawn for the opportunity to collaborate on other projects during my time at MSU; I learned a great deal about doing research that helped me carry out this project. Many others have helped me along the way. Thank you to my fellow SLS students for being great colleagues and friends, and especially Jin Soo Choi, Kathy Minhye Kim, Susie Kim, Jongbong Lee, Shinhye Lee, Jungmin Lim, and Myeongeun Son for their generous feedback on instruments help with trialing. Thanks to Dustin Crowther for always being available to talk L2 pronunciation and bounce ideas around, to Dr. Jai Ok Shim and Heidi Little at the Korean- American Educational Commission and Keehye Shin at Hankuk University of Foreign Studies for logistical support, and to HUFS TESOL graduate students Yerin An, Haewon Kim, Sohee Lee, Minchae Shin, and YounJu Yoo for coding and transcription assistance. Last and certainly not least, I want to thank my family. I am thankful to my in-laws for their hospitality and support over the last year in Korea. To my son, Euan: Thank you for remembering me after I came back from trips, and thanks for being a good sleeper! It meant a lot to me. To my wife, Kyujin: Thank you for being Euan’s mom, thank you for being my partner in life, and thank you for being my language expert. I couldn’t have done this without you. vii TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES KEY TO ABBREVIATIONS INTRODUCTION CHAPTER 1: DIAGNOSTIC LANGUAGE ASSESSMENT What is Diagnostic Language Assessment? Operational Examples of DLA Some Key Concerns in DLA Practicality Grain Size and Score Reporting: How Detailed Should Diagnosis Be? Measurement Models and Techniques Self-Assessment Validity in DLA CHAPTER 2: DIAGNOSING SECOND LANGUAGE PRONUNCIATION Why Diagnose L2 Pronunciation? What is L2 Pronunciation? The Linguistic Basis of Intelligible Pronunciation The Cognitive Basis of Pronunciation The Developmental Basis of L2 Pronunciation Age Cross-Linguistic Influence Experience Instruction Research on L1 and L2 Korean Phonological Development The Goal: Diagnosing L2 Korean Pronunciation CHAPTER 3: DESIGN AND DEVELOPMENT OF THE KOREAN PRONUNCIATION DIAGNOSTIC Design Test Purpose Appropriate Uses Structure and Item Specifications Production Tasks Picture Naming Nonword Reading Perception Tasks xiii xvi xix 1 5 5 8 13 13 14 20 26 28 33 33 36 39 42 46 47 47 50 51 55 57 61 61 61 61 64 65 65 66 67 viii Pronunciation Judgment Nonword Identification Item Writing Scoring Score Reports Development Alpha Version Piloting Findings and Revisions Beta Version Piloting Findings Developer Observations and Scorer Feedback Native Speaker Results Task 1 – Picture Naming: Analysis of Non-Target Elicited Words Reliability Items Statistics Score Reporting Revisions Leading to Operational Version Conclusion CHAPTER 4: METHODS Participants Field Testing Learners Native Speakers KPD Production Task Scoring Reliability Study Interview Study Materials Language Background Questionnaire Pronunciation Self-Assessment Independent Speaking Task Elicited Imitation Test Semi-Structured Interviews Procedures Analyses CHAPTER 5: MEASUREMENT Research Questions Analysis Details Measurement Models Two Statistical Approaches to Measurement Classical Test Theory Analyses Rasch Analyses Reliability Analyses Correlations ix 67 67 68 69 70 71 71 74 75 78 80 81 82 83 84 89 89 91 92 95 96 97 97 97 101 102 103 104 104 105 105 107 108 110 111 114 114 114 114 117 121 122 125 127 Results Measurement Summary CTT Observed Scores Rasch Models Production Items Perception Items Production Parcels Perception Parcels Native Speakers Reliability Internal Consistency Production Items – Inter-Scorer Agreement Production Parcels – Inter-Scorer Reliability Production Parcels – Identification of Diagnostic Weaknesses across Scorers Item Analyses CTT Item Analyses Individual Items Parcels Rasch Item Analyses Individual Items Parcels Native Speakers Internal Structure Production and Perception Total Score Correlations Task Total Correlations Production and Perception Phoneme Parcel Correlations Discussion RQ1a: How Reliable is the KPD? RQ1b: How Reliably are Production Items Evaluated by Different Scorers? RQ2a: What is the Internal Structure of Test Tasks? RQ2b: To What Extent Do Item Difficulty Hierarchies Align with Expectations? Additional Considerations CHAPTER 6: PRONUNCIATION PROFILES Research Question Analysis Details Cluster Analysis Data Standardization Results Production Profiles Determining the Number of Clusters Cluster Descriptions Perception Profiles Determining the Number of Clusters Cluster Descriptions Profiles, L1, and Proficiency x 128 128 128 130 130 133 135 137 139 140 140 141 142 144 145 145 145 146 150 150 152 159 160 161 161 162 164 164 165 166 167 168 169 169 170 171 172 172 174 174 177 179 180 183 185 Discussion CHAPTER 7: EXTERNAL RELATIONSHIPS Research Questions Analysis Details Oral Proficiency Pronunciation in Spontaneous Speech Self-Assessment Results Relationship between KPD Results and Oral Proficiency Relationship between KPD Results and Pronunciation in Spontaneous Speech Relationship between KPD Results and Self-Assessments Summary of Learner Self-Assessments Phoneme-Level Differences between KPD Results and SA Correlations between KPD Results and SA Agreement between KPD Diagnostic Flags and SA Discussion RQ4: To what extent do KPD results show an expected relationship with Korean oral proficiency? RQ5: To what extent do results reflect difficulties test-takers show in spontaneous, meaning-focused speech? RQ6: To what extent do results reflect self-assessments of pronunciation ability and difficulties? CHAPTER 8: INTEPRETATION AND USE Research Questions Methods Interviewees Score Reports KPD Retesting Analysis of Interview Data Findings Learner Understanding of Results and Potential Application Interpretation New Information Potential Application A Teacher’s Perspective Interpretation New Information, Gaps, and Incongruencies Potential Application Learner Utilization and Impact Changes in Production and Perception Application of KPD Results Perceptions of Change Discussion RQ7: How do (a) Teachers and (b) learners understand KPD score reports? To what 190 195 195 197 197 197 198 199 199 204 209 209 211 214 219 220 220 222 223 225 225 226 226 227 230 230 231 232 232 235 237 239 240 241 244 246 246 251 253 254 xi extent do they learn anything new from KPD score reports? RQ8: Do learners report any changes in their self-study routines and/or their attention to phonological form in formal or informal learning situations? RQ9: Do learners show improvements in a) overall and/or b) in weak areas after receiving and applying KPD feedback? Additional Considerations CHAPTER 9: SUMMARY OF FINDINGS AND EVALUATION OF THE VALIDITY ARGUMENT Summarizing the KPD Validity Argument Operationalization Inference Evaluation Inference Generalization Inference Explanation Inference Extrapolation Inference Utilization Inference Test Usefulness & Impact Inference Evaluation of the KPD Validity Argument Conclusion CHAPTER 10: DISCUSSION & CONCLUSION Discussion on Diagnosing Second Language Pronunciation Situating the KPD in L2 Pronunciation and DLA Important Questions and Tentative Answers Room for Expansion Towards and Interface between Pronunciation Instruction and Diagnostic Assessment Implications for Diagnostic Language Assessment Final Thoughts APPENDICES APPENDIX A: KPD Table of Specifications APPENDIX B: KPD Item Specifications APPENDIX C: KPD Production Task Scoring Sheet APPENDIX D: Scoring Guidelines for KPD Production Tasks APPENDIX E: Language Background Questionnaire APPENDIX F: Pronunciation Self-Assessment APPENDIX G: Independent Speaking Task APPENDIX H: Korean EIT Directions and Practice Items APPENDIX I: Interview Protocols APPENDIX J: Item Statistics REFERENCES 255 257 258 259 263 263 263 264 266 267 268 269 269 270 274 275 275 276 279 284 286 289 293 295 296 300 305 308 311 317 322 324 327 331 341 xii LIST OF TABLES Table 2.1 Korean Phoneme Inventory Table 3.1 KPD Design Summary Table 3.2 KPD Scoring Overview Table 3.3 Initial KPD Design Table 3.4 Alpha Pilot Participants Table 3.5 KPD Beta Design Summary Table 3.6 KPD Beta Learner Summary Statistics Table 3.7 KPD Beta Items with Incorrect Responses from Korean NSs Table 3.8 Summary of KPD Beta Task 1 – Picture Naming Non-Target Responses Table 3.9 KPD Beta Task 1 Items which Elicited Non-Target NS Responses Table 3.10 KPD Beta Task 1 Items with Frequent Non-Target Learner Responses Table 3.11 Reliability of the KPD Beta Table 3.12 KPD Beta Items Flagged for Potential Revision Table 4.1 Field Testing Sample Characteristics: Demographic Categories Table 4.2 Field Testing Sample Characteristics: Age and Exposure Table 4.3 Self-Assessment of Macroskills Table 4.4 Korean Learning, Use, and Motivation Table 5.1 Summary of KPD Learner Scores Table 5.2 Rasch Measurement Summary for Production Items Table 5.3 Rasch Measurement Summary for Perception Items Table 5.4 Rasch Measurement Summary for Production Parcels xiii 41 64 69 74 75 80 82 84 85 86 87 89 91 98 99 100 101 128 132 134 137 Table 5.5 Rasch Measurement Summary for Perception Parcels Table 5.6 Summary of NS KPD Scores Table 5.7 Internal Consistency of the KPD Table 5.8 Rasch Person Reliability Estimates for the KPD Table 5.9 Inter-Scorer Agreement for Individual Production Items Table 5.10 Inter-scorer Reliability for Item Parcel Scores Table 5.11 Inter-Scorer Reliability/Agreement Indices for all Parcel Scores and Diagnostic Flags Table 5.12 Inter-Scorer Agreement for Diagnostic Flags Table 5.13 Production Parcel Statistics Table 5.14 Perception Parcel Statistics Table 5.15 Correlations Among KPD Task Sum Scores Table 5.16 Phoneme Production and Perception Parcel Spearman Correlations Table 6.1 Phoneme Production Mean Accuracy and Diagnostic Flag Proportion by Cluster Table 6.2 Phoneme Perception Mean Accuracy and Diagnostic Flag Proportion by Cluster Table 6.3 L1 Composition of Phoneme Production Clusters Table 6.4 Oral Proficiency of Phoneme Production Clusters Table 6.5 L1 Composition of Phoneme Perception Clusters Table 6.6 Oral Proficiency of Phoneme Perception Clusters Table 6.7 Cross-Tabs of Production and Perception Cluster Membership Table 7.1 Average Production and Perception Phoneme Parcel Accuracy by Oral Proficiency Quantiles Table 7.2 Correlations between Phoneme Production, Perception, and Oral Proficiency Table 7.3 Comparison of KPD Results and Independent Speaking Productions 139 140 140 141 141 143 143 144 148 149 161 163 179 185 186 187 188 188 189 202 203 206 xiv Table 7.4 Learner Self-Assessment Results: Phoneme/Item-Level Descriptive Statistics Table 7.5 Differences between KPD Results and Learner Self-Assessments Table 7.6 Correlations between KPD Scores and SA for each Phoneme Table 7.7 Summary Statistics for KPD Flagged Phonemes and SA Agreement Table 8.1 Interviewees Table 8.2 Multiple Perspectives on Pronunciation Difficulties Table 8.3 Group-Level Summary of Changes in KPD Production and Perception Scores Table 8.4 Individual Summaries of Changes in KPD Production Scores and Learning Activity Table A1 KPD Table of Specifications Table J1 KPD Production Item Statistics Table J2 KPD Perception Item Statistics 210 212 216 219 228 240 247 249 297 332 337 xv LIST OF FIGURES Figure 1.1. A series of inferences that typify validity arguments. Figure 2.1. Lower-level listening processes, based on Field (2013, p. 97). Figure 2.2. Lower-level speaking processes, based on Field (2011, p. 77). Figure 2.3. A proposed validity argument for using the KPD to inform learning and instruction. Figure 3.1. Diagram of a KPD score report. Figure 3.2. KPD Alpha piloting procedures. Figure 3.3. Early draft of KPD score report. Figure 3.4. KPD Beta piloting procedures. Figure 3.5. KPD Beta score report. Figure 4.1. Structure of interviews. Figure 5.1. Histograms showing the distributions of sum scores for (A) all dichotomous KPD items, (B) all production KPD items, (C) all perception KPD items, and (D) all KPD tasks. Figure 5.2. Histograms of average accuracy scores across all phonemes in (A) production and (B) perception. Figure 5.3. PCA of residuals for production items. Figure 5.4. Test information function for production items. Figure 5.5. PCA of residuals for perception items. Figure 5.6. Test information function for perception items. Figure 5.7. PCA of residuals for production parcels. Figure 5.8. Test information function for production parcels. Figure 5.9. PCA of residuals for perception parcels. 30 44 45 60 73 75 79 81 93 110 129 130 131 132 133 135 136 137 138 xvi Figure 5.10. Test information function for production parcels. Figure 5.11. Histograms of item agreement indices for individual items based on all seven scorers. Figure 5.12. Average accuracy (inverse of difficulty) for each phoneme parcel on the production (y-axis) and perception (x-axis) sections of the KPD. Figure 5.13. Wright maps for the KPD (A) production (Task 1 and Task 2) and (B) perception (Task 3 and Task 4) individual items. Figure 5.14. Rasch item difficulty measures for each phoneme parcel on the production (y-axis) and perception (x-axis) sections of the KPD. Figure 5.15. Visual summary of production parcel difficulties (A) and category thresholds (B). Figure 5.16. Visual summary of perception parcel difficulties (A) and category thresholds (B). Figure 5.17. Item information and partial-credit step probability plots for production parcels. Figure 5.18. Item information and partial-credit step probability plots for perception parcels. Figure 5.19. Scatterplot of production and perception raw total scores. Figure 5.20. Scatterplot of production and perception parcel average accuracy scores. Figure 6.1. HCA dendrogram depicting suggested clustering of test-takers according to production parcel scores. Figure 6.2. Plot of within-cluster sum of squares for k = 1..10 clusters based on production parcel scores. Figure 6.3. Gap statistic plot for k = 1..10 clusters based on production parcel scores. Figure 6.4. Plot of clusters along the first two principle components of the production parcel data. Figure 6.5. Heatmaps of phoneme production mean accuracy (A) and diagnostic flag proportion (B) by cluster. Figure 6.6. HCA dendrogram depicting suggested clustering of test-takers according to production parcel scores. 139 142 147 151 153 154 155 157 158 161 163 175 176 176 177 178 181 xvii 182 182 183 184 200 201 205 213 215 217 218 Figure 6.7. Plot of within-cluster sum of squares for k = 1..10 clusters based on perception parcel scores. Figure 6.8. Gap statistic plot for k = 1..10 clusters based on perception parcel scores. Figure 6.9. Plot of clusters along the first two principle components of the perception parcel data. Figure 6.10. Heatmaps of phoneme perception mean accuracy (A) and diagnostic flag proportion (B) by cluster. Figure 7.1. Distribution of EIT scores. Figure 7.2. Scatterplots of the relationship between EIT scores and (A) average production phoneme accuracy and (B) average perception phoneme accuracy. Figure 7.3. Mean production and perception phoneme accuracy across oral proficiency quantiles. Figure 7.4. Mapping average learner accuracy for production and perception. Figure 7.5. Relationships among average KPD scores and SA. Figure 7.6. Scatterplots of KPD score and SA for each phoneme in (A) production and (B) perception. Figure 7.7. Mapping learner discrimination of phoneme difficulty for production and perception. xviii KEY TO ABBREVIATIONS American Council on Teaching Foreign Languages Analysis of Variance Automatic Speech Recognition Contrastive Analysis Hypothesis Cognitive Diagnostic Assessment Cognitive Diagnostic Model Common European Framework of Reference Classical Test Theory Diagnostic Language Assessment Elicited Imitation Test Fundamental Frequency (pitch) Functional Load (for Foreign Language, see L2) Hierarchical Cluster Analysis High-Variability Phonetic Training Intraclass Correlation Coefficient International English Language Testing System Item Information Function International Phonetic Alphabet Item Response Theory Instructed Second Language Acquisition Korean as a Foreign Language ACTFL ANOVA ASR CAH CDA CDM CEFR CTT DLA EIT F0 FL HCA HVPT ICC IELTS IIF IPA IRT ISLA KFL xix KSL KPD L1 L2 L3 LBQ NNS NS OPIc PAM PCA PCM PTE SA SAT SD SLA SLM TA TIF TOEFL TOPIK Korean as a Second Language Korean Pronunciation Diagnostic First Language Second Language (includes Foreign Language) Third Language (L3+ = third or later language) Language Background Questionnaire Non-Native Speaker Native Speaker Oral Proficiency Interview – Computer Perceptual Assimilation Model Principal Components Analysis (Rasch) Partial Credit Model Pearson Test of English Self-Assessment Skill Acquisition Theory Standard Deviation Second Language Acquisition Speech Learning Model Teaching Assistant Test Information Function Test of English as a Foreign Language Test of Proficiency in Korean xx INTRODUCTION Second language (L2) pronunciation is a critical factor in the communicative success of L2 speakers. Without intelligible pronunciation, listeners experience greater difficulty (Lee, 2017a) and may fail to fully understand speakers (Kang, Thomson, & Moran, 2018), with communication breakdowns likely to occur (Jenkins, 2002; Loewen & Isbell, 2017; Matsumoto, 2011). Even when a speaker’s pronunciation is largely intelligible, poor pronunciation can make listening a more difficult, effortful task (Crowther et al. 2015; Kang, Rubin, & Pickering, 2010; Saito, Trofimovich, & Isaacs, 2017). Further compounding the gravity of unintelligibility problems is the fact that pronunciation development presents a considerable challenge to language learners. For one, out of all aspects of second language competence, pronunciation appears to be affected most by age-related effects (Long, 2013). Simply put, it is extremely unlikely for learners with a post-puberty age of onset to acquire native-like pronunciation. And although a learner’s other languages can be an asset in learning some aspects of a new second language, already known languages are a strong influence on L2 pronunciation and can be a source of confusion when learning new L2 speech sounds (Best & Tyler, 2007; Flege, 1995). Beyond an initial period of rapid familiarization with the phonological system of a new L2, some researchers have argued that subsequent pronunciation development is limited and/or unlikely to occur as a product of continued, naturalistic language use (Derwing & Munro, 2015). Fortunately, L2 pronunciation is amenable to instruction (Lee, Jang, & Plonsky, 2015; Saito, 2012; Pennington, 1998; Thomson & Derwing, 2015), and native-like pronunciation is not necessary for an L2 speaker to be broadly intelligible and highly comprehensible (Derwing, Munro, & Wiebe, 1998; Jenkins, 2000; Munro & Derwing, 1995; Levis, 2005). Lee et al.’s (2015) meta-analysis of L2 pronunciation instruction studies found beneficial effects to be 1 comparable in magnitude to instructional treatments targeting other aspects of L2s, such as vocabulary and grammar. However, compared to vocabulary and grammar, pronunciation often receives little attention in L2 classrooms (Foote, Holtby, & Derwing, 2011) or language textbooks (Derwing, Diepenbroek, & Foote, 2012). Language teachers have reported low levels of confidence in teaching pronunciation (Derwing & Munro, 2015), owing to a lack of background knowledge in phonology and pronunciation teaching methods (Murphy, 2014). When it does occur in language classrooms, pronunciation instruction commonly takes a one- size-fits-all approach, where a group of learners are instructed on several features selected based on the intuitions of a teacher, researcher, or materials designer (e.g., Isbell, Park, & Lee, 2019). While both testing (Lado, 1961) and L2 pronunciation experts (Derwing & Munro, 2015) have offered many helpful suggestions for assessing individuals’ pronunciation difficulties, to my knowledge there are very few well-documented and researched assessment instruments or accounts of language teacher practices used to inform whole-class or individualized instruction. Some teacher-oriented books and classroom texts for L2 English pronunciation do present some helpful methods for assessing specific problems with perceiving phonological features, but their approach to assessing production involves reading aloud paragraph-length written text and free production (e.g., Celce-Murcia, Brinton, Goodwin, & Griner, 2010; Gilbert, 2005). The former approach is potentially problematic due to reading aloud being a specialized skill that differs from typical speech, requiring strong literacy and sound-symbol correspondence knowledge (Levis & Barriuso, 2012), and the latter approach is limited in the sense that there is no guarantee that features targeted for assessment will be used, or used enough times to obtain reliable information about. Two recent volumes on L2 pronunciation assessment (Isaacs & Trofimovich, 2017; Kang & Ginther, 2017) have done little to address this gap of identifying individuals’ 2 pronunciation weaknesses, and virtually no attention is given to assessing learners’ pronunciation in a way that informs instruction. One potential avenue for helping teachers and learners make more informed and confident instructional decisions about pronunciation is Diagnostic Language Assessment (DLA) (Alderson, 2005; Alderson, Brunfaut, & Harding, 2014; Lee, 2015). Situated in a larger movement calling for assessment practices that directly support learning in language assessment (Turner & Purpura, 2015) and educational assessment more broadly (Pellegrino, DiBello, & Goldman, 2016), proponents of DLA emphasize providing detailed, instructionally-useful information on what a learner can and cannot do through well-constructed diagnostic instruments and carefully thought-out procedures. With detailed knowledge of what learners know and do not know, teachers or learners using DLA decide what to study and how to go about studying it. This dissertation explores the potential of DLA to usefully inform intelligibility-focused L2 pronunciation learning. In the following chapters, I describe a project that spans the development, field testing, and validation of an instrument to diagnose L2 pronunciation called the Korean Pronunciation Diagnostic (KPD). In Chapter 1, I review literature on DLA and validity in language testing to establish guiding principles for diagnosing pronunciation and a framework for examining the validity of the diagnostic process. In Chapter 2, I make the case for a pronunciation diagnostic and summarize theory and research on L2 pronunciation that form the grounds for the design of the KPD. Chapter 2 culminates with a prospective validity argument for the KPD which guided the validation research I carried out. Chapter 3 features, in detail, the design of the KPD and chronicles its development through two rounds of pilot testing. Chapter 4 describes the methodology of this study, detailing instruments used other than the KPD and providing an overview of procedures for the validation research reported on in Chapters 5 3 through 8. In Chapter 5, I present the results of measurement analyses of the KPD based on a sample of 198 L2 Korean test-takers. In Chapter 6, I describe learner pronunciation profiles that emerged from a cluster analysis of KPD phoneme-level scores. In Chapter 7, I present the results of analyses that compare KPD scores to three external measures: a measure of overall Korean oral proficiency, learner segmental phonological errors in spontaneous speech, and learner self- assessments of Korean pronunciation abilities. In Chapter 8, I draw on interviews with 21 Korean learners and a teacher who taught two of those learners to explore how these key stakeholders interpret and apply KPD results. I also report on exploratory analyses for a subset of 14 learners who took the KPD again after 2-4 months of time in which they had an opportunity to engage in pronunciation learning activity. In Chapter 9, I review the results of the previous four chapters holistically through an explication and critical review of the KPD’s validity argument. Finally, in Chapter 10, I close the dissertation with a discussion of implications of the KPD’s development and validation followed by discussion of broader implications for diagnosing L2 pronunciation and diagnostic language assessment. 4 CHAPTER 1: DIAGNOSTIC LANGUAGE ASSESSMENT In this chapter, I provide a broad overview of Diagnostic Language Assessment (DLA). I begin by defining DLA and situating it in relation to other types of assessments. Here I include examples of several DLA instruments. Next, I raise and discuss key concerns in DLA theory and practice that are of particular relevance to this dissertation. Finally, I discuss argument-based validity as means of (a) establishing support for using tests and (b) setting a validation research agenda for DLA. What is Diagnostic Language Assessment? Diagnostic language assessment (DLA) has the aim of uncovering a language learner’s strengths and weaknesses for the purposes of informing instruction (Alderson et al., 2014). In this sense, DLA can be considered as a type of formative assessment, which is concerned with student progress toward achieving the goals or target outcomes of an educational curriculum. Indeed, for many decades now, teachers of languages and other subjects have been using formative assessments to inform the teaching of their courses and to help individual students, often through the provision of individualized feedback. One way that DLA can be distinguished from other types of formative assessment is its scope. Whereas many formative assessments are used to gauge student progress toward completing an in-progress task or achieving a near-term curricular outcome, DLA is concerned with a learner’s overall level of ability and finding their weakest links in the use of that ability. Further, DLA also has an orientation to the future: DLA should yield information that is useful for subsequent instruction. Before proceeding further, it is important to clarify the use of the term diagnostic in reference to DLA and other types of assessments. Unlike diagnostic tools used by psychologists, speech language pathologists, child development experts, and medical professionals, the 5 diagnosis yielded via DLA is not clinical in nature nor related to foundational cognitive or educational development. Having weaknesses in L2 skills and knowledge is, by and large, not pathological, and most adult L2 learners have successfully and fully acquired one language (and, often, literacy in that language) already. However, almost every L2 learner at some point experiences having some weaknesses or gaps in their L2 knowledge or skills that hamper the effective use of the L2, and the treatments that may be prescribed as a result of a DLA are simply commonly-used (but principled) teaching and learning activities. That being said, DLA theory does draw on other forms of diagnosis (Alderson et al., 2014; Alderson et al. 2015) and, I argue, may draw more directly on certain types of language-related diagnostic techniques and instruments used in other fields such as speech language pathology and educational psychology. DLA has a clear emphasis on identifying the weaknesses of L2 learners (Alderson et al., 2014), as there is an obvious connection between these weaknesses and instructional planning. Still, determining learner strengths is not without some instructional utility: Instruction targeting mastered knowledge or proficient subskills can be confidently skipped over in favor of focusing on more pressing aspects of language competence. The reasons for using specialized assessment procedures to examine learner strengths and weaknesses are not new. Consider Lado (1961) on the challenges language teachers face in assessing their students’ L2 pronunciation weaknesses: Informal contact with students, even the extended contact of the language classroom, is not very effective as a way to test a student’s pronunciation. From this extended contact one can say that one student has better pronunciation than another in rough terms, but when asked to list the specific pronunciation problems of a particular student of ours we will remember only the very salient mispronunciations and will not as a rule be able to come anywhere near completeness. (Lado, 1961, p. 80) The challenges Lado outlined over 50 years ago are still relevant in language classrooms and other instructional contexts today. As a means of addressing these challenges, Alderson, Brunfaut, and Harding (2014) argued for five guiding principles of DLA that, if followed, can 6 allow practitioners to bridge the gap between rough comparisons of ability and specific understanding of individual weaknesses (paraphrased): 1. A test user ultimately diagnoses, not the test. 2. Diagnostic instruments should be targeted and discrete and provide highly-detailed information about a learner’s abilities. 3. Diagnostics should take account of multiple perspectives, including learner self- assessments. 4. DLA should involve four stages: observation, initial (informal) assessment, use of diagnostic instruments, and decision making. 5. DLA should be connected to future instruction. Principle 1 highlights the role of the diagnostician, typically a teacher, and the role of their expertise in interpreting diagnostic information (Edelenbos & Kubarek-German, 2004) and agency in decision making. Lee (2015) added strong arguments for providing elaborate feedback (see Principle 2) and connecting results to future instruction (see Principle 5): These can be seen as essential and distinguishing components of DLA. Without these components, diagnostics are (a) unlikely to be very helpful and (b) essentially do nothing that other types of tests (achievement tests, proficiency tests) already do. In the field of L2 pronunciation, Trofimovich, Isaacs, Kennedy, Saito, and Crowther (2016) provided support for Principle 3. They found that learners frequently have poor self-assessments: Lower-ability learners overestimate their pronunciation quality, while higher-ability learners underestimate it. Trofimovich et al. suggested that using self-assessments alongside objective measures could help develop learner awareness and clarify goals for improvement. While these major principles provided important guidance for diagnosticians, previous work on DLA has elaborated in greater detail how 7 diagnostic tests might be designed. The following suggestions from Alderson (2005) further informed specifications for diagnostic tests: • Diagnostic tests are based on a detailed theory of language development. • Diagnostic tests are likely to be discrete and focused on specific elements rather than global language abilities. • Diagnostic tests are likely to focus on lower-level (i.e., bottom-up) language skills rather than higher-order integrated skills. These three suggestions from Alderson lay a type of foundational blueprint for designing new diagnostic instruments: a starting point, in a sense. The suggestions also provide a way to identify already-existing diagnostic tests, which may be important because not all diagnostic tests are labeled as such. Operational Examples of DLA While DLA has been theorized to a considerable degree, Alderson et al. (2015) noted that few specifically-tailored diagnostic language tests exist. More commonly, existing proficiency tests have been retrofitted for diagnostic purposes (e.g., Lee & Sawaki, 2009; Jang, 2009). Jang (2009) is arguably the quintessential example of this approach to diagnosis and worthy of additional consideration here due to its topical relevance and Jang’s rigor of analysis and frankness in interpretation of her findings. In her paper, Jang (2009) describes application of a measurement technique called cognitive diagnostic assessment (CDA) to the reading section of LanguEdge, an early prototype of the TOEFL iBT. Through a rigorous, iterative analysis of LanguEdge items by judgments of a team of experts, Jang identified 9 subskills of reading comprehension that were tapped into by the various test items. Through the application of a sophisticated measurement model, Jang was able to estimate test-taker mastery of these 9 8 subskills and provide score reports with considerably more information on test-takers’ reading abilities compared to just having a single reading ability score. Jang also collected data on test- taker self-assessments and conducted classroom case studies where she interviewed learners and teachers. While the CDA approach showed promise, Jang identified several obstacles to meaningfully diagnosing learners, such as some subskills being represented by too few items, very large (mostly > .8) correlations among subskills (questioning their separability), difficulty measuring very low and very high ability test-takers, issues with subskill labeling and divisibility of subskills across items, and questionable applicability of subskill feedback to instruction (though the awareness-raising capacity of the subskill feedback was noted positively by teachers). Jang connected most of these difficulties to the design of the test: LanguEdge was built as a proficiency test, not a diagnostic test. Aside from retrofitting proficiency tests to provide more detailed feedback, other tests labeled diagnostic have ended up measuring language abilities broadly (i.e., primarily function as proficiency tests) and/or been mostly used for course placement decisions (e.g., DIALANG: Alderson & Huhta, 2005; DELNA: Elder & von Randow, 2008; Knoch & Elder, 2016). With the KPD, and this dissertation, I aim to put contemporary DLA theory into practice, following the principles and suggestions offered by Alderson and others. Similar efforts to put DLA principles into practice have recently been made by Kremmel (2017). Kremmel developed an instrument that diagnoses learner levels of (written) form- meaning vocabulary knowledge, information that is useful for understanding difficulties in L2 reading comprehension. Links between diagnosis and vocabulary instruction are readily available thanks to information provided by corpus-based word frequency (Kremmel, 2016) and analyses of lexical coverage of texts at various levels of sophistication. For example, learners can 9 be given tailored lists or spaced-repetition flashcard programs to study with independently or referred to reading materials at an appropriate lexical level to foster incidental form-meaning knowledge acquisition. Kremmel’s work in this area built more formally on the longstanding use of vocabulary size tests (e.g., Nation & Beglar, 2007) to diagnose learner weaknesses in overall receptive vocabulary knowledge and to direct students to appropriate material in extensive reading programs to promote reading and vocabulary development (Nation, 2001). Another outstanding example of DLA focuses on the same language and skill area as this dissertation: L2 Korean pronunciation (Kim, 2006). Kim developed an instrument used for diagnosing Korean learners’ pronunciation difficulties and tracking their development over time. Kim’s diagnostic included a broad range of pronunciation phenomena, going beyond individual sounds to include learner knowledge of phonological processes (e.g., nasalization, tensification, consonant cluster simplification) and suprasegmental aspects of pronunciation. Kim, in many ways ahead of the curve in DLA, also described a cyclical process of feedback, observation, and reevaluation that occurred after the administration of her diagnostic. This diagnostic was later included in a two-volume pronunciation textbook (Choi, Kim, Park, Jin, & Park, 2009a, 2009b) and the scoring and feedback form noted relevant textbook units for different categories of pronunciation features. Despite the many strengths and innovations of Kim’s approach, all diagnostic test items consisted of word and sentence read-alouds and did not consider learner perception when diagnosing difficulties. The Criterion software published by the Educational Testing Service (https://www.ets.org/criterion) is another example of a diagnostic test. Criterion is a program designed to help learners improve their writing for the Test of English as a Foreign Language (TOEFL). Learners write TOEFL-style essays, which are then given an estimated (computer- 10 generated) overall score, but more importantly, are also given detailed, computer-generated written corrective feedback that diagnoses their writing difficulties (Chapelle, Cotos, & Lee, 2015). Learners can see what their most common errors are and are given algorithm-based advice on how to address them. Teachers can also supplement the Criterion feedback received by students. Finally, although not labeled as DLA, Dynamic Assessment (Poehner & Lantolf, 2013; Teo, 2012) shares many of the same aims and is worth considering from the perspective of DLA. In Dynamic Assessment, learner difficulties are probed via standard test tasks (e.g., reading comprehension multiple-choice questions, oral interviews). Where learners make mistakes, mediation is provided in the form of hints or other support which allow the learner to eventually arrive at a correct answer (in the case of a discrete-point test) or otherwise improve their understanding or performance. Compared to DLA, however, Dynamic Assessment is not oriented to subsequent instruction in quite the same way. There is some overlap: Dynamic Assessment collects information on what a learner can and cannot do independently, information that can be applied to curricular placement or instructional decisions. However, Dynamic Assessment also emphasizes the mediation that occurs in the assessment event as instruction, effectively melding assessment, teaching, and learning (Poehner & Lantolf, 2013). The diagnosis of L2 abilities for instructional purposes has not been strictly confined to the field of language assessment. Quite expectedly, L2 researchers and practitioners concerned with language teaching and learning have developed tools and instructional programs to identify and address individual learner needs, and this work has often been carried out without reference (originally, at least) to the work and theory of Alderson, Y. Lee, or other key figures in DLA. Specific to L2 pronunciation, several computer programs have recently been created that, overtly 11 or covertly, diagnose learner difficulties with phoneme perception and/or production and then adjust the content of program learning activities. The NetProfII program (https://netprof.ll.mit.edu/netprof/), developed by MIT’s Lincoln Labs for the U.S. Defense Language Institute, features vocabulary and pronunciation training that provides evaluation and feedback through automated speech recognition. This program also maintains detailed records of learner performance over extended use of the program, yielding detailed reports on phoneme accuracy ratings. Although no initial diagnostic test is available, over time learners’ difficulties with phoneme production are profiled and made available to the learner through an interactive dashboard; in theory a learner could then select words containing difficult phonemes to focus on in subsequent practice sessions. Focusing on perception instead of production, Qian, Chukharev-Hudilainen, and Levis (2018) developed a program for English-language learners that provided adaptive high- variability phonetic training (HVPT) for English segments. They developed this program in response to several previous calls for greater personalization and efficiency in computer-based HVPT programming (Levis, 2007; Munro, Derwing, & Thomson, 2015). The program required learners to accurately discriminate phoneme contrasts in minimal pairs. For each training session, a phoneme was targeted five times, and if a learner met or exceeded 80% accuracy (i.e., 4+ out of 5 correct), the learner would ‘exit’ further training on that phoneme and instead focus on more subjectively-difficult contrasts (i.e., those responded to correctly less than 80% of trials). From a DLA perspective, Qian et al.’s (2018) program smoothly integrated diagnosis and instructional planning. 12 Some Key Concerns in DLA In this section, I review some key questions in DLA that are especially relevant to the present study. In some cases, these questions reflect a lack of research or development in practice, and in others the questions reflect areas of controversy. Practicality Practicality in assessment is an ever-present concern: Any usefulness an assessment has can be made irrelevant by untenable time, money, or expertise requirements. That said, the greater the assessment stakes, the greater the resources are that are deemed reasonable. In medicine, where the stakes are extremely high, it is completely reasonable to run many expensive laboratory tests (themselves originally researched and developed at great cost) in order to understand problems underlying painful symptoms and potentially resulting in a life being saved. For learning disabilities, screening all youngsters early and referring probable cases for more detailed, time-intensive diagnosis and treatment by a trained expert can have major positive impacts on a child’s education and long-term quality of life. For DLA, however, the stakes are generally low. Not to say that that the benefits of facile language abilities are trivial, but gaps in adult L2 ability are (a) not usually a matter of life-and-death, (b) may represent relatively minor inconveniences able to be overcome through communication strategies and/or sympathetic interlocutors, or (c) may eventually be ameliorated without specific intervention, given enough time, L2 exposure, and/or conventional instruction. Thus, the kind of extremely rigorous scientific analyses and/or technically-savvy tools and procedures available thanks to laboratory phonology/acoustic phonetics (e.g., spectrogram analysis, ultrasound), cognitive science (e.g., event-related potentials), psycholinguistics (e.g., eye-tracking, reaction time analyses), and computational linguistics (e.g., automated speech 13 recognition, natural language processing) are rarely practical for real-world DLA due to reluctance to expend money and expert labor on the development and scoring of low-stakes assessments. Similarly, tests of considerable length (e.g., the 3+ hours and two visits necessary to complete an IELTS exam) and rigorous scoring procedures (e.g., multiple, trained human raters and computer scoring engine on TOEFL productive tasks) are also likely to be out of acceptable practicality bounds. While Alderson et al. (2014) did not specifically say that DLA must be brief, at the very least it should be practical for teachers and learners to do and sensitive to the many time demands on language teaching and learning. Developers of diagnostic instruments and procedures should ask: What can be provided that is practical for learners and teachers? How can one maximize, or at least strike a reasonable balance, between technical quality and resource expenditures? Grain Size and Score Reporting: How Detailed Should Diagnosis Be? The level of detail in information provided by test scores is a key, if not defining, feature of DLA. Clearly, a single score describing ability in a language skill area is insufficient; such a score may only be appropriate for describing global levels of proficiency. However, there is no clear guidance on what level of granularity in scores is necessary to meet the needs of DLA, i.e., identifying specific strengths and weaknesses at a level useful for instruction. The provision of a handful of subscale scores associated may or may not be sufficient for DLA. Many language proficiency exams, for example, provide subscores for each of the traditional four macroskills of reading, writing, listening, and speaking (e.g., the TOEFL), but this level of detail is unlikely to uncover anything but broad-stroke areas of strengths and weaknesses. Even presenting a handful of subscores in the context of assessing a more delimited area of language ability, such as reading ability (e.g., see the 9 subcomponents of reading ability in Jang, 2009), may not be 14 sufficiently informative for understanding specific student weaknesses, nor for planning instruction. This question about how fine-grained diagnostic language assessments should be cannot be addressed entirely by the quantity of information reported. The Pearson Test of English (PTE, https://pearsonpte.com/), a standardized test of English proficiency, provides highly-detailed score reports that feature an overall scale score, macroskill subscores, and subscores for six enabling skills (grammar, oral fluency, pronunciation, spelling, vocabulary, written discourse) that underlie performance in the skill areas (i.e., a total of 11 scores, Pearson, 2018). Pearson (2018) avoid the word diagnostic, yet describe the enabling scores as “information about particular strengths and weaknesses of a test taker’s ability to communicate in speaking or writing” which “may be useful to determine the type of further English study” a learner should engage in to improve (p. 42). I would hazard to say that most experts would not describe the PTE as a particularly useful diagnostic instrument, a single piece of information about, say, a learner’s grammar provides little specific guidance on how or what to study. However, it may be appropriate to say that PTE scores have some diagnostic qualities. Thus, in part, the question of grain size must consider quality. Scores/subscores provided by DLA tools will likely be large in number, but must also contain diagnostically-actionable information, based on a detailed description of language ability and understanding of language development. Although large grain-size in DLA is clearly undesirable, there may be limits on information granularity due to practicality and utilization issues. Extremely high-granularity may require untenably long observation procedures or instrument designs, making use of such techniques impractical, and stability of diagnostic classifications is likely to be lower at finer grain-size (Lee & Sawaki, 2009). Making use of information from a high-granularity diagnostic 15 procedure may also prove challenging or otherwise overwhelming. If the grain-size of a diagnostic is keyed to the minutest details of learning theory and language, the resulting information may be too technical and/or too voluminous for learners and teachers to fruitfully apply. Imagine being a language learner (or a teacher) and being told that you (or your student) have voice onset times for word-initial stop consonants that are on average 34.11ms too long. Without substantial training in phonology and phonetics of the target language, it might be difficult to comprehend what that information means, much less apply it. Now imagine receiving parallel information for other acoustic qualities, such as intensity, for other syllable/word contexts, and other types of sounds. Background in phonology and phonetics aside, the sheer volume of such information would likely be overwhelming, perhaps debilitatingly so. Thus, grain-size is a Goldilocks issue for DLA practitioners: Not too large, not too small, not too vague, and not too technical: just right should be strived for. In DLA, score reporting has been framed in terms of providing feedback (Alderson, 2005; Kunnan & Jang, 2010) rather than simply informing a stakeholder of a test result. This is one more way in which DLA emphasizes a connection to subsequent learning: Just like immediate corrective feedback in a classroom interaction (e.g., Saito & Lyster, 2012) or delayed feedback on written assignments (e.g., Ferris, 2010), I argue that the primary purpose of feedback to learners from a diagnostic test is to raise awareness of linguistic form in order for the learner to subsequently apply conscious attention to form in both deliberate learning activity and general language use. Theoretically, this view of diagnostic feedback is well-aligned with SLA hypotheses and theories that suggest that learners need, or at least can benefit from, conscious attention to forms (i.e., vocabulary and/or grammar) to develop and ultimately acquire or otherwise master those forms (e.g., DeKeyser, 2017; Robinson, 1995; Schmidt, 1990, 1993; 16 Schmidt & Frota, 1986), but perhaps not SLA theory that suggests that all learners need is implicit (i.e., unaware) learning of form in order to develop and acquire the forms (e.g., Krashen, 1982; Truscott, 1996; VanPatten & Rothman, 2015). In Schmidt’s (Schmidt, 1990, 1993; Schmidt & Frota, 1986) influential Noticing Hypothesis, it is claimed that learners must be aware of linguistic forms at a conscious level (i.e., they must notice forms). Noticing is what allows learners to direct attention to a form, which in turn promotes storage in memory and learning. Complementing this hypothesis, Robinson (1995) detailed the process by which noticing and attention to form in short-term memory is a necessary condition for storage in long-term memory. Such a process also factors into Gass and Mackey’s (2006) Interaction Hypothesis, where through input from interlocutors and interactional feedback learners’ awareness of and attention to form is promoted, facilitating acquisition. Coming from a slightly different perspective, a Skill Acquisition Theory (SAT) approach to SLA (DeKeyser, 2017) suggests that a considerable amount of practice with attention given to linguistic forms is necessary to achieve fluent, automatized skill in using them. This practice can come in the form of pre-planned instruction (e.g., a classroom activity) or learners’ own conscious monitoring of explicit knowledge (or declarative knowledge in DeKeyser’s framework) during authentic language use (e.g., daily interactions during study abroad). More specific to the present study, learner awareness and attention to form is known to be helpful to L2 speech learning (Guion & Pedersen, 2007; Kennedy & Trofimovich, 2010; Moyer, 2014; Saito, 2018; Thomson, 2012). It is widely accepted that explicit phonetic instruction (e.g., pronunciation instruction based on explicit description of articulation) is beneficial, with learners generally showing improvements on the phonological forms they are taught (Lee et al., 2014). However, all phonological forms cannot be taught or paid attention to all the time. This is where 17 learner autonomy and independent use of learning strategies (or metacognitive strategies) (Moyer, 2014) also comes into play when considering the utility of diagnostic feedback: An experienced, well-trained, strategic learner who is aware of their weaknesses may be able to deliberately pursue study activities or utilize techniques that address their specific needs. In Moyer’s (2014) review of highly successful L2 phonology acquirers, she specifically points out the autonomous deployment of strategies such as “self-monitoring”, “explicit attention to accent”, and “conscious concern for accent” (p. 430), which all draw on learner awareness and attention to form. Along these lines, Kunnan and Jang (2010) suggested that for diagnostic feedback to be most useful, it should be presented in a way that encourages learners to “reset their own learning goals by breaking down goals into manageable tasks” (p. 617). In other words, by guiding learners to linguistic forms in most need of attention, diagnostic feedback potentially enhances the learner’s efficacy in autonomous learning and strategy deployment. It is also worthwhile to consider DLA score reporting from the perspective of teachers or tutors. Although a teacher’s conscious attention to linguistic form is not a primary concern, a teacher’s awareness of student weaknesses can be deployed to induce or reinforce learner awareness through well-matched pedagogical responses, such as in situ corrective feedback or deliberate provision of pronunciation learning opportunities, such as the creation (or modification) of classroom activities or the selection of learning materials, drawing on the teacher’s training and knowledge of pronunciation teaching (Baker, 2014; Burri, Baker, & Chen, 2017). As Alderson et al (2015) pointed out, the whole DLA process is for naught if no party, learner or teacher, appropriately considers and then acts upon diagnostic feedback – a sentiment that would surely be agreed upon by scholars supportive (e.g., Ferris, 2010; Saito & Lyster, 2010) and critical (e.g., Truscott, 1996) of feedback in SLA and language teaching. 18 Some reports of stakeholder understanding of diagnostic results have been cause for considerable concern. Huhta (2010) found that language learners paid most attention to the overall proficiency level and ignored many other parts of DIALANG results. Similarly, Yang (2003) found that DIALANG test-takers compared their overall scores to their TOEFL or IELTS scores and did not substantially engage with the diagnostic information provided. Jang and Wagner (2014) emphasized that learners with different goals and motivations are likely to differ in their uptake and application of diagnostic feedback. A key question, then, is: How might DLA score reports be designed to effectively promote awareness of linguistic forms (or perhaps other relevant aspects of performance)? There does not currently appear to be a simple answer to this question. For one, the feedback of different types of diagnostic assessments will often, and perhaps necessarily, take different forms: Diagnostic feedback on L2 writing may involve annotation of learner text (e.g., from a teacher or a computer program), while diagnostic feedback on a reading test could utilize item-level hints during the test, and item-level feedback after the test. Despite these skill/content area and method considerations, there are few, if any, specific guidelines for presenting diagnostic information. Some useful advice, though vague in terms of format, comes from Alderson and colleagues (2015), who suggested that diagnostic feedback could (and perhaps should) attempt to link together weaknesses, probable causes, and next steps for learning. They also offered the following key characteristics of diagnostic feedback (p. 169): it is much more detailed than, for example, a reading test score; it is not limited to the actual errors a learner makes; it is based on an understanding of what probably underlies those errors; and finally, 19 • • • • it is not limited to errors but also addresses what the learner could do to improve the skill involved When considering what the literature says about provision of diagnostic feedback, I wonder whether, for a test designed for diagnostic purposes from the ground-up, if any kind of total score is necessary. Although it is common to provide a total score (Alderson, 2005; Jang, 2009; Lee & Sawaki, 2009; Sawaki, Kim, & Gentile, 2009), I question whether the practice was simply born out of habit or just a byproduct of retrofitting proficiency tests for diagnostic purposes. This is not to say that providing an overall ability score is wholly inappropriate, but excluding any overall scores and presenting only detailed information on specific aspects of ability and suggestions for improvement could avoid the problem of learners finding a total score and ceasing further engagement with feedback. Indeed, effective feedback in classrooms is not contingent on a teacher telling a learner overall how good they are before getting into the specifics of an error or recurring difficulties. Measurement Models and Techniques A measurement model can be simply defined as the way scores are assigned to objects. In the case of language learning, the objects of measurement are typically L2 learners, and these learners are assigned scores on some attribute (a skill, a domain of linguistic knowledge) as the result of an assessment procedure (e.g., an interviewer’s overall judgment of a learner’s proficiency level, a conventional reading proficiency test). In several treatments of DLA, measurement is little discussed (e.g., Alderson, 2005; Alderson et al., 2015) while in others, measurement techniques are on center stage (Jang, 2005, 2009; Lee & Sawaki, 2005). Perhaps the issue of measurement is sometimes avoided due to the potential thorniness of dimensionality in DLA. In measurement, dimensionality refers to the number of dimensions 20 along which examinees are meaningfully compared on the basis of an assessment procedure. Most commonly, and especially in high-stakes educational and language proficiency testing, measurement is unidimensional: Learners’ are assigned a single score that refers to their ability along a single dimension. Unidimensional measurement is supported by well-tested and widely- used techniques and analysis software familiar to many assessment practitioners. However, as previously discussed, DLA requires more than a single score in order to be truly diagnostically useful. Ideally, DLA yields multiple scores that allow for meaningful inferences on the status of subcomponents and more narrowly defined knowledge bases that influence macro abilities. Rigorous and simultaneous measurement of multiple dimensions presents a marked increase in theoretical and technical complexity and is unfamiliar territory for many language testing and assessment specialists. DLA, which will typically report multiple scores targeting different aspects of an ability, may on the surface appear to be a prime example of a multidimensional measurement opportunity. It is important to note that a measurement dimension is not the same as a construct or attribute. Rather, measurement dimensions are mathematical/statistical abstractions of assessment data; the relationship between a measurement dimension and a theoretical construct must be inferred and supported by additional evidence (Reckase, 2009), such as the test content or investigation of item response processes. Because human knowledge structures and mental abilities are complex, it is possible to conceive of theoretical constructs as abstractions of complicated, multicomponent mental processes. Language ability is no exception (e.g., Bachman & Palmer, 2010). For this reason, unidimensional measurement has occasionally been criticized as fundamentally flawed (Buck & Tatsuoka, 1998). However, a higher-level abstraction like reading comprehension, which obviously involves identifiable subcomponents such as 21 lexicogrammatical knowledge and grapheme decoding, can be justifiably measured along a single dimension, focusing on global performance rather than attempting to directly and separately measure each relevant knowledge base and processing routine. Nonetheless, there is still a need in DLA to get information on those knowledge bases and processes. There are at least three approaches to acquire such information in DLA: arithmetic subscore calculation, unidimensional Item Response Theory (IRT) or Rasch measurement with analysis of unexpected responses, and multidimensional measurement. These approaches differ substantially in their practicality. Simple subscore calculation has the smallest sample size requirements (essentially there is none) and the lowest technical expertise. One simply defines which items on a test or other assessment tool constitute meaningful subscales and computes sum scores. These subscores can then be added up to arrive at a total score representing an individual’s overall ability. This approach is (implicitly) in line with Classical Test Theory (CTT), which posits that a person’s true ability is represented by the sum of item/task scores, plus or minus measurement error. Due to its practicality, this method may be the most common approach to gleaning information on subcomponent knowledge and skills in language assessment (Jang, 2009). Although technically simple, the definition of meaningful subscales should nonetheless be principled, based on a thorough understanding of the underlying processes and the linguistic knowledge necessary to carry out higher-level tasks. It is also possible to apply weights to items and/or subscales, usually based on theory, but also possible based on technical quality, before adding them to produce an overall ability score (e.g., weighing the pronunciation scores for phonemes according to communicative importance, weighing subscores equally). IRT and Rasch measurement techniques are common in language testing (McNamara, 1995; Knoch & McNamara, 2012). Comparatively, Rasch and the simplest form of an IRT 22 model require larger sample sizes (a minimum of 50-200, depending on desired precision of estimates and test design factors, DeAyala, 2009, Linacre, 1994) and greater technical expertise compared to CTT. Unlike CTT, IRT and Rasch consider the responses of individual examinees to individual items when determining the ability of people and the difficulty of items. In simpler Rasch/IRT analyses, when the data fit the model, raw total scores will correlate almost perfectly with a person’s ability measure, allowing for straightforward (but more rigorously supported) interpretations of raw scores. Importantly, estimates of person ability and item difficulty are theoretically not sample dependent in IRT/Rasch (this is practically plausible when initially estimated with a sufficiently large and representative sample), which allows for detailed and generalizable consideration of item difficulty hierarchies that can be related to the theoretical understanding of the construct being assessed. In practice, extracting diagnostic subscore information using Rasch/IRT is very similar to computing subscores in a CTT model, but with greater confidence in the order of item/task difficulty and more precise probabilistic information on items that an examinee under- or overperforms on. Using Rasch analysis to collect validity evidence for an aural vocabulary knowledge test, McLean, Kramer, and Beglar (2015) were able to show that item difficulty patterned reliably according to frequency, with items targeting more frequent vocabulary being easier than less frequent vocabulary. This aligns with exposure-based accounts of vocabulary acquisition and empirical findings of word frequency in natural language use, which in turn allows for developmental interpretations of the vocabulary test scores. For example, McLean et al.’s (2015) vocabulary test scores can be used to infer a learner’s overall level of vocabulary knowledge, and highlight areas of weakness, such as an unexpected number of incorrect responses to items in a high-frequency (easier) band of vocabulary. Such a student could be referred to some remedial 23 vocabulary instruction. Rasch/IRT measurement can also accommodate polytomously-scored item/task responses, including the combining of several related items into one item parcel (also referred to as an item bundle or superitem). Justice, Bowles, and Skibbe (2006) used Rasch analysis on data from a developmental pre-literacy test that featured dichotomous and polytomous items; items were constructed to target specific knowledge facets of basic print concepts. A product of this analysis was an easy-to-use scoring sheet which visually incorporated relations between item difficulty and test-taker ability, allowing test-users to intuitively understand which item scores a child would be expected to receive given their overall ability level (based on their raw total score). Thus, broad instructional decisions could be made based on an overall score (e.g., referral to remedial pre-literacy instruction) and more specific instructional decisions could be made based on item performance (e.g., reviewing where the title of a book can be found). Multidimensional measurement is the third and most demanding approach. There are a variety of multidimensional measurement techniques, most based on IRT (Reckase, 2009), which could be applied meaningfully in DLA. For the sake of brevity and relevance, this review will focus on Cognitive Diagnostic Models (CDM), a family of multidimensional IRT models that include variants such as the Rule Space Model (e.g., Buck & Tatsuoka, 1998) and the Fusion Model (e.g., Jang, 2009). Expanding on simpler IRT models, CDMs introduce additional dimensions based on cognitive attributes (skills, processes, knowledge) needed to successfully respond to items. Item attributes, usually coded by several experts with thorough understanding of knowledge bases and cognitive processes tapped by the larger construct being measured, can explain examinee responses to items in finer-grain detail: Examinee mastery of specified 24 cognitive attributes determine their odds of correct responses in accordance with the demands of each item. CDM may represent an ideal measurement model for DLA. The technical potential of CDMs to provide detailed, precisely-measured information on a range of subordinate skills and knowledge aligns well with the goals of DLA. Criticism of CDM in DLA primarily stems from their post-hoc application to general proficiency tests, a phenomenon Alderson (2010, p. 99) described as “trying to retrofit a proficiency test into diagnostic uses.” Much of Alderson’s criticism of this approach stems from problems in accurately ascribing cognitive attributes to the kinds of questions found on typical reading or listening comprehension tests. For instance, even experts in L2 reading will not always agree whether a given reading question requires an inference to be made. Jang (2009), whose coders only achieved moderate agreement in assigning attributes to reading items, also recognized the limitations of the retrofitting CDM approaches, agreeing with Alderson’s criticism that diagnostic tests need to be built from the ground up for diagnostic purposes in order to capitalize on the technical potential of CDMs. Retrofitting is not the only weakness of CDMs. Being more technically sophisticated, potentially estimating numerous cognitive attributes, CDMs usually involve much larger samples than the other measurement models discussed so far. Jang’s (2009) application of a CDM to the LanguEdge reading test (a TOEFL iBT precursor) involved 2,703 examinees; and Sawaki, Kim, and Gentile (2009) had over 3,000 examinees, while Buck and Tatsuoka (1998) used a more modest sample of 412. Even taking Buck and Tatsuoka (1998) as an acceptable sample size (for 15 cognitive attributes), it is clear that CDM analyses can be quite resource-demanding. 25 Self-Assessment Self-assessment (SA) has been popular in L2 research and classroom practice and is a key step in effective DLA according to Alderson (2005; Alderson et al., 2015). Generally, the accuracy of SA (i.e, association between self-assessment and objective/expert assessment) for language learners has been found to be positive and moderate, yet widely variable: Ross’ (1998) seminal meta-analysis found that the average correlation between learner SA and objective tests for overall proficiency was r = .63, with a range of .09 to .80. Especially relevant to the present dissertation, the average correlation between SA and an objective test for listening ability was r = .65 (range: .25 to .81) while the average correlation for speaking ability was slightly lower at r = .55 (range: .09 to .78). More recently, Ma and Winke (2019) found that L2 Chinese learners could fairly accurately self-assess their proficiency at broad levels but struggled to accurately assess their abilities at finer levels of distinction, especially if they were at Intermediate levels of proficiency (Novice and Advanced learners were better at self-assessing their oral skills than were Intermediate-level learners). For pronunciation self-assessment, findings pertaining to learner accuracy are mixed. Trofimovich et al. (2016) found weak to small correlations between SA and expert judgments of degree of foreign accentedness (r = .06) and comprehensibility (r = .18) for L2 English learners. Lappin-Fortin and Rye (2014) found that learners of French were reasonably accurate in self-assessing their global pronunciation and learned to more accurately assess specific features of French pronunciation that they had been taught explicitly, but nonetheless tended to overestimate their abilities. Dlaska and Krekeler’s (2008) study, where learners of German assessed their segmental productions by comparing their recordings to native speaker models, found high overall learner agreement with expert judges, but the learners failed to identify roughly half of their mispronunciations – in other words, they tended to overestimate 26 the accuracy of their productions. Thus, it might appear that SA involving (a) productive and (b) more specific aspects of language proficiency may tend to be less accurate than other forms of SA. Interestingly, Trofimovich et al. (2016) also found that less-proficient learners (i.e., those with stronger foreign accents or lower comprehensibility) tended to overestimate their speech quality while more proficient speakers tended to underestimate. This finding reflects the well- known Dunning-Kruger effect (Kruger & Dunning, 1999), that is, the notion that those with less expertise tend to overestimate themselves while those with greater expertise tend to underestimate themselves, which throws another wrench into the machinery of self-assessment. The apparent flaws of SA are not necessarily a problem for DLA. Rather, in DLA, they may be seen as a learning opportunity: For learners unaware of their weaknesses (or strengths, as it may be in the case of more experienced or proficient learners), reconciling SAs with expert/objective scores from diagnostic instruments can highlight gaps and create awareness in learners that will hopefully support subsequent learning. Indeed, if SA were so accurate that other steps of DLA showed learners nothing new, there would be little argument for doing anything beyond SA in the first place. In the DIALANG test for diagnosing foreign language ability (www.lancaster.ac.uk/researchenterprise/dialang/about.htm, Alderson, 2005), examinees complete a self-assessment prior to taking the DIALANG and then their DIALANG results are presented alongside their SA results after the test is complete. When there is a mismatch, the DIALANG system provides several possible explanations for why there is a mismatch, with the hope that learners will more carefully consider their abilities and take to heart the suggestions for future study provided with the test results. While this approach to utilizing self-assessment is somewhat simplistic, especially as the DIALANG focuses on language skills rather broadly, it 27 may nonetheless provide a useful wake-up call for someone presenting a strong Dunning-Kruger effect in the conceptualization of their language skills. SA often takes the form of can-do statements, popularized by frameworks of language proficiency such as the Common European Framework of Reference (CEFR, Council of Europe, 2017) and the American Council on the Teaching of Foreign Languages’ Guidelines (ACTFL, 2012). Can-do statements may be framed as yes/no questions (i.e., dichotomous responses) or involve longer rating scales (see Little, 2005; Tigchelaar, Bowles, Winke, & Gass, 2017; Ma & Winke, 2019). Otherwise, SA may employ other item types with rating scales anchored by short weak/low and strong/high descriptors of ability or performance, e.g., the accentedness and comprehensibility scales used by Trofimovich et al. (2016). In DLA, it would seem prudent for the grain-size of any SA to be roughly parallel to that of any diagnostic instrument used in the process. This is not to say that the inclusion of some broader, more general self-assessment items should be discouraged, but rather that it would seem easier and more useful for learners to compare self-assessment and diagnostic test results that are more directly relatable. Validity in DLA Alongside the previously discussed key questions in DLA, validity is a chief concern for any type of assessment. Validity also provides a framework for investigating and evaluating assessment instruments, procedures, and uses. In line with the larger field of educational assessment, validity in language assessment is widely conceived of in an argument-based framework (Bachman & Palmer, 2010; Kane, 2013; Chapelle, Enright, & Jamieson, 2008, 2010), and DLA is no exception (Chapelle, Cotos, & Lee, 2015). Whereas classical notions of validity have a narrow focus on whether a test measures what it claims to measure (or more precisely, whether variation in test scores reflects variation in the underlying construct(s) or trait(s)) 28 (Borsboom, Mellenbergh, & van Heerden, 2006), the contemporary argument-based approach broadens the scope of validity to include the decisions made based on test scores and subsequent impacts on test stakeholders (Messick, 1989). Although argument-based validity theorists in educational assessment (e.g., Kane, 2013) and language assessment (Bachman & Palmer, 2010; Chapelle et al., 2008, 2010) differ somewhat in their specifications of validity arguments, the general structure involves a series of progressive inferences that lead from test-taker responses to the use of test results by a range of stakeholders (Figure 1). Each inference (indicated by curved arrows in Figure 1) requires some sort of backing (gray boxes with bullet-point examples). The first inference in Figure 1 is evaluation. For scores to be meaningful, they must be appropriately assigned to responses elicited by well-designed test items or tasks in a way that reflects the targeted construct of language ability. Support for this inference comes from the theoretical background related to the construct and the connection between theory and operationalization in the form of test tasks and items, with well-reasoned scoring rules in place based on that connection. In a basic sense, this inference requires that the test content and tasks are a sensible snapshot of the way the targeted construct functions in real life. The next inference, generalization, reflects the assumption that scores from a given test observation are consistent with other possible observations. Support for this inference broadly involves estimation of the reliability of scores, which provides statistical backing for the notion that a test- taker is expected to receive similar scores, for example, if he took a slightly different form of the test, if he took the test two days earlier or later, or if a different teacher scored his responses. 29 Figure 1.1. A series of inferences that typify validity arguments. 30 Next comes the explanation inference, which holds that differences in scores are explained by differences in the underlying language constructs. Conversely, scores are not attributable to irrelevant factors (e.g., a reading test should not depend on an examinee’s mathematical ability). Support for this inference largely comes from measurement characteristics, such as the alignment between predicted and empirical item/task hierarchies and description of the internal structure of test items/tasks (e.g., dimensionality analysis, relationships among tasks). The extrapolation inference, which allows for test scores to be interpreted as reflective of performance in other, non-test situations, follows. Support for this inference often comes in the form of a relationship between test scores (or test-taker responses) and authentic (or semi-authentic) performance on a task with real-world relevance. The last inference in the chain is utilization. Some validity theorists make a distinction between this inference and the previous inferences: Kane (2013), for example, distinguishes between ‘interpretation’ and ‘use’ of test scores; utilization falls into the latter category while the previous inferences relate more directly to the issue of interpreting the meaning of scores in terms of the targeted construct. The primary assumption in the utilization inference is that decisions made on the basis of test scores are useful, fair, and beneficial. Evidence is required that shows how decisions made help ensure or improve outcomes (learning, job performance, etc.), or otherwise beneficially serve a social function (e.g., allow a school to hire teachers with adequate language ability). While these five inferences are at the core of most validity arguments, other inferences are possible, and often appended to either the beginning or end of this core chain. For example, the work of Carol Chapelle and colleagues (e.g., Chapelle et al., 2008, 2010; Chapelle et al., 2015) features validity arguments that begin with an additional inference (often referred to as 31 “domain definition,” or sometimes “authenticity”) related to the link between the design of test tasks and real-world language use or contexts of use. Chapelle and colleagues as well as Bachman and Palmer (2010) have also included additional inferences at the end of the chain related to the beneficial consequences of utilizing test results. For DLA, in line with Alderson (2005), Alderson et al. (2014), and Harding et al.’s (2015) recommendations for diagnostic instruments to be constructed based on a detailed theory of learning and models of language processing, adding an inference at the beginning of the chain, connecting such theory to test-taker responses, would strengthen a DLA instrument’s validity argument. Similarly, adding an inference at the end of the chain related to the beneficial impact of applying diagnostic results would be in-line with the emphasis on subsequent learning in DLA (e.g., Lee, 2015) and would enhance the persuasiveness of the validity argument. In following chapters, I will introduce a proposed validity argument for the KPD that I use to set the research agenda for this dissertation. 32 CHAPTER 2: DIAGNOSING SECOND LANGUAGE PRONUNCIATION As Alderson (2005) suggested, a strong theory of language development ought to underpin any diagnostic language assessment. Theory also supports key inferences in validity arguments. In this chapter, I begin by making a case for developing a pronunciation diagnostic, highlighting a gap in instructionally-relevant pronunciation assessments. Then, I establish the theoretical grounding for the KPD by reviewing theories and research related to L2 pronunciation, both in general and specifically for Korean. Specifically, I review the linguistic, cognitive, and developmental bases that underpin the design of the KPD. I end the chapter by laying out the goals of the KPD development project and introducing a validity argument used to frame the validation research agenda in this dissertation. Why Diagnose L2 Pronunciation? In the Introduction of this dissertation, I pointed out that pronunciation can present persistent challenges to L2 learners, and that such problems can lead to intelligibility issues in real-world communication. I also pointed out that despite the well-documented effectiveness of pronunciation instruction, pronunciation is often neglected in language classrooms due to time/curricular restraints and in some cases lack of teacher confidence. At the same time, whole- class pronunciation instruction, when it is done, can be limited in its effectiveness, possibly due to instructional targets being sub-optimally matched to individual learner needs. Derwing and Munro (2014, p. 44) illustrated such a condition, where the resulting instructional decision was to mostly avoid teaching segmentals: “Little emphasis was placed on individual vowels and consonants, it turned out, because the students shared very few problems at the level of the segment.” Thus, anything that would aid teachers and learners in identifying critical targets for 33 pronunciation learning activity, whether in in-class or more individualized out-of-class formats, would appear beneficial. But why would a diagnostic assessment, specifically, be beneficial? As reviewed in Chapter 1, DLA has several characteristics that make it particularly well-suited to supporting learning. Compared to other types of assessments, such as proficiency or achievement tests, diagnostic instruments are designed with learning theory in mind and provide highly-detailed feedback that can be used to inform instruction. By and large, the most common form of pronunciation assessment would appear to be as a component of large-scale, high-stakes speaking assessments (Isaacs, 2018; Isaacs & Harding, 2017). In these sorts of assessments, pronunciation is treated broadly as just one aspect of rubrics used to evaluate a learner’s overall speaking abilities (e.g., IELTS, https://www.ielts.org/; OPIc, https://www.languagetesting.com/oral-proficiency-interview-by-computer-opic; TOEFL iBT https://www.ets.org/toefl), and results provide little to no guidance for subsequent learning activity. Isaacs, Trofimovich, and Foote (2018) developed a more detailed scale of global pronunciation quality that is theoretically well-grounded and could be used for upper-level instructional decisions, such as assigning international graduate students to pronunciation support classes. Similarly, for Korean, Lee (2017b) developed and examined pronunciation rating scales that can be used to augment speaking assessments. However, these scales ultimately fall short of providing individualized, instructionally-relevant information about learners’ abilities. Other pronunciation assessments have engaged more meaningfully with detailed, individualized results informative to learning. Lappin-Fortin and Rye’s (2014) self-assessment approach is commendable for its detail, requiring students to think about their global 34 pronunciation quality as well as their ability to produce individual features of French phonology, such as vowel and consonant segments and features of connected speech. Dlaska and Krekeler (2008), working with learners of German, also took a self-assessment approach that relies on learner comparisons of self-recordings to native speaker audio models to raise learner awareness of segmental pronunciation difficulties. Tsurutani (2008) took advantage of automated speech recognition (ASR) to provide detailed feedback on learner Japanese pronunciation that is integrated with training activities. Kim (2006), discussed in the previous chapter, aimed to diagnose difficulties with individual Korean phonemes and suprasegmental features. Similarly, Celce-Murcia et al. (2010) provided a tool to diagnose L2 English speakers’ production difficulties and included some tasks targeting perception as well. However, each of these examples, while certainly of considerable utility, could be improved on. Many of them (e.g., Celce-Murcia et al., 2010, p. 481; Dlaska & Krekeler, 2008; Kim, 2006; Lappin-Fortin & Rye, 2014) evaluate pronunciation based entirely on read-aloud words or sentences (which can be prone to non-pronunciation influences, Levis & Barriuso, 2012; Munro, 2008), rely on a native-speaker standard (Dlaska & Krekeler, 2008; Tsurutani, 2008), or have limited observations of pronunciation targets (Dlaska & Krekeler, 2008; Kim, 2006). Aside from some suggestions for perception items from Celce-Murcia et al. (2010, but not included on their diagnostic) that mirror Lado (1961), none incorporate production and perception of pronunciation features, a design which has strong motivations in pronunciation learning theory (more details follow in later sections). Specific to Korean pronunciation, Lee (2017b) stated that Kim (2006) appears to be the only example of a detailed pronunciation assessment for L2 learners, and further noted that research on Korean pronunciation assessment is lacking in general. In her state-of-the-art review of pronunciation assessment, Isaacs (2018) 35 lamented that Lado’s (1961) nearly 60-year-old text is still the most comprehensive treatment of pronunciation assessment, signaling that new advances are sorely needed. I agreed with Isaacs, especially in regards to lower-stakes, instructionally-relevant pronunciation assessments, and I saw an opportunity to fill these gaps in pronunciation assessment by developing a state-of-the-art yet practical assessment tool, in line with diagnostic principles elaborated by Alderson (2005) and colleagues (2014), that (a) diagnoses learner phoneme-level strengths and weaknesses in pronunciation, (b) integrates both production and perception, (c) explicitly promotes intelligibility-based evaluation of pronunciation, (d) does not rely exclusively on read-aloud tasks, (e) is relatively easy to administer and score, and (f) beneficially informs pronunciation learning, and evaluate it rigorously. What is L2 Pronunciation? Pronunciation refers to how humans produce speech using the vocal apparatus. Speech begins inside the mind, and through complex neural-motor activity, the lungs, vocal tract, and mouth move to produce sounds that represent language. The different ways in which humans use the vocal apparatus affect the resulting sounds produced in terms of both acoustic and temporal features. Pronunciation encompasses the qualities of segmental features that define words, i.e., phonemes, and suprasegmental (or prosodic) features that take shape over multiple segments, such as intonation, pitch accent, and stress. Features commonly associated with speech fluency, like speech rate and pauses, are also related to pronunciation. In naturalistic speech, these features can be difficult to tease apart, but nonetheless form a meaningful and practical basis for examining pronunciation. L2 researchers have commonly examined pronunciation quality in terms of pronunciation’s impact on a listener. Derwing and Munro (2015) offered a useful (and widely- 36 used) framework for considering listener-based dimensions of pronunciation. The degree to which a speaker’s message is accurately received by a listener is referred to as intelligibility. A related but partially independent dimension is comprehensibility, which refers to the listener’s ease of understanding a speaker. Seen another way, comprehensibility is analogous to the amount of effort a listener must put forth to comprehend speech. Accentedness is the difference between the speaker’s pronunciation and the listener’s own speech variety. When dealing with L2s, accentedness can also be understood as degree of foreign accent (rather than accents associated with L1 regional dialects). While all three dimensions are worth considering, Derwing and Munro declared that intelligibility is “the most fundamental characteristic of successful oral communication” (p. 1). If the sounds produced by a speaker (in a L1 or L2) are not intelligible to listeners, communication will not be successful. The importance of speech intelligibility has long been recognized throughout the field of L2 pronunciation (e.g., Abercrombie, 1949), but has not always been emphasized in language teaching and assessment. Recently, the importance of intelligibility has been stressed in pedagogy by Levis (2005), who contrasted the previous emphasis in language teaching on achieving nativelike speech (the Nativeness Principle) with a more contemporary focus on learner intelligibility (Intelligibility Principle). Intelligible, rather than native-like, speech has concomitantly seen greater emphasis in descriptive frameworks of communicative second language ability, such as the Common European Framework of Reference (CEFR, Council of Europe, 2017) and the American Council on Teaching Foreign Languages’ ACTFL Guidelines (2012). In these proficiency frameworks, used in both pedagogical settings and assessment, intelligibility is generally depicted as something that lower-proficiency learners will struggle 37 with. Their interlocutors must put forth “some effort” and engage in “collaboration” with the speaker to establish meaning (Council of Europe, 2017, p. 134-135) or be “sympathetic” and/or “accustomed” (ACTFL, 2012, p. 9) to L2 speech in order for communication to be successful. At intermediate levels, learners are generally intelligible, but still mispronounce some sounds regularly. At higher levels of proficiency, learners are assumed to have sufficient control over the production of almost all L2 sounds (and indeed do have high accuracy in the production of the most critical sounds for distinguishing words in an L2, Kang & Moran, 2014), at which point suprasegmental and fluency-related aspects of L2 pronunciation may figure more prominently in communicative effect and ease of understanding from the listener’s perspective. Empirical research on factors influencing speech intelligibility has suggested an integral role for segmental pronunciation. In monologic speech, research has shown that segmental accuracy has a clear effect on intelligibility (Kang, Thomson, & Moran, 2018a, 2018b). In Kang et al. (2018b), segmental features had the greatest influence on the intelligibility of individual sentences as well as on the comprehension of extended monologues. Along similar lines, Jenkins (2002) argued that most pronunciation-related breakdowns between L2 users (i.e., L2 pronunciation being processed by an L2 listener) are related to segmental features. Jenkins, focusing on English as an international language, went on to propose a pared-down, intelligibility-oriented pronunciation syllabus for L2 English learners, prioritizing consonant phonemes and deemphasizing many suprasegmental features. While context is often pointed to as a resource that interlocutors can use to help maintain intelligibility when mispronunciations occur, Jenkins found that this occurs less often when the interlocutor is a non-native speaker. Other L2 research has reinforced the importance of segmental pronunciation in interactive speech, including (but not limited to) research on English (Loewen & Isbell, 2017; Matsumoto, 38 2011), French (Kennedy, Guénette, Murphy, & Allard, 2015), and Spanish (Bowles, Toth, & Adams, 2014). In sum, intelligibility is widely considered to be the most important aspect of L2 pronunciation. Intelligibility fails when listeners cannot associate a speaker’s sounds with linguistic forms. Accordingly, segmental features are perhaps the most critical aspects of pronunciation to be mastered by L2 learners, as they form the basis of word forms, though it is not necessary to have native-like production of all segments. While duly noting the communicative functions of suprasegmental features and the effect they can have on listeners (e.g., Kang, Rubin, & Pickering, 2010) and the role of the listener in maintaining intelligibility through contextual cues and communicative strategies, I have focused the KPD and the remainder of this literature review on phonemes as criteria for identifying pronunciation weaknesses across a wide range of L2 proficiency levels. The Linguistic Basis of Intelligible Pronunciation As these frameworks of language proficiency suggest, segmental aspects of L2 phonology form the foundation of successful communication for L2 users. At a basic linguistic level, all phonemes are useful in distinguishing higher-level linguistic forms (i.e., words), allowing access to their associated meanings. With natural languages being composed of tens or hundreds of thousands of words, there are bound to be many that have highly similar forms, e.g., minimal pairs which differ by a single phoneme (cap and cab, in English). The concept of Functional Load (FL), first described by Brown (1988), explains the importance of segmental phonological contrasts by examining how much utility (a) phoneme contrasts (e.g., /n/-/m/) and (b) individual phonemes have when it comes to distinguishing the words that compose a language’s lexicon (see also Oh, Coupé, Marsico, & Pellegrino, 2015). For example, the English 39 contrast of /n/-/t/ has a high FL due to the frequency at which those two phonemes distinguish similar words (e.g., nap/tap, night/tight). Because those phonemes are frequent in the lexicon and form crucial contrasts with other phonemes, /n/ and /t/ (as individual phonemes) are said to have high FLs. Oh et al.’s recent survey of FL in several typologically different languages suggested that consonants generally have higher FL than vowels, with vowels gaining some ground when considering inflectional derivations. FL information within a language can provide insights as to how likely a mispronunciation of a phoneme will lead to listener difficulty. Examining L2 production as understood by L1 listeners, Munro and Derwing (2006) found that (a) utterances with higher FL errors and (b) utterances with more high-FL errors created greater difficulty in listener understanding. This implies that some mispronunciations are more severe and present a greater threat to intelligible speech for L2 learners. I now turn to the linguistic specifics of Korean pronunciation. According to Shin, Kiaer, and Cha (2012), the contemporary Korean spoken in South Korea (and particularly by younger people in the capitol region) has 28 phonemes. Among these phonemes are 7 vowels, 19 consonants, and 2 glides; the glides combine with vowels to form 10 diphthongs (Table 1). Cross-linguistically, Korean is somewhat rare in that it has a tension featural distinction for some consonants, resulting in a two- (tension) or three-way (tension X voicing) distinction among consonants with the same place and manner of articulation. Tensing requires pharyngeal articulation, somewhat longer stop/fricative/affricate duration, and tends to result in higher pitch (F0) of the following vowel. Korean has a (C)(G)V(C) syllable structure. The allophonic distribution of Korean consonants is generally sensitive to syllable and word context. One notable idiosyncrasy involves the /s/ and /s*/: when these phonemes are followed by the vowel /i/ or glide /j/, the place of articulation changes to the alveopalatal area. 40 Some research on Korean Functional Load suggests that consonants are more critical than vowels for distinguishing words (Oh et al., 2015). Among Korean consonants, /n, k, l, s, t/ have the greatest FL, in that order. Additionally, the contrast between /l/ and /n/ has a notably higher FL than other contrasts; /n/ features in other top-ranking contrasts, too. Although vowels are somewhat less critical, the vowels /i, a, o/ are comparable in FL to the previously listed consonants, and several vowel contrasts carry greater FL than most consonant contrasts, e.g., /i- ɛ/, /o-i/, /i- ɑ/. Thus, it seems a fair assessment to say consonants and vowels are of similar, if not equal, importance in shaping and distinguishing words. Table 2.1 Korean Phoneme Inventory Bilabial Alveolar Alveopalatal t Velar Glottal k k* kh ŋ h p t* th n l p* ph m Consonants Stop Lax Tense Aspirated Fricative Lax Tense Affricate Lax Tense Aspirated Nasal Liquid Vowels High Mid Low Glide Note. Information compiled from Shin et al. (2013). Unrounded Unrounded ɯ ʌ ɑ Front i ɛ j, w s s* tɕ tɕ* tɕh u o Back Rounded 41 The Cognitive Basis of Pronunciation From a psycholinguistic perspective, phonemes play a key role in models of spoken word recognition: Incoming soundwaves, after being decoded into phonemic units, can then inform the activation of potential lexical matches and suppression of competitors (McClelland & Elman, 1986; McQueen, Norris, & Cutler, 1994). Or, in other words, if a sound produced by a speaker is unrecognizable as the intended phoneme, the listener’s word recognition is impeded, and intelligibility may suffer. Failure to identify a speaker’s intended word causes immediate deterioration in intelligibility and can potentially cause ripple effects in subsequent word recognition: the previously (mis)identified word contributes to top-down activation and suppression processes, where the listener attempts to apply their understanding of the current discourse context and general world knowledge. Research has found that segmental pronunciation can substantially affect the intelligibility of L2 utterances (Isbell, 2017; Kang et al., 2018a, 2018b; Loewen & Isbell, 2017; Zoghbor, 2018). Before proceeding further in discussing the role of phonemes in pronunciation, it is appropriate to highlight some important issues related to the mental representation of phonemes in speakers and listeners. As Field (2014) pointed out in his synthesis of historical and contemporary perspectives on phonemes, it is unlikely that language users (L1 or L2) possess a distinct, minimalist inventory of phonemes in their minds due to substantial variation in phonetic realization across speakers (e.g., pitch differences among men, women, and children) and contexts including local linguistic contexts (e.g., those leading to co-articulation phenomena) as well as social contexts (e.g., phonetic differences in phoneme realizations among national, social, and ethnic varieties of a named language). Instead, Field suggested that theories which account for this variation, such as multiple trace-based accounts that center on users’ experience hearing 42 countless variations of individual sounds (Bybee, 2001), are more plausible. In other words, primarily (if not only) through substantial language experience can users of a language develop a sense of what acoustic patterns underlie the sounds used to encode meaning in language, i.e., the abstractions linguists refer to as phonemes. Thus, Field argued that while phonemes are still a valid unit of discussing learner pronunciation and intelligibility, more sophisticated input-based approaches are needed for building up learner knowledge of variation in phoneme realizations. The foundational role of L2 phonemes in intelligible speech conveniently aligns with recommendations for DLA specifications offered by Harding, Alderson, and Brunfaut (2015). In their article, Harding et al. discussed potential avenues for implementing DLA that specifically targets L2 reading and listening skills. For L2 listening, Harding and colleagues cited Field’s (2013) model, which is based on Cutler and Clifton’s (1999) well-known model of L1 listening, as a strong and detailed model of language ability that could form the basis of a diagnostic test. Harding et al. (2015) emphasized the model’s “obvious scope… for operationalizing elements of this model through discrete assessment tasks” (p. 329). A full summary of this model is beyond the scope of this dissertation, but I will highlight the lower-level processes most critical to DLA (Figure 2.1). In Field’s model, auditory input is decoded into phonemes, syllable structures, and suprasegmental information which is subsequently used in lexical search. While the focus of this dissertation is on diagnosing pronunciation, an aspect of language production, lower-level listening processes (i.e., perception of phonemes in speech for word recognition) play an important role in L2 pronunciation development, an idea I will return to. 43 Figure 2.1. Lower-level listening processes, based on Field (2013, p. 97). Although Harding et al. (2015) did not specifically address language production, their advice in selecting a detailed process model of language ability can easily be applied to productive skills. Once again, Field’s work has proven valuable. In 2011, Field articulated a process model of L2 speaking (Figure 2.2), based on Levelt’s (1993) seminal L1 speaking model. Field’s speaking model and listening models are not simply mirror images, but their parallels in lower-level processes are obvious: The phonetic encoding and phonological encoding of speaking align with the input decoding and lexical search of listening. In the speaking model, messages that have been grammatically encoded are then converted to strings of phonemes. These phonemes direct phonetic articulatory settings that ultimately result in sounds being produced. Importantly, both speaking and listening rely on phonological knowledge in the lower- level processes. Field’s (2011) speaking model also connects speaking to listening: After 44 articulating a chunk of speech, the speaker can self-monitor in order to repair mispronunciations or dysfluencies (or other errors). Figure 2.2. Lower-level speaking processes, based on Field (2011, p. 77). An important feature of Field’s (2011, 2013) models is a distinction between knowledge and processes. For example, in Figure 2.2, phonological knowledge and the syllabary are sources of knowledge that are drawn on in speech production. On the other hand, phonological encoding and phonetic encoding are processes; these may also be thought of as (sub)skills or abilities. Thus, it is possible for a speaker to possess the relevant knowledge to produce a sound (i.e., they might know what a segment sounds like, or how/where their speech articulators operate to produce it), but they may nonetheless fairly accurately articulate a sound at times due to a hiccup or failure in a process. Similarly, learners may have imperfect knowledge (e.g., a poorly-defined phonological category), but processes that are sufficiently tuned to produce intelligible (if not native-like) articulations more often than not (e.g., Sheldon & Strange, 1982). This distinction can be related to the competence-performance dichotomy in language assessment and can also be 45 related to different knowledge types in SLA theory. For example, in DeKeyser’s (2017) application of skill acquisition theory to SLA, a distinction is made between declarative knowledge (knowledge of) and proceduralized (and ultimately automatized) knowledge (knowledge how, with an emphasis on expedient use). This distinction has been recently picked up by Saito and Plonsky (in press) in their measurement framework for L2 pronunciation, which contrasts controlled pronunciation tasks, which largely tap into declarative knowledge bases, and spontaneous production tasks, which measure accuracy and efficiency in processing. The Developmental Basis of L2 Pronunciation Although L2 learners do tend to develop greater control over phonological features alongside their overall oral proficiency (Kang & Moran, 2014; Saito, Trofimovich, & Isaacs, 2016), it has long been understood that the development of L2 pronunciation is not a straightforward, predictable process. Abercrombie (1949, p. 118), in what would become an early landmark in pronunciation teaching, noted that: “People vary, to a surprising extent, in ability to learn the pronunciation of foreign language. Every phonetician must have had the experience, at some time or other, of meeting a person to whom the imitation of the most exotic sounds at first hearing presented no difficulty at all. At the other extreme are a more numerous minority who are hopelessly recalcitrant, and for whom any deviation from the native sound system is apparently impossible.” (Abercrombie, 1949, p. 118). Decades of subsequent observations from teaching practice, empirical research, and theory building would support Abercrombie’s description of large variability in L2 pronunciation development outcomes. In this section, I review research on L2 pronunciation development, highlighting several salient factors found to influence this variability, including learner age, 46 cross-linguistic influence and bi/multilingualism, experience, the relationship between perception and production of L2 sounds, and instruction. Discussion of these factors is followed by a review of L1 and L2 Korean segmental development that is useful for establishing general expectations of pronunciation difficulties in the present study. Age One of the most consistent and robust findings in research on age-related constraints in SLA is related to phonological development: Learners who begin study of an L2 past the age of six (or more liberally, past puberty) are generally unlikely to acquire native-like perception and articulation of L2 sounds (Abrahamsson, 2012; Flege et al., 1999; Long, 2013; Piske, MacKay, & Flege, 2001). Thus, for older L2 learners, age of onset plays an important predictive role in defining the endpoint of pronunciation development. It is for this reason that Levis’ Intelligibility Principle has taken a firm hold on the field of L2 pronunciation, as native-like pronunciation outcomes are simply not a realistic goal for many L2 learners (and in some cases may not be desired, e.g., by learners who strongly identify with their national/ethnic group). At the same time, even though much adolescent and adult L2 learning does not result in nativelike phonologies, L2 phonology does develop, typically in the direction of more intelligible and/or target-like perception and production of L2 sounds and sound patterns. Instruction has been found to improve various aspects of L2 phonology, including phoneme perception (e.g., Flege, 1991; Hardison, 2005; Thomson, 2012) and production (e.g., Thomson, 2011; Lee et al., 2015). Cross-Linguistic Influence A key feature of L2 pronunciation learning that distinguishes it from child L1 acquisition is the bi/multilingual phonemic inventory. The starting point for L2 learners is not a blank slate, and L2 learners do not simply turn off their L1 phonemic inventory or develop an entirely 47 separate phonological system when learning or using the L2. It is widely observed that learners (to varying degrees) substitute L1 phonemes, follow L1 syllable structure constraints, and apply L1 prosodic patterns to L2 speech. The earliest theoretical accounts of L1 transfer or influence, such as the Contrastive Analysis Hypothesis (CAH), relied completely on cross-linguistic differences in phonological systems to make strong predictions about difficulty and learning for various L1-L2 pairings (Lado, 1957; Stockwell & Bowen, 1965). In brief, the CAH predicted that phonological features not present in the L1 would be very difficult to acquire in the L2, features that are optional in the L1 yet obligatory in the L2 would be a moderate challenge, and features that were present and obligatory in both languages would be easily acquired (or transferred). Ultimately, many specific predictions based on the CAH failed to pan out, and similarly the CAH failed to account for variation in learning and accuracy within L1 groups. Namely, L2 pronunciation research has shown that considerable variation in phoneme articulation exists within groups of speakers from the same L1 background (e.g., Abrahamsson, 2012), and that speakers from the same L1 group can vary greatly in the overall strength of foreign accent (e.g., Kang et al., 2010; Munro & Derwing, 1995). As Munro, Derwing, and Thomson (2015) pointed out, just because a contrastive analysis predicts a challenge based on L1-L2 pairing, in many cases the potential challenge is either (a) overcome quickly or (b) never actually presents substantial, long-lasting difficulty and thus, in either case, does not require much specific instruction. Further, while the L1 has an undeniable influence on L2 phonology and pronunciation, not all learners are influenced by their L1 in exactly the same way or to the same degree, and learners may make progress in different aspects of L2 pronunciation at different rates. 48 Additionally, L1 varieties and multilingualism create conditions that make L1-based predictions of pronunciation difficulties more difficult to carry out and less reliable for teachers. For example, (Mandarin) Chinese is one of the world’s most widely spoken first languages, which might make some knowledge of Mandarin Chinese phonology useful for second/foreign language teachers in many contexts, but the varieties spoken throughout the Chinese-speaking world vary phonologically. While pronunciation textbooks have commonly provided information on the phonological systems of various learner L1s (e.g., Avery & Ehrlich, 1992; Kwon, 2017), they rarely provide information on non-dominant regional varieties. How much can a teacher be expected to know about their students’ specific L1 varieties? At the same time, language classrooms are increasingly being populated by multilingual learners; many foreign language learners are technically L3+ learners. It remains unclear whether native languages or L2s primarily shape L3 phonology, but it is possible for L2 articulatory settings to be transferred to an L3, even when L1 settings would result in production closer to native-like targets (Llama, Cardoso, & Collins, 2010). Relevant specifically to Korean, Chen (2018) illustrated an interesting case where some Korean phonological difficulties experienced by Taiwanese Mandarin speakers, who also had some knowledge of Taiwanese and/or Hakka, differed in pronunciation difficulties from what the CAH would predict for Mandarin speakers from China. Although the strongest accounts of L1 transfer such as the CAH have been abandoned, cross-linguistic influence remains prominent in theoretical accounts of L2 speech learning. L2 phonological development is adequately described by models involving perceptual assimilation (Best & Tyler, 2007; Flege, 1995). Empirical research has demonstrated that L1 phonemes remain active during L2 speech perception and influence word recognition (Imai, Walley, & Flege, 2005; Weber & Cutler, 2004), providing strong evidence for the influence of learners’ 49 pre-existing phonemes on the L2 phonological system. Perceptual assimilation, for L2 learners, happens when a newly-encountered L2 sound is parsed as a similar, existing phoneme (typically, but not always, a L1 phoneme). For example, an English learner of Korean may assimilate the Korean /k/ and English /k/, as they share a number of acoustic and articulatory similarities. However, the same learner may also assimilate Korean /kh/ to the L1-L2 /k/ phoneme category. In part, this is because aspiration is not phonemic in English, leading to the learner perceiving the two sounds as being more similar than they really are in Korean. With enough input, learners can separate these assimilated L1-L2 phonemes, but it is not guaranteed, and it can take quite a long time for L2 phonemes to become distinctly and robustly represented in the learner’s inventory (recall Field’s (2014) discussion of phoneme representation in the mind). Individual learners will also vary in the rate and potentially in the order of distinguishing new L2 phonemes; this variation is likely driven in part by differing amounts of L2 use/exposure and individual differences such as motivation, musical aptitude, and other cognitive/neurological differences (see Ingvalson, Ettlinger, & Wong, 2014, for a discussion of the latter). Experience The most dynamic period of L2 speech learning tends to occur within the first year or two of exposure to the language (Flege, 1988), at least in immersion contexts, which Derwing and Munro (2015) referred to as the Window of Maximal Opportunity. Within this window, development may not always be uniformly in the direction of target-like representations and articulation; sometimes learners experience ups and downs in accuracy due to the process of building new representations and reorganizing their phoneme inventories (Holliday, 2016). After the window passes, learners’ L2 phonology may fossilize, whereby pronunciation of segments and suprasegments ceases developing toward more intelligible, comprehensible forms (Derwing 50 & Munro, 2013). For instructed L2 Korean learners in a low-input foreign language environment, I and my colleagues (Isbell, Park, & Lee, 2019) found support for this window as well. We found that students within their first year of exposure to Korean as a foreign language showed rapid improvements in pronunciation (greater comprehensibility as well as lower error rates) regardless of treatment, while second-year students without pronunciation instruction showed no improvements. The state of interlanguage phonology and corresponding quality of L2 pronunciation that exists after this period is perhaps of greater interest: While giving beginners a good start to L2 speech sounds and pronunciation is important, the greater challenge lies in the gradual disentanglement of assimilated phonemes and the development of more intelligible and comprehensible pronunciation, generally in the direction of target language norms. From the perspective of diagnosis and targeted instruction, weaknesses discovered in learners who have most likely cleared the Window of Maximal Opportunity are likely to be more stable and less likely to improve without instruction in a shorter time period (Derwing & Munro, 2014). Instruction As previously mentioned, L2 pronunciation instruction is known to be effective (Lee et al., 2014) and durable (Couper, 2006). Moreover, instruction is capable of aiding learner development even when long-term fossilization of L2 phonology has occurred (Derwing, Munro, Foote, Waugh, & Fleming, 2014). The L2 pronunciation instruction literature, both empirical and practice-oriented, is rich with techniques that promote pronunciation learning, such as shadowing (speaking alongside an audio model, Foote & McDonough, 2017), read aloud (reading text aloud, with feedback if possible, Horgues & Scheuer, 2014; McCrocklin, 2019), choral repetition (teacher led repetition of words/sentences, Baker, 2014), explicit instruction of acoustic and articulatory features (explaining how to produce sounds and what they should sound 51 like, e.g., Derwing et al., 1998), communicative tasks (such as conversation or information-gap tasks, Loewen & Isbell, 2017; Saito & Lyster, 2012), and listening to self-recordings (recording oneself and listening for aspects to improve, often comparing to a model) and using visual aids (looking at acoustic visuals of self- or other-productions, Hardison, 2004), among many others. While a complete treatment of the various types of pronunciation instruction and their associated benefits and limitations is beyond the scope of this chapter (though see Celce-Murcia et al., 2010; Derwing & Munro, 2015; Thomson & Derwing, 2015; Lee et al., 2015), I will revisit some specific instructional techniques later in the dissertation when relevant. Here, I focus my review of the literature on more general aspects of pronunciation instruction most relevant to diagnostic assessment. One finding from the L2 phonological development literature that has important implications for instruction is the perception-production link (Flege, 1991; Derwing & Munro, 2015). Recent research in cognitive science has shown that areas of the brain responsible for articulation can also become active during and contribute to speech perception (Möttönen & Watkins, 2009). These same areas of the brain can also contribute to the learning of novel phonological forms (Nora, Renvall, Kim, Service, & Salmelin, 2015). In some ways, this relationship is quite intuitive: When a language user has a strong, consistent ability to perceive a specific sound, it suggests that they have a strong underlying mental representation of the sound and its distinguishing features, which in turn would lead to consistent, accurate articulation of the sound. Strong interpretations of the perception-production link include accurate perception (a) preceding and (b) predicting accurate production of L2 sounds. For example, if a Japanese learner of English cannot perceive the difference between /l/ and /r/ (instead assimilating both sounds to their Japanese /ɾ/ phoneme), it is unlikely that they will be able to produce the 52 distinction. In some cases, being trained to perceive a L2 phoneme results in improvements to production (e.g., Lee & Lyster, 2017; Thomson, 2011; see also Sakai & Moorman’s 2018 meta- analysis supporting such findings across 18 different studies). At the same time, some learners will be able to quite reliably decode a given phoneme from speech but struggle to articulate it in their own production: English learners of Spanish frequently struggle in producing the trill /r/, but are usually quite able to distinguish it from the flap /ɾ/ in listening. Another key finding of research on L2 speech perception and production is that focus on form, i.e., promoting awareness and directing attention to linguistic (in this case, articulatory/acoustic) form, is beneficial to learning (Derwing & Munro, 2015; Guion & Pedersen, 2007; Kennedy & Trofimovich, 2010; Moyer, 2014; Saito, 2018, Venkatagiri & Levis, 2007). Thomson (2012) discussed the role of attention on phonological learning, whereby learner attention to phonological form leads to improvement of perception and in turn production. Focus on form is often operationalized as corrective feedback in speech perception and pronunciation studies, where learners are alerted to their errors and given information to support more target- like performance in the future (e.g., Lee & Lyster, 2016, 2017). Explicit focus on form instruction is also useful with a primary focus on production: Learners receive explicit phonetic instruction prior to carrying out practice and/or communicative activities, where learners receive feedback on their production involving the provision of model input from a teacher or peer, and then go on to gradually produce more intelligible articulations with continued practice (e.g., Derwing et al., 1998; Isbell et al., 2019; Gooch, Saito, & Lyster, 2016; Saito & Lyster, 2012). This progression from explicit articulatory and acoustic knowledge to consistent, intelligible production aligns well with skill acquisition approaches to SLA (DeKeyser, 2017), where learners, particularly in instructed settings, first acquire declarative knowledge of L2 speech 53 sounds and eventually develop efficiency in producing them through attention-focused practice (Saito and Plonsky, in press). When attempting to diagnose learner pronunciation issues for the purpose of setting instructional targets, Lado (1961) emphasized the assessment of both perception and production, highlighting that testing only one or the other results in an incomplete picture: If a student pronounces a sound contrast in a foreign language he will also hear it. … At the same time, students learn to hear sound contrasts usually before they are able to pronounce them, and so in testing production we would not discover everything the student has learned to hear. And what is more to the point in this chapter, by testing recognition of the sound segments we will not have tested what the student has learned to pronounce. Finally, the distance between recognition and pronunciation is not the same for every student. Some students who learn to hear reasonably well still have very poor pronunciation, whereas others learn to pronounce almost as well as they can hear. (Lado, 1961, p. 78) Furthermore, as seen in the excerpt, Lado highlighted the variability in student speech learning and suggested the potential of identifying different sorts of profiles that characterize learners’ pronunciation. Thus, instruction can benefit from pinpointing the source of individual learners’ difficulties: An English instructor might begin with perception training for the Japanese learner who cannot perceive or produce /l/, or at least tackle both modes simultaneously. On the other hand, the same instructor may have another student work exclusively on production if the student can reliably hear the difference between /r/ and /l/. While the stronger claims of the perception- production link are up for debate, there is nonetheless a straightforward pedagogical argument for establishing perception first: Aural feedback on learner pronunciation has to be interpretable, and if a learner cannot tell the difference between what they produce and the model provided by a program, textbook audio CD, or teacher, adjustments to articulation seem less likely to occur. Exemplifying this, classroom-based pronunciation instruction research by Saito and Lyster (2012) showed that corrective feedback on pronunciation in the form of recasts, requiring 54 learners to hear the difference between their own productions and the model provided in feedback, could induce changes in L1 Japanese learners’ phoneme representation and articulation of English /r/, a feature that was considered to be fossilized for many of the learners. Research on L1 and L2 Korean Phonological Development Research on the acquisition of Korean phonemes, receptively and productively, has yielded some useful insights. In child L1 Korean acquisition, Kim, Kim, and Stoel-Gammon (2017) report that the earliest acquired consonants tend to be /p, p*, ph, t*, k, m, n, h/ while the latest acquired consonants are /tɕ, tɕh, s, s*, l/ (see also McLeod & Crowe, 2018, which synthesizes the findings from several L1 Korean consonantal acquisition studies). Consonants tend to be acquired earlier in syllable-initial contexts and later in clusters or word final positions. From a featural perspective, there are reasonably clear orders of acquisitions for place and manner of articulation. For place, the order is roughly bilabial → alveolar → velar → alveopalatal → liquid. For tension and voicing, children follow a tense → aspirated → lax sequence; for fricatives lax precedes tensed. These patterns can serve as a baseline for difficulty expectations and acquisition orders where L2 acquisition data are absent or insufficient. While the data from L1 Korean children are potentially useful for understanding L2 development, research on L2/heritage learners of Korean (with English as an L1/dominant language) has shown a notable contrast: L2 learners tend to struggle with tensed consonant articulation (e.g., /k*/) even at advanced proficiency levels (Lee et al., 2009; Oh, Jun, Knightly, & Au, 2002) or considerable exposure (Holliday, 2015). Holliday (2015), a longitudinal study of Mandarin speakers’ acquisition of Korean’s lax/tense/aspirated stop consonant distinction, found that learner development trajectories varied considerably and that learners struggled to reliability produce the distinction even after one year of residence in South Korea. Yu (2016) even found 55 that young Korean-English bilinguals’ tensed consonants are less tensed than their monolingual peers’. Some research suggests that among adult L1 English speakers, aspirated consonants are mastered first (e.g., Tark, 2016). Recall that the tensed feature is learned very early on by L1 children. Furthermore, even when adult L2 Korean learners have acquired a phoneme, their articulation may still differ from native speakers (NSs), such as through the use of different areas of the tongue when making alveolar stops (Ko, 2013). Nonetheless, several similarities do exist between child L1 and adult L2 learners. L2 learners also have been shown to struggle with /l/, particularly with respect to its allophone distribution (Kim & Park, 1995; Kim, 2007; Lee, 2012). While this aligns with the pattern children exhibit, it is at the same time somewhat surprising that even L1 English learners struggle with accurate production: Their L1 contains /l/ and /r/ (and a flap [ɾ], an allophone of /l/ in Korean, as an allophone of /t, d/), so the articulation of Korean phonemes is mostly within their existing oral-motor skillset. Kim (2015) suggested that syllable context plays an important role in the accuracy of production for /l/ and other consonants. Following Lee (2012), the apparent hierarchy of ease for /l/ allophones by position is onset > coda ≈ geminate (where an /l/ in a coda position is followed by an /l/ in the onset of the following syllable). The pattern of coda articulations being more difficult mirrors child L1 acquisition order (Lee, 2012). Empirical findings have also suggested that some back vowels, particularly /ʌ, ɯ/, can present a challenge to L1 English speakers (Kim & Silva, 2003). These findings on L2 Korean phoneme acquisition, complemented by the L1 research, provide a suitable basis for examining the overall hierarchy of difficulty found among targets in the KPD. However, it is important to note that most of the published research on L2 Korean phonological acquisition is based on L1 English speakers; developmental patterns for learners of other L1 backgrounds might be expected to differ. 56 Although relatively few in number, studies on pronunciation instruction for L2 Korean suggest that L2 Korean learners can improve their pronunciation, just like learners of any other language. Tark (2016) demonstrated that form-focused instruction with corrective feedback helped learners improve their mastery of stops, fricatives, and affricates. Focusing on some of the same targets, Shin (2007) highlighted how perception training for three-way consonant distinctions (i.e., lax, tensed, and aspirated stops) led to improvements in both learners’ perception and production. Thus, even for features commonly observed as difficult for learners, good instruction can make a difference. Instructional treatments with a broader scope have also benefitted learners. In my prior work with colleagues, Isbell et al. (2019), students in their fourth semester of Korean study in the United States improved their speech comprehensibility after an 8-hour instructional treatment that targeted a set of segmental and suprasegmental features. Their fourth-semester peers who did not receive the treatment showed no development, indicating a benefit to this broader-scope Korean pronunciation instruction treatment. However, the evidence of improvement was not extremely robust, raising the question of whether better-targeted instruction, suited to learners’ individual needs, could have made a bigger impact. The Goal: Diagnosing L2 Korean Pronunciation A diagnostic instrument that can help teachers and learners identify pronunciation weaknesses and in turn motivate well-targeted instruction is both desirable and plausible. For this dissertation, I developed and validated a new diagnostic language assessment for L2 Korean pronunciation, the KPD. All the inferences outlined previously in Chapter 1 (see Figure 1.1) are relevant to the KPD. However, some inferences for DLA are perhaps more important than others. In particular, I consider the utilization inference (or impact in Bachman & Palmer’s (2010) terms) to be key in DLA, aligning with Lee’s (2015) emphasis on how DLA results are 57 used for instruction. For instruction, I adopt Housen and Pierrard’s (2005) definition of second language instruction: “any systematic attempt to enable or facilitate language learning by manipulating the mechanisms of learning and/or the conditions under which these occur” (p. 2). This broad view of instruction encompasses not only what a teacher does with students in traditional classrooms, but also the aims of learning materials (e.g., textbooks, software) and the deliberate learning activities undertaken by individual learners (Loewen, 2015); learning activity is a term I will use interchangeably with instruction throughout the dissertation to describe the relatively informal and ad-hoc yet still deliberate language-focused learning efforts used by learners. Thus, validity arguments for a DLA procedure should include impact on teachers in classrooms/tutorial sessions, on learner awareness and autonomous learning activity, and the selection of learning materials by either teachers or learners. Figure 2.3 is my proposed validity argument for using the KPD to inform the learning and instruction of KFL/KSL learners. On the right side of the figure are the sources of information that provide backing (gray boxes) for key inferences (arrows pointing upward) in the argument for using KPD results. When comparing the KPD validity argument to the more generic structure in Figure 1.1, readers may notice that I have included two additional inferences: (1) operationalization and (2) usefulness and impact. The operationalization inference draws on the work by Chapelle et al. (2008, 2010), who highlighted in greater detail how the description of target abilities and situations of use should inform test design and item/task construction. Often, theoretical and descriptive grounds are grouped with test observations in validity arguments, but in the case of the KPD I feel that a finer distinction here is useful given Alderson et al.’s (2015) strong emphasis on detailed theory informing the construction of discrete, well-targeted items/tasks. I consider the final inference near the top of the figure, usefulness and impact, based 58 on how effectively learners are able to make improvements to their pronunciation after applying KPD results. Finally, the entire validity argument for using the KPD can be evaluated by synthesizing the strength of supporting evidence across inferences and considering outstanding shortcomings or gaps in support. The boxes on the left detail how backing supports the inferences. Some backing for the inferences in the proposed argument already exists in the form of theoretical backing, test design, and initial piloting efforts. The former has been elaborated on already, and I detail the latter two shortly. Other backing is needed; these gaps are presented in the form of research questions (RQs). In total I have identified nine RQs, several with sub-questions (Note: RQs are reproduced in the Methods chapter for easier reading). More concisely, the work undertaken as part of this dissertation can be summarized in the following four aims: • Aim 1: Create, pilot, and revise test items to result in a final form of a diagnostic instrument that a) functions well and takes minimal time to administer, b) provides detailed, meaningful information about a learner’s mastery of Korean phonemes, and c) can be used productively by Korean language teacher and learners. • Aim 2: Field test the final form with a suitable number of Korean language learners in order to collect normative data that facilitates the interpretation of results and consideration of diverse learner profiles. • Aim 3: Study the relationship between diagnostic test scores and spontaneous speech, and plot phoneme acquisition patterns across proficiency levels. • Aim 4: Study how Korean language teachers and learners interpret and act on test results. The remainder of this dissertation documents the results of these efforts and reflects on the evidence they provide pertaining to the valid diagnosis of L2 Korean segmental pronunciation. 59 The next chapter describes the culmination of Aim 1, Chapters 5 and 6 cover Aim 2, Chapters 6 and 7 cover Aim 3, and Chapter 8 represents initial an initial exploration of Aim 4. Figure 2.3. A proposed validity argument for using the KPD to inform learning and instruction. 60 CHAPTER 3: DESIGN AND DEVELOPMENT OF THE KOREAN PRONUNCIATION DIAGNOSTIC In this chapter, I present the design and development of the KPD in detail. The design section features a detailed description of the operational version of the KPD used in the remainder of this dissertation. The development section chronicles the pre-operational changes to test design and items as a result of two phases of piloting. Design This section lays out the test purpose, appropriate uses, test specifications, item specifications, item creation process, scoring, and score reports. Test Purpose The purpose of the Korean Phonology Diagnostic (henceforth KPD) is to pinpoint strengths and weaknesses in L2 Korean learners’ receptive and productive phonemic inventories, with the goal of then positively influencing learning through learner awareness and instructional remediation. Results are intended to be informative and instructionally relevant to individual learners and their teachers or tutors. Based on Field’s models of listening (2013) and speaking (2011) in Chapter 2, the test focuses on phonological knowledge as it relates to lower-level listening and speaking processes. While the KPD would be suitable for a wide range of proficiencies, it is likely not suitable for true beginners or extremely novice learners. The KPD also requires a basic level of familiarity with hangeul (한글, the Korean alphabet). Familiarity with high-frequency Korean vocabulary is helpful, though does not need to be comprehensive. Appropriate Uses The KPD is intended to be used by students, teachers, and potentially language programs to increase awareness of pronunciation difficulties and guide instructional decisions relevant to 61 classroom instruction and autonomous learning activity. These are inherently low-stakes decisions with minimal potential for negative consequences. As examples, the following uses of the KPD would be appropriate: • A Korean learner, now an undergraduate business student, feels that she struggles with intelligible pronunciation. She asks her old Korean teacher for guidance; the teacher administers and scores the KPD and provides feedback to the learner. The learner then selects material from a Korean pronunciation textbook to practice on her own time. The learner also focuses more on her perception and production of difficult sounds when using Korean in her coursework. • A learner asks his teacher for help with his pronunciation. The teacher administers and scores the KPD and provides feedback to the student. The teacher meets with the student after class once per week for brief tutorial sessions and assigns some practice materials. The student pays more attention to difficult sounds during class time. • A Korean language program offers a range of short-term supplemental courses, such as academic presentations, TOPIK preparation, and pronunciation fundamentals. To help ensure that students who sign up for the segmentally-focused pronunciation fundamentals course stand to benefit from it, the KPD is administered to ensure that students with generally strong control of Korean sounds are referred to other courses and only students with segmental difficulties take the pronunciation course. The KPD results are passed on to the teacher of the course to inform more detailed instructional decisions. • A teacher of a pronunciation class wants to identify common difficulties and assign individualized homework to students. The teacher administers the KPD to each student in her class of 10 students and uses the score reports to select common targets for group 62 instruction. Targets not covered in whole-class sessions are assigned to learners according to their needs. The KPD should not be used to interpret learners’ overall Korean proficiency, speaking ability, or even overall pronunciation quality. The KPD should not be used to make decisions about immigration eligibility, visa status, university entrance, or employment. The KPD is inappropriate for high-stakes decisions, especially those meant to be based on more generalized communicative ability in Korean. The following examples illustrate some inappropriate uses of the KPD: • A university in Korea has been using the Test of Proficiency in Korean (한국어능력시험, TOPIK, http://www.topik.go.kr) in making admission decisions for international students. The TOPIK does not have a speaking component, and the university is looking for a freely-available, quick, and easy-to-score speaking test. The university decides to use the KPD and require at least a 70% average across all phonemes in production for admission. • A university in the United States has received complaints about the accents of their non- native teaching assistants (TAs) in the Korean program. The program director decides that all TAs should be able to demonstrate mastery of Korean segmental pronunciation in order to provide a good model for students. TAs who show significant difficulty in producing Korean sounds are excluded from teaching duties. • A Korean coffee shop owner has received complaints about her international student baristas being difficult to understand. When interviewing new baristas, she administers the KPD to non-Koreans and does not make offers to people with “too many” troublesome sounds. 63 Structure and Item Specifications The KPD involves two parts, with a total of four tasks (summarized in Table 3.1; a full table of specifications is available in Appendix A). My approach to task selection was informed by Harding et al.’s (2015) recommendations for using a detailed model of language production to focus on lower-level processes. The perception-production link, with its implications for development and pedagogical practice, was also a major influence. My design of the tasks themselves was inspired by recommendations in Lado (1961) and Derwing and Munro (2015) and heeded the latter’s recommendation that “materials suitable for classroom testing are similar to many of those used in pronunciation research” (p. 115). I also took note of Munro’s (2008) recommendation to avoid relying on a single task type for evaluating a learner’s intelligibility, as most speech elicitation techniques have at least one drawback which should be counterbalanced. Item specifications (following Davidson & Lynch, 2002) for all tasks are in Appendix B. Table 3.1 KPD Design Summary Section Production Task Picture Naming* Nonword Read-Aloud Perception Pronunciation Judgment* Nonword Identification* Brief Item Specification Item: picture of a concrete noun Response: speaking the matching word Item: 1-2 syllable nonword Response: reading aloud the nonword Item: picture of a concrete noun + audio recording of the word Response: forced choice whether audio recording was (in)correct Item: audio recording of a 1-2 syllable nonword Response: forced choice between two written 1-2 syllable nonwords Number of Items 154 (in 35 words) 63 72 (plus 40 filler items) 63 Note. *Task was part of initial pilot test design. 64 Within each section and task, the Korean phonemic inventory (Shin, Kiaer & Cha, 2012) served as a basis for selecting the number of items. As mentioned previously, there are 7 vowels, 19 consonants, and 2 glides that combine with vowels to form 10 diphthongs. A minimum of four items per phoneme were included in the Production section and at least three items per phoneme in the Perception section. Given the secondary utility of perception scores in relation to the test purpose and use, I felt it acceptable to collect somewhat less information in order to keep the test length more practical. An important consideration for consonant phoneme targets was the inclusion of targets in different syllable contexts to better capture the phoneme’s allophonic distribution. While four items per phoneme was set as a general minimum, several phonemes were featured more due to their prevalence in real words (e.g., the vowels /ɑ/, /i/, /o/, /ʌ/). Some consonants, such as /l/, /s/, and /s*/ have additional items to account for markedly different and previously known to be difficult allophonic realizations. For example, when /s/ and /s*/ (both alveolar fricatives, the latter being tensed) are followed by the vowel /i/ or glide /j/, /s/ is realized as [ɕ] and /s*/ is realized as [ɕ*] (alveopalatal fricatives, the latter being tensed). Production Tasks. The first part focuses on production and includes a Picture Naming task and a Nonword Reading task. Picture Naming. For the Picture Naming task, learners are required to say the word that corresponds to a picture they are shown. This type of task is commonly used in assessments of children’s L1 speech development (e.g., Kim, Pae & Lee, 2005; Seok et al., 2002). In terms of Field’s (2011) process model of speaking, this task requires test-takers to activate lexeme and phonological knowledge to phonologically encode the target word, and then taps knowledge of articulatory settings to complete phonetic encoding immediately preceding articulation of the word. The quality of articulations provides information on phonetic encoding ability and 65 articulation knowledge, but at the same time may reflect malformed lexeme knowledge (i.e., having an erroneous phonological representation of the target word stored in the lexicon). To mitigate this latter possibility, all words are imageable nouns (thereby avoiding lexemes for potentially malformed verbal inflections) and most fall within the first 1,500 most common words in Korean; some exceptions were included because the words were known to be introduced relatively early in instructional settings (e.g., body parts, animals, foods) or due to a lack of other imageable nouns featuring a target phoneme. Nonword Reading. The Nonword Reading task requires learners to read aloud a one or two syllable nonword; each nonword has only one target phoneme that is scored. Vowels are assessed in isolation, and consonants and glides are assessed in simple, legal syllable structures: (G/C)V, VC, or VCV. Through this task, written letters are used to tap into phonological knowledge, leading to phonological and then phonetic encoding and articulation similar to the Picture Naming task. To minimize potential interference from issues related to learners’ orthographic knowledge, I constructed items that avoid sound-symbol mismatches (i.e., no phonological processes leading to a discrepancy between the written grapheme and the spoken phoneme). Variation in syllable context for consonants only affected allophonic realization of phonemes (e.g., [k̥ o], [u.k̬ u], [ok̚ ]). While the Picture Naming task avoids issues learners may have with orthography, the Nonword Reading task helps to ensure that consonant targets are represented in a variety of syllable contexts, but always in ways that are orthographically transparent (i.e., do not involve instances of grapheme-phoneme mismatches, such as nasalization or consonant relinking). For example, I was unable to find a suitable word for the Picture Description task that placed /ph/ in an intervocalic (CVC) context but covering this was easy to do in the Nonword Reading task. 66 Perception Tasks. The second part of the KPD focuses on perception and includes a Pronunciation Judgment task and an Identification task. Pronunciation Judgment. The pronunciation judgment task presents pictures of common Korean vocabulary and shortly after and while the picture is still visible plays an audio recording. The audio recording is either the correct phonological form of the word, or it contains a single phoneme deviation (typically the substitution of another phoneme with mostly similar features); the learner must judge whether the sound they heard was accurate for the picture they saw. This task type has recently been used in experimental psycholinguistics (e.g., Amengual, 2016). Only the items which contain mispronunciations contribute to scores for individual phonemes and features. In terms of Field’s (2013) lower-level listening processes, the picture provides learners the target phonological string associated with the lexeme, and then test-takers decode the speech signal they hear and compare the phonemes they have decoded to the target string. If a test-taker can detect the phoneme in the stimulus that does not match the correct form of the word, it is inferred that their mental representation of that phoneme is robust enough to be distinct from the non-target (but somewhat similar) phoneme in the stimulus. This process is admittedly indirect but avoids some of the pitfalls of a task based on, for example, listening and then choosing between two words in a minimal pair. Minimal pair tasks can be difficult to construct due to a lack of minimal pairs, or minimal pairs that are likely to be known by learners. For example, most learners could be expected to know 강 (river), but may not know 간 (liver), which constitutes a minimal pair based on the /ŋ-n/ contrast in the word-final position. Finding a sufficient number of minimal pair sets where both words were imageable was another concern. Nonword Identification. The Identification task presents nonword audio, and learners must choose between two written options that differ by only one phoneme. Here, test-takers must 67 tap into their phonological knowledge to decode the speech signal of each item into short strings of phonemes. Once the string of phonemes has been identified, test-takers select the written representation that best matches what they heard. Like the Nonword Reading task, nonword options consisted of 1-2 syllables; V, (G/C)V, VC, VCV. I created written keys and distractors that avoided any sound-symbol mismatches. Item Writing I was the primary item writer. Because my own Korean proficiency has limitations, I relied on a NS informant with a background in applied linguistics and Korean language teaching—a content expert—to verify keys, proofread, and spot any major problems at early stages of item creation. This type of test development arrangement is reportedly common for less-commonly taught languages and was explored in depth by Ryan and Brunfaut (2016), whose case study of a testing company found that testing experts and language informants working together produced higher-quality items. In the specific case of KPD item writing, the language assessment knowledge and classroom experience of the informant was extremely valuable and insightful. Importantly, items were revised after two stages of piloting. This process is documented later in the chapter. Several key resources supported my item writing. Shin et al. (2013) was the primary linguistic resource consulted to verify information on Korean phonetics and phonology. For the Picture Naming and Pronunciation judgment tasks, I consulted Lee, Jang, and Seo’s (2017) Frequency Dictionary of Korean for the selection of target words in the Picture Naming and Pronunciation Judgment Tasks. Openly-available picture collections for psycholinguistic experiments were drawn on for images used in these two tasks, including MultiPic (Duñabeitia et al., 2017) and BOSS (Brodeur, Dionne-Dostie, Montreuil, & Lepage, 2010). I also utilized 68 images with Creative Commons licenses from www.pixabay.com. In a few cases, I produced original hand-drawn images (e.g., an image for ramyeon, Korean ramen noodles). Where necessary, I manually altered, combined, or otherwise edited images from these resources in the Paint.net image editing software. Audio stimuli for both Perception tasks were recorded by the aforementioned expert informant, a female native speaker of Korean originally from the Gangwon province who attended university in Seoul and is a fluent speaker of Seoul Korean. Stimuli were recorded using a Snowball Blue microphone connected to a desktop computer and the audio recording and editing software Audacity. I applied Audacity’s noise reduction and normalization filters to recordings and individual audio files were saved for each item. Scoring The KPD utilizes both human scoring (for production tasks) and objective scoring (for perception tasks), as shown in Table 3.2. All items on the KPD are scored dichotomously. The objectively-scored perception items are scored based on an answer key that I verified through native speaker consultation and involvement in stimulus recording and through piloting with several native speakers (more details on piloting follow later in the chapter). Table 3.2 KPD Scoring Overview Task Picture Naming Nonword Reading Human Pronunciation Judgment Nonword Recognition Objective Objective 69 Scoring Method Scoring Target Human each phoneme in a word target phoneme in a nonword pronunciation error in stimulus word target phoneme in a nonword Scores clear (1) or unclear (0) clear (1) or unclear (0) correct (1) or incorrect (0) correct (1) or incorrect (0) For the production items, scoring can be carried out by any proficient speaker of Korean, ideally with some linguistic training and familiarity with the test items. The ideal scorer (and administrator) of the test would be a learner’s Korean teacher or tutor; Sundqvist et al. (2018) argue that teachers are well-positioned to evaluate learning-oriented, low-stakes tests due to their subject knowledge and ability to apply results to instructional activities. A simple scoring sheet is available (Appendix C), which can be used to cross out unclear phonemes while listening to test-takers’ responses. In consultation with the Korean instructor who scored responses in the pilot and operational testing, I also developed a set of scoring criteria to guide scoring decisions (Appendix D). These criteria emphasize intelligibility of test-taker productions, i.e., making scoring based on unambiguity of phonemes while not demanding productions to sound native- like. Score Reports Score reports for the KPD have so far undergone three phases of development. The earlier versions are discussed in the latter half of this chapter. A sample KPD score report from the operational version of the test is in Figure 3.1, annotated with translations of major features of the report. The goal of the score report is to provide detailed, instructionally-relatable information on specific pronunciation weaknesses, such as particular phonemes or articulatory features that are not well mastered by a learner. The current version of the score report was guided by two principles: (1) guide score users’ attention to an individual’s most severe pronunciation targets, and (2) present detailed information that can be used as a springboard for subsequent learning. The first page of the score report focuses on the first principle. The text at the top of the page provides a brief explanation of the score report and directs users to “focus on these difficult 70 sounds and features in class or when studying on your own.” No numeric scores are given on the first page; only lists of difficult to pronounce phonemes, articulatory features, and word contexts alongside explanations. Wherever possible, non-technical vocabulary is used in the report. For instance, 소리 (sound) is used in lieu of 음소 (phoneme). Korean characters are used to exemplify features, making the concepts more accessible to learners who may not know the precise linguistic terms; this was a particular concern for Korean-language versions of a score report. Focusing on the second principle, the second page of the score report features detailed scores for every phoneme (cf. Kim, Pae, & Lee, 2005; Seok, Park, Shin, & Park, 2002) in production and perception, along with examples of the sound in real words, drawn from items the examinees did not respond to clearly/correctly on the KPD (Kunnan & Jang, 2009). Development The remainder of this chapter reviews this development history, highlighting changes to the test structure, tasks, and items as a result of piloting. This provides a record of development and early validation efforts related to the grounds and operationalization inferences in the KPD’s validity argument. To date, the KPD has existed in three versions: an initial ‘Alpha’ version of the test, a heavily-revised ‘Beta’ version, and an ‘Operational’ version based on limited revisions of the Beta. Both the Alpha and Beta versions were piloted with Korean learners and native speakers; pilot test data and my own observations and content analyses of items informed the development of subsequent versions of the KPD. Alpha Version The initial version of the KPD, henceforth the KPD Alpha, was developed in the spring of 2017. It consisted of five tasks organized in three sections (Table 3.3). A major difference between this version and later versions is the presence of a Repetition task and a Sound and 71 Articulation Knowledge task (the latter comprising an additional section for Explicit Knowledge) and the lack of the Nonword Reading task. The other tasks (Picture Naming, Pronunciation Judgment, and Nonword Recognition) survived to subsequent versions of the test with only minor modifications and item revisions. All five tasks were delivered using the PsychoPy experimental software (Peirce, 2009); this mode of delivery was also changed in subsequent versions of the test. The Repetition task involved listening to and repeating a short 1-2 syllable nonword and was scored in a manner virtually identical to the Nonword Reading task. Much like the Nonword Reading task which replaced it, I had designed the Repetition task to elicit phonemes in particular syllable positions without being subjected to potentially malformed or absent phonological representations of words in the lexicon. The Sound and Articulation Knowledge task involved three-option multiple choice questions in English about the acoustic and articulatory qualities of Korean phonemes. My rationale for including this section and task was based on empirical findings supporting explicit phonetic teaching (e.g., Lord, 2005, 2008, 2010) and widespread pedagogical recommendations to teach articulations explicitly (e.g., by using diagrams of the articulatory apparatus, by teaching students to manually check physical sensations such as vibrating vocal chords and release of air, see Celce-Murcia et al., 2010). If a learner lacked explicit knowledge necessary to produce a sound, she could be taught it; if the learner had the explicit knowledge but could not accurately produce a sound, the instructional emphasis would likely be on perception and/or production practice without belaboring basic explanations of phonetic qualities and articulatory settings. 72 Figure 3.1. Diagram of a KPD score report. The first page of the score report is shown on the left, and the second page on the right. 73 Table 3.3 Initial KPD Design Section Production Perception Explicit Knowledge Tasks Picture Naming Repetition Pronunciation Judgment Nonword Identification Sound and Articulation Knowledge Number of Items 35 words; 140 items 63 72 63 39 Piloting. This subsection details initial piloting of the KPD Alpha. The primary goal of this piloting was to investigate the alignment of test-taker processes and responses with what I had intended in task design, and to root out any undesirable task issues or item-level problems before proceeding to larger-scale piloting efforts. Essentially, I wanted to see if things generally worked (and what did not) and receive some guidance on initial item revisions. This kind of initial, small scale piloting is sometimes referred to as prototyping (see Nissen & Shedl, 2012), as the major aim is quickly finding major flaws before committing additional resources to test development and administration to larger numbers of (pilot) test-takers. The major consequence of this piloting was that two of the KPD Alpha tasks were eliminated and/or replaced after this pilot: The Repetition task, and the Sound and Articulation Knowledge task. In the summer of 2017, I carried out this small-scale piloting of five initial tasks with four participants, including two L1 English learners of Korean and two Korean native speakers (Table 3.4). Each participant completed all five of the tasks, and after each task they completed a semi- structured interview with me. Figure 3.2 outlines the general procedures I followed for piloting and includes the list semi-structured interview questions. 74 Table 3.4 Alpha Pilot Participants ID Alpha01 Korean F L1 Sex Alpha02 Korean F Alpha03 English F Alpha04 English M 1. KPD Notes Graduate student in Second Language Studies. Expertise in psycholinguistics. Taught Korean as a foreign language for 1 year at an American university. Graduate student in Second Language Studies. Expertise in language assessment. Taught Korean as a foreign language for 3 years at an American university. Graduate student in Second Language Studies. Expertise in language assessment. Advanced speaker of French, novice in Korean. Basic Korean learned while teaching English in Korea for one year. Undergraduate student in Information Technology, minoring in Korean. Previously learned Korean in the U.S. military. Has Korean spouse. Intermediate Korean proficiency. • • • • • 1.1 - Task 1: Picture Naming 1.2 - Task 2: Repetition 1.3 - Task 3: Pronunciation Judgment 1.4 - Task 4: Identification 1.5 - Task 5: Sound and Articulation Knowledge After Each Task: Semi-Structured Interview • 1. Overall, what are your impressions of the task? • 2. What do you believe the task is about? • 3. Were the directions clear? Please explain any difficulties or confusion you encountered. • 4. Do you recall any particular questions that you felt were problematic? [Participants shown items to stimulate comments] • 5a. For Native Speakers: Did you generally feel confident in your answers? 5b. For Learners: What were the easiest parts of the task? The hardest? • 6. Do you have any suggestions for improving the task? 2. Post-Test Interview • 2.1. Do you have any comments or questions about the test? 3. Background Questionnaire Figure 3.2. KPD Alpha piloting procedures. Findings and Revisions. I analyzed participants’ performance on the KPD and their comments to guide my revisions to task design, individual items, and test administration. In terms of task design, I learned several important lessons from the alpha piloting. For tasks that 75 utilized picture stimuli (i.e., Picture Naming and Pronunciation Judgment), the selection of target words and selection/construction of the images are paramount. If a test-taker could not recognize the target word from the picture, I was unable to gain any information about their phonological abilities. Some pictures were simply not clear or obvious enough. Other pictures elicited multiple appropriate responses. For example, for an item intended to elicit right (as in right hand, 오른쪽 in Korean), I used a picture with an arrow pointing to the right. One NNS (Alpha 03) could not recall the word, one NNS immediately responded correctly (Alpha 04), and the two NSs instead responded with the Korean equivalent of arrow sign/symbol (화살표). Similarly, if a test taker did not focus on the targeted aspect or element of an image, they offered non-target responses. For example, one NNS, Alpha 04, offered 하트 (heart, a loanword) for an image intended to elicit 사랑 (love), focusing on the literal shape rather than what it commonly symbolizes. Thus, it became clear that images must be referents for commonly used and/or studied words, and the images should contain ample cues for word identification (e.g., shape, color, and additional symbols such as circles or ‘X’). I also realized that the tester could provide some support, such as asking for another word, pointing to specific parts of an image, or giving clues, without compromising the intended response process (i.e., a test-taker recalls the phonological form of a target word and produces it). For the repetition task, the post-task interviews with both native and non-native speakers revealed that perception ability played a major role in responses, even though I had originally intended the task to primarily measure production ability. I had hoped that an audio stimulus for short nonwords, rather than printed letters, would avoid an interference from shortcomings/mismatches related to learner orthographic and sound-symbol relationship knowledge. However, all respondents noted that items were easier to respond to if they felt they 76 had confidently identified what they heard in the stimulus. Thus, I eliminated the Repetition task and replaced it with a nonword reading task using the same stimuli in subsequent versions of the test. Finally, I found that the Explicit Knowledge task had some insurmountable problems. On the positive side, my observations of the test-takers and their interview data showed that they were carefully thinking about and/or mouthing the articulations for target and distractor sounds, and so the task did appear to tap into explicit knowledge of articulatory and acoustic features. However, as one might expect, this section was difficult. Surprisingly, it was difficult for native speakers, too. My suspicion here is that because many of the Explicit Knowledge items relied on analogy to English sounds, the Korean NSs (who at the same time were English NNSs) were in a sense hampered. This is a considerable problem given that the target population of the KPD varies in L1; it seemed it would be inappropriate to give this task to those who do not speak (Standard American) English as an L1. Furthermore, one feature of items targeting articulations required test-takers to correctly choose descriptions of oral articulations. This worked well enough for articulations such as bilabial stops or lip-rounding for vowels, but items that required test-takers to identify what they were doing with their tongue and/or where the tongue was in the mouth were opaque. These items were also challenging for me to write while avoiding technical jargon (e.g., not using terms like palate or alveolar ridge). Thus, despite some evidence for the task and items functioning as intended and the potential usefulness of such information instructionally, I decided to remove this task from future versions of the KPD. In terms of individual items, rather than salient task features or features affecting several items, the participants alerted me to several items with idiosyncratic problems. Primarily, these were vocabulary items in the Picture Naming and Pronunciation Judgment tasks. Some words 77 were simply unfamiliar to the NNSs, and so I have revised such items to target words that are either (a) higher frequency or (b) introduced earlier on in instructional materials (e.g., featured in the first level of the popular Sogang Korean or Integrated Korean textbooks). Other feedback pertained to audio quality and response options for some nonword items (Repetition and Identification tasks). When a native speaker felt an item had a stimulus and/or options that led to any lack of confidence in the response, I took that as a sign that something needed to be fixed. Through this process, I was able to identify several Identification items that would be re- recorded to ensure the clarity of keys. Piloting the KPD Alpha also provided me an opportunity to begin developing the KPD score report. Figure 3.3 is an example of this early attempt. Compared to later versions, this initial score report was somewhat text-heavy, but contained many of the core elements that would reappear in subsequent revisions. For these initial score reports, I customized the text for each of the two learners who participated in the pilot. One major difference between this version of the score report and later versions is the provision of summary scores (total and task scores) at the top of the report. Beta Version The second iteration of the KPD, dubbed KPD Beta, was pared down to four tasks (Table 3.5). The administration of the tasks also underwent changes: The production tasks were administered via paper flipbooks (i.e., each item printed on its own 5.5 by 8.5 inch cardstock, with all items ring-bound in the top-left corner), and the receptive tasks were administered in OpenSesame (Mathôt, Schreij & Theeuwes, 2012), an experimental software similar to PsychoPy but with better (or at least more intuitive, to me) display settings. The Picture Naming and Pronunciation Judgment task stimuli were massively revamped. I opted for full-color 78 Figure 3.3. Early draft of KPD score report. pictures, drawing on open-source images previously used in psychological studies (MultiPic and BOSS), free full-color images (www.pixabay.com), and created or edited images manually as needed. In my image editing, I used various techniques to make target words more obvious, including arrows and circles to highlight key aspects and limited amounts of text to highlight a semantically-related word. For example, the new picture for 아저씨 (ajeossi, middle-aged man) 79 included a picture of a middle-aged woman labelled “아줌마” (ajumma, middle-aged woman) and a blank under the picture of the man. Table 3.5 KPD Beta Design Summary Section Production Task Picture Naming* Nonword Read-Aloud Perception Pronunciation Judgment* Nonword Identification* Brief Item Specification Item: picture of a concrete noun Response: speaking the matching word Item: 1-2 syllable nonword Response: reading aloud the nonword Item: picture of a concrete noun + audio recording of the word Response: forced choice whether audio recording was (in)correct Item: audio recording of a 1-2 syllable nonword Response: forced choice between two written 1-2 syllable nonwords Number of Items 154 (in 35 words) 63 72 (plus 40 filler items) 63 Note. *Task was part of initial pilot test design. Piloting. Compared to the KPD Alpha pilot, the focus of piloting the KPD Beta was more quantitative. I set out to collect data from a sample of learners that was just large enough for estimates of reliability and item statistics to be meaningful. I also collected data from a handful of Korean NSs. Having established tasks that generally worked (and after I had removed/addressed major flaws), I was more interested in fine-tuning at the item level. At the same time, this second piloting offered an opportunity to pilot other instruments that would be used in the main study, including a background questionnaire, pronunciation self-assessment, and an independent speaking task (see Chapter 4 for details). The piloting procedures for the KPD Beta are outlined in Figure 3.4. 80 The KPD Beta was piloted with 27 learners and 7 NSs of Korean recruited at Michigan State University. Of the learners, 25 were female and 2 were male. Learner breakdown by class level was as follows: 8 first year (KOR 102), 13 second year (KOR 202), 3 third year (KOR 302), 3 fourth year (KOR 402). For L1, 18 reported being English speakers, 5 were Chinese speakers, 2 spoke Malay, and there was 1 speaker each of Japanese and Thai. PART 1 – Online 1. Send participant link to Qualtrics survey 2. Participant completes all parts of Qualtrics survey Informed Consent a. b. Part 1: Background c. Part 2: Korean Pronunciation Self-Assessment i. Global ii. Phoneme inventory (production and perception) PART 2 – In-Person Independent Speaking Task (2-3 minutes) 1. 2. KPD (15-20 minutes) a. Production Tasks – Audio Recorded i. Picture Naming ii. Nonword Reading b. Perception Tasks – OpenSesame i. Pronunciation Judgment ii. Nonword Identification 3. Korean EIT (10 minutes) – Audio Recorded Figure 3.4. KPD Beta piloting procedures. Of the NSs, 6 were female and 1 was male. All NSs grew up primarily in South Korea and their dominant language was Korean, but all spoke English at a high level. Findings. Rather than the learners’ specific results and pronunciation strengths and weaknesses, I focus here on the technical qualities of the KPD itself, highlighting the specifics of test-taker responses where relevant. As a point of reference for the more detailed findings which follow, Table 3.6 presents summary statistics for the 27 learners who completed the KPD Beta. 81 Table 3.6 KPD Beta Learner Summary Statistics Section Production Task 1 – Picture Naming Task 2 – Nonword Reading Perception Task 3 – Pronunciation Judgment* Task 4 – Nonword Identification n 217 154 63 135 72 63 mean 193.74 14.17 153 154 124 153 142.7 51.04 29 61 124 71 99.56 61 18 41.33 58.22 50 63 SD min max skew kurt. -1.09 0.85 -0.77 1.44 -1.28 1.34 -0.94 -0.31 -0.23 -0.84 -0.72 0.03 7.50 7.41 14.00 11.63 3.29 Developer Observations and Scorer Feedback. During piloting, I observed participants responding to items and took notes when I saw issues. I also took notes on any (unsolicited) verbal feedback participants gave on the tasks and items. At the task level, I noticed that the time of 1.0 seconds between initial stimulus presentation and audio in the receptive tasks seemed to be excessive. For the Nonword Reading task, I noticed an important error: I mistakenly included two additional items targeting the glide /j/ and failed to include any items targeting ㄲ /k*/. Other notes on individual items were as follows: • T1_32-6 (ㅣ in 초콜릿): Appears to be substantial speaker variation; some NSs and learners use ㅔ instead of ㅣ. Excluding from analyses. • T3_06 (팔): Picture would be clearer if it showed more of the upper body (to distinguish it from looking like a leg). • T3_68 (예쁘다): The stimulus, “에쁘다”, is perhaps a slang/stylistic variation. Need to look into this. • T3_16 (미국): The stimulus, “미궄” (articulated with a /kh/ in the coda), is not highly distinct and is also not a phonemic contrast. Consider changing to “미굿” or “미궁”. 82 • T3_50 (시장): The target phoneme is /i/, but the stimulus “쉬장” is not articulated to be distinct enough. “셔장” or “섀장” Another source of feedback came from the Korean instructor who scored the production section of the KPD Beta. She found the noise from shuffling through the paper flipbooks present in the audio recordings to be a minor distraction. At the same time, I did notice that the flipbooks could occasionally be cumbersome. Native Speaker Results. Due to the small number of examinees and the extremely high proportion of correct responses to most items, most conventional reliability and item analyses are not appropriate. Instead, the analyses of NS item responses focus solely on proportion of correct responses: A NS should generally be able to answer every item correctly, barring an occasional slip of the tongue or mishearing, and NS productions should otherwise be judged as acceptable. The first key finding is that relatively few items—13 out of 366—had any incorrect responses from NSs. This provided general support for the notion the KPD Beta task designs, item specifications, individual items, and scoring procedures were working as intended: Speakers known to have robust Korean phonological systems (i.e., virtually all NSs) could successfully produce and perceive Korean segments according to KPD results. The items in Table 3.7, however, warranted extra scrutiny, because this desired success was not (totally) present. For 9 of the 13 potentially flawed items, there was only one NS incorrect response each. These incorrect responses may conceivably be attributable to accidental mis-presses on the keyboard (perception tasks) or slips of the tongue. In the case of the Picture Naming item, it may be an idiosyncratic scoring error rather than a speaker error. Nonetheless, I carefully reviewed these marginally problematic items when revising the KPD, focusing on stimulus clarity and distractor choices (as relevant). More pressing were items T3_01 (3 incorrect responses), T3_33 (7 83 incorrect), T3_44 (5 incorrect), and T3_50 (7 incorrect). These items required substantial revision and/or replacement. Table 3.7 KPD Beta Items with Incorrect Responses from Korean NSs Task Task 1 – Picture Naming Item Code T1_33-6 Incorrect Note 1/7 Task 2 – Nonword Reading N/A N/A OK_30 1/7 This is the ㄴ/n/ in 빨간색, it was substituted with an ㅇ/ŋ/ by one NS. All items responded to correctly. 과일 /kwɑ.il/ fruit; filler item (pronounced correctly) Task 3 – Pronunciation Judgment T3_01 T3_33 T3_39 T3_41 T3_44 T3_50 T3_59 T3_68 T3_71 Task 4 – Nonword Identification T4_40 T4_44 3/7 7/7 1/7 1/7 5/7 7/7 1/7 1/7 1/7 1/7 1/7 비 /pi/ rain pronounced as “피” /phi/ 싸움 /s*a.um/ fight pronounced as “사움” /sa.um/ 하나 /ha.na/ one pronounced as “하마” /ha.ma/ 창문 / ʨhɑŋ.mun/ window pronounced as “찬문” /ʨhɑn.mun/ 시장 [ɕi.ʨɑŋ] market pronounced as “쉬장” [ɕwi.jaɑŋ] 눈 /nun/ eye pronounced as “는” /n n/ 예쁘다 /ye.p*ɯ.ta/ pretty pronounced as /e.p*ɯ.tɑ/ 원 /wʌn/ won (Korean currency) pronounced as /wɑn/ 니 /ni/; distractor 미 /mi/ 웅 /uŋ/; distractor 움 /um/ Task 1 – Picture Naming: Analysis of Non-Target Elicited Words. For the KPD Beta, Task 1 procedures were revised to allow for the tester to prompt test-takers when they provided a non-target word, up to and including modeling the word for the test-taker if it was completely unknown. While I deemed this accommodation necessary if the KPD were to be administered to 84 learners and L2 users across a reasonably wide range of general proficiency, I also had concerns about items that might consistently require extensive prompting and/or modeling: The flow of the task would be interrupted, and the overall time demand of the test would increase. To investigate this new aspect of Task 1 procedures, I re-listened to all Task 1 audio recordings and logged each instance where a test-taker’s first response to an item was off target. I logged what alternative(s) they provided and whether they ultimately required the tester (i.e., me) to model the word for them. Table 3.8 shows a summary of this analysis. Table 3.8 Summary of KPD Beta Task 1 – Picture Naming Non-Target Responses Number of Non-Target Initial Number of Tester Models Responses (proportion*) Group All NSs Learners Note. *Proportion computed based on the total number of items administered to each (sub)group Supplied (proportion) N 34 7 27 309 (26%) 18 (7%) 291 (31%) 204 (17%) 0 (0%) 204 (22%) (35 items × N test-takers). Focusing on specific items, there were only 8 words (out of 35) that elicited non-target responses from NSs (Table 3.9). The most frequently unclear items were 빵 (bread), 포도 (grape), and 돈 (money). The non-target alternatives provided for bread and money were more specific terms, while the alternatives provided for grape indicated some lack of clarity in the picture; it did not appear that most of the NSs could distinguish the picture as grapes and not some other similar-looking fruit. 85 Table 3.9 KPD Beta Task 1 Items which Elicited Non-Target NS Responses Item T1_1 빵 T1_16 포도 T1_17 돈 T1_11 택시 T1_24 그림 T1_30 왼쪽 T1_4 나비 T1_9 집 Translation bread grape money taxi picture left (side/direction) butterfly house Freq. Alternatives Provided 5 바게트 (baguette) 5 열매 (berry), 가지 (eggplant), 블루베리 (blueberry) 3 지폐 (bill), 화폐 (bill), 현금 (cash) 1 자동차 (car) 1 액자 (picture frame) 1 1 나바 (cf. 나방, moth) 1 주택 (house/dwelling) [mumbling] Table 3.10 lists which items most frequently elicited non-target responses from learners and those which most frequently required modeling by the tester (me). In total, 31 out of 35 items initially elicited a non-target word or no response by at least one Korean learner. I took these data with a grain of salt, given that much of the pilot learner sample was on the lower end of Korean proficiency due to having relatively minimal exposure to the language (e.g., second year students had only had roughly 150 hours of classroom instruction when they took the KPD). Like the NSs, the images for grape and money were somewhat ambiguous to the learners. Looking at the non-target words supplied, compared to NSs the learners often substituted more general terms or hypernyms. For example, the Korean word for “fruit” was given for the pictures of grapes and lemon, and the Korean word for “man” was given for a picture of a middle-aged man (n.b., the Korean word for middle-aged man is extremely commonly used). Learners also attempted to supply loanwords or words from other languages, such as the Japanese tori for Korean 새 (bird). In other instances, phonological word forms were inaccurately recalled. 86 Table 3.10 KPD Beta Task 1 Items with Frequent Non-Target Learner Responses Item Eng. Freq. Model Freq. Prompting Alternatives Provided T1_14 땅콩 peanut T1_25 용 dragon monkey T1_3 원숭이 T1_26 침대 bed 24 23 21 18 22 18 21 17 Success 2/24 5/23 0/21 1/18 T1_4 나비 butterfly 18 17 1/18 T1_16 포도 grape T1_11 택시 taxi 17 13 T1_17 돈 money 13 T1_28 왕 T1_24 그림 picture king T1_10 새 bird T1_23 의자 chair T1_27 쓰레기 trash 13 12 11 10 9 13 4/17 1 5 13 7 9 7 6 12/13 8/13 0/13 5/12 2/11 3/10 3/9 87 당근 (carrot), 돈 (money), 상추 (lettuce) 룡 (similar to Chinese), 량 (amount), 공룡 (dinosaur), “dragon” (English) 마리 (animal counter), 동물 (animal) 잠대 (sleep + second half of target word), 베드 (English “bed” in Korean pronunciation), “bed” (English) 빠삐용 (Korean approximation of French for “butterfly”), 냄비 (cooking pot), 비자 (visa) 과일 (fruit), 폼 (?), 블루… (blue…), “grapes” (English) 자동차/차 (car), 기겐샤 (Korean approximation of a Japanese word?) 원 (won, the Korean currency unit), 현금 (cash), 현킨 (malformed 현금/cash), 천원 (1,000 Korean won) 왕자 (prince), “king” (English) 사진 (photograph), 꽃 (flower), 종이 (paper), “art” “painting” (English) 아가 (baby), 샘 (?), 토리 (Japanese), 파란색 (blue), 가새 (? + bird) 자리 (seat), 자기 (oneself), 의사 (doctor) 휴지통 (wastebasket), 휴게통 (malformed 휴지통), 레서핑 (?), 나비스탄 (?) Table 3.10 (cont’d) T1_8 아저씨 middle- aged man T1_18 레몬 lemon T1_19 시계 clock T1_5 토끼 rabbit T1_7 돼지 pig T1_30 왼쪽 left T1_1 빵 bread T1_13 귀 ear T1_31 불 fire T1_22 맥주 beer 9 8 8 8 8 7 5 5 5 4 2 0 8 8 7 0 1 5 4 1 7/9 8/8 0/8 0/8 1/8 7/7 4/5 0/5 1/5 4/5 red 0/4 0/4 남자 (man), 아버..(beginning of “father”), 할아버지 (grandfather) 과일 (fruit) 시간 (hour), 시름 (?), “clock” (English) 또자 (?), 토자(?), “rabbit” (English) 뒤기 (malformed 되지) 오른쪽 (right), 오른… (beginning of “right”), 왼쪽에 (left + to/on) 밤 (chestnut; possible mispronunciation of target), 음식 (food), “bread” (English) 이 (tooth), 얇.. (part of idiom “귀가 얇다”, meaning gullible) 화 (Sino-Korean root meaning “fire”) 물 (water), 술 (alcohol), 소주 (Korean traditional alcohol), 비어 (Korean pronunciation of loanword “beer”), “beer” (English) none T1_33 빨간색 T1_34 꽃 Note. Items responded to with non-target words fewer than 4 times excluded from table. “flower” (English) flower 4 4 4 4 For these items where non-target words were initially elicited, I was also interested in seeing where I was able to prompt learners to eventually provide the target word. This varied greatly. For words like monkey, which was not initially provided by 21 out of 28 learners, it seemed that they all were just unfamiliar with the word in Korean, and I had to provide a model to each of them. However, for words like left, I was able to successfully prompt all 7 learners who initially supplied something else (most commonly right). In general, I took away from this analysis that several pictures would need revising in order to minimize non-target responses and 88 modeling, yet at the same time I accepted that to some degree prompting and modeling may be necessary, particularly when administering the KPD to learners with limited Korean experience. Reliability. I examined reliability of the KPD for the 27 learners by computing Cronbach’s alpha. Two types of scoring models were explored: individual items and item parcels. For the individual items approach, I entered each item separately into reliability and item analyses. I carried these analyses out at the Task level (i.e., separately for Task 1, Task 2, etc.) and at the Mode level (i.e., Task1 & Task 2, Task 3 & Task 4). For the item parcels approach, I computed total scores across each phoneme in each mode, collapsing the several items corresponding to a phoneme into a single polytomous item (e.g., a sum score for all items targeting ㄱ /k/ in Task 1 and Task 2). Results of these reliability analyses are in Table 3.11. Generally, reliability results were within desirable ranges, and item parceling led to minimal degradation of reliability despite collapsing 100+ items into just 28. In sum, the test items (or item parcels) appeared to be strongly interrelated. Table 3.11 Reliability of the KPD Beta Section Production Task 1 – Picture Naming Task 2 – Nonword Reading Perception Task 3 – Pronunciation Judgment* Task 4 – Nonword Identification *Excluding filler items. n 217 154 63 135 72 63 Cronbach’s alpha (individual items) Cronbach’s alpha (item parcels, n = 28) .92 .85 .86 .92 .91 .65 .87 .91 Item Statistics. As another means of investigating the performance of individual items, I ran classical test theory (CTT) item analyses on the set of learner test data, separately for production and perception items, which yielded discrimination (D) and facility (P) statistics for 89 each individual KPD item. The diagnostic decisions made on KPD data are technically based on a cut score—actually, cut scores for parcels of items—which made criterion-referenced item statistics (i.e., the B index and facility differences between masters and non-masters) more appropriate. However, due to still being in early stages of developing an appropriate measurement model and framework for interpreting scores, I opted to go with the CTT analyses, which still gave a reasonably informative indicator of how well participants with generally more accurate pronunciation did on the items and how easy the items were overall. Additionally, I expected that items would have very high facility values. For example, items targeting ㅏ /ɑ/, a phoneme cross-linguistically common to many learner L1s, were expected to be rather easy. Thus, typical interpretations of CTT item analyses for norm-referenced tests (e.g., desirable values are between .25 and .75) were ignored. More weight was given to discrimination. In typical norm-referenced test contexts, discrimination values above .3 are desired (Carr, 2011), but I took a more liberal approach in line with my expectations that some items would be very easy (i.e., have high facility and thus poorly differentiate learners with stronger and weaker pronunciation or perception): I flagged items with negative discrimination (Table 3.12). Negative discrimination indicated that learners with generally more accurate production (or perception) tended to do poorly on the item. At the same time, given the small sample, a small number of people at the higher end of the total score range with similar pronunciation difficulties (e.g., great difficulties with phonemes predicted to be difficult, such as ㄹ /l/) could skew discrimination indices. Thus, I looked for larger negative discrimination values alongside facility values, and I considered the content of items. 90 Table 3.12 KPD Beta Items Flagged for Potential Revision D P Notes -.14 -.17 -.22 -.22 -.05 Item Task 1 – Picture Naming T1_1-1 T1_12-1 T1_18-2 T1_30-4 T1_32-5 Task 2 – Nonword Reading T6_05 T6_30 Task 3 – Pronunciation Judgment T3_18 T3_34 T3_45 T3_68 Task 4 – Nonword Identification T4_24 T4_51 Note. For Tasks 3 and 4, distractors are indicated in parenthesis. .78 ㅃ /p*/ .74 ㅆ /s*/ -.12 -.07 -.19 -.21 .15 .11 .74 .93 -.16 -.18 -.24 -.28 .85 쪼 /ʨ*/ (조) .96 으 /ɯ/ (우) .70 ㅃ /p*/ in 빵 .96 ㅌ /th/ in 택시 .96 ㅘ /wɑ/ in 화장실 .96 ㅉ /ʨ*/ in 왼쪽 .67 ㄹ /l/ (geminate) in 초콜릿 (ㄱ)ㄲ /k*/ in 꿀 (ㅅ)ㅆ /s*/ in 접시 (ㄴ)ㄹ /l/ in 라디오 (ㅔ)ㅖ/je/ in 예쁘다 Many of the items with larger discrimination and/or lower facility targeted tensed phonemes, which was not unexpected given their cross-linguistic rarity, high degree of similarity with other Korean sounds (i.e., articulation differs with a lax phoneme only in tenseness), and previous empirical findings (e.g., Moon et al., 2009). Similarly, the phoneme /l/ (ㄹ) was flagged in one item. Other items involved English-origin loanwords. This may have been due to learners mixing the Korean phonological form with the one present in their native language. Score Reporting. Each of the 27 learners in the second (Beta) pilot study received an individual score report (Figure 3.5). The reports were composed in English and consisted of two pages. The first page summarized their KPD results, highlighting phonemes that were deemed difficult to produce based on an arbitrary cutoff of 80% accuracy in production. The first page also included information on features (e.g., tenseness) and word contexts (e.g., word-initial) that 91 presented difficulty for learners, using the same 80% cutoff. Notably, the first page has no numeric scores. My intention was to require score report users to read the feedback instead of zero in on any total or part scores (see Alderson et al., 2015, pp. 188-192, for discussion of learners preferring traditional total scores and paying less attention to diagnostic feedback). The second page provided detailed information on learners’ accuracy for each of the 28 phonemes in production and perception. It also included stimuli from items on which they made mistakes. My intentions here were to make the results more memorable (“ah, 왼쪽, I always mispronounce the ㅉ”) and to provide some initial material for instruction. A learner could try recording the missed production items and ask his teacher to give feedback, or a teacher could provide dictation exercises based on the missed perception (and production) items. Revisions Leading to Operational Version Broad, task level revisions for the KPD Operational Version were few in number and relatively minor. The production tasks were converted to PowerPoint presentations that could be smoothly clicked through on a computer (although using flipbooks would still have been acceptable). For the Pronunciation Judgment task, the time between initial presentation of the image and start of the audio was reduced from 1.0 seconds to 0.5 seconds. Based on the previously presented Beta pilot findings and careful review of item content, I made the following changes to items: Task 1 – Picture Naming • T1_4 나비 (butterfly): The coloring of the image was manually altered to those of the iconic Monarch butterfly to avoid the non-target moth • T1_16 포도 (grape): The original image only showed a single grape. I produced an image with a cluster of grapes. 92 Figure 3.5. KPD Beta score report. 93 • T1_17 돈 (money): The original image included only paper money/bills, and some non- target responses reflected this. I replaced this with an image including both paper bills and coins, aiming to elicit the more general money. • T1_25 용 (dragon): Upon careful inspection of the non-target responses, I noticed that several non-Western participants had difficulties coming up with the right word. I added an image of a dragon from East Asian cultures to make this item more cross-culturally effective. • T1_29 레몬 (lemon): I replaced this item with 라면 (ramyeon, Korean ramen noodles). • T1_32 초콜릿 (chocolate): Although the National Institute of the Korean Language (2015) maintains that the penultimate phoneme is /i/, I decided not to score (i.e., ignore/delete from specifications) the /i/ in the last syllable due to substantial NS and learner variation. Task 2 – Nonword Reading • Two /j/ glide items (T2_57 and T2_59) were replaced with items targeting /k*/: 까 (/k*ɑ/) and 이끼 (/i.k*i/). Task 3 – Pronunciation Judgment • OK_30 과일 (fruit, filler item), T3_01 비 (rain), T3_39 사람 (person), T3_41 하나 (one), T3_71 원 (Korean won currency): Re-recorded • T3_33 싸움 (a fight): Changed to 비싸다 (/pi.s*ɑ.tɑ/, expensive), with the audio as 비사다 (/pi.sɑ.tɑ/) • T3_44 창문 (window): Changed audio from 찬문 (/ʨhɑn.mun/) to 차문 (/ʨhɑ.mun/) 94 • T3_50 시장 (market): Changed audio from 쉬장 (ɕwi.ʨɑŋ) to 섀장 (/ɕje.ʨɑŋ/) • T3_68 예쁘다 (pretty): Changed audio from 에쁘다 (/e.p* ɯ.tɑ/) to 왜쁘다 (we.p* ɯ.tɑ) Task 4 – Sound Identification • T4_40 니 (/ni/), T4_44 웅 (/uŋ/): Re-recorded Conclusion This chapter documented the design and development of the KPD, highlighting the linguistic and psycholinguistic bases for the design of the test as well as incremental efforts to better represent the underlying constructs and reduce sources of irrelevant variance in test-taker performance. This documentation will be revisited in Chapter 9, where evidence for the validity of the KPD is considered alongside the proposed validity argument from Chapter 1. 95 CHAPTER 4: METHODS This dissertation is a test development project. In test development, developers typically go through several stages, beginning with setting a purpose for developing a test and ultimately producing an operational form of the test with supporting documentation (Irwing & Hughes, 2018). Previous chapters have detailed several of the early stages, including defining the test purpose and developing items. In this chapter, I outline the methods I used to carry out the validation stage of test development, that is, collecting evidence that relevant to the inferences and assumptions of the KPD’s validity argument. I adopted a mixed-methods research design for the validation stage of test development. Specifically, I used a mixed-methods design that is closest to a convergent parallel design in Creswell and Plano Clark’s (2011) widely-used typology. I collected both quantitative and qualitative data at roughly the same time, with the KPD validity argument as the nexus for integrating sources of information and making interpretations. The quantitative component involves the collection of field test data and other relevant measures from a large sample of L2 Korean learners. The qualitative component entails interviews with L2 Korean learners and a teacher of two of those learners. These two components complement one another primarily by providing evidence relevant to different inferences or assumptions in the KPD’s validity argument. In language testing, interviews are commonly used to explore, in some detail, stakeholder test score interpretations (e.g., Dimova & Kling, 2018) and interfaces between tests and teaching and learning (e.g., Allen, 2016; Tan & Turner, 2015). This study makes uses of instruments with Korean-English bilingual directions, with Korean being the target language for participants and English being a global lingua franca which could support participants at earlier stages of Korean learning. Interviews utilized Korean and/or 96 English, with either language being used to various degree to support meaning-making and mutual understanding between interviewer and interviewees. Participants For the quantitative component of the study, which I refer to as field testing, I collected KPD test data from a large number of adult Korean language learners in Seoul, South Korea. I also collected data from a small number of Korean NSs. For the qualitative component of the study, which I refer to as the interview study, I interviewed a subset of 21 learners from the field testing sample. In addition, I interviewed one Korean instructor who had taught two of these learners. Field Testing For field testing of the KPD, a large sample of Korean learners and a small number of Korean NSs participated. Learners. In total, 198 learners of Korean participated in the field testing of the KPD (Table 4.1). A large majority (174) were female. A total of 24 L1s and 36 nationalities were represented in the sample. A plurality of these learners were L1 Mandarin speakers from Mandarin-dominant countries (i.e., China and Taiwan). Most learners were affiliated with Korean universities in some way, as intensive program language students, undergraduate, or graduate students. A small number were currently working in Korea in various capacities (e.g., embassy staff, English teacher). 97 Table 4.1 Field Testing Sample Characteristics: Demographic Categories 24 174 n 79 5 28 12 21 11 1 1 39 63 17 3 1 34 119 30 13 1 n Category n Circumstances in Korea Language Student*** 59 30 Level 1 (Lowest) Level 2 14 Level 3 14 9 Level 4 Level 5 7 Level 6 (Highest) 6 Other/Specialized Program 6 5 Undergraduate Graduate Student 5 Other (not a student) 43 Category Gender Male Female Nationality China Taiwan Japan USA Russia Vietnam Hong Kong Kazakhstan France Malaysia Other* (less than 5 per country) First (Most Dominant) Language 88 Chinese – Mandarin 19 English 19 Russian 13 Japanese 11 Spanish 8 Chinese – Cantonese 7 Vietnamese French 5 Other** (less than 5 per language) 28 Note. *Includes Azerbaijan, Bangladesh, Belarus, Bermuda, Brazil, Chile, Columbia, Ecuador, El Salvador, Germany, Indonesia, Iran, Italy, Kyrgyzstan, Mexico, Mongolia, Peru, Philippines, Singapore, Spain, Thailand, Turkey, Turkmenistan, Sri Lanka, Ukraine, and Uzbekistan. **Includes Azerbaijani, Bangla, German, Indonesian (Bahasa), Italian, Kazakh, Kyrgyz, Mongol, Malay (Bahasa Malay), Persian (Urdu), Portuguese, Tagalog, Turkish, Taiwanese, and Sinhala. *** “Language Student” refers to learners enrolled in a university-affiliated intensive Korean program. Throughout Korea, instruction in these institutes is almost universally divided into six levels, with 1 being appropriate for (true) beginners and 6 designed for learners at/approaching advanced levels of overall Korean proficiency. 1st 2nd 3rd 4th 5th or later NA Korean as a jth Language (median) 98 The average age of participants was 24.17 years (median = 23 years, Table 4.2). The average participant began learning Korean at roughly the age of 19 years old, and most participants were learning it as their third or later language. On average, participants had spent a total of roughly one to one and a half years in Korea but varied considerably in their total time spent in-country. Of that time, approximately six months to one year was spent in in-country language study on average, but again, there was considerable variation (SD = 14.77 months). Outside of Korea, most likely in their home countries, participants had spent one to one and a half years studying Korean as a foreign language, yet again there was considerable variation (SD = 24.22 months). Table 4.2 Field Testing Sample Characteristics: Age and Exposure n 198 Age (years) Age of Onset (years) 198 Time Living in South Korea (months) 198 196 Time Living with a Korean-Speaking Family (months) Time Studying Korean in South Korea (months) Time Studying Korean as a Foreign Language (months) Total Korean Study Time (months) 198 198 198 M 24.17 19.35 17.76 9.67 SD 4.46 4.79 19.72 44.63 Median Min Max 48 39 130 360 19 0 0 0 23 19 12 0 11.01 14.49 6.5 17.31 24.22 12 28.33 30.63 22.5 0 0 0 130 216 296 Participants self-reported their Korean proficiency in two ways: self-assessment of the four macroskills (speaking, listening, writing, and reading) and self-report of proficiency test results (Table 4.3). The self-assessment was based on a scale of 0 (“none”) to 10 (“perfect”), with each point having a simple descriptor (e.g., 5 = adequate). The means and median self- ratings for productive scales were roughly 5, and receptive skills were roughly 6. 99 Table 4.3 Self-Assessment of Macroskills Skill Speaking Listening Writing Reading n 198 198 198 198 Mean 5.01 5.82 4.84 5.80 SD 1.94 2.03 1.84 2.06 Median 5 6 5 6 Min 1 1 1 1 Max 10 10 10 10 For self-reported proficiency test results, a majority of participants reported Test of Proficiency in Korean (TOPIK) results (n = 140) as their most recent standardized proficiency test; the only other standardized test reported was the ACTFL Oral Proficiency Interview (n = 2; one participant reported a score of Novice Low and another reported a score of Intermediate High). The TOPIK exam has two levels, with a lower-level form (TOPIK I) that yields results in major bands 1 and 2, and a higher-level form (TOPIK II) that yields results in major bands of 3 to 6 (www.topik.go.kr). One-hundred twenty-nine participants reported results from the TOPIK II. The average TOPIK band score reported was 4.25 (SD = 1.09), with a median of 4. Participants also reported on the contribution of extracurricular activities to their Korean learning, their current level of Korean use for common activities, and their motivations to learn Korean (Table 4.4). Relatively few participants reported having any family members who spoke Korean, explaining the low number of responses to questions about interacting with family in the first two parts of Table 4.4. However, as motivation may be more future-oriented or aspirational, most participants did respond to the motivation question about family. In general, participants had relatively high engagement in a variety of extracurricular activities. Motivation-wise, instrumental goals such as getting a job or going to university were of similar importance as integrative goals such as having friendships with Koreans or appreciating Korean culture. 100 Variation in responses to these questions was rather large, highlighting the diversity of participant learning practices, current Korean use, and their strong motivations for learning. Table 4.4 Korean Learning, Use, and Motivation n M 10 10 10 10 10 10 SD Median Min Max 198 7.16 2.61 198 0.76 2.02 198 5.95 2.35 198 6.76 1.93 198 6.77 2.31 198 5.17 2.85 Contribution to Learning Korean* by… Interacting with Friends Interacting with Family Reading Self-Study Watching TV or Movies Listening to Music Level of Current Korean Use** when… Interacting with Friends Interacting with Family Reading Self-Study Watching TV or Movies Listening to Music Motivation for Learning Korean* due to… Getting a Job Earning More Money Going to University or Other Training Impressing Friends and Family Korean-Speaking Family Korean-Speaking Spouse or Partner Friendship with Koreans Korean Culture Note. *Scale: 0 = not at all, 1 = minimally, 5 = moderately, 10 = most importantly. **Scale: 0 = none, 1 = almost never, 5 = 50% of the time, 10 = always. 193 6.61 3.12 194 5.81 3.25 198 6.76 3.31 198 4.09 3.06 198 1.46 2.52 198 2.75 3.35 188 6.27 2.76 188 6.60 2.45 198 6.11 2.44 198 0.53 1.78 198 5.74 2.54 198 6.74 2.30 198 6.22 2.60 198 5.69 2.98 8 0 6 7 7 5 6 0 6 7 7 6 8 6 8 5 0 1 6.5 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Native Speakers. In total, 6 Korean NSs completed field testing procedures. NS participants were recruited from the Seoul area, and all were connected to universities in some way (3 undergraduate students, 3 graduate students). Of the 6 NSs, 5 were female. Their average age was 23.5 years old (median = 23, min = 19, max = 31). All NS participants reported English 101 as their second most-dominant language; participants additionally reported lower levels of proficiency in Japanese (n = 3), French (n = 1), and Spanish (n = 1). On average, participants reported using Korean 76.5% of the time (median = 75%, min = 60%, max = 97%). KPD Production Task Scoring Reliability Study Six Korean NSs (female = 5), all enrolled in or recent graduates of a master’s degree program in teaching Korean as a second/foreign language, participated in the scoring reliability study. These participants were not the same individuals as the previously described NSs who participated in the field testing. Participants varied in their teaching experience; at one end a participant had only minimal experience tutoring while on the other end another participant had been teaching Korean classes for immigrants at a cultural center for one year. I gave all participants an introduction to the test and training on how to score the production tasks. All information was given in Korean. For each of the two tasks, training included examples of scoring (i.e., listen to real responses and see what score was given), detailed explanation of scoring criteria, and a selection of items from different test-takers to practice score (i.e., isolated items and responses) with feedback. Then, participants scored the entire production section for one sample examinee. After completing scoring for the sample examinee, they were given a copy of the scores given by the expert rater who scored all of the test-takers who completed the KPD in field testing. Participants had the opportunity to ask questions throughout the training. After the introduction and training, all participants scored a subset of 20 randomly selected KPD tests from field testing. The 20 tests were a stratified random sample; two NSs KPD tests and 18 learner KPD tests were randomly selected to compose the subset used in the rater reliability study. 102 Interview Study A total of 22 participants took part in the interview study. Among these 22, 21 were L2 learners of Korean, all of whom had completed the field testing procedures before their initial interview, and one was a teacher of Korean. Five of the L2 learners were graduate students, four were undergraduate students, and 12 were language students (i.e., currently studying in an intensive 20-hours-per-week Korean language program housed at a university). Learner interviewee L1 backgrounds included Chinese (Mandarin), Cantonese, French, German, Japanese, Russian, Spanish, and Vietnamese. More details on these participants can be found in Chapter 8. I invited these learners to participate in the interview study primarily on the basis of representativeness and having potentially interesting perspectives on L2 Korean pronunciation. I looked for individuals representing a range of interesting KPD score profiles (different weaknesses, having relatively many or few weaknesses) and those who made interesting comments when chatting before/during/after their field testing appointment; I made brief notes about small talk with participants about jobs, learning experiences, interest or struggles in pronunciation, etc., during field testing appointments. I also considered learners’ backgrounds (current circumstances in Korea, linguistic background, and Korean proficiency level), as I believed having diverse perspectives is important (Friedman, 2012). On a more practical level, I considered potential interviewees’ linguistic ability to participate in an interview (i.e., sufficient Korean or English proficiency to understand and respond to open-ended interview questions). With just two exceptions, all participants who I invited to participate in the interview study accepted (one simply had no interest, and another cancelled her appointment due to illness and could not reschedule before departing Korea). Learner interviews took place in Korean and/or 103 English depending on the interviewee’s and my own linguistic capabilities; most interviews were conducted entirely in Korean with minimal code-switching to English. The Korean instructor who participated in an interview had taught two of the language students who participated in the interview study in a university intensive Korean language program. I recruited this teacher based on my personal network: He was one of my teachers in an intensive Korean course I took before starting data collection. Through informal observation of his teaching and informal chats about L2 research, I thought he would be interested in participating in the study. The interview with the Korean instructor was conducted in Korean with minimal English code-switching. Materials In addition to the KPD, described in detail in the previous chapter, the following instruments and materials were used. Language Background Questionnaire The language background questionnaire (LBQ, Appendix E) collects information on participants general linguistic background (L1, other L2s and associated proficiency levels) and elicits more detailed information on experiences with the Korean language. I used Marian, Blumenfeld, and Kaushanskaya’s (2007) Language Experience and Proficiency Questionnaire (LEAP-Q) as a basis for the language background questionnaire, adding some items about current class level, Korean proficiency test results, prior instruction, and heritage status. Additionally, I removed some of the accent items from Marian et al., as these aspects are covered by the self-assessment. The LBQ was presented bilingually in Korean (the learners’ target language) and English (a widely-known lingua franca). 104 Pronunciation Self-Assessment The pronunciation self-assessment (SA, Appendix F) was intended to capture (a) perceptions of global pronunciation abilities and attitudes, and (b) awareness of pronunciation strengths and weaknesses at the level of individual phonemes. All self-assessment items utilize positively-oriented left-to-right numerical scales, i.e., the leftmost point indicates the least/worst and the rightmost point indicates most/best. Like the LBQ, the SA was presented bilingually in Korean and English. The first part of the SA contains items targeting self-perceived comprehensibility and accentedness, following Derwing and Munro’s (1998, 2015) widely-used, simple 9-point scales. Learners were directed to focus on how others react to their speech, and to focus more on how they produce speech rather than what they are able to say (i.e., make judgments primarily based on their articulation rather than their knowledge of vocabulary or syntax). Additionally, 9-point scales targeting satisfaction with current pronunciation abilities and value placed on pronunciation were included. The second part of the instrument deals with the difficulty in (a) production and (b) perception of each phoneme in Korean’s inventory (28 phonemes in 2 modalities = 56 total items). When self-assessing, learners indicated how often they have difficulty with a sound in either modality on a 7-point scale, ranging from 1 (“Always”) to 7 (“Almost never”). For production items (k = 28), reliability (Cronbach’s alpha) was .95. For perception items (k = 28), Cronbach’s alpha was also .95. Independent Speaking Task To elicit naturalistic, spontaneous speech, I created an independent speaking task following Kim et al.’s (2016) description, which they based on the TOEFL independent speaking task. The prompt (in English) was: “Some people prefer to live in a small town. Others prefer to 105 live in a big city. Which place would you prefer to live and why?” I produced a Korean translation, which was copy edited by a Korean-English bilingual with Korean teaching experience. This task should have been accessible to advanced beginners and higher. Much of the vocabulary (e.g., descriptive adjectives, places) and grammar structures (e.g., present tense, patterns to express like/dislike, comparatives) are covered within the first semester or two of coursework in most Korean programs. The task directions and prompt were presented bilingually on paper (Appendix G). I gave oral directions in the participant’s preferred language (Korean or English) and I always read the prompt aloud in Korean. After the directions and reading of the prompt, participants were given 15 seconds to think about their responses and then up to 1 minute to speak. Participants were not cut-off immediately at the one-minute mark, I allowed them to continue until a natural stopping point. A team of three coders, all native speakers of Korean with training in linguistics, completed broad phonemic transcriptions of all 198 Independent Speaking task responses collected during Field Testing. A set of six tasks were transcribed by all three coders. At the phoneme level, agreement was achieved when all three coders indicated the same phoneme; where one coder differed a partial agreement (assigned a conservative value of 0.5) was recorded. The agreement among coders across a total of 2,136 phonemes was 92%. While intercoder agreement was high, I noticed some inconsistency among coders in how they applied transcription conventions. For example, the common verb 있다 (/it.t* ɑ/, to exist, to have) in some of its inflectional variants was sometimes inappropriately transcribed with the ㅆ letter, corresponding to the phoneme /s*/, even though the speaker clearly did not articulate that sound. This likely arose due to /s*/ (written as ㅆ) always changing to /t/ in coda positions in 106 Korean. Accordingly, I carefully reviewed audio files and corrected all phonemic transcriptions used in analyses to increase the consistency across each speech sample. Elicited Imitation Test The Korean elicited imitation test (EIT) developed by Kim, Tracy-Ventura, and Jung (2016) served as an independent measure of learners’ oral language proficiency. The Korean EIT consists of 30 items and takes approximately 10 minutes to administer. Each item requires the learner to listen to a spoken sentence, wait 2 seconds, and then repeat the sentence orally. Sentences range from 7 to 19 syllables in length; the length of sentences increases as the test progresses. Learner responses are recorded. Each item is scored on a 0-4 scale as follows, with 120 total points possible: • 4: Perfect repetition without any discrepancy • 3: Accurate content repetition with minor changes in form allowed • 2: Features changes to content and/or form that affect meaning • 1: Includes half of the sentence or less • 0: Any of the following: no response, only one word repeated, or unintelligible repetition Kim et al. (2016) reported 95% exact agreement between two raters, and an internal consistency of .96 (Cronbach’s alpha). Based on a sample of 66 Korean learners living in Korea and had an average of 3 years residence (min = 2 months, max = 7 years), Kim et al. found a mean score of 52.82 (SD = 24.10). For this study, the directions of the test and practice items were translated into Korean, drawing on Park’s (2014) Korean-language instructions for an English EIT, and simplified in order to make the task more accessible to lower-proficiency Korean learners who do not have a strong command of English (Appendix H). No changes were made to the test items from Kim et 107 al. (2016). A Korean NS research assistant scored participant EIT responses. The internal consistency (Cronbach’s alpha) of the EIT was .96. I scored a subset of 20 randomly-selected EIT responses. The total-score interrater reliability was r = .97. Semi-Structured Interviews For the interview and retesting study, I conducted face-to-face semi-structured interviews (Brinkmann, 2013). Learners participated in one or two interviews, and the teacher participated in one interview. Interviews involved responses to stimuli (KPD results and self-assessment results) and a set of pre-defined questions. As the interviews were only semi-structured, I probed further when participants made interesting comments and/or did not directly or elaborately answer questions. I also gave the floor to participants at the end of the interview, encouraging them to ask their own questions or bring up anything else that was on their minds related to the KPD and/or Korean pronunciation. All interviews were recorded and transcribed. I designed the structure of interviews for learners and the teacher was to facilitate connections across time and perspectives. Figure 5.1 outlines the general structure of these interviews, and complete interview protocol can be found in Appendix I. The first interview for learners included four phases: Orientation and Reflection, Interpreting KPD Results, Learning Activity, and Progress. The first phase directs learners to think about their pronunciation and reviews their self-assessment responses. The second phase focuses on how learners understand their KPD results and elicits differences between self-assessment and KPD results. The Learning Activity phase explores learners’ pronunciation learning practices (and those found in their classes) along with their immediate thoughts on what they might try after seeing the KPD results. The last phase, Progress, has learners reflect on their pronunciation learning history and future pronunciation goals. 108 The second interview for learners included three phases: KPD Results, Learning Activity, and Progress. The KPD Results phase connects to the Interpreting KPD Results phase from the first interview but shifts the focus to what aspects of the KPD Results remained salient for learners after some time has passed. The Learning Activity phase of this second interview is aligned with the phase of the same name in the first interview, but this time focused on what the learner has done in the interval between interviews. Similarly, the Progress phase touches on learner perceptions of learning progress over the time interval between the first and second interviews. The interview for the teacher included four phases: Pronunciation Teaching, Teacher’s Observations of Students, Interpreting KPD Results, and Utilizing Results. The Teacher’s Observation of Students phase aligns with the Orientation and Reflection phase of the first learner interview, showing another perspective on informal observations of student’s strengths and weaknesses. Similarly, the Interpreting KPD Results phase is parallel to the learner interview phase of the same name. The Pronunciation Teaching and Utilizing Results sections elicit information on current, typical teaching practices and ways in which the KPD results could be applied in a classroom setting, respectively. 109 Figure 4.1. Structure of interviews. Single-headed arrows indicate sequence, double-headed arrows indicate content relationships. Procedures Study procedures are divided into two sets: Field Testing, which all learner participants will undergo, and Interview Study with Retesting. The order of study activities is listed below for each set. • Field Testing 1. Informed Consent (3 minutes) 2. Language Background Questionnaire (10-15 minutes) 3. Self-Assessment (5-10 minutes) 4. Independent Speaking Task (3 minutes) 5. KPD (15-20 minutes) 6. EIT (10 minutes) 110 • Interview Study and Retesting 1. Interview 1 (learners and teacher) 2. Interview 2 & Retest (subset of learners) ▪ Part 1: Interview ▪ Part 2: Independent Speaking Task ▪ Part 3: KPD Analyses This dissertation makes use of several quantitative analyses. When examining test score data, classical test theory (Crocker & Algina, 1986) and Rasch measurement (Rasch, 1960/1980) approaches were used. Correlations were used to examine the relationships among tasks and between instruments (e.g., between the KPD and self-assessments). I used cluster analysis (Yan & Ginther, 2017; Staples & Biber, 2015) to explore learner profiles based on KPD results. Additionally, I qualitatively analyzed interview data to investigate content related to participant understanding and application of KPD results. For coherence and readability, detailed descriptions of analyses can be found immediately preceding the results of each analysis in subsequent chapters. Basic analytical details for each RQ are outlined below: • RQ1a: How reliable is the KPD? Analysis: Cronbach’s alpha and Rasch-based reliability estimates were computed based on individual items for each task, each modality, and the whole test. Items in each modality parceled according to target phoneme were created and Cronbach’s alpha and Rasch-based reliability estimates were calculated. 111 • RQ1b: How reliably are production items evaluated by different scorers? Analysis: Interrater reliability for production tasks were analyzed via computation of Cohen’s kappa. • RQ2a: What is the internal structure of test tasks? Analysis: Pearson correlations were run between the KPD total score and each task. Correlations were also be run among all tasks. • RQ2b: Do item difficulty hierarchies align with expectations and previous research? Analysis: Item facility (percentage correct) and Rasch item difficulty estimates were computed. • RQ3: Do scores indicate distinct test-taker profiles in terms of mode, articulatory features, and/or mastered phonemes? Analysis: Cluster Analysis was used to investigate the presence of clusters that represent distinct profiles in pronunciation ability. • RQ4: Do overall results show expected relationship with Korean oral proficiency? Analysis: Correlations between KPD total scores and EIT results were computed. • RQ5: Do KPD Results reflect difficulties test-takers show in spontaneous, meaning- focused speech? Analysis: Independent speaking task responses were phonemically transcribed by NS coders. Total error errors were tallied and normed to a standardized rate (per 100 words). Additional tabulations were made for individual phonemes, features, and contexts. Pearson correlations between total phonemic error rates in the independent speaking task and KPD scores were run. Additional correlations were run for target phonemes and features between the KPD and independent speaking task. 112 • RQ6: To what degree do KPD Results reflect self-assessments of pronunciation ability and difficulties? Analysis: Correlations were run between KPD results and self-assessments. • RQ7: To what extent do a) learners and b) teachers understand score reports and/or learn anything new from them? Analysis: Interview data were coded for alignment and discrepancies between a) learners’ understanding and KPD results and b) teachers’ understanding and KPD results. • RQ8a: Do learners report any changes in their self-study routines and/or their attention to phonological form in formal or informal learning situations? Analysis: Interview data were coded for pronunciation-related study activity and pronunciation awareness/attention. Codes were analyzed within subjects across time, allowing for analysis of changes in study activity and awareness/attention. • RQ9: To what degree do learners show improvements in a) overall and/or b) weak areas after receiving KPD Results? Analysis: A subset of interviewed students were retested roughly 2-3 months after receiving their initial KPD score report. Initial KPD and post-test KPD scores were compared. Within-groups t-tests were used at the group level, and descriptive statistics were tallied to examine changes for individual learners on phonemes and features. These results are considered alongside interview data on pronunciation-focused learning activities. 113 CHAPTER 5: MEASUREMENT In this chapter, I present results related to the measurement properties of the KPD. This includes basic summary statistics of whole test, section, and task scores as well as more detailed item analyses, reliability analyses, and analyses related to the internal structure of the test (i.e., part-total and part-part correlations). Results are primarily focused on learner test data, but NS test data is also considered where relevant and appropriate. Research Questions As a convenience to readers, the RQs addressed by the results in this chapter are as follows: • RQ1a: How reliable is the KPD? • RQ1b: How reliably are production items evaluated by different scorers? • RQ2a: What is the internal structure of test tasks? • RQ2b: To what extent do item difficulty hierarchies align with expectations? Analysis Details Brief descriptions of analyses were provided in the Methods chapter (Chapter 4). In what follows, I provide more detailed descriptions of the measurement analyses. Measurement Models A measurement model (or scale) can be simply defined as the way in which scores are assigned to objects of measurement (Hand, 1996; Stevens, 1946). In this case, I am concerned with how scores from the KPD are assigned to L2 speakers of Korean. All individual KPD items are scored dichotomously, and all items reflect some facet of a learner’s phonological competence in Korean. Thus, a simple measurement model would simply be the sum of all KPD items as a reflection of phonological competence. However, this approach is of limited use and 114 relevance in the present context. Rather, theory and empirical research support the idea that productive and perceptive phonology, while related, are distinct. In turn, it would be defensible and more informative to calculate separate total scores for the production and perception sections of the test, where a learner’s production ability is reflected by the sum of all production items and perception ability is reflected by the sum of all perception items; the two abilities are expected to be correlated because these two skills are related in their development; growth in one can support the growth in the other (most often, growth in perception aids growth in production). In the measurement models for production and perception abilities, item analyses for diagnosing poorly-performing items and examining the expected hierarchy of item difficulties would occur at the level of individual items. However, KPD results are not intended to be used as simple sums reflecting an overall level of phonological competence. Rather, sub scores for each phoneme in production and perception, each based on the subtotal of several individual items, are the primary unit of interpretation and intended use (Dorans, 2018). Furthermore, due to variation among phonemes in the number of critical allophones and their overall frequency of occurrence in real words (Shin, Kiaer, & Cha, 2012), each phoneme is represented by non-uniform numbers of individual items. In other words, raw phoneme subtotals are not tau-equivalent (i.e., phonemes are not equally weighted by default), making some phonemes more important than others when using a simple sum of item scores to represent overall production or perception ability. Thus, I found it appropriate to consider the use of measurement models in which (a) subscores are aggregated at the phoneme level, such as within item parcels, and (b) scale weights of individual phonemes are uniform. To accomplish this, I computed item parcels for each phoneme in production (Task 1 and Task 2) and perception (Task 3 and Task 4) by summing all of the individual items that 115 target a given phoneme (refer to Appendix A). In measurement analyses, parcels are made tau- equivalent by (a) converting to percentage scores in CTT analyses or (b) specifying equal parcel weights in Rasch analyses. Thus, the measurement model of parcels maps test-takers’ overall abilities by considering equally performance on each of the 28 Korean phonemes. The creation of item parcels, also called item bundles or super items, warrants further discussion. Item parceling typically involves the principled summing of multiple individual items into one polytomous item. Instead of considering the dichotomous items A, B, and C separately in analyses, the responses to all three items are summed and considered as Parcel X with a scale of 0-3 points. This effectively reduces the total number of items on a test, potentially reducing the reliability of scores (Marais & Andrich, 2008), but this is mitigated by the increased amount of information about test-taker abilities provided by a parcel compared to any single item. This is referred to as a score-based approach to item parceling (Eckes, 2014). Item-based approaches to parcel measurement also exist but are excluded here due to technical complexity and concomitant sample size requirements. There are two main reasons for parceling items: content and context. Parceling by content groups items that tap into the same aspect of a larger, overarching construct, e.g., items on a test of receptive phonological knowledge which target the same phoneme. Parceling by context groups items that share a context which influences responses across items. For example, consider a reading comprehension test where a test-taker must read a passage and then answer a main idea question followed by a question about the author’s purpose: If a test-taker does not correctly identify the main idea of the passage, they might be less likely to subsequently identify the author’s purpose for writing it. Marais and Andrich (2008) discuss these phenomena in terms of local item dependence, that is, sets of items with stronger than expected relationships in 116 responses, and refer to two types of dependence: trait dependence (corresponding to content) and response dependence (corresponding to context). Accounting for local dependence is critical to the application of many measurement models (e.g., Rasch, IRT) and can lead to better measurement of underlying test-taker ability. One common application of item parceling is in the creation of testlets for several dichotomous items which share a common stimulus, e.g., a text followed by several comprehension questions (e.g., Eckes, 2014). Parceling has also been applied to C-tests for items which have several dichotomous items embedded in the same paragraph (e.g., Lee-Ellis, 2009). I chose to run and report analyses for both measurement models, individual items and item parcels, due to the quality assurance benefits of examining individual items and the necessity of considering the way scores are actually intended to be interpreted and used (i.e., item parcels). Two Statistical Approaches to Measurement In addition to there being more than one measurement model relevant to the analysis of the KPD, there are also multiple statistical approaches available for analyzing the measurement properties of the test. In the field of measurement, a general distinction is made between classical test theory (CTT, Crocker & Algina, 1986; DeMars, 2018) and item response theory (IRT, Brown, 2018; Meijer & Tendeiro, 2018). Tests with dichotomously scored items as well as tests with polytomously scored items can be analyzed with CTT and IRT. In short, CTT maintains that an observed score on a test is an examinee’s ‘true’ ability, plus or minus a constant amount of measurement error. Thus, CTT aligns well with theory of measurement known as operationalism, which holds that an attribute, essentially, is defined as the score on the test (Hand, 1996). In the present context, this would be akin to saying that a learner’s pronunciation 117 accuracy is one and the same with their KPD score. In contrast, IRT is based on the notion that what is really being measured (i.e., the attribute possessed by examinees) can only be measured indirectly: This latent attribute (or trait) is not something that can be directly measured, but its level can be inferred through analysis of observable responses to items. This approach is better aligned with the theory known as representational measurement, which aims to establish accurate links between the test scores from people who vary in their relative levels of the attribute (Hand, 1996). With reference to the KPD, this theoretical approach holds that a learner’s underlying pronunciation (or perceptual) abilities are represented by scores on the KPD; this representation is mediated by the content and technical qualities of the test. While these two statistical (and theoretical) approaches to measurement analysis differ in several other ways (see Embretson, 1996, and DeMars, 2018, for summaries), they do share several important features: (a) Tests should measure a single dimension, (b) scores from several items may be summed or otherwise combined, and (c) several statistical analyses are available to investigate flaws in individual items. Usefully, in the simplest of IRT models (i.e., the dichotomous Rasch model and some variations of it), raw sum scores of all items or item parcels will correlate nearly perfect with model estimations of person ability. This is helpful because a simple total of raw scores is easier to explain to and be interpreted by test users who are not savvy in quantitative measurement techniques, and it facilitates comparisons of information about the same dataset obtained by the two approaches. Despite sharing some basic similarities, IRT offers several practical advantages over CTT. For one, IRT places the ability of examinees and the difficulty of items (or parcels) onto the same interval scale of measurement, allowing for the direct and meaningful comparison of item difficulty and person ability statistics. This can be useful for interpreting what typical and/or 118 particular examinees know or can do. Additionally, IRT allows for a more robust consideration of measurement error. Whereas CTT considers error as constant throughout the range of person ability, in IRT error can be examined conditionally along the continuum of person ability (through a calculation of information aggregated at the test level) as well as at the level of individual items (through calculations of information at the item level). Thus, IRT facilitates consideration of measurement error at critical score ranges, such as around cut-points for interpretation or decision making. Nonetheless, IRT approaches do have some drawbacks. For one, they typically require large(r) sample sizes to estimate model parameters. For the simplest dichotomous models (i.e., 1- parameter models), tests of at least 30 items and sample sizes of 200 to 250 examinees meet minimum recommendations, though in one variation of 1-parameter models, the Rasch model (see below for details), Linacre (1994) has argued that meaningful results can be obtained for 30 item tests with fewer examinees. Linacre advised an absolute minimum of 30 examinees for dichotomously scored tests and 50 for polytomously scored tests, and further suggested that Rasch analyses conducted with 100 to 150 examinees will yield estimates of item and person ability within a reasonably narrow confidence range (0.5 logits). Beyond sample size considerations, IRT models have stricter stances on the relationship between model estimates and response data. For most IRT approaches, models must be adjusted to fit a given set of response data. This can be done by adding additional parameters to the model to be estimated freely, such as item discrimination and/or a guessing parameter. However, doing so requires even larger sample sizes (e.g., 500 examinees in order to include a discrimination parameter, and 1,000 examinees to include both discrimination and guessing parameters), and is thus not considered further here. In the Rasch family of models, all item discriminations are 119 uniformly constrained, and in line with a view of the Rasch model as prescriptive, item response data must fit the model (rather than the other way around). What this view dictates is that elements of measurement (items, persons) which demonstrate poor fit to the model should be removed. When there are a large number of individual items, removing a handful of poorly- fitting items is usually not a grave concern. However, it is often difficult to justify removing examinees and large numbers of items or a whole content-based item parcel. After all, people who do not fit the Rasch model still may wish to receive diagnostic feedback on their pronunciation! Similarly, from a content perspective, it is often unreasonable, if not absurd, to remove substantial portions of content from a test due to poor fit statistics. Bowles, Skibbe, and Justice (2011) illustrated this problem in their Rasch analysis of an assessment of letter name knowledge (LNK) for 909 children in the early stages of literacy development. In their LNK test, which featured one item for each letter in the English alphabet that children must point to and name, several items (i.e., letters) were found not to fit the Rasch model. Bowles et al. noted the absurdity, from a content perspective, of effectively removing letters of the alphabet to satisfy Rasch model fit demands. In the case of the KPD, removing a phoneme-based item parcel would not be justifiable, as all phonemes are undeniably part of the attribute being assessed and potentially relevant to making subsequent instructional decisions. I elected to conduct both CTT and Rasch measurement analyses. Doing so allowed for the examination of converging or diverging evidence of measurement qualities. The inclusion of Rasch measurement provided the previously discussed benefits over CTT, while CTT served as both as an additional perspective on the data and as a “back-up” in the event that the data showed unignorable misfit to the Rasch model. In the following subsections, I provide relevant technical details for the present analyses conducting using each approach. 120 Classical Test Theory Analyses. In CTT, the relationship between test scores and the “true” score associated with the attribute of measurement is defined through the following equation (1): (1) Observed Score = True Score + Error Where the observed score is typically the sum of all item/task scores and error is typically estimated via standard error of measurement (SEM), which is calculated based on test reliability (e.g., Cronbach’s alpha) and the standard deviation of test scores (see Brown, 1999, for the formula). Thus, an examinee’s true score is estimated as falling somewhere within an interval defined by the observed score plus or minus the SEM. In practice, such as when using test scores for subsequent statistical analyses, the observed score is taken as a good estimate of an examinee’s ability level on the attribute. In CTT, statistics used for characterizing the qualities and performance of items include item facility (P) and item discrimination (D). P is the proportion of correct responses across all examinees for dichotomous items, or the averaged scores from all examinees for polytomous items. D is the association between test-takers’ responses on an item and their overall scores on the test, typically estimated via correlation (the approach taken here) but sometimes as the difference in P between the examinees in the top and bottom third of total scores (see Carr, 2011, for ways to calculate P and D). Item discrimination, which is a value that runs from -1 to 1, is useful as an indicator of an item’s technical quality; larger and positive discrimination values indicate that more able examinees responded correctly than less able examinees (which is desirable), while negative values indicate the opposite, which is obviously undesirable. Values at or near zero mean that the item did not discriminate, which means that the item provides no information and is not useful for measuring the underlying construct, at least from a 121 psychometric perspective. This can happen, for example, when everyone gets the item correct, or when everyone gets the item wrong (which is information that may be useful to teachers), or when responses on the item are seemingly random (which is information that may not be immediately useful to teachers). For dichotomous items, I used point-biserial correlations to calculate discrimination, and for polytomous item parcels I used Pearson correlations between the parcel score and the total score minus the parcel. Rasch Analyses. Rasch analyses yield estimates of ability for each person, difficulty estimates for each item, fit statistics for persons and items, and estimates of reliability for both person ability and item difficulty (Bond & Fox, 2015; Boone, Staver, & Yale, 2014). Person ability and item difficulty are both expressed in logits (log-odds units), which relate to the probability that a given examinee will produce a correct response to a given item. For the PCM, the Rasch-Andrich difficulty threshold between each step of the scale (e.g., the boundary between a sum score of 3 or 4 on all items targeting /k*/) is estimated, based on the point along the person ability continuum where an examinee would have 50-50 odds of scoring in the higher or lower category, with the average difficulty of all thresholds reported as the overall item parcel difficulty. The measurement quality of person ability at the level of the whole test or individual items can also be examined using information functions; more information means more robust and precise measurement of ability. Test information functions (TIF) represent the amount of information yielded for examinees along the ability continuum; information is maximized where there are more items (or partial-credit scale steps) at or near a given person ability level. Similarly, item information functions (IIF) are maximized where item difficulty is equal to person ability. For the dichotomous Rasch model, all items have the same IIF, but IIFs for 122 polytomous items may take different shapes based on information associated with each scale- step (Linacre, 2005). For both production and perception KPD items, I used two Rasch models to analyze response data: (1) the dichotomous Rasch model (Rasch, 1960/1980) for individual item analyses, and (2) the Rasch partial-credit model (PCM, Masters, 1982) for item parcels. All Rasch models were estimated using the Winsteps software (version 4.3.4). For both models, item response data from all 198 examinees was included. For the dichotomous Rasch model, this sample size was expected to yield highly-accurate model parameters per Linacre (1994). For the Rasch PCM, where the partial-credit scale thresholds for each item parcel is estimated separately from all other parcels, the sample size of 198 participants should be sufficient (Linacre, 1994, p. 328 noted that “100 responses per item may be too few”). Aside from sample size and precision considerations, the Rasch models used also assume unidimensionality (see Chapter 1 for conceptual discussion of unidimensionality). Assessments of unidimensionality, within the framework of a Rasch analysis and without performing more data-intensive item factor analyses, entail the analysis of model residuals via fit statistics and principal components analysis. Fit statistics, at the level of individual observations or aggregated at measurement elements (i.e., persons, items) provide information on how frequently and significantly the item response data do not fit the unidimensional Rasch model. At the level of individual observations, model predicted values are compared to empirical values and the difference is standardized, allowing for interpretations following a Z distribution (i.e., critical values of ≥ 2 or ≥ 3 are considered statistically significant at the .05 and .01 alpha levels, respectively). Linacre (2019) proposed that when fewer than 5% of residuals exceed the Z ≥ 2 threshold and fewer than 1% 123 exceed Z ≥ 3. For person and item fit statistics, it is common to examine both infit (information- weighted fit, reported as a mean-square) and outfit (outlier-sensitive fit, also reported as a mean- square). The former is sensitive towards deviations from model expectations in observations near the estimated measure (e.g., observations from persons with ability near the difficulty of an item), while outfit is sensitive to deviations in observations where there is greater distance between measurement elements (e.g., when a high-ability person responds incorrectly to a low- difficulty item, or when a low-ability person responds correctly to a high-difficulty item). Both statistics may range from 0 (representing “overfit”, where responses are too predictable) to infinity (representing increasingly large and frequent deviation in responses), with 1.0 indicating perfect fit. Common guidelines for interpreting infit and outfit state that values between 0.7 and 1.3 are acceptable for most purposes, with values between 0.5 and 1.5 acceptable in low-stakes assessment contexts (Wright & Linacre, 1994). Principal components analysis (PCA) of the Rasch model residuals allow for the detection of systematic patterning among residuals which may indicate additional measurement dimensions of substance that could interfere with measurement of the primary Rasch dimension. One typically looks for eigenvalues greater than 2 in the first one or two contrasts when determining whether any additional measurement dimensions might be substantial enough to interfere with the unidimensionality requirement. When contrasts have generally small eigenvalues, it is relatively safe to assume any patterning in Rasch residuals to simply be reflective of noise. As discussed previously, it is common to remove (or at least revise) persons or items that do not fit the Rasch model. However, because I went into this research uncertain of whether a unidimensional Rasch model is appropriate for the KPD, I considered these analyses exploratory 124 and did not engage in the typical subsequent trimming of items/parcels or people who did not fit the Rasch model. My main interests were (a) obtaining useful information on the reliability of the KPD, performance of KPD items/parcels, and the hierarchy of KPD items/parcels, and (b) determining the general suitability of applying Rasch measurement to the KPD. Reliability Analyses Reliability was considered from two perspectives: (1) conventional test reliability indices from CTT (internal consistency) and Rasch (person reliability) analyses, and (2) the inter-scorer reliability among several teachers (scorers) for the production section of the KPD. From the perspective of conventional test reliability, all 198 KPD responses were scored by an experienced instructor of Korean and submitted to Cronbach’s alpha analyses in R using the psych package (version 1.8.12, Revelle, 2018). Cronbach’s alpha is a flexible, although conservative, method of estimating test reliability, and is able to accommodate dichotomously and polytomously scored items. For individual dichotomously-scored items, alpha was calculated for the whole test, production and perception sections separately, and separately for each task. For polytomously-scored item parcels, alpha was calculated for all parcels together and production and perception parcels separately. A commonly-used Rasch correspondent to Cronbach’s alpha is the person separation index (Linacre, 2019); this was estimated in Winsteps separately for production and perception items/parcels. To investigate reliability among several scorers, I recruited six additional scorers, all of whom were Korean NSs pursuing graduate degrees related to teaching Korean as a foreign language. These six scorers varied in their teaching experience; some had only limited tutoring experience while others had up to a year of formal classroom teaching experience. These six scorers rated a random sample of 20 KPD responses, in which I deliberately included two 125 (randomly-selected) Korean NS responses. The dichotomous scores from all seven scorers (including the primary scorer) for each item were submitted to calculations of interrater agreement and reliability using the R packages irr (version 0.84.1; Gamer, Lemon, & Singh, 2019) and ragree (version 0.0.4; Redd, 2019) including percent agreement, Fleiss’ Kappa (a variant of Cohen’s Kappa for more than two scorers), and Gwet’s AC1. Percent agreement is a crude measure of interrater agreement that does not account for agreement by chance, and values closer to 100% are considered more desirable. Kappa ranges from -1 to 1 and adjusts for chance agreement, making it a superior estimate of interrater reliability to percent agreement, and is commonly interpreted according to Landis and Koch’s (1977) benchmarks: kappa < 0.0 = poor, 0.0 ≤ kappa < 0.2 = slight, 0.2 ≤ kappa < 0.4 = fair, 0.4 ≤ kappa < 0.6 = moderate, 0.6 ≤ kappa < 0.8 = substantial, 0.8 ≤ kappa ≤ 1.0 = almost perfect. Gwet’s AC1 accounts for both chance agreement and models random guessing by scorers (although truly random guessing is most likely not present in a context such as this one) and is a less-biased estimate of interrater reliability compared to kappa, especially when there is a high prevalence of one response option in the data (Gwet, 2008). It has the same range as Fleiss’ Kappa and follows the same benchmarks for interpretation. For Fleiss’ Kappa and Gwet’s AC1, estimates for items with perfect agreement that all examinees responded correctly (or incorrectly) cannot be produced. In these instances, I manually recoded the indices to 1.0, representing perfect agreement. To summarize overall levels of interrater agreement/reliability across all items, I computed means, SDs, and ranges for each index. Item parcel scores from each scorer were also calculated for each examinee and converted to percentages. I then examined the consistency of these parcel scores from all seven raters through calculation of interclass correlation coefficients (ICC). I used ICCs that modeled 126 random examinee and random rater effects (ICC(2,1) following the notation of Shrout and Fleiss, 1979), and ICCs that took into account consistency of examinee rankings (ICCC) as well as absolute agreement in score levels (ICCA) assigned by different coders (McGraw & Wong, 1996). ICCs may range from -1 to 1, with values closer to 1 desirable. Koo and Li (2016) offered the following guidelines for the interpretation of ICC values: ICC < 0.5 = poor, 0.5 < ICC < 0.75 = moderate, 0.75 < ICC < 0.9 = good, 0.90 < ICC = excellent. ICC values where all scorers were in perfect agreement cannot be estimated; in these cases I manually substituted a value of 1.0 to indicate perfect reliability. To summarize overall levels of interrater reliability of parcel scores, I computed means, SDs, and ranges of ICC values across all parcels. Finally, to examine the reliability of interpretations and potential impact on decision making, I dichotomized all parcel scores from each rater using a threshold of 75% accuracy that represents the diagnostic flag criterion. This allowed for consideration of the reliability of diagnostic profiles of learners across several scorers. Like interrater reliability for the dichotomously scored items above, I calculated the same three statistics (percent agreement, Fleiss’ Kappa, and Gwet’s AC1), but this time only for the 28 dichotomized item parcels. Correlations To examine the internal structure of the various sections and tasks of the KPD, I ran Pearson product-moment or Spearman rank-order correlations as appropriate. When data were continuous and appeared to reasonably follow a normal distribution, I used Pearson correlations. When data had less variability and/or did not appear to follow a normal distribution, I used Spearman correlations. 127 Results The following results provide information on the distribution of KPD scores, reliability of scores, detailed summary statistics of KPD items, and relationships among scores on the various parts of the KPD. Measurement Summary In this section, I report top-level summary information on individual item and item parcel measurement models analyzed with CTT and Rasch-based approaches. A brief summary of NS scores follows. CTT Observed Scores. For individual dichotomously-scored items, sum score statistics are found in Table 5.1 based on all 198 L2 Korean learners who participated in the field testing. Relative to maximum scores, means were high for the whole task, each section, and each task. However, there was some nontrivial variation in sum scores, as shown by standard deviations (SD) and ranges. Figure 5.1 illustrates the distribution of sum scores at the level of individual tasks, sections, and all items of the KPD. Table 5.1 Summary of Learner KPD Scores Section All Production Task 1 – Picture Naming Task 2 – Nonword Reading Perception Task 3 – Pronunciation Judgment* Task 4 – Identification Note. *Excluding filler items. k 352 217 154 63 135 72 63 M 310.34 201.21 146.16 55.05 109.14 50.39 58.74 SD 19.35 8.86 4.85 4.96 11.84 9.52 3.19 Range 261 – 350 178 – 217 131 – 154 42 – 63 80 – 134 23 – 71 45 – 63 128 Figure 5.1. Histograms showing the distributions of sum scores for (A) all dichotomous KPD items, (B) all production KPD items, (C) all perception KPD items, and (D) all KPD tasks. 129 For individual items grouped into phoneme-based parcels (separately for production and perception) and converted to percentage scores to achieve tau-equivalence, the average learner production score was 90.7% (SD = 5.4%, Range = 73.3% – 100.0%) and the average perception score was 80.9% (SD = 9.3%, Range = 56.9% – 99.4%). Figure 5.2 illustrates the distribution of average phoneme accuracy scores for examinees in production and perception. Figure 5.2. Histograms of average accuracy scores across all phonemes in (A) production and (B) perception. Rasch Models. I estimated Rasch models based on individual items and phoneme-based item parcels for the production and perception sections of the KPD. For individual items, the dichotomous Rasch model was used, and for the item parcels, the Rasch partial-credit model (PCM) was used. Measurement summaries and indices of model fit are provided next. Production Items. For the dichotomous Rasch model of production item responses, Rasch model parameters explained 18.1% of variance in observations. A total of 1442 observations, approximately 3.4% of the total number, were unexpected at the Z ≥ |2.0|. At the Z ≥ |3.0| level, there were 709 unexpected observations (1.7%). A Principal Components Analysis (PCA) of model residuals found several contrasts with eigenvalues > 2.0 and explained variance in excess 130 of 2% (Linacre, 2019; first contrast eigenvalue = 6.24, proportion of variance explained = 2.7%). Examination of a scree plot (Figure 5.3) revealed a pronounced elbow at the third contrast. Due to a large number of items, it was difficult to extract meaningful patterns when examining biplots of the residual component loadings, but some informal observations could be made. For example, in the first contrast, I was able to observe some clustering of items targeting consonants, particularly tensed consonants. Thus, it appeared that there may be some dependence among phoneme targets. Figure 5.3. PCA of residuals for production items. TV = total variance, MV = variance explained by person & item measures, PV = variance explained by person measures, IV = variance explained by item measures, UV = unexplained variance, U1-5 = unexplained variance in PCA contrasts 1-5. Boxed region contains PCA contrast scree plot. Table 5.2 contains summary statistics for person and item measures. As a group, examinees had generally high phoneme production ability compared to the difficulty of items. 131 Figure 5.4 illustrates the test information function (TIF), which provides information on where the most precise measurement occurs on the continuum of examinee abilities the test items. According to the TIF, the most accurate information is yielded from examinees with relatively low production abilities. In terms of infit, nearly all persons and items demonstrated good fit to the model. In other words, examinees with phoneme production ability near the difficulty of items tended to perform as expected. However, for outfit, many items showed overfit (values under 0.7) and underfit (values over 1.3). That is, examinees with generally high or low phoneme production ability performed unexpectedly on otherwise easy (or difficult) items with some nontrivial frequency. Table 5.2 Rasch Measurement Summary for Production Items Avg. Model Element Persons* Items** *Based on 197 examinees with non-extreme (i.e., not perfect) scores. **Based on 187 non- Infit 1.00 0.73 – 1.39 1.00 0.86 – 1.19 1.86 – 6.19 -0.22 – 3.41 SD 0.87 0.20 Outfit 0.99 0.99 Range 0.23 – 3.69 0.30 – 3.59 Measure 3.36 0.00 Range Range S.E. 0.33 0.45 extreme items. Figure 5.4. Test information function for production items. 132 Perception Items. For the dichotomous Rasch model of perception item responses, Rasch model parameters explained 37.2% of variance in observations. A total of 1034 observations, approximately 3.9% of the total number, were unexpected at the Z ≥ |2.0|. At the Z ≥ |3.0| level, there were 409 unexpected observations (1.5%). A PCA of model residuals found several contrasts with eigenvalues > 2.0 and explained variance in excess of 2% (first contrast eigenvalue = 6.47, proportion of variance explained = 3.4%). Examination of a scree plot (Figure 5.5) revealed a pronounced elbow at the third contrast. Similar to the production items, it was possible to informally observe some clustering of items with related targets in the first contrast. For example, I observed negative loadings for several items targeting the /s*/ phoneme in the first contrast. Thus, it again appeared that there may be some dependence in residuals related to phoneme targets. Figure 5.5. PCA of residuals for perception items. TV = total variance, MV = variance explained by person & item measures, PV = variance explained by person measures, IV = variance explained by item measures, UV = unexplained variance, U1-5 = unexplained variance in PCA contrasts 1-5. Boxed region contains PCA contrast scree plot. 133 Table 5.3 contains summary statistics for person and item measures. As a group, examinees had generally higher phoneme perception ability compared to the difficulty of items, but compared to production items there was more overlap. Figure 5.6 illustrates the test information function (TIF), which provides information on where the most precise measurement occurs on the continuum of examinee abilities the test items. According to the TIF, the most accurate information is yielded from examinees with low to moderate perception abilities. In terms of infit, nearly all persons and items demonstrated good fit to the model, with a few exceptions (8 misfitting persons, 1 misfitting item). In other words, examinees tended to perform as expected on test items whose difficulty levels were closely matched to the examinees’ ability levels. However, as seen in the outfit values, many persons and items showed overfit (values under 0.7) and underfit (values over 1.3): 83 misfitting persons and 36 misfitting items. Most outfit issues for people were associated with overfit (n = 57), which indicated that their responses were too predictable based on item difficulties. Table 5.3 Rasch Measurement Summary for Perception Items Avg. Model Element Persons* Items** *Based on all 198 examinees. **Based on 121 non-extreme items (out of 135 total items). Infit 0.99 0.66 – 1.37 1.00 0.80 – 1.52 0.31 – 6.33 -3.46 – 4.38 SD 1.03 1.91 Outfit 0.94 0.94 Range 0.07 – 4.46 0.22 – 2.49 Measure 2.24 0.00 Range Range S.E. 0.30 0.33 134 Figure 5.6. Test information function for perception items. Production Parcels. For the Rasch PCM model of production parcel scores, Rasch model parameters explained 28.9% of variance in observations. A total of 285 observations out of 5,544 (5.1%) were unexpected at the Z ≥ |2.0|. At the Z ≥ |3.0| level, there were 93 unexpected observations (1.7%). A PCA of model residuals found two contrasts with eigenvalues greater than 2.0 and explained variance in excess of 2% (first contrast eigenvalue = 2.62, proportion of variance explained = 6.7%). Examination of a scree plot (Figure 5.7) revealed a pronounced elbow at the third contrast. With the smaller number of parcels (compared to individual items), patterns in contrast loadings were more interpretable: The first contrast was defined primarily by a cluster of tense and aspirated consonants with positive loadings. The second contrast appeared to be characterized mostly by a cluster of lax stops (/k, p, t/). Thus, it appeared that there may be some parcel dependence based on articulatory features associated with phonemes. 135 Figure 5.7. PCA of residuals for production parcels. TV = total variance, MV = variance explained by person & item measures, PV = variance explained by person measures, IV = variance explained by item measures, UV = unexplained variance, U1-5 = unexplained variance in PCA contrasts 1-5. Boxed region contains PCA contrast scree plot. Table 5.4 contains summary statistics for person and parcel measures. As a group, examinees had generally higher phoneme perception ability compared to the difficulty of items, but compared to production items there was more overlap. Figure 5.8 illustrates the test information function (TIF), which provides information on where the most precise measurement occurs on the continuum of examinee abilities the test items. According to the TIF, the most accurate information is yielded from examinees with lower production abilities. In terms of infit, nearly all parcels and most persons demonstrated good fit to the model (58 misfitting persons; 34 overfitting and 24 underfitting). For outfit, more persons and items showed misfit: 93 misfitting persons and 5 misfitting items. Most outfit issues for people were associated with overfit (n = 136 60), which indicated that their responses were too predictable; outfit issues for parcels were slight. Full, detailed information on parcel statistics are found in the following sections. Table 5.4 Rasch Measurement Summary for Production Parcels Avg. Model Measure Element Persons* Parcels *Based on 197 examinees with non-extreme (i.e., not perfect) scores. Infit 0.98 0.50 – 2.27 1.00 0.88 – 1.19 SD 0.76 0.71 1.71 0.00 Range 0.54 – 4.36 -1.54 – 0.93 S.E. 0.31 0.13 Range Outfit 0.95 0.95 Range 0.18 – 2.67 0.33 – 1.44 Figure 5.8. Test information function for production parcels. Perception Parcels. For the Rasch PCM model of perception parcel scores, Rasch model parameters explained 47.5% of variance in observations. A total of 260 observations out of 5544 (4.7%) were unexpected at the Z ≥ |2.0|. At the Z ≥ |3.0| level, there were 50 unexpected observations (1.0%). PCA of model residuals found one contrast with an eigenvalue > 2.0 (first contrast eigenvalue = 3.45, proportion of variance explained = 6.5%). Examination of a scree plot (Figure 5.9) suggests an elbow at the second or third contrast. First contrast loadings suggest that some relation among phonemes with similar articulations may influence measurement. For example, the phonemes affricate stops /ʨ, ʨ*, ʨh/ (lax, tense and aspirated, respectively) had 137 large positive loadings (all > .40). Similar patterns are observable for other stop consonants with similar place and manner of articulation, although sometimes lax stops had negative loadings. Figure 5.9. PCA of residuals for perception parcels. TV = total variance, MV = variance explained by person & item measures, PV = variance explained by person measures, IV = variance explained by item measures, UV = unexplained variance, U1-4 = unexplained variance in PCA contrasts 1-4. Boxed region contains PCA contrast scree plot. Table 5.5 contains summary statistics for person and parcel measures. As a group, examinees had generally higher phoneme perception ability compared to the difficulty of items, but compared to production items, there was more overlap. Figure 5.10 illustrates the test information function (TIF), which provides information on where the most precise measurement occurs on the continuum of examinee abilities the test items. According to the TIF, the most accurate information was yielded from examinees with low to moderate perception abilities. In terms of infit, nearly all parcels (except one, for /s*/) and most persons demonstrated good fit to the model (61 misfitting persons; 30 overfitting and 31 underfitting). For outfit, more persons 138 and items showed misfit: 71 misfitting persons and 3 misfitting items. Most outfit issues for people were associated with overfit (n = 45), which indicated that their responses were too predictable; outfit issues for parcels were slight. Full, detailed information on parcel statistics are found in the following sections. Table 5.5 Rasch Measurement Summary for Perception Parcels Avg. Element Persons* Parcels** Measure SD 1.04 0.76 1.45 0.00 Range -0.40 – 5.80 -1.34 – 1.82 Model S.E. 0.30 0.12 Range Infit 1.00 0.42 – 2.00 1.00 0.77 – 1.36 Outfit 0.99 0.99 Range 0.06 – 2.90 0.68 – 1.57 Figure 5.10. Test information function for production parcels. Native Speakers. Summary statistics for total scores from 6 NSs of Korean are in Table 5.6. NS performance on the KPD was at or very near ceiling; this was true for individual tasks as well. For phoneme parcels, the average NS production score was 99.9% (SD = 0.2%, range = 99.6% – 100%) and the average perception score was 98.5% (SD = 1.1%, Range = 96.5% – 99.4%). 139 Table 5.6 Summary of NS KPD Scores Section All Production Task 1 – Picture Naming Task 2 – Nonword Reading Perception Task 3 – Pronunciation Judgment* Task 4 – Identification k 352 217 154 63 135 72 63 M 349.5 216.67 153.67 63.00 132.83 70.17 62.67 SD 1.38 0.52 0.52 0.00 1.17 0.75 0.52 Range 348 – 351 216 – 217 153 – 154 63 – 63 131 – 134 69 – 71 62 – 63 Reliability This section details the reliability of the KPD, including estimates of internal consistency for all parts of the KPD and estimates of inter-scorer agreement for the production section. Internal Consistency. Internal consistency estimates (Cronbach’s alpha) for the KPD are in Table 5.7. Across the board, most estimates exceed recommended thresholds for low-stakes testing. The lowest reliability estimate, .65, comes from an item-level analysis of the Identification task. Many of the alpha values obtained are similar to those from the pilot study (Chapter 2), and once again it appeared that item parcels sacrifice little in terms of internal consistency. Table 5.7 Internal Consistency of the KPD Section All Production Task 1 – Picture Naming Task 2 – Nonword Reading Perception Task 3 – Pronunciation Judgment* Task 4 – Identification Note. *Excluding filler items. k 352 217 154 63 135 72 63 140 alpha (Items) .92 .83 .72 .74 .89 .89 .65 k 56 28 28 alpha (Parcels) .91 .78 .89 The Rasch person reliability figures for the KPD production and perception sections (Table 5.8) are similar to corresponding Cronbach’s alpha estimates. Little to no reliability in distinguishing overall production and perception ability appears to be lost when parceling items. Table 5.8 Rasch Person Reliability Estimates for the KPD Section Production Perception k 217 135 Person Reliability (Items) .82 .90 k 28 28 Person Reliability (Parcels) .78 .90 Production Items – Inter-Scorer Agreement. For each individual item assessing production of phonemes (including Task 1 – Picture Naming and Task 2 – Nonword Reading), percent agreement, Fleiss’ Kappa, and Gwet’s AC1 were computed based on the scores assigned by the seven scorers. Summary statistics for these agreement indices, based on all 217 items, are presented in Table 5.9. Table 5.9 Inter-Scorer Agreement for Individual Production Items Index Percent Agreement Fleiss’ Kappa Gwet’s AC1 Mean 85.39 0.48 0.93 SD 15.55 0.40 0.09 Range 30 – 100 -0.11 – 1.00 0.49 – 1.00 While the average Fleiss’ Kappa indicates only moderate agreement among coders, the average percent agreement and Gwet’s AC1 tell a different story. Due to a high prevalence of intelligible pronunciation (i.e., correct responses), the reduced negative bias of Gwet’s AC1 statistic better reflects the simple percent agreement. Figure 5.11 shows the distribution of the three agreement indices across items. While many items have Kappa values interpretable as 141 “none” or “slight” (Landis & Koch, 1977), the large majority of items have AC1 values associated with substantial or near-perfect agreement among raters. Figure 5.11. Histograms of item agreement indices for individual items based on all seven scorers. Given the high prevalence of correct responses and a closer alignment of Gwet AC1 values and the intuitive percent agreement figures, Gwet’s AC1 values were examined in further detail, which revealed that three items had less than substantial agreement, following Landis & Koch’s (1977) guidelines per Gwet (2008): T2_08 (target phoneme: /t/), T2_23 (target phoneme: /ʨ*/), and T1_30-3 (target phoneme: /ʨ*/; the ㅉ in 왼쪽, left). An additional 18 items (8% of all items) had AC1 values between 0.60 and 0.80, indicating substantial agreement. All other values obtained for individual items exceeded 0.80, indicating almost perfect agreement. Production Parcels – Inter-Scorer Reliability. The mean ICCs across all 28 item parcels are shown in Table 5.10. The ICC focused on consistency of ratings and the ICC focused 142 on absolute agreement in terms of parcel scores were similar in magnitude and dispersion. Parcel scores across the seven raters ranged from essentially no agreement to perfect agreement. Table 5.10 Inter-scorer Reliability for Item Parcel Scores Index ICCC ICCA Mean 0.50 0.48 SD 0.22 0.23 Range -0.02 – 1.00 -0.02 – 1.00 Table 5.11 contains ICC estimates for each phoneme parcel. Many of the ICC values fell into the ‘poor’ range, with 12 phoneme parcels in the ‘moderate’ or ‘good’ range. In some cases, closer inspection of the data revealed extremely high prevalence of high accuracy rates for some phonemes leading to low variability among the 20 test-takers, which would result in low ICC values despite generally similar scores being given to each examinee. For example, the phoneme /m/ (ㅁ) had an ICCA and ICCC of -0.02, the lowest among all phonemes and a figure that essentially indicates no interrater reliability. Out of the 140 parcel scores assigned to the 20 test- takers by the seven scorers, 129 were 100%, 10 were 87.5%, and one was 75%. The standard deviations of scores for the 120 /m/ parcel scores was 3.8%. Table 5.11 Inter-Scorer Reliability/Agreement Indices for all Parcel Scores and Diagnostic Flags Parcel Accuracy Scores ICCA 0.61 0.66 0.46 0.61 0.48 0.83 0.25 0.68 ICCC 0.64 0.69 0.51 0.62 0.49 0.84 0.30 0.68 Phoneme ㄱ /k/ ㅋ /kʰ/ ㄲ /k*/ ㄷ /t/ ㅌ /tʰ/ ㄸ /t*/ ㅂ /p/ ㅍ /pʰ/ Diagnostic Flags Percent Agreement Fleiss’ Kappa 80 80 75 60 90 90 65 80 0.10 0.47 0.17 0.35 0.38 0.84 0.12 0.60 Gwet’s AC1 0.91 0.91 0.87 0.73 0.95 0.97 0.82 0.92 143 Table 5.11 (cont’d) ㅃ /p*/ ㅈ /ʨ/ ㅊ /ʨʰ/ ㅉ /ʨ* / ㅅ /s/ ㅆ /s*/ ㅎ /h/ ㄹ /l/ ㅁ /m/ ㄴ /n/ ㅇ /ŋ/ ㅏ /ɑ/ ㅣ /i/ ㅔ /ɛ/ ㅓ /ʌ/ ㅗ /o/ ㅜ /u/ ㅡ /ɯ/ /w/ /j/ 0.56 0.14 0.45 0.55 0.20 0.34 1.00 0.44 -0.02 0.66 0.36 0.57 0.69 0.86 0.45 0.37 0.16 0.27 0.48 0.61 0.57 0.15 0.48 0.57 0.35 0.42 1.00 0.48 -0.02 0.66 0.41 0.56 0.69 0.87 0.49 0.38 0.16 0.30 0.51 0.63 90 100 80 45 60 50 100 80 100 70 80 100 100 100 50 100 100 100 70 80 -0.01 1.00 0.15 0.44 0.12 0.26 1.00 0.03 1.00 0.26 0.03 1.00 1.00 1.00 0.25 1.00 1.00 1.00 0.03 0.50 0.97 1.00 0.91 0.63 0.84 0.73 1.00 0.93 1.00 0.84 0.93 1.00 1.00 1.00 0.77 1.00 1.00 1.00 0.88 0.92 Production Parcels – Identification of Diagnostic Weaknesses across Scorers. Parcel scores from each scorer were dichotomized using a 75% accuracy threshold, where scores below the threshold were flagged as targets requiring further instruction (see Chapter 3 for discussion of this approach). Summary statistics for item parcel diagnostic flags agreement are contained in Table 5.12. Overall, the average agreement of diagnostic classifications across phonemes was 82.25%, which yielded an average Fleiss’ kappa that would be considered moderate and a Gwet’s AC1 that would be considered near perfect following Landis & Koch (1977). Table 5.12 Inter-Scorer Agreement for Diagnostic Flags Index Percent Agreement Fleiss’ Kappa Gwet’s AC1 Mean 81.25 0.50 0.91 SD 17.35 0.39 0.10 Range 45 – 100 -0.1 – 1.00 0.63 – 1.00 144 Table 5.11 also contains diagnostic flag agreement index values for each phoneme parcel. As with the agreement index values for individual items, percent agreement and Gwet’s AC1 point in the same direction; toward rather high levels of inter-scorer agreement for most phonemes. Due to a preponderance of phonemes not being flagged diagnostically, the same phenomenon of large, negative bias in Kappa values is present here, too. Informally, I observed that many of the phoneme parcels with relatively lower agreement also tended to be those phonemes which were among the most difficult, on average, in the full field-testing sample. Item Analyses In this subsection, analyses of individual items and parcels utilizing both CTT and Rasch analyses are presented in detail. More detail is provided for item parcels, as these are the primary units of score interpretation and use for the KPD (i.e., learners and teachers will make instructional decisions based on phoneme difficulties, not difficulties with individual items on the KPD). Finally, the item and parcel stats (CTT only; sample size too small for Rasch analysis) of NSs are covered in brief. CTT Item Analyses. Item analyses for individual items and item parcels follow. Individual Items. Item analyses based on all 198 test-takers for perception and production items indicated a generally low level of item difficulty and minimal levels of discrimination (point-biserial). For production items, the mean item difficulty was 0.93 (k = 217, SD = 0.10, range = 0.48 – 1.00), with 30 items answered correctly by all 198 test-takers. The most difficult items tended to target tense consonants, though the seventh most difficult item targeted the vowel /ʌ/. Mean discrimination for the production items where at least one examinee was scored as 0 was 0.14 (k = 135, SD = 0.11, range = -0.09 – 0.46). Production items with poor discrimination tended to have very low difficulties, with nearly all examinees having earned 145 scores of 1, while items with stronger discrimination values tended to be relatively more difficult. For perception items, mean item facility was 0.81 (SD = 0.22, range = 0.14 – 1.00), with 14 items answered correctly by all learners. The most difficult items targeted the phonemes /s, s*/, and higher difficulty items were generally diverse in targets, including consonants, vowels, and glides. The easiest items tended to be found in Task 4, and targeted phonemes such as /w, o, e, l, m, n/. Mean discrimination, based on 121 items, was 0.24 (SD = 0.14, range = -0.17 – 0.58). Perception items presenting at least a relatively moderate degree of difficulty tended to have better discrimination, similar to the production items, with the notable exception of item T3_34 targeting /s*/, which was the most difficult and least discriminating item (d = -0.17). Complete CTT (and Rasch) item statistics based on all 198 test-takers are available in Appendix J for all individual production (Table J1) and perception items (Table J2). Parcels. For CTT parcel analyses, all raw parcel scores were converted to percentages. The mean production parcel difficulty was 90.7% (SD = 8.5%, range = 67.2% – 99.7%). For perception parcels, mean difficulty was 80.1% (SD = 10.8%, range = 56.22% – 97.22%). Parcel difficulty statistics are visually displayed in Figure 5.12. As can be seen in the figure, production parcels were generally easier than perception parcels. The most difficult phonemes to produce were tensed consonants, which were also among the most difficult phonemes to perceive. The easiest phonemes to produce included cross-linguistically common vowels like /ɑ, i/ and consonants like /m, h/. Some sounds were noticeably more difficult to perceive than produce, such as /s, s*/. For parcel discrimination, production parcels had a mean discrimination (r) of 0.29 (SD = 0.13, range = 0.08 – 0.54). Consonants including the tenseness feature and the nasal /ŋ/, the vowel /ʌ/, and the glide /j/ had the strongest discrimination while other vowels tended to 146 have low discriminatory power. Perception parcels had a mean discrimination of 0.45 (SD = 0.15, range = 0.08 – 0.67). While vowels such as /ɑ, i, u/ had relatively lower discrimination, similar to production parcels, there were no clear patterns in terms of perception parcels with strong discriminatory power; a wide range of phonemes had high discrimination values. Complete parcel statistics for production and perception phonemes are in Table 5.13 and Table 5.14, respectively. Figure 5.12. Average accuracy (inverse of difficulty) for each phoneme parcel on the production (y-axis) and perception (x-axis) sections of the KPD. 147 Table 5.13 Production Parcel Statistics Phoneme k Mean SD Min Max Mean % SD % ㄱ /k/ ㅋ /kʰ/ ㄲ /k*/ ㄷ /t/ ㅌ /tʰ/ ㄸ /t*/ ㅂ /p/ ㅍ /pʰ/ ㅃ /p*/ ㅈ /ʨ/ ㅊ /ʨʰ/ ㅉ /ʨ* / ㅅ /s/ ㅆ /s*/ ㅎ /h/ ㄹ /l/ ㅁ /m/ ㄴ /n/ ㅇ /ŋ/ ㅏ /ɑ/ ㅣ /i/ ㅔ /ɛ/ ㅓ /ʌ/ ㅗ /o/ ㅜ /u/ ㅡ /ɯ/ /w/ /j/ Note. *Parcel-total correlation with parcel dropped from the total score. 14 13.10 1.10 5.44 6 0.89 2.97 0.97 4 7.75 1.23 9 4.31 5 1.05 3.08 1.07 4 6.66 0.63 7 4.48 0.85 5 3.15 4 1.12 7.76 0.59 8 0.80 3.53 4 2.69 4 0.94 0.57 10 9.72 1.12 5.67 7 4 3.98 0.16 12 11.41 0.93 0.27 8 7.93 0.84 10 9.42 10 9.36 1.15 14 13.95 0.21 15 14.90 0.33 0.45 9 5 0.87 12 11.77 0.56 0.32 3.90 4 0.43 4 3.81 10 9.34 0.84 1.04 8.05 9 r* Rasch Measure Rasch S.E. .32 .30 .45 .38 .34 .54 .32 .38 .55 .23 .33 .44 .14 .38 .20 .14 .11 .24 .38 .19 .17 .09 .41 .19 .17 .20 .12 .38 0.52 0.25 0.34 0.76 0.3 0.71 -0.33 0.25 0.72 -0.06 0.05 0.93 -0.13 0.75 -1.29 0.5 -1.25 0.28 0.36 -1.53 -0.87 -0.75 0.16 -0.03 -1.22 -0.59 0.59 0.56 0.07 0.09 0.08 0.07 0.08 0.08 0.12 0.09 0.07 0.13 0.1 0.09 0.13 0.08 0.45 0.08 0.27 0.09 0.07 0.34 0.22 0.17 0.09 0.13 0.23 0.17 0.09 0.08 8.77 4.31 9 1 0 4 0 0 4 1 0 4 0 0 7 2 2 8 6 6 3 13 13 7 1 9 2 2 7 5 14 6 4 9 5 4 7 5 4 8 4 4 10 7 4 12 8 10 10 14 15 9 5 12 4 4 10 9 94 91 74 86 86 77 95 90 79 97 88 67 97 81 100 95 99 94 94 100 99 97 86 98 97 95 93 89 8 15 24 14 21 27 9 17 28 7 20 23 6 16 4 8 3 8 12 1 2 5 17 5 8 11 8 12 148 Infit MS 1.04 1.09 0.98 0.96 1.11 0.93 0.93 0.96 0.92 0.94 1.03 0.94 1.05 1.07 0.97 1.13 0.99 1.03 0.92 0.97 0.99 1.05 0.88 1.00 1.01 0.97 1.19 0.99 Infit Z Outfit MS Outfit Z 0.35 0.59 -0.24 -0.34 0.83 -0.64 -0.45 -0.22 -0.71 -0.16 0.27 -0.56 0.32 0.67 0.24 0.9 0.06 0.27 -0.45 0.00 0.04 0.4 -1.06 0.08 0.13 -0.13 1.66 -0.03 -0.28 0.71 -0.43 -0.19 0.51 -1.32 -0.9 -0.61 -0.95 -0.81 -0.15 -0.68 0.26 0.21 -0.2 1.63 -0.44 -0.21 0.06 -0.71 -0.06 2.21 -1.58 -0.08 0.38 -0.37 2.89 -0.34 0.95 1.12 0.95 0.97 1.08 0.83 0.82 0.88 0.85 0.77 0.96 0.93 1.04 1.02 0.33 1.31 0.78 0.96 1.00 0.68 0.94 1.41 0.79 0.95 1.09 0.91 1.44 0.95 Table 5.14 Perception Parcel Statistics Phoneme k Mean SD Min Max Mean % SD % ㄱ /k/ ㅋ /kʰ/ ㄲ /k*/ ㄷ /t/ ㅌ /tʰ/ ㄸ /t*/ ㅂ /p/ ㅍ /pʰ/ ㅃ /p*/ ㅈ /ʨ/ ㅊ /ʨʰ/ ㅉ /ʨ*/ ㅅ /s/ ㅆ /s*/ ㅎ /h/ ㄹ /l/ ㅁ /m/ ㄴ /n/ ㅇ /ŋ/ ㅏ /ɑ/ ㅣ /i/ ㅔ /ɛ/ ㅓ /ʌ/ ㅗ /o/ ㅜ /u/ ㅡ /ɯ/ /w/ /j/ Note. *Parcel-total correlation with parcel dropped from the total score. r* Rasch Measure Rasch S.E. .45 .59 .58 .50 .52 .66 .55 .66 .50 .67 .64 .53 .43 .33 .28 .33 .38 .33 .41 .20 .08 .39 .50 .39 .27 .43 .44 .58 6 4.96 4 2.84 4 2.97 6 5.47 4 3.01 4 3.19 6 4.30 4 3.01 4 2.96 4 2.91 4 3.06 4 2.76 8 5.30 6 3.37 4 3.89 8 7.31 6 5.77 6 5.49 4 3.35 3 2.91 3 2.84 3 2.78 3 2.34 3 2.33 3 2.07 3 2.69 8 7.23 10 8.02 0.60 0.04 -0.04 -0.25 0.30 -0.01 0.54 0.22 0.07 0.26 0.31 0.37 1.58 1.82 -0.96 -0.10 -0.79 -0.50 -0.39 -1.22 -1.34 -1.26 0.47 0.19 0.24 -0.90 -0.12 0.87 0.10 0.09 0.10 0.11 0.11 0.09 0.08 0.10 0.09 0.08 0.10 0.09 0.08 0.09 0.23 0.09 0.16 0.11 0.10 0.22 0.19 0.17 0.12 0.14 0.10 0.13 0.10 0.08 15 23 21 13 19 24 20 20 25 27 20 24 15 17 08 12 08 12 21 11 13 15 22 19 27 20 11 11 0.90 0.92 0.83 0.77 0.74 0.94 1.20 0.82 0.99 1.07 0.81 0.94 1.22 1.02 0.32 0.93 0.48 0.73 0.82 0.33 0.39 0.44 0.66 0.57 0.80 0.61 0.85 1.14 3 0 0 3 1 0 1 1 0 0 1 0 3 1 3 4 4 2 0 1 1 1 1 1 0 0 4 5 6 4 4 6 4 4 6 4 4 4 4 4 8 6 4 8 6 6 4 3 3 3 3 3 3 3 8 10 83 71 74 91 75 80 72 75 74 73 76 69 66 56 97 91 96 92 84 97 95 93 78 78 69 90 90 80 149 Infit MS 1.06 0.91 0.87 0.88 0.95 0.78 0.99 0.77 1.08 0.80 0.83 1.01 1.28 1.36 0.97 1.16 0.93 1.08 1.06 0.96 1.12 0.92 0.91 1.00 1.26 0.95 1.03 0.99 Infit Z Outfit MS Outfit Z 0.70 -0.87 -1.24 -1.06 -0.46 -2.06 -0.09 -2.73 0.85 -2.09 -1.90 0.17 2.62 3.25 -0.14 1.26 -0.47 0.64 0.55 -0.05 0.70 -0.62 -1.02 0.01 2.54 -0.34 0.28 -0.08 0.27 -1.01 -1.36 -1.59 -0.41 -2.47 -0.54 -2.76 0.58 -2.01 -2.16 0.05 2.54 3.15 -0.51 1.39 -0.83 0.11 0.25 0.16 1.69 -0.50 -1.22 0.09 2.25 0.11 1.09 -0.16 1.03 0.89 0.85 0.72 0.96 0.68 0.94 0.74 1.07 0.76 0.78 1.00 1.27 1.35 0.80 1.24 0.79 1.01 1.03 0.99 1.57 0.87 0.87 1.01 1.25 1.01 1.15 0.98 Rasch Item Analyses. Rasch item analyses for individual items and item parcels follow. Individual items. Figure 5.13 shows the relationships between item difficulty and person ability for the production and perception items through plots referred to as Wright maps (or variable plots). In each plot, the logit scale, which is used to describe both item difficulty and person ability, is indicated on the y-axis. The left side of each Wright map shows the distribution of person ability estimates, and the right side shows the distribution of item difficult estimates where items located higher up are more difficult. Where an item and a person are parallel on the map, that person has .50 odds of responding correctly to that item. As the Wright maps indicate, relatively few production items presented much of a challenge for most learners on average, but for perception roughly a third to a half of item difficulties were in the range where many learners would find them challenging on average. Among production items, those targeting tense consonants were frequent at the higher end of the item difficulty continuum, and items targeting vowels such as /ɑ, i, ɛ/ were common at the lower end. For perception items, items targeting /s, s*/ and aspirated stop consonants were common at the higher end while vowels and glides were common at the lower end. Considering item fit, as previously mentioned there were relatively few issues with infit for either production or perception items. Among production items with outfit issues, many of the overfitting (outfit < .70) items were relatively easy and targeted vowels such as /ɑ, i, ɛ, o/ and the consonants /h, m/; examinees with abilities substantially greater than the difficulty of these items rarely produced them inaccurately. Underfitting items (outfit > 1.3) varied in their difficulty and target phonemes, but there was a noticeable preponderance of glides, nasals, and the liquid /l/ among underfitting production items. For perception items, the overfitting items were on the easier side but otherwise had little in common. Underfitting items varied in 150 Figure 5.13. Wright maps for the KPD (A) production (Task 1 and Task 2) and (B) perception (Task 3 and Task 4) individual items. The left columns on each plot show test-taker ability (higher = more able) and while the right columns show item difficulty (higher = more difficult. 151 difficulty, but some patterns did emerge in terms of targets: several items for glides (/w, y/), /m/, and fricative consonants /s, s*/ were among the 16 underfitting perception items. Complete Rasch item statistics for production and perception items are found in Appendix J (Table J1 and Table J2, respectively). Parcels. Figure 5.14 plots the parcel difficulties, in Rasch-scaled logits, of each phoneme in perception (x-axis) and production (y-axis). Phonemes closer to the diagonal had comparable perception and production difficulties, while those further from the diagonal were easier or harder in one modality. There were many similarities between the Rasch measures and the percentages based on observed scores (Figure 5.12). For example, among the easiest phonemes in both modalities were / ɑ, h, m/. The most difficult phonemes to produce were the tensed consonants, and also /t/. The most difficult phonemes to perceive were /s, s*/, but these were not the most difficult to produce (though /s*/ was among the most difficult). Complete Rasch parcel statistics for the production and perception sections of the KPD are contained in Table 5.12 and Table 5.13. Figures 5.15 and 5.16 provide conventional Wright maps, showing the distribution of average parcel difficulties relative to test-taker abilities, and expected score category keyforms, which show the test-taker ability ranges associated with scores on each parcel (ranges divided by colons “:”), for production and perception parcels, respectively. Much like the observed score parcel analyses, the two Wright maps reveal that production parcels were relatively easy for most test-takers while perception parcels were more likely to present a challenge. The category keyforms allow for direct comparisons of scores across parcels, despite that many parcels contained a different total number of items. For instance, a learner with an overall Korean phoneme production ability of 1.0 logits would be expected to score a 2 out of 4 on /ʨ*/ and 15 152 out of 15 on /i/. Similarly, a learner with an overall phoneme perception ability of 1.0 logits would be expected to score a 3 out of 6 on /s*/ and 3 out of 3 on /i/. Thus, given the high abilities of examinees relative to parcel difficulties, individuals scoring lower on a particular phoneme would often be considered unexpected from the Rasch perspective. On the note of differing numbers of items, some parcel total scores were never achieved. For example, for the production parcel for /ɑ/ (Figure 5.15, near the bottom) only aggregated scores of 13 or 14 were observed, effectively rendering it a dichotomous item with only one threshold. This phenomenon is examined in closer detail in the following section. Figure 5.14. Rasch item difficulty measures for each phoneme parcel on the production (y-axis) and perception (x-axis) sections of the KPD. Axes inverted; easier parcels are located upward (production) and rightward (perception). 153 Figure 5.15. Visual summary of production parcel difficulties (A) and category thresholds (B). 154 Figure 5.16. Visual summary of perception parcel difficulties (A) and category thresholds (B). 155 In terms of parcel fit, few phonemes in either modality exhibited any substantial misfit. For infit, no production parcels misfit and only one perception parcel demonstrated slight underfit: /s*/ infit = 1.36. For outfit, the production parcel /h/ considerably overfit (outfit = 0.33) while three parcels exhibited slight underfit: /l/ outfit = 1.31, /ʌ/ outfit = 1.41, /w/ outfit = 1.44. One perception parcel, /t*/ had slight overfit (outfit = 0.68) and two perception parcels underfit: /s*/ outfit = 1.35, /i/ outfit = 1.57. Although parcels generally had acceptable fit to the Rasch model, it is worth examining other technical qualities in further depth. Figures 5.17 and 5.18 display the item information function (IIF, blue regions) and the partial-credit step probability curves (black lines) for each phoneme parcel in production and perception, respectively. IIFs show where and how much information is gleaned about examinees, relative to the mean parcel difficulty. For example, the production parcel for /k*/ provides some information about test- takers across a range of ability, while the production parcel /m/ provides most information at lower ranges. The step probability curves represent the probability of an examinee of a given ability level obtaining a step score. As noted previously, some step categories are missing due to no examinees earning very low scores on some parcels, such as /o, s/. The curves for /k*/ in production have distinct peaks, which is generally desirable for score interpretation and indicates that test-takers with differing phoneme production abilities are likely to earn different observed scores. For many production phonemes, several step probability curves are highly overlapped or completed subsumed by other steps, e.g., /ŋ/. In these cases, differences in person ability at certain ranges are not reflected well by observed score differences. Such cases support the collapsing/combining of several categories for the purpose of measurement, not entirely unlike dichotomizing a parcel score to arrive at a diagnostic flag (i.e., 75% diagnostic flag threshold). 156 Figure 5.17. Item information and partial-credit step probability plots for production parcels. 157 Figure 5.18. Item information and partial-credit step probability plots for perception parcels. 158 In general, the perception parcels provided more information across a wider range of learner abilities compared to the production parcels. Perception parcels also featured more distinct step probability curves. Much of this is due to the relatively higher difficulty of perception parcels; information is always increased where parcel difficulty is near examinee ability. On the other hand, the IIFs and step probability curves are suggestive of inflection points that might distinguish between learners who have strong control of a phoneme and those who do not. The overall production abilities of many examinees were far from the last 2-3 step thresholds for many parcels. Thus, higher scores on these parcels provided relatively little information of use, and instead more could be learned about learner abilities when they notched only middling or low scores on a parcel. Native Speakers. The six NSs all earned scores of 1 on nearly every individual KPD item. For production items, the average item difficulty was 0.99 (SD = 0.02, range = 0.83 – 1.00). Only 2 out of 217 items were responded to incorrectly, each by just one person: T1_23-1 (the glide /j/ in 의자, chair) and T1_33-6 (the /n/ in the coda of the second syllable of 빨간색, red). Both items were found in Task 1. For perception items, the average item difficulty was 0.98 (SD = 0.10, range = 0.00 – 1.00). There were five perception items responded to incorrectly by NSs: T3_16 (4/6 correct responses, the final /k/ in 미국, America) , T3_34 (0/6 correct responses, the /s*/ in 접시, dish), T3_59 (5/6 correct responses, the /u/ in 눈, eye), T3_71 (4/6 correct responses, the /w/ in 원, Korean Won), T4_07 (5/6 correct responses, the /u/ in 우) T4_54 (5/6 correct responses, the /ph/ in 이피). For phoneme parcels, the average NS production score was 99.9% (SD = 0.5%, range = 98.1% – 100%) and the average perception score was 98.5% (SD = 3.9%, range = 83.3% – 100%). For production, the only parcels with less than perfect scores for all six NS participants 159 were /j/ (1 NS received a score of 89%) and /n/ (1 NS received a score of 90%). No NS was diagnostically flagged for any phoneme in production. For perception, the following 5 parcels had less than perfect scores for all NSs: /k/ (two NSs received scores of 83.3%), /ph/ (1 NS received a score of 75%), /s*/ (all 6 NSs received scores of 83.3%), /u/ (1 NS received a score of 33%), /w/ (2 NSs received scores of 87.5%). One NS would have received a secondary diagnostic flag indicating difficulty hearing /u/, but they would not have received the primary flag for difficulty producing that phoneme. While these lower perception scores are not ideal, they are perhaps reflective of the high yet imperfect NS performance in speech perception research, even in favorable (i.e., quiet, lack of background noise) listening conditions (e.g., Broersma & Scharenborg, 2010; Cutler, Weber, Smits, & Cooper, 2004). Internal Structure Examining the internal structure of the various parts of a test can provide information on the degree to which test scores align with expectations about the relationship among (sub)constructs. For the KPD, theory strongly suggests that phoneme production and perception abilities should be related at least moderately. Mechanistic expectations of how and what knowledge and skills are elicited by the various KPD tasks, gleaned from psycholinguistic processing models, hold that scores from production tasks should be at least moderately related, and the same goes for perception tasks. Some degree of relationship among scores from all tasks would in turn be expected. Tasks that tap into orthographic knowledge (and/or sound-symbol correspondences) and tasks that tap into lexical knowledge (i.e., meanings and phonological forms of relatively common lexical items) were also expected to be correlated. At a more intricate level, scores for each phoneme in production and perception are expected to be moderately correlated, in line with theory and empirical findings from speech learning. 160 Production and Perception Total Score Correlations. The correlation between total Production (raw sum of correct Task 1 and Task 2 items) and Perception (raw sum of Task 3 and Task 4 items) scores was r = .74 (df = 196, p < .001). Figure 5.19 presents this relationship in a scatterplot. Figure 5.19. Scatterplot of production and perception raw total scores. Task Total Correlations. I ran sum scores correlations among tasks, and correlations between each task and the total KPD score minus that task (Table 5.15). The KPD tasks largely correlated with one another and with the sum of all other tasks. 161 Table 5.15 Correlations Among KPD Task Sum Scores Task 1 – Picture Naming Task 2 – Nonword Reading Task 3 – Pronunciation Judgment Task 4 – Identification Task 1 Task 2 Task 3 Task 4 Total - Task 1.00 .63 1.00 .66 .63 1.00 .52 .59 .65 1.00 .71 .70 .76 .69 Production and Perception Phoneme Parcel Correlations. The correlation between learners’ average production parcel accuracy and average perception parcel accuracy was r = .73 (df = 196, p < .001). This relationship is shown visually in Figure 5.20. Within each learner, the average correlation between all 28 production and perception phoneme parcels was Spearman’s ρ = 0.20, with a standard deviation of 0.21 and a range of -.28 to 0.70. Some smaller (and small negative) individual correlations may be attributable to lack of variability (e.g., learners with very high scores in production and perception across all or most phonemes). In other cases, idiosyncratic differences in learner phonological systems may have yielded small negative correlations. Focused at the phoneme level across all 198 learners, Table 5.16 contains the Spearman correlations between production and perception phonemes. These values ranged from -0.11 (/u/) to 0.52 (/t*/) with an average of 0.20. Phonemes with higher correlations tended to also have higher difficulties and higher discrimination or information across a wider range of examinees as shown by the CTT and Rasch analyses; very small correlations appeared to be a product of limited variability (e.g., nearly all examinees earning maximal scores for /h/). 162 Figure 5.20. Scatterplot of production and perception parcel average accuracy scores. Table 5.16 Phoneme Production and Perception Parcel Spearman Correlations Phoneme ㄱ /k/ ㅋ /kʰ/ ㄲ /k*/ ㄷ /t/ ㅌ /tʰ/ ㄸ /t*/ ㅂ /p/ ㅍ /pʰ/ ㅃ /p*/ ㅈ /ʨ/ ㅊ /ʨʰ/ ㅉ /ʨ*/ ㅅ /s/ ㅆ /s*/ ρ 0.21 0.31 0.29 0.25 0.27 0.52 0.29 0.35 0.39 0.27 0.34 0.16 0.05 0.21 Phoneme ρ ㅎ /h/ ㄹ /l/ ㅁ /m/ ㄴ /n/ ㅇ /ŋ/ ㅏ /ɑ/ ㅣ /i/ ㅔ /ɛ/ ㅓ /ʌ/ ㅗ /o/ ㅜ /u/ ㅡ /ɯ/ /w/ /j/ -0.04 0.36 0.07 0.24 0.26 0.04 0.03 0.00 0.31 0.03 -0.11 0.23 0.11 0.23 163 Discussion In this chapter, I presented results pertaining to the measurement of Korean phoneme production and perception. I analyzed individual item and item parcel measurement models via CTT and Rasch analyses. I also reported additional analyses including descriptive (technically CTT) analyses of a small sample of NS test data and inter-scorer reliability analyses for a subset of KPD learner test data. In many ways, the results of these several analyses point in similar directions. Broadly speaking, the KPD appeared to have many desirable measurement qualities: the test was of appropriate overall difficulty for diagnostic purposes, the KPD scores and diagnostic flags were adequately reliable when considering the low-stakes of decision-making, the vast majority of items performed as intended, and the KPD sections and tasks were related in accordance with expectations. In what follows, I review these results in more detail as they pertain to the research questions. RQ1a: How Reliable is the KPD? By the standards of lower-stakes tests, such as classroom achievement tests, the KPD exceeds acceptable thresholds of reliability and in some cases meets the standards typically expected for high-stakes tests, such as standardized large-scale language proficiency tests. In terms of test reliability (Cronbach’s alpha), the individual KPD items exceeded alpha values of .80 for both production and perception sections; the perception section (alpha = .89) approached levels of reliability more commonly associated with high-stakes standardized tests. Each KPD task also had acceptable reliability, with Task 4 – Identification showing the lowest overall level of reliability (.65). Little to no reliability was lost by parceling items according to target phonemes; perception reliability did not appreciably drop, and production reliability fell slightly to a still-respectable .78. Rasch estimates of person separation reliability, which is 164 normally viewed as similar to the internal consistency of Cronbach’s alpha, told a nearly identical story, as expected. High levels of internal consistency might naturally be expected for long tests. However, as mentioned, even collapsing 100+ or 200+ items into 28 parcels, some of which had no observations at lower parcel scores, and still obtaining adequately high levels of reliability provides some additional evidence in favor of the KPD’s reliability. Of course, each parcel provides considerably more information about test-takers’ abilities than a single dichotomously scored item, so in some ways the minimal loss of reliability is not so unexpected. Additionally, given that test reliability is maximized when items are well-targeted to the range of examinee ability, the reliability indices obtained for the generally low-difficulty KPD items and parcels is also positive in terms of the interpretation of test scores. RQ1b: How Reliably are Production Items Evaluated by Different Scorers? In additional to test reliability, inter-scorer (intercoder) agreement results for the human scored production section were also favorable, though not quite as robust in all indices. The classic index of inter-scorer agreement, kappa, showed very poor levels of agreement due to a high prevalence of correct responses (i.e., intelligible articulations). However, the intuitive percent agreement and Gwet’s AC1, an intercoder reliability index designed to reduce bias in such contexts, found high levels of agreement among the 7 scorers at the level of individual items. According to Gwet’s AC1, all items had at least moderate levels of agreement and most fell into the range of very good or nearly perfect. For parcels, ICC values were moderate on average, with some showing essentially no agreement or consistency among coders. However, this too appeared to be related to a high prevalence of correct responses/high parcel scores where a small number of slight deviations could yield a low ICC value. When parcels were 165 dichotomized for the purpose of assigning diagnostic flags, the average Gwet’s AC1 was 0.91 and all parcels were in the range of very good to nearly-perfect agreement. In sum, this provides reasonably compelling evidence that the KPD can be scored consistently by different teachers after minimal training; levels of consistency found here are adequate for low-stakes, localized decision-making. RQ2a: What is the Internal Structure of Test Tasks? For overall scores (raw sums of individual items), there was a large correlation (Plonsky & Oswald, 2014) between learner production and perception abilities. The total scores for each task were also highly intercorrelated. These correlations aligned with general expectations for production and perception abilities to be substantially related (Flege, 1995; Isbell, 2017). At the level of individual phonemes, correlations between production and perception were extant and positive, but smaller. Across all phonemes measured for all test-takers, the correlation between phoneme production and perception scores was .32, which may be interpreted as small to medium following Plonsky and Oswald (2014). Within each learner, the average correlation (Spearman’s rank-order) was small at .20. Interestingly, there was substantial variation in terms of these within-learner correlations; some learners had almost no correlation or even negative correlations between their perception and production of phonemes. Some cases of no correlation seem plausibly connected to very little variation in both production and perception scores (e.g., a learner with very high scores across the board). In other cases, it may well be learner idiosyncrasies at work, though undesirable influences on measurement cannot be entirely ruled out (e.g., measurement error attenuating correlations). At the specific phoneme level (across all learners), correlations between production and perception ranged from essentially nothing to medium-sized. Here, it was clear that the generally 166 easier sounds had weaker correlations due to ceiling effects/restriction of range. Otherwise, the results speak positively to expected relationships between phoneme perception and production (Flege, 1995). RQ2b: To What Extent Do Item Difficulty Hierarchies Align with Expectations? Specific to L2 Korean phonology, several phonemes were expected to be the most difficult to produce and among the most difficult to perceive: all tensed consonants and /l/ (Kim, 2015; Lee, Moon, & Long). All tensed consonants were indeed among the most difficult for learners to produce and perceive, on average, but somewhat surprisingly, /l/ was not. In fact, in observed score analyses, /l/ was among the most accurately produced phonemes (95%), though Rasch measures placed /l/ more in the middle of the difficulty continuum. One explanation for this, and a desirable one at that, is that the scoring criteria for the productive task – i.e., unambiguous, not necessarily native-like, intelligible pronunciation – made it possible for learners to be relatively successful with /l/, thanks to Korean lacking any other liquids or phonemes with similar qualities to /l/. In other words, even if an /l/ was substituted with a phone like [ɻ] (which is not present in Korean) by a Chinese or American English speaker, it was unlikely to be misheard, or heard with uncertainty, by the scorer. This apparent phenomenon also relates to the previous RQ, whereby the production accuracy tended to exceed perception accuracy in many cases. Aspirated consonants were of moderate difficulty in production and perception, which also finds support in the literature (e.g., Holliday, 2014). Phonemes which were easier to produce and perceive tended to be cross-linguistically common vowels such as /i/ and consonants such as /m/, which makes intuitive sense along the lines of cross-linguistic influence and aligns generally with speech learning theories (e.g., Best & Tyler, 2007; Flege, 1995). Additionally, NSs 167 generally performed at ceiling for all phonemes, which was also expected. In sum, the hierarchy of item difficultly general to the sample of learners largely aligned with theory and previous findings once due consideration was given to the Intelligibility Principle-based scoring criteria for production items. Additional Considerations Beyond the specific RQs that motivated the analyses in this chapter, the results also motivate additional discussion and consideration of issues of measurement models and analytical approaches for diagnostic assessments such as the KPD. While both item and parcel measurement models appeared to work adequately, I favored the parcel model due to its more direct relation to the way scores were intended to be interpreted and used, even though this resulted in some minor loss of reliability in the production section. Additionally, the question of which measurement analytical approach—CTT or Rasch—is best suited for a diagnostic like the KPD with the sample available is unsettled. Both analyses yielded generally similar information, though the estimation of item difficulty differed in some cases. It was also not clear that traditional Rasch conceptualizations of overall ability and unexpectedness of observations could be clearly applied to the task of diagnostic flagging, though this avenue was not explored in depth and could not be ruled out entirely. These issues will be revisited in more depth in the Conclusion chapter. 168 CHAPTER 6: PRONUNCIATION PROFILES In this chapter I focus on learner pronunciation profiles, which relates to the explanation inference of the KPD’s proposed validity argument. I present cluster analyses of learners’ production and perception scores for Korean phonemes followed by description of the pronunciation strengths and weaknesses of learner clusters. Then, I present descriptive statistics for within-cluster learner L1 backgrounds and oral proficiency. Finally, I discuss these results in relation to the relevant research question. The primary research question I address in this chapter is: Research Question • RQ3: Do scores indicate distinct test-taker profiles in terms of phoneme production and perception abilities? This question bears on the explanatory power of KPD scores. While there were general trends in phoneme difficulties in perception and production (see Chapter 5, Measurement), individual learners exhibited variation in their phoneme accuracy scores. Given the influence of learner L1 (and other known languages), proficiency, exposure to Korean, and phonological aptitude on phonological development in an L2, it is unlikely for that variation to simply be reflective of a single-path, deterministic range of L2 Korean phonological ability. Along these lines, one would also expect to see some commonalities emerge across subsets of learners, i.e., profile groupings. The emergence of several such shared profiles would offer some positive evidence that KPD is sensitive to distinct, and meaningful, differences in pronunciation difficulties. Additionally, the identification of test-taker profiles has implications for the utilization and overall usefulness of the KPD. Namely, if nearly all learners with pronunciation difficulties 169 had similar profiles, or if all learners from shared L1 backgrounds had nearly identical KPD score profiles, then it would make little sense to use the KPD at all: To guide instruction or to raise awareness of a learner’s pronunciation difficulties, simply knowing that a learner is struggling with pronunciation, or knowing the learner’s L1 (which would predict certain pronunciation difficulties), or, better yet, knowing both, would be more than sufficient. A diagnostic test such as the KPD would not be needed. Analysis Details The primary analysis I used to investigate learner profiles was cluster analysis (Hastie, Tibshirani, & Friedman, 2009; Kassambara, 2017; King, 2015; Staples & Biber, 2015). In many respects, cluster analysis can be viewed as a counterpart to factor analysis (especially exploratory factor analysis and principal components analysis). Factor analysis groups variables (or items) into factors (that is, groups of variables or items that share an underlying construct), while cluster analysis sorts people (or other objects of interest) into clusters (that is, groups of people that share similar characteristics). In a data matrix where variables are columns and people are rows, factor analysis combines similar columns while cluster analysis groups similar rows. Although the consideration of individual profiles is crucial in DLA, for the purposes of broadly considering the diversity of profiles that might emerge in KPD results, a means of finding and describing relatively common profiles is useful. Cluster analysis “provides a bottom- up way to identify new groups that are better defined with respect to target variables” (Staples & Biber, 2015, p. 243). My intent with the analysis was to identify groups of individual learners who shared similar diagnostic profiles. This also allowed me to consider the differences in pronunciation difficulties between these groups and their backgrounds. In this sense, I did not 170 seek to make claims about theoretically-motivated, generalizable profiles of pronunciation difficulties, rather I simply aimed to describe profiles that emerged among the study’s sample. Ginther and Yan (2018) innovatively applied cluster analysis to language testing data for the purpose of enriched score interpretation and decision-making. By considering the TOEFL subscores of Chinese international students at an American university, Ginther and Yan found four distinct score profiles, each of which fared differently in terms of first-year academic performance. Of note here is that their cluster analysis was able to meaningfully distinguish shared profiles among language learners of the same L1 background. Like Ginther and Yan (2018), my interest in identifying groups of learners is the interpretation and use of their test scores. I wished to consider how the groupings of learners, including those from similar backgrounds, pointed to different profiles that would lead to different instructional foci. Due to high dimensionality in the dataset (28 phoneme parcels for each modality), the sample size in the present study (n = 198) would be considered relatively small for cluster analysis. To deal with this limitation, I adopted two strategies for dividing and paring down the data. First, I elected to run separate cluster analyses for production and perception. Second, I excluded phoneme parcels that exhibited little variation across the entire sample; details on which phonemes follow. Cluster Analysis Cluster analysis is used to classify i objects (in this case, test-takers) into groups based on (a) similarity within groups and (b) dissimilarity between groups across a set of j variables. If all objects are highly similar in respect to a particular variable or variables, inclusion of those variables in the analysis adds little information for classification and may inflate the level of within-groups similarity. Thus, I elected to remove phoneme scores from several phonemes from 171 the cluster analysis of production phonemes: ㅁ, ㅎ, ㅏ, ㅣ, ㅗ, ㅔ (/m, h, ɑ, i, o, ɛ/). These phonemes all had mean accuracy ratings > 90%, SDs ≤ 5%, and minimum accuracy ≥ 75% (i.e., no test-taker was flagged for a pronunciation weakness for these phonemes). This left 22 production phoneme parcel scores as variables in the cluster analyses. For a cluster analysis of perception phonemes, I removed several phoneme scores (ㅎ, ㅁ, ㅏ, ㅣ, ㅔ; /h, m, ɑ, i, ɛ/) which had mean accuracy ratings > 90% and SDs ≤ 10%, indicating that most test-takers had similarly high scores and that inclusion of these phonemes in a cluster analysis would have relatively little benefit. Carrying out cluster analyses from a strictly descriptive perspective, that is, where there was no theory on the number of clusters, I used three techniques to evaluate the most appropriate number of clusters in the data: The three techniques are described in Chapter 14 of Hastie, Tibshirani and Friedman (2009). I used R and support functions from the factoextra package (version 1.0.5, Kassambara & Mundt, 2017) for all cluster analyses. First, I conducted a hierarchical cluster analysis (HCA) using Ward’s D2 criterion (which squares the input Euclidean distance matrix, Murtagh & Legendre, 2014). HCA differs from k-means CA primarily in that it starts from the bottom, with each individual as a cluster being joined to other highly-similar clusters. HCA yields a graphical representation of the hierarchy of similarities called a dendrogram. By examining forks in the dendrogram, and the distances between forks, it is possible to determine a likely number of clusters that exist in the data. The remaining techniques involve the computation of a set of k-means cluster analyses, usually from 2 to 10 or 15 clusters. The first of these techniques is known as the elbow method. This procedure involves examining a plot of total within-cluster variances for the set of k-means cluster solutions. Where the plot bends (i.e., where an elbow is visible) and levels off is considered a good indicator for a 172 suitable number of clusters, as adding additional clusters only minimally reduces the total amount of within-clusters variation. The final technique I used is called the Gap statistic. This statistic involves the comparison of actual data against a set of uniformly distributed data. The total within-cluster variance for the actual and uniform dataset are plotted for a set of k-means clusterings, much like the elbow method. Where the curve for the actual data deviates furthest from the uniform data is where clustering is likely to be most meaningful. Based on these techniques, I arrived at a suitable k number of clusters to include my final k-means cluster analyses. Data Standardization K-means cluster analysis depends on the distance, usually Euclidean, between observations. As such, the presence of variables that differ considerably in scale leads to cluster assignments that poorly represent patterns in the data: large-scale variables effectively drown out small-scale variables. To resolve this dilemma, it is common practice to standardize the scales of all variables entered in a cluster analysis (Steinley, 2004). Two commonly used methods for scale standardization are z-scores (i.e., subtracting the mean and dividing by the standard deviation) and min-max scaling (i.e., converting variable scores to percentages by dividing values by the maximum possible score). Conveniently, I had already converted the observed phoneme parcel scores for the KPD to a percentage scale to achieve tau-equivalence and easy interpretability, and I used these percentage scores in the present cluster analyses. Results In the following subsections, I present the results of cluster analyses on learners’ production and perception KPD phoneme scores. This is followed by comparisons of production and perception clusters in terms of learner L1 and overall Korean oral proficiency. 173 Production Profiles In this subsection, I present results pertaining to the identification of production parcel clusters and the description of the final solution clusters. Determining the Number of Clusters. As an initial step, I ran a hierarchical cluster analysis, the results of which are shown as a dendrogram in Figure 6.1. Based on the dendrogram, which has color coding for five clusters, four or five clusters seemed likely; the yellow and green clusters in the middle of the plot are rather small compared to the others, and the last step going from four to five clusters has a small height. The average within-cluster variation (sum of squares) of k-means cluster analyses for 1 through 10 clusters are in Figure 6.2. In this plot, there is a fairly distinct elbow at k = 4 clusters. After this point, reductions of within-cluster variance are fairly small. Figure 6.3 plots the gap statistic for each of the first k = 1 through k = 10 cluster solutions. Here, the 4-cluster solution appeared to be the point at which differences in within- cluster similarity for the production parcel data became markedly different from within-cluster similarity of a simulated uniform dataset. 174 Figure 6.1. HCA dendrogram depicting suggested clustering of test-takers according to production parcel scores. 175 Figure 6.2. Plot of within-cluster sum of squares for k = 1..10 clusters based on production parcel scores. Figure 6.3. Gap statistic plot for k = 1..10 clusters based on production parcel scores. The vertical dashed line indicates where the number of clusters is optimal. All in all, a 4-cluster k-means solution appeared to be sufficiently well-supported. The 4- cluster solution resulted in 19 test-takers in Cluster 1, 73 in Cluster 2, 76 in Cluster 3, and 30 in 176 Cluster 4. Figure 6.4 plots the four clusters in two dimensions based on a principal components analysis of the production parcel data; each test-taker’s first contrast (Dim1) and second contrast (Dim2) scores are used for Cartesian coordinates. While the four clusters are not entirely distinct in this limited two-dimensional representation, some differences are visible. Figure 6.4. Plot of clusters along the first two principle components of the production parcel data. Cluster Descriptions. For each cluster, I computed mean accuracy for each production parcel as well as the proportion of diagnostic flags (based on a < 75% criterion). These values are visually represented in Figure 6.5; detailed numeric values are in Table 6.1. In the Figure 6.5 heatmaps, red cells indicate very low accuracy and a high proportion of diagnostic flags, while green cells indicate high accuracy and a low proportion of diagnostic flags, with yellow representing middling accuracy and a split proportion of diagnostic flags. Based on the mean 177 accuracy rates and proportion of diagnostic flags, the four clusters can be summarized as follows in terms of production difficulties: • Production Cluster 1: Difficulty with aspirated consonants, fricative and affricate tensed consonants. • Production Cluster 2: Limited difficulties; difficulty with tensed affricate and fricative consonant (and to a lesser extent, /k*/), /ʌ/. • Production Cluster 3: Few to no difficulties. • Production Cluster 4: Major difficulties with tensed consonants. Some difficulty distinguishing /t, th, t*/, some difficulty with /ŋ, ʌ, j/. Figure 6.5. Heatmaps of phoneme production mean accuracy (A) and diagnostic flag proportion (B) by cluster. 178 Table 6.1 Phoneme Production Mean Accuracy and Diagnostic Flag Proportion by Cluster ㄱ /k/ ㅋ /kʰ/ ㄲ /k*/ ㄷ /t/ ㅌ /tʰ/ ㄸ /t*/ ㅂ /p/ ㅍ /pʰ/ ㅃ /p*/ ㅈ /ʨ/ ㅊ /ʨʰ/ ㅉ /ʨ*/ ㅅ /s/ ㅆ /s*/ ㅎ /h/ ㄹ /l/ ㅁ /m/ ㄴ /n/ ㅇ /ŋ/ ㅏ /ɑ/ ㅣ /i/ ㅔ /ɛ/ ㅓ /ʌ/ ㅗ /o/ ㅜ /u/ ㅡ /ɯ/ /w/ /j/ C1 C2 C3 C4 Acc.% Flag% Acc.% Flag% Acc.% Flag% Acc.% Flag% 96 64 75 89 49 71 97 59 68 97 57 61 98 72 100 97 100 93 94 100 99 96 87 98 100 97 91 88 0 63 32 5 84 32 5 68 42 5 53 58 0 68 0 0 0 5 5 0 0 0 16 0 0 0 5 11 91 95 68 82 92 80 95 92 86 95 91 54 96 75 100 92 98 92 92 99 99 97 80 98 98 95 92 87 1 4 42 27 3 16 5 4 7 1 10 66 3 53 0 7 0 7 8 0 0 0 27 0 0 3 4 11 96 95 91 92 93 93 97 95 95 99 96 86 98 90 100 97 100 96 98 100 100 98 94 99 97 96 95 93 0 4 3 4 7 1 5 4 3 0 0 4 0 16 0 0 0 0 1 0 0 0 5 0 1 1 5 8 90 89 48 80 78 32 90 89 27 97 81 54 97 78 98 98 99 94 86 100 99 99 79 98 96 93 95 85 7 13 77 37 37 93 20 10 97 0 20 63 0 47 3 0 0 7 23 0 0 0 27 0 0 0 3 17 Perception Profiles Like in the previous section, with the following results I detail how I identified a clustering solution and describe the phoneme perception clusters that emerged. 179 Determining the Number of Clusters. To start, I ran a hierarchical cluster analysis (Figure 6.6). Based on the dendrogram, which has color coding for four clusters, three to five clusters seemed likely; the fifth cluster would have split the purple cluster (towards the left-hand side of the Figure 6.6) while the three-cluster solution would have merged the red and green clusters on the left. For the perception parcel scores, the location of an elbow plot of within-cluster variances for k = 1 through k = 10 cluster solutions (Figure 6.7) was not so clear. The most acute angle appeared to be centered on a 2-cluster solution, but the following 3-, 4-, and 5-cluster solutions also appeared to offer non-trivial reductions in within-cluster variability. More than five clusters seemed unnecessary. Finally, the gap statistic plot (Figure 6.8) suggested that k = 4 clusters would be optimal. 180 Figure 6.6. HCA dendrogram depicting suggested clustering of test-takers according to production parcel scores. 181 Figure 6.7. Plot of within-cluster sum of squares for k = 1..10 clusters based on perception parcel scores. Figure 6.8. Gap statistic plot for k = 1..10 clusters based on perception parcel scores. The vertical dashed line indicates where the number of clusters is optimal. 182 Ultimately, I settled on a 4-cluster solution, based both on the evidence considered thus far and the descriptive utility of the emergent clusters (see the next subsection for details). All in all, a 4-cluster k-means solution appeared to be sufficiently well-supported. The 4-cluster solution resulted in 42 test-takers in Cluster 1, 46 in Cluster 2, 74 in Cluster 3, and 36 in Cluster 4. Figure 6.9 plots the four clusters in two dimensions based on a principal components analysis of the production parcel data; each test-taker’s first contrast (Dim1) and second contrast (Dim2) scores are used for Cartesian coordinates. While the four clusters did overlap somewhat in this two-dimensional representation, some clear distinctions between pairs of clusters were quite apparent. For example, there is no overlap between Clusters 3 and 4. Figure 6.9. Plot of clusters along the first two principle components of the perception parcel data. Cluster Descriptions. For each cluster I computed the mean accuracy for each perception parcel and estimated the proportion of diagnostic flags (based on a < 75% criterion). 183 These values are visually represented in Figure 6.10 (numeric values are available in Table 6.2). Based on the mean accuracy rates and the proportion of diagnostic flags, the four clusters can be summarized as follows in terms of phoneme perception difficulties: • Perception Cluster 1: Difficulties with /p/, /s, s*/, some difficulty with back vowels, especially /u/. • Perception Cluster 2: Moderate difficulties with many stop consonants and fricatives, difficulty with the /s, s*/ and /o-ʌ/ distinctions. • Perception Cluster 3: Minimal difficulties outside of /u/ and /s*/. • Perception Cluster 4: Considerable difficulties with most consonants, back vowels and glide /j/. Figure 6.10. Heatmaps of phoneme perception mean accuracy (A) and diagnostic flag proportion (B) by cluster. 184 Table 6.2 Phoneme Perception Mean Accuracy and Diagnostic Flag Proportion by Cluster ㄱ /k/ ㅋ /kʰ/ ㄲ /k*/ ㄷ /t/ ㅌ /tʰ/ ㄸ /t*/ ㅂ /p/ ㅍ /pʰ/ ㅃ /p*/ ㅈ /ʨ/ ㅊ /ʨʰ/ ㅉ /ʨ*/ ㅅ /s/ ㅆ /s*/ ㅎ /h/ ㄹ /l/ ㅁ /m/ ㄴ /n/ ㅇ /ŋ/ ㅏ /ɑ/ ㅣ /i/ ㅔ /ɛ/ ㅓ /ʌ/ ㅗ /o/ ㅜ /u/ ㅡ /ɯ/ /w/ /j/ C1 C2 C3 C4 Acc.% Flag% Acc.% Flag% Acc.% Flag% Acc.% Flag% 77 70 75 85 72 82 58 70 83 66 73 68 59 53 97 86 94 84 82 98 95 87 78 75 47 86 88 75 38 24 12 21 29 14 81 29 19 36 19 31 79 90 0 14 2 24 14 7 14 33 60 71 100 38 5 45 78 63 67 93 70 74 72 65 68 62 71 58 65 57 96 92 97 95 86 98 96 94 70 75 91 95 92 80 50 48 35 9 28 22 52 43 37 48 28 65 61 87 0 2 2 4 9 7 11 17 74 65 26 15 0 33 93 88 89 97 86 97 83 94 88 96 93 86 73 61 99 94 98 96 92 99 94 98 90 85 75 99 95 87 5 4 4 4 5 1 24 0 5 1 0 11 45 81 0 1 0 1 5 4 16 7 28 45 61 4 1 12 75 47 52 85 62 47 63 56 43 47 52 49 62 50 95 90 93 88 67 93 94 87 64 69 54 69 82 72 47 75 67 28 47 78 69 69 86 72 75 75 81 92 0 11 8 17 36 14 17 39 83 81 92 61 8 61 Profiles, L1, and Proficiency I considered two relevant background variables, L1 (dominant language) and oral proficiency (as measured by the EIT), alongside production and perception cluster membership. While there were two dozen L1s represented in the sample, it was only meaningful to examine 185 the distribution of L1s across clusters for languages with several speakers; I selected 10 speakers as a cutoff for inclusion in these analyses. Table 6.3 contains information on the L1 composition of the production clusters. Keeping in mind that Cluster 3 was indicative of few pronunciation difficulties, it is interesting to examine where among the remaining clusters learners from each L1 subgroup were concentrated. L1 Chinese (Mandarin) speakers, the most numerous L1 subgroup, were absent from Cluster 1 and mostly were split across Cluster 2 and 3, with a handful in Cluster 4. English speakers fell primarily into Cluster 3, but several were found in Cluster 2 and 4. This relatively even distribution across clusters was found for the Japanese and Spanish speakers as well. Russian speakers, like the Chinese speakers, were absent from Cluster 1, but nearly half of all Russian speakers fell into Cluster 4. Table 6.3 L1 Composition of Phoneme Production Clusters Total Japanese Russian Spanish 4 (31%) 2 (15%) 0 (0%) 6 (32%) English 3 (27%) 10 (21%) 2 (11%) 12 (25%) 48 (55%) 4 (21%) 1 (9%) 35 (40%) 8 (42%) 5 (38%) 4 (21%) 4 (36%) 20 (42%) 6 (13%) Cluster Chinese C1 0 (0%) C2 C3 C4 Total Note. Percentages are based on L1 subgroups (columns). Bold indicates highest proportion of an 2 (15%) 9 (47%) 3 (27%) 19 73 76 30 5 (26%) 5 (6%) Others 11 19 88 48 198 13 19 L1 subgroup, italics indicate second highest proportion. For oral proficiency (Table 6.4, based on EIT scores), participants in Cluster 3 had the highest mean oral proficiency. Examination of the 95% confidence intervals shows that Cluster 3 had significantly higher oral proficiency than Clusters 1 and 4, but Cluster 2 and 3 overlapped considerably, as did Cluster 1 and 4. However, a one-way analysis of variance (ANOVA) based on production cluster membership did not return a significant result (F(1, 196) = 2.39, p = 186 0.124), indicating that on a whole the null hypothesis (that clusters did not vary in proficiency) could not be rejected. Interestingly, oral proficiency standard deviations were similarly large across clusters, and each cluster featured at least one member with rather low or considerably high oral proficiency. Table 6.4 Oral Proficiency of Phoneme Production Clusters Cluster Mean 46.68 C1 C2 68.01 81.14 C3 C4 54.26 SD 25.93 21.36 22.50 23.98 95% CI [34.18, 59.18] [63.03, 86.28] [76.00, 86.23] [45.31, 63.22] Min 13 21 20 9 Max 97 106 116 109 Table 6.5 contains L1 composition for each perception cluster. Like the production clusters, Cluster 3 membership was indicative of few difficulties with Korean phonemes. Chinese speakers were primarily concentrated in two of the perception clusters (3 and 1), with a handful being grouped in each of the other two clusters. English speakers were concentrated in Clusters 2 and 3, with a few in Cluster 4, while Japanese speakers mostly fell into Clusters 4 and 1. Aside from one test-taker in Cluster 1, Russian speakers were evenly split across Clusters 2 and 4. Spanish speakers were also concentrated in these two clusters. No Russian or Spanish speakers were found in Cluster 3. 187 Table 6.5 L1 Composition of Phoneme Perception Clusters English 1 (5%) 7 (37%) 7 (37%) 4 (21%) Cluster Chinese 26 (30%) C1 C2 C3 C4 Total Note. Percentages are based on L1 subgroups (columns). Bold indicates highest proportion of an Japanese Russian 4 (31%) 1 (5%) 9 (47%) 1 (8%) 0 (0%) 2 (15%) 6 (46%) 9 (47%) Others 8 (17%) 18 (36%) 15 (31%) 7 (15%) Spanish 2 (18%) 4 (36%) 0 (0%) 5 (46%) 42 46 74 36 50 (57%) 48 19 11 13 7 (8%) 5 (6%) 88 Total 198 19 L1 subgroup, italics indicate second highest proportion. For oral proficiency (Table 6.6), Cluster 3 had a visibly higher mean proficiency compared to all other clusters: An examination of the 95% confidence intervals suggests that this difference is statistically significant. However, there did not appear to be any reliable differences among the means of the other clusters, as indicated by their highly overlapping confidence intervals. An ANOVA did not return a statistically significant result for oral proficiency differences based on phoneme perception cluster membership (F(1, 196) = 1.13, p = 0.289), which did not permit me to reject the null hypothesis that clusters were not different in oral proficiency. Interestingly, all four clusters had at least one member with considerably high oral proficiency, and like the production clusters, standard deviations were rather large. Table 6.6 Oral Proficiency of Phoneme Perception Clusters Cluster Mean 62.48 C1 60.89 C2 C3 84.66 54.36 C4 SD 24.52 21.45 18.41 27.30 95% CI [54.83, 70.12] [54.52, 67.26] [80.40, 88.93] [45.13, 63.60] Min 19 20 33 9 Max 115 116 115 109 188 It was also informative to consider combinations of production and perception cluster membership (Table 6.7). Members of Production Cluster 1, who had notable difficulties producing aspirated stops, most commonly fell into Perception Cluster 4, indicating that they also had difficulty perceiving aspirated stops (alongside difficulties with many other consonants). A smaller number fell into Perception Cluster 2, which was also characterized by some difficulty with aspirated stops, though to a lesser degree. Production Cluster 2, which had generally intelligible pronunciation of Korean sounds but moderate difficulties with /s*, ʨ*/, and some difficulty with /k*, ʌ/, fell primarily into Perception Clusters 3 (minimal difficulties outside of /s*, u/and 1 (considerable difficulties with /s, s*/, difficulties distinguishing among /ʌ, o, u/), which seemed to align well. Interestingly, 11 members of Production Cluster 2 fell into Perception Cluster 4, which was characterized by a wide range of perception difficulties including those that were not salient problems for production. Production Cluster 3, which had good control of nearly all phonemes, mostly fell into Perception Cluster 3, as one might expect, though individuals fell into other perception clusters. Finally, Production Cluster 4, marked by the most severe and broad pronunciation difficulties, had no one fall into Perception Cluster 3; instead Production Cluster 4 members were concentrated in Perception Clusters 2 and 4. Table 6.7 Cross-Tabs of Production and Perception Cluster Membership Perception n o i t c u d o r P C1 C2 C3 C4 C1 3 24 13 2 C2 6 13 14 13 C3 1 25 48 0 C4 9 11 1 15 189 Discussion In this chapter, I used cluster analysis to identify groups of learners with similar production and perception profiles. Separate analyses on phoneme parcel scores for production and perception each identified four reasonably well-defined clusters. In each set of clusters, the largest cluster consisted of few to no difficulties with Korean sounds. The remaining clusters were characterized by learners with varying weaknesses, both in terms of targets and degree of difficulty. In broad terms, the answer to Research Question 3 appears to be “yes”: The KPD was able to detect substantive differences among test-takers’ pronunciation profiles, spanning the production and perception of Korean sounds, including the identification of profiles that were common to subgroups of learners. Of particular interest was Production Cluster 1. The individuals within it had greater difficulty producing aspirated consonants than some tensed consonants, bucking the general trend in phoneme production difficulty found in Chapter 5 (i.e., that tensed consonants were the most difficult phonemes to produce). This cluster interrupts potential interpretations of the clusters representing developmental stages. It is possible to trace a path of development from Cluster 4 to 2 to 3, where less difficult phonemes (Chapter 5) are mastered first followed by more difficult phonemes. However, Cluster 1 does not fit neatly into any sort of similar progression, as they showed better mastery of the generally difficult tense consonants compared to the moderate difficulty aspirated consonants. Interestingly, Cluster 1 and Cluster 4 had highly similar oral proficiency levels. Production Cluster 4 was notable for its L1 composition. Dominated by Russian speakers (roughly 1/3 of the cluster, and nearly half of all L1 Russian test-takers), it also featured fair proportions of English, Spanish, and other L1 speakers. There were even a small handful of 190 Chinese L1 speakers in the cluster; in other words, those five L1 Chinese speakers of Korean had more in common with the members of Cluster 4 than they did with the majority of their L1- background peers, who fell into Cluster 2. Perception Cluster 4 is another interesting cluster. Like Production Cluster 4, it breaks up a clean interpretation of clusters as developmental stages or solely proficiency related. Perception Cluster 4’s acute difficulties with /p/ and the location of the most acute back-vowel difficulty (i.e., /u/ instead of /ʌ/) are distinguishing features that prevent interpretation of a neat Perception Cluster 4 → 2 → 1 → 3 progression or continuum. Roughly equal proportions of L1 Chinese and Japanese speakers ended up in Cluster 1, along with smaller proportions of Spanish, Other L1s, English, and Russian speakers. In terms of overall oral proficiency, perception Clusters 1, 2, and 4 were extremely similar, further dampening any clear developmental interpretation of cluster membership, though it can be said that most learners with advanced oral proficiency tend to have few phoneme perception difficulties (Cluster 3). Cross-referencing production and perception cluster memberships revealed further differences among learners. Members of Production Cluster 2, who struggled with just a few tense consonants and had some difficulty with /ʌ/, were spread rather evenly across the perception clusters. This provides some support for the inclusion and instructional utilization of phoneme perception scores. For instance, those test-takers in Production Cluster 2 who fell into Perception Cluster 3 (i.e., the cluster with little to no phoneme perception difficulty) would be unlikely to benefit much from perception practice of the /ʨ*, k*/ (and to a lesser extent, /ʌ/) targets. However, those who fell in to Perception Cluster 4 would almost certainly benefit from such perception practice. 191 The L1 composition of the various clusters highlights a key point: Clustering was not L1- deterministic. Although there were some clear trends among L1 and cluster membership, which was to be expected to at least some degree, portions of test-takers from the every major L1 background were located in more than one cluster. Some of this L1 dispersal is due to the achievement of highly intelligible Korean pronunciation and the strong speech sound perception ability by learners from a wide range of backgrounds. However, even among learners who had not yet achieved those high levels of L2 phonological competence, dispersion across diagnostic profiles was evident. As mentioned, learners from several L1 backgrounds could be found in the peculiar Production Cluster 4 and Perception Cluster 1. L1 English and Japanese speakers were broadly dispersed across the four Production and Perception Clusters. Thus, the KPD appears to have some utility in identifying learner profiles beyond what could be guessed at by any L1- based generalizations or a contrastive analysis. Similarly, although there was a clear trend for learners with advanced oral proficiency to fall into the production and perception clusters with few difficulties, there were not always clear differences among clusters in terms of their overall proficiency. For production clusters, it is notable that Clusters 2 and 3 did not appear to be reliably different in oral proficiency, as their confidence intervals overlapped to a considerable degree. Furthermore, in both production and perception, very high and very low oral proficiency members could be found in all clusters. This too suggests that the KPD is potentially useful for addressing a wide array of learners, including beginners who are having difficulty tuning in to Korean phonology and advanced learners who perhaps have a fossilized interlanguage phonology that persists to cause them difficulties in communication. 192 This utility potentially holds considerable value for pedagogy. Teacher training manuals and textbooks for pronunciation often rely on L1-based recommendations for addressing specific learner difficulties (e.g., Kwon, 2017, for an example of a L2 Korean teacher training text and Choi, Kim, Park, Jin, & Park, 2009a, 2009b for Korean pronunciation textbooks). While such recommendations may serve as a broadly useful starting point, the present findings suggest that they will fall short for addressing the needs of some learners, not to mention wasting the time of some others. For example, the Choi et al. (2009a, 2009b) textbooks provide recommendations of which sounds (and corresponding textbook units) to focus on for twelve different L1 background (e.g., English, Japanese, Chinese, Arabic). L1 Chinese speakers are advised to focus on the /s, s*/ and /ʨ, ʨ*, ʨh/ distinctions, which indeed proved challenging for many L1 Chinese speakers in this study, but the book also recommends focus on /p, p*, ph/ and /t, t*, th/, which were not a common problem for the numerous Chinese speakers in Production Cluster 2. That latter advice would be better suited to the members of Production Cluster 1. Some of the L1-based advice for learning and teaching may also be at odds with the Intelligibility Principle and less relevant for instruction. Kwon (2017, p. 124) noted that English speakers often substitute the Korean mid vowels /ɛ/ and /o/ with English diphthongs [eɪ] and [oʊ], respectively, and suggested that teachers make L1 English learners aware of these mistakes. However, neither of these sounds were difficult to produce intelligibly for any cluster, much less clusters with larger concentrations of English speakers. The intelligibility-focused pronunciation diagnosis of the KPD may be able to point teachers and learners to more productive uses of limited time and energy. In the specific context of this study, it is worth pointing out that many of the participants attended the same intensive Korean program or the same graduate program in Korean as a 193 foreign language (focusing on either education or translation/interpretation), and some of them attended specific classes together. Having learners from different backgrounds and with different pronunciation needs is more than a hypothetical. Some important qualifications need to be made to the discussion of results thus far. I have so far pointed out where L1-based predictions of pronunciation difficulties fall short and where production and perception profiles have not lined up, each of which have potentially valuable instructional implications. However, I must clarify that the present results do not contradict or meaningfully call into question prevailing theory and findings in the fields of L2 pronunciation and speech learning. Indeed, there was considerable L1 patterning observable in the clusters: Of L1 Chinese speakers with notable pronunciation difficulties, virtually all of them fell into one production cluster (Production Cluster 2). Similarly, most of the adept articulators from Production Cluster 3 were also members of the highly-skilled Perception Cluster 3, in line with theoretical expectations (Best & Tyler, 2007; Flege, 1995), with relatively few falling into clusters characterized by substantial perception difficulties. But with almost any theory, especially complicated ones with many moving parts and difficult-to-observe processes, there are people who will fall somewhat to the wayside of group-level predictions. DLA can map out ability profiles for those individuals in ways that basic theory-driven expectations might not be able to, and in turn provide relevant support that might otherwise be missing from standard instructional materials, approaches, or curricula. 194 CHAPTER 7: EXTERNAL RELATIONSHIPS The relationships between KPD scores and external variables, which are relevant to the explanation and extrapolation inferences in the KPD’s proposed validity argument, are the focus of this chapter. Relevant to the explanation inference, I examined the relationship between KPD results and general oral proficiency. Relevant to the extrapolation inference, I examined the relationships between KPD results, pronunciation performance in spontaneous speech, and learner self-assessments. I utilized correlations and descriptive statistics to examine these relationships. In the discussion which follows the presentation of the results, I consider findings in relation to the primary research questions listed below. Research Questions The primary research questions addressed by the results in this chapter are: • RQ4: To what extent do KPD results show an expected relationship with Korean oral proficiency? • RQ5: To what extent do results reflect difficulties test-takers show in spontaneous, meaning-focused speech? • RQ6: To what extent do results reflect self-assessments of pronunciation ability and difficulties? For RQ4, Korean oral proficiency is a product of language experience and instruction, two factors known to influence L2 phonological development (Piske et al., 2001). This premise is well-grounded in SLA theory and empirical research, which has shown generally positive associations between both the amount of instruction and language experience (e.g., length of residence in an L2 environment) on proficiency outcomes (Isbell, Winke, & Gass, 2018). Compared to self-reports of language experience or the amount of instruction, oral proficiency as 195 measured by EIT scores is more directly comparable among subjects. In general, more proficient L2 speakers tend to have more intelligible and comprehensible L2 speech (Kang & Moran, 2014), and thus there is expected to be a small-to-moderate relationship between KPD results and overall oral proficiency. Similarly, higher proficiency learners are expected to have higher accuracy in the production and perception of individual phonemes, and vice-versa. As segmental production and phoneme identification are but pieces of speaking and listening processes and proficiency, respectively, an exceedingly strong relationship cannot be reasonably expected. Furthermore, local fossilization (i.e., the cessation of development over a considerable period of TL exposure and use, Han, 2004) and plateaus in interlanguage phonology are well-attested phenomena in the L2 pronunciation literature, whereby generally high-proficiency speakers’ productions are characterized by non-target like and sometimes unintelligible articulation of L2 speech sounds (Derwing & Munro, 2007; Derwing et al., 2014). The presence of such individuals in the present analyses, such as one participant with over 10 years of residence in Korea currently pursuing a doctoral degree (see Chapter 8), would limit the strength of any quantitative relationship between pronunciation ability and overall oral proficiency. For Research Question 5, KPD results, in terms of difficult to produce/inaccurate phoneme scores, ought to be reflected in spontaneous, meaning-oriented oral production. If the KPD results do not reflect difficulties in meaning-oriented oral communication, it would be hard to argue that the test reflects learners’ actual pronunciation weaknesses, and in turn that the test has any utility at all. Arguably, this is the most important piece of evidence in support of the extrapolation inference. However, it may also be the most difficult evidence to adequately capture, as collecting and analyzing spontaneous speech is subject to a host of challenges, such as collecting a long and representative enough speech sample(s) that would facilitate rigorous 196 and generalizable analyses of a learner’s complete segmental inventory in production. As such, the analysis and results presented here must be treated as preliminary. Research Question 6 provides an additional perspective on the extrapolation of KPD scores to pronunciation and perception in meaning-oriented general oral language use via comparison with learner self-assessments. To at least some extent, learner’s KPD scores should reflect their own observations of production and perception difficulties in their daily use of the language in Korea. However, the degree to which participants are (un)aware of their own specific phoneme-level difficulties is likely to limit the strength of the relationship between KPD scores and self-assessments. The findings related to this research question also bear on the utilization inference: Learners who may have poorer self-assessments (awareness) of their pronunciation and perception abilities stand to benefit the most from receiving KPD results. Similarly, if learners’ self-assessments were in perfect alignment with KPD results, there would be little reason to use the KPD at all, as self-assessments require fewer resources. The analyses related to each external measure are detailed in the following subsections. Analysis Details Oral Proficiency I examined the EIT and KPD parcel scores for all 198 learners in the study. I correlated EIT scores with KPD scores (averages and individual phonemes) in perception and production. I also examined average KPD parcel accuracy scores and individual phoneme scores for learners in different quantiles of EIT scores. Pronunciation in Spontaneous Speech To explore the relationship between KPD results and pronunciation difficulties evident in spontaneous speech samples, I selected a subset of 21 learners’ independent speaking (IS) 197 productions (with accompanying phonemic transcriptions) and conducted an error analysis. These 21 learners are the same learners who participated in follow-up interviews (see Chapter 8) who were originally selected for their diversity of background (proficiency level, academic status, L1) and pronunciation difficulties (numerous to minimal, with different strengths and weaknesses). By analyzing the IS responses of these learners, I ensured broad representation in a limited sample size and at the same time enriched the findings related to the interview study. For the 21 transcriptions (originally transcribed by linguistically-trained NSs, and edited by me for consistency of conventions, see Chapter 4), I coded all deletion (removal of a phoneme) and substitution (replacement of a phoneme with a non-target sound) errors. Knowledge of the prompt and repeated careful listens of speech files helped me judge what a speaker intended to say, and in instances where I had difficulty or uncertainty in determining intended words I consulted a NS highly familiar with L2 Korean speech (the same NS Korean instructor who scored the KPD; see Chapter 4). I counted the total number of phonemes, excluding nonverbal sounds (e.g., 엄…, umm…) but including lexical word fillers (e.g., 뭐, what), repetitions (e.g., 그 그, that that), and false-starts/interrupted words (e.g., 살- 살아요, li- live). I also counted the total number of erroneous phonemes. For each error, I tallied which target phoneme was mispronounced. Self-Assessment I examined the relationship between KPD scores and self-assessments in three ways: computation of difference scores, correlations, and alignment with diagnostic flags. KPD parcel scores for each phoneme in production and perception were in the form of percentages (see Chapter 4 for details), and to facilitate the computation of difference scores, I converted the phoneme-level self-assessments to percentages. For each learner, I subtracted their self- 198 assessment (as a percentage) from their KPD parcel score to arrive at a difference score. Each learner’s mean difference score in production and perception was computed. For correlational analyses, I used Pearson correlations (no substantial differences were found when using Spearman correlations). I started with correlations among global measures of pronunciation ability: I computed correlations between global self-assessments (accentedness and comprehensibility), average phoneme-level self-assessments in each modality, and average KPD scores (i.e., average parcel scores across phonemes in each modality). To investigate learner agreement with Korean phonemes most in need of remediation, I examined the alignment of KPD diagnostic flags and phoneme-level self-assessments. As the reader might recall, diagnostic flags were assigned to phoneme parcels with < 75% accuracy, and I dichotomized the self-assessments using the same criterion (< 75%, i.e., a rating of 5 or less out of 7). Results The following subsections feature results of analyses on the relationships between KPD results and oral proficiency, pronunciation in spontaneous speech, and self-assessment. Relationship between KPD Results and Oral Proficiency Among the 198 field testing participants, the average EIT score was 68.92 out of 120 (SD = 25.38, median = 70, min = 9, max = 116). Figure 7.1 shows the distribution of total EIT scores. 199 Figure 7.1. Distribution of EIT scores. The correlation between EIT scores and average production phoneme parcel accuracy was r = .51 (95% CI [0.39, 0.60], p < .001), and the correlation between EIT scores and average receptive phoneme parcel accuracy was r = .56 (95% CI [0.45, 0.65], p < .001). The scatterplots in Figure 7.2 visually represent these relationships. As a convenience and reminder for readers, the correlation between average production parcel accuracy and average perception phoneme parcel accuracy originally reported in Chapter 5 was r = .74. To further illustrate the relationship between KPD scores and oral proficiency, I divided the 198 learners into quantiles based on their EIT scores and then computed summary statistics for average production and perception phoneme parcel accuracy (Table 7.1). While the mean production and perception phoneme accuracy increases across oral proficiency quantiles as expected, the differences among quantiles are not extremely large. The third and fourth oral proficiency quantiles differ very little in average production phoneme accuracy, with nearly identical means and standard deviations. Further, these two quantiles are not so different from the second quantile in terms of phoneme 200 production. The progression of perception phoneme accuracy across quantiles is more clear-cut when examining means, yet at the same time there is greater intra-quantile variation in average perception accuracy. Figure 7.2. Scatterplots of the relationship between EIT scores and (A) average production phoneme accuracy and (B) average perception phoneme accuracy. 201 Table 7.1 Average Production and Perception Phoneme Parcel Accuracy by Oral Proficiency Quantiles Production* Quantile EIT Mean (SD) Mean SD Min Max 87 98 1 89 2 99 91 97 3 99 91 4 95 5 100 Note. *All values are percentages. 30.93 (10.66) 56.60 (5.31) 70.71 (3.71) 84.38 (4.67) 102.90 (6.17) 73 77 81 79 80 5 5 4 4 4 Perception* Mean SD Min Max 88 91 97 98 99 57 59 62 66 69 74 78 80 84 89 8 7 8 8 7 In addition to the relationships between oral proficiency and overall phoneme production and perception accuracy, I considered the relationship between oral proficiency and individual phonemes. To this end, I computed correlations for production and perception accuracy and oral proficiency for each phoneme (Table 7.2). The average correlation between EIT scores and production phonemes was .18, with a minimum of -.05 and maximum of .33. For phonemes that were generally easy to produce such as /ɛ, u/ (see Chapter 4), correlations between production accuracy and oral proficiency were small, likely due to attenuation. On the other hand, tense and aspirated consonants had moderate correlations with oral proficiency. For perception phonemes, the average correlation was .28 with a minimum of .08 and maximum of .45. While some of the easier to perceive vowels had smaller correlations, on a whole the correlations between each perception phoneme’s accuracy scores and oral proficiency were moderate. In sum, the relationship between phoneme perception and oral proficiency was relatively stronger than the relationship between phoneme production and oral proficiency. 202 Table 7.2 Correlations between Phoneme Production, Perception, and Oral Proficiency Phoneme Production-EIT (r) Perception-EIT (r) ㄱ /k/ ㅋ /kʰ/ ㄲ /k*/ ㄷ /t/ ㅌ /tʰ/ ㄸ /t*/ ㅂ /p/ ㅍ /pʰ/ ㅃ /p*/ ㅈ /ʨ/ ㅊ /ʨʰ/ ㅉ /ʨ*/ ㅅ /s/ ㅆ /s*/ ㅎ /h/ ㄹ /l/ ㅁ /m/ ㄴ /n/ ㅇ /ŋ/ ㅏ /ɑ/ ㅣ /i/ ㅔ /ɛ/ ㅓ /ʌ/ ㅗ /o/ ㅜ /u/ ㅡ /ɯ/ /w/ /j/ 0.19 0.22 0.27 0.17 0.31 0.33 0.17 0.29 0.31 0.22 0.29 0.26 0.14 0.28 0.04 0.15 0.14 0.17 0.22 0.11 0.16 -0.05 0.26 0.11 -0.01 0.07 -0.02 0.15 0.27 0.41 0.38 0.41 0.28 0.44 0.33 0.40 0.34 0.44 0.45 0.34 0.18 0.09 0.38 0.30 0.16 0.15 0.19 0.08 0.18 0.17 0.27 0.21 0.12 0.30 0.30 0.35 As a means of visually exploring the relationship between oral proficiency and phoneme accuracy, I plotted phoneme accuracy means in production and perception for each quantile (Figure 7.3). This visualization illustrates how some of the previously discussed correlations vary. The accuracy of phoneme /ɑ/, which had very small correlations with oral proficiency, had 203 nearly uniform accuracy in perception and production across oral proficiency quantiles. In contrast, the progression in accuracy for /t*/ in both production and perception is visually distinctive, as one might expect given the stronger correlations between accuracy and oral proficiency. For the most part, phonemes which were found to be more difficult according to the measurement analyses in Chapter 4 tended to demonstrate larger correlations between accuracy and oral proficiency and more visible upward progressions across oral proficiency quantiles; low-difficulty phonemes lacked such relationships with oral proficiency. Notable exceptions include /u/ and /s*/ in perception: These phonemes showed little relationship with oral proficiency yet were among the most difficult for learners to perceive accurately, implying that distinct perceptive category formation and/or the ability to discriminate these sounds from similar phonemes eludes even many highly-proficient L2 speakers of Korean. Relationship between KPD Results and Pronunciation in Spontaneous Speech Examining the relationship between the KPD results and Independent Speaking production was challenging because roughly 1 minute (or less) of spontaneous speech is not guaranteed to elicit all 28 Korean phonemes, much less multiple instances of each phoneme. The most common phonemes of Korean (e.g., /ɑ, n, k/, Shin et al., 2013) were plentiful in speech samples, but less-common phonemes were minimally present or not present at all. For instance, it is entirely possible to respond to the Independent Speaking prompt without using the phoneme /ʨ*/, which according to Shin et al. (2013) makes up less than 1% of phonemes produced in typical Korean speech (avoiding other phonemes unintentionally is also distinctly possible). S113 did exactly this, which was unfortunate because she had a 0% production accuracy score on the KPD for that phoneme. 204 Figure 7.3. Mean production and perception phoneme accuracy across oral proficiency quantiles. 205 Thus, I focused this analysis on phonemes mispronounced during IS, and on the KPD production scores for those phonemes. For the subset of 21 analyzed IS samples, Table 7.3 summarizes the overlap between errors in learners’ spontaneous speech samples and KPD scores. One observable trend is that learners with lower error rates in spontaneous speech tended to have higher KPD production parcel averages. Learners S005, S016, S105, S111, and S133 all had high average KPD production accuracy (> 90%) and low phonological error rates in speaking (≤ 3%). Meanwhile, learners with < 90% KPD accuracy tended to have higher error rates in speaking (> 5%, e.g., S001, S040, S054). Table 7.3 Comparison of KPD Results and Independent Speaking Productions Test Taker S001 KPD Avg. 80% IS Tot. Phon. 417 IS Err. Rate 6% S004 85% 434 3% S005 94% 149 S013 86% 197 2% 7% S014 90% 128 5% S016 98% 337 3% Phonological Errors & KPD Score /k/: 5 errors, 86% KPD acc., /ʨ/: 1 errors, 50% KPD acc. /ʨʰ/: 1 errors, 50% KPD acc., /ʨ*/: 1 errors, 75% KPD acc. /s/: 4 errors, 70% KPD acc., /h/: 1 error, 100% KPD acc. /l/: 2 errors, 75% KPD acc., /ɛ/: 2 errors, 89% KPD acc. /ʌ/: 1 error, 80% KPD acc., /o/: 1 error, 100% KPD acc. /ɯ/: 1 error, 50% KPD acc., /j/: 3 errors, 78% KPD acc. /k/: 5 errors, 79% KPD acc., /tʰ/: 1 errors, 100% KPD acc. /t*/: 1 errors, 50% KPD acc., /l/: 3 errors, 100% KPD acc. /n/: 2 errors, 80% KPD acc., /ŋ/: 1 error, 80% KPD acc. /i/: 1 error, 100% KPD acc., /o/: 1 error, 75% KPD acc. /n/: 1 error, 100% KPD acc., /ŋ/: 2 errors, 90% KPD acc. /k/: 2 errors, 86% KPD acc., /t*/: 1 error, 50% KPD acc. /s/: 3 errors, 90% KPD acc., /l/: 1 error, 92% KPD acc. /ŋ/: 1 error, 100% KPD acc., /o/: 5 errors, 83% KPD acc. /k*/: 1 error, 50% KPD acc., /t/: 1 error, 100% KPD acc. /s/: 3 errors, 100% KPD acc., /o/: 1 error, 92% KPD acc. /j/: 1 error, 89% KPD acc. /k/: 1 errors, 100% KPD acc., /k*/: 1 errors, 100% KPD acc. /p/: 1 errors, 100% KPD acc., /s/: 2 errors, 100% KPD acc. /s*/: 1 errors, 100% KPD acc., /n/: 2 errors, 90% KPD acc. /ɯ/: 1 errors, 100% KPD acc., /j/: 1 errors, 89% KPD acc. 206 Table 7.3 (cont’d) Test Taker S018 KPD Avg. 93% IS Tot. Phon. 304 IS Err. Rate 6% S035 79% 300 5% S040 84% 131 7% S048 88% 488 6% S054 88% 481 8% S074 84% 362 5% S088 83% 349 3% S104 85% 346 5% S105 92% 276 S111 92% 372 2% 2% S113 77% 181 4% Phonological Errors & KPD Score /k/: 5 errors, 100% KPD acc., /t/: 2 errors, 89% KPD acc. /t*/: 1 errors, 75% KPD acc., /p/: 1 errors, 71% KPD acc. /ʨ/: 2 errors, 100% KPD acc., /s/: 4 errors, 100% KPD acc. /l/: 1 errors, 92% KPD acc. /kʰ/: 1 errors, 67% KPD acc., /tʰ/: 1 errors, 60% KPD acc. /pʰ/: 1 errors, 40% KPD acc., /s/: 5 errors, 100% KPD acc. /l/: 3 errors, 83% KPD acc., /n/: 1 errors, 60% KPD acc. /ŋ/: 1 errors, 50% KPD acc., /ʌ/: 2 errors, 80% KPD acc. /s/: 6 errors, 100% KPD acc., /n/: 1 error, 80% KPD acc. /ɛ/: 1 error, 100% KPD acc., /ʌ/: 1 error, 60% KPD acc. /k/: 1 errors, 86% KPD acc., /kʰ/: 2 errors, 83% KPD acc. /k*/: 2 errors, 50% KPD acc., /t/: 2 errors, 89% KPD acc. /t*/: 1 errors, 75% KPD acc., /p/: 1 errors, 100% KPD acc. /s/: 14 errors, 100% KPD acc., /l/: 4 errors, 67% KPD acc. /n/: 1 errors, 90% KPD acc., /j/: 2 errors, 89% KPD acc. /k/: 1 errors, 86% KPD acc., /k*/: 2 errors, 75% KPD acc. /t/: 1 errors, 67% KPD acc., /tʰ/: 1 errors, 60% KPD acc. /t*/: 6 errors, 50% KPD acc., /ʨ/: 2 errors, 100% KPD acc. /ʨ*/: 1 errors, 100% KPD acc., /s/: 1 errors, 100% KPD acc. /s*/: 1 errors, 100% KPD acc., /h/: 1 errors, 100% KPD acc. /n/: 1 errors, 100% KPD acc., /ŋ/: 11 errors, 90% KPD acc. /ɛ/: 4 errors, 100% KPD acc., /ʌ/: 1 errors, 80% KPD acc. /o/: 1 errors, 100% KPD acc., /u/: 3 errors, 100% KPD acc. /kʰ/: 3 errors, 50% KPD acc., /t/: 4 errors, 78% KPD acc. /t*/: 3 errors, 75% KPD acc., /p/: 1 errors, 100% KPD acc. /pʰ/: 1 errors, 40% KPD acc., /ʨ/: 1 errors, 100% KPD acc. /l/: 1 errors, 100% KPD acc., /m/: 1 errors, 100% KPD acc. /n/: 3 errors, 90% KPD acc. /kʰ/: 1 error, 33% KPD acc., /t/: 1 error, 89% KPD acc. /ʨ/: 1 error, 100% KPD acc., /m/: 1 error, 100% KPD acc. /n/: 5 errors, 80% KPD acc., /o/: 1 error, 100% KPD acc. /s/: 7 errors, 100% KPD acc., /l/: 7 errors, 67% KPD acc. /n/: 2 errors, 90% KPD acc., /i/: 1 error, 100% KPD acc. /t*/: 2 errors, 75% KPD acc., /s/: 4 errors, 80% KPD acc. /t/: 2 errors, 78% KPD acc., /t*/: 1 error, 100% KPD acc. /s/: 2 errors, 90% KPD acc., /j/: 1 error, 89% KPD acc. /s/: 5 errors, 100% KPD acc., /n/: 1 error, 100% KPD acc. /ŋ/: 1 error, 100% KPD acc. 207 Table 7.3 (cont’d) Test Taker S121 KPD Avg. 86% IS Tot. Phon. 524 IS Err. Rate 2% S133 94% 345 1% S139 91% 507 5% S156 93% 139 5% Phonological Errors & KPD Score /k/: 1 errors, 79% KPD acc., /t*/: 1 errors, 50% KPD acc. /s/: 2 errors, 90% KPD acc., /n/: 2 errors, 100% KPD acc. /ɑ/: 1 errors, 100% KPD acc., /ʌ/: 2 errors, 100% KPD acc. /j/: 2 errors, 89% KPD acc. /k/: 2 errors, 100% KPD acc., /k*/: 1 error, 75% KPD acc. /s/: 1 error, 100% KPD acc., /s*/: 1 error, 71% KPD acc. /k/: 1 errors, 93% KPD acc., /k*/: 2 errors, 50% KPD acc. /t/: 2 errors, 67% KPD acc., /t*/: 1 errors, 75% KPD acc. /ʨ/: 1 errors, 88% KPD acc., /l/: 4 errors, 92% KPD acc. /n/: 3 errors, 100% KPD acc., /ŋ/: 2 errors, 70% KPD acc. /ɑ/: 3 errors, 100% KPD acc., /ɛ/: 1 errors, 100% KPD acc. /ʌ/: 1 errors, 80% KPD acc., /o/: 2 errors, 100% KPD acc. /u/: 1 errors, 100% KPD acc., /w/: 1 errors, 90% KPD acc. /j/: 1 errors, 89% KPD acc. /k/: 2 errors, 93% KPD acc., /ʨ/: 1 error, 100% KPD acc. /s*/: 2 errors, 86% KPD acc., /j/: 2 errors, 100% KPD acc. Note. KPD Avg. = Average production phoneme accuracy. IS Tot. Phon. = Total phonemes uttered in independent speaking task. IS Err. Rate = Phonological error rate. Looking at errors in spontaneous speech in greater detail, the rightmost column of Table 7.3 lists all phonemes which were erroneously produced, the number of times they were produced erroneously, and the KPD production parcel accuracy score for that phoneme. I have bolded phonemes where the KPD accuracy score was at or below the diagnostic flag criterion (75%). In many cases, phonemes which would be interpreted as a substantial difficulty according to the KPD diagnostic flag did show up as problematic in spontaneous speech. Consider learner S133. This learner had relatively high production accuracy overall, according to both the KPD and phonological error rate. The errors this learner did produce in speech, though, aligned in part with phonemes identified as difficult on the KPD: /k*, s*/. Similarly, learner S074 had several points of close alignment with KPD results (/kh, t*, ph/). Of course, not every phoneme erroneously produced in the speech samples aligned with KPD results. This could be due to the 208 complex phonological adjustments in connected speech that were not captured by the KPD, poorly-formed phonological representations of words used in the response, different criteria of KPD scoring and phonemic transcription, or any number of other potential factors. One phoneme deserves some additional attention and explanation: /s/. This phoneme occurred in the speech sample extremely frequently, in large part due to its connection with the prompt: The word for city, 도시 (/tosi/, [t̥ oɕi]), which was central to the topic, contains an /s/. With so many productions of /s/, it is to be expected that more errors could occur. Moreover, this /s/ is also realized as a marked allophone in this word context, [ɕ], which only occurs when followed by /i, j/ and requires a substantial change to articulation. This allophonic variant, in addition to being perhaps more challenging for learners than [s], may also have triggered greater sensitivity on the part of the transcribing team. As a result, for many of the 21 learners, /s/ errors in spontaneous speech were not well-reflected by their KPD scores. Relationship between KPD Results and Self-Assessments In the following subsections, I first present a summary results of learner self-assessments. Then, I present the results of analyses of absolute differences between KPD scores and SA responses, correlations between KPD and learner self-assessments, and agreement between KPD diagnostic flags and learner SA responses. Summary of Learner Self-Assessment. On a scale of 1 (Always Difficult) to 7 (Almost Never Difficult), the average self-assessment of Korean phonemes was 5.47 (SD = 1.64, min = 1, max = 7) in production and 5.33 (SD = 1.71, min = 1, max = 7) in perception. Table 7.4 provides descriptive statistics for SA responses at the phoneme level. For each item, learners in the sample used the full range of the SA scale, and with a few exceptions (a rating of 2 for /ɑ/ in production and perception, a rating of 2 for /i/ in perception) there were observations for every scale 209 category for every phoneme. Learners rated several tense consonants (/k*, t*, ʨ*, s*/) and, surprisingly, the mid-front vowel /ɛ/ as most difficult to produce, whereas the vowels (/a, i/) and consonants /m, h/ were assessed as the easiest sounds to articulate. In perception, the trio of /ʨ, ʨʰ, ʨ*/ were rated as rather difficult, along with several tense consonants (/k*, t*, s*/), both glides (/j, w/) and several vowels (/ɛ, ʌ, o/). Table 7.4 Learner Self-Assessment Results: Phoneme/Item-Level Descriptive Statistics Production Phoneme Mean SD Min Max ㄱ /k/ 1 ㅋ /kʰ/ 1 ㄲ /k*/ 1 ㄷ /t/ 1 ㅌ /tʰ/ 1 ㄸ /t*/ 1 ㅂ /p/ 1 ㅍ /pʰ/ 1 ㅃ /p*/ 1 ㅈ /ʨ/ 1 ㅊ /ʨʰ/ 1 ㅉ /ʨ*/ 1 ㅅ /s/ 1 ㅆ /s*/ 1 ㅎ /h/ 1 ㄹ /l/ 1 ㅁ /m/ 1 ㄴ /n/ 1 ㅇ /ŋ/ 1 ㅏ /ɑ/ 1 ㅣ /i/ 1 ㅔ /ɛ/ 1 ㅓ /ʌ/ 1 ㅗ /o/ 1 ㅜ /u/ 1 ㅡ /ɯ/ 1 1 /w/ 1 /j/ Note. Higher values = easier. 1.33 1.64 1.74 1.42 1.50 1.76 1.37 1.57 1.67 1.65 1.71 1.79 1.55 1.74 1.12 1.89 1.07 1.60 1.78 0.90 1.03 1.96 1.65 1.70 1.42 1.69 1.54 1.51 5.79 5.54 4.95 5.64 5.60 4.99 5.66 5.45 5.18 5.12 5.16 4.72 5.51 4.74 6.17 5.24 6.37 5.85 5.43 6.55 6.49 4.90 5.05 5.05 5.95 5.58 5.24 5.18 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 Perception Mean SD Min Max 1.42 1.69 1.76 1.51 1.62 1.75 1.44 1.63 1.66 1.72 1.82 1.77 1.54 1.77 1.27 1.46 1.23 1.69 1.75 0.92 1.02 2.03 1.82 1.84 1.54 1.64 1.76 1.62 5.60 5.32 4.98 5.51 5.40 4.93 5.51 5.22 5.16 4.89 4.84 4.65 5.48 4.61 6.01 5.78 6.24 5.75 5.38 6.53 6.51 4.36 4.69 4.64 5.82 5.65 4.94 4.85 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 210 Phoneme-Level Differences between KPD Results and SA. One way of looking at the relationship between KPD results and learner SA is to compute difference scores. After converting phoneme parcel scores and SA easiness ratings to percentages to facilitate direct comparisons, I calculated difference scores by subtracting SA percentage scores from KPD percentage scores. Across phonemes, the mean difference was 16% (SD = 8%) for production and 9% (SD = 10%) for perception. The results of this analysis are presented in Table 7.5. Positive values indicate learners underestimated a phoneme’s easiness (i.e., their accuracy was relatively higher than their perception of easiness) while negative values indicate an overestimation (i.e., their accuracy was relatively lower than their perception of easiness). For many phonemes in each modality, learners were on average quite accurate. For example, learners showed only trivial gaps (-1%) between their perceptions and accuracy in perceiving /kh/. However, in almost all cases, standard deviations were considerable, often greater than 20% or 30%. Even more crucially, the range of difference scores was generally large. At the extremes, there were learners who vastly overestimated the easiness of a phoneme (e.g., -100% for /th/ in production) or vastly underestimated their own accuracy (e.g., +100% for /l/ in perception). Surprisingly, learners exhibited considerable differences between KPD scores and SA of the phoneme /ɛ/, especially in perception. This is likely attributable to some confusion introduced by the format of the SA (see Appendix F), which attempted to present the two Korean letters ㅔ and ㅐ as both corresponding to the phoneme /ɛ/ (which is the case in modern Korean, see Shin et al., 2013, Chapter 5). However, more conservative descriptions (and prescriptions) of Korean phonology do not include /ɛ/, instead featuring /e/ (front unrounded mid vowel corresponding to ㅔ) and /æ/ (front unrounded low vowel corresponding to ㅐ). I occasionally received queries 211 about this item, and I was somewhat puzzled when very advanced speakers deliberated for some time on this item before marking a middling degree of easiness. It appears that many learners were under the impression that the two letters corresponded to different phonemes, and that the self-assessment item was asking how well they could distinguish between the two phonemes. Table 7.5 Differences between KPD Results and Learner Self-Assessments KPD Production – SA Production Max Min SD KPD Perception – SA production Max Min SD 27% 30% 31% 28% 28% 30% 31% 30% 30% 38% 31% 32% 30% 32% 22% 25% 21% 28% 33% 18% 20% 36% 34% 33% 35% 28% 29% 27% -33% -83% -75% -50% -50% -67% -83% -50% -75% -100% -75% -75% -63% -67% -25% -50% -33% -50% -75% -50% -67% -33% -67% -67% -83% -67% -25% -40% 83% 83% 100% 100% 83% 100% 67% 100% 83% 83% 100% 83% 88% 67% 100% 100% 100% 100% 100% 100% 67% 100% 100% 100% 83% 100% 100% 83% Phoneme Mean ㄱ /k/ ㅋ /kʰ/ ㄲ /k*/ ㄷ /t/ ㅌ /tʰ/ ㄸ /t*/ ㅂ /p/ ㅍ /pʰ/ ㅃ /p*/ ㅈ /ʨ/ ㅊ /ʨʰ/ ㅉ /ʨ*/ ㅅ /s/ ㅆ /s*/ ㅎ /h/ ㄹ /l/ ㅁ /m/ ㄴ /n/ ㅇ /ŋ/ ㅏ /ɑ/ ㅣ /i/ ㅔ /ɛ/ ㅓ /ʌ/ ㅗ /o/ ㅜ /u/ ㅡ /ɯ/ /w/ /j/ 14% 15% 9% 9% 10% 10% 18% 15% 9% 28% 19% 5% 22% 19% 13% 24% 10% 13% 20% 7% 8% 33% 19% 31% 15% 19% 23% 20% Mean 6% -1% 8% 16% 2% 14% -3% 5% 5% 8% 12% 8% -8% -4% 14% 12% 9% 12% 11% 5% 3% 37% 16% 17% -11% 12% 25% 16% 23% 28% 35% 27% 28% 34% 25% 27% 32% 28% 33% 38% 26% 32% 19% 31% 18% 26% 30% 15% 17% 32% 31% 28% 24% 29% 26% 25% -21% -67% -83% -56% -100% -83% -29% -80% -83% -25% -100% -100% -20% -43% -33% -33% -13% -40% -50% -7% -13% -11% -60% -8% -25% -50% -30% -44% 100% 100% 100% 100% 83% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 212 At the learner level, considering average differences between KPD and SA results across phonemes in each modality provides insight into a learner’s overall level of SA accuracy. Figure 7.4 maps each learners SA accuracy along production and perception dimensions. A highly accurate learner will be near the origin; less-accurate learners will be located farther from the origin. Learners in Quadrant I tended to underestimate the easiness of Korean phonemes in both production and perception, while learners in Quadrant III tended to overestimate. Many learners were, on average, quite accurate—within 10% (inner ring)—for both production and perception. However, most learners exhibited greater average differences along each dimension, generally between 10% and 30% (middle ring). Interestingly, very few learners tended to underestimate the easiness of production while overestimating the easiness of perception. In other words, learners did not perceive production to be easy when they perceived perception (on average) to be difficult. Figure 7.4. Mapping average learner accuracy for production and perception. 213 Correlations between KPD Results and SA. Another way of examining the relationship between KPD Results and learner SA results is by focusing on the strength of the relationship via correlations. The upper diagonal of Figure 7.5 contains Pearson correlation coefficients (r) among KPD accuracy scores averaged across phonemes, SA easiness scores averaged across phonemes, and SA responses for global pronunciation qualities (comprehensibility and accentedness). All correlations were significant at p < .01. The figure also features scatterplots for variable pairs (lower diagonal), and the diagonal shows density plots for each variable. The largest correlations were obtained between SA Production and SA Perception (r = .88) and KPD Production and KPD Perception (r = .73). The global SA measures of comprehensibility and accentedness were moderately correlated and had moderate correlations with averaged KPD Production and Perception scores. Other correlations were smaller; notably averaged SA perception had slightly stronger associations with KPD scores than SA production. The correlations between averaged SA production and averaged KPD scores, even for KPD production, were small. Looking at finer-grained associations between SA and KPD results, Table 7.6 shows Pearson correlations between KPD results and SA for each phoneme in both modalities; Figure 7.6 shows scatterplots for these relationships. What is perhaps most interesting about these results is the number of small, statistically insignificant correlations. These have arisen generally due to restriction of range effects; for some of the easier phonemes (in terms of both KPD results and SA results), correlations appear to have been attenuated by a lack of variation (i.e., most participants rating the ease of a phoneme such as /i/ at 7/7 and notching very high accuracy scores on the KPD). The strongest phoneme-level correlations were found for objectively difficult phonemes (per KPD results, see Chapter 5) such as /k*, t*, p*/. However, some 214 relatively strong correlations (though still generally small to moderate in magnitude) were found for some broadly easier phonemes, like /n, ŋ, r/. Figure 7.5. Relationships among average KPD scores and SA 215 Table 7.6 Correlations between KPD Scores and SA for each Phoneme Phoneme ㄱ /k/ ㅋ /kʰ/ ㄲ /k*/ ㄷ /t/ ㅌ /tʰ/ ㄸ /t*/ ㅂ /p/ ㅍ /pʰ/ ㅃ /p*/ ㅈ /ʨ/ ㅊ /ʨʰ/ ㅉ /ʨ*/ ㅅ /s/ ㅆ /s*/ ㅎ /h/ ㄹ /l/ ㅁ /m/ ㄴ /n/ ㅇ /ŋ/ ㅏ /ɑ/ ㅣ /i/ ㅔ /ɛ/ ㅓ /ʌ/ ㅗ /o/ ㅜ /u/ ㅡ /ɯ/ /w/ /j/ n 195 195 194 195 196 195 195 195 196 192 193 194 196 196 193 194 196 196 196 196 194 193 195 194 196 194 194 196 KPD & SA Production r .05 .21 .16 -.02 .27 .28 -.04 .25 .35 .13 .08 -.01 .01 .09 .01 .19 .10 .29 .16 .00 -.04 .17 .13 .12 .06 .14 .08 .20 p 95% CI 0.463 0.004 0.024 0.772 0.000 0.000 0.572 0.000 0.000 0.067 0.265 0.914 0.900 0.192 0.839 0.009 0.151 0.000 0.021 0.986 0.585 0.018 0.067 0.108 0.432 0.045 0.261 0.00 [-0.09, 0.19] [0.07, 0.34] [0.02, 0.29] [-0.16, 0.12] [0.13, 0.39] [0.14, 0.40] [-0.18, 0.10] [0.12, 0.38] [0.22, 0.46] [-0.01, 0.27] [-0.06, 0.22] [-0.15, 0.13] [-0.13, 0.15] [-0.05, 0.23] [-0.13, 0.15] [0.05, 0.32] [-0.04, 0.24] [0.15, 0.41] [0.02, 0.30] [-0.14, 0.14] [-0.18, 0.10] [0.03, 0.30] [-0.01, 0.27] [-0.03, 0.25] [-0.08, 0.19] [0.00, 0.28] [-0.06, 0.22] [0.06, 0.33] KPD & SA Perception r .06 .35 .25 .02 .27 .36 .02 .25 .34 .06 .29 .27 .02 .16 .07 .16 .07 .24 .16 .10 .08 .06 .18 .18 .10 .35 .23 .22 p 95% CI 0.378 0.000 0.001 0.782 0.000 0.000 0.746 0.000 0.000 0.442 0.000 0.000 0.805 0.024 0.362 0.027 0.340 0.001 0.029 0.158 0.253 0.382 0.011 0.010 0.157 0.000 0.001 0.002 [-0.08, 0.20] [0.22, 0.47] [0.11, 0.37] [-0.12, 0.16] [0.13, 0.39] [0.23, 0.48] [-0.12, 0.16] [0.11, 0.38] [0.21, 0.46] [-0.09, 0.19] [0.15, 0.41] [0.13, 0.39] [-0.12, 0.16] [0.02, 0.29] [-0.08, 0.20] [0.02, 0.29] [-0.07, 0.21] [0.10, 0.37] [0.02, 0.29] [-0.04, 0.24] [-0.06, 0.22] [-0.08, 0.2] [0.04, 0.31] [0.04, 0.31] [-0.04, 0.24] [0.22, 0.46] [0.10, 0.36] [0.08, 0.35] n 195 192 193 194 195 194 194 194 195 192 194 194 195 195 193 195 196 195 195 196 194 193 196 196 194 194 195 195 216 Figure 7.6. Scatterplots of KPD score and SA for each phoneme in (A) production and (B) perception. 217 Finally, it is worthwhile to consider the strength of association between SA and KPD results at the level of individual learners. This is interpretable as a measure of how well learners were able to discriminate between phonemes along a continuum of difficulty. For production, the average within-learner correlation between KPD and SA results was r = .21 (SD = .23, min = -.32, max = .79) for 194 learners (four learners had no variation in SA responses). For perception, the average within-learner correlation between KPD and SA results was r = .20 (SD = .25, min = -.42, max = .79) for 197 learners. Figure 7.7 illustrates the distribution of learner correlations between KPD and SA results. Learners who discriminated phoneme difficulty well, with positive associations between KPD and SA results for both production and perception, are located in Quadrant I. Learners with overall poor or misguided discrimination of phoneme difficulty are located in Quadrant III. While most learners showed some positive association for both production and perception, some seemed to have misperceptions about their strengths and weaknesses. Other learners could discriminate the difficulty of phonemes in production or perception, but not the other (Quadrants II and IV). Figure 7.7. Mapping learner discrimination of phoneme difficulty for production and perception. 218 Agreement between KPD Diagnostic Flags and SA. As a final means of examining the relationship between KPD scores and self-assessments, I turned to the diagnostic flags for especially difficult phonemes. As readers may recall from previous chapters, I set a 75% accuracy threshold for diagnostic flagging of especially difficult phonemes. To compare these diagnostic flags with learner self-assessments, I dichotomized self-assessment scores using the same 75% threshold. In practice, this meant that a learner indicated a phoneme ease of 5 or less (out of 7, 7 = Never Difficult). From the two sets of binary phoneme scores (diagnostic flags and dichotomized self-assessments), I tagged matches between diagnostic flags and learner- recognized critical difficulties. Table 7.7 contains summary statistics for KDP diagnostic flags and learner self-assessment agreement. Table 7.7 Summary Statistics for KPD Flagged Phonemes and SA Agreement Mode Production Perception # KPD Flagged Phonemes M 3.55 8.39 SD 2.30 4.67 Range 1 – 10 1 – 23 # KPD-SA Matches M SD Range 2.17 2.02 0 – 9 4.78 4.11 0 – 18 % KPD-SA Agreement M 60 53 Range 0 – 100 0 – 100 SD 39 33 Due to the difficulty of KPD perception tasks (see Chapter 4, Measurement), learners on average had over twice as many perception phonemes flagged on the KPD compared to production phonemes. Learner recognition of these phonemes as being difficult was close to two- thirds for production and closer to one-half for perception. As might be expected, there was considerable variation among learners in their recognition of phoneme difficulties as revealed by the KPD, including a substantial number of learners who failed to recognize any of their difficulties. Out of 160 learners who had at least one production phoneme flagged according to 219 KPD results, 33 (21%) had self-assessments that failed to recognize the difficulty of any flagged phonemes. For perception phoneme flags, the number was 31 out of 193 (16%). Discussion In this chapter, addressing Research Questions 4, 5, and 6, I presented the results of analyses that compared KPD scores to external measures of oral proficiency, pronunciation in spontaneous speech, and self-assessments of pronunciation ability and phoneme difficulty. Research Question 4 and the analysis of oral proficiency primarily addressed the explanation of KPD scores, with the expectation that more proficient speakers will tend to have more intelligible production and more accurate perception of Korean phonemes. The remaining research questions and analyses primarily addressed the extrapolation of KPD results to pronunciation (and perception) to more general domains of Korean use. Research Question 6 also bears on the utilization of KPD scores, whereby weaknesses in learner self-assessments might be corrected by KPD scores. In what follows, I discuss the results in respect to each research question. RQ4: To what extent do KPD results show an expected relationship with Korean oral proficiency? Broadly, KPD results demonstrated relationships with Korean oral proficiency that were in line with expectations, providing support to the explanation inference in the KPD’s validity argument. Specifically, medium-sized (Plonsky & Oswald, 2014) correlations were found between oral proficiency and average phoneme accuracy for both perception and production of Korean phonemes. When dividing learners into oral proficiency quantiles, a generally steady upward progression was found for average perception accuracy across phonemes, a pattern that 220 was somewhat less visible for production phonemes, though it is worth noting that production phoneme averages were higher overall. For individual phonemes, larger correlations between oral proficiency and accuracy were found for perception phonemes compared to production phonemes, but in both modalities the strongest correlations were generally obtained for more difficult phonemes (see Chapter 4). Progression in phoneme accuracy across oral proficiency quantiles showed a similar trend: Easier phonemes had high average accuracy across quantiles, while more difficult phonemes showed an upward trend from the lower to upper oral proficiency quantiles (with a small number of exceptions related to universally difficult phonemes that tended to elude many of even the most advanced speakers). In sum, the findings here are in alignment with theory and research that suggests phonological competence develops with experience/instruction (Piske et al., 2001) and that some Korean phonemes tend to be more difficult and thus take longer to obtain control, difficulties which can persist into advanced stages of overall proficiency (Lee et al., 2009). In particular, at the group level tense and aspirated consonants tended to be more difficult for low- proficiency learners but became progressively less challenging for more proficient learners. From the perspective of diagnostic utility, the presence of ‘outliers’ in these analyses are of great interest: Individuals who have phoneme (specific or averaged) accuracy out of line with the expectations for their overall oral proficiency range. For example, in the fourth and fifth oral proficiency quantiles (i.e., the highest 40% of oral proficiency), there were some learners with average phoneme production accuracy of 79% and 80%, respectively, which would entail the flagging of several phonemes as being difficult. The KPD would be of considerable utility for such learners, who despite having generally high levels of oral proficiency in Korean could nonetheless stand to benefit from targeted pronunciation study, study that their generally high- 221 proficiency peers might not need. Similarly, lower proficiency learners with excellent segmental pronunciation (e.g., with an average production phoneme accuracy in the 90-99% range) may not need as much segmental pronunciation instruction or individual study as some of their peers; these learners could confidently spend their time on other aspects of learning Korean. RQ5: To what extent do results reflect difficulties test-takers show in spontaneous, meaning-focused speech? Based on an exploratory, descriptive analysis of 21 learners, I observed a trend of learners with higher average KPD phoneme accuracy scores tending to produce phonological errors at a lower rate in their contemporaneous speaking. This provides some broad support for the extrapolation of KPD results to more naturalistic, meaning-focused Korean speaking performance. Of course, the speech samples I collected were not extensive and thus this support can only be taken as prospective. Looking at the learner productions in greater detail, I examined the alignment of errors produced with corresponding KPD phoneme accuracy scores. This examination showed many cases of alignment, where some of each learner’s most difficult phonemes according to the KPD appeared as a challenge in naturalistic speech. This alignment is encouraging and provides some support for extrapolating KPD results as a finer grain-size. However, due to limitations in the volume of spontaneous speech collected and other issues (e.g., different error criteria on the KPD and phonemic transcriptions, different abilities/knowledge being tapped), I cannot claim particularly strong support for extrapolation from this analysis. That said, the level of alignment observed between scores derived from the highly-discrete, non-communicative KPD and genuine (though limited) meaning-focused communication is encouraging. 222 RQ6: To what extent do results reflect self-assessments of pronunciation ability and difficulties? While learners’ self-assessments of global pronunciation abilities, i.e. accentedness and comprehensibility, were strongly related to their overall levels of performance on the KPD, finer- grained self-assessments were less accurate, a finding in line with research on pronunciation self- assessment in L2s such as French (Lappin-Fortin & Rye, 2014) and German (Dlaska & Krekeler, 2008). The challenge faced by learners appears to be in identifying the relative difficulty of Korean phonemes, as evidenced by the generally low average within-learner correlation between KPD parcel score and their self-assessed ratings in each modality. This interpretation is further supported by the alignment between diagnostic flags and learner self-assessments, where learners only recognized substantial difficulty in perceiving or producing between half and two-thirds (respectively) of phonemes that were especially difficult for them according to the KPD. However, it is worth noting that on average, learner absolute accuracy, based on difference scores, was not dismal: On average, learners averaged a 16% difference for production and 9% for perception. Thus, on a whole, I find that KPD scores have a moderate relationship with learner self-assessments, which suggests some meaningful extrapolation of the KPD results to pronunciation and listening in typical Korean use, moderated by learner awareness. Indeed, I further suggest that the current findings bode well for the utilization inference in the KPD’s validity argument: KPD results have the potential to heighten or fill in gaps in learner’s self-awareness of specific pronunciation difficulties, which in turn could lead to more fruitful instructional decisions, attention to form in typical communication, or both. While my discussion of self-assessments so far has focused on sample means, it is worth pointing out the degree of variation in the sample: Several learners demonstrated major shortcomings in their 223 ability to accurately self-assess pronunciation difficulties. With such poor awareness of pronunciation (and perception) difficulties, it is unlikely these learners would be able to monitor their own productions, selectively pay additional attention to good exemplars in the input or make productive and efficient decisions when planning additional pronunciation-related study on their own time. While training in self-assessment has been shown to help self-assessment accuracy (Chen, 2008), research on whether it is feasible to train learners at a variety of proficiency levels to better self-assess their pronunciation difficulties is non-existent. It is these learners for which the KPD might be utilized to the greatest, most beneficial extent. 224 CHAPTER 8: INTERPRETATION AND USE Moving further up the chain of inferences in the KPD’s proposed validity argument, in this chapter I examine the interpretation and use of KPD results by stakeholders. I sought to understand how test-takers (and one teacher) were able to make sense of the KPD score report and interpret results meaningfully. Further, I investigated how test-takers applied information from the KPD score report over a 2- to 4-month period to explore how they were able to use the test results, and to uncover whether or how that use was beneficial. The data I present in this chapter are primarily qualitative, derived from face-to-face semi-structured interviews, but I also provide supporting quantitative information and analyses based on initial KPD results and, for a subset of students, KPD retest results. Research Questions For reader convenience, the three research questions I address in this chapter are as follows: • RQ7: How do (a) teachers and (b) learners understand KPD score reports? To what extent do they learn anything new from KPD score reports? • RQ8: Do learners report any changes in their self-study routines and/or their attention to phonological form in formal or informal learning situations? • RQ9: Do learners show improvements in a) overall and/or b) in weak areas after receiving and applying KPD feedback? My investigation primarily focused on learners’ individual interpretation and utilization of results, rather than interpretation and utilization of test results in a classroom context with the guidance of a teacher. Although this is perhaps a shortcoming, I see it as a useful starting point, as individual students are the ground-floor, most immediately impacted stakeholders in any 225 assessment geared toward learning, and self-regulated learning can lead to desirable pronunciation learning outcomes (Moyer, 2014). Students can benefit from increased awareness of their pronunciation abilities (Kennedy & Trofimovich, 2010; Saito, 2018), and in practice many L2 (Korean or otherwise) classrooms make little time for pronunciation instruction. Additionally, due to diverse learner backgrounds and needs it may be difficult to arrive at suitable whole-class segmental targets (Derwing & Munro, 2014). Such conditions make learners’ own efforts to study more autonomously worthy of interest. In the following sections, I provide methodological details followed by my presentation and discussion of findings. Methods A primary description of the interview study procedures was reported in Chapter 4. What follows here are details on the interviewees, supporting KPD score reports, and analytical details. Interviewees As mentioned in Chapter 4, I interviewed a total of 22 individuals, including 21 learners and one teacher, who had taught two of the student interviewees in an intensive Korean program class. I refer to all interviewees with pseudonyms. Table 8.1 provides details on the 21 learner interviewees. Among the 21 learners, five were graduate students, four were enrolled in undergraduate programs (with two in English-medium programs), and 12 were enrolled in intensive Korean programs (one of these students was an exchange student also taking English- medium undergraduate courses). The learners represented eight different L1 backgrounds (with nearly half of the group identifying as L1 Mandarin Chinese speakers) and 11 countries of origin (distinguishing Hong Kong as a special region within China). Learners’ time spent living in Korea at the time of their initial KPD ranged from approximately 1 month to nearly 11 years. 226 The teacher, a male native speaker of Korean who I will refer to as Jae-woo, taught Yu-wen and Yuki in different classes over the previous two semesters, and I interviewed him on November 21, 2018. For test-takers, the first interview generally occurred within a couple of weeks of their field-testing appointment. Fourteen test-takers were available and willing to complete a second interview and take the KPD again. These second interviews took place roughly three months after each participant’s first interview (mean = 3.16 months, min = 2.33 months, max = 4.30 months). After each appointment, participants received 10,000 KRW (approximately $10 USD). Score Reports The KPD score reports detailed in Chapter 3 were provided to all interviewees during the first interview and revisited as needed in the second interview (see Figure 3.1). However, I made one small change to the score reports given to learners: I selected a threshold of 80% (rather than 75%) for flagging critical phonemes that would appear on the first page. Anticipating a generally higher level of both general Korean proficiency and specific pronunciation ability compared to my pilot sample, I was worried that too many test-takers would receive little in terms of helpful feedback with a 75% diagnostic flag criterion. I chose to err on the side of strictness (e.g., for some phoneme parcels making just one mistake could result in a flag) in order to provide more test-takers, especially those with mostly (but not universally) intelligible pronunciation, with at least some prescriptions for study, as the feedback was advertised as a benefit of participating in the study. In terms of scoring, this did not change anything, and raw accuracy scores on the second page of the reports were unchanged. However, it did introduce some slight changes in interpretations and (potentially) decision-making for some phonemes with scores from 75-79% in comparison with flags reported in previous chapters. 227 Table 8.1 Interviewees Pseudonym Graduate Students Min F Sex Age From 23 China Languagesb Chinese, Korean, English Acad. Status. KFL (MA) EITd LORe 101 0;1 Interview 1 Interview 2 10/1/2018 1/7/2019 24 Vietnam Vietnamese, KFL (MA) 73 0;1 10/1/2018 1/8/2019 Hoa Ju-an Yang Amber F F F F 25 China 30 China 23 Hong Kong Undergrad. Students Leo M 26 Russia Xiu Lan Sofia F F 23 China 23 Belarus Fang F 21 China KFL (MA) International Trade (MA) Korean, English Chinese, Korean, English, Japanese Chinese, Korean Hospitality (PhD) Cantonese, English, Chinese, Japanese, Korean Russian, English, Korean Chinese, Korean, English Russian, Belarussian, English, Korean Chinese, Korean, English, Japanese International Studiesc KFL (BA) Business Management KFL (BA) Level 3 91 1;0 11/5/2018 1/29/2019 87 96 10;10 11/5/2018 11/6/2018 0;8 - 2/1/2019 38 5;0 10/3/2018 1/15/2019 69 0;8 10/10/2018 - 75 2;0 10/10/2018 1/29/2019 70 1;6 10/11/2018 1/8/2019 45 0;2 8/29/2018 11/7/2018 74 0;11 8/31/2018 1/7/2019 79 1;0 9/5/2018 - Language Students Holger M 28 Germany German, English, Jing Chia-ling F F 23 China 19 Taiwan Korean Chinese, Korean, English Chinese, Korean, English Level 4 Level 4 228 Table 8.1 (cont’d) Maria F 23 Mexico Noriko Sakura Yu-wena Aylin Na Yukia Alice Xiu Ying F F F F F F F F 29 48 Japan Japan 28 Taiwan 19 Kazakhst an 23 China 21 Japan 22 France 20 China Level 5 Spanish, English, French, Korean Level 2, International Studiesc Japanese, Korean Level 5 Level 3 Japanese, Korean, English Chinese, English, Korean Russian, English, Korean, Kazakh Chinese, Korean, English Japanese, Korean Level 4 French, English, Level 2 Korean Chinese, Cantonese, Korean, English Level 5 Level 5 Level 5 18 0;6 10/5/2018 12/17/2018 59 67 0;6 0;1 10/8/2018 11/5/2018 1/7/2019 2/13/2019 33 0;11 11/5/2018 62 0;8 11/6/2018 - - 57 1;0 11/9/2018 2/13/2019 50 47 0;7 0;2 11/14/2018 11/15/2018 - - 72 0;5 11/9/2018 2/14/2019 Note. aJae-woo taught Yuki (Fall 2018) and Yu-wen (Summer 2018). bSelf-reported, in order of dominance. cEnglish-medium degree. dElicited Imitation Test (oral proficiency measure, scale 0-120). eLength of Residence at time of initial KPD testing in years;months. 229 KPD Retesting For each of the 14 participants who completed a second interview and KPD retest, I calculated their average production and average perception accuracy across all phonemes at initial test and retest, and computed change scores (retest minus initial test). I also examined their production and perception phoneme flags at initial test and retest, tallying the total number of flags. At the group level, I computed descriptive statistics. At the individual level, I focused on production phoneme average accuracy over time, and further analyzed the production flags by examining which flags were lost or gained from initial test to retest. I then interpreted these analyses alongside learners’ comments about learning activity and perceptions of change (see following section). Analysis of Interview Data The 36 interview sessions took 16 hours (970 minutes) in total. All interviews were transcribed in the originally-used language(s). I used two approaches to transcribing the interview data: manual transcription (i.e., completed by myself or a research assistant) and manually-checked automated transcription (i.e., using automated transcription software such as Vocalmatic, www.vocalmatic.com, followed by manual correcting). Interviews remained in the original languages in my subsequent analysis of the data. My approach to analyzing the interview data was primarily qualitative content analysis, “a method for systematically describing the meaning of qualitative data … by assigning successive parts of the material to the categories of a coding frame” (Schrier, 2014, p. 170). More specifically, I took a deductive approach to content analysis (Elo & Kyngäs, 2007; Schreier, 2014), utilizing pre-established categories based on my validity argument-driven research questions and initial review of the interview data. I created a spreadsheet with one row 230 for each interviewee and columns for relevant categories that served as a matrix display (Miles, Huberman, & Saldaña, 2014) or coding frame (Schreier, 2014) for analysis of the interview data. This frame layout facilitated cross-case comparisons, allowing me to see similarities and differences across the pool of participants. The categories included understanding of results, alignment of results with own assessments, typical pronunciation learning/teaching, potential learning activity, actual learning activity, and changes in pronunciation; the latter two categories only applied to learners who completed a second interview. For each topic, I made notes, in English, on what each interviewee said on the topic and compiled illustrative interview excerpts. Findings In this section, I present my main findings related to stakeholders’ utilization of their KPD results. I start by discussing the learners’ understanding and potential for applying KPD results in their continued Korean learning efforts. Then, I turn to the comments of the teacher of two of those learners to consider a more expert perspective on understanding of test results and more conventional classroom-based application of results. Finally, I focus on the second interview and KPD retest data from 14 learners to explore the actual utilization of KPD results and the results’ impacts on pronunciation learning. Throughout the findings, I present quotations and excerpts from interviews. If the original comments were in Korean, I provide my English translation followed by the original Korean in parentheses; my translations prioritize meaning and do not attempt to reflect any form-related infelicities present in the original Korean. Occasionally, I lead with original Korean to emphasize linguistic choices made by learners in their comments. I represented Korean letter/sound names with IPA symbol equivalents (e.g., ㄱ, spoken as ‘기역’ /ki.jʌk/, = /k/). For English comments, I provide interviewee’s original words 231 without correcting any lexical or grammatical infelicities. Where I felt it was necessary, I inserted bracketed contextual information or corrections into comments. Learner Understanding of Results and Potential Application The following findings are based on cross-case analysis of learner comments, primarily based on their initial reactions to receiving their KPD score reports (see Figure 3.1 for an example) in the first interview. Interpretation. At a basic level, learners understood that the KPD results provided information on their pronunciation strengths and weaknesses, and learners tended to focus more on the latter. All learners recognized that the phonemes highlighted on the first page were their weaknesses; in Korean they often used terms like “약점” (weak point, Xiu Ying) or “문제점” (problem (point), Ju-an and Na). Learners latched on to accuracy scores and example words from the second page of the score report, especially for phonemes with low scores. Furthermore, many learners readily thought of these sounds as targets for study and improvement: "Now I know what I should work [on]" (Maria), “After learning what my mistakes are, I can fix them” (저의 실수를 아는 후에 그 실수를 고칠 수 있어요, Aylin). Across learners, I commonly found general agreement with the information provided by the KPD. Comments indicating broad acceptance of results were common: “After seeing this it really seems right” (이것을 봤을 때 진짜 맞는 것 같아요, Fang), “What I thought was difficult all came out with scores like that” (제가 어렵다고 생각했던 것은 다 그런 점수도 나왔으니까, Yuki), “Ah, as I expected, I am right about the pronunciations I think are difficult” (아 역시, 제가 어렵다고 생각하고 있는 발음 다 맞아요, Noriko). I believe this broad acceptance may be related to learners’ epistemic orientation to the KPD results. Namely, learners appeared to regard the results as valid, due to objectivity or externality, and able to fill in gaps in 232 their own self-knowledge. I offer support for this interpretation through the following illustrative excerpts: Evaluating based on my own impressions isn’t objective. A Korean saying she’s lacking with this or she needs to practice that more are more effective results… In my opinion, I’m not sure about my own evaluation of myself. It’s just based on my own thoughts. (자기 생각대로 평가하면 좀 객관 되지 않은 거예요. 한국 사람이나 이거 이 친구는 이거 부족하구나 이 친구는 이거 더 연습해야 되구나 그거는 더 효과적인 결과가 … 제 생각으로 제가 평가한 거니까 좀 모르겠어요… 그냥 자기 생각으로 한 거니까, Hoa) But actually it's the first time [getting the KPD results], like when someone like, tell me about like my pronunciation. Yeah. That is why I really wanted to know that because like how our teachers, they never like did it. (Sofia) Now I know the problem before I didn’t know the problem and just like, okay, they [people I talk to] don’t understand [me], maybe because – Like now I kind of know what kind of problems do I have. (Leo) However, learner interpretation of the KPD results was not without limitations. A commonly occurring obstacle to understanding results appeared to be a lack of familiarity with linguistic vocabulary. On the first page of the KPD results, supplemental information on difficult articulatory features and contexts was provided. Few learners knew terms such as 경음 (tense), 격음 (aspirated), or 파찰음 (affricate) that were used to label features or terms such as 종성 (final consonant) used to label contexts. Only some of the more advanced learners, particularly but not exclusively those pursuing degrees in Korean as a second/foreign language, were immediately able to understand what these terms meant, such as Amber and Yu-wen. Interestingly, some learners with more informal/self-directed Korean learning histories had difficulty talking about Korean sounds, having never formally learned the names of Korean letters, though they clearly had adequate enough sound-symbol correspondence knowledge to interpret the scores (instead of using letter names for consonant names she did not know well, learners such as Sakura and Chia-ling simply constructed a short syllable of / + ɯ/ 233 to refer to phonemes; this would be like saying “wuh” to represent the letter w in English). Also related to linguistic deficiencies, lower proficiency learners such as Alice (who was enrolled in a Level 2 Korean language class) demonstrated some difficulty understanding the explanatory prose on the KPD report. Occasionally, information in the KPD score reports was difficult for students to reconcile. Rarely, their score on a sound or feature, in production or perception, was so different from their self-appraisal of pronunciation abilities that they voiced some disagreement or disbelief. Several learners were surprised about their low perception scores, either generally or in reference to specific sounds. However, three learners, Jing, Chia-ling, and Sakura, all expressed difficulty in accepting that their listening (perception) was worse than their pronunciation. In each case, I elaborated on the different scoring criteria for the production and perception sections. I also commented on how narrowly listening was being operationalized on the KPD. Other learners voiced more specific disagreement with the information provided by KPD results. For example, Aylin believed her production accuracy scores for /o/ (100%) and /ʌ/ (20%) were a reversed representation of her actual pronunciation of those two vowels. Similarly, Fang had trouble accepting that she had difficulty pronouncing tensed consonants (KPD production parcel scores: /k*/ = 50%, /s*/ = 57%, /t*/ = 75%, /ʨ*/ = 75%), genuinely believing that she had little trouble producing them. In one case of disagreement with a KPD score, a learner referred to an external assessment of her pronunciation by a teacher: Xiu Ying said " But my teacher said that in my last presentation my /l/ sound wasn’t clear" (근데 선생님이 제가 지난번 발표할 때 ㄹ [/l/] 소리가 잘 안 나와 가지고 그렇게 말했어요). Na indicated some disagreement with her low score on /k*/. Her first reaction was that she could produce that sound with little difficulty. In fact she demonstrated that to me in the interview, producing example words from her score 234 report, 토끼 (/tho.k*i/) and 꿀 (/k*ul/), intelligibly and accurately (to my ears, at least). As we discussed this further, she appeared to come to a realization, or offer a concession, that her knowledge of articulation might not always match up with her accuracy in production: “Although I know how to pronounce it, I might not [always] be that accurate” (하지만 어떻게 발음하는지 알고 있는데 그렇게 정확하지 않아요). New Information. Rare disagreements aside, learners’ broad acceptance of KPD results led to many discrepancies with their prior self-appraisals to be considered as new information to process and incorporate. Almost all interviewees expressed surprise at—but not rejection of— some piece of information contained in their score report. In some cases, there was no surprise that a given sound was difficult, but learners were nonetheless surprised at the degree of difficulty it presented. For example, Aylin readily agreed that several tensed sounds were difficult for her to produce but expressed some shock at scores of 0% for /t*, p*, ʨ*/. Sakura was similarly surprised by her perception score of 0% for /u/ but was grateful to now be aware of how acute that difficulty was. In both the first and second interview, Hoa was appreciative to learn that she had difficulty distinguishing between /p/ and /ph/ in her production. Other learners, such as Holger, were surprised by the overall number of pronunciation difficulties identified by the KPD. In the first interview, Holger commented that he did not think his pronunciation was a big obstacle to being understood, referring instead to vocabulary and grammar as being bigger challenges. In my personal experience, though, I had considerable difficulty understanding Holger’s pronunciation, and clearly the Korean teacher who scored the KPD often found his articulations ambiguous. Learners often viewed these surprises as targets and/or motivation for improvement. Hoa had an especially even-keeled yet highly motivated reaction to the surprises in her results: 235 I always think I have to keep up my efforts. It’s not a matter of feeling bad or feeling good [about the results]. The only thought that came to mind was “Wow, I really have to practice and study more.” (항상 노력해야 된다고 생각해요. 기분 나쁜 거 아니고 그냥 좋은 거도 아니에요. 그냥 더 많이 연습하고 공부해야 됐구나 라는 생각만 들었어요.) Sometimes the surprises were pleasant, providing learners with a boost in confidence or an opportunity to reappraise their abilities. Consider the following excerpt from my interview with Xiu Ying, who had voiced some disagreement about the KPD’s assessment of her /l/ pronunciation: Dan: If you look at the back side, there are more detailed results. So /k/- 그 뒤쪽에서 보면 더 자세한 결과 나와요. 그래서 ㄱ은- Xiu Ying: [gasp] Really?! 진짜요?! Dan: Yes, 100%- It came out as 100% accuracy 네 100%- 100% 정확도 나왔어요, 이거 Xiu Ying: When I was learning [Korean] it was what I thought was the most difficult… 네가 배울 때 이거 제일 어려운다고 생각했는데 Here, Xiu Ying’s reaction seems to be more reflective of pleasant surprise, perhaps in reference to overcoming her initial struggles articulating /k/ without fulling realizing it herself. At a more general level, Na commented that: Pronunciation accuracy came out higher than my expectations. Maybe because of that accent of mine, my pronunciation confidence isn’t so high and for the first time seeing pronunciation scores coming out on the higher end has me feeling pretty good. (발음 정화도 예상보다 더 높은 편이 나왔어요. 아마 그 억양 때문에 자기 발음 자신감 그렇게 높지 않아 가지고 처음으로 이런 좀 높은 편인 점수 나와 가지고 좀 기분이 좋아요.) In Na’s case, the pleasantly surprising results may provide correction for her perhaps undeservedly low confidence in her pronunciation abilities. Noriko, another learner with low confidence in her pronunciation ability and many genuine difficulties, was very glad to see high scores for phonemes such as /ʨ, ʨh/. To me, these comments highlighted how not just 236 weaknesses could be potentially informative or otherwise useful to learners. Clearly, to some extent at least, learners were interested in their underappreciated strengths as well. Potential Application. Before considering whether and how learners thought they might apply their KPD results, it is worth briefly reviewing what learners said about their typical pronunciation learning activity. First, several participants were not enrolled in any formal Korean language courses at the time of field testing and the interview(s). These learners generally reported not having any current pronunciation learning activity outside of daily-life Korean interaction (e.g., in academic or social settings) and consuming Korean media such as television dramas and pop music. Second, some of the graduate and undergraduate students were enrolled in degree programs for Korean as a second/foreign language; these programs train students to be translators, interpreters, or Korean language teachers. As such, some students were taking courses on Korean phonology, teaching Korean pronunciation (which included material on Korean phonology), or both. Some of these learners reported recording their pronunciations in connection with course assignments and receiving feedback from their instructor. Third, the learners who were taking classes in intensive Korean language programs generally reported minimal attention to pronunciation in their courses. When I asked them about their typical pronunciation learning activity in their Korean classes, they most commonly referenced general speaking activities with their classmates, instructors incidentally addressing major pronunciation mistakes during read-aloud activities, and occasional choral repetitions, mostly of single words (i.e., commonly-used controlled pronunciation activities: Baker, 2014; Celce-Murcia et al., 2010). Outside of class, learners mostly mentioned watching dramas and perhaps trying to shadow lines, if they did anything at all. This information will be useful for interpreting their comments on what they might (and later, did) do after receiving their KPD results. 237 Turning to what learners said about potential subsequent pronunciation learning activity, which was framed as “study” (공부) and “practice” (연습) in interview questions, the majority of learners (n = 19) said they wanted to study or practice their pronunciation. Some learners, such as Hoa, commented on how the KPD results will help narrow down study targets: “Before coming here [to the interview], I always felt I had to study more. Everything. Now that I’ve come here, I understand which areas I should focus on more” (여기에 오기 전에도 항상 항상 더 연습해야 된다고 생각해요. 모든 거 다요. 오늘 와서 어떤 부분에 더 집중해야 된다는 것을 알게 되었어요). When it came to specific approaches or techniques for study and practice, learners came up with several ideas: using a textbook (Leo, Noriko), speaking Korean more with friends and getting feedback (Min, Ju An, Xiu Lan, Maria, Aylin, Na), reading aloud and/or self-recording (Hoa), watching dramas (Chia-ling), and asking a teacher or tutor for help (Maria, Noriko). However, one common finding was a lack of knowledge about how to study pronunciation. Although learners were perhaps put on the spot to come up with something during the interview, many outright confessed that they did not know what to do that would help their pronunciation (Amber, Sofia, Fang, Holger, Jing, Yu-wen, Alice). Another reoccurring comment was that pronunciation practice was something they could not do on their own (Holger, Maria, Sakura, Alice), as they saw no way of getting feedback on whether they were pronouncing clearly or not. When I gave learners an opportunity for their own comments or questions at the end of the interview, several learners asked me for advice or additional ideas for studying pronunciation (Amber, Sofia, Fang, Jing, Noriko). In my responses, I mentioned activities such as shadowing, recording one’s own pronunciation and comparing to a model, using a textbook to focus on difficult sounds, using a program like Praat (for Amber specifically, who was familiar 238 with the program and had a strong base in Korean phonetics and phonology), and doing listening practice such as those in pronunciation textbooks or dictation. A Teacher’s Perspective Interviewing Jae-woo added a valuable perspective to the understanding and potential utilization of KPD results. Jae-woo taught Yuki in Level 4 of an intensive Korean program during the Fall 2018 semester and taught Yu-wen in Level 4 during the Summer 2018 semester. Before the interview, I obtained permission from Yuki and Yu-wen to share their information with Jae-woo. At the time of the interview, Jae-woo’s semester with Yuki had recently finished and it had been approximately three months since he had taught Yu-wen. During the interview, I asked Jae-woo if he would like me to play a sample of Yu-wen’s speech (the Independent Speaking task) to jog his memory. Jae-woo said that he could remember, but that hearing the speech sample would help with his memory accuracy, so I played Yu-wen’s file before asking Jae-woo to reflect on her pronunciation. To facilitate comparisons between learner, teacher, and KPD perspectives on pronunciation difficulties, I have summarized and compiled the information in Table 8.2. The self-assessment column contains phonemes that the students indicated were especially difficult to produce on their paper self-assessment. The teacher observations are based on Jae-woo’s interview comments, and the KPD results are based on diagnostic flags for production phonemes and supplemental information (with a < 80% criterion) from the first page of the score reports. 239 Table 8.2 Multiple Perspectives on Pronunciation Difficulties Yuki Self-Assessment* /k*, d*, t, p, ʨ*, s*, n, ŋ, l, ʌ, o, u, ɯ, w, j/ Yu-wen /k*, ʨ, ʨ*, s*, ŋ, l, w, j/ Teacher’s Observations Phonemes: /o, ʌ/, typical Japanese L1 influences on other phonemes Contexts: Syllable coda, consonant clusters Other: Unexpected pitch-accent Phonemes: /l, u, o/, broad L1 Chinese interference Contexts: Syllable coda (esp. /l/) Other: lack of facial expression, lack of gesture, muted physical articulation of speech sounds KPD Results Phonemes: / ʨh, kh, ph, p*, k*, t*, ʨ*, t/ Features: Aspirated, Tense, Affricate Contexts: Initial Consonant Phonemes: /s*, t, p*, ʌ, l, ŋ, ʨ, ʨ*, j/ Features: tense, fricative, sonorants Contexts: final consonant Note. Bolded elements indicate agreement among two or more sources. *Both learners had a median and mode of 4 (out of 7) on their self-assessments; phonemes shown were rated at 3 or below in production. Interpretation. As an experienced teacher with strong knowledge of Korean phonology, Jae-woo considered the KPD results more critically than the learners. At first glance, Jae-woo was not quite sure how to interpret the information on the KPD, remarking that: At first, not knowing how the mechanism worked for scoring these two [Yuki and Yu- wen], the results were a little vague to me- when I looked at it, everything said “pronunciation is difficult” and it seemed like anything that was difficult for foreigners was included. (일단 그 둘이 어떤 그 메커니즘으로 만들어졌는지 제가 정확히 모르기 때문에 이 결과에 대해서도 역시 조금 막연하다 막연한 부분이 있는데 볼 때는 다 발음하기 어렵다는 외국인들이 발음하기 어려운 발음들이 다 대부분 포함이 되어 있는 거 같아요.) However, as we talked more about the results and Jae-woo asked several detailed questions about how scores were calculated, standards for scoring production and perception, example words in the first page explanations and on the second page example column, etc., and became more familiar with the structure of the test, he seemed to move past initial skepticism and “got a 240 feel for what it was about” (어떤 걸 얘기한는구나를 느낄 수 있었어요). Like some of the learners had commented, he saw the KPD information as filling in gaps in what he was able to observe or perceive: The biggest reason [the results are useful] is that even though I know the students and what their pronunciation difficulties are, like we’ve talked about here, there are limits on content and limits on pronunciations that I hear. Right? Like during reading class time I can hear students and when we share a new text it would be great if I could judge how students pronounce sounds that are in that text, but the fact is the classroom environment isn’t like that. Considering the education is centered on must-teach grammar and sentence patterns, what I didn’t know about the students’ pronunciation, even though it appeared in the [KPD] results, works out to about 50%. (가장 큰 이유는 제가 학생들을 알고 있고 그 사람들의 발음이 뭐가 문제다라고 여기서 이야기 있긴 하지만 제한된 내용과 제한된 발음을 들을 뿐이에요 그죠 제가 뭐 읽기 시간을 통해서 들을 수도 있고 새로운 텍스트를 나눠주고 텍스트 안에 있는 여러가지 소리 듣고 판단하면 좋겠지만 사실 수업 환경이 그렇지 못 하잖아요. 가르쳐야 하는 문법 문형 중심의 교육이다 보니까 학생들이 발음하는 것들은 결과에서도 그대로 나타났지만 저는 그 학생들이 가지고 있는 발음 문제점에 한 50% 정도 밖에 모르고 있었던 셈이죠.) Jae-woo’s comments here mirror what Lado (1961) wrote about the limits of a teacher’s observations in identifying a full range of specific learner difficulties. Further, what is in theory possible to accomplish in the classroom will not always happen, and language education which prioritizes other aspects of linguistic competence will impose additional limits on what even a knowledgeable and conscientious teacher can achieve through observation of students. New Information, Gaps, and Incongruencies. The interview with Jae-woo provided a unique opportunity to triangulate the self-assessments of learners and the KPD results. Here, I highlight new information introduced by the KPD, gaps in all three assessments, and incongruencies among the three sources. It is worth pointing out that although I consider the KPD results to be generally reliable and reflective of pronunciation and perception abilities (see Chapters 3-6), gaps and incongruencies among assessments here should not, by default, be settled by what the KPD results say, as the KPD has measurement error, an arbitrary diagnostic 241 flag criterion (though seemingly appropriate, see Chapter 5), and other limitations. It is quite possible that Yuki, Yu-wen, and Jae-woo made accurate observations that the KPD distorted or failed to detect. First, based on the information in Table 8.2, it was clear to me that both learner self- assessments and the KPD, both of which featured items for every Korean segment, resulted in much more detailed information related to individual phonemes. Yuki’s self-assessments of production difficulties had moderate alignment with KPD results, with several phonemes showing up as difficulties on both, and Yu-wen’s self-assessment was remarkably well-aligned with her KPD results. Although Jae-woo did identify a few specific phonemes that were troublesome for both Yuki and Yu-wen, he broadly characterized their segmental difficulties as L1-driven: Yuki had “the errors that Japanese speakers have when pronouncing Korean” (일본어 화자가 한국어를 발음할 때 나타나는 오류들이 그대로 있는 편인데) and Yu-wen had “all the difficulties that generally appear for Chinese speakers when they learn Korean” (그 중국어를 사용자들이 한국어 배울 때 나타나는 문제점들이 전반적으로 들어가 있어). In this sense, there was little specific overlap that I was able to observe between Jae-woo’s segmental observations and either students’ self-assessments or KPD results. There was some congruency between his assessment of Yu-wen’s difficulties related to pronunciation contexts and the KPD, as Jae-woo’s observation of her difficulties with syllable coda pronunciation (particularly for /l/) aligned with the KPD supplemental results about final consonants. However, he viewed Yuki as having a similar problem with codas and particularly with consonant clusters (which can be found in sequences such as CVC.CV in Korean), which he attributed to Japanese having highly-restricted codas, an observation not reflected in KPD results. Curiously to me, Jae- woo made no specific comments about generally difficult articulatory features (e.g., tense, 242 aspirated) for either learner. Any observations he might have had were possibly subsumed under his comments related to L1 influence (e.g., Japanese phonology lacks a tense feature for consonants). While the KPD clearly provided more details related to segments, what I found most interesting about Jae-woo’s assessments of Yuki’s and Yu-wen’s pronunciation were aspects not covered by the KPD. For Yuki, Jae-woo talked at some length about her use of pitch accent and gave examples of how her pitch accent differed markedly from standard Korean. He went on to attribute this to her specific Osaka variety of Japanese. For Yu-wen, Jae-woo made comments about her muted oral articulation (“When pronouncing, she doesn’t try to open her mouth much.” 발음할 때 입을 크게 벌리고 노력하지 않은 편이에요.), which he attributed to an introverted personality. In our interview, Yu-wen revealed what she described as a complete lack of confidence in her pronunciation; Jae-woo appeared to be cognizant of this. Jae-woo saw Yu- wen’s muted style of communication extending to supporting strategies, noting that Yu-wen did not utilize much facial expression or gesture when she spoke. He thought such strategies could help an interlocutor cope with her sometimes unintelligible pronunciation (at one point, Jae-woo commented that Yu-wen was only about 70% as intelligible as Yuki, who despite occasional errors was generally intelligible in communication). Thus, I found Jae-woo’s observations, while not as fine-grained at the phoneme level, to contribute to a more well-rounded understanding of Yuki’s and Yu-wen’s pronunciation challenges. There was also information not provided by the KPD that Jae-woo would have liked to know, perhaps to further support his limited opportunities to observe student pronunciation in detail: Phoneme-level information on difficult pronunciation contexts, e.g., whether a learner’s pronunciation difficulties with /l/ were related to syllable codas (as he had observed with Yu- 243 wen). Jae-woo reiterated this several times throughout the interview. Such comments relate to the issue of grain-size in diagnostic assessment, and clearly, Jae-woo hoped for even finer details. Potential Application. As with the previously discussed learner findings on potential application of KPD results, it is first crucial to consider Jae-woo’s comments on typical pronunciation teaching as well as his beliefs related to pronunciation teaching. With respect to the latter, Jae-woo placed the greatest importance on helping learners be able to communicate with Koreans. While the framing of communicating with Koreans might be seen as narrow or prioritizing nativelikeness, Jae-woo’s original comment in Korean (“한국 사람들과 의사소통이 가능한 수준”, level at which communication with Koreans is possible, emphasis mine) is something I interpreted more or less oriented toward intelligibility in line with Levis (2005). Jae-woo also stated his awareness of the importance of pronunciation in language education academics and research but felt that level of importance has not really entered Korean teaching practice. Jae-woo described his typical pronunciation teaching as follows: For example, when teaching lower levels where there is more focus on form, that part has some exclusive pronunciation practice and [for example] after the instructor reads [a word] the students repeat. Or, when presenting a sentence to highlight syntax, the instructor reads aloud and the students follow along and then instruction can be given to students based on [pronunciation] errors that arise. (예를 들어서 초급 같은 경우에는 형태 좀 더 초점을 맞춰서 교육을 하고 있기 때문에 그 과정에서 여는 거 같은 발음은 연습할 뿐이고 교사가 읽은 후에 학생들이 따라 읽고 또는 통사적으로 문장을 교사가 읽으면 또 문장을 따라 읽고 거기서 생기는 오류들을 학생들에게 지도하는 편입니다.) This conventional, choral repetition-based classroom pedagogy is in line with what many of the Korean language program students reported in interviews and supports Jae-woo’s view that the importance of pronunciation is not generally treated adequately in Korean language education. He went on to ascribe this mismatch to curricular demands and lacks in pedagogical materials, 244 which puts teachers in a difficult situation when it comes to devoting more time to pronunciation. In sum, while Jae-woo appeared well-versed in Korean phonology (and at least some learner L1 phonologies) and believes pronunciation to be important, his teaching practice was constrained by the status quo. Despite some of his initial skepticism and critical interpretation of the KPD results for Yuki and Yu-wen, Jae-woo was positive about the potential for both learners and teachers to apply them: Through diagnostic results like these learners can know what kind of difficulties they have and if instructors could incorporate these in class it seems like it could make for a really effective class. (이런 진단 결과를 통해서 학습자들이 어떤 발음 상의 문제점이 있는지를 알고 교사가 수업에 들어갈 수 있다면 훨씬 효과적인 수업이 될 수 있을 것 같아요) This quote indicated to me that he sees some value in learner awareness as well as potential for teacher-driven application. Although he acknowledged that Yuki and Yu-wen differed in their pronunciation weaknesses and that students from the same L1 could have different profiles, he felt that “90%” of learners from the same L1 background would have the same pronunciation difficulties, barring any extensive time in a target-language environment or extensive self-study. He imagined separate pronunciation classes for students of different L1 backgrounds, an idea I found concordant with his previous description and attribution of learner pronunciation difficulties along lines of L1 interference. In these classes, he would use repetition activities, but also add self-listening, perhaps keying into the perception information in KPD score reports. He also thought it would be helpful to correct students’ place of articulation (“조음 위치를 교정하는 거” place of articulation correction), which I took to mean providing explicit articulatory instruction (which is well-supported in the pronunciation instruction literature, Derwing & Munro, 2015; Derwing et al., 1998; Lee et al., 2014). While I was hoping for 245 comments more specific to Yuki and Yu-wen’s difficulties, I did find it interesting that many of the teaching ideas brought up by Jae-woo were not part of what he is typically able to do in his classroom. Learner Utilization and Impact Fourteen of the 21 learners were available to complete a second interview, which was focused on their application of KPD results and pronunciation learning activity. During the second interview, I also had them retake the KPD. The second-interview data provided a better, more concrete understanding of how Korean learners might apply the information from their KPD score reports compared to their speculative comments from the first interview. The quantitative KPD retest data, though small in scale, shed light on the link between pronunciation learning activity led and measurable pronunciation development. To a limited extent, examining the KPD test-retest data alongside learning activity also allowed me to consider measurement stability and sensitivity. For the sake of coherence and conciseness, I focus my reporting of findings primarily on the production of Korean phonemes rather than on supplementary information (features, contexts), with some consideration of perception at a broad level. In what follows, I first consider quantitative data describing the differences in phoneme perception and production from initial test to retest, followed by an analysis of the interview data to connect learner activity and perceptions with retest scores. Changes in Production and Perception. Overall, the 14 learners made modest improvements to their production and perception of Korean phonemes over the 2 to 4 months between initial KPD and retest (Table 8.3). On average, learners became 1% more accurate in their average phoneme production and 2% more accurate in phoneme perception. It is worth pointing out that phoneme perception averages were lower to begin with, making the somewhat 246 larger gains unsurprising; there was greater variability in phoneme perception accuracy. In terms of diagnostic flags, learners were able to ameliorate less than one net phoneme flag on average. Table 8.3 Group-Level Summary of Changes in KPD Production and Perception Scores Production SD 5% 5% 4% mean 88% 89% 1% Initial Parcel Average Retest Parcel Average Change Initial Flag Count Retest Flag Count Change in Flag Count Note. Based on 14 learners who completed the KPD a second time. Diagnostic flags based on < 12.79 12.21 -0.57 6.43 6.14 -0.29 4.44 5.34 2.87 3.27 2.44 1.94 Perception mean 79% 81% 2% SD 9% 9% 5% 80% accuracy criterion. Throughout this subsection, as well as the subsection on learner utilization and impact, readers will find it helpful to refer to Table 8.4. Table 8.4 summarizes the differences in KPD production scores from initial to retest, as well as the learners’ descriptions of their pronunciation learning activities. Learners are listed in the table in descending order according to the magnitude of improvement to their average phoneme accuracy from initial testing to retesting. After each learner’s name, the table contains information on average phoneme production accuracy. This is followed by information on diagnostic flags at initial KPD and retest. The last column of the diagnostic flag part of the table, labeled Description, uses a - sign to note which phoneme flags did not appear again on the retest results and a + sign to note which flags newly appeared on the retest. Some learners made impressive accuracy gains and showed largely expected patterns in phoneme flag reduction. For example, Maria improved her production accuracy by 7% and removed one phoneme flag without adding any new flags. Similarly, Noriko’s average 247 production accuracy improved by 5% and she was able to remove 6 phoneme flags (though added two new ones at retest). For Maria, a student with limited Korean experience to begin with (EIT score of 18/120 and was enrolled in Level 2 in an intensive Korean program), the magnitude of improvement in just over two months is not surprising (especially considering her specific learning activity, discussed later). Noriko, however, was in a Level 5 course at the time of her initial test and had a mid-range EIT score yet was able to make noticeable improvements. Ju-an’s results are interesting—a gain of 4% accuracy and net loss of two diagnostic flags— because she had high production accuracy to start with (92%) and considerable Korean experience (EIT 91/120, enrolled in a Korean-language master’s degree program, and one year of residence in South Korea). Not all learners’ KPD results showed signs of progress. In the middle of the pack in Table 8.4 lies Amber, a multilingual student from Hong Kong with high levels of Korean experience and a very high average phoneme production at her initial KPD: 94%. Amber’s average accuracy showed virtually no change at retest, and she only shuffled two phoneme flags, with a net loss of zero flags. Several other learners showed small decreases in average production phoneme accuracy and unclear patterns in diagnostic flags. At the extreme, Leo, an English- medium program undergraduate with moderate Korean proficiency but extensive in-country experience (EIT 38/120, 5 years residence in South Korea), saw on his retest a decrease in average phoneme accuracy of 9% and the addition of four diagnostic flags. 248 Table 8.4 Individual Summaries of Changes in KPD Production Scores and Learning Activity Name Maria Initial Retest Change 86% Initial Retest Change Description Diagnostic Flags (n) Avg. Phoneme Acc. 93% 7% 6 5 -1 - /t/ Summary of Learning Activity Thought a lot about results, esp. tense consonants. Paid extra attention to her teacher’s pronunciation of difficult sounds. Started to visualize written form of words to aid in remembering to articulate tense sounds. Began exaggerating tenseness. Asked her teacher for feedback on her pronunciation. Met with a tutor once a week for month to work on pronunciation (did not show KPD scores to tutor). Did typical class activities such as read aloud and presentations. Watched Korean TV, studied for TOPIK listening. Thought about results frequently. Practiced reading sentences aloud, asked language exchange partner to correct mispronunciations when reading news articles aloud. - /ŋ, t*, n, p, ɯ, j/ + /s*, ʌ/ - /t*, p*, s, j/ + /u, ɛ/ Noriko 79% 84% 5% Holger 80% 83% 4% Ju-an 92% 95% 4% Jing 85% 87% 2% Fang 88% 90% 2% 13 9 -4 13 11 -2 5 6 3 7 -2 1 6 5 -1 - /t, k*, s*, u/ + /t*, ʨʰ/ Memorized short list of weaknesses and tried to keep them in mind while interacting, trying to pronounce those sounds more clearly. - /t, t*, k, ʌ, o/ + /s*, b*, kh, k*, l, u/ - /s*, t*, ʌ/ + /t, u/ Did not think about results or practice much outside of incidental Korean use at work or with boyfriend. Reported paying more attention to syllable coda sounds. Paid more attention to difficult sounds in daily use. Took a Korean pronunciation class, but it had little practice opportunity. Worked on /l/ pronunciation by learning a popular song. Paid attention to expressions her Korean coworkers used with customers. 249 Table 8.4 (cont’d) Xiu Ying 92% 95% 2% Amber 94% 94% 0% Hoa 93% 92% -1% Na 91% 90% -1% Min 98% 96% -2% Sofia 88% 86% -3% Sakura 83% 81% -3% Leo 90% 81% -9% 5 4 -1 - /kh/ 5 5 5 5 0 0 - /p*, u/ + /kh, t*/ - /p, t*, ʨʰ/ + /s*, p*, k*/ 5 7 2 - /t*/ + /ʨʰ, k, y/ 1 2 7 7 1 0 - /l/ + /j, ʨ*/ - /th, j/ + /ph, ʌ/ 9 8 -1 4 8 4 - /ʨ*, ɯ, w, y/ + /t, n, ʌ/ + /th, ph, p*, k/ 250 Did not think much about results or practice on her own. Teacher in translation/interpretation course corrected imprecise pronunciation. Used Korean informally with friends. Paid more attention to some difficult targets in general use. Shared her results with Korean friends in phonology/pronunciation course. Little focused practice and did not think about results too often. Used Google voice-to-text technology to practice, esp. /p, ph/. Used a proverb and expressions books to find meaningful language to practice pronouncing. Practiced TOPIK listening; did some shadowing of passage extracts. Did some practice of difficult sounds; individual words and sentences with feedback from Korean friend. Tried to learn more phonological processes. Little feedback from teacher in regular Korean class but did get some advice about syllable codas. Tried to pay more attention to her /l/ pronunciation in daily life. Took a Korean phonology/pronunciation class during fall semester, which included practice opportunities and self-recording homework. Did not think much about results. Could not take a Korean class in current semester. No specific pronunciation practice. Used Korean in daily life and watched Korean dramas. Did not practice or study much. Asked a Korean friend for confirmation of her difficulties with a few sounds. In-class pronunciation feedback focused mostly on phonological processes. Bought a pronunciation textbook but did not use it. Did little specific pronunciation practice. Spoke Korean in social settings, watched Korean YouTube. Application of KPD Results. Follow-up interviews with learners illuminated the quantitative test-retest data and showed a range of ways that learners applied what they learned from their KPD results. Learners with some of the largest improvements to their KPD results reported engaging in sustained and focused pronunciation learning activity. In my view, Noriko’s learning activity demonstrated the greatest investments: Noriko hired a tutor specifically to work on her pronunciation. She could only afford this for one month, and found the experience of getting intensive pronunciation feedback a little “scary” (“무서워요”), but ultimately found it helpful. Noriko described some of what she did with the tutor as reading aloud and getting evaluation and corrections from the tutor; she further remarked that “I couldn’t distinguish things like this [on my own]” (이런 거는 제가 구별이 할 수 없었어요) (this read-aloud with feedback activity is similar to the tandem exercise described in Horgues & Scheuer, 2014). Beyond this specific pronunciation learning activity, Noriko reported engaging in general speaking and listening practice in class, and extra listening practice outside of class for pleasure (watching Korean dramas) and test preparation (TOPIK listening section). Maria, who showed the greatest overall improvement from initial test to retest, also revealed a considerable degree of engagement with her KPD results and commitment to pronunciation learning activity. Similar to Noriko’s seeking of external help, Maria reported showing her KPD report to her teacher and asking for additional feedback on her pronunciation, which she received periodically after class: “So after class I would get the feedback, and she would say ‘No your pronunciation is not good, you are still doing this wrong.’ She mentioned about the 드[/tɯ/], 트[/thɯ/].” Interestingly aligned with this comment, /t/ was the one phoneme flag that Maria was able to clear on her KPD retest. On her own, she reported paying more attention to how her Korean teacher produced difficult sounds, and she also made efforts to 251 pronounce tense consonants more exaggeratedly. Holger and Hoa were two other learners whose learning activity stood out. Holger, who made some noticeable improvements on his KPD retest, reported regular sentence read-aloud practice and working on pronunciation with a language exchange partner. Hoa, who had rather high production accuracy to begin with yet did not make many overall gains, took a more tech-infused approach: She used Google’s automated speech recognition (ASR) service to work on her pronunciation (see McCrocklin, 2019, for a classroom- based application of ASR for pronunciation instruction), with a special focus on her /p-ph/ contrast (encouragingly, she did manage to lose her /p/ diagnostic flag at retest). Finally, Fang reported paying attention to difficult sounds in daily use, and more specifically to work on /l, she practiced the popular Korean children’s song “Baby Shark” (“상어 가족”, Pinkfong, 2016), which features a nonlinguistic refrain of /t*u.lu.lu.t*u.lu/ (“뚜루루뚜루”). While some learners did not engage much with specific, focused pronunciation learning activities, they did describe how the KPD results led to awareness-raising and low-level, continuous application of results in daily language use. Ju-an, who I previously noted had impressive gains with respect to her initial high production accuracy, reported memorizing her major difficulties and then reflecting on them whenever an interlocutor had difficulty understanding something she said. In addition, she generally tried to be more conscious of her articulation of difficult sounds. Xiu Ying and Jing, who both posted modest improvements on their retest results, did not engage in much focused pronunciation learning activity but did report paying more attention to difficult sounds in their daily language use. Last, some learners neither engaged in much focused pronunciation learning activities nor tried to maintain awareness of difficult sounds in daily use, though many reported engaging in general listening and speaking practice. Unsurprisingly, many of these learners showed little or 252 no evidence of improvement in their KPD retest score: Leo, Sakura, Sofia, and Amber all followed this pattern. Despite their lack of pronunciation learning activity, they did show at least some initial engagement with results. Leo, Sakura, and Amber all reported talking with Korean friends about the results. Amber even went so far as to say words/syllables and ask her classmates what consonants they heard her say, specifically focused on the tense consonants that were diagnostically flagged on her score report. Amber found agreement between her classmates’ uncertainty of her tense consonant production and her KPD results. Ultimately, other demands prevented further engagement with results. For example, Sakura reported actually buying a Korean pronunciation book but was unable to free up enough time to study it, and Sofia and Leo were both kept busy by their undergraduate coursework and part-time jobs. Perceptions of Change. As might be expected given the varying levels of learning activity and varying levels of initial pronunciation accuracy, learners varied in the perceptions of change from initial test to retest. Six learners stated that they noticed some kind of improvement to their pronunciation (Min, Ju-An, Fang, Maria, Xiu Ying, and Jing). Three of these six could describe their improvements in detail, though they said it was not easy to judge for themselves: Maria (improvements to /p*, ʨ, ʨ*/), Ju-an (improvements to /t, t*/ and /ʌ, u, o/), and Min (consonant relinking, a phonological process that is not directly assessed by the KPD). While Maria did not clear her diagnostic flags for /p*, ʨ*/, she did make substantial improvements, going from 0% to 75% accuracy for /p*/. Ju-an did clear her diagnostic flags for /t/ (going from 78% to 89% accuracy) and /u/ (going from 75% to 100% accuracy), though she did regress in accuracy on /t*/ (dropping from 100% to 75%). Two other learners, Amber and Noriko, spoke of a lack of development in relatively certain terms. Amber felt confident that she did not make any substantial improvements, 253 especially in relation to the tense consonants that still eluded her. In her case, while she did show improvement in one tense consonant (/p*/), a different one (/t*/) was newly flagged on retest, and her overall production phoneme average showed virtually no difference at retest. Noriko, despite making considerable gains in her KPD scores, did not perceive much improvement and felt that her difficulties persisted. To some extent, she was not wrong: Even at retest, she had a total of 9 diagnostic flags on production phonemes and still had room for general improvement (84% average production phoneme accuracy). With some limits, it appeared to me, perception of improvement (or lack thereof) was possible and reasonably accurate for some learners, in some cases even at the phoneme or feature level. The remaining learners expressed uncertainty when it came to noticing changes in their pronunciation. Some learners reported positive or negative impressions of their progress but qualified them immediately before or after by saying that they were not sure (Hoa, Sofia, Na) or could not tell (Leo, Holger, Sakura). Despite Hoa’s uncertainty and overall limited development, she did nonetheless appear to improve in one phoneme that she had practiced, /p/. Holger, who could not tell on his own whether he had progressed, posted relatively strong improvements on his KPD retest. Where these less-certain learners had difficulty judging their own gains, they sometimes turned to the assessments of others: Sofia reported customers at the restaurant she worked at understood her better compared to a few months prior while Hoa described comments from Korean friends about reduced Vietnamese-like intonation in her speech and that Google’s ASR still indicated she had some difficulty with /p/ and /ph/. Discussion In this chapter, I drew on interview data to report on the interpretation and utilization of KPD results by key stakeholders: Korean learners and a Korean teacher. Furthermore, I brought 254 in KPD retest data to shed light on the connection between utilization of KPD results and subsequent learning, a key consideration in diagnostic language assessment and in turn an important piece of evidence for the utilization inference in the KPD’s validity argument. In this discussion section, I reflect on the findings in respect to my primary research questions and then offer additional considerations arising from analysis of the data. RQ7: How do (a) Teachers and (b) learners understand KPD score reports? To what extent do they learn anything new from KPD score reports? The teacher I interviewed, Jae-woo, came to understand KPD score reports as a source of information on segmental pronunciation issues that filled in gaps in his own observations. At a more basic level, he had no trouble understanding the content of the score report, though he needed more explanation of the test structure and scoring procedures in order to develop a better sense of how to interpret the information contained in the score report. This points to a need for documentation to be made available to test users. While I have created documentation of the KPD design and task/item specifications, I did not provide these to Jae-woo, nor have I developed more succinct, stakeholder-friendly documentation that would undoubtedly aid in appropriate score interpretation. Learners, who had all taken the KPD themselves and at least had a first-hand understanding of the KPD design, tended to view the KPD as an external, more objective assessment of their pronunciation weaknesses and strengths. As Chapter 7 illustrated, fine- grained self-assessment of segmental production and perception abilities was not easy for learners to do accurately, and learner uncertainty about their own strengths and weaknesses was a topic that came up during interviews as well. In addition to filling in gaps in their knowledge, the KPD results also helped learners confirm or reject what they had (uncertainly) thought about 255 their own abilities. On the other hand, several learners had difficulty reconciling their lower perception scores, which were presented on the same scale as the production scores. Although it may be the case that some learners do have substantial perception difficulties, much of the disparity could be attributed to differences in scoring standards across the two modalities, as discussed in previous chapters. Thus, additional explanation or re-scaling of perception scores may help learners more appropriately interpret their scores. Although learners appeared to immediately understand the information on the first page of the score report as feedback on weaknesses and intuitively grasped the meaning of the percentages for each phoneme on the second page, learners’ understanding of their score reports was not absent of stumbling blocks. Learners of all levels of overall Korean proficiency were unfamiliar with some linguistic terminology (e.g., 경음 tense), and learners with lower levels of proficiency showed some difficulties comprehending the orienting prose at the top and bottom of the first page. Furthermore, learners frequently asked whether the example words given for phonemes on the second page of the report were from the production or perception section of the KPD (interestingly, I had separated the example words by modality in an earlier version of the score report). Thus, improvements to the score report addressing these stumbling blocks could improve learner interpretation and perhaps utilization of KPD results. Key to the utilization of diagnostic instruments is that the diagnostic feedback provides information which the users could not have easily obtained otherwise; after all, it would make little sense to go through the process of administering a diagnostic test if teacher and learner observations could provide the same benefits. The quantitative comparison of learner self- assessments and KPD results in Chapter 7 suggested that learners had limited awareness of their strengths and weaknesses in production and perception of segmentals, and interview data 256 explored in this chapter provided further support for learners becoming aware of new information through KPD score reports. This was not just students becoming aware of weaknesses, but also becoming aware of (and/or more confident in) their strengths. The teacher perspective also supported the idea that KPD results could provide additional information for test users. Jae-woo commented that his observations of students, whom he had taught for a full semester, amounted to perhaps half of the picture. RQ8: Do learners report any changes in their self-study routines and/or their attention to phonological form in formal or informal learning situations? Some learners reported concrete changes in their learning activity in response to the KPD results, but not all. Among those who engaged in focused pronunciation learning activities, several learners discussed activities that are well-supported in research or long-standing pedagogical practice: shadowing (Foote & McDonough, 2017), read aloud with ASR feedback (McCrocklin, 2019) or partner feedback (Horgues & Scheuer, 2014), practice through songs (Graham, 2001; Richards, 1969), and seeking feedback during (or after) meaning-focused interaction (Saito & Lyster, 2012). In the initial interview, some learners reported a lack of knowledge related to studying or practicing pronunciation effectively. This suggests that the utilization of the KPD could be enhanced by providing learner-friendly information on pronunciation learning activities and/or the delivery of results by a teacher who can provide more specific guidance in this area. Learners also reported that the KPD results guided them to effortfully raise their awareness of difficult phonemes, or to pay more attention to how they pronounce difficult phonemes in typical language-use situations, or both. This combination of awareness and deliberate attention could lead to increased levels of incidental focus-on-form which learners 257 may otherwise miss out on during typical meaning-focused Korean use (Kennedy & Trofimovich, 2010; Saito, 2018; Schmidt, 1990, 1993). A minority of learners who completed the second interview and KPD retest reported doing little if anything with the KPD results. While they had perfectly understandable reasons for not committing to additional pronunciation learning activities (e.g., limited time), the more important takeaway is that diagnostic assessment alone cannot be considered an instructional intervention; learning activity naturally depends on learner and/or teacher efforts. As Alderson et al. (2014) emphasized, test users are at the heart of diagnostic assessment, and the use of a diagnostic instrument is just one phase of a larger process. Nonetheless, the KPD was able to be fruitfully applied by learners, which is promising on its own, especially for learners not currently engaged in formal Korean instruction, and bodes well should the test be used by a knowledgeable teacher/diagnostician (such as Jae-woo) within a classroom context. RQ9: Do learners show improvements in a) overall and/or b) in weak areas after receiving and applying KPD feedback? With many qualifications, I believe the answer to this research question is “yes.” The learners who appeared to take focused, sustained measures to address their Korean pronunciation after receiving their initial KPD results made clear gains. Maria, Noriko, and Holger all took substantive action, guided by their KPD results, to improve their pronunciation, including self- study, paying closer attention to difficult sounds in their input and output, and seeking help from others (teacher, tutor, language exchange partners). It is worth noting that those learners who made the most impressive gains were those with some of the most initial production difficulties and only moderate amounts of Korean language experience, though not exclusively: Ju-an made impressive gains despite initially high levels of production accuracy and Korean exposure. 258 Although to directly relate this KPD-motivated and guided activity to their visible improvements at retest is difficult without a control group, I believe it is reasonable to conclude that beneficial outcomes of post-diagnostic learning activity are certainly possible. The learners who did relatively little with their results and subsequently showed little improvement, either in global production accuracy or accuracy of specific problem phonemes, which provides some counter-factual support for this conclusion. Namely, when KPD results are not meaningfully applied, learners are not likely to experience improvements to their segmental pronunciation abilities over the course of 2 to 4 months (in absence of other directed pronunciation learning activity). This counter-factual situation and outcome is intuitive (i.e., what improvements would be expected when no effort is made?) and also makes sense on a more theoretical level given that learners’ pronunciation development often plateaus, showing little to no change over extended periods of time, after a phase of rapid L2 phonological development that starts with initial exposure to the language (i.e., the Window of Maximal Opportunity, Derwing & Munro, 2015). Of course, the data on learning gains presented in this chapter is extremely small in scale, and larger-scale quantitative investigations would provide stronger support for the beneficial consequences of using the KPD to guide pronunciation learning activity. Further, the findings related to post-diagnostic learning activity and pronunciation development again underscores the conceptualization of diagnosis as a process that must feed into instruction (Alderson et al., 2014; Lee, 2015). Additional Considerations The test-retest results in this chapter provide additional glimpses into potential support for three other inferences in the KPD’s validity argument: Generalization, Explanation, and Extrapolation. While the data presented in this chapter have a variety of limitations, including 259 depth of interview questioning, learner (and interviewer, in some cases) language proficiency for interviews, and small number of test-retest participants, I find the implications for these inferences too interesting to not consider. Generalization inferences in validity arguments broadly pertain to the consistency of results across observations of test takers with the same, presumably static (or temporarily stable) level of ability in attributes of interest. Analyses related to the generalization inference should attempt to account for variation in scores across different forms of a test (e.g., test equating), across different human raters (e.g., inter-rater reliability or agreement), and across different points in time (e.g., test-retest reliability). However, most commonly, generalization is investigated via estimation of internal consistency (e.g., Cronbach’s alpha), which theoretically aligns with the average of all possible split-half reliability estimates (Crocker & Algina, 1986). This is actually among the weakest forms of evidence in support of the generalizability of test scores, as it only offers conclusions based on one administration at one point in time. With the KPD test-retest data, it was possible to (at least) consider the test-retest consistency of some individuals who took the test at two different points in time without engaging in behaviors that would produce a substantive change in their ability. Amber, Sofia, Sakura, and Leo appeared to do the least amount of pronunciation study, and thus can be assumed to have reasonably similar ability levels at initial test and retest, though Sakura’s lower proficiency and limited exposure could have led to more development than the others. Three of these learners showed very little difference in KPD scores across two observations in time, providing some support for generalization of scores. Leo, however, had noticeably lower production scores on his second KPD, a direction of change that would not follow most predictions for L2 phonological development for a learner with five years of residence in the target language environment. While 260 intra-scorer variability may partially explain the discrepancy in Leo’s scores, intra-speaker variation may play as large of a role, or perhaps an even greater one. Recent work by Smith, Johnson, and Hayes-Harb (2019) on L2 intra-speaker variability in vowel production, one of very few such papers on L2 speaker variability, found that while L2 speakers in their study did not exhibit larger variation in vowel production than L1 speakers; the L2 speakers’ variations were outside of L1 norms approximately 50% of the time. In the context of the KPD, although NS-like productions are not required, substantial deviations could nonetheless lead to unintelligible production of target phonemes and have a negative impact on scores. The phenomenon of intra-speaker variability could also (at least partially) explain subscore variability for someone like Amber, who despite maintaining virtually the same production phoneme average across two points in time varied slightly in individual phoneme accuracy scores. Explanation inferences draw on a wide range of support, from documenting test-taker response processes to investigating relationships with external measures informed by theory and substantive empirical research (Chapelle et al., 2010; Kane, 2013). At the core of classical perspectives on validity, the key consideration is whether variations in scores produce (or reflect) variation in the ability measured (Borsboom et al., 2004). One of the most rigorous ways of investigating this relationship is by testing individuals at multiple points in time, before and after interventions or experience that theoretically should produce a change in the individual’s underlying ability. Once again, the KPD test-retest data provides an interesting perspective on this aspect of score meaning. As I have already discussed at some length, learners who engaged in substantial learning activity over a period of 2 to 4 months, especially those with lower initial phoneme production abilities, showed changes in their KPD scores in the expected direction, 261 while those who did less or started with higher abilities (or both) showed comparatively smaller change or little change at all. While the sample size and specificity of learning activity and intervening exposure to the language is insufficient for rigorously testing this relationship and evaluating the magnitude of change, it was nonetheless promising to observe a chain of test score → learning activity → ability development → higher score. Finally, interviews with learners and a teacher provided additional evidence pertaining to the extrapolation of KPD scores to other domains, in this case, learners’ daily life, social interactions, and classroom language use. To some extent Jae-woo’s comments about Yuki and Yu-wen’s pronunciation offered some support for a connection between KPD scores and classroom language use, though this was limited, in part due to Jae-woo’s lack of specificity in his observations of phoneme-level difficulties experienced by the two students. Like the self- assessments analyzed in Chapter 7, learner interview comments related to their (dis)agreement with KPD scores provided support for the notion that KPD scores reflect non-test performance reasonably well. Learner anecdotes of pronunciation and/or hearing difficulties were especially illuminating and persuasive due to their specificity. For example, Sofia could specifically and vividly recall how customers at her part-time job misunderstood her rendering of 꿀 (honey, /k*ul/) likely due to the tense stop /k*/ in the onset of the word. Leo shared an amusing anecdote about his difficulty pronouncing two of his Korean friends’ minimal-pair names (differentiated only by an initial /ʨ-ʨʰ/ aspiration contrast). Some participants endeavored to find their own evidence to support the extrapolation of test results, such as when Amber tasked her Korean classmates to identify whether she was making a tense or non-tense sound, and when Hoa checked her /p, ph/ pronunciation against Google’s ASR. The specificity and vividness of this qualitative data provides substantial support for the KPD’s extrapolation inference. 262 CHAPTER 9: SUMMARY OF FINDINGS AND EVALUATION OF THE VALIDITY ARGUMENT In Chapter 2, I outlined a proposed validity argument for the interpretation and use of KPD scores. In this argument, I sketched out what kind of information could be used to support each inference in the argument which allows test users to go from test observations to real-world, beneficial use of the KPD. For some of the earlier inferences in the KPD’s validity argument, my already completed design and initial piloting efforts provided support (Chapter 3). However, for most inferences, I identified gaps in necessary support, which led to the formation of research questions and the collection of data to answer them. I have reported on these findings over the last several chapters (Chapters 5 through 8). In this chapter, I return to the validity argument and summarize the evidence gathered and interpret it in respect to specific inferences necessary to support KPD score interpretation and use. Following this synthesis, I critically evaluate the strength of the argument and consider gaps and weaknesses to be addressed in the future. Summarizing the KPD Validity Argument In this section, I return to the proposed validity argument and synthesize all extant support for each inference. In what follows, I provide a formal, detailed rendition of the validity argument, following Chapelle et al. (2008, 2010): I articulate the warrants that explicate each inference, and for each warrant I articulate key assumptions, and for each assumption I review the extant support for each assumption. Operationalization Inference Warrant: Observations of learners’ Korean segmental production and perception reveal underlying strengths and weaknesses in phonological knowledge and processing that are important to communication and the development of intelligible pronunciation. 263 Assumptions and Support: (1) Items represent the inventory of Korean phonemes. (2) Test tasks are designed with reference to theories of L2 phonological learning. (3) Test tasks are sufficiently delimited to lower-level subprocesses of speech production and perception. With regards to the first assumption, the KPD suitably delimits the target domain to Korean segmental phonology, with tasks that exclude most suprasegmental aspects of pronunciation as well as opaque phonological processes found in spontaneous connected speech. Related to the second assumption, the KPD features both production and perception tasks, as informed by theories of speech learning (Flege, 1995) and empirical findings on the link between perception and production (Sakai & Moorman, 2018). Addressing the third assumption, KPD task designs limit the non-phonological resources necessary for learners to respond, requiring only knowledge of basic sound-script correspondences and commonly-taught, high-frequency vocabulary. In this way, language production and perception on the KPD is relegated to lower- level subprocesses (Field, 2011, 2013), minimizing construct-irrelevant variance from higher- level subprocesses unrelated to segmental pronunciation knowledge and ability. Reviews of literature on these areas are found in Chapter 2, and Chapter 3 and Appendices A and B contain detailed information on the development and specification of the test and its component tasks. Evaluation Inference Warrant: Observations of phoneme production and perception on the KPD are evaluated to yield scores that are (a) instructionally-relevant, (b) indicative of strengths and weaknesses, and (c) in line with the ultimate goal of intelligible oral communication. 264 Assumptions and Support: (1) Task responses are scored based on appropriate criteria. (2) Measurement characteristics of the KPD differentiate learners by overall ability in perception and production, while phoneme parcel subscores provide appropriate diagnostic information. Pertaining to the first assumption, evaluation of KPD responses is appropriate, well- defined, and verified. Evaluation of production task responses is based on a clear criterion and heuristics that draw on research and best practices in L2 pronunciation pedagogy (i.e., Levis’ Intelligibility Principle, 2005; see Chapter 2). Evaluation of the production tasks are aided by an easy-to-use scoring sheet, and training materials were created to orient new scorers to the scoring criteria. The perception task responses are evaluated based on accurate keys which were verified by a NS linguistic informant. Furthermore, two rounds of piloting provided additional verification of item keys (Chapter 3), as did the almost universally maximal scores of NS test- takers (Chapter 5) Regarding the second assumption, measurement analyses found that a measurement model based on phoneme parcels had statistical characteristics that were highly similar to models based on individual items (Chapter 5). At the same time, the phoneme parcels align with the intended use of the KPD, i.e., to provide information on phoneme-level strengths and weaknesses to guide instruction. In both CTT and Rasch analyses, some phoneme parcels had very low difficulty and/or discrimination. However, from a diagnostic perspective, this is fine: The major concern is the capability to detect low performance on phonemes, even if they tend to be easy for most learners. Moreover, Rasch analyses of phoneme parcels showed that parcel information was greatest at lower score levels, supporting this aim. 265 Generalization Inference Warrant: Observed KPD scores estimate learners’ abilities with stability and are similar across scorers. Assumptions and Support: (1) Items are sufficient in number and quality to yield stable estimates of overall production and perception abilities and individual phoneme ability. (2) KPD production section scores are stable across scorers. Relevant to the first assumption, KPD overall production and perception scores based on phoneme parcels are internally consistent and have adequate precision (Chapter 5). Internal consistency estimates for both individual item and item parcel scoring models were suitable for low-stakes assessment and provide positive evidence for the (lower-bound) of the KPD’s precision of measurement (Crocker & Algina, 1986). Additionally, there is limited support for the generalization of KPD results across test occasions (Chapter 8). In Chapter 8, several learners took an initial KPD and then took the test again approximately three to four months later, without having engaged in deliberate pronunciation learning activities. These learners saw little to no change in their overall KPD scores, as would be expected for individuals whose underlying abilities had not changed. Inter-scorer agreement was also found to be high. At the individual item level and at the parcel-based diagnostic flag level, different scorers on average had nearly perfect levels of agreement. Phoneme parcel scores across raters varied widely from phoneme to phoneme, and many low estimates of interrater reliability were obtained. However, this was found to be due to a preponderance of very high scores with near-universal agreement for some phonemes. In sum, different scorers introduce little variability to KPD scores. 266 Explanation Inference Warrant: KPD scores are reflective of learners’ underlying phoneme knowledge and processing ability. Assumptions and Support: (1) Response processes align with theoretical expectations. (2) KPD test tasks relate to one another in accordance with theory. (3) Hierarchies of phoneme parcel difficulty in production and perception align with expectations. (4) KPD Scores reflect distinct learner profiles. (5) Phoneme production and perception scores relate to overall oral proficiency to an expected degree. (6) Changes in phoneme production and perception scores reflect changes in ability due to learning and experience. Regarding the first assumption, observations and test-taker interviews during piloting indicated that response processes aligned with expectations (Chapter 5). Regarding the second assumption, the internal structure of the KPD generally aligned with expectations (Chapter 5). Overall production and perception section scores correlated, and the pattern of correlations among overall task scores generally aligned with expectations. Pertaining to the third assumption, the hierarchy of item difficulties aligned with expectations and empirical findings (Chapter 5). One consonant that was easier than expected based on research findings was /l/. However, the ease of /l/ could be reasonably attributed to the KPD’s scoring criteria and lack of similar sounds which might increase the likelihood of learner articulations being ambiguous. 267 For the fourth assumption, KPD scores increased moderately alongside overall oral language proficiency, as expected (Chapter 7). Regarding the fifth assumption, KPD parcel scores indicated several identifiable general profiles for difficulties in phoneme production and perception that were not simply determined by L1 influence or Korean proficiency (Chapter 6). Lastly, related to the sixth assumption, small-scale exploratory analyses of test-retest data suggest that KPD results appear to reflect changes in underlying pronunciation abilities of learners (Chapter 8). Extrapolation Inference Warrant: The knowledge and abilities measured by the KPD are relevant to learner performance in general Korean oral communication. Assumptions and Support: (1) Strengths and weaknesses in phoneme production and perception are related to pronunciation in general Korean language use. As the KPD by design isolates aspects of L2 phonology and of language processing, an idealized one-to-one correspondence between KPD results and meaningful, spontaneous Korean language use could not (and should not) be expected. This delimitation in mind, the alignment between KPD results, learner self-assessments of production and perception abilities, and learner errors in spontaneous, meaning-focused speaking provided mostly positive support for this assumption (Chapter 8). Additionally, alignment between a teacher’s observations of two students and several learner anecdotes of production and perception difficulties provides additional support for the extrapolation of KPD results to strengths and weaknesses in general Korean use (Chapter 7). 268 Utilization Inference Warrant: KPD phoneme scores and diagnostic flags are interpretable and useful to learners and teachers for planning pronunciation learning activity and raising awareness of difficulties. Assumptions and Support: (1) KPD feedback is interpretable by learners and teachers. (2) KPD feedback can support instructional decisions. Regarding the first assumption, key stakeholders, learners and a teacher, were able to appropriately and beneficially utilize KPD results (Chapter 8). Learners and a teacher were able to easily understand key information on KPD score reports related to phoneme strengths and weaknesses in each modality. However, learners struggled to interpret some supplemental information contained in the score report. Regarding the second assumption, several learners were found to engage in substantial pronunciation learning activity, ranging from exercises such as shadowing to deliberate awareness raising and attention to target phonemes in daily language use (Chapter 8). However, some learners had few ideas on how to apply the KPD results, and others engaged in little to no self-directed pronunciation learning activity after obtaining KPD results. Nonetheless, many learners’ self-assessments were found to contain misconceptions of their segmental strengths and weaknesses, highlighting the potential for KPD results to correct learner understandings and more usefully focus learners’ awareness and learning efforts (Chapter 7). Test Usefulness & Impact Inference Warrant: Appropriate application of KPD scores by learners and teachers leads to beneficial outcomes through the development of more intelligible segmental pronunciation and accurate perception. 269 Assumptions and Support: (1) Application of KPD results contributes to pronunciation development. Learners have the potential to fruitfully apply KPD results on their own through engagement in a variety of pronunciation learning activities (Chapter 8). For learners who sustain pronunciation learning activity to a sufficient degree, KPD retest results suggested meaningful improvement to pronunciation abilities. This suggests that utilizing the KPD can have beneficial consequences on pronunciation learning, fulfilling the primary purpose of a diagnostic assessment. Evaluation of the KPD Validity Argument Before proceeding to my evaluation of the KPD’s validity argument, I must concede that I have become rather personally invested in the development and use of the KPD, and that a neutral party with an etic perspective would be the ideal evaluator. Nonetheless, dissertations require solo authorship, and so I have endeavored to be as objective (and self-critical) as possible. I hope that I do not fall too short of that goal. Overall, the KPD’s validity argument is well-supported, but that support is thinner toward the end of the chain of its constituent inferences. Support for the operationalization inference draws on well-researched findings from L2 speech learning and psycholinguistics in tandem with Harding et al.’s (2015) cutting-edge ideas on the design of diagnostic language assessment instruments. While Field (2014) pointed out that phonemes are not static, easily delimited entities in the minds of language uses and instead draw on a network of numerous variations due to contexts and speakers, phonemes as an abstraction of phonological knowledge serve as a useful heuristic that is interpretable by stakeholders. Furthermore, the KPD features several instances of each Korean phoneme, all in different phonological contexts, which at least partially 270 addresses this potential weakness in operationalization. In sum, the inference that the KPD suitably operationalizes the target construct in alignment with desired measurement outcomes and uses is well supported. Support for the evaluation inference can also be regarded as strong. The KPD production scoring criteria draw on Levis’ (2005) Intelligibility Principle, which prioritizes effective communication and recognizes typical limits on (adult) L2 acquisition. The KPD’s scoring guide, training materials, and scoring sheet facilitate consistent scoring. The degree to which KPD production scoring actually reflects intelligibility in naturalistic language use may be questionable, but the surprising findings for /l/ parcel difficulty suggest that the scoring of KPD responses genuinely did not require native-like articulation, at the very least. For perception tasks, multiple rounds of piloting, consultation with linguistically-informed NSs, and NS KPD score data all contributed to the verification of answer keys, leaving very little room to question support for this inference. In sum, the inference that the evaluation of test-taker responses is appropriate is strongly supported. The generalization inference is well-supported. Conventional CTT and IRT estimates of reliability for the desired phoneme-parcel measurement model are adequate, especially when considering the KPD’s relatively low assessment stakes. This lends a considerable amount of support to the generalization inference. The test-retest reliability data is promising, but ultimately too small to provide substantial support. More research on test-retest reliability would be desirable. The evidence pertaining to inter-scorer agreement also adds strong support for this inference and is especially valuable considering that the KPD was designed to be scored locally by individual Korean teachers or tutors. 271 A wide range of evidence exists to support the explanation inference, but some of these sources are not without limitations. The use of cluster analysis to examine learner profiles was a useful way to examine broad differences in test-taker profiles, but ultimately it is unclear whether those clusters are stable and generalizable. It is hard to connect these clusters to any theory underlying L2 phoneme production and perception, but it does seem safe to interpret the clusters with minimal difficulties as having reached (or nearly reached) a desirable, high-level of phoneme control with limited need for additional instruction. The data pertaining to expected changes in KPD scores before and after a period of substantial pronunciation learning was exploratory and small in scale; clearly, more rigorous investigation of intra-individual changes in ability would be useful for supporting this inference. All in all, however, the evidence accumulated so far strongly suggests that KPD scores reflect learners’ productive and perceptive abilities related to segmental phonemes. Support for the extrapolation inference is perhaps the most challenging to interpret. Learner self-assessments provided some evidence of the extrapolation of KPD results to Korean use more generally, but this relationship had to be viewed as attenuated by limits in learner self- assessment accuracy. Similarly, alignment between KPD production phoneme results and phonological errors in spontaneous, meaningful speech was necessarily attenuated by the numerous influences on real-time speech that were intentionally excluded in the KPD’s design. Felicitously, teacher observations of two learners and learner anecdotes from interviews augmented the support for this inference. Ultimately, however, extrapolation is one of the weaker links in the KPD validity argument. Evidence collected to support the utilization inference was crucial with respect to the instructionally-relevant, diagnostic purpose of the KPD. Learner and teacher interpretation of 272 phoneme-level strengths and weaknesses was strongly supported, though I identified some necessary changes to score reports. While some learners were able to come up with and ultimately arrange for or execute quality pronunciation instruction/learning activity, other learners were less capable of applying their results to learning. Thus, while support for the utilization inference is promising and indicative of potential, it could be strengthened by clearer ties between KPD results and learning activity, whether that be in the form of linked resources for self-directed learning or in the form of a teacher/tutor who can provide structured pronunciation instruction. The usefulness and impact inference may be the weakest link in the KPD’s validity argument. Promisingly, a small number of learners were shown to have made non-trivial improvements to their pronunciation after receiving their initial KPD feedback. A handful of other learners made smaller gains, but these gains are arguably less certain due to precision limits of the KPD. The weakness in support for this inference comes not so much from the quality of the evidence, but the quantity: The accumulated evidence is based on a small number of learners. Going forward, it would be helpful to collect larger scale test-retest data. It would also be helpful to do so in a context where there was more structure or guidance leading students to suitable classroom-based instruction or other learning activities, similar to my recommendations for bolstering evidence for the utilization inference. In sum, I judge there to be a substantial and reliable connection between learner segmental pronunciation abilities and KPD scores, which leads to appropriate interpretations of scores relevant to making decisions about learners’ strengths and weaknesses. The KPD also has the potential to be fruitfully utilized, and the potential to make a positive impact on pronunciation learning. The weaknesses in the validity argument point to the necessity of 273 additional validation research, which I have already alluded to. While this dissertation presents a broad collection of evidence to support the KPD score interpretation and use, the evidence is not all-encompassing. In my evaluation, further investigating support for the utilization and impact inferences is most critical for future validation research. Specifically, examining the use of the KPD in a classroom setting, preferably in several classrooms taught by several different teachers, would be a valuable source of evidence that could further illuminate the degree to which KPD results can be beneficially applied. Conclusion As discussed above, despite some shortcomings, the evidence I collected in this dissertation largely supports the interpretation and use of KPD scores for the diagnosis of L2 Korean segmental pronunciation. Concurrently, the four aims I outlined in Chapter 2 have largely been achieved: The test has been developed (Aim 1), field tested to facilitate interpretation of results (Aim 2), examined in relationship to spontaneous speech and oral proficiency (Aim 3), and studied in terms of how teachers and learners understand test results (Aim 4). In the next chapter, I situate the KPD project in the larger L2 pronunciation and DLA literatures. 274 CHAPTER 10: DISCUSSION & CONCLUSION In Chapters 5 through 8, I presented the results of validation research and discussed findings in respect to specific research questions. In these discussions, I situated my findings within the L2 pronunciation literature and the broader literature of DLA. In Chapter 9, I summarized and interpreted these findings in respect to the KPD’s validity argument, which culminated in an evaluation of the validity of KPD score use and interpretation. In this chapter, I offer broader considerations for diagnosing second language pronunciation and for creating and using diagnostic language assessments. I do this by first situating my research findings within the broader literature of second language pronunciation theory, research, and pedagogy, and the broader literature of DLA theory, research, and use. I then conclude the dissertation with some parting thoughts on the role of diagnostic assessment and language assessment professionals in the landscape of language learning practice and research. Discussion on Diagnosing Second Language Pronunciation Moving beyond the scope of the KPD, I now discuss the broader goal of diagnosing second language pronunciation by considering prior research and how the results of this research project fit within it. I start by situating my research results within the field of second language pronunciation, and then by discussing what I see as key questions for diagnosing L2 pronunciation and sharing my tentative answers based on my dissertation work. Then, I explore ways to expand and further develop the tools and practice of L2 pronunciation diagnosis. Next, I connect important areas of research that need to be combined to develop an interface between pronunciation instruction and diagnosis. Last, I present several implications for DLA theory and practice based on my findings related to the KPD. 275 Situating the KPD in L2 Pronunciation and DLA The results of this dissertation give support to the idea that teachers and learners can benefit from detailed, individualized information when it comes to making informed and confident instructional decisions about teaching and learning pronunciation. To my knowledge, the KPD stands as the only stand-alone pronunciation assessment tool that (a) diagnoses learner phoneme-level strengths and weaknesses in pronunciation (cf. holistic approaches of Isaacs et al., 2018 and others), (b) integrates both production and perception (Flege, 1995; Sakai & Moorman, 2018), (c) explicitly promotes intelligibility-based evaluation of pronunciation (Levis, 2005), (d) does not rely exclusively on read-aloud tasks (Levis & Barriuso, 2012; Munro, 2008; Saito & Plonsky, in press), (e) is relatively easy to administer and score, (f) has been shown to positively inform pronunciation learning (Lee, 2015), and (g) has been rigorously evaluated using an argument-based validity framework (Kane, 2013; Chapelle et al., 2008, 2010). While other dedicated, instructionally-relevant pronunciation assessments may share some of these features (e.g., Dlaska & Krekeler, 2008; Kim, 2006; Lappin-Fortin & Rye, 2014, Tsurutani, 2008), they do not possess all of them. Beyond the KPD and Kim (2006), there would seem to be few, if any, other detailed, instructionally-relevant assessments of L2 Korean pronunciation (Lee, 2017b). At the same time, the KPD may be the first diagnostic tool for a productive language skill to successfully incorporate Alderson et al. (2014) and Harding et al.’s (2015) recommendations for DLA instruments to be designed based heavily on language learning theory, to have discrete tasks focused on lower-level aspects of language processing, and to provide feedback that is directly relatable to subsequent instruction (see also Lee, 2015). Viewed this way, I believe the KPD fills gaps in pronunciation assessment and DLA and represents a new direction for the interface between pronunciation teaching, learning, and assessment. 276 A key feature of the KPD is its capacity to raise learner awareness, which in turn promotes learner attention to phonological forms in their regular language use and during specific (classroom or self-directed) learning activities. This is very likely to be beneficial to pronunciation learning based on a body of research on the relationship between phonological awareness and pronunciation outcomes (Kennedy & Trofimovich, 2010; Moyer, 2014; Saito, 2018; Venkatagiri & Levis, 2007). Importantly, through KPD results, self-assessments, and interview data, I was able to observe how learner misperceptions could be addressed through the provision of diagnostic feedback. There is a potential link here to learners’ perception skills: Those who cannot hear their own difficulties (or strengths) are more likely to have lacks or errors in awareness, both of which would hinder attention-focusing on critical pronunciation targets. Due to the challenges and constraints of classroom teaching, even phonologically-knowledgeable and experienced teachers may not always be able to help students fill in or correct gaps in their awareness of pronunciation difficulties (see Chapter 8), which further highlights the utility of the KPD. The KPD stands out as a pronunciation assessment tool that incorporates and promotes the well-supported link between the perception, production, and learning of L2 speech sounds (Flege, 1995; Möttönen & Watkins, 2009; Nora, Renvall, Kim, Service, & Salmelin, 2015; Sakai & Moorman, 2018). Aside from expectations of relationships perception and production abilities generally being met in the results of this dissertation, what is perhaps more encouraging from a learning perspective is the attention learners gave to perception when interpreting and applying their KPD results. Learners reacted strongly to low perception scores for individual phonemes and reported specific learning activities related to perception, such as devoting more attention to target sounds in their input (e.g., from their teacher) and deliberately using audio models (e.g., 277 songs) when practicing their pronunciation. Language tests are known to have washback effects on teaching and learning (Messick, 1996): Tests influence the what or how of language teaching and learning in classrooms or learner practices. As shown in this dissertation, pronunciation assessments that incorporate learning principles into their design and feedback have the potential to positively influence learners’ awareness and application of these principles in their self- directed learning activities, a positive form of washback. Furthermore, specific to the perception- production link, there seems little reason not to incorporate it in the design of pronunciation assessments: Perception activities are widely recommended in pronunciation instruction (Celce- Murcia et al., 2010; Derwing & Munro, 2015; Thomson, 2011), perception items are quick to administer and easy to score, and promoting perception practice would likely have only beneficial side-effects on L2 listening abilities (Field, 2013; see also Vandergrift & Goh, 2012; Yeldham & Gruba, 2014). Rigorous examination of the KPD was facilitated by the application of argument-based validity. Language testing specialists have been grappling with ways to ensure that tests, including diagnostic tests, that they make are reliable and valid. While test reliability is rather easy to evaluate psychometrically (and psychometric test properties are mostly undebatable), validity is anything but easy, as the concept itself and ways to investigate it in relation to a test and its scores is debated: Is it a relatively straightforward relationship between test scores and what is being measured (e.g., Borsboom et al, 2004) or a more sprawling concept that extends to stakeholder use of test scores and consequences of that use (Messick, 1989)? For DLA at least, with its strong emphasis on usefulness in subsequent instruction, I believe Messick’s broader view necessarily prevails. Diagnostic tests, like other tests concerned with the interpretation and use of test scores, can be placed and investigated within an argument-based framework to 278 examine the validity of score uses (Bachman & Palmer, 2010; Chapelle, Cotos, & Lee, 2015; Chapelle, Enright, & Jamieson, 2008, 2010; Kane, 2013), which I did in this dissertation. Although argument-based validity theorists in educational assessment (e.g., Kane, 2013) and language assessment (Bachman & Palmer, 2010; Chapelle et al., 2008, 2010) differ somewhat in their specifications of validity arguments, the general structure and approach I took involved a series of progressive inferences that led from test-taker responses to the use of test results by a range of stakeholders. As demanded by the validity argument I constructed, I collected a wide range of relevant evidence (learner background data, self-assessment, oral proficiency measure, spontaneous speech samples, interviews, and KPD retests) that would bear on the critical evaluation of each inference. This work provided the backbone of the dissertation, and opened doors to future research questions that must be investigated in more detail. Important Questions and Tentative Answers After analyzing the KPD phoneme parcel data, it was clear that some phonemes presented no substantial difficulty for virtually any learner. This raises the question of whether all phonemes need to be assessed when diagnosing segmental pronunciation. Trimming has the obvious benefit of freeing up resources to either collect more information about other phonemes or include more aspects of pronunciation in diagnosis. However, at the outset of this project, I did not wish to make any assumptions about what might be possible in terms of learner pronunciation weaknesses. Now, with data in hand, I feel it is appropriate to consider possible delimitations. For L2 Korean specifically, it appeared that the phonemes /ɑ, i, ɛ, h, m/ were universally not problematic in either production or perception. Given the wide range of L1 backgrounds and levels of language experience in the sample, it appears that these sounds might be non-issues for virtually any learner. In some cases, deciding whether a phoneme is non- 279 problematic is a little less clear cut. /u/, for example, was very easy in production at the group level, yet was (a) flagged as a substantial difficulty for a small number of learners and (b) presented a considerable challenge in terms of perception. I would argue that /u/ should be kept. For other languages, I recommend at least collecting pilot data on all phonemes with learners from a range of backgrounds and then see what might be appropriate to trim. For English, where it might be desirable to diagnose based on a delimited set of phonemes deemed crucial for lingua franca communication (Jenkins, 2002), I feel the need to point out that just because certain phoneme contrasts might be unimportant (e.g., English /θ-ð/), intelligibility issues related to constituent members of the contrast cannot be ruled out (e.g., imagine a learner whose /θ/ articulation is closer to [s] in words like think or thin). In cases such as this, explication in scoring criteria might best handle the delimitation rather than removal of certain phonemes from test specifications. Similarly, pedagogical arguments have been made to prioritize segments and contrasts with high functional load (Kang & Moran, 2014; Munro, Derwing, & Thomson, 2015). This can lead to some sensible recommendations, such as not devoting too much time to the low-FL English /θ-ð/ distinction, but in other cases, application of FL to teaching and assessment is less intuitive. For example, in Korean, the highest FL contrast for vowels is /i-ɛ/ and the highest FL vowel is /i/ (Oh et al., 2015). However, on the KPD, /i/ was one of the easiest phonemes for learners to produce and perceive (see Chapter 5), and I suggested that it could be excluded from a revision of the KPD. In instructional settings, it would seem most learners would have a phoneme extremely close to /i/ in their linguistic repertoires that could be immediately drawn on in Korean (Flege, 1995; see Chapter 7, where learners across the range of L2 Korean oral proficiency had high production and perception accuracy), suggesting limited benefit of a 280 pedagogical focus on this high FL phoneme. On the other hand, for consonants, the contrast with highest FL is /l-n/, with /n/ as first- and /l/ as the third-highest FL consonants. While /n/ and /l/ were not in general difficult for learners to produce and perceive (>90% accuracy on average), several individual learners did have considerable difficulty with these two sounds, suggesting that assessment inclusion is justified and instructional attention would be worthwhile for learners who need it. FL could conceivably be applied in the construction of diagnostic assessment items. In KPD perception tasks, I primarily designed stimuli (Task 3 – Pronunciation Judgment) and distractors (Task 4 – Nonword Identification) on the basis of articulatory and acoustic similarity. This did yield items featuring the high-FL /i-ɛ/ and /l-n/ contrasts. However, other high-ranking FL contrasts like /i-o/ or /k*-t/ did not appear like they would serve as useful distractors due to considerable articulatory differences – items based on such contrasts would likely be extremely easy for learners, resulting in overestimation in the robustness of their perception of target phonemes. In sum, while FL may have some useful pedagogical applications, such as eliminating low-importance phonemes from pronunciation syllabi, it is less clear how FL might be applied more broadly in pedagogy and the specification of pronunciation diagnostic tests. In designing KPD tasks, I devoted considerable attention to tapping into lower-level speech production and perception processes (Field, 2011, 2013). However, in these models, the role of speededness or automaticity is not emphasized, presumably because naturalistic language processing is generally assumed to be occurring at the speed of real-time communication. On the other hand, the pronunciation instruction literature has largely favored outcome measurement tasks that are discrete, controlled, and mostly unspeeded (Lee, Jang, & Plonsky, 2015; Saito & Plonsky, in press). Saito and Plonsky’s (in press) recently proposed framework for measuring L2 phonological knowledge in instructional settings distinguishes between pronunciation in 281 controlled, relatively unspeeded contexts, tapping into declarative phonological knowledge, and pronunciation in spontaneous, speeded production, tapping into the degree of proceduralization and automatization of phonological knowledge. How do the conceptualizations of declarative and procedural phonological knowledge impact diagnostic assessment? How might this distinction be applicable to the perception of phonological features? I believe the answers to these questions have substantial implications for construct representation as well as hold instructional implications in line with the skill acquisition theory (DeKeyser, 2017) that Saito and Plonsky draw on. I also see a tension between diagnostic assessment and Saito and Plonsky’s framework, as DLA experts have begun to recommend discrete, controlled tasks (Harding et al., 2015, who did deal with speed in terms of listening stimuli rate of speech or articulation but not response time), which according to Saito and Plonsky falls short of fully understanding learners’ pronunciation abilities. Alderson (2005) anticipated aspects of this tension, noting that “speeded diagnostic tests” could be more useful for assessing proceduralized (or “implicit”) knowledge but cautioned that speeded tests may have limited diagnostic utility: “Knowing that somebody reads slowly does not tell us why that person reads slowly, which is the essence of diagnosis” (pp. 260- 261). I deliberately designed the KPD to include multiple tasks for measuring production and perception knowledge, but in my original specifications I did not explicitly account for declarative and procedural knowledge. Retrospectively, I believe that Task 1 – Picture Naming would loosely fit Saito and Plonsky’s definition of a spontaneous production task due to its primary focus on meaning (test-takers must first search for a word that matches the meaning of the picture) and relative degree of speededness (the whole word must be uttered at once), reflecting to some extent learners’ proceduralized pronunciation ability. Task 2 – Nonword 282 Reading, due to its greater structure and form-focus (i.e., provision of phonemes that learners must produce; no concern for meaning), likely qualifies as a controlled knowledge task, which more likely reflects consolidated declarative pronunciation knowledge. Saito and Plonsky (in press) do not apply their framework to receptive phonology, but I think such an extension is possible and logical, in line with work on receptive morphosyntactic knowledge (e.g., Suzuki, 2017; Suzuki & DeKeyser, 2017). I would hazard to say that both KPD receptive tasks are controlled knowledge tasks, as the lack of time constraint on the task responses allows learners to deliberately call on their declarative representations of phonemes and compare them to the stimulus for as long as they hold the stimulus in their phonological short-term memory. While I did not report separate scores according to phoneme accuracy in each type of task, I see no reason why it would not be both possible and useful to do so. I recommend that future development of the KPD and other pronunciation diagnostics consider Saito and Plonsky’s declarative/procedural distinction, including both (alleviating Alderson’s 2005 concerns) could yield information highly pertinent to instructional planning, such as whether to focus on explicit phonetic instruction or simply to practice in more communicative contexts. For many good reasons, L1 influence holds a prominent place in the study of L2 pronunciation. L1 is widely thought to influence interlanguage phonology (Flege, 1995) and psycholinguistics research widely agrees that the phonemes of L1s and other languages remain active during L2 speech perception (Imai et al, 2005; Weber & Cutler, 2004). On a more practical level, learner L1-specific pedagogical recommendations are abundant (e.g., Avery & Ehrlich, 1992; Derwing & Munro, 2015; Kwon, 2017). In the DLA literature, too, recommendations for L1-based tailoring of instruments can be found (Harding et al., 2015). Should learner L1 play a prominent role in the design of pronunciation diagnostics? I argue that 283 they should not. In the cluster analysis in Chapter 7, although many learners who had pronunciation difficulties shared them with many of their same-L1 peers, I also found that learners from the same L1 background could have distinct differences in terms of phoneme difficulties. For example, there were English and Japanese learners who struggled to produce tense consonants, which a simple contrastive analysis would predict. However, there were also small numbers of English and Japanese learners who had greater difficulties with aspirated consonants – a phenomenon that would not be so readily predicted by a contrastive analysis, given that English and Japanese have aspiration. Contrastive analysis or other L1-influence approaches risk over-simplifying cross-linguistic differences (e.g., in the case of English and Japanese, aspiration and voicing have featural overlap which could lead to less-predictable feature transfer or phoneme assimilations in L2 Korean), and do not necessarily account for how individuals react to unfamiliar new features of the L2 (or L3+, as it may be) phonology. As I have already mentioned, there may be suitable grounds for excluding a small number of segments from a diagnostic across the board, but outside of that I see little compelling reason to further reduce the domain of segments based on learner L1s. Doing so could lead to overlooking difficulties of less typical learners within an L1, and would also limit the applicability of the diagnostic itself, resulting in potentially less mileage out of test development efforts. Room for Expansion At the outset of this project, I knew I would have to delimit the scope of the diagnostic instrument, as it would have been beyond my means to develop and validate a truly comprehensive diagnostic of second language Korean pronunciation (if such a diagnostic is even possible). While I feel that my initial focus on segmentals is justified on both practical and theoretical grounds, I remain cognizant of the value of collecting diagnostic information related 284 to other important aspects of pronunciation. Comments from learners and the teacher I interviewed provided additional reminders of this. Some learners felt that intonation was where they struggled most, that they had difficulties related to (certain phonemes in combination with) syllable structure, or that they needed to work on their ability to apply phonological processes in connected speech. The teacher I interviewed, Jae-woo, also brought up intonation issues with one of his students, and pointed out another student’s weaknesses in facial expression and gesture which are known to be important parts of a speaker’s repertoire capable of enhancing interlocutor understanding (e.g., Hardison, 2018; Sueyoshi & Hardison, 2005). It is thus more appropriate to view the KPD and present study as one piece of the diagnostic puzzle. As Alderson et al. (2014) pointed out, diagnosis is a process that begins with a diagnostician and benefits from multiple sources of evidence. Imagine a more robust and holistic classroom-based diagnosis context, with the teacher featured in this dissertation, Jae-woo (the Korean teacher featured in Chapter 9), as a diagnostician whose informal observations of Yu- wen’s (one of the learners from Chapter 9) pronunciation difficulties initiate the process of diagnosis. As Jae-woo noticed some segmental difficulties in Yu-wen’s production, he might ask her to complete a self-assessment and administer the KPD. Jae-woo might also utilize other diagnostic tools or observations to provide Yu-wen on feedback related to her gesturing and expression while speaking. Afterwards, Jae-woo could apply the diagnostic information by recommending self-study material or providing some additional homework assignments. Yu- wen, as an active participant in the diagnostic process, might share her lack of confidence in pronunciation with Jae-woo, who in turn might be able to counsel her on the affective challenges involved with second language pronunciation. 285 Such a view on diagnosing L2 pronunciation points to many possibilities for diagnostic instrument development and formulation of principles and procedures for teacher-driven (or teacher-guided) diagnosis. I believe the KPD provides a useful starting point for diagnosing segmental pronunciation (though certainly improvements are possible), and broadly speaking, the KPD represents a set of test specifications that (a) worked as intended and (b) could be easily adapted to other target languages. More original work is needed in the development and validation of practical, reliable, and sufficiently detailed diagnostic tools for suprasegmental aspects of L2 pronunciation and pronunciation supports such as gesture and communication strategies. Training materials and procedures for teachers to diagnose learner pronunciation are other promising avenues for further developing pronunciation diagnosis. Towards an Interface between Pronunciation Instruction and Diagnostic Assessment While this dissertation can primarily be viewed as test development and validation project, it is a diagnostic test development and validation project in which links to instruction and pedagogy are important. Although I did not examine traditional classroom-based pronunciation instruction (e.g., Isbell et al., 2019) or the use of a structured training program (e.g., Thomson, 2011), Chapter 8 touched on issues related to pronunciation teacher cognition as well as out-of-class and autonomous pronunciation learning activity. Pronunciation teacher cognition deals with the “knowledge, beliefs, thoughts, attitudes, and perceptions” of teaching pronunciation (Burri, Baker, & Chen, 2017, p. 110, see also Baker, 2014), and is an under-researched area in general. My interview with Jae-woo, an in-service teacher with considerable knowledge of phonology and experience teaching students from a variety of L1s, had an orientation to teaching pronunciation that was largely driven by learner L1. He also saw considerable constraints on his classroom pronunciation teaching practices, 286 resulting in limited use of teacher-centered read-alouds and repetitions as his primary means of addressing pronunciation in the classroom (Baker, 2014; Foote et al., 2013). When asked how he would apply the KPD’s diagnostic feedback, he described separate classes that focused on explicit instruction tailored to addressing difficulties related to L1-influence. What is interesting here, to me, is how pronunciation teacher cognition and teaching practices might interface with teacher competence in diagnostic assessment of pronunciation issues (Edelenbos & Kubanek- German, 2004). Jae-woo mostly described his two students’ pronunciation difficulties broadly in terms of L1 interference and did not offer much detail in terms of specific phonemes each learner experienced. By Jae-woo’s own admission, he missed “50%” of the picture (in his defense, I would like to point out he was able to comment insightfully on some other non-segmental aspects of their pronunciation). I wonder: Did Jae-woo’s strong orientation to L1 influence in L2 pronunciation constrain his diagnosis of his students’ specific segmental difficulties? Research examining pronunciation teacher cognition alongside diagnostic assessment practice would be informative and potentially provide guidance to teacher training. Although knowledge of learner L1 phonology and common transfer-related influence is almost certainly useful for pronunciation teaching, it may be time to consider a shift to training teachers how to observe individual difficulties without relying solely on L1-based assumptions (e.g., Avery & Ehrlich, 1992; Kwon, 2017). Findings related to actual learner application of KPD results were just as interesting as the teacher’s orientation to diagnosing pronunciation difficulties. The learning activities and strategies applied by learners included: • Shadowing • Reading aloud 287 • Song-based practice • (Seeking) feedback in meaning-focused interaction • Heightened attention to target sounds in input and own output Most of these strategies and learning activities find at least some support in research (Derwing, Munro, & Wiebe, 1998; Foote & McDonough, 2017; Horgues & Scheuer, 2014; Loewen & Isbell, 2017; Moyer, 2014; Saito & Lyster, 2012) or commonly-used and recommended pedagogical practices (Baker, 2014; Celce-Murcia et al., 2010). However, much of the support for these practices is based on more formal instructional contexts, such as teacher-led classrooms or well-structured computer-assisted pronunciation training tools. Moyer’s (2014) review of exceptional L2 pronunciation outcomes emphasized the role of learner autonomy and engagement in activities and strategy use to promote L2 phonological learning. This raises several questions: How effective are these practices in non-classroom/autonomous learning contexts? Which activities might be best suited for learners to effectively pursue on their own time and initiative? Which activities can learners be trained to do on their own without investing a great deal of time? Empirical research addressing these questions would contribute to the less- understood facet of self-directed/autonomous pronunciation instruction and at the same time address the DLA-instruction interface: If learner feedback from a test like the KPD could be integrated with recommendations for specific, effective self-directed learning activities, learner pronunciation development could be positively impacted. Finally, I believe this dissertation should motivate L2 pronunciation researchers to consider more broadly the role of assessment in promoting student pronunciation learning. Previous low-stakes, instructionally-relevant pronunciation assessment efforts, such as Lappin- Fortin and Rye (2014), have shown the potential of self-assessments and pre-post achievement 288 tests for raising learner awareness and tracking learning outcomes. The KPD has shown how diagnostic pronunciation assessment can shape and promote learner attention, pronunciation learning strategy use, and out-of-class pronunciation learning activity. More research on the interface of pronunciation instruction and diagnostic assessment could lead to more concrete recommendations for practitioners and more fruitful self-directed learning for students. Implications for Diagnostic Language Assessment I now turn to implications for diagnostic language assessment. Through the course of setting a purpose and scope of diagnosis, developing the KPD, constructing a validity argument, and seeking evidence to support the validity of KPD score interpretation and use, I have arrived at several key considerations for DLA pertaining to grain size, measurement models, validity, and DLA instrument design. As discussed in Chapter 2, grain size is a key consideration in DLA. Finer grain-size in the design and subsequently test scores can lead to more concrete, instructionally-relevant information to be utilized by stakeholders. At the same time, finer grain size has an inverse relationship with practicality, requiring more and more tasks, items, or other observations in order to isolate smaller bits of linguistic knowledge and competence. In other tests labeled as diagnostic, learners are given more general feedback, such as feedback on how well the understand main ideas or details. To make a crude analogy to pronunciation diagnosis, that would be like giving a learner just two scores for segmental and suprasegmental accuracy, which perhaps might be a useful starting point, but such score categories are of too large a grain size to provide concrete guidance for instruction. With the KPD, I feel that I have struck an effective balance: A substantial subcomponent of pronunciation (and in turn, of speaking ability) is diagnosed at the level of individual phonemes, which are discrete and relatively fine bits of 289 linguistic knowledge, over the course of about 15 minutes and in a format which only takes a teacher roughly 5 additional minutes to score (caveat: compiling results may take extra time without the specialized software I used to administer the test). The teacher I interviewed in Chapter 9, Jae-woo, nonetheless expressed desire for an even finer grain size: He wanted to know about allophone-level difficulties (i.e., performance in different syllable positions/phonological environments) within each phoneme. I can certainly see the pedagogical application of such information and did consider allophonic distribution of phonemes in the design of the KPD (Chapter 4). However, concerns related to reliability/information sufficiency and practicality (test length and complexity of score reports) led me to avoid fully pursuing that level of detail in the KPD’s design and reporting of results. Was that the right call? I leave that question open to readers. Nonetheless, it appears that relatively fine grain size in diagnostic instruments has benefits, as shown by the KPD’s utility in helping learners narrow down their list of study targets and sounds to pay special attention to in their general Korean use. In an interview featured on Glenn Fulcher’s Language Testing Bytes podcast, Eunice Jang expressed hope for methodological diversification in DLA, including “psychometrically less constrained diagnostic modelling, such as latent class or profile analysis, clustering methods, or subscoring approaches” (Fulcher, 2015). I agree with Jang, as doing anything else would likely limit DLA to retrofitting existing proficiency tests to provide more detailed score reports (Jang, 2009) due to sample size and resource constraints associated with test development and administration that is not large-scale and high-stakes. While it would have been wonderful to collect data from 1,980 (or 19,800) L2 Korean learners and construct more sophisticated psychometric models (e.g., cognitive diagnostic models based on combinations of phonological features and syllable/phonological contexts), I must wonder what practical use that would have 290 yielded. At the most immediate level of interpreting diagnostic test scores, it appears to me that easily interpretable subscores linked to concrete learning targets are key for learners and teachers. Such subscores are useful for delimiting study targets and for promoting awareness of a manageable, tailored list of targets when using the language. While I concede that the identification of a range of stable diagnostic profiles through very large datasets could be useful for tracking students into predetermined instructional modules, doing so would be fruitless without following through on the development of such modules. This dissertation and other work on diagnostic instruments, such as those developed for L2 writing, have utilized argument-based validity to set an agenda for validation research that gathered necessary evidence to support the use of a test. Tasks in L2 writing diagnostic tests tend to more closely resemble authentic writing tasks instead of discretized tasks that target subcomponents of writing ability, with diagnostic information coming from thorough, detailed analysis of written products (Chapelle et al., 2015). As such, validity arguments for such writing diagnostics have been able to support extrapolation inferences with rather straightforward links between test tasks and real-world writing tasks. In contrast, I have encountered a challenge in establishing an appropriate connection between a highly-discrete diagnostic test and real-world, meaningful language use. Extrapolation support for a test like the KPD based on task features or parameters, e.g., pointing out that it elicits words and phonemes used in real-world Korean use, would seem to be trite. This leaves alignment between test task responses with authentic language use performance as a suitable source of support. The unreasonableness of expecting (composite) scores derived from highly discretized diagnostic assessment task responses to closely reflect learner language behavior in spontaneous and holistic language use begs the question: How much support for extrapolation is needed? I do not have a clear answer to that 291 question. Presumably, a comprehensive set of discrete diagnostic scores pertaining to a communication skill or subskill (say, diagnostic information coming from both segmentally- and suprasegmentally-focused pronunciation diagnostics) should be able to explain a large portion of learner performance in authentic language use. However, I suspect that any one piece of the diagnostic picture can at most explain a proportionally small piece of the larger picture in authentic language use, and even this relationship may be difficult to isolate due to the confluence of linguistic, cognitive, and situational factors that bear on typical language use. I see sorting through this issue as an important area of work for DLA, especially in the approach to DLA espoused by Alderson and colleagues (Alderson, 2005; Alderson et al., 2015; Harding et al., 2015). Finally, with respect to developing instruments for DLA, I believe this dissertation serves as proof of concept for Harding et al.’s (2015) recommendations for diagnostic instrument design. Harding et al. focused on diagnosing L2 reading and listening skills, and I was able to apply many of their recommendations for diagnosing listening subskills (i.e., phoneme perception). I was also able to adapt their recommendations to an aspect of speaking ability (i.e., pronunciation). The resulting product, a new test built from the ground up on the basis of learning theory and models of linguistic processing, was capable of providing detailed linguistic feedback that could be appropriately interpreted and applied to support learning. Without discounting other purposes and practices in L2 assessment sometimes referred to as diagnostic, such as retrofitting existing proficiency tests to provide enhanced score reports to learners (Jang, 2009; Jang et al., 2015) or to identify students requiring additional language support broadly writ (Knoch & Elder, 2016), I see the Harding et al. model of diagnostic instrument design, exemplified by the KPD, as the way forward for DLA. 292 Final Thoughts The KPD and this dissertation, for all their limitations (and there are many), represent a considerable undertaking: I developed a brand-new test with four distinct tasks and a total of over 350 items through multiple rounds of piloting, field tested it with nearly 200 learners, and investigated the validity of its interpretation and use through the collection and analysis of diverse types of evidence. I hope that the fruits of these efforts extend beyond the pages of this dissertation, as the final product appears to be useful in promoting the pronunciation development of L2 Korean learners. At the very least, I believe this research benefited the learners who took the KPD and received score reports, many of whom expressed appreciation for and sincere interest in their results and the research itself. To the end of reaping more value from the KPD development and validation efforts, I plan to produce and release a free, publicly available version of the KPD with user-friendly documentation and score-calculation tools for Korean teachers and/or tutors to use. More broadly, I believe the kind of test development and validation efforts showcased in this dissertation raise important points for how language testers can contribute to learning- oriented (Turner & Purpura, 2015), instructionally-relevant (Pellegrino et al., 2016), low-stakes assessments. The first point is that rigorous test development and validation can be worth it for low-stakes assessments. While many low-stakes assessment practices and tools are justifiably simple and economical, low-stakes does not necessarily have to be synonymous with low-value: A high-quality diagnostic assessment tool, with interpretations supported by rigorous validation research, can provide teachers and learners with relatively easily-obtained, detailed information mostly unavailable from observations and other informal assessments. 293 The second point is that these sorts of test development and validation efforts are not reasonable to expect from classroom teachers, who often assume responsibility for the creation of many, if not most, classroom assessments. Rather, I see the development of instruments like the KPD as a prime opportunity for language testing professionals and researchers to contribute to learning-oriented, instructionally-relevant, low-stakes classroom assessment practices: Build a useful tool and put it in the hands of teachers and learners (for free if possible, or at least for cheap). I believe such efforts are an opportunity for language testers to do more good in our work by creating something that is directly and concretely useful to language learning, above and beyond producing knowledge and developing theory through academic research that may (or may not) be relevant to low-stakes classroom assessments. My final point is that in both the design and validation phases of this instructionally- relevant assessment project, I had to give considerable attention to SLA theory and pronunciation instruction concerns. This is not common in development efforts for many kinds of language tests, such as proficiency tests which are more concerned with norm-referenced construct definitions and domain of use descriptions over developmental trajectories and applicability of results to teaching and learning practices. Here, with DLA and instructionally-relevant language assessments more broadly, I see an opportunity for greater interaction and collaboration among language assessment professionals, SLA and especially ISLA researchers, and language pedagogy experts. I hope that this dissertation inspires more of those connections to be made. 294 APPENDICES 295 APPENDIX A KPD Table of Specifications 296 Table A1 KPD Table of Specifications Target Consonants Context 1. Picture naming Production – Part 1 ㅂ p (lax bilabial stop) ㅃ p* (tense bilabial stop) ㅍ ph (aspir. bilabial stop) ㄷ t (lax alveolar stop) ㄸ t* (tense alveolar stop) ㅌ th (aspir. alveolar stop) ㄱ k (lax velar stop) ㄲ k* (tense velar stop) ㅋ kh (aspir. velar stop) ㅈ ʨ (lax alv-pal. affric.) initial medial final initial medial final initial medial final initial medial final initial medial final initial medial final initial medial final initial medial final initial medial final initial medial final 2 불, 버스 1 나비 1 집 2 빨간색, 빵 2 피아노, 포도 1 컴퓨터 2 돈, 돼지 1 포도 1 초콜릿 2 딸기, 땅콩 2 토끼, 택시 1 컴퓨터 2 귀, 그림 2 시계, 발간색 1 발간색 1 꽃 1 토끼 1 코 1 땅콩 1 집 5 돼지, 여자, 의자, 아저씨, 화장실 297 2. Nonword Reading 1 보 1 우부 1 압 1 빠 1 오뽀 1 포 1 우푸 1 도 1 이디 1 앋 1 따 1 우뚜 1 토 1 아타 1 기 1 우구 1 옥 1 까 1 이끼 1 키 1 오코 1 자 1 이지 Perception – Part 2 3. Pronunciation Judgement 1 비 1 일본 1 입 1 뿌리 1 아빠 1 팔 1 소파 1 다리 1 바다 1 옷 2 똥, 딸 1 탈 1 외투 1 개 1 미국 1 책 1 꿀 1 어깨 2 칼, 카메라 1 짐 1 사진 # of items 13 7 9 14 8 9 14 8 8 12 4. Identification 1 바 1 오보 1 웁 1 삐 1 우뿌 1 푸 1 이피 1 디 1 아다 1 욷 1 뚜 1 오또 1 티 1 우투 1 구 1 오고 1 익 1 꼬 1 우꾸 1 쿠 1 아카 1 지 1 오조 Table A1 (cont’d) ㅉ ʨ* (tense alv-pal. affric.) ㅊ ʨh (aspir. alv-pal. affric.) ㅅ s (lax alv-pal. fricative) ㅆ s* (tense alv-pal. fricative) ㅎ h (lax glottal fricative) ㅁ m (bilabial nasal) ㄴ n (alveolar nasal) ㅇ ŋ (velar nasal) ㄹ l (alveolar liquid) Vowels ㅣ i (high front unrounded) ㅐ,ㅔ ɛ (mid front unrounded) initial medial final initial medial final initial medial before i/j initial medial before i initial medial final initial medial final initial medial final initial medial final initial medial final geminate 2 맥주, 왼쪽 2 침대, 초콜릿 1 새 3 버스, 빠간색, 원숭이 2 시게, 화장실 1 쓰레기 1 학생 2 아저씨, 택시 2 학생, 화장실 1 맥주 2 레몬, 침대 1 그림 1 나비 2 피아노, 빨간색 1 돈 2 화장실, 땅콩 5 빵, 용, 땅콩, 학생, 왕 1 레몬 3 쓰레기, 그림, 빨간색 2 불, 화장실 1 초콜릿 10 집, 지아노, 화장실, 돼지, 쓰레기, 아저씨, 나비, 택시, 토끼, 원숭이 1 쭈 1 오쪼 1 치 1 우추 1 수 1 아사 2 샤,셔 1 쏘 1 우쑤 1 씨 1 하 1 오호 1 미 1 오모 1 움 1 노 1 이니 1 안 1 옹오 1 잉 1 루 1 이리 1 알 1 울루 1 이 2 잡지, 오른쪽 1 차 1 기차 1 소 1 세상 2 음식, 도시 2 쌀, 싸움 1 접시 1 호랑이 1 아홉 1 목 1 나무 1 사람 1 노래하다 1 하나 1 문 1 창문 1 가방 1 라디오 1 사랑 1 별 1 콜라 2 아기, 시장 8 레몬, 쓰레기, 택시, 맥주, 빨간색, 새, 시계, 침대 1 에 2 백, 뱀 298 1 쪼 1 아짜 1 추 1 이치 1 사 1 우수 2 시, 쇼 1 쑤 1 오쏘 1 씨 1 히 1 우후 1 모 1 우무 1 임 1 니 1 아나 1 온 1 앙아 1 웅 1 라 1 오로 1 울 1 일리 1 이 1 에 8 9 18 13 8 13 13 13 19 14 12 Table A1 (cont’d) ㅡ ɯ (high back unrounded) ㅓ ʌ (mid back unrounded) ㅏ ɑ (low back unrounded) ㅜ u (high back rounded) ㅗ o (mid back rounded) Glides* j_ /_j w_ 3 그림, 쓰레기, 버스 1 으 2 하늘, 음악 3 버스, 컴퓨터 (x2) 1 어 2 커피, 머리 9 아저씨, 빵, 피아노, 빨간색(x2), 학생, 나비, 의자, 여자 1 아 2 산, 강 3 불, 원숭이, 맥주 1 우 2 눈, 둘 11 꽃, 돈, 포도(x2), 레몬, 초콜릿(x2), 토끼, 왼쪽, 피아노, 코 1 오 2 손, 호주 1 으 1 어 1 아 1 우 1 오 4 여자, 의자, 용, 컴퓨터 6 왼쪽, 돼지, 귀, 왕, 화장실, 원숭이 6 얘, 의, 여, 야, 유, 요 4 위, 왜, 워, 와 6 고양이, 교수님, 의사, 연필, 우유, 예술가 6 얘, 의, 여, 야, 유, 요 4 교회, 원, 가위, 화 4 위, 왜, 워, 와 7 7 13 7 15 22 18 Total 128 63 72 63 326 *Glides combine with monopthongs to form 10 diphthongs: ㅖ/ㅒ(jɛ), ㅢ(ɯj), ㅕ(jʌ), ㅑ(jɑ), ㅠ(ju), ㅛ(jo),ㅞ/ㅙ(wɛ), ㅝ(wʌ), ㅟ(wi), ㅘ(wɑ) 299 APPENDIX B KPD Item Specifications 300 I. Specification Title: Picture Naming 1. General Description: Learners should be able to recall phonological representations of words and articulate them accurately. 2. Prompt Attributes: A picture and English text will be displayed. Prompts are words selected due to 1) high frequency 2) word class (nouns are more image-able) and 3) length (preference for shorter words- less potential distraction). Pictures should clearly elicit the target word. Images should thus use color, indicators such as arrow or circles, and possibly even text (but not for the target word) to ensure that expected responses are given. 3. Response Attributes: Responses scored by human judgment. Judge should be trained in Korean phonology and native/near-native proficiency. Responses will be scored for accuracy of all phonemes. 4. Sample Item: <4 seconds are given to respond> 5. Supplemental Information: Phoneme inventory includes 19 consonants, 7 vowels, and 2 glides (which may combine with vowels to form 10 diphthongs), following Shin, Kiaer, and Cha (2013). Refer to Table of Specifications for environments that must be tested for each phoneme. Refer to A Frequency Dictionary of Korean (Lee, Jang, & Seo, 2017) for word frequency information; words among top 1,500 most frequent are preferred but any word in the top 5,000 and/or determined to be commonly introduced in instructional settings is permissible. 6. Directions and Practice Item(s) Directions: In this section, will name pictures. First, you will see a picture and a sentence with a blank. Then, you will speak the word for the picture and the blank. Practice Item A: You should have said “책상”. II. Specification Title: Nonword Reading 1. General Description: Learners should be able to articulate sounds of Korean. 2. Prompt Attributes: A nonword target of one to two syllables (V, CV, VC, or CVC) will be displayed. For consonant items, the Korean vowels /i/ /a/ /o/ and /u/ will be used to provide context, as these are common in many languages and likely to present little challenge. For glides and some allophones, more complicated syllables may be used, but these should not be unnecessarily complex. Items targeting vowels will utilize 1 syllable targets with a single vowel. 3. Response Attributes: 301 Responses scored by human judgment. Judge should be trained in Korean phonology and native/near-native proficiency. Responses will be scored for accuracy of the target phoneme only. 4. Sample Item: “다” is displayed. 5. Supplemental Information: Phoneme inventory includes 19 consonants, 7 vowels, and 2 glides (which may combine with vowels to form 10 diphthongs), following Shin, Kiaer, and Cha (2013). Refer to Table of Specifications for environments that must be tested for each phoneme. 6. Directions and Practice Item(s) Directions: In this section, you will see Korean letters and read them out loud. Practice Item A: “다” is displayed. III. Specification Title: Pronunciation Judgment 1. General Description: Korean users must be able to decode speech sounds and match sounds to phonological representations of words, drawing on contextual clues. Examinees will judge the quality of phonemes in spoken Korean. 2. Prompt Attributes: A single, ideally short, word will be played once and a picture of the word will be displayed. An English caption of the picture will be provided. The picture and caption will be displayed prior to hearing the word (1 second if computer administered) The spoken word may be correct (i.e., standard pronunciation) or incorrect. Incorrect words will have one (and only one) phoneme intentionally mispronounced. Prompts are words selected due to 1) high frequency 2) word class (nouns are more image-able) and 3) length (preference for shorter words- less potential distraction). The prompt will be recorded by a native speaker of Standard Korean with normal duration and neutral pitch/intonation. For each prompt, the question “Is this right?” will be displayed. 3. Response Attributes: The response involves selecting “Yes” or “No” within 2 seconds after the prompt is done playing. “Yes” and “No” will be displayed side-by-side. The response will be indicated by key press or circling; in the former case a reaction time may also be recorded. 4. Sample Item: Is this right? Yes No* 5. Supplemental Information: Phoneme inventory includes 19 consonants, 7 vowels, and 2 glides (which may combine with vowels to form 10 diphthongs), following Shin, Kiaer, and Cha (2013). Refer to Table of Specifications for environments that must be tested for each phoneme. 302 Refer to A Frequency Dictionary of Korean (Lee, Jang, & Seo, 2017) for word frequency information; words among top 1,500 most frequent are preferred. 6. Directions and Practice Item(s) Directions: In this section, you judge the accuracy of Korean word pronunciations. First, you will see a picture that represents a word. Then, you will hear a sound for that word. Last, you will decide whether the word was pronounced correctly. Practice Item A: Is this right? Yes No* The correct answer was “No”. The pronunciation you heard was “착상” instead of “책상”. IV. Specification Title: Nonword Identification 1. General Description: Korean users must be able to decode speech sounds. Examinees will identify individual phonemes in spoken Korean. 2. Prompt Attributes: A one- or two-syllable nonword (when possible) will be played once. The target for each item is one Korean phoneme embedded in a V, CV, VC, or VCV nonword carrier. The different sequences allow for the phoneme to be tested in a variety of phonetic environments that span its allophonic distribution. For consonant items, the Korean vowels /i/ /a/ /o/ and /u/ will be used to provide context, as these are common in many languages and likely to present little challenge. For vowel items, /n/ /m/ /k/ /p/ /t/ are candidates for context, as they are also extremely common across world languages. These are also among the most commonly occurring phonemes in spoken Korean. The prompt will be recorded by a native speaker of Standard Korean with normal duration and neutral pitch/intonation. For each prompt, the question “Which sound did you hear?” will be displayed. 3. Response Attributes: The response will involve selecting one of two text options. The selection will be made by pressing a key or circling with a pen/pencil; in the former case a reaction time may also be recorded. The response may be made as soon as the audio is played, with a limit of 2 seconds per item. The two text options will be identical except for one difference: the target phoneme. The key will match the prompt, and the distractor will be another phoneme with similar articulation/acoustical properties. For example, if the key is a tensed consonant, the distractor would be a lax or aspirated counterpart (e.g., ㄲ and ㅋ). The two options will be displayed side-by-side, and the location of the key should be random. 4. Sample Item: What sound did you hear? a. 다* b. 마 303 5. Supplemental Information: Phoneme inventory includes 19 consonants, 7 vowels, and 2 glides (which may combine with vowels to form 10 diphthongs), following Shin, Kiaer, and Cha (2013). Refer to Table of Specifications for environments that must be tested for each phoneme. Refer to Shin et al. (2013) and Choo & O’Grady (2003) for suitable key-distractor contrasts. 6. Directions and Practice Item(s) Directions: In this section, you will identify Korean sounds. First, you will hear a sound. Then, you will select the Hangeul letters that match the sound. Practice Item A: What sound did you hear? a. 다* b. 마 The correct answer was “다”. 304 APPENDIX C KPD Production Task Scoring Sheet 305 표적 소리들 # 단어 1 빵 ㅃ ㅏ ㅇ 2 피아노 ㅍ ㅣ ㅏ ㄴ ㅗ ㅝ ㄴ ㅅ ㅜ ㅇ ㅣ 3 원숭이 Task 1 – Picture Naming / 1 – 그림 말하기 표적 소리들 # 단어 13 귀 ㄱ ㅟ 14 땅콩 ㄸ ㅏ ㅇ ㅋ ㅗ ㅇ ㅋ ㅓ ㅁ ㅍ ㅠ ㅌ ㅓ 15 컴퓨터 표적 소리들 # 단어 25 용 ㅛ ㅇ 26 침대 ㅊ ㅣ ㅁ ㄷ ㅐ 27 쓰레기 ㅆ ㅡ ㄹ ㅔ ㄱ ㅣ 4 나비 ㄴ ㅏ ㅂ ㅣ 5 토끼 ㅌ ㅗ ㄲ ㅣ 16 포도 ㅍ ㅗ ㄷ ㅗ 17 돈 ㄷ ㅗ ㄴ 28 왕 ㅘ ㅇ 29 라면 ㄹ ㅏ ㅁ ㅕ ㄴ 6 여자 ㅕ ㅈ ㅏ 18 화장실 ㅎ ㅘ ㅈ ㅏㅇㅅㅣ ㄹ 30 왼쪽 ㅚ ㄴ ㅉ ㅗ ㄱ 7 돼지 ㄷ ㅙ ㅈ ㅣ 19 시계 ㅅ ㅣ ㄱ ㅔ 31 불 ㅂ ㅜ ㄹ 8 아저씨 ㅏ ㅈ ㅓ ㅆ ㅣ 20 학생 ㅎ ㅏ ㄱ ㅆ ㅐ ㅇ 32 초콜릿 9 집 ㅈ ㅣ ㅂ 21 딸기 ㄸ ㅏ ㄹ ㄱ ㅣ 33 빨간색 ㅊ ㅗ ㅋ ㅗ [ㄹ ㄹ] ㅣ ㄷ ㅃ ㅏ ㄹ ㄱ ㅏ ㄴ ㅅ ㅐ ㄱ 10 새 ㅅ ㅐ 11 택시 ㅌ ㅐ ㄱ ㅆ ㅣ 12 코 ㅋ ㅗ 22 맥주 ㅁ ㅐ ㄱ ㅉ ㅜ 23 의자 ㅢ ㅈ ㅏ 24 그림 ㄱ ㅡ ㄹ ㅣ ㅁ 306 34 꽃 ㄲ ㅗ ㄷ 35 버스 ㅂ ㅓ ㅅ ㅡ # 글 표적 소리 1 아사 2 쏘 3 어 4 이끼 5 쭈 6 왜 7 우구 8 도 9 키 10 오뽀 11 자 12 와 13 으 14 여 15 포 16 이리 ㅅ ㅆ ㅓ ㄲ ㅉ ㅙ ㄱ ㄷ ㅋ ㅃ ㅈ ㅘ ㅡ ㅕ ㅍ ㄹ Task 2 – Nonword Reading / 2 – 글자 읽기 # 글 표적 소리 # 글 표적 소리 # 글 표적 소리 17 얘 18 이지 19 유 20 기 21 오호 22 워 23 압 24 셔 25 하 26 잉 27 아타 28 이 29 우부 30 옥 31 에 32 토 49 노 50 미 51 보 52 따 53 이디 54 우 55 샤 56 우추 57 아 58 오쪼 59 안 60 까 61 치 62 울루 63 수 33 오코 34 오모 35 의 36 옴 37 우쑤 38 위 39 알 40 이니 41 씨 42 옹오 43 빠 44 우푸 45 앋 46 오 47 루 48 우뚜 ㅋ ㅁ ㅢ ㅁ ㅆ ㅟ ㄹ ㄴ ㅆ ㅇ ㅃ ㅍ ㄷ ㅗ ㄹ ㄸ ㅒ ㅈ ㅠ ㄱ ㅎ ㅝ ㅂ ㅅ ㅎ ㅇ ㅌ ㅣ ㅂ ㄱ ㅔ ㅌ ㄴ ㅁ ㅂ ㄸ ㄷ ㅜ ㅅ ㅊ ㅏ ㅉ ㄴ ㄲ ㅊ ㅅ [ㄹㄹ] 307 APPENDIX D Scoring Guidelines for KPD Production Tasks 308 Scoring Guidelines for KPD Production Tasks Supplies: Scoring Sheet, Pen or Pencil, Headphones How to Score: Write the student’s name or ID number on the Scoring Sheet. With the Scoring Sheet in front of you, listen to the student’s audio file for Task 1 – Picture Naming and Task 2 – Nonword Reading. Judge each target sound as correct (easily identifiable) or incorrect (uncertain or unclear). Mark incorrect target sounds by crossing them out on the scoring sheet. Scoring Criteria: This test is designed to identify pronunciation weaknesses. However, pronunciation of target sounds does not have to be “perfect” or exactly native-like. Instead, target sounds should be clearly and easily recognizable, without ambiguity. You should mark a target sound as incorrect if… It could be understood as a different Korean sound It is not 100% clear to you as the target sound • • • You hesitate or have to really think whether you heard the target sound • • The sound does not sound at all like a Korean sound If the sound sounds starkly out of place Note: Sometimes, a student will self-correct, or the test administrator will prompt them to try a different word. In this case, judge the student’s final production. 309 한국어 발음 진단 검사 (KPD) 조음 채점법 필수품: 채점지, 연필이나 볼펜, 이어폰/헤드폰 채점법: 채점지에 학습자의 이름 또는 번호를 적는다. 채점지를 앞에 두고 학습자의 오디오 파일을 듣는다. 각 표적 소리를 정답(쉽게 구분함) 또는 오답(불확실, 분명하지 않음)으로 평가한다. 오답일 경우, 채점지의 표적소리에 줄을 그어 표시한다. 채점 기준: 이 시험은 학습자 발음의 취약점을 알아보기 위한 것이다. 하지만, 표적소리의 조음은 완벽하거나 원어민의 조음과 똑같지 않아도 된다. 그 대신에 표적소리가 애매한 것 없이 분명하고 쉽게 구분할 수 있어야 한다. 다음과 같은 경우의 조음은 오답으로 평가해야 한다: • 표적소리가 아닌 다른 한국어 소리로 알아들을 수 있다 • 100% 표적소리가 전혀 아니다 • 조음을 들은 후에 망설이게 되고 표적소리 인지를 고민하게 된다 • 주어진 단어 환경에서 조음이 자연스럽지 않다 • 한국어의 소리가 전혀 아닌 것 같다 특이 사항: 가끔 학습자가 조음을 혼자서 수정한 후 다시 말하거나 반복해서 말할 때가 있다. 또는 시험 감독자가 다른 단어를 말하도록 유도하기도 한다. 이런 경우에는 학습자의 마지막 조음을 평가한다. 310 APPENDIX E Language Background Questionnaire 311 배경 설문 – Background Information 1. 기본 정보 – Basic Information 성/Last Name: (영문/English) 국적 Home Country: 명/First Name: (영문/English) 출생년도 / Year of Birth: (예: 1986) 날짜 Date: 성별 □남/Male □녀/Female 2. 지금 대학교나 어학당/학원에 다닙니까? Are you currently attending a university or language school? □ 아니요 / No □ 예 / Yes 학교 이름/School Name: ________________________________________________ 3. 연락처가 무엇입니까? What is your contact information? (결과를 받고 싶을 경우/If you want to see your results) 이메일:_______________________________________ 3. 지금 몇 급 수업을 듣습니까? What level of Korean class are you taking now? □ 1 급 □ 2 급 □ 3 급 □ 4 급 □ 5 급 □ 6 급 □ 지금 대학교나 대학원 수업을 듣는다 □ 다른 경우/Other: _______________________________________________________ 카카오 ID:_________________________________ 4. 이번에 언제 한국에 들어왔습니까? (년/월/일)/When did you arrive in Korea? (YYYY/MM/DD): ___________________________________________________ 5. 최종 학력을 표시하세요: (가장 최근의 교육 수준을 표시하세요.) Please check your highest education level: □ 고등학교 시작 / Less than high school □ 고등학교 졸업 / High school graduate □ 석사 졸업/ Master's degree □ 직업교육훈련 / Vocational training □ 박사 등 / Ph.D./M.D./J.D. □ 다른 학위/Other: □ 대학교 시작 / Some college □ 학사 (대학교 졸업) / 3-4 year degree ________________________________________ (B.A., B.S., etc.) □ 대학원 시작 / Some graduate school 312 언어 배경 – Language Background 1. 아는 언어 중에서 잘하는 언어를 순서대로 다 쓰세요: Please list all the languages you know in order of dominance (i.e., strength): 1 위: 기타/Others: 3 위: 4 위: 2 위: 5 위: 2. 위의 언어들을 요즘 얼마나 많이 사용하는지 퍼센트(%)로 표시하세요 (총 퍼센트(%)가 100 이어야 합니다): Please list what percentage of the time you are currently and on average exposed to each language you listed above. (Your percentages should add up to 100): 언어/language: 사용%: 3. 한국어는 내가 ____번째로 배운 언어입니다. / Korean is my ____th language. □ 1 □ 2 □ 3 □ 4 □ 5 (또는 5 이상/ 5 or later) 4. 아래에 있는 것을 했을 때 몇 살이었습니까? How old were you when you... 한국어 배우기를 시작했을 때/began learning Korean? 처음 한국어로 이야기할 수 있을 때/became conversational in Korean? 한국어로 읽기 시작했을 때/began reading in Korean? 처음 한국어를 유창하게 할 수 있을 때/became fluent in reading Korean? 313 5. 다음을 경험한 기간이 몇 년 동안 / 몇 개월 동안이었는지 쓰세요. Please list the total number of years and months you spent in each Korean language environment: 몇 년 / Years 몇 개월 / Months 한국에서 살기 Living in South Korea 한국어를 말하는 가족과 살기 Living with a family that speaks Korean 한국안에 있는 학교나 학원에서 한국어를 공부하기 Studying Korean in a school in South Korea 다른 나라에 있는 학교나 학원에서 한국어를 공부하기 Studying Korean in a school in another country 한국 여행하기 Vacation in South Korea 6. 당신의 한국어 말하기, 듣기, 쓰기, 읽기 능력을 0 부터 10 중에서 표시하세요: On a scale from 0 to 10, please select your level of proficiency in speaking, listening, writing, and reading Korean: 0 못 함 none 1 매 우 낮 음 very low 2 낮 음 low 3 보 통 fair 5 충분 adequat e 4 거의 충분 slightly less than adequat e 7 잘 함 goo d 6 충분보 다 조금 높음 slightly more than adequate 8 매 우 잘 함 very goo d 9 훌륭함 excellen t 10 완벽 perfec t 능력/Your Skills (숫자를 쓰세요/write a number): 말하기/Speaking: 듣기/Listening: 쓰기/Writing: 읽기/Reading: 314 7. 한국어 능력 시험을 본 적이 있습니까? 시험 이름, 날짜와 점수를 쓰세요. Have you ever taken a Korean proficiency test? Please write the test name, date, and score. 시험 이름/Test Name 점수 / Score 날짜(년/월) / Date (Year/Month) 8. 아래 있는 것들이 당신의 한국어 배우기에 얼마나 기여했는지 0 부터 10 중에서 표시하세요: On a scale from 0 to 10, please select how much the following factors contributed to your Korean learning: 0 전혀 도움이 안됨/not at all 1 아주 조금 도움됨 minimally 2 3 4 5 보통 6 7 8 moderately 10 9 가장 많이 도움됨/most importantly 친구와 소통/Interacting with friends: 가족과 소통/Interacting with family: 읽기/Reading: 혼자서 공부/Self-study: TV 나 영화 보기/Watching TV or movies: 음악 듣기/Listening to music: 9. 현재 아래의 상황에서 얼마나 한국어를 사용하는지 표시하세요: On a scale from 0 to 10, select how much you are currently exposed to Korean in the following contexts: 0 5 1 전혀 사용 거의 사용 안함 none 안함 almost never 2 3 4 6 7 50%정도 half the time 8 9 10 항상 always 예: 읽을 때 50%정도 한국어로 읽어요 (즉 남은 50%는 내 모국어로, 아니면 다른 언어로 읽어요). Example: When I am reading, it is in Korean about half the time (5). (e.g., the other half is English). 친구와 소통/Interacting with friends: 혼자서 공부/Self-study: 315 가족과 소통/Interacting with family: 읽기/Reading: TV 나 영화 보기/Watching TV or movies: 음악 듣기/Listening to music: 10. 아래에 있는 것들이 한국어 배우기에 얼마나 동기를 주는지 0 부터 10 중에서 표시하세요: On a scale from 0 to 10, please select your how much each of the following motivate you to learn Korean: 0 10 1 5 2 3 4 보통 moderately 6 7 8 9 가장 많이 도움됨/most importantly 전혀 도움이 안됨/not at all 아주 조금 도움됨 minimally 취직 Getting a job: 대학교 입학, 다른 교육 Going to university or other training: 한국어 하는 가족/Korean-speaking family: 한국인 친구 사귀기/Friendship with Koreans: 돈 더 벌기/Earning more money: 가족과 친구에게 감명주기 Impressing friends and family: 한국어 하는 부부나 애인/Korean- speaking spouse or romantic partner: 한국문화/Korean culture: 316 APPENDIX F Pronunciation Self-Assessment 317 발음 자기 평가 – Pronunciation Self-Assessment 1 부: 발음 전체 인상/Part 1: Your Overall Impressions 당신의 전반적인 한국어 발음 능력을 생각해 보세요. 무엇을 말하는지 (단어, 문법)가 아니라 당신이 어떻게 말하는지 (소리, 억양/인토네이션, 율동/리듬)에 집중하세요. 또한, 다른 사람이 당신의 한국어 말하기 어떻게 반응하는지를 생각해 보세요. Think about your general pronunciation ability in Korean. Focus on how you speak (sounds, intonation, and rhythm) rather than what you say (vocabulary, grammar). Also, think about how others react to your Korean speaking. 1. 이해 난이도: 다른 사람들이 당신의 말을 얼마나 쉽게 이해합니까? Comprehensibility: How easily are you understood by others? Your Korean speaking is... 아주 이해하기 어려움 1 Extremely hard to understand 2 3 4 5 6 7 8 아주 이해하기 쉬움 9 Extremely easy to understand 2. (외국) 억양: 당신이 한국어를 말할 때 한국사람처럼 들립니까? Accent: Do you sound like a South Korean when you speak Korean? You have... 외국 억양이 매우 심함 1 A very strong foreign accent 2 3 4 5 6 7 8 외국 억양이 거의 없음 9 Very little accent 3. 만족감: 당신의 한국어 발음에 얼마나 만족합니까? Satisfaction: To what extent are you happy with the way you pronounce Korean? You are... 하나도 안 만족함 1 Not happy at all 2 3 4 5 6 4. 가치: 당신에게 한국어 발음은 얼마나 중요합니까? Value: How important is Korean pronunciation to you? It is... 하나도 안 중요함 1 Not important at all 2 3 4 5 6 7 7 8 8 완전히 만족함 9 Completely happy 매우 중요함 9 Extremely important 318 2 부: 한국어의 소리 Part 2: Individual Sounds 이 부분에서는 한국어의 소리에 (예: ㄱ, ㄴ, ㄷ) 대해서 생각할 것입니다. 한국어에는 28 개의 소리가 있습니다: 자음 19 개, 모음 7 개와 반모음 2 개입니다. (반모음은 '와'와 '요'의 첫 소리입니다). For this part of the self-assessment, you will need to think about the individual sounds of Korean (ㄱ, ㄴ, ㄷ, etc.). Korean has 28 unique sounds: 19 consonants, 7 vowels, and 2 glides (glides are the first sound in 와 and 요). 각 소리에 대해서 (a) 그 소리를 잘 듣는 것과 (b) 그 소리를 잘 발음하는 것이 얼마나 어려운지를 생각하세요. (1 = 항상 어렵다, 7 = 전혀 어렵지 않다) For each individual sound, think about how difficult (1 = always difficult, 7 = never difficult) it is for you to (a) clearly pronounce the sound and (b) clearly hear the sound. 그 소리가 얼마나 어려운지를 잘 모르겠다면 '모르겠다'를 선택해도 됩니다. If you are really not sure at all about a sound, you can select "Not sure". 한국어 자음 1 / Korean Consonants 1: 소리 Sound 모르겠다 Not sure 영역 Mode ←항상 어렵다 ←Always Difficult 전혀 어렵지 않다→ Never Difficult→ ㄱ ㄲ ㅋ ㄷ ㄸ ㅌ 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 ? ? ? ? ? ? ? ? ? ? ? ? 319 한국어 자음 2 / Korean Consonants 2: ←항상 어렵다 소리 ←Always Sound Difficult 영역 Mode 모르겠다 Not sure 전혀 어렵지 않다→ Never Difficult→ 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ㅂ ㅃ ㅍ ㅈ ㅉ ㅊ ㅅ ㅆ ㅁ ㄴ ㅇ (예: 방) ㄹ ㅎ 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 320 한국어 모음 / Korean Vowels: 소리 Sound 영역 Mode ←항상 어렵다 ←Always Difficult ㅏ ㅣ ㅓ ㅗ ㅜ ㅡ 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. 듣기 Hearing 발음 Pron. ㅔ, ㅐ 듣기 Hearing 1 1 1 1 1 1 1 1 1 1 1 1 1 1 한국어 반모음 / Korean Glides: 소리 ←항상 어렵다 Sound ←Always Difficult 영역 Mode 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 /w/ 발음 Pron. (예: 와, 워, 외, 왜, 위) 듣기 Hearing /j/ 발음 Pron. (예: 야, 요, 여, 유, 예, 얘, 의) 듣기 Hearing 1 1 1 2 2 3 3 2 3 1 2 3 4 5 6 7 321 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 모르겠다 Not sure 전혀 어렵지 않다→ Never Difficult→ 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ? ? ? ? ? ? ? ? ? ? ? ? ? ? 모르겠다 Not sure 전혀 어렵지 않다→ Never Difficult→ 6 6 6 7 7 7 ? ? ? ? APPENDIX G Independent Speaking Task 322 한국어 말하기 Speaking Task 설명: 질문을 잘 읽으세요. 그리고 그 질문에 대한 의견을 말하세요. 15 초동안 의견을 생각한 다음에 1 분 동안 말하세요. Directions: Read the question below. You will give your opinion on the question. You will have 15 seconds to think about your opinion, and then you will have 1 minute to speak. 질문: 어떤 사람들은 작은 도시에 사는 것을 좋아해요. 또 어떤 사람들은 큰 도시에 사는 것을 좋아해요. 당신은 작은 도시와 큰 도시중에서 어디에 살고 싶어요? 왜요? Question: Some people prefer to live in a small town. Others prefer to live in a big city. Which place would you prefer to live in and why? 323 APPENDIX H Korean EIT Directions and Practice Items 324 In this task, you’ll be asked to repeat some sentences in Korean and some sentences in English. Please follow the instructions carefully. Please do not take any notes during this exercise. Now let’s begin. 이 시험에서는 한국어 문장을 듣고 그 문장을 따라할 거예요. 설명을 잘 듣고 따라해 보세요. You are going to hear several sentences in Korean. After each sentence, there will be a short pause, followed by a tone sound {TONE}. Your task is to try to repeat exactly what you hear. You will be given sufficient time after the tone to repeat the sentence. Repeat as much as you can. Remember, DON'T START REPEATING THE SENTENCE UNTIL YOU HEAR THE TONE SOUND {TONE}. Now let's begin. 지금부터 한국어 연습 문장 5 개를 들을 거예요. 각 문장을 들은 다음에 삐소리가 나면 <삐소리> 그 문장을 따라 말해보세요. 문장을 완벽하게 따라하는 것이 어려워도 열심히 해 보세요. 삐소리 후 문장을 말할 시간은 충분히 있을 거예요. 하지만 삐소리를 듣기 전에 말하면 안 돼요. 자, 지금 연습 문장을 해 볼까요? Note: response time given is roughly ~0.6s per syllable 나는 꽃이 좋아. (6 syllables) 2.0s pause, 0.5s tone, 3.9s response time [translation: I like flowers] 저는 편지를 써요. (7 syllables) 2.0s pause, 0.5s tone, 4.1 s response time [translation: I write a letter] 저는 큰 차가 필요해요. (9 syllables) 2.0s pause, 0.5s tone, 5.6 s response time [translation: I need a big car] 비가 와서 밖에 안 나가요. (10 syllables) 2.0s pause, 0.5s tone, 6 s response time [translation: As it is raining, I don't go out] 그 여자 아이는 축구를 좋아해. (12 syllables) 2.0s pause, 0.5s tone, 7.2 s response time [translation: That girl likes soccer] That was the last practice sentence. 그건 마지막 연습 문장이었어요. 325 Now you will hear more Korean sentences. Once again, after each sentence, there will be a short pause followed by a tone sound . Your task is to try to repeat exactly what you hear in Korean. You will be given sufficient time after the tone to repeat the sentence. Repeat as much as you can. Remember, don’t start repeating the sentence until you hear the tone sound . 지금부터 문장들을 더 들을 거예요. 각 문장을 들은 다음에 삐소리가 나면 <삐소리> 그 문장을 따라 말해보세요. 문장을 완벽하게 따라하는 것이 어려워도 열심히 해 보세요. 삐소리 후 말할 시간이 충분히 있을 거예요. 하지만 삐소리를 듣기 전에 말하면 안 돼요. Do you have any questions? 질문이 있어요? Now, let’s begin. 그럼, 시작합시다! 326 APPENDIX I Interview Protocols 327 Interview 1 - Learner Orientation; Reflection on Own Pronunciation 1. Ask participants to reflect on their Korean Pronunciation / 당신의 발음을 반영해주세요 a. “발음에 대해 어떻게 생각하세요?” / What do you think about your pronunciation? b. “한국어 발음의 가장 어려운 점은 뭐예요?” / What are the most difficult aspects? 2. Present participants with their self-assessments to assist with reflection. Interpreting Results 1. Present participants with their KPD results. Ask them to read the results and share any thoughts that arise. a. “발음 진단 검사 결과를 보면서 의견이나 질문이 있으면 이야기해 주세요” 2. After initial reactions, follow up with the following questions: a. “결과에 대해 어떤 생각이 있어요?” / What do you think about the test results? b. “결과는 당신의 생각과 비슷해요?” / Are the results similar to your own impressions? (Can go to the self-evaluation results here) c. “놀라운 결과가 있어요? 왜요?” / Are there any surprising results? Why? Learning Activity 1. Participants prompted to discuss study/practice habits: 2. “보통 발음을 어떻게 공부하거나 연습해요?” / How do you usually study or practice pronunciation? 3. “다른 발음 공부나 연습을 시도해 봤어요?”/ Have you tried any other methods? 4. “결과를 본 후에 다른 공부나 연습 방법을 시도 할 것 같아요?” / After seeing these results, do you think you’ll try anything different? Progress 1. Participants prompted to discuss pronunciation development: 2. “한국어 발음 배우기에 대한 경험을 이야기해주세요” a. “어련운 게 뭐예요?” “쉬운 것은? / What has been difficult for you? Easy? b. “한국어 발음을 배웠을 때, 어떤 단계나 과정을 통해서 경험했어요? / What process or steps did you experience when learning Korean pronunciation? 3. “한국어 발음에 대한 목표가 있어요?” / Do you have any Korean pronunciation goals? a. “그 목표를 다 이루웠어요? 아니면 아직 멀었어요?” / Are you far from achieving that goal? Close? b. (목표가 없는 경우) “왜 발음에 대한 목표가 없나요?” / Why do you not have any pronunciation goals? 328 KPD Results Follow-up: Interview 2 – Learner 1. “한국어 발음 시험 결과가 기억나요? 뭐라고 써 있었어요?” Do you remember what your pronunciation test results were? Do you remember what it said? a. Can review the results if participant can’t recall very well 2. “결과지에 대해 얼마나 생각했어요?” How much have you thought about your test results? Learning Activity 1. Participants prompted to discuss study/practice habits: a. “처음 인터뷰한 후에 발음을 공부하거나 연습했어요? 어떻게 했어요?” / Since we last met, have you studied or practiced pronunciation? If so, how? b. “결과를 본 후에 다른 방법을 해 봤어요?” / After seeing the results, have you tried anything different? Progress 1. Participants prompted to discuss pronunciation development: a. “최근에 __씨의 발음에 달라진 게 있어요? 없어요? 설명해주세요.” / Recently, have you noticed any changes in your pronunciation? None? Please explain. 2. “한국어 발음에 대한 목표가 있어요?” / Do you have any Korean pronunciation goals? c. “그 목표를 다 이루웠어요? 아니면 아직 멀었어요?” / Are you far from achieving that goal? Close? ***Now administer Independent Speaking AND KPD again*** 329 Pronunciation Teaching – 발음 교육 1. Teachers prompted to discuss pronunciation teaching practices: Interview 1 – Teacher a. “보통 어떻게 발음을 가르쳐요?” / How do you usually teach pronunciation? b. “다른 방법으로 가르쳐 봤어요?”/ Have you tried any other methods? c. “학생들 발음을 가르칠 때 어떤 목표나 원칙이 있어요?” / For your students, do you have any pronunciation goals or principles? Teacher’s Observations – 교사의 착안과 평가 2. Show teachers the list of students who are participating in the study and have completed the KPD. Ask teacher to describe each student’s pronunciation in turn. The student’s Independent Speaking can be played if necessary, to jog the teacher’s memory. a. “전반적으로, <이 학생>의 발음이 어때요?” / Overall, how is this student’s pronunciation? b. “이 학생이 말을 대부분 쉽게 이해 할 수 있어요? 이해하기 얼마나 어려워요?” / Are you able to easily understand what this student says? How difficult is he/she to understand? c. “더 구체적으로, <이 학생>의 발음은 어떤 어려운 점이 있어요?” / More specifically, what are the difficulties this student has in pronunciation? d. “어느 소리/음소를 특별히 어려워해요?” / Which sounds/phonemes are especially difficult? e. “다른 어려운 점이 있어요?” / Are there any other challenging features? Interpreting Results – 결과 이해하기 3. Present teacher with their students’ KPD results. Ask them to read the results and share any thoughts that arise. a. “발음 진단 검사 결과를 보면서 의견이나 질문이 있으면 이야기해 주세요” 4. After initial reactions, follow up with the following questions about the test results in general: a. “결과에 대해 어떻게 생각해요?” / What do you think about the test results? b. “결과는 선생님 생각과 비슷해요?” / Are the results similar to your own impressions? c. “놀라운 결과가 있어요? 왜요?” / Are there any surprising results? Why? Using Results – 결과 응용하기 5. Ask the teacher how he/she might address the student’s pronunciation weaknesses. a. “결과를 본 후에 이 학생들의 취약점을 어떻게 고쳐주고 싶어요?” / After seeing these results, how would you address the students’ weaknesses? b. “이 결과지가 유용할 것 같아요? 왜요?” / Do these score reports seem useful? Why? c. “이 학생들의 발음에 대해서 알고 싶은 다른 것이 있어요?” / Is there anything else you’d like to know about these students’ pronunciation? 330 APPENDIX J Item Statistics 331 Table J1 KPD Production Item Statistics Item T1_01-1 T1_01-2 T1_01-3 T1_02-1 T1_02-2 T1_02-3 T1_02-4 T1_02-5 T1_03-1 T1_03-2 T1_03-3 T1_03-4 T1_03-5 T1_03-6 T1_04-1 T1_04-2 T1_04-3 T1_04-4 T1_05-1 T1_05-2 T1_05-3 T1_05-4 T1_06-1 T1_06-2 T1_06-3 T1_07-1 T1_07-2 T1_07-3 T1_07-4 T1_08-1 T1_08-2 T1_08-3 T1_08-4 T1_08-5 T1_09-1 T1_09-2 T1_09-3 T1_10-1 T1_10-2 T1_11-1 T1_11-2 IF 0.84 1.00 0.96 0.91 1.00 0.98 0.99 1.00 0.90 0.82 0.99 0.94 0.86 0.98 0.99 1.00 0.99 1.00 0.78 0.99 0.75 1.00 0.92 0.98 1.00 0.99 0.88 0.98 0.99 1.00 0.98 0.87 0.94 0.97 0.95 1.00 0.95 0.97 0.99 0.85 0.99 ID 0.34 NA 0.15 0.31 NA 0.10 0.06 NA 0.14 0.04 0.17 0.06 0.22 0.02 0.11 NA 0.03 NA 0.11 0.02 0.05 NA 0.11 0.07 NA 0.01 0.08 0.08 0.01 NA 0.02 0.28 0.19 -0.05 0.18 NA 0.29 0.03 0.06 0.11 0.09 Rasch Measure 1.46 -3.43 -0.23 0.74 -3.43 -0.81 -2.22 -3.43 0.93 1.65 -2.22 0.35 1.29 -0.81 -1.52 -3.43 -1.52 -3.43 1.90 -2.22 2.08 -3.43 0.60 -1.10 -3.43 -1.52 1.15 -1.10 -1.52 -3.43 -1.10 1.24 0.26 -0.58 0.04 -3.43 0.04 -0.39 -2.22 1.38 -2.22 Rasch S.E. 0.20 1.83 0.39 0.26 1.83 0.51 1.01 1.83 0.24 0.19 1.01 0.30 0.21 0.51 0.72 1.83 0.72 1.83 0.18 1.01 0.17 1.83 0.27 0.59 1.83 0.72 0.22 0.59 0.72 1.83 0.59 0.22 0.32 0.46 0.35 1.83 0.35 0.42 1.01 0.21 1.01 332 Infit MS 0.93 1.00 0.99 0.95 1.00 1.01 1.01 1.00 1.02 1.11 0.99 1.03 1.00 1.02 1.00 1.00 1.01 1.00 1.10 1.01 1.14 1.00 1.03 1.01 1.00 1.02 1.06 1.00 1.01 1.00 1.02 0.97 0.99 1.04 0.98 1.00 0.95 1.02 1.01 1.07 1.00 Infit Outfit Outfit Z -0.55 0.00 0.10 -0.21 0.00 0.17 0.34 0.00 0.19 1.03 0.32 0.22 0.04 0.20 0.23 0.00 0.25 0.00 1.06 0.34 1.62 0.00 0.19 0.20 0.00 0.25 0.42 0.19 0.25 0.00 0.22 -0.18 0.03 0.24 0.04 0.00 -0.06 0.19 0.34 0.55 0.33 MS 0.75 1.00 3.64 0.68 1.00 0.70 0.63 1.00 0.97 1.22 0.32 1.06 0.93 1.37 0.58 1.00 0.88 1.00 1.16 0.87 1.16 1.00 1.00 1.07 1.00 0.93 1.64 1.31 1.69 1.00 1.40 0.85 0.79 1.49 0.67 1.00 0.59 1.21 0.63 1.24 0.51 Z -1.21 0.00 3.19 -1.02 0.00 -0.22 0.21 0.00 0.01 1.14 -0.15 0.28 -0.22 0.70 -0.14 0.00 0.23 0.00 0.98 0.40 1.12 0.00 0.11 0.36 0.00 0.28 2.17 0.61 0.90 0.00 0.70 -0.55 -0.41 0.87 -0.66 0.00 -0.89 0.54 0.21 1.08 0.08 Table J1 (cont’d) T1_11-3 T1_11-4 T1_11-5 T1_12-1 T1_12-2 T1_13-1 T1_13-2 T1_14-1 T1_14-2 T1_14-3 T1_14-4 T1_14-5 T1_14-6 T1_15-1 T1_15-2 T1_15-3 T1_15-4 T1_15-5 T1_15-6 T1_15-7 T1_16-1_p T1_16-2_p T1_16-3_p T1_16-4_p T1_17-1_p T1_17-2_p T1_17-3_p T1_18-1 T1_18-2 T1_18-3 T1_18-4 T1_18-5 T1_18-6 T1_18-7 T1_18-8 T1_19-1 T1_19-2 T1_19-3 T1_19-4 T1_20-1 T1_20-2 T1_20-3 T1_20-4 T1_20-5 T1_20-6 0.97 0.98 1.00 0.92 0.98 0.94 0.84 0.75 0.98 0.97 0.89 1.00 0.94 0.97 0.93 1.00 0.99 0.96 0.94 0.91 0.91 0.92 0.85 0.98 0.96 0.98 0.85 0.99 0.98 0.98 0.99 0.93 1.00 0.98 0.88 0.98 0.99 0.98 0.99 1.00 1.00 0.95 0.92 0.95 0.89 0.08 0.10 NA 0.15 0.11 0.06 0.05 0.28 0.26 0.26 0.00 NA 0.20 0.04 0.09 NA 0.02 0.07 0.19 0.32 0.08 0.17 0.05 0.24 0.05 0.04 0.21 0.03 -0.01 0.17 0.06 0.20 NA 0.19 0.14 0.10 0.18 -0.09 0.09 NA NA 0.12 0.10 0.08 0.25 -0.39 -0.81 -3.42 0.67 -1.10 0.35 1.46 2.11 -1.10 -0.39 1.04 -3.43 0.35 -0.58 0.52 -3.43 -1.52 -0.23 0.26 0.81 0.74 0.60 1.42 -0.81 -0.23 -1.10 1.42 -2.22 -1.10 -1.10 -2.22 0.44 -3.43 -1.10 1.09 -1.10 -1.52 -0.81 -2.22 -3.43 -3.43 0.15 0.67 0.04 0.99 0.42 0.51 1.83 0.27 0.59 0.30 0.20 0.17 0.59 0.42 0.23 1.83 0.30 0.46 0.28 1.83 0.72 0.39 0.32 0.25 0.26 0.27 0.21 0.51 0.39 0.59 0.21 1.01 0.59 0.59 1.01 0.29 1.83 0.59 0.23 0.59 0.72 0.51 1.01 1.83 1.83 0.33 0.27 0.35 0.24 1.01 1.00 1.00 1.02 1.00 1.03 1.09 1.00 0.97 0.96 1.09 1.00 0.98 1.01 1.02 1.00 1.01 1.01 0.99 0.94 1.05 1.00 1.09 0.97 1.01 1.02 1.01 1.01 1.02 0.99 1.01 0.99 1.00 0.98 1.03 1.00 0.99 1.04 1.00 1.00 1.00 1.01 1.02 1.02 0.98 0.16 0.17 0.00 0.14 0.18 0.22 0.76 -0.03 0.13 0.03 0.59 0.00 0.02 0.18 0.18 0.00 0.25 0.15 0.05 -0.25 0.30 0.08 0.72 0.10 0.14 0.21 0.10 0.34 0.23 0.17 0.34 0.06 0.00 0.16 0.25 0.19 0.21 0.24 0.33 0.00 0.00 0.14 0.19 0.18 -0.07 1.07 0.78 1.00 0.88 0.77 1.12 1.52 1.01 0.38 0.49 1.53 1.00 0.78 2.21 0.90 1.00 1.95 1.77 0.70 0.67 1.02 0.81 1.28 0.45 1.29 0.86 0.90 0.76 1.40 0.51 0.63 0.90 1.00 0.48 1.13 0.82 0.43 1.93 0.51 1.00 1.00 0.87 2.32 1.14 0.76 0.32 -0.08 0.00 -0.25 -0.02 0.41 2.17 0.12 -0.71 -0.89 1.77 0.00 -0.48 1.62 -0.16 0.00 1.06 1.35 -0.66 -1.11 0.18 -0.47 1.25 -0.72 0.67 0.10 -0.40 0.32 0.70 -0.43 0.21 -0.14 0.00 -0.50 0.57 0.05 -0.36 1.25 0.08 0.00 0.00 -0.16 3.02 0.44 -0.85 333 Table J1 (cont’d) T1_21-1 T1_21-2 T1_21-3 T1_21-4 T1_21-5 T1_22-1 T1_22-2 T1_22-3 T1_22-4 T1_22-5 T1_23-1 T1_23-2 T1_23-3 T1_24-1 T1_24-2 T1_24-3 T1_24-4 T1_24-5 T1_25-1_p T1_25-2_p T1_26-1 T1_26-2 T1_26-3 T1_26-4 T1_26-5 T1_27-1 T1_27-2 T1_27-3 T1_27-4 T1_27-5 T1_27-6 T1_28-1 T1_28-2 T1_29-1_n T1_29-2_n T1_29-3_n T1_29-4_n T1_29-5_n T1_30-1 T1_30-2 T1_30-3 T1_30-4 T1_30-5 T1_31-1 T1_31-2 0.85 1.00 0.96 0.99 0.99 0.99 0.99 0.90 0.93 1.00 0.90 0.99 1.00 0.98 1.00 0.99 0.99 0.98 0.92 0.96 0.96 1.00 0.97 0.99 1.00 0.96 0.99 0.98 0.99 0.99 0.99 0.99 0.97 0.97 1.00 1.00 0.95 0.85 0.95 0.98 0.49 0.95 0.81 0.94 0.99 0.32 NA 0.20 0.04 0.12 -0.02 -0.02 0.16 0.21 NA 0.10 0.10 NA 0.04 NA -0.05 0.17 0.15 0.21 0.17 0.13 NA 0.13 0.06 NA 0.12 0.08 0.02 0.13 0.12 0.07 -0.03 0.24 0.13 NA NA 0.15 0.29 0.03 0.09 0.04 -0.01 0.31 0.13 0.04 1.42 -3.43 -0.08 -1.52 -2.22 -2.22 -2.22 0.93 0.44 -3.43 0.87 -1.52 -3.43 -1.10 -3.43 -2.22 -2.22 -1.10 0.67 -0.23 -0.23 -3.43 -0.58 -1.52 -3.43 -0.08 -1.52 -0.81 -1.52 -2.22 -2.22 -2.22 -0.39 -0.58 -3.43 -3.43 0.04 1.42 0.04 -1.10 3.39 0.04 1.69 0.35 -2.22 0.21 1.83 0.37 0.72 1.01 1.01 1.01 0.24 0.29 1.83 0.25 0.72 1.83 0.59 1.83 1.01 1.01 0.59 0.27 0.39 0.39 1.83 0.46 0.72 1.83 0.37 0.72 0.51 0.72 1.01 1.01 1.01 0.42 0.46 1.83 1.83 0.35 0.21 0.35 0.59 0.15 0.35 0.19 0.30 1.01 0.95 1.00 0.98 1.01 1.00 1.02 1.02 1.02 0.98 1.00 1.04 1.00 1.00 1.02 1.00 1.02 0.99 0.99 0.98 0.99 1.00 1.00 1.00 1.01 1.00 1.00 1.00 1.02 1.00 1.00 1.01 1.02 0.97 1.00 1.00 1.00 1.00 0.96 1.04 1.01 1.20 1.05 0.96 1.01 1.01 -0.37 0.00 0.04 0.24 0.33 0.35 0.35 0.15 -0.01 0.00 0.27 0.24 0.00 0.22 0.00 0.35 0.32 0.17 -0.02 0.08 0.12 0.00 0.14 0.24 0.00 0.12 0.24 0.20 0.23 0.33 0.34 0.35 0.04 0.13 0.00 0.00 0.11 -0.24 0.23 0.20 3.66 0.25 -0.35 0.11 0.34 0.78 1.00 0.71 1.46 0.42 1.29 1.29 0.94 2.38 1.00 1.05 0.62 1.00 1.26 1.00 1.73 0.32 0.56 1.02 0.85 0.75 1.00 0.74 0.88 1.00 1.18 0.80 1.25 0.52 0.44 0.60 1.41 0.71 0.92 1.00 1.00 1.01 0.81 1.16 0.72 1.23 1.89 0.83 1.04 0.72 -1.00 0.00 -0.48 0.73 -0.02 0.67 0.67 -0.13 2.79 0.00 0.27 -0.08 0.00 0.56 0.00 0.90 -0.15 -0.35 0.16 -0.11 -0.33 0.00 -0.24 0.23 0.00 0.51 0.14 0.56 -0.22 0.00 0.17 0.74 -0.36 0.09 0.00 0.00 0.17 -0.86 0.49 -0.10 3.08 1.68 -0.88 0.24 0.28 334 Table J1 (cont’d) T1_31-3 T1_32-1 T1_32-2 T1_32-3 T1_32-4 T1_32-5 T1_32-7 T1_33-1 T1_33-2 T1_33-3 T1_33-4 T1_33-5 T1_33-6 T1_33-7 T1_33-8 T1_33-9 T1_34-1 T1_34-2 T1_34-3 T1_35-1 T1_35-2 T1_35-3 T1_35-4 T2_01 T2_02 T2_03 T2_04 T2_05 T2_06 T2_07 T2_08 T2_09 T2_10 T2_11 T2_12 T2_13 T2_14 T2_15 T2_16 T2_17 T2_18 T2_19 T2_20 T2_21 T2_22 0.98 0.99 0.99 0.99 0.99 0.83 0.80 0.80 1.00 0.98 1.00 0.99 0.97 0.99 1.00 0.86 0.76 0.98 0.72 0.98 0.91 1.00 0.86 0.88 0.94 0.96 0.80 0.70 0.83 0.83 0.76 0.85 0.83 0.85 0.63 0.83 0.90 0.96 0.95 0.81 0.76 0.91 0.91 0.97 0.78 0.17 0.05 0.21 0.13 -0.01 0.02 0.29 0.28 NA -0.02 NA -0.03 0.11 0.00 NA 0.28 0.25 0.05 0.40 0.07 0.22 NA 0.19 0.10 0.13 0.30 0.42 0.28 0.34 0.26 0.16 0.14 0.30 0.43 0.25 0.32 0.35 0.07 0.14 0.16 0.25 0.29 0.23 0.16 0.37 -0.81 -1.52 -1.52 -1.52 -1.52 1.58 1.76 1.76 -3.43 -1.10 -3.43 -2.22 -0.39 -2.22 -3.43 1.29 2.05 -1.10 2.28 -1.10 0.74 -3.43 1.33 1.09 0.26 -0.23 1.76 2.37 1.58 1.54 2.05 1.42 1.58 1.38 2.73 1.54 0.93 -0.08 0.15 1.73 2.05 0.74 0.81 -0.39 1.93 0.51 0.72 0.72 0.72 0.72 0.20 0.19 0.19 1.83 0.59 1.83 1.01 0.42 1.01 1.83 0.21 0.17 0.59 0.17 0.59 0.26 1.83 0.21 0.23 0.32 0.39 0.19 0.16 0.20 0.20 0.17 0.21 0.20 0.21 0.16 0.20 0.24 0.37 0.33 0.19 0.17 0.26 0.25 0.42 0.18 335 0.99 1.01 0.98 1.00 1.02 1.12 0.97 0.97 1.00 1.02 1.00 1.02 1.00 1.01 1.00 0.97 1.01 1.01 0.91 1.00 0.98 1.00 1.02 1.04 1.01 0.95 0.89 0.98 0.94 0.99 1.05 1.03 0.96 0.88 1.02 0.95 0.92 1.02 1.01 1.04 1.01 0.96 0.97 0.99 0.92 0.14 0.25 0.20 0.23 0.26 1.00 -0.23 -0.27 0.00 0.23 0.00 0.35 0.13 0.34 0.00 -0.19 0.09 0.21 -1.32 0.18 -0.02 0.00 0.15 0.28 0.14 -0.03 -1.05 -0.21 -0.50 -0.08 0.60 0.28 -0.28 -0.89 0.38 -0.38 -0.40 0.15 0.13 0.38 0.10 -0.15 -0.10 0.11 -0.92 0.55 0.78 0.37 0.53 1.77 1.31 1.02 0.90 1.00 1.69 1.00 1.41 1.70 1.01 1.00 0.87 0.97 0.95 0.87 0.85 0.91 1.00 1.03 1.07 0.81 0.48 0.74 0.96 0.77 0.92 1.11 1.04 0.85 0.66 0.99 0.73 0.62 1.40 0.76 1.10 0.96 0.68 0.90 0.65 0.75 -0.51 0.12 -0.48 -0.21 0.95 1.50 0.17 -0.53 0.00 0.96 0.00 0.74 1.19 0.50 0.00 -0.48 -0.16 0.22 -0.99 0.09 -0.18 0.00 0.21 0.33 -0.35 -1.03 -1.54 -0.30 -1.20 -0.35 0.74 0.26 -0.74 -1.67 -0.10 -1.39 -1.40 0.88 -0.46 0.58 -0.23 -1.03 -0.24 -0.50 -1.62 Table J1 (cont’d) T2_23 T2_24 T2_25 T2_26 T2_27 T2_28 T2_29 T2_30 T2_31 T2_32 T2_33 T2_34 T2_35 T2_36 T2_37 T2_38 T2_39 T2_40 T2_41 T2_42 T2_43 T2_44 T2_45 T2_46 T2_47 T2_48 T2_49 T2_50 T2_51 T2_52 T2_53 T2_54 T2_55 T2_56 T2_57_n T2_58 T2_59_n T2_60 T2_61 T2_62 T2_63 0.48 0.76 0.81 0.92 0.88 0.98 0.98 0.61 0.75 0.51 0.99 0.99 1.00 0.99 0.98 0.98 0.99 0.98 0.91 0.94 0.96 0.97 0.98 0.91 1.00 0.85 0.96 0.69 1.00 0.96 0.98 0.77 0.76 0.87 0.80 0.98 0.66 0.92 0.99 0.88 0.99 0.34 0.26 0.25 0.06 0.04 0.12 0.19 0.29 0.16 0.23 0.18 0.18 NA -0.05 0.10 0.03 0.11 0.15 0.29 0.30 0.03 0.13 0.15 0.19 NA 0.04 0.20 0.29 NA 0.10 -0.02 0.29 0.16 0.20 0.46 0.08 0.18 -0.07 0.06 0.15 -0.05 3.42 2.02 1.73 0.60 1.09 -0.81 -1.10 2.81 2.11 3.30 -2.22 -2.22 -3.43 -2.22 -1.10 -1.10 -2.22 -1.10 0.74 0.26 -0.23 -0.39 -0.81 0.74 -3.43 1.42 -0.08 2.45 -3.43 -0.23 -1.10 1.99 2.02 1.24 1.76 -0.81 2.58 0.60 -1.51 1.09 -2.22 0.15 0.18 0.19 0.27 0.23 0.51 0.59 0.16 0.17 0.15 1.01 1.01 1.83 1.01 0.59 0.59 1.01 0.59 0.26 0.32 0.39 0.42 0.51 0.26 1.83 0.21 0.37 0.16 1.83 0.39 0.59 0.18 0.18 0.22 0.19 0.51 0.16 0.27 0.72 0.23 1.01 336 0.94 0.99 0.99 1.04 1.08 1.00 0.98 1.01 1.06 1.04 0.99 0.99 1.00 1.02 1.00 1.02 1.00 0.99 0.95 0.94 1.03 1.00 0.99 1.00 1.00 1.09 0.98 0.98 1.00 1.00 1.01 0.98 1.06 1.01 0.87 1.01 1.06 1.09 1.01 1.01 1.02 -1.14 -0.14 -0.03 0.27 0.55 0.15 0.15 0.19 0.77 0.73 0.32 0.32 0.00 0.35 0.19 0.21 0.33 0.17 -0.16 -0.13 0.19 0.12 0.14 0.05 0.00 0.75 0.04 -0.29 0.00 0.13 0.21 -0.25 0.77 0.10 -1.33 0.18 1.08 0.49 0.24 0.12 0.35 0.92 0.90 0.98 1.13 1.15 0.95 0.53 0.99 0.98 1.02 0.31 0.31 1.00 1.73 0.69 1.56 0.46 0.66 0.70 1.08 1.40 0.76 0.68 0.92 1.00 2.01 0.87 0.91 1.00 1.37 0.93 0.88 0.99 0.94 0.68 0.84 1.10 1.63 1.15 1.27 1.73 -1.10 -0.60 -0.06 0.48 0.62 0.17 -0.41 -0.07 -0.06 0.32 -0.17 -0.17 0.00 0.90 -0.13 0.85 0.03 -0.18 -0.92 0.32 0.85 -0.25 -0.27 -0.15 0.00 3.65 -0.10 -0.78 0.00 0.80 0.20 -0.74 -0.05 -0.17 -1.99 0.01 0.99 1.65 0.48 1.04 0.90 Table J2 KPD Perception Item Statistics Item T3_01_s T3_02 T3_03 T3_04 T3_05 T3_06 T3_07 T3_08 T3_09 T3_10 T3_11 T3_12 T3_13 T3_14 T3_15 T3_16 T3_17 T3_18 T3_19 T3_20 T3_21 T3_22 T3_23 T3_24 T3_25 T3_26 T3_27 T3_28 T3_29 T3_30 T3_31 T3_32 T3_33_n T3_34 T3_35 T3_36 T3_37 T3_38 IF 0.58 0.46 0.91 0.59 0.77 0.40 0.69 0.92 0.81 0.98 0.73 0.59 0.33 0.75 0.66 0.52 0.99 0.39 0.71 0.49 0.80 0.29 0.73 0.45 0.46 0.39 0.77 0.39 0.18 0.53 0.34 0.19 0.42 0.14 1.00 0.89 0.91 0.98 ID 0.51 0.18 0.32 0.33 0.40 0.50 0.42 0.25 0.34 0.29 0.47 0.58 0.41 0.21 0.46 0.26 0.13 0.31 0.47 0.53 0.43 0.51 0.37 0.37 0.34 0.57 0.36 0.30 0.41 0.13 0.32 0.17 0.38 -0.17 NA 0.29 0.35 0.12 Rasch Measure 1.80 2.39 -0.48 1.78 0.79 2.66 1.24 -0.62 0.49 -2.04 1.01 1.78 3.03 0.92 1.42 2.09 -3.46 2.71 1.15 2.24 0.59 3.29 1.01 2.41 2.39 2.71 0.76 2.74 4.06 2.04 2.98 3.94 2.56 4.38 -4.67 -0.23 -0.41 -2.34 Rasch S.E. 0.16 0.16 0.26 0.16 0.18 0.16 0.17 0.27 0.19 0.51 0.17 0.16 0.17 0.17 0.16 0.16 1.00 0.16 0.17 0.16 0.19 0.17 0.17 0.16 0.16 0.16 0.18 0.16 0.20 0.16 0.17 0.20 0.16 0.22 1.82 0.24 0.25 0.58 337 Infit MS 0.87 1.22 0.92 1.04 0.93 0.85 0.92 0.96 0.96 0.92 0.87 0.80 0.95 1.07 0.89 1.10 0.98 1.04 0.87 0.84 0.88 0.82 0.95 0.99 1.01 0.82 0.95 1.07 0.85 1.23 1.08 1.14 0.99 1.52 1.00 0.96 0.90 0.99 Infit Outfit Outfit Z -2.56 3.47 -0.37 0.68 -0.81 -2.36 -1.23 -0.13 -0.39 -0.02 -1.75 -4.00 -0.64 0.89 -1.80 1.84 0.31 0.61 -1.84 -3.04 -1.19 -2.19 -0.58 -0.12 0.19 -2.91 -0.57 1.00 -1.18 3.93 1.08 1.14 -0.18 3.12 0.00 -0.21 -0.49 0.16 MS 0.79 1.23 0.70 1.05 0.75 0.84 0.84 0.75 0.76 0.29 0.72 0.72 0.93 1.33 0.80 1.11 0.31 1.06 0.77 0.79 0.74 0.79 0.85 1.02 1.02 0.74 0.85 1.09 0.86 1.34 1.14 1.49 0.95 2.48 1.00 0.75 0.64 0.62 Z -2.20 2.52 -0.74 0.48 -1.39 -1.85 -1.13 -0.54 -1.08 -0.90 -1.85 -2.94 -0.60 1.80 -1.64 1.15 0.02 0.65 -1.62 -2.48 -1.29 -1.79 -0.93 0.29 0.26 -3.05 -0.76 0.99 -0.67 3.35 1.36 2.40 -0.54 4.55 0.00 -0.70 -1.02 -0.11 Table J2 (cont’d) T3_39_s T3_40 T3_41_s T3_42 T3_43 T3_44_s T3_45 T3_46 T3_47 T3_48 T3_49 T3_50_s T3_51 T3_52 T3_53 T3_54 T3_55 T3_56 T3_57 T3_58 T3_59 T3_60 T3_61 T3_62 T3_63 T3_64 T3_65 T3_66 T3_67 T3_68_s T3_69 T3_70 T3_71_s T3_72 T4_01 T4_02 T4_03 T4_04 T4_05 T4_06 T4_07 0.91 0.75 0.99 0.94 0.70 0.82 0.79 0.94 0.91 0.69 0.93 0.92 0.99 0.79 0.81 0.90 0.89 0.53 0.98 0.94 0.65 0.54 0.42 0.91 0.78 0.23 0.74 0.62 0.95 0.99 0.67 0.98 0.77 0.83 0.46 0.89 0.99 0.78 0.82 0.97 0.94 0.24 0.27 0.22 0.25 0.38 0.28 0.17 0.19 0.35 0.24 0.23 -0.09 0.13 0.39 0.40 0.24 0.16 0.43 0.20 0.19 0.25 0.25 0.35 0.20 0.32 0.30 0.07 0.32 0.25 0.19 0.41 -0.05 0.17 0.25 0.39 0.25 0.10 0.36 -0.02 0.19 0.21 -0.48 0.89 -2.76 -0.87 1.18 0.41 0.63 -0.87 -0.41 1.26 -0.70 -0.55 -2.76 0.63 0.52 -0.35 -0.23 2.04 -2.04 -0.87 1.45 2.02 2.58 -0.48 0.70 3.68 0.95 1.63 -1.19 -2.76 1.37 -2.34 0.79 0.33 2.36 -0.23 -2.76 0.70 0.41 -1.62 -0.87 0.26 0.18 0.71 0.30 0.17 0.20 0.19 0.30 0.25 0.16 0.28 0.27 0.71 0.19 0.19 0.25 0.24 0.16 0.51 0.30 0.16 0.16 0.16 0.26 0.18 0.19 0.17 0.16 0.35 0.71 0.16 0.58 0.18 0.20 0.16 0.24 0.71 0.18 0.20 0.42 0.30 338 0.97 1.03 0.95 0.96 0.95 0.98 1.08 0.99 0.90 1.08 0.96 1.15 0.98 0.91 0.90 0.96 1.03 0.92 0.95 0.98 1.09 1.12 1.03 1.00 0.97 1.06 1.20 1.03 0.95 0.96 0.93 1.03 1.09 1.00 0.99 0.99 0.99 0.94 1.20 0.96 0.98 -0.08 0.34 0.16 -0.10 -0.65 -0.17 0.81 0.04 -0.47 1.14 -0.10 0.76 0.20 -0.88 -0.90 -0.17 0.23 -1.47 0.05 -0.01 1.39 2.11 0.43 0.07 -0.27 0.61 2.34 0.55 -0.07 0.17 -1.04 0.23 1.01 0.03 -0.17 0.00 0.22 -0.57 1.70 0.03 -0.01 0.82 1.00 0.26 0.65 0.87 1.11 1.29 0.77 0.63 1.04 0.87 1.93 0.48 0.88 0.73 1.40 1.08 0.90 0.77 1.01 1.13 1.14 1.04 0.89 0.98 1.09 1.41 0.99 0.58 0.34 0.91 1.83 1.49 1.04 0.97 0.74 0.59 0.88 1.85 0.69 0.75 -0.39 0.05 -0.48 -0.70 -0.90 0.52 1.38 -0.37 -1.05 0.35 -0.18 1.99 -0.13 -0.56 -1.28 1.11 0.34 -1.07 -0.01 0.18 1.06 1.51 0.51 -0.16 -0.03 0.61 2.19 -0.04 -0.71 -0.33 -0.63 1.00 2.34 0.26 -0.31 -0.74 0.01 -0.59 2.99 -0.27 -0.43 Table J2 (cont’d) T4_08 T4_09 T4_10 T4_11 T4_12 T4_13 T4_14 T4_15 T4_16 T4_17 T4_18 T4_19 T4_20 T4_21 T4_22 T4_23 T4_24 T4_25 T4_26 T4_27 T4_28 T4_29 T4_30 T4_31 T4_32 T4_33 T4_34 T4_35 T4_36 T4_37 T4_38 T4_39 T4_40_s T4_41 T4_42 T4_43 T4_44_s T4_45 T4_46 T4_47 T4_48 0.77 0.99 0.99 0.95 0.92 0.93 0.99 0.93 0.85 1.00 0.93 0.94 0.83 0.99 0.76 0.86 0.91 0.93 0.89 1.00 0.88 1.00 1.00 0.98 0.94 0.75 0.92 0.99 1.00 1.00 0.97 0.98 0.89 1.00 0.92 0.98 0.85 1.00 0.99 1.00 0.98 0.37 0.02 0.07 0.20 0.23 0.31 0.17 0.20 0.03 NA 0.22 0.21 0.30 0.16 0.36 0.30 0.24 0.24 0.20 NA 0.11 NA NA 0.17 0.17 0.23 0.14 -0.05 NA NA -0.02 0.01 0.08 NA 0.17 0.22 0.09 NA 0.17 NA 0.19 0.79 -3.46 -2.76 -1.07 -0.55 -0.70 -3.46 -0.78 0.17 -4.67 -0.70 -0.87 0.33 -2.76 0.83 0.12 -0.48 -0.70 -0.18 -4.67 -0.07 -4.67 -4.67 -2.34 -0.97 0.89 -0.55 -3.46 -4.67 -4.67 -1.81 -2.34 -0.23 -4.67 -0.55 -2.04 0.21 -4.67 -3.46 -4.67 -2.34 0.18 1.00 0.71 0.33 0.27 0.28 1.00 0.29 0.21 1.82 0.28 0.30 0.20 0.71 0.18 0.21 0.26 0.28 0.23 1.82 0.23 1.82 1.82 0.58 0.32 0.18 0.27 1.00 1.82 1.82 0.46 0.58 0.24 1.82 0.27 0.51 0.21 1.82 1.00 1.82 0.58 339 0.94 1.00 1.00 0.97 0.97 0.93 0.97 0.99 1.16 1.00 0.98 0.97 0.97 0.98 0.96 0.97 0.98 0.97 1.02 1.00 1.09 1.00 1.00 0.97 0.99 1.06 1.04 1.01 1.00 1.00 1.04 1.01 1.08 1.00 1.02 0.95 1.12 1.00 0.97 1.00 0.97 -0.62 0.33 0.22 -0.03 -0.09 -0.25 0.29 0.04 1.19 0.00 -0.03 -0.06 -0.18 0.19 -0.47 -0.18 -0.05 -0.07 0.15 0.00 0.60 0.00 0.00 0.13 0.05 0.70 0.26 0.34 0.00 0.00 0.24 0.21 0.54 0.00 0.18 0.06 0.94 0.00 0.29 0.00 0.12 0.83 0.95 0.79 0.85 0.94 0.57 0.22 0.78 1.33 1.00 0.79 0.91 0.83 0.40 0.85 0.75 0.77 0.73 0.90 1.00 1.06 1.00 1.00 0.49 0.93 1.09 1.03 2.21 1.00 1.00 1.41 1.80 1.22 1.00 0.85 0.45 1.18 1.00 0.22 1.00 0.42 -0.88 0.55 0.22 -0.13 -0.02 -1.05 -0.10 -0.38 1.20 0.00 -0.40 -0.03 -0.68 -0.23 -0.82 -0.90 -0.53 -0.57 -0.22 0.00 0.28 0.00 0.00 -0.31 0.02 0.56 0.21 1.10 0.00 0.00 0.73 0.98 0.73 0.00 -0.28 -0.54 0.75 0.00 -0.10 0.00 -0.43 Table J2 (cont’d) T4_49 T4_50 T4_51 T4_52 T4_53 T4_54 T4_55 T4_56 T4_57 T4_58 T4_59 T4_60 T4_61 T4_62 T4_63 0.99 1.00 0.97 0.91 0.99 0.88 1.00 0.89 0.95 0.87 0.99 0.99 1.00 0.98 1.00 0.03 NA 0.11 0.23 0.03 0.00 NA 0.12 0.11 0.19 0.13 0.04 NA 0.15 NA -3.46 -4.67 -1.81 -0.48 -3.46 -0.12 -4.67 -0.18 -1.07 -0.02 -3.46 -3.46 -4.67 -2.34 -4.67 1.00 1.82 0.46 0.26 1.00 0.23 1.82 0.23 0.33 0.22 1.00 1.00 1.82 0.58 1.82 1.00 1.00 1.01 0.98 1.00 1.14 1.00 1.05 1.02 1.05 0.98 1.00 1.00 0.97 1.00 0.33 0.00 0.15 -0.04 0.33 0.89 0.00 0.34 0.15 0.35 0.31 0.33 0.00 0.13 0.00 0.84 1.00 0.75 0.89 0.84 1.46 1.00 1.60 1.12 0.88 0.31 0.78 1.00 0.77 1.00 0.48 0.00 -0.12 -0.17 0.48 1.40 0.00 1.68 0.39 -0.34 0.02 0.44 0.00 0.08 0.00 340 REFERENCES 341 REFERENCES Abrahamsson, N. (2012). Age of onset and nativelike L2 ultimate attainment of morphosyntactic and phonetic intuition. Studies in Second Language Acquisition, 34, 187-214. https://doi.org/10.1017/S0272263112000022 Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning and assessment. New York, NY: Continuum. Alderson, J. C., Haapakangas, E., Huhta, A., Nieminen, L., & Ullakonoja, R. (2015). The diagnosis of reading in a second or foreign language. New York: Routledge. Alderson, J. C., Brunfaut, T., & Harding, L. (2014). Towards a theory of diagnosis in second and foreign language assessment: Insights from professional practice across diverse fields. Applied Linguistics, 36(2), 236-260. https://doi.org/10.1093/applin/amt046 Alderson, J.C., & Huhta, A. (2005). The development of a suite of computer-based diagnostic tests based on the Common European Framework. Language Testing, 22(3), 301-320. https://doi.org/10.1191/0265532205lt310oa Allen, D. (2016). Investigating washback to the learner from the IELTS test in the Japanese tertiary context. Language Testing in Asia, 6(7), 1-20. https://doi.org/10.1186/s40468- 016-0030-z Amengual, M. (2016). The perception of language-specific phonetic categories does not guarantee accurate phonological representations in the lexicon of early bilinguals. Applied Psycholinguistics, 37, 1221-1251. https://doi.org/10.1017/S0142716415000557 American Council on the Teaching of Foreign Languages. (2012). ACTFL proficiency guidelines 2012. Alexandria, VA: ACTFL. Bachman, L., & Palmer, A. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford, UK: Oxford University Press. Baker, A. (2014). Exploring teachers’ knowledge of second language pronunciation techniques: Teacher cognitions, observed classroom practices, and student perceptions. TESOL Quarterly, 48, 136-163. Best, C., & Tyler, M. (2007). Nonnative and second-language speech perception. In O.-S. Bohn & M. J. Munro (Eds.), Language experience in second language speech learning (pp. 13– 34). Amsterdam: John Benjamins. Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: Fundamental measurement in the human sciences (3rd ed.). New York: Routledge. 342 Boone, W. J., Staver, J. R., & Yale, M. S. (2014). Rasch analysis in the human sciences. New York: Springer. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2006). The concept of validity. Psychological Review, 111(4), 1061-1071. https://doi.org/10.1037/0033- 295X.111.4.1061 Bowles, M. A., Toth, P. D., & Adams, R. J. (2014). A comparison of L2-L2 and L2-heritage learner interactions in Spanish language classrooms. Modern Language Journal, 92, 497- 517. https://doi.org/10.1111/j.1540-4781.2014.12 Brinkmann, S. (2013). Qualitative interviewing. New York: Oxford University Press. Brodeur, M. B., Dionne-Dostie, E., Montreuil, T., & Lepage, M. (2010). The bank of standardized stimuli (BOSS), a new set of 480 normative photos of objects to be used as visual stimuli in cognitive research. PloS ONE, 5(5), e10773. Broersma, M., & Scharenborg, O. (2010). Native and non-native listeners’ perception of English consonants in different types of noise. Speech Communication, 52, 980-995. https://doi.org/10.1016/j.specom.2010.08.010 Brown, Adam. (1988). Functional load and the teaching of pronunciation. TESOL Quarterly, 22(4), 593-606. https://doi.org/10.2307/3587258 Brown, Anna. (2018). Item response theory approaches to test scoring and evaluating the score accuracy. In P. Irwing, T. Booth & D. J. Hughes (Eds.), The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale, and Test Development (pp. 607-638). Hoboken, NJ: John Wiley & Sons Ltd. Brown, J. D. (1999). Standard error vs. standard error of measurement. Shiken: JALT Testing & Evaluation SIG Newsletter, 3(1), 20-25. Burri, M., Baker, A., & Chen, H. (2017). “I feel like having a nervous breakdown”: Pre-service and in-service teachers’ developing beliefs and knowledge about pronunciation instruction. Journal of Second Language Pronunciation, 3(1), 109-135. https://doi.org/10.1075/jslp.3.1.05bur Bybee, J. (2001). Phonology and language use. Cambridge, U.K.: Cambridge University Press. Carr, N. (2011). Designing and analyzing language tests. Oxford, UK: Oxford University Press. Celce-Murcia, M., Brinton, D. M., Goodwin, J. M., & Griner, B. (2010). Teaching pronunciation: A course book and reference guide (2nd ed.). New York: Cambridge University Press. 343 Chapelle, C. A., Cotos, E., & Lee, J. (2015). Validity arguments for diagnostic assessment using automated writing evaluation. Language Testing, 32(3), 385-405. https://doi.org/10.1177/0265532214565386 Chapelle, C. A., Enright, M. K., & Jamieson, J. (Eds.) (2008). Building a validity argument for the Test of English as a Foreign Language. London: Routledge Chapelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument-based approach to validity make a difference? Educational Measurement: Issues and Practice, 29(1), 3-13. https://doi.org/10.1111/j.1745-3992.2009.00165.x Chen, Y.-J. (2018). A study on production of Korean syllable-final /n/ and /ŋ/ by Taiwanese speakers. Unpublished master’s thesis. Hankuk University of Foreign Studies. Chen, Y.-M. (2008). Learning to self‐assess oral performance in English: A longitudinal case study. Language Teaching Research, 12, 235– 262. https://doi.org/10.1177/1362168807086293 Choi, E., Kim, E., Park, H., Jin, M., & Park, K. (2009a). 외국인을 위한 한국어 발음 (제 1 권) [Korean pronunciation for foreigners (Vol. 1)]. Seoul: SISA Hangeulpark. Choi, E., Kim, E., Park, H., Jin, M., & Park, K. (2009b). 외국인을 위한 한국어 발음 (제 2 권) [Korean pronunciation for foreigners (Vol. 2)]. Seoul: SISA Hangeulpark. Council of Europe. (2017). Common European framework of reference for languages: Learning, teaching, assessment – Companion volume with new descriptors. Retrieved from https://rm.coe.int/common-european-framework-of-reference-for-languages-learning- teaching/168074a4e2 Creswell, J. W., & Plano Clark, V. L. (2011). Designing and conducting mixed methods research (2nd ed.). Thousand Oaks, CA: Sage. Crocker, L. M., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart, and Winston. Crowther, D., Isaacs, T., Trofimovich, P., & Saito, K. (2015). Does a speaking task affect second language comprehensibility? Modern Language Journal, 99(1), 80-95. https://doi.org/10.1111/modl.12185 Cutler, A., & Clifton, C. (1999). Comprehending spoken language: A blueprint of the listener. In C. M. Brown and P. Hagoort (Eds.), The Neurocognition of Language (pp 123-166). Oxford: Oxford University Press. Cutler, A., Weber, A., Smits, R., & Cooper, N. (2004). Patterns of English phoneme confusions by native and non-native listeners. Journal of the Acoustical Society of America, 116(6), 3668-3678. https://doi.org/10.1121/1.1810292 344 Davidson, F., & Lynch, B. K. (2002). Testcraft: A teacher’s guide to writing and using language test specifications. New Haven, CT: Yale University Press. De Ayala, R. J. (2009). The theory and practice of item response theory. New York: Guilford Press. DeKeyser, R. M. (2017). Knowledge and skill in ISLA. In S. Loewen & M. Sato (Eds.), The Routledge handbook of instructed second language acquisition (pp. 15-32). New York: Routledge. DeMars, C. E. (2018). Classical Test Theory and Item Response Theory. In P. Irwing, T. Booth & D. J. Hughes (Eds.), The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale, and Test Development (pp. 49-73). Hoboken, NJ: John Wiley & Sons Ltd. Derwing, T. M., Diepenbrooke, L. G., & Foote, J. A. (2012). How well do general-skills ESL textbooks address pronunciation? TESL Canada Journal, 30(1), 23-44. Derwing, T., & Munro, M. J. (2014). Myth 1: Once you have been speaking a second language for years, it’s too late to change your pronunciation. In L. Grant (Ed.), Pronunciation myths: Applying second language research to classroom teaching (pp. 34-55). Ann Arbor, Michigan: Michigan University Press. Derwing, T. M., & Munro, M. J. (2013). The development of L2 oral language skills in two L1 groups: A 7-year study. Language Learning, 63(2), 163-185. https://doi.org/10.1111/lang.12000 Derwing, T. M., & Munro, M. J. (2015). Pronunciation fundamentals: Evidence-based perspectives for L2 teaching and research. Philadelphia, PA: John Benjamins. Derwing, T. M., Munro, M. J., Foote, J. A., Waugh, E., & Fleming, J. (2014). Opening the window on comprehensible pronunciation after 19 years: A workplace training study. Language Learning, 64(3), 526-548. https://doi.org/10.1111/lang.12053 Derwing, T. M., Munro, M. J., & Weibe, G. (1998). Evidence in favor of a broad framework for pronunciation instruction. Language Learning, 48(3), 393-410. Dimova, S., & Kling, J. (2018). Assessing English-medium instruction lecturer language proficiency across disciplines. TESOL Quarterly, 52(3), 634-656. Dlaska, A., & Krekeler, C. (2008). Self-assessment of pronunciation. System, 36, 506-516. https://doi.org/10.1016/j.system.2008.03.003 345 Dorans, N. J. (2018). Scores, scales, and score linking. In P. Irwing, T. Booth & D. J. Hughes (Eds.), The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale, and Test Development (pp. 571-605). Hoboken, NJ: John Wiley & Sons Ltd. Duñabeitia, J.A., Crepaldi, D., Meyer, A.S., New, B., Pliatsikas, C., Smolka, E., & Brysbaert, M. (2017). MultiPic: A standardized set of 750 drawings with norms for six European languages. Quarterly Journal of Experimental Psychology. https://doi.org/10.1080/17470218.2017.1310261 Eckes, T. (2014). Examining teslet effects in the TestDaF listening section: A testlet response theory modeling approach. Language Testing, 31(1), 39-61. https://doi.org/10.1177/0265532213492969 Edelenbos, P., & Kubanek-German, A. (2004). Teacher assessment: The concept of ‘diagnostic competence.’ Language Testing, 21(3), 259-283. https://doi.org/10.1191/0265532204lt284oa Elder, C., & von Randow, J. (2008). Exploring the utility of a web-based English language screening tool. Language Assessment Quarterly, 5(3), 173-194. https://doi.org/10.1080/15434300802229334 Elo, S., & Kyngäs, H. (2007). The qualitative content analysis process. Journal of Advanced Nursing, 62(1), 107-115. https://doi.org/10.1111/j.1365-2648.2007.04569.x Embretson, S. E. (1996). The new rules of measurement. Psychological Assessment, 8(4), 341- 349. Ferris, D. R. (2010). Second language writing research and written corrective feedback in SLA: Intersections and practical applications. Studies in Second Language Acquisition, 32, 181-201. https://doi.org/10.1017/S0272263109990490 Field, J. (2011). Cognitive validity. In L. Taylor (Ed.), Examining Speaking: Research and Practice in Assessing Second Language Speaking (pp. 65-111). Cambridge: Cambridge University Press. Field, J. (2013). Cognitive validity. In A. Geranpayeh & L. Taylor (Eds.), Examining listening: Research and practice in assessing second language (pp. 77–151). Cambridge: Cambridge University Press. Field, J. (2014). Myth 3: Pronunciation teaching has to establish in the minds of language learners a set of distinct consonant and vowel sounds. In L. Grant (Ed.), Pronunciation myths: Applying second language research to classroom teaching (pp. 80-106). Ann Arbor, Michigan: Michigan University Press. 346 Flege, J. E. (1991). Perception and production: The relevance of phonetic input to l2 phonological learning. In T. Hueber and C. Ferguson (Eds.), Crosscurrents in second language acquisition and linguistic theories (pp. 249-289). Amsterdam: John Benjamins. Flege, J. E. (1995). Second language speech learning: Theory, findings, and problems. In W. Strange (Ed.), Speech perception and the critical period hypothesis (pp. 233-277). Timonium, MD: York Press. Flege, J. E., Yeni-Komshian, G. H., & Liu, S. (1999). Age constraints on second-language acquisition. Journal of Memory and Language, 41(1), 78-104. https://doi.org/10.1006/jmla.1999.2638 Foote, J., Holtby, A., & Derwing, T. M. (2011). Survey of the teaching of pronunciation in adult ESL programs in Canada, 2010. TESL Canada Journal, 29, 1-22. https://doi.org/10.18806/tesl.v29i1.1086 Foote, J. A., McDonough, K. (2017). Using shadowing with mobile technology to improve L2 pronunciation. Journal of Second Language Pronunciation, 3(1), 34-56. https://doi.org/10.1075/jslp.3.1.02foo Fulcher, G. (Host). (2015, July). Issue 22: Eunice Jang on Diagnostic Language Testing [Audio Podcast]. Retrieved from http://languagetesting.info/sage/podcasts/Diagnostic%20Language%20Testing.mp3 Friedman, D. (2012). How to collect and analyze qualitative data. In A. Mackey and S. M. Gass (Eds.), Research Methods in Second Language Acquisition: A Practical Guide (pp. 180- 200). London: Wiley. Gamer, M., Lemon, J., Fellows, I., & Singh, P. (2019). irr: Various coefficients of interrater reliability and agreement (R package version 0.84.1) [Computer software]. https://CRAN.R-project.org/package=irr Gass, S., & Mackey, A. (2006). Input, interaction, and output: An overview. AILA Review, 19, 3- 17. Gilbert, J. B. (2005). Clear speech – Pronunciation and listening comprehension in North American English: Student’s book (3rd ed.). New York: Cambridge University Press. Ginther, A., & Yan, X. (2018). Interpreting relationships between TOEFL iBT scores and GPA: Language proficiency, policy, and profiles. Language Testing, 35(2), 271-295. https://doi.org/10.1177/0265532217704010 Graham, C. (2001). Jazz chants old and new. Oxford, England: Oxford University Press. 347 Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29-48. https://doi.org/10.1348/000711006X126600 Han, Z.-H. (2004). Fossilization in adult second language acquisition. Clevedon, UK: Multilingual Matters. Harding, L., Alderson, J. C., & Brunfaut, T. (2015). Diagnostic assessment of reading and listening in a second or foreign language: Elaborating on diagnostic principles. Language Testing, 32(3), 317-336. https://doi.org/10.1177/0265532214564505 Hardison, D. M. (2004). Generalization of computer-assisted prosody training: quantitative and qualitative findings. Language Learning & Technology, 8, 34-52. Hardison, D. M. (2005). Second-language spoken word identification: Effects of perceptual training, visual cues, and phonetic environment. Applied Psycholinguistics, 26(4), 579– 596. https://doi.org/10.1017/S0142716405050319 Hardison, D. M. (2012). Second-language speech perception: A cross-disciplinary perspective on challenges and accomplishments. In S. Gass & A. Mackey (Eds.), The Routledge handbook of second language acquisition (pp. 349–363). London: Routledge. Hardison, D. M. (2018). Visualizing the acoustic and gestural beats of emphasis in multimodal discourse: Theoretical and pedagogical implications. Journal of Second Language Pronunciation, 4(2), 232-259. https://doi.org/10.1075/jslp.17006.har Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York: Springer Science+Business Media. Holliday, J. J. (2014). The perceptual assimilation of Korean obstruents by native Mandarin listeners. The Journal of the Acoustical Society of America, 135, 1585-1595. https://doi.org/10.1121/1.4863653 Holliday, J. J. (2015). A longitudinal study of the second language acquisition of a three-way stop contrast. Journal of Phonetics, 50, 1-14. https://doi.org/10.1016/j.wocn.2015.01.004 Holliday, J. J. (2016). Second language experience can hinder the discrimination of nonnative phonological contrasts. Phonetica, 73, 33-51. https://doi.org/10.1159/000443312 Horgues, C., & Scheuer, S. (2014). “I understood you, but there was this pronunciation thing…”: L2 pronunciation feedback in English/French tandem interactions. Research in Language, 12(2), 145-161. https://doi.org/10.2478/rela-2014-0005 Housen, A., & Pierrard, M. (2005). Investigations in instructed second language acquisition. Berlin: Mouton de Gruyter. 348 Imai, S., Walley, A. C., & Flege, J. E. (2005). Lexical frequency and neighborhood density effects on the recognition of native and Spanish-accented words by native English and Spanish listeners. The Journal of the Acoustical Society of America, 117(2), 896. https://doi.org/10.1121/1.1823291 Ingvalson, E. M., Ettlinger, M., & Wong, P. C. M. (2014). Bilingual speech perception and learning: A review of recent trends. International Journal of Bilingualism, 18(1), 35-47. https://doi.org/10.1177/1367006912456586 Irwing, P., & Hughes, D. J. (2018). Test development. In P. Irwing, T. Booth & D. J. Hughes (Eds.), The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale, and Test Development (pp. 3-47). Hoboken, NJ: John Wiley & Sons Ltd. Isaacs, T. (2018). Shifting sands in second language pronunciation teaching and assessment research and practice. Language Assessment Quarterly, 15(3), 273-293. https://doi.org/10.1080/15434303.2018.1472264 Isaacs, T., & Harding, L. (2017). Pronunciation assessment. Language Teaching, 50(3), 347-366. https://doi.org/10.1017/S0261444817000118 Isaacs, T., & Trofimovich, P. (Eds.). (2017). Second language pronunciation assessment: Interdisciplinary perspectives. Bristol, UK: Multilingual Matters. Isaacs, T., Trofimovich, P., & Foote, J. A. (2018). Developing a user-oriented second language comprehensibility scale for English-medium universities. Language Testing, 35(2), 193- 216. https://doi.org/10.1177/0265532217703433 Isbell, D. R. (2017). Explaining intelligibility: What matters most in L2 speech? Paper presented at the 19th annual Second Language Research Forum, Columbus, Ohio. Isbell, D. R., Park, O.-S., & Lee, K. (2019). Learning Korean Pronunciation: Effects of Instruction, Proficiency, and L1. Journal of Second Language Pronunciation, 5(1), 13- 48. https://doi.org/10.1075/jslp.17010.isb Isbell, D. R., Winke, P. M., & Gass, S. M. (2018). Using the ACTFL OPIc to assess proficiency and monitor progress in a tertiary foreign languages program. Language Testing. Online Early Access. https://doi.org/10.1177/0265532218798139 Jang, E. E. (2009). Cognitive diagnostic assessment of L2 reading comprehension ability: Validity arguments for Fusion Model application to LanguEdge assessment. Language Testing, 26(1), 31-73. https://doi.org/10.1177/0265532208097336 Jang, E. E., Dunlop, M., Park, G., & Van der Boom, E. H. (2015). How do young students with different profiles of reading skill mastery, perceived ability, and goal orientation respond to holistic diagnostic feedback? Language Testing, 32(3), 359-383. https://doi.org/10.1177/0265532215570924 349 Jang, E. E., & Wagner, M. (2014). Diagnostic feedback in language classroom. In A. Kunnan (Ed.), Companion to language assessment (vol. 2, pp. 693–711). New York, NY: Wiley- Blackwell. Jenkins, J. (2002). A sociolinguistically based, empirically researched pronunciation syllable for English as an international language. Applied Linguistics, 23(1), 83-103. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1-73. https://doi.org/10.1111/jedm.12000 Kang, O., & Ginther, A. (Eds.). (2017). Assessment in second language pronunciation. New York: Routledge. Kang, O., & Moran, M. (2014). Functional loads of pronunciation features in nonnative speakers’ oral assessment. TESOL Quarterly, 48(1), 176-187. https://doi.org/10.1002/tesq.152 Kang, O., Rubin, D., & Pickering, L. (2010). Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. Modern Language Journal, 94(4), 554–566. https://doi.org/10.1111/j.1540-4781.2010.01091.x Kang, O., Thomson, R. I, & Moran, M. (2018a). Empirical approaches to measuring the intelligibility of different varieties of English in predicting listener comprehension. Language Learning, 68(1), 115-146. https://doi.org/10.1111/lang.12270 Kang, O., Thomson, R. I., & Moran, M. (2018b). Which features of accent affect understanding? Exploring the intelligibility threshold of diverse accent varieties. Applied Linguistics, Online advance access. https://doi.org/10.1093/applin/amy053 Kassambara, A. (2017). Practical guide to cluster analysis in R: Unsupervised machine learning. STHDA. Kassambara, A., & Mundt, F. (2017). factoextra: Extract and visualize the results of multivariate data analyses, R package version 1.0.5. https://CRAN.R-project.org/package=factoextra Kennedy, S., Guénette, D., Murphy, J., & Allard, S. (2015). Le rôle de la prononciation dans l’intercompréhension entre locuteurs de français lingua franca [The role of pronunciation on comprehension between speakers of French as a lingua franca]. Canadian Modern Language Review, 71, 1–25. https://doi.org/10.3138/cmlr.2139 Kennedy, S., & Trofimovich, P. (2010). Language awareness and second language pronunciation: A classroom study. Language Awareness, 19(3), 171-185. https://doi.org/10.1080/09658416.2010.486439 350 Kim, E.-A. (2006). A study on the diagnosis & evaluation for pronunciation errors of Korean language learners [한국어 학습자의 발음 오류 진단 및 평가에 관한 연구]. Journal of Korean Language Education [한국어교육], 17(1), 71-99. Kim, J. E., & Silva, D. J. (2003). Accounting for back-vowel under-differentiation: An acoustically-based study of English-speaking learners of Korean. The Korean Language in America, 8, 51-64. Kim, J.-Y. (2015). Second language acquisition: Phonology. In L. Brown & J. Yeon (Eds.), The Handbook of Korean Linguistics (pp. 373-388). Hoboken, NJ: John Wiley & Sons. Kim, M. (2007). Aspects of Korean second language phonology (Doctoral dissertation). Retrieved from ProQuest. (UMI 3279086) Kim, C.-W., & Park, S.-G. (1995). Pronunciation problems of Australian students learning Korean: Intervocalic liquid consonants. Australian Review of Applied Linguistics, 12, 183-202. https://doi.org/10.1075/aralss.12.12kim Kim, M. J., Pae, S. Y., & Lee, S. E. (2005). The development of the ‘Test of Articulation for Children’: Concurrent validity. Communication Sciences & Disorders, 10(1), 82-96. Kim, M., Kim, S.-J., & Stoel-Gammon, C. (2017). Phonological acquisition of Korean consonants in conversational speech produced by young Korean children. Journal of Child Language, 44, 1010-1023. https://doi.org/10.1017/S0305000916000258 Kim, Y., Tracy-Ventura, N., & Jung, Y. (2016). A measure of proficiency or short-term memory? Validation of an elicited imitation test for SLA research. Modern Language Journal, 100(3), 655-673. https://doi.org/10.1111/modl.12346 King, R. S. (2015). Cluster analysis and data mining: An introduction. Boston, MA: Mercury Learning and Information. Knoch, U., & Elder, C. (2016). Post-entry English language assessments at university: How diagnostic are they? In V. Aryadoust & J. Fox (eds.), Trends in Language Assessment Research and Practice: The View from the Middle East and Pacific Rim (pp. 210-230). Newcastle-upon-Tyne, England: Cambridge Scholars Publishing. Ko, I. (2013). The articulation of Korean coronal obstruents: Data from heritage speakers and second language learners (Unpublished doctoral dissertation). University of Hawai‘i, Hawai‘i. Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15, 155-163. https://doi.org/10.1016/j.jcm.2016.02.012 351 Krashen, S. (1982). Principles and practice in second language acquisition. New York: Prentice Hall. Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77(6), 1121-1134. Kwon, S. (2017). 한국어 발음 교육론 [Korean pronunciation pedagogy]. Seoul: SISA Hangeulpark. Lappin-Fortin, K., & Rye, B. J. (2014). The use of pre-/posttest and self-assessment tools in a French pronunciation course. Foreign Language Annals, 47(2), 300-320. https://doi.org/10.1111/flan.12083 Lado, R. (1957). Linguistics across cultures. Ann Arbor, MI: University of Michigan Press. Lado, R. (1961). Language testing: The construction and use of foreign language tests: A teacher’s book. London: Longmans, Green and Company. Lee, A. H., & Lyster, R. (2016). Effects of different types of corrective feedback on receptive skills in a second language: A speech perception training study. Language Learning, 66(4), 809-833. https://doi.org/10.1111/lang.12167 Lee, A. H., & Lyster, R. (2017). Can corrective feedback on second language speech perception errors affect production accuracy? Applied Psycholinguistics, 38, 371-393. https://doi.org/10.1017/S0142716416000254 Lee, H. (2017a). An empirical study to rethink the goals and components of teaching Korean language pronunciation. Journal of Korean Language Education, 28(3), 105-126. Lee, H. (2017b). 한국어 발음 평가 연구 [Korean Pronunciation Assessment for Foreign Language Speakers]. Seoul: Jisikgwagyoyang. Lee, J., Jang, J., & Plonsky, L. (2015). The effectiveness of second language pronunciation instruction: A meta-analysis. Applied Linguistics, 36(3), 1–23. https://doi.org/10.1093/applin/amu040 Lee, S. (2012). Orthographic influence on the phonological development of L2 learners of Korean (Unpublished doctoral dissertation). The University of Wisconsin-Milwaukee, Milwaukee, WI. Lee, S-H., Jang, S. B., Seo, S. K. (2017). A frequency dictionary of Korean. New York: Routledge. Lee, S.-Y., Moon, J., & Long, M. H. (2009). Linguistic correlates of proficiency in Korean as a second language. Language Research, 45(2), 319-348. 352 Lee, Y.-W. (2015). Diagnosing diagnostic language assessment. Language Testing, 32(3), 299- 316. https://doi.org/10.1177/0265532214565387 Lee, Y.-W., & Sawaki, Y. (2009). Application of three cognitive diagnosis models to ESL reading and listening assessments. Language Assessment Quarterly, 6(3), 172-189. https://doi.org/10.1080/15434300902985108 Lee-Ellis, S. (2009). The development and validation of a Korean C-Test using Rasch analysis. Language Testing, 26(2), 245-274. https://doi.org/10.1177/0265532208101007 Levelt, W. J. M. (1993). Speaking: From intention to articulation. Cambridge, MA: MIT Press. Levis, J. (2005). Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly, 39, 39-377. https://doi.org/10.2307/3588485 Levis, J. (2007). Computer technology in teaching and researching pronunciation. Annual Review of Applied Linguistics, 27, 184-202. https://doi.org/10.1017/S0267190508070098 Levis, J., & Barriuso, T. A. (2012). Nonnative speakers’ pronunciation errors in spoken and read English. In J. Levis & K. LeVelle (Eds.), Proceedings of the 3rd Annual Pronunciation in Second Language Learning and Teaching Conference (pp. 187-194). Ames, IA: Iowa State University. Linacre, J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7(4), 328. Linacre, J. M. (2005). Dichotomous & polytomous category information. Rasch Measurement Transactions, 19(1), 1005-1006. Linacre, J. M. (2019). Winsteps® Rasch measurement computer program User’s Guide. Beaverton, Oregon: Winsteps.com. Little, D. 2005. The Common European Framework and the European Language Portfolio: Involving learners and their judgements in the assessment process. Language Testing, 22(3), 321–336. https://doi.org/10.1191/0265532205lt311oa Llama, R., Cardoso, W., & Collins, L. (2010). The influence of language distance and language status on the acquisition of L3 phonology. International Journal of Multilingualism, 7(1), 39-57. https://doi.org/10.1080/14790710902972255 Loewen, S. (2015). An introduction to instructed second language acquisition. New York: Routledge. Loewen, S., & Isbell, D. R. (2017). Pronunciation in face-to-face and oral synchronous computer mediated interaction. Studies in Second Language Acquisition, 39(2), 225-256. https://doi.org/10.1017/S0272263116000449 353 Long, M. (2013). Maturational constraints on child and adult SLA. In G. Granena & M. Long (Eds.), Sensitive periods, language aptitude, and ultimate L2 attainment (pp. 3-41). Amsterdam: John Benjamins. Lord, G. (2005). Can we teach foreign language pronunciation? The effects of a phonetics class on second language pronunciation. Hispania, 88, 557-567. Lord, G. (2008). Podcasting communities and second language pronunciation. Foreign Language Annals, 41, 364-379. Lord, G. (2010). The combined effects of immersion and instruction on second language pronunciation. Foreign Language Annals, 43(3), 487-503. Ma, M., & Winke, P. (2019). Self-assessment: How reliable is it in assessing the oral proficiency of Chinese learners over time? Foreign Language Annals, 52, 66-86. https://doi.org/10.1111/flan.12379 Marais, I., & Andrich, D. (2008). Formalizing dimension and response violations of local independence in the unidimensional Rasch model. Journal of Applied Measurement, 9, 200–215. Marian, V., Blumenfeld, H. K., & Kaushanskaya, M. (2007). The Language Experience and Proficiency Questionnaire (LEAP-Q): Assessing language profiles in bilinguals and multilinguals. Journal of Speech Language and Hearing Research, 50(4), 940-967. https://doi.org/10.1044/1092-4388(2007/067) Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. Mathôt, S., Schreij, D., & Theeuwes, J. (2012). OpenSesame: An open-source, graphical experiment builder for the social sciences. Behavior Research Methods, 44(2), 314- 324. doi:10.3758/s13428-011-0168-7 Matsumoto, Y. (2011). Successful ELF communications and implications for ELT: Sequential analysis of ELF pronunciation negotiation strategies. Modern Language Journal, 95, 97– 114. https://doi.org/10.1111/j.1540-4781.2011.01172.x McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18(1), 1–86. McCrocklin, S. (2019). ASR-based dictation practice for second language pronunciation improvement. Journal of Second Language Pronunciation, 5(1), 98-118. https://doi.org/10.1075/jslp.16034.mcc McGraw, K.O., & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30-46. 354 McLeod, S., & Crowe, S. (2018). Children’s consonant acquisition in 27 languages: A cross- linguistic review. American Journal of Speech-Language Pathology, 1-28. https://doi.org/10.1044/2018_AJSLP-17-0100 McQueen, J. M., Norris, D., & Cutler, A. (1994). Competition in spoken word recognition: Spotting words in other words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20(3), 621-638. https://doi.org/10.1016/0010-0285(86)90015-0 Meijer, R. R., & Tendeiro, J. N. (2018). Unidimensional Item Response Theory. In P. Irwing, T. Booth & D. J. Hughes (Eds.), The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale, and Test Development (pp. 413-443). Hoboken, NJ: John Wiley & Sons Ltd. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (pp. 13-103). Washington, DC: American Council on Education and National Council on Measurement in Education. Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 241- 256. Miles, M. B., Huberman, M. A., & Saldaña, J. (2014). Qualitative data analysis: A methods sourcebook (3rd ed.). Thousand Oaks, CA: SAGE. Mӧttӧnen, R., & Watkins, K.E. (2009). Motor representations of articulators contribute to categorical perception of speech sounds. The Journal of Neuroscience, 29(31), 9819- 9825. https://doi.org/10.1523/JNEUROSCI.6018-08.2009 Moyer, A. (2014). Exceptional outcomes in L2 phonology: The critical factors of learner engagement and self-regulation. Applied Linguistics, 35(4), 418-440. https://doi.org/10.1093/applin/amu012 Munro, M. J. (2008). Foreign accent and speech intelligibility. In J. G. Hansen Edwards & M. L. Zampini (Eds.), Phonology and Second Language Acquisition (pp. 193-218). Philadelphia, PA: John Benjamins. Munro, M. J., & Derwing, T. M. (1995). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 45(1), 73-97. https://doi.org/10.1111/j.1467-1770.1995.tb00963.x Munro, M. J., & Derwing, T. M. (2006). The functional load principle in ESL pronunciation instruction: An exploratory study. System, 34, 520-531. https://doi.org/10.1016/j.system.2006.09.004 Munro, M. J., Derwing, T. M., & Thomson, R. I. (2015). Setting segmental priorities for English learners: Evidence from a longitudinal study. International Review of Applied Linguistics, 53(1), 39-60. https://doi.org/10.1515/iral-2015-0002 355 Murphy, J. (2014). Myth 7: Teacher training programs provide adequate preparation in how to teach pronunciation. In L. Grant (Ed.), Pronunciation myths: Applying second language research to classroom teaching (pp. 188-234). Ann Arbor, Michigan: Michigan University Press. Murtagh, F., & Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: Which algorithms implement Ward’s criterion? Journal of Classification, 31, 274-295. https://doi.org/10.1007/s00357-014-9161-z Nation, I. S. P. (2001). Planning and running an extensive reading program. NUCB Journal of Language Culture and Communication, 3(1), 1-8. Nation, I. S. P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31(7), 9-13. National Institute of the Korean Language. (2015, May 29). “초코렛”과 “초콜릿” [“chocoret” and “chocolate”]. Retrieved from https://www.korean.go.kr/front/mcfaq/mcfaqView.do?mn_id=&mcfaq_seq=5497&pageI ndex=1 Nissen, S., & Shedl, M. (2012). Prototyping new item types. In G. Fulcher & F. Davidson (Eds.), The Routledge Handbook of Language Testing (pp. 281-294). London: Routledge. Nora, A., Revall, H., Kim, J.-Y., Service, E., & Salmelin, R. (2015). Distinct effects of memory retrieval and articulatory preparation when learning and accessing new word forms. PLOS One, 10(5), 1-27. https://doi.org/10.1371/journal.pone.0126652 Oh, J. S., Jun, S.-A., Knightly, L. M., & Au, T. K. M. (2003). Holding on to childhood language memory. Cognition, 86, B53-B64. https://doi.org/10.1016/S0010-0277(02)00175-0 Oh, Y. M., Coupé, C., Marsico, E., & Pellegrino, F. (2015). Bridging phonological system and lexicon: Insights from a corpus study of functional load. Journal of Phonetics, 53, 153- 176. https://doi.org/10.1016/j.wocn.2015.08.003 Pearson. (2018). PTE Academic Score Guide. Retrieved from https://pearsonpte.com/wp- content/uploads/2017/08/Score-Guide.pdf Peirce, J. W. (2009). Generating stimuli for neuroscience using PsychoPy. Frontiers in Neuroinformatics, 2(10), 1-8. https://doi.org10.3389/neuro.11.010.2008 Pellegrino, J. W., DiBello, L. V., & Goldman, S. R. (2016). A framework for conceptualizing and evaluating the validity of instructionally relevant assessments. Educational Psychologist, 51(1), 59-81. https://doi.org/10.1080/00461520.2016.1145550 Pennington, M. C. (1998). The teachability of phonology in adulthood: A re-examination. International Review of Applied Linguistics, 36(4), 323-341. 356 Pinkfong. (2016). 상어 가족 [Baby Shark] [music video]. Seoul: SmartStudy. Retrieved at https://youtu.be/761ae_KDg_Q Piske, T., MacKay, I. R. A., Flege, J. E., (2001). Factors affecting degree of foreign accent in an L2: A review. Journal of Phonetics, 29(2), 191-215. https://doi.org/10.1006/jpho.2001.0134 Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning, 64(4), 878-912. https://doi.org/10.1111/lang.12079 Poehner, M. E., & Lantolf, J. P. (2013). Bringing the ZPD into the equation: Capturing L2 development during Computerized Dynamic Assessment (C-DA). Language Teaching Research, 17(3), 323-342. https://doi.org/10.1177/1362168813482935 Qian, M., Chukharev-Hudilainen, E., & Levis, J. (2018). A system for adaptive high-variability segmental perceptual training: Implementation, effectiveness, transfer. Language Learning & Technology, 22(1), 69-96. https://doi.org/10125/44582 Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Chicago: MESA Press. Redd, R. (2019). ragree: Rater agreement (R package version 0.0.4) [Computer software]. https://github.com/raredd/ragree Revelle, W. (2018). psych: Procedures for personality and psychological research (R package version 1.8.12) [computer software]. Evanston, Illinois: Northwestern University. Richards, J. (1969). Songs in language learning. TESOL Quarterly, 3(2), 161-174. Robinson, P. (1995). Attention, memory, and the “noticing” hypothesis. Language Learning, 45(2), 283-331. Ryan, E., & Brunfaut, T. (2016). When the test developer does not speak the target language: The use of language informants in the test development process. Language Assessment Quarterly, 13(4), 393-408. https://doi.org/10.1080/15434303.2016.1236110 Sakai, M., & Moorman, C. (2018). Can perception training improve the production of second- language phonemes? A meta-analytic review of 25 years of perception training research. Applied Psycholinguistics, 39, 187-224. https://doi.org/10.1017/S0142716417000418 Saito, K. (2012). Effects of instruction on L2 pronunciation development: A synthesis of 15 quasi-experimental intervention studies. TESOL Quarterly, 46(4), 842-854. https://doi.org/10.1002/tesq.67 357 Saito, K. (2018). Individual differences in second language speech learning in classroom settings: Roles of awareness in the longitudinal development of Japanese learners’ English /ɹ/ pronunciation. Second Language Research. Online advance publication. https://doi.org/10.1177/0267658318768342 Saito, K., & Lyster, R. (2012). Effects of form-focused instruction and corrective feedback on L2 pronunciation of /ɹ/ by Japanese learners of English. Language Learning, 62(2), 595-633. https://doi.org/10.1111/j.1467-9922.2011.00639.x Saito, K., & Plonsky, L. (in press). Effects of second language pronunciation teaching revisted: A proposed measurement framework and meta-analysis. Language Learning. Saito, K., Trofimovich, P., & Isaacs, T. (2016). Second language speech production: Investigating linguistic correlates of comprehensibility and accentedness for learners at different ability levels. Applied Psycholinguistics, 37, 217-240. https://doi.org/10.1017/S0142716414000502 Saito, K., Trofimovich, P., & Isaacs, T. (2017). Using listener judgments to investigate linguistic influences on L2 comprehensibility and accentedness: A validation and generalization study. Applied Linguistics, 38(4), 439-462. https://doi.org/10.1093/applin/amv047 Schmidt, R. (1990). The role of consciousness in second language learning. Applied Linguistics, 11, 129-158. Schmidt, R. (1993). Awareness and second language acquisition. Annual Review of Applied Linguistics, 13, 206-226. Schmidt, R., & Frota, S. (1986). Developing basic conversational ability in a second language: A case study of an adult learner of Portuguese. In R. R. Day (Ed.), Talking to Learn: Conversation in Second Language Acquisition (pp. 237-326). Rowley, MA: Newbury House. Schreier, M. (2014). Qualitative content analysis. In U. Flick (Ed.), The SAGE Handbook of Qualitative Data Analysis (pp. 170-183). Thousand Oaks, CA: SAGE. http://doi.org/10.4135/9781446282243.n12 Seok, D.-I., Park, S.-H., Shin, H.-J., & Park, J.-H. (2002). A study on the development of Korean Standard Picture Articulation Test. Communication Sciences & Disorders, 7(3), 121-143. Sheldon, A., & Strange, W. (1982). The acquisition of /r/ and /l/ by Japanese learners of English: Evidence that speech production can precede speech perception. Applied Psycholinguistics, 3, 243-261. Shin, E. (2007). How do non-heritage students learn to make the three-way contrast of Korean stops? The Korean Language in America, 12, 85-105. 358 Shin, J., Kiaer, J., & Cha, J. (2013). The sounds of Korean. New York: Cambridge University Press. Shrout, P.E., & Fleiss, J.L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428. Smith, B. L., Johson, E., & Hayes-Harb, R. (2019). ESL learners’ intra-speaker variability in producing American English tense and lax vowels. Journal of Second Language Pronunciation, 5(1), 139-164. https://doi.org/10.1075/jslp.15050.smi Staples, S., & Biber, D. (2015). Cluster analysis. In L. Plonsky (Ed.), Advancing Quantitative Methods in Second Language Research (pp. 243-274). New York: Routledge. Steinley, D. (2004). Standardizing variables in K-means clustering. In D. Banks, F. R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, Clustering, and Data Mining Applications: Studies in Classification, Data Analysis, and Knowledge Organisation. Berlin: Springer. Stevens, S. S. (1946). On the theory of scales and measurement. Science, 103(2684), 677-680. Stockwell, R., & Bowen, J. (1965). The sounds of English and Spanish. Chicago, IL: University of Chicago Press. Sueyoshi, A., & Hardison, D. M. (2005). The role of gestures and facial cues in second language listening comprehension. Language Learning, 55(4), 661-699. Sundqvist, P., Wikström, P., Sandlund, E., & Nyroos, L. (2018). The teacher as examiner of L2 oral tests: A challenge to standardization. Language Testing, 35(2), 217-238. https://doi.org/10.1177/0265532217690782 Suzuki, Y. (2017). Validity of new measures of implicit knowledge: Distinguishing implicit knowledge from automatized explicit knowledge. Applied Psycholinguistics, 38, 1229-1261. https://doi.org/10.1017/S014271641700011X Suzuki, Y., & DeKeyser, R. (2017). The interface of explicit and implicit knowledge in a second language: Insights from individual differences in cognitive aptitudes. Language Learning, 67, 747-790. https://doi.org/10.1111/lang.12241 Tan, M., & Turner, C. E. (2015). The impact of communication and collaboration between test developers and teachers on a high-stakes ESL exam: Aligning external assessment and classroom practices. Language Assessment Quarterly, 12, 29-49. https://doi.org/10.1080/15434303.2014.1003301 Tark, E. S. (2016). Acquisition of Korean obstruents by English-speaking second language learners of Korean and the role of pronunciation instruction (Doctoral dissertation). Retrieved from ProQuest. (No. 10191441) 359 Teo, A. (2012). Promoting EFL students’ inferential reading skills through computerized dynamic assessment. Language Learning & Technology, 16(3), 10-20. http://llt.msu.edu/issues/october2012/action.pdf Thomson, R. I. (2011). Computer Assisted Pronunciation Training: Targeting second language vowel perception improves pronunciation. CALICO Journal, 28, 744-765. Thomson, R. I. (2012). Improving L2 listeners’ perception of English vowels: A computer- mediated approach. Language Learning, 62(4), 1231-1258. https://doi.org/10.1111/j.1467-9922.2012.00724.x Thomson, R. I. (2016). English Accent Coach [Computer program]. Version 2.3. www.englishaccentcoach.com Thomson, R. I., & Derwing, T. M. (2015). The effectiveness of L2 pronunciation instruction: A narrative review. Applied Linguistics, 36(3), 326–344. https://doi.org/10.1093/applin/amu076 Tigchelaar, M., Bowles, R. P., Winke, P., & Gass, S. (2017). Assessing the validity of ACTFL Can-Do Statements for spoken proficiency: A Rasch analysis. Foreign Language Annals, 50(3), 584–600. https://doi.org/10.1111/flan.12286 Trofimovich, P., Isaacs, T., Kennedy, S., Saito, K., & Crowther, D. (2016). Flawed self- assessment: Investigating self- and other-perception of second language speech. Bilingualism: Language and Cognition, 19(1), 122-140. https://doi.org/10.1017/S1366728914000832 Truscott, J. (1996). The case against grammar correction in L2 writing classes. Language Learning, 46(2), 327-369. Turner, C. E., & Purpura, J. E. (2015). Learning-oriented assessment in second and foreign language classrooms. In D. Tsagari & J. Banerjee (Eds.), Handbook of Second Language Assessment (pp. 255-272). Boston: De Gruyter Mouton. Vandergrift, L., & Goh, C. C. M. (2012). Teaching and learning second language listening: Metacognition in action. New York, NY: Routledge. VanPatten, B., & Rothman, J. (2015).What does current generative theory have to say about the explicit-implicit debate? In P. Rebuschat (Ed.), Implicit and explicit learning of languages (pp. 89-116). Amsterdam: John Benjamins. Venkatagiri, H. S., & Levis, J. M. (2007). Phonological awareness and speech comprehensibility: An exploratory study. Language Awareness, 16(4), 263-277. https://doi.org/10.2167/la417.0 360 Weber, A., & Cutler, A. (2004). Lexical competition in non-native spoken-word recognition. Journal of Memory and Language, 50, 1-25. https://doi.org/10.1016/S0749- 596X(03)00105-0 Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370. Yan, X., & Ginther, A. (2017). Interpreting the relationships between TOEFL iBT scores and GPA: Language proficiency, policy, and profiles. Language Testing (Online early access). https://doi.org/10.1177/0265532217704010 Yeldham, M., & Gruba, P. (2014). Toward an instructional approach to developing interactive second language listening. Language Teaching Research, 18, 33–53. https://doi.org/10.1177/1362168813505395 Yu, H. J. (2016). The development of obstruent consonants in bilingual Korean-English children (Doctoral dissertation). Retrieved from ProQuest. (No. 10163769) Zoghbor, W. S. (2018). Teaching English pronunciation to multi-dialect first language learners: The revival of the Lingua Franca Core (LFC). System, 78, 1-14. https://doi.org/10.1016/j.system.2018.06.008 361