ACQUISITION OF L2 VOWEL DURATION IN JAPANESE
BY NATIVE ENGLISH SPEAKERS
By
Tomoko Okuno

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Second Language Studies – DOCTOR OF PHILOSOPHY
2013

ABSTRACT
ACQUISITION OF L2 VOWEL DURATION IN JAPANESE
BY NATIVE ENGLISH SPEAKERS
By
Tomoko Okuno

Research has demonstrated that focused perceptual training facilitates L2 learners’
segmental perception and spoken word identification. Hardison (2003) and Motohashi-Saigo
and Hardison (2009) found benefits of visual cues in the training for acquisition of L2 contrasts.
The present study examined factors affecting perception and production of vowel duration (i.e.,
long versus short) in Japanese and benefits of waveform displays as visual cues on the
acquisition of vowel duration in L2 Japanese by native speakers of L1 English, and transfer to
production. Vowel length in Japanese is a contrastive feature, important for communication, and
a challenge for many L2 learners.
A pretest-posttest design with controls was used. A between-subject variable was training
type: auditory visual (AV), auditory-only (A-only), and no training (controls). Within-subject
variables were vowel type, preceding consonant, and pitch pattern. Participants were 64 learners
of Japanese whose L1 was American English. Testing and training materials were 40 bisyllabicwords containing long and short vowels. To create the stimuli, two Japanese vowels (/a, u/), two
consonants (/k, s/), and 10 pitch patterns were selected. The stimuli, produced by six NSs of
Japanese, were recorded.
Production and perception pre- and post-tests were administered to assess the effects of
training on perception accuracy and reaction time (RT). During production testing, participants
produced 16 bisyllabic words in isolation. For perception testing, they completed a forced-

choice, four-alternative identification task for 18 stimuli, the bisyllabic words. Perception
training, conducted between the pre- and post-tests, involved eight sessions, each 25 minutes; the
participants also completed the same identification task, using a computer. During training,
feedback was provided on both correct and incorrect responses; immediately after the choice,
correct words appeared on the screen.
Results indicated significant improvement on identification accuracy for both groups, but
the rate of improvement of the AV group was greater. On the other hand, RTs of the two groups
became slower after the training. In addition, it was found that vowel type, preceding consonant,
and pitch patterns in addition to the talker’s voice in the training together affected L2 learners’
perception of vowel duration. The results suggested that the learners’ stages of L2 perceptual
development involve the evaluation of input based on context- and talker-dependent perceptual
categories.

Copyright By
Tomoko Okuno
2013

To my family

v

ACKNOWLEDGMENTS

I could not complete this dissertation without help, support, and encouragement from
many people. I would like to show my deepest appreciation to Dr. Debra Hardison, the co-chair
of my dissertation committee, for her support and guidance and feedback on the proposal, draft,
and statistical analysis. In addition, she helped me to revise the dissertation by giving me
specific comments and feedback. I think that I learned a lot through working on and revising this
dissertation. I would like to extend my appreciation to Dr. Mutsuko Endo Hudson, the other cochair, for her support and feedback on my dissertation and giving me opportunities to expand my
teaching career. In addition, I would like to thank to Dr. Susan Gass, and Dr. Yen-Hwei Lin, the
dissertation committee, for patiently reading my dissertation and giving me helpful feedback.
I would like to thank the Department of Linguistics and Languages and Second Language
Studies Program for giving me a lot of support, funding, and opportunities to teach Japanese
courses as an instructor.
Students who were taking Japanese language courses at Michigan State University were
very supportive. They were motivated to learn Japanese and to improve their Japanese language,
and they participated in my study. I would like to thank them for their participation and
encouragement to me to finish my dissertation.
I thank my dissertation working group with Soo Hyon Kim and Baburhan Uzum. We
met twice a week and wrote our dissertations. It was a very good way to motivate myself to
continue to work on my dissertation and to give encouragement to each other. I am very happy
to graduate with Soo Hyon and Baburhan.

vi

Finally, I would like to thank to my family and friends at Michigan State University. My
family, including my two cats, Moggy and Syllable, always emotionally supported me since I
came to Michigan State University. Also, I want to thank my friends, including Masae Yasuda,
Nao Nakano, Misako Matsubara, Chien Hsiung (Scott) Chiu, Nobuhiro Kamiya, Kanako Kamiya,
Tsuyoshi Oshita, Junkyuu Lee, Shaofeng Lee, Seongmee Ahn, Marthe Russell, Grace Lee
Amuzie, Solène Inceoglu, Jimin Kahng, and Chiung Wang.

vii

TABLE OF CONTENTS

LIST OF TABLES ………………………………………………………………………………xi
LIST OF FIGURES ……………………………………………………………………………..xv
CHAPTER 1: INTRODUCTION AND REVIEW OF THE LITERATURE …………………....1
Introduction ……………………………………………………………………………….1
Review of the Literature ………………………………………………………………….2
A Model of Speech Perception …………………………………………………...2
Status of Vowel Duration in Japanese: Issue of the mora ………………………..4
Factors Affecting Perception of Segment Length for NSs ……………………….5
Factors Affecting Perception of Segment Length for NNSs ……………………..7
L2 Research with a Focus on Training Studies Involving Spectral Differences …8
L2 Research with a Focus on Training Studies Involving Exaggerated Stimuli …9
L2 Research with a Focus on AV Training Studies …………………………….11
Exemplar-Based Model …………………………………………………………13
Research Questions and Hypotheses ……………………………………………………14
Overview of Study Design ……………………………………………………………...18
CHAPTER 2: EXPERIMENT 1…………………………………………………………………20
Method …………………………………………………………………………………..20
Participants ……………………………………………………………………....20
Materials ………………………………………………………………………...21
Production Test ………………………………………………………….21
Perception Test …………………………………………………………..21
Procedures ……………………………………………………………………….23
Production Test ………………………………………………………….23
Perception Test …………………………………………………………..26
Results …………………………………………………………………………………...27
Overall Results of the Production Test ………………………………………….28
Overall Results of the Production Test ………………………………………….29
Analysis of Production Data …………………………………………………….32
Analysis of Factors Affecting Perception Accuracy …………………………….35
Analysis of Factors Affecting Perception RT …………………………………...53
Conclusion of Experiment 1 …………………………………………………………….60
CHAPTER 3: EXPERIMENT 2…………………………………………………………………62
Method …………………………………………………………………………………..62
Participants ………………………………………………………………………62
Materials ………………………………………………………………………...62
Production Test ……………………………………………………….…62
Perception Test …………………………………………………………..62
Perception Training ……………………………………………………...63
viii

Procedures ……………………………………………………………………….65
Production Test ………………………………………………………….65
Perception Test …………………………………………………………..65
Perception Training ……………………………………………………...66
Test of Generalization (TG) ……………………………………………..70
Results …………………………………………………………………………………...71
Comparability of Groups at Pretest ……………………………………………...71
Analysis of Overall Effectiveness of the Perception Training …………………..72
Influence of Stimulus Variables on Perception Accuracy ………………………73
Effectiveness of Training Type on Perception RT ……………………………...81
Analysis of Production Data …………………………………………………….87
Analysis of Effectiveness of Training per Group ……………………………….92
Perception Accuracy in Training – AV Group ………………………………….97
Perception Accuracy in Training – A-only Group ……………………………..103
Perception RT in Training – AV Group ……………………………………….112
Perception RT in Training – A-only Group ……………………………………119
TG with Novel Tokens – Comparison of Production Accuracy …………….....127
Overall Effects of TG (familiar and novel tokens) – Perception Accuracy ……130
TG with Familiar and Novel Tokens – Comparison of Perception Accuracy …132
Comparing Accuracy in Pretest and TG1 (novel tokens) ……………...132
Comparing Accuracy in Pretest and TG2 (novel talker) ………………136
Comparing Accuracy in Posttest and TG1 (novel tokens) ……………..142
Comparing Accuracy in Posttest and TG2 (novel talker) ……………...145
Overall Effects of TG (familiar and novel tokens) – Perception RT …….…….149
TG with Familiar and Novel Tokens – Comparison of RT ……………………151
Comparing RT in Pretest and TG1 (novel tokens) ………….................151
Comparing RT in Pretest and TG2 (novel talker) …………………..…156
Comparing RT in Posttest and TG1 (novel tokens) ……………………161
Comparing RT in Posttest and TG2 (novel talker) …………………….164
CHAPTER 4: DISCUSSION AND CONCLUSION ………………...………………………..167
Factors Affecting Perception and Production of Vowel Duration in L2 Japanese …….167
Effectiveness of Perceptual Training on Accuracy and RT ……………………………170
Effectiveness of Training per Group ………………………………………………...…172
Comparison between the Two Types of Training ……………………………………...175
Transfer to Production …………………………………………………………………176
Generalizability of the Training Effects on Perception Accuracy and RT ………….…177
Generalizability of the Training Effects to Production ………………………………...178
Conclusion ……………………………………………………………….………….…179
APPENDICIES ……………………………………………………………………………..…183
Appendix A: List of Target Stimuli for Production Test in Experiment 1 ……………184
Appendix B: List of Practice Stimuli for Production Test in Experiment 1 and 2 ……185
Appendix C: List of Target Stimuli for Perception Test in Experiment 1 …………….186
Appendix D: List of Practice Stimuli for Perception Test in Experiment 1 and 2 …….188
Appendix E: List of Target Stimuli for Perception Tests in Experiment 2 ……………189
ix

Appendix F: List of Stimuli for Perception Training in Experiment 2 ……………...…190
Appendix G: List of Practice Stimuli for Training Sessions ………………………..…191
Appendix H: List of Target Stimuli for Production Test in TG1 in Experiment 2 …….192
Appendix I: List of Target Stimuli for Perception Test in TG1 in Experiment 2 …...…193
REFERENCES ………………………………………………………………………………...194

x

LIST OF TABLES

Table 1: Examples of words with geminates and singletons…………………………………..….4
Table 2: Summary of independent and dependent variables for Experiment 1……………….…19
Table 3: A sample task for the raters………………………………………………………….…28
Table 4: Mean accuracy for the production accuracy in Experiment 1……………………….…29
Table 5: Distribution of perception accuracy by course enrollment in percentages……………..31
Table 6: Mean production accuracy of the four tokens in Experiment 1………………………...32
Table 7: Errors observed in the production data in Experiment 1……………………………….34
Table 8: Descriptive statistics for perception accuracy by pitch pattern, preceding consonant,
and vowel type……………………………………………………………………….…37
Table 9: An example of choices used in the identification task for CVV.CVV tokens………....40
Table 10: An example of choices used in the identification task for CVV.CV tokens………….44
Table 11: An example of choices used in the identification task for CV.CVV tokens………….48
Table 12: An example of choices used in the identification task for CV.CV tokens…………....52
Table 13: Descriptive Statistics for perception RT by pitch pattern and CV combination
(in milliseconds)…………………………………………………………………..…...55
Table 14: Talker assignment for recording stimuli used in identification tasks………………....63
Table 15: Descriptive statistics for the perception pre/post-tests per group…………………..…72
Table 16: Mean perception accuracy of the six stimuli in Group I (CVV.CVV) in
Experiment 2………………………………………………………………………..…76
Table 17: Mean perception accuracy of the six stimulus type in Group II (CVV.CV) in
Experiment 2………………………………………………………………………..…78
Table 18: Mean perception accuracy of the six stimulus type in Group III (CV.CVV) in
Experiment 2………………………………………………………………………..…79

xi

Table 19: Mean perception RT of the six stimuli in Group I (CVV.CVV) in
Experiment 2……………………………………………………………………..……82
Table 20: Mean perception RT of the six stimulus type in Group II (CVV.CV) in
Experiment 2………………………………………………………………………......83
Table 21: Mean perception RT of the six stimulus type in Group III (CV.CVV) in
Experiment 2…………………………………………………………………………..85
Table 22: Descriptive Statistics for production tests in Experiment 2 (pretest and posttest) for
the AV and A-only groups organized by consonant-vowel combination……………..88
Table 23: Errors observed in the production posttest in Experiment 2…………………….…….91
Table 24: Mean accuracy scores of the five tokens in Group I (CVV.CVV) (AV group) ……...98
Table 25: Mean accuracy scores of the four tokens in Group II (CVV.CV) (AV group)……...100
Table 26: Mean accuracy scores of the five tokens in Group III (CV.CVV) (AV group)……..102
Table 27: Mean accuracy scores of the eight tokens in Group IV (CV.CV) (AV group)……...103
Table 28: Mean accuracy scores of the five tokens in Group I (CVV.CVV) (A-only group)….104
Table 29: Mean accuracy scores of the four tokens in Group II (CVV.CV) (A-only group)…..106
Table 30: Mean accuracy scores of the four tokens in Group III (CV.CVV) (A-only group)....108
Table 31: Mean accuracy scores of the eight tokens in Group IV (CV.CV) (A-only group)….110
Table 32: Mean RT scores of the five tokens in Group I (CVV.CVV) (AV group)………...…112
Table 33: Mean RT scores of the four tokens in Group II (CVV.CV) (AV group)……………115
Table 34: Mean RT scores of the five tokens in Group III (CV.CVV) (AV group)………...…117
Table 35: Mean RT scores of the eight tokens in Group IV (CV.CV) (AV group)……………118
Table 36: Mean RT scores of the five tokens in Group I (CVV.CVV) (A-only group)………..120
Table 37: Mean RT scores of the four tokens in Group II (CVV.CV) (A-only group)………...122
Table 38: Mean RT scores of the five tokens in Group III (CV.CVV) (A-only group)……..…124
Table 39: Mean RT scores of the eight tokens in Group IV (CV.CV) (A-only group)………...125

xii

Table 40: Descriptive Statistics (mean, SD) of the production accuracy in pretest, posttest,
and TG……………………………………………………………………………….127
Table 41: Errors observed in the production data in Experiment 2 (TG)………………………128
Table 42: Descriptive Statistics for the perception accuracy in pretest, posttest, and two TGs..130
Table 43: List of stimulus type in TG1…………………………………………………………132
Table 44: Mean accuracy scores of tokens in Group I (CVV.CVV) in the comparison
between pretest and TG1………………………………………………………….…133
Table 45: Mean accuracy scores of tokens in Group II (CVV.CV) in the comparison between
pretest and TG1………………………………………………………………………134
Table 46: Mean accuracy scores for tokens in Group III (CV.CVV) in the comparison between
pretest and TG1………………………………………………………………………135
Table 47: Mean perception accuracy of the six stimulus type in Group I (CVV.CVV) in
pretest and TG2 comparison ………………………………………………………...137
Table 48: Mean perception accuracy of the six tokens in Group II (CVV.CV) in pretest and
TG2 comparison …………………………………………………………………..…138
Table 49: Mean perception accuracy of the six tokens in Group III (CV.CVV) in pretest and
TG2 comparison……………………………………………………………………...140
Table 50: Mean perception accuracy of the six tokens in Group I (CVV.CVV) in posttest and
TG1 comparison……………………………………………………………………..142
Table 51: Mean perception accuracy of the six tokens in Group II (CVV.CV) in posttest and
TG1 comparison………………………………………………………………..……143
Table 52: Mean perception accuracy of the six tokens in Group III (CV.CVV) in posttest and
TG1 comparison……………………………………………………………………...144
Table 53: Mean perception accuracy of the six stimulus type in Group I (CVV.CVV) in
posttest and TG2 comparison………………………………………………………...146
Table 54: Mean perception accuracy of the six tokens in Group II (CVV.CV) in posttest and
TG2 comparison…………………………………………………………………..…147
Table 55: Mean perception accuracy of the six tokens in Group III (CV.CVV) in posttest and
TG2 comparison……………………………………………………………………..147
Table 56: Descriptive Statistics of the perception RT in the pretest, posttest, and two TGs…..149
xiii

Table 57: Mean RT scores of the tokens in Group I (CVV.CVV) in the comparison between
pretest and TG1………………………………………………………………………152
Table 58: Mean RT scores of the tokens in Group II (CVV.CV) in the comparison between
pretest and TG1………………………………………………………………………154
Table 59: Mean RT scores of the tokens in Group III (CV.CVV) in the comparison between
pretest and TG1………………………………………………………………………155
Table 60: Mean perception RT of the six stimulus type in Group I (CVV.CVV) in pretest and
TG2 comparison ………………………………………………………………….…157
Table 61: Mean perception RT of the six tokens in Group II (CVV.CV) in pretest and TG2
Comparison …………………………………………………………………………158
Table 62: Mean perception RT of the six tokens in Group III (CV.CVV) in pretest and TG2
comparison………………………………………………………………………..…160
Table 63: Mean perception RT of the six tokens in Group I (CVV.CVV) in posttest and TG1
comparison………………………………………………………………………...…162
Table 64: Mean perception RT of the six tokens in Group III (CV.CVV) in posttest and TG1
comparison …………………………………………………………………………..164
Table 65: Mean perception RT of the six tokens in Group II (CVV.CV) in posttest and TG2
comparison …………………………………………………………………………..166
Table 66: Target stimuli in production test…………………………………………………….184
Table 67: Practice stimuli in production test …………………………………………………..185
Table 68: Target stimuli in perception test in Experiment 1 ……………………….………….186
Table 69: Practice stimuli in perception test ………………………………………………...…188
Table 70: Target stimuli in perception test in Experiment 2 …………………………………..189
Table 71: Stimuli in perception training ……………………………………………….………190
Table 72: Practice stimuli in training …………………………………………………………..191
Table 73: Target stimuli in production test in TG1 ……………………………………………192
Table 74: Target stimuli in perception test in TG1 ……………………………………………193

xiv

LIST OF FIGURES

Figure 1: One type of phonological structure for geminates, singletons, and long vowels
(σ: syllable; μ: mora)……………………………………………………………………5
Figure 2: Pitch assignment in the Tokyo dialect…………………………………………………..6
Figure 3: Pitch patterns used in this study with an example word with the pitch pattern ……....22
Figure 4: Instructions for the production test for Experiment 1………………………………....24
Figure 5: A “+” sign shown before the presentation of stimuli ……………………………...….25
Figure 6: Presentation of the stimuli for the production test in Experiment 1………………...…25
Figure 7: Instructions for the perception test in Experiment 1………………………………..…26
Figure 8: Presentation of the auditory stimuli and identification task for the perception test in
Experiment 1 ………………………………………………………………………….27
Figure 9: Distribution of perception accuracy by course enrollment …………………………...31
Figure 10: Effects of consonant and token type on production accuracy in Experiment 1…...…33
Figure 11: Mean perception accuracy by preceding consonant and vowel type ………………..36
Figure 12: Mean perception accuracy by pitch pattern, preceding consonant, and vowel
Type ……………………………………………………………………………….…37
Figure 13: Errors for the tokens CVV.CVV with the (1) LH.HH pitch pattern ………..……….41
Figure 14: Errors for the tokens CVV.CVV with the (2) LH.HL pitch pattern ………………....41
Figure 15: Errors for the tokens CVV.CVV with the (3) HL.LL pitch pattern ………………....42
Figure 16: Effects of vowel type and pitch pattern in Group II (CVV.CV) on perception
accuracy ……………………………………………………………………………..44
Figure 17: Errors for the tokens CVV.CV with the (4) LH.H pitch pattern ………………….....45
Figure 18: Errors for the tokens CVV.CV with the (5) HL.L pitch pattern ………………….…46
Figure 19: Errors for the CV.CVV tokens with the (6) L.HH pitch pattern ………………….…48

xv

Figure 20: Errors for the CV.CVV tokens with the (7) L.HL pitch pattern ………………….…49
Figure 21: Errors for the CV.CVV tokens with the (8) H.LL pitch pattern …………………….50
Figure 22: Effects of vowel type and pitch pattern in Group IV (CV.CV) in Experiment 1 …...51
Figure 23: Errors for the CV.CV tokens with the (9) L.H pitch pattern ………………………..52
Figure 24: Errors for the CV.CV tokens with the (10) H.L pitch pattern ………………………53
Figure 25: Mean perception RTs by preceding consonant and vowel type …………………….54
Figure 26: Mean RTs by pitch pattern, consonant, and vowel combination (in milliseconds) …55
Figure 27: Effects of preceding consonant and pitch pattern in Group II (CVV.CV) on RT in
Experiment 1…………………………………………………………………………58
Figure 28: Examples of the waveform displays …………………………………………………64
Figure 29: Instructions for perceptual training for A-only training group ………………………67
Figure 30: Instructions for perceptual training for AV training group ………………………….68
Figure 31: Identification task for perceptual training for A-only training group ……………….68
Figure 32: Identification task for perceptual training for AV training group ……………….…..69
Figure 33: The comparison of perception accuracy between pretest and posttest by group ……73
Figure 34: Stimulus type in pretest and posttest in Experiment 2……………………….………74
Figure 35: The comparison of perception accuracy of the tokens in Group II (CVV.CV) by
training groups in Experiment 2 ………………………………………………….…77
Figure 36: The comparison of perception accuracy of the tokens in Group III (CV.CVV) by
training groups in Experiment 2 ………………………………………………….…80
Figure 37: The comparison of perception RT of the tokens in Group II (CVV.CV) by training
groups in Experiment 2………………………………………………………………84
Figure 38: The comparison of perception RT of the tokens in Group III (CV.CVV) in
Experiment 2 ………………………………………………………………………..86
Figure 39: The comparison of production accuracy by vowel and token type in
Experiment 2 …………………………………………………………………..……90

xvi

Figure 40: Perception accuracy in each week and talker by AV and A-only groups ………...…92
Figure 41: Perception accuracy by talker in perceptual training ……………………………..…93
Figure 42: The RT for each week and talker by AV and A-only groups …………………….….94
Figure 43: The RT in the training grouped by the four talkers ……………………………….…95
Figure 44: Tokens in the training sessions by stimulus type ……………………………………96
Figure 45: The comparison of perception accuracy of tokens in Group I (CVV.CVV) for AV
training group ………………………………………………………………………..99
Figure 46: The comparison of perception accuracy of tokens in Group II (CVV.CV) for AV
training group ………………………………………………………………………101
Figure 47: The comparison of perception accuracy of tokens in Group I (CVV.CVV) for
A-only training group ……………………………………………………………...105
Figure 48: The comparison of perception accuracy of tokens in Group II (CVV.CV) for
A-only training group ………………………………………………...……………107
Figure 49: The comparison of perception accuracy of tokens in Group III (CV.CVV) for
A-only training group …………………………………………………….……..…109
Figure 50: The comparisons of perception accuracy of tokens in Group IV (CV.CV) for
A-only training group ………………………………………………………………111
Figure 51: The comparison of perception RT of tokens in Group I (CVV.CVV) for AV
training group ………………………………………………………………………114
Figure 52: The comparison of perception RT of tokens in Group II (CVV.CV) for AV
training group ………………………………………………………………………116
Figure 53: The comparison of perception RT of tokens in Group IV (CV.CV) for AV
training group ………………………………………………………………………119
Figure 54: The comparisons of perception RT of tokens in Group I (CVV.CVV) for A-only
training group ………………………………………………………………………121
Figure 55: The comparison of perception RT of tokens in Group II (CVV.CV) for A-only
training group ………………………………………………………………………123
Figure 56: The comparisons of perception RT of tokens in Group IV (CV.CV) for A-only
training group ………………………………………………………………………126

xvii

Figure 57: The comparison of perception accuracy of tokens in Group III (CV.CVV)
between the pretest and TG1 ………………………………………………………136
Figure 58: The comparison of perception accuracy of tokens in Group II (CVV.CV) between
the pretest and TG2 ……………………………………………………………..…139
Figure 59: The comparison of perception accuracy of tokens in Group II (CVV.CV) between
the pretest and TG2 ………………………………………………………………..141
Figure 60: The comparison of perception accuracy of the tokens in Group III (CV.CVV)
between the posttest and TG1 …………………………………………………..…145
Figure 61: The comparison of perception accuracy of tokens in Group III (CV.CVV)
between the posttest and TG2 ……………………………………………………..148
Figure 62: The comparison of perception RT for the tokens in Group I (CVV.CVV)
between the pretest and TG1 ………………………………………………………153
Figure 63: The comparison of perception RT for the tokens in Group II (CVV.CV) between
the pretest and TG1 ………………………………………………………………..154
Figure 64: The comparison of perception RT of the tokens in Group III (CV.CVV) between
the pretest and TG1 ……………………………………………………………..…156
Figure 65: The comparison of perception RT of tokens in Group II (CVV.CV) between the
pretest and TG2 ……………………………………………………………………159
Figure 66: The comparison of perception RT of tokens in Group III (CV.CVV) between the
pretest and TG2 ……………………………………………………………………161
Figure 67: The comparison of perception RT of the tokens in Group I (CVV.CVV) between
the posttest and TG1 ……………………………………………………………….163

xviii

CHAPTER 1: INTRODUCTION AND REVIEW OF THE LITERATURE

Introduction
Second language (L2) learners have difficulties in perceiving and producing new L2
contrasts once they have established a phonological system for their first language (L1) (e.g.,
Archibald, 2005; Flege, 1995). The learners need to modify the existing system or establish a
new one in order to be able to perceive or produce a new contrast in the L2, such as the contrast
between English /l/ and /r/ for Japanese and Korean native speakers (NSs) (e.g., Ingram & Park,
1998). One of the common cases that L2 learners of Japanese encounter is acquisition of
durational contrasts (e.g., Asano, 2005; Enomoto, 1992; Hirata, 1990; Hirata & Kelly, 2010;
Minagawa, 1997; Motohashi, 2007; Motohashi-Saigo & Hardison, 2009; Toda, 1998, 2003, and
2009). For English native speakers (NSs), acquiring the contrasts between geminates and
singletons as well as long vowels and short vowels in Japanese is a challenge. According to
Toda’s (2009) study, L2 learners experienced communication breakdown due to the failure of
correctly identifying or pronouncing the durational contrasts. Thus, it is important to acquire L2
durational contrasts for communication.
In order to help the acquisition of L2 contrasts, several researchers have examined and
found the effectiveness of focused perceptual training for the acquisition of L2 phonetic contrasts
or segmental perception, using auditory-only (A-only) training (e.g., Borden, Gerber, & Milsark,
1983; Bradlow & Pisoni, 1999; Ingram & Park, 1998; Jamieson & Moroson, 1986; Lively,
Logan & Pisoni, 1993; Logan, Lively, & Pisoni, 1991; McCandliss, Fiez, Protopapas & Conway,
2002; Morosan & Jamieson, 1989; Sheldon, 1985; Sheldon & Strange, 1982; and Strange &
Dittman, 1984) and auditory-visual (AV) training (e.g., Hardison, 1999, 2003, 2005a, 2005b).

1

Other studies have paired waveforms with auditory information to train durational contrasts (e.g.,
Motohashi, 2007; Motohashi-Saigo & Hardison, 2009) or hand gestures (Hirata & Kelly, 2010).
The studies in the previous literature suggest that training on durational contrasts is easier than
spectrographic contrasts (Bohn, 1995). In addition, auditory as well as visual information helped
L2 learners to improve their correct identification of L2 contrasts in the training. The bimodal
training was particularly effective for the phonologically challenging segments based on the
learners’ L1 (Hardison, 2003).
The current project investigated the factors affecting acquisition of L2 durational
contrasts and how perceptual training can contribute to it. Specifically, the focus was the factors
affecting identification accuracy of vowel duration in Japanese by L1 American English learners.
In order to investigate the issue, four factors, including vowel type, pitch pattern, preceding
consonant, and learners’ L2 proficiency, were treated as independent variables in Experiment 1.
Experiment 2 then examined the effectiveness of two weeks of focused perceptual training using
AV versus A-only input, in order to improve L2 learners’ correct identification of L2 vowel
duration. Visual input was a waveform display. Participants were English NSs who were
studying Japanese as a foreign language in the U.S. The study examined the effectiveness of
input type on identification accuracy and response time before and after the training in order to
see how the training affected perceptual development of L2 vowel duration.

Review of the Literature
A Model for Speech Perception
It has been reported in the previous literature and in the foreign language classrooms that
durational contrasts in Japanese are difficult for L2 learners, particularly for English NSs. Flege

2

(1995) proposed a model for speech perception and production, called the Speech Learning
Model (SLM) to suggest why nonnative contrasts cause challenges for learners.
The SLM predicts two kinds of difficulties in acquiring L2 contrasts. First, it is argued
that it is difficult to acquire novel L2 contrasts such as English /l/ and /r/ for Japanese and
Korean learners (e.g., Aoyama, Flege, Guion, Akanahe-Yamada, & Yamada, 2004, Flege, 1995;
Ingvalson, McClelland, & Holt, 2011). For instance, Japanese has only one liquid which is
perceptually more similar to the flap in English (Price, 1981). Therefore, in order to acquire the
novel contrast, it is necessary to create two new categories for English liquids to distinguish the
contrast between /l/ and /r/ (e.g., Ingram & Park, 1998; Lively, Logan, & Pisoni, 1993; Sekiyama
& Tohkura, 1993; Takagi, 1993).
Flege (1995) also claims that it is difficult to acquire ‘similar L2 contrasts’ (i.e., two
segments that are contrastive in the L2, but not in L1). For example, the contrast between
English /i/ and /ɪ/ is difficult to acquire for Italian NSs because the L1 has only /i/ in its
phonological system (Flege & MacKay, 2004). This second category of difficulty described by
Flege can be found when NSs of English acquire L2 Japanese durational contrasts, including the
contrast between a geminate and a singleton consonant, as well as the contrast between a long
and a short vowel. The durational contrasts are contrastive in Japanese (Shibatani, 1990;
Kubozono, 1999b), but not in English. For example, Motohashi (1997) showed that English NSs
have difficulties in acquiring the durational contrast between geminates as in Table 1.

3

Table 1: Examples of words with geminates and singletons (Motohashi, 2007)
Words with Geminates
geminates
Japanese
English Gloss

Words with Singletons
singletons
Japanese
English Gloss

kk

kakko

parenthesis

k

kako

past

tt

kottoo

antique

t

kotoo

ss

sassu

to infer

s

sasu

isolatedisland
to bite

In addition, Asano (2005) reported that distinguishing vowel duration (i.e., long and short
vowels) such as ojiisan ‘grandfather’ and ojisan ‘uncle’ is difficult for native English speakers.
The distinction between a short and long vowel is not contrastive in L1.

Status of Vowel Duration in Japanese: Issue of the Mora
Another factor involved in the difficulty of acquiring L2 durational contrasts in Japanese
is the role of the mora as a unit of timing. English is a stress-timed language and employs a
syllable as a basic unit for timing (Pennington, 1996). Stressed vowels have longer duration than
unstressed vowels when they are spoken in isolation; unstressed vowels go through the process
of lenition and are reduced to schwa (Hayes, Kirchner, & Steriade, 2004). Thus, a key factor
determining the length of vowels in English is whether stress falls on the vowel or not. Vowels
also tend to lengthen before a voiced consonant. In Japanese, on the other hand, word stress is
not a key to determining the length of vowels; neighboring moraic units tend to show equal
duration (Port, Dalby, & O’Dell, 1987). The mora in Japanese is the key unit of timing (e.g.,
Kubozono, 1999a; Tsujimura, 2007). Following Hayes (1989), Figure 1 shows one way to
represent the phonological structures of a geminate, singleton, and long vowel (Hardison &
Motohashi, 2010, p. 82) incorporating both the moraic and syllabic levels of representation.
4

Figure (1a), (1b), and (1c) represent a geminate consonant (tt), a singleton consonant (t), and a
long vowel (ii) respectively.

Figure 1: One type of phonological structure for geminates, singletons, and long vowels (σ:
syllable; μ: mora)
a.

b.
σ

σ

σ

σ

μ

μ

μ

μ

k

c.

σ
μ

μ

i

t

e

k

i

t

e

k

i

σ
μ

μ

t

e

Figure (1a) is a phonological structure of the word kitte ‘a stamp;’ it has two syllables, but /t/ in
the coda position of the first syllable also forms the onset of the second syllable and there are
three morae. On the other hand, Figure (1b) is the structure of the word kite ‘coming;’ it also has
two syllables, but only two morae as /t/ only forms the onset of the second syllable. Figure (1c)
is the structure of the word kiite ‘listening;’ the vowel in the nucleus position constitutes two
morae. The difference in the basic units of timing, a syllable versus a mora, may contribute to
the difficulty English NSs have in acquiring durational contrasts in Japanese. A number of
research studies investigated factors affecting perception of these moraic units by both NSs and
NNSs.

Factors Affecting Perception of Segment Length for NSs
Regarding Japanese NSs’ perception, Fujisaki, Nakamura, and Imoto (1973, cited in
Toda, 2003) found that the actual length of the special morae (i.e., morae that consist of a

5

geminate, Figure 1a, a long vowel, Figure 1c, and moraic nasal such as hoN.ya ‘a bookstore’)
plays an important role. The special morae have one syllable but two morae and perception of
special morae is categorical, not continuous. Fujisaki and Sugifuji (1977) examined the Japanese
NSs’ perception of geminates using synthesized stimuli where the closure duration of a stop
consonant was manipulated. The NSs were asked to discriminate between geminates and
singletons. They found that the closure duration was a key for the NSs to correctly discriminate
the two segments.
In addition to the duration itself, other studies (e.g., Nagano-Madsen, 1992; Ofuka, 2003)
found that other factors such as pitch accent patterns can affect NSs’ perception. In Japanese,
each mora receives either High (H) or Low (L) pitch as in Figure 2, and pitch accent is
contrastive (Shibatani, 1990).

Figure 2: Pitch assignment in the Tokyo dialect
a.

a me ‘rain’

H

L

b.

a me

‘candy’

L H

In Figure (2a), the word ame has a HL pitch pattern, which means ‘rain’ in the Tokyo dialect.
On the other hand, in Figure (2b), the same segmental sequence ame has a LH pitch pattern, and
means ‘candy.’ Ofuka (2003) investigated how different pitch accents, HL or LH, affected
Japanese NSs’ perception of geminates and singletons. She manipulated the closure duration of
the stop /t/ in words such as katta [HL(L)] ‘won’ and katta [LH(H)] “bought” to create geminates
and kata (HL) ‘shoulder’ and kata (LH) ‘pattern’ to create singletons. Her findings

6

demonstrated that the NSs needed a longer closure duration for a word with LH(H) to be
perceived as a geminate, compared to a word with a HL(L) pattern.

Factors Affecting Perception of Segment Length for NNSs
Regarding NNSs’ perception, Toda (1998), Enomoto (1992), Hardison and MotohashiSaigo (2010) reported that L2 proficiency can affect perception of the durational contrast. In her
study of English NSs’ perception of Japanese vowel duration, Toda found that the NSs and
beginning level learners required a different duration for a vowel to be judged as long. Enomoto
found that the advanced learners of Japanese showed similar perceptual boundaries for geminates
and long vowels as Japanese NSs; however, the beginning learners did not. Thus, perception of
durational contrasts may progress along with overall L2 language proficiency. Hardison and
Motohashi-Saigo’s findings also concluded that correct identification of geminates with three
different consonants (i.e., /t/, /k/, and /s/) was affected by learners’ proficiency. For beginners,
segmental duration significantly affected the identification of all types of geminates. Yet for
low-intermediate and advanced learners, geminates with /s/, particularly geminates with /s/
1

followed by the vowel /u/ , were significantly more difficult to identify than others.
In addition to proficiency, pitch-accent pattern and position in a word affect perception of
vowel duration. Minagawa (1997) investigated whether pitch patterns (HH, LL, HL, and LH)
affected the perception of vowel duration for L2 learners whose L1s were Korean, Chinese,
English, Spanish, and Thai, and found that they (1) had greater perception accuracy of long
vowels when a word had a high-high (HH) pitch pattern, and (2) showed a tendency to perceive a
long vowel as a short vowel when a word had a low-low (LL) pitch pattern. Koguma (2000)

1

In this dissertation, /u/ is used as a typographical convention, but the Japanese vowel is [
7

].

investigated how L2 learners (L1 English) perceived long vowels in various positions in a word
(i.e., word-initial, word-medial, and word-final) and found that word-final position was the most
difficult and word-initial position was the easiest.

L2 Research with a Focus on Training Studies Involving Spectral Differences
In the literature, several perceptual training studies were conducted in order to examine
whether training helps L2 learners to develop their ability to correctly perceive L2 contrasts (e.g.,
Bradlow & Pisoni, 1999; Ingram & Park, 1998; Lively, Logan & Pisoni, 1993; Logan, Lively, &
Pisoni, 1991; Sheldon, 1985; and Strange & Dittman, 1984). Successful perceptual training
studies have reported the improvement of correct identification accuracy of L2 contrasts. The
first successful perceptual training reported in the literature was a study by Logan et al. (1991).
They conducted training for three weeks (i.e., a total of approximately 7.5 hours) to train L1
Japanese learners of L2 English to correctly identify /l/ and /r/. They found significant
improvement in ESL learners’ identification of the sounds.
Following the study by Logan et al. (1991), subsequent studies by Pisoni and colleagues
demonstrated the facilitative effects of auditory perceptual training (Lively et al., 1993; Bradlow,
Akahane-Yamada, Pisoni, & Tohkura, 1999). Lively et al. (1993) reported that their perceptual
training of L2 learners facilitated correct segmental identification of English /l/ and /r/ and
suggested the benefits of having training stimuli produced by multiple talkers. In addition,
Lively et al. (1993) found that the effects of perceptual training could be retained for three
months in a setting where English was a foreign language. Bradlow et al. (1999) examined
whether the facilitative effects of perceptual training could be retained and transferred to
production. They found that perceptual training enhanced correct identification of English /l/

8

and /r/ as well as improved production of the segments without explicit production training.
Additionally, they discovered that development in both perception and production was retained
even after three months.
The above perceptual training studies suggested the factors necessary to make perceptual
training successful. First, they emphasize that an identification task (vs. a discrimination task)
with stimuli containing high variability should be used because the identification task promotes
learners’ classification of the target sounds into appropriate categories. Logan et al. also used
different phonetic environments (i.e., different positions in a word such as initial/final clusters
and singletons) so that learners were exposed to a full range of cues and the development of
robust perceptual categories was enhanced. In addition, it is important to provide immediate
feedback during training because it can “enhance the learning process by allowing observations
of within-category similarities and between-category distinctions across contexts and talkers”
(Hardison, 2003, p. 515).

L2 Research with a Focus on Training Studies Involving Exaggerated Stimuli
Although most of the training studies employed /l/ and /r/ as the targets for training, there
are a few studies with other approaches, including the use of exaggerated acoustic cues.
Jamieson and Morosan (1989) conducted short perceptual training, including two training
sessions lasting 90 minutes with voiced and voiceless interdental fricatives using natural and
synthetic stimuli. Synthetic stimuli were created by exaggerating the amount of frication. Their
results indicated that (1) identification accuracy improved and (2) training with exaggerated
stimuli generalized to natural speech as well as a new talker. One of the limitations of their
study was the failure of training using the word-initial position to generalize to other positions in

9

the word such as word medial or final positions. Also, the results did not generalize to improved
performance with the [ð] and [d] confusion.
The efficacy of exaggerated cues was suggested by Kuhl, Andruski, Chistovich,
Chistovich, Kozhevnikova, Ryskina, Stolyarova, Sundberg, and Lacerda (1997). Mothers who
were NSs of English, Russian, and Swedish talked to their infants using hyperarticulated vowels
(/i/, /a/, /u/) in contrast to vowels in their speech to other adults. Kuhl et al.’s results may have
implications for L2 acquisition: hyperarticulated input may be adopted at the beginning of
learning an L2 so that it is easier to draw learners’ attention to the critical features in the input.
However, it is also important to give learners natural speech as input because they have to deal
with natural speech in communication. Therefore, hyperarticulated input should be changed to
natural speech over time. Uther, Knoll, and Burnham (2007) also found that female speakers of
Southern British English showed hyperarticulation of vowels in infant-directed speech as well as
speech directed to adult nonnative speakers of English compared to other adult English speakers.
McCandliss et al. (2002) examined the effectiveness of modified speech in the
development of perception of L2 contrasts between English /l/ and /r/ for L1 Japanese learners of
L2 English. They compared adaptive (i.e., exaggerated input; F3 of /l/ and /r/ are exaggerated)
and fixed (i.e., natural input) training for L2 learners with and without feedback for perception of
English /l/ and /r/. Results indicated that the most effective training condition was natural input
with feedback; exaggerated cues were not necessary. However, they did not examine the effects
of neighboring vowels on the segments. In addition, their perceptual training involved selfcontrolled sessions; therefore, it is unknown how much the participants paid attention to the
stimuli during the training or how they carried out the training.

10

L2 Research with a Focus on AV Training Studies
In addition to auditory perceptual training, a few researchers examined auditory-visual
(AV) training on the development of L2 perception (e.g., Hardison, 2003; Hirata & Kelly, 2010;
Motohashi, 2007; Motohashi-Saigo & Hardison, 2009). Different from unimodal language input
such as listening to speech sounds, bimodal input involves auditory information as well as visual
cues such as facial cues and/or hand gestures, which can be additional resources for the learners
to identify contrasts. Hardison (2003) compared two types of perceptual training (i.e., AV using
articulatory gestures with auditory information, and A-only) on the identification of L2 English
/r/ and /l/ by NSs of Japanese and Korean. She found that both training types brought
improvement in identification accuracy; however, the AV training provided significantly greater
improvement. Based on the study, visual input facilitated perception of the segments “in the
most challenging phonetic environments for each L1 group” (p. 514). In addition, she also
discovered that production of /l/ and /r/ improved significantly as a result of perceptual training.
Thus, similar to the successful A-only studies described earlier (e.g., Logan et al., 1991), the
effects of AV training can also be transferred to other skills. In this way, Hardison has shown
the advantage of bimodal input (i.e. audio-visual input) over unimodal input in identifying
different L2 consonants such as /l/ and /r/.
Motohashi-Saigo and Hardison (2009) also examined the effects of visual input on the
acquisition of Japanese durational contrasts. They used waveform displays along with the
auditory information in AV training, compared it with A-only training, and examined how the
visual cues helped the development of correct identification of Japanese geminate consonants by
NSs of English. They found that learners with AV training improved identification accuracy

11

significantly, generalized to novel stimuli, and transferred to production skill improvement.
There were significant advantages of AV training with waveforms over A-only training.
Hirata and Kelly (2012) also investigated the effect of multimodal information on the
perception of vowel durations in Japanese by NSs of English. The perceptual training (4
sessions, 120 minutes total) included four types of input: “A-only” (audio with visual image of
speaker with no movement), audio with lip movements, audio with hand gestures, audio with
hand gestures and lip movements. During the training, non-words were embedded in carrier
sentences, and produced at a slower pace by four different talkers. The researchers used
identification tasks in both testing and training. The participants listened to the input and
decided whether the second vowel in each target word was short or long. The results showed
that there were statistically significant effects of training so that the participants improved their
ability to identify vowel duration after the training. The audio with lip movement condition was
significantly better than A-only. The authors concluded that mouth movements were beneficial,
but the hand gestures had not helped perceptual learning. There are several methodological
issues with this study: a) participants were not learners of Japanese and had never been exposed
to the language so that it was difficult to compare or generalize their results/findings with other
studies involving learners of the target language, b) several stimulus factors such as rate of
speech, voice, and varying context of carrier sentence were not treated as variables, c) the hand
gesture involved the type of stroke associated with the given vowel duration and the hand’s
location in the speaker’s gesture space, which were not markedly different between the short and
long vowels, d) training involved four sessions, and e) pre- and post-test data were based only on
auditory information.

12

Okuno (2009) investigated the most effective training type for the correct identification
of L2 vowel duration (i.e., long and short vowels) in Japanese, using four different types of
perceptual training (i.e., AV and A-only training with hyperarticulated or natural speech).
Participants were 29 learners of Japanese as a FL (L1 English) at the beginning level. AV input
was a speaker’s face. The learners took a total of eight training sessions. In order to examine the
efficacy of the training, perception accuracy scores before and after training were compared.
The results indicated that all the learners improved in identification accuracy after the training;
however, no advantage was found for hyperarticulated speech over natural speech. One of the
possible explanations for the finding is that the study did not involve perceptual fading moving
from exaggerated speech to normal speech that other studies such as Morosan and Jamieson
(1989) had incorporated. Since the participants were not presented with graduated stimuli from
exaggerated to natural, they did not adjust their skills to correctly identify different lengths of
vowels in natural speech. In addition, the pretest scores may have reflected a ceiling effect.
Therefore, it was difficult to conclude whether hyperarticulation was effective for the
development of correct identification of L2 durational contrasts.

Exemplar-Based Model
The L2 learners’ performance in the previous studies that investigated the effects of
perceptual training (e.g., Logan et al., 1991; Hardison, 2003) was affected significantly by the
context in which the contrasts were embedded and talker variables. Findings in Hardison’s
(2003) studies revealed that “the context- and talker-dependent nature of speech processing
support the view that sources of variability or complexities in the speech signal are not merely
noise discarded from the signal during processing, but are a part of subsequent neural

13

representations” (p. 515). Perceptual training which provides the learners with multiple
exemplars in visual and/or speech input and feedback can enhance development of identifying
L2 contrasts.

Research Questions and Hypotheses
To sum up, the success of auditory and auditory-visual training for correct identification
of L2 segments has been established in the literature. Lively et al. (1993) concluded that training
should include stimulus variability, multiple talkers, identification tasks, and feedback in order to
develop robust perceptual categories. L2 learners have shown variable performance according to
phonetic environment and talker. This indicates that the learners use context- and talkerdependent exemplars. Most of the previous investigations have paid closer attention to the
perception of consonants in the L2, including /l/ and /r/ or /θ/ and /s/, as a focus of training. On
the other hand, few studies have focused on the effects of perceptual training on vowel
identification. Except for Hirata and Kelly (2010), no study has yet reported the effects of
training on vowel duration, which is a contrastive feature in Japanese, important for
communication, and a challenge for many L2 learners. Learners need to modify their perceptual
system to perceive vowel duration accurately in the L2. Perceptual training can provide focused,
identifiable input, which can shift their attention to relevant cues. The shift could, in turn,
promote a reorganization of perceptual distances in psychophysical space (Hardison, 2003). By
examining the efficacy of perceptual training on the identification of vowel duration and the
possibility of reorganizing perceptual distances, the present study seeks to fill a gap in the
previous literature.

14

This project investigates the effects of visual cues on the acquisition of vowel duration in
L2 Japanese by English NSs. Following Motohashi-Saigo and Hardison (2009), waveforms
were used as visual cues because they contain visual information on vowel duration. Also,
pseudo words (i.e., words that can be pronounced but do not have any meanings) were used in
order to avoid effects of neighborhood density, word frequency, and size of vocabulary.
Previous psycholinguistic research (e.g., Bundgaard-Nielsen, Best, & Tyler, 20011; Imai, Walley,
& Flege, 2005; Metsala, 1997; Ziegler, Muneaux, & Grainger, 2003) has shown that
neighborhood density and a learner’s size of vocabulary significantly affected word recognition
and determination of the phonological contrasts. For measurement, in addition to accuracy of
perception and production, reaction times (RTs) were measured when L2 learners identified
vowel duration both in testing and training. The proposed study is designed to investigate the
following five main research questions.

Research Question1: What factors affect perception accuracy, perception latency, and production
accuracy of vowel duration in L2 Japanese?

Hypothesis 1a: Based on Minagawa (1997), I hypothesized that pitch pattern could affect the
perception of vowel duration. In Minagawa’s study, it was easier to identify long vowels with a
HH pitch pattern and short vowels with a LL pitch pattern. Thus, tokens with the high pitch
pattern would have higher accuracy and shorter RT than the low pitch or falling pitch (HL) if the
pitch height is a key for L1 English learners.

15

Hypothesis1b: Regarding the types of vowels, high vowels such as /u/ have shorter duration than
the low vowel /a/ in the Tokyo dialect. The duration of the long vowel /u/ could be very close to
that of the short vowel /a/. As a result, NNSs may demonstrate difficulties in determining the
correct identification of vowel duration for the high vowels. Thus, I hypothesized that the type
of vowel could affect NNSs’ perception of vowel duration, and identification accuracy and RT of
the low vowel would be higher than that of the low vowel.

Hypothesis 1c: Based on Hardison and Motohashi-Saigo (2010), I hypothesized that proficiency
would affect the identification of long vowels. In this study, pseudo-words were used in order to
remove possible influences of vocabulary size, word familiarity, and neighborhood density.
Therefore, the ability to correctly identify the durational contrast could be related to the length
and overall L2 proficiency. Thus, it was predicted that identification accuracy would be higher
and RT would be shorter if the learners’ proficiency was higher.

Research Question 2: Is focused perceptual training effective for the acquisition of vowel
duration? How do perceptual accuracy and RT vary across the period of training? Do they vary
according to talker and/or other stimulus factors?

Hypothesis 2a: Based on the previous training studies including Hardison (2003) and MotohashiSaigo and Hardison (2009), I hypothesized that focused perceptual training could be effective for
the correct identification of vowel duration. In other words, L2 learners would have higher
accuracy in identifying the correct length of vowels after training.

16

Hypothesis 2b: Based on the previous training studies (e.g., Lively et al., 1993), I hypothesized
that L2 learners’ accuracy in identifying vowel length would increase, and response time (i.e.,
RT) would decrease as they progressed in training. As the other studies show, the largest
improvement in accuracy and RT could take place between Week 1 and Week 2 of the training
or from the pretest to the end of Week1.

Research Question 3: Which type of input in training, AV (with waveform display) or A-only, is
more effective for development of identification accuracy of durational contrasts in L2 vowels?
Does the effectiveness vary with proficiency level, vowel type, and preceding consonant?

Hypothesis 3: Based on Hardison (2003) and Hardison and Motohashi-Saigo (2010), I
hypothesized that the most effective type of training would be AV training. Hardison and
Motohashi-Saigo suggested that L2 learners can use visual cues, specifically waveforms as “a
valuable source of input in L2 learning” (p. 42).

Research Question 4: Does perception training transfer to production improvement?

Hypothesis 4: Based on Hardison (2003) and Bradlow, Akahane-Yamada, Pisoni, and Tohkura
(1999), I hypothesized that the effect of the training would transfer to another skill (i.e.,
production) if the training was effective.

17

Research Question 5: Does training generalize to novel stimuli spoken by a familiar talker from
training as well as stimuli spoken by an unfamiliar voice? Does the ability to generalize vary
according to the modality of training input? Do other stimulus factors affect the process?

Hypothesis 5: Based on Hardison (2003), I hypothesized that the effect of the training would
generalize to correct identification of new tokens and a new voice if the training was effective.

Overview of Study Design
Two experiments were conducted for this study. Experiment 1 was designed to
investigate factors affecting the identification and production of L2 vowel duration in Japanese.
In addition, it had the objective of potentially reducing the number of factors and/or levels for
analysis of the effects of training (Experiment 2) if they were not statistically significant. A
cross-sectional design was adopted for the experiment. A between-subject factor was L2
proficiency (i.e., High, Mid, Low). Within-subject factors were vowel type: /a/, /u/ (one high
and one low vowel), pitch pattern (where the dot represents a syllable boundary): LH.HH,
LH.HL, HL.LL, LH.H, HL.L, L.HH, L.HL, H.LL, L.H, H.L, and preceding consonant: /k/, /s/
(one stop and one fricative). Dependent variables were perception accuracy (i.e., percentage of
correct identification of vowel length), production accuracy (i.e., based on NSs’ ratings of
correct pronunciation), and perception reaction time (RT) (i.e., RT in milliseconds). Independent
and dependent variables are summarized in Table 2 below.

18

Table 2: Summary of independent and dependent variables for Experiment 1
Variables
Between-Subject

Description

L2 Proficiency (3)

High, Mid, Low

Vowel Type (2)
Pitch Pattern (10)

Preceding Consonant (2)

Low, High (/a/, /u/)
LH.HH, LH.HL, HL.LL
LH.H, HL.L
L.HH, L.HL, H.LL
L.H, H.L
/k/, /s/

Perceptual Identification Accuracy
Production Accuracy
Perception RT

Percentages of correct identification
NSs’ ratings of correct pronunciation
RT in milliseconds

Within-Subject

Dependent
Variables

19

CHAPTER 2: EXPERIMENT 1

Method
Participants
Participants were 64 L2 learners, whose L1 was American English, studying Japanese as
a foreign language at a large Midwestern university in the U.S. They were enrolled in the first
year (n=24), second year (n=17), third year (n=16), and fourth year (n=7) Japanese courses at the
time of the experiment. The participants enrolled in the first year Japanese language course (12
females and 12 males) did not have previous knowledge of Japanese when they started to study it.
At the time of participation, they had studied Japanese for about three months. The participants
enrolled in the second year Japanese language course (9 females and 8 males) had passed the
first year course (i.e., a total of 125 hours instruction in class) and were in the third semester.
The participants enrolled in the third year Japanese language course (9 females and 7 males) had
passed the second year course (i.e., a total of 250 hours instruction in class since the beginning of
their study) and were in the fifth semester. The participants in the fourth year Japanese course (6
females and 1 male) had passed the third year course (i.e., a total of 350 hours instruction in class
since the beginning of their study) and were in the seventh semester. No heritage learners
participated in this study, and all of the participants reported normal hearing and vision.
In the elementary Japanese language courses, the first- and second-year courses, the
contact hours of the class were 50 minutes per day, five times per week (a total of 125 hours of
instruction per year). In class, an instructor corrected the students’ inaccurate pronunciation
during oral drills and communicative activities; however, no special training for discriminating
particular phonemic contrasts was usually provided.

20

Generally speaking, the longer they study Japanese, the more interactions they have with
Japanese NSs. However, it was not necessarily the case that those interactions led to
development in Japanese proficiency because of individual differences such as motivation, L2
use, and L2 exposure.

Materials
Production Test: Target materials included 16 tokens contrasting long and short vowels
(Appendix A). High and low vowels, /a, u/, and two consonants /k, s/ were used to construct
target stimuli. The two consonants, a voiceless velar stop and a voiceless fricative, were selected
for this experiment based on the potential role played by consonant-vowel sonority difference on
learner perception (Hardison & Hotohashi-Saigo, 2010). The vowels /a, u/ represent the longest
and shortest vowels respectively in the Tokyo dialect. In addition to the target tokens in
Appendix A, four practice trials in Appendix B were prepared to familiarize participants with the
task.

Perception Test: Target materials included 40 tokens contrasting long and short vowels
(Appendix C). High and low vowels, /a, u/, and two consonants /k, s/ were used to construct
target stimuli. Also, 10 pitch patterns that occur in the language were used in this study as
shown in Figure 3. Each target was assigned one of the 10 pitch patterns. As a result, the target
stimuli included both real words and pseudo-words as in Appendix C.

21

Figure 3: Pitch patterns used in this study with an example word with the pitch pattern
1. CVV.CVV

2. CVV.CVV

3. CVV.CVV

LH HH

LH H L

HL LL

koo.hoo
‘official information’

4. CVV.CV

LH

H

ii.e
‘no’

6. CV.CVV

L HH
ji.koo
‘statute of limitation’

9. CV.CV

L H
ha.na
‘flower’

koo.hii
‘coffee’

kee.zai
“economics’

5. CVV.CV

HL L
aa.to
‘art’

7. CV.CVV

8. CV.CVV

L HL

H LL

i.suu
‘heteromerous’

ma.naa
‘manner’

10. CV.CV

H L
u.mi
‘sea’

22

In the current project, mostly pseudo words were used (i.e., words that can be pronounced in
terms of the phonology of Japanese; however, they do not have a meaning). Based on the
psycholinguistics research (Bundgaard-Nielsen, Best, & Tyler, 20011; Imai, Walley, & Flege,
2005; Metsala, 1997; Ziegler, Muneaux, & Grainger, 2003), it was found that neighborhood
density and a learner’s size of vocabulary significantly affected the word recognition and
determination of the phonological contrasts. Therefore, in order to avoid the effects, most of the
stimuli in the current project were pseudo words. There were 10 real words in order to balance
the stimuli; however, their frequency was not high and the learners may have had limited
exposure to them, if any. In the analysis, they will be compared with pseudo words and used if
there are no statistical differences between the two types of words. In addition to the target
tokens in Appendix C, four practice trials in Appendix D were prepared to familiarize
participants with the task.
Six NSs of Japanese, whose ages ranged from 18 to 35 years old and who were born in
Tokyo or near the Tokyo area of Japan, were recruited (4 females and 2 males) to record the
stimuli. In this project, pitch patterns used in kyootsuu-go, a dialect spoken in the Tokyo area
(Shibatani, 1990), were used. Therefore, the NSs who were born in the area were recruited.
While the NSs were bilinguals who speak English and Japanese, their dominant language is
Japanese. One of the female speakers, Talker 1, produced the testing and practice tokens for the
perception test in Experiment 1.

Procedures
Production Test: Computerized production test was created using E-Prime. The production test
was administered prior to the perception test. This order was adopted in order to avoid providing

23

participants with auditory input of the target tokens prior to the production tests, which could
influence the participants’ correct pronunciation of the target tokens.
During production testing, a visual prompt task of 16 tokens, listed in Appendix A, was
given to participants. Prior to the target stimuli, practice tokens, listed in Appendix B, were
given in order to familiarize participants with the task. The stimuli were written in roomaji (i.e.,
the alphabet representation of Japanese sounds), not hiragana, because the distinction between
long and short vowels was clearer (e.g., kaakaa vs. かあかあ ‘high school’) for some participants
whose proficiency was lower. The experiment was conducted in a quiet room.
The procedure of production testing is described below. First, participants read the
instructions on the computer screen (Figure 4).

Figure 4: Instructions for the production test for Experiment 1

Then, a plus sign (‘+’) appeared on the computer screen for two seconds (Figure 5) followed by
the target word while the participant was asked to read aloud. Then, a stimulus appeared on the

24

screen (Figure 6) and a participant was asked to read. When the participants were ready to
pronounce the word, they were asked to press ‘P’ to move to the next screen.

Figure 5: A “+” sign shown before the presentation of stimuli

Figure 6: Presentation of the stimuli for the production test in Experiment 1

In the next screen, the participants were asked to pronounce the word. The participants were
asked to press the key ‘P’ to move to the next stimulus.

25

Perception Test: After the production test, a perception test was given. During perception testing,
participants were given a forced-choice, four-alternative identification task involving a total of
40 target stimuli (see Appendix C). The rationale for using the identification task rather than a
discrimination task was based on previous studies (e.g., Logan et al., 1991). The choices were
written in romanization to make the distinction between long and short vowels clearer. First,
participants read the instructions on the computer screen (Figure 7).

Figure 7: Instructions for the perception test in Experiment 1

Then, a plus sign (‘+’) appeared on the computer screen (Figure 5) for two seconds. After
participants listened to a word played on the computer, they were asked to choose one option that
they thought matched what they heard from the list provided on the computer screen (Figure 8).
The participants were able to see the choices while they were listening to the stimuli.

26

Figure 8: Presentation of the auditory stimuli and identification task for the perception test in
Experiment 1

When the auditory stimulus ended, the timer to measure RT started. As soon as the participant
made a choice, the timer to measure RT stopped. Then, the computer screen showed the plus
sign and moved to the next stimulus. There was no feedback given in Experiment 1.
Prior to the target stimuli, the practice tokens in Appendix D were given to the
participants in order to familiarize them with the task. For each stimulus, a participant’s
responses, identification accuracy, and RT were recorded on the computer and saved for later
analysis. It was determined that the participants whose scores were 90% or higher in the
perception test would be excluded from this study in order to avoid ceiling effects.

Results
Identification accuracy scores (i.e., percentages of correct responses), production
accuracy scores (i.e., Japanese NSs’ rating), and Reaction Time (RT) in milliseconds were
tabulated. The data were analyzed and are reported in the following order: (1) the overall results
27

of the perception and production tests, (2) factors affecting production accuracy of vowel
duration, (3) factors affecting perception accuracy of vowel duration, and (4) factors affecting
perception latency. For the statistical analysis, alpha level was set at .05 (α = .05).

Overall Results of the Production Test: A total of 64 participants took a production test in
Experiment 1. A total of 16 items in Appendix A were used, and accuracy of correct
pronunciation was measured using NSs’ judgment. Three female NSs of Japanese rated the
participants’ pronunciation. The NSs, whose ages ranged from 30 to 40 years old, were born in
Japan and lived in the US. All of the three raters had taken linguistics courses and had Japanese
teaching experience. For rating, the raters were asked to listen to the words pronounced by the
participants and choose what they thought they heard from the list provided as in Table 3 below.

Table 3: A sample task for the raters
Item to be Rated

kaakaa

A List of Choices
(a) kaakaa
(b) kaaka
(c) kakka
(d) kaka
(e) other: (

)

When the rater chose (e) ‘other’, she was told to write down what she thought she heard. When
the rater judged that the participants pronounced the word correctly, one point was given for the
token; otherwise, no point was given. The pitch pattern was not measured because it was not a
focus in the production part and as pseudo words, learners would not have known what pattern to
use. The three raters coded each production individually, and the result on which at least two

28

raters agreed was used as the basis for the production score for the item. Interrater reliability was
checked using Pearson Correlation/Coefficient. There was a significant positive correlation
2

between Rater 1 and Rater 2 (r = .896, p = .001, R = .80), between Rater 1 and Rater 3 (r = .895,
2

2

p = .001, R = .80), as well as between Rater 2 and Rater 3 (r = .887, p = .001, R = .79); the
correlation was strong. For all the items, there was an agreement from at least two raters;
therefore, there was no need to resolve any ambiguous items.
The 16 tokens produced by learners were divided into four types depending on the
location of the long vowels: (1) CVV.CVV, contained long vowels in the first and second
syllables, (2) CVV.CV, contained long vowels in the first syllable, (3) CV.CVV contained long
vowels in the second syllable, and (4) CV.CV, contained no long vowels. Table 4 shows mean
scores of production accuracy sorted by the preceding consonant, vowel type, and token type,
obtained from 64 participants. The mean production accuracy was 70.38% (s.d. 16.96).

Table 4: Mean accuracy for the production accuracy in Experiment 1
Preceding Consonant /k/
Vowel /a/
Vowel /u/
Item
Mean (s.d.)
Item
Mean (s.d.)
kaa.kaa
kaa.ka
ka.kaa
ka.ka

.83 (.38)
.83 (.38)
.66 (.48)
.73 (.45)

kuu.kuu
kuu.ku
ku.kuu
ku.ku

Preceding Consonant /s/
Vowel /a/
Vowel /u/
Item
Mean (s.d.)
Item
Mean (s.d.)

.72 (.45)
.94 (.24)
.53 (.50)
.67 (.47)

saa.saa
saa.sa
sa.saa
sa.sa

.84 (.37)
.77 (.43)
.66 (.48)
.63 (.49)

suu.suu
suu.su
su.suu
su.su

.70 (.46)
.72 (.45)
.67 (.47)
.58 (.50)

Overall Results of the Perception Test: A total of 64 participants took a perception test in
Experiment 1. A total of 40 items in Appendix C were used, and perception accuracy and
latency were measured, using E-prime. For the perception accuracy, the participants’ choice was
29

coded either correct (one point) or wrong (zero). When a participant did not make a choice, no
point was given for the specific token. The perception reaction time (RT) was measured in
milliseconds using E-Prime.
Originally, I planned to treat the participants’ L2 proficiency as a between-subject factor,
with the intention of dividing the participants into three groups using the results of Experiment 1
in order to examine how proficiency affected correct identification of vowel duration in Japanese.
However, the use of test scores to assess proficiency is arbitrary because it is not clear what
scores can indicate the proficiency level. In addition, there is no appropriate independent
measurement available. The Japanese Language Proficiency Test (JLPT) has a listening section;
however, it measures holistic skills in listening. Therefore, the measurement is not directly
related to the issue of vowel duration. Finally, the courses that the participants were enrolled in
were not valid estimates of their ability to identify vowel duration as shown in Figure 9. Figure
9 and Table 5 shows the distribution of accuracy scores according to the participants’ length of
st

nd

rd

th

time studying Japanese at the college level (i.e., 1 , 2 , 3 , and 4 year of the Japanese
st

classes). Even some 1 year learners obtained more than 90% identification accuracy, which
was equal to the accuracy of more advanced learners. Therefore, the data were collapsed into
one group and the analysis focused on the remaining within-subject variables.

30

Figure 9: Distribution of perception accuracy by course enrollment. (For interpretation of the
references to color in this and all other figures, the reader is referred to the electronic version of
this dissertation.)
7

Number of Participants

6
5
4

1st year
2nd year

3

3rd year

4th year

2
1
0
100

90-

80-

70-

605040Percent Correct

30-

20-

10-

Table 5: Distribution of perception accuracy by course enrollment in percentages
Courses
st

100%

90 –
80 –
70 –
60 –
50 –
40 –
30 –
0–
99.99% 89.99% 79.99% 69.99% 59.99% 49.99% 39.99% 29.99%

1

2

4

5

4

5

2

1

---

year

---

3

6

6

2

---

---

---

---

3 year

rd

---

3

5

4

2

2

---

---

---

th

---

3

3

1

---

---

---

---

---

1

11

18

17

8

7

2

1

---

1 year
nd

2

4 year
Total

31

Analysis of Production Data: A three-way design ANOVA was used to test whether the
preceding consonant, type of vowel, or token type significantly affected accuracy in pronouncing
vowel duration in Japanese. Independent variables were preceding consonant (2; /k/ and /s/),
vowel type (2; /a/ and /u/), and token type (4: CVV.CVV, CVV.CV, CV.CVV, CV.CV). The
dependent variable was production accuracy. Results indicated significant main effects of vowel
2

type, FVowel(1, 63) = 5.063, p = .028, ƞp = .074, and token type, FType(3, 189) = 6.290, p
2

< .001, ƞp = .091; however, preceding consonant was marginally significant, FPreC(1, 63) =
3.768, p = .057. Thus, it was found that vowel type had a significant influence, but not the
preceding consonant. The mean accuracy scores for the tokens with the vowel /a/ and /u/
were .74 (s.d. .21) and .69 (s.d. .19) respectively. Thus, it was easier for the learners to correctly
pronounce vowel duration when the vowel was /a/. In addition, it was found that token type had
a significant influence. In order to locate where the differences existed in the four token types,
pairwise comparisons were performed using the Bonferroni correction. The mean accuracy for
the CVV.CVV, CVV.CV, CV.CVV, and CV.CV is tabulated in Table 6 below.

Table 6: Mean production accuracy of the four tokens in Experiment 1
Token

Mean (s.d.)

CVV.CVV
CVV.CV
CV.CVV
CV.CV

.77 (.26)
.81 (.24)
.62 (.36)
.65 (.35)

Results indicated that the (2) CVV.CV type was significantly different from the (3)
CV.CVV type (p = .005) as well as the (4) CV.CV type (p = .011). The mean scores of the (2)
32

CVV.CV were higher than those of the (3) CV.CVV as well as the (4) CV.CV. Therefore, it was
concluded that the (2) CVV.CV type in which the long vowel is in the first syllable was
comparable to CVV.CVV and easier to produce correctly than (3) CV.CVV and (4) CV.CV.
In addition to the main effects above, the Preceding Consonant x Token Type interaction
2

was significant, F(3, 189) = 4.002, p = .009, ƞp = .061. The results of simple effects tests
indicated that the CV tokens (Type 4) revealed significant effects on the two consonants /k/ and
/s/, F(1, 63) = 7.87, p = .007, as shown in Figure 10. The CV.CV tokens with the consonant /k/
(a stop) had higher accuracy than those with the consonant /s/ (a fricative).

Figure 10: Effects of consonant and token type on production accuracy in Experiment 1
100

Percent Correct

90

80

CVV.CVV
CVV.CV
CV.CVV

70

CV.CV
60

50
k

s

33

An error analysis was then conducted on the production data. There were three cases in
which the participants did not produce anything. Excluding these errors, there were 298
incorrect productions and they are summarized in Table 7 below.

Table 7: Errors observed in the production data in Experiment 1
Token with /a/

Errors

Number

Token with /u/

Errors

Number

CaaCaa

CaaCa
CaCaa
CaCCaa

15
5
1

CuuCuu

CuuCu
CuCuu

35
2

CaaCa

CaaCaa
CaCa
CaCaa
CaCCa
CaCCaa

17
3
1
1
1

CuuCu

CuuCuu
CuCu
CuCuu

17
4
1

CaCaa

CaaCaa
CaCCaa
CaaCa
CaCa

29
6
5
2

CuCuu

CuuCuu
CuCCuu
CuuCu
CuCu

35
8
3
3

CaCa

CaCaa
CaaCa
CaaCaa
CaCCa
CaCCaa

21
19
1
1
1

CuCu

CuCuu
CuuCu
CuuCuu
CuCCu
CuCCuu

26
16
4
1
1

Note: C = consonant (/k/ or/s/)

34

As shown in the table above, the most common errors observed for the CVV.CVV tokens were
2

CVV.CV; a long vowel in the second syllable was shortened to a short vowel. For the CVV.CV
tokens, the most common error was CVV.CVV; a short vowel in the second syllable was
lengthened. For the CV.CVV tokens, the most common error was CVV.CVV; a short vowel in
the first syllable was lengthened. Finally, the two major errors for the CV.CV tokens were
CV.CVV and CVV.CV; one of the short vowels was lengthened. Based on this error analysis, it
was concluded that when a token contained two long vowels, the learner shortened the one on
the second syllable. On the other hand, when a token contained both short and long vowels, the
learner lengthened the short vowel. When a token contained two short vowels, the learner
lengthened the short vowel either on the first or second syllable.
In conclusion, in this section, factors affecting production accuracy of vowel duration
were examined. It was found that vowel and token type had significant main effects on
producing tokens containing vowel duration. In addition, the interaction between the preceding
consonant and token type was found; the CV.CV token with the consonant /k/ had higher
production accuracy than those with the consonant /s/.

Analysis of Factors Affecting Perception Accuracy: As possible factors that affected
identification accuracy of vowel duration, preceding consonant (2; /k/, /s/), vowel type (2; /a/,
/u/), and pitch patterns (10) were examined. As shown in Figure 3, not every token type occurs
in the language in conjunction with every possible pitch pattern. The overall mean score for
perception accuracy was 76.04% (s.d. 15.17). The mean identification accuracy for words with
the preceding consonants /k/ and /s/ was 76.02% (s.d. 17.89) and 77.73% (s.d. 14.64)
2

In this dissertation, the word final position is represented as the second syllable in order to
make a contrast to the first syllable.
35

respectively; the mean identification accuracy for words with the vowels /a/ and /u/ were 78.52%
(s.d. 14.98) and 75.23% (s.d. 17.35) respectively. Then, the preceding consonant and vowel type
were combined. Mean scores for identification accuracy for /ka/, /ku/, /sa/, and /su/ were 76.72%
(s.d. 20.55), 75.31% (s.d. 18.17), 80.31% (s.d. 13.33), and 75.16% (s.d. 20.31) respectively
(Figure 11).

Figure 11: Mean perception accuracy by preceding consonant and vowel type

Percentages Correct

100

90

/k/

80

70

/s/
/a/

/u/

/k/

76.72

75.31

/s/

80.31

75.16

Descriptive statistics were then conducted on the responses to the 10 pitch patterns (1: LH.HH,
2: LH.HL, 3: HL.LL, 4: LH.H, 5: HL.L, 6: L.HH, 7: L.HL, 8: H.LL, 9: L.H, 10: H.L) assigned to
each combination of the consonants and vowels: /ka/, /sa/, /ku/, and /su/. Table 8 and Figure 12
show the descriptive statistics for perception accuracy by pitch pattern, preceding consonant, and
vowel type.

36

Table 8: Descriptive statistics for perception accuracy by pitch pattern, preceding consonant, and
vowel type
Pitch
Pattern

Preceding Consonant /k/
Vowel /a/
Vowel /u/
Mean
(s.d.)
Mean
(s.d.)

1
2
3
4
5
6
7
8
9
10

.83
.53
.53
.92
.83
.84
.80
.59
.86
.94

(.38)
(.50)
(.50)
(.27)
(.38)
(.37)
(.41)
(.50)
(.35)
(.24)

.63
.70
.61
.95
.66
.73
.77
.70
.92
.86

(.49)
(.46)
(.49)
(.21)
(.48)
(.45)
(.43)
(.46)
(.27)
(.35)

Preceding Consonant /s/
Vowel /a/
Vowel /u/
Mean
(s.d.)
Mean
(s.d.)
.73
.58
.69
.91
.75
.83
.67
.98
.94
.95

(.45)
(.50)
(.47)
(.29)
(.44)
(.38)
(.47)
(.13)
(.24)
(.21)

.84
.58
.66
.83
.55
.84
.77
.56
.95
.94

(.37)
(.50)
(.48)
(.38)
(.50)
(.37)
(.43)
(.50)
(.21)
(.24)

Figure 12: Mean perception accuracy by pitch pattern, preceding consonant, and vowel type
1

Accuracy

0.8

0.6
/ka/
/ku/
0.4

/sa/
/su/

0.2

0
LH.HH LH.HL HL.LL LH.H
Group I

HL.L

L.HH

Group II

L.HL

H.LL

Group III
37

L.H

H.L

Group IV

Figure 12 above shows that perception accuracy for the L.H, H.L, and LH.H pitch was higher
than the other pitch patterns. The LH.H and HL.L patterns also had relatively higher perception
accuracy; the LH.HH, LH.HL, and HL.LL patterns had relatively lower perception accuracy.
In order to examine whether preceding consonant, vowel type, and pitch pattern
significantly affected the correct identification of vowel duration in Japanese, the 10 pitch
patterns were divided into 4 categories according to the location of the long vowels (i.e., first
and/or second syllables) as shown in Figure 12. Pitch patterns (1) LH.HH, (2) LH.HL, and (3)
HL.LL were categorized into Group I which contained two long vowels (CVV.CVV); pitch
patterns (4) LH.H and (5) HL.L were categorized into Group II which contained one long vowel
in the first syllable (CVV.CV); pitch patterns (6) L.HH, (7) L.HL, (8) H.LL were categorized
into Group III which contained one long vowel in the second syllable (CV.CVV); and pitch
patterns (9) L.H and (10) H.L were grouped into Group IV which did not contain any long
vowels and was used as baseline information.
A three-way ANOVA was used to test whether preceding consonant, vowel type, and/or
pitch pattern in Group I (CVV.CVV) significantly affected the correct identification of vowel
duration in Japanese. Independent variables were preceding consonant (2; /k/ and /s), vowel type
(2; /a/ and /u/), and pitch pattern (3: LH.HH, LH.HL, HL.LL). The dependent variable was
perception accuracy. Results indicated significant main effects of pitch pattern, FPitch(2, 126) =
2

10.866, p < .001, ƞp = .147; however, preceding consonant, FPreC(1, 63) = 1.726, p = .194, and
vowel type, FVowel(1, 63) = .578, p = .450 were not significant.
It was found that neither the type of vowel nor the preceding consonant affected
identification of the vowel duration of the tokens in Group I. However, the pitch patterns

38

affected the correct identification. In order to locate where the differences existed among the
three pitch patterns, pairwise comparisons were performed using the Bonferroni correction.
Results indicated that (1) LH.HH was significantly different from (2) LH.HL (p < .001) and (3)
HL.LL (p = .001). The pitch pattern (1) LH.HH, a mean of .76, had significantly higher
accuracy than (2) LH.HL, a mean of .60 and (3) HL.LL, a mean of .62. Then, these three pitch
patterns were compared. The pitch patterns (1) and (2) shared the same pitch on the first syllable
and only had a difference in the pitch pattern on the second syllable, HH and HL respectively.
Yet, the pitch patterns (1) and (3) did not share any similarity. The pattern (1) started with low
pitch and kept high pitch after the second mora; the pattern (3) had the opposite pattern. The
comparison between (1) and (2) suggested that the differences in the pitch pattern on the second
syllable were the key.
In addition to the main effect of pitch pattern, the results indicated that the Preceding
Consonant x Vowel Type x Pitch Pattern interaction was significant, F(2, 126) = 7.322, p = .001,
2

ƞp = .104. The results of simple effects tests revealed that perception accuracy of /ka/ was
higher than /ku/ with the LH.HH pitch while that of /ku/ was higher than /ka/ with the LH.HL
pitch.
Error analysis was conducted on the responses to the CVV.CVV tokens. Table 9 below
shows the four choices used in the identification task for the tokens with the CVV.CVV structure.
The order of presentation of the four choices was randomized; therefore, each token had a
different order of options. Among the four choices, (A) was the correct response; (B) was
different from (A) in terms of the vowel length of the second syllable, (C) was different because
the first syllable contains a geminate, instead of a long vowel, and (D) was different in terms of
the vowel length of the first syllable. Previous literature (e.g., Motohashi-Saigo & Hardison,
39

2009) found that L2 learners misperceived long vowels as geminates. Also, the token with the
CV.CV structure was considered too different from the CVV.CVV structure; therefore,
geminates were included as one of the choices.

Table 9: An example of choices used in the identification task for CVV.CVV tokens
Choices in the
Identification Task

Examples

A. CVV.CVV
B. CVV.CV
C. CVC.CV

kaa.kaa
kaa.ka
kak.ka

D. CV.CVV

ka.kaa

Selection would indicate:

-correct
-misperception of long vowel in second syllable
-misperception of long vowel in second syllable
-misperception of long vowel in first syllable as
geminate
-misperception of long vowel in first syllable

Note: Syllable boundaries were not marked in the experiment.

Figure 13 shows the number of errors that the participants made for the tokens with the (1)
LH.HH pitch pattern. There were 66 errors in total; approximately 65.15% of the errors were
observed for the choice CVV.CV (misperception of long vowels in second syllable).

40

Figure 13: Errors for the tokens CVV.CVV with the (1) LH.HH pitch pattern

Number of Errors

25
20
15
10

5
0
CVV.CVV

CVC.CV

CVV.CV

CV.CVV

ka

0

1

8

5

No
Answer
2

ku

0

3

21

0

0

sa

0

1

8

5

2

su

0

2

6

0

2

Next, Figure 14 shows the number of errors that the participants made for the tokens with the (2)
LH.HL pitch pattern. There were 102 errors in total; approximately 88.25% of the errors were
again observed for the choice CVV.CV (misperception of long vowel in second syllable).

Number of Errors

Figure 14: Errors for the tokens CVV.CVV with the (2) LH.HL pitch pattern
30
25
20
15
10
5
0
CVV.CVV
ka
ku
sa
su

CVC.CV

CVV.CV

CV.CVV

0
0
0
0

2
2
0
4

26
15
25
23

1
2
2
0

41

No
Answer
0
0
0
0

Finally, Figure 15 shows the number of errors that the participants made for the tokens with the
(3) HL.LL pitch pattern. There were 97 errors in total; approximately 71.13% of the errors were
observed for the choice CVV.CV, similar to the other two patterns.

Number of Errors

Figure 15: Errors for the tokens CVV.CVV with the (3) HL.LL pitch pattern
25
20
15
10
5
0
CVV.CVV

CVC.CV

CVV.CV

CV.CVV

ka

0

6

17

7

No
Answer
0

ku

0

0

20

4

0

sa

0

2

15

1

2

su

0

2

17

4

0

The error analysis revealed that the learners had a tendency to incorrectly perceive the
CVV.CVV tokens as CVV.CV. This error pattern suggested that a long vowel in the second
syllable was perceived as a short vowel. In addition, there were more errors observed for the
vowel /u/ compared to the vowel /a/. The simple effects analysis of the interaction also
suggested that the vowel /a/ had higher accuracy than the vowel /u/ for the three pitch patterns in
this group.
Next, a three-way ANOVA was used to test whether the preceding consonant, vowel type,
and/or pitch pattern in Group II (CVV.CV) significantly affected the correct identification of
vowel duration in Japanese. Independent variables were preceding consonant (2; /k/ and /s),
vowel type (2; /a/ and /u/), and pitch pattern (2: LH.H, HL.L). The dependent variable was
42

perception accuracy. Results indicated significant main effects of preceding consonant, FPreC(1,
2

2

63) = 7.471, p = .008, ƞp = .106, pitch pattern, FPitch(1, 63) = 28.474, p < .001, ƞp = .311, and
2

vowel type, FVowel(1, 63) = 10.938, p = .002, ƞp = .148. Based on the results, the type of
vowel, preceding consonant, and pitch pattern affected the identification of L2 vowel duration of
the tokens in Group II. The mean accuracy scores of the tokens with the vowel /a/ and /u/
were .85 and .75 respectively. Therefore, it was easier for the learners to identify vowel duration
when the vowel was /a/, compared to /u/. The mean accuracy scores of the tokens with the
consonant /k/ and /s/ were .84 and .76 respectively. Thus, it was easier for the learners to
identify vowel duration when the preceding consonant was /k/, compared to /s/. The mean
accuracy scores of the tokens with the pitch pattern (4) LH.H and (5) HL.L were .90 and .70
respectively; therefore, the pattern (4) was easier than (5). Similar to the previous comparison
between (1) LH.HH and (3) HL.LL, (4) LH.H and (5) HL.L did not share any similarity; the two
tokens were very distinct.
In addition to the significant main effects above, the Vowel Type x Pitch Pattern
2

interaction was significant, F(1, 63) = 8.663, p = .005, ƞp = .121. Results of simple effects tests
revealed that accuracy for the vowel /u/ was significantly lower in the pitch pattern HL.L as
shown in Figure 16.

43

Figure 16: Effects of vowel type and pitch pattern in Group II on perception accuracy
1

Accuracy

0.9
0.8

0.7
0.6
0.5
a

LH.H
0.91

HL.L
0.79

u

0.89

0.6

Error analysis was conducted on the responses to the CVV.CV tokens. Table 10 below
shows the four choices used in the identification task for the tokens with the CVV.CV structure.
Among the four choices, (B) was the correct response; (A) was different from (B) in terms of the
vowel length in the second syllable, (C) was different because the first syllable contains a
geminate, instead of a long vowel, and (D) was different in terms of the vowel length of the first
syllable.

Table 10: An example of choices used in the identification task for CVV.CV tokens
Choices in the
Identification Task
A. CVV.CVV
B. CVV.CV
C. CVC.CV
D. CV.CV

Examples

kaa.kaa
kaa.ka
kak.ka
ka.ka

Selection would indicate:

-misperception of vowel length in second syllable
-correct
-misperception of long vowel as geminate
-misperception of vowel length in first syllable

Note: Syllable boundaries were not marked in the experiment.

44

Figure 17 shows the number of errors that the participants made for the tokens with the (4) LH.H
pitch pattern. There were 26 errors in total; approximately 61.54% of the errors were observed
for the choice CVV.CVV (misperception of vowel length in the second syllable).

Number of Errors

Figure 17: Errors for the tokens CVV.CV with the (4) LH.H pitch pattern

9
8
7
6
5
4
3
2
1
0
CVV.CVV

CVV.CV

CVC.CV

CV.CV

ka

4

0

2

1

No
Answer
0

ku

1

0

2

0

0

sa

3

0

2

0

1

su

8

0

0

2

0

Second, Figure 18 shows the number of errors that the participants made for the tokens with the
(5) HL.L pitch pattern. There were 78 errors in total; the majority, approximately 55.12% of the
errors were observed for the choice CVV.CVV, similar to errors for pitch pattern LH.H.

45

Figure 18: Errors for the tokens CVV.CV with the (5) HL.L pitch pattern

Number of Errors

25
20
15
10
5
0
CVV.CVV

CVV.CV

CVC.CV

CV.CV

ka

2

0

7

2

ku

14

0

4

4

sa

5

0

3

8

su

22

0

5

2

No
Answer
0
0
0
0

The error analysis also indicates that the HL.L pitch pattern as shown in Figure 18 revealed more
errors for the tokens with the vowel /u/ (i.e., a total of 36) than those with /a/ (i.e., a total of 7).
In addition, more errors were observed for the tokens with HL.L pitch (i.e., a total of 16) as
shown in Figure 17 than LH.H pitch (i.e., a total of 43) as shown in Figure 18.
Next, a three-way ANOVA was used to test whether preceding consonant, vowel type,
and/or pitch pattern in Group III (CV.CVV) significantly affected the correct identification of
vowel duration in Japanese. Independent variables were preceding consonant (2; /k/ and /s),
vowel type (2; /a/ and /u/), and pitch pattern (3: L.HH, L.HL, H.LL). The dependent variable
was perception accuracy. Results indicated significant main effects of vowel type, FVowel(1, 63)
2

2

= 5.154, p = .027, ƞp = .076, and pitch pattern, FPitch(2, 126) = 5.586, p = .005, ƞp = .081;
however, preceding consonant was not significant, FPreC(1, 63) = 1.595, p = .211. Based on the

46

findings, the type of vowel and pitch pattern affected the identification of L2 vowel duration of
the tokens in Group III. The mean accuracy scores of the tokens with the vowel /a/ and /u/
were .79 and .73 respectively. Therefore, it was easier for the learners to identify vowel duration
when the vowel was /a/, compared to /u/. In order to locate where the differences existed in the
three pitch patterns, pairwise comparisons were performed using the Bonferroni correction.
Results indicated that (6) L.HH was significantly different from (8) H.LL (p < .01). The pitch
pattern (6) L.HH was significantly easier than (8) H.LL. The two pitch patterns were very
distinct; (8) L.HH starts with low pitch and remains high after the second mora; (10) H.LL is the
opposite pattern.
In addition to the main effects, the following interactions were significant: the Vowel
2

Type x Pitch Pattern interaction, F(2, 126) = 4.759, p = .01, ƞp = .070, the Preceding Consonant
2

x Pitch Pattern interaction, F(2, 126) = 3.759, p = .026, ƞp = .056, and the Preceding Consonant
2

x Vowel Type x Pitch Pattern interaction, F(2, 126) = 18.990, p < .001, ƞp = .232. In order to
analyze the three-way interaction, a simple effects test was conducted. Based on the results, it
was found that the LH.L pitch pattern had higher accuracy with the vowel and consonant
combination of /ka/, than the other consonant-vowel combinations such as /ku/, /sa/, and /su/.
Error analysis was conducted on the responses to the CV.CVV tokens. Table 11 below
shows the four choices used in the identification task for the tokens with the CV.CVV structure.
Among these choices, (C) was the correct response; (A) was different from (C) in terms of the
vowel length in the first syllable, (B) was different in terms of the vowel length in both the first
and second syllable, and (D) was different in terms of the vowel length on the second syllable.

47

Table 11: An example of choices used in the identification task for CV.CVV tokens
Choices in the
Identification Task
A. CVV.CVV
B. CVV.CV
C. CV.CVV
D. CV.CV

Examples

Selection would indicate:

kaa.kaa
kaa.ka
ka.kaa
ka.ka

-misperception of vowel length in first syllable
-misperception of vowel length in both syllables
-correct
-misperception of vowel length in second syllable

Note: Syllable boundaries were not marked in the experiment.

Figure 19 shows the number of errors that the participants made for the CV.CVV tokens with the
(6) L.HH pitch pattern. There were 44 errors in total; approximately 47.73% of the errors were
observed for the choice CV.CV; approximately 27.27% of the errors were observed for
CVV.CVV; and approximately 22.72% of the errors were observed for CVV.CV. The majority
was observed for the choice CV.CV (misperception of vowel length in the second syllable).

Number of Errors

Figure 19: Errors for the CV.CVV tokens with the (6) L.HH pitch pattern
8
7
6
5
4
3
2
1
0
CVV.CVV

CVV.CV

CV.CVV

CV.CV

ka

1

0

0

6

No
Answer
0

ku

3

6

0

7

0

sa

6

2

0

2

1

su

2

2

0

6

0

48

Next, Figure 20 shows the number of errors that the participants made for the CV.CVV tokens
with the (7) L.HL pitch pattern. There were 63 errors in total; approximately 57.14% of the
errors were observed for the choice CV.CV (misperception of the vowel length in the second
syllable).

Number of Errors

Figure 20: Errors for the CV.CVV tokens with the (7) L.HL pitch pattern
16
14
12
10
8
6
4
2
0
CVV.CVV

CVV.CV

CV.CVV

CV.CV

ka

4

4

0

4

No
Answer
1

ku

2

2

0

12

0

sa

1

3

0

15

0

su

5

5

0

5

0

Finally, Figure 21 shows the number of errors that the participants made for the CV.CVV tokens
with the (8) H.LL pitch pattern. There were 75 errors in total; approximately 74.67% of the
errors were observed for the choice CV.CV, similar to the other two patterns.

49

Number of Errors

Figure 21: Errors for the CV.CVV tokens with the (8) H.LL pitch pattern
25
20
15
10
5
0

CVV.CVV

CVV.CV

CV.CVV

CV.CV

ka

3

4

0

19

No
Answer
1

ku

1

2

0

15

0

sa

0

1

0

1

0

su

3

4

0

21

0

The error analysis revealed that the majority of the learners incorrectly perceived the CV.CVV
tokens as CV.CV. In other words, the learners misperceived a long vowel on the second syllable
as a short vowel.
Finally, a three-way ANOVA was used to test whether preceding consonant, vowel type,
and/or pitch pattern in Group IV (CV.CV) significantly affected the correct identification of
vowel duration in Japanese. Independent variables were preceding consonant (2; /k/ and /s),
vowel type (2; /a/ and /u/), and pitch pattern (2: L.H, H.L). The dependent variable was
perception accuracy. Results indicated significant main effects of preceding consonant, FPreC(1,
2

63) = 9.061, p = .004, ƞp = .126; however, vowel type, FPreC(1, 63) = .047, p = .829, and pitch
pattern, FPitch(1, 63) = .034, p = .854, were not significant. Based on the findings, the preceding
consonant affected the identification of L2 vowel duration of the tokens in Group IV (CV.CV).
The mean accuracy scores of the tokens with the consonant /k/ and /s/ were .90 and .95

50

respectively. Therefore, it was easier for the learners to identify vowel duration when the
preceding consonant was /s/, compared to /k/.
In addition to the main effects, the Vowel Type x Pitch Pattern interaction was significant,
2

F(1, 63) = 5.154, p = .027, ƞp = .076 (Figure 22). Simple effects tests were conducted, and the
results revealed that the accuracy of the L.H pitch with the vowel /u/ was higher than that with
the vowel /a/. The figure suggests that the biggest difference is greater accuracy for /a/ with H.L
versus L.H.

Accuracy

Figure 22: Effects of vowel type and pitch pattern in Group IV in Experiment 1
0.96
0.95
0.94
0.93
0.92
0.91
0.9
0.89
0.88
0.87
L.H

a
0.9

u
0.94

H.L

0.95

0.9

Error analysis was conducted on the responses to the CV.CV tokens. Table 12 below
shows the four choices used in the identification task for the tokens with the CV.CV pitch pattern.
Among the four choices, (D) was the correct response; (A) was different from (D) in terms of the
vowel length in the first syllable, (B) was different because of the geminate, and (D) was
different in terms of the vowel length of the second syllable.

51

Table 12: An example of choices used in the identification task for CV.CV tokens
Choices in the
Identification Task

Examples

A. CVV.CV
B. CVC.CV
C. CV.CVV
D. CV.CV

Selection would indicate:

kaa.ka
kak.ka
ka.kaa
ka.ka

-misperception of vowel length in first syllable
-misperception as a geminate
-misperception of vowel length in second syllable
-correct

Note: Syllable boundaries were not marked in the experiment.

Figure 23 shows the number of errors that the participants made for the CV.CV tokens with the
(9) L.H pitch pattern. There were 20 errors in total; approximately 80% of the errors were
observed for the choice CVC.CV (misperception as a geminate).

Number of Errors

Figure 23: Errors for the CV.CV tokens with the (9) L.H pitch pattern
8
7
6
5
4
3
2
1
0
CVV.CV

CVC.CV

CV.CVV

CV.CV

ka

1

7

0

0

No
Answer
0

ku

1

4

1

0

0

sa

0

4

0

0

0

su

1

1

0

0

0

Figure 24 shows the number of errors that the participants made for the CV.CV tokens with the
(10) H.L pitch pattern. There were 20 errors in total; approximately 75% of the errors were
observed for the choice CVC.CV, following the data for the L.H pattern.

52

Number of Errors

Figure 24: Errors for the CV.CV tokens with the (10) H.L pitch pattern
8
7
6
5
4
3
2
1
0
CVV.CV

CVC.CV

CV.CVV

CV.CV

ka

1

3

0

0

No
Answer
0

ku

1

7

1

0

0

sa

1

1

0

0

1

su

0

4

0

0

0

The error analysis revealed that the majority of the learners incorrectly perceived the CV.CV
token as CVC.CV. This error pattern suggested that the perception of duration in the first
syllable was misperceived as a geminate. For the L.H pitch pattern, there were more errors on
the tokens with the vowel /a/; however, for the H.L pitch pattern, there were more errors on the
tokens with the vowel /u/. The simple effect tests also suggested that the accuracy was higher
with the vowel /u/, compared to the vowel /a/.

Analysis of Factors Affecting Perception RT: As possible factors that could affect the perception
RT, preceding consonant (2; stop /k/, fricative /s/), vowel type (2; /a/, /u/), and pitch pattern (10
patterns) were examined. The overall mean of perception RT was 2652.42 milliseconds (s.d.
429.11). The mean identification RT for words with the preceding consonant /k/ and /s/ was
2662.76 milliseconds (s.d. 437.85) and 2573.95 milliseconds (s.d. 441.45) respectively. The
mean identification RT for stimuli with the vowel /a/ and /u/ was 2568.20 milliseconds

53

(s.d.387.41) and 2668.51 milliseconds (s.d. 482.92) respectively. The mean RT for the
CVcombinations /ka/, /ku/, /sa/, and /su/ was 2626.36 milliseconds (s.d. 482.01), 2699.16
milliseconds (s.d. 484.04), 2510.04 milliseconds (s.d. 404.78), and 2637.85 milliseconds (s.d.
551.65) respectively as in Figure 25.

Figure 25: Mean perception RTs by preceding consonant and vowel type
2750

Response Latency

2700
2650
2600
2550

2500
2450
2400

/k/
/s/

/a/
2626.36

/u/
2699.16

2510.04

2637.85

Then, 10 pitch patterns (1: LH.HH, 2: LH.HL, 3: HL.LL, 4: LH.H, 5: HL.L, 6: L.HH, 7: L.HL,
8: H.LL, 9: L.H, 10: H.L) were assigned to each combination of consonant and vowel: /ka/, /ku/,
/sa/, and /su/. Table 13 and Figure 26 show the descriptive statistics for the perception accuracy
by pitch pattern, preceding consonant, and vowel type.

54

Table 13: Descriptive Statistics for perception RT by pitch pattern and CV combination (in
milliseconds)
Pitch

Preceding Consonant /k/
Vowel /a/
Vowel /u/
Mean
(s.d.)
Mean
(s.d.)

1
2
4
5
7
8
9
10
11
12

2594.05
3066.11
3084.25
2581.39
2956.86
2909.63
2405.42
2742.77
2067.11
1855.98

(1105.82)
(1000.43)
(1129.87)
(703.08)
(661.15)
(963.02)
(956.37)
(1137.97)
(834.36)
(903.65)

2608.91
2900.61
3007.59
2602.78
3230.97
2928.56
2772.88
2648.70
1981.69
2308.94

Preceding Consonant /s/
Vowel /a/
Vowel /u/
Mean
(s.d.)
Mean
(s.d.)

(877.32)
(956.43)
(1040.13)
(1010.40)
(993.23)
(965.29)
(784.10)
(879.31)
(692.84)
(979.32)

2562.36
3006.17
2958.61
2712.05
2776.03
2966.53
2757.14
1537.41
1789.23
2034.88

(1318.76)
(1013.56)
(1262.16
(916.47)
(1050.64)
(1004.49)
(920.75)
(603.29)
(748.91)
(749.48)

2936.28
3130.66
2776.81
2669.13
2884.75
2567.23
3033.09
2843.41
1804.94
1772.22

(1363.07)
(1075.40)
(1152.75)
(782.26)
(1020.95)
(877.07)
(1143.37)
(1039.80)
(779.67)
(858.63)

Figure 26: Mean RTs by pitch pattern, consonant, and vowel combination (in milliseconds)
3500

3000
2500
/ka/

2000

/ku/

1500

/sa/
/su/

1000
500
0
LH.HH LH.HL HL.LL
Group I

LH.H

HL.L

L.HH

Group II

L.HL

Group III
55

H.LL

L.H

H.L

Group IV

As Figure 26 shows, the RT is shortest for the pitch pattern L.H and H.L (CV.CV). Also, when
the token has high pitch at the end such as LH.HH and LH.H, the RT tends to be shorter than the
other patterns such as LH.HL and HL.L respectively.
It was examined whether preceding consonant, vowel type, and/or pitch pattern
significantly affected the response latency for the identification of vowel duration in Japanese.
In order to examine the effects in detail, the 10 pitch patterns were divided into 4 categories
(Group I, II, III, and IV) as indicated in Figure 26, according to the location of the long vowels
as described earlier.
A three-way ANOVA was used to test whether preceding consonant, vowel type, and/or
pitch pattern in Group I (CVV.CVV) significantly affected the RT in identifying vowel duration
in Japanese. Independent variables were preceding consonant (2; /k/ and /s/), vowel type (2; /a/
and /u/), and pitch pattern (3; LH.HH, LH.HL, HL.LL). The dependent variable was perception
RT. Results indicated significant main effects of pitch pattern, FPitch(2, 126) = 7.884, p = .001,
2

ƞp = .111; however, vowel type, FVowel(1, 63) = .046, p = .810, and preceding consonant,
FPreC(1, 63) = .058, p = 831, were not significant. None of the interactions was significant.
It was found that the four pitch patterns significantly affected the RT to perceive vowel
length. In order to locate where the differences existed in the four pitch patterns, pairwise
comparisons were performed using the Bonferroni correction. Results indicated that (1) LH.HH
was significantly different from (2) LH.HL (p < .01) as well as (3) HL.LL (p < .01). The mean
RT for LH.HH was 2675.40 milliseconds, for LH.HL was 3025.89 milliseconds, and for HL.LL
was 2956.82. Thus, the learners identified the vowel duration for the LH.HH pitch pattern more
quickly than the other two patterns.

56

Next, a three-way ANOVA was used to test whether preceding consonant, vowel type,
and/or pitch pattern in Group II (CVV.CV) significantly affected the RT in identifying vowel
duration in Japanese. Independent variables were preceding consonant (2; /k/ and /s/), vowel
type (2; /a/ and /u/), and pitch pattern (2; LH.H, HL.L). The dependent variable was perception
RT. Results indicated significant main effects of pitch pattern, FPitch(1, 63) = 16.853, p < .001,
2

ƞp = .211; however, preceding consonant, FPreC(1, 63) = 1.409, p = .240, and vowel type,
FVowel(1, 63) = 1.098, p = .299, were not significant. It was found that the two pitch patterns
significantly affected the RT to perceive vowel length. The mean RT for (4) LH.H was 2641.34
milliseconds and that for (5) HL.L was 2952.15 milliseconds. Therefore, the learners identified
the vowel duration for the token with the LH.H pattern faster than the ones with the HL.L pattern.
In addition to the main effects, the Preceding Consonant x Pitch Pattern interaction was
2

significant, F(1, 63) = 7.259, p = .099, ƞp = .103. Simple effects tests were conducted, and as
shown in Figure 27 results suggest that the RTs for /s/ vs. /k/ showed a greater difference with
HL.L than LH.H.

57

Figure 27: Effects of preceding consonant and pitch pattern in Group II on RT in Experiment 1
3100

RT in milliseconds

3000
2900
2800
2700
2600
2500
k

LH.H
2592.09

HL.L
3039.92

s

2690.59

2810.39

Third, a three-way ANOVA was used to test whether preceding consonant, vowel type,
and/or pitch pattern in Group III (CV.CVV) significantly affected the RT in identifying vowel
duration in Japanese. Independent variables were preceding consonant (2; /k/ and /s/), vowel
type (2; /a/ and /u/), and pitch pattern (3; L.HH, L.HL, H.LL). The dependent variable was
perception RT. Results indicated significant main effects of pitch pattern, FPitch(2, 126) =
2

2

12.120 p < .001, ƞp = .161, and vowel type, FVowel(1, 63) = 17.403, p < .001, ƞp = .216;
however, preceding consonant was not significant, FPreC(1, 63) = 3.025, p = .078. It was found
that the type of vowel significantly affected the response speed of L2 vowel duration. The mean
response speed of the tokens with the vowel /a/ and /u/ were 2553.15 milliseconds and 2798.98
milliseconds respectively. Therefore, the L2 learners identified the vowel duration for the tokens
with the vowel /a/ significantly faster than ones with the vowel /u/. It was also found that the

58

three pitch patterns significantly affected the RT to perceive vowel length. In order to locate
where the differences existed in the three pitch patterns, pairwise comparisons were performed
using the Bonferroni correction. Results indicated that (8) H.LL was significantly different from
(6) L.HH (p < .001) as well as (7) L.HL (p < .001). The mean RT of (8) was 2443.07
milliseconds and was faster than that of (6) L.HH (2842.99 milliseconds) and (7) L.HL (2742.13
milliseconds). Thus, the L2 learners responded to the tokens with the H.LL pitch patterns faster
than the other two pitch patterns.
In addition to the main effects, all of the interactions were significant: the Vowel Type x
2

Pitch pattern, F(2, 126) = 20.297, p < .001, ƞp = .244, the Preceding Consonant x Vowel Type,
2

F(1, 63) = 4.391, p = .042, ƞp = .064, the Preceding Consonant x Pitch Pattern, F(2,126) =
2

20.587, p < .001, ƞp = .246, and the Preceding Consonant x Vowel Type x Pitch Pattern, F(2,
2

126) = 26.166, p < .001, ƞp = .293. In order to analyze the three-way interaction in detail,
simple effects tests were conducted. Basically, it was found that the token with the L.HL pitch
pattern had a faster RT with /ka/ compared to /ku/.
Finally, a three-way ANOVA was used to test whether preceding consonant, vowel type,
and/or pitch pattern in Group IV (CV.CV) significantly affected the RT in identifying vowel
duration in Japanese. Independent variables were preceding consonant (2; /k/ and /s/), vowel
type (2; /a/ and /u/), and pitch pattern (2; L.H and H.L). The dependent variable was perception
RT. Results indicated significant main effects of preceding consonant, FPreC(1, 63) = 8.944, p
2

= .004, ƞp = .124; however, vowel type, FVowel(1, 63) = .139, p = .710, and pitch pattern,
FPitch(1, 63) = 1.928, p = .170, were not significant. It was found that the preceding consonant

59

significantly affected the RT. The mean RTs for the token with the consonant /k/ and /s/ were
2053.43 milliseconds and 1850.31 milliseconds respectively. Thus, the learners could identify
the vowel duration with the preceding consonant /s/ faster than the consonant /k/.
In addition to the main effects, the Preceding Consonant x Vowel Type interaction, F(1,
2

63) = 5.271, p = .025, ƞp = .107, and the Preceding Consonant x Vowel Type x Pitch Pattern
2

interaction, F(1, 63) = 9.704, p = .003, ƞp = .133, were significant. The simple effects tests
were conducted, and the results revealed that for the H.L pitch patterns the RT was shorter when
the vowel and consonant combination was /su/ compared to /sa/.

Conclusion of Experiment 1
In conclusion, in this section, factors affecting accurate production, correct identification,
and response latency of vowel duration were examined. Regarding the production of the vowel
duration, it was found that vowel type and token type had significant main effects. In addition, a
significant interaction between the preceding consonant and token type was found; the stop /k/
had higher accuracy than the fricative /s/ for the CV.CV token.
The important pattern that emerges from the perception accuracy data involves the
influence of structural position of the long vowel, i.e., overall, there is misperception of vowel
length in the second syllable regardless of pitch pattern. In the case of CV.CV, the pattern of
errors suggests participants thought they perceived a longer duration but assigned it to a
geminate.
In the next section, the data obtained from the perceptual training in Experiment 2 were
analyzed. One of the objectives of Experiment 1 was to explore the potential effects of variables
prior to the training study in Experiment 2. For the perception accuracy, perception latency, and
60

production accuracy, three variables (preceding consonant, vowel type, and pitch pattern) were
analyzed and found to affect the identification of vowel duration. Therefore, the variables were
included in the analysis of the data in Experiment 2.

61

CHAPTER 3: EXPERIMENT 2

Experiment 2 investigated the effects of auditory-visual input (i.e., waveform displays)
and auditory-only in the training of L1 English speakers to identify L2 Japanese vowel duration.

Method
Participants
Participants were the same as in Experiment 1, which served as an exploratory study. A
total of 12 participants received 90% or higher on the identification task; therefore, they were
excluded from the study in order to avoid ceiling effects. The remaining 52 learners participated
in Experiment 2.

Materials
Production Test: Materials included 16 tokens contrasting long and short vowels (Appendix A).
High and low vowels /a, u/ and two consonants /s, k/ were used to construct the target stimuli.
As with Experiment 1, pitch production was not treated as a variable in production.

Perception Test: Target stimuli for testing and treatment (i.e., perceptual training) were different.
Out of 40 tokens used in Experiment 1, 18 with long and short vowels were used for testing in
Experiment 2 (see Appendix E). A total of six NSs of Japanese (M=2; F=4) pronounced the
stimuli as shown in Table 14; Talker 1 was used for the testing stimuli (same with Experiment 1),
Talker 6 was used for the Test of Generalization 2 (TG2) which contained familiar stimuli
produced by a novel talker, Talker 2 was used for perception training and TG1, which involved

62

novel stimuli produced by a familiar talker, and the remaining three talkers (Talker 3, 4, and 5)
were used for training stimuli.

Table 14: Talker assignment for recording stimuli used in identification tasks
Talker

Gender

Experiment

Task

1

Female

2

Female

Experiment 1
Experiment 2
Experiment 2

3
4
5
6

Male
Male
Female
Female

Experiment 2
Experiment 2
Experiment 2
Experiment 2

Perception Test
Perception Pretest and Posttest
Training 1 & 5
TG1
Training 2 & 6
Training 3 & 7
Training 4 & 8
TG2

Notes: TG1: generalization test with novel tokens produced by a familiar talker
TG2: generalization test with familiar tokens produces by a new talker

Perception Training: Out of 40 tokens used in Experiment 1, a total of 22 stimuli were used for
the perceptual training (see Appendix F). The tokens were produced by four talkers as shown in
Table 14 above. For both AV and A-only training conditions, the stimuli were audiorecorded
using a digital voice recorder. For the AV condition, waveforms as shown in Figure 28 were
generated using Praat.

63

Figure 28: Examples of the waveform displays
(a) kaaka

k

aa

k

a

(b) kaka

a

k

a

k

(c) suusu

s

uu

s

(d) susu

s

u

s

u
64

u

Procedures
Production Test: Computerized production test was created using E-Prime. The production test
was administered prior to the perception test. The procedure was the same as in Experiment 1.
During production testing, a visual prompt task of 16 tokens, listed in Appendix A, was given to
participants. The stimuli were written in roomaji (i.e., the alphabet representation of Japanese
sounds), not hiragana. The experiment was conducted in a quiet room. The participants’
responses were recorded using a digital voice recorder and saved for later analyses.

Perception Test: After the production test, a perception test was given. Computerized perception
test was created using E-Prime. During perception testing, participants were given a forcedchoice, four-alternative identification task involving a total of 18 target stimuli (see Appendix E).
The rationale for using the identification task rather than a discrimination task was based on
previous studies (e.g., Logan et al., 1991). The choices were written in romanization, not
hiragana. The procedure was the same as in Experiment 1. Identification accuracy, the
participants’ responses, and RTs were recorded on the computer and saved for later analysis.
In studies with a person’s face as the AV input, testing stimuli are often presented in AV,
A-only, and V-only conditions for the group that receives AV training (e.g., Hardison, 2003).
The A-only test scores for the AV and A-only training groups are then compared since it is the
only modality they share. However, in the current study, a waveform was used as the visual
input, and it was not reasonable to test V-only accuracy. In addition, there was no rationale for
AV testing because the waveform was essentially a training tool to facilitate the perception of
duration. Therefore, in the current study, the two groups differed on the type of training, but
were tested with A-only input.

65

Perception Training: Eight training sessions (approximately 25 minutes each), totaling
approximately 3.5 hours in length, were administered individually, depending on participants’
schedules. A forced-choice identification task was used. Prior to perception training, all the
participants in the AV training group received waveform instruction for about five minutes,
which included demonstration of how long and short vowels appeared in waveforms while
listening to audio files. The purpose of this instruction was to “help learners understand the
relation between the acoustic signal they [are] receiving and the electronic visual representation”
(Motohashi-Saigo, 2007, p. 72). Five practice tokens were used in order to familiarize
participants with the task (Appendix G). The participants in the AV training group listened to
the stimulus and were asked to choose what they heard from the list provided while watching the
associated waveform. On the other hand, the participants in the A-only group listened to the
stimuli and were asked to choose from the options. Feedback was provided in the training,
regardless of whether the responses were correct or not; the correct stimulus appeared as
feedback on the computer screen after the participants selected their response. After receiving
the feedback, participants in the A-only group had another chance to listen to each stimulus again.
Participants in the AV group had another chance to listen to each stimulus again with the display
of the associated waveform. The waveform was shown with the feedback so that the participants
in the AV group could use the visual information to pay more attention to the form when their
answers were wrong and the input type was always consistent with the type of training they were
receiving. Their responses and RTs were recorded on the computer.
The detailed procedure for the perceptual training is described below. First, a participant
read the instructions on the computer screen as in Figure 29 for the A-only training group and
Figure 30 for the AV group.

66

Figure 29: Instructions for perceptual training for A-only training group

67

Figure 30: Instructions for perceptual training for AV training group

Then, the plus sign (+) appeared on the computer screen for four seconds before the participant
listened to the stimulus presented as an isolated word. The participant listened to a stimulus and
was asked to choose the correct response from the list provided as in Figure 31 for the A-only
training group and Figure 32 for the AV training group.

Figure 31: Identification task for perceptual training for A-only training group

68

Figure 32: Identification task for perceptual training for AV training group

69

As shown in Figure 32, the waveform was also provided when the participants in the AV training
group worked on the identification task. As soon as the participant made a choice, the computer
screen showed a correct answer, and the participant listened to the stimulus again. As soon as
the feedback was finished, the computer screen showed a plus sign again and continued the task
for the rest of the stimuli.

Test of Generalization (TG): In order to see whether the participants’ improvement in identifying
vowel duration could be extended to novel stimuli produced by a familiar voice (TG1) and
familiar stimuli (i.e., stimuli used in training sessions) produced by an unfamiliar voice (TG2),
TGs that involve production and perception tests were given to the AV and A-only training
groups. The novel stimuli for TG1 are listed in Appendix H; the familiar stimuli for TG2 were
the same as the posttest in Appendix E. These involve a vowel not presented in testing or
training /e/ and a new consonant /t/.
All the procedures and formats of the tests were the same as the pretest/posttest described
earlier. A familiar and an unfamiliar voice were operationalized in the following way. A
familiar voice was a talker who produced tokens for training; therefore, Talker 2 (female)
produced the target stimuli for TG1. On the other hand, an unfamiliar voice was a talker who
had not produced tokens for either training or testing; therefore, a new talker (Talker 6) produced
target stimuli for TG2. Production accuracy, perception accuracy, and perception RT were
compared with (1) the pretest data to see if there was a significant improvement for these new
materials, and (2) the posttest data to see if the TG data were comparable and any improvement
noted between pretest and posttest could be generalized.

70

Results
A total of 4 participants did not complete all the tasks in Experiment 2 (i.e., perceptual training,
posttests, and TGs); therefore, their data were removed from the analysis. As a result, the data
from the remaining 48 participants were used for the analysis for Experiment 2. The data were
analyzed following Hardison (2003) and Motohashi-Saigo and Hardison (2009) and are
presented in the following order: (1) comparability of groups at pretest, (2) overall effectiveness
of perceptual training, (3) influence of stimulus variables on perception accuracy, (3) influence
of stimulus variables on perception latency, (3) the effects of perceptual training on production,
(4) the effect of training per group, and (5) tests of generalization. For the statistical analysis, the
alpha level was set as .05 (α = .05).

Comparability of Groups at Pretest: The 48 participants were divided into three groups: AV
training group (n=16), A-only training group (n=16), and Control (i.e., no training) group (n=16).
The mean accuracy scores in the pretest for the AV, A-only, and Control groups were 68.75%
(s.d. 16.21), 71.78% (s.d. 14.94), and 65.97% (s.d. 16.84) respectively. The mean RT scores in
the pretest for the AV, and A-only, and Control groups were 2856.63 milliseconds (s.d. 582.99),
2805.25 milliseconds (s.d. 515.29), and 2789.66 milliseconds (s.d. 410.47) respectively. In order
to examine whether the three groups were statistically equivalent at the time of pretest, two oneway ANOVAs were performed. The independent variables for both were group type (AV, Aonly, Control); dependent variables were perception accuracy and RT. The results of the
ANOVAs confirmed that the three groups were statistically equivalent before perceptual
training: FAccuracy(2, 47) = 0.424, p = .657; FRT(2, 47) = 0.076, p = .927.

71

Analysis of Overall Effectiveness of the Perception Training: The descriptive statistics of
perception accuracy and RT in the pretest and posttest for each training group are shown in Table
15 below.

Table 15: Descriptive statistics for the perception pre/post-tests per group
Group

Sample
Size

Accuracy
[Mean % (s.d.)]
Pretest
Posttest

RT
[Mean in milliseconds (s.d.)]
Pretest
Posttest

AV
A-Only
Control

16
16
16

68.75 (16.21) 96.53 (8.58)
71.18 (14.94) 87.50 (12.91)
65.97 (16.84) 64.24 (19.98)

2856.63 (582.99)
2805.25 (515.29)
2789.66 (410.47)

3106.78 (530.25)
3179.96 (564.17)
3139.41 (520.75)

Mixed ANOVA was used to test whether the training itself was effective for improving the
accuracy of identifying vowel duration and its response speed, compared to no training. The
within-subject factor was time (2; pretest and posttest); the between-subject factor was group
type (3; AV, A-Only, Control). The dependent variable was perception accuracy.

Results

2

indicated significant main effects of time, FTime(1, 45) = 68.275, p < .001, ƞp = .603, and group
2

type, FGroup(2, 45) = 6.956, p = .002, ƞp = .236. The Time x Training Modality interaction was
2

also significant, F (2, 45) = 25.271, p < .001, ƞp = .529. In order to locate where the difference
existed among the three groups, post-hoc comparison was performed using Tukey HSD. Results
indicated that the control group was significantly different from the AV group (p = .003) and the
A-only group (p = .018); however, there was no statistically significant difference between the
two experimental groups (p = .788) (Figure 33) although overall accuracy increased more for the
AV group.
72

Figure 33: The comparison of perception accuracy between pretest and posttest by group
100

90

80

AV
A-only

70

Control

60

50
Pretest

Posttest

The purpose of having a control group was to determine if L2 learners could improve without
training over the same period of time. The participants in the experimental groups spent two
weeks receiving perceptual training. They also received regular classroom instruction during the
training. Therefore, it was important to have the control group to show that the improvement
from the pretest to posttest was due to the training. Since the control group did not improve, it
was concluded that the improvement resulted from the training. Therefore, the control group
was removed from further analyses.

Influence of Stimulus Variables on Perception Accuracy: In Experiment 1, it was found that
there were several interactions involving pitch pattern, preceding consonant, and/or vowel type,
which suggested that the combination of these factors affected perception accuracy of vowel
length. Based on the results of Experiment 1 which indicated that the position of the long vowel

73

in the second syllable influenced perception accuracy, and to create a manageable stimulus set,
the variables of consonant, vowel type, and pitch pattern were combined into 18 different
stimulus types as shown in Figure 34. The pitch pattern groups I – III are the same as those in
Experiment 1. No stimuli from Group IV (short vowels only, CV.CV) were used in the testing
materials in Experiment 2.

Figure 34: Stimulus type in pretest and posttest in Experiment 2
Group I
1. CVV.CVV

2. CVV.CVV

3. CVV.CVV

4. CVV.CVV

LH HH
saa.saa

LH HH
suu.suu

LH HL
kaa.kaa

LH HL
kuu.kuu

5. CVV.CVV

6. CVV.CVV

HL LL
saa.saa

HL LL
kuu.kuu

Group II
7. CVV.CV

LH H
kaa.ka
11. CVV.CV

HL L
kuu.ku

8. CVV.CV

LH H
kuu.ku

9. CVV.CV

LH H
suu.su

12. CVV.CV

HL L
suu.su

74

10. CVV.CV

HL L
kaa.ka

Figure 34 (cont’d)
Group III
13. CV.CVV

14. CV.CVV

L HH
sa.saa

L HH
su.suu

17. CV.CVV

18. CV.CVV

H LL
ku.kuu

15. CV.CVV

L HL
ka.kaa

16. CV.CVV

H LL
sa.saa

H LL
su.suu

First, a mixed ANOVA was used to test whether the effectiveness of the perceptual
training varied depending on stimulus type within Group I. Within-subject factors were time (2;
pretest and posttest) and stimulus type (6). The between-subject factor was group type (2; AV,
A-Only). The dependent variable was perception accuracy. Results indicated significant main
2

effects of time, FTime(1, 30) = 44.885, p < .001, ƞp = .599, and stimulus type, FSType(5, 150) =
2

4.241, p = .001, ƞp = .124; however, group type was not significant, FGroup(1, 30) = .839, p
= .367. None of the interactions was significant.
The mean accuracy scores for the tokens at pretest and posttest were .62 and .91
respectively. Therefore, the perception accuracy of the stimuli in Group I improved from pretest
to posttest. In addition, it was found that stimulus type had a significant influence. In order to
locate where the differences existed among the six stimulus types, pairwise comparisons were

75

performed using the Bonferroni correction. The mean accuracy scores for each stimulus type are
shown in Table 16 below.

Table 16: Mean perception accuracy of the six stimuli in Group I (CVV.CVV) in Experiment 2
Stimulus Type (ST)

Tokens
Pretest

1
2
3
4
5
6

saa.saa (LH.HH)
suu.suu (LH.HH)
kaa.kaa (LH.HL)
kuu.kuu (LH.HL)
saa.saa (HL.LL)
kuu.kuu (HL.LL)

.66
.81
.44
.75
.50
.56

Mean Accuracy
Posttest
.97
.97
.81
.91
.91
.91

Results indicated that ST3 was significantly different from ST4 (p = .004) and ST2 (p = .007).
The two tokens ST3 and ST 4 share the same preceding consonant and pitch pattern, but the
vowel differs. Also, ST5 was significantly different from ST2 (p = .026). Based on the
comparison of these two, the vowel /u/ combined with the consonant /k/ and the LH.HL pitch
pattern was perceived more accurately than the vowel /a/ in the same condition.
Next, a mixed ANOVA was used to test whether the effectiveness of the perceptual
training varied according to stimulus type within Group II. Within-subject factors were time (2;
pretest and posttest) and stimulus type (6). The between-subject factor was group type (2; AV,
A-Only). The dependent variable was perception accuracy. Results indicated significant main
2

effects of time, FTime(1, 30) = 10.083, p = .003, ƞp = .252, stimulus type, FSType(5, 150) =
2

2

10.156, p < .001, ƞp = .253, and group type, FGroup(1, 30) = 6.127, p = .019, ƞp = .170, were
all significant. However, none of the interactions was significant.

76

It was found that all of the factors affected perception accuracy. The mean accuracy
scores for the tokens at pretest and posttest were .76 and .88 respectively. Therefore, the
perception accuracy of the stimuli in Group II improved from pretest to posttest. In addition,
group type had effects on the perception accuracy. Since the AV group had the higher accuracy
than the A-only group, it was concluded that the AV training was more effective in developing
perception accuracy of the tokens in Group II than the A-only group (Figure 35).

Figure 35: The comparison of perception accuracy of the tokens in Group II (CVV.CV) by
training groups in Experiment 2
1

Accuracy

0.95
0.9
0.85
0.8
0.75
0.7
AV
A-only

Pretest
0.78

Posttest
0.97

0.73

0.78

It was also found that stimulus type had a significant influence on correctly identifying
the vowel duration. In order to locate where the differences existed in the six stimulus types,
pairwise comparisons were performed using the Bonferroni correction. The mean accuracy
scores for each stimulus type are tabulated in Table 17 below.

77

Table 17: Mean perception accuracy of the six stimulus type in Group II (CVV.CV) in
Experiment 2
Stimulus Type (ST)

Tokens
Pretest

7
8
9
10
11
12

kaa.ka (LH.H)
kuu.ku (LH.H)
suu.su (LH.H)
kaa.ka (HL.L)
kuu.ku (HL.L)
suu.su (HL.L)

.81
.94
.81
.91
.63
.44

Mean Accuracy
Posttest
.91
.97
.81
.96
.88
.72

Results indicated that ST7 was significantly different from ST12 (p = .015); ST8 was
different from ST11 (p = .010) and ST12 (p < .001); and ST10 was significantly different from
ST11 (p = .005) and ST12 (p < .001). The difference between ST8 and ST11 was pitch pattern;
therefore, it was concluded that the LH.H pattern was easier for correct perception than the HL.L
pitch pattern when the token contains the preceding consonant /k/ and the vowel /u/. In addition,
the difference between ST10 and ST11 was vowel type; therefore, it was concluded that the
vowel /a/ was easier for correct perception than the vowel /u/ when it followed /k/ in the HL.L
pattern.
Finally, a mixed ANOVA was used to test whether the effectiveness of the perceptual
training varied according to stimulus type within Group III. Within-subject factors were time (2;
pretest and posttest) and stimulus type (6). The between-subject factor was group type (2; AV,
A-Only). The dependent variable was perception accuracy. Results indicated significant main
2

effects of time, FTime(1, 30) = 24.083, p < .001, ƞp = .445, and stimulus type, FSType(5, 150) =
2

7.358, p < .001, ƞp = .197; however, group type was not significant, FGroup(1, 30) = .309, p
= .582. The mean accuracy scores for the tokens at pretest and posttest were .73 and .91
78

respectively. Therefore, the perception accuracy of stimuli in Group III improved from pretest to
posttest. It was also found that stimulus type had a significant influence on correctly identifying
the vowel duration. In order to locate where the differences existed among the six stimulus types,
pairwise comparisons were performed using the Bonferroni correction. The mean accuracy
scores for each stimulus type are shown in Table 18 below.

Table 18: Mean perception accuracy of the six stimulus type in Group III (CV.CVV) in
Experiment 2
Stimulus Type (ST)

Tokens
Pretest

13
14
15
16
17
18

sa.saa (L.HH)
su.suu (L.HL)
ka.kaa (H.LL)
sa.saa (H.LL)
ku.kuu (H.LL)
su.suu (H.LL)

.78
.81
.56
1.00
.65
.56

Mean Accuracy
Posttest
.97
.94
.88
.97
.88
.74

Results indicated that ST15 was significantly different from ST16 (p < .001); ST16 was
different from ST17 (p = .006) and ST18 (p = .001). The difference between ST16 (sa.saa with
H.LL) and ST18 (su.suu with H.LL) was the vowel. Considering the mean scores in Table 18,
the vowel /a/ was easier for correct perception than the vowel /u/ when it contains the preceding
consonant /s/ in the H.LL pattern. On the other hand, the difference among ST15, ST16 and
ST17 was the combination of the consonant and vowel. Considering the mean scores in Table
18, the perception accuracy of /sa/ was higher than /ku/ and /ka/ when the tokens had the H.LL
pitch.

79

In addition to the main effect of time and stimulus type, the Time x Group Type
2

interaction, F(1, 30) = 5.682, p = .024, ƞp = .159, and the Time x Stimulus Type interaction, F(5,
2

150) = 2.538, p = .031, ƞp = .159, were significant for Group III (Figure 36). Based on the
result, it was found that the rate of development in the perception accuracy was faster for the
learners in the AV group, compared to those in the A-only group for CV.CVV stimuli.
Regarding the interaction between the time and stimulus type, the results of the simple effects
tests revealed that the differences between ST16 and ST13, ST17, and ST18 were greater in
pretest than the posttest. In addition, ST16 revealed the highest accuracy; ST15 and ST18
revealed the lowest accuracy.

Figure 36: The comparison of perception accuracy of the tokens in Group III by training groups
in Experiment 2
1
0.95

Accuracy

0.9
0.85
0.8
0.75
0.7
0.65
AV
A-only

Pretest
0.67

Posttest
0.94

0.79

0.89

80

Figure 36 (cont’d)
1.1

1

Accuracy

0.9

ST13
ST14
ST15

0.8

ST16

ST17

0.7

ST18

0.6

0.5
Pretest

Posttest

Effectiveness of Training Type on Perception RT: It was examined whether the effectiveness of
the perceptual training on perception RT varied with preceding consonant, vowel type, and pitch
pattern. Similar to the analysis of perception accuracy, instead of having the three separate
variables as preceding consonant, vowel type, and pitch pattern, they were combined and labeled
as stimulus type in Figure 34 in the previous section. The stimulus type was divided into three
groups as shown in Figure 34. Prior to the statistical analysis, it was confirmed that the AV and
the A-only groups were statistically equivalent at the time of pretest.
First, a mixed ANOVA was used to test whether the effectiveness of the perceptual
training on perception RT varied according to stimuli within Group I. Within-subject factors
81

were time (2; pretest and posttest) and stimulus type (6). The between-subject factor was group
type (2; AV, A-Only). The dependent variable was perception RT. Results indicated no
significant main effects: time, FTime(1, 30) = 3.198, p = .084, and stimulus type, FSType(5, 150)
= 1.121, p = .352, and group type, FGroup(1, 30) = 1.104, p = .302. None of the interactions was
significant. The mean RTs at pretest and posttest were 2926.07 milliseconds and 3207.77
milliseconds respectively. The pretest revealed faster RT than the posttest; however, the
difference was not statistically significant. The mean RTs for each stimulus type are shown in
Table 19 below. There were no significant differences among the six tokens.

Table 19: Mean perception RT of the six stimuli in Group I (CVV.CVV) in Experiment 2
Stimulus Type (ST)

Tokens

Mean RT (milliseconds)
Pretest
Posttest

1
2
3
4
5
6

saa.saa (LH.HH)
suu.suu (LH.HH)
kaa.kaa (LH.HL)
kuu.kuu (LH.HL)
saa.saa (HL.LL)
kuu.kuu (HL.LL)

2420.03
2940.72
3188.34
3041.44
3010.16
2955.75

3301.38
3509.88
3201.00
3172.80
2945.94
3115.59

Next, a mixed ANOVA was used to test whether the effectiveness of the perceptual
training on perception RT varied according to stimulus type within Group II. Within-subject
factors were time (2; pretest and posttest) and stimulus type (6). The between-subject factor was
group type (2; AV, A-Only). The dependent variable was perception RT. Results indicated
2

significant main effects of time, FTime(1, 30) = 7.593, p = .010, ƞp = .202; however, stimulus
type, FSType(5, 150) = 1.278, p = .276, and group type, FGroup(1, 30) = 1.469, p = .235, were
82

not significant. The mean RTs at pretest and posttest were 2775.23 milliseconds and 3125.75
milliseconds respectively. Therefore, the perception RT of the stimulus type in Group III
(CV.CVV) increased at the posttest. The mean RTs for each stimulus type are shown in Table
20 below. There were no significant differences among the six tokens.

Table 20: Mean perception RT of the six stimulus type in Group II (CVV.CV) in Experiment 2
Stimulus Type (ST)

Tokens

7
8
9
10
11
12

Mean RT (milliseconds)
Pretest
Posttest

kaa.ka (LH.H)
kuu.ku (LH.H)
suu.su (LH.H)
kaa.ka (HL.L)
kuu.ku (HL.L)
suu.su (HL.L)

2532.31
2489.34
2552.63
3070.16
3156.28
2850.66

3013.28
3315.13
3396.16
3253.22
2618.78
3157.94

Although there were no main effects, the Time x Stimulus Type interaction was found, FTime(5,
2

150) = 5.498, p < .001, ƞp = .155 (Figure 37). In order to examine the interaction, the simple
effects tests were conducted and the result revealed that the differences between ST11 and ST7,
ST8, and ST11 were greater in the pretest than the posttest. The RT of ST11 was slower than the
ST7, ST8, and ST9 in the pretest; however, that of ST11 became faster in the posttest while the
RTs of the other three became slower.

83

Figure 37: The comparison of perception RT of the tokens in Group II by training groups in
Experiment 2
3400
3300
3200
3100
RT in milliseconds

ST7
3000

ST8
ST9

2900

ST10

2800

ST11

2700

ST12

2600

2500
2400
Pretest

Posttest

Finally, a mixed ANOVA was used to test whether the effectiveness of the perceptual
training on perception RT varied according to stimulus type within Group III. Within-subject
factors were time (2; pretest and posttest) and stimulus type (6). The between-subject factor was
training type (2; AV, A-Only). The dependent variable was perception RT. Results indicated
2

significant main effects of time, FTime(1, 30) = 16.515, p < .001, ƞp = .355, and stimulus type,
2

FSType(5, 150) = 3.123, p = .010, ƞp = .094; however, group type was not significant, FGroup(1,
30) = .063, p = .804. The mean perception RT at pretest and posttest were 2624.38 milliseconds
84

and 3249.11 milliseconds respectively. Therefore, the perception RT of stimuli in Group III
increased from pretest to posttest. It was also found that stimulus type had a significant
influence on response latency. In order to locate where the differences existed in the six stimulus
types, pairwise comparisons were performed using the Bonferroni correction. The mean
perception RTs are shown in Table 21 below.

Table 21: Mean perception RT of the six stimulus type in Group III (CV.CVV) in Experiment 2
Stimulus Type (ST)

Tokens

13
14
15
16
17
18

Mean RT (milliseconds)
Pretest
Posttest

sa.saa (L.HH)
su.suu (L.HL)
ka.kaa (H.LL)
sa.saa (H.LL)
ku.kuu (H.LL)
su.suu (H.LL)

2856.63
3060.66
2761.66
1570.06
2647.31
2851.97

2887.28
3344.09
3120.56
3505.66
3482.34
3154.72

Results indicated that ST14 was significantly different from ST16 (p = .008) and ST16
was significantly different from ST17 (p = .026). The difference between ST16 and ST17 was
the combination of vowel and consonant; the perception of the long vowel in /sa/ was faster than
that in /ku/ when the pitch pattern was H.LL.
In addition to the main effect of time and stimulus type, the Time x Stimulus Type
2

interaction was significant, F(5, 150) = 7.304, p < .001, ƞp = .196 (Figure 38). In order to
examine the interaction, simple effects tests were conducted and the result revealed that the
difference between ST16 and ST17 was greater in the pretest than the posttest.

85

Figure 38: The comparison of perception RT of the tokens in Group III in Experiment 2
4000

3500
ST13

3000

ST14
ST15
ST16

2500

ST17
ST18
2000

1500
Pretest

Posttest

In conclusion, the two training groups improved accuracy in identifying vowel duration
after the training. There were significant differences between the two types of training (AV vs.
A-only); however, the AV group demonstrated grater improvement compared to the A-only
group. There were mixed results regarding the influence of preceding consonant, vowel type,
and pitch pattern. Depending on the token type (i.e., CVV.CVV, CVV.CV, and CV.CVV),
influence of the variables was different. Although perception accuracy showed significant
improvement after the training, response latency became slower, which suggested that the
learners were processing the input more and thinking more about which choice provided in the
identification task was right. In the analysis of training data, it was found that talker’s voice
affected both perception accuracy and latency.

86

Analysis of Production Data: The production accuracy before and after the perceptual training
was analyzed to examine whether the efficiency of the training on correctly identifying vowel
duration would transfer to another skill such as production. The 32 participants in the AV and
A-only groups who took the perception training in Experiment 2 took a production pretest before
the perception training and posttest after the training. The same raters who rated the pretest data
rated the posttest data, using the same procedure. Interrater reliability was checked using
Pearson Correlation/Coefficient. There was a significant positive correlation between Rater 1 and
2

2

Rater 2 (r = .914, p < .001, R = .84), between Rater 1 and Rater 3 (r = .930, p < .001, R = .86),
2

as well as between Rater 2 and Rater 3 (r = .906, p < .001, R = .82); the correlation was strong.
The production accuracy scores (i.e., one point for the correct pronunciation of each token) are
shown in Table 22. The control group did not show improvement of production accuracy.

87

Table 22: Descriptive Statistics for production tests in Experiment 2 (Pretest and Posttest) for the
AV and A-only groups organized by consonant-vowel combination
Tokens

AV Group

A-only Group
Pretest
Posttest
Mean (s.d.)
Mean (s.d.)

Pretest
Mean (s.d.)

Posttest
Mean (s.d.)

kaa.kaa
kaa.ka
ka.kaa
ka.ka

.81 (.40)
.75 (.45)
.43 (.51)
.50 (.52)

.75 (.45)
.88 (.94)
1.00 (.00)
.81 (.40)

.75 (.45)
.93 (.25)
.63 (.50)
.68 (.48)

.88 (.34)
1.00 (.00)
.94 (.25)
1.00 (.00)

kuu.kuu
kuu.ku
ku.kuu
ku.ku

.56 (.51)
.94 (.25)
.50 (.52)
.43 (.51)

.75 (.45)
.93 (.25)
.88 (.34)
1.00 (.00)

.62 (.50)
.94 (.25)
.38 (.50)
.63 (.50)

.63 (.50)
1.00 (.00)
.88 (.34)
1.00 (.00)

saa.saa
saa.sa
sa.saa
sa.sa

.75 (.45)
.75 (.45)
.50 (.52)
.43 (.51)

.81 (.40)
1.00 (.00)
.94 (.25)
.94 (.25)

.68 (.48)
.68 (.48)
.63 (.50)
.68 (.48)

.81 (.40)
1.00 (.00)
.88 (.34)
.94 (.25)

suu.suu
suu.su
su.suu
su.su

.63 (.50)
.75 (.45)
.50 (.52)
.44 (.51)

.75 (.45)
.94 (.25)
1.00 (.00)
.94 (.25)

.69 (.48)
.81 (.40)
.63 (.50)
.69 (.48)

.69 (.48)
1.00 (.00)
.88 (.34)
.88 (.34)

A repeated-measure ANOVA was used to test whether the effects of perceptual training
transfer to correct production of the vowel duration. Within-subject factors were time (2; pretest
and posttest), vowel type (2: high /u/ and low /a/ vowel), preceding consonant (2: /k/ and /s/),
token type (4: CVV.CVV, CVV.CV, CV.CVV, CV.CV); the between-subject factor was group
type (2; AV, A-Only). The dependent variable was production accuracy. Results indicated
2

significant main effects of time, FTime(1, 30) = 67.148, p < .001, ƞp = .691, and token type,
2

FTType(3, 90) = 5.392, p = .002, ƞp = .152; however, vowel type, FVowel(1, 30) = 1.815, p

88

= .188, and group type, FGroup(1, 30) = 1.600, p = .216, and preceding consonant, FPreC(1, 30)
= .062, p = .806, were not significant. It was found that the token types significantly affected the
accuracy of participant’s production of vowel duration. The mean accuracy for the CVV.CVV
was .72, CVV.CV was .90, CV.CVV was .72, and CV.CV was .75. In order to locate where the
differences existed in the four token types, pairwise comparisons were performed using the
Bonferroni correction. The results indicated that CVV.CV was significantly different from
CVV.CVV (p = .003), CV.CVV (p = .004), and CV.CV (p = .012) and showed more accurate
production. The findings suggest that learners found it easier to produce a long vowel when
there was only one and it occurred in the first syllable.
In addition to the main effects of token type, the Time x Token Type interaction, F(3, 90)
2

= 7.977, p < .001, ƞp = .210, and the Vowel Type x Token Type interaction, F(3, 90) = 2.929, p
2

= .038, ƞp = .089, were also significant (Figure 39). To analyze the interactions in detail,
simple effects tests were conducted. Regarding the Time x Token Type interaction, results
revealed that CVV.CV was better at pretest, CV.CVV and CV.CV showed parallel improvement,
and CVV.CVV barely improved. Regarding the Vowel Type x Token Type interaction, the
accuracy of the CVV.CVV token type was higher when the vowel was /a/ compared to /u/.

89

Figure 39: The comparison of production accuracy by vowel and token type in Experiment 2
1

Accuracy

0.9
0.8
0.7
0.6
0.5
CVV.CVV

Pretest
0.69

Posttest
0.76

CVV.CV

0.82

0.97

CV.CVV

0.52

0.92

CV.CV

0.56

0.94

CVV.CVV

a
0.78

u
0.66

CVV.CV

0.88

0.91

CV.CVV

0.74

0.7

CV.CV

0.75

0.75

0.95

Accuracy

0.9
0.85

0.8
0.75
0.7
0.65

The production errors that the learners made during the production test are summarized in
Table 23 below. The learners made more errors when they pronounced the CVV.CVV tokens;
they tended to shorten the vowel in the second syllable. Also, the errors of the CV.CVV and

90

CV.CV types showed that the short vowels on the first syllable were harder to correctly
pronounce because they were generally lengthened.

Table 23: Errors observed in the production posttest in Experiment 2
Token with /a/

Errors

Number

Token with /u/

Errors

Numb
er

CaaCaa

CaaCa
CaCaa

11
1

CuuCuu

CuuCu
CuCuu

17
1

CaaCa

CaCaa
CaCa

1
1

CuuCu

CuCu
CuCu

1
4

CaCaa

CaaCa
CaaCaa
CaCCaa

2
1
1

CuCuu

CuuCu
CuuCuu
CuCu

1
3
2

CaCa

CaCaa
CaaCa

1
5

CuCu

CuCuu
CuuCu

1
1

In conclusion, production accuracy improved from pretest to the posttest while there was
no statistically significant difference between the two training groups. Thus, since the learners
did not receive any specific production training or practice, it was considered that the positive
effect of the focused perceptual training on the L2 vowel duration was transferred to production.
The interaction between time (i.e., pretest and posttest) and token type as well as vowel type and
token type was found. The three token types, CVV.CV, CV.CVV, and CV.CV, significantly
improved after the training, but not the CVV.CVV type. Also, there was a tendency for the
CVV.CVV tokens to be more accurately pronounced if they contained the vowel /a/, compared
to the vowel /u/.

91

Analysis of Effectiveness of Training per Group: In order to examine the development of
accuracy and response latency as well as effects of talker and other factors (i.e., pitch pattern,
vowel type, and preceding consonant) during the training, the perception accuracy and RT in the
training sessions were analyzed by training groups. Figure 40 illustrates the identification
accuracy in each training session (total of 8) by the AV and A-only groups. For both groups,
perception accuracy starting the end of the first week (i.e., Session 4) became higher; however,
accuracy in the third session in the second week (i.e., Session 7) was lower than the other
sessions in the weeks. In addition, AV groups showed higher accuracy than A-only groups,
except for Session 6.

Figure 40: Perception accuracy in each week and talker by AV and A-only groups
100.00

Perception Accuracy

95.00
90.00
85.00
80.00
75.00
70.00
AV
A-only

W1W1W1W1W2W2W2W2Talker2 Talker3 Talker4 Talker5 Talker2 Talker3 Talker4 Talker5
87.50
87.78
84.94
92.33
95.45
89.20
86.93
95.17
84.38

85.80

80.68

91.19

92

89.77

91.48

81.82

90.91

Figure 41 shows the perception accuracy by the four talkers used in the training; Talker2 (F) was
assigned for the first and the fourth sessions; Talker3 (M) was assigned for the second and sixth
sessions; Talker 4 (M) was assigned for the third and seventh sessions; and Talker 5 (F) was
assigned for the fourth and eighth sessions. Accuracy for tokens produced by Talker 3 was
comparable for both groups; in other cases, the AV training group showed higher scores.

Percent Correct

Figure 41: Perception accuracy by talker in perceptual training
96
94
92
90
88
86
84
82
80
78
76
74
AV
A-only

Talker2
91.48

Talker3
88.49

Talker4
85.94

Talker5
93.75

87.07

88.64

81.25

91.05

Figure 42 illustrates the RT in each training session by the AV and A-only groups. Across the
sessions, RTs were faster for the AV groups.

93

Figure 42: The RT for each week and talker by AV and A-only groups
3500.00
3000.00

Response Latency

2500.00
2000.00
1500.00
1000.00
500.00
0.00
AV

W1W1W1W1W2W2W2W2Talker2 Talker3 Talker4 Talker5 Talker2 Talker3 Talker4 Talker5
2981.01 2591.33 2547.27 2412.36 2280.88 2360.97 2450.07 2245.26

A-only 3328.31 2878.80 2798.88 2573.63 2513.79 2601.00 2658.22 2423.40

Figure 43 shows the RT to tokens produced by the four talkers used in the training. As described
earlier, Talker2 (F) was assigned for the first and the fourth sessions; Talker3 (M) was assigned
for the second and sixth sessions; Talker 4 (M) was assigned for the third and seventh sessions;
and Talker 5 (F) was assigned for the fourth and eighth sessions. The AV groups showed faster
RTs to all talkers.

94

Figure 43: The RT in the training grouped by the four talkers
3500.00
3000.00

2500.00
2000.00
1500.00
1000.00
500.00
0.00
AV

Talker2
2630.94

Talker3
2476.15

Talker4
2498.67

Talker6
2328.81

A-only

2921.05

2739.90

2728.55

2498.51

In order to examine the development of correct identification, response latency, and the
influence of other factors such as talker in training sessions, pitch pattern, vowel type, and
preceding consonant, the 22 tokens used in the training (Appendix F) were divided into four
groups depending on the pitch pattern as shown in Figure 44; vowel type, preceding consonant,
and pitch pattern were combined as stimulus type following the earlier analysis of the pretest and
posttest data.

95

Figure 44: Tokens in the training sessions by stimulus type
Group I
1. CVV.CVV

2. CVV.CVV

3. CVV.CVV

4. CVV.CVV

LH HH
kaa.kaa

HL LL
kaa.kaa

LH HL
saa.saa

LH HH
kuu.kuu

5. CVV.CVV

LH HL
suu.suu

Group II
6. CVV.CV

7. CVV.CV

8. CVV.CV

9. CVV.CV

LH H
kaa.ka

LH H
saa.sa

HL H
saa.sa

LH H
suu.su

Group III
10. CV.CVV

11. CV.CVV

12. CV.CVV

13. CV.CVV

L HH
ka.kaa

L HL
sa.saa

L HL
ku.kuu

H LL
ku.kuu

14. CV.CVV

L HH
su.suu

96

Figure 44 (cont’d)
Group IV
15. CV.CV

L H
ka.ka

16. CV.CV

17. CV.CV

18. CV.CV

L H
sa.sa

H L
sa.sa

H L
ka.ka

19. CV.CV

20. CV.CV

21. CV.CV

22. CV.CV

L H
ku.ku

H L
ku.ku

L H
su.su

H L
su.su

Perception Accuracy in Training - AV group: A three-way ANOVA was performed to examine
the development of perception accuracy and effects of the factors for the AV group. The
independent variables were week (2: Week1, Week2), talker (4: Talker2, 3, 4, 5), and stimulus
type. The dependent variable was perception accuracy in the eight training sessions.
Regarding the tokens in Group I (CVV.CVV), results indicated no significant main
effects: FWeek(1, 15) = 4.444, p = .052; talker, FTalker(3, 45) = 2.042, p = .121; and stimulus
type, FType(4, 60) = 1.113, p = .350. The mean accuracy scores for the first week and second
week were .88 and .93 respectively. The difference was marginally significant. The mean
accuracy scores for each talker were .90 (Talker 2), .86 (Talker 3), .92 (Talker 4), and .94 (Talker
5), and there were no significant differences among them. Table 24 shows mean accuracy scores
for each stimulus type in Group I. There were no statistically significant differences among the
five tokens.

97

Table 24: Mean accuracy scores of the five tokens in Group I (CVV.CV) (AV group)
Stimulus Type (ST)

Tokens

Mean Accuracy Scores

1
2
3
4
5

kaakaa (LH.HH)
kaa.kaa (HL.LL)
saa.saa (LH.HL)
kuu.kuu (LH.HH)
suu.suu (LH.HL)

.88
.92
.93
.91
.87

Although there were no significant main effects, the Talker x Stimulus Type interaction was
2

significant: F(12, 180) = 5.835, p < .001, ƞp = .280 (Figure 45). The Week x Voice interaction
was marginally significant, F (3, 45) = 2.818, p = .050. Results of simple effects tests revealed
that perception accuracy of ST5 produced by Talker2 and that of ST1 produced by Talker3 were
significantly lower in the first week than in the second week.

98

Figure 45: The comparison of perception accuracy of tokens in Group I for AV training group
1.1

1

Accuracy

0.9
ST1
0.8

ST2
ST3

0.7

ST4
ST5

0.6

0.5

Regarding the tokens in Group II (CVV.CV), results indicated significant main effects of
2

talker, FTalker(3, 45) = 19.056, p < .001, ƞp = .560, and stimulus type, FType(3, 45) = 17.328, p
2

< .001, ƞp = .536; however, week was not significant, FWeek(1, 15) = .958, p = .343. The mean
accuracy scores for the first week and second week were .79 and .83 respectively. The second
week had higher accuracy; however, the difference was not significant. The mean accuracy
scores for each talker were .93 (Talker 2), .82 (Talker 3), .63 (Talker 4), and .89 (Talker 5).
Results of the pairwise comparisons with Bonferroni correction indicated that Talker4 was
different from Talker2 (p < .001), Talker3 (p = .002), as well as Talker5 (p < .001). Thus,
99

Talker4, a male talker, was the most difficult for L2 learners to correctly perceive vowel duration.
Table 25 shows mean accuracy scores for each stimulus type in Group II. Results of pairwise
comparisons with Bonferroni correction indicated that ST8 was different from ST6 (p < .001),
ST7 (p < .001), and ST9 (p < .001). The ST6 had the lowest accuracy among the four tokens.

Table 25: Mean accuracy scores of the four tokens in Group II (CVV.CV) (AV group)
Stimulus Type (ST)

Tokens

Mean Accuracy Scores

6
7
8
9

kaa.ka (HL.L)
saa.sa (LH.H)
saa.sa (HL.L)
suu.su (LH.H)

.87
.87
.66
.88

In addition to the main effects, the Week x Stimulus Type interaction was significant: F (3, 45) =
2

3.169, p = .033, ƞp = .174 (Figure 46). Results of the simple effects tests indicated that the
difference between ST7 and ST8 was greater in the second week than in the first week; the
accuracy of ST8 improved in the second week although that of ST7 decreased. In addition, the
2

Talker x Stimulus Type interaction was significant: F (9, 135) = 5.326, p < .001, ƞp = .262.
Results of the simple effects tests indicated that the effects of the talker were greater for ST8.
The accuracy of ST8 was higher with Talker2, Talker3, and Talker5; however, Talker4 revealed
significantly lower accuracy as shown in the graph.

100

Figure 46: The comparison of perception accuracy of tokens in Group II for AV training group
1
0.95
0.9
0.85

ST6

0.8

ST7

0.75

ST8

0.7

ST9

0.65
0.6
Week1

Week2

1
0.9
0.8
0.7
0.6

ST6

0.5

ST7

0.4

ST8
ST9

0.3
0.2
0.1
0
Talker 2

Talker3

Talker4

Talker5

Regarding the tokens in Group III (CV.CVV), results indicated significant main effects of
2

talker, FTalker(3, 45) = 4.470, p = .008, ƞp = .230, and stimulus type, FType(4, 60) = 3.982, p

101

2

= .006, ƞp = .210; however, week was not significant, FWeek(1, 15) = .283, p = .603. None of
the interactions was significant. The mean accuracy scores for the first week and second week
were .91 and .92 respectively; there was no significant difference between the two weeks. The
mean accuracy scores for each talker were .89 (Talker 2), .91 (Talker 3), .89 (Talker 4), and .96
(Talker 5). Results of the pairwise comparisons with Bonferroni correction indicated that
Talker5 was different from Talker2 (p = .019) and Talker3 (p = .045). Thus, Talker5, a female
talker, was easier for L2 learners to correctly perceive vowel duration than Talker2, another
female talker, and Talker3, a male talker. Table 26 shows mean accuracy scores for each token
in Group III. Results of pairwise comparisons with Bonferroni correction did not indicate any
significant differences among the five tokens; however, ST11 had relatively lower accuracy than
the other four tokens.

Table 26: Mean accuracy scores of the five tokens in Group III (CV.CVV) (AV group)
Stimulus Type (ST)

Tokens

Mean Accuracy Scores

10
11
12
13
14

ka.kaa (L.HH)
sa.saa (L.HL)
ku.kuu (L.HL)
ku.kuu (H.LL)
su.suu (L.HH)

.94
.84
.92
.95
.91

Regarding the tokens in Group IV (CV.CV), results indicated significant main effects of
2

stimulus type, FType(7, 105) = 2.717, p = .012, ƞp = .153, and week, FWeek(1, 15) = 6.363, p
2

= .023, ƞp = .298; however, talker was not significant, FTalker(3, 45) = .884, p = .456. None of
the interactions was significant. The mean accuracy scores for the first week and second week

102

were .91 and .95 respectively; perception accuracy for the second week was significantly higher
than the first week. Thus, it was concluded that there was a significant development of accuracy
in the second week. The mean accuracy scores for each talker were .93 (Talker2), .92
(Talker3), .91 (Talker4), and .95 (Talker5); there were no significant differences among the four
talkers. Table 27 shows mean accuracy scores for each token in Group IV. Although the
significant differences among the 8 tokens were found, results of pairwise comparisons with
Bonferroni correction did not indicate any significant differences among the eight tokens.
However, ST18 and ST22 revealed relatively lower accuracy than the other six tokens.

Table 27: Mean accuracy scores of the eight tokens in Group IV (CV.CV) (AV group)
Stimulus Type (ST)

Tokens

Mean Accuracy Scores

15
16
17
18
19
20
21
22

ka.ka (L.H)
ka.ka (H.L)
sa.sa (L.H)
sa.sa (H.L)
ku.ku (L.H)
ku.ku (H.L)
su.su (L.H)
su.su (H.L)

.96
.91
.97
.88
.96
.88
.95
.93

Perception Accuracy in Training – A-only Group: A three-way ANOVA was performed to
examine the development of perception accuracy and effects of the factors for the A-only group.
The independent variables were week (2: Week1, Week2), talker (4: Voice2, 3, 4, 5), and
stimulus type. The dependent variable was perception accuracy in the eight training sessions.
Regarding the tokens in Group I (CVV.CVV), results indicated significant main effects
2

of week, FWeek(1, 15) = 6.310, p = .024, ƞp = .296; however, talker, FTalker(3, 45) = .823, p
103

= .488, and stimulus type, FType(4, 60) = 1.919, p = .119, were not significant. The mean
accuracy scores for the first week and second week were .83 and .88 respectively; there was
significant development of accuracy from the first week to the second week. The mean accuracy
scores for each talker were .87 (Talker2), .83 (Talker3), .84 (Talker4), and .88 (Talker5), and
there were no significant differences among them. Table 28 shows mean accuracy scores for
each stimulus type in Group I. ST1 had relatively higher accuracy and ST3 had relatively lower
accuracy; however, there were no significant differences among the five tokens.

Table 28: Mean accuracy scores of the five tokens in Group I (CVV.CVV) (A-only group)
Stimulus Type (ST)

Tokens

Mean Accuracy Scores

1
2
3
4
5

kaa.kaa (LH.HH)
kaa.kaa (HL.LL)
saa.saa (LH.HL)
kuu.kuu (LH.HH)
suu.suu (LH.HL)

.91
.84
.80
.86
.87

In addition to the significant main effects of week, the Talker x Stimulus Type interaction was
2

significant: F (12, 180) = 2.834, p = .001, ƞp = .159. In addition, the Week x Talker x Stimulus
2

Type interaction was significant: F (12, 180) = 1.815, p = .049, ƞp = .108 (Figure 47). Simple
effects tests were performed to analyze the three-way interaction, and results revealed that
perception accuracy of ST4 produced by Talker4 in the first week was significantly lower. In
addition, perception accuracy of ST5 produced by Talker3 in the first week was lower; however,
it improved in the second week.

104

Figure 47: The comparison of perception accuracy of tokens in Group I (CVV.CVV) for A-only
training group
1
0.95
0.9

Accuracy

0.85

0.8

ST1

0.75

ST2

0.7

ST3

0.65

ST4

0.6

ST5

0.55
0.5

Regarding the tokens in Group II (CVV.CV), results indicated significant main effects of
2

talker, FTalker(3, 45) = 15.527, p < .001, ƞp = .509, and stimulus type, FType(3, 45) = 7.242, p
2

< .001, ƞp = .326; however, week was not significant, FWeek(1, 15) = 3.412, p = .085. The
mean accuracy scores for the first week and second week were .77 and .82 respectively. The
second week had higher accuracy; however, the difference was not significant. The mean
accuracy scores for each voice were .91 (Talker2), .84 (Talker3), .58 (Talker4), and .88 (Talker5).
Results of the pairwise comparisons with Bonferroni correction indicated that Talker4 was
different from Talker2 (p = .001), Talker3 (p = .008), and Talker5 (p < .002). Thus, Talker4 was
105

the most difficult for L2 learners to correctly perceive vowel duration. Table 29 shows mean
accuracy scores for each stimulus type in Group II. Results of pairwise comparisons with
Bonferroni correction indicated that ST8 was different from ST6 (p < .009), ST7 (p < .011), and
ST9 (p < .002). The ST 8 had the lowest accuracy among the four tokens.

Table 29: Mean accuracy scores of the four tokens in Group II (CVV.CV) (A-only group)
Stimulus Type (ST)

Tokens

Mean Accuracy Scores

6
7
8
9

kaa.ka (HL.L)
saa.sa (LH.H)
saa.sa (HL.L)
suu.su (LH.H)

.85
.86
.65
.84

In addition to the main effects, the Talker x Stimulus Type interaction was significant: F (9, 135)
2

= 4.659, p < .001, ƞp = .237 (Figure 48). Results of the simple effects tests indicated that the
effects of the talker were greater for ST8. The accuracy of ST8 was higher with Talker2,
Talker3, and Talker5; however, Talker4 revealed significantly lower accuracy.

106

Figure 48: The comparison of perception accuracy of tokens in Group II (CVV.CV) for A-only
training group
1.2

1

Accuracy

0.8
ST6
ST7

0.6

ST8
ST9

0.4

0.2

0
Talker2

Talker3

Talker4

Talker5

Regarding the tokens in Group III (CV.CVV), results indicated significant main effects of
2

talker, FTalker(3, 45) = 3.425, p = .025, ƞp = .186, and stimulus type, FType(4, 60) = 8.788, p
2

< .001, ƞp = .369; however, week was not significant, FWeek(1, 15) = .516, p = .484. The mean
accuracy scores for the first week and second week were .90 and .91 respectively; there was no
difference between the two weeks. The mean accuracy scores for each talker were .86
(Talker2), .92 (Talker3), .91 (Talker4), and .95 (Talker5). Results of the pairwise comparisons
with Bonferroni correction did not find significant differences among the four talkers; however,
the difference between Talker2 and Talker5 was approaching significance (p = .081). Table 30
shows mean accuracy scores for each token in Group III. Results of pairwise comparisons with
107

Bonferroni correction indicated that ST11 was significantly different from ST10 (p < .001),
ST12 (p = .015), and ST14 (p = .047). Thus, the token with L.HL pitch and the combination of
consonant /s/ and a vowel /a/ was more difficult for the learners to perceive correctly than the
tokens with the L.HH pitch and /ka/ or /su/ as well as one with the L.HL pitch and /ku/.

Table 30: Mean accuracy scores of the four tokens in Group III (CV.CVV) (A-only group)
Stimulus Type (ST)

Tokens

Mean Accuracy Scores

10
11
12
13
14

ka.kaa (L.HH)
sa.saa (L.HL)
ku.kuu (L.HL)
ku.kuu (H.LL)
su.suu (L.HH)

.98
.81
.94
.89
.91

In addition to the main effects above, the Talker x Stimulus Type interaction was significant: F
2

(12, 180) = 2.792, p = .002, ƞp = .157 (Figure 49). Results of simple effects tests revealed that
the differences between ST11 and ST13 were greater with Talker4 than with the other talkers.
The learners demonstrated significantly lower accuracy for ST11 when it was produced by
Talker4. In general, Figure 49 shows that accuracy for ST10 and ST12 were much less variable
across talkers compared to ST11, ST13, and ST14.

108

Figure 49: The comparison of perception accuracy of tokens in Group III (CV.CVV) for A-only
training group
1
0.95

Accuracy

0.9
ST10

0.85

ST11
0.8

ST12

0.75

ST13
ST14

0.7
0.65
0.6
Talker2

Talker3

Talker4

Talker5

Regarding the tokens in Group IV (CV.CV), results indicated significant main effects of
2

stimulus type, FType(7, 105) = 5.770, p < .001, ƞp = .278, and talker, FTalker(3, 45) = 3.431, p
2

= .025, ƞp = .186, were significant; however, week was not significant, FWeek (1, 15) = .460, p
= .508. The mean accuracy scores for the first week and second week were .89 and .90
respectively. Accuracy in the second week was slightly higher than in the first week; however,
the difference was not significant. The mean accuracy scores for each talker were .86
(Talker2), .93 (Talker3), .86 (Talker4), and .92 (Talker5). Results of pairwise comparisons with
Bonferroni correction did not indicate any significant differences among the eight tokens;
however, Talker3 and Talker5 revealed relatively higher accuracy than the other two talkers.
Table 31 shows mean accuracy scores for each token in Group IV. Results of pairwise
109

comparisons with Bonferroni correction indicated there were significant differences between
ST16 and ST21 (p = .028). Thus, ST21 (L.H) was significantly easier for L2 learners to
correctly perceive than ST16 (H.L).

Table 31: Mean accuracy scores of the eight tokens in Group IV (CV.CV) (A-only group)
Stimulus Type (ST)

Tokens

Mean Accuracy Scores

15
16
17
18
19
20
21
22

ka.ka (L.H)
ka.ka (H.L)
sa.sa (L.H)
sa.sa (H.L)
ku.ku (L.H)
ku.ku (H.L)
su.su (L.H)
su.su (H.L)

.95
.83
.92
.81
.96
.78
.98
.90

In addition to the main effects, the Week x Talker interaction was significant: F (3, 45) = 5.293,
2

p = .003, ƞp = .261 (Figure 50). Results of the simple effects tests indicated that the difference
between Talker 3 and Talker 4 was greater in the second week, compared to the first week. The
learners demonstrated lower accuracy in correctly identifying the vowel duration of tokens
produced by Talker 3 in the second week. The Talker x Stimulus Type interaction was also
2

significant: F (21, 315) = 1.843, p = .014, ƞp = .109. Results of the simple effects tests
indicated that the perception accuracy for ST20 was highest with Talker 4 and lowest with
Talker4.

110

Figure 50: The comparisons of perception accuracy of tokens in Group IV (CV.CV) for A-only

Accuracy

training group
1
0.98
0.96
0.94
0.92
0.9
0.88
0.86
0.84
0.82
0.8

Talker2
Talker3
Talker4
Takjer4

Week1

Week2

1
0.95
0.9
ST15
ST16

Accuray

0.85

ST17
ST18

0.8

ST19
ST20

0.75

ST21
ST22

0.7
0.65
0.6
Taker2

Talker3

Talker4

111

Talker5

Perception RT in Training - AV Group: A three-way ANOVA was performed to examine the
development of perception RT and effects of the factors for the AV group. The independent
variables were week (2: Week1, Week2), talker (4: Talker2, 3, 4, 5), and stimulus type. The
dependent variable was perception RT in the eight training sessions.
Regarding the tokens in Group I (CVV.CVV), results indicated significant main effects
2

of week, FWeek(1, 15) = 19.363, p = .001, ƞp = .563, and stimulus type, FType(4, 60) = 7.395, p
2

< .001, ƞp = .330; however, talker was not significant, FTalker(3, 45) = 2.340, p = .086. The
mean RT scores for the first week and second week were 2733.71 milliseconds and 2399.91
milliseconds respectively. The RT in the second week was significantly faster than the one in
the first week. The mean RT scores for each voice were 2687.28 milliseconds (Talker2),
2598.64 milliseconds (Talker3), 2576.61 milliseconds (Talker4), and 2404.71 milliseconds
(Talker5), and there were no significant differences among them. Table 32 shows mean RT
scores for each stimulus type in Group I.

Table 32: Mean RT scores of the five tokens in Group I (CVV.CVV) (AV group)
Stimulus Type (ST)

Tokens

Mean RT (milliseconds)

1
2
3
4
5

kaa.kaa (LH.HH)
kaa.kaa (HL.LL)
saa.saa (LH.HL)
kuu.kuu (LH.HH)
suu.suu (LH.HL)

2342.13
2555.11
2693.26
2429.35
2814.21

Results of pairwise comparisons with Bonferroni correction indicated that ST5 was significantly
different from ST1 (p = .002) as well as ST4 (p = .001). In addition, ST4 was significantly
different from ST3 (p = .038). The learners’ response latency for ST5 was significantly slower
112

than ST1 and ST4. ST5 shares the same pitch pattern as ST3 but involved /su/ versus /sa/. Also,
the response latency of ST3 was slower than ST4.
In addition to the main effects, the Week x Talker interaction, F (3, 45) = 4.985, p = .005,
2

2

ƞp = .249, the Week x Stimulus Type interaction, F (12, 180) = 9.305, p < .001, ƞp = .383, and
2

the Week x Talker x Stimulus Type interaction, F (12, 180) = 1.911, p = .036, ƞp = .113, were
significant (Figure 51). Simple effects tests were performed in order to analyze the three-way
interaction, and results indicated that the RT difference between ST1 and ST5 was greater for
Talker2, compared to the other talkers, in the first week. Thus, the learners demonstrated slower
RTs for ST5 produced by Talker2 than ST1 in the first week.

113

Figure 51: The comparison of perception RT of tokens in Group I (CVV.CVV) for AV training
group
5000
4500

RT in milliseconds

4000
3500
ST1

3000

ST2

2500

ST3
ST4

2000

ST5

1500
1000

Regarding the tokens in Group II (CVV.CV), results indicated significant main effects of
2

week, FWeek(1, 15) = 5.105, p = .039, ƞp = .254, and stimulus type, FType(3, 45) = 3.139, p
2

= .034, ƞp = .173; however, talker was not significant, FTalker(3, 45) = 2.400, p = .080. The
mean RTs for the first week and second week were 2916.06 milliseconds and 2690.68
milliseconds respectively; the second week had significantly faster RTs. The mean RTs for each
talker were 2696.95 milliseconds (Talker2), 2915.54 milliseconds (Talker3), 2934.62
milliseconds (Talker4), and 2666.37 milliseconds (Talker5). Two female talkers (Talker2 and
114

Talker5) had relatively faster RT than the two male talkers (Talker3 and Talker4); however, the
difference was not significant. Table 33 shows mean RTs for stimuli in Group II. Results of
pairwise comparisons with Bonferroni correction indicated that ST7 was significantly different
from ST9 (p = .034); RT of ST9 was significantly faster than that of ST7.

Table 33: Mean RT scores of the four tokens in Group II (AV group)
Stimulus Type (ST)

Tokens

Mean RT (milliseconds)

6
7
8
9

kaa.ka (HL.L)
saa.sa (LH.H)
saa.sa (HL.L)
suu.su (LH.H)

2824.40
2963.51
2829.56
2596.02

In addition to the main effects, the Talker x Stimulus Type interaction was significant: F (9, 135)
2

= 3.786, p < .001, ƞp = .202 (Figure 52). Results of the simple effects tests indicated that the
differences between ST6 and ST8 were greater for Talker3 and Talker4, compared to Talker5.
The learners demonstrated significantly slower RT for ST8 produced by Talker3 compared to
ST6. On the other hand, the learners showed significantly longer RT for ST6 produced by
Talker4 compared to ST8.

115

Figure 52: The comparison of perception RT of tokens in Group II (CVV.CV) for AV training
group
3400

RT in milliseconds

3200

3000

ST6
ST7

2800

ST8
2600

ST9

2400
2200
Talker2

Talker3

Talker4

Talker4

Regarding the tokens in Group III (CV.CVV), results indicated significant main effects of
2

week, FWeek (1, 15) = 5.525, p = .033, ƞp = .269, and talker, FTalker(3, 45) = 6.417, p = .001,
2

ƞp = .300; however, stimulus type was not significant, FType(4, 60) = 2.319, p = .067. None of
the interactions was significant. The mean RT scores for the first week and second week were
2819.68 milliseconds and 2610.72 milliseconds respectively; the RT of the second week was
significantly faster than the first week. The mean RTs for each talker were 2957.73 milliseconds
(Talker2), 2644.70 milliseconds (Talker3), 2691.86 milliseconds (Talker4), and 2566.53
milliseconds (Talker5). Results of the pairwise comparisons with Bonferroni correction
indicated that Talker3 was different from Talker3 (p = .021) and Talker5 (p = .004). Thus, the
learners had longer response latency for Talker2, a female talker, compared to Talker3, a male

116

talker, and Talker5, another female talker. Table 34 shows mean RTs for each token in Group III.
ST10 revealed relatively faster RT than other four tokens; however, the difference was not
significant.

Table 34: Mean RT scores of the five tokens in Group III (CV.CVV) (AV group)
Stimulus Type (ST)

Tokens

Mean RT (milliseconds)

10
11
12
13
14

ka.kaa (L.HH)
sa.saa (L.HL)
ku.kuu (L.HL)
ku.kuu (H.LL)
su.suu (L.HH)

2530.57
2790.12
2722.75
2850.44
2682.14

Regarding the tokens in Group IV (CV.CV), results indicated significant main effects of
2

week, FWeek(1, 15) = 14.181, p = .002, ƞp = .486, talker, FTalker(3, 45) = 5.452, p = .003, ƞp

2

2

= .267, and stimulus type, FType(7, 105) = 6.041, p < .001, ƞp = .287. The mean accuracy
scores for the first week and second week were 2311.82 milliseconds and 1942.32 milliseconds
respectively; RT of the second week was significantly faster than the first week. The mean RT
scores for each talker were 2358.48 milliseconds (Talker2), 2074.56 milliseconds (Talker3),
2111.23 milliseconds (Talker4), and 1964.01 milliseconds (Talker5). The results of pairwise
comparisons with Bonferroni correction revealed that Talker2 was significantly different from
Talker5 (p = .003). The learners demonstrated faster RTs for tokens produced by Talker5 than
Talker2. Table 35 shows mean RTs for each token in Group IV. Results of pairwise
comparisons with Bonferroni correction indicated that (1) ST20 was significantly different from
ST15 (p = .004), ST17 (p < .001), and ST19 (p = .002); and (2) ST18 was significantly different
from ST15 (p = .029), ST17 (p = .001), and ST19 (p = .003). Thus, RTs for both ST18 and
117

ST20 were significantly slower than ST15, ST17, and ST19; the tokens with the H.L pitch
pattern had a tendency to have longer RTs than ones with the L.H pitch pattern.

Table 35: Mean RT scores of the eight tokens in Group IV (CV.CV) (AV group)
Stimulus Type (ST)

Tokens

Mean RT (milliseconds)

15
16
17
18
19
20
21
22

ka.ka (L.H)
ka.ka (H.L)
sa.sa (L.H)
sa.sa (H.L)
ku.ku (L.H)
ku.ku (H.L)
su.su (L.H)
su.su (H.L)

1947.84
2167.38
1885.06
2394.39
2003.44
2437.48
2047.84
2133.16

In addition to the main effects above, the Week x Talker interaction was significant: F (3,
2

45) = 6.672, p = .001, ƞp = .308 (Figure 53). Results of simple effects tests indicated that the
difference in RT between Talker2 and other talkers was significant in the first week, compared to
the second week. The learners demonstrated slower RTs for tokens produced by Talker2 in the
first week; however it was shortened significantly in the second week.

118

Figure 53: The comparison of perception RT of tokens in Group IV (CV.CV) for AV training
group
3000

RT in milliseconds

2800
2600
Talker2
Talker3

2400

Talker4
2200

Talker5

2000
1800
Week1

Week2

Perception RT in Training – A-only Group: A three-way ANOVA was performed to examine the
development of perception RT and effects of the factors for the A-only group. The independent
variables were week (2: Week1, Week2), talker (4: Talker2, 3, 4,5), and stimulus type. The
dependent variable was perception RT in the eight training sessions.
Regarding the tokens in Group I (CVV.CVV), results indicated significant main effects
2

of week, FWeek(1, 15) = 8.683, p = .010, ƞp = .367, and stimulus type, FType(4, 60) = 6.661, p
2

< .001, ƞp = .308; however, talker was not significant, FTalker(3, 45) = 1.720, p = .176. The
mean RT scores for the first week and second week were 3004.08 milliseconds and 2645.49
milliseconds respectively. The RT in the second week was significantly faster than one in the

119

first week. The mean RT scores for each talker were 2967.48 milliseconds (Talker2), 2927.73
milliseconds (Talker3), 2787.69 milliseconds (Talker4), and 2616.23 milliseconds (Talker5), and
there were no significant differences among them. Table 36 shows mean RT scores for each
stimulus type in Group I. Results of pairwise comparisons with Bonferroni correction indicated
that ST4 was significantly different from ST3 (p = .010) and ST5 (p = .046). The learners’
response latency for ST4 was significantly faster than for ST3 and ST5. Thus, the learners
responded more quickly to the token with the LH.HH pitch, the consonant /k/, and the vowel /u/
than the token with the LH.HL pitch, the consonant /k/ or /s/, and the vowel /a/ or /u/.

Table 36: Mean RT scores of the five tokens in Group I (CVV.CVV) (A-only group)
Stimulus Type (ST)

Tokens

Mean RT Scores (milliseconds)

1
2
3
4
5

kaakaa (LH.HH)
kaa.kaa (HL.LL)
saa.saa (LH.HL)
kuu.kuu (LH.HH)
suu.suu (LH.HL)

2342.13
2555.11
2693.26
2429.35
2814.21

In addition to the main effects, the Week x Talker interaction was significant, F (3, 45) = 4.312, p
2

= .011, ƞp = .216 (Figure 54). The results of simple effects tests revealed that the difference
between Talker2 and the other talkers was significant in the first week compared to the second
week. The learners showed longer RTs for tokens produced by Talker2 than the other three
talkers in the first week; however, the difference was not significant in the second week because
the RT of the Talker2 was significantly shortened in the second week. In addition, the Talker x
2

Stimulus Type interaction, F (12, 180) = 4.387, p < .001, ƞp = .226, was significant. Results of
simple effects tests indicated that (1) the differences between ST5 and ST1 as well as ST4 were
120

greater with Talker2, compared to the other three talkers; and (2) the differences between ST3
and ST1 as well as ST4 were greater with Talker5. Thus, the learners showed slower RTs with
ST5 produced by Talker2 and with ST3 produced by Talker5.

Figure 54: The comparisons of perception RT of tokens in Group I (CVV.CVV) for A-only
training group

RT in milliseconds

3400
3200
Talker 2
3000

Talker3

2800

Talker4
Talker5

2600
2400
Week1

Week2

3800
3600
RT in milliseconds

3400
3200

ST1

3000

ST2

2800

ST3

2600

ST4
ST5

2400
2200
2000
Talker2

Talker3

Talker4
121

Talker5

Regarding the tokens in Group II (CVV.CV), results indicated significant main effects of
2

talker, FTalker(3, 45) = 3.410, p = .025, ƞp = .185; however, week, FWeek(1, 15) = 3,970, p
= .065, and stimulus type, FType(3, 45) = 1.644, p = .193, were not significant. The mean RT
scores for the first week and second week were 3311.05 milliseconds and 3021.92 milliseconds
respectively. The second week revealed faster RTs than the first week; however, the difference
was not significant. The mean RT scores for each voice were 3163.13 milliseconds (Talker2),
3391.39 milliseconds (Talker3), 3226.11 milliseconds (Talker4), and 2885.31 milliseconds
(Talker5). Results of pairwise comparisons with Bonferroni correction did not detect significant
differences among the four talkers; however, the difference between Talker3 and Talker5 was
approaching significance (p = .075). Table 37 shows mean RT scores for each stimulus type in
Group II. There were no significant differences among the four stimulus types.

Table 37: Mean RT scores of the four tokens in Group II (CVV.CV) (A-only group)
Stimulus Type (ST)

Tokens

Mean RT Scores (milliseconds)

6
7
8
9

kaa.ka (HL.L)
saa.sa (LH.H)
saa.sa (HL.L)
suu.su (LH.H)

3146,13
3196.09
3327.49
2996.23

In addition to the main effects, the Talker x Stimulus Type interaction was significant: F (9, 135)
2

= 2.908, p = .004, ƞp = .162 (Figure 55). Results of the simple effects tests indicated that the
difference of ST 8 and ST9 was greatest for Talker5.

122

Figure 55: The comparison of perception RT of tokens in Group II (CVV.CV) for A-only
training group
4000
3800
3600

RT in milliseconds

3400
3200

St6

3000

ST7

2800

ST8

2600

ST9

2400
2200
2000
Talker2

Talker3

Talker4

Talker5

Regarding the tokens in Group III (CV.CVV), results indicated significant main effects of
2

talker, FTalker(3, 45) = 7.610, p < .001, ƞp = .337, and stimulus type, FType(4, 60) = 8.414, p
2

< .001, week, ƞp = .359; however, week was not significant, FWeek(1, 15) = 3.185, p = .095.
None of the interactions was significant. The mean RT scores for the first week and second
week were 2945.29 milliseconds and 2736.69 milliseconds respectively. RT for the second week
was faster than the first week; however, the difference was not significant. The mean RT scores
for each voice were 3047.54 milliseconds (Talker2), 2919.75 milliseconds (Talker3), 2797.11
milliseconds (Talker4), and 2599.56 milliseconds (Talker5). Results of the pairwise
comparisons with Bonferroni correction indicated that Talker5 was different from Talker2 (p
= .008) and Talker3 (p = .038). Thus, the learners had faster response latency for Talker5, a
female talker, compared to Talker2, another female talker, and Talke3, a male talker. Table 38
123

shows the mean RT scores for each token in Group III. Results of pairwise comparison with
Bonferroni correction revealed that (1) ST10 was significantly different from ST11 (p = .001),
ST12 (p = .003), and ST13 (p = .003); (2) ST11 was significantly different from ST12 (p = .024).
The differences between ST12 and ST13 (p = .051) as well as ST10 and ST14 (p = .055) were
marginally significant. Thus, the learners demonstrated faster RTs for tokens with the L.HH
pitch than the L.HL or H.LL pitch patterns. Also, with the L.HL pitch pattern, the learners
demonstrated faster RTs when the combination of consonant and vowel was /ku/ than /sa/.

Table 38: Mean RT scores of the five tokens in Group III (CV.CVV) (A-only group)
Stimulus Type (ST)

Tokens

Mean RT (milliseconds)

10
11
12
13
14

ka.kaa (L.HH)
sa.saa (L.HL)
ku.kuu (L.HL)
ku.kuu (H.LL)
su.suu (L.HH)

2464.07
3051.66
2730.14
3061.66
2897.41

Regarding the tokens in Group IV (CV.CV), results indicated significant main effects of
2

week, FWeek(1, 15) = 29.426, p < .001, ƞp = .662, talker, FTalker(3, 45) = 6.095, p = .001, ƞp

2

2

= .289, and stimulus type, FType(7, 105) = 5.372, p < .001, ƞp = .264. The mean RT scores for
the first week and second week were 2587.11 milliseconds and 2135.21 milliseconds
respectively; RTs for the second week were significantly faster than the first week. The mean
RT scores for each talker were 2691.94 milliseconds (Talker2), 2184.36 milliseconds (Talker3),
2399.96 milliseconds (Talker4), and 2168.39 milliseconds (Talker5). The results of pairwise
comparisons with Bonferroni correction revealed that Talker2 was significantly different from
Talker3 (p = .018). The difference between Talker2 and Talker5 was marginally significant (p
124

= .051). The learners demonstrated faster RTs for tokens produced by Talker3 than Talker2.
Table 39 shows mean RT scores for each token in Group IV. Results of pairwise comparisons
with Bonferroni correction indicated that the difference between ST17 and ST18 was significant
(p = .050). Thus, the learners demonstrated faster RTs for the token with L.H pitch with /sa/ than
one with H.L pitch with the same combination of consonant and vowel.

Table 39: Mean RT scores of the eight tokens in Group IV (CV.CV) (A-only group)
Stimulus Type (ST)

Token

Mean RT Scores (milliseconds)

15
16
17
18
19
20
21
22

ka.ka (L.H)
ka.ka (H.L)
sa.sa (L.H)
sa.sa (H.L)
ku.ku (L.H)
ku.ku (H.L)
su.su (L.H)
su.su (H.L)

2119.91
2357.30
2112.29
2804.39
2320.82
2709.23
2135.37
2329.98

In addition to the main effects above, the Week x Talker interaction was significant: F (3,
2

45) = 12.816, p < .001, ƞp = .461 (Figure 56). Results of simple effects tests indicated that the
differences in RTs between Talker2 and other talkers were significant in the first week,
compared to the second week. The learners demonstrated slower RTs for tokens produced by
Talker2 in the first week; however it was shortened significantly in the second week. The Talker
2

x Stimulus Type interaction was also significant: F (21, 315) = 1.715, p = .027, ƞp = .103.
Results of the simple effects tests indicated that the differences between ST20 and ST15, ST19,
ST21, and ST22 were greater with Talker4 than with the other three talkers. Thus, the learners
demonstrated slower RTs when they identified ST20 produced by Talker4.

125

Figure 56: The comparisons of perception RT of tokens in Group IV (CV.CV) for A-only
training group

RT in milliseconds

3200
3000
2800

Talker2

2600

Talker3

2400

Talker4

2200

Talker5

2000
1800
Week1

Week2

3300
3100
2900
ST15

RT in milliseconds

2700

ST16
ST17

2500

ST18
2300

ST19
ST20

2100

ST21
ST22

1900
1700
1500
Talker2

Talker3

Talker4

126

Talker5

TG with novel tokens – Comparison of Production Accuracy: A production TG was also given in
order to assess whether the effects of perceptual training that had transferred to production could
be generalized to the production of novel tokens. The three raters who rated the pretest and the
posttest rated the TG, using the same procedures. Interrater reliability was checked using
Pearson Correlation/Coefficient. There was a significant positive correlation between Rater 1 and
2

2

Rater 2 (r = .915, p = .001, R = .84), between Rater 1 and Rater 3 (r = .920, p = .001, R = .85),
2

as well as between Rater 2 and Rater 3 (r = .961, p = .001, R = .92); the correlation was strong.
Table 40 shows descriptive statistics for production accuracy scores in the pre-/post-tests and in
the TG for each training group; Table 41 below shows production errors that the learners made
during the TG.

Table 40: Descriptive Statistics (mean, SD) of the production accuracy in the pretest, posttest,
and TG
Stimulus
Type
CVV.CVV
CVV.CV
CV.CVV
CV.CV

Pretest

Posttest
A-only

AV

A-only

AV

70.31%
(27.72)
87.50%
(22.36)
60.94%
(37.60)
59.38%
(42.70)

78.13%
(25.62)
82.81%
(23.66)
51.56%
(37.05)
64.06%
(35.32)

75.56%
(23.21)
93.75%
(19.37)
95.31%
(10.08)
92.19%
(17.60)

127

75.00%
(24.15)
100.00%
(.00)
89.06%
(30.23)
95.31%
(10.08)

TG
AV

A-only

87.50%
(20.64)
97.92%
(8.33)
93.75%
(18.13)
91.67%
(19.24)

85.42%
(27.13)
100.00%
(.00)
85.42%
(32.13)
97.92%
(8.33)

Table 41: Errors observed in the production data in Experiment 2 (TG)
Token with /e/

Errors

Number

Token with /a/

Errors

Numb
er

seesee

seese

8

taataa

taata

3

seese

sese

1

sesee

seesee
sessee

1
6

tataa

tattaa
tuutuu

2
3

sese

sesee

3

tata

tataa

2

First, the pretest scores were compared to the TG scores using a mixed ANOVA in order
to examine whether there were any improvements in correctly producing vowel duration for the
novel tokens. Independent variables were test (2; Pretest, TG), token type (4: CVV.CVV,
CVV.CV, CV.CVV, CV.CV), and group type (2; AV, A-only); dependent variables were
production accuracy in pretest and TG. First, the tokens with /ka/ in the pretest and /ta/ in the
TG (a new consonant and a familiar vowel) were compared. The results of a mixed ANOVA
2

indicated significant main effects of test, FTest(1, 30) = 22.845, p < .001, ƞp = .432, and token
2

type, FTest(3, 90) = 3.913, p = .011, ƞp = .115; however, group type was not significant,
FGroup(1, 30) = 2.028, p = .165. None of the interactions was significant. Since the mean
accuracy of TG was higher (.95) than that of the pretest (.65), there was improvement. In
addition, among the four token types, there was a significant difference between CVV.CV and
CV.CVV. The CVV.CV type had a higher mean accuracy (.86) than the CV.CVV type (.72);
therefore, CVV.CV was easier to produce.

128

Second, the tokens with /sa/ in the pretest and /se/ in the TG (a familiar consonant and a
new vowel) were compared. The results of a mixed ANOVA indicated significant main effects
2

of test, FTest(1, 30) = 40.814, p < .001, ƞp = .576; however, token type, FType(3, 90) = 1.412, p
= .245, and group type, FGroup(1, 30) = .028, p = .864, were not significant. None of the
interactions was significant. Since the mean accuracy of TG was higher (.95) than that of the
pretest (.64), there was improvement.
Next, the production accuracy in the posttest and the TG were compared to examine
whether the two tests were comparable. Independent variables were test (2; Posttest, TG), token
type (4: CVV.CVV, CVV.CV, CV.CVV, CV.CV), and group type (2; AV and A-only);
dependent variable were production accuracy in the posttest and TG. First, the tokens with /ka/
in the pretest and /ta/ in the TG (a new consonant and a familiar vowel) were compared. The
results of a mixed ANOVA indicated no significant main effects: test, FTest(1, 30) = .717, p
= .407, token type, FType(3, 90) = 1.725, p = .168, and group type, FGroup(1, 30) = 1.788, p
= .191. None of the interactions was significant. Since there was no significant difference
between the two tests, it was concluded that they were comparable.
Second, the tokens with /sa/ in the pretest and /se/ in the TG (a familiar consonant and a
new vowel) were compared. The results of a mixed ANOVA indicated a significant main effect
2

of token type, FType(1, 30) = 3.533, p = .018, ƞp = .105; however, test, FTest(1, 30) = 1.364, p
= .252, and group type, FGroup(1, 30) = 2.647, p = .114, were not significant. None of the
interactions was significant. Among the four token types, there was a significant difference
between CVV.CVV and CVV.CV (p = .029); the CVV.CV was easier to produce than

129

CVV.CVV. Since there was no significant difference between the two tests, it was concluded
that they were comparable.

Overall Effects of TG (familiar and novel tokens) – Perception Accuracy: Tests of
generalizations (TGs) were given to the two experimental groups, in order to assess whether the
effects of perceptual training on correctly identifying duration of vowels could be generalized to
novel tokens (Appendix I) spoken by a familiar talker (TG1) and familiar tokens (i.e., tokens
used in testing; Appendix E) spoken by a novel talker (TG2). Table 42 shows descriptive
statistics of perception accuracy in the pre-/post-tests as well as in the TGs for each experimental
group.

Table 42: Descriptive Statistics for the perception accuracy in pretest, posttest, and two TGs
Pretest
Group Sample
Size
Mean %

Posttest

(SD)

Mean %

(SD)

TG1
(novel tokens)
Mean %
SD

TG2
(novel voice)
Mean % (SD)

AV

16

68.75

(16.21)

96.53

(8.58)

93.36

(7.02)

92.71

(9.01)

Aonly

16

71.18

(14.94)

87.50

(12.91)

89.06

(12.40)

88.89

(8.84)

The two TGs were compared with the pretest in order to examine whether there were any
improvements in correctly identifying vowel duration from the pretest to TGs. In order to
examine the overall effects of pretest to TG1 (novel tokens), a mixed ANOVA was performed.
Independent variables were test (2: pretest, TG1) and group type (2: AV, A-only); the dependent
variable was perception accuracy. Results indicated significant main effects of test, FTest(1, 30)
130

2

= 108.167, p < .001, ƞp = .783; however, group type was not significant, FGroup(1, 30) = .050, p
= .824. Perception accuracy of novel tokens in TG1 exceeded that in the pretest. The Test x
Training Modality interaction was not significant, F(1, 30) = 2.711, p = .110.
In order to examine the overall effects of pretest to TG2 (novel talker), a mixed ANOVA
was performed. Independent variables were test (2: pretest, TG2) and training type (2: AV, Aonly); the dependent variable was perception accuracy. Results indicated significant main
2

effects of test, FTest(1, 30) = 88.889, p < .001, ƞp = .748; perception accuracy also increased for
stimuli produced by a new voice. However, group type was not significant, FGroup(1, 30) = .032,
p = .860. The Test x Training Modality interaction was also not significant, F(1, 30) = 2.000, p
= .168.
In addition, the two TGs were compared with the posttest in order to examine whether the
posttest improvement following training could be generalized to novel tokens and a new talker.
In order to examine whether the posttest and TG1 were comparable, a mixed ANOVA was
performed. Independent variables were test (2: posttest, TG1) and group type (2: AV, A-only);
the dependent variable was perception accuracy. Results indicated no significant main effects of
test, FTest(1, 30) = .438, p = .513, or group type, FGroup(1, 30) = 3.586, p = .068. The Test x
Training Modality interaction was also not significant, F(1, 30) = 3.800, p = .061.
In order to examine whether the posttest and TG2 were comparable, a mixed ANOVA
was performed. Independent variables were test (2: posttest, TG2) and group type (2: AV, Aonly); the dependent variable was perception accuracy. Results indicated no significant main
effect of test, FTest(1, 30) = .786, p = .382; however, group type was marginally significant,

131

FGroup(1, 30) = 3.890, p = .058. The Test x Training Modality interaction was approaching
significance, F(1, 30) = 3.610, p = .067.
Thus, overall, there was accuracy development from the pretest to the TG1 (novel tokens)
and TG2 (novel voice). In addition, the two TGs were comparable to the posttest; therefore, the
training effects were generalized to novel tokens and a novel talker. In order to examine the
effects of pitch pattern, preceding consonant, and vowel type, tokens in TG1 were divided into
three groups used earlier (see Table 43). Each token in the TG contained a /s/ (familiar) + /e/
(novel) or /t/ (novel) + /b/ (familiar) consonant/vowel combination. The tokens in the TG1 were
compared with ones in the pretest/posttest in the following way.

Table 43: List of stimulus type in TG1

Stimulus
Type (ST)
ST1
ST2
ST3
ST4
ST5
ST6
ST7
ST8
ST9
ST10

TG1 Stimuli
Token
Novel
Segment
taa.taa (LH.HL)
see.see (LH.HH)
see.see (HL.LL)
taa.ta (LH.H)
taa.ta (HL.L)
see.se (LH.H)
see.se (HL.L)
ta.taa (H.LL)
se.see (L.HH)
se.see (H.LL)

t
e
e
t
t
e
e
t
e
e

Familiar
Segment
a
s
s
a
a
s
s
a
s
s

Pretest and Posttest
Token

kaa.kaa (LH.HL)
saa.saa (LH.HH)
saa.saa (HL.LL)
kaa.ka (LH.H)
kaa.ka (HL.L)
suu.su (LH.H)
suu.su (HL.L)
ka.kaa (H.LL)
sa.saa (L.HH)
sa.saa (H.LL)

Group

I
I
I
II
II
II
II
III
III
III

CVV.CVV

CVV.CV

CV.CVV

Comparing Accuracy in Pretest and TG1 (Novel Tokens): Perception accuracy in the pretest and
TG1 was compared using a mixed ANOVA in order to examine whether there were any
developments in identifying vowel duration for the novel tokens spoken by the familiar talker
132

(i.e., the talker in the training sessions). In the comparison between pretest and TG1,
independent variables were test (2; pretest, TG1), group type (2; AV and A-only), and stimulus
type (3 or 4 depending the group); the dependent variable was perception accuracy. Regarding
the tokens in Group I (CVV.CVV), the results of a mixed ANOVA indicated significant main
2

effects of test, FTest(1, 30) = 65.574, p < .001, ƞp = .686; however, stimulus type, FType(2, 60)
= 2.391, p = .100, and group type, FGroup(1, 30) = .000, p = 1.00, were not significant. The
mean accuracy scores of the pretest and TG1 were .53 and .95 respectively. Thus, there was
development of perception accuracy for the tokens in Group I. Table 44 below shows mean
accuracy scores for each token in Group I; there were no differences among them.

Table 44: Mean accuracy scores of tokens in Group I (CVV.CVV) in the comparison between
pretest and TG1
Stimulus Type
(ST)

Token

Pretest
Mean Accuracy

ST1
ST2
ST3

kaa.kaa (LH.HL)
saa.saa (LH.HH)
saa.saa (HL.LL)

.44
.66
.50

Token

TG1
Mean Accuracy

taa.taa (LH.HL)
see.see (LH.HH)
see.see (HL.LL)

.97
1.00
.88

Regarding the tokens in Group II (CVV.CV), the results of a mixed ANOVA indicated
2

significant main effects of stimulus type, FType(3, 90) = 16.858, p < .001, ƞp = .360; however,
test, FTest(1, 30) = 2.301, p = .140, and group type, FGroup(1, 30) = .303, p = .586, were not
significant. None of the interactions was significant. The mean accuracy scores for the pretest
and TG1 were .74 and .82 respectively. Perception accuracy scores were higher in TG1;
133

however, the difference between pretest and TG1 was not significant. In order to locate where
differences existed among the four stimulus types, pairwise comparisons with Bonferroni
correction were performed. Table 45 below shows mean accuracy scores for each token in
Group II.

Table 45: Mean accuracy scores of tokens in Group II (CVV.CV) in the comparison between
pretest and TG1
Stimulus
Type(ST)

Token

ST4
ST5
ST6
ST7

Pretest
Mean Accuracy

kaa.ka (LH.H)
kaa.ka (HL.L)
suu.su (LH.H)
suu.su (HL.L)

.81
.90
.81
.43

Token
taa.ta (LH.H)
taa.ta (HL.L)
see.se (LH.H)
see.se (HL.L)

TG1
Mean Accuracy
.97
.91
.88
.53

As the mean perception scores of each token in Table 45 show, ST7 was significantly lower than
ST4 (p < .001), ST5 (p < .001), and ST6 (p = .001). Thus, ST7 was the most difficult token to
correctly perceive among the four types. ST7 was significantly more difficult to correctly
identify than ST6, although they involved the same consonant and vowel but differed in pitch
pattern. Also, ST5, which contained the novel consonant but had the same pitch pattern as ST7,
had a higher accuracy. Therefore, the novel vowel with the HL.L pitch pattern appears to have
caused the difficulty.
Regarding the tokens in Group III (CV.CVV), the results of a mixed ANOVA indicated
2

significant main effects of test, FTest(1, 30) = 13.364, p = .001, ƞp = .308, and stimulus type,
2

FType(2, 60) = 5.955, p = .004, ƞp = .166; however, group type was not significant, FGroup(1,
30) = .000, p = 1.000. The mean accuracy scores for Group III for the pretest and TG1 were .78
134

and .93 respectively. Thus, there was development of perception accuracy for the tokens in
Group III. Stimulus type was also significant; therefore, pairwise comparisons with Bonferroni
correction were performed in order to locate where differences existed among the three stimulus
types. Table 46 below shows mean accuracy scores for each token in Group III.

Table 46: Mean accuracy scores for tokens in Group III (CV.CVV) in the comparison between
pretest and TG1

Stimulus
Type(ST)

Token

ST8
ST9
ST10

Pretest
Mean Accuracy

ka.kaa (H.LL)
sa.saa (L.HH)
sa.saa (H.LL)

.56
.78
1.00

Token
ta.taa (H.LL)
se.see (L.HH)
se.see (H.LL)

TG1
Mean Accuracy
97
.97
.84

The results showed the difference between ST8 and ST10 was significant (p = .008), but the
pitch pattern between ST8 and ST10 was identical. ST8 contained a novel preceding consonant
/t/and familiar vowel /a/; ST10 contained a familiar preceding consonant /s/ and a novel vowel
/e/. Thus, the learners had more difficulty identifying the vowel duration with a novel consonant.
In addition to the main effects above, the Test x Stimulus Type interaction was
2

significant, F(2, 60) = 10.994, p < .001, ƞp = .268 (Figure 57). Results of simple effects tests
revealed that the difference between ST8 and ST10 was significantly greater in the pretest than
in TG1; the accuracy of ST10 was higher and that of ST8 was lower in the pretest. The vowel /e/
in the H.LL pitch in TG1 revealed lower accuracy than the vowel /a/ in the same pitch pattern in
pretest. In contrast, the accuracy of /t/ in the H.LL pitch revealed higher accuracy than the
consonant /k/ in the same pitch.
135

Figure 57: The comparison of perception accuracy of tokens in Group III (CV.CVV) between
the pretest and TG1
1
0.95
0.9

Accuracy

0.85
0.8

ST8

0.75

ST9

0.7

ST10

0.65
0.6
0.55
0.5

Pretest

TG1

Comparing Accuracy in Pretest and TG2 (Familiar Tokens by Novel Talker): Perception
accuracy scores of the pretest and the TG2 were compared using a mixed ANOVA. Following
the analysis of the pretest and posttest comparison, the tokens used in the TG2 were divided into
three categories (Group I, II and III) as shown in Figure 34 in the previous part. Independent
variables were test (2; pretest, TG2), group type (2; AV and A-only), and stimulus type (6); the
dependent variable was perception accuracy. Regarding stimulus type in Group I (CVV.CVV),
results of a mixed ANOVA indicated significant main effects of test, FTest(1, 30) = 49.681, p
2

2

< .001, ƞp = .623, and stimulus type, FType(5, 150) = 3.844, p = .003, ƞp = .114; however,
group type was not significant, FGroup(1, 30) = .464, p = .501. None of the interactions was
significant. TG2 had a higher mean accuracy (.93) than the pretest (.62). Therefore, it was
136

concluded that there was a development of perception accuracy from pretest to TG2 for the
tokens in Group I. In addition, stimulus type had significant effects; therefore, pairwise
comparisons were performed using the Bonferroni correction in order to locate the differences.
Table 47 shows the mean perception accuracy for each token in Group I. The results revealed
that perception accuracy of ST3 was significantly different from ST2 (p = .007). ST2 had higher
accuracy than ST3; therefore, the former was easier to identify correctly than the latter.

Table 47: Mean perception accuracy of the six stimulus type in Group I (CVV.CVV) in pretest
and TG2 comparison
Stimulus
Type (ST)

Tokens

1
2
3
4
5
6

Mean Accuracy

saa.saa (LH.HH)
suu.suu (LH.HH)
kaa.kaa (LH.HL)
kuu.kuu (LH.HL)
saa.saa (HL.LL)
kuu.kuu (HL.LL)

Pretest

TG2

.66
.81
.44
.75
.50
.56

1.00
.97
.91
.91
.97
.88

Regarding stimulus type in Group II (CVV.CV), the results of a mixed ANOVA
2

indicated significant main effects of stimulus test, FTest(1, 30) = 4.156, p = .050, ƞp = .122, and
2

stimulus type, FType(5, 150) = 6.235, p < .001, ƞp = .172; however, group type was not
significant, FGroup(1, 30) = .385, p = .540. The perception accuracy significantly increased from
the pretest (.76) to the TG2 (.84). In order to locate where the differences existed among the six
tokens, pairwise comparisons were performed with Bonferroni correction. Table 48 shows the
mean perception accuracy for each token in Group II. The results revealed that ST7 was

137

different from ST11 (p = .029) and ST12 (p = .006). In addition, ST10 was different from ST11
(p = .048) and ST12 (p = .009). The accuracy differences across the three tokens (ST10, ST11,
and ST12) demonstrate that the issue is not only pitch pattern as these have the same pattern and
it is not solely the consonant or vowel but on interaction of all factors.

Table 48: Mean perception accuracy of the six tokens in Group II (CVV.CV) in pretest and TG2
comparison
Stimulus
Type (ST)

Tokens

7
8
9
10
11
12

kaa.ka (LH.H)
kuu.ku (LH.H)
suu.su (LH.H)
kaa.ka (HL.L)
kuu.ku (HL.L)
suu.su (HL.L)

Pretest

Mean Accuracy
TG2

.81
.94
.81
.91
.63
.44

1.00
.81
.78
.91
.78
.78

In addition to the main effects, the Test x Stimulus Type interaction was significant, F(5, 150) =
2

3.573, p = .004, ƞp = .106 (Figure 58). Results of the simple effects tests revealed that the
differences between ST7 and ST12 were greater in the pretest than TG1.

138

Figure 58: The comparison of perception accuracy of tokens in Group II (CVV.CV) between the
pretest and TG2
1
0.9
ST7

Accuracy

0.8

ST8
ST9

0.7

ST10
0.6

ST11
ST12

0.5
0.4
Pretest

TG2

Regarding stimulus type in Group III (CV.CVV), the results of a mixed ANOVA
2

indicated significant main effects of test, FTest(1, 30) = 34.539, p < .001, ƞp = .535, and
2

stimulus type, FType(5, 150) = 3.622, p = .004, ƞp = .108; however, group type was not
significant, FGroup(1, 30) = .758, p = .391. The perception accuracy significantly increased from
the pretest (.73) to the TG2 (.95). In order to locate differences among the six tokens, pairwise
comparisons with Bonferroni correction were performed. Table 49 shows the mean accuracy of
each token in Group III. The results revealed that ST16 was significantly different from ST15 (p
= .021) and ST18 (p = .003).

139

Table 49: Mean perception accuracy of the six tokens in Group III (CV.CVV) in pretest and
TG2 comparison
Stimulus
Type (ST)

Tokens

13
14
15
16
17
18

Mean Accuracy
Pretest

sa.saa (L.HH)
su.suu (L.HL)
ka.kaa (H.LL)
sa.saa (H.LL)
ku.kuu (H.LL)
su.suu (H.LL)

.78
.81
.56
1.00
.65
.56

TG2
1.00
.88
1.00
.91
.97
.97

In addition to the main effects, the Test x Group Type interaction was significant, F(1, 30) =
2

4.203, p = .049, ƞp = .123. As shown in Figure 59, the improvement for the AV group was
greater than that of the A-only group. The Test x Stimulus Type interaction was also significant,
2

F(5, 150) = 7.276, p < .001, ƞp = .195. Results of the simple effects tests revealed that the
accuracy of ST15, ST17, and ST18 improved the most from pretest to TG2.

140

Figure 59: The comparison of perception accuracy of tokens in Group II (CVV.CV) between the
pretest and TG2
1
0.95

Accuracy

0.9
0.85
AV

0.8

A-only

0.75
0.7
0.65
0.6

Pretest

TG2

1
ST14

Accuracy

0.9

ST13
ST14

0.8

ST15
ST16

0.7

ST17
ST18

0.6

0.5
Pretest

TG2

141

Comparing Accuracy in Posttest and TG1 (Novel Tokens): Perception accuracy in the posttest
and TG1 was compared using a mixed ANOVA in order to examine whether the two tests were
comparable (i.e., training effects were generalized to correctly identifying vowel duration of
novel tokens). Independent variables were test (2; pretest, TG1), group type (2; AV and A-only),
and stimulus type (3 or 4 depending the group); the dependent variable was perception accuracy
in posttest and TG1. Regarding the tokens in Group I (CVV.CVV) in Table 43 in the previous
part, the results of a mixed ANOVA indicated significant main effects of test, FTest(1, 30) =
2

2

10.090, p = .002, ƞp = .216, and group type, FGroup(1, 30) = 8.710, p = .006, ƞp = .225;
however, stimulus type was not significant, FType(2, 60) = 2.547, p = .087. The mean accuracy
scores of the posttest and TG1 were .90 and .98 respectively; therefore, there was development
from posttest to TG1. The difference between the two training groups was significant; however,
this difference was probably due to the difference in the posttest (the two groups were not
homogeneous before the comparison). Table 50 below shows mean accuracy scores for each
token; however, the differences were not significant.

Table 50: Mean perception accuracy of the six tokens in Group I (CVV.CVV) in posttest and
TG1 comparison
Stimulus Type
(ST)

Token

Posttest
Mean Accuracy

ST1
ST2
ST3

kaa.kaa (LH.HL)
saa.saa (LH.HH)
saa.saa (HL.LL)

.81
.97
.91

142

Token

TG1
Mean Accuracy

taa.taa (LH.HL)
see.see (LH.HH)
see.see (HL.LL)

.97
1.00
.88

Regarding the tokens in Group II (CVV.CV), the results of a mixed ANOVA indicated
2

significant main effects of stimulus type, FType(3, 90) = 14.670, p < .001, ƞp = .328, and group
2

type, FGroup(1, 30) = 6.788, p = .014, ƞp = .328; however, test was not significant: FTest(1, 30)
= 714, p = .405. The mean accuracy of the posttest was .85; that of TG1 was .82. The difference
between posttest and TG1 was not significant; therefore, the tokens in Group II in the two tests
were comparable. Stimulus type was significant; therefore, pairwise comparisons with
Bonferroni correction were performed in order to locate where differences existed among the
four stimulus types. Table 51 below shows mean accuracy scores for each token. ST7 was
significantly lower than ST4 (p < .001), ST5 (p < .001), and ST6 (p = .011). Thus, ST7 was the
most difficult token to correctly perceive among the four types.

Table 51: Mean perception accuracy of the six tokens in Group II (CVV.CV) in posttest and
TG1 comparison

Stimulus
Type(ST)

Token

ST4
ST5
ST6
ST7

kaa.ka (LH.H)
kaa.ka (HL.L)
suu.su (LH.H)
suu.su (HL.L)

Posttest
Mean Accuracy
.91
.97
.81
.72

Token
taa.ta (LH.H)
taa.ta (HL.L)
see.se (LH.H)
see.se (HL.L)

TG1
Mean Accuracy
.97
.91
.88
.53

In addition to the main effects above, the Time x Group Type interaction was significant: F(1,
2

30) = 6.429, p = .017, ƞp = .176. Among the two groups, the differences in perception accuracy
of the two groups were significantly greater in TG1 than in the posttest. The AV group had
significantly higher accuracy in the posttest.
143

Regarding the tokens in Group III (CV.CVV), the results of a mixed ANOVA did not
indicate any significant main effects: test, FTest(1, 30) = .105, p = .748, stimulus type, FType(2,
60) = 1.455, p = .241, and group type, FGroup(1, 30) = .034, p = .858. The mean accuracy scores
of the posttest and TG1 were quite higher: .94 and .93 respectively. Table 52 shows the mean
accuracy of each stimulus type; there were no statistical differences among them.

Table 52: Mean perception accuracy of the six tokens in Group III (CV.CVV) in posttest and
TG1 comparison
Stimulus
Type(ST)

Token

ST8
ST9
ST10

ka.kaa (H.LL)
sa.saa (L.HH)
sa.saa (H.LL)

Posttest
Mean Accuracy
.88
.97
.97

Token
ta.taa (H.LL)
se.see (L.HH)
se.see (H.LL)

TG1
Mean Accuracy
.97
.97
.84

Although there were no significant main effects, the Test x Stimulus Type interaction was
2

significant, F(2, 60) = 4.549, p = .014, ƞp = .132 (Figure 60). Results of simple effects tests
revealed that the differences between ST8 and ST10 were significantly greater in TG1 than in the
posttest. The accuracy of ST8 significantly developed while that of ST10 significantly decreased
from the posttest to TG1.

144

Figure 60: The comparison of perception accuracy of the tokens in Group III (CV.CVV)
between the posttest and TG1
0.98
0.96

Accuracy

0.94
0.92
0.9

ST8

0.88

ST9

0.86

ST10

0.84
0.82
0.8
Posttest

TG1

Comparing Accuracy in Posttest and TG2 (Familiar Tokens by Novel Talker): The tokens used
in the TG2 were divided into three categories (Group I, II and III) as shown in Figure 34 in the
previous part. Independent variables were test (2; posttest, TG2), group type (2; AV and A-only),
and stimulus type (6); the dependent variable was perception accuracy. Regarding the tokens in
Group I (CVV.CVV), the results of a mixed ANOVA indicated significant main effects of
2

stimulus type, FType(5, 150) = 2.839, p = .018, ƞp = .086, and group type, FGroup(1, 30) =
2

5.867, p = .022, ƞp = .164; however, test was not significant, FTest(1, 30) = .808, p = .376.
None of the interactions was significant. Since there was no difference between the two tests, it
was considered that the two tests, posttest and TG2, were comparable for the tokens in Group I.
Group type was significant; however, it was significant because the perception accuracy of the
AV and A-only group in the posttest was significantly different: F(1, 30) = 5.428, p = .027.
Pairwise comparison was performed to locate where the differences existed among the six tokens
145

in Group I. Table 53 shows the mean accuracy of each stimulus type in Group I. Regarding
stimulus type, perception accuracy of ST1 was higher than that of ST3; however, the difference
was not significant.

Table 53: Mean perception accuracy of the six stimulus type in Group I (CVV.CVV) in posttest
and TG2 comparison
Stimulus
Type (ST)

Tokens

1
2
3
4
5
6

Mean Accuracy

saa.saa (LH.HH)
suu.suu (LH.HH)
kaa.kaa (LH.HL)
kuu.kuu (LH.HL)
saa.saa (HL.LL)
kuu.kuu (HL.LL)

Posttest

TG2

.97
.97
.81
.91
.91
.91

1.00
.97
.91
.91
.97
.88

Regarding the tokens in Group II (CVV.CV), the results of a mixed ANOVA revealed
2

significant main effects of stimulus type, FType(5, 150) = 3.225, p = .009, ƞp = .097, and group
2

type, FGroup(1, 30) = 5.758, p = .023, ƞp = .161; however, test was not significant, FTest(1, 30)
= .871, p = .358. The mean accuracy scores of the posttest (.88) were not significantly different
from that of TG2 (.84). There were significant differences between the two groups; however, the
two groups were not homogeneous at the time of posttest. Regarding token type, Table 54 below
shows the mean accuracy of tokens in Group II. The results of the pairwise comparisons did not
reveal any significant differences among the six token types; however, the difference between
ST7 and ST12 approached significance (p = .070).

146

Table 54: Mean perception accuracy of the six tokens in Group II (CVV.CV) in posttest and
TG2 comparison
Stimulus
Type (ST)

Tokens

7
8
9
10
11
12

Mean Accuracy

kaa.ka (LH.H)
kuu.ku (LH.H)
suu.su (LH.H)
kaa.ka (HL.L)
kuu.ku (HL.L)
suu.su (HL.L)

Posttest

TG2

.91
.97
.81
.97
.88
.72

1.00
.81
.78
.91
.78
.78

Regarding the tokens in Group III (CV.CVV), the results of a mixed ANOVA did not
indicate any significant main effects: test FTest(1, 30) = 3.357, p = .077, stimulus type, FType(5,
150) = 1.447, p = .211, and group type, FGroup(1, 30) = .551, p = .464. Table 55 shows mean
accuracy scores for each token in Group III; there were no statistically significant differences
among them.

Table 55: Mean perception accuracy of the six tokens in Group III (CV.CVV) in posttest and
TG2 comparison
Stimulus
Type (ST)

Tokens

13
14
15
16
17
18

Mean Accuracy

sa.saa (L.HH)
su.suu (L.HL)
ka.kaa (L.HL)
sa.saa (H.LL)
ku.kuu (H.LL)
su.suu (H.LL)

Posttest
.97
.94
.88
.97
.88
.85

147

TG2
1.00
.88
1.00
.91
.97
.97

On the other hand, the Test x Stimulus Type interaction was significant, F(5. 150) = 2.805, p
2

< .019, ƞp = .86 (Figure 61). The results of the simple effects tests revealed that (1) the
difference between ST13 and ST15 was greater in the posttest than in the TG2; and (2) the
difference between ST 13 and ST 14 was greater in TG2 than in posttest. The accuracy of ST15
significantly improved in the TG2; however, that of ST14 significantly decreased in the TG2.

Figure 61: The comparison of perception accuracy of tokens in Group III (CV.CVV) between
the posttest and TG2
1
0.98
0.96

Accuracy

0.94

ST13

0.92

ST14

0.9

ST15

0.88

ST16
ST17

0.86

ST18

0.84
0.82
0.8
Posttest

TG2

In conclusion, two tests of generalization were conducted one with novel tokens
produced by a familiar voice (TG1) and one with familiar tokens produced by a new voice (TG2).
First, the pretest and the two TGs were compared. Overall, it was found that accuracy improved
from pretest to TG1 and TG2; however, there were some tokens that failed to generalize. In TG1,

148

the CVV.CV type did not demonstrate higher accuracy. In addition, it was found that
generalization to a new vowel was more difficult than to a new consonant. Second, the posttest
and two TGs were compared. Overall, it was found that the learners demonstrated comparable
performance while there were some cases which failed the generalization. Thus, it was
considered that the training effects were generalized to new tokens and a new talker. Regarding
effects of the training modality on perception accuracy, there were no statistically significant
differences between the two training types. However, it was found that the AV training was
more effective for the development of accuracy for the most difficult one for the learners.

Test of Generalization (Familiar and Novel Tokens) – Comparison of RT: Tests of generalization
were given to the two experimental groups, in order to assess whether the effect of perceptual
training on the response speed to identify vowel duration could be generalized to novel tokens
(Appendix I) spoken by a familiar talker (TG1) and familiar tokens (i.e., tokens used in testing;
Appendix E) spoken by a novel talker (TG2). Table 56 shows descriptive statistics for the
perception RT in the pre-/post-tests and two TGs.

Table 56: Descriptive Statistics of the perception RT in the pre-/post-tests, and two TGs
Pretest
Group

Posttest

TG1
(novel tokens)
Mean %
SD

TG2
(novel voice)
Mean %
(SD)

Sample
Size

Mean %

(SD)

Mean %

(SD)

AV

16

2782.15

(557.66)

3155.17

(532.95)

2435.90

(528.33)

2392.59

(571.46)

A-only

16

2893.53

(516.01)

3241.33

(492.71)

2675.71

(477.38)

2685.66

(764.26)

149

The two TGs were compared with the pretest in order to examine whether there were any
developments in perception RT to identify vowel duration from the pretest to TGs. In order to
examine the overall effects of pretest to TG1, a mixed ANOVA was performed. Independent
variables were test (2: pretest, TG1) and group type (2: AV, A-only); the dependent variable was
perception RT. Results indicated significant main effects of test, FTest(1, 30) = 6.263, p = .018,
2

ƞp = .173; RTs in TG1 was faster than in the pretest. However, group type was not significant,
FGroup(1, 30) = .394, p = .535. The Test x Group Type interaction was not significant, F(1, 30)
= 1.757, p = .195.
In order to examine the changes in perception RT from pretest to TG2, a mixed ANOVA
was performed. Independent variables were test (2: pretest, TG2) and group type (2: AV, Aonly); the dependent variable was perception RT. Results indicated significant main effects of
2

test, FTest(1, 30) = 5.446, p = .027, ƞp = .154; RTs in TG2 were faster than in the pretest.
However, group type was not significant, FGroup(1, 30) = .492, p = .489. The Test x Group
Type interaction was also not significant, F(1, 30) = 1.897, p = .179.
In addition, the two TGs were compared with the posttest in order to examine whether the
posttest and each TG was comparable. To compare the posttest and TG, a mixed ANOVA was
performed. Independent variables were test (2: posttest, TG1) and group type (2: AV, A-only);
the dependent variable was perception RT. Results indicated significant main effects of test,
2

FTest(1, 30) = 92.711, p < .001, ƞp = .756; RT in TG1 were faster. However, group type was
not significant, FGroup(1, 30) = .796, p = .379. The Test x Group Type interaction was not
significant, F(1, 30) = 1.873, p = .181.
150

In order to examine whether the posttest and TG2 were comparable, a mixed ANOVA
was performed. Independent variables were test (2: posttest, TG2) and group type (2: AV, Aonly); the dependent variable was perception accuracy. Results indicated significant main
2

effects of test, FTest(1, 30) = 28.422, p < .001, ƞp = .486; RTs in TG2 were faster. However,
group type was not significant, FGroup(1, 30) = 1.038, p = .316. The Test x Group Type
interaction was also not significant, F(1, 30) = .941, p = .340.
Thus, overall, RT scores of from the TGs (TG1: 2555.57 milliseconds; TG2: 2539.13
milliseconds) were faster compared to the pretest (2830.94 milliseconds) and posttest (3143.38
milliseconds). In order to examine the effects of pitch pattern, preceding consonant, and vowel
type, tokens in TG1 were categorized into the three groups as shown in Table 47 in the earlier
part.

Comparing Perception RT in Pretest and TG1 (Novel Tokens): Perception RT in pretest and
TG1 was compared using a mixed ANOVA in order to examine whether there were any
developments in response speed in identifying vowel duration for the novel tokens spoken by the
familiar talker (i.e., the talker in the training sessions). In the comparison between pretest and
TG1, independent variables were test (2; pretest, TG1), group type (2; AV and A-only), and
stimulus type (3 or 4 depending on the structural pitch pattern group); the dependent variable
was perception RT. Regarding the tokens in Group I (CVV.CVV), the results of a mixed
ANOVA indicated significant main effects of stimulus type, FType (2, 60) = 19.992, p < .001,
2

ƞp = .400; however, test, FTest (1, 30) = 2.085, p = .159, and group type, FGroup(1, 30) = .349, p
= .559, were not significant. The mean RTs for the pretest and TG1 were 2872.84 milliseconds
151

and 2612.03 milliseconds. In order to locate where differences existed among the three stimulus
types in Group I, pairwise comparisons with Bonferroni correction were performed. Table 57
below shows mean RT scores for each token.

Table 57: Mean RT scores of the tokens in Group I (CVV.CVV) in the comparison between
pretest and TG1
Stimulus Type
(ST)

Token

Pretest
Mean RT
(milliseconds)

Token

TG1
Mean RT
(milliseconds)

ST1
ST2
ST3

kaa.kaa (LH.HL)
saa.saa (LH.HH)
saa.saa (HL.LL)

3188.34
2420.03
3010.16

taa.taa (LH.HL)
see.see (LH.HH)
see.see (HL.LL)

2187.50
2054.34
3594.25

The results showed ST3 were significantly different from ST1 (p = .009) and ST2 (p < .001).
ST3 had the longest RT compared to the other two tokens. The source of the difficulty for ST3
appears to be the pitch pattern.
In addition to the main effects above, the Test x Stimulus Type interaction was
2

significant, F(2, 60) = 8.272, p = .001, ƞp = .216 (Figure 62). Results of simple effects tests
revealed that the differences between ST1 and ST2 were significantly greater in pretest than in
TG1. The RT of ST1 as well as ST2 decreased from pretest to TG1; however, the rate of
decrease was greater for ST1.

152

Figure 62: The comparison of perception RT for the tokens in Group I (CVV.CVV) between the
pretest and TG1
3800
3600
RT in milliseconds

3400
3200
3000

ST1

2800

ST2

2600

ST3

2400
2200
2000
Pretest

TG1

Regarding the tokens in Group II (CVV.CV), the results of a mixed ANOVA indicated
2

significant main effects of stimulus type, FType (3, 90) = 3.155, p = .029, ƞp = .095; however,
test, FTest (1, 30) = .031, p = .860, and group type, FGroup(1, 30) = .112, p = .740, were not
significant. The mean RTs of the pretest and TG1 were 2751.44 milliseconds and 2773.82
milliseconds respectively. The difference between pretest and TG1 was not significant. In order
to locate where differences existed among the four stimulus types in Group II, pairwise
comparisons with Bonferroni correction were performed. Table 58 below shows mean accuracy
scores for each token. The difference between ST4 and ST5 was marginally significant (p
= .053).

153

Table 58: Mean RT scores of the tokens in Group II (CVV.CV) in the comparison between
pretest and TG1
Stimulus
Type(ST)

Pretest

TG1

Token

Mean RT
(milliseconds)

Token

Mean RT
(milliseconds)

ST4
ST5
ST6
ST7

kaa.ka (LH.H)
kaa.ka (HL.L)
suu.su (LH.H)
suu.su (HL.L)

2532.31
3070.16
2552.63
2850.66

taa.ta (LH.H)
taa.ta (HL.L)
see.se (LH.H)
see.se (HL.L)

2570.75
2740.41
2847.53
2936.59

In addition to the significant main effects, the Test x Group Type interaction was
2

significant: F(1, 30) = 14.441, p = .001, ƞp = .325 (Figure 63). The two groups had greater RT
difference in TG1 compared to the pretest, and the RT of the AV group decreased while that of
the A-only group increased.

Figure 63: The comparison of perception RT for the tokens in Group II (CVV.CV) between the
pretest and TG1
3100

RT in milliseconds

3000
2900
AV

2800

A-only

2700
2600
2500
Pretest

TG1

154

Regarding the tokens in Group III (CV.CVV), the results of a mixed ANOVA indicated
2

significant main effects of stimulus type, FType (2, 60) = 3.296, p = .044, ƞp = .099; however,
test, FTest (1, 30) = 1.393, p = .247, and group type, FGroup(1, 30) = .001, p = 970, were not
significant. The mean RT of the pretest was 2395.45 milliseconds; that of TG1 was 2605.95
milliseconds. The RT looks like it lengthened in the TG1; however, the difference was not
significant. Stimulus type was also significant; therefore, pairwise comparisons with Bonferroni
correction were performed in order to locate where differences existed among the three stimulus
types. Table 59 below shows mean RT scores for each token. The results showed the difference
between ST9 and ST10 was significant (p = .005), suggesting that the source was the pitch
pattern. The token with the novel vowel with the H.LL pitch had significantly faster RT than the
one with the L.HH pitch.

Table 59: Mean RT scores of the tokens in Group III (CV.CVV) in the comparison between
pretest and TG1
Stimulus
Type(ST)

Token

Pretest
Mean RT
(milliseconds)

Token

TG1
Mean RT
(milliseconds)

ST8
ST9
ST10

ka.kaa (H.LL)
sa.saa (L.HH)
sa.saa (H.LL)

2761.66
2854.63
1570.06

ta.taa (H.LL)
se.see (L.HH)
se.see (H.LL)

2422.38
2494.38
2901.09

In addition to the main effects above, the Test x Stimulus Type interaction was
2

significant, F(2, 60) = 10.994, p < .001, ƞp = .268 (Figure 64).

155

Figure 64: The comparison of perception RT of the tokens in Group III (CV.CVV) between the
pretest and TG1
3100
2900
RT in milliseconds

2700
2500
ST8
2300

ST9

2100

ST10

1900
1700
1500
Pretest

TG1

Results of simple effects tests revealed that the differences between ST10 and ST8, and ST10
and ST9 were greater in pretest than in TG1. The RT of ST10 was significantly lengthened in
TG1, compared to pretest, while that of ST8 and ST9 were shortened in TG1.

Comparing RT in Pretest and TG2 (Novel Talker): Perception RT scores for the pretest and the
TG2 were compared using a mixed ANOVA. Following the previous analyses, the tokens used
in the TG2 were also divided into three categories (Group I, II and III) as shown in Figure 34.
Independent variables were test (2; pretest, TG2), group type (2; AV and A-only), and stimulus
type (6); the dependent variable was perception RT. Regarding stimulus type in Group I
(CVV.CVV), the results of a mixed ANOVA indicated significant main effects of test, FTest(1,

156

2

30) = 16.465, p < .001, ƞp = .345; however, stimulus type, FType(5, 150) = 1.661, p = .147, and
group type, FGroup(1, 30) = 2.663, p = .113, were not significant. None of the interactions was
significant. The mean RT of pretest was 2926.07 milliseconds; the mean RT of TG2 was
2393.98 milliseconds. Therefore, the RT was shortened from the pretest to TG2. Table 60
shows mean RT scores for each stimulus type in Group I.

Table 60: Mean perception RT of the six stimulus type in Group I (CVV.CVV) in pretest and
TG2 comparison
Stimulus
Type (ST)

Tokens

1
2
3
4
5
6

saa.saa (LH.HH)
suu.suu (LH.HH)
kaa.kaa (LH.HL)
kuu.kuu (LH.HL)
saa.saa (HL.LL)
kuu.kuu (HL.LL)

Mean RT
Pretest

TG2

2420.03
2940.72
3188.34
3041.44
3010.16
2955.75

2261.97
2323.59
2277.03
2484.28
2530.31
2486.72

Regarding stimulus type in Group II (CVV.CV), the results of a mixed ANOVA did not
indicate any significant main effects: test, FTest(1, 30) = .447, p .509, stimulus type, FType(5,
150) = 1.348, p = .223, and group type, FGroup(1, 30) = .113, p = .739. The mean RT scores
decreased from 2775.23 milliseconds (pretest) to 2664.20 milliseconds (TG2); however, the
difference was not significant. Table 61 shows mean RT scores for each stimulus type in Group
II.

157

Table 61: Mean perception RT of the six tokens in Group II (CVV.CV) in pretest and TG2
comparison
Stimulus
Type (ST)

Tokens

7
8
9
10
11
12

kaa.ka (LH.H)
kuu.ku (LH.H)
suu.su (LH.H)
kaa.ka (HL.L)
kuu.ku (HL.L)
suu.su (HL.L)

Mean RT
Pretest

TG2

2531.31
2489.34
2552.63
3070.16
3156.28
2850.66

2382.78
3019.44
3053.59
2171.75
2599.19
2788.44

2

The Test x Group Type interaction was significant: F (1, 30) = 5.132, p = .031, ƞp = .146. As
shown in Figure 65, the two training groups had greater RT difference at pretest, compared to
TG2; the RT of the AV group shortened whereas that of the A-only group lengthened in TG2.
The Test x Stimulus Type interaction was also significant: F (5, 150) = 5.249, p < .001, ƞp

2

= .149. The results of the simple effects tests revealed that (1) the differences between ST10 and
ST12 were significantly greater in TG2 than in pretest; and (2) the differences between ST10 and
ST7 were significantly greater in pretest than in TG2. RT of ST10 significantly shortened from
the pretest to TG2.

158

Figure 65: The comparison of perception RT of tokens in Group II (CVV.CV) between the
pretest and TG2

RT in milliseconds

3000
2900
2800
AV
2700

A-only

2600
2500
Pretest

TG2

3200

RT in milliseconds

3000
ST7

2800

ST8
ST9

2600

ST10
2400

ST11

ST12

2200
2000
Pretest

TG2

Regarding stimulus type in Group III (CV.CVV), the results of a mixed ANOVA indicated
2

significant main effects of stimulus type, FType(5, 150) = 7.355, p < .001, ƞp = .197; however,
test, FTest(1, 30) = .218, p = .644, and group type, FGroup(1, 30) = .001, p = .973, were not
159

significant. The mean RT decreased from 2624.38 milliseconds (pretest) to 2554.20 (TG2);
however, the change was not significant. To locate where the differences existed among the 6
stimulus types, pairwise comparisons were performed with Bonferroni correction. Table 62
shows mean RT scores for each stimulus type in Group III.

Table 62: Mean perception RT of the six tokens in Group III (CV.CVV) in pretest and TG2
comparison
Stimulus
Type (ST)

Tokens

13
14
15
16
17
18

Mean RT (milliseconds)
Pretest
TG2

sa.saa (L.HH)
su.suu (L.HL)
ka.kaa (L.HL)
sa.saa (H.LL)
ku.kuu (H.LL)
su.suu (H.LL)

2854.63
3060.66
2761.66
1570.06
2647.31
2815.97

2812.97
2472.94
2334.31
2496.91
2603.94
2604.13

It was found that ST16 was significantly different from ST13 (p < .001), ST14 (p < .001), ST15
(p < .022), ST17 (p < .001), and ST18 (p < .001).
In addition to the main effects, the Test x Stimulus Type interaction was significant: F(5,
2

150) = 5.657, p < .001, ƞp = .159 (Figure 66). The results of simple effects tests revealed that
the differences between ST14 and ST16 were significantly greater in pretest than in TG2. The
RT of ST14 decreased from the pretest to TG2; however, that of ST16 increased.

160

Figure 66: The comparison of perception RT of tokens in Group III (CV.CVV) between the
pretest and TG2
3300

3100
2900
ST13

RT in milliseconds

2700

ST14

2500

ST15

2300

ST16

2100

ST17

1900

ST18

1700
1500

Pretest

TG2

Comparing RT in Posttest and TG1 (Novel Tokens): Perception RTs in posttest and TG1 were
compared using a mixed ANOVA in order to examine whether the two tests were comparable
(i.e., training effects were generalized in response speed in identifying vowel duration of novel
tokens). Independent variables were test (2; posttest, TG1), group type (2; AV and A-only), and
stimulus type (3 or 4 depending the group); the dependent variable was perception RT.
Regarding the tokens in Group I (CVV.CVV), the results of a mixed ANOVA indicated
2

significant main effects of test, FTest (1, 30) = 10.963, p = .002, ƞp = .268, and stimulus type,
2

FType (2, 60) = 7.591, p = .001, ƞp = .202; however, group type, was not significant: FGroup(1,
30) = 1.734, p = .198. The mean RT of the posttest was 3149.44 milliseconds; that of TG1 was
2612.03 milliseconds. The RT significantly shortened in the TG1. Stimulus type had significant

161

effects; therefore, pairwise comparisons with Bonferroni correction were performed in order to
locate where differences existed among the three stimulus types. Table 63 shows mean RT
scores for each token in Group I.

Table 63: Mean perception RT of the six tokens in Group I (CVV.CVV) in posttest and TG1
comparison
Stimulus Type
(ST)

Token

Posttest
Mean RT
(milliseconds)

Token

TG1
Mean RT
(milliseconds)

ST1
ST2
ST3

kaa.kaa (LH.HL)
saa.saa (LH.HH)
saa.saa (HL.LL)

3201.00
3301.38
2945.94

taa.taa (LH.HL)
see.see (LH.HH)
see.see (HL.LL)

2187.50
2054.34
3594.25

The results showed ST3 was significantly different from ST1 (p = .008) and ST2 (p = .003). ST3
had a significantly longer RT than the other two tokens, which might be attributable to the pitch
pattern.
In addition to the main effects above, the Test x Stimulus Type interaction was
2

significant, F(2, 60) = 13.362, p < .001, ƞp = .308 (Figure 67). Results of simple effects tests
revealed that the differences between ST3 and ST1, ST3 and ST2 were greater in TG1 than in the
posttest. RT of ST1 and ST2 significantly shortened; however, that of ST3 increased.

162

Figure 67: The comparison of perception RT of the tokens in Group I (CVV.CVV) between the
posttest and TG1
3800
3600
3400

RT in milliseconds

3200
3000

ST1

2800

ST2

2600

ST3

2400
2200
2000
Posttest

TG1

Regarding the tokens in Group II (CVV.CV), the results of a mixed ANOVA indicated
2

significant main effects of test, FTest(1, 30) = 17.258, p < .001, ƞp = .365; however, stimulus
type, FType(3, 90) = 1.393, p = .250, and group type, FGroup(1, 30) = 3.232, p = .082, were not
significant. None of the interactions was significant. The mean RT of the posttest was 3205.15
milliseconds; that of TG1 was 2773.82 milliseconds. The RT of all the tokens in Group II
significantly shortened from the posttest to TG1.
Regarding the tokens in Group III (CV.CVV), the results of a mixed ANOVA indicated
2

significant main effects of test, FTest (1, 30) = 19.354, p < .001, ƞp = .392, and stimulus type,
2

FType (2, 60) = 5.459, p = .007, ƞp = .154; however, group type was not significant, FGroup(1,
30) = .262, p = .612. None of the interactions was significant. The mean RT for the posttest was
163

3171.17 milliseconds; that for TG1 was 2605.95 milliseconds. The RT of all the tokens in
Group II significantly shortened from the posttest to TG1. Stimulus type had significant effects;
therefore, pairwise comparisons with Bonferroni correction were performed in order to locate
where differences existed among the three stimulus types. Table 62 shows the mean RT of each
stimulus type in Group III.

Table 64: Mean perception RT of the six tokens in Group III (CV.CVV) in posttest and TG1
comparison
Stimulus
Type(ST)

Token

Posttest
Mean RT
(milliseconds)

Token

TG1
Mean RT
(milliseconds)

ST8
ST9
ST10

ka.kaa (H.LL)
sa.saa (L.HH)
sa.saa (H.LL)

3120.56
2887.28
3505.66

ta.taa (H.LL)
se.see (L.HH)
se.see (H.LL)

2422.38
2494.38
2901.09

Results indicated that ST9 and ST10 were significantly different (p = .010). The RT of ST10
was significantly longer than that of ST9 which may be attributable to the pitch pattern as the
segmental information was the same.

Comparing RT in Posttest and TG2 (Novel Talker): Perception RT scores for the posttest and the
TG2 were compared using a mixed ANOVA. Following previous analyses, the tokens used in
the TG2 were divided into three categories (Group I, II and III) as shown in Figure 34. For each
category, independent variables were test (2; posttest, TG2), group type (2; AV and A-only), and
stimulus type (6); the dependent variable was perception RT. Regarding stimulus type in Group
I (CVV.CVV), the results of a mixed ANOVA indicated significant main effects of test, FTest(1,
164

2

30) = 46.802, p < .001, ƞp = .609; however, stimulus type, FType(5, 150) = .320, p = .901, and
group type, FGroup(1, 30) = 2.097, p = .158, were not significant. None of the interactions was
significant. The mean RT of the posttest was 2658.32 milliseconds; the mean RT of TG2 was
2943.43 milliseconds. Therefore, the RT was lengthened from the pretest to TG2.
Regarding stimulus type in Group II (CVV.CV), the results of a mixed ANOVA
2

indicated significant main effects of test, FTest(1, 30) = 10.800, p = .003, ƞp = .265, and
2

stimulus type, FType(5, 150) = 3.310, p = .007, ƞp = .099; however, group type was not
significant, FGroup(1, 30) = .440, p = .512. None of the interactions was significant. The mean
RT of the posttest was 3125.75 milliseconds; the mean RT of TG2 was 2669.20 milliseconds.
Therefore, the RT was shortened from the posttest to TG2. In order to locate where the
differences existed among the six stimulus types, pairwise comparisons with Bonferroni
correction were performed. Mean RT scores of each stimulus type are tabulated in Table 63
below. RTs for both ST7 and ST10 were faster than ST11; however, comparisons could not
locate differences.

165

Table 65: Mean perception RT of the six tokens in Group II (CVV.CV) in posttest and TG2
comparison
Stimulus
Type (ST)

Tokens

7
8
9
10
11
12

kaa.ka (LH.H)
kuu.ku (LH.H)
suu.su (LH.H)
kaa.ka (HL.L)
kuu.ku (HL.L)
suu.su (HL.L)

Mean RT (milliseconds)
Posttest
TG2
3013.28
3315.13
3396.16
3253.22
2618.78
3157.94

2382.78
3019.44
3053.59
2171.75
2599.19
2788.44

Regarding stimulus type in Group III (CV.CVV), the results of a mixed ANOVA
2

indicated significant main effects of test, FTest(1, 30) = 18.760, p < .001, ƞp = .385; however,
stimulus type, FType(5, 150) = 1.038, p = .397, and group type, FGroup(1, 30) = .1.320, p = .260,
were not significant. None of the interactions was significant. The mean RT of pretest was
2789.16 milliseconds; the mean RT of TG2 was 3014.15 milliseconds. Therefore, the RT was
lengthened from the posttest to TG2.
In conclusion, as a result of comparing the RTs in pretest and posttest with the two TGs,
it was found that the learners generally demonstrated faster RTs in TGs, compared to the pretest
and posttest. Factors such as pitch patterns, vowel types, and preceding consonants affected
perception accuracy and RT. However, there were not many meaningful differences between the
two training groups (AV, A-only).

166

CHAPTER 4: DISCUSSION AND CONCLUSION

In this study, factors influencing L2 learners’ perception, response latency, and
production of vowel duration in Japanese were explored (Experiment 1). In addition, the
efficacy of focused perceptual training on vowel duration and its influence on production were
examined (Experiment 2). In this chapter, findings of Experiment 1 and 2 are discussed based on
the research questions proposed for this study.

Factors Affecting Perception and Production of Vowel Duration in L2 Japanese (RQ1)
Experiment 1 examined whether preceding consonant, type of vowel, and pitch pattern
for perception or token type for production had any influence on the production and perception
of vowel duration in L2 Japanese. It was found that vowel type and token type significantly
affected correct production of vowel duration. In general, the vowel /a/ had higher accuracy than
the vowel /u/. Also, the CVV.CV token type had higher accuracy than the CV.CVV type as well
as the CV.CV type. The error analysis of the token types showed that the learners had
difficulties correctly producing vowel duration in the final syllable. There was an interaction
between the preceding consonant and token type, which suggested that the CV.CV token with a
stop consonant (/k/) had higher accuracy than that with a fricative consonant (/s/).
For the perception accuracy, the tokens used in this study were divided into four groups:
(I) CVV.CVV, (II) CVV.CV, (III) CV.CVV, and (IV) CV.CV. For the tokens in Group I, it was
found that pitch pattern affected perception accuracy; the LH.HH pattern had higher accuracy
than LH.HL and HL.LL. For the tokens in Group II, it was found that all preceding consonants,
pitch patterns, and vowel types affected perception accuracy although generally, a stop (/k/) and

167

a low vowel (/a/) revealed higher accuracy than a fricative (/s/) and a high vowel (/u/)
respectively. In addition, the LH.H pitch pattern showed higher accuracy than the HL.L pattern.
There was also an interaction between vowel type and pitch pattern; with the LH.H pitch, the
vowel /a/ revealed higher accuracy than the vowel /u/. Regarding the tokens in Group III, vowel
type and pitch pattern affected perception accuracy. A high vowel /a/ revealed higher accuracy
than a low vowel /u/. Also, the L.HH pitch pattern showed higher accuracy than the H.LL
pattern. There was an interaction among preceding consonant, vowel type, and pitch pattern;
with the LH.H pitch, a combination of a consonant and vowel /ka/ showed higher accuracy than
/ku/, /sa/, and /su/. Finally, for the tokens in Group IV, it was found that preceding consonant
affected perception accuracy; tokens with a fricative /s/ showed higher accuracy than tokens with
a stop /k/. Also, there was an interaction between vowel type and pitch pattern; with the vowel
/a/, the H.L pitch showed higher accuracy than the L.H pitch.
Based on these findings regarding the pitch pattern, it was easier for the learners to
correctly identify the vowel duration with the LH pitch in the first syllable and with the HH pitch
in the second syllable. This finding is compatible with Minagawa (1997) who found that L2
learners including NSs of English more accurately identified long vowels with the HH pitch
pattern than with the LL pitch pattern. The learners in the current study were all NSs of English;
therefore, the higher pitch in word-final position may have been more perceptually salient. In
addition, accented vowels, which can be perceptually salient, have higher pitch and are
lengthened in a stress-timed language like English (Pennington, 1996). Therefore, it is easier for
English NSs to correctly perceive the length of long vowels if high pitch is assigned. Also, the
preference of high pitch on long vowels could demonstrate that the L2 learners were using
English prosodic preferences when processing Japanese speech input, by associating high pitch

168

with an accented vowel that has longer duration. Furthermore, in English, the first syllable on
many nouns and adjectives gets an accent (e.g., FA.ther) when the word does not have any prefix
(Kubozono & Ohta, 1998). Therefore, the learners may have had higher accuracy with high
pitch on the first syllable (i.e., CVV.CV or LH.H) versus the others (i.e., CVV.CV or HL.L).
Next, the overall findings of this study suggest that the L2 learners’ perception tends to
be continuous while NSs demonstrate categorical perception (Fujisaki, Nakamura, and Imoto,
1973, cited in Toda, 2003). As Figure 28 shows, the length of a consonant /k/ in kaka is 1.5
times as long as one in kaaka. As the error analysis of the CV.CV token in Figure 23 and Figure
24 showed, the slightly longer duration of /k/ may have confused the learners who perceived as a
long consonant (i.e., geminate).
In addition, regarding the vowel type, it was easier to identify and produce vowel
duration accurately when tokens contained a low vowel /a/, compared to a high vowel /u/. In
Tokyo Japanese, the low vowel /a/ is considered the longest vowel in Tokyo Japanese (Shibatani,
1990) and the high back vowel /u/ is the shortest. Thus, the inherent length of the vowel might
have been influential when the learners identified vowel duration. Next, the L2 learners had
difficulty correctly producing and perceiveing accurate vowel length in the word-final position
(i.e., the second syllable in this study), which supports what Koguma (2000) reported. The
word-final position can be a very unstable position perceptually. Mutuskawa (2006) reported
that Japanese long vowels in the word-final position (e.g., konpyuutaa ‘computer’) are often
shortened (e.g., konpyuuta) especially in representing loanwords in Japanese. Finally, the
interaction between a pitch pattern and a preceding consonant and/or vowel suggested that
perception accuracy was influenced by a combination of the word-level and prosodic level
factors.

169

Regarding the perception latency, the tokens used in this study were also divided into
four groups: (I) CVV.CVV, (II) CVV.CV, (III) CV.CVV, and (IV) CV.CV. For the tokens in
Group I, it was found that pitch pattern affected response time; the LH.HH pattern had shorter
RT than LH.HL. For the tokens in Group II, it was found that pitch pattern influenced RT; LH.H
had shorter RT than HL.L. In addition, there was an interaction between a preceding consonant
and pitch pattern; with the HL.L pitch, a stop /k/ had shorter RT than a fricative /s/. Regarding
the tokens in Group III, vowel type and pitch pattern affected perception RT. A high vowel /a/
revealed shorter RT than a low vowel /u/. Also, the H.LL pitch pattern revealed shorter RT than
LH.H and LH.H pitch patterns. An interaction between vowel type and pitch pattern was found,
and it suggested that the CV combination /ka/ revealed shorter RT than /ku/ with the LH.L pitch
pattern. Finally, for the tokens in Group IV, it was found that preceding consonant affected
perception latency; tokens with a fricative /s/ showed shorter RT than tokens with a stop /k/.
Also, there was an interaction between vowel type and pitch pattern; a combination of consonant
and vowel /su/ revealed shorter RT than /sa/ with the H.L pitch.
Based on these findings, regarding the pitch pattern, there was a tendency for the token
ending with the HH pitch to show a shorter RT. In addition, the token with a stop /k/ and/or a
low vowel /a/ revealed shorter RT. However, as the interactions between pitch pattern and
consonant and/or vowel show, the three factors influenced perception RT together.

Effectiveness of Perceptual Training on Accuracy and RT (RQ 2)
Experiment 2 examined whether focused perception training was effective for the
acquisition of vowel duration. In order to test the development of perception accuracy, the
accuracy scores before and after training were compared. It was found that the two groups who

170

received the training, both auditory-visual with waveform input and auditory-only, improved in
perception accuracy; the two groups demonstrated higher accuracy in identifying vowel duration
after the training. On the other hand, the group that did not receive the training, which served as
a control, did not improve their identification accuracy. Thus, it was concluded that the training
was effective in enhancing correct perception of vowel length. This finding regarding the
benefits of training on accurate perception of L2 contrasts confirmed what Bradlow and Pisoni
(1999); Hardison (2003); Hirata and Kelly (2010); Lively, Logan, and Pisoni (1993); Logan,
Lively, and Pisoni (1991); Motohashi (2007); Motohashi-Saigo and Hardison (2009) had found.
Regarding the influence of preceding consonant, vowel type, and pitch pattern, the results
of the pretest and posttest comparison showed very mixed results. Regarding the tokens in
Group I (the tokens with the CVV.CVV structure), among those with the LH.HH pitch, kaa.kaa
showed lower accuracy then kuu.kuu and suu.suu; among the tokens with the HL.LL pitch,
saa.saa showed lower accuracy than suu.suu. Thus, the learners demonstrated higher accuracy
with the tokens with the vowel /u/ than ones with the vowel /a/. This finding did not support the
results in Experiment 1 which showed that the vowel /a/, with a potentially longer duration,
demonstrated higher accuracy. On the other hand, regarding the tokens in Group II (tokens with
the CVV.CV structure), it was found that (1) perception accuracy was higher for those with
LH.H pitch than ones with HL.L pitch; (2) the vowel /a/ showed higher accuracy than the vowel
/u/ among the tokens with HL.L pitch. In Group III, the results showed that (1) the tokens with
/sa/ had a tendency to have higher accuracy than ones with /ka/. Although the data from Group I
showed slightly different patterns, generally, the learners demonstrated higher accuracy when
they identified vowel duration for tokens with the vowel /a/ than ones with the vowel /u/.

171

Although perception accuracy showed improvement after perceptual training, both of
the training groups showed that perception latency did not decrease. In other words, except for a
few examples such as kuu.ku with the HL.L pitch, RTs to identify vowel duration generally
became larger. Particularly, the RT of sa.saa with the H.LL pitch significantly lengthened. It
was expected that the learners would demonstrate faster RT after the training. It is possible that
as a result of receiving the training, the learners who had not been aware of or confident in their
knowledge of the distinction noticed the difference and their processing time increased as they
considered their response options.

Effectiveness of Training per Group (RQ 2)
The perception accuracy and response latency data obtained in the training sessions for
each training group were analyzed in order to examine whether there was a development of
perception accuracy and response latency and effects of other factors such as talker, pitch pattern,
preceding consonant, and vowel type. For perception accuracy, the AV and A-only training
groups demonstrated similar patterns. First, it was found that there were no significant
differences in perception accuracy in the first and second week, except for tokens with the
CV.CV structure for the AV group and ones with the CVV.CVV structure for the A-only group.
Also, there were effects of talker on the perception accuracy. For example, tokens with the
CV.CVV structure produced by Talker 5, a female talker, were easier than the other talkers for
the AV group. In addition, tokens with the CVV.CV and CVV.CVV structures produced by
Talker 4 were more difficult. There was an interaction between talker and stimulus type. For the
AV group, tokens such as saa.sa with HL.L pitch as well as kuu.kuu with LH.HH pitch produced
by Talker 4, kaa.kaa with LH.HH pitch produced by Talker 3, and suu.suu with LH.HL pitch

172

produced by Talker 2 were more challenging for the learners. For the A-only group, tokens such
as kuu.kuu with LH.HH pitch, suu.suu with LH.HL pitch, saa.sa with HL.L pitch, and sa.saa
with L.HL pitch produced by Talker 4 were more challenging for the learners. Finally, it was
found that tokens with /sa/ were challenging in general for the learners. Perception accuracy for
saa.sa with HL.L pitch as well as sa.saa with LH.H pitch was lower than the other tokens. The
reasons why the tokens with /sa/ were more difficult than ones with /ku/ or /su/ could be related
to the devoicing in Japanese. In general, the vowels between the two voiceless stops including
/s/ and /k/ are devoiced or perceptually lost when they are not accented. By losing the vowel
length by devoicing, the contrast between a devoiced short vowel which does not exist
perceptually and a long vowel could become clearer, which resulted in higher perception
accuracy for tokens with /ku/ and /su/. Also, as the waveform displays in Figure 28 shows, a
fricative /s/ has a noise before the vowel, and sonority difference is clearer with a stop /k/ than a
fricative /s/ (Hardison & Motohashi-Saigo, 2010).
The perception accuracy in each training session illustrated in Figure 40 suggests the
arbitrary nature of time to give a posttest. The study by Logan et al. (1993) and Hardison (2003)
administered perceptual training for three weeks. However, the training period in the current
study was two weeks. It is not known how posttest results would look if the test had been done
after Session 7 or after an additional session. The results show that the learners struggled at least
in the first three sessions. Therefore, it is probably difficult to see the facilitative effects of
training if the training period is very short.
Regarding the response latency, the AV and A-only groups showed slightly different
patterns. First, for the AV group, it was found that response latency significantly shortened in
the second week, compared to the first week. In addition, there were effects of talker. For

173

example, for the tokens with the CVV.CVV structure, suu.suu with LH.HL pitch produced by
Talker 1 showed longer RT than kaa.kaa with LH.HH pitch; for the tokens with the CVV.CV
structure, the RT for the tokens produced by Talker 4 was faster than those produced by Talker 2
and Talker 3; and for the tokens with the CV.CVV structures, RT for the tokens produced by
Talker 1 was longer. Finally, the effects of pitch pattern, preceding consonant, and vowel type
showed mixed results; therefore, it was difficult to draw a clear conclusion. However, there was
a tendency for tokens with /k/ to have a faster RT than ones with /s/.
On the other hand, for the A-only group, it was found that the response latency was not
always significantly shortened in the second week, compared to the first week. For example, the
RTs of tokens with the CVV.CVV and CV.CV structures became significantly faster in the
second week; however, the same pattern was not found for the CVV.CV and CV.CVV structures.
Second, there were effects of talker. For example, RTs for the tokens with the CVV.CVV and
CV.CV structure produced by Talker 1 significantly shortened in the second week; and RT for
the tokens with CV.CV structures produced by Talker 2 were faster than ones produced by
Talker1. In addition, because of the interaction between stimulus type and talker, suu.su
produced by Talker 4 revealed shorter RTs while suu.suu produced by Talker 1 and saa.saa
produced by Talker 5 revealed longer RTs. Finally, similar to the AV group, the effects of the
pitch pattern, preceding consonant, and vowel type showed mixed results; therefore, it was
difficult to draw a clear conclusion. However, there was a tendency for tokens with /k/ to have
faster RTs than ones with /s/.
The data in the training strongly suggested that talker’s voice had effects on the L2
learners’ perception accuracy and RT while the study contained only four different talkers in the
training sessions. Generally, the L2 learners revealed higher accuracy for the female talkers than

174

the male talkers. In addition, between the two male talkers (Talker 3 and 4), Talker 4 had lower
accuracy. Bradlow, Torretta, and Pisoni (1996) reported six important factors that make a voice
intelligible in American English: 1) female, 2) expanded vowel space, 3) precise articulation for
the point vowels (i.e., /i/, /a/, /u/), 4) low degree of phonetic reduction, 5) regular rhythm in
speech production, and 6) use of a relatively wide range in pitch at the sentence level. This may
explain why the two female talkers had relatively higher perception accuracy. Also, as a result
of examining Talker 4’s voice, it was found that he had lower pitch range than the other male
talker so that his voice does not show a wide pitch range.

Comparison between the Two Types of Training (RQ 3)
Although the training was beneficial to improve perception accuracy, the present study
did not find significant overall differences in the modality of the training on perception accuracy
or perception latency. Regardless of the types of perceptual training the learner took (i.e., AV or
A-only), significant improvement occurred. There was only one set of data, tokens with the
CVV.CV structure, which showed that the two groups were significantly different. For that set,
the AV group had significantly higher accuracy than the A-only group.
Although the overall efficacy of the training type was not found, the interaction between
the two points in time (i.e., before and after the training) and the training modality on perception
accuracy suggested that the AV training group’s rate of improvement was greater than the Aonly group’s. This finding partially supported Hardison (2003), and Motohashi (2007), and
Motohashi-Saigo and Hardison (2009) where perceptual training with bimodal input was more
effective than with unimodal input. Visual cues, including articulatory gestures involved in
producing /l/ and /r/ as well as a visual display of durational contrasts can explicitly inform

175

learners about the difference between the two contrasts. On the other hand, the results of the
current study showed that the learners were able to be trained to correctly identify vowel
duration without the additional information; the focused training with only the auditory input
facilitated the correct identification. It is because the waveform displays do not always show a
clear distinction between a long and short vowel. As Figure 28 shows, the waveform with a
preceding consonant /k/ shows a clear distinction, but not with /s/. Thus, the learners need to pay
more attention to the auditory input with less clear visual cues. However, as the training data in
Figure 41 show, the AV group revealed higher accuracy for Talker 4. Thus, the AV training
could facilitate correct identification of vowel duration for a challenging context such as a
difficult voice/talker.

Transfer to Production (RQ4)
Previous literature suggested the effects of perceptual training can transfer to production
if the training is successful. This study found that overall production accuracy significantly
improved after training for both of the training groups. Since the participants did not receive any
specific training or practice on how to pronounce the words with short and long vowels, it was
concluded that the effects of the perceptual training transferred to production. While the
development of correctly producing vowel duration was observed, there was no effect of training
modality or vowel type. Regarding the token type, there were significant differences on
production accuracy. The tokens with CVV.CV, CV.CVV, and CV.CV structures significantly
improved accuracy from the pretest to posttest; however, the CVV.CVV tokens did not show
significant improvement. The CVV.CVV tokens were more difficult than the other types

176

because error analysis revealed that learners made more errors (i.e., the long vowel on the second
syllable was shortened) for this token than the others.

Generalizability of the Training Effect on Perception Accuracy and RT (RQ5)
As Logan et al. (1991) argued, it is necessary to examine whether the effects of the
training extend to identification of the L2 contrast in new tokens in order to determine the
effectiveness of the training. Therefore, two tests of generalization were conducted: one with
novel tokens produced by a familiar voice (TG1) and one with familiar tokens produced by a
new voice (TG2). First, perception accuracy was examined. As a result of comparing the pretest
data with the two TGs, the overall finding was that the learners demonstrated significantly higher
accuracy on the two TGs. Therefore, it was confirmed that there was some development after
the perceptual training. The only exception was for the tokens with the CVV.CV structures in
the TG1; there were no significant differences in perception accuracy between pretest and TG1.
The token se.see with H.LL pitch, which contained a novel vowel, was more difficult than ta.taa
with L.HH and H.LL, pitch, which contained a novel consonant. It could suggest that
generalization to a new vowel was more difficult than to a new consonant. It is also the case
that /t/ and /k/ are both voiceless stops and have shown greater similarity in perception patterns
(e.g., Hardison & Motohashi-Saigo, 2010).
Next, as a result of comparing the posttest data with the two TGs, the overall finding was
that the learners demonstrated comparable performance. In other words, there was no significant
difference between the posttest and the two TGs, except for the tokens with CVV.CVV in TG1
which showed higher accuracy in TG1 compared to posttest. Regarding the stimulus type, the

177

accuracy scores of see.se and se.see were significantly lower in TG1; therefore, the benefit of the
training was not generalized to those two types of tokens containing a novel vowel /e/.
Regarding effects of the training modality on perception accuracy, the AV training was
more effective for the development of accuracy for tokens with the CV.CVV structure, compared
to the A-only training, in the comparison between the pretest and TG2. However, there were no
other meaningful differences between the AV and A-only groups.
Next, the response latency was examined. As a result of comparing the pretest RT with
the two TGs, it was found that the learners generally demonstrated significantly shorter response
latency on the two TGs although there were some tokens that showed the opposite patterns.
Next, as a result of comparing the posttest RT with the two TGs, it was found that the learners
generally demonstrated significantly shorter response latency on the two TGs. Based on this
finding, the learners were able to respond both accurately and quickly to novel stimuli and a new
voice; however, we must also acknowledge that the RTs were significantly longer from the
pretest to posttest. In addition, there were no meaningful differences in RTs between the AV and
A-only groups. Based on these results, it was concluded that the learners’ response time to
correctly identify the vowel duration improved and the training effects were extended to the
novel tokens as well as the novel voice, regardless of the training type.

Generalizability of the Training Effect to Production
In addition to the generalizability in the learner’s perception, it was examined whether the
training effects on production accuracy could be generalized to novel tokens. To test it, the test
of generalization containing novel tokens was given and compared the data with the pretest and
the posttest. As a result of comparing the pretest data with the TG, it was found that the learners

178

demonstrated significantly higher accuracy on TG. In the comparison between /ka/ and /ta/,
where generalization to a novel consonant /t/ was examined, as well as between /sa/ and /se/,
where generalization to a novel vowel /e/ was examined, the learners demonstrated higher
accuracy in the TG. Therefore, it was concluded that there was a development from pretest to
posttest. Also, the types of the tokens were significantly different; the CVV.CV token had the
higher accuracy, compared to the CV.CVV tokens. The effects of the training modality were not
found.
Next, as a result of comparing the posttest data with the TG, it was found that the learners
demonstrated comparable performance. In other words, there was no significant difference
between the posttest and the two TGs. The training modality was not significant, but the token
type was significant. Based on the comparison of the four token types, it was found that the
CVV.CVV type was more significantly difficult than the CVV.CV type. Thus, it was concluded
that the learners’ ability to correctly identify the vowel duration developed and the training
effects were extended to the novel tokens.

Conclusion
In the present study, Experiment 1 explored a range of factors potentially affecting
perception and production of vowel duration by L1 English learners of L2 Japanese. Based on
the findings, Experiment 2 investigated the factors affecting the efficacy of training to increase
learners’ identification accuracy of vowel duration. These factors included modality of training
(AV vs. A-only), preceding consonant, vowel type, talker’s voice, and pitch pattern. Several of
these factors had been the focus of some previous training studies.

179

In the few studies that have explored different modalities of learning, significant
improvement was found for both AV and A-only training, with a significant advantage for AV
training (Hardison, 2003; Motohashi Saigo & Hardison, 2009). In the current study, although the
AV and A-only training groups began at comparable levels, and both showed significant
improvement, the greater improvement in raw scores for the AV group compared to the A-only
was not statistically significant. Previous research also demonstrated the influence on L2
perceptual identification accuracy of the position of a target sound (e.g., for AE /r/ and /l/) in a
word, a talker’s voice (e.g., Bradlow et al., 1997; Hardison, 2003; Lively et al., 1993), and an
adjacent vowel (e.g., for AE /r/ and /l/, Hardison, 2003; for Japanese geminates, Motohashi Saigo
& Hardison, 2009). To this knowledge of contextual influence, the current study adds the
significant effects of the prosodic level of speech in the form of pitch pattern, which also
encompasses the issue of syllabic position of the morae in a token (i.e., in the first and/or second
syllable). Based on the significant complex interactions found in the earlier studies, it is not
surprising that the interactions in the current study showed a similar level of complexity in the
L2 learners’ perceptual performance.
Such perceptual variability is best captured by exemplar-based models of learning in
which the learners’ stages of L2 perceptual development involve the evaluation of input based on
context- and talker-dependent perceptual categories. The influence of context on perceptual
identification, now, must be more broadly understood, at least for some target languages, as
involving both the segmental and prosodic levels of speech.
In keeping with the hallmarks of successful training established by the past two decades
of research (e.g., Hardison, 2012), the current study has also demonstrated the learners’ ability to

180

generalize performance improvement from training to novel stimuli and a new voice, and to
transfer an improved perception skill to production in the absence of explicit production training.
Among the somewhat unexpected findings of Experiment 2 is the increase in response time for
the posttest compared to the pretest stimuli. One might hypothesize that greater accuracy as a
result of training would be accompanied by faster response time; however, the reverse finding
may have been due to the learners’ increased awareness of the range of stimulus cues following
training, and their attempts to attend to several dimensions of the speech signal simultaneously.
From a pedagogical standpoint, to focus learner attention on specific features of the speech event,
teachers may find that visual displays of waveforms (for segmental duration) and pitch contours
are helpful in the classroom or, for some learners, as self-study aids outside of class (e.g., Chun,
Hardison, & Pennington, 2008; Motohashi Saigo & Hardison, 2009).
There are a few limitations in the current study. The original design called for a
consideration of overall L2 proficiency as a factor. Other studies (Hardison & Motohashi Saigo,
2010; Toda, 1998) found an effect of L2 proficiency with regard to geminate perception.
Although participants in the current study were recruited from a range of course levels, it was
apparent that using exposure to instructed Japanese as a basis for proficiency was unfounded. A
comparable number of participants from each year of the course were disqualified from the
training study based on ceiling effects in terms of their accuracy in identifying vowel duration.
A review of the literature does suggest that, in general, L2 learners of Japanese have less
difficulty perceiving vowel duration compared to consonant duration, and the only available,
albeit weak, measure of proficiency (i.e., semester of study) was not valid for the research
objectives. Second, the current study focused on pseudo words in order to avoid the influence of
vocabulary size and neighborhood density. Although this served well the objectives of the

181

current study and its range of learners, the findings may not be as generalizable to the perception
of real words in the natural language environment. Third, the study focused on words produced
in isolation. It may be the case that different results would obtain for words produced in context;
however, the effect of connected speech on the perception of segmental duration is not clear. For
example, while Motohashi Saigo and Hardison (2009) found no significant effect of condition
(i.e., isolated word vs. carrier sentence context), a related study found significantly lower
identification accuracy for words produced in a carrier sentence versus those produced in
isolation (Hardison & Motohashi Saigo, 2010). Finally, to keep the stimulus set to a manageable
size in the training study, not every consonant-vowel combination was used for every pitch
pattern that can occur in the language. Future research could expand on this aspect.

182

APPENDICIES

183

Appendix A: List of target stimuli for production test in Experiment 1

Table 66: Target stimuli in production test
Stimuli
kaakaa
kaaka
kakaa
kaka
saasaa
saasa
sasaa
sasa
kuukuu
kuuku
kukuu
kuku
suusuu
suusu
susuu
susu

184

Appendix B: List of practice stimuli for production test in Experiment 1 and 2

Table 67: Practice stimuli in production test
Stimuli
noono
nono
rooro
roro

185

Appendix C: List of target stimuli for perception test in Experiment 1

Table 68: Target stimuli in perception test in Experiment 1
Stimuli

Pitch

Meaning

kaakaa
kaakaa
kaakaa
kaaka
kaaka
kakaa
kakaa
kakaa
kaka
kaka
saasaa
saasaa
saasaa
saasa
saasa
sasaa
saasaa
saasaa
sasa
sasa
kuukuu
kuukuu
kuukuu
kuuku
kuuku
kukuu
kuuku
kuuku
kuku
kuku
suusuu
suusuu
suusuu
suusu
suusu
susuu
susuu

LH.HH
LH.HL
HL.LL
LH.H
HL.L
L.HH
L.HL
H.LL
L.H
H.L
LH.HH
LH.HL
HL.LL
LH.H
HL.L
L.HH
L.HL
H.LL
L.H
H.L
LH.HH
LH.HL
HL.LL
LH.H
HL.L
L.HH
L.HL
H.LL
L.H
H.L
LH.HH
LH.HL
HL.LL
LH.H
HL.L
L.HH
L.HL

------------------------------------flowers and fruits
--------------------------------sake
bamboo leaves
--------------------------------cane
randomness
-----------------------------

186

Table 68 (cont’d)
Stimuli

Pitch

Meaning

susuu
susu
susu

H.LL
L.H
H.L

--------dust

A dot shown with each pitch pattern represents a syllable boundary.
It is not separating morae.

187

Appendix D: List of practice stimuli for perception test in Experiment 1 and 2

Table 69: Practice stimuli in perception test
Stimuli

Pitch

noono
nono
rooro
roro

LH.H
H.L
HL.L
L.H

188

Appendix E: List of target stimuli for perception tests in Experiment 2

Table 70: Target stimuli in perception test in Experiment 2
Stimuli

Pitch

kaakaa
kaaka
kaaka
kakaa
saasaa
saasaa
sasaa
sasaa
kuukuu
kuukuu
kuuku
kuuku
kukuu
suusuu
suusu
suusu
susuu
susuu

LH.HL
HL.L
LH.H
L.HL
LH.HH
HL.LL
L.HH
H.LL
LH.HL
HL.LL
LH.H
HL.L
H.LL
LH.HH
LH.H
HL.L
L.HL
H.LL

189

Appendix F: List of stimuli for perception training in Experiment 2

Table 71: Stimuli in perception training
Stimuli

Pitch

Meaning

kaakaa
kaakaa
kakaa
kakaa
kaka
kaka
saasaa
saasa
saasa
sasaa
sasa
sasa
kuukuu
kukuu
kukuu
kuku
kuku
suusuu
suusuu
susuu
susu
susu

LH.HH
HL.LL
H.LL
L.HH
L.H
H.L
LH.HL
LH.H
HL.L
L.HL
L.H
H.L
LH.HH
L.HH
L.HL
L.H
H.L
LH.HL
HL.LL
L.HH
L.H
H.L

--------------------flowers and fruits
----------------sake
bamboo leaves
------------cane
randomness
----------------dust

190

Appendix G: List of practice stimuli for training sessions

Table 72: Practice stimuli in training
Stimuli

Pitch

noono
nonoo
rooro
roro

HL.L
L.HH
LH.H
H.L

191

Appendix H: List of target stimuli for production test in TG1 in Experiment 2

Table 73: Target stimuli in production test in TG1
Stimuli
seesee
seese
sesee
sese
taataa
taata
tataa
tata

192

Appendix I: List of target stimuli for perception test in TG1 in Experiment 2

Table 74: Target stimuli in perception test in TG1

Stimuli

Pitch

seesee
seesee
seesee
seese
seese
sesee
sesee
sesee
sese
sese
taataa
taataa
taataa
taata
taata
tataa
tataa
tataa
tata
tata

LH.HH
HL.LL
LH.HL
LH.H
HL.L
L.HH
L.HL
H.LL
L.H
H.L
LH.HH
LH.HL
HL.LL
LH.H
HL.L
L.HH
L.HL
H.LL
L.H
H.L

193

REFERENCES

194

REFERENCES

Asano, M. (2005). Boundary of sounds. In M. Minami (Ed.), Linguistics and Japanese
Language Education IV (283 – 294). Tokyo, Japan: Kuroshio Publishers.
Aoyama, K., Flege, J., Guion, S., Akahane-Yamada, R., & Yamada, T. (2004). Perceived
phonetic dissimilarity and L2 speech learning: the case of Japanese /r/ and English /l/ and
/r/. Journal of Phonetics, 32, 233 – 250.
Archibald, J. (2005). Second language phonology as redeployment of L2 phonological
knowledge. Canadian Journal of Linguistics, 50, 284 – 315.
Bohn, O.S. (1995). Cros-language speech perception in adults: First language transfer doesn’t
tell it all. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in
corss-language reserach (pp. 279 – 304). Timonium, MDL York Press.
Borden, G., Gerber, A., & Milsark, G. (1983). Production and perception of the /r/ - /l/ contrast
in Korean adults learning English. Language Learning, 33, 499 – 526.
Bradlow, A. R. & Pisoni, D. B. (1999). Recognition of spoken words by native and non-native
listeners: Talker-, listener-, and item-related factors. Journal of the Acoustical Society of
America, 106, 2074 – 2085.
Bradlow, A. R., Torretta, G. M., & Pisoni, D. B. (1996). Intelligibility of normal speech I: Global
and fine-grained acoustic-phonetic talker characteristics. Speech Communication, 20,
255-272.
Bradlow, A., Pisoni, D., Akahane-Yamada, R., & Tohkura, Y. (1997). Training Japanese
listeners to identify English /r/ and /l/: IV. Some effects of perceptual learning on speech
production. Journal of the Acoustical Society of America, 101, 2299 – 2310.
Bradlow, A., Akahane-Yamada, R., Pisoni, D. B., & Tohkura, Y. (1999). Training Japanese
listeners to identify English /r/ and /l/: Long-term retention of learning in perception and
production. Perception & Psychophysics, 61, 977 – 985.
Bundgaard-Nielsen, R. L., Best, C. T., & Tyler, M. D. (2011). Vocabulary size matters: The
assimilation of second-language Australian English vowels to first-language Japanese
vowel categories. Applied Psycholinguistics, 32, 51 – 67.
Chun, D. M., Hardison, D. M., & Pennington, C. (2008). Technologies for prosody in context:
Past and future of L2 research and practice. In J. H. Edwards & M. Zampini (Eds.),
Phonology and second language acquisition (pp. 323 – 346). Amsterdam: Benjamins.

195

Enomoto, K. (1992). Interlanguage phonology: the perceptual development of durational
contrasts by English-speaking learners of Japanese. Edinburgh Working Papers in
Applied Linguistics, 3, 25 – 36.
Flege, J. (1995). Second-language speech learning: Theory, findings, and problems. In W.
Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language
research (pp. 229 - 273). Timonium, MD: York Press.
Flege, J., & MacKay, I. (2004). Perceiving vowels in a second language. Studies in Second
Language Acquisition, 26, 1 - 34.
Fujisaki, H. & Sugitou, M. (1977). Onsei no butsuriteki seishitsu [The physical characteristics
of speech]. In Iwanami Kouza Nihongo 5 On’in, pp. 65-105. Tokyo, Iwanami.
Hagiwara, R. E. (1995). Acoustic realization of American /r/ as produced by women and men
(Doctoral dissertation, University of California, Los Angeles). UCLA Working Papers in
Phonetics, 90.
Hardison, D. M. (1999). Bimodal speech perception by native and nonnative speakers of
English: Factors influencing the McGurk effect. Language Learning, 49, 213 – 283.
Hardison, D. M. (2003). Acquisition of second-language speech: Effects of visual cues, context
and talker variability. Applied Psycholinguistics, 24, 495 – 522.
Hardison, D. M. (2005a). Second-language spoken word identification: Effects of perceptual
training, visual cues, and phonetic environment. Applied Psycholinguistics, 26, 579-596.
Hardison, D. M. (2005b). Variability in bimodal spoken language processing by native and
nonnative speakers of English: A closer look at effects of speech style. Speech
Communication, 46, 73 – 93.
Hardison, D. M. (2012). Second language speech perception: A cross-disciplinary perspective
on challenges and accomplishments. In S. Gass & A. Mackey (Eds.), The Routledge
handbook of second language acquisition (pp. 349 – 363). London: Routledge.
Hardison, D. M., & Motohashi-Saigo, M. (2010). Development of perception of second
language Japanese geminates: Role of duration, sonority, and segmentation strategy.
Applied Psycholinguistics, 31, 81 – 99.
Hayes, B. (1989). Compensatory lengthening in moraic phonology. Linguistic Inquiry, 20, 30 –
253.
Hayes, B., Kirchner, R., & Steriade, D. (2004). Phonetically based phonology. NY: Cambridge
University Press.

196

Hirata, Y. (1990). Perception of geminated stops in Japanese word and sentence levels by
English-speaking learners of Japanese language. Journal of the Phonetic Society of
Japan, 195, 4 – 10.
Hirata, Y. & Kelly, S. (2010). Effects of lips and hands on auditory learning of second-language
speech sounds. Journal of Speech, Language, and Hearing Research, 53, 298-310.
Imai, S., Walley, A., & Flege, J. (2005). Lexical frequency and neighborhood density effects on
the recognition of native and Spanish-accented words by native English and Spanish
listeners. Journal of the Acoustical Society of America, 117, 896 – 907.
Ingram, J. C. L. & Park, S.-G. (1998). Language, context, and speaker effects in the
identification and discrimination of English /r/ and /l/ by Japanese and Korean listeners.
Journal of the Acoustical Society of America, 103, 1161 – 1174.
Ingvalson, E.M., McClelland, J.L., & Holt, L.L. (2011). Predicting native English-like
performance by native Jaapanese speakers. Journal of Phonetics, 39, 571 – 584.
Jamieson, D. E., & Mooroson, D. E. (1986). Training non-native speech contrasts in adults:
acquisition of the English /δ/ - /θ/contrast by francophones. Perception & Psychology, 10,
83 – 94.
Koguma, R. (2000). Perception of Japanese short and long vowels by English-speaking learners.
Current Report on Japanese-Language Education around the Globe, 10, 43 – 55.
Kubozono, H. (1999a). The sound system of Japanese. Tokyo, Japan: Iwanami.
Kubozono, H. (1999b). Mora and syllable. In N. Tsujimura (Ed.), The handbook of Japanese
linguistics. Malden, MA: Blackwell Publishers.
Kubozono, H., & Ohta, S. (1998). Onin koozoo to akusento [Phonological structures and
accent]. Tokyo, Japan: Kenkyuusha.
Kuhl, P. K., Andruski, J. E., Chistovich, I. A., Chistovich, L. A., Kozhevnikova, E. V., Ryskina,
V. L., Stolyarova, E. I., Sundberg, U. & Lacerda, F. (1997). Cross-language analysis of
phonetic units in language addressed to infants. Science, 277, 684 – 686.
Lively, S. E., Logan, J. S. & Pisoni, D. B. (1993). Training Japanese listeners to identify
English /r/ and /l/. II: The role of phonetic environment and talker variability in learning
new perceptual categories. Journal of the Acoustical Society of America, 94, 1242 –
1255.
Logan, J. S., Lively, S. E., & Pisoni, D. B. (1991). Training Japanese listeners to identify
English /r/ and /l/. Journal of the Acoustical Society of America, 89, 874 – 886.

197

Metsala, J. (1997). An examination of word frequency and neighborhood density in the
development of spoken-word recognition. Memory and Cognition, 25, 47 – 56.
McCandliss, B. D., Fiez, J. A, Protopapas, A., & Conway, M. (2002). Success and failure in
teaching the [r] – [l] contrast to Japanese adults: Tests of a Hebbian model of plasticity
and stabilization in spoken language perception. Cognitive, Affective, & Behavioral
Neuroscience, 2, 89 – 108.
Minagawa, Y. (1997). Accent patterns and segment places as a factor for perceiving Japanese
long and short vowels by native speakers of Korean, Thai, Chinese, English,and Spanish.
Proceedings of the Spring Meeting of the Society Teaching Japanese as a Foreign
Language, 123 - 128.
Morosan, D. E. & Jamieson, D. G. (1989). Evaluation of a technique for training new speech
contrasts: Generalization across voices, but not word-position or task. Journal of Speech
and Hearing Research, 32, 501 – 511.
Motohashi, M. (2007). Acquisition of geminates consonants in Japanese by American English
speakers. Unpublished doctoral dissertation, Michigan State University, Michigan.
Motohashi-Saigo, M., & Hardison, D.M. (2009). Acquisition of L2 Japanese geminates:
training with waveform displays. Language Learning & Technology, 13, 29 – 47.
Mutsukawa, M. (2006). Japanese loanword phonology in optimality theory: The nature of
inputs and the loanword sublexicon. Unpublished doctoral dissertation, Michigan State
University, Michigan.
Nagano-Madsen, Y. (1992). Mora and prosodic coordination: A phonetic study of Japanese,
Eskimo and Yoruba. Lund: Lund University Press.
Ofuka, E. (2003). Perception of a Japanese geminate stop /tt/: the effect of pitch type and
acoustic characteristics of preceding/following vowels. Journal of the Phonetic Society
of Japan, 7, 70 – 76.
Okuno, T. (2009). Factors influencing L2 vowel perception in Japanese: Hyperarticulation,
phonetic environment, and talker, American Association for Applied Linguistics
Conference, Denver, Colorado, March 2009.
Pennington, M. C. (1996). Phonology in English language teaching. New York: Longman.
Port, R.F., Dalby, J., & O’Dell, M. (1987). Evidence for mora timing in Japanese. Journal of
the Acoustical Society of America, 81, 1574 – 1584.
Price, P.J. (1981). A cross-linguistic study of flaps in Japanese and in American English.
Unpublished doctoral dissertation, University of Pennsylvania.

198

Sekiyama, K., & Tohkura, Y. (1993). Inter-language differences in the influence of visual cues
in speech perception. Journal of phonetics, 21, 427 - 44.
Sheldon, A. (1985). The relationship between production and perception of the /r/ - /l/ contrast
in Korean adults learning English: A reply to Borden, Gerber, and Milsark. Language
Learning, 35, 107 – 113.
Sheldon, A., & Strange, W. (1982). The acquisition of /r/ and /l/ by Japanese learners of English:
Evidence that speech production can precede speech perception. Applied
Psycholinguistics, 3, 243 – 261.
Shibatani, M. (1990). The languages of Japan. New York: Cambridge University Press.
Strange, W., & Dittman, S. (1984). Effects of discrimination training on the perception of /r – l/
by Japanese adults learning English. Perception & Psychophysics, 36, 131 – 145.
Takagi, N. (1993). Perception of American English /r/ and /l/ by adult Japanese learners of
English. A unified view. Unpublished Ph.D dissertation, University of California-Irvine.
Toda, T. (1998). Nihongo gakushuusha ni yoru sokuon/chooon/hatsuon no chikakuhanchuuka
[Categorical perception of geminates, long vowel, and moral nasals by Japanese learners].
Bungee Gengo Kenkyuu, 33, 65 – 82.
Toda, T. (2003). Second language speech perception and production: Acquisition of
phonological contrasts in Japanese. Lanham, MD: University Press of America.
Toda, T. (2009). Nihongo kyooiku niokeru gakushuusha onsee no kenkyuu to onsee kyooiku
jissenn [Research on learners’ speech sounds and practice of speech education in
Japanese language education]. Nihongo Kyooiku, 142, 47 – 57.
Tsujimura, N. (2007). An introduction to Japanese linguistics (2nd ed). Malden, MA: Blackwell
Publishing.
Uther, M., Knoll, M.A., & Burnham, D. (2007). Do you speak E-NG-L-I-SH? A comparison of
foreigner- and infant-directed speech. Speech Communication, 49, 2 – 7.
Ziegler, J., Muneaux, M., & Grainger, J. (2003). Neighborhood effects in auditory word
recognition: Phonological competition and orthographic facilitation. Journal of Memory
and Language, 48, 779 – 793.

199