IIII IIIIIIIIIIIIIIIIIIIIII.I! _, MICHIGAN STATE UNIVERSITY LIBRARI lull l'lllll‘ll’lllll “llllll , }301sos¢13 l l l Jill I In 3 1293 00606 9672 llBRARY Michigan State University This is to certify that the chssertation entitlea A Comparison of the Intelligibility and Egronomics of Speech Synthesizers presented By Laura Jean Kelly has been accepted towards fulfillment of the requirements for Doctor of PhilOSOphy degree in Dept. of Audiology and Speech Sciences ['7 ,7 / ./7 / ,7 //£ ./ a . Major professor Date #29 (Fm/$7 /C/ 322 MS U i: an Affirmative Action/Equal Opportunity Institution 0-12771 PLACE IN RETURN. BOX to remove this checkout from your record. TO AVOID FINES return on or More date due. DATE DUE DATE DUE DATE DUE MSU I. An Afiirmntive Action/Equal Oppoflunlty Institution fie ‘- H v e" W , ,A — -~—~——— A COMPARISON OF THE INTELLIGIBILITY AND ERGONOMICS OF SPEECH SYNTHESIZERS BY Laura Jean Kelly A DISSERTATION Submitted to Michigan State University in partial fulfullment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Audiology and Speech Sciences 1988 v-O- A be.» fi AV -. a. . Q “I ~V ‘ a“ ‘Ov . ~~A 5.. ‘A- "Noewd *0. '§ -Au .AIHDH’ " 5: u - ‘ v‘ . ROI 4 Oil Vu..vvl‘5 "Sb‘ayfi b' . ‘§-\--n| . .‘ .J...°" t ~-~‘.. V'- y- .A’. A!" - u».‘§i' ' ‘ .,,.....‘.o¢\' one-“ .9; no N‘U b v a» . Qe _ c. a: ABSTRACT A Comparison of the Intelligibility and Ergonomics of Speech mefhoci var-c UVL U Laura Jean Kelly Five experiments examined the intelligibility of five speech synthesizers and a human control at three points in a communication system. Experiment 1 assessed technical accuracy by performing spectral analysis of six vowels in CVC context generated by the speech sources. Experiments 2 and 3 assessed semantic precision using word recognition as measured by the Speech Intelligibility in Noise test (SPIN) and listening comprehension as measured by multiple choice tests of passage content. Two additional experiments assessed task performance via completion of oral instructions with (Experiment 4) and without (Experiment 5) options for communication repair. Completion of oral instructions was measured by a multiple instructions test (MIT) consisting of sets of instructions systematically varied in complexity. Experiments 2-5 employed listeners with normal-hearing. In Experiments 2 and 3 stimuli were presented in the presence of a twelve-voice babble (+8 dB S/B) noise. Experiment 2 (N=12) revealed significant differences as a result of speech source for SPIN full-list, high- and low- predictability key word subtests and for the interaction between speech source and linguistic predictability. Results for Experiment 3 (N=12) revealed significant differences in .3 pr-" J b- .4- . Cb.- wh“ “v-0- pa” l . ‘l C t . I I u 4‘ d a q 5.. S V. A u a f . . .. S w 6. t a .: v a .1 .2 a. m . C 4. C C e .. . n. . .3 v . X J. J. 9 u 3 H 0 av W b a» u S e C . t . 2 4. e x a: . . s . . . . . 4. .. . .l. S V” C u c. C. u u I a. .3 "v .. a. 5 at s .. a. . a 5. a. : ... ... r. .c v. .p.. a. — . . . . e 2. 2. m . ”w 8 . . . .2 ~A.~‘ ! 'V.H~ e. c * x n A,— \o'. Test completion time did not differ across sources. Stimuli for Experiments 4 and 5 were presented in the presence of a twelve-voice babble noise (+10 dB S/B). In Experiment 4 (N=12), significant differences were seen between mean MIT scores as a function of speech source, complexity levels of the test and the interaction between source and complexity level. MIT item completion time did not differ significantly as a function of source, but did demonstrate significant differences as a function of complexity level. The interaction between speech source and item completion time was significant. In Experiment 5 (N=14) subjects were allowed to select from seven communication repair options during presentation of the MIT. Significant differences among types of repair options selected were seen. The interaction between repair option and complexity level also was significant. A comparison of differences between MIT scores in Experiment 4 and Experiment 5 revealed scores in Experiment 5 to be significantly higher. A comparison of the results of Experiment 1 to Experiments 2, 3 and 4 indicated the correlation between technical accuracy (as measured by summed vowel distances from human archival data) semantic precision and task performance did not reach the criterion for further analysis. With love and greatest respect to my parents, James and Joan Kelly, who continue to be my teachers and friends. iv u .. insv .3; .‘naqa H‘- ‘y‘. - I ‘ \ ’h ' 6h-“ ..-..:. ...:..:f.s __ w . ~n1~fga w; i .‘ "H'“~u~.~—\A «lax. :v--.~:‘v “P‘H1~= ‘. sov.~u¢ :J—v‘..-‘ ‘ 3. A“ *’- ." £~Uuos 5“- f.‘ V‘ul t ACKNOWLEDGMENTS I would like to express my deepest appreciation to Michael R. Chial for his unfailing patience and consistent high standards. Beyond technical and professional skills, I take these with me as examples of the best in teaching. Special thanks to Lonnie Smrkovski for arranging access to equipment necessary for completion of this work. In addition, I extend my love and gratitude to my family members, Bill, Sandy, John, James and Patrick who contributed willing hands and immeasurable support at several critical points in the research process. Along the way to reaching this goal, I also was fortunate to encounter a special group of doctoral students who shared both crisis and celebration. Thank you for enriching the experience with your help and friendship. m. a: VI. ‘. h > ”b..* M n ‘ ~07, ~¢ AV."~\ J “ s‘ I CHAPTER II. TABLE OF CONTENTS LIST OF TABLES ............................... LIST OF FIGURES ............................... BACKGROUND ................................... Introduction ................................. Text to Speech Conversion .................... Speech Perception ............................ WOrd Recognition ............................. Word Recognition in Sentences ................ Listening Comprehension ...................... Speech Quality ............................... Evaluation of Receiver Performance in Complex Tasks ............................ Purpose of the Study ......................... Questions .................................... Signal Sources ............................... Synthesizer Configurations ................... vi xvii l 10 13 15 27 32 '51 .c-vl‘ ~ d Ho .“00 at" "" v." a». \rv -O'A’A u- D .u...-.4 VT ‘ -s“ ‘- s C r ‘ CHAPTER Page METHODS (continued) Experiment 1: Acoustical Analysis ....... 51 Methods .............................. 51 Experiment 2: WOrd Recognition ........... 53 Methods .............................. 53 Experiment 3: Listening Comprehension .... 60 Methods ............................... 60 Experiment 4: Oral Instructions ........... 64 Methods ............................... 65 Experiment 5: Oral Instructions (With Communication Repair) ................. 73 Methods ............................... 73 III. RESULTS AND DISCUSSION Introduction ................................. 77 Experiment 1: Technical Accuracy ................. 78 Experiment 2: WOrd-Recognition ............... 86 Description ............................... 86 Statistical Procedures .................... 91 Implications .............................. 97 Experiment 3: Listening Comprehension ........ 102 Description .............................. 102 Statistical Procedures ................... 106 Implications ............................. 115 Experiment 4: Oral Instructions .............. 117 Description .............................. 117 Vii n—acvoflna ‘\"‘ " .- .-~~——ov 9.....— ~-‘ .1 t Y 2 e . 4 . fi.« Mu Kw \h V .4 C. F.» a: .L‘g Wu.“ va .V ‘ ~54. a We“? ”4.? a K A r - FC . . aw Fa .. C r r e 2‘ In a PUADTW VA“ 85 A “A\ RESULTS AND DISCUSSION (continued) Statistical Procedures .................... 122 Implications .............................. 143 Experiment 5: Oral Instructions (With Communication Repair) ................. 144 Description ........................... 145 Statistical Procedures ................ 145 Comparison of Experiments Experiments 1 versus 2,3,and 4 ............ 164 Statistical Procedures ................ 166 Implications .......................... 166 Experiemment 4_versus 5 ................... 168 Description .......................... 168 Statistical Procedures ............... 168 Implications .......................... 174 Implications for System Design ........ 175 Clinical Applications ........................ 175 Future Research .............................. 176 Background ....................................... 179 Purpose .......................................... 180 Experimental Design .............................. 180 Experiment 1: Technical Analysis ............. 181 Experiment 2: Word Recognition ............... 182 Experiment 3: Listening Comprehension ........ 183 V111 ”9‘ OK" — w - a .1 ~ \ an-.. ‘H‘ , ‘u-e' ‘\' 3".“3- 4 1 n a-.~‘_.. c...‘ VV..~-¢I 3’ 4 (I) 1 I I '1 I) (I) ’(’ l O () DJ 5; ‘- 2‘ U. 8‘:- r s . y1\‘ .“ Experiment 4: Task performance ............... Experiment 5: Task performance (With Communication Repair) .................... Experiment 4 versus Experiment 5 ............. Conclusions ...................................... APPENDIX A. Informed Consent Release Form and Audiological Screening Form ............................... B. Speech Source Parameters ...................... C. Calibration of Experimental Tapes and Equipment ................................. D. Determination of Revised SPIN Signal-To-Babble E. Summary of Passages ........................... F. Multiple Instructions Test Text and Forms ..... REFERENCES ........................................... Page 184 188 195 247 wk ~.. ml .3 .AQ. r. .C . .17 7.. E 4‘73 T. Q» ‘«:. . C. C ~ .3 e “we A: . . .. ‘ h». «c . ¢ 4 ‘ a? .Q . A a « u? .z s. e ,4 ‘ v.5 «Q is 4 s 3. C S u... S C I .2 S C r . .3 S C .3 3 C (2.2 S C I .2 o A {u q .. «:4 a... AS. ~\ u n 0 Q4“ Qei atv Q‘U \‘v- LIST OF TABLES Table Page 1.1. Summary of speech synthesizers evaluated,stimuli employed and assessment goals of the research reviewed ............................................ 8 1.2. MRT scores (%-correct) at three speech-to-noise ratios as a function of speech source .............. 18 1.3. MRT error rates (%) overall and error rates for consonants in initial and final positions ........... 25 2.1. Permitted content options for each level of command of the MIT .......................................... 67 2.2. Sample commands and scoring criteria for the MIT ................................................. 69 3.1. Speech spectra coordinates for the vowel [[1 calculated from algorithms suggested by Miller (1984) which identify its placement in three dimensional space ................................... 80 3.2. Speech spectra coordinates for the vowel [I] calculated from algorithms suggested by Miller (1984) which identify its placement in three dimensional space ................................... 80 3.3. Speech spectra coordinates for the vowel [8] calculated from algorithms suggested by Miller (1984) which identify its placement in three dimensional space ................................... 81 3.4. Speech spectra coordinates for the vowel [GB] calculated from algorithms suggested by Miller (1984) which identify its placement in three dimensional space ................................... 81 3.5. Speech spectra coordinates for the vowel [a] calculated from algorithms suggested by Miller (1984) which identify its placement in three dimensional space ................................... 82 :. .1 v: C. ..‘ ».. : uy. 2. :4 .C . I... w . w. .... E .1 It .3 I C . . I .1 T. T t T. u .1 a . .1 . I .3 C. C at .. C P. T. C. I . .L I S i a T. x r: 3 a. L i a S a a a. e .. C S I s... .3 v. e a .3 m. new. .:.a N... we. : C .- . . . . .. .T r C E I E C a. I S “.4 W G .C 4. :1. I fie » . .._ . e . S If a v. v. a f v“ a S. S v“ S t v». S 2. a C .2 T t M t a 2. D. u. if v” S C 3 e a a I u .I .U .2 I. .1- A: 3. . . . .. . .. . . . .3. .1 . . . A... «qa. .44 «us ~«u .au qu— Table Page 3. 6. .10 .11 .12 .13 .14 .15 .16 Speech spectra coordinates for the vowel [1\J calculated from algorithms suggested by Miller (1984) which identify its placement in three dimensional space ................................... 82 Distance value for vowels in three dimensional space from Peterson and Barney (1952) vowel measurements as calculated using xyz coordinates derived from Miller (1984) scaled x 100 .......................... 84 Mean percent correct scores, standard deviations, and ranges for SPIN full-list test results as a function of speech source (N=12) .................... 87 Mean percent correct scores, standard deviations, and ranges for SPIN high-predictability and low- predictability test results as a function of speech source (N=12) ....................................... 88 .Mean arcsin transformed scores percent-correct, standard deviations, and ranges for SPIN full-list test results as a function of speech source (N=12).. 89 .Mean arcsin transformed scores percent-correct, standard deviations, and ranges for SPIN high- predictability and low-predictability test results as a function of speech source (N=12) ............... 90 .One-way within-subject analysis of variance of full-list SPIN test results ......................... 96 .Two-way within-subject analysis of variance of SPIN test results with main effects of speech source and word predictability ................................. 98 .Simple effects for the variables of speech source and linguistic predictability based on transformed percent-correct scores of the SPIN high- and low- predictability key word subtests .................... 100 .Mean percent correct scores, standard deviations, and ranges for multiple choice test results as a function of speech source (N=12) .................... 103 .Mean arcsin transformed percent-correct scores, standard deviations, and ranges for multiple choice comprehension test results as a function of speech source (N=12) ....................................... 104 Xi v . v. A: .. . I fix. a e v“ n5 . A. .41 .C ... t h S C .1 a v. E C. .C h“ a: .C a a. u... I _ . at 2 «3 «av .3. E t .T. S a. e w p. a: T. S .3 E I _ . H4— v” .. =~ he. s . s s a. .s‘ ‘ v e.‘ . s . lye VV .3 2. v; ‘ ‘ 2‘ vs «C 2 .2 WV «Q : A. «C \ .NM is c. I E .l a. 2 e .3 E e. ... u... .n E .1. T. C v.. S V a v.11? 14.2.7- ~31... 3 5 3 gm. 3.. C0 5. .C 5. .4 7c .4 .c A... .4... «flu .1. sq. Table Page 3.17. Mean completion time, standard deviations, and ranges for multiple choice comprehension tests (in seconds) as a function of speech source ......... 105 .Mean arcsin transformed percent-correct scores, standard deviations, and ranges for the multiple choice comprehension tests (N=12) ................... 107 .One-way within-subject analysis of variance of multiple choice test arcsin transformed percent-correct scores speech source ................ 113 .Mean percent-correct, standard deviations, and ranges for total scores of the Multiple Instructions Test as a function of speech source (N=12) ......... 118 .Mean arcsin transformed percent-correct scores, standard deviations, and ranges for total scores of the Multiple Instructions Test as a function of speech source ....................................... 119 .Mean percent-correct, standard deviations, and ranges for all levels of the Multiple Instructions Test as a function of speech source (N=12) ........... 120 .Mean arcsin transformed percent-correct scores, standard deviations, and ranges for all complexity levels of the Multiple Instructions Test as a function of speech source (N=12) .................... 121 .Mean arcsin transformed percent-correct scores, standard deviations and ranges for experimental versions of the Multiple Instructions Test across all speech sources (N=12) ........................... 123 .Mean completion time, standard deviation, and ranges for all levels of the total Multiple Instructions Test (in seconds) as a function of speech source (N=12) .............................................. 124 .Mean item completion time in seconds, standard deviations, and ranges for all levels of the Multiple Instructions Test as a function of speech source (N=12) .............................................. 125 .Two-way within-subject ANOVA of Multiple Instructions Test transformed percent-correct scores ............. 133 .Simple effects for the variable of speech source and level of complexity based on transformed percent- correct scores of the MIT ........................... 135 xii Table 3 .29. .30 .31 .32 .33 .34 .35 .36 .37 .38 .39 Page Two-way within-subject analysis of variance of Multiple Instructions Test results with main effects of speech source and time per level ................. 139 .Simple effects for the variable of speech source and level of complexity for item completion time (in seconds) of the MIT ................................. 141 .Mean percent-correct, standard deviations, and ranges for scores of the Multiple Instructions Test presented with communication repair options (N=14) ..... , ....... 146 .Mean standard deviations, and ranges of arcsin transformed percent-correct scores of the Multiple Instructions Test presented with communication repair options (N=14) ...................................... 147 .Mean percent, standard deviations, and ranges of repair options selected based on the total number used by each subject for all levels of the Multiple Instructions Test (N=14) ............................ 148 .Mean percent, standard deviations, and ranges of the number of repair options selected by level of the Multiple Instructions Test. Calculations are based on the number of repair options a subject selected for complexity levels of the MIT (N=14) ................. 149 .Mean percent, standard deviations, and ranges for repair options selected within levels of the Multiple Instructions Test. Calculations are based on each subjects total number of repair options used within a level of complexity (N=14) ........................ 150 .One-way within-subject analysis of variance of total MIT scores as a function of level ................... 158 .One-way within -subject analysis of variance of total repair options selected as a function of complexity level of the MIT .................................... 161 .Two-way within-subject analysis of variance of Multiple Instructions Test arcsin transformed percent-correct scores with main effects of level of complexity and repair option ..................... 163 .Simple effects for the variables of Multiple Instructions Test level of item complexity and repair option ....................................... 165 xiii .: 5-P- - -0. ‘14:...“ nun-“H“ {arse c . it . u n . \ ¢ s. Av .t .. . ,. ‘4 :— v: v. r .. 4. r” .q‘ .—u 4‘ ha. 4*. Am r. .w l W 1. 6 1~ .1 v. A: .1 r : d r C .3 r .2 4‘ r UK a r. t .7 a . C 3 I u. a. . . a. E E s. G a. y. E a. R. E . a 5 cm C C. E .. r 7 L c. a L c. e L o. E L b. e L c. .. e 9.» r. .u. 6. mm“ D e C C . .m C ... rw. fly .. up. C w... “y. F.» .. C V“ S I .w I . v. D. C S e , . S e S S G a S E S S o. I .. 1‘ fl/u . o o u o 4 I... I . filly «iv 01‘ :J . «(In n.~ A.» n.» n.v a.» Aj A.- Table Page 3.40.Mean percent-correct scores, standard deviations,and ranges of the Multiple Instructions Test presented with (Exp. 5) and without (Exp. 4) the option of communication repair (N=12) ......................... 169 3.41.Mean arcsin transformed percent correct scores, standard deviations, and ranges of the Multiple Instructions Test presented with (Exp. 5) and without (Exp. 4) the option of communication repair ............................................. 170 3.42.0ne-way between-subject analysis of variance of Multiple Instructions Test arcsin transformed percent-correct scores as a function of communication repair ................................ 173 C-1. Sound level measurements of Revised SPIN Test experimental tapes generated using a human talker (male) .............................................. 196 C-2. Sound level measurements of Revised SPIN Test experimental tapes generated using the DECtalk speech synthesizer .................................. 196 C-3. Sound level measurements of Revised SPIN Test experimental tapes generated using the Amiga speech synthesizer .................................. 197 C-4. Sound level measurements of Revised SPIN Test experimental tapes generated using the VOTRAX PSS speech synthesizer .................................. 197 C-5. Sound level measurements of Revised SPIN Test experimental tapes generated using the Smoothtalker (on MacIntosh) speech synthesizer ................... 198 C-6. Sound level measurements of Revised SPIN Test experimental tapes generated using the Echo II+ speech synthesizer .................................. 198 C-7. Sound level measurements of Revised SPIN Test twelve-voice-babble as recorded on the second channel of Bilger’s (1984) tape ..................... 199 C-8. Sound level measurements of experimental tapes of factual passages generated by a human talker (male) .............................................. 200 C-9. Sound level measurements of experimental tapes of factual passages generated by the DECtalk speech synthesizer ......................................... 200 xiv .Sound level measurements of experimental .Sound level measurements of experimental .Sound level measurements of experimental .Sound level measurements of experimental .Sound level measurements of experimental Page .Sound level measurements of experimental tapes of factual passages generated by the Amiga speech synthesizer ........................................ 201 Sound level measurements of experimental tapes of factual passages generated the VOTRAX PSS speech synthesizer ........................................ 201 .Sound level measurements of experimental tapes of factual passages generated by the Smoothtalker (MacIntosh) speech synthesizer ...................... 202 Sound level measurements of experimental tapes of factual passages generated the Echo II+ speech synthesizer ......................................... 202 tapes for the Multiple Instructions Test generated by a human male ................................................ 203 tapes for the Multiple Instructions Test generated by the DECtalk Speech synthesizer .......................... 203 tapes for the Multiple Instructions Test generated the Amiga speech synthesizer .................................. 204 tapes for the Multiple Instructions Test generated by the VOTRAX PSS speech synthesizer ....................... 204 tapes for the Multiple Instructions Test generated the Smoothtalker (on MacIntosh) speech synthesizer ...... 205 .Sound level measurements of experimental tapes for the Multiple Instructions Test generated by the Echo II+ speech synthesizer ......................... 205 .Summary statistics for Leq differences between calibration tones and SPIN experimental cassette recordings .......................................... 206 .Summary statistics for Leq differences between calibration tones and passage experimental cassette recordings .......................................... 207 .Summary statistics for Leq differences between calibration tones and Multiple Instructions Test experimental cassette recordings .................... 208 XV P‘e -M- V -- n \q 15‘ .. I. .. E Z :. E v. .1 E E .. .4 S i E e .1 S .... m. I T I .1 I v. .3 S a v” S f “a 2.. C .2 M: .c 4.. .a.. “a me. “be. a o e v a .3 4. :3 . . . 6.. .4 .4 . . 7.. 1. :5. Table Page C-23.Total running time in seconds of experimental recordings of the SPIN test ......................... 209 C-24.Total running time in seconds for cassette recordings of passages .............................. 210 C-25.Total running time in seconds of experimental recordings of the Multiple Instructions Test ........ 211 D-l. Mean percent-correct scores,standard deviations, and ranges for SPIN full-list as a function of signal-to-babble ratio using DECtalk and Echo II+ as speech sources (N=7) ............................ 217 D-2. Mean arcsin transformed percent-correct scores, standard deviations, and ranges for the SPIN full-list as a function of signal-to-babble ratio using DECtalk and Echo II+ as speech sources (N=7).. 218 D-3. High-predictability and low-predictability SPIN test mean percent-correct scores, standard deviations, and ranges as a function of signal-to-babble ratio using DECtalk and Echo II+ as speech sources (N=7) ............................ 219 D-4. High-predictability and low-predictability SPIN test mean arcsin transformed Percent-correct scores, standard deviations and ranges as a function of signal-to-babble ratio using DECtalk and Echo II+ as speech sources (N=7) ............................... 220 xvi . .t 6. v. S a t a. E S e t e .«u "v. qr“ u .nu T. S a . 4. w . 7‘ All“ a1d Q A. Q o a . C‘-~-A‘ .: 4 c :u w n . C. v s v a G. «C C e a A c E Au 5 H? a n u. v . . s r w. W . a: a P. at in“ “Us 5 w 1 I‘- A)“. 5/- . I. .A‘f‘ «v- Dy add fl 1‘ we h. t.. .. C1». .».. .1 vi 1.5. v. .1... v. C» C .4 vi v. a . A. a. a» a. r... e . i Q» a. one C F». by C. a re a. W y~ ‘ V A u. fl v o ~ s V s a» fly “No .‘L Rwy .44 . a: w.» a: a: M.. c. S 3 Ta u~ a. qu. all. a A. A/u. .4 «Is «1% Figure Page 1.1. Idealized model of an interactive communication system containing a speech synthesizer ................ 3 1.2. The three categories of human-computer interaction ........................................... 6 1.3. The steps involved in converting ASCII orthographic representation for a word into phonemes ............. 12 1.4. Summary of components of the communication model and corresponding the experiments ....................... 45 2.1. Process flow chart of experimental events for Experiments 2 through 5 ............................. 49 2.2. Apparatus used for making Leq measurements and placement of 1000 Hz calibration on experimental tapes ............................................... 55 2.3 Block diagram of experimental apparatus for Experiment 2 ........................................ 57 2.4. Block diagram of experimental apparatus for Experiment 3 ........................................ 62 2.5. Hierarchy of complexity levels used in the Multiple Instructions Test ................................... 66 2.6. Block diagram of experimental apparatus for Experiment 4 ........................................ 71 2.7. Block diagram of experimental apparatus for Experiment 5 ........................................ 75 3.1. Mean percent-correct scores and standard deviations LIST OF FIGURES for SPIN full-list test results as a function of speech source. Each bar represents observations of 12 normal-hearing adults tested monaurally. The lower histogram denotes i1 standard deviation ....... 92 xvii O . t .( -- ‘ \. .1 .C .c .. .p. v. a. v. v“ .o .r“ A. w~ .4 r: r .3 .G 4‘ .. wllwd n“ .3 r. . o. e I. . .3. r 1.3. . .3 .. 3 Z I S .3 a. .. 3 a. . n C. C r .3 a . w. 1.: .. .3 C. 7. r M: .T. 3. mt. v. t . . r. a E :N a. w. r. a. a” E I C. A. a. I. u S x. e nu T. A. a. a v... a r. c. «A 9 J. C. C. u. C. 5 a a a S r. .3 .1 .3 .3 e .w . . n“ r .3 a. r. .3 r m." .3 a v. S r” a .Q a a. W 13 »q n. y a e e .. G” a. S .2 p... t e a. r“ :29 3 .-. 3 a. r. C. C 5.. C c. «4 C e a. mm .2 C a. at a. :16; 3 . a C. A... C. r f. .2 V. a: .G .v m.“ n: V” s a S . . 3. I A. we. .1 4: C U. ‘4 no ... ‘L Vn ab c. Av hm be u Q. to no ~ 4 s 4 a» A» C «4“ ‘1‘“ p§ J. phi-- a In A F on 90 in nth .4... «ca. «(1 ‘4'. Adv. o N'" “n. ‘O a! Figure (Cont.) Page 3.2. Mean percent-correct scores and standard deviations for SPIN high-predictability and low-predictability word sets as a function of speech source. Each bar represents observations of 12 normal-hearing adults tested monaurally. The lower histogram denotes il standard deviation .................................. 93 Mean arcsin transformed percent-correct scores and standard deviations for SPIN full-list test results as a function of speech source. Each bar represents observations of 12 normal-hearing adults tested monaurally. The lower histogram denotes :1 standard deviation .................................. 94 Mean arcsin transformed percent-correct scores and standard deviations for SPIN high-predictability and low-predictability word sets as a function of speech source. Each bar represents observations of 12 normal-hearing adults tested monaurally. The lower histogram denotes 11 standard deviation ....... 95 Illustration of Newman-Keuls" test of pairwise comparisons for SPIN test transformed percent-correct scores for high-predictability and low-predictability words as a function of speech source. Nonsignificant mean pairs are connected by solid lines ............................ 99 Mean percent-correct scores and standard deviations for multiple choice comprehension tests as a function of speech source. Each bar represents observations of 12 normal-hearing adults tested monaurally. The lower histogram denotes i1 standard deviation ....... 108 Mean arcsin transformed percent-correct scores and standard deviations for multiple choice tests as a function of speech source. Each bar represents observations of 12 normal-hearing adults tested monaurally. The lower histogram denotes i1 standard deviation ................................. 109 Mean multiple choice test completion time and standard deviations in seconds as a function of speech source. Each bar represents observations of 12 normal-hearing adults tested monaurally. The lower histogram denotes :1 standard deviation ....... 110 Illustration of Newman-Keuls" test of pairwise comparisons for transformed multiple choice test comprehension scores. Nonsignificant mean pairs are connected by solid lines ............................ 113 xviii e r T. .. . «P. . E e .. .2 : e C .1 .2 E C .3 C I v” f S .. v.“ .3 f C 3 v” f f C A-.. .. «j. .3 .. .. o. a. 1L «3 .3. .1. w..- a: v: .2 e E e E e v. .3 T I t 1: c. n... :J g . .a— Ni ‘ ‘ e .. a .C V. S T. .2 C A; T“ f\ «a. l a. C; 4 ‘ Wm V.. a c; .0. .; _..Z C 2.". .. .c C... C . . e u.“ C C C 7. S .a I C. C.v.. S . C. C. C. u... f r- C T. v... 2.: . I I. r. r v.. S I E S .2 \f. S f i, C a... 4. Y. n I o o . . . .‘ — . :4 A3 a a . . «la . . . T. a .. .4 7. 3 .3 .J . a . . 2 J. . . . . . o . . . —-. A‘v \(v A4. «1.. .(v .4. A.» Kw Ky Figure (Cont.) Page 3.27 .Mean percent of repair options selected based on the total number of options for each subject within the complexity level of the MIT. Each bar represents observations of 14 normal-hearing subjects tested monaurally .......................................... 156 .Standard deviations for repair options selected within complexity level of the MIT .................. 157 .Illustration of results of Newman-Keuls’ test of paired comparisons of mean transformed percent-correct scores per complexity level of the MIT. Nonsignificant mean pairs are connected by solid line .......................................... 159 .Illustration of results of Newman-Keuls' test of paired comparisons of mean repair options selected per complexity level of the MIT. Nonsignificant mean pairs are connected by solid line ................... 162 Mean percent-correct scores and standard deviations for the Multiple Instructions Test presented with (Exp. 4) and without (Exp. 5) communication repair options. Each bar represents observations of 12 normal-hearing adults tested monaurally ............. 169 Mean and standard deviations of transformed percent-correct scores on all levels of the Multiple Instructions Test presented with (Exp. 4) and without (Exp.5) communication repair options. Each bar represents observations of 12 normal-hearing adults tested monaurally ................................... 170 Frequency response of the experimental earphone (TDH-39) ............................................ 212 Mean percent-correct and standard deviations for SPIN full-lists as a function of signal-to-babble ratio using DECtalk and Echo II+ as speech sources. Each bar represents observations of 7 normal-hearing subjects tested monaurally. The lower histogram denotes 11 standard deviation ....................... 221 Mean arcsin transformed percent-correct and standard deviations for SPIN full-lists as a function of signal-to-babble ratio using DECtalk and Echo II+ as speech sources. Each bar represents observations of 7 normal-hearing subjects tested monaurally. The lower histogram denotes 11 standard deviation ........................................... 222 xxi hnv-o-- -.AU vvu- ~4‘" a r4 quf-wvgw‘ .AJ 4‘ r . .5 A u a u .3 . u a: An— 0“ Aswan E RAlnvn“ v-4-v V “Fr..- . A ov-.\._‘- . . 2. C a. m 3 .7. i . T‘ .3 . . g A.» a A» 5. m“ p c .C v. a: p b. .v .. . ~: r. L“ .ne .ul. . .2 C 4‘ a «c r e. . 2‘ A» ‘. AH» .au .‘ . AG A a .. A]. u v a .1 E w: .. . 6 v V § .Q Q A v S we. 5 3 ans 0 . a: .. a 6 - Q v. a» ‘v*. Figure (Cont.) Page D-J. Mean percent-correct and standard deviations for SPIN high- and low- predictability test results as a function of signal-to-babble ratio using DECtalk as a speech source. Each bar represents observations of 7 normal-hearing subjects tested monaurally. The lower histogram denotes :1 standard deviation ............. 223 Mean arcsin transformed percent-correct scores and standard deviations for SPIN high- and low- predictability test results as a function of signal-to-babble ratio using DECtalk as a speech source. Each bar represents observations of 7 normal-hearing subjects tested monaurally. The lower histogram denotes :1 standard deviation ....... 224 Mean percent-correct and standard deviations for SPIN high- and low- predictability test results as a function of signal-to-babble ratio using Echo II+ as a speech source. Each bar represents observations of 7 normal-hearing subjects tested monaurally. The lower histogram denotes 11 standard deviation ............. 225 Mean arcsin transformed percent-correct scores and standard deviations for SPIN high- and low- predictability test results as a function of signal-to-babble ratio using Echo II+ as a speech source. Each bar represents observations of 7 normal-hearing subjects tested monaurally. The lower histogram denotes :1 standard deviation ....... 226 xxii l . es used 0 l' ...o‘ 5 ‘VI.\- lve C at a" vet‘- Un- ‘1.- v the perf' . ‘- ‘rVQI .\ “‘,R iate c: rcpt :I'I Chit . s nah e undert 4“ 7“. a. » ... Q. P Q» Chapter I BACKGROLND Introduction Digital speech synthesizers have become commonplace in devices used in business, industry, and education, and as augmentative communication aids to the handicapped. This proliferation raises difficult questions regarding the effects of machine-generated speech upon communication. Assuming the decision to use synthetic speech is intended to make more effective use of human communication resources, it is essential to know the nature and degree of these effects upon the performance of tasks. Once these are determined, appropriate cost/benefit evaluations of available systems can be undertaken. Of particular interest in the development and selection of a task-appropriate speech synthesis device is the intelligibility of the system in comparison to both human speakers and to other synthetic speech systems. The various acoustic environments in which these systems are used can dramatically influence the degree of intelligibility required and the robustness of intelligibility in the presence of competing signals. A wide variety of variables interact to influence intelligibility including signal complexity, task complexity, linguistic structure of the message, limitations of the human processing system and listener experience (Nusbaum and Pisoni, 1985). Measurement 1“}. HS I Q 4r’011 ...-v¢“v‘~-' {law .‘Vai a9- gp-p‘ v! “ f‘ “AA V urn-r «j 1 .. (II.\ 0 The ac cort‘ I \ '5‘) ‘Y‘ .o. ‘Ao - VA. .hAtu :uu o“'~“ 2 of intelligibility requires attention to the issues of accuracy, precision, sufficiency, and utility. An effective means of organizing the many concerns associated with the evaluation of synthesized speech is a communication model such as that model developed by Shannon and Weaver (1949). The foundation of their approach is a triad of issues described as follows: (1) The accuracy with which the symbols of communication are transmitted (technical accuracy). (2) The precision with which the transmitted symbols convey the desired meaning (semantic precision). (3) The effectiveness with which the received meaning affects conduct in a desired way (task performance). Figure 1.1 offers an idealized model of speech synthesis based on the work of Shannon and Weaver (1949), Flanagan (1981), and Chial (1986). Communication can be described as the transfer of information (i.e., facts, feelings, thoughts) from one place to another. In the present model, the process begins with an information source (human) who wishes to transfer information to a receiver (also human). The purpose of communication is variable; but is assumed, in this example, to be an intent on the part of the information source to effect a response from the receiver. Noise can be defined as anything which increases the ambiguity of the signal, thus reducing the likelihood of the desired receiver response. One means for generating an auditory signal for information transfer is a speech synthesizer. For this signal source to function, however, information must be 506000051 CO.G.O>COO ~ ‘ + \ 005303 IUOO .3nuu30 .O£E>Mu \ N exOh DCEvOQCW \ C0: HEsOuts 0. .ODE>Mw O. .xfir buiN.nuIUp.-:\An.w Puuqu-Xunu It use .u-.~Qb..:UAN-bfivnv 3535:)...“ :8on w 9:58:60 8286 co=mo_:=EEoo 028925 :m .o .82: .55.on 5332.00 90.22. 35.20 gauges—ego 2 .235 362 amazem— r _ scam—c.2000 3.30m _._§s8u. 1.. 3332a .SoESae 2 280 59.6 83¢ 0.83%.: .2382 x8930... 3332a 5.32600 38 595 .356 2 .35....” 2 ca» Buggim 53% a .o 3:82.800 H 3.3 .E 9:9“. 3889 8.02 22:35 0233 c2363.... u- 'F:AAQ urn-"Her” p.\J~’:~ Cuvufiohy the commicat‘. 'e.g enc ing a: assustic: signal 2:13:19 vocabul: generation. Ext A .: t'affic: noise Text, inclur‘ 32;:asegzental 1' Pine-tic 4* ‘uforz‘é slithesizer. S' f‘ww‘ ‘ “"39! refined ~‘A ‘ .- ~‘h' ec‘n WaVe . A a...“ . ‘ * 6.‘ “,er ,_ nge S‘Vrn.‘ ‘ ' “"A\~ . ' hi5 «¢.J“:OL a CCW-m,‘ ‘ 3'...“ :‘c; A. v“‘ ‘ s V! 4‘“ A. “rgh$ \ a "m Vum“ \‘ ‘41 ‘ :5: U“ u translated or encoded into a form the device can use. This encoding process is the first potential source of noise in the communication system. Noise sources can be intrinsic (e.g encoding and decoding) or extrinsic e.g.( a competing acoustic signal). Examples of intrinsic noise sources include vocabulary selection and the accuracy of text generation. Extrinsic noise might include competing talkers or traffic noise. Text, including punctuation marks that cue suprasegmental information, is recoded into segmental phonetic information through of algorithms stored in the synthesizer. Symbols generated from these algorithms are further recoded into parameters used to create a digitized speech wave. A digital-to-analog converter is used to output the synthesized wave for transmission along along or through a communication channel. A feedback component may exist to permit interactive control of the synthesizer. The quality, frequency and accuracy of feedback from the receiver and the utilization of such interaction by the source can influence the efficiency and effectiveness of communication. In goal- directed communication, variations in the efficiency and effectiveness of the communication process cause variations in the amount of work required to accomplish the goal. The role and use of human energy in such tasks can be addressed as an ergonomic issue; specifically communication ergonomics. Tn‘oraf‘?‘ fir ‘ bt.~vs\“-"‘V" ‘ ‘ -.9A 9r?“ "' b; , ! .Ce ...' '.. ...v-.4.. » w'.‘~ thr‘. "n l ‘5‘; t “C & ‘Hrfi‘ ." .31....“1; 51/ ls ‘ 9!. I V 9' ' .l:.n.cation be an I ....e 5.13.“. IZS GIT-.5 .. "-Me. Charv l.‘ e’.' . "‘ QP‘ ' “SMASH/e 279.; A ~ ‘I :Eé‘a I“ ' \Prlnt, Vi AK '1‘ I‘JS‘C The 1. A I.‘ ““‘S. a 171 a 9‘s ‘ .- {"6 R V S r ‘ el‘der a‘ 1_~a‘es I Set-40‘ ‘ ‘. \4 ‘5 LR“ . “9:an ‘3 SyStom ‘ Va‘(" n“ ‘ ._.3“ ' ‘° 3;. v7 \‘k‘*ah . ' ~e' V‘ . ‘=\"‘h “- ‘3 as b... ‘. -«' C -‘_‘.§. h: ‘nlha‘ \. I w -&on . t~ 5 :‘fi‘ieghc. 5 Interactions between humans and computers can be divided into three modes: simplex, half-duplex and full-duplex. This rubric (historically applied to serial communication technology) is useful for describing the nature of communication between human and computer, or between two or more humans employing a computer as a mediator of communication. As with the previous communication model, a system is assumed in which two actors exchange information over a channel of some sort. The actors may be either human or machine. ”Channel" in this context refers to any transmissive medium (acoustical, optical, electronic) or media (print, video, film). The simplex mode of interaction involves transmission of information in only one direction, that is, one actor is always the source, the other actor is always the receiver (see Figure 1.2). An example of simplex communication involving human actors is a taped or phonographic recording of music. The half-duplex mode allows actors to take "turns” in a discrete non-overlapping manner, exchanging the role of sender and receiver. Examples include formal debates, serious telephone conversations and telephone answering systems. The full-duplex mode is one in which actors simultaneously engage in bidirectional communication, serving as both sender and receiver. Examples of full-duplex communication include lively conversation and impassioned arguments. (Chial, 1984) Actorl I Actor ll Sender I Receiver Simplex Interaction Actor I I Actor II Sender Receiver Actor | | Actor ll Receiver l Sender Half-Duplex Interaction Actor I r Actor ll Sender + Sender + Receiver I Receiver Full-Duplex Interaction Figure 1.2 The three categories of human-computer interaction. b 4 (Y *J (D T) (7 (D (D .1 : . ‘ Ayn. "1““? l“ f' :g.‘ rid-..» Vs» c ::r:.u..icat ion. the selection C .t .. the device : I'Aag .~...-ss most cr T:;arap‘.rase S pres: cal to de everything perf .:4~ 1.:‘JV‘aCy ‘ n UL th 2‘s“ ‘__::h“‘ *\J V ~. , 1‘5‘;4, '1' and 9X Jusg‘ ‘ ' ""‘dlgi‘a. . . “4‘1e ‘ - » ea‘le {Q ~V?Vfi; §¢dg C C ’1 to. 2:»; “\ 7 It is possible to study aspects of synthesized speech at any point in the communication system or through any mode of communication. A practical approach to evaluation lies in the selection of a method appropriate to the task required of the device or for that portion of the communication process most critical to performance in a given situation. To paraphrase Shannon and Weaver, if it is not possible or practical to design an evaluation approach which can handle everything perfectly, then a system should be designed to handle well the jobs it is most likely to be asked to do, and should redesign itself to be less efficient for the rare task (1949, p. 14). Table 1.1 summarizes the approaches used to assess synthesized speech. Most prior research has concentrated on receptive intelligibility, usually through the use of word or sentence recognition and listening comprehension tests. This approach provides information limited to the semantic accuracy of the speech synthesis systems. A few studies also attempted to compare synthesizers on the basis of perceived quality and explored the relationship between intelligibility and perceived naturalness. To date, no attempts have been made to design and evaluate procedures focusing on technical precision or on complex task performance. The goal of this study was three fold: (l) to evaluate a group of speech synthesizers at the three communication system levels of technical accuracy, semantic precision and gtFi—c.<_a:.bi~ . TiliiFiiiiii 2:33 ‘ uWEixzi 38.28 iliiiliil Ti 333:; so one: .3336. 5.33. 2.. .o v.38 Ema—.388 05 33.95 goose €23.26 2:228 «.3556 533 3 racism 3 sip a: 3 a: a... .90 . .56.. :o .33 “an” “a. 63:6 588 .28.... 53> a: Em lam am . an». nun—.3... 3: .8... 3:. 2.0 BE A. $0 .80 in... .2 on: 55o) a. :2 += 93w 3.... So 3.:o =23 $50 2355 as. a: 81.... $.36 a.» .. 88 so... my. 8: Ex 6:. 2:. :28 S. .6 3.6.... a: $.30 Ste 3 .. «some 8... 3. $2 .82. 5.. .33... :2 $5 3.5. o.— > gown a? 3.32 :2 :2 :32 J88“ 32.. .3 72522.. a» a... 2.. E: 35%.... size: on: 92 - 33nd gage >m a? m2: m5. man. 9a «a 3 F3 >0 in _ T358301 T 398.com LTIIoESSIII— 8:38.... 300580 H LEGEND FOR ‘I’ABLE 1.1 STIMULUS MATERIALS CV Consonant Vows! Combinations HA8 Harvard Psychosomatic Sontsnoss WT Modifisd Rhyms Test AID Assossmsnt of lntslligibility of w-22 cso w-22 Dmnhnc Spa-ch ESWS Expsnmontsr Sslsctsd Wocds + Sontsnoss 8V Sontsncs Vsnflcafion PSS WWW ARC AdultRssdingComptohonsionTosts HAS Haskins Laboratonss Anomalous Sontsnoss MISC. W PM ASSESS!£NT GOALS 7 Efloct o! SPL Ushnmg' Comm 3 8 Eflsct of Training 4 OpsnvsClosstst 9 Eflsctongs Min-OHS A Bsmstsin a Pisoni (so) " “9‘" 8 PM (85”) B Chial (78) L Manous. Pisoni. Dsdlna s Nusbaum (85) C Grosns. Logan 5 Pisoni (88) M Moi-{ugh (73) 0 Guam. Msnous a Pisoni (84) N WW” 3' Wm" (87) E GW- Nusbaum 5 Pisoni (88) O waaum. Cmnspan s Pisoni (88) G Hoovsr. Rsichls, Van Tassll. Colo (s7) 0 Mi 8 006m (86) H anklns 5 Franklin (81) 9 Fiscal a Hunnicutt (so) I Km! 0. Lovlnson (a4) 3 PM 3‘ KW ‘33) . _ J LogaanisonHasa) T Schwab.Nusbaum&Ptsom(85) E E 'al C r a .-4+fi ‘ l g..- 5‘ car-F" g-V" +9 S *b IQ . a S l e K uh“ . - Wm -\~ .L CW rtin: NAT’ ry Va 1 “ e \ o~‘ ‘v 1‘ ‘ v- a ‘Qgc AA'v‘. H “a a. M 'U.‘ ' lO task performance; (2) to apply a combination of behavioral techniques used at the same level of the communication process to determine whether different rankings of systems occur as a result of different approaches; and (3) to obtain initial data on the role of communication repair in systems employing synthetic speech. Text-to-Speech Conversion Klatt (1987) discussed the state of the art in speech synthesis technology. The first step in the process of converting text to an auditory signal is the assignment of an ASCII (American standard code for information interchange) code to each typed character or string entered into the synthesizer. ASCII codes are 7-bit or 8-bit binary values assigned to letters, numbers, punctuation marks and special characters (Chial, 1984). According to Klatt (1987) the resulting code is then ideally subjected to the following analysis: (1) Reformat all digits, abbreviations and special characters into words and punctuation. (2) Section sentences to establish a surface syntactic structure. (3) Assign a stress pattern appropriate to the surface structure. (4) Determine a phonemic representation for each word. (5) Assign a stress pattern to each word. At the present time, text-to-speech systems are unable to perform a semantic analysis and thus assign stress patterns on this basis. Instead systems with the option of sentence T” Y' e‘.‘ h.‘ -.~* — ud‘" C S It ‘ a: and a .5 g . lat onunc “V .5" 1‘ C i r .‘ a 11 level intonation patterns use a ”generic“ inflectional pattern which may or may not change with punctuation markers such as commas and question marks. No additional stress patterns are inserted on the basis of semantics unless the user codes the input with additional stress markers. Therefore, the system proceeds to the derivation of phonemic representation and stress assignment at the word level. This is accomplished via a word-by-word comparison to entries in a pronunciation dictionary. Those words not listed are broken into pieces to remove prefixes and suffixes and compared again. If the system is still unable to find a match for the root word rules for letter pronunciation are used. Some systems incorporate dictionaries to check for exceptions to stress rules or to deal with special cases such as proper names (Klatt, 1987). Figure 1.3 reproduces Klatt’s (1987) model of text-to-speech conversion. Rule used for text-to-speech translation vary in accordance with speech synthesis systems and can be considered the primary determiner of perceptual differences. These systems are often proprietary, making analysis of underlying rule structures difficult. Even if detailed comparisons of rule structures were possible there is not yet enough information on the cause-effect relationship between rules configuration and perception. Consequently, empirical comparisons of rule structure are necessary. ll &STR 12 INPUT WORD "WHOLE WORD” DICTIONARY PROBE ‘ yes no no / AFFIX > \ STRIPPING "noon ‘ no DICTIONARY PROBE yes yes LETTER-TO-SOUND & STRESS RULES ————. i AFFIX REA'ITACHMENT ‘ PHONEMES, STRESS, V PARTS-OF-SPEECH Figure 1.3. The steps involved in converting as ASCII orthographic representation for a word into phonemes. (Klatt, 1987, p.768) va“ (hue: (13 (I) ~‘ speech can “arceptual spa "wally occur. “mated to ta - . , . .os'n‘ ., ....n the etc i I H;V-“ m mutilar ele... frequencies. E Pictting the p?“ .4 :i ' ' 4: _ - ' ....enS‘.ons. M: .~ :.€-I?€I‘.CS souls iv is ii‘o ‘ :c‘:‘| ‘ . vsfiblonsnlps ‘i -..Q" 10:: ‘v- "“3 SA; | .-t.erlzed h "h~\‘ l 5'\‘|*‘ . ~...3I‘1 alt)“ ‘- «“ :\\QWS ‘Ay‘ . ‘KJ‘ Q‘ “h N*._l~ "s ‘\ “N: hi.b‘or‘ . A W‘ *\ {gnc a" ‘ s." t” 13 Speech Perception Miller (1984) proposed that short-term spectral patterns of speech can be represented as points in a "auditory- perceptual space“. Speech spectra are integrated with normally occurring silences to form a pattern which is compared to targets learned over time. If the pattern falls within the stored target zones it will be perceived as a particular element. Description of a phonetic pattern can be accomplished using the spectral characteristics. Vowel characteristics traditionally have been quantified through measurements of fundamental frequency and formant frequencies. Some researchers have used these characteristics to describe differences in categories by plotting the phonemes along two dimensions. Shepard (1972) used a three dimensional plot to demonstrate that perceptual similarities among vowels tend to line up along formant dimensions. Miller theorized perceptual coding of phonetic elements should not be based on the number of prominences present in the spectrogram, but the pattern of the spectral shapes as characterized by log-power and log-frequency relationships. He suggested that vowels are plotted using these dimensions, the sensory perception of the phoneme is characterized by "spectral shape“ as opposed to absolute position along any one dimension. This reliance on shape allows for simple transposition of the vowel along either dimension without altering the phonetic information carried by the signal. .1. ng.9 [\D (I! m (I) y l the characte 2:109 (Fz/F :‘efined as U by a constan‘. 3 are freque I prominences . ‘v-A r-«S theses w: as a means of *L ". \'\ . ‘:‘.“FQ.‘0 “‘»\‘ \ Va a ‘h ‘a‘:“ ' ‘sii‘ a *lei l . U S ‘- fl ‘ L be.“ ‘A~\Jm“‘\ v yQSSlr \ 14 According to Miller, "spectral shape" can be represented as a single point plotted in three-dimensional space with the characteristics of x: log (F3/F2), y= log (Fl/For) and 2: log (F2/F1). In the case of periodic speech, F0! is defined as the fundamental frequency of the voice multiplied by a constant (1.5 times greater for males), and F1, F2 and F3 are frequency locations of the first three spectral prominences. Vowels can be plotted in three-dimensional space, allowing for the calculations of class differences based upon distance as measured in octaves, cents or semitones (Millers, 1982) Although Miller (1984) directed his interest to the design of cochlear implants, he speculated that cochlear prostheses will only succeed in transmitting phonetic information to the degree they are successful in matching the characteristics of spectral envelopes. It is suggested here that his model also can be applied to the signal source as a means of indexing technical accuracy. Differences between human and synthetic speech can be described on the basis of the degree of separation among points plotted in three dimensional space. Speech Understanding Introduction By far the most popular technique for assessing synthesized speech has been speech understanding or intelligibility. Intelligibility is defined here as encompassing the discrimination, recognition and 'so-at - o 0226 +39 a...» A' is. M ‘u. d V peecn s v.¢- f ora“* ": n-‘~ - ~- ,. 5* ~ «O ‘ =-S C. .A u -v in t t § it? e e ‘n i. gov-4 '5 “ .v‘gfirxfyy‘ ' ‘.. l5 comprehension of speech stimuli. Speech recognition is defined as the process by which an individual receives a speech signal, then immediately reproduces it verbally or in writing. Word-recognition tasks may involve word in isolation or in sentences. Scoring is based on the correct reproduction of single words. Speech comprehension tests, on the other hand, generally require longer retention of the speech signal as well as recognition of message content in a different form (usually written). Researchers have employed tests of segmental intelligibility, word-recognition, word- recognition in sentences, as well as listening comprehension tasks using sentence verification and continuous discourse in their efforts to compare systems. v’ w e e Won Chial (1976) designed four experiments to evaluate word recognition using the VOTRAX VI phonetic speech synthesizer and normal hearing listeners. The stimuli consisted of CID Auditory Test W-22, List 1 presented monaurally under earphones. Experiment 1 was designed to assess performance in quiet at comfortable listening levels (70 dB SPL). In addition, the effect of repeated trials on word recognition was investigated. An average word recognition score of 55% was obtained for the VOTRAX VI, in comparison to an average score of almost 100% for the human control. A significant improvement in performance was noted with repeated trials for the speech synthesizer. erences i ,5“ d-.. with .6781 s I Q ‘ la» .ep 45“. n‘. ‘ u-.¢¢a& ‘- results were : for the VO' .. of 70 (33 an. ‘9 t: ion s< 1t ewe-n v‘q' ‘3 fl 7. a. .c' 141C€ :18 Q. ign' eau was re s Q‘ ad. I U s ‘0‘ ha. ' ‘Vda‘ an talker ‘s «i. g. a men: ‘ I ,- er‘e is H II M ‘3: S were pr y. l6 Experiment 2 measured word recognition at six different sensation levels (SL) from +5 dB to +30 dB (re: Speech reception threshold) for the VOTRAX VI only. Significant differences in performance were noted among sensation levels, with a possible plateau noted at +20 dB and a definite plateau seen at +30 dB. Chial (1976) noted these results were similar to what would be found with a human talker. Experiment 3 assessed word recognition performance at six different signal-to-noise ratios (S/N) from -5 dB to +20 dB for the VOTRAX VI only. The stimuli were presented at a SPL of 70 dB in the presence of white noise. Word recognition scores improved as the S/N became more positive, with significant differences noted at all S/N until a plateau was reached at +15 dB. These findings were considered consistent with what would be expected with a human talker. Experiment 4 investigated the effect of repeated trails when human and synthesized speech are alternated. Ten test lists were presented at SPL of 65 dB. The results revealed no improvement in performance for the human talker, but significant differences for synthesized speech. Comparing results of experiments 1 and 4, Chial reported an accelerated learning effect associated with alternating presentation of talkers. To investigate the effects of noise on the perception of synthesized speech, Pisoni and Koen (1981) presented r..- 41.5; i 6332 .A' .U- p a: .a . 3. ‘MR C ‘— a“ a. Ia. ~‘.F ‘- .0 -..eS ‘ e.“ vyfn b..- ' RA“ VUAKSC'n “ e-ace, h. \ «D 2. $ - sx~ 17 material generated by the MITalk text-to-speech system at three different speech- to-noise ratios (+30,+20, and 0 dB). The Modified Rhyme Test (MRT) was presented at an average SPL of 80 dB with white noise attenuated to meet criteria for each noise condition. In addition, both open and closed response modes were employed. Table 1.2 summarizes the results. Pisoni and Koen concluded that the intelligibility of synthesized speech is affected more by noise than is human speech. It also was suggested that this signal distortion may interact with different processing tasks to produce effects on intelligibility. Hoover, Reichle, Van Tasell and Cole (1987) compared single word-recognition scores and word-recognition in sentences generated by the Echo II and the VOTRAX Type ’N Talk to human speech. Twenty seven consonant-vowel—consonant words were selected such that "all place, manner, and voicing characteristics were represented in either the initial or final positions of the word“ (p.30). For the contextual task, two sets of sentences were generated with these words in the final position. One set was designated as low-probability and used of the phrase “Say the word ' as the precursor to the stimulus item. The second set, designated as high-probability sentences, provided sufficient context for 90% of 15 listeners to correctly guess the item from the sentence content. Results revealed significant differences between each synthesizer and human speech on the basis of the percentage uv‘ ‘ H b.-. ‘-Ab 18 Table 1.2. MRT scores (%-correct) at three speech-to-noise ratios as a function of speech source. Closed Set Open Set Sp/N +30 +20 0 +30 +20 0 Source MITalk 93 89.4 56.6 79 73.5 28.9 Human 99 97.2 69.5 92 88.9 40.3 (Adapted from Pisoni and Koen, 1981) fi‘ wAVI V. v- --..o~ -w 35.. -212." . . . n a . n C. u a .1) '1' QR “ I —\\ uri ‘- pi . q \L S “A m. S i a C c . Ya. . . t a. i S 0 b S "I 3 "I a . 4. c. S e c 1 l. "V a. a o C T. S S . C . -u . . S A... O T.. 4 . a 3. L .. ..~ : a 9 S r < 5 .3 A o A C a k .. . a. L ~ . a o s . «é .3 G» S vs .. n. e h. I. H r. 63 a. .. r a. .1 a. C 5 Lu 9 C ., .L r . .1 1.. Q. .. . . . 4 a 3 3 .v u ..n 1. o u o. uh a . r a s 2. | [LII I? IJ m [lily 19 of words correctly identified in isolation and within sentences. Between synthesizers the VOTRAX performed significantly better than the Echo on both sentence categories. No difference was seen between the synthesizers when the words were presented in isolation, but, an error analysis revealed a substantial difference in the recognition of stop consonants. The Echo II proved to be much poorer (23% correct) than the VOTRAX (52% correct). Analysis of subject response patterns indicated 75% substituted the phoneme /m/ for stop consonants in the initial position (usually replacing /b/), whereas less than 1% of the subjects responded with this phoneme for the same items when presented by VOTRAX. In the final position, subjects identified 20% of the Echo II items as fricatives or affricates compared to 1% of the items presented by VOTRAX. Visual inspection of the acoustic waveforms associated with these stop consonants revealed distinct differences between the two synthesizers. The /b/ in the initial position produced by the Echo II begins with a low-frequency wave similar to a nasal consonant. In the final position the consonant release is followed by an aspiration resembling a fricative. Thus, the acoustic features correspond with many of the perceptual errors made by the subjects. 20 W n n s Pisoni and Hunnicut (1980) investigated the intelligibility of the MITalk using a three~phase approach consisting of segmental intelligibility test, a word recognition test and a listening comprehension test. Segmental intelligibility was evaluated with the Modified Rhyme Test (MRT) under earphones. Normal hearing subjects (N=72) produced an overall error rate of 6.9% with scores of 4.6% and 9.3% for consonants in the initial and final positions, respectively. Nasals were found to have the highest error rate (27.6%). An overall error rate of .6% was obtained when human speech was used. WOrd recognition in sentences was evaluated using the Harvard Psychoacoustics Laboratory Sentences (Egan, 1948) and semantically anomalous sentences created at Haskins laboratory (Nye and Gartenby, 1974). Scores of 93.2% and 78.7% correct were obtained for synthetic speech, while scores of 99.2% and 97.7 % were seen for recordings using natural speech. Listening comprehension was assessed using narrative passages selected from adult reading comprehension tests. Three groups of subjects were employed. One group listened to MITalk and a second to human speech. A third group viewed the passages in typed form. A set of multiple-choice questions was administered immediately following the task. Scores of 70.3%, 67.2% and 77.2% were obtained for the three groups, respectively. w... a 3. ~ H o e mm 3 .r“ m... a a“ an Q“ a V ~ .- ‘.. r. 3 t 3 S .e .3. is it}.. he ~~o. 'aa». 3 . 's S' F“ 5“- Eh! are. \...b ~\b ‘ I- v, ‘ t ‘V. ‘ A: 21 In a study comparing the MITalk to the Telesensory Systems text-to-speech device (a system based on MITalk) and a human control, Bernstein and Pisoni (1980) used the some battery of tests just described. No significant differences were reported among the systems on any of the measures employed. The largest differences and error rates were said to occur on the sentence materials, particularly the anomalous sentences. Specific percentages for each speech and test were not reported. Greene, Manous and Pisoni (1984) conducted an investigation using the Digital Equipment Corporation speech synthesis system version 1.7 (DECtalk). This device provides six different voices, two of which were used for this study: one male ("Perfect Paul") and one female ("Beautiful Betty“). Once again, the investigators used the MRT as a test of segmental intelligibility the Harvard PAL sentences and the Haskin anomalous sentences materials as stimuli. The materials were presented to the subjects under earphones (80 dB SPL) in the presence of 55 dB SPL of broadband noise to mask tape hiss. The results showed an error rate on the MRT of 3.3% (male voice) and 5.6% (female voice) when a closed-set response format was used. In contrast, the open-set format resulted in an error rate of 13.2% and 17.5% for the male and female voices, respectively. The Harvard and Haskins sentence materials demonstrated error rates of 4.7% and 13.2% (male voice) and 9.5% and 24.0% (female voice). r; ALA—‘u-‘ni A ~ - Q A Q N 3‘ H A . VHF . :OerCAye "fl; A 'Ua've A.” ? v4. 5“: S rt 1 a. e C c 1 c 1 .7. T. S . e C : 1 I P. E C. .F. 1: r 1 .. C e v 1 T. o + 1 -. 1 r S 3 fie. e e 41 .r. n. O 4 1 .. a E W c 1 .C s c 1 1 e a o 3 so . . 1 . :1 .u a: F a. .dd gnu Anv 1‘ g (V § 8 .Vi. N\N .. a .hy IF. A . L9 .Aa ‘ a a a s 1: ill, . la. h \‘_E 893.“; 22 A comparison of the intelligibility of phoneme classes in a sentence contex was undertaken by Logan and Pisoni (1986). The stimuli consisted of a subset of the Phoneme Specific Sentences (Huggins and Nickerson, 1985). These sentences are designed such that each item contains a number of words from a specific class of phonemes. For example, the sentence “Those waves veer over", contains numerous voiced fricatives. The subjects were asked to transcribe sentences presented under earphones. The items were scored on the basis of omissions, transpositions, and additions. An error in any category meant the sentence was counted as incorrect. Analysis revealed significant differences on the basis of synthesizer and on the basis of phonemic category. A significant interaction between voice and phonetic category also was noted. Significant differences were seen between DECtalk and Prose versus Infovox; however, no difference was seen between DECtalk and Prose. These findings are in contrast to previous research using the MRT as the measure of intelligibility. The authors concluded that the Phoneme Specific Sentence is a more difficult test as evidenced by the higher overall error rates exhibited by all sources. They suggested that errors at the level of phonetic categories can reveal more precise information about synthesizers than sources of error even in the absence of differences in overall performance among synthesizers. Kraat and Levinson (1984) compared the intelligibility for sentences of the Echo II and the VOTRAX Personal Speech ”A? '5.-. i," ‘4. v .M 1.. .1 .4 .. 3 1 3 .. to C we . 1 ...1 1: .1 1 1.... 1 ..1. a: u .1 v. C 1 at 1: :1 1... a. o ,. a c~ . 1 :1 1: 1.11 111 . 1: 1:. . 1 i. 1 . 1.: .1 a 1; - 1 1 1 .. 1: 23 System (PSS) produced at (1) normal rates and (2) with a 2 1/2 second pause between each word. Test materials consisted of 64 sentences from the Assessment of Intelligibility of Dysarthric Speech (Yorkston and Beukelman, 1981). Although not so stated by Kraat and Levenson, these materials originally were designed to assess the extent of motor speech difficulties upon the intelligibility of speech (Yorkston and Beukelman,1981) Sixteen sentences were randomly assigned to each of four conditions. Twenty normal-hearing adults were asked to write the sentences following their presentation by loudspeaker. The results revealed a significant difference in performance with percent-correct scores of 70.4% for the PSS and 45.7% for the Echo II in normal conditions and 84.3% and 81.1% in the pause conditions. .The pause condition provided significant improvement for both synthesis devices. Kraat and Levinson (1984) also evaluated the adequacy of the text-to-speech conversion rules of the two systems by using five speech pathology graduate students as judges of pronunciation of 1500 words produced by the synthesizers. Judges determined whether syllables had been added or deleted and whether vowel substitutions had occurred. Stimulus items were taken from the Thorndike and Lorge (1944) list of the 1000 most frequent English words. Of these, 45 were judged as mispronounced by the PSS and 175 by the Echo II. An additional 500 words were taken from the Beukelman, Yorkston, Poblete, and Naranjo's (1984) lexicon ‘t a? n.‘ iv. an \ \- 1: . L 1: p 1 [AV-fia- V~x..~ ‘d ‘ s h A h Q. . ‘2- .“E ‘5‘ ‘Q v. § ~ ~ : 1: q: a . ~\\ 24 of words commonly employed by users of augmentative communication devices. The PSS was judged to have mispronounced 36 of these words while the Echo II was judged to have mispronounced 55 of the items. Greene, Logan and Pisoni (1986) summarized error rates of segmental intelligibility for eight speech synthesis systems from experiments conducted over the past seven years (see Table 1.3) The subject criteria and procedures remained the same for all the systems evaluated. Two of these systems DECtalk and MITalk, already have been discussed here. As with the previous studies, data were reported for an open-set response format (open-set response formats produced higher error rates than closed-set response formats). These authors suggested a four level grouping of devices on the basis of intelligibility: (1) natural speech, (2) high-quality synthetic speech (DECtalk, Prose 3.0 and MITalk), (3) Moderate-quality synthetic speech (Inovox SA101, Berkely, and TSI proto-l), and (4) low-quality synthetic speech (VOTRAX Type ’N Talk and Echo). These categories reflect the effectiveness of rules for text-to- speech conversion used in each system. More recently, Mirenda and Beukelman (1987) compared five synthesized voices (Echo II+, VOTRAX Personal Speech System, and DECtalk; "Perfect Paul", Beautiful Betty“ and ”Kit the Kid") and a human speaker (female) using both single word and word recognition in sentences. Speech A...“ "-V - Q A “Vsokv ‘fs- A \ -H .V- ‘ "A’n‘ ’ ~— UV-.__,. ('1 n H' () 25 Table 1.3. MRT Error Rates (%) overall and error rates for consonants in initial and final positions. Error Rate (%) Voice Inital Final Overall Natural Speech 0.59 0.56 0.53 DECtalk 1.8 Paul 1.56 4.94 3.25 DECtalk 1.8 Betty 3.39 7.89 5.72 MITalk -79 4.61 9.39 7.00 Prose 2000 V3.0 7.11 4.33 5.72 Infovox SA 101 10.00 15.00 12.50 Berkely 9.78 18.50 14.14 TSI-Protosype I 10.78 24.72 17.75 VOTRAX Type'n'Talk 32.56 22.33 27.56 Echo 35.56 35.56 35.56 (Greene, Logan, and Pisoni, p. 104, 1986) ~-q' “V‘ :r-owy-A ~V‘§\' - V-“- o :. 2. in“ re; A H iv Q a: 26 stimuli also were generated for the Echo II+ and VOTRAX synthesizers using both standard English spelling and phonetic coding. Stimuli were a pool of single words (600) and sentences (1,100) selected randomly using the Computerized Assessment of Intelligibility of Dysarthric Speech (CAIDS: Yorkston, Beukelman, & Traynor,1984). Twelve sentences and fifty words were used with each speech source. The subjects consisted of five listeners from three age groups; adult (ages 26 -40), older elementary children (ages 10-12) and younger elementary children (6-8). Stimuli were presented via a monaural speaker in a quiet room. Recording procedures for the natural speaker, provisions for tape equivalency between speech sources and presentation levels were not reported. Subjects were asked to verbally report what they heard and were given the option of a second trial for each test item. The sentence materials were presented first to all subjects as they were judged by the experimenters to ”require more listening effort“ (p.122). At the single word recognition level significant differences were noted between speech sources, but no differences were noted across age groups. The results also indicated significant differences between speech source for word recognition at the sentence level. In addition, significant differences were seen as a function of age group as well as age group-by-source interaction. The authors indicate however, that the speech stimuli were originally designed for an adult population and therefore the .4... n: .. . .3. c i S a... .7. H l .. . y. 1 c . ~t s c . A,» ~D . . A: «t ; a . .. I .m. z. a. c . .1 u . r . t C acoc . 1 4‘ A: v h I. w '5‘“ 27 difference noted between age groups may reflect the overall linguistic complexity of the sentences. The single-word recognition scores proved to be lower than the sentence scores for the synthesized sources, but not for the human speech source. No differences were noted between stimuli generated using English spelling and phonetic coding for either synthesizer. Wen... Connected discourse has been used as a part of an overall approach to the assessment of speech synthesis systems. The earliest study available for review which used this approach in isolation was conducted by McHugh (1976) using an early VOTRAX text-to-speech system operated at six different stress settings. The stimuli consisted of passages taken from standardized reading comprehension tests. No differences in performance were noted among any of the experimental conditions. Hersch and Tartaglia (1983, as cited in Manous, Pisoni, Dedina and Nusbaum, 1985) evaluated the effect of rate on comprehension of speech produced by a prototype of the DECtalk. Stimuli were short passages followed by a set of multiple-choice questions. In this case, questions were said to measure both “literal and “inferential" comprehension of the material. Subjects were allowed to take notes if they desired. Comprehension was similar to that seen for time-compressed speech (Fairbanks G., Guttman, N., & Mirun, M.S.,l957) when synthetic speech was produced p A. v- S C a. ..‘ p. x m a. fly 4‘ ~\~ .: 7r Wu 28 at slow or normal rates. Increases in rate, however, resulted in decreases in comprehension. Schwab, Nusbaum and Pisoni (1985) also used connected discourse as a measure of listening comprehension in a study of the effects of training . Instead of multiple choice questions, a true/false format was used with several levels of comprehension ranging from word recognition to inferences. Findings revealed no differences between human speech and that produced by the VOTRAX Type N' Talk. Another approach to assessing listening comprehension has been the use of sentence transcription. Jenkins and Franklin (1981, as cited in Pisoni, Manous, and Dedina, 1986) used two groups of subjects, one transcribe a passage as it was presented sentence-by-sentence, and a second required to await completion of the entire passage before transcription. The speech sources used were a human control and the VOTRAX text-to-speech system. The model of the VOTRAX system was not reported. No significant differences were noted between synthetic and human speech. A sentence comprehension task was used by Manous, Pisoni, Dedina, and Nusbaum (1985) to compare performance with two human speakers and five speech synthesis devices. The rationale for using this procedure is based on the historical use of sentence verification procedures to assess speech processing with human speakers. Reaction times have been found to be slower when systematic changes were made in” such variables as grammatical form (Gough, 1965,1966) and 29 prosody (Larkey and Danly, 1983). The authors theorized the acoustic-phonetic properties of the speech and its speech quality may affect the amount of time needed to complete any stage involved in the comprehension process" (Manous, Pisoni, Dedina and Nusbaum 1985, p. 38). Subjects were asked to identify three-and six-word sentences as true or false through the use of a key board and then to transcribe the sentence they heard. Data were obtained for response latency, sentence verification accuracy and accuracy of sentence transcription. The speech sources differed significantly for all three dependent measures. A grouping of sources into three categories was noted. These were labeled (1) natural speech (2) high- quality synthetic speech and (3) moderate- to low-quality synthetic speech. This experimental procedure appears more sensitive than other methods employing connected discourse and multiple-choice questions. This same paradigm was used to measure the comprehension of sentences presented via digitally encoded speech (Pisoni and Dedina, 1986). Three different methods were used to generate the speech: (1) 2.4 kbps linear predictive coding (LPC) , (2) 9.6 kbps time-domain, harmonic scaling-subband coding (TDHS/SBC), and (3) 16 kbps continuously variable slope delta modulation (CVSD). Statistically significant differences were found in performance for all three dependent variables measured between the highest (CVSD) and the lowest (LPC) ranked vocoders. ' 1 “Pa/'1‘ A? n f-viev» A; final wcr to ccmple' 20" Phes UV... ‘.V sentences 4 .sa1.bcrlst § "~:’V“De 4 be" “ «dyerah+‘- A \r k,“ V M \ h 5"39 30 A third study employing sentence verification as a Ineasure of listening comprehension explored the effects of :semantic predictability on performance. Pisoni, Manous and ‘Dedina (1986) constructed a set of 80 sentences composed equally of true and false items (40 each) and high- and low- predictability (40 each) items. The predictability of the final word was determined by the number of times it was used to complete a sentence. A pool of 200 potential stimulus sentences was used to generate the frequency data. All sentences were controlled for intelligibility by a sentence transcription task using the DECtalk as speech source. Items ultimately retained for study produced no transcription errors. Data were collected for the three dependent variables of transcription accuracy, sentence verification accuracy, and response latency. As expected, no differences were noted between the human and synthesized voice for transcription accuracy. The only factor to reach significance in sentence verification accuracy was high- versus low- predictability. Significant differences were noted between speech sources for response verification latency but not for response verification accuracy. In addition, a marked difference in scores was seen for the high- versus low- predictability sentences. Further analysis failed to reveal an interaction between voice and predictability. The authors suggested that this provides evidence against the theory that differences among speech synthesizers are solely the of 31 result segmental intelligibility. Some aspect of the acoustic-phonetic input to the listener interferes with processing meaning as opposed to perception of the sentence. This speculation is further supported by Slowiaczek and Pisoni (1982) response times for a lexical naming task between human and synthesized (MITalk) speech sources. Differences between the speech sources remained the same following training in the task over several days. Taken together, these studies indicate that verification response latency is a sensitive measure for comparison of speech synthesis systems. Summary Many high-quality speech synthesis systems produce very low error rates. For example, overall error rates on the MRT were between 3.25 and 7% for the top four systems (Table 1.3). Differences among several synthesizers were as small as 2%. This ceiling effect for natural speech and high- quality synthesizers makes meaningful comparison difficult. Those studies that employed listening comprehension measures using passages and traditional post-testing failed to demonstrate significant differences among speech sources. These measures do not appear sensitive enough to be useful in comparing speech synthesizers. However, several attributes of test materials increase task difficulty, hence sensitivity to speech source effects. These attributes include (1) open-set response format (Pisoni and Koen, 1981; Greene, Logan, and Pisoni,1986), (2) anomalous sentences 4. av. z. A: a. .1. a u .I .. Bin-I‘ll? .‘R'. a» a: :1. a: a. ‘ ‘ us a. a . C. 9 a. 3 . . .2 \— 2‘ >s 32 (Pisoni and Hunnicut, 1980; Greene, Manous, and Pisoni, 1984), (3) the presence of noise (Pisoni and Koen, 1981) and (4) inclusion of high- and low-probability items (Pisoni, Manous and Dedina, 1986). In addition, sentence verification accuracy and sentence verification response latency seem to be response formats particularly sensitive to difference among systems (Slowiaczek, and Pisoni, 1982; Manous, Pisoni, Dedina, and Nusbaum, 1985; Pisoni and Dedini, 1986; Pisoni, Manous and Dedina, 1986). Speech Quality W Speech quality can be described as the overall “goodness“ or naturalness of speech produced processed or received by an element in a communication system. Some factors have been shown to contribute to the perception of speech quality such as intelligibility (Weldele and Millin, 1975); many other attributes have yet to be defined. It can be postulated, however, that each component of speech production contributes in varying degrees to perceived quality. The components and their possible contributions include (1) respiration (via alterations in intensity), (2) phonation (via alterations in fundamental frequency), (3) articulation (via precision of phoneme production), (4) resonance (via changes in oral/nasal coupling), and (5) rate. Thus, speech quality represents more than the individual contributions of speech production, word- recognition, intelligibility, discrimination or prosody: is A t” II... UH'W'“‘ ‘\ :0..th hny-c‘ Q HUD.-~“UV~ ; d (D .4 these char ‘fifi'r‘k . “Wovba‘u‘ut. 25'] preser both sgmt}: r9309"“.1'tic31 I’A r~=S€!‘..e an ‘.'I ic‘r" .‘p‘ V§ourbcn an< ‘9 1.st of VO“ MT I‘Ct‘c d' W I ‘C Q. 'n‘ y ‘ W lets a byts we "a 4.. ¢ Or ‘ a: L T‘- V“ ”A t 1» ..:ectl .‘ .‘ *C 9n .; v- V‘e My“ «‘I-L' 33 the total impact of these (and other perhaps undefined attributes) of the speech source which combine to make it unique. In the past, mathematical models used to generate these characteristics in synthesized speech have been limited in their ability recreate this richness by the availability of computer memory and by knowledge of contributing factors. However, speech quality measurements may present a more precise method for differentiating among both synthesized and human speech sources than word recognition or comprehension measurements. Wm 521131213: It has been theorized that the distinct quality of synthetic speech may act to alert the listener to its presence and thus facilitate detection and/or reception (Simpson and Williams, 1980). In order to investigate the role of voice distinctiveness and phonetic discriminability Nusbaum, Greenspan and Pisoni (1986) presented CV syllables via earphones to subjects in the presence of natural and synthetic voice distractors. Levels and mode (monaural, diotic, dichotic) of presentation were not reported. Subjects were asked to identify a target syllable spoken by the test talker from a series of 20 presented by either a natural or synthetic talkers. If the quality of speech acts as an altering mechanism, the percent of the syllables correctly identified should be higher when the target has a more unique character. The results indicated lower 3H recognition performance for both speech synthesizers (DECtalk and VOTRAX Type 'N Talk) compared to the human talker on the basis of all three performance measures (percent correct, response time and false alarm rate). These differences occurred regardless of whether the distracting voices were the same, different or mixed in relation to the target voice. The authors concluded that the distinctiveness of the voice is less critical to target detection than intelligibility of the speech. WW It has been suggested that traditional word-recognition tests fail to provide an adequate representation of the ability to understand speech in normal listening situations (Chial and Hayes, 1974; Oyer and Frankman, 1975; Berger,1978). These tests often lack the accuracy and precision required to demonstrate significant differences among signal sources, transmission systems, or listeners. This suggests that the complex interactions among intelligibility, prosody, message content and the listener’s knowledge of the language cannot be evaluated using abbreviated stimulus sets and paradigms. The IEEE Audio and Electroacoustics Group Subcommittee on Subjective Measurements reviewed a variety of procedures for making speech quality measurements with the intent of discovering those which had been successful and which could be used with a variety of signals. (IEEE,1969) Three methods were recommended: (1) the Category-Judgement vq- .‘.,'v _ r E n v. I». .G «\~ 1 .nI-IIP. Animul... I P . . ~= “P G. M1. ‘Qy‘ .‘§ u‘ d kw .3 E 4 _ T .2 : a. an‘ Abe ‘ny 35 Method, (2) the Relative Preference Method, and (3) the Isopreference Method. The Category-Judgement Method requires subjects to listen to a standard speech sample and then compare other signal(s) of interest to this standard. Their impressions are categorized according to the adjectives Unsatisfactory, Poor, Fair, Good, and Excellent. The result is a mean score for the signal based on the total number of judgements in each category. The main difficulty with this method appears to be excessive sensitivity to the content of the speech material used (IEEE,1969). The Relative-Preference Method places the signal of interest along an arbitrary rating scale based upon how often the signal is preferred in comparison to all other signals. The continuum along which the signal is placed is defined by the selected reference signals; therefore, the degree of degradation used with the reference is of utmost importance (IEEE,1969). The Isopreference Method (ISM) was originally proposed by Muson and Karlin (1962) and later simplified by Rothauser (1968). A test signal (speech) is presented as a forced choice comparison to a reference signal (also speech) subjected to different degrees of degradation via the addition of noise. Noise level is varied until the preference votes of a listening group are equally divided between test and reference signals." Thus, the signal-to- r.-. 5...;481 .1 T. .c a: . . c¢ w 544 .. . ‘ “a p- .. Ah & a: 36 noise ratio of the reference signal becomes the preference score for test signal. An alternative to paired-comparison techniques is quality magnitude estimation. The subject is asked to assign a numerical value to test stimuli. This can be done with (e.g., Chial and Daniel, 1977) or without (e.g., Lawson, 1980) a selected reference signal. Variations of the Relative-Preference method have been used to differentiate among speech synthesis systems. Nye, Ingemann, and Donald (1975, as cited in Logan and Pisoni , 1986) presented subjects with pairs of short passages generated by several different algorithms. Subjects were asked to state a preference for one of the speech sources. When compared to data obtained from a test of listening comprehension based on the passages, the results revealed listeners tended to prefer the algorithms which generated the highest listening comprehension scores. McHugh (1976) also obtained data on listener preference as it related to performance on a test of listening comprehension. Six different “inflection levels" of the VOTRAX VS 6.0 were used as speech sources. Twelve sentences were recorded for each source and presented in random order. Method of presentation was not reported. Subjects were required to listen to each sentence and rate the “goodness" of the sample on a seven point scale anchored with "good" and "bad". Results revealed wide variations in how the subjects used the scale. A tendency to cluster at the high ‘— ‘li l—'-.-" =. -q 9~ NH. n--¢~AA . Avflgv Vnu—v‘ 37 middle or low end of the scale was noted. Therefore, in order to combine the data it was necessary to rank order the preferences for each subject by obtaining a mean preference score on the speech source for all twelve sentences. Mean rank was then computed for all subjects. A comparison with results obtained on the test of listening comprehension revealed order of preference was similar to rankings based on listening comprehension performance. Logan and Pisoni (1986) conducted two experiments designed to evaluate listeners’ preferences using a paired- comparison paradigm. Stimuli used in one experiment were the Harvard Psychoacoustic sentences. Three speech sources were used, the DECtalk 2.0 (voice type not reported), Prose 2000 V3, and MITalk-79. Sentences were presented under earphones at an SPL of 80 dB in the presence of 50 dB of white noise. Following presentation of the same sentence by two speech sources, subjects were asked to select which voice was most natural sounding "A" or "B”. Data were collected for pair-wise preference, response latency for the preference choice, and confidence rating for the preference decision. All differences in preference were found to be significant with the exception of the Prose/MITalk combination. Logan and Pisoni state that in all cases the most intelligible voices were also the most preferred in the pair. Confidence in rating was also statistically significant and found to correspond to rankings by intelligibility. 38 In a second experiment the same procedures were employed; however, the Phoneme Specific Sentences were used as stimuli. In addition, the Infovox was substituted for the MITalk as one of the speech sources. A similar pattern was noted among speech sources for pair-wise preference. In addition, the same association between preference and intelligibility ranking was seen. A different approach was used by Nusbaum, Schwab, and Pisoni (1984, as cited in Logan and Pisoni, 1986). A questionnaire was used to determine subjects' subjective preferences for speech generated by the MITalk and the VOTRAX Type 'N Talk. The questionnaire required subjects to make forced-choice judgements between pairs of adjectives (e.g., gentle/harsh, halting/fluent, hard/easy). The synthesizers tended to be rated as having speech that was more harsh, rough and course than natural speech. It is noted, however, that the preselected adjectives may have biased the subjects’ responses. Further, the data provided little information about the attributes used to make these judgments. Summary A variety of methods have been used to measure the quality of synthesized speech including questionnaires (Nusbaum, Schwab, and Pisoni, 1984), phoneme recognition (Nusbaum Greenspan, and Pisoni, 1986), scaling (McHugh, 1976) and forced choice pair-wise comparisons (Logan and Pisoni, 1986). A persistent problem in the investigation of :nooch ’V‘; 7- .,..-u 4 - scrutiny a to obtain precise ar associ tic and Pisoni does not 1'. intelligih 1Catt. 195 Evalu Host 8 term reca; L“. e‘v’alua: wiiCh requ l“918 of m‘Eas’dres C 399en ~17 Com Jase 13:16. 39 speech quality is the definition of the concepts under scrutiny and the operationalization of those concepts so as to obtain results which are quantifiable, repeatable, precise and accurate. Researcher have demonstrated an association between intelligibility and quality (e.g., Logan and Pisoni, 1986), however, a high degree of naturalness does not necessarily mean the the speaker will be intelligible (Nixon, Anderson and Moore, 1985 as cited in Klatt, 1987). Evaluation of Receiver Performance in Complex Tasks Introduction Most studies of synthesized speech involve only short term recall or repetition of stimuli. A plausible next step in evaluation of synthesized speech is to devise a task which requires a variety of skills and offers different levels of complexity. There is a “need to develop new measures of sentence comprehension that can be used to study speech communication at processing levels above and beyond those indexed through transcription tasks and forced-choice intelligibility tests" (Pisoni, Manous and Dedina, p. 20, 1986). A distinguishing characteristic of human conversation is feedback between receiver and sender. The nature, extent and role of such feedback varies with the purpose and formality of the communication paradigm. In some situations, feedback signals sent from receiver to sender acknowledge receipt of a signal or message (or of the FF“““:‘?‘3 C. r. nh‘ .2 ti“. 9»- Q» .: HO receiver’s confidence about what was received). Such acknowledgment may invite additional communication from the sender or signal turn-taking in conversation. In other cases, feedback may be used to verify a receiver's hypothesis about a signal or message originating with a sender. Here verification of portions of signal or message may enhance the accuracy of information transfer. In still other cases, feedback from a receiver may convey to the sender the idea that a signal was not received or that a message was not understood. Such information may be used by the sender to modify either the signal or message (or both) to optimize information exchange. Modifications that respond to particular characteristics of the communication system (sender, channel, receiver) may be thought as adaptive. When a receiver uses feedback to minimize known or suspected errors in communication, the result is that of correcting or repairing flawed information exchange. Thus, corrective feedback can be thought of as "communication repair“ behavior, This seems to be a natural event among humans and may influence perceptions of naturalness of man- machine interaction, if not also of synthesized speech. W Early research in human communication demonstrated that different amounts of feedback increase accuracy of performance (Leavitt and Mueller, 1951; Rosenberg and Hall, 1958). Leavitt and Mueller (1951) designed two experiments .I-U \ ‘_‘_‘_1' 1;.— __ _ 9901.“. tri< feedback Visual o: YES/no 5: Free feec revealed f‘ heel-1 *\ ”IV" SC». in ; ' vURCldenP C0 94‘ ‘4 n. .i n 'hC ‘L .“a ‘ c 35319 1 'a’ a 35 QVaTIH an; _ \ 1‘ ‘. ‘3 “‘ U1 with the purpose of determining the effects of the presence or absence of feedback on such variables as accuracy, confidence level of the subjects and time to completion of the task. Classroom instructors were asked to describe different groupings of geometric shapes. In a first experiment, the students were asked to recreate the geometric patterns from oral descriptions. Four different feedback conditions were employed: 1) Zero feedback (no visual or verbal), 2) Visible audience (no verbal), 3) Yes/no student responses (visual and limited verbal), and 4) Free feedback (unlimited visual and verbal). Results revealed consistent increases in accuracy and in subjects’ (both students' and instructors’) confidence in their accuracy with increased levels of feedback. Conversely, the time required to give the instructions increased when additional feedback was made available. Experiment two further explored the conditions of no feedback and free feedback. The purpose of this experiments was to determine the effects of feedback on performance over a longer series of trials (4). The same differences were noted between feedback conditions in initial accuracy, confidence and time. However, in the zero feedback condition, accuracy climbed steadily with repeated trials, whereas the feedback condition began and then remained at the same high level. In contrast, when time to completion was evaluated, the zero feedback condition demonstrated no significant change with trial, whereas the free feedback cotglé .H‘ nu. an .‘AXK .51.; ECCUIE svster ‘ vex G v ( ‘-' ‘ .drth ff .s n. 1. .1 m: e c e .r. u.“ r. T. «d are C .P. at a. v e C v Q o 54. Q. .‘ \- Q “0:: ha Q. be U2 condition showed a steady decline in the time required to complete the task. Thus, interpersonal feedback during human communication increases subject confidence and accuracy, although the presence of an internal feedback system is suggested by an increase in performance observed even in the absence of free interpersonal feedback. It is further noted that adding more feedback increases the time required to complete tasks. The authors suggested, however, the potential exists for a continuing decrease in time and amount of feedback to the point where most misunderstandings are clarified and no feedback is required. WW Similar findings would be expected when synthesized speech is used in place of the human voice. Although no research is available on the feedback during task performance, researchers have demonstrated improved perception of synthesized speech with feedback to the subject about performance during training sessions (Schwab, Nusbaum and Pisoni, 1985; Greenspan, Nusbaum and Pisoni, 1986). In a study conducted by Greenspan, Nusbaum and Pisoni (1986), training took the form of visual and auditory repetition of the item to the subject (whether or not it was requested) following transcription of the stimulus item. A variety of message types (words, meaningful sentences and semantically anomalous sentences) and training stimuli (both novel and repeated) were used. The post~training tests revealed subjects demonstrated significant improvements in F1! 1.._’ _.-__.:__.. voprxm‘. 9‘ w .VVV‘SAA¢ 9‘ ‘rnn per.c...au Am Ch talkers f elicit, 1 is possi} ‘2 .23: pro. Skthesi Options A. . "drfl‘ I Q \4“.G . f: A “' (‘C U‘ l43 recognition of the stimuli regardless of message type. The untrained control group demonstrated no such improvements in performance. EQEEQIX An obvious difference between human and synthetic talkers is the difficulty of machine-based systems to elicit, understand and use feedback from a listener. Yet it is possible to provide the listener with control options that produce effects (in the synthetic talker) similar to those observed in person-to-person communication. For example, a listener can be given the option to cause a synthesizer to repeat a previous signal. Other command options can be specified strategically in consideration of the kinds of feedback that enhance information transfer in person-to-person communication conducted in difficult listening situations. Such options include changes in signal level (or in signal-to-noise ratio), speaking rate, phrasing (word choice), and speaking mode (e.g., normal vs. cardinal letter names or coded letter names). This leads to several questions regarding the presence of a feedback option in an ongoing task in which synthesized speech is used. How frequently would feedback be employed? What types of alterations in the message would the listener request? What are the effects of a feedback option on accuracy of task performance? Do patterns of requested alterations vary with listener experiences. Can answers to ‘ “l . aw s‘. V o I v ‘ v A \ j o r Part latel‘ -"w ““9 Son: The data ca se que ave bee: 1.4 ‘r A inc {PFC HOOV “‘\.V- voov 51»; AV :4 «t P h Uh these questions promote further analysis of differences in fh MA & (D Purpose of the Study Studies of the understandability of speeCh synthesizers have been based primarily on word-recognition and listening comprehension tasks. Review of the available research has identified aspects of those procedures which may prove useful in further evaluations of synthesized speech. The study was devised to parallel the components of the Shannon-Weaver model of communication systems (see Figure 1.4). Part one entailed an acoustic analysis of signals Iproduced by speech synthesizers. It was theorized that such (iata can be used to predict performance of speech synthesis Systems. Part two evaluated semantic precision by using sentences and connected.discourse presented in noise as measures of .intelligibility and listening comprehension. Finally, part ‘three evaluated the effect of synthesized speech on receiver Iperformance and the role of feedback upon receiver 'Performance using a task requiring a variety of skills and Containing several levels of complexity. Questions The following questions were addressed upon analysis of the data: (1) Do synthetic speech sources differ significantly from each other (and from human talkers) as a function of Illa .Iiilwfl LF 145 .mEostoaxm 05303230 9: new .32: cosmoEacLEoo on. so 95:868 65 no $2.85 :1 2:9“. 33. 9.0203me 39:35 “.3530 .2230 .8336. 8:8: £3 .236: 05 3255 :02! 3:33.... 33:33: 9%: 3.8.52.3 gun. .o @582: < gamma Bob 36:02.2. 29:22 ”.385 .acoavcoo 3:89:00 325 3.8825 3.95» 9.3: oocmczotoa x3. .o 05238 < «Janna moz<2m0mmmm ¥mz=anoo 89.: .328 c_ 52.582 203 .o 2:30.: < aafiuqfl. 205—Oman O_._.z<2mm 58:30 o>o a c. couscous «.032, 3.55.5 :93on caps: a new 23.8593 05 .3223 .5 .o 805.com. on. 958:9. .82: cm 332a 2 3.558820 358 380.3 .6 cozmagw 35.63de >00 00:08:09.0 :0 :0:0 30: 00000:: .5. 059”. v.005 803:0 5:05:00 20:50:06 00_E0_ . 00:00: 05:02.05 5:00:05 00:00.: 3030”". 800:8 .0035 :003 00.52:. 002.05 230.0 0.00:0 :o=05=00 .0205 '5: ‘ IT hu’ "UL. 0 A/‘VW'F‘ -v‘ a I yyum- .4414. aiva-HU‘. h “A? meniULy was an periphe: charact. 1, 0‘ _ 50 communication lines. Text-to-speech translation was accomplished using rule structure implemented by read-only memory (ROM) devices located in the synthesizer. System 3 was an “add-in“ circuit board (also available as a peripheral device) containing sound generation circuits. Text-to-speech translation was accomplished via software distributed in disk form. The specific settings and voice characteristics classified as the "default" mode for each synthesizer are summarized in Appendix B. W Unless otherwise noted recordings for all experiments inere made by on Ampex 632 tape by connecting the output of the synthesizer to an audio mixer (TEAC, Model MB-ZO), then to a cassette recorder (JVC, Model KD-lS). The materials ‘vere generated using the text-to-speech systems set for the Inale speaker or default pitch settings. Default conditions for rate and inflection were employed whenever possible. If default modes did not exist, a mid-scale value was used. If automatic inflection was available, it was employed. No efforts were made to optimize text-to-speech translation by any of the synthesizers. Volume control settings were set to default or mid scale positions. Successive repetitions of the sentence “He tacked the tip tap top of the teep with tOOp" were output to allow adjustment of the input level of the recorder for a VU peak between -3 and 0 dB. *-——-—-—— Exp-er acoustica A n ‘ a. «evelop C reflectir St; ml :5 The 5 fear back staidard Consisten was Speci the Vowel m nu-Cle Y HA . e.‘ +- p “Te-dc 51 Technical Precision Experiment 1: Acoustical Analysis Experiment 1 represented an attempt to describe selected acoustical properties of speech sources. The goal was to develop a single number index for each speech synthesizer reflecting a goodness of fit to human speech. Methods Wish. The stimuli consisted of three front vowels [1.1.8,m], four back vowels [11, 0, 3, a], and one central vowels [A] of standard American English. These vowels were produced in a consistent CVC nucleus environment where the consonant [p] ‘was specified to simplify spectrographic identification of ‘the vowel. As suggested by House and Fairbanks (1953), the CVC nucleus was preceded by [hill to minimize the effects of jphonetic context on test syllables. W951: A male talker of General American dialect recorded iPhonetic stimuli using the sentence "Say the word who pr another time“. Each vowel was presented to the speaker in a random order and spoken at “habitual“ inflection, pitch, rate and linguistic emphasis. Three trials of the word list at each pitch level were conducted with the average measurements of those trials used for analysis of data. Tape recorder input level was established using successive repetitions of the sentence "He tacked the tip tap top of the teep with toop” for a VU range of -3 to 0 dB. The la.— m-__.-L—— r hord‘V LeVU§ “' 'H. P'Tfi I - , \a;e\n -‘M 500 con ESP Sona Walysis used for while ch. t ‘URdamen- 52 recording were made in a sound chamber using a microphone (Electro-Voice, Model RE-lS) to mouth distance of 20 to 30 cm. W W To describe the acoustical properties of the signal sources, measurements of rate, fundamental frequency and center formant frequencies (f1 through f3) were performed using a Kay DSP Sona-Graph model 5500 connected to a Marantz Model 5020 cassette player. The DSP Sonagraph allows for dual-channel, real-time display and analysis of acoustic signals. Therefore, channel one was used for a spectographic display and format measurements, ‘while channel two was used for display waveform display and fundamental frequency measurements. Recorded stimuli were ‘gated to isolate the CVC sequence from the carrier phrase. (harsors were used to isolate the target syllables, followed 13y auditory playback of the segment to verify the sselections. Spectrograms were produced using a broadband ifilter (300 Hz) on a time axis of 1.0 second. Formants were i_dentified by visual inspection using the black on white V'isual graphics. Various color coding options were selected aas required to enhance visualization. The DSP automatically d:i.splayed the frequency identified by the cursor. Fundamental frequency measurements were made by measuring the peak-to-peak time intervals on the wave form display. The reciprocal of these time intervals estimated the f“-I'Idamental frequency of CVC vowels. he... ;____... - Expat designed goal was . ‘mer CO: “A, ‘ O .: .emonstra. H A ”C‘sists c W‘Fs ‘- v'v,‘~a‘ nsz xgva 5“ TA I C "u Q : U_b‘ng A! U 53 Semantic Precision Experiment 2: Word Recognition. Experiment 2 represents the first of three studies designed to evaluate selected aspects of reception. The goal was to obtain a measure of word recognition in context under competing message conditions. MQLDQQ Stimulua_Material§ WOrd recognition was assessed using the Revised Test of Speech Intelligibility in Noise (SPIN) (Bilger, 1983). The original goal of this test was to provide a measure of listening abilities in everyday situations. Evaluation of the materials with both normal-hearing (Kalikow, Stevens and Elliott, 1977) and impaired-hearing subjects (Dirks, Kamm, Dubno and Velde, 1981; Dubno, J. R., Dirks, D. D., and Morgan, D. E.,1984; Bilger and others, 1984) have demonstrated usefulness of the test. Bilger’s revision consists of eight recorded lists of 50 sentences. Each list contains 25 high-predictability and 25 low-predictability key words. Test materials are presented together with a competing signal (twelve-talker voice-babble). Bilger and others (1984) offer norms for normal and impaired listeners based upon presentation of test sentences at a speech-to- babble ratio of +8 dB where the sentences are adjusted to a level of 50 dB above threshold for that signal. Responses are scored on the basis of the number of key words correctly :4 n+{f4 A .ue“\.L Ae‘u ) 4 appendx r: recognitic as well as predic abi (means sta "1 i E each- of th Speech sou ‘ub H'l three perc items (25 Percent c tr5138mm; P." A4,. TO ma '4 WK Wmef‘t 4 ‘ J _. b “A “=.Gn ‘u v- a" r \ . 54 identified. Copies of the SPIN scoring forms are given in Appendix B. SPIN test results were indexed by percent-correct word- recognition scores computed separately for total test lists, as well as for low-predictability key word and high- predictability key word subtests. Descriptive statistics (means standard deviations and ranges) were calculated for each of these dependent variables, and for each of the six speech sources employed in this experiment. Because the three percent variables were based on different numbers of items (25 for subtests; 50 for full lists), and because percent- correct indices often are skewed, these data were transformed for further analysis. W To maintain a consistent relationship between the speech stimuli and babble tape as well as among stimulus recordings by different speech sources, equivalent sound levels (Leq) were measured. Leq measurements also resolved differences between the synthesizers in terms of relative levels among segmental unite (phonemes). Duration and Leg measurements ‘were undertaken using a TDH-39 earphone, a 6-cc Coupler, and a Larson-Davis Laboratories integrating sound level meter (Model 8008). A.block diagram of the arrangement of the equipment is shown in Figure 2.2. Level calibration tones placed at the beginning of each experimental tape were related to the program material such that the calibration L_« n.<_.-.n—-v. - ‘5 Fig“ DiaCe 55 SIGNAL INJECTOR f TAPE I f l * names :1 M'XER MCFDPI-KWE ARTIFICAL EAR fl as: Figure 2.2. Apparatus used for making Leq measurements and placement of 1000 Hz calibration tones on experimental tapes. v A I Cit-g surene & ‘ ates dev; 5‘”; v0.» V xsar;re: a I d I-I.I..FII.J i lie q er: / AR cu EX; 4*- , ‘ ‘eDV‘AA, “.1 ,‘ 0 ”H‘C 1'8 ‘1' rd 56 tone had an Leq nominally 6 dB more than the Leq of the associated material. SPIN lists 1, 3, 5, 6, 7, and eight were used as stimuli in Experiment 2. Selection of these lists was based on data provided by Bilger (1983). He demonstrated that list 2 deviates form other lists in reliability of raw scores for the low—predictability subtest. List 4 has the lowest reliability (r=0.927) and the highest standard error of measurement (7.72%). Therefore, lists 2 and 4 were not used as experimental stimuli. 3W Description; A block diagram of the arrangement of the experimental apparatus is provided in Figure 2.3. Individual subjects were seated in a double walled sound suite (IAC, Dimensions 2.54 m x 2.74 m x 1.98 m ). Speech stimuli were presented monaurally under earphones (TDH-39 mounted on a MX- 41/AR cushion) to the better ear as determined via a four frequency average (500, 1K, 2K and 4K Hz). The poorer ear was covered by a dummy earphone and cushion. SPIN lists were reproduced and voice-babble were recorded with cassette players (Marantz, Model 5020 and JVC, Model KD-15 respectively), and routed to a speech audiometer (Grason- Stadler, Model 162), where the signal and babble tracks were mixed for output to the earphone. Qa1ibratiQn_Qf_listening_e22eretusi In those experiments involving taped presentation to listeners, a Calibration selector switch allowed the examiner to monitor 57 PHONES (TDH-39,MX-41/AR) Dummy Active 4 Talkback |:| Microphone 1 SOUND CHAMBER WALL TAPE | (TEST SENTENCES) l m y es 162 I , AUDIONETER (BABBLE) * . TAPE Peoomm * RMS VOLT METER VONITOR Figure 2.3. Block diagram of experimental apparatus for Experiment 2. kL—L.‘ ata;ed char .el {Bruel l each SP a heari 58 a taped calibration tone routed through either audiometer channel. These signals were monitored with a true RMS VTVM (Bruel and Kjaer Model 2409) as follows. Before and after each SPIN list was presented, the audiometer was adjusted to a hearing level (HL) of 70 dB. The VTVM levels for test list and babble calibration tones were prerecorded. If pre- and post-test level checks were within 11 dB, subject responses were accepted. In addition to the within-session calibration, the speech audiometer and earphones were checked and found to be within tolerances specified by ANSI S 3.6-1969. The frequency response curve for the TDH-39 earphones can be found in Appendix C. The signal-to-babble ratios recommended by Bilger (1983) were based upon signals recorded by a human. Because performance-intensity functions may differ for human and synthesized sources (Chial, 1973), a pilot study was conducted to identify a speech-to-babble ratio that could be used for all of the sources studied here. The goal of the pilot study was to find a compromise S/B for which listeners performance with a high quality source (DECtalk) and a low- quality source (Echo II+) fell within the linear portion of the performance intensity function. Details of the pilot study are given in Appendix D. Each subject listened to the DECtalk and Echo II + synthesizers at the four different speech-to-babble ratios: +8, +4, 0, and -4. On the basis of 59 this information a signal—to-babble ratio of +8 was selected for use in the Experiment 2. Procedures After auditory screening, the subjects were seated in the sound room. A monaural voice-babble detection threshold was measured. The SPIN list presentation level was adjusted to be 50 dB above this threshold and the voice-babble presentation level was adjusted to be -8 dB relative to the SPIN list presentation level (Bilger,1983). Subjects were instructed as follows: ”This is an experiment in which you will hear several sets of sentences. The sentences will come from the earphone on your ear. Your job will be to repeat the last word of each sentence. For example, if your hear "Mrs. Smith did not consider the door,“ then say "door.” It will be hard to hear the sentences because they will be played in the presence of background noise of many people talking at the same time. The noise will come from the same earphone as the sentences. If your are not sure of the last word, feel free to guess. We will use 6 tests, with 50 sentences. Each test takes a few minutes and will be presented by a different talker. There will be a short break after each test. Once again, your are to listen to the sentence and repeat the last word you hear. If you are not sure of the word please guess. Any questions? Let's try a practice list." Earphones and the talkback microphone were positioned by the experimenter and a 50 sentence practice list was administered in the presence of the voice-babble- noise (+8 S/B). The practice list was divided into two parts. The l. 0" .u-uv~ Q I ' a 18. C005 sceech o o the s l U -.f nib fig .1; Cu C» «HM D's-.11.!!! .J 0. goal w Linctl 60 first 25 sentences consisted of items presented by the male talker as recommended by Bilger (1983), thus providing experience with the task. The second set of 25 sentences consisted of five sentences produced by each of the five speech synthesizers. This was intended to provide exposure to the different talkers to be used during the test. Total experimental time was approximately 1.5 hours per subject. Listening Comprehension Experiment 3 was the second in the series of experiments designed to evaluate selected aspects of reception. The goal was to obtain a measure of listening comprehension as a function of talker. Methed Stimnlns_Materials The stimulus materials consisted of six passages and corresponding multiple choice tests selected from the Wing by Connors (1974). He based his selection was based on an index of readability (discussed below) and on test “difference" scores. Difference scores were calculated by subtracting scores obtained when the tests were administered without reference to the passages ('test only" scores) from scores Obtained when the passages were administered. Additional factual passages selected from texts designed for use by junior high and high school students (Chial, 1973) and for use in perceptual research (Cox and McDaniel, 1984) were used as practice stimuli. Phonetic content was not a A..III-____ _________ I ~ ' I ‘ V bv'n-Ns . 7‘“ at“ a A'. J . V‘V‘ Y‘ "' yfixuds4e“ np‘n L a a re p: p . .rA .u . . 2.. \. S S v. e .. i a . . a 4 . a. L. 4 a "I S 9 C v D“ r i .n ~ 3 v r r . y . a- 0 v i . . a. S 50. A c A 0 :a u u a. «d 2. c. :8 r. G. u . 5. 2o . . ~3 : . Q. ~ 0. 0.. ad .rl. A o C» a» QC . . a: vs .I. . . \u. , . a: t A a P... - a 3. a: u .. a a. a . a Q» Il-|l'mIIvA - r m rill-Illinx 61 criteria for selection of material. Copies of these materials are given in Appendix E. Two measures were applied to the experimental passages to determine levels of readability across passages; Gunning’s Fog Index and Flesch's Index of readability (Gross, 1986). Both indices provide values said to represent the number of years of schooling required to comprehend the material. A grade level criterion of between 5 and 7 was used for selection of the experimental passages. Analysis of the passages were performed by means of Thunder (Gross,1986), a program designed for the Macintosh computer. These data, as well as those originally calculated by Connors (1974), are provided in Appendix E. W . Recording procedures were the same as those previously described. Equivalent sound levels (Leq) were used to establish equal levels for taped stimuli and the competing signal. Level calibration tones were related to the program material such that the calibration tone had an Leq of 6 dB greater than the Leq of the associated material. WW Description; A block diagram of the arrangement of the eXperimental apparatus is provided in Figure 2.4. The test environment, mode and level of signal presentation, and method of level calibration were the same as in Experiment 2. 62 (TDH-39,MX-41/AR) DUMMY fl‘ ACTIVE SUBJECT ‘ TALKBACK SWITCH ; MICROWE SOUND CHAMBER WALL I (TEST PASSAGES) l TAPE GS 162 m AUDIONETER (BABBLE) TAPE m RMS ' VQT WC m TIMER Figure 2.4. Block diagram of Experimental apparatus for Experiment 3. * b. ‘-J r-n-v- —4‘! {15 o r?” r f (1‘ :3 LDQWCT' 63 In order to measure multiple-choice test completion time, the subject was also given a response button. This was connected to a switch input (Coulbourn $22.02) the output of which was connected to a RS/T flip flop (Coulbourn, 841-02). The Flip Flop was connected to a precision time base generator(Coulbourn SSl-l-) set to 100 Hz and finally to an electronic timer (Coulbourn R11-25). The timer was reset and activated by the experimenter at the onset of each test using a switch module (Coulbourn $96-03). Procedures Each of 12 subjects who satisfied audiometric criteria was seated at a desk and provided with a pencil. The subject was instructed to listen to each passage, after which they were given a multiple-choice test regarding the content. The following instructions were given orally: You are going to hear a recordings of passages made by several different talkers. You will hear the speaker in only one ear. The passages will come from the earphone on your ear. It will be hard to hear them because they will be played in the presence of a background noise of many people talking at the same time. I want you to ignore the noise completely. Pay attention only to the voice. Once you have listened to the passage you will be given a short test regarding the content. The test will consist of eight or nine multiple choice questions. Do not leave a question blank. If you are not sure of an answer please guess. Keep in mind you will be scored on the time it takes you to complete the test as well as the number correct so work as quickly as you can. I will tell you when to begin and when you are done press this button to stop the timer. There are 6 passages each lasting approximately 30 seconds. At the end of each multiple choice 64 test there will be a short break. Remember, your job is to listen to the passage and then answer the test questions about the passage as quickly as you can. Any questions? Let’s try a practice set. Practice passage one (recorded using a female voice) and corresponding test were administered to familiarize the subject with the test taking process. A second practice passage without a test was presented to familiarize the subject with the talker for that list. The same familiarization passage was used prior to each voice. Experimental passages were presented monaurally at a sensation level of 50 dB (re: voice-babble threshold) and at a S/B of +8 dB. Following presentation of an experimental passage the subject was directed to select a folder containing a copy of the associated multiple choice test. The experimenter told the subject to begin and simultaneously started the timer. The subject pressed the timer response button when the test was completed. Order of passages and talkers were randomized among subjects. Approximately one hour was required for each subject to complete Experiment 3. Task performance Experiment 4: Oral Instructions Experiment 5 was the first of two studies evaluating iselected aspects of receiver performance. The goal was to Cfl>tain a measure of listener performance as a function of talker on a relatively complex task. 65 Me od Stimulus_Material_Prenaration Subjects completed an adaptation of the Oral Directions subtest of the Detroit Tests of Learning Aptitude (Hammill, 1985). The original subtest is intended to measure “listening comprehension, spatial relations, manual dexterity, short term memory, and attention” (Hammill, 1985, p. 56). The relative contribution of these abilities to test performance are not known. The subject is given a series of commands to be carried out using pencil and paper, e.g. putting an X in a square, the letter F in a triangle and a numeral 4 in a circle. Thus, each command may be described as containing a minimum of an action-patient combination with an adverb of place. Here action denotes the presence of a verb expressing an activity or movement that can be seen or heard. Patient is defined as the receiver of the effect of a process or action (Heidinger, 1984). In the present experiment, it was desirable to evaluate the effects of message complexity upon subject performance. In addition, several versions of the test were required for practice and for use with each speech source. Therefore, a Inodification of the Oral Directions Test was devised. This “medification called the Multiple Instructions Test (MIT) and fiJrvokes five levels (A through D) of increasing length and CKnnplexity." As Figure 2.5 and Table 2.1 illustrate, Level .A <=ontained two commands, Level B had three commands, Level 66 ~03 0:283:05 0.3.22 05 :_ 000: 206. 3:03:50 so 20.0.21 .md 059“. 08.0500 600.. > { 03:80 00.2 67 Complexity Level of Command A B C 0 Number of Commands 2 3 4 5 Number of 9- 15- 21- 27- Words Per Item 12 13 24 3° . , 4 - 7 - 1 o - 1 2- Gunnmgs Fog Index 5 9 12 15 , . 182-171-155- 141- F'GSChS Index 139 179 166 151 Action and Patient Adverb- Place KK J J Adjectives Size Ordinal Color of Object KKK KKK Restriction J J Table 2.1. level of command of the MIT. Permitted content options for each 68 C had four commands, and Level D had five commands imbedded within each item. Levels B, C and D also contained adjectives (order, color, size). Levels C and D contained restrictions such as “ The line may not touch any other object“ in place of an action-patient command. For the present purpose a restriction was defined as a statement following a command or series of commands that places additional constraints on the action to be taken. A total of six items were presented at each level. Table 2.1 summarizes the characteristics assigned to each level of command. A list of the actions, patients, adverbs and adjectives, as well as copies of the stimuli are provided in Appendix F. Subject performance was measured in terms of (1) time- to- completion of total task, (2) time-to-completion per item, (3) percent of items correctly executed at each level of command, and (4) total percent correct for all items. Determination of correct command execution was based on a predetermined set of scoring criteria for each item. Examples of an item, the scoring criteria for each command and sample responses are shown in Table 2.2. Becording_Me§th§. The basic recording procedures were the same as those previously described. Leq measurements of recorded commands were taken using a Larson-Davis Laboratories sound level meter (Model 8008). Calibration tones placed at the beginning of each experimental tape were related to the program material such that the calibration Sarr San 8a dr; iOL COO Sample Level A Command: Put a 'B" in the circle and circle the small square. Sample Scoring Criteria: 1. Is there a ”B" in the circle? 2. Is there a circle around the small square? 1:: 0 Sample Level C Command: Put an "X" in the red square, an 'X" in the second triangle and draw a line from the small square to the oval. The line may not touch any other shape. Sample criteria for scoring: Is there an "X" in the red square? Is there an "X' in the second triangle? Is there a line from the third square to the circle? Does it touch any other shape? PP’N?‘ Table 2.2 Sample commands and scoring criteria for the MIT. 70 tone had an Leq of 6 dB more than the Leq of the associated material. Additionaljoreeninmoeouoesi In addition to the audiometric screening procedures previously described, the subjects for this experiment were asked to undergo screening for color blindness and appropriate color nomenclature. The purpose of this test was to rule out those subjects unable to respond effectively to items containing color modifiers. This was be accomplished through the use of Dvorine Pseudo-Isochromatic Plates (Dvorine, 1953). The subjects were asked to provide the name for a group of saturated colors (red, brown, purple, yellow, blue, green, gray, and orange). Part two called for the identification of the digits on 15 plates made of eight different color combinations to rule out color blindness. ExperimentaLApoaratus Description; Figure 2.6 illustrates the equipment used in Experiment 4. Test items and voice babble was reproduced (Marantz, Model 5020 and JVC, Model KD-lS cassette recorders, respectively) , and routed to the speech audiometer (Grason-Stadler, Model 162), where the signal and voice babble tracks were mixed for output through a monaural earphone. The subject was provided with a response button connected to an electronic timer described in Experiment 3. The subjects were asked to press the button when they finished each item. Response latency was timed from the 71 (TDH- 39, MX-41/AR) DUMMY fl‘ ACTIVE SUBJECT TALKBACK SWITCH E] MICROPI-KNE SOUND CHAMBER WALL TAPE I "ESTSEHIEHES’ GS 162 I AUDIONETER (BABBLE) , TAPE m Rm ' VOLT ELETFDNC m TIMER WITCH Figure 2.6. Block diagram of experimental apparatus for Experiment 4. 72 onset of an alerting phrase ("Do it now“,“Begin now" or "Start“). E£Q£§§B£§§ The subjects were provided with the response books, a pen and the response button. .They were instructed orally as follows: You will be hearing a set of directions to be carried out with the materials in front of you. Here are samples of the pictures and the colors you will be seeing and the colors used to describe them. Take a moment to look them over. Please tell me now if you think you will have any difficulty identifying any of these colors, objects or shapes. - This page demonstrates how the picture will be arranged. A sample instruction might be "Number one. Put a P in the diamond and circle the square.“ Please listen carefully to the directions given for each item, waiting to begin until after you have heard the command to do so. Immediately upon completion of the item push the response button provided. Respond as quickly and as accurately as possible for each one. Your score is based on the time it takes you to complete it correctly. The directions you will hear are made by several different talkers. You will hear the speaker in only one ear. Along with the voice you will hear a noise of a group of people talking at the same time. I want you to ignore the noise completely. Pay attention only to the voice and carry out the instructions as quickly and as accurately as you can. Any questions? Let’s try a practice test. The practice list consisted of 24 items with one item presented by each of the six speech sources at each complexity level. This served to provide subjects with experience in both the experimental task and speech sources. '(3 (I) ‘1) 'r1 Hf) -_‘ 73 Practice and experimental tests were presented at a level of 50 dB above the subjects threshold for the voice-babble. A signal-to-babble of +10 was be maintained. Order of tests and talker was randomized between subjects. A short break was provided between each test. The data acquisition time for each subject was approximately 2 hours. Experiment 5: Oral Instructions With Communication Repair The goal of experiment 5 was to describe the pattern and effect of communication repair options on complex task performance. Methods WW Descriptignl Subjects were asked to complete the MIT using the VOTRAX Personal Speech Synthesizer (PSS). This device was selected because it could be interfaced with the Apple IIe computer for listener control and because it offered a reasonable number of repair options. In addition, initial analysis of the results of Experiment 2 indicated the VOTRAX PSS ranked third of the five synthesizers. By selecting a mid-range device it was hoped to avoid ceiling effects when repair options were employed. Version 4 of the MIT was selected for use with this experiment. Item presentation was controlled by the computer using a program written for this purpose. All experimental items were presented live, thus allowing the subjects to choose from a limited set of repair options. The calibration phrase "He tacked the tip tap top of the teep with toop" was used to 7“ set the VU meter of the audiometer to a range of -3 to 0 dB prior to testing each subject. Figure 2.7 illustrates the equipment configuration for Experiment 5. Voice-babble was reproduced via cassette recorder and mixed with the MIT items for monaural earphone presentation. MIT stimuli were presented 50 dB SL (re: voice babble detection threshold; A voice-to-babble ratio of +10 was used for all subjects. If a subject selected the "repeat louder“ option the S/B returned to +10 dB after that item was presented. Sound level measurements of the stimuli indicated the repeat louder option resulted in an increase of 5 dB. Consequently the S/B was increased to +15 dB when this option was selected. Control of the repair option was performed by the experimenter via keyboard following instructions from the subject. Eroeedures Each subject was provided with response books and a pen. The following instructions were given orally: You will be hearing a set of directions to be carried out with the materials in front of you. Here are samples of the pictures and the colors you will be seeing and the colors used to describe them. Take a moment to look them over. Please tell me now if you think you will have any difficulty identifying any of these colors, objects or shapes. This page demonstrates how the picture will be arranged. A sample instruction might be "Number one. Put a P in the diamond and circle the square." Please listen carefully to the directions given for each item. Your job is to carry them out as accurately as you can. If you wish, you may ask 75 (TDH 39 ,MX-41/AR) DUMMY m‘ ACTIVE TALKBACK MICROPHONE SOUND CHAMBER WALL I I (TEST SENTENCES) (BABBLE) COMPUTER TAPE l Figure 2.7. Block diagram of experimental apparatus for Experiment 5. WNITORS GS 162 AUDIOIIETER VOLT 6 Fit 7L5 E < . . . VA Vs :h u Va 2; C. . . .1 76 I for an item to be repeated using one of the following options: (2) Repeat, no change. (3) Repeat, louder. (4) Repeat, slower. (5) Repeat, louder and slower. (6) Repeat, faster. (7) Repeat, louder and faster. To request a repetition simply state the option you feel will be most helpful to you either by number or by phrase. You may request up to three repetitions per item. If you do not require the item to be repeated or if you are finished with the instructions, request number one (no repeat or done). There are 24 items in the set. At the end of each page (every six items) there will be a short pause before the item on the next page is presented. You will hear the speaker in your ear. Along with the voice you will hear a noise composed of a group of people talking at the same time. I want you to ignore the noise completely. Pay attention only to the voice and carry out the instructions as accurately as you can. Any questions? Let’s try a practice test. The purpose of this set is to give you a chance to experience the task and to become familiar with the repetition options. In this set you will receive a sample of several different voices including the one you will be hearing. as with the regular test, you may request the repetition option you would like to have if given the chance. However, in the practice test the item will not be repeated. Simply carry out the instructions as accurately as you can. The practice set was administered. This was the same practice tape used in Experiment 4. This process served to provide the subjects with exposure to both the experimental task and to the voice. Data acquisition time was approximately 45 minutes per subject. so. .' Chapter III Results and Discussion Introduction This study assessed a set of five commercially available digital speech synthesizers representing a range of technologies and costs. The goals of the study were to (1) develop instrumental and behavioral assessments that parallel selected aspects of a model of communication systems, (2) describe differences among speech synthesizers as revealed by various assessment methods and (3) compare rankings of speech synthesizers resulting from assessment methods. An additional goal was to define measures of “communication ergonomics" capable of reflecting the utility of speech synthesizers in tasks requiring exchange of information between machines and people. Five experiments were devised to investigate the three major features of technical accuracy, semantic precision, and complex task performance. Technical accuracy was assessed through measurements of fundamental frequency (F0) and format frequency (F1, F2, F3) for each of nine vowels in a CVC context. Semantic precision was evaluated through a word- recognition-in noise task and a listening-comprehension-in noise task. Complex task performance was assessed using tests containing instructions of increasing durations and semantic complexity, with and without listener options for feedback control of speech sources. 77 Fiftv normal hearih university population for participation in the four experiments requiring subjects (Experiments 2 to 4, N=12; Experiment, 5:14). Each subject received an audiological screening consisting of otoscopic examination, pure tone and tympanometric testing. Subjects were practiced in experimental tasks and speech sources prior to data acquisition. The following experimental questions were asked: (1) Do synthetic speech sources differ significantly from each other (and from human talkers) as a function of acoustic structure of phonemes, word-recognition measured in noise comprehension measured in noise, complex task performance? Q10 cm (2) To what extent can b, c, and d (above) be predicted from a? (3) What is the effect of listener-invoked communication repair upon the ability of listeners to accomplish complex tasks directed by speech synthesizers. Technical Accuracy Experiment 1 was designed to evaluate the technical accuracy of the five synthetic speech sources. Nine vowels in a CVC context were generated and recorded using the five synthesizers and a human control. The fundamental frequency and the first three formant frequencies were determined for each vowel using spectographic analysis. In order to calculate spectral loci coordinates it was necessary to have three format measurements for each vowel. Therefore, if a h 0..)- {A I'- (D RF LL. '1 In 79 formant could not be identified on the spectrograph under either normal (black on white) or enhanced treatments (color coded for power) the vowel was rejected for use in further analysis. Three of the nine vowels recorded for analysis, were eliminated because of incomplete format data. Five vowels remained for [1, I, Q . 8. (I and A] analysis . Data Reduction According to Miller (1984) a phoneme can be characterized on the basis of it’s “spectral shape". Miller represents this shape as a single point in three dimensional space with the characteristics of x = log (F3/F2), y = log (Fz/Fo') and z = log (F2/F1). F0’ is defined as the fundamental frequency of the voice multiplied by a constant (1.5 greater for males). The first step in data reduction, therefore, was to determine the xyz coordinates for each vowel generated by the speech source. Tables 3.1 to 3.6 summarize the results of these calculations. Distance among the associated spectral loci were calculated using the formula: D = [(x1 - x2)2 = (yl - y2)2 = (21 - 22)]-5 Where, D distance is the square root of the sum of the squared differences along paired points indexed by three axes of the “Miller graph". The dependent variable of spectral loci distance were derived by summing distances across vowels. Calculations were initially conducted manually and were verified via Microsoft Excell, a spreadsheet program for use with Macintosh. Reliability between data sets was found s~~ VIA. ‘us< du‘. Va}. 1‘“, a :30 w I" ‘u. u I. . v... h ..3 .1 4;“ \ avwc a o ‘5. ..r A: p: u s m 0 A y .u 80 ble 3.1. Speech spectra coordinates for the vowel Ii} lculated from algorithms suggested by Miller (1984) which entify its placement in three dimensional space. urce x y 2 man .189 .205 .879 Ctalk .162 .304 .689 iga .185 .187 .750 trax PSS .264 .525 .736 oothtalker .152 .306 .910 ho II + .150 .273 .477 terson & Barney .188 .121 .928 ble 3.2. Speech spectra coordinates for the vowellk] lculated from algorithms suggested by Miller (1984) which entify its placement in three dimensional space. urce x y z Ctalk .207 .391 .514 iga .187 .371 .526 trax PSS .264 .531 .736 oothtalker .131 .422 .658 no II + .165 .449 .533 terson & Barney .107 .284 .707 81 Table 3.3. Speech spectra coordinates for the vowel [8] calculated from algorithms suggested by Miller (1984) which identify its placement in three dimensional space. Source x y 2 Human .171 .550 .368 DECtalk .187 .501 .422 Amiga .164 .512 .477 VOTRAX P88 .136 .749 .433 Smoothtalker .219 .422 .589 Echo II + .164 .541 .415 Peterson & Barney .129 .434 .540 Table 3.4. Speech spectra coordinates for the vowel [D] calculated from algorithms suggested by Miller which identify its placement in three dimensional space. Source x y 2 Human .252 .520 .308 DECtalk .199 .531 .380 Amiga .204 .505 .368 VOTRAX PSS .136 .749 .433 Smoothtalker .244 .583 .385 Echo II + .237 .439 .336 Peterson & Barney .146 .539 .416 82 Table 3.5. Speech spectra coordinates for the vowel [a] calculated from algorithms suggested by Miller (1984) which identify its placement in three dimemsional space. Source x y 2 Human .264 .579 .550 DECtalk .315 .531 .286 Amiga .352 .539 .243 VOTRAX PSS .284 .760 .268 Smoothtalker .109 .747 .528 Echo II + .121 .541 .636 Peterson & Barney .350 .539 .174 Table 3.6. Speech spectra coordinates for the vowel [IS] calculated from algorithms suggested by Miller (1984) which identify its placement in three dimemsional space. Source x y 2 Human .325 .406 .373 DECtalk .339 .501 .285 Amiga .270 .432 .348 VOTRAX P88 .284 .711 .301 Smoothtalker .355 .547 .335 Echo II + .273 .541 .329 Peterson & Barney .302 .516 .269 83 to be on the order of i .001. Table 3.7 summarizes these ‘3; *4. stances an- the absolute sums of t? Fundamental frequency and formant center frequency data extracted from the spectograms of human and synthetic speech were compared to archival data for male talkers (N=33) taken from Peterson and Barney (1952) and plotted in the same manner. These findings were later used in correlational analyses to determine the relationship between data obtained in Experiment 1 and in Experiments 2,3 and 4. Statistical Methods and Tools For Experiments 2 Through 5 Subject performance for Experiments 2, 3, 4 and 5 was indexed, in part, by percent-correct scores. Other measure included completion time in seconds (Experiments 2 and 3) and percent of communication repair options selected (Experiment 5). Means standard, deviations and ranges were calculated for all indices. Because the percent-correct variables were based upon different numbers of items (e.g. 50 for SPIN full- list; 25 for SPIN subtests), and because percent-correct indices often are skewed, these data were transformed prior to further analysis. Studebaker (1985) suggested the use of a "rationalized“ arcsin transform (R) to overcome the problems associated with percentage data. Advantages cited for using R include (1) the correction of correlation between means and variances typical of percentages, (2) linearization of data relative to the variance, thus allowing for direct comparisons between all parts of the performance range, (3) the provision of percentage-like numbers for 84 Table 3.7.Distance value for vowels in three dimensional space from Peterson & Barney (1952) vowel measurements as calculated using xyz coordinates derived from Miller (1984) scaled x 100. Source Vowel [l] [I] [8] A [m] [a] [A] SUM Human 12.02 12.22 21.26 15.21 51.25 15.31 127.27 DECtalk 30.44 24.20 14.80 6.46 13.29 4.26 93.45 Amiga 20.16 .21.60 10.64 8.23 8.80 11.96 81.38 PSS 47.02 29.35 33.27 21.06 20.28 19.92 170.90 Smooth 18.85 14.84 10.30 9.43 44.89 9.06 107.90 Echo II + 47.72 24.66 16.85 15.66 51.87 7.13 163.90 SUM 176.20 126.87 107.13 76.05 190.39 67.64 85 purposes of discussion, and (4) permitting variance and afanflarfl Ravi stir-W1 1' Uh“ Huh“ “9".“ J-VL‘ \v 0 U' C D percentage data. Data transform was implemented via Microsoft Excell. Percent-correct scores were converted to proportions by dividing by 100. The proportions (p) were then entered into the first column of the spreadsheet. Proportions were converted to radians by the formula: T= arcsin 77p * N/N+1) + arcsin 77p * N +1/N+1). Where T is the arcsin transform expressed in radians and N is the number of items tested. The rationalized arcsin transform (RAST) was calculated in the next column as recommended by Studebacker (1985) using the formula: R=46.47324337T ~23 Where R is a linear transformation of T expressed in Raus. Rau is defined as a quasi-physical unit with no physical dimension. The resulting transformed scores expressed in raus were used in subsequent statistical analysis. It is important to note that values may exceed 100 or be less than 0. Analysis was accomplished using two software programs. Descriptive statistics (means, standard deviations and ranges) and correlations were generated using StatView (Abacus Concepts, 1986). ANOVAs’, Newman-Keuls’ post hoc test of mean differences and simple effects were calculated using CLR ANOVA (Clear Lake Research, 1985). Criteria for significance for all statistical tests was p S 0.05. 86 Semantic Precision The SPIN test was used to evaluate semantic accuracy via word-recognition in sentences. Subjects were asked to listen to a set of fifty sentences presented in the presence of a twelve-voice babble noise (+8 S/B) and to repeat the last word of the sentence. Speech sources were presented in randomized orders and SPIN lists were counterbalanced within sources. Error counts were made separately for the full-list and half-list high and low-predictability key words and the results compared for accuracy of count. Error counts were always conducted at least twice regardless of full and half- list agreement. In addition, random selection of seven score sheets (10%) indicated no difference in error count between first and later scorings. Deggriptigny Tables 3.8 and 3.9 summarize mean percent- correct, standard deviations and ranges of performance as a function of speech source for full-list SPIN test, high- predictability and low-predictability word scores. Tables 3.10 and 3.11 display mean arcsin transformed percent-correct scores, standard deviations and ranges. Overall, scores for speech synthesis sources were poorer than the human control. Aside from natural speech the highest mean score occurred with DECtalk (81.6) and the lowest with Echo II + (28.8). There was considerable variation among synthesized speech sources as reflected by the ranges. 87 Table 3.8. Mean percent-correct scores, standard deviations, and ranges for SPIN full-list test results as a function of speech source (N=12). Speech Source Mean 8.0. Range Human 90.6 2.4 88-99 DECtalk 81.6 7.0 70-92 Amiga 66 7.6 54-76 Votras PSS 51.1 8.2 36-56 Smoothtalker 36.5 7.8 26-54 Echo II+ 28.8 7.2 16-40 88 Table 3.9. Mean percent-correct scores, standard deviations, and ranges for SPIN high-predicatibility and low-predictability test results as a function of speech source (N=12). Word Predictability Mean S.D. Range Human High-predictability 100 0 100 -100 Low-predictability 81.3 4.9 76 - 88 DECtalk High-predictability 94.6 4.9 88 -100 Low~predictability 70.3 11.1 48 - 84 Amiga High-predictability 81 9.3 64 -92 Low-predictability 50.6 9.8 36 -68 Votras PSS High-predictability 62.6 12.4 32 -76 Low-predictability 39.6 13.3 20 -6O Smoothtalker High-predictability 45.8 12.7 24 -68 Low-predictability 26 8.6 12 ~40 Echo II+ High-predictability 30 12 12 -52 Low-predictability 17.8 9.3 2 -32 89 Table 3.10. Mean arcsin transformed percent-correct scores, standard deviations, and ranges for SPIN full-list test results as a function of speech source (N=12). Speech Source Mean S.D. Range Human 93.1 3.8 89.0 -98.0 DECtalk 81.7 8.5 68.0 -95.0 Amiga 65.0 7.3 53.6 -74.8 Votras PSS 51.0 7.5 37.0 -64.8 Smoothtalker 37.4 7.3 27.2 :53.6 Echo II+ 29.8 7.3 16.0 :40.8 90 Table 3.11. Mean arcsin transformed percent-correct scores, standard deviations, and ranges for SPIN high- predicatibility and low-predictability test results as a function of speech source (N=12). Word Predictability Mean S. Range Human High-predictability 113.83 0 113.0-113 Low-predictability 80.3 5. 74.3-88.1 DECtalk High-predictability 113.2 11. 88 -113 Low-predictability 69.1 10. 48.2 -83.1 Amiga High-predictability 80.3 10. 62.2 -93.8 Low-predictability' 50.6 8. 37.3 -66.4 Votras PSS High-predictability 61.6 11. 33.5 -74.3 Low-predictability 40.3 12. 21.3 -59.0 Smoothtalker High-predictability 46.1 11. 25.6 ~66.4 Low-predictability 27.2 8. 11.8 -41.0 Echo II+ High-predictability 31.0 11. 11.8 -51.7 Low-predictability 17.7 11 -5.2 -33.5 91 Figures 3.1 and 3.2 are histograms illustrating the full-list and high-predictability and low-predictability percent correct scores. Each bar represents mean percent- correct or mean arcsin transformed percent-correct scores for twelve subjects. Standard deviations are provided in the lower histograms. Figures 3.3 and 3.4 display the functions for the transformed scores. The figures demonstrate the rank ordering of the speech sources as a function of word- recognition scores and the differences between the high and low-predictability scores for each speech source. The mean differences across speech sources between high and low- predictability scores were 21.4 % and 27 raus for the percent correct and arcsin transformed percent-correct scores respectively. S;atistigal_progedure§L To investigate the differences noted in SPIN scores across speech sources a one-way, repeated measures analysis of variance (ANOVA) was performed. Table 3.12 summarizes the results of ANOVA. Results were significant. Thus, word recognition varied as a function of speech source. To establish the extent of the differences among speech sources a Newman-Keuls' test of paired comparisons was performed revealing that each mean differed significantly from every other mean. A two-way, mixed-effects ANOVA was used to evaluate the main effects of speech source and SPIN subtests, as well as the interaction between the two effects. Both main effects were repeated, i.e. each subject received both SPIN subtests “ ”iii—arryrj. 92 100— g 80- ‘13 0 t 8 60- E' O ‘L’ 8 40- C O O i Z 1 20— 0 L. c 3‘ g; '3 + .— 0 0. '3 ._ eech Source E 3 '9 x E S 3 E3 £5 E E; .c I o 4 75 e 8 > a g 30 l l l J J l ‘0' 20- E 8 10— o ___l__l_l___l_L m 0 Figure 3.1. Mean percent-correct scores and standard deviations for SPIN full-list test results as a function of speech source. Each bar represents observations of 12 normal-hearing adult subjects tested monaurally. The lower histogram denotes 11 standard deviation. 93 IOO - —- "l . .. |_l High-predictability I Low-predictability g 80: on! 8 t 8 60* E' O E 8 40- C O 0 z 20- O m E e 5 s. 3’ "'3 é Speech Source g S E g g 0 u: L o ‘z I o ‘t '5 o .3 > E A 30 L I I 1 (P I be ‘0' 20 ~ 8 5 1o — o m 0 mil :1 Figure 3.2. Mean percent-correct scores and standard deviations for SPIN high-predictability and low- predictability word sets as a function of speech source. Each bar represents observations of 12 normal-hearing subjects tested monaurally. The lower histogram denotes i 1 standard devitation. 94 8 100 .. o 25 3 L. 3 80F U) ‘5 O t 8 60 " I 3 U E 40 a O '8 E ,_ 20 - 1 : I o 1 O . z 0 w E g i g g.) "a 1 Speech Source 5 3 ~5- 3; TE 3 :E 3 <1 :7 § 5 O LII-I > E (D 30 l l L l L L 20 - SD Arcsin (Rous) 10- I O . Figure 3.3. Mean arcsin transformed percent-correct scores and standard deviations for SPIN full-list test results as a function of speech source. Each bar represents observations of 12 normal-hearing subjects tested monaurally. The lower histogram denotes i lstandard deviation. 95 120 mo 25 P D High-predictability U) 0 g ' Low-predictability m 80_ ‘6 O t O 2 :50L 5 U 0 g 40— - ‘8 C E c 20- . O O 22 o '5 x 3 '¥_ 4. 5 '3 go a. 0 : Speech Source 5 a - x E o 3 L’ E E *’ .c m i H O o > 5 30 l l J l J 1 20— 2411 111 Figure 3.4. Mean arcsin transformed percent-correct scores and standard deviations for SPIN high-predictability and low- predictability word sets as a function of speech source. Each bar represents observations of 12 normal-hearing subjects tested monaurally. The lower histogram denotes 11 standard deviation. SD Correct (Rous) 96 Table 3.12 One-way within-subject analysis of variance of full-list SPIN test results. Source dF SS MS F p Subjects 11 475.944 41.6 Speech Source 5 36516.944 7303.389 143.885 (.001 Error 55 2791.722 50.759 97 (low-predictability and high-predictability key words) as VOTRAX PSS, Smoothtalker, Echo II+, and a male human control). Table 3.13 displays the results of these calculations. The main effects of speech source and linguistic predictability (SPIN high and low-predictability half-lists) proved to be significant, as was the interaction of source and predictability. Thus, speech sources resulted in different word recognition performance with regard to linguistic predictability and high and low-predictability scores differed with regard to speech source. To further assess the interaction between the two main effects the Newman-Keuls’ test of pairwise comparisons was conducted. Figure 3.5 illustrates these results by using a heavy solid line to identify those variables whose mean pairs did not differ significantly. Five mean pairs did not demonstrate significant differences including; Human (81.3 R) and Amiga (81 R) high-predictability, DECtalk (70.3 R) low and VOTRAX PSS (62.6 R) high-predictability, Amiga (50.6 R) low and Smoothtalker (45.8 R) high-predictability, VOTRAX PSS low (39.6 R) and Smoothtalker (45.8 R) high-predictability, and Smoothtalker (28 R) low and Echo II+ (30 R) high- predictability. Table 3.14 lists the results of simple effects tests on the interaction of the two main effects. All F-ratios were significant. Implications; The experimental question associated with Experiment 2 was whether synthetic speech sources produced 98 Table 3.13. Two-way within-subject analysis of variance of SPIN test results with Main Effects of speech source and word predictability. Source dF SS MS F p Subjects 11 1256.971 114.27 Speech Source 5 94310.696 18862.139 144.374 <.001 Error 55 7185.611 130.647 Predictability 1 22196.779 22196.779 479.619 <.001 Error 11 509.08 46.28 Source/ Predictability 5 1997.125 399.452 4.57 .0015 Error 55 4806.611 87.393 99 .383 3.3» ha couoonqoo ouo n33 cool “#:6033332 .350... nooono no 5308.5 0 no no.3: 5:333qu L5H ago finanoaoauounnaufin no» oouoou aoouuoonpsoouon 3:333» anon 5% no «soon—ado". 333a no anon .oaosuugoz no 3303033 .m.n 939nm 0N. 0.me 0.0m QVO oo— 39 Non 28: owl? do». E m. _ o ._ x ._ I ._ . x ._ I 3 z 3 .1 3:30:68“. 1. saw homo.“ xwwm; 03541 539.3 5:5: 8.59m 3‘3! tall..." .41., lOO ooo .v NNo .oN ooo do a." .n who .oood +HH Odom no hnaaafldnoauoum ooo .v 9.3.3.." hon. do.“ «a .n ooo .«oHN uoxaonnnooflm no hnfiawndnowuoum ooo .v mvo .od ooo do.“ «a a ooo .mNhN mmm 30> no hnfiafinonwuoum ooo .v on.“ .oo :6 .bo «a a «v0 .onom now: no hnfiafinonowooum ooo .v ova .vo ooo .mo .3" .n “.3. .bmao fidnoua no hnwafinonoauoum ooo .v. bow .62. «.3 .m« .3 a 3% .oNho gm n0 hnnawfldnoafloum ooo .v much .00 wow .vod mm o our .0090 hnaaufldnoauounnboa no ooufiom ooo .v oo« .oo« «oo .ofid mm o no .voNNH hnaaafldnoafiounlaos no ooufiom A h emu 0.3 due no: noozu .nnaonn—na ouch hov— hnfiafinonouuounlboa EdnADE no szm 05 «0 00.800 no0nuoonnsoouon 3330803 no 303 5333303 03038: 30 09.300 #00090 «0 00.3030» 05 non 03030 0.318 .36 0309 lOl differences in word recognition in noise. On the basis of the present outcomes it can be said that significant differences do exist between all speech sources for word recognition in sentences. These findings are consistent with those obtained in previous research (Pisoni; Mirenda & Beukelman, 1987; Hoover et al. 1987), though it should be noted not all of the devices employed here have been used by other investigators. Only DECtalk. VOTRAX and Echo II+ have been used in other word recognition studies. High and low-predictability subtests of the SPIN made it possible to assess the interaction between speech source and linguistic predictability. Once again the results confirm those obtained in similar studies, at least with regard to DECtalk (Greene and Pisoni, 1988). Performance improved increased message redundancy as reflected by the higher scores for high-predictability key words. Pairwise comparisons provided some interesting patterns of mean differences. For example, the Amiga high- predictability subtests did not differ significantly from the Human low°predictability scores and were significantly better than the DECtalk low-predictability scores. This is in contrast to full-list results in which Amiga proved to be significantly poorer than both the human and DECtalk. This suggests that a device with mid-range intelligibility can be used successfully in high-predictability communication situations, while even a high quality device will exhibit markedly poorer performance in comparison to a human source 102 in the absence of adequate message redundancy. It is likely similar outcomes would occur for sets of test materials that differ in word familiarity or word frequency. W Factual passages were used to evaluate semantic accuracy via a measure of listening comprehension. The passages were presented in the presence of a twelve-voice babble noise (+ 8 S/B) and were immediately followed by a multiple choice test regarding passage content. Speech sources were presented in randomized orders and passages were counter balanced within sources. The time required to complete each multiple choice test was recorded. Error counts for the multiple choice tests were repeated a minimum of two times. A random selection of seven tests (10%) indicated no differences in error count. Description; Table 3.15 contains mean percent-correct, mean standard deviations and ranges as a function of speech source for multiple-choice comprehension test scores. Corresponding transformed percent-correct scores (Raus), standard deviations and ranges are provided in Table 3.16. The speech synthesizer demonstrating the highest and lowest mean tramsformed scores were DECtalk (82.3 R) and Echo II + (49 R). Examination of the ranges reveals a wide variation in performance within all speech sources. Table 3.17 shows mean multiple choice test completion time, standard deviations and ranges as a function of speech source. Test completion times demonstrated little variation as a function 103 Table 3.15. Mean percent-correct scores, standard deviations, and ranges for multiple-choice comprehension test results as a function of speech source (N=12). Speech Source Mean S.D. Range Human 82 16.4 50-100 DECtalk 82.3 8.9 66.6-100 Amiga 74.6 21.3 33.5-100 Votras PSS 72.5 16.6 50-100 Smoothtalker 55.9 13.6 33.3-75 Echo II+ 49 15 25 -75 104 Table 3.16. Mean arcsin transformed percent-correct scores, standard deviations, and ranges for multiple choice comprehension test results as a function of speech source (N=12). Speech Source Mean S.D. Range Human 81.9 19.1 10 ~107.2 DECtalk 80.0 11.0 64 ~107.2 Amiga 74.3 22.2 36 -107.2 Votras PSS 71.2 17.5 50 -107.2 Smoothtalker 55.1 11.6 35 -71.58 Echo II+ 49.1 12.7 28 -7l.5 Table 3.17. 105 Mean completion time, ranges of multiple choice comprehension tests (in seconds) as a function of speech source (N=12). standard deviations, and Speech Source Mean S.D. Range Human 38.2 11.9 20.2 -56.6 DECtalk 45.5 8.8 26.1 -59.3 Amiga 45.7 15.2 18.0 -80.9 Votras PSS 43.4 14.6 21.3 -73.6 Smoothtalker 46.3 9.0 32.2 -60.0 Echo II+ 46.5 12.6 31.0 -74.1 106 of synthesized speech source (43.4 to 46.5 secs). Table 3.18 displays arcsin transformed percent-correct test scores across all speech sources as a function of test version. Figures 3.6, 3.7 and 3.8 are histograms displaying mean test scores, mean transformed percent-correct scores and mean multiple choice test completion time. In can be seen that the rank ordering of speech sources on the basis of transformed percent-correct comprehensive test scores remains the same as in Experiment 2 (Human, DECtalk, Amiga, VOTRAX, Smoothtalker and Echo II+). S;atis;igal_£rggedure§L Passages used as experimental stimuli were originally designed for use with human speakers. The question arises regarding the equivalency of the passages produced by speech synthesizer. Though the versions were counterbalanced across synthesizers, nonequivalent versions could effect statistical outcomes. Noting the presence of higher trans formed scores for tests three (77.4 R) and four (79.7 R), a one-way within-subject repeated measures ANOVA was performed. The F-ratio failed to reach significance, thus suggesting the equivalency of the multiple choice comprehension tests across speech sources. It is assumed,therefore, that results were not confounded by differences among versions of the measurement device. Differences amongtransformed multiple choice test scores were analyzed using a one-way repeated measures ANOVA. Results were significant. Indicating that listening comprehension varies as a function of speech source. Table 107 Table 3.18. Mean arcsin transformed percent-correct scores, standard deviations, and ranges for the multiple choice comprehension tests (N=12). Multiple Choice Test Mean S.D. Range Test 1 66.9 10.9 54.1 -85.9 Test 2 63.4 22.0 39.5 -107.2 Test 3 77.4 21.4 28.4 -107.2 Test 4 79.7 24.1 35.8 -107.2 Test 5 65.6 13.3 50.0 -84.3 Test 6 58.4 19.4 35.8 -85.9 108 100- g 80- ‘6 0 t 8 60b .'.. C O E 33 40; C O O r. 20- '1 O 1 m a“: c: :E o 33 i; .1 Speech Source E 3 9 X E E o 9’ E E3 *‘ .c I m 4: a O 0 C3 o 0 Lu 2» E ,‘ 30 1 1 1 1 q) 1 ac E 20— o m 0 Figure 3.6. Mean percent-correct scores and standard deviations for multiple choice comprehension tests as a function of speech source. Each bar represents observations of 12 normal-hearing subjects. The lower histrogram denotes 11 standard deviation. 109 A g a: V U) 0 ‘5 (J 80 _ m on! O (D L t. O U 60- ,L. 02 V 3 E 40— O ‘- a) C 8 1— 20— .5 (I) 8 <1: 0 m '5 0'1 9‘. + s i-i 8. e 2 = Speech Source E u '— x z o .3 ‘3 EE " f” s: 1: Lu .q is O o C) c: O 1n A > E (D L l l l “I" 1 3 30 25 c 20 — E k 10 - o 0‘) 0‘ Figure 3.7. Mean arcsin percent-correct transformed scores and standard deviations for multiple choice comprehension tests as a function of speech source. Each bar represents observations of 12 subjects. The lower historam denotes 11 standard deviation. llO 60 Mean Test Completion Time (secs) 3 .. e .. + E “a 8, ‘L '5 2 Speech Sour‘ce E 5 “PS- g g g g Lu 4: b O U c: o o In > E m 1; 30 I l I l I l O (I) 33 20 E i: 10 1 o m 0 Figure 3.8. Mean multiple choice test completion time and standard deviations in seconds as a function of speech source. Each bar represents observations of 12 normal-hearing subjects. The lower histogram denotes :1 standard deviation. 111 3.19 summarizes these results. A post hoc analysis revealed six pairs of means that did not differ. These included; Human/DECtalk. Human/Amiga, Human/VOTRAX PSS, DECtalk /Amiga, DECtalk/ VOTRAX PSS, and Amiga/VOTRAX PSS. Figure 3.9 illustrates the results of the Newman-Keuls' test of paired comparisons. It can be seen that Smoothtalker and ECHO II + were the only speech sources which varied significantly from all other speech sources: listening comprehension was significantly poorer for these speech synthesizers. A one-way repeated measures ANOVA also was performed to evaluate differences in mean multiple choice test completion times. No significant differences between speech sources were noted. Thus, comprehension test completion time did not effectively differentiate speech sources. Time alone can be considered an index of the ergonomics of human-computer interaction, however, the relationship between performance and time can also be viewed as an ergonomic metric. Ideally, efficient information exchange promotes speed, but not at the cost of accuracy. In the present study efficiency was calculated by dividing mean accuracy (percent-correct comprehension) by mean time-to-test completion (seconds). Figure 3.10 illustrates outcomes for this derived variable. Ranking according to percent-correct per second follows the order of (1) Human (2.14), (2) DECtalk (1.75), (3) VOTRAX PSS (1.64), (4) Amiga (1.62), Smoothtalker (1.19) and Echo II + (1.05). This ordering is very similar to that for comprehension test scores. However, 112 Table 3.19. One-way within-subject analysis of variance of multiple choice arcsin transformed percent-correct scores for speech source. Source dF 33 MS F p Subjects 11 Speech Source 5 Error i727.745 157.068 10871.231 2174.246 7.577 <.ooo 55 15781.647 286.939 113 dog: 33... is 630233 30 n53 88: 33333032 .0930» nooono no 5323 o no oouoon 93305393 anon 0330 033?: 33333» .3“ «cannon—goo 3.332— «0 anon .oaosunsoaaoz no 53390.33 66 026: a: 98 will! n.3,. «.z. 0.6 28: 03.. 0.6 Ev .3 will... new o.om.|l.. 0.5 +__ 28 .33 mm... 09.5. 538° :25: 838. 586m V8.52, 11H 51 4.. A 0 0 0) 'N. a 0 e L. 0 3—1 ‘3 02 V 3) 0 c 2 2 o --i u— s. 5... Lu 14 0 Human DECtalk Amiga Votrax Smooth Echo II + PSS talker Speech Source Figure 3.10. Efficiency of six speech sources for a listening comprehension task. Efficiency was calculated by dividing mean percent-corect scores for a multiple-choice comprehension test by the time (in seconds) required to complete the test. 115 differences between the Human talker and DECtalk become more apparent as task completion time is taken into account. Amiga and VOTRAX exchanged positions as a result of the PSS’s slightly smaller mean test completion time. Even so, the efficiency ratio equalizes these two sources, thus emphasizing the similarity in their performance for this task. Implications‘__The experimental question asked in with Experiment 3 was whether listening comprehension in noise differed as a function of speech source. Significant differences do exist between some synthetic speech sources, even though the effects of source on comprehension were less systematic than the effects of source on word recognition. These results are in contrast to the findings of McHugh (1976), Schwab, Nusbaum & Pisoni 1(985) and Greene and Pisoni (1988). None of these studies demonstrated significant differences between synthesized speech sources or between synthesized speech sources and a human control. One possible explanation might lie in the use of different stimuli. However, the passages employed in Experiment 3 were originally designed for use with grade school children. Those used by Greene and Pisoni (1988) were taken from adult reading comprehension tests. Logically, the greater difficulty associated with an adult task should be more effective in differentiating between speech sources. This was not case. Another possibility lies in the choice of speech synthesizers. In previous research the MITalk -79 has been 116 shown (Greene, Logan and Pisoni, 1986) to have a high degree of intelligibility, thus a listening comprehension task might not be sufficient to demonstrate small differences in performance. This is supported by the present findings reported in that differences were noted only for the poor quality synthesizers in relation to each other and to the other speech sources. Mid-range and high intelligibility devices did not demonstrate significant differences. However, McHugh (1976) used an early (and it is assumed, poorer) version of VOTRAX and was unable to demonstrate differences between the speech synthesis device and a human control. Finally, the use of the twelve-voice-babble in the present study very likely increased the processing difficulty sufficiently to reveal those devices most affected by reduced redundancy. As noted previously, studies such as Chial (1973) and Pisoni and Koen (1981) have demonstrated the negative effect of noise on the perception of synthesized speech. In summary, both measures of semantic accuracy demonstrated differences as a function of speech source. Rank ordering of speech sources based on percent-correct scores resulted in the same hierarchy for both experiments. However, word recognition in sentences as measured by the Revised SPIN Test resulted in a pattern of significant differences among all speech sources. This suggests the R- SPIN is a more sensitive measure of semantic precision than 117 the combination of factual passages and multiple choice comprehension tests. Task Performance Experiment 4; Oral IDSEIBQEiQDS (Without Repair Options) Task performance was evaluated using the Multiple Instructions Test. Subjects were asked to follow a set of instructions of varying complexity presented monaurally in conjunction with a twelve-voice babble noise (+10 S/B). Six alternate forms of the MIT were generated by all speech sources. Order of speech synthesis was randomized across subjects and versions of the MIT were counterbalanced across speech sources. Times-to-completion of each item was recorded and summed to generate total test completion time. Subjects responded by marking clear plastic sheets overlayed on graphic respons forms. Because it was necessary to erase the subject responses after each session, copies were made of each response sheet. However, all scoring was done from the original response forms and verified during the duplicaiton process. Error counts were then made from the score sheet. Counts were verified by a volunteer during the process of tabulating errors. DescriptignL Tables 3.20 to 3.24 give summary statistics for percent-correct scores, andarcsin transformed percent-correct scores (raus), of each level of the Multiple Instructions Test as a function of speech source. Scores ranged from 71.3% (DECtalk) to 43.2% (Echo II +) for the Synthetic sources as compared to 77.2 % for the human 118 Table 3.20. Mean percent-correct, standard deviations, and ranges for total scores of the Multiple Instructions Test as a function of speech source (N=12). Speech Source Mean S.D. Range Human 77.2 6.9 66.6 -88 DECtalk 71.3 9.1 52.3 -83 Amiga 67.9 10.3 42.8 -80.9 Votras PSS 67.2 7.1 55.9 -79 Smoothtalker 57.2 5.0 48.8 -66.6 Echo II+ 43.2 14.0 14 -61.9 119 Table 3.21. Mean arcsin transformed percent-correct scores, standard deviations, and ranges for total scores of the Multiple Instructions Test as a function of speech source (N=12). Speech Source Mean S.D. Range Human 75.9 7.4 65.0 -88 DECtalk 69.9 8.9 52.0 -81.9 Amiga 66.6 9.7 43.5 -79.5 Votras PSS 65.8 6.8 55.2 -77.4 Smoothtalker 56.5 4.5 48.9 -65.0 Echo II+ 43.5 13.2 14 .4-60.7 120 Twie 3.22. Mean percent correct, standard deviations, and ranges for all complexity levels of the Multiple Instructions Test as a function of speech source (N= 12). MIT Level Mean 5.0. Range Humai Level A 97.9 3.7 91.6 ~ 100 Level B 87.4 17.2 44.4 ~ 100 Level C 73.1 10.4 58.3 ~87.7 Level D 68 8.1 56.5 ~80 DECtalk Level A 95.1 4.3 91.6 ~100 Level B 85.8 12.9 61.0 ~100 Level C 67.0 14.5 .453 ~91 Level 0 57.3 1 1.9 33.3 ~70 Amiga Level A 92.9 10.0 66.6 ~100 Level B 72.9 18.5 33.3 ~100 Level C 60.6 1 1.9 37.5 ~75 Level D 62.4 1 1.9 43.3 ~67.6 Votras PSS Level A 85.2 16.3 50.0 ~100 Level B 77.3 13.4 55.0 ~94 Level C 60.5 13.3 45.0 ~83 Level D 58.9 1 1.9 43.0 ~86.6 Smoothtalker Level A 76.9 10.7 58.0-91.6 Level B 60.4 12.9 33.3 ~83 Level C 46.6 14.2 29.1 ~70.8 Level D 56.4 1 1.6 33.3 ~70 Echo 11+ Level A 59.6 24.3 16.0 ~91.6 Level B 50.3 19.4 22.2 ~83 Level C 38.7 16.5 1.2 ~62.5 Level D 38. I 16.8 10.0 -66.6 121 Table 3.23. Mean transformed percent-correct scores, standard deviations, and ranges for all complexity levels of the Multiple Instructions Test as a function of speech source (N=12). MIT Level Mean S.D. Range 111mm Level A 100.7 7.6 87.9 ~ 104.9 Level B 87.7 19.4 45.4 ~104.9 Level C 69.6 9.5 56.7 ~83.3 Level D 64.9 6.9 55.3-75.5 DECtalk Level A 95.0 8.7 97.9 ~104.9 Level B 70.9 18.1 36.3 ~104.9 Level C 64.5 12.7 46.2 ~87.2 Level D 56.0 9.7 36.3 ~66.4 Amiga Level A 93.5 13.6 63.5 ~104.9 Level 8 73.8 12.4 54.0 ~9l.1 Level C 58.7 9.8 39.8 ~70.8 Level D 60.3 9.9 44.5 ~72.3 Votras PSS Level A 84.1 17.8 50.0 ~104.9 Level B 73.8 12.4 54.0 ~91.1 Level C 58.7 1 1.2 45.9 ~78.4 Level D 58.4 1 1.5 43.7 ~86.3 Smoothtalker Level A 73.2 10.3 56.4 ~87.9 Level 8 58.7 10.9 36.3 ~78.4 Level C 47.2 1 1.6 32.7 ~67.1 Level D 55.2 9.4 36.3 ~66.4 Ecno 11+ Level A 58.5 21.2 21.1 ~87.98 Level 8 50.4 16.3 26.5 ~78.4 Level C 39.9 15.6 41-601 Level D 39.8 14.7 I39 ~63.5 VIL N\V Isl; 122 control. Scores for all speech sources declined as MIT '"creased. Table 3.24 transformed percent~correct scores for the experimental versions of the MIT. Tables 3.25 and 3.26 contain mean test completion time, standard deviations and ranges for the total MIT and for each level of the test. Figure 3.11 to 3.14 are histograms illustrating mean test scores, transformed scores, and standard deviations for the total test and for each level. These illustrate both the rank ordering of the synthesizers resulting from the MIT scores and the differences between level of item complexity. Different levels of the MIT are identified by bar coding within each speech source. Figure 3.15 displays the arcsin transformed percent-correct scores for the experimental versions of the MIT across versions of the test. Figures 3.16 and 3.17 summarize the results of item completion for the total test and for complexity level. W A one-way repeated measures ANOVA was performed to determine if differences existed among versions of the MIT for all speech sources. The F-ratios were not significant indicating that the six versions of the MIT did not differ. Table 3.27 contains the findings of a two-way, mixed effects ANOVA performed on transformed percent-correct scores to evaluate the significance of the main effects of speech source and MIT complexity level, as well as the interaction between these two main effects. Differences 123 Table 3.24. Mean arcsin transformed percent-correct scores, standard deviations, and ranges for experimental versions of the Multiple Instructions Test across all speech sources (N=12). MIT Test Mean S.D. Range Test 2 64.8 10.4 42.8 ~79.6 Test 3 65.9 12.1 39.1 -99.0 Test 4 60.9 12.4 29.7 -74.3 Test 5 61.6 15.9 33.5 ~81.9 Test 6 60.1 17.4 14.4 -79.6 Test 8 69.5 13.3 45.5 -85.5 124 Table 3.25. Mean completion time, standard deviations, and ranges for the total Multiple Instructions Test (in seconds) as a function of speech source (N=12). Speech Source Mean S.D. Range Human 104.0 33.2 104 ~150.3 DECtalk 116.5 17.4 88.4 ~145.1 Amiga 120.0 28.6 87.0 -166.8 Votras PSS 118.9 26.4 73.2 -170.5 Smoothtalker 118.2 23.1 91.0 ~166.3 Echo II+ 106.8 30.6 52.7 -165.9 125 Table 3.26. Mean item completion time in seconds, standard deviations, and ranges in for all levels of the Multiple Instructions Test as a function of speech source (N=12). MIT Level Mean S.D. Range Human Level A 2.7 .96 1.1 ~43 Level 8 4.2 1.1 2.5 ~59 Level C 6.1 1.1 4.0 ~80 Level D 6.1 1.6 3.4 ~90 DECtaIk LevelA 2.9 .65 2.1 ~43 Level B 4.2 .86 2.7 ~55 Level C 6.0 1.4 4.4 ~74 LevelD 6.0 1.2 41 ~80 Amiga Level A 3.3 .68 2.3 ~46 Level B 4.4 1.1 3.0 ~63 Level C 5.5 1.4 3.7 ~79 Level D 6.6 2.0 3.8 ~10.5 Votras PSS Level A 3.2 .59 2.2 ~42 LevelB 4.5 1.0 3.1 ~65 Level C 5.8 1.7 4.0 ~9.1 Level D 6.4 1.1 4.8 ~87 Smoothtalker Level A 3.3 .8 1.6 ~48 Level 8 4.6 .92 3.8 ~66 Level C 5.8 1.3 3.6 ~7.9 LevelD 5.9 1.5 41 ~86 Ecno II+ Level A 3.5 .86 2.3 ~56 Level 8 4.0 1.0 2.3 ~62 Level C 4.9 1.8 2.4 ~88 Level D 5.3 1.9 1.7 ~82 126 100 ~ g 80 .. ’6 (D t 8 60 — .1. C a) E 33 4o _ C O (D z 20 — 0 m a g i g, 33 9?: 1 Speech Source 5 g *5- }; E 3 3 L *’ .c I is 4 s 8 8 :> E A 30 1 1 1 1 UL3 1 be ‘6 20 — E 8 10 - o m 0 Figure 3.11. Mean percent-correct scores and standard deviations for Multiple Instructions Test as a function of speech source. Each bar represents observations of 12 normal-hearing subjects tested monaurally. 127 73100— 3 O 95 3 15 80- O (D H 0 O Esm- L) A 93 E40— 0 ~— (I) C 8 20- I.— C O O 2: 0 m 3 x s s s 3 B 1 Speech Source 5 a -- g E o :3 ‘3 El 1. *’ .c 3: E: <[ «a :3 (J O LU 5 E A (n 330 l L l l l I a? 5201-- U) 810~ 4: 8 0 Figure 3.12. Mean transformed percent-correct scores and standard deviations for Multiple Instructions Test as a function of speech source. Each bar represents observations of 12 normal-hearing subjects tested monaurally. 128 [3 Level A L I Level B Level C I Level D 100 (I) O l 0" O l I Mean Percent-Correct (z) Speech Source L O) x u— (I o-I S H O O E U) 1 Human DECtolk Amigo _ Votrax PSS - Echo II + ooo SD Correct (3) 0 Figure 3.13. Mean percent-correct scores and standard deviations for all levels of the Multiple Instructions Test as a function of speech source. Each bar represents observations of 12 normal-hearing subjects tested monaurally. 129 Cl Level A - Level 8 Level C - Level D 100 a 80— I 40 20- r‘leonTronsformed (iv-Correct Scores (Rous) VIII/11111111141111” ”ll/III/lllll/I/llll ”II/11111111111111” W 0 l- 3 A; g; 35 + 8 '3 8: °- ° 2 Speech Source 5 8 E a; E o H 3:: g E A (D g 30 l l l l l l a? V E 20 l- (D E 10 4 8 o Figure 3.14. Mean transformed percent-correct scores and standard deviations for all levels of the Multiple Instructions Test as a function of speech source. Each bar represents 12 normal-hearing subjects tested monaurally. 130 A (D 3100r (Z V (O G) ‘5 (a 80 _ U) H 0 O t 0 <9 60* A 02 v 13 O E 40- O u. 0‘) C O L F- 20 _ C O o I: 0, (V V) ‘1 U1 «0 (D «I *4 +a +4 «4 +u 0‘) 0‘) (D 0‘) (I) 0‘) (I) (D 0 0 (I) 0) F- l- P- F- F- F- 330 l 1 1 1 l l 3 O 9320- S 810- L ‘d C) m 0 Figure 3.15. Mean transformed percent-correct scores and standard deviations for the six experimental versions of the Multiple Instructions Test. Each bar represents observations of 12 normal-hearing subjects tested monaurally. 131 12ol 100 I (I) O r A O I Mean Test Completion Tine (Secs) 0‘ O l 20 L 0 Speech Sowce S 5 k? é ; o 5 5 g 2 3 g g .3 :- 40 I L l L l l 30 — 20 L- 00 SD Tine (Secs) Figtre 3. 16. Mean completion time in seconds and standard deviations all items of the Multiple Instructions Test as a function of speech sowce. Each bar represents observations or 12 normal-nearing subjects tested monaurally. 132 10 - Level D Level C [:3 Level A - Level B Lox—Ouzuoosmi mmn. xabo) i 0984 L inflame L l COC==+ 388 25... 5:2an”. 83. com: 389 2:; om Speech Source Mean item completion time in seconds and standard deviations for each level of the Multiple Instructions Test as a function of speech source. Each bar represents 12 normal-hearing subjects tested monaurally. Figure 3.17. Table 3.27. Two-way Within-subject ANOVA of Multiple Instructions Test transformed percent-correct scores Source dF SS MS F p Subjects 11 7217.445 655.677 Speech Source 5 33511.346 6702.269 30.730 (.000 Error 55 11995.455 218.099 Level 3 38587.081 12862.360 131.990 (.000 Error 33 3215.844 97.450 Source/Level 5 3559.759 237.317 1.692 .0568 Error 165 23141.124 140.249 134 between means for the two main effects of speech source and complexity level were significant. The interaction between source and level failed to meet the criterion for significance. Thus, MIT performance varied as a function of speech source and of item complexity level, but the two effects cannot be said to interact. To further analyze the interaction of means (raus) Newman-Keuls’ test of paired comparisons were performed on the main effects of speech source and MIT level of complexity. Figures 3.18 and 3.19 illustrate these results. Three pairs of means did not demonstrate significant differences. These included DECtalk/Amiga, DECtalk/VOTRAX PSS and Amiga/VOTRAX PSS. Thus, the findings suggest subject task performance did not differ significantly for DECtalk, Amiga, and VOTRAX PSS. However, task performance scores did differ among these synthesizers and the human source, as well as among these and Smoothtalker and Echo II+. Levels C and D of the MIT did not differ significantly, but Levels A, B, and C. di differ from each other. Table 3.28 lists the results of simple effects comparisons between speech source and level and between level and speech source. All F-ratios were significant suggesting that each speech source produced significant differences at each level and each level produced significant differences for each source. To assess the differences between means (raus) of speech source and item completion time and the interaction between 135 ooo .v 8mg. mwoéoa on m «3.80 +HH anon no Hove..— coo.v 96.3 mmudue on m www.mmva noxaoufiooam no Agog 25.9 93. .3" mac .3“ on m woo .muoa mmm “23.2. go .2284 oooi mmpfiu umm .N: mm o madman” you: no H95.” oooi popdm new 4.: mm m «am down 5383 an ago." ooo .v ovmdu ace .8." mm m woodman saw an .325 25.9 $96 «ea doe mm m 93 :Bo a Hobo; no 033m ooof omnda «mo .03" mm m 96.33 0 H95; no oouoom 25 .v mead mmwdom mm m omodvow m .323 no oousom ooof «mméa 304.3 mm m ooo 63m 4 H984 as oousom a h own one 93 am: 9033 IE on» no oouoon aoouuoouugouon 3530.805 do cocoa 3303100 no...“ an. aged 38 condom cocoon «o 3.30:? on» no“ 30030 033m .86 038. 136 609.: Sudan he ooaoosqoo and 953 883 acoofiufiaoensoz .oouSon cocoon uo dosage o no nouooo 30:00:33qu cosh—333» goons anon. 3030333 3323: new 30339-3 023.com no anon .oaosuigoz no noeuouunszn .36 9:63 9m... 9:. 9mPIII 9:. 92V 93 9:.III. 9:. 2:. 28: 59.5 umm 1.2.8.. 5896 x88) $.84 £385 cons: oucaom 137 Complexity Level Level A Level B Level C Level D Means 84.2 69.2 56.4 —55.7 Figure 3.19. Illustration of Newman-Keuls' Test of pairwise comparisons of mean arcsin transformed percent-correct scores for the Multiple Instructions Test for the main effect of item complexity level. Nonsignificant mean pairs are connected by solid line. 138 them, a two-way within-subject repeated measures ANOVA was performed with the main effects of speech source and “IT mean item completion time at each level of complexity. Table 3.29 summarizes the results. The F-ratio for the effect of speech source was not significant. Thus, item completion time did not differentiate between speech sources. The F-ratio for the main effect of item completion time at level of complexity was significant, as was the interaction between speech source and item completion time. The Newman-Keul's test revealed the only Level C and D did not differ according to item completion time. These results are provided in Figure 3.20. Measures of simple effects are shown in Table 3.30. Only one of the paired means for speech source and item completion time was significant. However, item completion time differed for each level within speech source. Thus, mean item completion time varied significantly for MIT level of complexity for all speech sources. Efficiency (percent-correct per second) was calculated for the MIT using total test scores and total test completion time. Results are displayed in Figure 2.21. Once again, the speech sources can be ranked according to the data; (1) Human (.74), (2) DECtalk (.612), (3) Amiga (.574), (4) VOTRAX PSS (.565), (S) Smoothtalker (.483) and (6) (Echo II + .404). As in Experiment 3, the inclusion of time clarifys distinctions between human and synthesized speech sources. Thus, task completion times can be useful even in the absence of 139 Table 3.29. Two-way within-subject analysis of variance of Multiple Instructions Test arcsin transformed percent- correct scores with Main Effects of speech source and time per level. Source dF SS MS F p Subjects 11 190.788 17.344 Speech Source 5 8.941 1.788 .981 .4375 Error 55 100.224 1.822 Time 3 406.127 135.376 65.639 <.001 Error 33 68.060 2.062 Source/Time 15 26.071 1.738 3.651 <.001 Error 165 78.544 .476 140 Complexity Level Level A Level 8 Level C Level D Means 3.1 4.36 5.73 —6.1 Figure 3.20. Illustration of Newman-Keuls’ Test of pairwise comparisons of mean item completion time in seconds for the Multiple Instructions Test for the effect of item complexity level. Nonsignificant mean pairs are connected by solid line. llll «on. man :3 000. mm m 25:. +5 anon 00 059 ooo.v monéu 5.2.. mm m 0225 005030026 no 059 80.9 vmv .3 com. mm m Show mg “28.3 00 05.". ooo .v mm.“ .3. 02.. mm m ace .9.” 00.34 00 0&2. oooi can .Nv men. an n modem 50003 00 082. 80.9 gm .2. mmp. no a «3.3 can 00 0&2. moo. and «mad mm m End 9 H050.— »0 00.53 moo. and «mod mm m mead o Hobo..— 00 oousom own. «004 2.0. mm m 30. m H0>0A 00 00.33. «3. wand .3... mm m «and 4 H0504 00 oousom a .u 0m: am: 0033 2.an on» “2030000 93 050 830.330 :03 03 5:03.80 00 H30." 30 000500 #00090 00 0030.39 0A.... 000 000030 0.356 .8...” 0309 Efflclencg (x-Correct/sec) 142 30 N O J O l Human DECtalk Amiga Votrax Smooth Echo II + PSS talker Speech Source Figure 3.21. Efficiency of six speech sources for the Multiple Instructions Test. Efficiency was calculated using percent-correct total test scores and completion times in seconds. Each bar represents observations of 12 normal-hearing subjects. 143 significant differences between sources. USing this informat‘o in combi.ation with performance data is useful in rating the efficiency with which tasks are performed under the direction of various speech sources. Implications; Experiment 4 asked whether task performance differed significantly as a function of speech source. It can be said that significant differences do exist between performance as a function of some speech sources. On the basis of task scores speech sources could be grouped in a manner suggested by Greene,Logan and Pisoni (1986) into (1) natural speech (human control) , (2) high to mid-range speech (DECtalk, Amiga, VOTRAX PSS), and (3) low quality speech (Smoothtalker, Echo II+). Once again, overall order is the same as that seen in both Experiments 2 and 3. Thus, the speech sources employed in this study demonstrated consistency with regard to relative intelligibility when assessed at different points in the communication model. However, the differences in pattern of grouping reinforces the need to consider the task when selecting devices for specific applications. Significant differences between item complexity levels for both transformed scores and time indicate test design was successful in creating a hierarchy of difficulty through the first three levels. Level D did not vary significantly from level C in either score or completion time. Thus, the gpresent array of level D items did not prove to be sufficiently difficult to delineate a fourth level of 1 M. .II..I'..I'. laj lllll performance. These items could be removed from the test or redesigned to discover which components might increase the difficulty beyond that already seen in level C. ra ' t' n W°th a' O t'o Subjects were asked to follow a set of instructions of varying complexity presented monaurally in conjunction with a twelve-voice babble noise (+10 S/B) using test 4 of the MIT. Test items were generated using the VOTRAX PSS text-to-speech system. Following presentation of a test item, subjects were allowed to select from seven repair options; (1) no repeat, (2) repeat no change, (3) repeat louder, (4) repeat slower, (5) repeat louder and slower, (6) repeat faster, and (7) repeat louder and faster. A maximum of three repair options were allowed per item and the total test. The number of each type and the total for the task as a whole were generated via the program used to present stimulus items. These values were used to verify experimenter counts at the end of each session. As with Experiment 4, copies were made of each response sheet, scoring was done from the original forms and verified during the copy process. Error counts were then made from the score sheet. Three sets of percentages were calculated from the repair options recorded; (1) the percent of each repair option selected for the total MIT, (2) the percent of total repair options selected per level of complexity and (3) the ‘percent of each repair option selected within a level based 1H5 on the total for that level. Percentages were based upon individual subject totals. Description; Percent-correct scores and arcsin transformed percent-correct scores (Raus) for the total MIT and for each level of complexity are provided in Tables. 3.31 and 3.32. Tables 3.33 to 3.35 list mean percent-correct scores, standard deviations and ranges for the three sets of repair option data. Figures 3.22 to 3.27 display these results. Standard deviations for repair option and repair option per level are provided in figures separate from the means. The most frequently types of repair options includedfor the test as a whole (Figure 3.23) were no repeat, repeat no change and repeat louder. The pattern of response for complexity level is revealed in Figure 3.32. As level of complexity increases the preferred option shifts such that no repeat decreases in frequency and repeat no change and repeat louder increase. MW Noting the presence of differences among percent-correct scores in complexity levels of the MIT, a one-way repeated measure ANOVA was performed on arcsin transformed percent- correct scores to determine whether differences were significant (Table 3.36). A significant difference due to