t Mari. a .L 5.53 M m. n M s 1. t. «M ~ . 3 “-2.. . (woman 1 E This is to certify that the thesis entitled FEASIBILITY STUDY OF VOICE ACCESS TO COMPUTERS FOR PEOPLE WITH LIMITED SPEECH presented by LAMBERT MATH IAS has been accepted towards fulfillment of the requirements for MS degree in ELECTRICAL ENGINEERING I I M £1. Major professor / Date 8/15/02 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution LlBRAfiY ' Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/01 c;/CIRC/DaIeDue.p65-p. 15 FEASIBILITY STUDY OF VOICE ACCESS TO COMPUTERS FOR PEOPLE WITH LIMITED SPEECH By Lambert Mathias A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical and Computer Engineering 2002 ABSTRACT FEASIBILITY STUDY OF VOICE ACCESS TO COMPUTERS FOR PEOPLE WITH LIMITED SPEECH By Lambert Mathias Dysarthria is a general term for a speech disorder in which speech is slow, weak, imprecise or uncoordinated. Commercially available automatic speech recognition (ASR) systems cannot reliably recognize dysarthric speech due to the inherent variability in such utterances. People with dysarthria generally lack articulatory precision. Simple phonemes like vowels are physically the easiest sounds to produce, since they do not require dynamic movement of the vocal system. This research is primarily a feasibility study investigating the reliability of vowel-based phoneme recognition of dysarthric speech. The goal is to evaluate if ASR algorithms could be used to reliably differentiate among the different vowel sounds produced by dysarthric speakers. The intended purpose is to provide personal computer based access methods for people with dysarthric speech. In this work, the hidden Markov model (HMM) is the basic technological approach adopted in developing the speech recognition algorithms, and all the experimental results quantifying the feasibility of these algorithms are presented. ACKNOWLEDGEMENTS I am indebted to my research advisor Professor John R. Deller Jr. for his excellent guidance and support throughout the course of my graduate studies. I would like to thank him for his valuable insights, and technical expertise, which were the driving force behind this research. I would also like to express my gratitude to my committee members, Professor Percy A. Pierre and Dr. Michael Seadle for their insigthful comments. This work was supported by the National Institutes of Health under Small Business Innovation Research Grant No. 1-R43-HD36164-01A1. The Principal Contractor in the SBIR award is InvoTek, Inc. of Alma, Arkansas. A special thanks is due to InvoTek's President Torn Jakobs for his generous support, encouragement, technical advice, and for superb management of the project. The MSU team also appreciates the synergistic interactions with, and clinical expertise of, Professor David Beukelman, Ms. Susan Fager, Ms. Cara Ullman, and other staff members, and all of the participants in the study, at the Madonna Rehabilitation Hospital in Lincoln, Nebraska, and at the Department of Special Education of the University of Nebraska. iii TABLE OF CONTENTS LIST OF TABLES ............................................................................................................. vi LIST OF FIGURES ........................................................................................................... ix 1 INTRODUCTION ...................................................................................................... 1 1.1 Background ......................................................................................................... 1 1.2 The Voice Access System Project ...................................................................... 2 1.3 Research Objectives and Goals ........................................................................... 3 2 THEORETICAL BACKGROUND ............................................................................ 5 2.1 Overview ............................................................................................................. 5 2.2 The Hidden Markov Model ................................................................................ 6 2.3 HMM Training Algorithm ................................................................................ 10 2.3.1 The General Training Problem ................................................................. 10 2.3.2 The Baum-Welch Training Algorithm ...................................................... 1 1 2.4 HMM Recognition Algorithm .......................................................................... 14 2.4.1 The General Classification Problem ......................................................... 14 2.4.2 The Viterbi Recognition Algorithm .......................................................... 16 2.5 Speech Analysis and Feature Extraction ........................................................... 18 2.5.1 Short-term Processing of Speech .............................................................. 18 2.5.2 Mel-scale Filter-Bank Processing ............................................................. 20 2.5.3 Mel-frequency Cepstrum Coefficients (MFCC) ....................................... 23 2.5.4 Log Energy, Delta and Acceleration Coefficients .................................... 23 2.6 Language Modeling .......................................................................................... 24 2.7 Phonemes and Phonetic Transcription .............................................................. 26 3 IMPLEMENTATION DETAILS... ... ..................................................................... 29 3.1 Hardware and Software Tools .......................................................................... 29 3.2 Speech Databases .............................................................................................. 29 3.2.1 TIMIT Speech Corpus .............................................................................. 29 3.2.2 The Dysarthric Speech Corpus ................................................................. 30 3.3 Creating the Observation Strings ...................................................................... 33 3.3.1 Phoneme Extraction from the TIMIT Speech Corpus .............................. 33 3.3.2 Phoneme Extraction from the Dysarthric Database .................................. 34 3.3.3 Features Comprising the Observation Strings .......................................... 34 3.4 Implementation of the Phoneme Recognizer .................................................... 35 3.4.1 HMM Topology ........................................................................................ 37 iv 3.4.2 HMM Training and Testing ...................................................................... 38 3.5 Bigram Language Model Implementation ........................................................ 4O 4 EXPERIMENTAL EVALUATION... ..................................................................... 45 4.1 Overview ........................................................................................................... 45 4.2 Experiments with Normal Speech .................................................................... 45 4.3 Experiments with Dysarthric Speech ................................................................ 47 4.3.1 Classification Experiment ......................................................................... 48 4.3.2 Conclusion ................................................................................................ 51 4.4 Experiments with Language Modeling ............................................................. 52 4.4.1 Speaker-Independent Bigram Language Model ....................................... 53 4.4.2 Speaker-Dependent Bigram Language Model .......................................... 58 4.4.3 Conclusion ................................................................................................ 67 4.5 HMMs Trained on Dysarthric Speech .............................................................. 67 4.5.1 Classification Experiment ......................................................................... 68 4.5.2 Conclusion ................................................................................................ 71 5 CONCLUSIONS AND FUTURE RESEARCH ...................................................... 72 5. 1 Conclusions ....................................................................................................... 72 5.2 Future work ....................................................................................................... 74 BIBLIOGRAPHY ............................................................................................................. 76 Table 2.1. Table 3.1. Table 3.2. Table 4.1. Table 4.2. Table 4.3. Table 4.4. Table 4.5. LIST OF TABLES Phonetic transcriptions of vowels. ................................................................... 28 Isolated utterances in the dysarthric speech database. ..................................... 32 Accuracy of the vowel phoneme HMM on TIMIT training data. ................... 39 Recognition results with normal speech. ......................................................... 47 Distribution of utterances in the dysarthric speech database. .......................... 48 Confusion matrix for Speaker 4 (shaded cells indicate reliable vowel sound).49 Confusion matrix for Speaker 1 (shaded cells indicate reliable vowel sound).50 Confusion matrix for Speaker 2 (shaded cells indicate reliable vowel sound).50 Table 4.6. Confusion matrix for Speaker 3 (shaded cells indicate reliable vowel sound).51 Table 4.7. Speaker-independent Bigram LM .................................................................... 53 Table 4.8. Confusion matrix with speaker-independent LM for Speaker 1 ...................... 54 Table 4.9. Confusion matrix without speaker-independent LM for Speaker 1. ............... 55 Table 4.10. Confusion matrix with speaker-independent LM for Speaker 2 .................... 55 Table 4.11. Confusion matrix without speaker-independent LM for speaker 2. .............. 55 Table 4.12. Confusion matrix with speaker-independent LM for Speaker 3 .................... 56 Table 4.13. Confusion matrix without speaker-independent LM for Speaker 3. ............. 56 Table 4.14. Confusion matrix with speaker-independent LM for Speaker 4 .................... 56 Table 4.15. Confusion matrix without speaker-independent LM for Speaker 4. ............. 57 Table 4.16. Error rates for the recognition task with the speaker-independent LM. ........ 57 Table 4.17. Bigram LM for Speaker 1. ............................................................................. 59 vi Table 4.18. Confusion matrix with speaker-dependent LM for Speakerl (shaded cells indicate reliable vowel sound). ......................................................................................... 59 Table 4.19. Confusion matrix without speaker-dependent LM for Speaker 1(shaded cells indicate reliable vowel sound). ......................................................................................... 60 Table 4.20. Error rate for Speaker 1. ................................................................................ 60 Table 4.21. Bigram speaker-dependent LM for Speaker 2. .............................................. 61 Table 4.22. Confusion matrix with speaker-dependent LM for Speaker 2 (shaded cells indicate reliable vowel sound). ......................................................................................... 61 Table 4.23. Confusion matrix without speaker-dependent LM for Speaker 2 (shaded cells indicate reliable vowel sound). ......................................................................................... 62 Table 4.24. Error rate for Speaker 2. ................................................................................ 62 Table 4.25. Bigram speaker-dependent LM for Speaker 3. .............................................. 63 Table 4.26. Confusion matrix with speaker-dependent LM for Speaker 3 (shaded cells indicate reliable vowel sound). ......................................................................................... 63 Table 4.27. Confusion matrix without speaker-dependent LM for Speaker 3 (shaded cells indicate reliable vowel sound). ......................................................................................... 64 Table 4.28. Error rate for Speaker 3. ................................................................................ 64 Table 4.29. Bigram speaker-dependent LM for Speaker 4. .............................................. 65 Table 4.30. Confusion matrix with speaker-dependent LM for Speaker 4 (shaded cells indicate reliable vowel sound). ......................................................................................... 65 Table 4.31. Confusion matrix without speaker-dependent LM for Speaker 4 (shaded cells indicate reliable vowel sound). ......................................................................................... 66 Table 4.32. Error rate for Speaker 4. ................................................................................ 66 Table 4.33. Partitioning the dysarthric speech database into training and test sets. ......... 68 Table 4.34. Confusion matrix for Speaker 1 (shaded cells indicate reliable vowel sound). ........................................................................................................................................... 69 Table 4.35. Confusion matrix for Speaker 2 (shaded cells indicate reliable vowel sound). ........................................................................................................................................... 69 Table 4.36. Confusion matrix for Speaker 3 (shaded cells indicate reliable vowel sound). ........................................................................................................................................... 70 vii Table 4.37. Confusion matrix for Speaker 4 (shaded cells indicate reliable vowel sound). ........................................................................................................................................... 70 Table 5.1. Row-column access for Speaker 1 using reliable vowel phonemes from (a) recognition task without LM (b) recognition task with LM. ............................................ 73 viii LIST OF FIGURES Figure 2.1. Three-state left-to-right HMM. ........................................................................ 9 Figure 2.2. Computation of P[0 I xik ] using the forward recursion of the PB algorithm. ........................................................................................................................................... 16 Figure 2.3. The Viterbi Algorithm. ................................................................................... 17 Figure 2.4. The mel scale. ................................................................................................. 21 Figure 3.1. Example of MFCC feature extraction: (a) The speech waveform for the phoneme utterance ‘AE’. (b) The 39-dimensional MF CC parameters. ............................ 35 Figure 3.2. The phoneme classifier. .................................................................................. 36 Figure 3.3. Example of a bigram LM using seven vowel phonemes. ............................... 41 Figure 3.4. LM-based Viterbi search grid ......................................................................... 42 Figure 3.5. LM-based Viterbi search algorithm ................................................................ 43 Figure 4.1. The a priori distribution of the seven vowels in the TIMIT test set. ............. 46 ix 1 Introduction 1.1 Background Dysarthria is a general term for a speech disorder in which speech is slow, weak, imprecise or uncoordinated. This disorder is commonly associated with other general neuromotor disabilities (Parkinson’s disease, cerebral palsy, etc). People with dysarthria may have difficulty in making themselves understood, or in reliably controlling environmental and communication aids. Many individuals with dysarthric speech, who use augmentative and alternative communication (AAC) devices have normal or exceptional intellects, reading and language skills, and would strongly prefer to use their residual speech, however limited [21]. AAC devices using speech technologies have the potential of not only serving vocational and educational needs but can also help satisfy such individuals’ social communication needs. Current commercially available automatic speech recognition (ASR) products (e.g., Dragon Dictate and IBM Via Voice) are designed for individuals whose speech is not impaired. Commercial systems may be able to recognize the speech of individuals with mild impairments, or individuals who have received sufficient training to alter their articulatory patterns to achieve improved machine recognition rates [25] [26]. However, the use of off-the-shelf commercial recognizers for people with dysarthria has not been particularly successful, with recognition rates for severely dysarthric people varying anywhere between 18-85% [27] [28]. Severe dysarthria is still a challenge for most commercial recognizers largely due to the extraordinary variability in dysarthric speech, and also because commercial recognizer systems are optimized for the mass market. The variability in dysarthric speech differs not only across individuals, but also for a particular individual depending upon the amount of stress, the time of day and other personal and environmental conditions. The inconsistency of dysarthric speech makes recognition of dysarthric speech inherently a different problem than that of normal speech. A different perspective using ASR for people with severe dysarthria is the use of a small set of utterances that can be reliably recognized. What is needed is a speech recognition system that is optimized for people who are capable of producing distinct vocalizations, even though these vocalizations may not be meaningful in normal speech. This approach can help individuals with dysarthria to use communication aids more effectively and improve their performance of job-related tasks. This research is primarily a feasibility study investigating the reliability of vowel-based phoneme recognition system for dysarthric speech. The phoneme-level recognizer developed must be capable of reliably differentiating among the different vowel sounds produced by dysarthric speakers. 1.2 The Voice Access System Project This thesis was written as part of a NIH-sponsored SBIR Phase 1 joint project between Invotek Inc. , and the Speech Processing Laboratory at Michigan State University. The goal of the Voice Access System (VAS) project is to provide persons who have physical disabilities and unintelligible speech with an access method for assistive devices that significantly reduces the physical fatigue experienced during device access. An important feature of this voice access system is that it does not attempt to recognize a particular sound sequence. The only criterion for recognition is that the system be able to consistently discriminate among the sounds used for access. The VAS offers significant advantages over other more physically-demanding access methods. 1.3 Research Objectives and Goals People with dysarthria generally lack articulatory precision. Simple ‘steady-state’ phonemes like vowels are physically the easiest sounds to produce, since they do not require dynamic movement of the vocal system [22]. In this research, we investigate the reliability of recognizing vowel utterances of dysarthric speakers. Hidden Markov Models (HMMs) are known to be quite efficient in speech recognition related tasks [29]. In this study, speaker-independent HMM models of seven representative vowel sounds, trained on normal speech, are used to build a phoneme classifier. The test utterances consist of multiple utterances of the seven vowels spoken by dysarthric individuals The subjects for this research have been provided by the Madonna Rehabilitation Hospital, Nebraska and the Department of Special Education of the University of Nebraska. The dysarthric test utterances are passed through the phoneme classifier and the classification results are used to compute a confusion matrix. The confusion matrix gives information about the number of test utterances that are classified as belonging to each of the HMMs representing the different vowels. The confusion matrix is used to evaluate which vowel sounds are most reliably recognized for a particular speaker. A similar recognition experiment is performed using HMMs trained on utterances from the dysarthric speech database. Furthermore, a bigram language model is added to the phoneme classifier to evaluate its effect on vowel recognition accuracies. Both, the speaker-dependent and speaker-independent bigram LM is considered in this research. Chapter 2, introduces the concept of HMMs, the algorithms commonly used for speech recognition and training, and the extraction of observation strings from raw speech. Chapter 3, discusses the actual implementation of the vowel-based phoneme recognition system, the feature extraction process, the speech corpora used for training and testing, the Baum-Welch training algorithm, the Viterbi recognition algorithm, and the bigram language model implementation. Chapter 4, discusses the results of the phoneme classification experiments. The focus is on which vowels can be most reliably recognized for a particular dysarthric speaker with minimum amount of confusability, and whether adding language information to the recognition task can help improve recognition rates. The final chapter, Chapter 5, summarizes the research conclusions, and outlines the course of further research. 2 Theoretical Background 2.1 Overview Speech recognition is a difficult task, given the variability associated with speech. A good recognition system must account for all the dynamics and uncertainties in speech in order to achieve reasonable accuracy. Stochastic methods provide adequate models to characterize much of the variability in speech. Furthermore, the question whether a given utterance belongs to a certain class becomes that of hypothesis testing, a statistical decision theory problem. Hidden Markov modeling is a parametric technique that has been successfully applied to speech recognition with considerable success [7] [8]. The HMM uses Markov chains to model the changing statistical characteristics that exist in the actual observations of speech signals. The HMM also has inherent time normalization properties. In terms of implementation, the HMM lends itself easily to computation on sequential machines. HMMs are iteratively trained using one of two iterative algorithms and variations: Viterbi decoding and Baum-Welch re-estimation [9] [10] [11]. However there is some front-end processing involved on the speech data to map it to a feature space that completely characterizes the dynamics of the speech waveform. The main purpose of the front-end processing is to derive feature vectors such that different vectors belonging to a given class of utterance are similar to each other, while feature vectors belonging to different utterances are maximally different from one another. The feature extraction is carried out over small segments of speech called “frames” over which the speech signal can be reasonably assumed to be stationary. The feature extraction process serves to isolate the effect of the environment noise and the speaker identity on the speech utterance, thereby enhancing the speaker independence of the system and making it more robust to environmental changes. The procedure also reduces the amount of data to be managed by the speech recognition and training systems. The feature vectors thus completely represent the temporal and spectral behavior of a short segment of the acoustical speech input. The ultimate goal of the front-end is to estimate parameters that effectively discriminate among the different phonetic units, while reducing the computational demand on the classifier. The mel-frequency cepstrum is the most commonly used feature space in characterizing the speech signal. The different aspects of speech recognition and training are discussed in the following sections. For detailed information, the reader is referred, for example to the text by Deller et al. [2] and the paper by Rabiner [3]. 2.2 The Hidden Markov Model Signal modeling based on HMMs is a technique that extends conventional stationary spectral analysis principles to the analysis of time-varying signals [4]. HMMs use a Markov state process to model the changing statistical characteristics that are probabilistically manifested through actual observations. The state sequence is hidden, and is observed through another set of observable stochastic processes. The observable output probabilities associated with each of the hidden states are characterized by either discrete probability distributions or continuous probability density functions. In this thesis, the latter approach is used. This class of HMMs is called continuous hidden Markov models (CHMM). The advantage of CHMMs is that the observations are continuous signals or vectors, and therefore do not suffer from degradation due to quantization errors as in the discrete case. The model structure usually adopted for speech recognition is a left-to-right or Bakis structure [12]. In the Bakis model, states are aligned so that only “left-to-right” transitions are allowed. Such a model is appropriate to characterize speech signals whose dynamics progress sequentially along a timeline. Based on the above discussion we can now formally define an HMM. A HMM is characterized by the following sets of quantities : 1. N state, the number of states in the model. We denote the individual states as S = {1,2,3,...,Nstate}, and the state at any time t I as st. 2. The transition probability matrix, A = {01.1.} where ay=P[st+l =i|St=jL ISi,j= Max I¢,_.(i>a,,.1bj(o,> 3’5 state w,(j)= argmax{¢,_1(i)a,jl ISiSNstate Termination: The best score, P* = Max [¢T (j)] 1313 state s; = argmax [mm 15} 5” state Backtracking: From time t = T — l, ...... ,2,1 * * St : Wt+l (SM) n * i * The best state sequence 1 = {Sl , $2 ,..., ST} 17 2.5 Speech Analysis and Feature Extraction HMMs do not use raw data directly as an input. The speech signal is first sampled, digitized, and then transformed, into a multi-dimensional feature space using either time- domain or frequency-domain approaches. The feature extraction process transforms the one-dimensional speech signal into a multi-dimensional stream of feature vectors at a reduced sampling rate (frame rate) thereby resulting in the compression of data. Although the speech signal is non-stationary, it contains small portions of stationary spectral characteristics within a given utterance, giving rise to the term quasi-stationary. Hence, the feature analysis procedure must be applied over a window of speech short enough to be considered stationary, and at the same time long enough to make a good estimate of the speech signal parameters. The mel-cepstrum is the most popular feature space employed in many speech recognition systems [14]. Mel-frequency cepstrum coefficients (MFCC) feature extraction involves computing the short-term discrete Fourier transform (stDFT) of the given speech signal, then passing it through the mel-scale filter banks to compute the log total energy in each critical band and finally taking the inverse discrete Fourier transform (IDF T) of the mel scale coefficients. The feature extraction process is explained in the following sections. 2.5.1 Short-term Processing of Speech The first step in feature extraction is the short-term processing of speech. This involves breaking the speech signal into a series of short segments known as the analysis frame. 18 Let S(n) be a discrete time speech signal and w(n) be the finite window with which we multiply the speech signal in order to get the speech frame f (n;r) given by f(n;r)=s(n)w(r—n). (2.5.1) This new frame of speech is a sequence on n , which happens to be zero outside the short term n E [r -— L S +1, r]. Here Ls is the total length of the speech frame and r is the end position of the speech frame. Typically a Hamming window with impulse response of the form w(n) = 0.54 — O.46cos[ 1.27m1] n = 0,---LS -1- (252} S is used for the short-term analysis of the speech signal. The length L s of the window is typically less than the length of the speech utterance. Overlapping frames are used to smooth the frame-to-frame transition. The short-term Fourier transform of the speech signal is then obtained by using the stDFT, , 27m I’ "I I S(d;r) = Z f(n;r)e N d = 0,...N’—l. (2.5.3) n=r—N'+l N ' is the number of points used to compute the stDFT. The number of points used to compute the DFT is generally a power of two, so that it is easier and more efficient to implement. In the case that the frame length is not a power of two, zeros are padded at the end of the frame sequence to increase the resolution of the stDFT by increasing the 19 number of points over which the stDF T is computed. Hence, N ' is usually equal to the length of the speech frame after zero padding. The magnitude of the above stDFT denoted by |S(d;r)| gives the magnitude spectrum of the speech frame for which the stDFT has been computed. 2.5.2 Mel-scale Filter-Bank Processing The human ear resolves frequencies non-linearly across the audio spectrum and empirical evidence [14] suggests that designing a front-end to operate in a similar non-linear manner to that of the human auditory system improves recognition rates [6]. The mel- scale filterbank approach is the most straightforward way to obtain the desired non-linear frequency transformation. The mapping from the linear scale to me] scale is given by the approximation FHz Fmel = 259510g10(1+%6). (2.54) where, Fmel is the perceived frequency and F H 2 denotes the real frequency [31]. A me] is a unit of measure of perceived pitch or frequency of a tone. Figure 2.4 shows the warping of the linear frequency scale by the mel scale. 20 8000 I I I I I I I I I 7000- _ 7,; 6000'- “ a E I 5000- - IL >‘. U 5 4000- ~ 3 U 0 I; .00) 3000‘ .( .2 0 U 3 a 2000- . 1000— - J l l l I I I l l 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Real Frequency, F Hz,(Hz) Figure 2.4. The mel scale. It has been found that the perception of a particular frequency by the auditory system, is influenced by the energy in a critical band of frequencies around that particular frequency [17]. Further, the bandwidth of a critical band varies with frequency, beginning at about IOOHZ for frequencies below lkHz , and then increasing logarithmically above lkHz. The log total energy in critical bands centered around the mel frequencies are computed by correlating the log magnitude spectrum corresponding to a critical band filter and calculating the weighted sum of the log magnitude spectrum for that particular critical 21 band filter. We use the notation Y (i) to denote the log total energy in the ith critical band with center frequency Fic and lower and upper cutoff frequency F” and flu respectively, where N'/2 2” Y(i)= 210g|5(q;r)lHi(q-fi7) q=0 d- Ill = Z log|S(q;r)|Hi(q q=dfl (2.5.5) 27r NI ) where, N ' is the number of points used to compute the stDF T. The integers i index the center frequencies of the critical band filters, each of which is assumed to be centered on . 27r . . one of the frequencres resolved by the stDFT. H [(q —]—v—,) 15 the magnitude spectrum of the ith critical band filter. If we know the sampling frequency FS of the speech signal, the relation between the cutoff frequencies Fil and Fm, and their corresponding sequence indices is given by F. 1‘} Fit 2 d1] N' and Fin = diu E? (25-6) The resultant sequence is given by ~ Y i , =d. F Y(q)= (~) q 'C , where F. =d. -—3 (2.5.7) 0 otherqe[0,N'—1] ’6 'c N' 22 2.5.3 Mel-frequency Cepstrum Coefficients (MFCC) The final step in the MF CC feature extraction process ids taking an IDFT of the mel- scaled filter-bank coefficients. The MFCC at frame position r is given by 2 ~ 2701 CS (173‘) = ‘NT 3,: Y(dic)COS(d,-c N' ) (2.5.8) IC Li=l,2,...,Ncb _ NC b is the total number of critical band filters used on the Nyquist range, hence there are only Ncb terms in the sum of (2.5.8). Here we note that the IDFT reduces to a discrete cosine transform (DCT). This simplifies the stochastic characterization of the features thereby reducing computational costs. 2.5.4 Log Energy, Delta and Acceleration Coefficients To augment the spectral parameters derived from the MF CC analysis, a log energy coeflicient is added to the MF CC parameters which is given by cs(0;m) = Z 17(dic) (2.5.9) i=l,...Ncb 23 A further improvement in performance can be obtained by adding differenced or delta cepstrum coefficients to the MFCC parameters thereby accounting for the dynamics of the speech signal. The delta coefficient at frame r is defined as def Acs(n;r) = cs(n;r+77Q)—cs(n;r—-77Q) (2.5.10) for all n. Here Q represents the number of samples by which the window is shifted for each frame and 77 is chosen to smooth the delta cepstrum. The acceleration coefficients (also known as the delta-delta coefficients) are obtained by applying the above equation to the delta coefficients. 2.6 Language Modeling When statistical relationships among utterances are known, a language model (LM) makes it possible to reduce the search space for the given recognition task, or alternatively assign higher probabilities to some utterances than others, thereby reducing recognition errors. Stochastic LMs apply a probabilistic and statistical framework to the language modeling problem. The most widely used stochastic language model in speech recognition tasks is the N-gram model. An N-gram grammar is a representation of an N“- order Markov LM in which the probability of occurrence of an utterance is conditioned upon the prior occurrence of N-l other utterances. The utterances can either be whole words or simple phonemes. In this research, we use utterances of vowel phonemes. In the N-gram approach, the language information is formulated as a probability distribution of 24 the different utterances in the vocabulary. In this thesis, the utterances represent the individual phoneme utterances that are used in the training and testing of HMMs. Let us formally define an N-gram stochastic language model. Let W = w ,wz, ..... ’WLw be a string of known utterances of length Lw in the vocabulary and P(W) be the a priori probability of the given sequence W . Then P(W ) can be factored as LW P(W) = P(w ,w2 ,....,wLw ) = “P(W, | w1,..., WH) (2.6.1) i=1 However estimating the joint probability above is a computationally intensive task. A practical solution is to use an N-gram language model with N=2 (known as a bigram LM). In a bigram LM the probability of occurrence of a given utterance is conditioned only on the occurrence of the preceding utterance. Bigram models help reduce computation and also provide a simple unified framework to embed both language and phonetic information in a single HMM. In this thesis, we concentrate on bigram language models. (2.6.1) can now be re-written as LW P(W) = “P(W, | w,._1 ). (2.6.2) i=2 Let the observed string or sequence of utterances be 0 = {01,02 ,...,0 L }. We can find 0 the most likely utterance string W I. using the MAP classification rule W‘ = argmax{P(0 I W)P(W)}. (2.6.3) W 25 To evaluate (2.6.3), we replace the known word string W = w , w2 , ..... , w L with HMM w models representing each of the known utterances in the word string W, i.e., we construct a network of HMMs {l , 1S k S L ,1} representing each utterance in W . Here, we assume that each utterance in the vocabulary has only one HMM associated with it. The probability P(OIW) now is equivalent to estimating the probability P(O | 21k ) , the likelihood score of the HMM, which can be evaluated using the forward recursion of the forward-backward algorithm as discussed in Section 2.4.1. The bigram probability P(W) , which is nothing but the probability of transition from one phoneme can be obtained from the LM defined in (2.6.2). The detailed algorithm for evaluating the bigram probability scores is described later in Section 3.5. 2.7 Phonemes and Phonetic Transcription “The basic theoretical unit for describing how speech conveys linguistic information is called a phoneme. For American English, there are about 42 phonemes consisting of vowels, semivowels, diphthongs and consonants. Each phoneme is a result of a unique set of articulatory gestures (such as the type and location of the sound excitation as well as the position or movement of the vocal tract articulators). Due to many different factors including, for example, accents, gender, and, most importantly, coarticulatory effects, a given phoneme will have a variety of acoustic manifestations in the course of flowing speech. Thus, from an acoustic point of view, the phoneme represents a class of sounds that convey the same meaning. The phonemes of a language, therefore, comprise a 26 minimal theoretic set of units sufficient to convey all the meaning in the language. The process of translating speech into a string of symbols representing the phoneme is called phonemic transcription and if it includes diacritical marks indicating allophonic variation, the process is called phonetic transcription” [2]. The three widely used phonetic transcriptions are the International Phonetic Alphabet (IPA), the Single Symbol Version, and the Upper Case version of the ARPAbet. In this thesis, we will use the upper case ARPAbet, which is used for the phonetic transcriptions in the TIMIT database developed by Texas Instruments and Massachusetts Institute of Technology. The mapping of the IPA symbols and upper case ARPAbet for the vowels in American English are shown in Table 2.1. 27 fr Table 2.1. Phonetic transcriptions of vowels. IPA Symbol Upper case ARPAbet Example word 1 IY beet I [H bit E EH bet e EY bait 2e AE bat (1 AA bott (1U AW bout (II AY bite AH but 9 A0 bought 31 OY boy 0 OW boat U UH book 11 UW boot a AX about 1 IX debit 28 3 Implementation Details 3.1 Hardware and Software Tools The phoneme recognition system and all the following experiments were carried out on Pentium-III class personal computers running the Windows NT 4.0 operating system. The HMM routines from the Bayes Net Toolbox developed by Kevin Murphy at University of California, Berkeley [15] was used for training and implementing the HMM based speech recognition system. In addition, the VoiceBox toolbox developed by Mikes Brooks, Imperial College, London [16] was employed to extract MFCC parameters from the speech signal. MATLAB-6.l was used as the development environment as it is an interactive, matrix-oriented programming language with built-in support for data analysis and visualization. 3.2 Speech Databases 3.2.1 TIMIT Speech Corpus “The TIMIT database is a corpus of read speech developed by Texas Instruments and Massachusetts Institute of Technology. The main purpose for designing his corpus was to provide speech data for acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains 29 speech from 630 speakers representing eight major dialect divisions of American English, each speaking 10 phonetically-rich sentences. The TIMIT corpus also includes time- aligned orthographic, phonetic, and word transcriptions, as well as speech waveform data for each spoken sentence. The text material in the TIMIT prompts, consists of two dialect “Shibboleth” sentences, 450 phonetically-compact sentences, and 1890 phonetically- diverse sentences. The dialect sentences (designated SA type in the database) were meant to expose dialectal variants of the speakers, and were read by all 630 speakers. The phonetically compact sentences (designated SX type in the database) were meant to be comprehensive as well as compact. The phonetically diverse sentences (designated SI type in the database) were selected to add diversity in sentence types and phonetic contexts. The corpus is also subdivided into training set (70-80% of the corpus) and test set (20-30% of the corpus)” [1]. The speakers are both male and female. The speech data were sampled at 16kHz and the digitized wavfile was stored in the National Institute for Standards and Technology (NIST) SPeech HEader REsource (SPHERE) format using 16 bits/sample. 3.2.2 The Dysarthric Speech Corpus This corpus consists of non-labeled speech data collected by personnel in the Communication Center of Excellence, at Madonna Rehabilitation Hospital, Nebraska, following an Institutional Review Board (IRB)-approved protocol for the protection of human subjects. The principal investigator for the clinical study is Professor David Beukelman of the University of Nebraska, Department of Special Education, and is also a Researcher associated with the Madonna Rehabilitation Hospital. The corpus is 30 comprised of isolated utterances of nine vowel sounds, four semivowel sounds and two nasal sounds. Each of the utterances was repeated at least 10 times to provide sufficient number of speech samples for testing and evaluation purposes. Four speakers with varying amounts of dysarthria provided the isolated sound utterances. The description of the four speakers is given below: Speaker 1 is a 38-year-old male with a diagnosis of Traumatic Brain Injury (TBI). His intelligibility is severely/profoundly impaired. Speech characteristics include slow rate, inability to produce consonant sounds other than nasals, vowel distortions, but some control over pitch and intonation. Speaker 2 is a 26-year-old female with a diagnosis dysarthria secondary to athetoid cerebral palsy. Her intelligibility is severely impaired. Speech characteristics include imprecise consonants, slow rate, distorted vowels, and some control over prosody. Speaker 3 is a 39-year-old male with a diagnosis of TBI. His intelligibility is moderately impaired. His speech characteristics include impaired control over respiration, strained/strangled voice quality, imprecise consonant production and decreased word boundaries. Speaker4 is a 49-year-old female with a diagnosis of dysarthria secondary to mixed cerebral palsy. Her intelligibility is severely impaired. Speech 31 characteristics include imprecise consonants, repetition of phonemes, irregular articulatory breakdown, and distorted vowels The waveforms were digitally recorded at a 44.1kHz sampling rate, stereo and all the utterances were stored in a single wave file (.wav format) [18] for each speaker. Table 3.1 tabulates the isolated utterances that were used to build the dysarthric speech corpus. Table 3.1. Isolated utterances in the dysarthric speech database. Sounds Uppercase Example ARPAbet Symbols OW open AA ma IY eat AH up Vowels AY eye AE cat AO awful UW oops EY ate L fall . R earn Sem1vowel Y young W way M h Nasals N :3“ 32 3.3 Creating the Observation Strings As discussed in Section 2.5, the speech signal must be converted to a suitable feature space. This thesis concentrates on developing vowel phoneme-level HMMs and evaluating their performance on the dysarthric speech corpus. For this purpose, seven vowel sounds (UW, OW, AY, AE, AO, IY and EY) were chosen for training and evaluation purposes. The HMM models for the two remaining vowel sounds (AA and AH) could not be properly trained as there was considerable acoustic variation associated with these phonemes within the TIMIT database. Hence, the phonemes AA and AH were not incorporated into the phoneme classifier. All results and experiments have been carried out using these seven vowel sounds. 3.3.1 Phoneme Extraction from the TIMIT Speech Corpus The speech utterances in the TIMIT database are complete sentences and each wave file is associated with a phonetic transcription of the sentence spoken. This transcription was used to extract the seven chosen vowel phonemes from all the speakers across seven dialects, for the training set. Before extracting the phonemes the NIST SPHERE wave files were converted to Windows PCM wave files using the NIST-provided software sphconvertexe [19]. Only the SI and SX type sentences were considered during the phoneme extraction process, as the SA type sentences tend to introduce a bias in the recognition process [20]. The resultant extracted phonemes were stored in a binary format. 33 3.3.2 Phoneme Extraction from the Dysarthric Database The dysarthric speech database obtained from the University of Nebraska contained all the utterances for a given speaker within a single large wave file. The wave file for each of the speakers was broken into smaller segments containing only a single phoneme utterance and were stored as individual wave files using the commercial software package “Cool Edit” [30]. Before segmenting, the speech files were down sampled to 16kHz and converted from stereo (2 channels) to mono (1 channel) using Cool Edit. 3.3.3 Features Comprising the Observation Strings The feature space is a 39-dimensional feature vector comprising 12 mel-cepstrum coefficients, a log energy coefficient, 13 delta cepstrum coefficients and 13 delta-delta cepstrum coefficients. The 0th order coefficient of the MFCC is not included as it is closely related to the log energy measure. The features are computed over a speech signal frame of 160 points after a Hamming window has been applied. The window is advanced by 64 points for each frame. For a 16kHz speech signal, this implies that a 4ms window is applied for every lOms of speech. The 39-dimensional MFCC parameters obtained both for normal as well as the dysarthric speech, constitutes the observation string that is applied as input to an HMM. Figure 3.1 shows the MFCC for the vowel phoneme ‘AE’. The feature vector consists of 12 mel-cepstrum coefficients, 1 log energy coefficient, 13 delta coefficients and 13 delta-delta coefficients. The MP CC features have been extracted for the entire speech waveform for all the frame positions. 34 0.6 . ‘ 0.4 0'2 [I I... . . , 'i, 0 ‘ II'HHHH‘III [III lllllllll-JH iiijl'lii'rlmliu Wail-V! ‘ I -0.2 ii“ 04 ' 0 5000 10000 15000 (a) 100 I I r I I I T I 50~ l l l o 5 1o 15 20 25 30 35 40 (b) Figure 3.1. Example of MP CC feature extraction: (a) The speech waveform for the phoneme utterance ‘AE’. (b) The 39-dimensional MFCC parameters. 3.4 Implementation of the Phoneme Recognizer The phoneme recognizer consists of seven vowel-based HMMs trained on normal speech from the TIMIT database. A given test speech utterance is passed through the seven vowel HMMs and the likelihood scores are computed. The vowel associated with the model giving the maximum score is chosen as the recognized vowel. Thus each given 35 speech utterance is classified as one of the seven different vowels. Figure 3.2 shows the structure of the phoneme classifier that was implemented. VITERBI RECOGNIZER 0 I O 0 (D m 3 PHONEME 2 ; UNKNOWJEI 'g' RECOGNIZED INPUT E —’ PHONEME PHONEME 8 0 I) III —————_—————————-—————— -.-u..................¢-...-u- Figure 3.2. The phoneme classifier. 36 3.4.1 HMM Topology Phoneme utterances are typically represented by a three-state Bakis structure [12]. The first state and the last state nominally represent the transition into and out of the phoneme, respectively, and the middle state represents the steady-state portion of the utterance. The state transition probabilities are governed by the following equations ail. =0, j < i. (3.4.1) 0, i Table 4.3 shows the confusion matrix for Speaker 4. The columns represent the different phoneme vowels used in the recognizer and the rows indicate the different vowel utterances that were passed through the phoneme classifier. The number in each cell represents the number of test utterances of a given vowel phoneme (from the dysarthric speech database) that were classified as belonging to each vowel HMM. This method of representation of the classification results helps identify the vowel sounds that are least likely to be confiised with other vowel sounds. In Table 4.3, it is observed that the three vowel phonemes OW, IY and AE are never confused with each other and hence are the most reliable sounds produced by Speaker 4. Thus for Speaker 4 we can postulate at least three reliable vowel phonemes. 49 Similarly, confusion matrices for Speaker 1, Speaker 2 and Speaker 3 represent results of running similar classification experiments. Table 4.4. Confusion matrix for Speaker 1 (shaded cells indicate reliable vowel sound). owelHMMs AB A BY AY 1 0 0 0 1 .'0 O l Vowel utterances Table 4.5. Confusion matrix for Speaker 2 (shaded cells indicate reliable vowel sound). U2 3 E a L- O 3:: D E G > 50 Table 4.6. Confusion matrix for Speaker 3 (shaded cells indicate reliable vowel sound). m o 9 = a I... o 3: D Ta 3 c > For Speaker 1, we observe from Table 4.4 that the phonemes UW and IY have high recognition accuracies. UW is confused with IY only once and this suggests that with a little intervention (some articulatory training by clinicians), Speaker 1 can be trained to reliably produce the phoneme UW. Similarly, for Speaker 2 we observe that the phonemes OW and IY are reliable choices, and for Speaker 3, OW, AE and IY are reliable choices. 4.3.2 Conclusion Using a phoneme classifier trained on normal speech, we were able to recognize at least two reliable vowel sounds for each of the dysarthric speakers. This suggests that it is feasible to build a speech recognition system capable of recognizing vowel sounds with the minimum amount of confusability with other vowels. 51 4.4 Experiments with Language Modeling The purpose of the LM is to restrict the search space of the recognition task and thereby help in reducing recognition error rates. In this research, we use bigram LMs to investigate the effect of language modeling on vowel phoneme recognition rates. The goal is to ascertain whether introducing LM in the recognition task, increases the accuracy with which vowels are recognized. Two cases were considered for the bigram LM. First, a speaker-independent bigram LM was computed for all speakers to determine whether it is possible to derive a single LM that can increase recognition accuracies across all dysarthric speakers. In the second experiment, a restrictive speaker-dependent bigram LM was used to test for increase in vowel recognition accuracies. Speaker- independent bigram LM in this context is a single LM that represents the entire dysarthric population used in this research. Similarly, a speaker-dependent bigram LM is one that is specific to each speaker. In order to quantify the bigram LM recognition task results, the LM recognition results were compared with the baseline acoustic recognition results using vowel phoneme error rate. In this research, the vocabulary consists of isolated vowel phoneme utterances, so the error rate is defined as # misrecognitions in the utterance string length of the utterance string Error rate E = (4.1.2) The utterance string here refers to a hypothetically observed sequence of phoneme utterances that we wish to recognize. 52 4.4.1 Speaker-Independent Bigram Language Model In this experiment, we use a single LM for all four dysarthric speakers. The transition probabilities associated with the bigram LM used in this experiment is shown in Table 4.7. These probabilities are shown in the form of a grid in Table 4.1, where the numbers in each cell denotes the probability of the vowel HMM representing the column (to which the cell belongs) following the vowel HMM representing the row (to which the cell belongs). The details of this representation of the bigram LM are explained in Section 3.5. Table 4.7. Speaker-independent Bigram LM. Vowel HMMs uw ow IY AE AO EY AY uw 0 0.56 0 0.66 0.45 0 0 g ow 0.45 0 0.56 0 0 0.55 0 2 IY 0 0.75 0 0 0.65 0 0 5 AB 0.75 0 0 0 0.6 0 0.65 g A0 0.6 0 0.5 0 0 0.43 0.6 § BY 0 0.45 0 0.78 0 0 0.6 AY 0 0.5 0 0.6 0 0.5 0 The bigram LM shown in Table 4.7 is a contrived one as we have only phoneme utterances in the dysarthric database. The results of the experiments in Section 4.2 were first used to ascertain which vowels could be considered as reliable phonemes. The 53 phoneme-to-phoneme transition probabilities were then assigned such that a poorly recognized vowel phoneme is always followed by a reliably recognized vowel phoneme. In addition, the probability of a vowel phoneme being immediately following itself was made zero. The probability distribution of the vowel HMMs from the bigram LM in Table 4.7 was used to generate finite strings of 30 vowel phoneme utterances for the recognition task. The vowel phonemes used in the LM task are randomly generated, which implies that the number of utterances used in the acoustic recognition task in case of the LM is not necessarily the same as those used in the recognition task of Section 4.2. Hence, the classification results shown in Section 4.2 are not always the same as those shown in the bigram LM recognition task. The results from the acoustic classification task and the bigram LM recognition task are tabulated below. Table 4.8. Confusion matrix with speaker-independent LM for Speaker 1. Vowel HMMs UW OW IY AE A0 EY AY UW 6 0 1 0 0 1 0 OW 0 6 0 5 0 0 0 _ § IY l 0 4 0 0 2 0 ‘3’ 5, AB 0 0 0 3 0 0 0 § § A0 0 1 0 0 0 0 0 :5 BY 0 0 0 0 0 0 0 AY 0 0 0 0 0 0 0 54 Table 4.9. Confusion matrix without speaker-independent LM for Speaker 1. Vowel HMMs IY A 1 E 2 7 Utterances Table 4.10. Confusion matrix with speaker-independent LM for Speaker 2. Vowel HMMs UW W IY A BY A UW 1 W Y Vowel Utterances Table 4.11. Confusion matrix without speaker-independent LM for speaker 2. Vowel HMMs 8 “3’5 °0 >3: ;: Table 4.12. Confusion matrix with speaker-independent LM for Speaker 3. Vowel HMMs uw ow IY AE A0 EY AY uw 6 0 1 0 0 3 0 m ow 0 7 0 1 0 0 0 7. “g IY 0 0 5 0 0 0 0 g 4,3 AB 0 0 0 6 0 0 0 > g A0 0 0 0 0 0 0 1 ‘3 BY 0 0 0 0 0 0 0 AY 0 0 0 0 0 0 0 Table 4.13. Confusion matrix without speaker-independent LM for Speaker 3. Vowel HMMs Table 4.14. Confusion matrix with speaker-independent LM for Speaker 4. Vowel HMMs IY AB A BY AY 56 Table 4.15. Confusion matrix without speaker-independent LM for Speaker 4. Vowel HMMs W A E O UW m o 9 G N I- o 1: D Table 4.16. Error rates for the recognition task with the speaker-independent LM. Dysarthric Error rate for phoneme Error rate for phoneme Speakers recognition with LM recognition without LM Speakerl 11/30 14/30 Speaker 2 15/30 14/30 Speaker 3 6/30 17/30 Speaker 4 14/30 13/30 The confusion matrices for the recognition task shown in Table 4.8 and Table 4.15 give the distribution of the classified vowel utterances for each speaker, for both baseline acoustic recognition and recognition with the bigram LM. As a side note, the baseline recognition results in Table 4.9, Table 4.11, Table 4.13 and Table 4.15 consists of vowel utterances randomly chosen from the database depending upon the sequence of phonemes being generated using the speaker-independent bigram LM. This implies that the baseline 57 classification results from the tables mentioned before may be different from those shown in Table 4.3 to Table 4.6. The speaker-independent LM does improve recognition rates for Speaker 1 and Speaker 3 as observed from the error rates in Table 4.16. However, the baseline acoustic recognition rates outperform those of the recognizer with the bigram LM for Speaker 2 and Speaker 4. This suggests that although we can improve recognition rates for dysarthric speech with a bigram LM, it is not possible to deduce a speaker- independent bigram LM to represent the entire dysarthric population. 4.4.2 Speaker-Dependent Bigram Language Model In this experiment, a speaker-dependent bigram LM was computed and the recognition rates for both the baseline acoustic recognition task and the recognition with speaker — dependent bigram LM were compared. In this case, too, the bigram LM is a contrived one. The phoneme-to-phoneme transition probabilities are defined for each speaker such that there is a high probability transition from a poorly recognized vowel phoneme (for that speaker) to a reliably recognized vowel phoneme (for that speaker). In addition, the LM used is very restrictive so as not to allow too many transitions from a given phoneme. The probability of a vowel phoneme following itself is zero. Using these rules we can generate numerous such LMs for the recognition task. The LMs shown in the Tables below are the ones that perform consistently better than the baseline recognition task more than 90% of the time. The results of the language modeling experiments for each speaker are presented in the tables below. 58 Table 4.17. Bigram LM for Speaker 1. Vowel Phonemes A .5 Vowel Phonemes Table 4.18. Confusion matrix with speaker-dependent LM for Speakerl (shaded cells indicate reliable vowel sound). Vowel HMMs A BY AY 1 m o 9 E E o 3:: D 59 Table 4.19. Confusion matrix without speaker-dependent LM for Speaker 1(shaded cells indicate reliable vowel sound). Vowel HMMs UW|ow IY AE Ao EY AY a UW 0 1 0 0 0 0 g ow 6 1 0 0 1 0 g IY 0 0 6 0 0 0 0 .3 AB 0 0 1 0 3 0 7, A0 0 2 0 0 0 0 0 5 BY 0 0 7 0 0 0 0 > AY 0 0 4 0 0 4 0 Table 4.20. Error rate for Speaker 1. Error Rate with bigram LM 5/45 without bigram LM 30/45 We observe from Table 4.18 and Table 4.19 that the bigram model improves recognition rates for the vowels OW, AE, AO, EY and AY. Furthermore, if we observeTable 4.19, the phoneme pairs UW and AE, and OW and AE are never confused with each other for the baseline acoustic recognizer. From, Table 4.18 we also observe that the phoneme set UW, OW, EY and AY are never confused with each other. Thus, there is an increase in the number of reliable vowel phonemes recognized for Speaker 1, as a result of the bigram LM. 60 Table 4.21. Bigram speaker-dependent LM for Speaker 2. Vowel HMMs A Vowel HMMs Table 4.22. Confusion matrix with speaker-dependent LM for Speaker 2 (shaded cells indicate reliable vowel sound). Vowel HMMs A -5 9:1 5:: >5 D 61 Table 4.23. Confusion matrix without speaker-dependent LM for Speaker 2 (shaded cells indicate reliable vowel sound). Vowel HMMs A 2 Vowel Utterances > > Table 4.24. Error rate for Speaker 2. Error Rate with bigram LM 1/50 without bigram LM 25/50 For Speaker 2, without the LM the vowel phonemes OW and IY, and IY and AY respectively, can be considered to be reliable vowels. With the bigram LM, we obtain significant improvements in the recognition results, as six vowel phonemes UW, OW, IY, AE, A0 and AY can be reliably recognized. 62 Table 4.25. Bigram speaker-dependent LM for Speaker 3. nemes A owel Vowel Phonemes Table 4.26. Confusion matrix with speaker-dependent LM for Speaker 3 (shaded cells indicate reliable vowel sound). no 0 0 E E 0 1: 5 63 Table 4.27. Confusion matrix without speaker-dependent LM for Speaker 3 (shaded cells indicate reliable vowel sound). Vowel HMMs A Vowel Utterances Table 4.28. Error rate for Speaker 3. Error Rate with bigram LM 7/40 without bigram LM 25/40 For Speaker 3, without language modeling the vowels OW, IY and AE can be considered as reliable vowels for recognition. With the introduction of the bigram LM, the vowel phonemes OW, IY, AE, EY and AY are never confused with each other. 64 Table 4.29. Bigram speaker-dependent LM for Speaker 4. Vowel Phonemes W AB A BY IY 0 .5 .5 O O .5 Vowel Phonemes Table 4.30. Confusion matrix with speaker-dependent LM for Speaker 4 (shaded cells indicate reliable vowel sound). owel 5 :8 OO >3: :2 65 Table 4.31. Confusion matrix without speaker-dependent LM for Speaker 4 (shaded cells indicate reliable vowel sound). Vowel HMMs uw ow IY AB AO BY AY uw 0 10 0 0 0 0 0 ,, ow 1 6.; 0 0 0 0 0 .3 g IY 0 0 3 0 0 1 0 g 2 AB 0 0 0 ' 5y 0 0 0 > 5‘; A0 0 2 0 1 0 0 4 9 BY 0 0 0 l 0 1 0 IY 0 0 0 2 0 0 3 Table 4.32. Error rate for Speaker 4. Error Rate with bigram LM 4/40 without bigram LM 22/40 For Speaker 4, we can identify the vowels OW and AE as reliable vowel sounds for the recognition task without LM. In the bigram LM case, the vowel phonemes OW, IY,EY and AY are never confused with each other. 66 4.4.3 Conclusion The bigram LM increases recognition rates of the phoneme vowels. However, it is not possible to deduce a speaker-dependent LM that represents this dysarthric population, due to the variability in dysarthric speech. Different speakers produce different vowel sounds reliably, depending upon the degree and type of dysarthria. A better approach is to build a LM that is specific to a dysarthric speaker. From the language modeling experiments, we observe that the speaker specific LMs always improve the recognition rates of vowel phonemes. Furthermore, with the help of a proper LM it is possible to obtain a larger number of reliable vowel sounds than in the baseline acoustic recognition case. This suggests that including a bigram LM in the baseline recognition task can give us more reliable sounds that can be used as control triggers for an AAC device. 4.5 HMMs Trained on Dysarthric Speech In this experiment, the seven vowel HMMs were trained on the dysarthric speech database. The utterances of each speaker were partitioned into a training set and test set. Table 4.33 shows the number of utterances that were used as training set and test set for the HMM training and recognition tasks respectively. 67 Table 4.33. Partitioning the dysarthric speech database into training and test sets. Speaker 1 Speaker 2 Speaker 3 Speaker 4 Vowels Train Test Train Test Train Test Train Test set set set set set set set set UW 6 4 6 4 6 4 5 5 OW 7 5 8 4 6 4 6 5 IY 5 3 6 4 6 4 5 5 AB 5 2 8 4 6 4 5 5 A0 4 3 6 4 6 4 5 5 BY 5 4 6 4 6 4 5 5 AY 12 4 6 4 6 4 5 5 4.5.1 Classification Experiment The purpose of this experiment is to investigate whether we can obtain reliable vowel recognition when the phoneme classifier consists of HMMs trained on the utterances of dysarthric speakers themselves. To achieve speaker independence, the vowel HMMs are trained on the training sets across all the four speakers. The test utterances of each of the four speakers are then passed through this phoneme classifier trained on dysarthric speech. The classification results obtained are used to deduce a confusion matrix for each speaker. 68 Table 4.34. Confusion matrix for Speaker 1 (shaded cells indicate reliable vowel sound). Vowel Ms A Utterances Table 4.35. Confusion matrix for Speaker 2 (shaded cells indicate reliable vowel sound). Ta 3 c > 8 9 E E o 1: D 69 Table 4.36. Confusion matrix for Speaker 3 (shaded cells indicate reliable vowel sound). Vowel Utterances Table 4.37. Confusion matrix for Speaker 4 (shaded cells indicate reliable vowel sound). V -5 iii 00 >3: :2 For Speaker 1, the vowel phoneme OW is the only one that is well recognized. The phonemes OW, IY and A0 are never confused with each other. However, the phoneme sounds IY and A0 do not have high recognition rates as a result of which they cannot be considered as reliable vowel sounds. For Speaker 2, the vowel phonemes OW, IY and AB 70 are reliably recognized and are never confused with each other. Similarly for Speaker 3, it is OW, IY and A0, and for Speaker 4 it is IY and A0, which can be considered as reliable vowel phonemes. 4.5.2 Conclusion In the above experiment, we trained a classifier based on the dysarthric speech utterances across four speakers. We were able to obtain reliable recognition from some vowel phonemes, for all speakers except Speaker 1. There is tremendous variability associated with dysarthric speech, primarily due to the articulatory imprecision associated with the dysarthric speakers. The inconsistency of dysarthric speech makes it difficult to train a reliable recognition system that can reliably recognize phonemes for a large population of dysarthric speakers. 71 5 Conclusions and Future Research 5.1 Conclusions The main goal of this research was to evaluate the feasibility of using ASR techniques to obtain reliable recognition of dysarthric vowel utterances, with the long-term goal of incorporating this vowel recognizer into AAC and PC-based devices. It is very difficult to achieve high recognition accuracies for words or utterances spoken by severely dysarthric individuals, mainly due to the inconsistency of dysarthric speech. A different perspective would be to identify the vocalizations, which can be used as reliable sounds for the recognition task. A “reliable sound” in the context of this research is the one that the recognizer can consistently discriminate among the given vocalizations. To test this hypothesis, only vowel phoneme utterances obtained from four dysarthric speakers at the Madonna Rehabilitation Hospital, Nebraska were used for evaluation purposes. The experimental results obtained from the phoneme-based vowel recognizer (without a LM) trained on normal speech indicate that for each of the speakers, at least two vowel phonemes can be identified as reliable vocalizations. The next task was to investigate whether the addition of language modeling information to the phoneme recognizer could increase reliability of vowel recognition. For this purpose a bigram, LM was incorporated into the recognition task. The results from the bigram LM recognition task imply that including language information does increase vowel recognition accuracy. In addition, 72 we observe that the number of reliable vowel phonemes obtained for each of the speakers is more than those obtained from the recognition task without a LM. However, it is not possible to build a speaker-independent language modeling framework representing a large dysarthric speaker population. It is not possible for the LM to take into account all the variability associated with dysarthric speech. This implies that we can obtain the benefits of language modeling by building a speaker-dependent LM. The main advantage of language modeling is that it enables us to identify more control words that can be used as access controls for any AAC device. For example, for Speaker 1, the baseline recognition system (without LM) identified three vowel sounds (UW, OW and AE) as reliable access sounds. This means we can access a maximum of nine vowel pairs using a row-column access method. However, the bigram LM implementation gives us four reliable vowel sounds (OW, AE, EY and AY). This implies now we can access a maximum of 16 keys using a row-column access method. Column Control Vocalizations UW OW AE Row UW Key 1 Key 2 Key 3 Control OW K 4 K 5 Vocalizations ey 31/ Key 6 AB Key 7 Key 8 Key 9 (a) Column Control Vocalizations OW AE EY AY Row OW Key 1 Key 2 Key 3 Key 4 Control AB Key 5 Key 6 Key 7 Key 8 Vocalizations EY Key 9 Key 10 Key 11 Key 12 AY Key 13 Key 14 Key 15 Key 16 (lb) Table 5.1. Row-column access for Speaker 1 using reliable vowel phonemes from (a) recognition task without LM (b) recognition task with LM. 73 Table 5.1 shows the construction of a row-column access type keypad using the reliable vowel phonemes obtained from the recognition tasks for Speaker 1. An additional recognition experiment was carried out using a vowel recognizer trained on dysarthric speech of the four participants in the study. The dysarthric speech database was partitioned into training and testing sets for this experiment. The goal of this experiment was investigate if we could obtain reliable vowel phonemes from this recognition task. The results obtained indicated that although it was possible to obtain reliable vowel phoneme sounds for each of the speakers, the results were not consistent with those obtained from the vowel recognizer trained on normal speech. The training utterances in the dysarthric speech database do not result in a good representation of their representative vowel sounds. As a result, the HMM models used in the phoneme recognizer trained on dysarthric speech is not as well modeled as the HMMs used in the phoneme recognizer trained on dysarthric speech. However, if we are interested only in the feasibility of a recognizer trained on dysarthric speech, then it is possible to obtain reliable vowel phonemes. 5.2 Future work The scope of this work was limited to testing the feasibility of using ASR techniques for reliable recognition of dysarthric speech. The long-term goal is to use these reliable sounds as control triggers for an array of AAC devices, including the personal computer (PC). All the conclusions and results obtained in this research were from the four 74 dysarthric speakers selected to participate in this study. More reliable statistics to validate the conclusions above can be obtained by performing similar experiments over a larger population of dysarthric individuals. Another improvement would be to use words built around the reliable vowel sounds as access triggers. For example, if ‘OW’ is a reliable sound, we can use words like ‘boat’ and ‘open’ built around this vowel phoneme as control triggers. Of course, this requires more sophisticated algorithms to implement vowel spotting within a given word. Further, we can evaluate the performance of a context dependent phoneme-based vowel recognition system to investigate which words can be most reliably used as control triggers in the AAC device. 75 [1] [2] [3] [4] [5] [6] [7] [8] [9] BIBLIOGRAPHY Garofolo, John S., et al., DARPA-TIMITAcoustic-Phonetic Continuous Speech Corpus, Documentation for “NIST Speech Disc 14.1”, February, 1993. Deller, J.R., Hansen, J.H.L. and Proakis, J.G., Discrete-Time Processing of Speech Signals, New York: IEEE Press, 2000. Rabiner, L.R. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceedings of the IEEE, vol. 77, pp. 257-286, 1989. Huang, X.D., Ariki, Y., and Jack, M.A., Hidden Markov Models for Speech Recognition, Edinburgh : Edinburgh University Press, 1990. Wu, Y., Ganapathiraju, A., and Picone, J ., “Baum-Welch Re-Estimation of Hidden Markov Model,” Institute for Signal and Information Processing, 1999. Steve Young et al., The HT K Book (for HT K Version 3.1), Cambridge University Engineering Department, 2002. Baker, J.K., “Stochastic Modeling for Automatic Speech Understanding.” In D. R. Reddy, ed., Speech Recognition, New York: Academic Press, pp. 521-542, 1975. Jelinek, F., “Continuous Speech Recognition by Statistical Methods,” Proceedings of the IEEE, vol. 64, pp. 532-556, April 1976. Picone, J., “Continuous Speech Recognition Using Hidden Markov Models,” IEEE Acoustics, Speech, and Signal Processing Magazine, vol. 7, pp. 26-41, July 1990. 76 [10] [11] [12] [13] [14] [15] [16] [17] [18] Liporace, L.A., “Maximum Likelihood Estimation for Multivariate Observations of Markov Sources,” IEEE Transactions on Information Theory, vol. 28, pp. 729- 734, September 1982. Juang, B.H., Levinson, SE. and Sondhi, M.M., “Maximum Likelihood Estimation for Multivariate Mixture Observations of Markov Chains,” IEEE Transactions on Information Theory, vol. 32, pp. 307-309, March 1986. Bakis, R., “Continuous Speech Word Recognition via Centisecond Acoustic States,” Proceedings of the 91‘" Annual Meeting of the Acoustical Society of America, Washington DC, 1976. Baum, L.E., “An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes,” Inequalities, vol. 1, pp. 1-8, 1972. Davis, SB. and Mermelstein, P., “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 31, pp. 793-807, 1983. Murphy, K., Bayes Net Toolbox for Matlab [software]. Retrieved February 2002 from the World Wide Web : http://www.cs.berkeley.edu/~murphyk/Bayes/bnt.html. Brookes, M., VOICEBOX: Speech Processing Toolbox for MATLAB [software]. Retrieved February 2002 from the World Wide Web : http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html. Schroeder, M.R., “Recognition of Complex Acoustic Signals,” Life Science Research Reports, vol. 55, pp. 323-328, 1977. IBM and Microsoft, Waveform Audio File Format, Multimedia Programming Interface and Data Specification v1.0, 1991. 77 [19] [20] [21] [22] [23] [24] [25] [26] [27] NIST, Sphconvert.zip v2.1 (for Wintel) [software]. Retrieved February 2002 from the World Wide Web: ftp://ftp.ldc.upenn.edu/pub/ldc/misc sw. Lee, K.F., “Speaker Independent Phone Recognition Using Hidden Markov Models,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 31, pp. 1641-1648, November 1989. Treviranus, J., Shein, F., Haataja, S., Parnes, P. and Milner, M., “Speech Recognition to Enhance Computer Access for Children and Young Adults who are Functionally Non-speaking,” Proceedings of RESNA 14th Annual Conference, pp. 308-310, 1991. Deller, J.R., Hsu, D. and Ferrier, L., “On the Use of Hidden Markov Modeling for Recognition of Dysarthric Speech,” Computer Methods and Programs in Biomedicine, vol. 35, no. 2, pp. 125-139, 1991. Lorei, Marcus, “Phonetic Modeling of Dysarthric Speech.” Master’s Thesis, Michigan State University, 1995. Manasse, Nancy, “Speech Recognition.” Barkley AAC training Laboratory, University of Nebraska-Lincoln, May 1999. Carlson, GS. and Bernstein, J., “Speech Recognition of Impaired Speech,” Proceedings of RESNA 10’“ Annual Conference, pp. 103-105, 1987. F errier, L.J., Jarell, N., Carpenter, T. and Shane, H., “A Case Study of a Dysarthric Speaker using the Dragon Dictate Voice Recognition System,” Journal for Computer Users in Speech and Hearing, vol. 8, no. 1, pp. 33-52, 1992. F errier, L.J., Shane, H.C., Ballard, H.F., Carpenter, T. and Benoit, A., “Dysarthric Speakers’ Intelligibility and Speech Characteristics in Relation to Computer Speech Recognition,” Augmentative and Alternative Communications, vol. 11, pp. 165-173. 78 [28] [29] [30] [31] [32] Doyle, P., Leeper, H., Kotler, A., Thomas-Stonell, N., OfNeill, C., Dylke, M. and Rolls, K., “Dysarthric Speech: A Comparison of Computerized Speech Recognition and Listener Intelligibility,” Journal of Rehabilitation Research and Development, vol. 34, no. 3, pp. 309-316, 1997. Poritz, A.M., “Hidden Markov Models: A Guided Tour,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, New York, vol. 1, pp. 7-13, 1988. Johnston, David, Cool Edit 2000 [software]. Copyright Syntrillium Software Corporation, 2000. Fant, C.G.M., “Acoustic Description and Classification of Phonetic Units,” Ericcson Technics, no. 1, 1959. Klatau, Aldebaro, “Survey of Results on Phoneme Classification and Recognition Using TIMIT.” University of California San Diego, 2000. Retrieved May 2002 from the World Wide Web: http://speech.ucsd.edu/aldebaro/papers/. 79 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIIIIIIJIIIIIIIIIIIII[IIIIII