r... a 3.4... ‘ .33 .‘ n... 1qu&5 .. Era.» .. s .3! 1|: :1 .. wwwflubiifi..x.\‘ 4 ~ 73.; .r! r. I) i. larval... 1.1.1.11. . 1 .. . V 3...: at. r. it... . s .3: I .1 . {33.03. ,ufiuhnflvnh V‘ it? ”Dingo. A. 12!}! :2 fig"; brie... . #5 £319. 1‘ .14. . ¢ JI‘ y . “nun-4 .a I (“an «2... .‘LP {as}. E . ‘ 1:1“ J inbtxh .09 In 01...... . i ‘thu «3v! i :9“va +4.1. . 1:1. ‘ is...“ .3151" ‘ Q \.I\.:vn\ 0. Mail“?! I . . I... 3:! n15!» flu . t )‘i h. ' wig-i3? 1:; * -~: 111'- gig; 1. . )1}? Q‘ . 35.6.9 1 us. ta. v: . a. :5 5 :39?) {($¢'lw . . ‘ f9 -1§\9!\ [1.5.1: . «.1i... :1\3LI1Q ‘ >.‘$..1..vv$;||\4.. . 1. .2'») J .{xiucnu ‘ xv. .. 31.1.» .4 7.. I. ‘25:; m1 M If]! .5 -1}??? c. Dl$.ler.I-, r A Trans»? \W 5m IBRARIES \W \\\\\\\\\\\\\\\\\\1\\\\\\\\\\\ 90\\\\\\\\\\2\L\\\\\i 131293 i This is to certify that the dissertation entitled AUTOMATED EVALUATION OF LANGUAGE MODELS BASED ON RECEIVER-OPERATOR-CHARACTERISTIC ANALYSIS presented by Yin-Pin Yang has been accepted towards fulfillment of the requirements for Ph. D. degree in Electricai Engineering f3. Mil/Lox Major professor Date37VtafiL 96. MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Michigan State University PLACE ll RETURN BOX to remove We checkout from your record. To AVOID FINES return on or before dete due. DATE DUE DATE DUE DATE DUE Jl I | MSU Ie An Afflrmetive Action/Equal Opportunity InetItutIon Wanna-o1 AUTOMATED EVALUATION OF LANGUAGE MODELS BASED ON RECEIVER-OPERATOR-CHARACTERISTIC ANALYSIS By Yin-Pin Yang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering 1996 ABSTRACT AUTOMATED EVALUATION OF LANGUAGE MODELS BASED ON RECEIVER-OPERATOR-CHARACTERISTIC ANALYSIS By Yin-Pin Yang The problem of designing and evaluating a language model is formulated as an optimization task parameterized by the trade-off between the coverage and over- generation with respect to the task language. These parameters parallel the detection and false-alarm rates in signal detection theory. Currently successful language models, such as n-gram models and context-free grammars, provide no systematic approach to design and evaluation other than direct experimentation. Further, current model designs do not take into account the effects of the quality of the acoustic models in the system. A language model that is optimized to maximize recognition, may not, in fact, optimally represent the grammar of the task language. In this research, the recognition performance of a language model is statistically characterized by specific mathematical quantities that can be optimized and evaluated similarly to receiver-operator-characteristic (ROC) analysis used in detection theory. The design process accounts for the quality of the acoustic models in the system and presents a graphical ROC-like display of language model performance without the need for extensive experiments with the recognition system. Specifically, analogously to ROC analysis, the performance of a speech recognizer in terms of recognition rate and computational cost can be illustrated in a two- dimensional plane of coverage (of the task language) and over-generation (of out-of- task-language sentences), resulting in a performance contour plot. A given language model is parameterized by coverage and over—generation and is represented by a point in the ROC plane. While adjusting certain design parameters of the language model (e.g., amount of training, weights of sublanguages, etc.), the designer can readily observe the motion of the point and can ”drive” the model in the direction of better performance. X-window-based software called the ROC-LM design tool is presented in this paper. The tool is quite general and can be used to design and evaluate all commonly- used language models. Example case studies are presented in the paper. Copyright by YIN-PIN YANG 1996 To my parents Jong-Hsz'ang Yang, Re-Chu Chang and my wife, Wen- Yu H321. Acknowledgments First and foremost, I would like to thank my thesis advisor, Professor John R. Deller, Jr., for his endless encouragement, patient guidance and generous support throughout my Ph.D. pursuit at Michigan State University. Without his help, this work could not possibly be finished now. I would like to thank all the members in my Ph.D. guidance committee, Professor Marvin Siegel, Professor Majid Nayeri (Department of Electrical Engineering), and Professor Jonathan I. Hall (Department of Mathematics) for their time and effort in discussing and reviewing my thesis. I would like to express my gratitude to my parents for all their support. I also owe a great deal to the friends I met here at Michigan State University. Most of all, I wish to thank Professor James Gallagher (Department of Education), who is a very good friend of my father, for his kind and sincere words whenever I needed help. A very special word of thanks for my wife, Wen-Yu Hsu, for all her love, patience, encouragement and understanding over past three and a half years. We are expecting our first baby which is due on August. Wish the best for a new life. vi Contents List of Figures 1 Introduction and Background 1.1 The Significance of Using ROC Analysis in Language Model Design . 1.2 Literature Review and Contributions of This Research ........ 1.3 Descriptive View of Language Modeling ................. Problem Formulation 2.1 Formal View of the Recognition and Language Modeling Problems . . 2.2 Recognition Problem ........................... 2.3 LD Problem ................................ 2.4 Integration of the LD Problem and Focus on the Decision Processes . 2.5 ROC Analysis of the “Language Detector” ............... Development of the ROC LM Design 3.1 Channel Probabilities in the ROC Analysis of the LM Performance 3.2 Recognition Rate ............................. 3.3 Representing Recognition Rate by Coverage and Over-Generation . . 3.4 Modeling the Probability Density Functions .............. 3.5 Issue of Sentence Length ......................... 3.6 The Weight, w, and Relative Motion of LMs .............. 3.7 The Scheme of the ROC-LM Design ................... The ROC-LIV! Design Tool 4.1 Overview .................................. 4.2 The ROC Display Window, ROC-WIN DOW .............. 4.3 The ESM-RR Block ............................ 4.4 The EVL-LM Block ............................ 4.5 The Control Panel Block ......................... Implementation and Design Cases 5.1 The Acoustic Environments ....................... 5.1.1 Acoustic Settings ......................... 5.1.2 TI-46 and TIMIT Database ................... 5.2 Introducing the TLs ........................... vii ix (:1me 7 7 8 13 17 20 23 24 28 32 34 43 44 45 48 49 51 53 57 67 75 76 76 77 viii 5.3 LM Design Cases ............................. 5.3.1 Case I : To Determine the Required Number for Training N - Gram for an Alphabet Recognizer ................ 5.3.2 Case II: A Hybrid Model of Bigram and Trigram and Concerns About Computational Cost ................... 5.3.3 Case III : Adaptive Language Modeling ............. 6 Conclusions and Future Work 6.1 Summary and Concluding Remarks ................... 6.2 Contributions ............................... 6.3 Future Work ................................ Appendix A: The Case of Continuous Speech Recognition Bibliography 80 81 95 99 99 100 101 103 105 List of Figures 2.1 2.2 2.3 2.4 3.5 3.6 3.7 3.8 4.9 4.10 4.11 5.12 5.13 Formalizing the recognition problem and the embedded “LD problem” into a conventional signal detection framework. ............ 10 An illustration of formalizing the LD problem into a conventional signal detection framework. .......................... 14 A conceptual illustration of how the LD model can be treated as a grammar observer. First, a practical evaluation methodology can be thought as hypothesizing all the possible word strings (sentences) first then performing the acoustic and grammatical evaluations. ..... 18 The block diagram of the operation of the decision rule 53 ....... 19 Illustration of the four channel probabilities, detection rate, false-alarm, miss rate and correct rejection. ..................... 25 Illustration of the trade-off between coverage and over-generation. More sentences are covered by LM, but at the cost of increased perplexity of the LM. Consequently, the recognition rate is decreased and the computation cost is increased. ..................... 26 (a) An example of how the f (zald = 0, D2) is constructed, when E = 4 and k = 2. (b) An example of how the f(a:a,o|d = 3, 0903) is constructed, when L = 4, k0 = 2 and k1 = 0. .............. 41 An illustration of a changing contour plot with LMs unchanged. The shift of a contour plot will make no difference to all of the LMs on the plane. However, the rotation will result in different increments or decrements which depend on the location. ............... 46 A diagram of the relationships among the four blocks: The ROC dis- play window, the Estimation of Recognition Rate, the Evaluation of LM, and the Control Panel and Designer Interface. .......... 50 A diagram of the data structure for storing word concatenation infor- mation from training context. ...................... 62 The layout of the ROC-LM Design Tool. ............... 68 By clicking on the button ACOU PDF, the two Gaussian p.d.f.’s of acoustic level evaluations are shown on the ROC-WIN DOW. . . . . 84 The bigram performance when the slider PRM #1, representing the number of sentences in training context, is moved from 100, 190, through 550 with PRM #2 random seed as 177. A trajectory marked by cir- cles is form. The random sees is then changed to 1,997, and PRM #1 moves backward from 550 to 100. Another trajectory with “X” is made. 85 ix 5.14 5.15 5.16 5.17 5.18 5.19 X The trigram performance when the slider PRM #1, representing the number of sentences in training context, is moved from 100, 190, through 1,000 with PRM #2 random seed as 177. A trajectory marked by cir- cles is form. The random sees is then changed to 1,997, and PRM #1 moves backward from 1,000 to 100. Another trajectory marked “X” is made. ................................... The comparison of the bigram performance and trigram. ....... after initializations and rescaling, the location of BiTri-Gram and con- tour plot is shown. Notice that the new estimated the G-set = { 5.61, 1.44, 1.70, 1.71 } is directed to their corresponding field on the ESM- RR block. ................................. When the new contour plot is drawn, the old LM (marked “X”) is transformed to a relative position with respect to the new contour plot. Move the threshold tg from 16.0, 13.2, 10.4, 7.6 through 6.2. BiTri- gram moves toward the criterion line and crosses it at t9 = 7.6. How- ever, the price for the computational improvement is the decrement of recognition rate from 86.2% to 83.4%. ................. PRM #2, representing the size of TrainSet II, is set to 0 at first. Then, PRM #1, which is the size of TrainSet I, is moved from 1000, 1900, 2800, 3700, 4600 through 5500. Then, PRM #1 is changed to 400. Repeat the above. Another trajectory appears on the left side of the previous one. On this trajectory, when the size of TrainSet I is 3700, the recognition rate reaches its maximum. ............... 93 94 Chapter 1 Introduction and Background 1.1 The Significance of Using ROC Analysis in Language Model Design The objective of this research was to develop a quantified strategy for adjusting pa- rameters in a language model (LM), in order to optimize a general performance func- tion in a speech recognizer. The performance function consists of the recognition rate and the computational cost incurred in using the task language. A novel approach, the ROC-LM design scheme, emulating receiver-operating-characteristic (ROC) analysis [1], [2] is presented to achieve this objective. The main idea is that a LM is funda- mentally quantified by two parameters, “coverage” and “over-generation,” paralleling hit-rate and false-alarm rate in ROC analysis. In detection problems, ROCs depict detection rate versus false-alarm rate for various values of the signal-to—noise ratio. Analogously to ROC analysis, the performance of a LM in terms of recognition rate and computational cost may be expressed in a plot of coverage versus over-generation. 2 In this study, a speech recognition system is modeled as a detection system with an N-interval forced choice (N IFC) and two-observer (acoustic and grammar observer) task. Within this framework, given a LM, the recognition rate of a speech-recognition system can be calculated from the statistical model of acoustic evaluation, perplexity of the task language, and a quantified characterization of the LM, instead of by direct, time-consuming recognition experiments. As a subsystem of the recognizer model, the LM performance can be characterized in terms of a classical binary-signal-detection- in-noise problem. It is this subsystem that is the principal focus of the work. It should be emphasized that acoustic information is incorporated in the LM design. Using the binary-detection analogy, the performances of various LMs can be an- alytically compared according to their coverage and over-generations. In addition, given a strategy for creating a LM, and given that the coverage and over-generation can be adjusted by changing certain parameters in this model, a trajectory can be drawn on the chart by varying one of these parameters. The designer can thereby determine appropriate values for the parameters. 1.2 Literature Review and Contributions of This Research Currently successful approaches to language modeling can be divided into the fol- lowing three main categories: context-free (type 2 in the Chomsky hierarchy [4]) [5]- [10], N -gram (Markov model, regular or finite-state, type 3) [11]-[14] and automatic- 3 learning finite state (ALF S, also type 3) [15]-[19]. With prior syntactic knowledge being applied through the production rules, the context-free grammar provides bet- ter performance for word strings that obey the grammar rules. However, the drawback is that the grammar is so rigid (inflexible) that it often involves complex interactive parsing methods. Meanwhile, an N -gram grammar is relatively adaptable but it suf- fers from the mismatch of the modeled language and the natural language, and often over-generates meaningless hypothetical sentences resulting in wasted computation time. Finally, the ALFS grammar is aimed at retaining adaptability and simultane- ously avoiding the mismatch problem. However, the current ALFS algorithm is only successful on recognition tasks with low perplexity [18]. Another important category is adaptive language modeling [20], [22], [23], in which LMs tend to “adapt” to particular sublanguages. When a large heterogeneous lan- guage source, which is composed of smaller, more homogeneous “sublanguages,” such as newspaper articles, is used for training, an adaptive LM is capable of focusing on certain sets of vocabularies and grammar rules which are related to particular subjects. An adaptive LM is thus powerful because it uses the immediate history of context so that the LM hypothesizes sentences more precisely. This research is concerned with three issues: 1. Current speech recognition research focuses only on the performance of an in- dividual LM rather than the overall performance of a speech recognizer. The accuracy of acoustical evaluation and the properties of the task language are not involved in the training processes of a language model. In this study, the over- 4 all speech recognition behavior is mathematically modeled. Therefore, given acoustical evaluation and task language information, the criterion optimized by the LM is the recognition rate rather than entropy, [20] which is commonly used in Language modeling problems. 2. The trade-off between “flexibility” of a LM and recognition performance has been noted by many speech researchers [18]. However, there has not been a mathematical formulation of the phenomenon. In this work, an arbitrary LM is parameterized by two values, coverage and over-generation‘, that quantify the trade-off. The LM is then detached from the overall recognition system and communicates with the remaining part, mainly acoustic evaluations, through the two parameters. Given requirements of a recognition task, such as recog- nition rate and computational cost, the required coverage and over-generation are specified. The LM is then designed to achieve these specifications. Conse- quently, the language modeling problem is formalized as a design problem. 3. In recent years, there has been considerable interest in the use of adaptive LMs [16], [20]. The proposed ROC design scheme can be used to quantitatively ad- just the “adaptability” to given sublanguages. Weights can be assigned to each of the sublanguages to indicate their importance. Thus different requirements of recognition correctness and computational cost can be specified by an op- timization function that varies with application. Consequently, a LM can be trained to be adaptive as required by the performance function. 1Coverage and over-generation will be defined in later chapters. 5 1.3 Descriptive View of Language Modeling Roughly speaking, the syntactic grammar (or syntax) of a speech recognizer is a set of rules by which words are properly combined to form a sentence [3]. The language associated with that syntactic grammar is a set of all possible sentences which are produced by the grammar. The term “language model” is often used to indicate the software system that generates the language (sentences hypothesized as the spoken one) in accordance with the syntactic rules. In a top-down search algorithm, the LM is used to hypothesize a search space consisting of all valid sentences according to the syntax. Each of the sentences in the search space corresponds to a string of word models. In an isolated-word recognizer, by hypothesizing the acoustic word boundaries, each word model can then be evalu- ated by considering the corresponding segment of the acoustic symbol string (see [3]) corresponding to the utterance. An evaluation algorithm then produces a likelihood for each word model string and the word string with the highest total likelihood is selected as the recognized sentence. The main purposes of the LM in a speech recognizer are: 1. To narrow the search space to reduce computational cost. For a recognition system with vocabulary size NV and an uttered sentence with In,ax words in length, if no grammar is involved, the size of the LM search space is NM = Nam”. However, if a grammar with perplexity2 P is employed, the size of search 28peech researchers use perplexity as a measurement of the size of recognition task. Roughly speaking, P is the average “branch factor” at any decision point in the language graph. A precise definition can be found in [3]. 6 space is reduced to P“. P is normally smaller than NV, so P’m“ is consequently much smaller than Nflm“, and the computation is reduced dramatically. 2. To improve recognition performance. The recognition of a sentence is benefited by the knowledge of allowable word concatenations. For instance, in a speech recognizer based on an N -gram statistical model, the following sentences, “This is a book,” and “This eats a book,” may be both hypothesized by the model. However, since the frequency of the concatenation between “this” and “is” is higher than that between “this” and “eats,” the former sentence is more prob- able. In the next chapter, LMs are formalized from the view point of the detection task. Chapter 2 Problem Formulation In this chapter, we set forth the basic formalisms and notation to be used through- out the dissertation. 2.1 Formal View of the Recognition and Language Modeling Problems A speech recognition system is to be designed. Let the vocabulary, V, be an exhaustive set of NV = [V] words that potentially arise in the recognition task. For convenience, let one element of V be a “null” word. Assuming that the maximum sentence length considered in the task is In,ax < 00, there are on the order of NL = N?“ strings of words (or sentences) that can be formed from the set V. Let us denote these sentences by L = {31,32,...,3NL}. Further, for each 3;, there will be a (finite) number of acoustic representations (“pronunciations”) of that sentence. Let us denote the entire set of pronunciations of sentences (over all sentences) by A = [a], a2, . . . , aNA}. This set also needs to be associated with the acoustic strings which, in general, may be 7 8 shared by numerous sentences. From a set theoretic point of view, the pronunciations and the acoustic strings are in one-to-one correspondence. We begin by structuring the recognition problem and the embedded “language detection problem” into a conventional signal detection framework. The discussion can be best understood with reference to Figs. 2.1 and 2.2. 2.2 Recognition Problem The underlying probability space is (Q, 2“, P), where P reflects the “true” distribution of pairs of sentences and acoustic strings in the task. The sample points of Q are ng, and P(w,-j) = P(s.~,aj) = probability of sentence 3, “pronounced as” (2.1) acoustic string a]- in the task language. Formally, when a sentence 3,- is spoken by the user of the device using “pronunciation” aj, the point on, = (s;,a,~) is selected from Q with probability P(w.-j). The point Log,- is then analyzed by the “transmitter” using a source classifier: 93012.5) = 3;. (2.2) 9 The underscore below QR indicates that this quantity is a random variable3. We have 9,, : (52,29) _. (L,2L). (2.3) The space (L,2L) is also the classification space (or “parameter” space) for the de- tection problem. At the “receiver” (recognizer), the observation is made: 2(wv) = 01- (2-4) The observation is a random variable mapping, 9: (9,29) —+ (A2"). (2.5) The acoustic space, therefore plays the role of the observation, or measurement, space in this development. Based on the value of g, recognition decision is made: £3016) = arg mgXi- 1n Q(8k)Q(aj:8k)} = 3:, (2.5) where Q() is defined below. QR is then a random variable mapping, QR : (51,29) _. (L,2L). (2.7) 3Strictly speaking, g3, QR, 93, and similar mappings in this development should have domains corresponding to the integer indices of the sentences on acoustic strings. It is notationally useful, however, to allow this small conceptual abuse. 10 PR = 59 Cost Space \ - -Q(sir 5“) - SI? ‘ L XL ‘—> B (L' 2L) ' Decision Space 5_g(aj) = 3;: argmax Q(a’lsk)Q(st) 1‘ 8—,,(aj) : A ->L V (L. 2 l - caustic (Observation) Language (Parameter) Space Space 03(0)”) 2 Q —)A 3(0)”) 3 0—)L _ (o, 2“, P) Figure 2.1: Formalizing the recognition problem and the embedded “LD problem” into a conventional signal detection framework. 11 Let us denote by l the decision statistic used in the decision rule QR, Iota.) = I—f—lfl 1nQ(s.Ia.-)Q(s.).(a.-Is.) (2.8) For future purposes, we note that I can be decomposed into a negative log likelihood (NLKHD), la, associated with the acoustic model information, and a negative log probability, lg, associated with the language model“: -1n Q(aj|3k) lajl — In Q(3k) |3k| i am “Snail = Ia(ajl3k) + I9(Sic) = ( )+( Where appropriate, the statistic I, la and 1,, may also be modeled as random variables that depend on the probability structure of the problem. Next the “evaluator” will Compute rates of proper operation and penalties as- sociated with errors. The cost function Q simply determines the correctness of the outcome 9: LxLaB, mm) L ai=2 where, (s.-,s;) = (2.11) a ui¢2 The recognition rate for the system is given by pH = E2. (2.12) 4Subscript g to denote “grammar.” 12 The rate, or further cost criteria, may be used to adjust the parameters of the recog- nition models. The measure Q may be viewed as a “noisy probability measure” over (I learned by training of the system by finite data. It represents an attempt to estimate the “true” (task) distribution P. In the language space L, there are two critical sets associated with the distributions P and Q. The task language, LT g L, is defined as the set LT = {8;[P(3.') 7é 0;.9.‘ E L}, (2.13) whereas the model language, LM Q L, is defined as the set LM = {s,-|Q(s,-) 75 0;s.- E L}. (2.14) Therefore P(LT) = l and Q(LM) = 1. For the convenience of further discussions, we also define LMT 2 LM (1 LT, (2.15) _ def LT = LT — LMT, (2.16) .. def The recognition problem may be viewed as an “Nb-ary” detection problem. For the hypothesis structure, U,- :93 = 3;, i E [1, N1] (i.e. s.- is uttered). (2.18) 13 For the decision structure: R}. :_6_R = 3]., k 6 [1, NL] (i.e. 63 responds with sk). (2.19) 2.3 LD Problem The purpose of “language detection” (LD) problem is to characterize the expediency of the hypothesizing of a LM. The LD is not actually performed during recognition phase. Once a LM is modeled by evaluation statistic, the same amount of hypothesis sentences with their grammar likelihoods are produced for each recog— nition trial. In the LD problem, the initial probability space can be taken as L as generated in the problem above. The sets LT and LM are also subsets of L as created above. A testing sentence 3,, which is considered as a sample point from L with probability P(s.-), can be selected from either outside or inside LT to evaluate LM. Note that in the LD problem, sentences outside of L7, i.e., those with P(s,) = 0, may be “selected” for attempted detection. However, only those sentences in the set L7,, play any significant role in this regard. In terms of the detection theory, the testing sentence 3.- is then analyzed by the “transmitter” (speaker) using a source classifier: 1, If S; 6 LT cL(3.-) = 17(3.) = (2.20) 0, If S,‘ ¢ LT 14 pL = E¢L - -¢_L(cv ao- ‘ ¢_L:LxL—>B .0 .1 O 5'- 5569143 0_L(S.-) :L-+L » v a (L, 2‘, P) Figure 2.2: An illustration of formalizing the LD problem into a conventional signal detection framework. 15 where 11 is the indicator variable for the event LT. We have 9L : (L, 21’) —+ (8,23), where B is the set of binary numbers, {0,1}. (2.21) The space (B, 23) is the classification space (parameter space) for the detection prob- lem. At the “receiver” (detector), the observation is the sample sentence itself: gL(s,-) = 3;. (2.22) The decision is then made by: I, If 8.’ E LM Ms.) = 1M(s.-) = (2.23) 0, If 8; ¢ LM where [M is the indicator variable for the event L M. The decision random variable is a, : (52,27) —> (3,23) (2.24) where (L,2L) and (8,23) are as defined above. It is observed that (L,2L) plays the role of both the classification and decision spaces in the recognition problem, while (B, 23) plays these roles in the LD problem. The decision processes are further described below. 16 The cost (reward) function is the product ¢L(CL,5L) = CL X 5L- (2.25) and the expectation of (151, is exactly the LD rate, that is, pl, = EQL. (2.26) Note that for the LD problem, we have inherently specified the following hypoth- esis structure (letting s,- E L be a testing sentence in a given trial): CL = 1 (i.e. s, 6 LT) (2.27) H0 CL = 0 (i.e. 3, ¢ LT) (2.28) D1 6L = 1 (i.e. s.- 6 LM) (2.29) db = 0 (i.e. 3,- ¢ LM) (2.30) 17 2.4 Integration of the LD Problem and Focus on the Decision Processes The heart of the speech recognition system is embedded in the detection rule 63. In turn, the “performance” of the rule 63 is a reflection of the quality of the training which is quantified by the “noisy” probability distribution Q. In particular, we are interested in the quality of the LM, and this part of the training is reflected in the distribution Q(s,°) over L and the corresponding term Q(s,-), or 19(3)), in 63. The LD problem as illustrated in Fig. 2.2 is a formal construct to allow the convenient, qualified study of the LM as reflected in Q(s,), hence 63. Formally, a practical speech recognizer functions along the lines described in Fig. 2.1, without any specific operation related to the language detection problem. However, the LD problem may be formally integrated into the overall speech system in the following . way: Suppose that the speech recognizer is capable of hypothesizing a_ll s.- 6 L, not just there in LM. Each 3,- is subjected to a LD test “_6L(s,-)” and is then passed on to 63 (with corresponding acoustic observation a,) only if it passes the language test (6,13,) = I). This amounts using the LD scheme in Fig. 2.2 without any probability structure, prior to performing recognition in Fig. 2.1. The LD test serves as a formal binary “blocking mechanism” that prevents untrained sentences from receiving consideration. The integrated mechanism is illustrated in Fig. 2.3. Once a sentence passes the LD test, it is subjected to evaluate and integrate the acoustic and language information in light of observation a,- and the structure embodied in the LM and acoustic models. This process is illustrated in Fig. 2.4. 18 All Possible Word Strings Prune <—<§l>Gl-ammar T Decrsnon Grammatical Evaluation 1 Acoustic Evaluation Figure 2.3: A conceptual illustration of how the LD model can be treated as a gram- mar observer. First, a practical evaluation methodology can be thought as hypothe- sizing all the possible word strings (sentences) first then performing the acoustic and grammatical evaluations. In evaluating the set of total likelihood measures over LM for a given a,-, two assumptions are made in this work: Assumption 2.1 In the present research, the stack is assumed large enough so that stack size does not affect recognition results. In other words, the failure of a correct sentence to be recognized can only be due to the poor acoustic evaluation or the pruning by the LM. The theories developed in this research do not consider the effects of stack size. For a recognition system with a high perplexity LM and poor acoustic evaluations, it is not appropriate to assume the recognition results are independent of stack size. Assumption 2.2 Word boundaries of the uttered acoustic symbol string are known, so that the acoustic symbol string of a sentence can be correctly divided into E seg- ments, where fl is the length of the uttered sentence. Therefore, the theories created in the following can only be applied in present form to 19 Database of Uttered Acoustic Symbol String Word Model I +Acoustic Level _~ valuation A . cousuc si-l Q(aj|S.°) Likelihood . Q(S.') Total NM 8,” Smng Likelihood Grammar Level si+2 *Evaluatton I 563038! 9 Probability Si+3 Language Model Modeled Language, LM Top One 81 8);] Recognition Output When Result <— All Evaluations ‘__, S] a 8, Done ,‘ . sj+l An Ordered List according to Total Likelihood Figure 2.4: The block diagram of the operation of the decision rule 63. 20 an isolated-word recognition (IWR) system or a continuous-speech recognition (CSR) system with a powerful word-boundary detector. The two assumptions above restrict the immediate applicability of this work to cases of the unlimited stack size and known word boundaries (USS—KB). This study can be extended by learning that the pruning due to the insuflicient stack and the performance of segmentations can be mathematically modeled by given certain cor- responding statistics. This study emphasizes the USS-KB case to prove the validity and function of the proposed methodologies. Once this fundamental goal is achieved, these more realistic cases will be considered in future work. 2.5 ROC Analysis of the “Language Detector” The LD problem is formulated as a convenient, quantified means for assessing the quality of a LM. Currently, a LM can only be assessed using experimentation and simulation. Further, the prevailing (and rational) approach is to attempt to make pl, as large as as possible by sufficient training (make LM and TL match as closely as possible). While this approach does not always maximize p3, there would be no systematic approach to do otherwise. Two factors aflect the relationship between p3 and p1,. They are NM and the accuracy of the likelihood evaluation l(a,-,sk), s)c 6 LM. pL is usually improved accompanied with the increases of N M. In this case, if the probability of the l(a,-, s.) obtaining the best score among all the evaluations in LM is relatively sensitive to NM, which implies a bad acoustic evaluation performance, p3 may not be improved 21 by improving pL. On the other hand, if the likelihood evaluation is extremely powerful and totally insensitive to N M, p3 can be improved by improving p1,. In the case that NM is fixed (unchanged with respect to some adjustable parameters in the LM), p3 is improved when pl, is improved. Therefore, the design of a LM requires the integration of both sources of acoustic and language information. Usually, the acoustic space is fixed to designers when con- sidering a speech recognition task. Hence, this research emphasizes the relationships among p3, p1, and NM. These are precisely the factors related in the ROC analysis of the LD problem. In the remaining chapters, we will focus on the LD problem as a means to assess the effects of the LM on recognition and to change the recognizer accordingly. The significances of analyzing the recognition problem via LD and ROC analysis are the following: 1. The recognition rate can be directly determined for various parametric permu- tations of the LM rather than obtained from the simulations. 2. LMs are parameterized by two measurable values, “coverage” and “over-generation.” Given a performance function, the performance of each coverage—over-generation pair is evaluated and a contour plot is drawn on the ROC-LM design chart. Once the chart is obtained, the designer can easily compare and design LMs. There are two applications of the proposed ROC design scheme: 1. For comparison purposes: For example, for the same vocabulary size, a trigram model is known to have 22 lower perplexity than a bigram model for the same NV. However, the bigram model includes more grammatically correct sentences than the trigram (better coverage). The ROC-LM design chart can quantitatively show the trade-off and determine which model is better given a recognition requirement. . For design purposes: Suppose there are several designable parameters in a LM, the ROC-LM design chart can be used to adjust these parameters. For instance, a LM which success- fully combines trigram and bigram [20] needs to have weights that optimize the task requirements. On the ROC-LM design chart, the direction to obtain better performance is clearly shown so that the designer can readily make adjustments. Chapter 3 Development of the ROC-LM Design Tool In this chapter, the ROC-LM design tool is developed by starting from a simplified task in which 1. Recognition rate is the only measurement of recognition performance; 2. All sentences in the TL have the same length in words; The ROC-LM design tool can be divided into two parts. Part one is the recognition environment, which is concerned with the performance of the acoustic evaluation and the properties of the TL. These are the factors that are assumed to be fixed once the LM design/analysis begins. Part two is concerned with the LM. These two aspects are actually interactive, communicating with each other through two parameters, coverage and over-generation (defined below), which are the two axes of the ROC- LM plane. In the ROC—LM design, LMs are characterized by the coverage and over-generation. One point on the ROC plane, which represents a coverage, over-generation pair, cor- responds to a LM. The design starts with measuring (or parameterizing) the LM 23 24 individually by coverage and over-generation. Next, performance measures, such as recognition rates and computational cost, are estimated according to the given acoustic environments for each of grid points on the ROC plane. Contour plots pa- rameterized by performance measures can then be drawn on the ROC plane. On the other hand, a LM measured by coverage and over-generation is located on the contour plot. The designer can easily observe the “motion” of the model when some of its parameters are modified, and “move” the LM to approach the design goal. In the following, all recognition behaviors, for example, the estimation of recogni- tion rates, the measurement of LMs and the performance of the acoustic level evalu- ations, are mathematically modeled. The ROC-LM design scheme is then described at the end of the chapter. 3.1 Channel Probabilities in the ROC Analysis of the LM Performance Here we elaborate on the LD problem described in Chapter 2 and define some future notation and vocabulary. In terms of classical binary detection theory and the formulation of Ch. 2, we can define the following hypothesis structure and “channel transition” probabilities as related the LD problem. 1. Correct language detection rate (hit—rate), related to “coverage” P(DIIH1)=P(§L=1|£L= 1) (3'1) 25 ~Correct Rejection L — L M—LT Figure 3.5: Illustration of the four channel probabilities, detection rate, false-alarm, miss rate and correct rejection. 2. Language false-alarm rate, related to “over-generation” P(D1|H0) = P(§L = 1191, = 0) (3-2) 3. Miss rate P(Do|H1) = PM]. = 0|£L =1) (3-3) 4. Correct rejection rate P(DOIH0) = P(§L = 0121. = 0) (3-4) Figure 3.5 illustrates the set relationships among these four probabilities in the lan- guage space L. The idea to use ROC analysis in LM design originates from an interesting trade- off phenomenon. A LM needs to be “flexible” (LM sufficiently large) so that the 26 Modeled Lan uage, LM~ Figure 3.6: Illustration of the trade-off between coverage and over-generation. More sentences are covered by LM, but at the cost of increased perplexity of the LM. Consequently, the recognition rate is decreased and the computation cost is increased. complete complement of sentences can be hypothesized during recognition. However, if the model is so flexible (NM >> NT) that many sentences outside of L7 are hypothesized, the computation cost is increased and the recognition rate is decreased. Figure 3.6 illustrates the trade-off. The TL, LT, is modeled by LM, LM. LM usually does not match LT ideally. In order to increase the percentage of LT cov- ered, another LM, LM:, is generated by a looser grammar that is derived from LM by adjusting certain parameters. Now more sentences in L7 are covered by LM, but usually with the price that the over-generation of LM: is also increased. The increase of coverage and over-generation implies the likely increase of perplexity of LM. Con- sequently, the correct sentence (if hypothesized) has less chance to be selected and recognized in the larger LM. The recognition rate is thus likely degraded and the computational cost increased. 27 Based on the trade-off described above, the recognition correctness is affected by the LM through the following two factors. One is whether the correct sentence can be hypothesized (“covered”) by the model; the other one is that how many extraneous word strings (not in LT) are hypothesized so that the recognizer makes extra effort evaluating them. Meanwhile, the more hypothesized sentences the more unlikely it is that the correct sentence obtains the highest likelihood, thus reducing the recognition rate. More quantified justifications will be addressed in the next sections. Therefore, the performance of a LM is mathematically described by two parame- ters, “coverage” and “over-generation.” The coverage, CM, of the LM is defined as the probability that any sentence, say 3,, generated by LT (35 6 LT), that is, legally spo- ken by the user of the device, is detected as a grammatical sentence (3.- 6 LM), that is, being hypothesized as a recognition candidate by LM. In terms of the formalism of Chapter 2, we have CM is P(D,|H1) = P(§,, = mg, = 1). (3.5) The over-generation, OM, of the LM, however, is not directly defined as the false- alarm of the LD problem shown in Eq. (3.2). Because of the way in which the LD problem is structured, P(cL = 0) = P(Ho) = 0, so that P(DIIH0) is not defined. We therefore define GM to be an “anti—miss-rate” of LM versus L7, in which the given condition is any sentence randomly chosen from LM. OM “s! 13(9), = on), = 1) = P(H0|Dl). (3.6) 28 In the later sections, we will find that the over-generation OM only concerns the total number of sentences in LM, and is simply a normalized measurement for the size of LM. 3.2 Recognition Rate Figure 2.1 illustrates the mechanism of a speech recognition task producing recogni- tion rates, represented by the notations and definitions above. First, when a speaker utters a sentence, 3,, one of the samples in LT, s.- is then sent to measurement space and classification space in parallel. For the former, since 3,- is unknown to the speech recognition system, the measurement space evaluates all hypotheses in sample space LM, resulting a vector of evaluation results. The decision space is then select a most-likely sentence, say 3;, among the result vector. On the other branch, the clas- sification space sends the index of the uttered sentence, the correct answer, say 3,, to the cost function block. Cost function then compares the two indexes and responds with the recognition correctness, and the probability of the correctness represents the recognition rate. From Eq. (2.12), the recognition rate p3 is the expectation of Q. As mentioned previously, suppose 3,- ¢ LMT, it is impossible for Q to respond i, resulting P(6_R = 8;ng = 3,) = 0 and Q = 0. That is, Pa = E2 = P(LMT)E(91|LMT) + P(Lf)E(QIL1—~) (3-7) = P(LMT)E(.‘IZ|LMT) (33) 29 = P(LMT) 2 PM}: = 3.193 = 3i)P(£n = SiILMT) (3-9) tuéLMr Now let A,- be the set of acoustic strings corresponding to sentence 3,: A. ‘1—°;‘{a,-|P(a,,s.) > 0}. (3.10) Suppose by experimentation with the acoustic models, we have determined that the statistic —1 _ lajl ° lsz-I an(a,-|s,~,a,~ E A;)Q(S,‘), (3.11) 321’ where IaJ-I and [3.] denote the length of a,- in “symbols” and the length of s, in words, respectively, is distributed approximately according to the probability density function (p.d.f.), Q,- ~ f,‘(.’13), a: Z 0, S.‘ E LMT- (3.12) Further, experiments have shown that, for most task language, the random vari- ables g:_,-, i = l, 2, . - - , L MT, are approximately identically distributed. In this research, we assume the above is true and also assume that the g,- are independent. Hence, any one of the i.i.d. g,- may be characterized by the single distribution f(a:). Similar] we define Z for re resentin an “incorrect” evaluation. Let a P X; déf {aj|P(a,-,s,-) = 0}. (3.13) 30 Again, in this research, we assume that the statistic —3 H —1 lajl ' Isz'l In Q(a,-|s,-,a,- E X;)Q(3;), S; 6 LM, (3.14) 2,- is distributed approximately identically for all i, and we also assume that the Z,- are independent. Let the distribution 7(a) characterize the i.i.d. 2,75. In these terms, the recognition rate p3 (2.12) can be written as PR = P(LM7‘)P(Z‘, < Q, V8), 6 L34) (3.15) 00 oo__ NIH—'1 = Pam) / ax) [/ ram] dz, (3.16) thus circumventing the need to know prior probabilities P(gR = SglLMT). Now consider the LD rate p1,. Let s,- also serve as a testing sentence for the LD problem. From Eq. (2.26), the LD rate, pL, can be expressed as, noting that since 3,- always chosen from L7, q, is always 1, PL = EQL = E(9L X $51.) = E21. (3-17) = PM]. =1) = P(LMT) (3-18) Therefore, we further have, p1. = pl. [0” for) [[71293] NW dx. (3.19) The equation derived above is the recognition rate for a “binary-decision” LM 31 because all p.d.f.’s, say 7(3), are universal in LM. In this case, the LM is incapable of distinguishing the event .9), 6 LE, and the event 3,. E LMT by its grammar-likelihood measurement, lg. According to Fig. 3.5, 7(a) is then split into two distinguishable p.d.f.’s, 7(a'ILMT) and 7(xILX4), letting the sizes of L}, and LMT be N1; and NMT respectively. By modifying Eq. (3.19), the recognition rate for a general LM can be written as ]NMT"1 Pa = p. /°° ax) [f REILIDW] ”’3 [/°° fawn: dz. (3.20) The recognition rates derived above, representing the sentence correct rate in which all of the words in the uttered sentence are correctly recognized, are not the only figure of merit for a speech recognition system. For instance, other measurements such as word correction rate, insertion rate, deletion rate, and replacement rate are frequently used in the evaluation of many speech recognition systems [3]. Although only the sentence correct rate is formulated in this research, it will not be difficult to derive the other measurements in a similar methodology. For instance, one-word replacement rate is the probability of choosing sentences which have a one-word dif- ference from the uttered sentence. 32 3.3 Representing Recognition Rate by Coverage and Over-Generation To represent p3 by CM and OM, first, by comparing Eq. (3.5) and (3.18), we have )0], = P(6L =1|cL :1): CM. (3.21) To characterize 0M, the following assumption is made: Assumption 3.3 The distribution P and Q over L are sufficiently uniform so that the size ofa subset L', L' Q LM {or L' g L7), in LM (or LT) can be approximated by M = Q(L')NM (3.22) (orIL'l = P(L')NT). (3.23) Based this assumption, OM can be interpreted as the percentage of sentences in LM belonging to L}, and C M is the percentage of sentences in LT belonging to LMT. Then, we have, NMT = (l—0M)'NM=CM'NT; (3.24) CM N = - . 0M = 1 - 91m; (3.26) 33 Similarly, 0’" NT—CM-NT = CMOM -::N — = N” M N” 1—0M 1—0M NT. (3.27) Therefore, NMT and NE, in Eq. (3.20) can be expressed by OM, CM and NT as follows. p1. = 0M [0” Hz) [ff 7(Tv'ILX1W] Jail” [f71-fILMT)da]C”NT"Idx. (3.28) Two important issues concerning 0M and N M are to be addressed. 1. 0M can be viewed as a better measurement for the size of the LM, since it is normalized by NM. From Eq. (3.20), p3 depends on CM and NM in terms of LM performance, compared with that p3 depends on CM, CM and NT by Eq. (3.28). The transformation between OM and N M is expressed by Eq. (3.26) and (3.25). The following example shows the convenience of employing 0M instead of N M to characterize the LM performance. Suppose there are two LMs, A and B. Both have the same coverage, 0.9, but the size of A is 106 and that of B is 104. The designers have no clue about how badly the two LMs over-generate the extra sentences unless the size of TL is specified. Now suppose the size of LT which A intended to model is 105 and that of B is 102, by Eq. (3.26), 0.9 5 IOI‘A. OM—l—WXIO —0.91, 34 0.9 2 for B: OM =1 — ID: x10 = 0.991. Thus, conclude that 91% of A and 99.1% of B are over-generated. In this case, A is a better LM than B in terms of CM and 0M. 2. Practically, for most of the speech recognition task, NM is easier to estimate (will be discussed later) than 0M. The recognition rate p3 is thus directly calculated by N M. Therefore, even Assumption 3.3 fails, only the measurement OM is effected. The estimation of p3 is still correct as long as the estimation of N M is accurate. 3.4 Modeling the Probability Density Functions General Formulations In the binary detection problem, the p.d.f.’s of observations under hypotheses Ho and H1 are prior knowledge in predicting the detection rate. In this research, the N L- ary detection problem contains two observations, acoustic evaluation and grammatical evaluation. Similar to Eq. (2.9), total likelihoods g and E can be decomposed as Its ll £9 + wga’ (3.29) IHI II 1‘, + min, (3.30) where w denotes the weight5 on the lg. 5In most speech recognition systems, these likelihoods are linearly combined to produce a total 35 The p.d.f. of; and E can be obtained from the p.d.f.’s of £0 and 329. Since usually the acoustic level and grammar level evaluations are assumed to be independent [3], the conditional p.d.f.’s on g; are given by: f($|LMT) = . TLHLMT) = 7(EILX4) = f(:r: = 2:“ + wngLMT) [0” 1a.)“ 23 w _$a [LMT) (1:13“; 7(5 = r, + wrglLMT) 7 (am 53.. w 5‘1. ILMT) (153a; (3.31) (3.32) (3.33) (3.34) (3.35) (3.36) The p.d.f.’s of , £12“, £9, 1,, and 29 are estimated experimentally by employing pre- defined density functions to best fit the histograms of the evaluation results. In this research, Gaussian density functions are used to model the p.d.f.’s of the resulting histograms. Because of the independence assumption, the p.d.f. of _;1_: (or i) is easily obtained by summing up the means and taking the square root of the sum of squared variances. By using Gaussian p.d.f.’s, the conditional p.d.f.’s of 3;“ and $9 on both grammar level and acoustic level under different hypothesis are discussed in the following. Then, the conditional p.d.f.’s of the total likelihoods g and I); in Eq. (3.32), (3.34) and (3.36) are constructed. likelihood. 36 Qammar Level According to Eq. (3.5) and Eq. (3.6), CM and 0M are defined as two conditional probabilities. CM is based on the hypothesis structure, H1 and H0, see Eq. (2.27) and (2.28), while OM is based on decision structure, 01 and Do, see Eq. (2.29) and (2.30). Therefore, to characterize a LM by the coverage and over-generation, two sets of testing sentences are required. Let T T denote the set containing grammatical sentences which are randomly chosen from the TL, and TM denote a sentence set which are randomly generated by the LM. To obtain CM, TT is fed into the LM to be evaluated. Suppose there are Npruned sentences in T T which cannot be generated by LM, or do not have a finite likelihood, then CM is estimated as CM = 1 _ (3.37) The rest of sentences, which obtain finite likelihoods, are used to construct a histogram of likelihood. The f (39) is then obtained by employing a Gaussian function to best-fit this histogram. Then, we have [($glLMT) = glxgi ”9,1: 09.1): (3'38) where f indicates a p.d.f. To obtain 0M, a sentence set generated by LM is fed into an observer which can detect whether the input sentence is grammatical in the task language sense. Then, 0M can be estimated as the percentage of how many of them are not TL. However, this 37 method is unrealistic in speech recognition problems since usually the true grammar rules are unknown or not even existent. Therefore, an alternative method to obtain 0M and f(xg|L;,) is proposed in the following. According to Eq. (3.26), OM can be expressed by CM, NT and NM. The CM can be obtained from equations above and NT is supposed to be a known knowledge to designers. The estimation of OM is then turned into the estimation of NM. For a small-size LM, NM can be obtained by simply counting the total of all possible generated sentences. For a large LM, N M can be estimated by averaging the logarithmic generating probabilities of sentences in TM, which is a randomly-chosen subset of LM, TM C LM. Let G(s,-) represent the generating probability of a sentence .3,- and lTMl is the size of TM. NM is estimated as, 'TM'InGu.) NM z {EMF—W . (3.39) Using Eq. (3.22), f($g) = Q(Ll—M)f($ylL;!) + Q(LMT)f($glLMT) (3'40) = 01wf($g|L174) + (1 - 0m)f($g|LMT) (3-41) Consequently, “$9le4) ___ f(93g) — (1 “OOMM)f(xglLMT). (3.42) 38 Similarly, f(a:g|L;,) is modeled by a Gaussian p.d.f., f($glL1—\:!) : 9(13g; ”9,0: 0.0.0)- (3'43) Acoustic Level It should be noted that the acoustic symbol strings are supposed to be correctly segmented and evaluated by a word model, usually implemented by the HMM. As a consequence, the resulting likelihoods are associated with one “word” instead of one “sentence.” Let us work on the word model first. Using similar ideas to sentence cases, let A’ 6 be the set of acoustic strings corre- sponding to a word7 12,-, v; E V: A' “$3 {a,|P(a,-,v.) > 0}. (3.44) Let random variable g2“, denote the statistic of the acoustic evaluations by experi- ments: —1 $2,] = m 111 P(aj|v,-,aj E AI), (3.45) .7 Similarly, we have —1 I —— ll] P(aj|v,-, a,- ¢ A’), (3.46) 32a,0 = l _ 1| From experiments, g1,” and £1“ can be modeled by a mixed function of a delta 6The “prime” stand for using “word” as one signal in this chapter. 7Note that “a word” is a sentence with length one. The underlying space (9,2Q,P) is still appliable. 39 and a Gaussian p.d.f. f (332,0) = (1 - P950132 - 6) + 1069014; #L,o.0;,o); (3-47) “93:13) (1 — 05(3’; “ 5) + Pig($ixi #:1,1a‘7;,1)- (3-48) The reason for using 6() is that when evaluating an HMM, sometimes the HMM cannot find any valid path for the input symbol string. Consequently, the probability of this string generated from this HMM is zero, and its N LKHD is 00. Practically, a penalty 6 rather than 00 is assigned to 3:3,, (or 332,0) in this case. Therefore, from detection theory point of view, let P6 represent the probability of a acoustic symbol string evaluated by a wrong word model is mistakenly classified (false-alarm); while, let Pi represents the probability of a acoustic symbol string evaluated by a correct word model is successfully classified (detection rate). However, in current formulations, the interested signal is a “sentence” rather than a “word.” In order to construct f (ma), the p.d.f. of the acoustic likelihoods for a sentence, LM is partitioned into [I + 1 subsets, L9,L)\?,- - - , L56), where L is the length of the uttered sentence in words and LS? represents a subset which contains sentences with d word difference from the correct sentence. Now we start with the trivial case when d = 0, which refers to the correct sentence. Obviously, IL)?! = l. The random variable gm which is equivalent to the d = 0 case, is obtained by summing the E of £2,1- £§a = L ' 33:14; (3°49) 40 From Eq. (3.48), random variable $5” is composed by one constant 6 with probability (1 —P,’) and one Gaussian random variable, say £2.12 with total conditional probability Pl’. In other words, 333,1 either produces a constant e or a Gaussian $11. To obtain 950,1, which is comprised by L of 334,1, let 0" denote the event that there are k of e and (£ — k) of $3.1 produced. For each k, 0 S k S E, 12.06) 1.. + (L — km... (3.50) f(xald=0.D") = 9(xx.;ke+(£-k)u;,.,¢£—ham. (3.51) The probability of the event 0" is Cf(1 - P’ 1)kP1’( L—k) 8. Thereby, gm] given d = 0 can be constructed as, c f(a:a|d = 0) = 2 f(2:a|d = 0, D")P(d = 0, D") (3.52) k=0 C = 2 gm; he + (L — km“, \/£ — ka;,1)C,f(l — P;)"P,‘“"’(3.53) k=0 Consult with Fig. 3.7(a) to have an example of the case when £ = 4 and k = 2. To construct f (mold > 0), or say 7(30), considering the subset L53), where d > 0, the random variable 2,, is comprised by d incorrect word models, represented by $30: and (L: — d) correct word models, represented by £2,1- 2.01) = d . _. + (c — 4) _:.. (3.54) 80:5 denote the combinator of i chosen from £. 41 p.d.f of ('0,1 p.d.f of ('0’, p.d.f of ('0’ 1 p.d.f of ('0’ 1 D0 D1 D0 D1 P11 P11 p(Iald=0, D2) (20 \ p.d.f of 1'“, 0 p.d.f of ['0’ 1 D D1 D0 D1 P '01 P' 11 p(la|d=3, 13%?) (b) Figure 3.7: (a) An example of how the f (mald = 0, D2) is constructed, when E = 4 and k = 2. (b) An example of how the f(2:a,o|d = 3, 03’03) is constructed, when £=4,ko=2and k1 =0. 42 Figure 3.7(b), shows an example to construct [(xald > 0). By extending the idea used in the d = 0 case, let ’61 and k0 denote the number of 6 produced by $31,, and £29 respectively, and Di“ and 05° denote these events. Then, f (mald > 0) is simply the sum of all combinations of the event k1 and ’60. Given (1 > 0, kg E [0,d] and 161 E [0,.C — d], we have C-d {($31.33) = z: 29(z.;u.6.(d)a.o(d))P(Ds°)P(Df*); (3.55) ko=o 141:0 where, ,ua,0(d) 2 (k0 + k1)e + (d — 160)};31’0 + (I. — d — kfipfm; (3.56) «45(4) = \/(d— ko)a;.+(£— d— 1.1)., (3.57) P(D{;°) = 0:0(1—Pg)"°Pg“"“’; (3.58) d—k) 19(1):“) = Cg-d(1—P;)*IP;‘“ (3.59) For 2,, (i.e. d > 0), the number of incorrect words (I, can spread from 1 to L. Subsequently, Mai, L3,?) Q(L3,‘3’). (3.60) 61:] Recall that only modeled language are considered in estimating the recognition rate. QM(L(A‘;)) can be then approximated by assuming 8.7 is the averaged branch factor of the modeled language LM. We have (3.61) gs.— Bf: 43 and Cf(B}'— 1)d (3.62) Note that 25:0 Q(L3¢)) = 1. Total Likelihood According to Eq. (3.29), total likelihood g is the linear combination of acous- tic likelihood ga and grammar likelihood g9. From Eqs. (3.38), (3.43), (3.53) and (3.60), the p.d.f.’s of £9 no matter in LMT or in L}, are all Gaussian, and all ga’s are all linearly—combined Gaussian. Therefore, all p.d.f.’s of total likelihoods g; are also linearly-combined Gaussian, in which the mean can be obtained by adding the gram- mar bias (mean) on the mean of each Gaussian component of f (93“) (or f—(ia)) in Eq. (3.60) and Eq. (3.53). Also, the variance can be obtained by taking the square-root of the sum of each squared variances. The desired p.d.f.’s, f(:c|LMT), f(:E|LMT) and f-(ilLil), are thereby obtained. Finally, by plugging the above three equations into Eq. (3.28), the recognition rate under the Gaussian assumption can be obtained. It is important to note that the recognition rate depends on sentence length L and the weight w. 3.5 Issue of Sentence Length Normally, the recognition rate of a speech recognition system is sensitive to the length of the uttered sentence, say £. The “length” means “the length in words” in an IWR system and means “the length in acoustic symbols” in an CSR system. In this chapter, all formulations developed above are under the assumption that [I is fixed. However, A 44 in real world, the TL usually do not have unique length. To avoid the dependence of L, the following two methods are proposed. First, the function P.1(l) is defined as the percentage of the test sentence set with length l in the TL. For all possible I, the recognition rate pR(l) is estimated by Eq. (3.28) given the sentence length L = I . Then the overall recognition rate is defined as £3 = :P.z(l)pn(l)- (3.63) The other simpler definition to eliminate the dependence of L is to define L as the average sentence length in the TL, FR = PR(£)- (3-64) The second method is simply an approximation of estimating the recognition rate. It is important to note that the coverage and over-generation are also sensitive to L, which means one LM, when dealing with diverse sentence length in TL, needs more than one pair of CM and OM to characterize it. In this case, the ROC-LM design is not achievable because a LM is no longer characterized by only two parameters. 3.6 The Weight, 111, and Relative Motion of LMs Another adjustable parameter in Eq. (3.28) is the weight w between acoustic and grammar levels. w can be interpreted as an indicator of the importance (or reliability) of the grammar evaluation with respect to acoustic evaluation, and is always adjusted to optimize the recognition rate. Under the equal-likelihood assumption, the weight w is set to zero to optimize the recognition rate. In this case, the responses of the LM are binary, that is, 00 or a constant. When w 74 0, from Eq. (3.43) and (3.38), the grammar-level evaluations are characterized by CM, OM, pg“), 09,0, a,” and 09,1, six parameters. In the ROC-LM design scheme, a trajectory of LMs can be made by assuming only that CM and 0M are changed and the rest of them remain invariant. However, this assumption is not practical in many applications, for example, the adaptive language modeling problems. In the case that CM and OM remain the same and the rest of the parameters are changed, resulting the ROC contour plot is changed, a relative motion of that LM is calculated with respect to the new contour plot. Fig. 3.8 illustrates a changing contour plot with LMs unchanged. The shift of a contour plot will make no difference to all of the LMs on the plane. However, the rotation will result in different increments or decrements which depend on the location. In Fig. (3.8), LM1 increases from 93 to 95, while LM2 increases from 93 to 97. 3.7 The Scheme of the ROC-LM Design To summarize, the scheme of the proposed ROC-LM design is itemized as follows: 1. The means and variances of random variables 2;“, and £3“, are obtained from the histograms of acoustic evaluations; 46 Coverage Over-Generation Solid Line: the old ROC contours Dash Line : the new ROC contours Figure 3.8: An illustration of a changing contour plot with LMs unchanged. The shift of a contour plot will make no difference to all of the LMs on the plane. However, the rotation will result in different increments or decrements which depend on the location. 47 . The means and variances of random variables $9,] and £99 are obtained from the histograms of grammatical evaluations; . The size of TL, NT, and the average sentence length in TL are measured. . The two-dimensional ROC plane is divided into n by n grids, for example, n = 5. Each grid corresponds to a pair of coverage CM and over~generation OM, and its recognition rate can calculated according to (3.28). In the equal—likelihood case, w is set at zero. . The contour plot of recognition rates is plotted on the ROC chart. . CM and 0M are measured for an existing LM and the corresponding point is located on the ROC plane. . By adjusting various parameters of the LM, the CM and 0M are changed, and a trajectory is made by connecting the locations of (CM, CM) on the ROC plane. . If CM and 0M are not changed but ”9,0, #9,], 09,0, or 09,1 are changed, the location of the old (CM, OM) is mapped to a relative location with respect to the new ROC contour plot. . By overlapping the contour plot and the trajectory, the designer can observe the changing performance of the LM under the influences of various parameters and can locate the optimal set of parameters in the model. Chapter 4 The ROC-LM Design Tool According to the ROC-LM Design in the previous chapter, a mouse—driven X- window version ROC-LM Design Tool was implemented on Sun Workstation SPARC 10 with its graphic facilities supported by MATLAB. The ROC-LM Design Tool includes three individual programs, one for calculating the estimated recognition rates, one for evaluating LM with CM and OM, one for displaying results and controlling interfaces. Each program communicates one another through text files ”which contain required information. Several key-words (identifiers) are used in those files to identify what the following number stands for, for example, “COVERAGE: 93.1.” The order of identifiers in the text file is not important. The merit of implementing the tool by three individual programs and using identifiers to carry information is that, for instance, when different methodologies of evaluating LMs are considered, designers may simply replace the LM evaluation part of the tool, as long as the same identifiers are employed for communications. 48 49 4.1 Overview The ROC-LM Design Tool is composed of four functions: 1. A display window for the ROC design plane, ROC—WIN DOW; 2. The estimation of recognition rates, ESM-RR; 3. The evaluation ofthe LM, EVL-LM; 4. The control panel and designer interface, CTRL-PANEL. Figure. 4.9 shows a diagram of the relationships among these four blocks9. The first three parts are interactive on one another and the fourth is a designer interface con- trolling the communications and inter-activities among the designer and the above three blocks. The ROC-WIN DOW includes a two-dimensional ROC plane with cover- age, C'M, as its vertical axis and over-generation, OM, as its horizontal axis. The main function of ROC-WIN DOW is to draw the contour plot of the estimated recognition rates, and also mark the location of the evaluated LM by the CM and OM. The estimation of recognition rates involves with the acoustic evaluation, LM evaluation excluding CM and 0M, and some general information such as sentence length in words, TL size. Given the range of the CM and OM which can be adjustable by the control panel, dividing the range into 5 by 5 cells, each cell corresponds to one estimated recognition rate, and resulting a 5 by 5 matrix. This matrix is sent to the ROC-WIN DOW block to draw the contour plot. 9The block ROC-WIN DOW and CTRL-PANEL are implemented as one program. 50 ‘8 '4: d) Covera e a and g , . Contour Plot Over-generation ll; of . . :2 coverage Recogmtion Rate File: eval__lm.dat The ROC plane File: esm rr. da t LM Information Evaluation Estimation of (13f (language \A o I Recognition Rate Aon/off A i A Testing Sentence Set Controllable Parameters Recognition Environment Settings File: esm_rr.prm File: esm_rr.prm Control Panel and Designer Interface Designer Figure 4.9: A diagram of the relationships among the four blocks: The ROC display window, the Estimation of Recognition Rate, the Evaluation of LM, and the Control Panel and Designer Interface. 51 On the other hand, a LM with parameters controlled by the control panel is evaluated by CM and 0M, and the evaluation results are sent to the ROC-WINDOW to mark its appropriate location. As described in the previous chapter, two testing sentence sets are required in evaluating LMs. Note that although the LM is assumed to be characterized by only CM and OM, the means and variances of the evaluated grammar likelihoods are still required by ESM-RR to estimate the recognition rate. Therefore, there is pass way from EVL-LM to ESM-RR to allow designers to make the decision whether passing the resulting means and variances to the ESM-RR block. Note that, most of the times, the means and variances are supposed to be invariant during the modification of LMs. Finally, CTRL—Panel is responsible for the settings of all parameters and com- mands from the designer. Details for each blocks are addressed in the following sections. 4.2 The ROC Display Window, ROC-WINDOW ROC-WINDOW is implemented by employing the MATLAB Engine Library. The MATLAB Engine Library is a set routines that allows the users to call MATLAB from their own application programs, thereby employing MATLAB as a computation engine. ROC—WINDOXV is activated by threeincidences: the output of ESM-RR is ready, the output of EVL-LM is ready and any plotting command sent directly from CTRL- Panel. They are described as follows. 52 1. When the output of ESM-RR ready The ROC-WIN DOW receive the 5 x 5 matrix from ESM-RR through the file esm_rr.dat, and draw a new contour plot and scales on both axes by employing plotting commands in the MATLAB engine library. To avoid over-writing those existing markers for the evaluated LMs, a buffer, named LM-BUFFER, is used to store the locations of those existing LMs. Each time a new contour plot is completed, LM-BUFFER pops up these reserved LMs and draw them on the plane. 2. When the output of EVL-LM ready The ROC-WINDOW receives the CM and 0M from EVL-LM and draws a marker at the corresponding location on the ROC plane. The style of the marker, such as, “green circle” or “red cross-hair,” is designer’s choice which can be changed anytime during designing. The CM and OM pair is then stored in LM-BUFFER for redrawing purpose. 3. Any plotting command from the Control Panel All MATLAB commands can be used through a MATLAB command window in the CTRL—PANEL. Not only ROC contour plots is displayed on this win- dow, but also the p.d.f.’s of acoustic evaluations, or histograms of grammar evaluations, can be displayed by ROC-WINDOW. 53 4.3 The ESM-RR Block The ESM-RR block is in charge of the estimation of recognition rates which is derived based on an IWR speech system, given sentence length and TL size. As previously described, ESM-RR is an individual program and is detachable from the ROC-LM Design Tool, which means, if a new method to estimate recognition rates is developed or the mechanism of a speech recognition system dose not fit with the developed mathematical models in the previous chapter, this block can be replaced as long as the format of input and output remains the same. ESM-RR starts with receiving the settings of the interested ranges of CM and 0M on the ROC plane from the file, esm_rr.prm, which is sent by CTRL-PAN EL. Both of the ranges are evenly divided by five. Then, the centroid of each cell is elected as an imaginary LM characterized by CM and 0M. The recognition rate is then estimated by substituting the CM and 0M into Eq. (3.28). Let Cleft and ”cm-9h, represent the right bound and left bound of the CM respectively, and Cleft and 0,49)“ represent those of the 0M. Then, the five elected CM and OMS are Cright _ cleft 5 0right - 01m 5 c(i)=clcft+(i+0.5) , fori=0,---,4; (4.1) 0(j)=0(cft+(j+0.5) , forj =0,--°,4. (4.2) A 5 x 5 matrix with each entry being an estimated recognition rate then results and is written to the output file esm.rr.dat. Several issues regarding numerical analysis are addressed in the following. From 54 Eq. (3.28), the estimation of recognition rates is mainly an integration from —00 to +00 on a mixed Gaussian p.d.f., f(:z:), multiplied by two “maskers,” say MLMT(.’L') and M Li} (2:), which are defined as, Mama) a [f Ramada]de (43) oo __ N; M.,(x> 2 [f3 mama-s] (4.4) Then, Eq. (3.28) can be written as, pa = [0” f(x)ML...(x)M.,-,(z)dx. (4.5) Since f-(§:|LMT) and f—(ilLXJ) are both p.d.f.’s, the integrations of them are always smaller than one, NMT and NE, both positive integers, MLMT(2:) and ML;(2:) are always in the range of [0, 1] for any finite :c. To implement Eq. (4.5) on digital computers, firstly, the infinite integral domain is replaced by a finite range, say [a, 1)]. From the previous chapter, f (2:) is a multiple mixed Gaussian p.d.f., and by the fact that erf(2.5) ’5 0.9996, if the integral domain outside (250(k), —2.5a(k)] for each Gaussian p.d.f. in Eq. (3.60) are truncated, the error is bounded within 0.04 ‘70. = °°_ M M- d- . (f... L[2.350(k),_2.50(m)f(x) [mm .Mm 4. (46) = [outside u,,[2.5a(k),-2.5a(k)] f(x)MLMT(x)ML;4 (x)d;z: (4'7) . d' < 0.0004. 4.8 loutsnde Uk[2.5cr(k),—2.Sa(k)]f(x) I ( ) 55 Therefore, a is chosen as the left bound of the “left-most” Gaussian p.d.f. in [(56), that is, when k = 0 in Eq. (3.53)"). Similarly, b is chosen as the right bound of the “right-most” Gaussian p.d.f., that is, when k = L in Eq. (3.53). Therefore, a = L14, + Wflg’] — 2.5\/La,’,2,1 + wzang; (4.9) b = Le + w/LgJ + 2.5 w’agd. (4.10) Next, by evenly partitioning the range [a, b] into n segments, letting h = 6‘“ and 2:.- = a + h(i — 0.5), the pa in Eq. (4.5) can be approximated by PR = 21(4)ML...(x.-)M.,(x.)h. (4.11) 3:] To calculate MLMT(.2:) and ML;($), note that f—(EILMT) and f(:E|LX4) are both mixed Gaussian (a linear combination of Gaussian p.d.f.), thereby, the integration of them can be obtained by just dropping the integral symbol and replacing g with 51 in Eq. (3.55). The subscript 9 —v 8 of f () is used to denote the replacement of g with 81, where 81(3) is defined as a. 51— 5(5) 2 /°° g(z)dz .—. 0.5(1 — 871(5)). (4.12) Therefore, we have, )] NMT-l MLMT($) = [fG-’£1($ILMT (4.13) 10Note that c >> p1,, 56 mm = [6-363532 The algorithm for calculating pg is thus derived as follows. ESM-RR Algorithm {Read file: esm.rr.prm} {Initialization} foreachlSéS5,and1£jS5 do C(i) = Cleft + (i + 0.5)M3-3—‘3 0(2') = 0,... + (,- + 05)—2—4 r5, 2 Recognition Rate Estimation(c(i),o(j)); {Termination} record ng; end do Recognition Rate Estimation Algorithm {Argsz c, o} { Initialization} 2 Find integral domain [a, b] Eq. (4.9) and (4.10); 3 h = 19;“). fl , 3 foreachlSisn, do (4.14) 57 4 x,=a+(i+o.5)3"’7“l; 5 Calculate flag); 6 Calculate MLMT(:c,-) and MLIl(:1:,-), by Eq. (4.13), (4.14); 7 r=r+f(:z:,-)*MLMT(x,)*MLxl(x,-)*h; end do 8 return 1'; 4.4 The EVL-LM Block In this research, the ROC-LM Design Tool is installed with a LM, Bi-TriGram, which is a hybrid of bigram and trigram, to exemplify how to evaluate a LM by the CM and 0M. Bi-TriGram is a flexible model which can easily shift between bigram and tri- gram, or a combination between them. There are three parameters in Bi-TriGram to control the p.d.f. of N LKHD, CM and OM of the Bi-TriGram. These three parameters are weight, penalty and threshold. 0 weight, wg : a weight between bigram and trigram likelihoods, 0 S wg g 1. Let 15,1, and 19 represent the N LKHD of bigram, trigram and the total grammatical likelihood respectively. We have, lg = (1 — wg)lb + wglt; (4.15) 58 o penalty, 1),, : if trigram does not exist in the train context, It is assigned a penalty. 1).: a threshold, tg : if lg is above a threshold, t9, the whole hypothesized sentence is pruned, or say lg = 00. Following is an algorithmic descriptions for illustrating their relationships, in which 00 stands for the pruned sentence. Practically, 100.0 is used for representing 00, and all 19’s larger than 100.0 are considered as 00. The symbol - in a word string, for instance, w” — -, is used to represent “any word in vocabulary V.” Grammar Rules for Bi-TriGram {lnitializatiom} 1 for each three consecutive word w” — w’ — w in one sentence, do 2 if bigram exists, _ _ count of w”—w', 3 I” _ ln count of w"—- ’ 4 if trigram exists, _ _ count of w"-w’—w. 5 1‘ _ ln count of w”—w’-~ ’ else 6 It = P9; end if else (if bigram does not exist) 59 8 I, .— I, + (1 — w,)lb + wglt; 9 if I, > t,,, The above algorithm is applied 011 each word concatenation in one hypothesized sentence. Now note that, 1. If threshold t,, is set to 00, the pruning due to the high lg (line 8) never occurs; 2. If pg is set to 00, when the consecutive word pairs cannot be found in trigram, the whole sentence will be assigned to 00 (line 6) and can be considered as being pruned. Conclusively, by letting w, = 1, t9 = pg 2 00, the above algorithm reduces to the following and Bi-TriGram reduces to a trigram. Note that the existence of trigram implies the existence of bigram. Grammar Rules for Bi-TriGram {lnitializatiom} 1 for each three consecutive word w” — w’ -— w in one sentence, do 2 if trigram exists, II I l = —ln count Ofw -w —w, 3 ‘ count ofw"-w'-. ’ 60 else 4 I; = 00, end if 5 lye—lg+l¢; Similarly, when wg = 0 and tg = 00, the original algorithm is reduced to the following one and Bi-TriGram reduces to a bigram. Grammar Rules for Bi-lh‘iGram {Initialization ;} 1 for each three consecutive word w” — w’ — w in one sentence, do 2 if bigram exists, II I I = —111 count ofw —w, 3 ” count of w”—- ’ else (if bigram does not exist) 4 1;, = 00'. end if 5 lg (— 19 + lb; end do The following is an algorithm for evaluating Bi-TriGram, which consists of three phases. Note that, for some applications, designers may choose to skip the training 61 phase, directly reading a constructed data base from disk, and reduce the response time of EVL-LM. EVL-LM (Bi-'h'iGram) Algorithm {Initialization} Training; Evaluate CM; Evaluate 0M: Write results to file evl_lm.dat; {Terminator} To illustrate the training algorithm, a diagram of the data structure for storing word concatenation information from training context is shown in Fig. 4.10. In Fig. 4.10, the horizontal links represent bigram links, such as, wl-wbl, wl-wbz, - - -, etc. The vertical links represent trigram, such as, wl-wbl-wn, wl-wbywn, - - 0, etc. Note that if there exists a trigram wl-wbl-wn, there must be a bigram win-w“. Training Algorithm {Train-Set: training sentences chosen from TL;} {Initialization} 1 repeat until End-of-File, do 2 Read one word, w, from Train-Set; 3 if link w’-w exists, 62 w: ~+rw..11:im W2 —_>b1gramlmk . ——-> C O W“ Wu -'= I W1v1-1——> WM -— Vocabulary, size = IVI Figure 4.10: A diagram of the data structure for storing word concatenation infor- mation from training context. 63 4 count of link w’-w increases; else 5 Insert link w’-w; 6 count of link w’-w = 1; end if 7 if link w”-w’-w exists, 8 count of link w”-w’-w increases; else 9 Insert link w”-w’-w; 10 count of link w”-w’-w = 1; end if 11 w” = w’; 12 w’ = w; end do 13 record trained grammar; To evaluate CM and 0114, the same evaluation algorithms but different testing sets are employed. From the previous chapter, CM is based on the TL; while the OM is based 011 the LM. Thereby, there are two sentence sets, Test-Set I and Test-Set II, for evaluating these two parameters. The following is an algorithm for evaluating grammar likelihoods for a BiTri-Gram. The evaluation returns the mean, pg, variance, 09, of the grammar likelihoods, and the 64 total number of the sentences obtaining finite likelihoods (that is, without pruned). Evaluation Algorithm {Initialization} 1 for each word, w, in testing set, do 2 if link w’-w exists, 3 bi_lk = log(count of link w’-w / count of w’--); 4 if link w”-w’-w exists, 5 tri_lk = -log(count of link w”-w’-w / count of w”-w’--); else 6 tri_lk = penalty; end if 7 1k = (l-weight)*bi.lk + weight*tri_lk; 8 if w =2 word-terminator, 9 nlk = 1k / sentenceJength; 10 if 1k > threshold, 11 increase totaLcount; 12 store nlk in nlk_array; end if 13 {Reset variables;} end if 14 w” = w’; 65 15 w = w; end if end do 16 p = mean of nlk_array; 17 a = variance of nlk_array; 18 return totaLcount, p, 0; According to Eq. (3.26), when the grammar rules for the TL is not available to EVL- LM in some applications, 0M can be estimated from N M. In the current Bi-TriGram case, suppose the grammar rules of TL is unknown, and let N Mbimm be the size of Bi-TriGram when threshold t9 = 00, that is, the size of bigram part. Then, when t9 is finite, we have, NM ~ total_count = ' 4.16 NMbigram ITest.Set II| ( l Therefore, the NM is estimated as, totaLcount N = N ' - .1 M MW“ |Test.Set 11) (4 7) NMbimm is estimated as follows. Let lg), denote the averaged bigram negative likeli- hood for sentences in Test_Set II. Then, the averaged probability of generating one sentence from the bigram is 10196. Thereby, the size of N Mbimm is the inverse of 1059*, NMbi m, = 10494». 4.18 8 66 The algorithms for evaluating CM and 0M are thus listed as follows. Evaluate CM Algorithm {Test-Set I: randomly choose m test sentences from TL; } {Initialization} 1 [count, pg, 09] = call Evaluation; 2 ”9.1 = 119; 3 09,1 = 09; 4 CM = count / m; 5 write CM, ’19,], 09,1 to file evl_lm.dat; end. Evaluate 0M Algorithm {Test—Set II: randomly generate m test sentences from LM; } {Initialization} 1 [count, pg, 09, 19),] = call Evaluation; 2 #g.0 = #9; 3 09,0 = try; 4 NM = 107195 ”‘ count / m; 5 write NM, #93, 09,0 to file evl_lm.dat; end. Note that since the information of ILTI is held in CTRL-PANEL, it is CTRL-PANEL’S responsibility to convert NM to OM by applying Eq. (3.26). 67 4.5 The Control Panel Block The CTRL-PANEL block is implemented by employing X Window system program- ming, or X, for short, and Motif toolkit. The X provides a standard window envi- ronment that allows application programmers to spend less time porting to graphics user interface (GUI) platform. Motif, one of the leading commercially—available GUIs, provides the user interface components, such as pushbuttons, dialogs, etc, to realize the visual on-line ROC design scheme. The main functions of CTRL-PAN EL are to collect all required parameters from the designer and dispatch commands to EVL-LM, ESM-RR and ROC-WINDOW. The layout of CTRL-PANEL is shown in Fig. 4.11. Followings are the function descriptions of each element on the panel. There are three types of controls for CTRL-WIN DOW, buttons, sliders, and text field. Their functions are described as follows. The label on a button used for a single command that has the same name as the command. For example, to apply ROC contour plot on the ROC plane, the designer would click 011 a button that says “CONTOUR.” Sliders are used to set a numeric value and to give a visual indication of the setting. Every slider associates a type-in field and a label on its left side. The type-in field displays the corresponding value immediately when the slider is moved, and the field also allows users type in a number in case that the slider fails to provide the number. Text fields are used when the users requires input from the keyboard. If the text field is a command line, press return-key to activate the command. For example, 68 Speech Processing Lab Michigan State University ROC WINDOW OHIO 0 Evalu It) Over-Generation ROC-Contour Control Panel (ESM-RR) MU_A0: SIGMA_AO: P’_A0: MU_A1: SIGMA_A1: P’_Al: EPS ILON : Setence Length: TL Size: Language Model Design Panel (EVL-LM) Control and Designer Interface (CT RL-PANEL Th LM ' : P te #1: e to desrgn arame r Coverage Range: Left: Parameter #2; Parameter #3: R181“: Over-Gen Range: Left: Right: LM Locator: MATLAB Command: CONTOUR ACOU PDF EXIT APPLY LM CLEAR LM NLK PDF UCL R.R. CLR UCLS Figure 4.11: The layout of the ROC-LM Design Tool. 69 typing in “text(0.5,0.5,’here’)” followed by a return key in the field labeled MATLAB Command will send the command to the MATLAB Engine and write a string “here” on [0.5, 0.5] of the ROC plane. From the lay-out in Fig. 4.11, all of the controls fall into two groups. The upper- left group, the ESM-RR group, concerns with the information for ESM-RR, and the bottom group, the EVL-LM group, is the information for EVL-LM. Followings are the descriptions of each control on CRTL_PAN EL. P-Al and P_A0 Pf/Pé in Eq. 3.48) and (3.47), representing the probability of a cor- rect / incorrect word HMM finding a valid path for the input acoustic sym- bol string and returning a likelihood. MU.A1 and MU_A0 fling/14w in Eq. (3.48) and (3.47), representing the mean of the word-level HMM evaluations when correct / incorrect HMMs are used, excluding those evaluations returned penalties. SIGMA-A1 and SIGMA-A0 dad/0&0 in Eq. (3.48) and (3.47), representing the standard deviation of the word-level HMM evaluations when correct / incorrect HMMS are used, excluding those evaluations returned penalties. PENALTY The e in Eq. (3.48) and (3.47). 1L 70 WEIGHT The w in Eq. (3.29). Refer to Sec. 3.6 for adjusting w if undecided. MU-G1 and MU_G0 The [LN/149,0 in Eq. (3.38) and (3.43) representing the mean of the LM evaluations when grammatical / non-grammatical sentences are used, ex- cluding those evaluations returned penalties. SIGMA_G1 and SIGMA-G0 093/094) in Eq. (3.38) and (3.43) representing the standard deviation of the LM evaluations when grammatical / non-grammatical sentences are used, excluding those evaluations returned penalties. Sentence Length Average sentence length or the length which appears most frequently are applied in this work, Eq. (3.64). See the discussion in Sec. 3.5. Sizeof(Task L) The NT in Eq. (3.28), and Eq. (3.26). [LTI combined with L indicates the difficulty of a speech recognition task. Experiments show that the estima- tion of recognition rate is not very sensitive to [L7]. Roughly estimation on ILTI is enough. Over-Gen Range, OVR—G LF and OVR-G RT 71 The range of OM on the ROC plane to estimate recognition rates, always between zero and one. Note that OM cannot be one unless 0M equals zero, see Eq. (3.25). CM Range, COVER LF and COVER RT The range of Co on the ROC plane to estimate recognition rates, always between zero and one. Similarly to the above, CG = 0 never happens. CONTOUR When clicking on this button, CTRL-PAN EL collects the information in the ESM—RR group, writes to the file ESM-RR.prm, and activates the ESM-RR block to estimate recognition rates 011 the 5 X 5 grid points on the ROC plane. The resulting matrix is then delivered to ROC-WIN DOW to draw a new contour plot. If one of ”9,0, 09,0, #9,] , 09.1 is different from the one when the old contour plot is made, and if the weight w is not zero, it is the occasion for relative motion. The old LMs are then transformed to their relative positions with respect to the new contour plot. ULC R.R. The button CONTOUR produces 5 X 5 recognition rates each time. In contrast, ULC RR, abbreviated by upper-left-corner recognition rate, pro- vides designers to estimate recognition rate on one particular location, which is (OVR—G LF, COVER RT) on the ROC plane. ULC RR uses 72 vertical axis as the estimated recognition rate and horizontal axis is the count for each estimation. For example, a designer is interested to show the effect on recognition rates when a parameter 6 is changed. Each time the c is changed, click on the ULC RR button. ULC RR will provide a changing recognition rate associated with (OVR-G LF, COVER RT) versus the changing e. Designers can thus observe the tendency of the evolving recognition rate. CLEAR UCL Clear the UCL RR buffers. ACOU PDF The p.d.f.’s of acoustic evaluations will be drawn on ROC-WIN DOW ac- cording to f(:rf,0) and f(a:f,1) in Eq. (3.47) and (3.48) respectively. EXIT Exit the ROC Design Tool. PRM #1, #2 and #3 The ROC Design Tool allows three adjustable parameters fro designing 3. LM. The setting for each parameter associates with a text field to briefly describe the parameter, a numeric field for displaying its current value, two numeric fields showing its left and right bounds, and a slider between them to adjust the current value between the left and right bounds. 73 When all of the three settings are ready, the designer may click on the APPLY LM button to activate the EVL-LM block. APPLY LM When this button is clicked, CTRL-PAN EL collects the the numeric fields for three parameters and activates EVL-LM. As mentioned above, if the first numeric field is empty, EVL-LM will be notified by an initialization mode rather than activation mode. CLEAR LM Clear the buffer LM-BUFFER. All makers for LMs on ROC plane will be erased. LM Locater Designers may choose from the following six types of markers: ro : red circle; rx : red X; r+ : red crosshair; go : green circle; g+ : green X; gx : green crosshair. NLK PDF The histograms of the grammar-level evaluation results, grammar likeli- hoods, will be drawn 011 ROC-WINDOW. EVL-LM is required to write the grammar likelihoods to ASC II files nlkcv.dat and nlkog.dat when evaluating CM and 0114, respectively. 74 MATLAB CMD Designers may use MATLAB command thru this field. A return key is needed to activate the command. Chapter 5 Implementation and Design Cases The main purpose of the proposed ROC-LM Design Tool is to quantitatively describe the interactions among the performance of the LM and the recognition en- vironments which includes acoustic performances and the complexities of the TLs. To demonstrate the proficiency of the ROC-LM Design Tool to achieve these goals, four types of LMs, two simulated TLs, and two acoustic environments are used in this chapter. By using ROC-LM, the theory of the trade-off between coverage, CM, and over—generation, GM, for each underlined LM with the recognition rate and com- putational cost as its concerns can be “seen” from the graphical screen of the tool. In addition, the issue of computational cost, the issue of adaptive LM can also been visually shown by using this tool. 75 76 5.1 The Acoustic Environments 5.1.1 Acoustic Settings The acoustic environment refers to the settings in the first seven parameters in the ESM_RR block of the ROC-LM Tool. Let A denote these seven parameters to model one set of the acoustic evaluation performance: A = {#09, 00.0? ”4.1 9 00.1, Prim (in, 6} (5'1) The first six parameters in A are obtained from the histograms of the acoustic eval- uations from the sets A’ (correct) and A’ (incorrect). In this research, these acoustic evaluations employ the following methods“. 1. Mel-Cepstrum The feature extraction task for this acoustic evaluations is based on the mel- cepstral parameters. Twenty mel-frequency components are desired on the Nyquist range 0-8 KHz (the sampling rate is 16 KHz). 2. Vector Quantization Ten features are selected to compose a vector for representing each frame of the speech signal including eight mel-based cepstral parameters, power and differenced power. The vector quantization codebook contains a set of 128 vectors which provide the minimum average distance between a given training 11These methods have been developed and verified in the Speech Processing Lab of Michigan State University by Chiu [24], 1992. For more details on features and techniques used in speech recognition see [3]. , 77 set of analysis vectors and the codebook entries. 3. HMM Training The six-state Bakis model [3] is used for the word model. Further, additional constrains disallow jumps of more than two states. The forward-backward (or Baum- Welch) re—estimation algorithm [3], [25], [26] is adopted for training. 5.1.2 TI-46 and TIMIT Database Two speech data corpses are employed in this study: 1. TI-46 Word Speech Database Speaker-Dependent Isolated Word Corpus (TI- 46), which contains a corpus of isolated words designed and collected at Texas Instruments (TI) in 1980 [27], [28]; 2. TIMIT (Texas Instruments - MIT) Speech Database, which provides continuous speech data for the acquisition of acoustic-phonetic knowledge [29]. The T146 corpus contains 16 speakers and 46 isolated words per speaker, compar- ing with the TIMIT corpus, which is a continuous speech database, containing 630 speakers from eight major dialect regions with vocabulary size 6,100. Obviously, the former have better acoustic performance than the latter. Following figures show the experimental results of the histograms of the evaluations under H63 and Hap. The seventh parameter, 6, represents the penalty assigned to an HMM word model which fails to find a valid path for the input acoustic symbol string. Initially, e is arbitrarily set to 10. The ROC-LM Design Tool provides a function to adjust 78 parameters in the settings of acoustic environment. Refer to the description of the button ULC RR“ in the previous chapter. The seven acoustic parameters for TI-46 and TIMIT are: A“-.. = {5.20, 1.80, 001,085, 1.05, 0.97, 20.1}; (5.2) ATIMIT = {4.82, 2.55, 0.03, 0.85, 0.92, 0.93, 6.76}. (5.3) 5.2 Introducing the TLs Two TLs are considered in this work, English word spelling rules and artificial digit- string (ADS) language. The ADS language will be defined below. The first language is used in the alphabet recognizer, which is considered as an important application for an IWR system. In alphabet recognizers, a letter is treated as a “word” and a word is a “sentence.” In order to eliminate those impossible concatenations of two letters, for example, letter “j” never followed by letter “g,” usually a “LM”, such as bigram, is employed to represent the spelling rules and predict the next letter, thus improving the recognition. The source of the words (sentences) used for the purposes of training or testing the LM is from the Sport Column of San Jose Mercury News. Repetitive words are allowable, which means some “sentences” frequently occurs. The second TL, ADS language, is suggested in [3, Ch. 10, Problem 10.] as a pedagogical language example. The reason of using ADS language is firstly that this language is quantitatively well-defined from the bottom lexical level through 79 pragmatic level, such that the size and generating rules 011 each level can be exactly calculated by the ROC—LM Design Tool. This is important because by definitions the OM is the percentage of LM belonging TL. Therefore, given a sentence randomly generated from LM, the ROC—LM requires an evaluation model to determine whether or not this sentence is TL, then obtaining the OM rate. For spoken English or English Spelling Rules as TL, there is no such specific rules to proceed this evaluation. Although the 0M can be estimated from the estimation of the [LM|, Eq. (3.26), it is still more precise if the 0M can be obtained directly from evaluating the LM. Following is the definition of the ADS language [3]: Consider a “language”for which the symbols are 0, 1, 2, - - -, .9, +, -. The formal components of the language are as follows: I. Lexical. A “word” consists of any number of digits concatenated by any sequence of + and - signs in any order. The first or last symbol in a word is never a sign. The total numerical value of a word must be nine. For example, 4+5—0 is a word in the language. 2. Syntactic. A “sentence” is any sequence of words such that any word ending in an odd digit must be followed by a word beginning in an even digit, and vice versa. For example, 3+6+4/7+3—1/2+0—5+8+4 is a proper sentence. 80 3. Semantic. To be “meaningful,” a sentence must contain words that become increasingly longer. For example, 9/2+7+1—1/4+5—5+5+0—0+2—1—1 is a meaningful sentence. 4. Pragmatic. When “speaking” to a child, each word in a sentence is usually no more than five digits long. There are no constraints for adult listeners. 5.3 LM Design Cases Following are three typical examples in designing LMs. These three cases are used to verify the trade-off theory and demonstrate usages of the ROC-LM Design Tool. Note that there are of total 21 adjustable parameters on the ROC-LM Control Panel, and there is no specific design scheme and no guarantee to locate the optimal parameter settings. However, there is one simple rule for adjusting parameters: The most significant parameter to improve recognition rate is the one which makes the moving direction of the LM on the ROC plane normal to the local contour curves. 81 5.3.1 Case I : To Determine the Required Number for TYain- ing N -Gram for an Alphabet Recognizer Problem Description The first design case is to find the best n-gram (n=2 or n=3) and appropriate number of training sentences for an alphabet recognizer. The alphabet recognizer is used for retrieving the spelling of a word, and is an important application of IWR systems. Although the vocabulary size in this case is as limited as 26, the recogni- tion task is usually difficult due to the acoustic similarities existing among alphabet symbols. N -gram is usually employed for providing the information associated with the spelling rules. The size of the training context for an n-gram is always a trade-off problem to the LM designer. Too few training sentences results into the insufficiency of word concatenation information and causes the low CM; meanwhile, too many training sentences results into the over-hypotheses of the LM and causes the 0M- In this design case, bigram, trigram and a combination of them, say BiTri-Gram (mentioned in Chapter 3), are evaluated and compared on the ROC plane. The acoustic environment set-ups are obtained from TI-46 data bases. For each n-gram, a trajectory is form along with the variation of the total number of training sentences. Parameter Settings The following is a list of initial settings required by the ROC-LM Control Panel. The acoustic parameter set-ups use 9411-46. The e and w are initialized to zero. Since w = 0, 119,0, 09,0, ’19,] and 09,1 Will not effect recognition rates and are arbitrarily initialized. The sentence length L is set as the averaged “word” length in the testing 82 context. The TL size is assumed to be 1,000. Finally, the initial scopes of the C M and 0M are both set as (0.0, 1.0) to have a gross view of the ROC contour plot. MU.A0 14,, 5.20 MU_A1 11;, 0.85 SIGMAAO 6;, 1.05 SIGMA.A1 5;, 1.80 P’-A0 5,, 0.01 P’-Al Pg, 0.97 EPSILON e -10.00 WEIGHT w 0.00 MU_H0 149.0 1.00 MU.Hl mm 1.00 SIGMA_H0 09,0 1.00 SIGMA_H1 09,1 1.00 SNT LEN L 8 TL SIZE |LT| 1000 COVER LF Cleft 0.00 OVR-G LF Oleft 0.00 COVER RT crgght 1.00 OVR—G RT any,“ 1.00 Design Procedures 1. Firstly, we may click 011 the button ACOU PDF to show the two Gaussian p.d.f.’s of acoustic level evaluations, £3”, and x2,“ Fig. 5.12. 2. After the button CONTOUR is clicked, a contour plot is shown on the ROC- WIN DOW. Since the performance of acoustic evaluations is almost perfect, the recognition rates are dominated by CM, resulting in a set of flat ROC contour curves. 3. Next, a “middleman” program bigram_train, which conveys the two parame- ter settings, number of training sentences n and the random seed r, from the ROC-LM Control Panel to the language model bigram, then prepares a training 83 context for the LM evaluation program bigram, is run. Note that bigram-train -init will bring the description and initial settings for n and r. 4. By clicking on the button APPLY LM, a circle appears 011 (0.994 0.482) of the ROC plane, which means OM is 0.994 and CM is 0.482. So, we change the range of ROC plane by moving COVER LF to 0.44, OVR-G LF to 0.992. 5. Then, the slider PRM #1, representing n, is moved as the following sequence, and APPLY LM is clicked for each time when n is changed. 100 -+ 190 -§ 280 —2 370 —) 460 —2 560 A trajectory is made on the plane as shown in Fig. 5.13. 6. Then, the random seed r is changed to show how sensitive the bigram is when different set of the training context is applied. PRM #2 is moved from 177 to 1997, and LM Locator is switched from “red circle” to “red X.” Next, move PRM #1 by the above sequence backward. Another trajectory is made as shown in Fig. 5.13. 7. Same procedures is applied for the program trigram_train except that PRM #1 is changed from 100 to 1000. Results are shown in Fig. 5.14. Results and Discussion The comparisOn between the bigram performances and trigram performances is shown in Fig. 5.15. It is observed that the area of trigram occupy spreads around 84 Speech Processing Lab Michigan State University 0.4 » 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 . -5 10 L Lnucunceuunrt DESIGN PMEL DesignU'III Pal-1|] 0.0 PHI-3'1 [7.7)— |'1._o— [IE-— near Ln ] arm LHJ msxsn EFFECT] ILK 91r| 11.1: 12.8.] cu: lLC’sI Figure 5.12: By clicking on the button ACOU PDF, the two Gaussian p.d.f.’s of acoustic level evaluations are shown on the ROC-WIN DOW. Speech Processing Lab Michigan State University [M . , . ., ‘_ _ 11030 [5.20 .11: ROC WINDOW H 51811930 1.80 1 ' P'_no [0.01 .. HUJII .85 0,9 SIGHFLRI 1.05 08 P’Jll I0.97 ‘03 I EPSILON 20.1 < (I g0] 1:11:17 0.00 O U I'1U_H0 I 1.14 SIGHRJIO 0.12 HU_H1 I 0.35 "'= ' SIGI‘IRJIl 0.17 P m 0.5 n 1 '129151/ 0.992 0.994 .996 0 98 1 l— .,..,. OVER— ENERATION 5’" LE" 5 ' ' - TL SIZE 1000 [ LANGUAGE MODEL DESIGN PRNEL ] W r'—‘°— CUVER LF 0.“ Design LH Ibisran- treiri PR" .2 I177 RT 1.00 UVR-G LF I .992 RT I1.00 LH LOCRTOR x , I [.9 n tleb run: I ] I [F— " a I manlmpnrlsxn] Ii?” L" I CLERR LI'I I HRSKER EFFEETI ILK PIFI J Lu: R.R.I CU? LLC'sI Figure 5.13: The bigram performance when the slider PRM #1, representing the number of sentences in training context. is moved from 100. 190. through 550 with PRM #2 random seed as 177. A trajectory marked by circles is form. The random sees is then changed to 1997, and PRM #1 moves backward from 550 to 100. Another trajectory with “X” is made. 86 Speech Processing Lab Michigan State UniverSIty 1 WW 0mm Pm ] 9 en P V COVERAGE p m .9 U! P A P m I WI WWI EXIII MYLIII I CTI ILCRJIJ MM'OI Figure 5.14: The trigram performance when the slider PRM #1, representing the number of sentences in training context, is moved from 100, 190, through 1,000 with PRM #2 random seed as 177. A trajectory marked by circles is form. The random sees is then changed to 1,997, and PRM #1 moves backward from 1,000 to 100. Another trajectory marked “X” is made. 87 the southwest of that of bigram does. If the size of training context is limited at, for instance, 550, bigram has better recognition rate (around 72%) than trigram has (around 54%) due to its high CM. Trigram always has lower 0M, which can be a benefit to recognition rates if the training context is unlimited and trigram’s CM reaches the same height of bigram’s. However, in this case, the benefit is not promising because of the flat contour plot, which implies the CM dominates the recognition rate. Therefore, it can be predicted that if the contour plot rotates counter-clockwise, which implies the OM is relatively important, trigram may have better performance than bigram due to its lower 0M. Concerning the random seed, which represents different set of training context, it is shown that the smaller size of the training context the more sensitive the selection of the training context is. For example, there is an about 4.5% difference of recognition rate for a bigram uses different sets of training contexts with size 100. When size is increased to 370, the difference is within 1%. 5.3.2 Case II: A Hybrid Mode] of Bigram and Trigram and Concerns About Computational Cost Problem Description At present, n-gram models, especially bigram or trigram models, are recognized as effective LMs and have been widely used in speech recognition systems. As men- tioned in the Chapter 3, the ROC—LM Design Tool is installed a hybrid of bigram and trigram, the BiTri-Gram. In this case, the same recognition task, an alphabet 88 Speech Processing Lab M' .. . ..-.._.__...-_..-............ .... ... ".......--.-M.--................_............... :~.t......:.......:... ........................ ......i ROC WINDOW P an P \J COVERAGE D O) P at P A P to 0.99 0.985 9 0.995 1 OVER-G N RATION '1; U $1019-90 l1.90 {31—33: 2 . [ Lnucuncsuonrtnrsrcupnurt ] . assign L" I ”‘VU'JNHI PR” 2 [W m '1 [iomeSn m7 U1 | 11599 LHJ msxra EFFECTI nu: PIFI OVR-G LF I0.98 [..-.:5:33:11;iiiQ-;:;;iiifii RT [1.00 [.111 UNTIIRI MINI EXITI w: R.R.] cue LLC'sI Figure 5.15: The comparison of the bigram performance and trigram. 89 recognizer, and the same recognition set-ups are used to show the performance of BiTri-Gram. Computational cost is often an important consideration in many applications. In this research, since the stack size is assumed to be infinite, the computational cost is reasonably assumed to be proportional to the size of LM, that is, the total number of the hypothesized sentences. In this design case, BiTri-Gram uses a parameter, threshold, to control the size of the generated LM Suppose the threshold is set to 00, the size of BiTri—Gram is exactly the size of bigram, and the CM of them are identical too. The reason is that when bigram and trigram both exist in the training context the likelihood of BiTri-Gram is the linear combination of those of bigram and trigram. When bigram exists but trigram does not, BiTri-Gram equals bigram likelihood plus a penalty. The same CM and the same LM size imply the same OM. Conclusively, when threshold is infinite, no matter of what values the weight (between bigram and trigram) and penalty are, the C M and 0M are always the same, and the LM “stays” at the same location on the ROC plane. Nevertheless, the p.d.f.’s of the grammar likelihoods, that is, flap, 09.0: [15,1 and a“, may be changed, and the recognition rate may be benefited by the more “distinguishable” p.d.f.’s of grammar likelihoods. Although the CM and 0M remain the same, the recognition rate contour plot is changed, a trajectory can be made by transforming the previous locations of the LMs to the relative new locations for the new contours. Refer to the discussions of “relative motion of LM” in the Chapter 3. Design Procedures 90 . As in case I, ATI—46 is employed. According to the result of the previous case, 550 and 1,997 are chosen as the size of training context and random seed re- spectively. . The hybrid of bigram and trigram, BiTri_Gram, is run with with its three con- trollable parameters, weight wg, penalty p9 and threshold t9, which are defined in the previous chapter. As before, after initializations and rescaling, the location of BiTri—Gram and contour plot is shown in Fig. (5.16). Notice that the new estimated { [493, 09,0, ”9,1, 09,1 } (call it G—set) is directed to their corresponding field on the ESM-RR block. G-set = { 5.61, 1.44, 1.70, 1.71 }. . Now PRM #1, wg, is moved from 1.0 to 0.66. After clicking on APPLY LM, G-set is changed to { 4.17, 0.97, 1.46, 1.16 } but the CM and 0M remain the 88.1118. . This is the illustration of the “relative motion.” Since G-set is changed, the contour plot needs to be reestimated. By clicking on CONTOUR, the new contour plot is drawn and the “old” LM is transformed to a “relative position” with respect to the new contour plot. See Fig. (5.17). . It is shown that when wg move toward 0 from 1, the recognition rate decreases, indicating that wg = 1 is the best choice. Then, penalty pg is considered. By the same scheme, pg = 14.0 is found to optimize the recognition rate. . The contour plot of [LMI is drawn on the ROC plane. Note that from Eq. 3.25, lLMI = (l—gcoha-ILTI (dashed lines). A BiTri-Gram is employed by starting 91 with wg=1.0, pg=l4.0, tg=100.0 (representing 00). Suppose there is a upper bound, say 2 x 106|LT| for computational cost. As shown on the ROC plane, the BiTri-Gram is not qualified to this requirement. See Fig. 5.18. 7. Now the threshold t,, is moved to reduce the 0114. Move tg from 16.0, 13.2, 10.4, 7.6 through 6.2. It can be observed that the BiTri-gram moves toward the criterion line and crosses it at tg = 7.6, as shown in Fig. (5.18). However, the price for the computational improvement is theidecrement of recognition rate from 86.2% to 83.4%. Results and Discussion To start with, concerning the hybrid of bigram and trigram and the relative mo- tion, from experiment results, when wg and t9 are modified with t, = 100(00), the CM and 0M remain the same, while the G-Set is changed. Fig. 5.17 shows that when G-Set is changed all contour curves propagate toward the northwest, and relatively the LM moves toward the southeast, which is the opposite direction to design. Also note that since the whole contour plot does not rotate much, the three LMs locates at a straight line, which implies that each time the G-Set changed the recognition rates on each grid point decrease the same amount so that the “shapes” of the contour plot do not have much difference. Secondly, when threshold tg is finite, it can be observed from Fig. (5.18) that the decrease of tg results into the decrease on both the CM and 0M. In this case, since the tangent vector of the contour curves is flatter than the moving direction of LMs when tg is decreasing, conclude that the decrease of tg will result in the decrease of 92 Speech Processing Lab Michigan .. .....— _WW tate University 86.21 , P’_no |0.01 [ 51mm |1.05 0.8 E 8 I § 0.7 I g . o 0.6 g 0.5 3 0.4 0.996 0.997 9 0.999 I OVER—(QESN RATION [ Lnucunceuonrt ntsrcurnurt PRI'I '2 [faculty [10.00 ' mare-l ”“19" L” I [530—— ITOIT | 0.86 PR" 03 KMI [100.00 53253223 ::.:::.-: ::::~"' rib-T— mv LII | 1:11:09 111] msxsa EFFECT] u Purl «marl EXIII uJ: R.R. I cur M's] Figure 5.16: after initializations and rescaling, the location of BiTri-Gram and contour plot is shown. Notice that the new estimated the G-set = { 5.61, 1.44, 1.70, 1.71 } is directed to their corresponding field on the ESM-RR block. 93 Speech Processing Lab Michigan State University ............................................... —. v . ~— 3 0.94 E g 0.92 ”m U FILM 0.9 $1000.00 |0.52 L 110.111 [1.22"] 510119.111 |o'."" L 0.88 L Lnucunceuonu nesrcu PANEL ] Design LII Ibltr1-9rad I 10.00 |10.0 mum‘s“, I100.00 LW— : “ [370.0— RT W [itinerant-101.1...” LutocarmIFhm] Hummill mantel mum] EXITI mm L" | 61598 LH | msm EFFECTI u m1 at R.R.I cur lLC'sI Figure 5.17: When the new contour plot is drawn, the old LM (marked “X”) is transformed to a relative position with respect to the new contour plot. 94 Speech Processing Lab Mlchlgan State University firmware comm Pm J I o e I I e I v a c e c e a n e n I z I I O U c n n loin-Io-D—eecnnoIO0I0..I...DUIIoIIleloonooIl-IIIODIIIIII: ooooooooooo 0 0.9994 0.9995 DOVLFI PE'ERILYRATSCISB 0.9999 [ 111111301165 11011151 0:511:11 PANEL ”my, U, Ibitrigul PM“ [lantern RT |1.00 [mainaammafl mtmrmfilm] HatlabCI'Illzll man| 1mm: EXIT mar 1.11 | 115113 111 | 110st EFFECTI 10.x 91r| 11.1: R.R.I cm ur':| Figure 5.18: Move the threshold t9 from 16.0, 13.2, 10.4, 7.6 through 6.2. BiTri-gram moves toward the criterion line and crosses it at t, = 7.6. However, the price for the computational improvement is the decrement of recognition rate from 86.2% to 83.4%. 95 recognition rate. It is interesting to conclude that, if the contour curves are steeper than the moving direction, the decrease on tg will improve the recognition rates. Also note that from Fig. (5.18) there is a big drop from the LM labeled as “[2]” to LM “[3].” At this point, the moving direction is almost normal to the contour curves, which means the recognition rate is sacrificed with a little improvement on computational cost. 5.3.3 Case III : Adaptive Language Modeling Problem Description In recent years, adaptive LMs become popular to the speech processing researchers [20], [23] because of their capabilities of adapting to the style or topics of the doc- ument. Assume that from a partial document, denoted by h, h 6 LT, an adaptive unigram, bigram, or trigram, distribution can be written as pd(w|h), pd(w' — wlh), or pd(w” — w’ — wlh), respectively, where, d denotes dynamic model, w is any word from vocabulary, w’ is the previous word and w” is the previous second word. The problem encountered in the adaptive modeling design is that h usually does not contain enough sentences to train. Therefore, the “static” model is still needed to supply more information about word concatenations. Thus, how many of the training sentences from static model are needed to optimize the performance involves again with a trade—off problem. The design goal of this case is to find the optimal size of the training context for training the static model. BiTri-Gram is still employed in this case except that it is trained by different way. 96 There are two sets of training sentences. One is the local context TrainSet II, which is generated according the pragmatic rules and is the one LM intends to adaptive to. The other is the static context TrainSet I, which is generated by semantic rules only. The bigram part inside BiTri-Gram is trained by both TrainSet I and II, and the trigram part is trained by TrainSet II only. The ADS language, which is well-defined from lexical level to pragmatic level, is employed to demonstrate how an adaptive LM is benefited from the knowledge and classification of TL. Design Procedures 1. ATIMIT is employed. The sentence length of the TL, SNT LEN, is set as three. The TL size is set to 415 which is counted by program according to the pragmatic generating rules of ADS language. 2. The size of TrainSet I and the size of TrainSet II are the two controllable parameters of the program adaptivcztrain, which firstly train BiTri-Gram by TrainSet I and TrainSet II, then evaluate its CM and 0M by TestSet II, which is generated according to the pragmatic rules. 3. PRM #2, representing the size of TrainSet II, is set to 0 at first. Then, PRM #1, which is the size of TrainSet I, is changed from 1000, 1900, 2800, 3700, 4600 through 5500. A trajectory is drawn in Fig. (5.19). 4. Then, PRM #1 is changed to 400. Repeat the above. Another trajectory is shown on thc left side of the previous one. On this trajectory, when the size of TrainSet I is 3700, the recognition rate reaches its maximum. 97 Results and Discussions From Fig. (5.19), with the assist of TrainSet II the whole trajectory is shifted toward the northwest. Not only CM but also OM is improved, thus improving the recognition rate, also computational cost. Note that the first trajectory stands for a pure bigram. From this experiment, it is shown how the knowledge from pragmatic level can benefit the recognition performance, and this is what the adaptive LMs take advantage of to improve recognition performance. 98 Speech Processing Lab Mlchigan State University I RUE-mow com. Pm J 8 0.92 g 0.9 E Q 0.38 0.06 0.04 0.02 0. i i 1 i 8'92 0'94OVER-G'lzfiiial0i1b‘ivo 98 l 5'" LE" r1— [5 TLSIZE I415 [ L Lnusunceuonu DESIGN P011£L J “8,9,, U, Iodaptmyaidm PRM '1IMTmSet(Ge:) RT |1.00 [1111111111 11111001111104] 1111111110] mm| £x1r| 09mm| 1:1:00u1| 1.01111111| 5m urR.R.| c131w:| Figure 5.19: PRM #2, representing the size of TrainSet II, is set to 0 at first. Then, PRM #1, which is the size of TrainSet I, is moved from 1000, 1900, 2800, 3700, 4600 through 5500. Then, PRM #1 is changed to 400. Repeat the above. Another trajectory appears on the left side of the previous one. On this trajectory, when the size of TrainSet I is 3700, the recognition rate reaches its maximum. Chapter 6 Conclusions and Future Work 6.1 Summary and Concluding Remarks Converting a LM problem into a design task arises from the trade-off between cov- erage and over-generation, paralleling the hit-rate and false alarm in the detection theory. Currently successful approaches to modeling language, such as n—gram mod- els, context-free grammar, etc., cannot formalize this trade-off. The main objective of this research was the development of a systematic scheme for the design of a LM. The idea of the design scheme originates from the ROC analysis in the detection theory. Analogous to ROC analysis, the performance of a speech recognition task in terms of recognition rate and computational cost needs to be expressed in a plot of coverage and over-generation, resulting in performance contour plot. On the other hand, 8. LM can be parameterized by coverage and over- generation and occupies a spot on the two-dimensional plot. Thus, by adjusting certain parameters in the LM, designs can readily observe the motion of the spot and can “drive” the spot in the direction of better performance. 99 100 An X-window version software called the ROC-LM Design Tool was created to implement the proposed design scheme. In order to demonstrate how this tool func- tions, several standard design cases for LMs were performed. Simulation results show that the design tool is useful for both design and comparison purposes. The recogni- tion rates can be calculated in virtually real time, enabling the designers to observe the influence of each factor in the LMs on the resulting recognition rates to achieve the design purposes. 6.2 Contributions The main theme of this study is to apply detection theory and ROC technique to the speech recognition task. Emulating those methodologies developed in the detec- tion theory, the recognition rate is estimated by given prior knowledge, such as the statistics of acoustic and grammatical evaluations, the length of uttered sentences in words, the size and statistics of the TL. In a consequence, a general and impor- tant contribution of this work is that the overall recognition behavior is modeled by mathematical formulations. The goal of this work has been to open the door to new research directions. Some other contributions of this research are summaries as follows: 1. Mathematical definition of all language modeling tasks. 2. Demonstrative evidence of the influence of all well-defined quantities on the recognition rate, given the fact that the speech recognition process is mathe- matically modeled. 101 3. Implementation of the ROC LM design tool to demonstrate the proposed design scheme. 6.3 Future Work In further study, first, the CSR problem must be solved. The known-boundary as- sumption is very restrictive for the current speech recognition system. A rough intro— ductory idea of mathematically formalizing the CSR case is addressed in Appendix A. Secondly, one of the significant aspects of this research is that a totally new view point of language modeling problem has been proposed. It is interesting to explore more measurements employed in the detection theory, for instance, detectability and sensitivity, to apply to the speech recognition task. Thirdly, the adaptive LM has demonstrated its powerful ability to hone in on the particular sublanguage used in each of the articles. The definition of sublanguage requires certain prior knowledge, which should be classified as the pragmatic knowl- edge. The tendencies of speech recognition tasks are to employ not only phonetic and grammatical knowledge, but also higher levels of knowledge, such as those on the pragmatic level. It is challenging and interesting work to integrate the whole recog- nition behavior from the bottom level, acoustic level, through the grammar level and the semantics level, to the pragmatic level, by utilizing the well-developed techniques in the detection theory. Finally, sincethe recognition performance has been formalized, ANN (Artificial 102 Neural Networks) engineers would be interested to know that the optimization func- tion which neurons desire to optimize has been well defined. The proposed idea derives from the fact that the ROC design tool cannot really find an optimal LM given recognition environments, but it can find the optimal values for those unde- cided parameters in an existing language model. Neural networks, which have very flexible structures with many weights undetermined and are equipped with many efficient training schemes, are the best candidates to optimize as initial models. In summary, research topics for future work are listed below: 1. Solution of the CSR problem. 2. Exploration of more techniques in detection theory to apply to language mod- eling. 3. Integration of recognition behaviors, from the acoustic level through the prag- matic level. 4. Integration with Artificial Neural Networks. Appendix A The Case of Continuous Speech Recognition To generalize the IWR system to the CSR system, first some information about the segmentation performance must be retrieved from experiments. By the experimental statistics, certain stochastic model is created and the estimated recognition rate, which is involved with segmentation results, can be derived for the CSR system. Suppose a known-boundary acoustic symbol string, a = a1 - - - ab, ab,“ - - - 011,011.44 - - . aLa, is uttered, where L, is the length of acoustic symbol string. Let ()1, - - - , bL_1, 1 < (11 < (12,- - - , < bL_1 < La, denote the (L — 1) word boundaries, where L is the length of sentence in words. Then, the segmentation performance can be measured by evaluato ing the acoustic likelihoods under all possible segmentation hypotheses. Define d as the difference between the true boundary and hypothesized boundary. The p.d.f of acoustic score (likelihood) can be obtained by varying d. In the CSR case, sentence length L is unknown, thereby all possible L is hypothesized before hypothesizing boundaries. 103 104 As a consequence, in the CSR case, the total number of all possible hypothesized sentences is no longer IVIL as in the IWR case but 252‘,“ IVI‘, where In,“ is the maximum of sentence length L. For each uttered sentence with length L, there are 05:1 segmentation hypotheses in total”. Therefore, theoretically the recognition rate can be calculated by learning the total number and the p.d.f.’s of all hypotheses. Conclusively, the ROC language model design tool for CSR case is doable but it needs a great amount of computation. More theories and experiments need to be made to simplify the calculations. We will leave this task in the future work. 12This is computationally prohibitive unless done in a systematic way such as the use of an HMM search algorithm. Bibliography [1] James P. Egan, Signal Detection Theory and ROC Analysis, Academic Press, 1975. [2] Ivan Selin, Detection Theory, Princeton: Princeton University Press, 1965. [3] J.R. Deller, Jr., J.G. Proakis, and J.H.L. Hansen, Discrete Time Processing of Speech Signals, New York: Macmillan, 1993. [4] N. Chomsky, “On certain formal properties of grammars,” Information and Con- [51 [6] [7] [8] [9] trol, vol. 2, pp. 137-167, 1959. M. Weintraub, H. Murveit, M. Cohen, P. Price, .1. Bernstein, G. Baldwin, D. Bell, “Linguistic constraints in Hidden Markov model based speech recognition,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 699—702, 1989. H. Murveit, and R. Moore, “Integrated natural language constrains into HMM- based speech recognition,” Proceedings of the IEEE' International Conference on Acoustics, Speech, and Signal Processing, pp. 573—576, 1989. Y.L. Chow and S. Roukos, “Speech understanding using unification grammar,” in Proceedings of the IEEE' International Conference on Acoustics, Speech, and Signal Processing, pp. 727—730, 1989. K. Kita, T. Kawabata and H. Saito, “HMM continuous speech recognition using predictive LR parsing,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 703-706, 1989. K.Y. Su, T.H. Chiang and Y.C. Lin, “A unified framework to incorporate speech and language information in spoken language processing,” Proceedings of the IEEE' International Conference on Acoustics, Speech, and Signal Processing, pp. I-185-I-188, 1992. 106 [10] J. Kupiec, “Hidden Markov estimation for unrestricted stochastic context-free grammar,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I-177-I-180, 1992. [11] L. Bah], F. Jelinex and R.L. Mercer, “A maximum likelihood approach to contin- uous speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 179—190, 1983. [12] J.M. Baker, “Dragondictate-30K: Natural language speech recognition with 30,000 words,” in Proceedings of Eurospeech-89, pp. 161-163, 1989. [13] H. Cerf-Damon, S. DeGannaro, M. Ferretti, J. Gonzalez, and E. Keppel, “TAN- GORA - A large vocabulary speech recognition system for five languages,” Pro- ceedings of Eurospeech-QI, pp. 183—192, 1991. [14] G. Maltese and F. Mancini, “An automatic technique to include grammatical and morphological information in a trigram-based statistical language model,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I-I57—I-160, 1992. [15] P. Garcia, E. Vidal and F. Casacuberta, “Local languages, the successor method, and a step towards a general methodology for the inference of regular grammars,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 841—845, 1987. [16] H. Rulot, N. Prieto and E. Vidal, “Learning accurate finite-state structural mod- els of words through the ECGI algorithm,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 643-646, 1989. [17] F. Casacuberta, “Some relations among stochastic finite state networks used in automatic speech recognition,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 691—695, 1990. [18] N. Prieto and E. Vidal, “Automatic learning of structural language models,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 789—792, 1991. [19] E. Giachin, “Automatic training of stochastic finite-state language models for speech understanding,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. l-173—I-176, 1992. [20] R. Lau, R. Rosenfeld, S. Roukos, Trigger-Based Language Models: A Maximum Entropy Approach, Proceedings of the IEEE International Conference on Acous- tics, Speech, and Signal Processing, pp. II-45—II-48, 1993. [21] M. Meteer, J.R. Rohlick, Statistical Language Modeling Combining N-gram and Context-free Grammars, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. II-37—II-40, 1993. 107 [22] L.G. Miller and A.L. Gorin, A Structured Network Architecture For Adaptive Language Acquisition, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1—201—1-204, 1992. [23] SD. Pietra, V.D. Pietra, R.L. Mercer, S. Roukos, Adaptive Language Modeling Using Minimum Discriminant Estimation, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I-633—I-636, 1992. [24] C.C. Chiu, Large- Vocabulary Continuous Speech Recognition Using Partitioned Graph Search, (Ph.D. Dissertation), Dept. of Electrical Engineering, Michigan State University, East Lansing, 1993. [25] K.F. Lee, Automatic Speech Recognition, the Development of the SPHINX Sys- tem, Boston: Kluwer Academic Publishers, 1989. [26] LR. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, 1989. [27] G.R. Doddington and TB. Schalk, “Speech Recognition: Turning Theory to Practie,” IEEE Spectrum, pp. 26-32, 1981. [28] TB. Schalk, “The Design and Use of Speech Recognition Data Bases,” Proceed- ings of the Workshop on Standardization for Speech I/O Technology, pp. 211-214, March 18-19, 1982. [29] “Getting started with the DARPA TIMIT CD-ROM: An acoustic-phonetic continuous speech database,” National Institute of Standards and Technology (N IST), Gaitherburg, Maryland, 1988. "1111111111 [11111111115