1 $1.) i It... I“. 741:...» I 2 ’1 a». x B“ ‘ .rllth’: L9,!!! 4.. . ‘ . ’p'). All- I Eli). .dLls‘x. .5”! $51!] Olit‘rlv ul‘b o 3. t!_lv I vvlvflt...‘ ‘1‘ l... ' ll .Lwlzl‘fi r; 1.1.1.0, 95‘.vlzl AL. then 8 record 6f(l). i+1§t£ T. The notations (1?- J and 6373/.) indicate standard HMM probabilities which are de- fined in Appendix E. Three thresholds. AL. A; and A... are set in this algorithm to find acceptable intervals. AL is the minimum length of a candidate interval. A; is the lower-bound probability for any potential interval which starts at time i. i.e.. if 6f(l) > A). then time i is a possible starting time of a potential interval in the 47 observation sequence corresponding to the given word model. Au is the upper-bound likelihood for any promising interval, any interval with likelihood exceeding Au is excluded. Briefly. the vertex evaluation algorithm presented above works as follows: For every possible starting time i, i.e.. (Sf-(1) > A]. find the largest ending time t such that blike(i,t) < Au and (t — i + 1) > AL. Figure 3.19: An example illustrates how the phone models can be combined to form a big word model. 3.2.1.2 Refining Boundaries Following Vertex Evaluation After executing the baseline evaluation algorithm above. many potential intervals for the extension of the path over 1:) may be hypothesized. The likelihood for each surviving interval will usually be very close to A... However, one can still tell which interval is most likely by inspecting the lengths of these potential intervals. The best result is one which has the smallest likelihood per observation time. However, there is a problem: There will generally be many overlapping potential intervals detected by the baseline algorithm. Therefore, the space for storing these results is large. A 48 merging criterion presented below is used to build a stack which contains a relatively small number of entries. At the same time. a reasonable likelihood measure for the merged boundaries is described. Finally, the merged potential boundaries and their corresponding likelihoods are stored in the stack. In designing a merging criterion and likelihood measure, two requirements are considered. One is that those intervals with good likelihoods are retained after they are merged. The other is that the new likelihood resulting from the merger of bound- aries must be reasonable. which means the merger does not decrease the likelihood to an unacceptable level (the length of the merged interval is usually increased). A. Merging Criterion Given two hypothesized word intervals B1 and 82 from the results of the baseline al- gorithm. let Blur, and BL”, be the starting time and ending time for the hypothesized interval B‘. respectively. Then. there are three possible cases for the relationships among the boundaries of B1 and B2: 1. One interval is contained in the other. say 32 Q 8‘. Then. the merged interval B is the same as 3‘. This case. is shown in Fig. 3.20 (a). 2. There is no intersection between B1 and 82. e.g., 851.0,, < 332“", or 83,01, < Balm”. In this case. no merging occurs. This case is shown in Fig. 3.20 (b). 3. B1 and 82 partially intersect. Assume HEW, S letop < Bftop. “(Haw—351.01,) < AL. then the boundaries B1 and 82 can be merged as one interval 8 such that _ l _ 2 Bstart — Bsmy-t and Bstop — 8 gap. Otherwise, no merging occurs between B1 49 and 82 because the boundary gap between them is large enough to retain both as hypothetical intervals. This case is shown in Fig. 3.20 (c). likelihood likelihood II (I ""33”" 1:> w w —> Time > Time (a) likelihood likelihood [1 11 . u : ‘ I i I?) i j I?! : I : ' : 1 , 1 ‘ :: TWIHC? , L I =3 13nne (b) likelihood likelihood 11 (l ' -------- I _;__ B: ; (:3 B 3’: I ' I ‘ ' = Time 47 Time (c) Figure 3.20: Three possible cases of merging potential intervals B1 and 82. which are the resulting potential intervals from the baseline word spotting algorithm: (a) 32 Q B‘; (b) no intersection between B1 and B): (c) B1 and 82 are partially intersected. B. Likelihood Measure for Merged Boundaries Suppose the intervals 8‘, B2, - - - . 3” result from the merging process above. Then. the following likelihood measure is used to calculate the likelihood of these merged boundaries. Let like(B B be the likelihood of the resulting interval 8‘. Then. i l ) start‘ stop the likelihood is of the form: [at stop like(Bi i start’ [3 stop ) = blike(B‘ start“ ) + (duration likelihood) (33-5) where the duration likelihood is computed as follows: Since matching the word model Mp to some part of the observation sequence is essentially unconstrained. it is possible to expand or contract the subsequence assigned to a model so that it accounts for a large or a small part of the observation sequence. This problem is alleviated by adding the duration constraints to likelihood measure. A simple Gaussian duration model is used. Let Pp(D.) be the probability density of the potential interval 3‘ with length D.- for the word model M”. Then: P310.) — —e “t 9-61 where the mean and standard deviation for M” are DP and 0p. respectively. These statistics are estimated from the training data. Then, the “duration likelihood” is just the negative logarithm of Pp(D.). The. conceptual plot of the word spotting results for the selected vertices (i.e., words) is shown in Fig. 3.21. 51 It is a lan _uage graph example “v V' ' /‘ list Of sentences ' ~ 0 0 contains very ;. Stan‘n » nodel g . . wan (soft pruning and hard pruning are applied.“ word (HMM) = "wants" word (HMM) = "8380’" begin end likelihood begin end likelihood 3 8 258 17 20 366 12 18 1468 5 8 986 . O O O o o . o o o o O o Figure 3.21: A conceptual plot of the word-spotting results for the selected vertices. The vertices represented by shaded circles are vertices selected through the graph partitioning approach. 52 3.2.2 Modified Stack Decoding In conventional path driven left-to-right search strategies. the evaluation of vertices takes place as they are encountered along paths. In partitioned case. where the objective is to eliminate this costly procedure by preselecting vertices for evaluation. Here, a modified left-to-right procedure is presented. This method involves some rather straightforward modifications of conventional stack decoding procedures [I 1]. [:38] to accommodate the 'inevaluated vertices. The essence of the partitioning method is that only (9(\/]_V_) vertices in G are selected for evaluation. This means that an alternative method must be available to “evaluate” an unselected (not on the selected set C) vertex. A simple and conservative procedure is used to replace the evaluation model by a time-duration model at each unselected vertex extension. When the quantity of the probability of an unselected vertex extension is needed. we use. instead. the probability that the observation string corresponding to this unselected vertex extension is a given length. These probabilities are estimated from the training data as described above. For unselected vertices. a simple Gaussian duration model is used. i.e.. the prob- ability density for an unselected vertex. say .r,.. is of the form (3.6) where .r,, and M” are equivalent in these two discussions since model M” is resident at vertex rp. Thus, the duration model is incorporated into the partitioned graph search techniques. Ac- cordingly, a modified stack decoding algorithm for the partitioned graph search is as follows: 53 Modified Stack Decoding Algorithm {Initialization} Put the start vertex. says. in the stack to form a “null” partial path. {Recursionz Best-First Path Growing Loop} Take the top entry (the best partial path) off the stack. say X0,1-1 = 8,.L'1, - - - ..r1_1. If the best partial path is complete (i.e.. the end—of-path flag is “true”). Then, i. Output the path and increment the output hypothesis counter by one. ii. If the output hypothesis counter is equal to n. stop. Else For each possible vertex extension. say I). of the partial path (1'9” 33;): < Bitnrt) do i. If .r, is an unselected vertex. then a time-duration model is used. to update the likelihood (see below) ii. If .131 is a selected vertex, then either a whole-word HMM or a phonetic-baseform model is used (see below). iii. The pruning strategy discussed below is applied. It is clear that the stack itself is just a sorted list which supports the operations: take the best entry off the stack and insert new entries according to their likelihoods. The following must be contained in each stack entry: 1. The history of the partial path. 2. The partial path likelihood. 3. An end-of-path flag to indicate whether the entry is a complete path. 4. The beginning time in the observation sequence corresponding to the terminal vertex of the partial path. Each path extension is entered into the stack. its position determined by its likelihood. The stack is of finite length. say q. so that only the q most likely partial paths survive. The finite stack, therefore. effects the pruning operation called hard pruning in Chapter I. A second type of pruning occurs when a partial path. for which there is sufficient room in the stack. is deemed too unlikely to be viable and is removed. This process is called soft pruning. The best It paths are determined through the following process: At each iteration. paths are extended from the top of the stack down. The decoding is complete when 72 complete paths appear as entries in the top 72. entries of the stack. Likelihood updates during the extensions occur according to the following argu- ments: The method described below is based on the report by Venkatesh, et a1. [13] although differences exist in the present method from the one described there. Suppose the likelihood of a particular path through the graph G represented by the vertex string X = 131.12. - - - .17], is evaluated, and assume that the likelihood mea- sure is the probability of the occurrence of path X. given the observation string Y :2 y1,y2.~-,yT. Using Bayes rule. P(X I Y) = P(X.Y)/P(Y). Since P(Y) is 55 identical for all paths. it is sufficient to seek the path X' such that X' = arg n‘lXaX P(X.Y). (3.7) Let us assume that the best. partial path in the stack ranges over the vertices X114 = 33.1.. - - - ..r,_l. and that the best likelihood associates observations Yup] with this partial path. i.e.. t, — 1 maximizes — log P(X1,1_1.Y1,.) over all 1'. Then we wish to compute the likelihood associated with extending the path over vertex .1?) using observations Yr... where t 2 t1. Note that D P(Xl,l—l~Yl.t1—I ‘FI' Yt1.t) : P(Xl.I-l$Yl.f1-I)P(II'Yt1.t I Xl.l-19Yl.t1-l) (3'8) ' : P(X1.1_I,Y1..._1)P(.r1 I X1,(-1.Y1.z)P(Yt1.t IxthJ—DYlJI—l) = P1X1.1—1.Y1.c,—1)P(-FI I-Fl—1)P(Yt1.t 1131). We have made extensive use of the assumed independence of the observations in this derivation. We see that the likelihood extension is simple. The quantity P(.r( I 11-1) is the transition probability associated with the graph edge between 1.1-1 and .r), and HY,” I .171) is obtained form the vertex evaluation procedure described above, or. if an is not selected for evaluation. by inserting the likelihood that the duration of the sequence generated by the model at 1:1 is t — t1. This latter number is obtained from the duration density for the model. When the best 72. paths have been obtained. the question remains as to how to se- lect the optimal path. What remains in the stack are surviving paths which represent 56 a small subgraph of G. say G'. Depending on the scale of the problem. it is useful to “partition” G' using another set of vertices and repeat the procedures. or simply to search GI using the standard left-to-right method or some other ad hoc technique. Whatever the case. the new search problem will be very significantly scaled down with respect to the original problem. Using these methods. the search will converge to a solution very quickly. A grammatical approach is presented in next section to find the most likely path after the first-pass search. Note that the repartition of G' cannot be done in real time in the current research due to the limitations of the machine speed and memory space. Many other interesting search procedures may arise with future research. 3.2.3 Computation Savings from the First Pass The evaluation process comprises a vast majority of the computational effort of the search algorithm above. In fact, each attempt to extend a path by one vertex in- volves multiple evaluations of the vertex with respect to about T(T + 1)/2 separate substrings of the observation sequence. Y. where ‘T is the length of the complete observation sequence. In the case in which a HMM is present at the vertex, each of these “subevaluations” will require between 0(HT’). and (9(H2T’) operations (de— pending on the model structure), where H is the number of states in the HMM and T' denotes the substring lengthg. The other computations required in the search become 8It is not necessary to “start over.” but rather to supplement the information in the existing stack with further evaluations. 9Further improvement might be possible using an HMM evaluation method due to Deller and Snider [l2]. 57 insignificant in this light, and the computational expense is seen to be nearly directly proportional to the number of vertices actually evaluated. The partioning approach. therefore. can be seen to reduce this cost by a factor which is (Xx/T) with respect to a full search of G. It is difficult to quantify the benefit of the partitioning approach in terms of the cost a conventional left-to-right search. since the later is highly problem-dependent. For illustrative purposes. however. consider a language graph of N = 108 vertices. each representing a phoneme (e.g.. see [11]). If there are typically 30 phonemes in a sentence. then. on the average. we would expect to find about 33 x 105 vertices in a time slot. To evaluate only the first three time slots (2 1 word) would require about 107 vertex evaluations. Such an evaluation is 1000 times more expensive than the 104 vertex evaluations required in the partitioned case. and does not offer the pruning safety of using acoustical data which are distributed across the message in time. Not explicitly mentioned above is the very important fact that the partitioned search is more robust to impulsive or intermittent noise than conventional left-to-right search. The evaluation of the data. by focusing on the partitioned C-set of vertices, is distributed across the observations in time. This means that the search is less likely to be pruned in the presence of unmodeled noise. Also, the inherent advantage in this work is that there need be no special form of the underlying model for a symbol at a vertex. as long as some suitable measure of likelihood is computable for each vertex extension. 58 3.3 Second Pass: Optimal Solution It is important to keep in mind that the objective of the partitioned graph search is not necessarily to deduce a single best path through a language graph. Partitioned search provides a systematic way to divide and conquer the original large graph using a series of low complexity operations. In a large problem. the subgraph remaining in the stack after the first partition and search can be further partitioned and searched in a similar manner. The solution would be expected to rapidly converge. The it” partitioning recursion would require 0(gV1/2') evaluations. The goal in the second stage of the search is to find the most likely path in the remaining subgraph after the first stage of the search. In this work. we use the simple procedure of iterating over the existing subgraph with its inherent bigram grammar using the stack decoding algorithm described above. The number of resulting hy- potheses. n. is set to unity (“I-best” search). and all nodes are evaluated. Since the graph is very small, the computational effort involved in this procedure is not significant. In future applications. any number of procedures may be used in this “expensive” pass through the graph- Chapter 4 Implementation and Experimental Issues 4.1 The TIMIT Database To evaluate the performance of the partitioned graph search techniques. the TIMIT (Texas Instruments - MIT) speech database [29] is used. The TIMIT database has been designed to provide speech data for the acquisition of acoustic-phonetic knowl- edge and for the development and evaluation of automatic speech recognition systems. TIMIT contains a total of 6.300 sentences. 10 sentences spoken by each of 630 speakers from eight major dialect regions of the United States. Seventy percent of the speakers are male. and most speakers are Caucasian adults. There are of total 2.342 distinct sentences. which consist of two dialect “shibboleth” sentences (sa). 450 phonetically- compact sentences (sx). and 1890 phonetically-(Iiverse sentences (si). The database is divided into dialect regions: New England (drl). Northern (dr2), North Midland (dr3). South Midland (dr4). Southern (dr3). New York City (dr6), W'estern (dr7). and Army Brat (moved around) (dr8). The “sa” sentences are spoken across all speakers. they introduce an unfair bias for certain phonemes in certain contexts [21]. Therefore. :39 60 the two “sa” sentences are not used in the current experiments. The vocabulary size in this work is 6.100. Some relevant information about the use of TIMIT in this work is summarized in Table 4.6. Training set Test set No. of speakers 462 168 Dialect coverage 8 8 No. of male 326 112 No. of female 136 56 semen“ mm” gill-1338 :ix 11.7530 No. of sentences 3,696 1.344 Vocabulary size 4,891 2.373 No. of words 1 1.087 4.621 Table 4.6: Summary of the usage of the TIMIT database in this work. The TIMIT corpus includes time-aligned orthographic transcriptions. phonetic transcriptions. word transcriptions as well as speech waveform data for each sentence- utterance. The file structure for the TIMIT database is shown in Fig. 4.22. The TIMIT waveforms are sampled at 16 kHz. and the samples are 16-bit short integers which are in VAX/Intel byte order (least significant byte/most significant byte). Thus. it is required to swap the bytes in the samples to MSB/LSB order before they are used. Since the current experiments are performed on SUN workstations. the following command is used to skip the 1024-l)yte ASCII header and byte-swap the samples: dd if=input-file.wav bs=1024 skip=1 I dd conv=swab > output.file TIMIT employs a set of 62 phonetic labels. These labels. along with examples. are listed in Table 4.7. 61 Phone Example Phone Example Phone Example /iy/ bEEt ‘fl Ray lb/ Bob /ih/ Way Id/ Dad /eh/ Yacht /g/ Gag ley/ lhh/ Hay /p/ Pop lae/ aHead lt/ Tot /aa/ /el/ bottLE /1H—-I——~I-i—> (b) Figure 4.23: All the possible vertices with different phonetic labels for the word ask in a large graph which is similar to that of Fig. 2.1 (a). Each case represents a vertex in the large graph. Figure 4.24: The word ask in a small graph which is similar to that of Fig. 2.1 (b). 66 They often go out in the evening ldh-ey/ lao-f-tcl-t-ix-ng/ /gcl--ow/ /aa-q/ lix-n/ /dh-iy/ liy-v-n-ix-ng/ 1 honor my mom lq-aa/ /q-aa-nx-axr/ lm—aa/ /m-aa-m/ Never happen 11] life Starting . ln-eh-v-er/ lhv-ae-pcl-p-ix-er/ /1h-. -aa-fl node . . . . is thinner than I am Iih-z/ /th-ih-nx-er/ ldh.eh-n/ lay/ lac-m/ a wet boat lax/ lw-eh-tcl/ lb-ow-tcl/ Observation String y1y2y3 y4 y5y6y7y8y9y10y11y12y13 y14 y15y16y17y18 leyZO (Nor is she a wet boat) Figure 4.25: (a) An example language graph. (1)) The boundaries of the observation string are unknown. 67 store the planarity breaking arcs for future reference. Note that the planar subgraph. say G', must include all the vertices in the original graph G. On the other hand, let G(V, E) be the underlying undirected graph of G(V. A). then the undirected version of G(V, A) is the undirected graph formed by converting each edge of G(V.A) to an undirected edge and removing duplicate edges. The way to construct the graph G is as follows: A data file to store the set of sentences 2 is created (recall that each sentence begins with a dummy point 0). The sentences are stored in the data file one by one, and every word together with its phonetic transcriptions in a sentence are stored line by line in the data file according to the concatenation of the words in the sentence. After building up a data file using the method described above, the data file is read line by line from the top. Whenever a new line is read. one new vertex may be added to the partial graph (which has been built up to this point) if the phonetic transcription in this line is different from any one in the existing partial graph. At the same time, a new edge to the partial graph is added (recall that each edge represents the transition from word to word in sentences). Since the language graph is constructed in this fashion, a connected directed graph is obtained. However, the underlying undirected graph might not be planar. The graph is generally too large to use the method presented in [16] for extracting the planarity breaking arcs in the graph. The following method is developed to extract the planarity breaking arcs and find the partition sets of the graph. The detailed process of the graph partitioning is described later in this chapter. Since the existence of the start vertex increases the possibility of nonplanarity. 68 we can temporarily remove the start vertex from the graph. However, all the edges incident to the start vertex can be brought back in the partitioning process. The reason why we can remove the start vertex first is that: According to the search algorithms developed in Chapter 3. the start vertex is always in the selected set C. and it is the first vertex to be evaluated on each path. Let us recall the partitioning algorithm for nonplanar graphs developed in Chapter 2. Any planarity breaking arc incident to a vertex in the set C can be brought back without violating the PST. Therefore, we can remove the start vertex first and discover the planarity breaking arcs for each component of the remaining graph. Note that the remaining subgraph might not be connected. Therefore, it might contain some connected components. If some component remains large. we can further discover the blocks10 in that component. A linear algorithm to identify the blocks of a graph is found in [4]. Since the size of each block in each component of the remaining subgraph is relatively small. we can then apply the method described in [16] to find the planarity breaking arcs for each component of the remaining subgraph. After this process. the partitioning procedure can be implemented. 4.2.2 Speech Modeling for Partitioned Graph Search Tech- niques 4.2.3.1 Cepstral Analysis The feature extraction task for this speech recognition system is based on the mel- \ 10A block is a graph which contains no_cut vertex. 69 cepstral parameters. The analysis in this research is based on a paper by Davis and Mermelstein [27], in which the mel-based cepstral coefficients. say 0... are calculated as below: '20 c7. = 2E.- cos[n(i —. )9 I =1 1 for n = l,2.---.I\’ (4.9) [V'— O 2:1 where E.- denotes the critical banal11 filter log energy outputs. and K is set to 8 in our implementation. Twenty mel-frequency components are desired on the Nyquist range 0 — 8 kHz (the sampling rate is 16 kHz). The critical-band filtering is then simulated by a set of twenty triangular bandpass filters as shown in Fig. 4.26, where ten mel- frequency components appear linearly on the Range 0 — 1 kHz and the remaining ten are distributed logarithmically on the range 1 — 8 kHz. 4.2.3.2 Vector Quantization One major issue in vector quantization (VQ) is the design of an appropriate code- book for quantization. The VQ codebook contains a set of L vectors which provide the minimum average distortion (distance) between a given training set of analysis vectors and the codebook entries. The most common measure of difference between two vectors, say c1 and C2, is the Euclidean distance (1(01. 0;) (suppose that there are K features in each vector): 1.- d(c1.c2) —_- 2W3) ——c2(i)I2. (4.10) i=1 11A critical band can be viewed as a bandpass filter whose frequency response corresponds roughly to the tuning curves of auditory neurons [76]. 70 Filter bank 1 v r 0.9 J 0.7 s 0.6 4 -§ 0 0.5 r - 0.4 . . 0.3 > I I I e 0.2 - [ [ [ < 0.1 *- -1 0 . 1 l 1 m 1 L O 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) Figure 4.26: Filter bank for generating mel-based cepstral coefficients. Codebook generation usually involves the analysis of a large sample of training sequences. Many iterative procedures have been proposed for designing codebooks [8]. [67]. In this research. the splitting approach is adopted. which iteratively de- signs codebooks of increasing dimension using the binary search method. The binary search algorithm is summarized as follows: The set of training sequences in the K— dimensional space is first partitioned into two regions using the 2-means algorithm [67]. Then each of the two regions is divided into two subregions. This process is repeated until the space is divided into L regions. The centroid is computed for each region by averaging all of the training vectors in the partition. Then. the centroid of each region is a prototype vector in the VQ codebook. In our implementation. L is set to 128. 71 In order to improve the recognition rate. the power and differenced power are extracted from the speech signal. Power is computed from the speech waveform as follows: lvw P.=108(Zfiil (4.11) k=l where P. is the power for frame i. There are N... discrete samples in each frame. N... is set to 256 and consecutive frames are spaced 128 samples apart. The Hamming window is applied to the speech to create the frame samples. It is also necessary to normalize for speaker loudness variation. In this work, we normalize by the factor which makes the variance of the power term unity. That is, each term is divided by the square root of its variance. Then. a direct distance measure can be computed without. weighting the power term. In addition to the power information. another source of information is differenced power, which is computed as follows: Di 2 i+l —' Pi—l (412) where D. is the differenced power term of the i” frame of the utterance. Clearly. differenced power provides the information about relative changes in amplitude or loudness. Similarly, D. is normalized to make the variance of the differenced power unity. The recognition accuracy improves significantly with the power and differenced power features added. Therefore. there are ten features selected for representing each frame of the speech signal including eight mel-based cepstral parameters, power, and 72 differenced power. 4.2.3.3 HMM Training Both the whole-word HMh/Is and phone models are used in the intraword level in the language graph, and methods described in this subsection are used to train either model form. The Bakis model [8] is used for both the word models and phone models. In the implementation. the three-state Bakis model is used to train the phone models. and six-state for the word models. The state transition coefficients are such that a” = 0, Vj < i in a Bakis model. Further, additional constraints disallow jumps of more than two states. This constraint is of the form: (1.5 = 0, Vj > i+ 2. For example, the form of the ”state transition matrix A of Fig. 2.5 with six states is an an (113 0 0 0 0 (122 (£23 (124 0 0 0 0 (133 (134 (135 0 A = (4.13) 0 0 0 (£44 (145 (146 0 0 0 0 ass “56 {000001) 71'.= (4.14) In this research. the forward-backward (or Baum- 14"elch) reestimation algorithm [8]. [24]. [21] is adopted for training both the word models and phone models. It has 73 been proved by Baum et al. [21] that either the parameters of the re-estimated model remain the same when a critical point is reached or every re-estimate is guaranteed to improve the model parameters. A detailed description of the forward—backward algorithm is given in Appendix E. There are of total 1.156 words (with different phonetic labels) plus one silent “word” (representing the start vertex) and 60 phones trained in this experiment. These words are selected from the TIMIT database with at least five training data. and none of them is a function word. The training data for phone models are chosen from the “sx” sentences in the TIMIT database. The total number of training data for each phone model is at most 200. Figure 4.8 shows the actual number of phonetic segments in the training set. Phone Number Phone Number Phone Number Ind 2658 /r/ 2953 IN 1317 /ih/ 2425 /w/ 1216 Id/ 1438 leh/ 1815 ly/ 635 /g/ 773 ley/ 1374 lth 462 /p/ 1668 lae/ 1414 lhv/ 337 /t/ 2340 /aa/ I468 lel/ 568 [lg] 2463 law/ 428 /s/ 3679 /dx/ 1081 /ay/ 1 198 /shl 820 /q/ 1626 lah/ 1282 /1./ 2309 Ith 787 /a0/ 1 I62 /zh/ 95 /ch/ 597 loy/ 233 /f/ 1308 [be]! 1 196 low/ 944 /th/ 476 Ich/ 2395 /uh/ 263 IV] 1 125 lch 866 luw/ 342 /dh/ I387 /pcl/ 1693 lux/ 956 /m/ 2012 Itcl/ 3454 ler/ 1036 /n/ 3565 Aid] 2667 lax/ 21 11 lng/ 736 lepi/ 538 [ix] 4387 lem! 69 lpau/ 363 laxr/ 1490 /en/ 402 lux/ 395 lax-bl 22S /eng/ 15 /1/ 2615 Table 4.8: Phonetic segments in the training set. 74 4.2.3 Graph Partitioning After the training process is completed. graph partitioning is used to select the key vertices in the language graph. A good partition set should satisfy the following criteria: 1. The vertices in the selected set C are well-trained. 2. The selected set C contains as few function words” as possible. These function words are problematic in continuous speech recognition [‘21] because they are articulated very poorly and are hard to recognize or even locate. For example, Table 4.10, Table 4.11, Table 4.12, and Table 4.13 enumerate the number of times the function words the. in, with, and a. respectively of different transcrip- tion labels in the TIMIT database. To achieve a partition satisfying the criteria above. a non—negative reward for each vertex is required. A reward is assigned to each vertex according to the following policies: 1. Among all vertices representing whole words, higher rewards are assigned to vertices trained by more training data. ‘2. A higher reward is assigned to a vertex representing a whole-word HMM than to a vertex containing a phone model. l2Function words are typically prepositions. conjunctions. pronouns, and short verbs. For example, the, a. in. with are function words ['21]. 75 3. A penalty (negative reward) is assigned to a vertex representing a function word. The function words used in this research are listed in Table 4.9. There are 42 function words indicated by Lee et al. [21]. a all and any are at be been by did find for from get give has have how in is it list many more of on one or show than that the their to use was were what why will with would Table 4.9: A list of function words appearing in this research. Phonetic Labels No. Phonetic Labels No. II Phonetic Labels No. dh-ix 688 dcl-d-iy 1 I] d-ih 2 (“W 298 q 3 th-ih 2 dh-ax 804 ah-z-ax l l til-21X 2 dh-eh 25 dh-el 3 d-ah 1 dh-ih 36 s 3 ll z-ax 1 dh-ah 46 n-ax 1 z-ax-h l iy 66 n-iy 4 dh-q-eh l dh-ax-h 34 th-ax-h 5 dcl-dh-ix l dh 24 dh-aa 3 aX-Pau-dh-ax 1 3mm (mm 3 I] sh-dh-iy 1 h-ix 5 z-dh-ix 1 I] th-ix 1 Table 4.10: Different ways the function word the is pronounced in TIMIT. Clearly, a better vertex selection is a set of higher reward. To quantify these requirements. a reward function of the following form is used for vertex .13: 76 Phonetic Labels No. IrPhonetic Labels No 0.5+Phonetic Labelsl No. en 219 ih-ng hv -—1x n 2 ih-n 261 ix-nL 10_0| ix- pawq- -ih- -n 1 ax-n 114 er q- ax- --h mg 1 q-ix-ng 40 q-ih-ng 1 1 hh-ix-ng 1 w-ix-n 23 3mg 1 ax-h-pau-q-ih-n 1 Cl-ih-n 86 I iX-q-ih—n 2 | hh-ih-n 1 eng 13 q-ih-cng l q-ih-m 1 ix-nx 128 ih-ax-n I hh-cm 1 q-iy-ng 2 ih-ix-n I to] 1 ax-hn 2 q-ih-nx 4 I ng 2 n I Table 4.11: Different ways the function word in. is pronounced in TIMIT. Phonetic Labels No. Phonetic Labels No. Phonetic Labels . w-ax-dh l9 w-ix-tcl-q l ix-th 2 w-ix-th 10 l uh-dh l w-ix-s 8 w-ix-dh 25 w-ax-h-th l w-dh 1 w-ih-th 26 w-th 1 w-uh-tcl-t l w-ih-s 8 w-ih-ax 1 w-ix-tcl 2 w-ax-th 46 w-ih-tCl-th l W-iX-q l hh-w-ax-h-th l w-ih-dh 4 Table J(.122: Different ways the function word with. is pronounced in TIMIT. Phonetic Labels No. Phonetic Labelsl No. Phonetic Labels No. ax-h 31 q-ax 42 q-eh 4 ax 497 61-“! 6 m 2 ix 401 ch 19 uh 6 ey 1 30 q-ix 12 hv-ey 1 0'6)’ 54 q-ax-h 1 q-ah 35 ah 103 Table 4.13: Different ways the function word a is pronounced in TIMIT. 77 Q(.r) =O‘S(£)+13T(I)+‘Y(I)F(.II) (4.15) Let Q(.r) be the non-negative reward assigned to the vertex 1:. Let 5(2) be the weight for vertex J: which is based 011 the number of occurrences of the corresponding word with its particular phonetic baseform in the TIMIT database. Generally speaking, the larger S(.r), the more incoming and outgoing transitions from vertex 1:. Let T(.1:) be the weight for the quality of training for the word at vertex 2:, e.g., the larger T(I) is, the better training the resident word has. Let F(.1:) be the penalty for the vertex .1: if it contains a function word. a, 1‘3, and 7(1') are the weighting factors. These factors satisfy the following properties: a +3 = 1, and (1 >13 (4.16) —1, if the word at vertex 2: is one of the function words in Table 4.5 0, otherwise (4.17) Experimentally, a is set to 0.6, and 1’3 is 0.4. The penalty for a function word [(1‘) is assigned to be 0.805(2). The significance of 5(1) and T(x) in the reward function is the following: Suppose the word model at vertex .1: is not well-trained. Then, T(.r) is set to zero, and 03(2) is the reward for the vertex 1:. If the word model at .r is well-trained, then more reward can be assigned to I through T(r). At the same time, 78 the existence of T(r) can be used to indicate the quality of training of the HMM for the word resident at .1:. By this, we mean that the more training data for training the model at .r, the larger T(r). If a vertex is not represented by a well-trained word HMM, then it can be represented in terms of phone models. After a reward is assigned to each vertex, the partitioning algorithm described in Chapter 2 is applied directly to the graph. Therefore, a set of vertices (i.e., those vertices in the set C) are selected for further evaluation and search. The partition results are affected by the following three factors: 1. [\1 The selection of the source vertex 3. In a planar subgraph of the language graph, we need to choose a source vertex for breadth-first search (BFS) in Step 1 of the partitioning algorithm described in Section 2.3.1. Clearly, different source vertices for the BFS cause different partition results. . The reward assignment for each vertex. The rewards assigned to the vertices can be modified by applying different sets of weighting factors a, 13, and 7(2) in equation (4.17). The resulting partition depends on the choice of the weighting factors. . The extraction of planarity breaking arcs. For a nonplanar graph, the planarity breaking arcs are first extracted to create a planar subgraph. The graph parti- tioning algorithm for planar graphs is then applied to this planar subgraph to get a preliminary partition set. A complete partition set is achieved by applying a method described in Section 2.3.2. However, the resulting partition depends on which set of planarity breaking arcs is extracted. The set of arcs causing nonplanarity is not unique. Our goal is to find a partition which contains a high reward in the set C by comparing several partitioning results. A set of high reward should be marked for evaluation. [11 our experiments, there are a total of 470 vertices (z 2.6% of the total number of vertices in the language graph) in the set C. Among the selected vertices, 1.91 of them are from the violation of the PST (recall the partitioning algorithm for nonplanar graphs in Chapter ‘2). The partition result is good enough because the set C contains 25% of the vertex reward in the language graph, and there are a total of 217 well-trained word models in the selected set. In the next section, these selected vertices are used to test. the performance of the graph search algorithm developed in this work. 4.2.4 Two-Pass Graph Search 111 this section, a few parameters for the word-spotting algorithm described in Section 3.2.1 are assigned. Experimentally, three thresholds, AL, A1, and Au, are set as follows to yield good spotting results: For each selected word model M”, AL is set to 0,, — Up, A; is set to 10-3. and _\u is set to 10‘2““, where DP and 0,, are estimated from the training data. The significance of these parameters is discussed in Chapter 3. The stack size (1 is virtually unlimited. Therefore, there is no hard pruning. 80 4.3 Experiments 4.3.1 Methods and Measures of Performance Partitioned graph search techniques are applied to recognition of continuous-speech taken from the TIMIT [21] database. Comparisons of the results of the following approaches are made: 1. partitioned graph search. 2. left-to-right search employing randomly selected vertices of the same number as used in the partitioned graph search cases. 3. conventional left-to-right stack search with pruning. To compare the performance between the partitioned-search-based algorithms and the conventional left-to-right methods, the measures of performance include: 1. computational complexity. (\D . recognition accuracy. 3. noise robustness. ln continuous-speech recognition, there are three types of errors: substitution errors (a word is misrecognized in the sentence), deletion errors (a correct word is omitted in the recognized sentence), and insertion errors (an extra word is added in the recognized sentence). h’leasures suggested by Lee [21] are used to assess the recognition performance. we first align the recognized word string against the correct 81 word string, and then compute the number of words correct, and the number of substitutions, deletions. and insertions. Five results are reported in computing the error rate. These are the percent correct (% Corn), percent substitutions (71;. Subs), percent deletions ((76 Dels.), percent insertions (% 1113.). and word accuracy (\Nord Acc.). Note that the percent correct does not consider insertions as errors but the word accuracy does. These performance measures are computed as follows: . Correct ‘7. C.‘ .= 101) x 4.18 ( orr Correct Sent Length ( ) _ S l.‘ (7:: Subs. = 100 x a, 1‘" (4.19) (.orrect Sent Length _ D1. 70 Dels. = 100 x ‘3 Q (4.20) Correct Sent Length 1113 (7" l ‘. = 100 _ 4.21 C “b x Correct Sent Length ( ) l . .'l. D 1 Error Rate = 100 X ns + S” )8 + e S (4.22) (.orrect Sent Length Word Acc. = 1 — Error Rate (4.23) Since Correct Sent Length 2 Correct + Subs + Dels. 82 Correct — Ins W (1 A = 100 . 4.24 or CC x Correct Sent Length ( ) To consider the measure of performance regarding to noise robustness, we need to know the signal-to-notse ratio (SNR). In this work, the SNR is computed as follows: Denote the SNR in (18 by N. Then, . T2 h‘.(dB) = 10 X log ELL; (4.25) 1712' where f,- is the sample at timei in the speech signal, and n, the noise at time i. The summations in equation (4.27) are taken over the ranges of is which are corrupted by noise. 4.3.2 Results and Discussion A set of ‘20 utterances taken from the dr3 dialect (North Midland) segment of TIMIT is used to evaluate the performance of the three methods listed above. In order to reduce the computational effort for the left-to-right search, a pruning strategy is used to prune unlikely partial paths after evaluating at least the first and second models on each path. The partition evaluation likewise comprises a sort of “pruning” procedure and it is assured that no path is pruned in the partition search case unless at least two evaluations have taken place on it. Noisy utterances in these experiments were created by adding noise of 0 dB SNR to randomly selected time intervals in the normal speech. The method used to corrupt the normal speech is as follows: Four fixed length blocks (3,000 samples”) of noisy samples are added to the normal speech beginning at four randomly (using a C- language version of the machine-dependent random number generator) selected times. The experimental results are shown in Table 4.14. In addition to the simple measure of performance defined above, we have added three further measures of performance specifically dealing with complexity reduction. Method I Partition Random Left-to-Right Measures Normal Nolsy Normal Noisy Normal Noisy % Corr. 79.05 % 56.08 % 59.46 % 40.54 % 60.14 % 42.56 % % Subs. 16.22 % 35.13 % 34.46 % 52.03 % 39.19 % 50.68 % % Dels. 4.73 % 8.79 % 6.08 % 7.43 % 0.67 % 6.67 % % Ins. 0.00 % 2.03 % 2.70 % 4.05 % 6.76 % 6. 67% Word Acc- 79.05 % 54.05 % 56.76 % 36.49 % 53.38 % 35.81 % Average it of surviving Paths after first PCS 763 778 1,327 1.318 976 991 or LTR pruning it of vertices evaluated at first PCS or LTR 470 470 470 470 10,061 10.061 Pruning Average node evaluated! node pruned In the first 0.05271 0.05285 0.05824 0.05814 1.17034 1.173512 and second time slots Table 4.14: Experimental results. To assess complexity, let us consider the computational effort exerted to reduce the graph by pruning in the first pass of the search. 13The average sentence length is about 45,000 samples. As a reasonable complerity 84 measure (CM), we define Average number of vertex evaluations performed Complexity Measure (CM) 2 Number of vertices pruned in the first two time slots. (4.26) In the equation above. the number of vertices pruned is obtained either effectively in the partition case, or directly in left-to-right search. This measure favors the left-to-right search in the sense that left—to-right evaluation works directly with the first two models on any path whereas the partitioned search can generally only affect them indirectly. The CM result for each noise—free utterance is shown in Fig. 4.27. Fig. 4.28 is a "blomip" of the CM comparison between the partition search and random selection methods so that the results can be seen clearly. Similar plots for the noisy speech are shown in Figs. 4.29 and 4.30. On the average, the computational expense of left-to—right search as measured by the CM is about 22 times more than the computational effort for partitioned search. 111 considering this CM result, it should be carefully noted that the partition and search procedure is only performed once in these experiments. If partition and search were to be repeated in successive stages. even greater cost savings could be obtained. However, in a relatively small graph like the one under consideration here, the benefit of repartitioning is not great, especially in terms of the CM. For example, we estimate that the graph is reduced to about 10% of its original size by the initial partitioning procedure (requiring 470 evaluations). If a second partition recursion could reduce the remaining subgraph by an additional 90% using, say, 47 further evaluations, then the CM measure for partitioned search would be reduced only in the third decimal 85 place. The recursive partitioning procedure will become increasingly more important and beneficial as the graph size increases. It is acknowledged that the recognition rates for each method explored in Ta- ble 4.14 are disappointingly low compared with rates obtained for other speaker- independent continuous-speech recognition systems using the TIMIT and other databases for evaluation [21], [65]. The main focus of this research has been to show the poten- tial for a steep decrease in the computational effort using partitioned graph search compared with conventional left-to-right search approaches in the continuous-speech recognition task. Much of the research effort has focused on the graph-theoretic as- pects of partitioning. and future attention to modeling and training issues. as well as more extensive experimental data, is expected to bring the recognition results in line with the state of the art. There is nothing about the partitioned search method which precludes similar modeling and training procedures to those used in very successful systems (see. e.g. [8]). As an example of the effects of possibly unfavorable modeling and relatively few test utterances in the present work, consider the following facts. Three of the test utterances were found to be particularly problematic in the recognition task. These are 5x140 (from fkmsO). si908 (from fpktO) and si1490 (from fkmsO). We suspect that poorly trained models were to blame for prematurely pruning correct paths in each trial and in each method. Table 4.15 shows the results for the basic measures of performance with these sentences removed from the study. Clearly these results bring the present study more in line with expected rates. Much is still to be learned about important modeling issues, parameter settings, and other details through future experimental research. 86 Method I Partition Random Lefi-to-nght Measures Normal Noisy Normal Noisy Normal Noisy % Corr. 92.8 % 65.6 % 69.6 % 47.2 % 70.4 % 50.4 % % Subs. 5.6 % 26.4 % 26.4 % 46.4 % 29.6 % 42.4 % % Dels. 1.6 % 8.0 % 4.0 % 6.4 % 0.0 % 7,2 % % Ins. 0.0 % 0.8 % 0.1) % 4.0 % 4.8 % 7.2 % Word Ace. 92.8 % 64.8 % 69.6 % 43.2 % 65.6 % 43.2 % Average # of surviving patm after first P68 7 10 748 1,244 1,234 1,013 1.045 or LTR pruning # of vertices evaluated at first PGS or LTR 470 470 470 470 10.061 10.061 Pruning Average node evaluated! node pruned In the first 0.05224 0.05258 0.05725 0.05736 1.17788 1.18487 and second time slots Table 4.15: Experimental results with three problematic test utterances removed. Finally, we wish to exhibit an overall measure of performance for the various methods which takes into account the complexity benefits of the partitioned search. Using CM as defined in (4.26) as a measure of computational effort, then the word accuracy normalized to CM is used as a good indicator of overall performance of these search methods: ‘76 Word Accuracy Complexity Measure ° Normalized Word Accuracy (NWA) = The results of the NVVA for each search method are shown in Fig. 4.31. For the nor- mal speech, the recognition results by partitioned graph search are significantly better 87 than the results of conventional left-to-right methods in these experiments. Further- more, the partitioned graph search is more robust than the left-to-right approach in noisy environments. An interesting and expected outcome is that the robustness to noise afforded by the partitioned search is apparently due to the “scattered site" evaluation which takes place. This is evident in Fig. 4.31 because the random selec- tion technique is also much more robust to noise than left-to-right search. However, previous results clearly indicate that. random selection is not nearly as effective in reducing the computational complexity as partitioning. This is due to the fact that partitioning provides a very systematic way of selecting "scattered,” but high-payoff, vertices. Complexity Performance for Normal Speech —'— Parfloned search x 1.8 -~ g 1.6 ,_ —+— Random selection a 1.4 1» —O— LOR-iO-rlghi ‘5 2 a > o 8 .9 E > f f . . - 12 3 4 5 6 7 8 91011121314151617181920 Test utterance number Figure 4.27: The complexity measure (CM) for normal speech using three different methods. 88 Complexity Performance for Normal Speech 3 0,07 -— —'— Partionod search +1! do locllo § 0.065" on moo n 8 0.0,. 83 \ 0.055 53 / 3 0.05 1 . 3 _ § 0.045 >0,04.:+¢¢¢e::::%::::r.-. 1234567891011121314151617181920 Test utterance number Figure 4.28: The complexity measure (CM) for normal speech using using partition and random selection methods. Complexity Perlormanco lor Noisy Speech ——'— Padlonod search 2 l 1.8 ~ ‘16 i. 1.4 -- .1. O __ .—— .‘K E 1.2 0§~./.__./0_’ \./. O 0"” \.\.’/\. ——*— Random selection O Loft-to-rlghl 'l .. 0.8 .. 06 4. 0.4 ,_ 0.2: 12 3 4 5 6 7 8 91011121314151617181920 Tostufloranconumbor Vertices evaluated per vortex Figure 4.29: The complexity measure (CM) for noisy speech using three different methods. 89 Complexity Performance for Nolsy Speech g 0.07 -. _'— Partloned search § 0.065 0 + Random selection 2' 0.06 1* 2 3 / § 5 0.055 . B a. 5 0.05 3 .2 0.045 E > 0.04 t t ##r % l l : t # l l ¢ ¢ : + : .L : 12 3 4 5 6 7 8 91011121314l5lbl7l81920 Test utterance number Figure 4.30: The complexity measure (CM) for noisy speech using using partition and random selection methods. 90 Overall Performance 0.l8" 0.16‘ 0.l4* 0.12. 0.1“ 0.084 0.06‘ 0.04‘ 0.02‘ Cl Normal speech I Nolsy speech Word accuracy per computational effort Partition Random L-to-R Method Figure 4.31: The overall performance as measured by the normalized word accuracy (N WA) of three different search methods. Chapter 5 Further Discussion, Conclusions and Future Work 5.1 Summary and Further Discussion This work has introduced a novel approach for continuous-speech recognition based on partitioned search of language graphs. In general, the most important finding of this work is that the partitioned search technique shows great potential for application to a large-vocabulary contimious-speech recognition. The general goal of this work has been to open the door to new research directions for future systems, which. if they are to be capable of recognizing speech based on practical and flexible languages. must involve very large graphs of very high perplexity. The central motivation for this partitioning operation is to reduce search complexity from (9(N) to O(\/—\7) by a judicious selection of vertices. This complexity reduction has the obvious advantage of permitting the search of larger spaces. but also some less obvious benefits which are discussed above and below. To summarize. the advantages of the partitioned graph search procedures explored in this work are as follows: 91 92 e The decoding procedure is based upon a very large graph (reflecting a large language model). The benefits of a large graph are twofold: l. A large graph will reflect the original grammar more accurately thereby improving recognition performance. 2. Models in a large graph will be trained in context and reflect the acoustic knowledge more accurately. These two “large model” benefits do not come at the expense of insufficient training because the partitioning procedure selects well-trained vertices. This feature permits a more accurate language model and better training in context. The computational complexity of partitioned search (with respect to conven- tional procedures) is greatly reduced by the inherent divide—and-conquer proce- dure. e Partitioned search can be used to increase robustness to frequently misarticu- lated words such as function words. Partitioned search is more robust to impulsive or intermittent noise than con- ventional left-to-right search, because the evaluation of the data, by focusing on the partitioned C-set of vertices, is distributed across the observations in time. This means that the search is less likely to be pruned in the presence of unmodeled noise. Significantly. partitioned search may be generally viewed as a type of n-best method, which has become quite popular in the speech recognition community in 93 recent years. An n-best approach employs a computationally inexpensive pass to pare down the problem to one involving a smaller graph of likely solutions. The smaller graph can then be subjected to intense scrutiny to find an optimal solution. 5.2 Contributions The partitioning concept is simple to understand, but finding a good partition is a highly nontrivial problem. Graph partitioning algorithms developed in this work offer an efficient and systematic method for locating the high-payoff set of vertices for distributed evaluation. The implementation of the novel partitioning algorithms are reported in this dissertation. In continuous-speech recognition tasks, the boundaries in the observation data are generally unknown. Without knowledge of the boundaries for each word in an utterance, the recognition task becomes significantly more challenging. In this work, a two-pass search algorithm for the unknown boundaries case is developed. The major contributions of this research are summarized as follows: 1. A vertex partitioning algorithm for generally nonplanar graphs is developed, and the key operation for finding a cycle for the planar graph subpartitioning is improved. ‘2. Partitioned graph search algorithms, which are applicable to the case of un- known boundaries in the observation string, are developed. 3. Partitioned graph search techniques are applied to recognition of continuous- 94 speech taken from the TIMIT [‘21] database. Comparisons of the results using the following approaches for recognizing continuous speech are made: (a) partitioned graph search. (b) left-to-right search employing randomly selected vertices of the same num- ber as used in the partitioned graph search cases. (c) conventional left-to-right stack search with pruning. 5.3 Future Work The following issues for future research have emerged from this work: 1. There is a “bottleneck” in the process of partitioning a very large graph. The difficult task appears in the planarity testing operation. Before a language graph is partitioned. certain planarity breaking arcs need to be extracted, and the planar embedding for a large graph is extremely time and space consuming. To solve this problem when a large vocabulary system is considered, the following method is suggested. The language graph is decomposed into several small subgraphs and each subgraph is partitioned individually. Then the individual partitioning results are combined to form a selected set for the large graph. However, the selected vertex set C may no longer satisfy (9( fl). Therefore, the very interesting and challenging problem of finding more efficient and effective methods remains for future work. 95 2. Since most language models are expected to be representable only by nonplanar graphs, it is critical that the partitioning procedure be applicable to such graphs. One ad hoc method for dealing with the nonplanarity issue is described in Chapter 2. Other methods for treating nonplanarity and general effects of nonplanarity upon performance remain important issues for future research. 3. In the present research, resident at each vertex is either a whole-word HMM or a phone-based model composed of context-independent phone HMMs. Whole- word models are often found to outperform phone-based models if sufficient data are available to train them [81]. However, this training approach is im- practical in a large speech recognition task. In the future systems, other models for representing these vertices are required. such as context-dependent phone models (e.g., triphones [21]). 4. Much is still to be learned about important modeling issues, parameter settings, and other details through future experimental research aimed at improving the overall performance of the partitioned graph search technique. 5. The complexity reduction is very significant when N is much larger than fl. Future systems must employ a very large (109 - 10” vertices or even more) graphs of very high perplexity if they are to be capable of recognizing speech based on practical and flexible languages. In order to prevent the intractable problem of training of all vertices in huge graphs, the partitioning process can be used to determine a small set of high-payoff vertices to be trained. In this case, the partitioning is not guided by the availability of training data, but a. 96 number of partition sets can be tried to find a set for which sufficient training data are available. Appendices Appendix A Graph-Theoretic Notations and Definitions A graph G(V, E) is an ordered pair consisting of a finite set of vertices V and a finite—set of edges B. Let |V| denote the number of vertices in G and IE] the number of edges. If each edge is an unordered pair of distinct vertices, then the graph is undirected. If each edge has an assigned orientation, then the graph is directed. Let (n.1,?) be an edge in a directed graph. Then (a. v) is said to join vertex u to vertex v; u is the tail of (u, v), and v is its head. If G is a directed graph, the underlying graph of G is the undirected graph formed by converting each edge of G to an undirected edge. Only undirected graphs are considered in the following definitions. Of course, we may extend the definitions to directed graphs by considering their underlying graphs. A walk from v0 to 12;, in G is a finite nonempty sequence whose terms are alternately vertices v,- and edges 64-1“, 1 g i S k, such that 6,-” = (v,_1,v,). If all the vertices of a walk are distinct, then the walk is called a path. On the other hand, if all the edges of a walk are distinct, the walk is called a trail. A walk from a vertex to itself is called a closed walk. A cycle. is defined as a closed trail whose origin and internal vertices are distinct. 97 98 A vertex w is reachable from a vertex v if there is a path from v to w. A graph G(V, E) is said to be connected if any vertex in G is reachable from any other vertex. If G is a connected graph. then a spanning tree of G is a subgraph T Q G which is a tree and contains all the vertices of G. The radius of a tree is the maximum distance of any vertex from the root. A graph G(V, E) is said to be planar if it can be drawn, or embedded in the plane in such a way that the edges of the embedding intersect only at their ends. A planar embedding of a planar graph is sometimes referred to as a plane graph [1]. A planar representation of a graph divides the plane into a number of connected regions called faces, each bounded by some edges of the graph. The boundary of a face is a closed walk. A face f is said to be incident with the vertices and edges on its boundary. Figure A.32 depicts the faces of a particular embedding of the graph. Let b(f) denote the boundary of the face f. For example, f; of the plane graph given in Figure A.32 is: b(f)) = "161.2'0262.7U7€7.sl-’666.1Ur Figure A.32: A plane graph with four faces. 99 Of course. any planar representation of a connected planar graph always contains one face enclosing the graph. This face, called the erteriorfacc, is f, in Figure A.32. Note that if e is a cut edge in a plane graph, then only one face is incident with 6 since the removal of the cut edge disconnects the plane graph; otherwise, there are two faces incident with 6. Similarly. a cut vertex of a connected graph is defined as a vertex whose removal disconnects the graph. For instance. 8.7.3 is a cut edge and v7 a cut vertex in Figure A.32. Since 6773 is a cut edge. only f3 is incident with it. On the contrary, em is not a cut edge: there are exactly two faces (f1 and f2) incident with it. This property is important in the developments of graph partitioning algorithms described in ("hapter 2. Finally. if G is connected and contains no cut vertices, then G is biconnectrd. Appendix B Supporting Lemmas for the PST The constructive proof of the Planar Separator Theorem (PST) depends on two fundamental lemmas. A brief description of these supporting lemmas is as follows: Lemma 3.1 Let G be any .V-verter connected planar graph having non-negative ver- ter rewards summing to no more than one. Suppose G has a spanning tree of radius 1'. Then the N rtices ofG can be partitioned into three sets A, B, and C. such that no edge joins a vertex in A with a vertex in B. neither A nor B has total reu'ard exceeding 2/3. and C contains no more than 2r + 1 vertices. one the root of the tree. The proof of the lemma proceeds by first embedding G in the plane and finding a breadth-first spanning tree [4] [28] of G. Since each face is triangulated by adding some additional edges. any nontree edge (including the new added edges) forms a simple cycle with some of the tree edges. Therefore, the length of this cycle is at most 27‘ + 1 if it contains the root of the tree. By the Jordan Curve Theorem [1]. the cycle divides the graph into two parts. the inside and the outside of the cycle. Lipton and Tarjan [17] show by examples that at least one such cycle separates the graph so that neither the inside nor the outside of the cycle has total reward exceeding 2/33. 100 101 Lemma 3.2 Let G be any N—verter connected planar graph. Suppose the vertices of G are partitioned into levels according to their distance from some vertex .5 and that L(l) denotes the total number of vertices on level 1. Given any two levels ll and I" such that the total reward on levels 0 through I, — 1 not exceeding 2/3 and levels I” + 1 and above have total reward not exceeding 2/3. it is possible to find a partition A, B, C of the vertices ofG such. that no edge joins a vertex in A with a vertex in B, neither A nor B has total reward emcee-ding 2/3. and C S L(l’) + LU”) + max{0,2(l' — l” — 1)}. The lemma is very important for constructing a vertex partitioning algorithm for actual implementation. The proof of the lemma concerns the relationship between levels l, and l". (i) Suppose l' 2 I". Then the lemma is obviously true if we choose all the vertices on level l, to be in the set C and let A be all the vertices below the level I, and B be all the vertices above the level 1'. (ii) Suppose that l, < I”. Since the vertices in levels 1' and l" are deleted. the graph naturally partitions into three parts: vertices on levels 0 through 1' — 1. vertices on ll + 1 through I" — 1. and vertices on levels I" + 1 and above. To find an appropriate vertex partitioning in this condition. two cases must be considered. One is the case in which the total reward between ll + 1 and l" — 1 does not exceed 2/3. A proper partition is obtained by setting A the part of the three with the most vertices, B the remaining two parts, and C the set of vertices in levels 1' and l". The other case is that in which the total reward between I, + l and l" —- 1 exceeds 2/3. 111 this case, the part between l, + l and l" — 1 requires subpartitioning. A subpartitioning is carried out as follows: All vertices on levels I" and above are deleted and all vertices on levels 0 through I, — l shrunk to a 102 single vertex .1: of reward zero. A new graph. say GI. is formed. Note that the new graph preserves planarity [17]. Apply Lemma 8.1 to the new graph. Let Al. BI, C, be its vertex partition: the set C, being the vertices on the cycle. Therefore, a proper vertex partitioning of the graph G derives from letting A be the set among A, and B! with more vertices. C the vertices in levels I, and l" plus the set Ci. and B the remaining vertices. Appendix C Lipton-Tarjan Approach to Finding a Cycle The Lipton-Tarjan (L&T) approach [17] for finding a cycle to complete the vertex partitioning for planar graphs involves the following steps: (C) (d) Figure C33: Cases of Step 4 in the L&T approach for finding a proper cycle. Solid edges are tree edges; dotted edges are nontree edges. Step 1 Construct a breadth-first spanning tree rooted at .1: (created by shrinking pro- cedure as described in the proof of Lemma 3.2?) in the new graph G’. Record, for 103 104 each vertex v, the parent of v in the tree. and the total number of all descendants of v including v itself. Step 2 Embed the new graph G, in the plane. Let H, be the planar embedding of G’. Make each face of H, a triangle by adding a suitable number of additional edges. Step 3 Choose any nontree edge. say (vhwl), and locate the corresponding cycle. Compute the number of vertices on each side of this cycle by scanning the tree edges incident on each side of the cycle. Determine which side of the cycle has greater reward and call it the “inside." Step 4 Let (r. :) be the nontree edge whose cycle is the current candidate to complete the separator. If the reward inside the cycle exceeds 2/3, then find a better cycle by the following method. Locate the triangle (1:. y. .2) which has (.1'. 2) as a boundary edge and lies inside the (.r, 3) cycle. Two cases must be considered: 1. If either (.r,y) or (y, z) is a tree edge. lct (v. m) be the nontree edge among (1:.y) and (y. .7). Then compute the reward inside the (v, w) cycle. This case is shown in Figure C33 (a) and (b). ‘3 If neither (.r,y) nor (y.:) is a tree edge, determine the tree path from y to the (.r, 2) cycle by following parent pointers from y. Let u be the vertex on the (J3. :) cycle reached during this search. Then. compute the number of vertices inside the (.r.y) cycle and (y. 3) cycle. Let (v, to) be the nontree edge among (any) and (y, :) whose cycle has more reward inside it. This case is shown in Figure (7.33 (c) and (d). 105 Repeat Step 4 until a cycle satisfying Lemma 8.1 is found. Appendix D A Planarity Testing Algorithm The planarity testing algorithm of Demoucron et al. [1] is shown in Figure D.34. Before using this algorithm to determine whether a given graph is planar. some pre- processing considerably simplifies the work. Note the following points: 1. If the graph is not connected. then each component should be subjected to planarity testing. 2. If no cycle is found. then the graph is a tree. Therefore. it is planar. 3. If IEI < 9 or N < 5 then the graph must be planar; if IEI > 3N — 6. then the graph must be nonplanar [l]. The following definitions are required: Let Gi(V,-.E.~) be a subgraph of G. a bridge B of G related to G. is then: 1. either an edge (u. v) E E where (u. v) Q E.- and (am) 6 Vi, or 2. a connected component of (G—G.) plus any edge incident with this component. 106 107 We denote by V(B.G.~) the vertices of attachment of B to G.. Let H,- be an embedding of G. in the plane. If B is any bridge of G.. then B is said to be drawable in a face f of H,- if V(B. G.) are contained in the boundary of f. We write F(B. H.) for the sets of faces of H,- in which B is drawable. The algorithm to follow is based on a very important criterion: lfF(B, H.) = 0. then we cannot obtain further planar subgraph embedding. Thus the algorithm will terminate for nonplanarity. Given a graph G. the algorithm determines an increasing sequence G1. G2. - - - of planar subgraphs of G. and corresponding planar embeddings H1, H2, - - - when G is planar. Through the algorithm. it is easy to record the faces of each subgraph G.“ at each iteration i as shown in Figure D.34. The procedure is as follows: 1. If there exists a bridge B such that F(B,H.—) 2 lb. then the graph G is non- planar. Thus. the planarity testing and planar embedding ceases. [Q If there exists a bridge B such that |F(B.H.')| = 1. then let f = F(B,H.-). From the bridge B. choose a path P.- Q B and set G.“ = G.- U R. Thus. the faces of G.“ can be obtained by drawing P. in the face f of Gi. 3. Otherwise, choose any face f and any bridge B such that f E F(B. H.). From the bridge B. choose a path P.- Q B and set G.“ = G. U Pi. The faces of G.“ can also be obtained by drawing P. in the face f of G.. Note that if G is planar. then by Euler's formula [1], IEI — N + ‘2 faces will he found. Since these faces have been found following implementation of the process shown in Figure D.34. a fixed planar embedding of the graph G can be constructed. 108 Find a cycle GI and a planar embedding H1 Of 61 Y i=i+1 es T E(G)-E(Gi) = For each bridge B of Gi find F(B,Hi) Find a path P. in B connecting two vertices 13 there a B of attachment. 81101! that set Gi+ 1: Gi U Pi' IF(B.I::I)I = 0 Draw PL in [to get Hi f - YES Is there a B ‘— such that "303.1101 = 1 A '2 Choose any B and f, f belong to NO F(BJ'Ii) Figure D.34: The planarity testing algorithm of Demoucron et al. Appendix E Elements of the Hidden Markov Model ln speech recognition. hidden Markov models (HMM) have become increasingly popular in the last fifteen years since the modeling provides a mathematically rigorous approach and works very well in practice [8]. [9].[21]. [‘24]. Much of the popularity of the HMM is based on the discovery of efficient methods of estimating its parameters. The form of the HMM can be chosen according to the knowledge of application domain and the parameters trained from known data. To use HMM. three basic problems (training, evaluation, and recognition) must be understood: 1. The training problem: Training in HMM is simply the estimation of the pa- rameters of a model from a set of known training data (observation sequences) in which the parameters can best describe how a given observation sequence is generated. ‘2. The evaluation problem: Given a model and a sequence of observations. the evaluation problem considers the probability that the observed sequence is pro- duced by the model. 109 110 3. The recognition problem: Recognition in HMM is the identification of an un- known observation sequence by choosing the most likely model (representing a word. or a sentence. etc.) that produces the observation sequence. A discrete observation HMM. say M. is defined by a set of 1 states. .1 observation symbols, and three probabilistic matrices: M: {n.A.B}. (E.1) Let the individual states be denoted as 5' = 1,2. - - - . I. and the observation symbols Z = {.31. :2. ~ - - . :J}. Then. the probabilistic matrix [I is the initial state probability: fl = {7n}. 7r. 2 P(state i at t = 1). (E2) where the variable t denotes discrete time. A is an I x I matrix containing the Markovian state transition probabilities: A = {aij}. a.) = P(statej at t+ llstate i at t). (E3) and B is an I x J matrix containing the observation symbol probability distributions: B : {bj(k)}, bj(k) : P(::,c at tlstatej at t). (EA) Note that “ij and bj(k) satisfy the following properties: a.) Z 0.bJ(k) 2 0, Vi.j.k (13.5) 111 2a.] =1, Vi (Eb) :bJ-(lc) = l. Vj. (ET) k In the HMM. the state sequence is hidden but the output sequence can be directly observed. A set of sample model topologies is shown is Figure [5.35. The HMM topology of Figure E.35(a) is called an ergodic topology because any state can be reached from any other state. However. this topology is inappropriate for speech recognition because speech consists of an ordered sequence of sounds. Figure E.35(b) through E.35(d) are all left-to-right models. the left-to-right restriction forces an ordering on the state sequence in which the path evaluation must either stay in the same state or go to a higher numbered state. Figure E.35(b) is the general left-to-right topology. Figure E.35(c). a Bakis model [9] (stay. move. or skip one). and Figure E.35(d). a linear (stay or move) model. are commonly used for speech recognition. Although we have dichotomized HMMS into ergodic and left-to-right models. there are still many possible variations and combinations [8]. [9]. [‘21]. Let Y = y1.y-2.~-.yT be the discrete observation of a training utterance for a model M. Our goal is to choose the model parameters such that P(Y|M) is locally maximized. Let m(i) be the forward variable and ,3¢(i) be the backward variable defined as: at(i) = P(y1.y2.---.y,. and s, = ilM) (BS) (b) General Left-to-Right Model 0 o o e “x‘bka (c) Bakis Model —8—8.8—Q—~ (d) Linear Model Figure [3.35: Some examples of four-state HMM topologies. .‘iltlll = Plyi+1.yi+2.- ' ' .yrlsi = F! M) (E9) where st denotes the state at time t. Define ,9,(i.j) the probability of being in state i at time t. and state j at time t + 1 given the model M and the observation sequence Y. The forward-backward algorithm [‘24], which can be used to adjust the model parameters (A. B. f1) and maximize the probability of the observation sequence given the model, is sketched below: Step 1 Initialize all the elements in. [1, A, and B with reasonable random numbers satisfying the basic properties discussed above. Step 2 For each training datum, implement Steps 3 to 5. 113 Step 3 Calculate the forward and backward variables inductively as follows: || 3‘ "Q” {C Y p—A /\ N /\ N 01(1) 0i+1(] )=[Za.(i mutual). IStST—l.andISjSI —j=21(lu'bj( (Ut+1) )f3t+1(j) :T—l,T—2,"',l, and ISIS]. Step 4 Calculate .pg(i.j) and '7,(i). Gill)(lijbj(yt+i)l3t+l(j) :1 =1 211:1 at( )“iij (yt+1)l3t+l(.l) Wild): I Adi) = 29:01]). i=1 Then, a set of reestimation formulas for 11, A. and B are: in = 71(1) :i_—1T-1.,~(i.j) Zi=1 1W ) m) : Ziuwmn J 232174!) Step 5 set 7r.- +— Ti}, (1.04— (iilz'j, and l)j(l\') (— 01(h').V2.j.l€. 114 Step 6 If some convergence criterion is reached. then stop. Otherwise, go to Step 2?. Bibliography Bibliography [11 [‘21 131 [41 [91 [101 [111 [1‘3] J.A. Bondy and U.S.R. Murty. Graph Theory with Applications. New York: American Elsevier Publishing. 1976. F. Harary, Graph Theory. Reading, Massachusetts: Addison-Wesley, 1969. S. Even. Graph Algorithms, Potomac. Maryland: Computer Science Press, 1979. A. Gibbons. Algorithmic Graph Theory. New York: Cambridge University Press, 1985. B.T. Lowerre and R. Reddy. “The HARPY speech understanding system,” in W.A. Lea (ed.), Trends in Speech Recognition. pp.340——360, Englewood Cliffs, New Jersey: Prentice-Hall, 1980. A.J. Viterbi, “Error bounds for convolutional codes and an asymptotically opti- mal decoding algorithm.” IEEE Trans. on Information Theory, vol. IT-l3, pp. 260-269. Apr. 1967. F. .lelinek, “Fast sequential decoding algorithm using a stack,” IBM J. of Re- search and Development, vol. 13. pp. 675-685. Nov. 1969. .1.R. Deller, Jr., .l.G. Proakis. and .1.H.L. Hansen. Discrete Time Processing of Speech Signals, New York: Macmillan. 1993. .1. Picone. “Continuous speech recognition using hidden Markov models.” IEEE ASSP Magazine, pp. 226-41, July 1990. DB. Paul. “Speech recognition using hidden Markov models,” Lincoln Labora- tory J... vol. 3, no. 1, pp. 41—61, Spring 1990. LR. Bahl. F. Jelinek and R.L. Mercer. “A maximum likelihood approach to continuous speech recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. PAMl-S. no. ‘2, pp. 179—4190. Mar. 1983. .1.R. Deller, Jr. and R.K. Snider. “Reducing redundant computation in HMM evaluation,” IEEE Trans. on. Speech and Audio Processing, June, 1993. 115 [13] [141 [151 1161 [17] [18] [191 ['20] 116 C.C. Venkatesh, J.R. Deller. Jr.. and C.C. Chiu, “A graph partitioning approach to signal decoding,” manuscript. L.R. Bahl and F. Jelinek. “Decoding for channels with insertions. deletions and substitutions with applications to speech recognition,” IEEE Trans. on Informa— tion Theory, vol. lT-‘Zl. no. 4. pp.404—411. July 1975. S.E. Levinson. “Structural methods in automatic speech recognition.” Proc. of the IEEE, vol. 73. no. 11, 1985. C.C. Chin. Algorithms for Signal Decoding Using Graph Partitioning, (MS. the- sis). Dept. of Electrical Engineering. Michigan State University, East Lansing. 1991. R.J. Lipton and R.E. Tarjan. “A separator theorem for planar graphs.” SIAM J. of Computing, vol. 36. no. ‘2. pp. 177—189. Apr. 1979. H.N. Djidjev. “On the problem of partitioning planar graphs,” SIAM J. of Al- gebraic and Discrete Methods, vol 3, no. 2. pp. 229—240, June 1982. G.L. Miller. “Finding small simple cycle separator for 2—connected planar graphs,” Proc. [6th Annual ACM Symp. on Theory of Computing, pp. 376—382. Apr. 1984. C.C. Chiu. C.C. Venkatesh. A.-H. Esfahanian, and J.R. Deller. Jr.. “An al- gorithm for planar graphs subpartitioning.” submitted to J. of Combinatorial .flIathematics and Combinatorial Computing. K.F. Lee. Automatic Speech Recognition. the Development of the SPHINX Sys- tem, Boston: Kluwer Academic Publishers. 1989. DA. Plaisted. “A heuristic algorithm for small separators in arbitrary graphs." SIAM J. of Discrete Mathematics, vol 19, no. 2, pp. 267—280. Apr. 1990. S. Rao. “Finding near optimal separators in planar graphs,” Proc. 28th Annual IEEE Symp. on Foundations of Computer Science, pp. 225-237, 1987. LR. Rabiner. “A tutorial on hidden Markov models and selected applications in speech recognition.” Proc. of the IEEE, vol. 77. no. ‘2, 1989. CH. Lee and LR. Rabiner. “A frame-synchronous network search algorithm for connected word recognition.” IEEE Trans. on Acoustics. Speech, and Signal Processing, vol. 37, no. 11, pp. 1649-1658, 1989. R. Schwartz and Y.L. Chow, “The N-best algorithm: an efficient and exact procedure for finding the N most likely sentence hypotheses,” Proc. IEEE Int. Conf. on Acoustics. Speech. and Signal Processing, pp. 81~84. 1990. [‘27] [281 [291 [30] [31] [:33] [34] [36] [37] [38] [39] [40] 117 S. Davis and P. Mermelstein. “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. ‘28. pp. 357—366. 1980. .1.A. McHugh. Algorithmic Graph Theory. Englewood Cliffs. New Jersey: Prentice-Hall. 1990. “Getting started with the DARPA TIMIT CD-ROM: An acoustic-phonetic continuous speech database.” National Institute of Standards and Technology (NIST), Gaitherburg. Maryland, 1988. A.V. Aho. .1.E. Hopcroft and JD. Ullman. The Design and Analysis of Computer Algorithms, Reading, Massachusetts: Addison-Wesley. 1974. A.V. Aho and JD. Ullman. The Theory of Parsing, Translation, and Compiling. ’olume I: Parsing. Englewood Cliffs. New Jersey: Prentice-Hall, 1972. EB. Messinger. L.A. Rowe. and RR. Henry. “A divide-and-conquer algorithm for the automatic layout of large directed graphs,” IEEE Trans. on Systems. Man, and Cybernetics, vol. 21. no. 1. pp. 1—12. .1an./Feb. 1991. MR. Fellows. “Transversals of vertex partitions in graphs.” SIAM J. of Discrete Mathematics. vol 3. no. ‘2, pp. 206-215, May 1990. RR. Barnes. A. Vannelli. and IQ. Walker. “A new heuristic for partitioning the nodes of a graph.” SIAM J. of Discrete Mathematics, vol 1, no. 3. pp. 299—305. Aug. 1988. T. Ozawa. “The principal partition of vertex-weighted graphs and its applica- tions.” Discrete Algorithms and Complexity, pp. 5-33. 1987. JP. Hutchinson and G.L. Miller. “On deleting vertices to make a graph of posi- tive genus planar,” Discrete Algorithms and Complexity, pp. 81—98, 1987. H. Gazit and G.L. Miller. “A parallel algorithm for finding a separator in planar graphs.” Proc. 28th Annual IEEE Symp. on Foundations of Computer Science. pp. 238—248, 1987. J.R. Gilbert, D.J. Rose. and A. Edenbrandt. “A separator theorem for chordal graphs.” SIAM J. of Algebraic and Discrete Methods, vol 5. no. 3. pp. 306—313. Sep.1984. R.E. Tarjan, “Space-efficient implementations of graph search methods,” ACM Trans. on Mathematical Software, vol 9, no. 3, pp. 326-339. Sep. 1983. BR. Barnes, “An algorithm for partitioning the nodes of a graph,” SIAM J. of Algebraic and Discrete Methods, vol 3, no. 4. pp. 541—550, Dec. 1982. 1411 [4‘31 [43] [441 [451 1461 [491 1501 1511 [s21 [:33] [541 [as] [.36] 118 N. Deo, G.M. Prabhu. and MS. Krishnamoorthy, “Algorithms for generating fundamental cycles in a graph.” ACM Trans. on Mathematical Software, vol 8. no. 1, pp. 26-42. Mar. 1982. H.N. Djidjev. “A linear algorithm for partitioning graphs,” Comptes rendus de 1' Academic Bulgare des Sciences, vol. 35, pp. 1053—1056, 1982. R..1. Lipton and R.E. Tarjan. “Applications of a planar separator theorem,” SIAM J. of Computing, vol. 9. no. 3. pp. 615—627, Aug. 1980. .1. Hopcroft and R.E. Tarjan, “Efficient planarity testing,” J. of the Association for Computing Machinery. vol. 21, no. 4, pp. 549—568, Oct. 1974. J.R. Gilbert. .1.P. Hutchinson and R.E. Tarjan, “A separator theorem for graphs of bounded genus.” J. of Algorithms, vol. 5, pp. 391—407, 1984. RC. Read. “A new method for drawing a planar graph given the cyclic order of the edges at each vertex.” Research Report CORR 86-14, University of Waterloo, July 1986. R..1. Lipton and R.E. Tarjan, “Applications of a planar separator theorem.” Proc. 18th Annual IEEE Symp. on Foundations of Computer Science, pp. 162— 170, 1977. T.H. Cormen. GE. Leiserson and R.L. Rivest. Introduction to Algorithms, Carn- bridge. Massachusetts: MIT Press. 1991 G. Brassard and P. Bratley. Algorithms — Theory and Practice, Englewood Cliffs. New Jersey: Prentice-Hall, 1988. A.V. Aho, .1.E. Hopcroft and .1.D. Ullman. The Design and Analysis of Computer Algorithms, Reading. Massachusetts: Addison-Wesley, 1974. M.R. Garey and D.S. Johnson. Computers and Intractability, A Guide to the Theory of NP—Completeness, New York: W.H. Freeman. 1979. R. Sedgewick. Algorithms, Reading, Massachusetts: Addison-Wesley, 1988. F.S. Roberts, Applied Combinatorics. Englewood Cliffs, New Jersey: Prentice- Hall, 1984. G. Casella and R.L. Berger. Statistical Inference, Belmont. California: Wadsworth. 1990. RE. Pfeiffer. Concepts of Probability Theory. New York: Dover Publications. 1978. A. Papoulis. Probability, Random variables. and Stochastic Processes, New York: McGraw-Hill. 1984. [571 1641 [651 [661 1671 1681 119 R. Schwartz and Y.L. Chow. “The N-best algorithm: an efficient and exact procedure for finding the N most likely sentence hypotheses,” Proc. IEEE Int. Conf. on Acoustics. Speech. and Signal Processing, pp. 81—84, 1990. DB. Paul, “Algorithms for an optimal A" search and linearizing the search in the stack decoder,” Proc. IEEE Int. Conf. on Acoustics. Speech, and Signal Processing, pp. 693—696, 1991. L.Bahl, P.S. Gopalakrishnan. D. Kanevsky and D. Nahamoo, “Matrix fast match: a fast method for identifying a short list of candidate words for decoding,” Proc. IEEE Int. Conf. on Acoustics. Speech, and Signal Processing, pp. 345—348. 1989. LR. Rabiner and S.E. Levinson. “A speaker-independent, syntax-directed, con- nected word recognition system based on hidden Markov models and level build- ing,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. ASSP-33. no. 3, pp. 561—573, 1985. H.Ney, “The use of one-stage dynamic programming algorithm for connected word recognition,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. ASSP-3‘2. no. 2. pp. 263—271. 1984. F. Jelinek, L. Bahl, and R.L. Mercer, “Design of a linguistic statistical decoder for the recognition of continuous speech.” IEEE Trans. on Information Theory, vol. 1T-21, no. 3, pp. 250—256, 1975. C.C. Tappert, N.R. Dixon. and AS. Rabinowitz. “Application of sequential de- coding for converting phonetic to graph representation in automatic recognition of continuous speech (ARCS),” IEEE Trans. on Audio and Electroacoustics. vol. All-21, pp. 225—228, 1973. LR. Bahl, P.de Souza, P.S. Gopalakrishnan. D. Kanevsky and D. Nahamoo, “Constructing groups of acoustically confusable words,” Proc. IEEE Int. Conf. on Acoustics, Speech. and Signal Processing. pp. 85—88, 1990. K.F. Lee and H.W. Hon, “Speaker-independent phone recognition using hidden Markov models,” IEEE Trans. on Acoustics. Speech, and Signal Processing, vol. 37. no. 11. PP- 1641-1648, 1989. PS. Cohen and R.L. Mercer. “The phonological component of an automatic speech-recognition system,” IEEE Symp. on Speech Recognition, pp. 177—187, 1974. .1. Makhoul. S. Roucos and H. Gish. “ ’ector quantization in speech coding.” Proc. of the IEEE, vol. 73. no. 11, pp. 1551—1587. 1985. R.M. Fano, “A heuristic discussion of probabilistic decoding,” IEEE Trans. on Information Theory, pp. 64—74. 1963. [691 [701 1711 [7‘21 [751 [761 [77] [781 [791 1801 [811 120 L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Englewood Cliffs, New Jersey: Prentice-Hall, 1978. .1.D. Markel and AH. Gray, Jr.. Linear Prediction of Speech, Berlin, Heidelberg, New Jersey: Springer-Verlag. 1976. E..1. Yannakoudakis and P..1. Hutton Speech Synthesis and Recognition Systems, England: Ellis Horwood Limited, 1987. .1.S. Bridle, M.D. Brown and R.M. Chamberlain. “An algorithm for connected word recognition,” Proc. IEEE Int. Conf. on Acoustics. Speech, and Signal Pro- cessing, pp. 899-902. 1982. H. Sakoe. “Two-level DP-matching - A dynamic programming-based pattern matching algorithm for connected word recognition.” IEEE Trans. on Acoustics, Speech, and Signal Processing. vol. 27, no. 6, pp. 588-595, 1979. AN. Ince, Digital Speech Processing: Speech Coding, Synthesis and Recognition. Boston: Kluwer Academic Publishers, 1992. S. Furui. Digital Speech Processing, Synthesis, and Recognition, New York: Mar- cel Dekker, Inc., 1989. D. O’Shaughnessy. Speech Communication, Human and Machine, New York: Addison-Wesley, 1988. Y. Linde. A. Buzo. and R.M. Gray, “An algorithm for vector quantizer design.” IEEE Trans. on Communications. vol. COM-28, no. 1, pp. 84—95, Jan. 1980. J.G. Wilpon. LR. Rabiner. C.-H. Lee. and ER. Goldman “Automatic recogni- tion of keywords in unconstrained speech using hidden Markov models,” IEEE Trans. on Acoustics. Speech, and Signal Processing, vol. 38. no. 11. pp. 1870— 1878. 1990. .1.G. Wilpon. C.-H. Lee, L.R. Rabiner. and ER. Goldman “Application of hidden Markov models for recognition of a limited set of words in unconstrained speech,” Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 254-257, 1989. J.G. Wilpon. L.G. Miller, and P. Modi “Improvements and applications for key word recognition using hidden Markov modeling techniques,” Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 309—312, 1990. LR. Bahl, R. Bakis, PS. Cohen, A.G. Cole, F. Jelinek, B.L. Lewis, and R.L. Mer- cer. “Recognition results with several experimental acoustic processors.” Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 249—251, 1979.