. . V . , _ . u .f (4 I. , .p. .9 \ .E, 5.. ,fl . V V be... .. . .. Vb. Vaxgifl? . V , . , . . . .. E 3%. is; .. 5:. .1 .k. n A 2W?! 3W. , , .be . . n.3’é::, . . Vt. 3“ .. ; at 5V. 1...: 2 V . x‘ .I . a . .a fix, u. v: up ma, .9 . V V . a r, . U V 4 .. $me ‘ ¢’ 1 . a, at”... . .n. .V1h1l x3 .w, , . . . . V V V , . . . V .V .. . V $3.. .7.“ .C . .éi .. V . Fl; :..vL. kg? . . ,. Em? , wé‘fiwwgfifi V .. THES‘S [les -1 soul This is to certify that the dissertation entitled EVALUATION AND IMPROVEMENT OF THE HMM BY STATE-SPACE MODELING presented by Yong-Beom Lee has been accepted towards fulfillment of the requirements for Ph. D. degree in Electrical Eng AM % Major professor Date I, fit ' 2000 MS U is an Affirmative Action/Equal Opportunity Institution O~ 12771 M'Chlaan State University LIBRARY PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE | DATE DUE W 11m clam-gasp.“ EVALUATION AND IMPROVEMENT OF THE HMM BY STATE-SPACE MODELING By Yong-Beam Lee A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical and Computer Engineering 2000 ABSTRACT EVALUATION AND IMPROVEMENT OF THE HMM BY STATE-SPACE MODELING By Yong-Beam Lee Analytical modeling of speech production is not an easy task, in part because of the rapidly time-varying nature of speech signals. The hidden Markov model (HMM) is widely used for the stochastic modeling of time-varying signals, and it has been most applied in the area of speech production and recognition. Most current HMM research has focused on its applications. On the other hand, studies of the theoretical aspects of the HMM are relatively few. This is due to the difficulties of analyzing a model that is inherently probabilistic and recursive in nature. However, if the fundamentals of the HMM are approached from a different direction, it is possible to obtain useful analyses of the HMM which contribute to its use in speech technologies. The main objective of this dissertation is to revisit and further investigate three fundamental HMM problems related to speech recognition using a novel mathematical formulation. Rather than the conventional representation of the HMM as a scalar recursive algorithm, the HMM will be represented using a vector-matrix formulation. It will be shown that the HMM can be represented as a state-space model. The conventional Baum-Welch (time-varying) model as well as an “approximate” time- invariant model will be studied in detail in the context of this new formulation. A more thorough theoretical and empirical investigation of this approximate model is presented in this dissertation. In particular, the spoken-digit recognition problem will be the focus of applied studies. Some useful results and techniques using the time—invariant approximation of the HMM are addressed and analyzed. In addition, new state—search techniques using clustering, and novel set-membership identification techniques are developed as the basis for a novel HMM training approach. The new training results in HMM state assignments corresponding to acoustically meaningful segmentation of the speech, rather than adherence to the conventional maximum likelihood criterion. The results of new search techniques are compared to those of the Viterbi search. To my wife and daughter For their love, support, and sacrifice iv ACKNOWLEDGMENTS I would like to extend my deep thanks and gratitude to Professor John R. Deller, Jr. for his guidance and encouragement of my graduate program for quite a long time. His direction was very important in helping me step into the speech processing and digital signal processing field. Not only the idea for the state Space formulation of the HMM, but also major parts of the problems in this thesis were suggested by Professor Deller, who also made uncountable number of suggestions for the research. Moreover, I am very thankful for his meticulous guidance for the writing. Also, I would like to thank all the members of my thesis committee: Dr. H. Khalil, Dr. C. Wei], Dr. P. Pierre, and Dr. J. Deller, Jr. Finally, I give heartful thanks to my family and parents for their love, support, patience, and encouragement. Contents List of Tables viii List of Figures ix 1 Introduction 1 1.1 Background ................................ 1 1.2 History of the Vector-Matrix Formulations of the HMMs ....... 3 1.3 Problems of Existing Vector-Matrix Formulations of the HMM . . . . 5 1.4 Objectives ................................. 6 2 Vector-Matrix Formulations of the HMM 8 2.1 HMM Background ............................ 9 2.2 Time-Varying Forward-Backward HMM ................ 11 2.2.1 Evaluation Problem ........................ 11 2.2.2 Decoding Problem ........................ 23 2.2.3 Training (Estimation) Problem ................. 26 2.3 Time-Invariant Approximation for the HMM .............. 30 2.4 Transformations of State Equations ................... 33 2.4.1 Transformation of Time-Invariant State Equation ....... 33 2.4.2 Transformation of the Time-Varying State Equation ...... 36 2.5 Analysis of Illegal Paths Caused by Approximation .......... 38 2.5.1 Likelihood Difference ....................... 39 2.5.2 Comparison of the State-Transition Matrices .......... 41 2.6 Validity of the T ime—Invariant Approximation of the HMM ...... 43 2.6.1 Matrix Norm Approach ..................... 43 2.6.2 Likelihood Expansion Approach ................. 47 2.6.3 Matrix Inversion Approach .................... 48 2.6.4 Eigenanalysis Approach ..................... 50 3 Practical Issues in the Use of the TIA HMM 53 3.1 Efficient Evaluation Technique ...................... 54 3.2 Analysis of the TIA HMM ........................ 58 3.2.1 Likelihood Structures ....................... 58 3.2.2 Experimental Comparisons of Likelihood Between Model Types 62 3.2.3 State Probability Distribution Vector in the TIA HMM . . . . 67 vi 3.2.4 Comparison of z(t) and 7(t) ........... . ........ 68 3.2.5 Experimental Results on the Effects of A ............ 72 3.3 Reconciliation of the TIA HMM ..................... 78 3.3.1 Feedback Control ......................... 78 3.3.2 Stochastic Modeling of Temporal Information in the TIA HMM 80 3.4 Discussion ................................. 82 Training HMMs so that Hidden Model States Meaningfully Repre- sent Acoustic States 84 4.1 Maximum Likelihood Approach to State Sequence Determination . . 86 4.1.1 Introduction ............................ 86 4.1.2 Experimental Results ....................... 88 4.2 State Sequence Based on “Acoustic Distance” ............. 98 4.2.1 Introduction ............................ 99 4.2.2 The Concept ........................... 101 4.2.3 Recursive Viterbi Search Based on k-Means .......... 108 ' 4.2.4 Experimental Results ....................... 110 4.2.5 Appropriate Number of States .................. 118 4.2.6 Remark .............................. 125 4.3 State Search by Set-Membership Identification ............. 125 4.3.1 Original Thoughts about Exploiting the SM ID ........ 125 4.3.2 Background of the SM Identification .............. 130 4.3.3 State Search Using the SM Identification ............ 132 Conclusions and Future Research 138 5.1 Conclusions ................................ 138 5.2 Future Research .............................. 141 vii List of Tables 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 4.1 Approximate computational complexities for computing (A(t + 1)A)" by three different approaches. ...................... Likelihood from the F-B HMM in leave-one-out-test. ......... Likelihood from the TIA HMM based on leave-one-out-test. ..... Statistical Properties of the likelihood results from the F-B HMM. . . Statistical Properties of the likelihood results from the TIA HMM. . . State probability distribution vectors under Bakis, :1:(t) B, and ergodic, x(t)e, constraints .............................. Sum of likelihoods of fifteen training utterances for each digit associ- ated with three different state-transition matrices in the F-B HMM. . Likelihood P(0 I M) using A,- and B for each digit 2' in a resubstitu- tion test ................................... Likelihood 11le P(0¢ | M) using A,- and B for each digit 2' in a re- substitution test. ............................. Likelihood P(0 | M) using AB, and BAH for each digit in a resub- stitution test. ................ 1 ............... Likelihood 11;, P(0t | M) using A3, and B As for each digit in a resubstitutiontest. .................l ........... Likelihood P(0 | M) using AB, and B As, for each digit in a resub- stitution test. ............................... Likelihood [1:1 P(Ot | M) using AB, and B As for each digit in a resubstitutiontest. ................2 ........... The likelihoods based on L(0 | M, M') with the TIA HMM. . . . . Likelihoods of the conventional Viterbi search and the recursive Viterbi search based on k-means clustering .................... viii 57 65 65 66 66 68 73 75 76 76 77 77 82 116 List of Figures 2.1 3.1 3.2 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Six-state Bakis topology of the HMM. ................. State probability distribution after training digit “four.” ....... The average of 7,-(t),i = 1,. . . ,5 from the entire training utterances of word “four.” ................................ State search results from the conventional Viterbi and Q‘ = '{q{ = [If arg max,, P(0., q. | A4,), 2': 1,. . . , 10 in a five-state Bakis HMM of a spoken word “one.” Note that each graph represents a different “i” except top two figures in the left column. The tests employ resubstitution. State search results from the conventional Viterbi and Q" 2 [LT q; = [If arg max,,, P(0t,qt I Mg), i = 1,. . . , 10 in a five—state Bakis HMM of a spoken word “two.” Note that each graph represents a different “i” except top two figures in the left column. The tests employ resubstitution. State search results from the conventional Viterbi and Q‘ = [If q; = II? arg max,, P(0.,q, | M,), i = 1,. . . , 10 in a five-state Bakis HMM of a spoken word “four.” Note that each graph represents a different “i” except top two figures in the left column. The tests employ resubstitution. State search results from the conventional Viterbi and Q‘ = [1? q; = [If arg maxq, P(0¢,qt | Mg), i = 1,... ,10 in a five-state Bakis HMM of a spoken word “six.” Note that each graph represents a different “i” except top two figures in the left column. The tests employ resubstitution. State search results from the conventional Viterbi and Q" = I]? q; = II? arg maxqt P(0t, qt I M,), i = 1,. . . , 10 in a five-state Bakis HMM of a spoken word “one.” Note that each graph represents a different “i” except top two figures in the left column. The tests employ leave-one-out. State search results from the conventional Viterbi and Q" = H? q; = [If argmaxq, P(0¢,qt I Mi), i = 1,... ,10 in a five-state Bakis HMM of a spoken word “two.” Note that each graph represents a different “i” except top two figures in the left column. The tests employ leave-one-out. State search results from the conventional Viterbi and Q" = I]? q; = [If arg maxq, P(0., qt I M,), i = 1,. . . , 10 in a five-state Bakis HMM of a Spoken word “four.” Note that each graph represents a different “i” except top two figures in the left column. The tests employ leave- one—out. .................................. ix 89 90 91 92 93 94 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 State search results from the conventional Viterbi and Q‘ = ,Tq’,‘ = [If arg maxq, P(0t, q; | M;), i = 1, . . . , 10 in a five-state Bakis HMM of a spoken word “six.” Note that each graph represents a different “i” except top two figures in the left column. The tests employ leave-oneout. 96 State segmentation resulting from conventional ML (Viterbi) training of a five-state Bakis HMM for the utterance “six.” The resulting seg- mentation is not coherent with the physical dynamics of the speech. . 100 Euclidean distances between four different symbols (symbol “zero,” “32,” “64,” and “96”) and the rest of symbols indexed along the ab- scissa in the codebook. Each symbol represents a centroid of the cluster in the feature vector Space ......................... 105 State sequence for the spoken word “six” by the Viterbi search tech- nique of a five-state Bakis HMM. Five different sets of initial values have been assigned to A and B ...................... 112 State sequences for the Spoken word “four” in a five-state Bakis HMM. The second figure is the consequence of the conventional Viterbi search. The third figure is the result of the recursive Viterbi search based on k-means clustering. The 4‘“ graph is the conventional Viterbi search result based on the third graph. ..................... 114 State sequences for the spoken word “six” in a five-state Bakis HMM. The second figure is the consequence of the conventional Viterbi search. The third figure is the result of the recursive Viterbi search based on k-means clustering. The 4‘“ graph is the conventional Viterbi search result based on the third graph. ..................... 115 Davis-Bouldin relative index for fifteen utterances of spoken word “one.” 121 Davis-Bouldin relative index for fifteen utterances of spoken word “two.” 122 Davis-Bouldin relative index for fifteen utterances of spoken word “four.” 123 Davis-Bouldin relative index for fifteen utterances of spoken word “six.” 124 State search results for word “four” using the recursive Viterbi search based on k-means clustering with different initial clusters ........ 126 State search results for word “six” using the recursive Viterbi search based on k-means clustering with different initial clusters ........ 127 Some informative results regarding state segmentation by the SM tech- nique for the word “six.” ......................... 134 State segmentation result by the SM theory with various values of a for word “four.” .............................. 136 State segmentation result by the SM theory with various values of a for word “six.” Chapter 1 Introduction 1.1 Background Speech is the most natural way of transferring information among human beings. Speech recognition by a machine (e.g., a computer) is a way to translate human speech into corresponding text so that a machine can perform productive work automatically according to human speech inputs. It has many applications such as word dictation, voice activated dialing, automated attendant, command control of a machine, and so forth. The eventual goal of speech processing in engineering is to make machines understand human speech as naturally as humans communicate with one another. While human communication is natural and easy because of the extraordinary capability of the human brain, speech recognition by a machine is not straightforward in Spite of the remarkable deve10pment of computer technology. There are some practical reasons why processing of speech signals is challenging. For example, the same phoneme, when spoken by different speakers, will be acous- tically different due to variations in vocal-tract anatomy. Also, the same speaker may produce different versions of the same sound under different circumstances, for instance, when s/ he is suffering from a cold and when s/he is not [3]. Certain sounds may be shortened or completely left out when the speaker talks very fast. Differences in dialect, like phoneme deletion or phoneme substitution, also complicate the Speech recognition process. Other problems, like speech related noise (e.g., lips smacks and tongue clicks), add to the difficulty of the recognition systems. It is obvious that without some simplifying assumptions the task of modeling speech for recognition would be highly impractical. The hidden Markov model (HMM) is a popular technique in many contempo- rary applications in signal processing, communications, and control. In particular, the HMM has been used to successfully and automatically cope with acoustic uncer— tainty in speech signal applications [4] using the statistical modeling. For example, achieving a flexible model for rapidly time-varying signals using a dynamic program- ming technique [4] is very difficult. The popularity of the HMM is due to its simplicity, compactness, and easy implementation. Along with the dynamic time warping tech- nique (DTW), the HMM has been applied to Speech recognition systems for many years. In particular, this technique can be globally applied to a large and complex speech recognition system [4]. Although the HMM has been widely researched and extensively applied to the speech processing field, it is true that it lacks diverse formulations so that the HMM can be exploited more efficiently depending on specific applications. This is because of its inherent time-varying nature as well as the somewhat complex recursive structure associated with sophisticated model. HMM research has focused primarily on its application rather than its fundamental characteristics. 1.2 History of the Vector-Matrix Formulations of the HMMs The conventional Baum-Welch algorithm, which is also called the Forward-Backward (F—B) algorithm [1, 2, 4], is, a concise, and compact representation of the HMM dynamics for explaining quickly time-varying speech signals efficiently. Whether any- path or a bestpath (Viterbi path) [4] is considered for evaluation/ training of the HMM, the conventional F-B approach of the HMM is based on a series of scalar recursion [1, 2, 4]. Also, due to the model characteristics associated with stochas- tic time—varying signals, the F—B HMM inherently does not have such flexibility and applications as a linear time-invariant model which has stationary parameters repre- senting a model. Work cited in [6, 7, 8, 27, 30, 31, 32] represents several independent approaches that take advantage of vector-matrix formulations of the HMM. Vector-matrix rep- resentations of the HMM provide a diverse and unified way to interpret the HMM Operation. In recent paper, by Turin and Karan [27, 31], matrix HMM formulations have been exploited to find useful algorithms for speech recognition technology. In addition, Hjalmarsson et al. [30] have used a state-space formulation of HMM to find non-recursive formulae for training the HMM. Similarly, in work by Elliott et al. [32], a state-space formulation of HMM has been used for estimation and control. Ex- cept for the work conducted in the author’s laboratory [6, 7, 8], these vector-matrix formulations are all based on the F—B HMM. Vector-matrix formulations of the time— invariant approximation (TIA) of the HMM, which is different from the conventional F—B HMM, were first proposed by Snider and Deller [6, 7]. Turin [27] has proposed vector-matrix representations of the HMM to allow paral- lel computing to achieve some computational savings during training and evaluation. In particular, he uses vector-matrix formulations to obtain a more computationally efficient algorithm when speech signals satisfy a certain condition. Karan et al. [31] have used a matrix formulation for the algorithm proposed by Streit [29] in computing the eventual moments, defined1 as Mj,,-(lc, T) = E{Pj(0t)'°} = 20,]. P,~(0T) PJ-(OT)’°, to measure moments of the output sequence probabilities of the HMM M,- with respect to M;. Here M = {N , M, A,B,1r} is a set of parameter matrices defining the characteristics of the HMM with a state-transition matrix A, observation probability matrix B, as well as the initial state distribution matrix 1r and the N and M related to the sizes of the matrices. The evaluation Mj,i(k, T) pro- posed in [29] uses vector-matrix descriptions to overcome a computational difficulty by simplifying the recursive scalar computations which have a similar formulation to the a posteriori probabilities [1, 2] P(0,q¢ = i I M) in the F-B HMM. Such computational savings are possible due to the asymptotic analysis of the dynamics of state-space equations. Elliott et al. [32] suggest a unique state-space model for the two stochastic vari- ables, state and observation, leading to a independent identically distributed (i.i.d) observation process through a change of probability measure. By forming such an ideal distribution, it is possible to obtain several key results related to state estimator by applying the Fubini theorem which allows interchange of expectation and summa- tion in the product measure Space. This technique has shown the capability of the state—space structure of the HMM in state estimation and control. Snider and Deller [6, 7, 8] adept a simplified probability likelihood measure 113;, P(0t) to allow a more compact and analyzable approach to evaluate the HMM likelihood, P(0 | M). Depending on the circumstances, it is possible to control the compression index, the ratio of the number of eigenvalues merged to the total number of eigenvalues in all the HMMS, for trade-off between speech recognition rate and Speed and memory complexity requirements. 1Precise definitions of notations are established in Chapter 2. 4 1.3 Problems of Existing Vector-Matrix Formula- tions of the HMM In spite of useful results inherent in the vector-matrix and state-Space approaches to the HMM cited above, Open issues remain. First, in [27], to obtain a computationally economical formulation with a vector- matrix formulations of the PB HMM, it is assumed that long stretches of the same symbol string occur within a Speech utterance. In fact, such a condition on a speech signal is very helpful to have more computational savings in training and testing of the HMM in reality. For instance, for a very limited small scale system which has a small vocabulary as well as a small number of symbols with simple waveform structures, 'Ihrin’s condition on signals is justified; therefore, further computational savings can be attained without compromising recognition performance. In addition, under the very unusual circumstance that the speaker is restricted in the number of sounds s/he can reliably produce, Turin’s condition is valid even with a relatively large vocabulary. However, in reality ’I‘urin’s condition on signals is not ubiquitous in the quickly time-varying speech signals. In practice, even a word model which may have as many as 128 symbols after being quantized for the purpose of efficient, secure storage and transmission, does not usually adhere to Turin’s condition very well. Also, for a large scale system with a large vocabulary and complex speech waveform structures, it is unusual to find that Turin’s condition is met. Even for a sentence model or a compound model which is composed of concatenating of phones or word HMMS, it is not easy to argue in support of Turin’s condition on speech signals. In other words, there is a limitation in applying ’I‘urin’s algorithms to the general speech signal ap- plications. When speech signals do satisfy 'I‘urin’s condition, however, computational savings can be obtained. This will be discussed in detail later in the thesis. The work Of Elliott et al. [32] is mainly focused on finding an Optimal estimation algorithm from Observed signals to reveal the originating signals transmitted in a noisy environment. Further, Elliott’s derivations are focused mostly on the estimation Of states and unknown parameters without a specific procedure for the likelihood evaluation Of the HMM. This is a significant derivations from HMM modeling and use in speech recognition. Snider and Deller report empirically useful results in terms Of performance and computation savings, but Offer little discussion of the general viability of the modified likelihood [LT=1 P(Ot). Such a likelihood measure is not identical to P(0 | M) Of the F—B HMM because Of the potential for including extra likelihoods from illegal state paths [4, 7, 25]. A brief explanation about this issue is discussed in [4]. 1.4 Objectives In this research, the focus is on the use Of HMMs to model the acoustic process at the lowest levels Of a Speech recognizer. First, the conventional HMM with a state-space structure will be reformulated leading to a versatile computational structure with rich interpretability. In particular, a TIA HMM suggested in [6, 7, 8] will be a main focus Of this dissertation. The viability Of the TIA HMM will be shown through several formal approaches. Such time-invariant formulations and corresponding likelihood measures will be argued tO be proper approximations Of the conventional F-B HMM. This thesis is composed of five chapters. The present chapter is a short introduc- tion to, and background of, this research. The second chapter introduces time-varying and time-invariant state-Space models of the HMM. Vector and matrix formulation notations are used to describe the three fundamental problems of the HMM. In partic- ular, the “illegal state path problem” inherent in the TIA HMM is briefly discussed. The third chapter deals intensively with the problem of illegal state paths produced by the TIA HMM. A few evolving techniques that reconcile the conventional F-B HMM to the TIA HMM are discussed. Chapter 4 is focused on the problem Of finding an apprOpriate state sequence in some sense for a given speech signal. New state-search techniques using the maximum likelihood criterion, clustering, and novel set-membership identification techniques are develOped for HMM training. The re— sults of these search techniques are compared to those Of the conventional Viterbi search. The final chapter, Chapter 5, presents research conclusions and future re- search directions. In this research, theoretical results are applied mainly to the isolated digit recogni- tion problem, one Of the classical problems Of speech recognition. For the experimental studies in this research, input speech, which is uttered by an American male speaker, is sampled at a rate of 10kHz. More details Of this Speech corpus is described in Chapter 3. Because Of the non-stationary nature of speech utterance, the acoustic feature extraction is performed on sampled data on a frame-by-frame basis. Hamming window analysis is applied to each frame, all Of which are 25ms long with a 15ms overlap. Then 10‘“ order mel-frequency cepstral coefficients are computed. This produces a sequence of cepstral speech vectors, or as it is usually called, a speech pattern. These speech patterns are classified based on the seven level clusters so that each speech pattern is represented by 128 symbols. It has been implicitly assumed that the given speech signals are free from all background noise so that we can concentrate exclusively on the main modeling problem. Chapter 2 Vector-Matrix Formulations of the HMM In this chapter, we first review briefly the HMM theory to support new derivations based on the conventional mathematical formulations. Three basic problems Of the HMM, evaluation, estimation (training), and decoding, will be introduced. Then, these basic HMM problems will be reformulated in vector-matrix notation. The con— ventional F-B algorithm as well as the Viterbi search algorithm will also be subsumed under this vector-matrix formulation. Third, the TIA HMM prOposed by Snider and Deller [6, 7, 8] will be discussed, followed by the transformation of the TIA and F-B HMM formulations. Next, some useful characteristics of the HMM discovered using the vector-matrix formulations will be discussed. They give a framework in which to take advantage of the TIA HMM. Fourth, it will be shown analytically with several approaches that such an approximate approach for the HMM evaluation is proper under some practical conditions. Finally, the “illegal path problem” inherent in such an approximation will be briefly discussed. This apparent defect Of the TIA HMM will be treated extensively in Chapter 3. 2.1 HMM Background The HMM was first applied to Speech technology independently by Baker [21] at Carnegie Mellon University and Jelinek at IBM in 1975 [1, 2, 3, 4]. When it was first published, neither was it called the HMM, nor was it develOped to model and recognize speech signals [22, 23, 24]. However, because Of its excellent performance in (apparently) explaining the prOperties Of highly variable signalsl, it has been broadly used in the area Of speech signal processing. The major capability Of the HMM lies in its ability to structure the information content Of variable data. It also systematically translates this information into a set Of stochastic parameters. The HMM uses a stochastic approach to explain the characterization of speech variability. It is used to model a doubly stochastic production process with the transition parameters modulated by a Markov chain [1, 2]. Thus, the Observed speech sequence2 is assumed to be the result of the interaction of two stochastic processes. The Markovian assumption on the transition probabilities of the HMM imposes two major constraints on the possible variations in the speech production system. The first constraint is a state model and the other is the dynamics Of state transi- tions according tO the Markovian assumption. These allow a compact description of the time-varying speech signal assumed to represent “acoustic states” of speech production. The HMM is a versatile model which can be used to represent a word, a subword unit, or, in principle, a complete sentence or paragraph [4]. Figure 2.1 shows a typical Six-state HMM with left-to-right or Bakis state transition constraints [4]. Let us now formalize the dynamics Of the HMM. Recall the definition Of a homo- 1This work, in part, investigates whether the HMM accurately models the physical properties of the speech production system. (See Chapter 4.) 2Specified later in this section. Figure 2.1: Six-state Bakis tOpOlogy Of the HMM. geneous first-order discrete-state Markov process as one which can be at one of N states3 31,82, . . . , SN and whose state transitions are dependent only upon the most recent state. Let us denote the state of the system in the abstract at discrete time t by qt. We denote the stationary conditional probabilities by “ii = PI‘It = SjI‘It—l = Si): 1 5. ii]. S M (2-1) Additionally, let the initial state probabilities be denoted 7r,- = P(q1= Si), 1 g i g N. (2.2) The realization of the process is a state sequence, say, {q1, q2, . . . , qT}. This process is completely characterized by the number Of states N, the set Of state-transition probabilities {0451'}, and the set of initial state probabilities {7r,-}. Now consider a discrete-Observation hidden Markov process. In the discrete HMM, an Observation sequence are assumed to be generated by jumping from state to state. With each jump, either during the transitions (on the arc), or upon at the next state, an Observation is emitted. At each time, an unknown state emits Observation symbol 0, = k, 1 g k S M according to the conditional distribution bj(lc) = P(k Observation at time tIqt = 53'), 1 S j S N, 1 g k g M (2.3) = P(0t=k|(1t=sj): 3By convention, integers are used to represent states as shown in Fig. 2.1. 10 where M is the number Of distinct Observation symbols. Symbol 0, is the index of some characteristic measurement extracted from Speech, usually frame-wise. There- fore, a speech signal is reduced to strings Of features extracted from the acoustic speech waveform. The generated Observation sequence, denoted in the abstract by 0 = {01, 02, . . . ,OT}, is a realization Of a doubly stochastic process, is. a random process generated by an unobservable random process. Here, T is the number Of Ob— servations in the sequence. A model governed by such a probabilistic structure is the hidden Markov model when the unobservable random process is a stationary Markov process as described above. In the remainder of the discussion, we shall use the notation M to denote the set Of elements of an HMM, namely M = {N, M, {051'}, {bJ-(k)}, {7r,-}}. 2.2 Time-Varying Forward-Backward HMM In this section, we review three HMM problems - evaluation, decoding, and training (or estimation) - and reformulate the conventional F-B HMM in vector-matrix terms. We then exploit this structure to discover interesting prOpertieS Of the HMM. We also inherently derive some useful expressions for HMM implementation based on state-space formulations. 2.2. 1 Evaluation Problem The recognition or evaluation problem involves the determination of the conditional likelihood for a given Observation string, 0, namely P(O I M). The most natural measure Of the likelihood Of a given HMM, say M, in light Of Observation sequence 0, would be a posteriori probability P (M I 0). However, the available data will not allow us to characterize P (M I 0) during the training process, under the condition Of equal a priori probability P(M) among the HMMS, so it is conventional to take the a 11 posteriori probability P (0 I M) as the Observation probability measure instead [4]. Vector-Matrix Formulation of the HMM Along a Forward Path In the F-B solution to this problem [23], the forward probabilities are defined by 0i“) = P(01,02,...,0t,qt :2. I M) for Z = 1,.. .,N. (2.4) This quantity is the joint probability Of the partial Observation sequence to time t and residence in state i at time t, given the model M. These probabilities can be calculated recursively as follows [4]: For each state j = 1, . . . , N; and for each t Z 1 j(t+1)= Zai(t()aj.-b(0t+1) i=1 .,N, (2.5) where aj, and bj(0t+1) are defined in (2.1) and (2.3). For the initial time, aj(1) is defined as bj(01)7rj. The final conditional observation probability is P(01,02, . . . ,OT I M) = £047“). (2.6) i=1 The HMM has been develOped and studied principally through such conventional formulations. Here, conventional formulations implies that in the evaluation and training (explained later) required in the HMM computations, the algorithm is ba- sically focused on individual computations Of each state as (2.5) rather than an in- tegrated way that combines state computations. In general, because it represents a linear, time-varying state-Space system, the HMM can be researched principally experimentally. However, by combining state processing, it is possible to acquire sev- eral Significant insights into the HMM which might be difficult to discover otherwise. Once revealed, these useful characteristics Of the HMM can be applied to applications for practical benefits. 12 Generally, matrices provide convenient tools for systemizing laborious calculations by providing a compact notation for describing complicated interrelationships among system variables. Through vector-matrix notations, it is possible to process all HMM states simultaneously and reveal useful properties of the HMM in the process. Let us reformulate the three HMM problems with vector-matrix notations to provide one important basis Of this research. For an N—state HMM, the N recursions of (2.5) for a,(t), i = 1, . . . , N, can be written in vector-matrix form as { (11(t'l'1) \ (b1(0t+1) 0 0 \ a2(t+1) _ 0 b2(0t+1) . . . 0 K (1N(t+ 1) j \ 0 0 . . . bN(0t+1) } / all an ... am ) ( a1(t) \ 021 022 . . . 021v . (12“) , (27) \ 0N1 0N2 . . . aNN / \ (XII/(t) ) This equation can be viewed as the state equation Of an N—State state-space model with state variable4 01,-(t), i E [1, N]. The state equation can be used for t = 1 by 4Note that the “states” in this context are to denote mathematical variables with which they are used to represent an alternative time-domain dynamics of a HMM. On the other hand, the meaning of “state” in the context of “state model” explaining a HMM is to imply that within such a state, a signal possesses some measurable and distinctive properties. 13 adding the input term / b1(01) 0 . . . O \ ( 7I'1 \ 0 b2(01) . . . 0 W2 . «5(2) (2.8) ( 0 0 b~(01)) (7m) to the right side Of (2.7), in which 6(t) is the Kronecker sequence [4, 33]. In vector- matrix notation, we can write the complete state equation as a(t + 1) = A(t + 1)Aa(t) + A(1)1r6(t), (2.9) where the vector and matrix definitions are Obvious by correspondence to (2.7) and (2.8). The output equation for the state-space model is I y(t) = C a(t). (2.10) The prime in (3'I is used to denote the matrix transpose. The only output of signifi- cance (for making a final decision) is P(01,02, . . . , 0T I M) = y(T) = c’a(:r), (2.11) with matrix 0 defined to be a vector of ones, C" = 1’ = [1,1,...,1]. (2.12) 14 Analysis of HMMS with Vector-Matrix Formulations Along a Forward Path Note that since the probability Of making a transition to some state at each time is always unity, then each column of A must sum to one. Accordingly, A is a column stochastic matrix [38] which, in turn, makes it non-negative definite [44, 45]. An important consequence Of this is that the vector 1 consisting of all ones is a left eigenvector Of A with eigenvalue 1 [39] SO that l'A = 1'. Non-negative matrices occur in a variety of applications [45]. Non-negativeness implies useful characteristics that can be used to analyze the dynamics Of a model [51]. Furthermore, in the left-to-right or Bakis HMM, which is Often employed in speech recognition, A is a lower-triangular matrix with strong diagonal components. In this case, the eigenvalues of A are the diagonal elements themselves. Moreover, because of its triangular structure, if all the eigenvalues are distinct, then eigenvector matrix Of A is also triangular. Finding the eigenstructure Of such models is therefore relatively computationally inexpensive. The use Of this eigenstructure will be explained later in this chapter. The vector-matrix representation (2.9) and (2.10) reveals interesting results that are not apparent in the usual F—B recursions. Above all, it is the combination Of A(t) and A, which comprise the effective state-transition part Of state equation, monitors and quantifies state path information. Here A has dimensions N x N, and provides all N2 state-transition probabilities at a given time. This implicitly includes information about whether a given state transition is possible. The premultiplication Of A by A regulates the possible paths through the states in light Of the states’ abilities to generate certain Observations. Non-zero values in the diagonal elements Of the A(t) matrix allow state jumps at t depending on the locations Of those non—zero values whereas zeros prohibit such transitions. For example, if A5,,(t) is zero, meaning that a symbol at time t is not generated from state i, then the it” row Of A - A is also zero. Therefore, jumps from any state to state i at time t are prohibited. Consequently, 15 A(t) can be regarded as a sort Of switching matrix at t which specifies the available transitions in accordance with the topology of A. Thus, if there is a legal path starting from an initial state tO a final state associated with T-duration speech utterance 0 = {01,02, . . . ,OT}, T multiplications Of matrix pairs A(t)A with t = 1, . . . ,T produce a non-zero matrix. Such a non-zero matrix leads to non-zero likelihood with a suitable initial state probability and a final Observation condition. Henceforth, the model consisting Of (2.9) and (2.10) is called the time-varying state equation because the composite state-transition matrix, A - A, varies with time. The entries in A effectively control the state path by prohibiting entry into a state at time t that cannot produce symbol 0;. By recursion, the a posteriori probability is written in terms of the matrices defined above as P(01,02, . . . ,OT | M) = UA(T)AA(T — 1)A- - . A(2)AA(1)1r. (2.13) Since P(01,02, . . . , Or I M) is a scalar, it can also be expressed as P(OIM) = (C'A(T)AA(T—1)A---A(2)AA(1)1r)' (2.14) = 1r'A(1)A'A(2)A'A(T — 1)A'A(T)C. In fact, this is the formulation with which Turin started in deriving other matrix-based HMM algorithms [27]. Vector-Matrix Formulation of the HMM Along Backward Path In general, the matrix representation (2.13) provides a flexible way to compute the Observation probability through diverse state-space structures and representations for the model. For example, let us derive a state equation that is different from (2.9)- (2.11). To have a state-space model (2.9)-(2.11), A(t)A was considered as a state 16 variable for (2.13). Instead, let fl(t) be an N-vector Of state variables for (2.14). Define MT) = C. Because Of recursive nature of equation (2.14), an alternate state- equation-like formulation follows immediately. Let fi(t) = A'A(t + 1)s(t + 1) (2.15) and B (T) = C fort = T—1,T—2, . . . , 1. Then the a posteriori conditional Observation probability is given by P(01,02, . . . ,OT | M) = n’A(1)p(1). (2.16) In the matrix formulation, it is not necessary to know the statistical interpretation Of )8, whose elements are equivalent to flit) in (2.17), but these quantities are rec- ognizable as the backward probabilities in the F—B algorithm where they are defined as fii(t) = P(0t+1:0t+21---10TIQt=inli i=1:"'?N1 (2'17) and computed recursively as =2 a,-,-b,-(0,+1)fi,-(t+ 1), (2.18) Similarly tO (2.5). Not surprisingly, the state-space formulation (2.15) Of state recursions, written 17 explicitly as f .310) \ / an 021 am I ,82(t) = (112 6.22 0N2 (219) (51%”) (am am (INN) (bl(0t+l) 0 0 l { ,BIIt‘I'I) \ O b2(0t+1) . . . 0 . ,32(t + 1) \ 0 0 bN(0t+1) } \5N(t+1) ) can be decomposed into the F—B backward recursions as in (2.18). The output equa- tion complementing (2.19) is given by W) = «'Aumm (2.20) with the only output Of significance (for making a final decision) being y(1) = P(01,02,...,0TIM) N = "’A(1)3(1) = Zflibi(01)16i(1)' (2-21) i=1 Result (2.21) is equivalent to the likelihood computation provided by the backward F-B recursion [1, 2, 4]. Other Vector-Matrix Formulations for the HMM In addition to these results which are equivalent to the widely-used F-B recursions, the matrix formulation provides a flexible state-equation-like structure that serves as 18 a basis for many other computational structure. For instance, if AA(t) takes to be the state variable rather than A(t)A in (2.13), then we Obtain a new state equation with a state vector X (t) governed by the recursion X(t+1) = AA(t)X(t)+1r6(t), with output equation :10) = P(01,02, - - . .0: l M) = C“A(Wflt) and final likelihood y(T) = P(01,02,...,0TIM)=dA(T)X(T). In fact, X(t+1) = Aa(t) where a(t) is defined in (2.4). (2.22) (2.23) (2.24) (2.25) (2.26) The expressions above are useful in different circumstances. Usually A has full rank and, thus, almost always has inverse .4”. However, in (2.9), A is Often singu- lar because of its sparseness. Therefore, having A premultiplied by A in the state equation allows transformation Of the state-space equation. Equation (2.15) provides a representation similar to that Of (2.22) in the sense that the A, premultiplies A in the state-transition equation. Later we Show some useful expressions and prOperties arising from this characteristic. More novel prOperties Of HMMS will also result from the formulation in which A is premultiplied by A in the state equation. 19 Some interpretation for the state vector X (t) is Obtained by considering one Of its elements. From (2.22), X. T. In terms Of the developments above, the final likelihood can be variously repre- sented. For example, P(01,02,...,0T|M) = dam = «'Aumu) = a’(t)a(t), te{1,T} = s’(t)a(t), te{1,T} (2.31) = X’(t)Y(t), te{1,T} = Y'(t)X(t), te {1,T}. Also, letting tr(-) denote the trace Of a matrix, we can write I P(OIM) = tra(t)fl(t)) = tr A(t)A . . . A(2)AA(1)1r(A'A(t + 1) . . . A’A(T)C)') ( = tr(a(t)fi’(t)) (2.32) ( ( tr A(t)A...A(2)AA(1)1rC"A(T)A...A(t+1)A). Here WC" forms an N x N matrix. 21 Interpretation of the HMM Evaluation using the Vector-Matrix Formula- tion As yet another example HMM formulation arising from the vector-matrix framework, consider the Bakis HMM structure in which every path starts from a predetermined initial state (by definition, state 1) and finishes at a final state (state N). From equation (2.13), the likelihood can be represented in the compact form as P(01,02, . . . ,0, | M) = C’G(t)1r, (2.33) where G(t) = A(t)AA(T — 1)A - - oA(2)AA(1). (2.34) Thus, G (t) amounts to matrix products among vector-matrix-vector multiplications for P(O I M). We know that a matrix is a set of numbers arranged in a rectangular grid Of rows and columns. Likewise, matrix G (t) provides an algebraic interpretation for computing P(01,02, . . . ,0; I M) as follows: Let C“ = (0,0, . . . ,0, 1) and 1r, = (1,0, . . . ,0) for simplicity, and let us suppose that the size Of each matrix in (2.34) is N x N, and that G(t) is computed in advance. G(t) multiplies both vectors C and 11' for P(01,02,...,0¢ I M). Then, the computation Of P(01,02, . . .,0, I M) in (2.33) is equivalent to choosing the (N, 1) element in G (T) according to the position Of non-zero entries in C" and 71'. Therefore, entry g,,-(t) of G(t) amounts tO the the sum of likelihoods Of the paths leading from state i at initial time to state 3' at time t. Thus, the vector-matrix representation simplifies the underlying meaning Of the forward or backward computation of the HMM in a way which might not be possible with the conventional F—B HMM algorithm. This example shows that the vector-matrix formulation Of the HMM elucidates the likelihood computations in 22 association with state paths for signals. 2.2.2 Decoding Problem The HMM was conceived as one for which states would represent distinct acoustic phenomena [2, 4]. The solving the decoding problem also elucidates the structure of the model while providing the statistical characteristics Of each state. The state sequence Q = {q1, Q2, . . . ,q'r} corresponding tO a speech symbol string 0 = {01,02, . . .,OT} in the HMM is “hidden.” The hidden part of the HMM, a state sequence, must be found based on some modeled way since no exact solution exists. There are several ways of finding a state sequence for a speech signal. Among them, the Viterbi search algorithm [57 , 58] is pOpular and recognized as an efficient way Of finding an Optimal state sequence. Here we structure the Viterbi algorithm in the matrix notation established above and discuss the significance of resulting formulation. Let di(t+ 1) = max P(01102a' ° '10t+13q17Q22' ' ' :qt1Qt+l = Z I M): (235) qliq21"'1qt which implies the highest probability of a single path ending at state i, at time t+ 1. In the similar way, let ‘I’i(t + 1) = argmqetlxp(01:022~-:0t+1:(I1:Q22---:(It2(1t+1 = i I M) (2-36) \II,(t + 1) is the state qt at time t that leads to d,(t + 1). In these terms, the steps Of the Viterbi algorithm are as follows: 23 o Initialization 612(1) 2 0 b12501) 0 . 77.2 , (2.37) { 1111(1) \ f 0 \ W2“) = 0 (2.38) K ”NI” 1 K 0 / o Recursion Fort=2,...T, I (11(1) ) (b,(o,) 0 0 ) 32(1) _ o b,(o,) 0 ( dN(t) ) ( 0 0 bN(0t) ) { max{d1(t — 1)a11, . . . ,dN(t — 1)01N} I x max{d1(t — 1)0.21, . . . ,dN(t — 1)0.2N} (239) \ max{d1(t — 1)am ..... dN(t — Dan/N} ) _ 0 b,(o,) 0 24 / maxlSjSN{dj(t‘1)a1j} \ maxlSjSN{dj(t — 1)a2,-} x , (2.40) I max15,5~{d,-(t - 1)“le / I ‘1’1(t) \ I argmaxlgsflddt- Dan} \ ‘1’2(t) = arg maxlstNide - 1)a2,~} (2.41) \ ‘I’N(t) ) \ argmaXISjSN{dj(t_1)aNj} ) 0 Termination P’(01,02,...,OT | M) = max{d,~(T)}, (2.42) q} = max{\II,-(T)} (2.43) Fort=T—1,...1, q? = ink-+1 (t+ 1). (2.44) Where q; is the Optimal state at t. We can represent equations (2.37) through (2.41) in matrix form as follows: d(t) = A(t)max{Aod(t- 1)} (2.45) ‘1’“) = arngaX{{a.-j}1gi,jg~o{dj(t— 1)}193N} (2-46) for t = 2, . . .T. Where d(1) = A(1)1r and 0 represents the Hadamard product [36] 25 Of the N x N matrix and the N x 1 vector defined such that {(311 CiNI {f1\ {Cllfl CleN\ 021 C2N 0 f2 = C21f1 C2NfN . (2.47) \CNI CNN) {far} (CNifi CNNfN) The likelihood in (2.42) can be represented as P'(0 l M) = I|d(T)||1- (2-48) The back substitution process to find the Optimal state path is the same as equation (2.44). The matrix formulation Of the Viterbi search algorithm provides a compact repre- sentation Of the search algorithm. Also, such formulation makes it easy to implement the search algorithm. In MATLAB, for example, we can get very compact and sim- plified code. 2.2.3 Training (Estimation) Problem The training problem concerns how to estimate the elements of M so as tO best describe 0. This problem is Often solved in a maximum likelihood (ML) framework. To estimate the model parameters given an Observation sequence 0, the quantity P(OIM) is Optimized. Using an iterative procedure, the model parameter set M is reestimated to maximize P(OIM). There are two widely-used algorithms for this optimization problem, the F-B reestimation algorithm (iterative update and improve— ment) and the Viterbi training algorithm [4, 57, 58]. The Viterbi algorithm has been shown to converge to a prOper characterization Of the underlying Observations [17 , 19], and has been found to yield models with comparable performance to those trained 26 by F—B reestimation [20]. Further, the Viterbi approach is more computationally efficient than the PB procedure [4]. A matrix representation for the F—B training algorithms is described in [27]. How- ever, an alternative and more straightforward formulation is presented here. The 7 variable [1, 2], a key component in the HMM training, is first represented in a vector-matrix formulation. 73-,(t) is the probability of a path being in state i at time t and making a transition tO state j at time t + 1 given 0 and M. Thus, we have ’in(t) = P(‘It =iaQt+1 = LI 0,M) (2.49) a,-(t)aj,b-(Ot)fl-(t 'I' 1) _ _ P(JOINJI) , t—1,...,T 1 (2.50) for each j and i, where a, b, 01,6 are defined as (2.1), (2.3), (2.4), and (2.17). Let us define the matrix (7110:) 71203) ’YINIt) I‘ (t) = 721“) 722.03) ’72N(t) . (2.51) (7111103) 7N2“) ’YNNU) I Then by (2.50), I fl1(t+ 1) 0 0 l 1 0 @(t+1) 0 r“) — P(o | M) \ 0 0 flN(t+l)/ 27 (01(00 0 0 \ / all (112 am \ 0 b2(03) . . . 0 (1.21 0,22 . . . 021v X . X K 0 0 bN(0t) / \ 0N1 0N2 aNN / 01(t) 0 0 \ x (I ‘12.“) " (I (2.52) ( 0 0 aN(t) ) ( ,81(t+ 1) 0 0 l 1 0 fl2(t + 1) 0 = P(O I M) ' 3 \ 0 0 ,3N(t+1) } / a1(t) 0 0 l 0 (12(t) . . . 0 A(t)A , t = 1,. ,T — l (2.53) K 0 0 . . . UN“) ) = dias(13(t + 01);] $1023) -dias(a(t)), t: 1 ,T 1, (2.54) where ( P1 0 ... 0 \ A 0 0 diag(p) = : p2 : : (2-55) for any row or column vector p. 28 Then, if 7,-(t) (note single subscript) is defined as in [40], mlt) = P(qt =i I QM), (2-56) we have { '71“) \ { 01(t)51(t) \ _ 72(t) _ 1 02(t)32(t) _ _ 7(t) — : —- __(a'(t),8(t)) : , t — 1, . . . ,T 1 (2.57) \7N(t) ) \ CYNUWNU) ) = r’(t).1 (2.58) assuming that a'(t)fl(t) 96 0. This equation holds for any t E [1, T — 1]. In terms of variables defined above, the reestimation formula for the state- transition matrix A becomes ( (71(t) 0 0 “—1 A = Tim) M (_) 72°“) (.J (2.59) ( ( 0 0 7N(t) ) ) = r(t) gamma») (2.60) in which 23:11 denotes a element-by—element conventional matrix summation. Here A denotes estimated value of A. To reestimate the B matrix, let V = {Ukj} be an K x N matrix such that V = { Z 71(t)}15kst<,195~° (2'61) t 8.1:. O¢=k Then, with the matrix defined above, the reestimation formula for the observation 29 matrix B becomes { {71(t) 0 0 )\_l B = Vx 2T: 9 72°“) 9 (2.62) ( (0 0 ...ry~(t))) = Vx (édiaghun) . (2.63) As a consequence, we have a compact expressions of its training algorithm with matrix formulations. 2.3 Time-Invariant Approximation for the HMM A vector—matrix formulation of the RB HMM is posed in a state-space formulation with suitable state variables, state-transition matrix, and output equation. In mod- eling terms, the state—space system of the F—B HMM is linear, but time-varying. However, because the state-transition matrix is time varying due to varying obser— vation symbol probabilities, unlike the linear time-invariant model, it is not easy to transform the F-B HMM expressions and to derive other representations for the HMM which may be useful to find various techniques tosolve the HMM problems for some Speech applications. An approximate time-invariant model for the time-varying state-space F—B HMM with more potential for application is presented. Since theoretically it is not possible to make the time-varying F—B HMM and TIA HMM identical, a revised formulation with different state variables and likelihood measurements needs to be posed for the approximation. There are several ways to pose such an approximation. Here, we study one approximation method based on a state-Space formulation which was develOped 30 in the author’s laboratory by Snider and Deller [4, 6, 7]. The original motivation of the technique prOposed in [6, 7] was to decrease the computational load in evaluating the HMMs. However, here we will show that such a derivation is useful not only for the computational aspect, but also for an reasonable approximation of the likelihood P(O | M) computed using the time-invariant state- space model. Practically, in the HMM, we can approximate a posterion' probability using the state equation as below: P = P(o1 lM)P(02 | M) - . - P(OT | M) ~ P(01,02, . . .,OT | M).(2.64) The validity of this approximation will be discussed in this and the next chapter. As [6, 7, 8], we assume that there is a model M with N states q,-,z' = 1,2,. . ., N and M discrete observation symbols k, k = 1,2, . . . , M. At each observation time t, we define the state probability vector a:(t), and the observation probability vector, y(t) , as follows: a:(t) é (z,(t),x2(t),...,xN(t)) (2.65) (Mt), y2(t), - - - ,ym(t)) (2-66) Q A H V II where, 2,-(t) is the probability of being in q,- at discrete time t given the model M, P(q, at t | M), and yk(t) is the probability of generating symbol k at discrete time t given the model M, P(k at t I M). In these terms, the dynamics of the HMM are as follows: a:(t + 1) = Am(t) + u(t)6(t) (2.67) y(t) = Ba:(t) (2.68) 13w l M) = [[1 Hot | M) = 1311101“) (2.69) 31 where, A is the N x N state-transition matrix associated with the HMM whose (75, j ) element, a,,- = P(qj at t+1 | q,- at t) for any t; B is the M x N observation probability matrix whose (k, j) element, bk, = P(k | qj); and 11(0) is some vector such that when 2(0) is defined as zero, 3(1) takes the prOper initial values, with u(t) arbitrary but finite for all t 75 0, and 6 (t) is the Kronecker sequence. yo, (t) corresponds to the It“ element of vector y(t). Here It is the symbol realized by 0,. 13(0 | M) is the likelihood explained in the following. The expressions (2.67)-(2.69) are not equivalent to (2.9)-(2.11) since the definitions of the state variables as well as the likelihoods from both sets of equations are different. State variable a(t) is the joint probability of the partial observation sequence from an initial time to time t. However, 2(t) is the probability of being in states at time t. Therefore, the likelihood P(01,02, . . . ,OT | M) which needs to be evaluated from an initial time to a specific time t cannot be expressed by state variables a:(t +1) and y(t) without an independence condition that will be explained. For reference, because A is a stochastic matrix, a:(t) is a positive vector. In the Bakis structure of the HMM which is often employed in speech recognition, except a few initial values of t, m, (t) 76 0 for any 2'. This implies that the system can be any one of states [1, N] regardless of the preceding state. From (2.67) and (2.68), we have a:(t) = A‘"1u(0), (2.70) y(t) = BA"1u(O). (2.71) These two probability values can be used to compute the state and observation prob- abilities at any time t 6 [1, T]. The observation probability at time t for a observation 32 symbol 0, is P(Ot l M) = yo.(t) = (b1(0t), b2(0t), . . . , (IA/(00) ' 23“) = b1(0t)$1(t) + b2(0¢)$2(t) + . . . + bN(0t)$N(t) (2.72) = 1’A(t)m(t). Here 570,) = (51(ot), 52(ot), . . . , bN(0,)). Note that b’(0,) = 1’A(t). (2.73) Horn (2.72), it is seen that the probability for a given observation at time t can be computed using state equation (2.70) and observation equation (2.71). 2.4 Transformations of State Equations One of the merits of using the state-space structure develOped above is that it admits the transformation of the state and observation equations into alternative formula- tions. Let us discuss this subject for the conventional F-B HMM and the TIA HMM. 2.4.1 Transformation of Time-Invariant State Equation The original motivation for the time-invariant state equation was that the state- transition matrix could be diagonalized to significantly reduce the number of floating point Operations required to compute the HMM likelihood5 [4, 6, 7]. To diagonalize the state-transition matrix, let a:(t) = M z(t). Where M is diag- onalizing transformation on the state space. M is a square matrix with dimension 5In light of the remarks by Mitchell et al. [25], further discussion of this model appears in [4]. 33 N x N and z(t) is a new state variable defined as M ‘la:(t) assuming that M "1 exists. If A does not have distinct eigenvalues other than zero, then M ‘1 does not exist. However, in the practical HMM application in speech processing, a speech signal is modeled as the result of random processes. A holds such variable transition information of random processes. Numerically, this leads that mostly the entries of A are different from each other. For example, the same phoneme spoken by different speakers will be acoustically different. Also, the same speaker may produce different versions of the same sound under different circumstances. Numerically, this leads that mostly the entries of A are quite different from each other and they do not have specific patterns such as singularity for the matrix for instance. Numerically, such diversity of realization of random processes for speech signals justifies assuming the existence of M ’1 under a suitable size of number of state in the state model. Equation (2.67) and (2.68) yield Mz(t+1) = AMz(t)+u(t)6(t) (2.74) y(t) = BMz(t). (2.75) Diagonal dominance provides a relatively simple criterion for guaranteeing the nonsingularity of a matrix. An N x N real or complex matrix A is diagonally dominant if law] 2 23,-,2,- Iaijl,i = 1, . . .,N. A is also strictly diagonally dominant if strict inequality holds. If A is strictly diagonally dominant, then A is nonsingular [37]. Since A is nonsingular, a matrix M, that is composed of a set of eigenvectors of A, is nonsingular. Therefore, M ’1 exists. From (2.74), z(t+1) = M“AMz(t)+M‘lu(t)6(t) (2.76) y(t) BMz(t). (2.77) 34 Now suppose that M = PU, where P is the usual matrix of normalized eigenvectors and U is a special diagonal matrix such that the 2"" element of the vector U '1 is the reciprocal of the 2"” element of the vector P’1u(0). As a consequence of this operation, each element of the excitation vector, M ‘1u(t), is unity at time zero. It follows that z(t + 1) = U"1P‘1APUz(t) + U‘1P_lu(t)6(t) (2.78) = Az(t) + fi(t)6(t) (2.79) y(t) = BPUz(t) = Bz(t) (2.80) where A = U‘IP'IAPU is a diagonal matrix, fi(t) = U"1P“lu(t), and B = BP. This result is significant because it separates all states into independent computations. Furthermore, this property provides a way to combine all HMMS in the system into one large state-space formulation [6, 7]. Moreover, as in (2.70) and (2.71), z(t) = A 17(0), (2.81) y(t) = BAHam) (2.82) where A“ is computed easily. In the HMM application to speech modeling, the Bakis condition is generally assumed, and, thus, A is a triangular matrix. Therefore, when there are K HMMS which need to be evaluated, it is necessary to compute eigensystems of K models. However, under the Bakis condition, all the eigenvalues are located on the diagonal positions of A. Therefore, it is not necessary to compute eigenvalues. Furthermore, because of the strong diagonal prOperty of A, it is highly probable that there are quite a few cases across which eigenvalues can be shared. Hence practically we do not need to compute all the eigenvectors of K systems. Thus, computational load to 35 computing eigenvalues and eigenvectors of A among models can be possibly lessened by suitable preprocessing which examines the diagonal components of state-transition matrices across models. 2.4.2 Transformation of the Time-Varying State Equation Let us return to the case of the time-varying state equations, a(t + 1) = A(t + 1)Aa(t) + A(1)1r6(t) y(t) P(01,02, . . .,0. | M) = C'a(t). (2.83) (2.84) Let A and P denote the eigenvalue and eigenvector matrices of A respectively, AP = PA. Since A is nonsingular, P is nonsingular. Therefore a(t + 1) = A(t+1)(PAP"1)a(t)+ A(1)1r6(t) = A(t+1)PAP“1a(t)+ A(1)1r6(t) P'1a(t+1) = P-1A(t+1)PAP'1a(t)+P—1A(1)7r6(t). Now let P‘la(t) = 6(t). Then a(t + 1) = (P"A(t+1)PA)a(t) + (P’1A(1)1r)6(t) y(t) = P(01,02,...,0tIM)=C"P6:(T). 36 (2.85) (2.86) (2.87) (2.88) (2.89) (2.90) Since P is an eigenvector matrix of A, this expression can be represented as follows: a(t + 1) = P"1(APA(t + 1) — APA(t + 1) + A(t + 1)AP)d(t) +(P“A(1)7r)5(t) = P“APA(t+1)a(t) + P“1(A(t+1)AP — APA(t+1))c‘i(t) +(P’1A(1)7r)5(t) = AA(t + 1)a(t) + P‘1(A(t+ 1)PA — PA(t+1)A)Ez(t) (2.91) +(P“A(1)1r)6(t) = AA(t + 1)d(t) + P‘1(A(t + 1)P — PA(t+1))Ad(t) +(P"A(l)7r)6(t) = AA(t + 1)6:(t) + (P‘1A(t + 1)P — A(t + 1))Aa(t) +(P“A(1)vr)6(t), where AA(t + 1) is a diagonal matrix. To obtain a diagonalized state equation from (2.91), we need to have P-1A(t+1)P = A(t + 1). Or equivalently, A(t+1)P = PA(t+1). (2.92) If P has N distinct eigenvalues, the necessary and sufficient condition to satisfy the commutativity (2.92) is that all the eigenvectors of P should be same as those of A(t + 1) for all t [37]. However, P is not an eigenvector matrix of A(t + 1) but of A. Therefore, P‘1A(t + 1)P aé A(t + 1) in general. Furthermore, A(t) is time Varying. Thus, a constant P which satisfies (2.92) for every t does not exist in 37 general. Therefore, there is no universal eigenvector matrix P which diagonalizes the time-varying F-B HMM. In addition, consider the case in which the eigenvector matrix P changes with time, depending on A(t). Let P, denote a eigenvector matrix of A(t)A. Then it follows that a(t + 1) = A(t+1)Aa(t)+ A(1)1r6(t) (2.93) Pzfilau + 1) = P:.31(A(t + 1)A)Pt+1P:.31a(t) + P::1A(1)vr6(t) (2.94) a(t + 1) = A(t + 1)AP;+1,P,P;1a(t)+ P;.},A(1)1r6(t) = —A(t + 1)A(Pt—+11Pt)a(t) + P1331(A(1)1r)6(t) (2-95) if P;1 exists for all t, where d(t + 1) = P3100: +1), and Pt'fi1(A(t + 1)A)Pt+1 = W. However, due to the sparseness of A(t), A(t)A is singular most of the time and Pfl does not exist at these times. Additionally, even if Pfl exists for all t, it is necessary to compute P, for each t, resulting in no computational benefit from the matrix diagonalization. Moreover, due to the fact that P,— +11Pt aé I in general, (2.95) implies that it is not possible to obtain a diagonalized state equation using the formulation above. 2.5 Analysis of Illegal Paths Caused by Approxi- mation In this section, we discuss the problem caused by the approximation of the PB time- varying HMM by the TIA HMM. This issue was first noted by Mitchell et al. [25] following the publication of the original TIA HMM paper [7]. 38 2.5.1 Likelihood Difference We first discuss the relationship between P of the TIA HMM and P of the PB HMM. Those likelihood are significant for HMM evaluation in speech recognition. As a matter of fact, the likelihood measure, P(O | M) = {:1 P(Ot | M) em- ployed with state equations (2.67 )-(2.69) is not linearly related to the a posteriori joint probability P(O | M) = P(01,02, . . . ,OT | M) used on the F-B HMM approach. For example, suppose there are three symbols {01, 02,03} in an observation string at times t = 1, 2, 3, respectively. Then by the “chain rule” of conditional probability, P(01,02,03 | M) = P(01 l M)P(02 | 01,M)P(03 | 02,01,M) (2.96) which, if and only if 01,02, and 03 are conditionally independent“, can be written as P(01,02,O3 | M) = P(Ol I M)P(02 I M)P(03 | M) (2.97) ~ = P(01,02,03 ] M). (2.98) In this case, a time—invariant state equation can be applied to compute the “F-B” a posteriori probability P(O | M). However, the symbol occurrences are generally dependent upon the states in the HMM. The inequality of P(O | M) and P(O | M) caused by the assumption of independence among symbols without consideration of hidden state dependency was initially noted in [25], where a simple counter-example using a two-state HMM can be found7. To examine the differences in P and I3 in more detail, consider a model with two states. From the F-B matrix formulation (2.13), P(01,02, 03 | M) = C’A(3)AA(2)AA(1)1r (2.99) 6The dependent conditioning information being the state value. 7In light of the remarks by Mitchell et al. [25], further discussion of this model appears in [4]. 39 . b1(03) 0 an 612 = C 0 02(03) G21 G22 ) x “02) 0 a” a” (2.100) 0 b2(02)) G21 022 01(01) 0 \ 71'1 0 52(01) ) W2 Assume a left-to—right (Bakis) model so that (712 = 0,U = (0,1) and 1r' = (1,0). Then, P(01,02,03 IM) = b2(03)a21b1(02)a11b1(01) + b2(03)a22b2(02)a21b1 (01). (2.101) On the other hand, the product of individual observation probabilities is . .,(3) ) . 21(2) P(03 | M)P(02 | M)P(01 I M) = 1 A(3) 1 A(2) 272(3)) 222(2) , 171(1) \ 1 A(l) (2.102) 182(1)) I b1(03)al1b1(02)aubl(01) 51(03)Gf1b2(02)02151(01) b2(03)a21a11b1(02)a11b1(01) (2.103) 52(03)a2ia11b2(02)021b1(01) b2(03)a22a21b1(02)a11b1(01) + b2(03)a22a21b2(02)a21b1(01). ++++ Therefore, P(O | M) involves extra cross terms which can be regarded as resulting 40 from one or more “illegal” state paths. Since all the terms of each P(Ot I M) are multiplied together in computing 11;, P(Ot I M), this technique has been called an “anypath” method [4]. Although different from the F—B HMM likelihood, the “time-wise” P(O I M) probability has been useful in discovering new aspects of HMMs in this work. More- over, the accompanying state equation is advantageous in that the resulting HMMS can be implemented to perform fast processing in real application with fewer re- sources than with the F-B HMM [6, 7, 25]. It is well-known that the training and evaluation of F-B HMMS are computationally very demanding [66]. To decrease the computational complexity, a few techniques have been prOposed using vector-matrix formulations [27, 30]. However, since the proposed techniques are based on the time- varying F-B HMM, there is a limitation to the possible decrease in computational complexity. In spite of the apparent weakness of permitting illegal paths, the TIA HMM is a useful and effective model as we discuss later in this work. 2.5.2 Comparison of the State-Transition Matrices Let us examine the state transitions of both F—B HMM and TIA HMM in more detail. Next, we discuss the discrepancy of the role of state-transition matrices of each model from the point of how to constitute available state paths of a speech utterance. Since A premultiplies the matrix A in the time-varying state—space HMM, diag- onal elements of A multiply the corresponding rows of A. To examine the dynamics of the AA matrix of the time-varying state equation, consider two two-state Bakis models and a test string 0 = (01,02, . . .,OT}. At times t and t+ 1, a(t) = 121(0.) 0 a” a” a(t—1) (2.104) 0 b2(03) G21 0'22 41 a(t+1) = MOM) 0 “11“” a(t) (2.105) 0 b2(0t+1) 021 022 _ 51(0t+1) 0 all 012 0 b2(0t+1) 021 022 b 0 0 1( t) a“ a” a(t—1). (2.106) 0 b2(0t) 021 022 Due to the Bakis condition, an = 0 and only transitions from state 1 to 2, 1 to 1, and 2 to 2 are legal. However, these legal jumps are also controlled by the probabilities in A(t) and A(t + 1). Let b2(0t) = b1(03+1) = 1 and b1(0¢) = b2(0t+1) = 0 for instance. From these assumptions, all 0 O 0 a(t + 1) = a(t — 1) (2.107) 0 0 G21 022 = a(t — 1). (2.108) 0 0 Therefore, the likelihood P(O I M) becomes zero. This is from the fact that by A(t), the observation at t can be generated from state 2 and by A(t + 1), the observation at t+ 1 can be generated from state 1, but once an observation is generated by the second state, the path cannot return to state 1. Hence the matrix sequence A(t)A,t = 1, . . .T inherently determines the allowable state paths depending on the elements of the observation string. For the time-invariant model of (2.67) and (2.68), however, the computed likeli- hood (2.69) is an approximate value based on the assumption that O, is uncondition- 42 ally independent of 0., for all 7' E [1,t — 1), meaning that N P(Ot I M) = ZP(0t I Qt = i:M)P(Qt = i I M) (2-109) i=1 = 1'A(t)a:(t). (2.110) Here, algebraically the computation of P(Ot I M) involves a:(t) which is a state distribution that is dependent upon t only, not upon the history of the state path, nor of the symbol string. Thus, only the state-transition probabilities in A are responsible for predicting the state path in the TIA HMM. However, observation symbols play a significant role in deciding the feasibility, if not the value of the probability, of a state sequence. The viability of the TIA HMM depends on the degree to which probabilities computed for illegal state paths are small, rendering them infeasible. This will be discussed in the following chapters. 2.6 Validity of the Time-Invariant Approximation of the HMM We examine the extent to which the TIA HMM represented by approximations (2.67)- (2.69) is a viable model for practical speech recognition. In particular, the relative significance of two significant matrices A and B of the F-B HMM will be examined analytically and heuristically using several approaches. Of course, it is true that the following analysis are in fact heavily dependent upon the probabilities of A(t). 2.6.1 Matrix Norm Approach Let us first revisit two state—transition equations (2.11) and (2.25). Again, we have P(01,02,...,0,,...,0TIM) = C’a(T)=C’A(T)X(T). (2.111) 43 Note that the actual relationship between a(t) and X (T) is given in (2.26). For t 76 T, the likelihood of the RB HMM is P(0,,02,...,0.|M) = 1a(t)=lA(t)X(t). (2.112) For simplicity of analysis, consider an ergodic constraint on A so that all N states are legitimate final states. Then, P(01,02, . . .,0. l M) = C’a(t) = C'A(t)X(t) (2.113) for t E [1,T] with suitable initial conditions 07(1) = A(1)7r and X (1) = 77. Because A is a stochastic matrix, it is easily verified that (f = C'I=C'A= CA" (2.114) for any natural number n with a column vector C as defined in (2.12) and I defined as the identity matrix of suitable size. Then for any 1 g t S T, a posteriori probability (2.112) can be represented variously as P(01,02,...,0,|M) = 651(1) = I|a(t)lll = dAa(t) = IIAa(t)II1 = dA(t)X(t) = IIA(t)X(t)||1 (2115) = C'AA(t)X(t) = IIAA(t)X(t)II1 44 where I] - then = |IX(t+1)|I1, II represents the ll norm. If n is a matrix such that Therefore, we have that for any n with C = [1,1,...,1]. A+n = I, C'A = c’=dr = C’(A+n) = C’A+dn. do = d((2)n=0 (2.116) (2.117) (2.118) (2.119) Here 0 is the zero vector with an appropriate dimension. The difference matrix 52 indirectly points out the relative importance of the stochastic matrix A in the process of evaluation of the a posteriori probability for a given utterance. From (2.115), Ila(t)II1 = IIAa(t)|I1=II(I-9)a(t)II1 = “G(t) - m3I(I)II1- (2.120) In particular, consider a diagonally dominant A. For simplicity, let A is a 2 x 2 45 matrix and let 61 and 62 be two small numbers in the off-diagonal so that a a 1—6 6 11 12 = 1 2 ' (2.121) G21 G22 61 1-62 Also, let a t a(t)= 1” . (2.122) 0120) Then, 6 —6 n: 1 2 . (2.123) —61 62 Therefore, all the elements of Q are composed of small numbers. From (2.121), we have that Aa = (2.124) (1 — 61)C¥1(t) + 6202“) 6101(t) + (1 — 62)oz2(t) If 61 and 62 are relatively small compared to the entries of a, (2.124) can be approx- imated as (1 — £1)a1(t) + 620:2(t) z (1 — 61)al(t) (2.125) 6101 (t) + (1 — 62)02(t) (1 — 62)a2(t) = (1 — 61) 0 a1(t) 0 (1 — 62)02(t) C12“) 2 Da, 46 where D is a diagonal matrix which is made up of diagonal elements only from matrix A so that A = D + D1. Therefore, P(01,02,...,0,IM) = IIa(t)I| = IIAa(t)II = ||(D+D1)a(t)|| .9 IIDa(t)II (2.126) = dDa(t), where D is practically close to an identity matrix I. 2.6.2 Likelihood Expansion Approach The preceding discussion concerns figuring out the relative insignificance of stochastic matrix A at the final time t in light of the likelihood. In fact, however, we need to see the effect of A at each of the time instants {1,2, . . . ,T} at which a speech signal is evaluated. To look at the influence of A matrix more closely, again consider a two-state model and the dynamics of the HMM over a few time instants. The results of this analysis can be extended to any size state model. Let M1 and M2 be two HMMs for speech evaluation. For simplicity, let us suppose that an observation string is composed of three symbols {01, 02, 03}. Then, we need to evaluate the final two likelihoods, P(01,02,03 | M1) = (1,1)A1(3)A1A1(2)A1A1(1)1r1 (2.127) P(01,02, 03 | M2) = (1,1)A2(3)A2A2(2)A2A2(1)1r2, (2.128) where subscripts 1 and 2 denote M1 and M; respectively. Let us assume a Bakis structure for A1 and A2. Then 01,12 = (11,22 = (12,12 = (12,22 = 0. Like the notation of 47 A1, A2, the subscript of n from and-,- denotes a transition probability from state 2' to j in HMM 17.. So, the likelihoods for M1 and M2 become P(01,02,03 I M1) = b1,1(3)a1,11b1,1(2)a1,11b1,1(1)7rl,1 + b1,2(3)a1,21b1,1(2)a1,11b1,1(1)7r1,1 (2.129) + b1,2(3)a1,22b1,2(2)a1,21b1,1(1)7r1,1 P(01702103 I M2) = b2,1(3)02,1152,1(2)G2,11b2,1(1)7r2,1 + b2,2(3)a2,21b2,1(2)a2,11b2,1(1)7724 (2.130) + 52,2(3)a2,22b2,2(2)02,21b2,1(1)7T2,1- Here similarly to and-g, bug-(k) denotes an observation probability for symbol 16 from state j in HMM n and 7r,”- denotes a initial state probability for a state i in HMM n. In (2.129), if A1 is close to A2, then B plays a decisive role in computing the likelihoods. It is difficult to show analytically how much the matrices A and B affect the likelihood in general since its result differs depending on the time-varying input speech signals. However, it is possible roughly to estimate the relative contribution of each matrix. 2.6.3 Matrix Inversion Approach Here we consider the application of the vector-matrix formulation of the HMM to assess the relative importance of A and B to the likelihood. 5 Previously, a matrix norm has been applied for an ergodic model to obtain a closed form for the a posteriori probability quantity of a given speech utterance at a Specific time as (2.115) or (2.116). Now consider a general case covering all time indices. Reconsider a model with a time-varying state-equation representation as (2.22). 48 Multiplying both sides of (2.22) by A“, A‘1X(t+1) = A(t)X(t)+A’11r6(t). (2.131) Let us assume a Bakis structure with strong diagonal elements in A. As before, let fl=I—A. Then (I — n)-1X(t + 1) = A(t)X(t) + A-lmsu). (2.132) Further (I—fl)’1 = I+n+nz+n3+m (2.133) = I + n: a". (2.134) Therefore, (I + f: fl")X(t + 1) = A(t)X(t) + A'11r6(t) (2.135) n=l X(t+1) = A(t)X(t) —(§;m)X(t+1)+A-1«6(t).(2.136) n=1 Horn the definition of {2 and (2.134), 2 n" = A-1 — I. (2.137) n=l Substituting in (2.136), X(t + 1) A(t)X(t) — (A'-l — I)X(t + 1) + A—1n6(t) (2.138) = 22(1) + 22(6 + 1) + A‘11r6(t) (2.139) 49 where X(t) = A(t)X(t) and X(t+ 1) = —(A’1 —I)X(t+1). Because A is assumed to have strong diagonal elements, X (t + 1) = X (t) for all t > 0. To see X (t + 1) quantitatively, consider, for example, a simple case in which A E R2”, all 012 A = _ (2.140) G21 022 Then, A'1 _ I = d tlA . G22 — (111022 + 012021 —a12 (2.141) e ( ) —a21 0.11 — 011022 ‘I' 012021 assuming that det(A) = (1116722 — (112021 > 0 which is justified in the present analysis because of the strong diagonal property. Contrary to the denominator, all entries of the numerator of (2.141) are small numbers. For the Bakis model, (122 = 1 and an = 0, so that 921 0 A‘l—I = “1‘ . (2.142) —“ 0 Accordingly, I] (A—1 — I )X (t+ 1) II is very small compared with the values in X (t). For any size of N, we reach the same conclusion. This is further support for the notion that observation probabilities are much more significant than the state-transition probabilities in computing the likelihood. 2.6.4 Eigenanalysis Approach Let us examine the eigensystem of the state-transition matrix of the time—varying F-B HMM. This approach also leads to the conclusion that A is relatively insignificant compared to A in light of the likelihood measure. 50 From (2.86), the diagonalization of time-varying F-B HMM is explicitly given by a(t + 1) = A(t + 1)PAP'1a(t) + A(1)1r6(t). (2.143) Since A(t + 1) is a diagonal matrix, analyzing A (or P) is sufficient. Note that generally the operation of vector-matrix-vector multiplication (2.90) to compute the likelihood does not have a straightforward relation to the eigensystem of the matrix. However, consider the condition in which A is close to the identity matrix I. To observe the significance of A, consider two HMMs which are numerically very close to each other. Let A1 and A2 be state-transition matrix of M1 and M2, respectively. Let A1 have the Bakis tOpOlogy. For analysis purpose, suppose that all the entries of A2 are close to those of A1 and they are close to I so that they have strong diagonal property. Further assume that the other two matrices B1 and m from M1 are the same as B2 and «2 from M2. Then let us estimate the likelihood from A2 in terms of A1. Consider a practical case first in which A2 which is close to A1 is of the Bakis topology so that the diagonal entries of A2 are eigenvalues themselves. During ma- trix multiplications for likelihood computation, the eigenvalues of the state-transition matrix are explicitly involved in the matrix operations as explicit (diagonal) entries in the matrix. In this case, it is trivial since both A1 and A2 produce the similar likelihood. Next consider the case such that A2 is not of the Bakis topology but it is still strongly diagonally dominant. The non-Bakis condition makes the analysis difficult since the eigenstructure of the system varies depending on changes in the entries of the matrix. Because A2 is no longer triangular, the eigenvalues of the matrix are not explicitly involved in the matrix multiplication. However, the Gerschgorin circle theorem [37 , 38, 39] provides another way to assess the matrix multiplication 51 approximately. The theorem tells the possible relative locations of correSponding eigenvalues after the entries of a matrix changes a little from an original matrix. To apply the theorem, let A be an eigenvalue of the N x N matrix A2. Then, by the Gerschgorin theorem, each eigenvalue lies in at least one of the discs with center am,- and radius r,- = 23,-,5,- IaW-I, z” = 1, . . . , N in the complex plane, I/\ — amiI S Ti. (2.144) Because A2 is diagonally dominant, the size of each Gerschgorin disk is very small. Moreover, since A2 is a stochastic matrix, one eigenvalue remains unity. For matrix computation, consider eigenvectors of A2. Although the eigenvalues of A2 are not affected much and the Euclidean distances between respective eigenvalues of A1 and A2 are small, the sensitivity of eigenvectors depends on the eigenvalue sensitivity and separation. In particular, in case of identical eigenvalues, there exists an infinite set of possible eigenvectors because of linear dependency. Therefore, the eigenvector conditions are not helpful to estimate the likelihood with matrix A2 in association with A1. However, in case that A(t) is sparse8 so that the effect of off-diagonal elements are reduced, we may get likelihood results with A2 which are close to those with A1. The results are in fact mostly dependent upon the probabilities of A(t). For the TIA HMM, (2.76) and (2.7 7), eigenanalysis approach is better applicable than for the F-B HMM because the time-invariant state equation does not depend on the observation symbol string. 8The sparseness of B will be explained in Chap. 3. 52 Chapter 3 Practical Issues in the Use of the TIA HMM This chapter is devoted to the practical issues related to the TIA HMM. Here, the “practical issues” relate principally to the problem of illegal state sequences in the TIA HMM. Additionally, the technique prOposed by Turin [27] to reduce the computational load for likelihood computation will be reexamined. Then, a new approach will be prOposed and derived to reduce the computational work for likelihood computation based on a condition imposed on an utterance by Turin. Through such an approach, we will show that we can obtain more computational savings as well as fast evaluation with reduced computational resources. Even though such an assumption imposed by Turin on a speech utterance is not ubiquitous in real speech signals, however, this study will be significant to compare the efficiency of computational savings of the new approach with that of Turin. We will discuss the problem related to computational savings of HMMS first in the following section. 53 3.1 Efficient Evaluation Technique In [27], Turin suggests a new technique for computing the HMM likelihood efficiently using a vector-matrix formulation of the HMM. Turin’s approach assumes that the observation string extracted from the speech has long stretches of identical observa- tions. Let us discuss and examine his method so as to derive more computational saving technique. Before presenting a new technique, we briefly review the Turin’s method [27]. Assume that the observation string has long stretches of identical observations, say, 0t+1 = 0t+2 = ' ‘ ' = 0t+r (3.1) with r—repetitions of the symbol. When there are many blocks of repetitive strings, the following development can be reapplied. From (2.13), the likelihood is given by P(01,02,...,0TIM) = C'A(T)AA(T—1)A---A(2)AA(1)7r (3.2) = UA(T)AA(T—1)A---A(t+r+1)A (A(t + 1))A)'A(t)A- . - A(2)AA(1)1r (3.3) under (3.1). Thus, the problem becomes how to compute a matrix (A(t + 1))A)’ efficiently. Among several algorithms proposed by Turin [27] for computing (A(t + 1))A)r one of the technique suggested using [28] is as follows: 1. Let {bk_1bk-1 ---b1bo} be a binary representation of r as r = b0+2b1+...+2’°—1b,,_1. (3.4) 54 Also, let Qo = I (3.5) R1 = A(t+1)A. (3.6) 2. Fori=1,...,k+1, Ri+1 = R3 (3-7) .-_ ’fb,-_ =0 0.- = Q 1 1 ‘ (3.8) Qi—iRi if bi—l = 1- 3. Termination (A(t+1))A)' = Qk (3-9) This algorithm requires on the average 3N 3 log10 r floating—point Operations (fl0ps) in calculating (A(t + 1))A)’. It is obvious that we get more computational savings when for r is large. Depending on r, however, the computational savings varies. For example, when r = 2" for n E N, the large computational savings can be obtained. On the other hand, the computational savings becomes relatively small when r = 2" — 1. As well as such variability of computational savings, this algorithm still requires a recursive squaring of (A(t + 1)A). To improve upon techniques proposed for computing (A(t+ 1)A)" [27, 28], we de- velOp a more computationally efficient technique for computing this repetitive matrix multiplication based on a linear transformation of the matrix. This method is par- ticularly efficient in cases where the matrix R1 from (3.6) is a sparse, near-triangular matrix, typical of the HMM structure. The derivation follows. In Chapter 2, a similarity transformation of the non-singular matrix A was used 55 to obtain more computational savings in the TIA HMM. Similarly, a linear transfor— mation can be applied to the computation of (A(t + 1)A)'. In this case, this matrix product is singular most of the time. Initially, suppose that A(t + 1) is non—singular. Then, the resulting product A(t + 1)A is non-singular; thus, the matrix (A(t + 1)A) can be expressed as the product of three matrices, A(t + 1)A -_—_ P(t + 1)D(t+1)P‘1(t+ 1), (3.10) where P(t + 1) is an eigenvector matrix of, and D(t + 1) is a diagonal eigenvalue matrix of, A(t + 1)A. Therefore, (A(t +1)A)r = P(t+1)D’(t + 1)P-1(t+1). (3.11) Since D(t+ 1) is a diagonal matrix, computing D" (t+1) is straightforward. If D(t+ 1) is N x N, for example, it takes only N x r flaps to compute D'(t + 1). Likewise, computing P‘l(t + 1) from P(t + 1) takes N3 flops. This is not computationally demanding when N is not large. N is practically not over 6 in HMMS. Second, suppose that A(t+1)A is singular because of zero elements in the A(t+1) matrix. Note that A is a non-singular matrix. In this case, we still can choose non- singular eigenmatrix P(t + 1) because it is possible to have any linearly independent eigenvector corresponding to a zero eigenvalue. Thus, a non-singular matrix P(t + 1) exists always regardless of values of A(t + 1). Hence, there is always a valid relation (3.11). To compare the required number of floating Operations for computing (A(t+1)A)' between three techniques described above, consider a simple case as follows. Suppose that all the diagonal entries of A(t+ 1) are not zero, and A is a triangular matrix with allowing any forward state jump. Furthermore, assume that state transition matrix 56 A(t + 1)A has distinct eigenvalues. Then, the necessary computational complexities for three techniques are shown in Table 3.1. For example, when r = 10 and N = 5, I Approaches 1 flops I Conventional F-B HMM gm + 1)?“ + %(N + 1)(N + 2) (7' - 1) Turin’s algorithm 3N 3 log10 r Similarity Transformation N(N + 1) + N7 + N2 + N3 + %(N + 1)(N + 2) Table 3.1: Approximate computational complexities for computing (A(t + 1)A)' by three different approaches. the approximate complexities for computing (A(t + 1)A)' are 2040 fl0ps by the conventional scalar recursive F—B HMM algorithm, 375 flOps by Turin’s algorithm, and 265 flOps by the similarity transformation in (3.11). It is obvious that, like [27], savings in computing P increases with increase of r. In contrast to the technique in [27], however, the necessary load for computing (A(t + 1))A)r using the similarity transformation is less sensitive to 7' since the computational load increases proportionally with rate of N. On the other hand, for the 'I‘urin’s algorithm, it increases with 3N 3 associated with log10 7'. Now, consider the TIA HMM under the Turin’s assumption that the observation string has long stretches of identical observations. From (2.78) and (2.80), the partial likelihood from t + 1 to t + 7‘ becomes lilerlM) = 11111210»: (7)} (3.12) = HII{5(0 (0,)Az(T—1)} (3.13) = (6’(030426))(B’(o...>4"1z P(01,02, . . . ,OT I M,) (3.28) holds for all j 75 2', is T T H P(0. I M.) > H P(Ot I Mj) (3.29) t=l t=l always true for any 2? However, it is not easy to analytically specify how much larger the left side of (3.28) needs to be than the right side of (3.28) does. We can only say that in the case of the previous example, the likelihood of the correct word is approximately three times greater than that of the others. More intensive study of the likelihood relationship between the F-B HMM and the TIA HMM in conjunction with the performance of speech recognition is left for future research. 3.2.3 State Probability Distribution Vector in the TIA HMM In this section, we will investigate a:(t) of (2.67) to assess the effect of A in determining state sequences of an utterance in conjunction with the observation symbols. Consider a Bakis-type TIA HMM. Then, a state-transition equation is composed of A and state probability distribution vector m(t). However, since the state-transition part in the TIA HMM is only composed of A in contrast to A and A(t) in the F-B HMM, a Bakis condition does not have a direct influence on constituting the possible state transitions for an observation string. Instead, A affects 2(t) and a:(t) indirectly controls state transitions. To observe the effect of A, let us compare two cases, a Bakis (AB) and ergodic (Ac) constraints, for state-transition configuration with each other. Tables 3.6 shows 67 the state probability distribution vectors $(t) for a few specific times when 0.9 0 0.9 0.2 B: 2 Ac: 2 (3'30) 0.1 1 0.1 0.8 with 23(1) = 23(1) = (1,0)’. t=1 =2 2:3 t=4 t=5 2:40 2:50 t=60 2:70 23(2) 1.000 0.900 0.810 0.729 0.656 0.016 0.005 0.002 0.000 0 0.100 1.190 0.271 0.343 0.983 0.994 0.998 0.999 2:.(2) 1.000 0.900 0.830 0.781 0.746 0.666 0.666 0.666 0.666 0 0.100 0.170 0.219 0.253 0.333 0.333 0.333 0.333 Table 3.6: State probability distribution vectors under Bakis, x(t)B, and ergodic, :c(t)e, constraints. The effect of A on a:(t) is not apparent over short intervals. It follows that the likelihood differences arising from a Bakis and ergodic constraints are not evident over short times. Additionally, it is not possible to distinguish the topology of the HMM from a record of a:(t). However, over a sufficient duration of time, we see that the difference in two state probability distribution vectors becomes distinguishable. As explained in Section 3.2.1, the eflect of A in the TIA HMM is indirect and “global” in a sense to form “possible” state paths in an utterance, contrary to A which locally affects state paths. 3.2.4 Comparison of a:(t) and 7(2) In this section, we will discuss similarities of one state variable of the F-B HMM and the other of the TIA HMM. Particularly, we are interested in the similarity of state probability distribution vector 23(2) of the TIA HMM and 7(0.) of the EB HMM. 68 Previously, we have 931'“) = P(Qt =1 I M) (3-31) 7.0) = P(t}: = i I QM). (332) where 0 = {01,02,...,0T} and M = {N, M, A,B,7r}. Note that 7.(t) is the a posteriori probability of a state based on 0. On the other hand, T.(t) is a state probability distribution without 0 although both state variables provide information about the probability being in state 2' at time t. Reconsider the previous recognition experiments. For analysis purposes, consider the case for word “four.” Suppose that M is computed from fifteen training utter- ances using the F-B HMM and A is (0.9625 0 0 0 0 ) 0.0375 0.8835 0 0 0 A = 0 0.1165 0.9704 0 0 (3-33) 0 0 0.0277 0.8652 0 ( 0 0 0.0020 0.1348 1.0000 ) The corresponding state probability distribution 23,-(t) along 2 for 2' = 1, . . . ,5 appears in Fig. 3.1. Also, with M, "7203) 2' = 1, . . . ,5 can be computed for each utterance 0 of training data set. It is interesting when all 7,-(t),2' = 1, . . . ,5 of the training data set is combined and the average of the 7,-(t) is computed along t. Here since the length of each training utterance may be different, it is not possible to get complete alignment of the training data set along t. Instead, only the time duration commonly occupied by all training data set is considered. The average of 7,-(t) for the entire training utterance is shown in Fig. 3.2. Com- paring results, the transition pattern of 2,-(2) is seen to be similar to that Of 7.(t) for 69 manumy 0.9 0.8 0.7 0.6 0.4 + I E same; 2 ,_ ...... . .................. +....I.......O...;..,...........'.. ....... /. ......... .. . ..._. / Figure 3.1: State probability distribution after training digit “four.” 70 0&1» ......... 1.5““31 (18 0.7- ...... ....... . (16 probability F’ 0| l (14 (13 012- o1,_.w. .......... f ......... ..... . "newsman". +4444. ; 3 i E I '. 1 1 I - . . A‘IllI I I : A l I : I : : . 4o 50 60 70 o ............... 1O 20 30 Figure 3.2: The average of '7.(t),2' word “four.” ........................................ . 1, . . . ,5 from the entire training utterances of 71 each 2' even though the actual probabilities are different. For the other digits other than word “four,” we see the same phenomenon. It is not easy to say how these two statistical variables from different models is related since x.(t) is a infinite sequence if t is not constrained and 7,-(t) is a finite sequence according to the size of training data. Roughly speaking, however, the average of 7,-(t) for the training data set amounts to the state probability distribution T.(t). This phenomenon is related to counting process when A, B are computed. 3.2.5 Experimental Results on the Effects of A We have shown theoretically that a strongly diagonal A does not make significant contribution to the likelihood scores in the TIA HMM. Here, we will Show this ex- perimentally with some examples. Along with this experiment, we will discuss the possible way of reducing the computational loads required in the HMM training using the characteristics of A. To observe the effect of A Of the TIA HMM, first let us update B only in the HMM training in the five-state Bakis HMMS while allowing only one skip in any for- ward transitions. In other words, after A is assigned initially, the training procedure estimates B only and does not change A. Then, compare the likelihoods. Let (0.99 0.00 0.00 0.00 0.00) 0.01 0.99 0.00 0.00 0.00 An. = 0.00 0.01 0.90 0.00 0.00 . (3-34) 0.00 0.00 0.05 0.99 0.00 (0.00 0.00 0.05 0.01 1.00) 72 ( 0.70 0.20 0.10 0.00 ( 0.00 0.00 0.80 0.10 0.10 0.00 0.00 0.00 0.60 0.30 0.10 0.00 0.00 0.00 0.80 0.20 0.00 0.00 0.00 0.00 1.00 I (3.35) be the two preset state-transition matrices, for example. A3, is more diagonally dominant than A 3,. A. is the state-transition matrix for digit 2' from the F-B HMM training. Table 3.7 is the sum of likelihoods of fifteen training utterances for each digit. Note that Since we take the negative log to the likelihood for numerical purpose, the ML actually amounts to the minimum likelihood in the table. 14‘. A31 A B, 1008.4 1111.3 1154.5 1182.2 1030.7 2111.8 1725.6 1028.3 1073.2 1908.0 1019.5 1053.1 1176.0 1120.5 1072.5 1996.4 1632.3 991.3 899.9 1689.5 1053.3 1074.0 1240.6 1227.3 1146.7 2123.1 1719.8 1006.9 991.8 1800.7 Table 3.7: Sum of likelihoods of fifteen training utterances for each digit associated with three different state-transition matrices in the EB HMM. From the table, we see that the likelihoods from the usual F-B HMM which requires both A and B training can be frequently less than those of the models whose A is arbitrarily set and only B is updated. Other than the problem of local minimum of the HMM training in the Optimization criterion, we see that the training A is not much crucial in certain cases such as having diagonally dominant A in the HMM. 73 Next, to examine the recognition results for different state-transition matrices, let the resubstitution-test be performed even though such a test is not practical in speech recognition system. In the resubstitution-test, the training utterance is used for a testing utterance. In this simulation, however, there is no difference between a resubstitution-test and leave-one-out-test because we are looking for the effect of the tOpOlogy Of A in the F-B HMM and TIA HMM. We can reach the the same conclusions with a leave-oneout-test. Also, the results from the resubstitution-test will be useful when we discuss the topic about finding an optimal state sequence in a speech utterance in Chapter 4. The recognition results are in Table 3.8 through Table 3.13. Table 3.8 shows the likelihoods for each digit from the F-B HMM computation when one randomly cho- sen testing utterance among fifteen is evaluated by the HMMS. Table 3.9 shows the likelihoods for each digit from the TIA HMM computation when the same testing utterance in the case of F—B HMM is evaluated by the HMMS. Table 3.10 and Ta- ble 3.11 are the likelihoods for A3, for the EB HMM and the TIA HMM respectively. On the other hand, Table 3.12 and Table 3.13 are for A3,. Comparing Table 3.10-Table 3.13 with Table 3.8-Table 3.9, yields the following observations: 0 The digit recognition performance with A3, and AB, matches the performance with A1 and B in the F-B HMM. In addition, the more diagonally dominant A is, the better the recognition performance. a In case of the TIA HMM, we have the same conclusion that the more diagonally dominant A is, the better the recognition performance. However, the recogni- tion performance is more sensitive to the values of state-transition matrix than that of the F-B HMM. o In an extreme case such as 62,-, = 71¢, for all j E [1,N], the recognition per- 74 Test M1 M2 M3 M4 M5 M5 M7 M3 M9 M0 data one 63.7 481.0 569.0 553.8 218.1 532.4 320.4 534.5 173.7 567.1 two 476.2 84.3 669.2 570.5 643.7 638.8 489.7 635.0 724.1 456.8 three 800.7 758.5 75.3 728.7 682.7 697.8 741.8 544.3 747.3 493.6 four 593.2 680.2 711.9 70.7 612.0 757.4 843.7 823.2 868.0 559.5 five 727.4 748.7 622.7 442.9 53.1 637.5 592.1 696.5 694.7 498.7 six 857.2 681.5 756.8 820.3 633.0 122.7 302.2 536.1 711.7 827.9 seven 473.3 654.9 517.1 655.2 365.9 526.8 96.1 685.0 382.7 670.3 eight 562.7 536.1 424.4 574.7 530.2 447.8 524.9 66.8 532.9 551.7 nine 194.7 551.4 629.4 677.9 265.8 621.8 366.9 580.0 63.5 677.6 zero 1054.1 833.4 898.6 839.9 893.2 976.0 940.0 986.3 366.0 129.9 Table 3.8: Likelihood P(O I M) using A.- and B for each digit 2' in a resubstitution test. Test M1 M2 M3 M4 M5 M6 M7 M3 M9 M0 Data one 93.8 457.4 511.6 555.9 251.1 543.2 341.8 526.6 195.8 610.7 two 549.9 106.6 627.2 619.1 557.8 654.5 503.3 626.7 729.2 424.6 three 809.8 725.7 111.8 721.3 678.4 709.3 750.3 547.2 729.8 475.9 four 611.5 689.9 710.9 103.5 612.8 796.5 847.8 838.6 872.7 490.0 five 729.3 714.3 650.9 413.6 97.2 642.6 610.5 712.5 702.1 531.7 six 860.9 705.3 771.5 806.9 639.1 145.2 297.4 553.5 773.6 815.2 seven 413.8 638.2 575.7 699.7 391.0 539.3 122.7 700.8 409.4 683.2 eight 568.2 479.2 360.7 548.9 520.3 449.7 510.2 84.8 543.8 559.1 nine 139.3 487.7 634.9 692.3 231.3 647.5 395.6 588.1 95.8 694.0 zero 1056.0 842.8 818.3 789.2 862.0 981.5 952.5 977.3 1041.7 160.9 Table 3.9: Likelihood 11;, P(o. | M) using A,- and B for each digit 2' in a resubsti- tution test. 75 Test M1 M2 M3 M4 M5 M6 M7 M3 M9 M0 Data one 65.2 532.9 569.5 550.1 214. 1 530.9 322.0 517.0 170.6 577.8 two 471.2 78.7 675.8 580.3 633.6 680.3 481.9 636.1 724.3 443.5 three 801.6 755.1 76.5 733.8 681.2 755. 1 745.1 557. 1 755.4 483.5 four 587.6 672.2 711.2 70.6 604. 1 759.4 849.6 826. 1 869.1 572.3 five 729.2 743.1 622.2 429.0 59.7 692. 1 585.7 757.9 690.7 491.8 six 858. 1 697.0 755.9 807.0 639.9 115.5 307.4 599.3 715.6 850.2 seven 473.9 665.3 516.6 666.3 376.8 553.0 88.8 740.9 461.6 741.1 eight 563.5 536.6 424.6 573.2 537.8 464.4 519.4 72.4 535.8 529.0 nine 173.5 604.9 628.9 673.3 273.0 621.2 371.2 566.2 52.7 673.1 zero 1054.9 828.2 897.5 823.6 950.3 975.0 945.0 989.2 1040.4 114.0 Table 3.10: Likelihood P(O I M) using A81 and BAH, for each digit in a resubsti- tution test. Test M 1 M2 M3 M4 M5 M5 M7 M8 M9 M0 Data one 175.8 482.5 514.0 567.7 300.4 570.9 385.2 557.4 208.5 510.4 two 470.7 166.6 615.7 595.5 556.2 682.3 546.7 644. 1 724.0 453.7 three 827.4 740.9 161.0 725.8 689.3 743.8 760.5 547.2 741.0 479.1 four 607.2 725.2 694.4 134.2 610.0 802.4 853.7 852.7 876.0 533.5 five 741.9 704.1 619.9 436.9 125.8 665.6 608.9 740.5 669.6 550.5 six 877.0 660.9 743.2 782.3 664.8 197.1 220.9 582.5 724.9 747.4 seven 451.0 666.7 527.6 685.0 432.6 567.3 177.0 722.6 402.6 576.8 eight 583.4 445.1 366.3 537.2 525.9 461.6 509.2 120.6 557.5 478.0 nine 223.0 513.6 634.7 712.0 296.8 673.9 437.8 628.9 99.2 599.2 zero 1056.0 853.9 819.9 787.1 853.2 995.6 946.9 991.6 1044.8 208.9 Table 3.11: Likelihood H2"=1 P(Ot I M) using AB, and B As for each digit in a l resubstitution test. 76 Test M1 M2 M3 M4 M5 M5 M7 M3 M9 M0 Data one 66.2 470.1 566.6 548.1 212.3 536.6 320.9 534.0 179.0 583.5 two 540.9 81.7 681.8 586.2 645.5 639.8 486.6 626.7 726.4 447.6 three 798.3 760.5 81.4 733.8 676.2 697.9 734.9 542. 1 716.2 543.5 four 592.6 672.1 706.4 77.7 617.3 748.6 847.6 821.5 867.3 568.2 five 727. 1 755.7 617.2 433.0 61.8 643. 1 595.7 694.8 690.8 493.3 six 854.8 687.9 804.7 817.0 636.1 123.7 294.1 528.0 720.4 849.4 seven 476.7 650.4 514.3 673.8 376.8 530.4 93.5 677.4 369.6 754.1 eight 560.2 530.2 424.3 577.5 528.3 449.9 525.7 68.7 535.8 558.1 nine 178.9 535.2 627.8 670.0 271.5 628.0 366.5 575.8 59.8 671.4 zero 1051.6 831.6 911.4 830.3 953.7 978.7 952.0 989.3 1038.4 121.2 Table 3.12: Likelihood P(O I M) using .43, and B A3,‘ for each digit in a resubsti- tution test. Test M1 M2 M3 M4 M5 M6 M7 M3 M9 M0 Data one 135.5 470.3 569.8 554.1 306.0 641.6 358.0 536.3 255.2 626.5 two 710.4 206.4 707.7 774.0 680.0 677.2 710.2 613.9 746.7 475.5 three 792.9 775.5 197.6 749.4 676.2 730.1 744.8 725.9 712.6 728.4 four 745.0 672.8 773.8 320.9 705.6 761.5 860.6 828.0 871.8 566.0 five 731.5 825.8 773.0 607.9 441.0 822.4 811.9 702.0 798.5 651.6 six 859.2 919.5 878.3 943.4 739.3 260.8 662.5 536.9 850.5 906.4 seven 491.6 655.6 769.7 844.1 554.8 619.3 381.2 687.4 590.4 759.4 eight 564.6 583.7 417.8 640.4 532.7 481.7 568.0 141.5 540.9 623.0 nine 236.5 540.5 684.0 674.9 406.9 769.5 486.2 584.9 292.3 734.9 zero 992.0 892.5 945.5 989.5 1041.1 999.7 1017.7 980.8 1046.0 611.0 Table 3.13: Likelihood HT:l P(O, I M) using AB, and B AB for each digit in a 2 resubstitution test. 77 formance becomes degraded. This is because of the more ambiguity caused by “equally likely” state transition. Contrary to this argument, with A which is close to a diagonalized matrix, the performance becomes enhanced because of lessened ambiguity about state occupancy at a given time. In information theory, such uncertainty is measured in terms of entropy [5]. For the leave-one-out tests, the conclusions above are the same. Associated with these experimental results, we see that the required time and resources to train the HMMS become lessened with a preset state-transition matrix .43, due to the unnecessity of training a state-transition matrix. Therefore, it is possible to reduce the computational loads required in the training. Followed by this assertion, the arising problem could be “how close does A need to be a diagonalized matrix to obtain a satisfactory recognition performance?”. This is left for future research. 3.3 Reconciliation of the TIA HMM In this section, we are going to discuss a few evolving techniques that reconcile the TIA HMM to the conventional F-B HMM. 3.3.1 Feedback Control From Table 3.10 through 3.13, we see that the less each 23,-(t) is overlapped, the better the performance. The implies that in the TIA HMM, the closer one of states is probability one at each t, the better the performance. Simply speaking, we want a TIA HMM close enough to a certain “unknown” desirable system so that the states are separable from each other as much as possible for every t. Such separation can decrease the adverse effects of illegal paths. 78 However, the exponentially decaying characteristic by the Markovian assumption in the HMM basically does not allow such “separation.” One attempt to compensate for extra probabilities, as well as to decrease illegal paths effects, is to use state- variable feedback to get a desired state responses. State feedback technique is to relocate the eigenvalues of a system to get a desired system response [94]. If a given linear time-invariant system realization is state controllable, any desired characteristic polynomial can be obtained by state-variable feedback. In our problem, however, neither do we have specific desired eigenvalues, not do we know exactly which eigenvalues will be Optimal in a sense that the recognition performance as well as its robustness of the TIA HMM is comparable to those of the F—B HMM. Provided that such desirable poles are known, we have a:(t + 1) = A2:(t) + u(t)6(t) + W(t), (3.36) from the TIA HMM. Where W is the state feedback input which regulates the state probability so that a:(t + 1) can reach the desired values for each t by W(t) = Fa:(t). (3.37) Here we do not have a specific control input except an initial time t = 0 in the HMM. This is in contrast to the usual state-space control problem. Therefore, to accommodate the time-varying nature of a speech signal and to avoid the exponential decaying state probabilities of the Markovian model, we need a time-varying feedback control such as W(t) = F(t):c(t). (3.38) Unfortunately, this attempt does not allow us to have stationary diagonalized state- 79 transition matrix [A + F(t)] for all t. Therefore, we have the same problem which we face in the F-B HMM. The motivation for using the TIA HMM in speech recog- nition lies on its diagonalization. Therefore, the principal advantage of using the diagonalization of the state equation does not exist in this approach. 3.3.2 Stochastic Modeling of Temporal Information in the TIA HMM To make the TIA HMM robust, consider a model which includes additional temporal information between neighboring symbols of an utterance. The assumption that the observations generated by the HMM’s hidden process are only state dependent is, in fact, a limitation of the HMM when applied to a real speech signal. In reality, speech features are correlated. To include additional time-ordering relation between consecutive symbols in a Speech utterance, consider one of the techniques prOposed by Dai et al. [63]. The idea is to include a Markovian relation between symbols instead of just “observation independent.” In Dai’s approach, the state-space is the codebook and each symbol in the code- book becomes a state of the Markov process. The revised criterion seeks to find a HMM which produces a ML in the conventional F-B HMM sense in conjunction with the likelihood based on the Markovian relation between symbols as L’(0) = P(o | M)P(0 | M’), (3.39) where P(0 I M) = P(013022"'20T I M): (3'40) 80 and P(O I M’) : P(01,02,.-.,0T I M’) = P(OT l 01,02,...,0T_1,M')P(01,02,. ..,0T-1 | M’) = P(OTI0T_1,M')P(01,02,...,0T_1 IM’) (3.41) = {113(0. |0t_1,M') T t=1 with P(ol |00,M') = P(O1 | M’). (3.42) Here M, stands for the set of initial symbol probability and symbol transition matrix. The same idea can be applied to the TIA HMM, the likelihood [1:11 P(Ot I M), L’(0 I M,M’> = (£13m. l M>)(P