. .
V . , _ .
u .f (4 I. , .p. .9 \
.E, 5.. ,ﬂ . V V be...
.. . .. Vb.

Vaxgiﬂ? . V , . , . . . .. E

3%.
is; .. 5:.
.1 .k. n A
2W?! 3W. , , .be .
. n.3’é::, . .
Vt. 3“ .. ; at
5V. 1...: 2
V . x‘ .I
. a
.
.a

ﬁx,
u. v: up
ma,

.9 . V V . a
r, . U V 4
.. $me
‘ ¢’ 1
.

a,

at”...
. .n. .V1h1l

x3

 

.w, , . . . . V V V , . . . V .V .. . V $3..

.7.“ .C .
.éi .. V .
Fl; :..vL.

kg? . . ,. Em? , wé‘ﬁwwgﬁﬁ

V ..

 

 

THES‘S

[les
-1

soul

This is to certify that the

dissertation entitled

EVALUATION AND IMPROVEMENT OF THE HMM
BY STATE-SPACE MODELING

presented by

Yong-Beom Lee
has been accepted towards fulﬁllment

of the requirements for

Ph. D. degree in Electrical Eng

AM %

 

Major professor

Date I, ﬁt ' 2000

MS U is an Afﬁrmative Action/Equal Opportunity Institution O~ 12771

 

 

M'Chlaan State
University

LIBRARY

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE | DATE DUE

 

 

W

 

 

 

 

 

 

 

 

 

 

 

 

 

11m clam-gasp.“

 

 

EVALUATION AND IMPROVEMENT OF THE HMM BY
STATE-SPACE MODELING

By

Yong-Beam Lee

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical and Computer Engineering

2000

ABSTRACT

EVALUATION AND IMPROVEMENT OF THE HMM BY
STATE-SPACE MODELING

By

Yong-Beam Lee

Analytical modeling of speech production is not an easy task, in part because of
the rapidly time-varying nature of speech signals. The hidden Markov model (HMM)
is widely used for the stochastic modeling of time-varying signals, and it has been
most applied in the area of speech production and recognition.

Most current HMM research has focused on its applications. On the other hand,
studies of the theoretical aspects of the HMM are relatively few. This is due to
the difﬁculties of analyzing a model that is inherently probabilistic and recursive in
nature. However, if the fundamentals of the HMM are approached from a different
direction, it is possible to obtain useful analyses of the HMM which contribute to its
use in speech technologies.

The main objective of this dissertation is to revisit and further investigate three
fundamental HMM problems related to speech recognition using a novel mathematical
formulation. Rather than the conventional representation of the HMM as a scalar

recursive algorithm, the HMM will be represented using a vector-matrix formulation.

It will be shown that the HMM can be represented as a state-space model. The
conventional Baum-Welch (time-varying) model as well as an “approximate” time-
invariant model will be studied in detail in the context of this new formulation. A
more thorough theoretical and empirical investigation of this approximate model is
presented in this dissertation. In particular, the spoken-digit recognition problem will
be the focus of applied studies.

Some useful results and techniques using the time—invariant approximation of the
HMM are addressed and analyzed. In addition, new state—search techniques using
clustering, and novel set-membership identiﬁcation techniques are developed as the
basis for a novel HMM training approach. The new training results in HMM state
assignments corresponding to acoustically meaningful segmentation of the speech,
rather than adherence to the conventional maximum likelihood criterion. The results

of new search techniques are compared to those of the Viterbi search.

To my wife and daughter

For their love, support, and sacriﬁce

iv

ACKNOWLEDGMENTS

I would like to extend my deep thanks and gratitude to Professor John R. Deller,
Jr. for his guidance and encouragement of my graduate program for quite a long time.
His direction was very important in helping me step into the speech processing and
digital signal processing ﬁeld.

Not only the idea for the state Space formulation of the HMM, but also major
parts of the problems in this thesis were suggested by Professor Deller, who also made
uncountable number of suggestions for the research. Moreover, I am very thankful
for his meticulous guidance for the writing.

Also, I would like to thank all the members of my thesis committee: Dr. H. Khalil,
Dr. C. Wei], Dr. P. Pierre, and Dr. J. Deller, Jr.

Finally, I give heartful thanks to my family and parents for their love, support,

patience, and encouragement.

Contents

List of Tables viii
List of Figures ix
1 Introduction 1
1.1 Background ................................ 1
1.2 History of the Vector-Matrix Formulations of the HMMs ....... 3
1.3 Problems of Existing Vector-Matrix Formulations of the HMM . . . . 5
1.4 Objectives ................................. 6

2 Vector-Matrix Formulations of the HMM 8
2.1 HMM Background ............................ 9
2.2 Time-Varying Forward-Backward HMM ................ 11
2.2.1 Evaluation Problem ........................ 11

2.2.2 Decoding Problem ........................ 23

2.2.3 Training (Estimation) Problem ................. 26

2.3 Time-Invariant Approximation for the HMM .............. 30
2.4 Transformations of State Equations ................... 33
2.4.1 Transformation of Time-Invariant State Equation ....... 33

2.4.2 Transformation of the Time-Varying State Equation ...... 36

2.5 Analysis of Illegal Paths Caused by Approximation .......... 38
2.5.1 Likelihood Difference ....................... 39

2.5.2 Comparison of the State-Transition Matrices .......... 41

2.6 Validity of the T ime—Invariant Approximation of the HMM ...... 43
2.6.1 Matrix Norm Approach ..................... 43

2.6.2 Likelihood Expansion Approach ................. 47

2.6.3 Matrix Inversion Approach .................... 48

2.6.4 Eigenanalysis Approach ..................... 50

3 Practical Issues in the Use of the TIA HMM 53
3.1 Efﬁcient Evaluation Technique ...................... 54
3.2 Analysis of the TIA HMM ........................ 58
3.2.1 Likelihood Structures ....................... 58

3.2.2 Experimental Comparisons of Likelihood Between Model Types 62

3.2.3 State Probability Distribution Vector in the TIA HMM . . . . 67

vi

3.2.4 Comparison of z(t) and 7(t) ........... . ........ 68

3.2.5 Experimental Results on the Effects of A ............ 72
3.3 Reconciliation of the TIA HMM ..................... 78
3.3.1 Feedback Control ......................... 78
3.3.2 Stochastic Modeling of Temporal Information in the TIA HMM 80
3.4 Discussion ................................. 82

Training HMMs so that Hidden Model States Meaningfully Repre-

sent Acoustic States 84
4.1 Maximum Likelihood Approach to State Sequence Determination . . 86
4.1.1 Introduction ............................ 86
4.1.2 Experimental Results ....................... 88
4.2 State Sequence Based on “Acoustic Distance” ............. 98
4.2.1 Introduction ............................ 99
4.2.2 The Concept ........................... 101
4.2.3 Recursive Viterbi Search Based on k-Means .......... 108 '
4.2.4 Experimental Results ....................... 110
4.2.5 Appropriate Number of States .................. 118
4.2.6 Remark .............................. 125
4.3 State Search by Set-Membership Identiﬁcation ............. 125
4.3.1 Original Thoughts about Exploiting the SM ID ........ 125
4.3.2 Background of the SM Identiﬁcation .............. 130
4.3.3 State Search Using the SM Identiﬁcation ............ 132
Conclusions and Future Research 138
5.1 Conclusions ................................ 138
5.2 Future Research .............................. 141

vii

List of Tables

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13

3.14
4.1

Approximate computational complexities for computing (A(t + 1)A)"
by three different approaches. ......................
Likelihood from the F-B HMM in leave-one-out-test. .........
Likelihood from the TIA HMM based on leave-one-out-test. .....
Statistical Properties of the likelihood results from the F-B HMM. . .
Statistical Properties of the likelihood results from the TIA HMM. . .
State probability distribution vectors under Bakis, :1:(t) B, and ergodic,
x(t)e, constraints ..............................
Sum of likelihoods of ﬁfteen training utterances for each digit associ-
ated with three different state-transition matrices in the F-B HMM. .
Likelihood P(0 I M) using A,- and B for each digit 2' in a resubstitu-
tion test ...................................
Likelihood 11le P(0¢ | M) using A,- and B for each digit 2' in a re-
substitution test. .............................
Likelihood P(0 | M) using AB, and BAH for each digit in a resub-
stitution test. ................ 1 ...............
Likelihood 11;, P(0t | M) using A3, and B As for each digit in a
resubstitutiontest. .................l ...........
Likelihood P(0 | M) using AB, and B As, for each digit in a resub-
stitution test. ...............................
Likelihood [1:1 P(Ot | M) using AB, and B As for each digit in a
resubstitutiontest. ................2 ...........
The likelihoods based on L(0 | M, M') with the TIA HMM. . . . .

Likelihoods of the conventional Viterbi search and the recursive Viterbi
search based on k-means clustering ....................

viii

57
65
65
66
66
68
73

75

76

76

77

77
82

116

List of Figures

2.1

3.1
3.2

4.1

4.2

4.3

4.4

4.5

4.6

4.7

Six-state Bakis topology of the HMM. .................

State probability distribution after training digit “four.” .......
The average of 7,-(t),i = 1,. . . ,5 from the entire training utterances of
word “four.” ................................
State search results from the conventional Viterbi and Q‘ = '{q{ =

[If arg max,, P(0., q. | A4,), 2': 1,. . . , 10 in a ﬁve-state Bakis HMM
of a spoken word “one.” Note that each graph represents a different “i”

except top two ﬁgures in the left column. The tests employ resubstitution.

State search results from the conventional Viterbi and Q" 2 [LT q; =
[If arg max,,, P(0t,qt I Mg), i = 1,. . . , 10 in a ﬁve—state Bakis HMM
of a spoken word “two.” Note that each graph represents a different “i”

except top two ﬁgures in the left column. The tests employ resubstitution.

State search results from the conventional Viterbi and Q‘ = [If q; =
II? arg max,, P(0.,q, | M,), i = 1,. . . , 10 in a ﬁve-state Bakis HMM
of a spoken word “four.” Note that each graph represents a different “i”

except top two ﬁgures in the left column. The tests employ resubstitution.

State search results from the conventional Viterbi and Q‘ = [1? q; =
[If arg maxq, P(0¢,qt | Mg), i = 1,... ,10 in a ﬁve-state Bakis HMM
of a spoken word “six.” Note that each graph represents a different “i”

except top two ﬁgures in the left column. The tests employ resubstitution.

State search results from the conventional Viterbi and Q" = I]? q; =
II? arg maxqt P(0t, qt I M,), i = 1,. . . , 10 in a ﬁve-state Bakis HMM
of a spoken word “one.” Note that each graph represents a different “i”

except top two ﬁgures in the left column. The tests employ leave-one-out.

State search results from the conventional Viterbi and Q" = H? q; =
[If argmaxq, P(0¢,qt I Mi), i = 1,... ,10 in a ﬁve-state Bakis HMM
of a spoken word “two.” Note that each graph represents a different “i”

except top two ﬁgures in the left column. The tests employ leave-one-out.

State search results from the conventional Viterbi and Q" = I]? q; =
[If arg maxq, P(0., qt I M,), i = 1,. . . , 10 in a ﬁve-state Bakis HMM
of a Spoken word “four.” Note that each graph represents a different

“i” except top two ﬁgures in the left column. The tests employ leave-
one—out. ..................................

ix

89

90

91

92

93

94

4.8

4.9

4.10

4.11

4.12

4.13

4.14
4.15
4.16
4.17
4.18
4.19
4.20
4.21

4.22

State search results from the conventional Viterbi and Q‘ = ,Tq’,‘ =
[If arg maxq, P(0t, q; | M;), i = 1, . . . , 10 in a ﬁve-state Bakis HMM
of a spoken word “six.” Note that each graph represents a different “i”
except top two ﬁgures in the left column. The tests employ leave-oneout. 96
State segmentation resulting from conventional ML (Viterbi) training
of a ﬁve-state Bakis HMM for the utterance “six.” The resulting seg-
mentation is not coherent with the physical dynamics of the speech. . 100
Euclidean distances between four different symbols (symbol “zero,”
“32,” “64,” and “96”) and the rest of symbols indexed along the ab-
scissa in the codebook. Each symbol represents a centroid of the cluster
in the feature vector Space ......................... 105
State sequence for the spoken word “six” by the Viterbi search tech-
nique of a ﬁve-state Bakis HMM. Five different sets of initial values
have been assigned to A and B ...................... 112
State sequences for the Spoken word “four” in a ﬁve-state Bakis HMM.
The second ﬁgure is the consequence of the conventional Viterbi search.
The third ﬁgure is the result of the recursive Viterbi search based on
k-means clustering. The 4‘“ graph is the conventional Viterbi search
result based on the third graph. ..................... 114
State sequences for the spoken word “six” in a ﬁve-state Bakis HMM.
The second ﬁgure is the consequence of the conventional Viterbi search.
The third ﬁgure is the result of the recursive Viterbi search based on
k-means clustering. The 4‘“ graph is the conventional Viterbi search
result based on the third graph. ..................... 115
Davis-Bouldin relative index for ﬁfteen utterances of spoken word “one.” 121
Davis-Bouldin relative index for ﬁfteen utterances of spoken word “two.” 122
Davis-Bouldin relative index for ﬁfteen utterances of spoken word “four.” 123
Davis-Bouldin relative index for ﬁfteen utterances of spoken word “six.” 124
State search results for word “four” using the recursive Viterbi search

based on k-means clustering with different initial clusters ........ 126
State search results for word “six” using the recursive Viterbi search
based on k-means clustering with different initial clusters ........ 127
Some informative results regarding state segmentation by the SM tech-
nique for the word “six.” ......................... 134
State segmentation result by the SM theory with various values of a
for word “four.” .............................. 136

State segmentation result by the SM theory with various values of a
for word “six.”

Chapter 1

Introduction

1.1 Background

Speech is the most natural way of transferring information among human beings.
Speech recognition by a machine (e.g., a computer) is a way to translate human speech
into corresponding text so that a machine can perform productive work automatically
according to human speech inputs. It has many applications such as word dictation,
voice activated dialing, automated attendant, command control of a machine, and
so forth. The eventual goal of speech processing in engineering is to make machines
understand human speech as naturally as humans communicate with one another.

While human communication is natural and easy because of the extraordinary
capability of the human brain, speech recognition by a machine is not straightforward
in Spite of the remarkable deve10pment of computer technology.

There are some practical reasons why processing of speech signals is challenging.
For example, the same phoneme, when spoken by different speakers, will be acous-
tically different due to variations in vocal-tract anatomy. Also, the same speaker
may produce different versions of the same sound under different circumstances, for

instance, when s/ he is suffering from a cold and when s/he is not [3]. Certain sounds

may be shortened or completely left out when the speaker talks very fast. Differences
in dialect, like phoneme deletion or phoneme substitution, also complicate the Speech
recognition process. Other problems, like speech related noise (e.g., lips smacks and
tongue clicks), add to the difﬁculty of the recognition systems. It is obvious that
without some simplifying assumptions the task of modeling speech for recognition
would be highly impractical.

The hidden Markov model (HMM) is a popular technique in many contempo-
rary applications in signal processing, communications, and control. In particular,
the HMM has been used to successfully and automatically cope with acoustic uncer—
tainty in speech signal applications [4] using the statistical modeling. For example,
achieving a ﬂexible model for rapidly time-varying signals using a dynamic program-
ming technique [4] is very difﬁcult. The popularity of the HMM is due to its simplicity,
compactness, and easy implementation. Along with the dynamic time warping tech-
nique (DTW), the HMM has been applied to Speech recognition systems for many
years. In particular, this technique can be globally applied to a large and complex
speech recognition system [4].

Although the HMM has been widely researched and extensively applied to the
speech processing ﬁeld, it is true that it lacks diverse formulations so that the HMM
can be exploited more efﬁciently depending on speciﬁc applications. This is because of
its inherent time-varying nature as well as the somewhat complex recursive structure
associated with sophisticated model. HMM research has focused primarily on its

application rather than its fundamental characteristics.

1.2 History of the Vector-Matrix Formulations of
the HMMs

The conventional Baum-Welch algorithm, which is also called the Forward-Backward
(F—B) algorithm [1, 2, 4], is, a concise, and compact representation of the HMM
dynamics for explaining quickly time-varying speech signals efﬁciently. Whether any-
path or a bestpath (Viterbi path) [4] is considered for evaluation/ training of the
HMM, the conventional F-B approach of the HMM is based on a series of scalar
recursion [1, 2, 4]. Also, due to the model characteristics associated with stochas-
tic time—varying signals, the F—B HMM inherently does not have such flexibility and
applications as a linear time-invariant model which has stationary parameters repre-
senting a model.

Work cited in [6, 7, 8, 27, 30, 31, 32] represents several independent approaches
that take advantage of vector-matrix formulations of the HMM. Vector-matrix rep-
resentations of the HMM provide a diverse and uniﬁed way to interpret the HMM
Operation. In recent paper, by Turin and Karan [27, 31], matrix HMM formulations
have been exploited to ﬁnd useful algorithms for speech recognition technology. In
addition, Hjalmarsson et al. [30] have used a state-space formulation of HMM to ﬁnd
non-recursive formulae for training the HMM. Similarly, in work by Elliott et al. [32],
a state-space formulation of HMM has been used for estimation and control. Ex-
cept for the work conducted in the author’s laboratory [6, 7, 8], these vector-matrix
formulations are all based on the F—B HMM. Vector-matrix formulations of the time—
invariant approximation (TIA) of the HMM, which is different from the conventional
F—B HMM, were ﬁrst proposed by Snider and Deller [6, 7].

Turin [27] has proposed vector-matrix representations of the HMM to allow paral-
lel computing to achieve some computational savings during training and evaluation.

In particular, he uses vector-matrix formulations to obtain a more computationally

efﬁcient algorithm when speech signals satisfy a certain condition.

Karan et al. [31] have used a matrix formulation for the algorithm proposed by
Streit [29] in computing the eventual moments, deﬁned1 as Mj,,-(lc, T) = E{Pj(0t)'°} =
20,]. P,~(0T) PJ-(OT)’°, to measure moments of the output sequence probabilities of the
HMM M,- with respect to M;. Here M = {N , M, A,B,1r} is a set of parameter
matrices deﬁning the characteristics of the HMM with a state-transition matrix A,
observation probability matrix B, as well as the initial state distribution matrix 1r
and the N and M related to the sizes of the matrices. The evaluation Mj,i(k, T) pro-
posed in [29] uses vector-matrix descriptions to overcome a computational difﬁculty
by simplifying the recursive scalar computations which have a similar formulation
to the a posteriori probabilities [1, 2] P(0,q¢ = i I M) in the F-B HMM. Such
computational savings are possible due to the asymptotic analysis of the dynamics of
state-space equations.

Elliott et al. [32] suggest a unique state-space model for the two stochastic vari-
ables, state and observation, leading to a independent identically distributed (i.i.d)
observation process through a change of probability measure. By forming such an
ideal distribution, it is possible to obtain several key results related to state estimator
by applying the Fubini theorem which allows interchange of expectation and summa-
tion in the product measure Space. This technique has shown the capability of the
state—space structure of the HMM in state estimation and control.

Snider and Deller [6, 7, 8] adept a simpliﬁed probability likelihood measure
113;, P(0t) to allow a more compact and analyzable approach to evaluate the HMM
likelihood, P(0 | M). Depending on the circumstances, it is possible to control the
compression index, the ratio of the number of eigenvalues merged to the total number
of eigenvalues in all the HMMS, for trade-off between speech recognition rate and

Speed and memory complexity requirements.

 

1Precise deﬁnitions of notations are established in Chapter 2.

4

1.3 Problems of Existing Vector-Matrix Formula-

tions of the HMM

In spite of useful results inherent in the vector-matrix and state-Space approaches to
the HMM cited above, Open issues remain.

First, in [27], to obtain a computationally economical formulation with a vector-
matrix formulations of the PB HMM, it is assumed that long stretches of the same
symbol string occur within a Speech utterance. In fact, such a condition on a speech
signal is very helpful to have more computational savings in training and testing
of the HMM in reality. For instance, for a very limited small scale system which
has a small vocabulary as well as a small number of symbols with simple waveform
structures, 'Ihrin’s condition on signals is justiﬁed; therefore, further computational
savings can be attained without compromising recognition performance. In addition,
under the very unusual circumstance that the speaker is restricted in the number of
sounds s/he can reliably produce, Turin’s condition is valid even with a relatively
large vocabulary.

However, in reality ’I‘urin’s condition on signals is not ubiquitous in the quickly
time-varying speech signals. In practice, even a word model which may have as many
as 128 symbols after being quantized for the purpose of efﬁcient, secure storage and
transmission, does not usually adhere to Turin’s condition very well. Also, for a
large scale system with a large vocabulary and complex speech waveform structures,
it is unusual to ﬁnd that Turin’s condition is met. Even for a sentence model or a
compound model which is composed of concatenating of phones or word HMMS, it is
not easy to argue in support of Turin’s condition on speech signals. In other words,
there is a limitation in applying ’I‘urin’s algorithms to the general speech signal ap-
plications. When speech signals do satisfy 'I‘urin’s condition, however, computational

savings can be obtained. This will be discussed in detail later in the thesis.

The work Of Elliott et al. [32] is mainly focused on ﬁnding an Optimal estimation
algorithm from Observed signals to reveal the originating signals transmitted in a
noisy environment. Further, Elliott’s derivations are focused mostly on the estimation
Of states and unknown parameters without a speciﬁc procedure for the likelihood
evaluation Of the HMM. This is a signiﬁcant derivations from HMM modeling and
use in speech recognition.

Snider and Deller report empirically useful results in terms Of performance and
computation savings, but Offer little discussion of the general viability of the modiﬁed
likelihood [LT=1 P(Ot). Such a likelihood measure is not identical to P(0 | M) Of the
F—B HMM because Of the potential for including extra likelihoods from illegal state

paths [4, 7, 25]. A brief explanation about this issue is discussed in [4].

1.4 Objectives

In this research, the focus is on the use Of HMMs to model the acoustic process at the
lowest levels Of a Speech recognizer. First, the conventional HMM with a state-space
structure will be reformulated leading to a versatile computational structure with rich
interpretability. In particular, a TIA HMM suggested in [6, 7, 8] will be a main focus
Of this dissertation. The viability Of the TIA HMM will be shown through several
formal approaches. Such time-invariant formulations and corresponding likelihood
measures will be argued tO be proper approximations Of the conventional F-B HMM.

This thesis is composed of ﬁve chapters. The present chapter is a short introduc-
tion to, and background of, this research. The second chapter introduces time-varying
and time-invariant state-Space models of the HMM. Vector and matrix formulation
notations are used to describe the three fundamental problems of the HMM. In partic-
ular, the “illegal state path problem” inherent in the TIA HMM is brieﬂy discussed.

The third chapter deals intensively with the problem of illegal state paths produced

by the TIA HMM. A few evolving techniques that reconcile the conventional F-B
HMM to the TIA HMM are discussed. Chapter 4 is focused on the problem Of
ﬁnding an apprOpriate state sequence in some sense for a given speech signal. New
state-search techniques using the maximum likelihood criterion, clustering, and novel
set-membership identiﬁcation techniques are develOped for HMM training. The re—
sults of these search techniques are compared to those Of the conventional Viterbi
search. The ﬁnal chapter, Chapter 5, presents research conclusions and future re-
search directions.

In this research, theoretical results are applied mainly to the isolated digit recogni-
tion problem, one Of the classical problems Of speech recognition. For the experimental
studies in this research, input speech, which is uttered by an American male speaker,
is sampled at a rate of 10kHz. More details Of this Speech corpus is described in
Chapter 3.

Because Of the non-stationary nature of speech utterance, the acoustic feature
extraction is performed on sampled data on a frame-by-frame basis. Hamming window
analysis is applied to each frame, all Of which are 25ms long with a 15ms overlap. Then
10‘“ order mel-frequency cepstral coefﬁcients are computed. This produces a sequence
of cepstral speech vectors, or as it is usually called, a speech pattern. These speech
patterns are classiﬁed based on the seven level clusters so that each speech pattern
is represented by 128 symbols. It has been implicitly assumed that the given speech
signals are free from all background noise so that we can concentrate exclusively on

the main modeling problem.

Chapter 2

Vector-Matrix Formulations of the

HMM

In this chapter, we ﬁrst review brieﬂy the HMM theory to support new derivations
based on the conventional mathematical formulations. Three basic problems Of the
HMM, evaluation, estimation (training), and decoding, will be introduced. Then,
these basic HMM problems will be reformulated in vector-matrix notation. The con—
ventional F-B algorithm as well as the Viterbi search algorithm will also be subsumed
under this vector-matrix formulation. Third, the TIA HMM prOposed by Snider and
Deller [6, 7, 8] will be discussed, followed by the transformation of the TIA and F-B
HMM formulations. Next, some useful characteristics of the HMM discovered using
the vector-matrix formulations will be discussed. They give a framework in which to
take advantage of the TIA HMM. Fourth, it will be shown analytically with several
approaches that such an approximate approach for the HMM evaluation is proper
under some practical conditions. Finally, the “illegal path problem” inherent in such
an approximation will be brieﬂy discussed. This apparent defect Of the TIA HMM

will be treated extensively in Chapter 3.

2.1 HMM Background

The HMM was ﬁrst applied to Speech technology independently by Baker [21] at
Carnegie Mellon University and Jelinek at IBM in 1975 [1, 2, 3, 4]. When it was
ﬁrst published, neither was it called the HMM, nor was it develOped to model and
recognize speech signals [22, 23, 24]. However, because Of its excellent performance in
(apparently) explaining the prOperties Of highly variable signalsl, it has been broadly
used in the area Of speech signal processing.

The major capability Of the HMM lies in its ability to structure the information
content Of variable data. It also systematically translates this information into a set
Of stochastic parameters.

The HMM uses a stochastic approach to explain the characterization of speech
variability. It is used to model a doubly stochastic production process with the
transition parameters modulated by a Markov chain [1, 2]. Thus, the Observed speech
sequence2 is assumed to be the result of the interaction of two stochastic processes.

The Markovian assumption on the transition probabilities of the HMM imposes
two major constraints on the possible variations in the speech production system.
The ﬁrst constraint is a state model and the other is the dynamics Of state transi-
tions according tO the Markovian assumption. These allow a compact description
of the time-varying speech signal assumed to represent “acoustic states” of speech
production.

The HMM is a versatile model which can be used to represent a word, a subword
unit, or, in principle, a complete sentence or paragraph [4]. Figure 2.1 shows a typical

Six-state HMM with left-to-right or Bakis state transition constraints [4].

Let us now formalize the dynamics Of the HMM. Recall the deﬁnition Of a homo-

 

1This work, in part, investigates whether the HMM accurately models the physical properties of
the speech production system. (See Chapter 4.)
2Speciﬁed later in this section.

 

Figure 2.1: Six-state Bakis tOpOlogy Of the HMM.

geneous ﬁrst-order discrete-state Markov process as one which can be at one of N
states3 31,82, . . . , SN and whose state transitions are dependent only upon the most
recent state. Let us denote the state of the system in the abstract at discrete time t

by qt. We denote the stationary conditional probabilities by
“ii = PI‘It = SjI‘It—l = Si): 1 5. ii]. S M (2-1)
Additionally, let the initial state probabilities be denoted
7r,- = P(q1= Si), 1 g i g N. (2.2)

The realization of the process is a state sequence, say, {q1, q2, . . . , qT}. This process
is completely characterized by the number Of states N, the set Of state-transition
probabilities {0451'}, and the set of initial state probabilities {7r,-}.

Now consider a discrete-Observation hidden Markov process. In the discrete HMM,
an Observation sequence are assumed to be generated by jumping from state to state.
With each jump, either during the transitions (on the arc), or upon at the next state,
an Observation is emitted. At each time, an unknown state emits Observation symbol

0, = k, 1 g k S M according to the conditional distribution

bj(lc) = P(k Observation at time tIqt = 53'), 1 S j S N, 1 g k g M (2.3)

= P(0t=k|(1t=sj):

 

3By convention, integers are used to represent states as shown in Fig. 2.1.

10

where M is the number Of distinct Observation symbols. Symbol 0, is the index of
some characteristic measurement extracted from Speech, usually frame-wise. There-
fore, a speech signal is reduced to strings Of features extracted from the acoustic
speech waveform. The generated Observation sequence, denoted in the abstract by
0 = {01, 02, . . . ,OT}, is a realization Of a doubly stochastic process, is. a random
process generated by an unobservable random process. Here, T is the number Of Ob—
servations in the sequence. A model governed by such a probabilistic structure is the
hidden Markov model when the unobservable random process is a stationary Markov
process as described above.

In the remainder of the discussion, we shall use the notation M to denote the set

Of elements of an HMM, namely M = {N, M, {051'}, {bJ-(k)}, {7r,-}}.

2.2 Time-Varying Forward-Backward HMM

In this section, we review three HMM problems - evaluation, decoding, and training
(or estimation) - and reformulate the conventional F-B HMM in vector-matrix terms.
We then exploit this structure to discover interesting prOpertieS Of the HMM. We
also inherently derive some useful expressions for HMM implementation based on

state-space formulations.

2.2. 1 Evaluation Problem

The recognition or evaluation problem involves the determination of the conditional
likelihood for a given Observation string, 0, namely P(O I M). The most natural
measure Of the likelihood Of a given HMM, say M, in light Of Observation sequence
0, would be a posteriori probability P (M I 0). However, the available data will not
allow us to characterize P (M I 0) during the training process, under the condition Of

equal a priori probability P(M) among the HMMS, so it is conventional to take the a

11

posteriori probability P (0 I M) as the Observation probability measure instead [4].

Vector-Matrix Formulation of the HMM Along a Forward Path

In the F-B solution to this problem [23], the forward probabilities are deﬁned by
0i“) = P(01,02,...,0t,qt :2. I M) for Z = 1,.. .,N. (2.4)

This quantity is the joint probability Of the partial Observation sequence to time t
and residence in state i at time t, given the model M. These probabilities can be

calculated recursively as follows [4]: For each state j = 1, . . . , N; and for each t Z 1
j(t+1)= Zai(t()aj.-b(0t+1) i=1 .,N, (2.5)

where aj, and bj(0t+1) are deﬁned in (2.1) and (2.3). For the initial time, aj(1) is

deﬁned as bj(01)7rj. The ﬁnal conditional observation probability is

P(01,02, . . . ,OT I M) = £047“). (2.6)

i=1

The HMM has been develOped and studied principally through such conventional
formulations. Here, conventional formulations implies that in the evaluation and
training (explained later) required in the HMM computations, the algorithm is ba-
sically focused on individual computations Of each state as (2.5) rather than an in-
tegrated way that combines state computations. In general, because it represents
a linear, time-varying state-Space system, the HMM can be researched principally
experimentally. However, by combining state processing, it is possible to acquire sev-
eral Signiﬁcant insights into the HMM which might be difﬁcult to discover otherwise.
Once revealed, these useful characteristics Of the HMM can be applied to applications

for practical beneﬁts.

12

Generally, matrices provide convenient tools for systemizing laborious calculations
by providing a compact notation for describing complicated interrelationships among
system variables. Through vector-matrix notations, it is possible to process all HMM
states simultaneously and reveal useful properties of the HMM in the process. Let
us reformulate the three HMM problems with vector-matrix notations to provide one
important basis Of this research.

For an N—state HMM, the N recursions of (2.5) for a,(t), i = 1, . . . , N, can be

written in vector-matrix form as

 

 

 

 

 

 

 

 

{ (11(t'l'1) \ (b1(0t+1) 0 0 \
a2(t+1) _ 0 b2(0t+1) . . . 0
K (1N(t+ 1) j \ 0 0 . . . bN(0t+1) }
/ all an ... am ) ( a1(t) \
021 022 . . . 021v . (12“) , (27)
\ 0N1 0N2 . . . aNN / \ (XII/(t) )

This equation can be viewed as the state equation Of an N—State state-space model

with state variable4 01,-(t), i E [1, N]. The state equation can be used for t = 1 by

 

4Note that the “states” in this context are to denote mathematical variables with which they are
used to represent an alternative time-domain dynamics of a HMM. On the other hand, the meaning
of “state” in the context of “state model” explaining a HMM is to imply that within such a state,
a signal possesses some measurable and distinctive properties.

13

adding the input term

 

 

 

 

/ b1(01) 0 . . . O \ ( 7I'1 \
0 b2(01) . . . 0 W2
. «5(2) (2.8)
( 0 0 b~(01)) (7m)

to the right side Of (2.7), in which 6(t) is the Kronecker sequence [4, 33]. In vector-

matrix notation, we can write the complete state equation as
a(t + 1) = A(t + 1)Aa(t) + A(1)1r6(t), (2.9)

where the vector and matrix deﬁnitions are Obvious by correspondence to (2.7) and
(2.8).

The output equation for the state-space model is

I

y(t) = C a(t). (2.10)

The prime in (3'I is used to denote the matrix transpose. The only output of signiﬁ-

cance (for making a ﬁnal decision) is
P(01,02, . . . , 0T I M) = y(T) = c’a(:r), (2.11)
with matrix 0 deﬁned to be a vector of ones,

C" = 1’ = [1,1,...,1]. (2.12)

14

Analysis of HMMS with Vector-Matrix Formulations Along a Forward
Path

Note that since the probability Of making a transition to some state at each time is
always unity, then each column of A must sum to one. Accordingly, A is a column
stochastic matrix [38] which, in turn, makes it non-negative deﬁnite [44, 45]. An
important consequence Of this is that the vector 1 consisting of all ones is a left
eigenvector Of A with eigenvalue 1 [39] SO that l'A = 1'. Non-negative matrices occur
in a variety of applications [45]. Non-negativeness implies useful characteristics that
can be used to analyze the dynamics Of a model [51]. Furthermore, in the left-to-right
or Bakis HMM, which is Often employed in speech recognition, A is a lower-triangular
matrix with strong diagonal components. In this case, the eigenvalues of A are the
diagonal elements themselves. Moreover, because of its triangular structure, if all
the eigenvalues are distinct, then eigenvector matrix Of A is also triangular. Finding
the eigenstructure Of such models is therefore relatively computationally inexpensive.
The use Of this eigenstructure will be explained later in this chapter.

The vector-matrix representation (2.9) and (2.10) reveals interesting results that
are not apparent in the usual F—B recursions. Above all, it is the combination Of A(t)
and A, which comprise the effective state-transition part Of state equation, monitors
and quantiﬁes state path information. Here A has dimensions N x N, and provides all
N2 state-transition probabilities at a given time. This implicitly includes information
about whether a given state transition is possible. The premultiplication Of A by
A regulates the possible paths through the states in light Of the states’ abilities to
generate certain Observations. Non-zero values in the diagonal elements Of the A(t)
matrix allow state jumps at t depending on the locations Of those non—zero values
whereas zeros prohibit such transitions. For example, if A5,,(t) is zero, meaning that
a symbol at time t is not generated from state i, then the it” row Of A - A is also zero.

Therefore, jumps from any state to state i at time t are prohibited. Consequently,

15

A(t) can be regarded as a sort Of switching matrix at t which speciﬁes the available
transitions in accordance with the topology of A. Thus, if there is a legal path starting
from an initial state tO a ﬁnal state associated with T-duration speech utterance
0 = {01,02, . . . ,OT}, T multiplications Of matrix pairs A(t)A with t = 1, . . . ,T
produce a non-zero matrix. Such a non-zero matrix leads to non-zero likelihood with
a suitable initial state probability and a ﬁnal Observation condition.

Henceforth, the model consisting Of (2.9) and (2.10) is called the time-varying
state equation because the composite state-transition matrix, A - A, varies with time.
The entries in A effectively control the state path by prohibiting entry into a state
at time t that cannot produce symbol 0;.

By recursion, the a posteriori probability is written in terms of the matrices deﬁned

above as
P(01,02, . . . ,OT | M) = UA(T)AA(T — 1)A- - . A(2)AA(1)1r. (2.13)
Since P(01,02, . . . , Or I M) is a scalar, it can also be expressed as

P(OIM) = (C'A(T)AA(T—1)A---A(2)AA(1)1r)' (2.14)

= 1r'A(1)A'A(2)A'A(T — 1)A'A(T)C.

In fact, this is the formulation with which Turin started in deriving other matrix-based

HMM algorithms [27].

Vector-Matrix Formulation of the HMM Along Backward Path

In general, the matrix representation (2.13) provides a ﬂexible way to compute the
Observation probability through diverse state-space structures and representations for
the model. For example, let us derive a state equation that is different from (2.9)-

(2.11). To have a state-space model (2.9)-(2.11), A(t)A was considered as a state
16

variable for (2.13). Instead, let ﬂ(t) be an N-vector Of state variables for (2.14).
Deﬁne MT) = C. Because Of recursive nature of equation (2.14), an alternate state-

equation-like formulation follows immediately. Let
ﬁ(t) = A'A(t + 1)s(t + 1) (2.15)

and B (T) = C fort = T—1,T—2, . . . , 1. Then the a posteriori conditional Observation

probability is given by
P(01,02, . . . ,OT | M) = n’A(1)p(1). (2.16)

In the matrix formulation, it is not necessary to know the statistical interpretation
Of )8, whose elements are equivalent to ﬂit) in (2.17), but these quantities are rec-
ognizable as the backward probabilities in the F—B algorithm where they are deﬁned

as

ﬁi(t) = P(0t+1:0t+21---10TIQt=inli i=1:"'?N1 (2'17)

and computed recursively as

=2 a,-,-b,-(0,+1)ﬁ,-(t+ 1), (2.18)

Similarly tO (2.5).

Not surprisingly, the state-space formulation (2.15) Of state recursions, written

17

explicitly as

f .310) \ / an 021 am I

,82(t) = (112 6.22 0N2 (219)

 

 

 

 

(51%”) (am am (INN)

 

 

 

 

(bl(0t+l) 0 0 l { ,BIIt‘I'I) \
O b2(0t+1) . . . 0 . ,32(t + 1)
\ 0 0 bN(0t+1) } \5N(t+1) )

can be decomposed into the F—B backward recursions as in (2.18). The output equa-

tion complementing (2.19) is given by

W) = «'Aumm (2.20)
with the only output Of signiﬁcance (for making a ﬁnal decision) being

y(1) = P(01,02,...,0TIM)
N
= "’A(1)3(1) = Zﬂibi(01)16i(1)' (2-21)

i=1
Result (2.21) is equivalent to the likelihood computation provided by the backward
F-B recursion [1, 2, 4].
Other Vector-Matrix Formulations for the HMM

In addition to these results which are equivalent to the widely-used F-B recursions,

the matrix formulation provides a ﬂexible state-equation-like structure that serves as

18

a basis for many other computational structure. For instance, if AA(t) takes to be

the state variable rather than A(t)A in (2.13), then we Obtain a new state equation

with a state vector X (t) governed by the recursion

X(t+1) = AA(t)X(t)+1r6(t),

with output equation
:10) = P(01,02, - - . .0: l M) = C“A(Wﬂt)

and ﬁnal likelihood

y(T) = P(01,02,...,0TIM)=dA(T)X(T).

In fact,

X(t+1) = Aa(t)

where a(t) is deﬁned in (2.4).

(2.22)

(2.23)

(2.24)

(2.25)

(2.26)

The expressions above are useful in different circumstances. Usually A has full

rank and, thus, almost always has inverse .4”. However, in (2.9), A is Often singu-

lar because of its sparseness. Therefore, having A premultiplied by A in the state

equation allows transformation Of the state-space equation. Equation (2.15) provides

a representation similar to that Of (2.22) in the sense that the A, premultiplies A in

the state-transition equation. Later we Show some useful expressions and prOperties

arising from this characteristic. More novel prOperties Of HMMS will also result from

the formulation in which A is premultiplied by A in the state equation.

19

Some interpretation for the state vector X (t) is Obtained by considering one Of

its elements. From (2.22),

X.<t+ 1) =f3a..b.(o.)X.-(t). (227)

5:1

In the Moore form Of the HMM [4], a symbol is tO be produced when a new state is
entered following the transition. This fact makes it difﬁcult to give an exact prob—
abilistic interpretation Of X ,- (t + 1) as is possible for a(t) in (2.4). By inspection,
however, state variable X ,(t + 1) is similar to a(t) Of (2.5) except that X ,(t + 1)
amounts to a posteriori probability before a symbol is produced but after a state
transition has occurred. This is similar to Kalman ﬁlter update. From (2.27), an
Observation symbol is apparently generated before a state transition occurs, i.e. with
the exit Of the state at time t — 1. This does not imply that this formulation is
equivalent to the Mealy form Of the HMM [4], since the present expression is based
on a,-,- and bj parameters from the Moore form. Therefore, the deﬁnition Of B is quite
different in the Mealy form Of the HMM which is based on a model in which a symbol
is produced during the state transition, not upon arrival at a new state. However,
the usefulness Of (2.27) will be discussed later.

Thus, the state-Space formulation suggested here is a general and ﬂexible repre-
sentation Of the HMM Of which the conventional F—B HMM is a special case. The rep-
resentation embodies various interpretations and computational forms for the HMM.
There are potentially many interpretations and formulations for the HMM using the
vector-matrix form.

Consider another example formulation. Similarly to X (t), we can drive a state

space formulation for the backward computation as
Y(t) = A(t)A'Y(t + 1) + A(T)C§(T — t) (2.28)

20

y(t) = «Y(t) fort=T—1,T—2,...,1 (2.29)
and
y(T) = P(01,02,...,0T I M) = «’Y(1). (2.30)

For the new state variable Y(t), the condition Y(t) = 0 is imposed for t > T.
In terms Of the developments above, the ﬁnal likelihood can be variously repre-

sented. For example,

P(01,02,...,0T|M) = dam
= «'Aumu)
= a’(t)a(t), te{1,T}
= s’(t)a(t), te{1,T} (2.31)
= X’(t)Y(t), te{1,T}

= Y'(t)X(t), te {1,T}.

Also, letting tr(-) denote the trace Of a matrix, we can write

I

P(OIM) = tra(t)ﬂ(t))

= tr A(t)A . . . A(2)AA(1)1r(A'A(t + 1) . . . A’A(T)C)')

(

= tr(a(t)ﬁ’(t)) (2.32)
(
(

tr A(t)A...A(2)AA(1)1rC"A(T)A...A(t+1)A).

Here WC" forms an N x N matrix.

21

Interpretation of the HMM Evaluation using the Vector-Matrix Formula-

tion

As yet another example HMM formulation arising from the vector-matrix framework,
consider the Bakis HMM structure in which every path starts from a predetermined
initial state (by deﬁnition, state 1) and ﬁnishes at a ﬁnal state (state N). From

equation (2.13), the likelihood can be represented in the compact form as
P(01,02, . . . ,0, | M) = C’G(t)1r, (2.33)
where
G(t) = A(t)AA(T — 1)A - - oA(2)AA(1). (2.34)

Thus, G (t) amounts to matrix products among vector-matrix-vector multiplications
for P(O I M). We know that a matrix is a set of numbers arranged in a rectangular
grid Of rows and columns. Likewise, matrix G (t) provides an algebraic interpretation
for computing P(01,02, . . . ,0; I M) as follows: Let C“ = (0,0, . . . ,0, 1) and 1r, =
(1,0, . . . ,0) for simplicity, and let us suppose that the size Of each matrix in (2.34) is
N x N, and that G(t) is computed in advance. G(t) multiplies both vectors C and
11' for P(01,02,...,0¢ I M). Then, the computation Of P(01,02, . . .,0, I M) in
(2.33) is equivalent to choosing the (N, 1) element in G (T) according to the position
Of non-zero entries in C" and 71'. Therefore, entry g,,-(t) of G(t) amounts tO the
the sum of likelihoods Of the paths leading from state i at initial time to state 3'
at time t. Thus, the vector-matrix representation simpliﬁes the underlying meaning
Of the forward or backward computation of the HMM in a way which might not
be possible with the conventional F—B HMM algorithm. This example shows that

the vector-matrix formulation Of the HMM elucidates the likelihood computations in

22

association with state paths for signals.

2.2.2 Decoding Problem

The HMM was conceived as one for which states would represent distinct acoustic
phenomena [2, 4]. The solving the decoding problem also elucidates the structure of
the model while providing the statistical characteristics Of each state.

The state sequence Q = {q1, Q2, . . . ,q'r} corresponding tO a speech symbol string
0 = {01,02, . . .,OT} in the HMM is “hidden.” The hidden part of the HMM, a
state sequence, must be found based on some modeled way since no exact solution
exists. There are several ways of ﬁnding a state sequence for a speech signal. Among
them, the Viterbi search algorithm [57 , 58] is pOpular and recognized as an efﬁcient
way Of ﬁnding an Optimal state sequence. Here we structure the Viterbi algorithm
in the matrix notation established above and discuss the signiﬁcance of resulting
formulation.

Let

di(t+ 1) = max P(01102a' ° '10t+13q17Q22' ' ' :qt1Qt+l = Z I M): (235)

qliq21"'1qt

which implies the highest probability of a single path ending at state i, at time t+ 1.

In the similar way, let
‘I’i(t + 1) = argmqetlxp(01:022~-:0t+1:(I1:Q22---:(It2(1t+1 = i I M) (2-36)

\II,(t + 1) is the state qt at time t that leads to d,(t + 1). In these terms, the steps Of

the Viterbi algorithm are as follows:

23

o Initialization

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

612(1) 2 0 b12501) 0 . 77.2 , (2.37)
{ 1111(1) \ f 0 \
W2“) = 0 (2.38)
K ”NI” 1 K 0 /
o Recursion
Fort=2,...T,
I (11(1) ) (b,(o,) 0 0 )
32(1) _ o b,(o,) 0
( dN(t) ) ( 0 0 bN(0t) )
{ max{d1(t — 1)a11, . . . ,dN(t — 1)01N} I
x max{d1(t — 1)0.21, . . . ,dN(t — 1)0.2N} (239)
\ max{d1(t — 1)am ..... dN(t — Dan/N} )
_ 0 b,(o,) 0

 

24

/ maxlSjSN{dj(t‘1)a1j} \

maxlSjSN{dj(t — 1)a2,-}

 

 

 

 

 

 

x , (2.40)
I max15,5~{d,-(t - 1)“le /

I ‘1’1(t) \ I argmaxlgsﬂddt- Dan} \
‘1’2(t) = arg maxlstNide - 1)a2,~} (2.41)

\ ‘I’N(t) ) \ argmaXISjSN{dj(t_1)aNj} )

0 Termination
P’(01,02,...,OT | M) = max{d,~(T)}, (2.42)
q} = max{\II,-(T)} (2.43)
Fort=T—1,...1,

q? = ink-+1 (t+ 1). (2.44)

Where q; is the Optimal state at t. We can represent equations (2.37) through (2.41)

in matrix form as follows:

d(t) = A(t)max{Aod(t- 1)} (2.45)

‘1’“) = arngaX{{a.-j}1gi,jg~o{dj(t— 1)}193N} (2-46)

for t = 2, . . .T. Where d(1) = A(1)1r and 0 represents the Hadamard product [36]

25

Of the N x N matrix and the N x 1 vector deﬁned such that

{(311 CiNI {f1\ {Cllfl CleN\

021 C2N 0 f2 = C21f1 C2NfN . (2.47)

 

 

 

 

 

 

\CNI CNN) {far} (CNifi CNNfN)

The likelihood in (2.42) can be represented as
P'(0 l M) = I|d(T)||1- (2-48)

The back substitution process to ﬁnd the Optimal state path is the same as equation
(2.44).

The matrix formulation Of the Viterbi search algorithm provides a compact repre-
sentation Of the search algorithm. Also, such formulation makes it easy to implement

the search algorithm. In MATLAB, for example, we can get very compact and sim-

pliﬁed code.

2.2.3 Training (Estimation) Problem

The training problem concerns how to estimate the elements of M so as tO best
describe 0. This problem is Often solved in a maximum likelihood (ML) framework.
To estimate the model parameters given an Observation sequence 0, the quantity
P(OIM) is Optimized. Using an iterative procedure, the model parameter set M
is reestimated to maximize P(OIM). There are two widely-used algorithms for this
optimization problem, the F-B reestimation algorithm (iterative update and improve—
ment) and the Viterbi training algorithm [4, 57, 58]. The Viterbi algorithm has been
shown to converge to a prOper characterization Of the underlying Observations [17 , 19],

and has been found to yield models with comparable performance to those trained

26

by F—B reestimation [20]. Further, the Viterbi approach is more computationally
efﬁcient than the PB procedure [4].

A matrix representation for the F—B training algorithms is described in [27]. How-
ever, an alternative and more straightforward formulation is presented here.

The 7 variable [1, 2], a key component in the HMM training, is ﬁrst represented
in a vector-matrix formulation. 73-,(t) is the probability of a path being in state i at
time t and making a transition tO state j at time t + 1 given 0 and M. Thus, we

have

 

’in(t) = P(‘It =iaQt+1 = LI 0,M) (2.49)
a,-(t)aj,b-(Ot)ﬂ-(t 'I' 1) _ _
P(JOINJI) , t—1,...,T 1 (2.50)

for each j and i, where a, b, 01,6 are deﬁned as (2.1), (2.3), (2.4), and (2.17). Let us

deﬁne the matrix

 

 

 

(7110:) 71203) ’YINIt)
I‘ (t) = 721“) 722.03) ’72N(t) . (2.51)
(7111103) 7N2“) ’YNNU) I
Then by (2.50),
I ﬂ1(t+ 1) 0 0 l
1 0 @(t+1) 0
r“) — P(o | M)
\ 0 0 ﬂN(t+l)/

27

 

 

 

 

 

(01(00 0 0 \ / all (112 am \
0 b2(03) . . . 0 (1.21 0,22 . . . 021v
X . X
K 0 0 bN(0t) / \ 0N1 0N2 aNN /
01(t) 0 0 \
x (I ‘12.“) " (I (2.52)
( 0 0 aN(t) )

 

 

 

 

 

( ,81(t+ 1) 0 0 l
1 0 ﬂ2(t + 1) 0
= P(O I M) ' 3
\ 0 0 ,3N(t+1) }
/ a1(t) 0 0 l
0 (12(t) . . . 0
A(t)A , t = 1,. ,T — l (2.53)
K 0 0 . . . UN“) )
= dias(13(t + 01);] $1023) -dias(a(t)), t: 1 ,T 1, (2.54)
where
( P1 0 ... 0 \
A 0 0
diag(p) = : p2 : : (2-55)

 

 

for any row or column vector p.

28

Then, if 7,-(t) (note single subscript) is deﬁned as in [40],

 

 

 

 

mlt) = P(qt =i I QM), (2-56)
we have
{ '71“) \ { 01(t)51(t) \
_ 72(t) _ 1 02(t)32(t) _ _
7(t) — : —- __(a'(t),8(t)) : , t — 1, . . . ,T 1 (2.57)
\7N(t) ) \ CYNUWNU) )
= r’(t).1 (2.58)

assuming that a'(t)ﬂ(t) 96 0. This equation holds for any t E [1, T — 1].
In terms of variables deﬁned above, the reestimation formula for the state-

transition matrix A becomes

 

 

 

 

( (71(t) 0 0 “—1
A = Tim) M (_) 72°“) (.J (2.59)
( ( 0 0 7N(t) ) )
= r(t) gamma») (2.60)

in which 23:11 denotes a element-by—element conventional matrix summation. Here
A denotes estimated value of A.

To reestimate the B matrix, let V = {Ukj} be an K x N matrix such that

V = { Z 71(t)}15kst<,195~° (2'61)

t 8.1:. O¢=k

Then, with the matrix deﬁned above, the reestimation formula for the observation

29

matrix B becomes

{ {71(t) 0 0 )\_l

B = Vx 2T: 9 72°“) 9 (2.62)

 

 

 

 

( (0 0 ...ry~(t)))
= Vx (édiaghun) . (2.63)

As a consequence, we have a compact expressions of its training algorithm with

matrix formulations.

2.3 Time-Invariant Approximation for the HMM

A vector—matrix formulation of the RB HMM is posed in a state-space formulation
with suitable state variables, state-transition matrix, and output equation. In mod-
eling terms, the state—space system of the F—B HMM is linear, but time-varying.
However, because the state-transition matrix is time varying due to varying obser—
vation symbol probabilities, unlike the linear time-invariant model, it is not easy to
transform the F-B HMM expressions and to derive other representations for the HMM
which may be useful to ﬁnd various techniques tosolve the HMM problems for some
Speech applications.

An approximate time-invariant model for the time-varying state-space F—B HMM
with more potential for application is presented. Since theoretically it is not possible
to make the time-varying F—B HMM and TIA HMM identical, a revised formulation
with different state variables and likelihood measurements needs to be posed for the
approximation. There are several ways to pose such an approximation. Here, we study

one approximation method based on a state-Space formulation which was develOped

30

in the author’s laboratory by Snider and Deller [4, 6, 7].

The original motivation of the technique prOposed in [6, 7] was to decrease the
computational load in evaluating the HMMs. However, here we will show that such a
derivation is useful not only for the computational aspect, but also for an reasonable
approximation of the likelihood P(O | M) computed using the time-invariant state-
space model. Practically, in the HMM, we can approximate a posterion' probability

using the state equation as below:
P = P(o1 lM)P(02 | M) - . - P(OT | M) ~ P(01,02, . . .,OT | M).(2.64)

The validity of this approximation will be discussed in this and the next chapter.
As [6, 7, 8], we assume that there is a model M with N states q,-,z' = 1,2,. . ., N

and M discrete observation symbols k, k = 1,2, . . . , M. At each observation time

t, we deﬁne the state probability vector a:(t), and the observation probability vector,

y(t) , as follows:

a:(t) é (z,(t),x2(t),...,xN(t)) (2.65)

(Mt), y2(t), - - - ,ym(t)) (2-66)

Q
A
H
V
II

where, 2,-(t) is the probability of being in q,- at discrete time t given the model M,
P(q, at t | M), and yk(t) is the probability of generating symbol k at discrete time t
given the model M, P(k at t I M).

In these terms, the dynamics of the HMM are as follows:

a:(t + 1) = Am(t) + u(t)6(t) (2.67)
y(t) = Ba:(t) (2.68)
13w l M) = [[1 Hot | M) = 1311101“) (2.69)

31

where, A is the N x N state-transition matrix associated with the HMM whose (75, j )
element, a,,- = P(qj at t+1 | q,- at t) for any t; B is the M x N observation probability
matrix whose (k, j) element, bk, = P(k | qj); and 11(0) is some vector such that when
2(0) is deﬁned as zero, 3(1) takes the prOper initial values, with u(t) arbitrary but
ﬁnite for all t 75 0, and 6 (t) is the Kronecker sequence. yo, (t) corresponds to the
It“ element of vector y(t). Here It is the symbol realized by 0,. 13(0 | M) is the
likelihood explained in the following.

The expressions (2.67)-(2.69) are not equivalent to (2.9)-(2.11) since the deﬁnitions
of the state variables as well as the likelihoods from both sets of equations are different.
State variable a(t) is the joint probability of the partial observation sequence from
an initial time to time t. However, 2(t) is the probability of being in states at time t.
Therefore, the likelihood P(01,02, . . . ,OT | M) which needs to be evaluated from
an initial time to a speciﬁc time t cannot be expressed by state variables a:(t +1) and
y(t) without an independence condition that will be explained.

For reference, because A is a stochastic matrix, a:(t) is a positive vector. In the
Bakis structure of the HMM which is often employed in speech recognition, except a
few initial values of t, m, (t) 76 0 for any 2'. This implies that the system can be any
one of states [1, N] regardless of the preceding state.

From (2.67) and (2.68), we have

a:(t) = A‘"1u(0), (2.70)

y(t) = BA"1u(O). (2.71)

These two probability values can be used to compute the state and observation prob-

abilities at any time t 6 [1, T]. The observation probability at time t for a observation

32

symbol 0, is

P(Ot l M) = yo.(t)
= (b1(0t), b2(0t), . . . , (IA/(00) ' 23“)
= b1(0t)$1(t) + b2(0¢)$2(t) + . . . + bN(0t)$N(t) (2.72)

= 1’A(t)m(t).
Here 570,) = (51(ot), 52(ot), . . . , bN(0,)). Note that
b’(0,) = 1’A(t). (2.73)

Horn (2.72), it is seen that the probability for a given observation at time t can be

computed using state equation (2.70) and observation equation (2.71).

2.4 Transformations of State Equations

One of the merits of using the state-space structure develOped above is that it admits
the transformation of the state and observation equations into alternative formula-

tions. Let us discuss this subject for the conventional F-B HMM and the TIA HMM.

2.4.1 Transformation of Time-Invariant State Equation

The original motivation for the time-invariant state equation was that the state-
transition matrix could be diagonalized to signiﬁcantly reduce the number of ﬂoating
point Operations required to compute the HMM likelihood5 [4, 6, 7].

To diagonalize the state-transition matrix, let a:(t) = M z(t). Where M is diag-

onalizing transformation on the state space. M is a square matrix with dimension

 

5In light of the remarks by Mitchell et al. [25], further discussion of this model appears in [4].

33

N x N and z(t) is a new state variable deﬁned as M ‘la:(t) assuming that M "1
exists. If A does not have distinct eigenvalues other than zero, then M ‘1 does not
exist. However, in the practical HMM application in speech processing, a speech
signal is modeled as the result of random processes. A holds such variable transition
information of random processes. Numerically, this leads that mostly the entries of
A are different from each other. For example, the same phoneme spoken by different
speakers will be acoustically different. Also, the same speaker may produce different
versions of the same sound under different circumstances. Numerically, this leads
that mostly the entries of A are quite different from each other and they do not have
speciﬁc patterns such as singularity for the matrix for instance. Numerically, such
diversity of realization of random processes for speech signals justiﬁes assuming the
existence of M ’1 under a suitable size of number of state in the state model.

Equation (2.67) and (2.68) yield

Mz(t+1) = AMz(t)+u(t)6(t) (2.74)

y(t) = BMz(t). (2.75)

Diagonal dominance provides a relatively simple criterion for guaranteeing the
nonsingularity of a matrix. An N x N real or complex matrix A is diagonally dominant
if law] 2 23,-,2,- Iaijl,i = 1, . . .,N. A is also strictly diagonally dominant if strict
inequality holds. If A is strictly diagonally dominant, then A is nonsingular [37].
Since A is nonsingular, a matrix M, that is composed of a set of eigenvectors of A,

is nonsingular. Therefore, M ’1 exists. From (2.74),

z(t+1) = M“AMz(t)+M‘lu(t)6(t) (2.76)

y(t) BMz(t). (2.77)

34

Now suppose that M = PU, where P is the usual matrix of normalized eigenvectors
and U is a special diagonal matrix such that the 2"" element of the vector U '1 is
the reciprocal of the 2"” element of the vector P’1u(0). As a consequence of this
operation, each element of the excitation vector, M ‘1u(t), is unity at time zero.

It follows that

z(t + 1) = U"1P‘1APUz(t) + U‘1P_lu(t)6(t) (2.78)
= Az(t) + ﬁ(t)6(t) (2.79)
y(t) = BPUz(t) = Bz(t) (2.80)

where A = U‘IP'IAPU is a diagonal matrix, ﬁ(t) = U"1P“lu(t), and B = BP.
This result is signiﬁcant because it separates all states into independent computations.
Furthermore, this property provides a way to combine all HMMS in the system into

one large state-space formulation [6, 7]. Moreover, as in (2.70) and (2.71),

z(t) = A 17(0), (2.81)

y(t) = BAHam) (2.82)

where A“ is computed easily.

In the HMM application to speech modeling, the Bakis condition is generally
assumed, and, thus, A is a triangular matrix. Therefore, when there are K HMMS
which need to be evaluated, it is necessary to compute eigensystems of K models.
However, under the Bakis condition, all the eigenvalues are located on the diagonal
positions of A. Therefore, it is not necessary to compute eigenvalues. Furthermore,
because of the strong diagonal prOperty of A, it is highly probable that there are
quite a few cases across which eigenvalues can be shared. Hence practically we do

not need to compute all the eigenvectors of K systems. Thus, computational load to

35

computing eigenvalues and eigenvectors of A among models can be possibly lessened

by suitable preprocessing which examines the diagonal components of state-transition

matrices across models.

2.4.2 Transformation of the Time-Varying State Equation

Let us return to the case of the time-varying state equations,

a(t + 1) = A(t + 1)Aa(t) + A(1)1r6(t)

y(t) P(01,02, . . .,0. | M) = C'a(t).

(2.83)

(2.84)

Let A and P denote the eigenvalue and eigenvector matrices of A respectively,

AP = PA.

Since A is nonsingular, P is nonsingular. Therefore

a(t + 1) = A(t+1)(PAP"1)a(t)+ A(1)1r6(t)

= A(t+1)PAP“1a(t)+ A(1)1r6(t)

P'1a(t+1) = P-1A(t+1)PAP'1a(t)+P—1A(1)7r6(t).

Now let

P‘la(t) = 6(t).

Then

a(t + 1) = (P"A(t+1)PA)a(t) + (P’1A(1)1r)6(t)

y(t) = P(01,02,...,0tIM)=C"P6:(T).

36

(2.85)

(2.86)

(2.87)

(2.88)

(2.89)

(2.90)

Since P is an eigenvector matrix of A, this expression can be represented as follows:

a(t + 1) = P"1(APA(t + 1) — APA(t + 1) + A(t + 1)AP)d(t)

+(P“A(1)7r)5(t)

= P“APA(t+1)a(t) + P“1(A(t+1)AP — APA(t+1))c‘i(t)
+(P’1A(1)7r)5(t)

= AA(t + 1)a(t) + P‘1(A(t+ 1)PA — PA(t+1)A)Ez(t) (2.91)
+(P“A(1)1r)6(t)

= AA(t + 1)d(t) + P‘1(A(t + 1)P — PA(t+1))Ad(t)
+(P"A(l)7r)6(t)

= AA(t + 1)6:(t) + (P‘1A(t + 1)P — A(t + 1))Aa(t)

+(P“A(1)vr)6(t),

where AA(t + 1) is a diagonal matrix. To obtain a diagonalized state equation from

(2.91), we need to have
P-1A(t+1)P = A(t + 1).
Or equivalently,
A(t+1)P = PA(t+1). (2.92)

If P has N distinct eigenvalues, the necessary and sufﬁcient condition to satisfy the
commutativity (2.92) is that all the eigenvectors of P should be same as those of
A(t + 1) for all t [37]. However, P is not an eigenvector matrix of A(t + 1) but
of A. Therefore, P‘1A(t + 1)P aé A(t + 1) in general. Furthermore, A(t) is time

Varying. Thus, a constant P which satisﬁes (2.92) for every t does not exist in

37

general. Therefore, there is no universal eigenvector matrix P which diagonalizes the
time-varying F-B HMM.

In addition, consider the case in which the eigenvector matrix P changes with
time, depending on A(t). Let P, denote a eigenvector matrix of A(t)A. Then it

follows that

a(t + 1) = A(t+1)Aa(t)+ A(1)1r6(t) (2.93)
Pzﬁlau + 1) = P:.31(A(t + 1)A)Pt+1P:.31a(t) + P::1A(1)vr6(t) (2.94)
a(t + 1) = A(t + 1)AP;+1,P,P;1a(t)+ P;.},A(1)1r6(t)

= —A(t + 1)A(Pt—+11Pt)a(t) + P1331(A(1)1r)6(t) (2-95)

if P;1 exists for all t, where d(t + 1) = P3100: +1), and Pt'ﬁ1(A(t + 1)A)Pt+1 =
W. However, due to the sparseness of A(t), A(t)A is singular most of the
time and Pfl does not exist at these times. Additionally, even if Pfl exists for all t,
it is necessary to compute P, for each t, resulting in no computational beneﬁt from
the matrix diagonalization. Moreover, due to the fact that P,— +11Pt aé I in general,
(2.95) implies that it is not possible to obtain a diagonalized state equation using the

formulation above.

2.5 Analysis of Illegal Paths Caused by Approxi-

mation

In this section, we discuss the problem caused by the approximation of the PB time-
varying HMM by the TIA HMM. This issue was ﬁrst noted by Mitchell et al. [25]
following the publication of the original TIA HMM paper [7].

38

2.5.1 Likelihood Difference

We ﬁrst discuss the relationship between P of the TIA HMM and P of the PB HMM.
Those likelihood are signiﬁcant for HMM evaluation in speech recognition.

As a matter of fact, the likelihood measure, P(O | M) = {:1 P(Ot | M) em-
ployed with state equations (2.67 )-(2.69) is not linearly related to the a posteriori joint
probability P(O | M) = P(01,02, . . . ,OT | M) used on the F-B HMM approach.
For example, suppose there are three symbols {01, 02,03} in an observation string

at times t = 1, 2, 3, respectively. Then by the “chain rule” of conditional probability,
P(01,02,03 | M) = P(01 l M)P(02 | 01,M)P(03 | 02,01,M) (2.96)
which, if and only if 01,02, and 03 are conditionally independent“, can be written as

P(01,02,O3 | M) = P(Ol I M)P(02 I M)P(03 | M) (2.97)

~

= P(01,02,03 ] M). (2.98)

In this case, a time—invariant state equation can be applied to compute the “F-B”
a posteriori probability P(O | M). However, the symbol occurrences are generally
dependent upon the states in the HMM. The inequality of P(O | M) and P(O | M)
caused by the assumption of independence among symbols without consideration of
hidden state dependency was initially noted in [25], where a simple counter-example
using a two-state HMM can be found7.

To examine the differences in P and I3 in more detail, consider a model with two

states. From the F-B matrix formulation (2.13),

P(01,02, 03 | M) = C’A(3)AA(2)AA(1)1r (2.99)

 

6The dependent conditioning information being the state value.
7In light of the remarks by Mitchell et al. [25], further discussion of this model appears in [4].

39

. b1(03) 0 an 612

= C
0 02(03) G21 G22
)
x “02) 0 a” a” (2.100)
0 b2(02)) G21 022
01(01) 0 \ 71'1

0 52(01) ) W2

 

Assume a left-to—right (Bakis) model so that (712 = 0,U = (0,1) and 1r' = (1,0).
Then,

P(01,02,03 IM) = b2(03)a21b1(02)a11b1(01)

+ b2(03)a22b2(02)a21b1 (01). (2.101)
On the other hand, the product of individual observation probabilities is

. .,(3) ) . 21(2)
P(03 | M)P(02 | M)P(01 I M) = 1 A(3) 1 A(2)

272(3)) 222(2)
, 171(1) \
1 A(l) (2.102)

182(1)) I
b1(03)al1b1(02)aubl(01)
51(03)Gf1b2(02)02151(01)
b2(03)a21a11b1(02)a11b1(01) (2.103)
52(03)a2ia11b2(02)021b1(01)
b2(03)a22a21b1(02)a11b1(01)

+ b2(03)a22a21b2(02)a21b1(01).

 

++++

Therefore, P(O | M) involves extra cross terms which can be regarded as resulting

40

from one or more “illegal” state paths. Since all the terms of each P(Ot I M) are
multiplied together in computing 11;, P(Ot I M), this technique has been called an
“anypath” method [4].

Although different from the F—B HMM likelihood, the “time-wise” P(O I M)
probability has been useful in discovering new aspects of HMMs in this work. More-
over, the accompanying state equation is advantageous in that the resulting HMMS
can be implemented to perform fast processing in real application with fewer re-
sources than with the F-B HMM [6, 7, 25]. It is well-known that the training and
evaluation of F-B HMMS are computationally very demanding [66]. To decrease the
computational complexity, a few techniques have been prOposed using vector-matrix
formulations [27, 30]. However, since the proposed techniques are based on the time-
varying F-B HMM, there is a limitation to the possible decrease in computational
complexity. In spite of the apparent weakness of permitting illegal paths, the TIA

HMM is a useful and effective model as we discuss later in this work.

2.5.2 Comparison of the State-Transition Matrices

Let us examine the state transitions of both F—B HMM and TIA HMM in more detail.
Next, we discuss the discrepancy of the role of state-transition matrices of each model
from the point of how to constitute available state paths of a speech utterance.
Since A premultiplies the matrix A in the time-varying state—space HMM, diag-
onal elements of A multiply the corresponding rows of A. To examine the dynamics
of the AA matrix of the time-varying state equation, consider two two-state Bakis

models and a test string 0 = (01,02, . . .,OT}. At times t and t+ 1,

a(t) = 121(0.) 0 a” a” a(t—1) (2.104)

0 b2(03) G21 0'22

41

a(t+1) = MOM) 0 “11“” a(t) (2.105)

0 b2(0t+1) 021 022

_ 51(0t+1) 0 all 012
0 b2(0t+1) 021 022
b 0 0
1( t) a“ a” a(t—1). (2.106)

0 b2(0t) 021 022

Due to the Bakis condition, an = 0 and only transitions from state 1 to 2, 1 to 1, and
2 to 2 are legal. However, these legal jumps are also controlled by the probabilities
in A(t) and A(t + 1). Let b2(0t) = b1(03+1) = 1 and b1(0¢) = b2(0t+1) = 0 for

instance. From these assumptions,

all 0 O 0
a(t + 1) = a(t — 1) (2.107)
0 0 G21 022
= a(t — 1). (2.108)
0 0

Therefore, the likelihood P(O I M) becomes zero. This is from the fact that by A(t),
the observation at t can be generated from state 2 and by A(t + 1), the observation
at t+ 1 can be generated from state 1, but once an observation is generated by
the second state, the path cannot return to state 1. Hence the matrix sequence
A(t)A,t = 1, . . .T inherently determines the allowable state paths depending on the
elements of the observation string.

For the time-invariant model of (2.67) and (2.68), however, the computed likeli-

hood (2.69) is an approximate value based on the assumption that O, is uncondition-

42

ally independent of 0., for all 7' E [1,t — 1), meaning that

N
P(Ot I M) = ZP(0t I Qt = i:M)P(Qt = i I M) (2-109)

i=1

= 1'A(t)a:(t). (2.110)

Here, algebraically the computation of P(Ot I M) involves a:(t) which is a state
distribution that is dependent upon t only, not upon the history of the state path, nor
of the symbol string. Thus, only the state-transition probabilities in A are responsible
for predicting the state path in the TIA HMM. However, observation symbols play a
signiﬁcant role in deciding the feasibility, if not the value of the probability, of a state
sequence. The viability of the TIA HMM depends on the degree to which probabilities
computed for illegal state paths are small, rendering them infeasible. This will be

discussed in the following chapters.

2.6 Validity of the Time-Invariant Approximation
of the HMM

We examine the extent to which the TIA HMM represented by approximations (2.67)-
(2.69) is a viable model for practical speech recognition. In particular, the relative
signiﬁcance of two signiﬁcant matrices A and B of the F-B HMM will be examined
analytically and heuristically using several approaches. Of course, it is true that the

following analysis are in fact heavily dependent upon the probabilities of A(t).

2.6.1 Matrix Norm Approach

Let us ﬁrst revisit two state—transition equations (2.11) and (2.25). Again, we have

P(01,02,...,0,,...,0TIM) = C’a(T)=C’A(T)X(T). (2.111)

43

Note that the actual relationship between a(t) and X (T) is given in (2.26). For
t 76 T, the likelihood of the RB HMM is

P(0,,02,...,0.|M) = 1a(t)=lA(t)X(t). (2.112)

For simplicity of analysis, consider an ergodic constraint on A so that all N states

are legitimate ﬁnal states. Then,
P(01,02, . . .,0. l M) = C’a(t) = C'A(t)X(t) (2.113)

for t E [1,T] with suitable initial conditions 07(1) = A(1)7r and X (1) = 77. Because

A is a stochastic matrix, it is easily veriﬁed that
(f = C'I=C'A= CA" (2.114)

for any natural number n with a column vector C as deﬁned in (2.12) and I deﬁned as
the identity matrix of suitable size. Then for any 1 g t S T, a posteriori probability

(2.112) can be represented variously as

P(01,02,...,0,|M) = 651(1)
= I|a(t)lll
= dAa(t)
= IIAa(t)II1
= dA(t)X(t)
= IIA(t)X(t)||1 (2115)
= C'AA(t)X(t)

= IIAA(t)X(t)II1

44

where I] -

then

= |IX(t+1)|I1,

II represents the ll norm. If n is a matrix such that

Therefore, we have that

for any n with C = [1,1,...,1].

A+n = I,

C'A = c’=dr
= C’(A+n)
= C’A+dn.

do = d((2)n=0

(2.116)

(2.117)

(2.118)

(2.119)

Here 0 is the zero vector with an appropriate

dimension. The difference matrix 52 indirectly points out the relative importance of

the stochastic matrix A in the process of evaluation of the a posteriori probability

for a given utterance. From (2.115),

Ila(t)II1 = IIAa(t)|I1=II(I-9)a(t)II1

= “G(t) - m3I(I)II1-

(2.120)

In particular, consider a diagonally dominant A. For simplicity, let A is a 2 x 2

45

matrix and let 61 and 62 be two small numbers in the off-diagonal so that

a a 1—6 6
11 12 = 1 2 ' (2.121)
G21 G22 61 1-62
Also, let
a t
a(t)= 1” . (2.122)
0120)
Then,
6 —6
n: 1 2 . (2.123)
—61 62

Therefore, all the elements of Q are composed of small numbers. From (2.121), we

have that

Aa = (2.124)

(1 — 61)C¥1(t) + 6202“)
6101(t) + (1 — 62)oz2(t)

If 61 and 62 are relatively small compared to the entries of a, (2.124) can be approx-

imated as
(1 — £1)a1(t) + 620:2(t) z (1 — 61)al(t) (2.125)
6101 (t) + (1 — 62)02(t) (1 — 62)a2(t)
= (1 — 61) 0 a1(t)
0 (1 — 62)02(t) C12“)
2 Da,

46

where D is a diagonal matrix which is made up of diagonal elements only from matrix

A so that A = D + D1. Therefore,

P(01,02,...,0,IM) = IIa(t)I|
= IIAa(t)II
= ||(D+D1)a(t)||
.9 IIDa(t)II (2.126)

= dDa(t),

where D is practically close to an identity matrix I.

2.6.2 Likelihood Expansion Approach

The preceding discussion concerns ﬁguring out the relative insigniﬁcance of stochastic
matrix A at the ﬁnal time t in light of the likelihood. In fact, however, we need to
see the effect of A at each of the time instants {1,2, . . . ,T} at which a speech signal
is evaluated.

To look at the inﬂuence of A matrix more closely, again consider a two-state model
and the dynamics of the HMM over a few time instants. The results of this analysis
can be extended to any size state model. Let M1 and M2 be two HMMs for speech
evaluation. For simplicity, let us suppose that an observation string is composed of

three symbols {01, 02, 03}. Then, we need to evaluate the ﬁnal two likelihoods,

P(01,02,03 | M1) = (1,1)A1(3)A1A1(2)A1A1(1)1r1 (2.127)
P(01,02, 03 | M2) = (1,1)A2(3)A2A2(2)A2A2(1)1r2, (2.128)

where subscripts 1 and 2 denote M1 and M; respectively. Let us assume a Bakis

structure for A1 and A2. Then 01,12 = (11,22 = (12,12 = (12,22 = 0. Like the notation of

47

A1, A2, the subscript of n from and-,- denotes a transition probability from state 2' to

j in HMM 17.. So, the likelihoods for M1 and M2 become

P(01,02,03 I M1) = b1,1(3)a1,11b1,1(2)a1,11b1,1(1)7rl,1
+ b1,2(3)a1,21b1,1(2)a1,11b1,1(1)7r1,1 (2.129)
+ b1,2(3)a1,22b1,2(2)a1,21b1,1(1)7r1,1
P(01702103 I M2) = b2,1(3)02,1152,1(2)G2,11b2,1(1)7r2,1
+ b2,2(3)a2,21b2,1(2)a2,11b2,1(1)7724 (2.130)

+ 52,2(3)a2,22b2,2(2)02,21b2,1(1)7T2,1-

Here similarly to and-g, bug-(k) denotes an observation probability for symbol 16 from
state j in HMM n and 7r,”- denotes a initial state probability for a state i in HMM
n. In (2.129), if A1 is close to A2, then B plays a decisive role in computing the
likelihoods. It is difﬁcult to show analytically how much the matrices A and B affect
the likelihood in general since its result differs depending on the time-varying input
speech signals. However, it is possible roughly to estimate the relative contribution

of each matrix.

2.6.3 Matrix Inversion Approach

Here we consider the application of the vector-matrix formulation of the HMM to
assess the relative importance of A and B to the likelihood. 5

Previously, a matrix norm has been applied for an ergodic model to obtain a closed
form for the a posteriori probability quantity of a given speech utterance at a Speciﬁc
time as (2.115) or (2.116). Now consider a general case covering all time indices.

Reconsider a model with a time-varying state-equation representation as (2.22).

48

Multiplying both sides of (2.22) by A“,

A‘1X(t+1) = A(t)X(t)+A’11r6(t). (2.131)

Let us assume a Bakis structure with strong diagonal elements in A. As before, let

ﬂ=I—A. Then

(I — n)-1X(t + 1) = A(t)X(t) + A-lmsu). (2.132)
Further
(I—ﬂ)’1 = I+n+nz+n3+m (2.133)
= I + n: a". (2.134)
Therefore,
(I + f: ﬂ")X(t + 1) = A(t)X(t) + A'11r6(t) (2.135)

n=l

X(t+1) = A(t)X(t) —(§;m)X(t+1)+A-1«6(t).(2.136)

n=1
Horn the deﬁnition of {2 and (2.134),
2 n" = A-1 — I. (2.137)
n=l

Substituting in (2.136),

X(t + 1) A(t)X(t) — (A'-l — I)X(t + 1) + A—1n6(t) (2.138)

= 22(1) + 22(6 + 1) + A‘11r6(t) (2.139)

49

where X(t) = A(t)X(t) and X(t+ 1) = —(A’1 —I)X(t+1). Because A is assumed
to have strong diagonal elements, X (t + 1) = X (t) for all t > 0. To see X (t + 1)

quantitatively, consider, for example, a simple case in which A E R2”,

all 012

A = _ (2.140)
G21 022
Then,
A'1 _ I = d tlA . G22 — (111022 + 012021 —a12 (2.141)
e ( ) —a21 0.11 — 011022 ‘I' 012021

assuming that det(A) = (1116722 — (112021 > 0 which is justiﬁed in the present analysis
because of the strong diagonal property. Contrary to the denominator, all entries
of the numerator of (2.141) are small numbers. For the Bakis model, (122 = 1 and

an = 0, so that

921 0
A‘l—I = “1‘ . (2.142)
—“ 0

Accordingly, I] (A—1 — I )X (t+ 1) II is very small compared with the values in X (t). For
any size of N, we reach the same conclusion. This is further support for the notion
that observation probabilities are much more signiﬁcant than the state-transition

probabilities in computing the likelihood.

2.6.4 Eigenanalysis Approach

Let us examine the eigensystem of the state-transition matrix of the time—varying F-B
HMM. This approach also leads to the conclusion that A is relatively insigniﬁcant

compared to A in light of the likelihood measure.

50

 

From (2.86), the diagonalization of time-varying F-B HMM is explicitly given by
a(t + 1) = A(t + 1)PAP'1a(t) + A(1)1r6(t). (2.143)

Since A(t + 1) is a diagonal matrix, analyzing A (or P) is sufﬁcient. Note that
generally the operation of vector-matrix-vector multiplication (2.90) to compute the
likelihood does not have a straightforward relation to the eigensystem of the matrix.
However, consider the condition in which A is close to the identity matrix I.

To observe the signiﬁcance of A, consider two HMMs which are numerically very
close to each other. Let A1 and A2 be state-transition matrix of M1 and M2,
respectively. Let A1 have the Bakis tOpOlogy. For analysis purpose, suppose that all
the entries of A2 are close to those of A1 and they are close to I so that they have
strong diagonal property. Further assume that the other two matrices B1 and m
from M1 are the same as B2 and «2 from M2. Then let us estimate the likelihood
from A2 in terms of A1.

Consider a practical case ﬁrst in which A2 which is close to A1 is of the Bakis
topology so that the diagonal entries of A2 are eigenvalues themselves. During ma-
trix multiplications for likelihood computation, the eigenvalues of the state-transition
matrix are explicitly involved in the matrix operations as explicit (diagonal) entries
in the matrix. In this case, it is trivial since both A1 and A2 produce the similar
likelihood.

Next consider the case such that A2 is not of the Bakis topology but it is still
strongly diagonally dominant. The non-Bakis condition makes the analysis difﬁcult
since the eigenstructure of the system varies depending on changes in the entries
of the matrix. Because A2 is no longer triangular, the eigenvalues of the matrix
are not explicitly involved in the matrix multiplication. However, the Gerschgorin

circle theorem [37 , 38, 39] provides another way to assess the matrix multiplication

51

approximately. The theorem tells the possible relative locations of correSponding
eigenvalues after the entries of a matrix changes a little from an original matrix. To
apply the theorem, let A be an eigenvalue of the N x N matrix A2. Then, by the
Gerschgorin theorem, each eigenvalue lies in at least one of the discs with center am,-

and radius r,- = 23,-,5,- IaW-I, z” = 1, . . . , N in the complex plane,
I/\ — amiI S Ti. (2.144)

Because A2 is diagonally dominant, the size of each Gerschgorin disk is very small.
Moreover, since A2 is a stochastic matrix, one eigenvalue remains unity. For matrix
computation, consider eigenvectors of A2. Although the eigenvalues of A2 are not
affected much and the Euclidean distances between respective eigenvalues of A1 and
A2 are small, the sensitivity of eigenvectors depends on the eigenvalue sensitivity
and separation. In particular, in case of identical eigenvalues, there exists an inﬁnite
set of possible eigenvectors because of linear dependency. Therefore, the eigenvector
conditions are not helpful to estimate the likelihood with matrix A2 in association
with A1.

However, in case that A(t) is sparse8 so that the effect of off-diagonal elements
are reduced, we may get likelihood results with A2 which are close to those with A1.
The results are in fact mostly dependent upon the probabilities of A(t).

For the TIA HMM, (2.76) and (2.7 7), eigenanalysis approach is better applicable
than for the F-B HMM because the time-invariant state equation does not depend

on the observation symbol string.

 

8The sparseness of B will be explained in Chap. 3.
52

Chapter 3

Practical Issues in the Use of the

TIA HMM

This chapter is devoted to the practical issues related to the TIA HMM. Here, the
“practical issues” relate principally to the problem of illegal state sequences in the TIA
HMM. Additionally, the technique prOposed by Turin [27] to reduce the computational
load for likelihood computation will be reexamined. Then, a new approach will be
prOposed and derived to reduce the computational work for likelihood computation
based on a condition imposed on an utterance by Turin. Through such an approach,
we will show that we can obtain more computational savings as well as fast evaluation
with reduced computational resources. Even though such an assumption imposed by
Turin on a speech utterance is not ubiquitous in real speech signals, however, this
study will be signiﬁcant to compare the efﬁciency of computational savings of the new
approach with that of Turin. We will discuss the problem related to computational

savings of HMMS ﬁrst in the following section.

53

3.1 Efﬁcient Evaluation Technique

In [27], Turin suggests a new technique for computing the HMM likelihood efﬁciently
using a vector-matrix formulation of the HMM. Turin’s approach assumes that the
observation string extracted from the speech has long stretches of identical observa-
tions. Let us discuss and examine his method so as to derive more computational
saving technique.

Before presenting a new technique, we brieﬂy review the Turin’s method [27].

Assume that the observation string has long stretches of identical observations, say,
0t+1 = 0t+2 = ' ‘ ' = 0t+r (3.1)

with r—repetitions of the symbol. When there are many blocks of repetitive strings,

the following development can be reapplied. From (2.13), the likelihood is given by

P(01,02,...,0TIM) = C'A(T)AA(T—1)A---A(2)AA(1)7r (3.2)
= UA(T)AA(T—1)A---A(t+r+1)A

(A(t + 1))A)'A(t)A- . - A(2)AA(1)1r (3.3)

under (3.1). Thus, the problem becomes how to compute a matrix (A(t + 1))A)’
efﬁciently. Among several algorithms proposed by Turin [27] for computing (A(t +
1))A)r one of the technique suggested using [28] is as follows:

1. Let {bk_1bk-1 ---b1bo} be a binary representation of r as

r = b0+2b1+...+2’°—1b,,_1. (3.4)

54

Also, let

Qo = I (3.5)
R1 = A(t+1)A. (3.6)
2. Fori=1,...,k+1,
Ri+1 = R3 (3-7)
.-_ ’fb,-_ =0
0.- = Q 1 1 ‘ (3.8)

Qi—iRi if bi—l = 1-

3. Termination

(A(t+1))A)' = Qk (3-9)

This algorithm requires on the average 3N 3 log10 r ﬂoating—point Operations (ﬂ0ps)
in calculating (A(t + 1))A)’. It is obvious that we get more computational savings
when for r is large. Depending on r, however, the computational savings varies. For
example, when r = 2" for n E N, the large computational savings can be obtained. On
the other hand, the computational savings becomes relatively small when r = 2" — 1.
As well as such variability of computational savings, this algorithm still requires a
recursive squaring of (A(t + 1)A).

To improve upon techniques proposed for computing (A(t+ 1)A)" [27, 28], we de-
velOp a more computationally efﬁcient technique for computing this repetitive matrix
multiplication based on a linear transformation of the matrix. This method is par-
ticularly efﬁcient in cases where the matrix R1 from (3.6) is a sparse, near-triangular
matrix, typical of the HMM structure. The derivation follows.

In Chapter 2, a similarity transformation of the non-singular matrix A was used

55

to obtain more computational savings in the TIA HMM. Similarly, a linear transfor—
mation can be applied to the computation of (A(t + 1)A)'. In this case, this matrix
product is singular most of the time.

Initially, suppose that A(t + 1) is non—singular. Then, the resulting product
A(t + 1)A is non-singular; thus, the matrix (A(t + 1)A) can be expressed as the

product of three matrices,

A(t + 1)A -_—_ P(t + 1)D(t+1)P‘1(t+ 1), (3.10)

where P(t + 1) is an eigenvector matrix of, and D(t + 1) is a diagonal eigenvalue

matrix of, A(t + 1)A. Therefore,

(A(t +1)A)r = P(t+1)D’(t + 1)P-1(t+1). (3.11)

Since D(t+ 1) is a diagonal matrix, computing D" (t+1) is straightforward. If D(t+ 1)
is N x N, for example, it takes only N x r ﬂaps to compute D'(t + 1). Likewise,
computing P‘l(t + 1) from P(t + 1) takes N3 ﬂops. This is not computationally
demanding when N is not large. N is practically not over 6 in HMMS.

Second, suppose that A(t+1)A is singular because of zero elements in the A(t+1)
matrix. Note that A is a non-singular matrix. In this case, we still can choose non-
singular eigenmatrix P(t + 1) because it is possible to have any linearly independent
eigenvector corresponding to a zero eigenvalue. Thus, a non-singular matrix P(t + 1)
exists always regardless of values of A(t + 1). Hence, there is always a valid relation
(3.11).

To compare the required number of ﬂoating Operations for computing (A(t+1)A)'
between three techniques described above, consider a simple case as follows. Suppose
that all the diagonal entries of A(t+ 1) are not zero, and A is a triangular matrix with

allowing any forward state jump. Furthermore, assume that state transition matrix

56

A(t + 1)A has distinct eigenvalues. Then, the necessary computational complexities

for three techniques are shown in Table 3.1. For example, when r = 10 and N = 5,

 

 

 

I Approaches 1 ﬂops I
Conventional F-B HMM gm + 1)?“ + %(N + 1)(N + 2) (7' - 1)
Turin’s algorithm 3N 3 log10 r
Similarity Transformation N(N + 1) + N7 + N2 + N3 + %(N + 1)(N + 2)

 

 

 

 

Table 3.1: Approximate computational complexities for computing (A(t + 1)A)' by
three different approaches.

the approximate complexities for computing (A(t + 1)A)' are 2040 ﬂ0ps by the
conventional scalar recursive F—B HMM algorithm, 375 ﬂOps by Turin’s algorithm,
and 265 ﬂOps by the similarity transformation in (3.11).

It is obvious that, like [27], savings in computing P increases with increase of
r. In contrast to the technique in [27], however, the necessary load for computing
(A(t + 1))A)r using the similarity transformation is less sensitive to 7' since the
computational load increases proportionally with rate of N. On the other hand, for
the 'I‘urin’s algorithm, it increases with 3N 3 associated with log10 7'.

Now, consider the TIA HMM under the Turin’s assumption that the observation
string has long stretches of identical observations. From (2.78) and (2.80), the partial
likelihood from t + 1 to t + 7‘ becomes

lilerlM) = 11111210»: (7)} (3.12)
= HII{5(0 (0,)Az(T—1)} (3.13)

= (6’(030426))(B’(o...>4"1z<t))
(5'(0..1)422<t))(6'(050440) (3.14)

because 5(07) and 2(7) do not change over 15+ 1 g 7' S t+ r. In (3.14), however,
57

it is not possible to get further computational savings for Hi”; +1 P(OT I M) even

though A is a diagonal matrix, because A is between row and column vectors which

form a dot product.

3.2 Analysis of the TIA HMM

In this section, the structure of the TIA HMM will be focused on in detail.

3.2.1 Likelihood Structures

The HMM is Obviously based on the assumption that at each time, a symbol is
generated as a consequence of a state transition or result of state entrance. Depending
on the type of the HMM, a symbol is modeled to be generated either during or after
state jump [4]. In either case, the likelihood is made up of T sequential multiplications
of pairs of aj, from A and bj(k) from B for a T-length speech utterance. The initial
state probability 7r,- is a special case of a,,- which could be represented symbolically
as 77,- = 050.

For simplicity, assume that there is a single legal state path. Then, according to

the dynamics of the HMM, the formulation of the likelihood is as follows:

P(0 I M) = {a91.0b¢11(0t)} ' ' ' {amm-lbm(0t)}{a¢1t+1,9tbt+l (Ot+l)} ' ° ' ' (3°15)

In case there are more than a single legal state path, the sum of terms of form
(3.15) comprises the overall likelihood. Therefore, regardless Of the number of legal
states, the number of aJ-is are the same as that of bjs in the likelihood equation when
evaluated by the F-B HMM or Viterbi HMM.

For the TIA HMM, however, 2:, (t) is computed from Markov process (2.67 ); thus,

:3, (t) is sum of terms composed of (1,, and takes a “sum-Of-products” formulation. In

58

prOportion to t, the exponent of aji increases and z,(t) could be up to (t — 1)" powers
of a,,-. Therefore, Hi1 P(Ot I M) requires T terms of (2.72) and thus it takes on
“product-of-sums” formulation. Also, in 1'12":1 P(Ot I M), (t — 1)“ powers of 61,-, is
multiplied by a single bj(0t) rather just a,,-.

In an extreme case, for example, suppose that only a single element of a vector
5' = (b1(0t), b2(0t), . . . , bN(0t)) is not zero for t E [1,T]. Such a condition is rarely
satisﬁed for real speech signals under a HMM framework since it implies a simple
“non—hidden” Markov model. Practically, however, for a few t E [1,T], such phe-
nomenon occurs frequently in the F—B HMM training. It is not easy to quantify for
how many times the assumption is ﬁt because the length of training speech utterance
varies and it really data-dependent. Moreover, initial values of A and B randomly
assigned inﬂuence the estimated values of A and B during training.

The case that only one element of b, has non-zero probability in training is in-
vestigated with ﬁfteen different initial settings for A and B. As before, a Bakis
constraint is considered. Also, ﬁfteen training utterance of word “six” and “four” are
used for this simulation. For the word “six” with 5—state HMM, the rate of the case
that only a single element of bI has non-zero probability is 28.5% of T-observations
in training sequences. On the other hand, for the word “four” with 3—state HMM,
the rate reaches 84% of T-observations in training sequences.

According to the assumption, we have

from (2.72). Here, i designates the state that produces 01- Let N = 2 for instance,

and consider a case with the Bakis constraint on A. It follows that

(1) = , (3.17)

59

1'1 G11

(2) = . (3.18)
$2 (121
2
a: a
1 (3) = ” . (3.19)
1132 021011 + 022021
3
:1: a
1 (4) = ” (3.20)
132 amafl + (1220216111 + 032021 + 022021011

If P = P(Ol | M)P(02 I M)P(03 I M)P(04 | M) is to be evaluated for the

observation string 0 = {01, 02, 03, 04} for instance, then the likelihood becomes

P = P(01 I M)P(02 | M)P(03 I M)P(04 | M) (321)

= bar (Oil-7301 (1)b92 (02)$¢12 (2)1793 (03)3393 (3)914 (04)$94 (4): (3-22)

where {q1, (12, q3, q.,} is a sequence of states which produce a symbol string 0. In case
{(11 = 1, (12 = 1, q3 = 2, q; = 2}, for example, which is one of the legal state sequences

that can produce a symbol sequence 0,

P = ﬁlm.) (323)
= {70540051(02)b2(03)b2(04)}

{011(021011 + 022021)(0210?1 + 022021011 + 032021 + 022021011)}. (3.24)

The highest order of the polynomial in the a,,- coefﬁcients is six, and these polynomials
are multiplied by b1(01)b1(02)b2(03)b2(04)7r1. These products are not consistent with
(3.15) in the sense of Moore or Mealy forms of HMMS.

Comparing (3.15) with (3.22), roughly speaking, aj; in (3.15) is substituted for xq,
in (3.22). With “extra” polynomials composed of ajgs in 3,1,, the likelihood 11;, P(Ot I

60

M) is different from P(O I M) of the F—B HMM. It is not simple to ﬁgure out
quantitatively how much such extra likelihood from the TIA HMM affects the overall
performance of the speech recognition system. The performance is data—dependent
and dependent as well on the values of ajf, bj(0t).

Even though two likelihood measures are not identical, the performance of a speech
recognition system using HQ, P(Ot I M) is not signiﬁcantly degraded. This fact does
not imply that state-transition probabilities are not informative. They are important
to constitute a possible state sequence as well as the likelihood in the HMM. In the
F-B HMM, they are vital. In case of the F-B HMM, the state probabilities themselves
regulate the state path to some extent. In addition, with such state probabilities made
up Of state-transition probabilities sums instead of state-transition probabilities ajis,
the evaluation by the TIA HMM does not signiﬁcantly inﬂuence the performance of
a speech recognition system. More empirical results about viability of the TIA HMM
will be studied again later when we discuss an Optimal state sequence of a speech
utterance. Ultimately, of course, recognition performance obtained from a model
matters, not absolute likelihood scores.

Simply Speaking, ﬁnding a HMM producing maximum HT:1 P(Ot I M) is to ﬁnd a
best scoring HMM in light of P(Ot) over [1, T] on the average. Inherently, this implies
that if there is, on average, a higher matching rate associated with each individual 0:
of a testing utterance to each individual 0: of a training utterance, it is much more
probable that the correct utterance is recognized than if there are fewer matching
cases. This method is similar to the “perplexity” used in a language model which
roughly means the average number of branches at any decision point so that its degree
implies the difﬁculty or uncertainty in each word [4, 33].

In another sense, the likelihood evaluation by the TIA HMM can be regarded
as a subOptimal method when considered against the ML criterion, contrary to the

conventional F-B HMM.

61

3.2.2 Experimental Comparisons of Likelihood Between

Model Types

In this section, we will empirically Show the usefulness of the TIA HMM in speech
recognition. The spoken digit recognition problem will be the focus of the experi-
ments. Digit recognition has important applications in on—line banking, credit card
inquiry, and automatic dialing.

Fifteen isolated-word utterances of each ten digits “zero” through “nine” were
downloaded from ftp: / /archive.egr.msu.edu/pub/jojo/DPHTEXT. They were col-
lected in a quiet room at the author’s lab, recorded on TDK type II using a TEAC
W-450R cassette deck with Dolby C noise reduction. Prior to sampling, the data
were ﬁltered using an active bandpass, fourth-order Butterworth ﬁlter with a low—
pass cutoff frequency of 4.7 kHz and a highpass cutoff of 75 Hz. These were uttered
by an American adult male and sampled at 10 kHz. A MetraByte DAS16F 12—bit
analog to digital conversion board was used to sample the data. Each speech ﬁle con-
sists of integer samples covering the range 21:2048. Tenth order cepstral data [c(1) to
C(10)] were generated as a feature vector sets from these utterances using 256 points
Hamming windows and an FFT algorithm.

These generated cepstral feature vectors were used to construct a codebook of
128 symbols. This codebook was used to quantize the speech sample utterances for
training and testing.

The objective of this discussion is to compare the recognition result of two dif-
ferent likelihood measures from the F-B HMM and TIA HMM, and verify that the
TIA HMM works properly without much degradation in the Speech recognition per-
formance. Since the deﬁnitions of likelihood measures from both models are funda-
mentally different, it is not meaningful to directly compare the likelihood quantities.

Instead, we will compute global performance measures from the model types.

62

For this simulation, a discrete HMM is used as a word model. Due to the available
data set, a speaker-dependent model will be tested.
To Show the usefulness of the time—invariant state-space HMM empirically, one of

the formal tests used in pattern recognition studies is conducted [9].

o In each digit, randomly chosen fourteen isolated words among ﬁfteen utterance
are used as a training set for the HMM and then the unselected utterance is used
as a testing utterance. Next, the testing utterance is included in the training
data set and the other utterance which was chosen as a training utterance
previously is assigned as a testing utterance. This procedure is repeated until
each utterance is used as a testing utterance once. This procedure is called the

leave-one-out, or deleted test [9].

The experimental results appear in Tables 3.2- 3.3. Each table shows the likelihood
result from the F-B HMM and the TIA HMM respectively. For Simplicity, only one
set of results corresponding to the ﬁrst of 15 testing utterances is shown here. The
results for the other utterances are are similar to those in the tables. M.- denotes the
HMM for the digit 2'. Fivestate Bakis HMMs are used to allow only one skip in any
forward transitions.

To avoid numerical underﬂow caused by the multiplication of many numbers
between zero and one, the logarithm with base 10 is taken to the the quantity of

_ P(O.). Therefore, the index of the recognized HMM in a given trial is

T
i'=argmax(HlP( (0.)th)

(i) = arg max (logII( P (0. | M.)) (3.25)
1:1
T

4: z" = arg mad: 108 P(O. I Mi»
' 1:1

T
4:) z" = argmiin(—ZlogP(0t I M.))

i=1
63

In every experiment, the digits are recognized correctly. As well as this recognition

result, two additional observations are made.

0 Generally, the RB HMM produces larger likelihood than the TIA HMM.
Roughly speaking, this is due to the “extra” probabilistic terms ajis multi-
plied to bj(0t) as (2.103) in the TIA HMM. Since those aﬁs take the values
between zero and one, the likelihood decreases as such terms and the cross

terms produced by (2.69) increase.

0 Even if the recognition performance of two types of model may be the same,
there are differences of likelihood. Tables 3.4 and 3.5 are the statistics of Ta-
bles 3.2 and 3.3 respectively. They Show that the F-B HMM is more advanta-
geOus than the TIA HMM from the recognition point of view. This is because
not only the average likelihood difference between a correct digit and incor-
rect digits of the F—B HMM is larger than that of the TIA HMM, but also
the variance of incorrect digits of the former is smaller than that of the latter.
Therefore, the speech recognition system with the F-B HMM is robust than the
TIA HMM. A technique to make TIA HMM robust will be discussed later.

A more fundamental question concerning the speech recognition problem in light of
the likelihoods from the F-B HMM and the TIA HMM is the following: For i 6 [1, M ],
if

P(01,02, . . .,OT I M.) 2 P(01,02, . . .,OT I M,) (3.26)

holds for all j 79 z', is
T

T
H P(Ot I Mi) 2 H P(O. I My) (3-27)
t=l t=1
always true for any 2'? Where M is the number of HMMS which is equivalent to the
number of words to be compared. If (3.27) holds for any i, the TIA HMM can be

used with equal effectiveness in recognizing strings.

64

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Testing M1 M2 M3 M4 M5 M6 M7 M3 M9 M0
Data

one 82.1 508.4 572.2 479.0 223.0 529.3 321.4 512.5 182.3 580.0
two 550.7 83.0 779.0 684.1 642.0 634.1 611.6 778.9 672.6 403.6
three 791.6 732.9 109.8 723.6 710.4 731.7 737.6 367.5 753.7 544.3
four 601.8 826.3 670.7 82.9 599.4 753.6 835.3 781.4 839.9 596.2
ﬁve 721.6 760.3 601.2 431.2 57.2 706.8 556.5 747.6 680.1 482.0
six 891.1 737.2 850.5 906.3 662.9 140.0 457.0 691.8 873.1 850.2
seven 477.3 676.5 587.9 700.7 492.3 520.4 121.1 708.6 482.2 753.2
eight 575.7 515.0 327.0 608.9 530.3 508.3 565.1 72.2 558.2 525.5
nine 220.6 564.5 652.4 573.3 281.8 610.3 383.0 584.5 60.4 728.2
zero 1053.7 827.9 922.1 840.5 933.4 963.8 944.7 974.2 1027.5 127.3

Table 3.2: Likelihood from the F—B HMM in leave-one-out-test.

Testing M1 M2 M3 M4 M5 M6 M7 M8 M9 M0
Data

one 112.0 477.4 510.9 490.0 258.9 543.0 340.1 532.2 194.2 508.5
two 630.9 96.2 772.7 707.9 567.9 650.8 628.0 773.7 688.0 434.5
three 793.5 711.5 119.8 726.8 717.1 714.3 746.7 412.1 753.3 427.8
four 630.5 818.7 697.8 103.8 606.1 772.3 840.2 818.5 835.7 481.7
ﬁve 724.5 723.1 663.5 394.3 85.9 651.6 581.6 719.2 694.8 506.3
six 894.9 737.5 872.3 897.1 663.3 165.3 338.4 656.6 863.6 854.3
seven 484.3 644.7 646.8 681.0 420.1 481.0 145.1 693.9 469.0 619.4
eight 574.9 516.0 303.9 598.9 516.9 508.9 530.5 95.9 565.5 537.4
nine 179.5 498.2 660.9 623.6 256.8 640.3 411.1 610.4 82.0 651.3
zero 1056.0 840.4 834.1 790.0 876.9 971.9 954.0 960.6 1032.5 172.4

 

 

 

 

 

 

 

 

 

 

 

 

65

Table 3.3: Likelihood from the TIA HMM based on leave-one-out-test.

 

 

 

 

 

 

 

 

 

Testing Likelihood Likelihood Likelihood Standard deviation
Data of mean of difference of of
correct digit incorrect digits correct digit & incorrect digits
incorrect digits

one 82.1 434.2 352.1 151.5

two 83.0 639.6 556.6 115.2
three 109.8 677.0 567.2 134.8

four 82.9 722.7 639.8 106.1

ﬁve 57.2 631.9 574.7 120.0

six 140.0 768.9 628.9 147.2
seven 121.1 599.9 478.8 110.8
eight 72.2 523.7 451.5 80.5

nine 60.4 510.9 450.5 174.0
zero 127.3 943.0 815.7 75.0
average 93.6 645.2 551.6 121.5

 

 

 

 

 

 

Table 3.4: Statistical Properties of the likelihood results from the F—B HMM.

 

 

 

 

 

 

Testing Likelihood Likelihood Likelihood Standard deviation
Data of mean of difference of of
correct digit incorrect digits correct digit & incorrect digits
incorrect digits

one 112.0 428.3 316.3 129.7

two 96.2 650.4 554.2 105.5
three 119.8 667.0 547.2 142.4

four 103.8 722.3 618.5 126.6

ﬁve 85.9 628.7 542.8 114.3

six 165.3 753.1 587.8 182.7
seven 145.1 571.1 426.0 105.7
eight 95.9 516.9 421.0 85.5

nine 82.0 503.5 421.5 182.2

zero 172.4 924.0 751.6 92.8
average 117.8 636.5 518.7 126.7

 

 

 

 

 

Table 3.5: Statistical Properties of the likelihood results from the TIA HMM.

66

 

 

 

 

Unfortunately, (3.27) does not hold in general. The relationship is very data-
dependent. Therefore, a more practical question is stated as follows: For 2' e [1, M I,
if

P(01,02, . . .,0T I M.) > P(01,02, . . . ,OT I M,) (3.28)

holds for all j 75 2', is
T T
H P(0. I M.) > H P(Ot I Mj) (3.29)
t=l

t=l
always true for any 2? However, it is not easy to analytically specify how much larger
the left side of (3.28) needs to be than the right side of (3.28) does. We can only
say that in the case of the previous example, the likelihood of the correct word is
approximately three times greater than that of the others. More intensive study of
the likelihood relationship between the F-B HMM and the TIA HMM in conjunction

with the performance of speech recognition is left for future research.

3.2.3 State Probability Distribution Vector in the TIA HMM

In this section, we will investigate a:(t) of (2.67) to assess the effect of A in determining
state sequences of an utterance in conjunction with the observation symbols.

Consider a Bakis-type TIA HMM. Then, a state-transition equation is composed of
A and state probability distribution vector m(t). However, since the state-transition
part in the TIA HMM is only composed of A in contrast to A and A(t) in the F-B
HMM, a Bakis condition does not have a direct inﬂuence on constituting the possible
state transitions for an observation string. Instead, A affects 2(t) and a:(t) indirectly
controls state transitions.

To observe the effect of A, let us compare two cases, a Bakis (AB) and ergodic

(Ac) constraints, for state-transition conﬁguration with each other. Tables 3.6 shows

67

the state probability distribution vectors $(t) for a few speciﬁc times when

0.9 0 0.9 0.2

 

 

 

 

 

 

 

 

 

 

 

 

 

B: 2 Ac: 2 (3'30)
0.1 1 0.1 0.8
with 23(1) = 23(1) = (1,0)’.
t=1 =2 2:3 t=4 t=5 2:40 2:50 t=60 2:70
23(2) 1.000 0.900 0.810 0.729 0.656 0.016 0.005 0.002 0.000
0 0.100 1.190 0.271 0.343 0.983 0.994 0.998 0.999
2:.(2) 1.000 0.900 0.830 0.781 0.746 0.666 0.666 0.666 0.666
0 0.100 0.170 0.219 0.253 0.333 0.333 0.333 0.333

 

 

Table 3.6: State probability distribution vectors under Bakis, x(t)B, and ergodic,
:c(t)e, constraints.

The effect of A on a:(t) is not apparent over short intervals. It follows that the
likelihood differences arising from a Bakis and ergodic constraints are not evident
over short times. Additionally, it is not possible to distinguish the topology of the
HMM from a record of a:(t). However, over a sufﬁcient duration of time, we see that
the difference in two state probability distribution vectors becomes distinguishable.

As explained in Section 3.2.1, the eﬂect of A in the TIA HMM is indirect and
“global” in a sense to form “possible” state paths in an utterance, contrary to A

which locally affects state paths.

3.2.4 Comparison of a:(t) and 7(2)

In this section, we will discuss similarities of one state variable of the F-B HMM and
the other of the TIA HMM. Particularly, we are interested in the similarity of state
probability distribution vector 23(2) of the TIA HMM and 7(0.) of the EB HMM.

68

Previously, we have

931'“) = P(Qt =1 I M) (3-31)

7.0) = P(t}: = i I QM). (332)

where 0 = {01,02,...,0T} and M = {N, M, A,B,7r}. Note that 7.(t) is the a
posteriori probability of a state based on 0. On the other hand, T.(t) is a state
probability distribution without 0 although both state variables provide information
about the probability being in state 2' at time t.

Reconsider the previous recognition experiments. For analysis purposes, consider
the case for word “four.” Suppose that M is computed from ﬁfteen training utter-

ances using the F-B HMM and A is

 

 

(0.9625 0 0 0 0 )
0.0375 0.8835 0 0 0
A = 0 0.1165 0.9704 0 0 (3-33)
0 0 0.0277 0.8652 0
( 0 0 0.0020 0.1348 1.0000 )

The corresponding state probability distribution 23,-(t) along 2 for 2' = 1, . . . ,5 appears
in Fig. 3.1. Also, with M, "7203) 2' = 1, . . . ,5 can be computed for each utterance 0 of
training data set. It is interesting when all 7,-(t),2' = 1, . . . ,5 of the training data set
is combined and the average of the 7,-(t) is computed along t. Here since the length of
each training utterance may be different, it is not possible to get complete alignment
of the training data set along t. Instead, only the time duration commonly occupied
by all training data set is considered.

The average of 7,-(t) for the entire training utterance is shown in Fig. 3.2. Com-

paring results, the transition pattern of 2,-(2) is seen to be similar to that Of 7.(t) for

69

manumy

0.9

0.8

0.7

0.6

0.4

 

 

+ I E same; 2
,_ ...... . .................. +....I.......O...;..,...........'.. ....... /. ......... .. . ..._.

/

 

 

 

Figure 3.1: State probability distribution after training digit “four.”

70

 

0&1» ......... 1.5““31
(18
0.7- ...... ....... .

(16

probability
F’
0|
l

(14

 

(13

012-

o1,_.w. .......... f ......... ..... .

"newsman". +4444. ;

3 i E I '. 1 1 I - . .

A‘IllI I I : A l I : I : : .
4o 50 60 70

 

o ...............
1O 20 30

Figure 3.2: The average of '7.(t),2'
word “four.”

........................................
.

 

1, . . . ,5 from the entire training utterances of

71

each 2' even though the actual probabilities are different. For the other digits other
than word “four,” we see the same phenomenon.

It is not easy to say how these two statistical variables from different models is
related since x.(t) is a inﬁnite sequence if t is not constrained and 7,-(t) is a ﬁnite
sequence according to the size of training data. Roughly speaking, however, the
average of 7,-(t) for the training data set amounts to the state probability distribution

T.(t). This phenomenon is related to counting process when A, B are computed.

3.2.5 Experimental Results on the Effects of A

We have shown theoretically that a strongly diagonal A does not make signiﬁcant
contribution to the likelihood scores in the TIA HMM. Here, we will Show this ex-
perimentally with some examples. Along with this experiment, we will discuss the
possible way of reducing the computational loads required in the HMM training using
the characteristics of A.

To observe the effect of A Of the TIA HMM, ﬁrst let us update B only in the
HMM training in the ﬁve-state Bakis HMMS while allowing only one skip in any for-
ward transitions. In other words, after A is assigned initially, the training procedure

estimates B only and does not change A. Then, compare the likelihoods. Let

(0.99 0.00 0.00 0.00 0.00)
0.01 0.99 0.00 0.00 0.00
An. = 0.00 0.01 0.90 0.00 0.00 . (3-34)
0.00 0.00 0.05 0.99 0.00

 

 

(0.00 0.00 0.05 0.01 1.00)

72

( 0.70
0.20
0.10
0.00

 

( 0.00

0.00
0.80
0.10
0.10
0.00

0.00
0.00
0.60
0.30
0.10

0.00
0.00
0.00
0.80
0.20

0.00
0.00
0.00
0.00
1.00

 

I

(3.35)

be the two preset state-transition matrices, for example. A3, is more diagonally

dominant than A 3,. A. is the state-transition matrix for digit 2' from the F-B HMM

training. Table 3.7 is the sum of likelihoods of ﬁfteen training utterances for each

digit. Note that Since we take the negative log to the likelihood for numerical purpose,

the ML actually amounts to the minimum likelihood in the table.

 

14‘.

A31

A B,

 

 

 

 

 

1008.4
1111.3
1154.5
1182.2
1030.7
2111.8
1725.6
1028.3
1073.2
1908.0

 

1019.5
1053.1
1176.0
1120.5
1072.5
1996.4
1632.3
991.3

899.9

1689.5

 

1053.3
1074.0
1240.6
1227.3
1146.7
2123.1
1719.8
1006.9
991.8
1800.7

 

 

 

Table 3.7: Sum of likelihoods of ﬁfteen training utterances for each digit associated

with three different state-transition matrices in the EB HMM.

From the table, we see that the likelihoods from the usual F-B HMM which

requires both A and B training can be frequently less than those of the models

whose A is arbitrarily set and only B is updated. Other than the problem of local

minimum of the HMM training in the Optimization criterion, we see that the training

A is not much crucial in certain cases such as having diagonally dominant A in the

HMM.

73

Next, to examine the recognition results for different state-transition matrices,
let the resubstitution-test be performed even though such a test is not practical in
speech recognition system. In the resubstitution-test, the training utterance is used
for a testing utterance. In this simulation, however, there is no difference between
a resubstitution-test and leave-one-out-test because we are looking for the effect of
the tOpOlogy Of A in the F-B HMM and TIA HMM. We can reach the the same
conclusions with a leave-oneout-test. Also, the results from the resubstitution-test
will be useful when we discuss the topic about ﬁnding an optimal state sequence in a
speech utterance in Chapter 4.

The recognition results are in Table 3.8 through Table 3.13. Table 3.8 shows the
likelihoods for each digit from the F-B HMM computation when one randomly cho-
sen testing utterance among ﬁfteen is evaluated by the HMMS. Table 3.9 shows the
likelihoods for each digit from the TIA HMM computation when the same testing
utterance in the case of F—B HMM is evaluated by the HMMS. Table 3.10 and Ta-
ble 3.11 are the likelihoods for A3, for the EB HMM and the TIA HMM respectively.
On the other hand, Table 3.12 and Table 3.13 are for A3,.

Comparing Table 3.10-Table 3.13 with Table 3.8-Table 3.9, yields the following

observations:

0 The digit recognition performance with A3, and AB, matches the performance
with A1 and B in the F-B HMM. In addition, the more diagonally dominant

A is, the better the recognition performance.

a In case of the TIA HMM, we have the same conclusion that the more diagonally
dominant A is, the better the recognition performance. However, the recogni-

tion performance is more sensitive to the values of state-transition matrix than

that of the F-B HMM.

o In an extreme case such as 62,-, = 71¢, for all j E [1,N], the recognition per-

74

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Test M1 M2 M3 M4 M5 M5 M7 M3 M9 M0
data
one 63.7 481.0 569.0 553.8 218.1 532.4 320.4 534.5 173.7 567.1
two 476.2 84.3 669.2 570.5 643.7 638.8 489.7 635.0 724.1 456.8
three 800.7 758.5 75.3 728.7 682.7 697.8 741.8 544.3 747.3 493.6
four 593.2 680.2 711.9 70.7 612.0 757.4 843.7 823.2 868.0 559.5
ﬁve 727.4 748.7 622.7 442.9 53.1 637.5 592.1 696.5 694.7 498.7
six 857.2 681.5 756.8 820.3 633.0 122.7 302.2 536.1 711.7 827.9
seven 473.3 654.9 517.1 655.2 365.9 526.8 96.1 685.0 382.7 670.3
eight 562.7 536.1 424.4 574.7 530.2 447.8 524.9 66.8 532.9 551.7
nine 194.7 551.4 629.4 677.9 265.8 621.8 366.9 580.0 63.5 677.6
zero 1054.1 833.4 898.6 839.9 893.2 976.0 940.0 986.3 366.0 129.9
Table 3.8: Likelihood P(O I M) using A.- and B for each digit 2' in a resubstitution
test.
Test M1 M2 M3 M4 M5 M6 M7 M3 M9 M0
Data
one 93.8 457.4 511.6 555.9 251.1 543.2 341.8 526.6 195.8 610.7
two 549.9 106.6 627.2 619.1 557.8 654.5 503.3 626.7 729.2 424.6
three 809.8 725.7 111.8 721.3 678.4 709.3 750.3 547.2 729.8 475.9
four 611.5 689.9 710.9 103.5 612.8 796.5 847.8 838.6 872.7 490.0
ﬁve 729.3 714.3 650.9 413.6 97.2 642.6 610.5 712.5 702.1 531.7
six 860.9 705.3 771.5 806.9 639.1 145.2 297.4 553.5 773.6 815.2
seven 413.8 638.2 575.7 699.7 391.0 539.3 122.7 700.8 409.4 683.2
eight 568.2 479.2 360.7 548.9 520.3 449.7 510.2 84.8 543.8 559.1
nine 139.3 487.7 634.9 692.3 231.3 647.5 395.6 588.1 95.8 694.0
zero 1056.0 842.8 818.3 789.2 862.0 981.5 952.5 977.3 1041.7 160.9

 

 

 

 

Table 3.9: Likelihood 11;, P(o. | M) using A,- and B for each digit 2' in a resubsti-
tution test.

75

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Test M1 M2 M3 M4 M5 M6 M7 M3 M9 M0
Data

one 65.2 532.9 569.5 550.1 214. 1 530.9 322.0 517.0 170.6 577.8
two 471.2 78.7 675.8 580.3 633.6 680.3 481.9 636.1 724.3 443.5
three 801.6 755.1 76.5 733.8 681.2 755. 1 745.1 557. 1 755.4 483.5
four 587.6 672.2 711.2 70.6 604. 1 759.4 849.6 826. 1 869.1 572.3
ﬁve 729.2 743.1 622.2 429.0 59.7 692. 1 585.7 757.9 690.7 491.8
six 858. 1 697.0 755.9 807.0 639.9 115.5 307.4 599.3 715.6 850.2
seven 473.9 665.3 516.6 666.3 376.8 553.0 88.8 740.9 461.6 741.1
eight 563.5 536.6 424.6 573.2 537.8 464.4 519.4 72.4 535.8 529.0
nine 173.5 604.9 628.9 673.3 273.0 621.2 371.2 566.2 52.7 673.1
zero 1054.9 828.2 897.5 823.6 950.3 975.0 945.0 989.2 1040.4 114.0

 

 

 

Table 3.10: Likelihood P(O I M) using A81 and BAH, for each digit in a resubsti-
tution test.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Test M 1 M2 M3 M4 M5 M5 M7 M8 M9 M0

Data

one 175.8 482.5 514.0 567.7 300.4 570.9 385.2 557.4 208.5 510.4
two 470.7 166.6 615.7 595.5 556.2 682.3 546.7 644. 1 724.0 453.7
three 827.4 740.9 161.0 725.8 689.3 743.8 760.5 547.2 741.0 479.1
four 607.2 725.2 694.4 134.2 610.0 802.4 853.7 852.7 876.0 533.5
ﬁve 741.9 704.1 619.9 436.9 125.8 665.6 608.9 740.5 669.6 550.5
six 877.0 660.9 743.2 782.3 664.8 197.1 220.9 582.5 724.9 747.4
seven 451.0 666.7 527.6 685.0 432.6 567.3 177.0 722.6 402.6 576.8
eight 583.4 445.1 366.3 537.2 525.9 461.6 509.2 120.6 557.5 478.0
nine 223.0 513.6 634.7 712.0 296.8 673.9 437.8 628.9 99.2 599.2
zero 1056.0 853.9 819.9 787.1 853.2 995.6 946.9 991.6 1044.8 208.9

 

 

 

 

Table 3.11: Likelihood H2"=1 P(Ot I M) using AB,

and B As for each digit in a
l
resubstitution test.

76

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Test M1 M2 M3 M4 M5 M5 M7 M3 M9 M0
Data

one 66.2 470.1 566.6 548.1 212.3 536.6 320.9 534.0 179.0 583.5
two 540.9 81.7 681.8 586.2 645.5 639.8 486.6 626.7 726.4 447.6
three 798.3 760.5 81.4 733.8 676.2 697.9 734.9 542. 1 716.2 543.5
four 592.6 672.1 706.4 77.7 617.3 748.6 847.6 821.5 867.3 568.2
ﬁve 727. 1 755.7 617.2 433.0 61.8 643. 1 595.7 694.8 690.8 493.3
six 854.8 687.9 804.7 817.0 636.1 123.7 294.1 528.0 720.4 849.4
seven 476.7 650.4 514.3 673.8 376.8 530.4 93.5 677.4 369.6 754.1
eight 560.2 530.2 424.3 577.5 528.3 449.9 525.7 68.7 535.8 558.1
nine 178.9 535.2 627.8 670.0 271.5 628.0 366.5 575.8 59.8 671.4
zero 1051.6 831.6 911.4 830.3 953.7 978.7 952.0 989.3 1038.4 121.2

 

 

 

Table 3.12: Likelihood P(O I M) using .43, and B A3,‘ for each digit in a resubsti-
tution test.

 

 

 

 

Test M1 M2 M3 M4 M5 M6 M7 M3 M9 M0

Data

one 135.5 470.3 569.8 554.1 306.0 641.6 358.0 536.3 255.2 626.5
two 710.4 206.4 707.7 774.0 680.0 677.2 710.2 613.9 746.7 475.5
three 792.9 775.5 197.6 749.4 676.2 730.1 744.8 725.9 712.6 728.4
four 745.0 672.8 773.8 320.9 705.6 761.5 860.6 828.0 871.8 566.0
ﬁve 731.5 825.8 773.0 607.9 441.0 822.4 811.9 702.0 798.5 651.6
six 859.2 919.5 878.3 943.4 739.3 260.8 662.5 536.9 850.5 906.4
seven 491.6 655.6 769.7 844.1 554.8 619.3 381.2 687.4 590.4 759.4
eight 564.6 583.7 417.8 640.4 532.7 481.7 568.0 141.5 540.9 623.0
nine 236.5 540.5 684.0 674.9 406.9 769.5 486.2 584.9 292.3 734.9
zero 992.0 892.5 945.5 989.5 1041.1 999.7 1017.7 980.8 1046.0 611.0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 3.13: Likelihood HT:l P(O, I M) using AB,

and B AB for each digit in a
2
resubstitution test.

77

formance becomes degraded. This is because of the more ambiguity caused
by “equally likely” state transition. Contrary to this argument, with A which
is close to a diagonalized matrix, the performance becomes enhanced because
of lessened ambiguity about state occupancy at a given time. In information

theory, such uncertainty is measured in terms of entropy [5].

For the leave-one-out tests, the conclusions above are the same.

Associated with these experimental results, we see that the required time and
resources to train the HMMS become lessened with a preset state-transition matrix
.43, due to the unnecessity of training a state-transition matrix. Therefore, it is
possible to reduce the computational loads required in the training.

Followed by this assertion, the arising problem could be “how close does A need
to be a diagonalized matrix to obtain a satisfactory recognition performance?”. This

is left for future research.

3.3 Reconciliation of the TIA HMM

In this section, we are going to discuss a few evolving techniques that reconcile the

TIA HMM to the conventional F-B HMM.

3.3.1 Feedback Control

From Table 3.10 through 3.13, we see that the less each 23,-(t) is overlapped, the
better the performance. The implies that in the TIA HMM, the closer one of states
is probability one at each t, the better the performance. Simply speaking, we want a
TIA HMM close enough to a certain “unknown” desirable system so that the states
are separable from each other as much as possible for every t. Such separation can

decrease the adverse effects of illegal paths.

78

However, the exponentially decaying characteristic by the Markovian assumption
in the HMM basically does not allow such “separation.” One attempt to compensate
for extra probabilities, as well as to decrease illegal paths effects, is to use state-
variable feedback to get a desired state responses. State feedback technique is to
relocate the eigenvalues of a system to get a desired system response [94].

If a given linear time-invariant system realization is state controllable, any desired
characteristic polynomial can be obtained by state-variable feedback. In our problem,
however, neither do we have speciﬁc desired eigenvalues, not do we know exactly which
eigenvalues will be Optimal in a sense that the recognition performance as well as its
robustness of the TIA HMM is comparable to those of the F—B HMM. Provided that

such desirable poles are known, we have

a:(t + 1) = A2:(t) + u(t)6(t) + W(t), (3.36)

from the TIA HMM. Where W is the state feedback input which regulates the state
probability so that a:(t + 1) can reach the desired values for each t by

W(t) = Fa:(t). (3.37)

Here we do not have a speciﬁc control input except an initial time t = 0 in the
HMM. This is in contrast to the usual state-space control problem. Therefore, to
accommodate the time-varying nature of a speech signal and to avoid the exponential
decaying state probabilities of the Markovian model, we need a time-varying feedback

control such as

W(t) = F(t):c(t). (3.38)

Unfortunately, this attempt does not allow us to have stationary diagonalized state-

79

transition matrix [A + F(t)] for all t. Therefore, we have the same problem which
we face in the F-B HMM. The motivation for using the TIA HMM in speech recog-
nition lies on its diagonalization. Therefore, the principal advantage of using the

diagonalization of the state equation does not exist in this approach.

3.3.2 Stochastic Modeling of Temporal Information in the
TIA HMM

To make the TIA HMM robust, consider a model which includes additional temporal
information between neighboring symbols of an utterance.

The assumption that the observations generated by the HMM’s hidden process
are only state dependent is, in fact, a limitation of the HMM when applied to a
real speech signal. In reality, speech features are correlated. To include additional
time-ordering relation between consecutive symbols in a Speech utterance, consider
one of the techniques prOposed by Dai et al. [63]. The idea is to include a Markovian
relation between symbols instead of just “observation independent.”

In Dai’s approach, the state-space is the codebook and each symbol in the code-
book becomes a state of the Markov process. The revised criterion seeks to ﬁnd a
HMM which produces a ML in the conventional F-B HMM sense in conjunction with

the likelihood based on the Markovian relation between symbols as
L’(0) = P(o | M)P(0 | M’), (3.39)

where

P(0 I M) = P(013022"'20T I M): (3'40)

80

and

P(O I M’) : P(01,02,.-.,0T I M’)
= P(OT l 01,02,...,0T_1,M')P(01,02,. ..,0T-1 | M’)
= P(OTI0T_1,M')P(01,02,...,0T_1 IM’) (3.41)

= {113(0. |0t_1,M')

T
t=1

with

P(ol |00,M') = P(O1 | M’). (3.42)

Here M, stands for the set of initial symbol probability and symbol transition matrix.
The same idea can be applied to the TIA HMM, the likelihood [1:11 P(Ot I M),

L’(0 I M,M’> = (£13m. l M>)(P<o I M’»
= (1:1 P(Oth))(ﬁ P(O. I OHM» (3.43)
= @110. | M)P(0, | 0,_1,M’)).
Therefore,

no I 34.34) = - Iog1:1(P(0. I M)P(0. I0._1,M’))
= ‘ZIOSUD (0t I M)P(Ot I 0t— 12M )) (3-44)

= ZlogPOtIM)+108P(0tIOt13M»

81

The L(0 I M, M') likelihood results for the ten spoken digits database described

in Section 3.2.2 are given Table 3.14. Here 00 in the table denotes inﬁnite value caused

 

Testing
Data

5
3

l0

.3

3

sh

M5

.3
g
.3
.3
.3

 

one
two
three
four
ﬁve

o
...-I
O

C»:

(0

DO

00
00
00
00

91.9

Six
seven
eight
nine
ten

8888888

888888888.
8888888888
8888888888
888888§888
8888§88888
8885888888
E
88888888
8888888888

00
00
00
00
00

88
85

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 3.14: The likelihoods based on L(o | M, M’) with the TIA HMM.

by the negative log 0. It is obvious that all the digits are recognized correctly. As
well as this correct recognition performance, the difference of likelihood measure (or
variance in likelihood measure) between digit 2' and digit j 75 2' evaluated by M.-

becomes large. As a result, the recognition system is robust.

3.4 Discussion

In this chapter, the problem caused by mismatch between two model types, the TIA
and F—B HMM, have been focused on in detail. Additionally, practical issues in the
use of the TIA HMM have been discussed with the theoretical and empirical evidences
of the model.

Regardless of the problem caused by illegal state sequences, with the TIA HMM,
we obtain the comparable speech recognition performance to the F—B HMM in some

applications such as digit recognition. Such ﬂexibility of controlling recognition rate

82

and speed and memory makes the TIA HMM useful some applications.

Next, to reconcile the TIA HMM to the F-B HMM, various attempts are taken
although both models theoretically cannot be identical. Through those approaches,
however, two signiﬁcant results are reverified. First, albeit the inherent difference of
both models, there are some similarities between certain state variables of each model.
Next, the relative importance of B over A was reveriﬁed. Also, it was found that the
diagonally dominant condition on the state-transition matrix in the TIA HMM is an
important factor to affect the the performance of speech recognition.

Finally, we introduced a possible technique to render the TIA HMM robust. The
technique is to add one more temporal constrain between symbols of an utterance to
the existing HMM for HMM evaluation. Although the technique requires additional

memory and computations for this new constraint, it increases the robustness of

speech recognition.

83

Chapter 4

Training HMMS so that Hidden
Model States Meaningfully

Represent Acoustic States

The HMM is a state model and a speech utterance is modeled to be generated in
accordance with state transitions. In particular, a meaningful state sequence is Sig-
niﬁcant. Through a meaningful state sequence, for example, we can learn about
the structure of the signal model, and obtain the average statistics of the individual
states. In addition, the experimental evidence suggests that a state frequently repre-
sents one or more identiﬁable acoustic phenomena [2, 4]. Thus, we can discover the
acoustic characteristics of a speech utterance associated with such a meaningful state
sequence.

Finding a meaningful state sequence of a Speech signal in the HMM is often cited
as one of three major analytical problems centered on the HMM. The information
about an apprOpriate state sequence is useful to improve the performance of Speech
recognition system in conjunction with the solutions of two other HMM problems,

the evaluation as well as the training. This is because the evaluation and training

84

through a meaningful state sequence produces better performance and gives simple
algorithms for evaluation and training.

Like the evaluation problem of the HMM, for which a solution can be given de-
pending on the likelihood criterion, there are several possible ways to ﬁnd a mean-
ingful state sequence corresponding to a given observation sequence. Depending on
the Optimality criterion followed by the deﬁnition of “ state sequence,” the result of
possible state sequences corresponding to a speech signal may be different.

The problem of ﬁnding a meaningful state sequence involves the attempt to un-
cover the hidden part of the model. Depending on the application, different criteria
can be employed to ﬁnd an Optimal state sequence [56, 57, 59]. Among them, the
Viterbi search is a prevalent one and it is based on the probability that the HMM could

generate the observation sequence using the best possible state sequence [57 , 58],
Q“ = argmqax P(0,Q I M), (4.1)

in which Q represents any state sequence of length T.

The speciﬁc goal in this chapter is to prOpose some new techniques for ﬁnding
a meaningful state sequence associated with a Speech utterance. These techniques
are used for training HMMS in meaningful ways. Then, the results from each search
technique are compared to those of the Viterbi of the HMM.

In this discussion, it is assumed that a Speech Signal is already encoded with
reference to a codebook of 128 unique Spectral vectors. Hence, a speech utterance is
the sequence of codebook indices represented in the abstract as {01, 0t, . . . , 0T}. In
order to reduce the computational complexities required for recognition and analysis,

this research is restricted to the recognition of isolated words based on the discrete

HMM.

85

4.1 Maximum Likelihood Approach to State Se-
quence Determination

We showed previously the usefulness of the TIA HMM in a spoken digit recogni-
tion problem. Also, the TIA HMM is potentially advantageous in a large-vocabulary
system for computationally efﬁciency. As well as evaluating the likelihood, the TIA
HMM technique can also be useful to ﬁnd a meaningful state sequence of a speech sig-
nal. One way to enhance the performance of speech recognition is to exploit the state
sequence information which is Signiﬁcant to compute more informative parameters

for M during training HMMS. Let us discuss this topic in detail.

4.1.1 Introduction

The Viterbi search technique is used to ﬁnd an Optimal state sequence of a speech
signal based on the criterion (4.1) in the F-B HMM. As well as this widely used tech-
nique, however, there may potentially exist many other search techniques. Consider

one of the possible criteria for ﬁnding state sequence under the framework of the

HMMas
q? = argnggwahqth) forlStST (4.2)
and
Q‘ = {qI.43.---,qi~}- (4-3)

In other words, an optimal state sequence is a sequence of individuals state at each
t which is most likely to produce a symbol at t in conjunction with a state distribu-

tion probability. Without an explicit imposing constraint on state transitions, this

86

criterion is simpler than (4.1) which requires a backtracking procedure.

Actually, this criterion is close to the formulation of the TIA HMM. In fact, the
state searching procedure of the F-B HMM is to Viterbi what that of the TIA HMM
is to (4.3). Apparently, (4.2) and (4.3) can be easily implemented by the TIA HMM

since

q? = It)? P(0t29t I M) (4.4)
= 1%?X{P(0t I qt2M)P(Qt IM)} (4-5)
= lrsniasilcv{bi(0t)$i(t)}- (4-6)

A Bakis condition on state-transition matrix is not “explicitly” involved in in consti-
tuting the legal state sequence by (4.2) and (4.3). Because of lack of explicit imposing
constraint on state transitions, the TIA HMM produce an illegal state sequences dur—
ing operations.

In Chapter 3, however, we have Shown that the recognition performance is com-
parable to the conventional F-B HMM in spite of illegal state sequences. By the
same token, it will be shown here that the criterion (4.2) and (4.3) is also practically
efficient in ﬁnding a meaningful state sequence.

It will be shown that even though the computed state sequence of a Speech ut-
terance by (4.2) and (4.3) is not completely identical to the state sequence by the
Viterbi search technique for all t E [1,T], however, remarkably, the result of this
technique is quite close to that of the Viterbi search in a “global sense” for correct
word. The global shapes of computed state sequences by the Viterbi search and this
new ML in accordance with the TIA HMM are Similar to each other. Moreover, it
will be discussed that this ML technique is a fast and suboptimal method to Obtain a
possible state sequence information without backtracking procedure required in the

Viterbi search.

87

4.1.2 Experimental Results

To see the viability of the search criterion (4.2), (4.6) is applied to ﬁnd a meaningful
state sequence of the four different spoken utterances of “one,” “two,” “four,” and
“six.” The resubstitution and leave-one-out test are performed with these example
utterances. Here, it is assumed that M, for all z' = 1,. . .10 is computed in advance
using the F-B algorithm. Index 3‘ = 1 represents word “one,” and i = 10 represent
word “zero.”

To assess the effectiveness of the TIA HMM at ﬁnding a meaningful state sequence
for speech signals, let us apply (4.6) to the four example utterances.

Figures 4.1 through 4.8 are computed state sequences of four individual testing
word composed of “one,” “two,” “four,” and “six” based on Mg, 3' = 1,...,10.
Figures 4.1 through 4.4 are the resubstitution results and Figs. 4.5 through 4.8 are
the leave-one-out test results for the same utterances. Each ﬁgure is also composed of
ten subplots. For example, in Fig. 4.1, the left top ﬁgure is the raw Speech waveform of
the original spoken digit of word “one.” Below this raw waveform are the state search
results using the conventional Viterbi algorithm with the F-B HMM formulation based
on {A,-, 3,}, where z' = 1,2,4, 7, respectively. For example, in Fig. 4.1, the Viterbi
search result is obtained when the testing word “one” is evaluated by {A1, Bl} which
is the trained HMM for the word “one.” When that utterance is evaluated by different
HMMS other than {A1, B 1}, the Viterbi algorithm does not provide reasonable results
except the cases of mis-recognition. Therefore, the Viterbi state searCh results for
other words not drawn.

The rest of the ﬁgures are state search results based on criteria (4.2) and (4.3)
for M,, i = 1,... ,10 ranging from top to bottom in the left and right columns,
respectively.

Horn the ﬁgures, the following are observed.

0 As expected, the resubstitution method produces better result than the leave-

88

 

 

 

 

 

 

 

 

 

 

4' A L o i
0 1 WC 2000 3000 4000 5000 0 20 40 60 80 1 00

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

o z i g g o . .
0 20 4O 60 8O 1 00 0 20 4O 60 80 100

Figure 4.1: State search results from the conventional Viterbi and Q" = H? q; =
I]? argmaxq, P(Ot,qt I Mi), 2' = 1, . . . , 10 in a ﬁve-state Bakis HMM of a spoken
word “one.” Note that each graph represents a different “i” except top two ﬁgures in
the left column. The tests employ resubstitution.

89

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

oo 26 4b 80 80 100 0o 80 47) So 30 100
Figure 4.2: State search results from the conventional Viterbi and Q“ = H? q; =
I]:~ argmaxq, P(tht I Mg), 2' = 1, . . . , 10 in a ﬁve-state Bakis HMM of a spoken
word “two.” Note that each graph represents a different “i” except tOp two ﬁgures in
the left column. The tests employ resubstitution.

90

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

O 50 100 150 0 50 100 150

Figure 4.3: State search results from the conventional Viterbi and Q” = I]? q,“ =
II? argmaxq, P(0¢,q¢ I M5), 2' = 1, . . . , 10 in a ﬁve-state Bakis HMM of a Spoken
word “four.” Note that each graph represents a different “i” except top two ﬁgures
in the left column. The tests employ resubstitution.

91

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0 50 100 1 50 O 50 100 1 50

 

 

 

 

 

 

 

 

 

0 50 100 150

 

 

......................

...................

   

 

 

 

 

 

 

0 50 100 150 O 50 100 150

Figure 4.4: State search results from the conventional Viterbi and Q" = 11ft]; =
H? argmaxq, P(0¢,qt I Mi), 2' = 1, . . . , 10 in a ﬁve-state Bakis HMM of a spoken
word “six.” Note that each graph represents a different “i” except top two ﬁgures in
the left column. The tests employ resubstitution.

92

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.5: State search results from the conventional Viterbi and Q‘ = ['1qu =
HT argmaxq, P(thg I Mi), 2' = 1, . . . , 10 in a ﬁve-state Bakis HMM of a Spoken
word “one.” Note that each graph represents a different “i” except tOp two ﬁgures in
the left column. The tests employ leave-one—out.

93

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

00 2i) 43 80 80 100 0o 26 4:0 So 85 100
Figure 4.6: State search results from the conventional Viterbi and Q“ = I]? q; =
1'1? argmaxq, P(tht | Mg), 2' = 1, . . . , 10 in a ﬁve-state Bakis HMM of a spoken
word “two.” Note that each graph represents a different “i” except tOp two ﬁgures in
the left column. The tests employ leave-one—out.

94

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

O 50 100 150 0 50 100 150

Figure 4.7: State search results from the conventional Viterbi and Q” = fqg =
H? arg max,” P(0¢,qt I M5), 2' = 1, . . . , 10 in a ﬁve-state Bakis HMM of a spoken
word “four.” Note that each graph represents a different “i” except tOp two ﬁgures
in the left column. The tests employ leave-one—out.

95

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

O 50 100 150 O 50 100 150

Figure 4.8: State search results from the conventional Viterbi and Q“ = H? q; =
[If argmaxq, P(0¢,qt I Mi), 3' = 1, . . . , 10 in a ﬁve-state Bakis HMM of a spoken
word “Six.” Note that each graph represents a different “i” except tOp two ﬁgures in
the left column. The tests employ leave-one-out. '

96

one-out method. In particular, the state search result for the correct word with
a resubstitution method adheres more closely to the Bakis constraints than with

leave-one-out test.

Roughly speaking, state sequence results using (4.1) and (4.2) appear to be
useful indicators to pre—ﬁlter the correct utterances for this limited vocabulary.
This implies that, with the likelihood values, but also by examining the se-
quence, it is possible to sort out the possible “candidate” utterances from the
utterance data base. Under a Bakis condition, for example, the state sequence
needs to be monotonically non-decreasing. After applying (4.2) and (4.3) to
the testing utterances, the correct utterance is one of the utterances in which
the computed state sequence is closely consistent with the Bakis topology. Ex-
periments with larger vocabularies are needed to conﬁrm that this is a general

phenomenon.

The global shape of the state sequences found by the conventional Viterbi al-
gorithm and ML criterion using (4.2) and (4.3) for the correct word are similar
to one another. Even though there are some Short-term peaks which represent
jumps to different states and return to the previous states occurring over short
time durations, it is not difﬁcult to estimate a possible plausible state sequence

in light of Bakis tOpOlogy by removing such short-term peaks.

When the correct speech utterance is evaluated, ﬂuctuation in the state se-

quences are relatively scarce.

At a glance, the prOposed criterion does not appear to provide a reasonable crite—

rion for ﬁnding the “legitimate” state sequence since the scheme does not explicitly

impose constraint on Q“ composed of (1:: t = 1, . . .,T according to a Bakis con-

straint. However, the state sequence by (4.2) is close to the conventional Viterbi

search result without “explicit” constraint of state transition paths.

97

The implications of those results can be summarized as follows: First, we know
that the TIA HMM relies on individual state probabilities at each time to dissuade
illegal paths, and it implicitly constitutes available state path information. Thus,
the evaluation through the TIA HMM provides a reasonable measure to classify the
utterances. Similarly to the evaluation of the HMM by the TIA HMM, criterion
(4.2) implicitly has information about a possible state sequence. Like the TIA HMM,
criterion (4.2) only indirectly controls the overall state paths.

Next, the results implicitly Show the relative signiﬁcance of A and B. In the
previous deve10pments, we have assessed the relative Signiﬁcance of A and B in the
HMM scoring process. We have dealt with this problem from various points of view in
the previous chapters. Added to such attempts, the approach to ﬁnd an appropriate
state sequence can also help to Show the relative signiﬁcance of A and B. As a
consequence of the various approaches, therefore, it is not difﬁcult to infer that most
of the information of the training utterances is concentrated in the elements of B,
rather than A, in the F-B or Viterbi reestimation algorithm.

It is concluded that criterion (4.2) and (4.3) provides a simple and fast way to
ﬁnd a meaningful state sequence Of a correct speech utterance approximately without

a Viterbi criterion.

4.2 State Sequence Based on “Acoustic Distance”

In this section, a new state search technique using recursive Viterbi search based on
an “acoustic” distance is presented. There are several advantages of applying this
technique to ﬁnd an state sequence for a speech utterance. They are discussed here,
together with a basic idea and corresponding algorithm. Some results in a practical
application will be presented and those results are compared to the results from the

conventional Viterbi method.

98

4.2.1 Introduction

One of the original motivations for using HMMS in speech recognition is the apparent
congruity between the speech production process and the mathematical dynamics of
the model. Human speech production can be approximately modeled as a process of
dynamically positioning the Speech system “articulators” into physical “states” which
correspond to resulting acoustic outputs. The acoustic manifestations Of these states
are “observable” to the listener, but the physical states are “hidden.” Accordingly,
the states of a HMM are often thought of, to a ﬁrst approximation, as representing
distinct acoustical phenomena in the utterance, such as a vowel sound in a word or
a transition between phonemes in a word. In fact, the number of states in a model
is sometimes chosen to correspond to the expected number of such phenomena. For
example, if an HMM is used to model a phoneme (rather than a complete word),
then three states might be used — one to capture the transition on either end of the
phoneme, and one for the steady—state portion.

However, the HMM organizes itself to maximize an analytic criterion (usually
a ML), and not necessarily to correspond to some preconceived acoustic structure.
In fact, our work has shown that the conventional ML approaches (Baum-Welch
and Viterbi) frequently yield model structures which clearly exhibit little relation-
ship between waveform acoustics and HMM states. See Fig. 4.9 for an illustration.
A more “global” view of this phenomenon has led researchers, notably Ostendorf
and colleagues [61], to seek “segment based” models of the waveform that are more
meaningfully associated with regions of acoustic coherence in the Speech.

In this work, we present a Simple HMM decoding algorithm which seeks a mean—
ingful state sequence by ﬁnding an acoustic similarity among observation symbol so
that an acoustic meaningfulness for the state can be achieved. It is a new and simple
state searching technique for a given observation sequence using only a distance in-

formation within symbols of an utterance so that a state sequence could be consistent

99

(a) Speech waveform of word "six“

 

....................................................

   

 

 

 

 

 

 

 

 

Speech amplitude, s(n)

 

 

 

 

 

Time, n(norm-sec)

(b) Optimal state sequence by Viterbi decoding

I l l !

 

 

 

State label

 

 

 

 

 

 

Figure 4.9: State segmentation resulting from conventional ML (Viterbi) training of
a ﬁve-state Bakis HMM for the utterance “Six.” The resulting segmentation is not
coherent with the physical dynamics of the speech.

100

with the physical characteristics of a Speech waveform.

The work inherently begs the question as to why the conventional HMMS do
not exhibit more coherence between the “acoustic states” of the Speech, and the
analytical states of the model. We provide both qualitative and analytical analyses

of this question and also suggest some implications for HMM performance.

4.2.2 The Concept

Suppose that there is a sequence of speech samples along the time axis and the task is
to ﬁnd a state assignment that explains their production in an Optimal and meaningful
way. From the modeling point of view, the desirable state sequence is such that each
state is a set of entities (symbols) that are acoustically similar. This general concept is
the basis for clustering algorithms [10]. Using ideas akin to a clustering procedure, we
seek an algorithm with which to ﬁnd an state sequence for a given speech utterance.

Generally, clustering techniques are based on the heuristic argument that samples
representing the same cluster should be “close” to one another in the vector space
and “far” from vectors representing other clusters. The underlying assumption is
that the feature vector representing a sample is appropriate and efﬁcient in capturing
similarity among exemplars. The most commonly used clustering strategy is based
on the minimum squared-error criterion where squared-error amounts to the distance
between a sample and the centroid of a cluster in the vector space. The general
objective is to obtain that partition which, for a ﬁxed number of clusters, minimizes
the total squared-error. It is known that minimizing squared—error, or within—cluster
variation, is equivalent to maximizing the between-cluster variation [10]. In general,
a clustering method employs an iterative algorithm to Optimize a clustering criterion
function. Various criteria have been suggested in the literature, but among these,
the family of criterion functions quantifying the average afﬁnity of feature vectors to

cluster representatives have proved to be most useful [18].

101

In a discrete HMM system, let 0 = {01,02, . . . ,OT} be a sequence of a quan-
tized Speech symbol string for which we are interested in ﬁnding a meaningful state
sequence. For each 0:, 1 g t S T, there exists a corresponding M -dimensional fea-
ture vector x, in the vector space and thus we have a set X = {x1, x2, . . . ,xt, . . . ,xT}
associated with 0. In fact, x, amounts to a vector representing a centroid of a cer-
tain cluster in the codebook since index 0, is the result of quantization of a signal.
Frequently met-cepstral coefﬁcients are used for elements of x, of when processing
a speech signal. Also, let 8 = {1,2, . . .,N} be a ﬁnite set of sequential natural
numbers, each representing a state. The task is to associate X with a sequence
Q = (q1,q2,...,q1~), qt 6 S, in a meaningful way based on a given optimization
criterion.

For this task, suppose that we initially have N partitions for X and let us denote

them as C = {6(1),C(2), . . . ,c(’°), . . . , C(N)}. Therefore, cm is also a set and it has nu“)
entries as
k _
C( ) __ {x2:;11n(j)+1, . . . , $Z:=l 110)} (4.7)
so that
N .
Zn“) = T. (4.8)

i=1

Simply, let us have an initial state segmentation such that the entire symbol
string is divided into N segments of approximately equal lengths. How the ini-
tial segmentation effects the performance will be brieﬂy discussed later. Also, let
9 = {m(1),m(2), . . . ,m‘"’} be a set representing the centroids of clusters 60:), k =
1, . . . , N. Let the distance between z, and m0“) of cluster k be denoted by d(:rt, mm).
Here k E 8 so that eventually a sequence of states is denoted in terms of sequence of

clusters.

102

Then, we have three sets as 9, X, and D. Where D is deﬁned as

'D = {d($t:m(k))}lstsn ISkSN° (49)

Note that since the initial number Of clusters may not be the number of clusters, the
condition of merging of clusters may be imposed in the clustering algorithm. In that
case, the number of clusters can be less than N.

Now the task is to seek a meaningful state sequence Q" = {q1, Q2, . . . ,qT}, q E 8
under an optimality criterion.

Using the clustering algorithm, we seek a sequence Q" based on the criterion

T
Q’ = argm3n2d($t,m(q‘)) (4.10)

t=l
= argmgn dllma (4.11)

where Q = {q1,q2, . . . , qT}, dug] = 2;, d(zt,m(9‘)). It is known that the Euclidean
distance between two cepstral vectors representing features is a reasonable measure
of spectral similarity in the models [4]. Hence, let d(a:t,m(9‘)) = [In — m(‘1*)II2. Addi-

tionally, the constraint

QISQ2S-HSQT—ISQT (4-12)

is required for the Bakis model. This algorithm iterates until Q" converges to a stable
state sequence. The algorithm is based on the fact that there is high metric Similari-
ties between the components within the same cluster and high metric dissimilarities
between different clusters.

In develOping this technique, we assume a discrete HMM and thus a Speech signal
is assumed to be quantized. However, in fact, x, can be replaced with the unquantized

cepstral feature vector in this deve10pment since the algorithm is based on distance

103

information between feature vectors. Usually the speech frame vectors are quantized
by the LGB algorithm [4] which is based on the k-means algorithm. Basically, It-
means algorithm is one of the ways of hierarchical clustering [4]. In a hierarchical
clustering procedure like k-means, theoretically there does not exist ordered relations
between objects that are located at different places in the same layer in the hierarchical
tree. However, in our deve10pment, we use a partitional clustering method which is
used frequently in engineering and science for problems in which single partitions are
important. Therefore, it is signiﬁcant to check the proximities among symbols in the
codebook which are encoded by hierarchical clustering.

Figure 4.10, for example, shows the Euclidean distances between the features
representing for diﬂerent symbols such as “zero,” “32,” “64,” and “96” and the rest
of the features representing the symbols derived from spoken digits with cepstral
features quantized to 7 bits (128 levels) by LGB. For the other symbols other than
the above four symbols, we see similar pictures to Fig. 4.10.

The ﬁgure Shows that although the distances are not completely ordered as the
symbols assigned to the feature are ordered, the Euclidean distance is another mean-
ingful indicator to Show the proximities among symbols. Here symbols are encoded by
hierarchical clustering. Therefore, the quantized symbols may be classiﬁed according
to the partitional clustering method using the Euclidean distance between features.

The algorithm for ﬁnding a meaningful state sequence based on (4.10) through
(4.12) is as follows:

1. Initialization
One of N states is initially assigned to each feature in X = {x1, . . . ,xT}. X
can be divided into N approximately equal segments. Each components of a

given segment iS assigned the integer index of its associated segment, say,

{x1,x2,...,xn,} I—) 1

104

 

Distance between symbol 0 and the others Distance between symbol 32 and the others

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

600 f . 90° . i
800
500 >-
700 ,_ .
g 400 . ................................... , g 600
5 % 500
c 3m ................ c
'8 g 400
El 200» m 300
E 200
100 .. . ...........
o i i o
o 50 100 150
Symbol
Distance between symbol 96 and the others
400 700 T .
350
300
§2501 i
S 8
.9 s
‘O ‘O
5 200 g
8 s
8150 8
m m
100 -
50
o i i i i
o 50 100 150 o 50 100 150
Symbol Symbol

Figure 4.10: Euclidean distances between four different symbols (symbol “zero,” “32,”
“64,” and “96”) and the rest Of symbols indexed along the abscissa in the codebook.
Each symbol represents a centroid of the cluster in the feature vector Space.

105

{xm+1,xm+2,...,xn2} 1") 2 (4.13)

{an_,+1,x,,~_,+2,...,xT} I—> N.

Also, let

91={m(1),m(2),...,m(m} (4.14)

be the set of centroids of the initial segments. Therefore, the elements of 91 are

 

given by
,,.
mm = zen-1+1“ (4.15)
Tbj — nj._1

for j E [1,N] with no = 0 and nN = T. Also, if Bakis tOpOlogy is concerned,

which is often employed in speech recognition, let

(II = 1. (1:? = N. (4.16)
. Recursion
Forl=2,3,...,
Fort=2,3,...,T-1,
where
q: = argrrlilin d($t,m("‘)) and q{_1 S q{. (4.18)
t
Next t

106

Recompute C, 9,D with reassigned elements. If g, = 91-1, go to step 3. Oth-

erwise, Next I .

3. Termination

Q‘ = {qI.q5.---,qi~} (4-19)

Note that except for the initial cluster, the only required data for the recursive Viterbi
search based on k-means clustering comprise the set of distances between centroids
and a features.

To give the algorithm a probabilistic ﬂavor, consider a mapping which transforms

distance measure to a probabilistic measure by using the mapping
P = f(d) = e-4 (4.20)
so that
P(q. = i) = e-dmm‘”). (4.21)

This transformation makes the above algorithm more like a ML formulation of state
search algorithm Similar to the conventional Viterbi search which is based on the ML
criterion. In (4.20), it is straightforward to see that as d(:ct, mg) —+ co, the probability
that qt is i approaches zero, while conversely, as d(:z:t, m.) —-) 0, the probability that q;
is i approaches unity. This is acoustically reasonable because in the high dimensional
vector Space, a small distance between two cepstral features vectors implies that

the corresponding acoustic frames are very similar acoustically and, thus, should be

107

assigned to the same state. Therefore, (4.10) with constraint (4.12) can be written as

T
Q‘ = argmgXIIP(qt=ic)
t=1

T
= arg mgx 1'] (“the“) (4.22)

t=1

_ T (q)
= argmgxe Zt=1 d(z:,m t)

= argmax e’dl‘le
o
where it 6 {1,2, . ..,N} for all t. This expression could be useful to assess the
relationship between the conventional Viterbi search technique and the technique
above. We will have more to say later about the relation between the conventional

Viterbi technique and the recursive Viterbi search technique suggested here.

4.2.3 Recursive Viterbi Search Based on k-Means

The sequence Q” can be computed by a conventional k-means algorithm [9]. However,
in the case of speech Signals where time ordering of speech samples is important
associated with a state sequence constrained by Bakis topology for example, the
conventional Viterbi search technique is an apprOpriate way for ﬁnding a meaningful
state sequence. Therefore, let us apply the conventional Viterbi search technique
in ﬁnding Q" in our problem. The Viterbi searching is engineered to ﬁnd an path
which satisﬁes (4.10) and (4.12) under the Bakis topology. Here, note that in contrast
to an ordinary Viterbi decoding technique, the prOposed approach requires Viterbi
technique to be applied iteratively until Q‘ converges to a stable state sequence. This
is because the algorithm requires arbitrary initial clusters and the algorithm is based
on the squared-error between a feature and a centroid. Therefore, the resulting state
sequence may be different at each iteration depending on the members of clusters.

This recursive Viterbi search employs k-means to ﬁnd a meaningful state sequence

108

using the distance information among symbols. The algorithm is described in the
following way. Here, we need two more variables d,( j), 1M j) for this deve10pment.
Here d,( j ) stands for a sum of distances between the feature and a centroid of cluster
from t = 1, . . . ,t over the bestpath until it reaches state j at time t. 10,0) represents

a state at time t — 1 which corresponds to dt(j).

1. Initialization

Create initial clusters as in (4.13), and let

c1 = {C(11),C(12),...,C(1N)}. (4.23)

The centroids of N-clusters comprising the set 91 = {mIl), mjz), . . . ,mIN)} are

given by (4.15). Furthermore, let q'f = 1, q} = N as before. Additionally, the

initial values of the variables d¢(i) and 111,0) are given by

(11(1) = 0, (11(2) = 00 for 2 g i S N (4.24)

(01(3) = 0, Vj=1,...,N. (4.25)

2. Recursion
Forl=1,2,...,
Fort=2,3,...T, and forj =1,2,...N,

dtU) = lgiSnN{dt-l(i)+d($taml(i)l)} (4-26)
28(2) = arglrsrgian{dr—1(i)+d($t,mz(’;)1)} (4.27)

= magnesia}

109

Next t

(if?) = 1g;.i<nNd«r(j)=aI/r(N) (4.28)
(1% = N. (4.29)
Backtracking:
q; = ¢,(q{+1), t=T—1,T—2,...,1. (4.30)
Q? = {quSr-qqil (4.31)

Recompute C; = {cIl),cI2), . . . ,ch)}igr = {mIl),mI2), . . . ,mIN)}. If 9; = 91-1,

go to step 3. Otherwise, Next 1.

3. Termination:
Q’ = QI={qI.QS,---.Qi~} (4-32)

The probabilistic version of this algorithm would use (4.20) in place of the direct use

of the metric.

4.2.4 Experimental Results

In contrast to the conventional Viterbi technique or the state search with a criterion
(4.2), the proposed technique does not require any speciﬁc a priori knowledge like
{A, B} from the training utterance set. Additionally, this technique does not have
mismatch problem between the training and testing data. Only an initial segmenta-
tion information is required. Therefore, this technique can be applied universally to

ﬁnd a state sequence for a speech utterance.

110

The Relationship between Likelihoods and the State Path in the conven-
tional Viterbi Technique

First, let us consider the effects of an initial values of the two signiﬁcant matrices
in the PB HMM. The state segmentation results using the resubstitution method
for the word “six” is Shown Fig. 4.11. Each graph displays a state sequence and
the corresponding likelihood for a different set of initial values for A and B by the
conventional Viterbi search technique. For the numerical purpose, — loglo P(O I M)
along with a Optimal path is computed.

Figure 4.11 shows the impact of varying the initial values of A and B. In addition,
it shows the variable relationship between a state sequence and its corresponding
likelihood quantity. For example, although the likelihoods of two results are close
to each other, the state sequences could be quite different. Also, the state sequence
which has largest likelihood (the second ﬁgure) may be not coherent with the acoustics
of the utterance.

In an extreme case, the state sequence can be poorly matched to the acoustical
pattern of the corresponding speech utterance. This is because in the EB HMM,
for example, the likelihood is composed of T—multiplications of pairs of a,-,- and bj(k).
Under a Bakis topology with N states, a)”; is always one and it is highly probable
that “1,1 is close to one and it is greater than (1,.-,5, i E [2, N — 1]. To have large
likelihood under T-multiplications of pairs of probabilistic terms lying between zero
and one as (2.13), each aj, and bJ-(k) needs be as close as to one as much as possible.
Therefore, when the HMM is reestimated from the lengthy training utterances, the
HMM is, if possible, conﬁgured to have many initial (1) state and last (N) state when
structured according to the ML criterion in certain case. Accordingly, A and B could
be trained to assign as many initial and ﬁnal states as possible without considering a

congruity of the state path with the acoustic “states” of the utterance.

111

 

g

 

 

 

 

l
N

 

s(t), Speech Amplitude
o
l l
l i
{I
Ir
Il

 

 

 

 

 

 

 

 

 

 

 

N J5 O) O N «h 0) g
I
l

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

l m i 4' l
o 20 40 so so 100 120
I l r l I
_ ........................... ................ ' ....... f ...................... _
; h /
_ ..... / ............................... -
__l : Likelihood a 114.4
0 1 1 l l l
o 20 40 so so 100 120
6 I I l I I
4 I
2
o
s
4
2
o
3 . . . . .
o4 ‘ ' ‘
E f : : : :
W2... ............ .......... . ..... f ...... ............. i ............... ..
._I g g ; Likelihood: 123.3 g
o l l I l l
o 20 40 so so 100 120

Tune, t(msec)
Figure 4.11: State sequence for the spoken word “six” by the Viterbi search technique

of a ﬁve-state Bakis HMM. Five different sets of initial values have been assigned to
A and B.

112

State Search by Recursive Viterbi Based on k-Means

Let us apply the recursive Viterbi search based on k-means clustering to ﬁnd a mean—
ingful state sequence corresponding to a Speech utterance. In particular, the spoken
utterances “four” and “six” are considered as examples. Figures 4.12 and 4.13 are
the resulting state sequences with the conventional Viterbi technique and recursive
Viterbi search based on k-means clustering. The ﬁrst graph in Fig. 4.12 is the Speech

”

waveform for the word “four. The second ﬁgure is the result of the conventional
Viterbi search technique. The third ﬁgure is result of the recursive Viterbi search
based on k-means clustering. In addition, we need to compare the likelihoods of
the conventional Viterbi and the recursive Viterbi search based on 112-means cluster-
ing. Since probabilistic likelihood is not a formal measure to the proposed recursive
Viterbi search technique, it is necessary to have state sequence information as well
as likelihood of the recursive Viterbi search based on k-means clustering in terms of
the conventional Viterbi search technique. For this purpose, initially, the recursive
Viterbi search based on k-means clustering is applied to all ﬁfteen training utterances
to supply a state information. Then, by counting the symbols in each state, two ma-
trices A and B of the HMM are constructed. From these matrices, we can apply the
conventional Viterbi technique to ﬁnd a meaningful state sequence and corresponding
likelihood. These are shown in the fourth graph of Fig. 4.12. The shapes of the third
and last ﬁgures are frequently identical.

For utterance “four,” the resulting state sequence by the conventional Viterbi
technique and the recursive Viterbi search based on k-means clustering are not very
different. This is because there are too many states in the model. However, for
word “six,” it is apparent that the resulting state sequence by the recursive Viterbi
search based on k-means clustering is more consistent with the acoustic prOperties
the utterance. This is because the Viterbi algorithm has been focused on the ﬁnding

a state sequence which produces 3 ML criterion rather than considering the acoustic

113

Speech Wavefom for word "four”

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

§1°°° I I I I I.
E. $0,. ........................... I ............................... _.
g o ... A‘ I' I l
- "WP ........................................................................ ...
%_1m 1 J l L J.
0 1000 2000 3000 4000 5000
Time,t(msec)
6 I I I I I
4,. ................ .................. ..... H ..... .......................... .I
2- ................ /__/ ....... .......................... _
I } Viterbi search from the F—8_ HMM
0 1 I l i 1
0 20 40 60 80 100
6 I I I I I
4i— ...... /____/ .......................... ..
2,. .................. fl .......................... ..I
Recursive Viterbi based ongk-means _
o I I I I I
0 20 4o 60 80 100
6 I I I I I
4._/——/V ......................... "I
2h— ................ fli; ................. ......... —1
- E Viterbi search from third graph :
o I I I I I
0 20 40 60 80 100

Time, m(norm-sec)

Figure 4.12: State sequences for the spoken word “four” in a ﬁve-state Bakis HMM.
The second ﬁgure is the consequence of the conventional Viterbi search. The third
ﬁgure is the result of the recursive Viterbi search based on k-means clustering. The
4‘“ graph is the conventional Viterbi search result based on the third graph.

114

Speech Wavefom for word "six"

 

 

 

 

0
“O
3
.2
E
(I)
l l l l
—2000 ' ' 4

 

0 1000 2000 3000 4000 5000 6000 7000
Time. t(msec)

 

6 I I I I I I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4.. ......................................................................................... .I
2.. ............. ................... . ....... . .......... _I
i [ Viterbi search from the F—B l-iMM
o I I I I I I
0 20 40 60 80 100 120
6 I I I I I I
4- ...................................... 4/—J ....................... ..I
._ ............................... I ......... E ............. . .............. .......... _
g Recursive Viterbi based on k-f-means
o l l L 1 L I
0 20 40 60 80 100 120
6 I I I I I I

 

 

State
&
I
I

 

 

 

 

2). ............................. /. .............. .............. s .......... .-
EViterbi search from third graph

0 I h' I I I I

0 20 40 60 80 100 120

Time, m(norrn-sec)

Figure 4.13: State sequences for the spoken word “six” in a ﬁve-state Bakis HMM.
The second ﬁgure is the consequence of the conventional Viterbi search. The third
ﬁgure is the result of the recursive Viterbi search based on k-means clustering. The
4‘“ graph is the conventional Viterbi search result based on the third graph.

115

structure of the speech signal. On the other hand, the recursive Viterbi search based
on k-means clustering is mainly focused on acoustic features of speech Signals.

The likelihood quantities for all ﬁfteen training utterances of “four” and “six”
resulting from the conventional Viterbi technique and the recursive Viterbi based on

k-means clustering are shown in Table 4.1. Table 4.1 shows that frequently we can

 

 

 

utterance “four” “six”
Viterbi k-means Viterbi k-means

1st 70.7 69.5 134.4 131.7
2nd 80.6 74.9 173.4 164.5
3rd 71.2 68.9 150.2 143.4
4th 70.1 65.8 124.9 126.3
5th 55.2 51.7 127.8 122.4
6th 65.9 63.7 120.8 119.0
7th 134.3 106.2 145.3 145.0
8th 79.6 76.1 128.4 129.2
9th 78.6 73.0 138.7 134.0
10th 75.6 74.1 136.4 134.1
11th 84.8 84.4 145.8 138.7
12th 82.7 80.0 118.1 116.5
13th 85.9 83.0 141.6 138.3
14th 71.8 74.3 106.5 103.3
15th 65.2 67.6 135.9 139.5

 

 

 

 

 

 

 

 

Table 4.1: Likelihoods of the conventional Viterbi search and the recursive Viterbi
search based on k-means clustering.

ﬁnd a state sequence which not only is acoustically better consistent with a speech

utterance, but also has more likelihood than the conventional Viterbi search.

Discussion

We have focused on an HMM training algorithm which seeks to Optimize acous-
tic meaningfulness of the HMM in the sense of minimizing a Euclidean distance of
acoustic Similarity among observations assigned to given states. This work differs

from that of Ostendorf et al. [61], not only in its focus on a more “localized” view of

116

the waveform (frame processing with the explicit goal of training an HMM), but more
importantly in the explicit attempt to Optimize acoustic state-wise similarity rather
than Optimally segment the waveform using conventional ML. Similarly to the con-
ventional methods, the new algorithm can also be employed in recognition strategies
which assign scores based on acoustic match. The new method inherently provides
apprOpriate dynamic time warping of training and test strings, and is readily modiﬁed
to Optimize over multiple training sequences.

It can also be used to objectively and dynamically guide the selection of the
number of model states, the need for state merger, and to assess certain changes
in topology on-line. Most importantly, the new method is provably convergent to a
local minimum of acoustic mismatch, and regularly provides HMMS with meaningful
relationships between states and the acoustic content of the speech that the HMM
represents. Also this proposed work is different from the segmental method of states
by k—means segmentation [11, 33] Since the prOposed technique is based on the distance
metric between centroids and a sequence of symbols of speech. However, in [11, 33],
the segmental method is based on the F-B HMM which has parameter matrices
{A, B}.

Clustering, an unsupervised learning technique, has been widely applied to prob-
lems in pattern recognition and classiﬁcation [9, 10]. In speech signal technology, this
technique has been particularly applied in generating a codebook for Speech coding,
classifying a Speaker, and acoustic modeling [4, 11, 12, 13, 14, 15, 33].

First, in contrast to the Viterbi algorithm which requires a priori knowledge M,
the recursive Viterbi search based on k-means clustering does not require any in-
formation. Knowing a state sequence without a priori knowledge has a few useful
implications. The state information for an utterance can help to exploit and adopt
various linear time-invariant system techniques to the speech processing technology

by obviating the modeling of the time-varying dynamics.

117

Next, although the quantized symbols have been used in the deve10pment of the
algorithm, such quantization is not a requirement of the method. The cepstral feature
vector is directly involved in the computations. Therefore, the quantization distortion
can be avoided.

Finally, the relation between the conventional Viterbi and the recursive Viterbi
search based on k-means clustering may be interesting ultimately although it is not
clear from the present study. The hard part of this analysis is to uncover how the
distance information between centroids and feature vectors is dynamically distributed
in the M = {A, B, 1r}. In the F-B HMM training, the dynamics or characteristics of
an utterance are transmitted to aj; and bJ-(Ot) by the Expectation-Maximization [24]
method. For this method, the HMM requires quite many recursive iterations as (2.57 )-
(2.62), which is not easily analyzed. The relation between the conventional Viterbi
technique based on F-B HMM and the recursive Viterbi search based on k-means

clustering is left for future research.

4.2.5 Appropriate Number of States

A clustering algorithm based on k-means helps to determine an effective number
of states in a meaningful way [10]. Determining the number of states is also an
important issue in designing a HMM [4, 16]. Generally, the number of states is
determined roughly based on the expected number of identiﬁable acoustic phenomena
in the utterance. For example, for a word model, ﬁve to ten states are often assigned
to capture the phones in the utterances. For the phoneme model, three states are
assigned for a HMM to model discrete phones. Since the recursive Viterbi search
based on Ic-means clustering is conﬁgured. to ﬁnding an state sequence which considers
acoustical characteristics of a Speech utterance, this technique can be exploited to
automatically determine the appropriate number of states in the HMM.

One way of ﬁnding an appropriate number of states according to the sequence of

118

feature vectors of a speech utterance is found in [10]. AS a basis for further deve10p—
ment, we review this technique brieﬂy.

There are a few ways of computing an appropriate number of clusters. Among
them, the Davis-Bouldin (DB) index provides a relatively Simple way of deciding an
appropriate number of clusters [10]. It has been veriﬁed that the index does not
depend on either the number of clusters nor the clustering method [10]. This index
is as follows:

Given a partition of T objects into N clusters, one ﬁrst deﬁnes the following
measure of within-tO-between cluster spread for all pairs of clusters (j, k) as

6j+8k

R- =
1.1:
"W:

 

(4.33)

where e, is the average error for the 3"" cluster and mg: is the Euclidean distance

between the centers of the j", and kt), clusters. Where the index for the kt), cluster is

deﬁned as
R = R ' 4.34
I: Iglaxmf .k} ( )

and the Davies-Bouldin index for the N -cluster clustering is deﬁned as

1

N
N Z R), for N > 1. (4.35)

k=l

DB(N) =

DB(N) will be small for good clustering. The index is supposed to decrease mono-
tonically as N decreases until the “correct” number of clusters is achieved for well-
clustered data. The DB index is plotted against N and clustering is stOpped when
the index iS apparently minimized.

The DB(N) index can be easily implemented by the recursive Viterbi search

based on k-means clustering because such a search technique employs the criterion

119

of minimizing the sum of Euclidean distances between the centroids of clusters and
a sequence of feature vectors for a speech utterance. The distance information is the
main element in DB computations. By using this method, let us ﬁnd an apprOpriate
number of states in our digit recognition problem.

DB(N) versus N for the spoken utterances Of “one, two, four,” and “six” are
displayed in Fig. 4.14 through 4.17 respectively. For word “two” and “four,” we
see that three states are adequate. In case of word “one” and “six,” ﬁve states are
suitable in the HMMS. We can apply this method to the other words. For the spoken
digit recognition problem, therefore, ﬁve states will be apprOpriate. This result iS
consistent with the fact that conventionally we adOpt four to six states for modeling
a word in the HMM based on ML criterion.

Besides determining an adequate number of states, the DB (N) index also let
us know whether the Objects are well-clustered or poorly-clustered. As explained
previously, in the case of the well-clustered data, the DB(N) index decrease mono-
tonically as N increases until it reaches the “correct” number of states and then it
increases. For arbitrary random data, such a trend does not occur [10]. For our
example utterances, as shown in the ﬁgures, word “six” is better-clustered than the
other utterances.

All the other spoken digits have Similar patterns of DB(N). In addition, the
recursive Viterbi search based on k-means clustering is a meaningful technique for
ﬁnding a meaningful state sequence of a Speech utterance.

In the HMM, although the number of states of the HMM can be ﬂexible, having
an adequate number of states is signiﬁcant for enhancing the performance of Speech
recognition system by tuning each model to more faithfully match the acoustic prop—

erties of its corresponding Speech signals.

120

Davis-Boulding relative indices for the fifteen different word “one“

I _I r 092 ................ . ..........................

I . 0. e . "' ... e I

0%. ................ ..................... ,, .__ _
. n .' 1‘ e .

 

 

. .- . . . 0,9 ...... .3 .......... .......... I
0.94 ........... 1..-' ...... . :1 ........ ......... .. '. .-

0'92 088 ......

\ .r. ....... I ......... 2 .......... ...... . :2...
0.9: ' 0-85 ;\ ; :

 

2088-V~\ .......... ..... ' ........... 0.846 ........ .\ ........ 3 ........
0.82 ......... ......... \'. ......... I

I
DB(N)
/

270.86....IV’ \.., ........ ..... ......... .
0.84 ........ "VT . . .‘. -~.- .\ ...... ..... _ ..... 0.3 ........... ......... ..... \. ... ..........
0.82». ....... \\ ..... . ..... 078 \z .........

0,8 ...... ... ....... :4; . . , x ..... g ........... 0,76 ...... .4 ........ ....... § ........ _.,

0.78I ---------- ---------- -------- \~f-- ------ -I 0.74 4”

 

 

 

 

 

 

N, Number of States N, Number of States

 

 

 

 

 

 

 

 

 

N, Number of States N. Number of States

Figure 4.14: Davis-Bouldin relative index for ﬁfteen utterances of spoken word “one.”

121

Davis-Boulding relative indices for the ﬁfteen different word "two'

0.95

Figure 4.15: Davis-Bouldin relative index for ﬁfteen utterances of spoken word “two.”

 

............................. :........-q
. 1
........... ../.
' i /'
: : I]
. . ..4‘ ./ /

...... s .,,/1,..
1 .,, """""
E E/

.......... I,
: / '

I /
........... /
.’2" 4 2/
/~ ‘-
/ 2
..l ....... i ............................. ‘
I L3 -

 

 

 

 

N, Number of States

 

 

 

 

N, Number of States

 

0.95

0.75

 

 

 

0.95

0.9

0.8

 

N, Number of States

 

0.75 .

: .’:'\, .
. ...... .: ............ j.:....\..../...

I I I /(.

I I / \

: .’ I

I I I]:
........ .\./....d

: : ll .

: : / ‘

: :I .I.

: , 4' I.-'
r .......... f, ...... ‘1....".‘ ...............

 

 

 

N, Number of States

 

Davis-Bouiding relative indices for the ﬁfteen different word 'four“

1.1

1 .05

0.8

0.75

0.7

1.05

Figure 4.16: Davis-Bouldin relative index for ﬁfteen utterances of spoken word “four.”

 

I V

..........................

........

 

 

 

N, Number of States

 

.........................................

 

 

 

3 4 5 6 7

N, Number of States

0.85

0.8

DB(N)
P
N
or

0.7

0.85

1.05

0.95

 

...................

 

 

 

N, Number of States

 

 

 

'ch 1'_

 

 

 

3 4
N, Number of States

Davis-Boulding relative indices for the fifteen different word 'six"

 

3

1.8

1.6

 

1.4

 

 

N, Number of States

 

 

2.8 ......................................
2.6 ,. . . . ................... I
\ -_ I
\ -'
24 ...... \. .\. ......................
\
g \1
2‘ 2.2 .......... -. . .\... .......................... I
{5 V.
o '\.

 

 

 

Figure 4.17: Davis-Bouldin relative index for ﬁfteen utterances of spoken word “Six.”

N, Number of States

 

124

2.3 .
2.2 '
2.1 -

2 .

DB(N)

1.8
1.7
1.6

1.5

 

2.4 -

1.9

...............................

..........................

.....................................

 

 

 

 

N, Number of States

 

 

 

 

\ .
3.2...\ ....... I ............................ ,
\ E .
3, ..... \.\ .......... .‘ ................. I
2.8 ......... \\ .......... 1' ................... .I
i\ T
2.6 ........... \ ...... .......... ...........
2 \ z ,4
62.4 ..\._ ............... \\’(‘\*,
o \, z’ '
....\. ............ l ..................
2'2I ‘.: f .
V, : t
2I .......... ;..\.. ....... f .......... Z .......... 1
" ' \. I I d
1.8 ................. \ ....... ’. .1... ., ’ .4
\L.’ —' l
1.6 ...........................................
3 4 5 6 7

N, Number of States

4.2.6 Remark

First, let us examine the effect of initial clusters for the recursive Viterbi search based
on k-means clustering. Different initial partitions can lead to different clustering
results when the prOposed algorithm is applied. This is because algorithms based on
squared-error can inherently converge to local minima. This is especially true when
the clusters are not well-separated. Therefore, in choosing the initial partitions, if
possible, it is better to locate them far away from each other. Also one way to
overcome the local minimum problem is to run the state searching algorithm with
several different initial partitions. If they all lead to the same ﬁnal partitions, we can
be more conﬁdent that the global minimum of squared-error has been achieved.

Figures 4.18 and 4.19 are a set of several clustering results for word “four” and
“six” with different initial clusters. For the word “four,” we see that with the given
initial clusters, there are two different patterns for the state sequence. For the word
“six,” we see more diverse patterns. Although the resulting state sequences are various
depending on initial state partitions, they are still more consistent with the acoustical
prOperties of a Speech signal than those obtained by the Viterbi technique.

Also in this technique, it is possible to adjust the number of clusters by imposing

a criterion on the state search algorithm so that it can be merged or split as in the

ISODATA method [4, 10].

4.3 State Search by Set-Membership Identiﬁca-
tion

4.3.1 Original Thoughts about Exploiting the SM ID

In the HMM, no constraints are imposed on the speech signals (observations) at

different time epochs except a nonstationary Markovian assumption. As another

125

Speech Wavefom for word 'four' Recursive Viterbi Result based on distance metric

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

g 1000 w . v . f .
3 mo ..............
E o .............. 4
m _m ..................................

‘” -1000 . - ‘ * . -

0 1000 2000 3000 4000 5000 100

Time, t(msec)

L ...................................... .1

100

100

 

 

 

 

 

 

 

 

 

1ill..‘ ;;;';
020406080100 020406080100

 

 

Figure 4.18: State search results for word “four” using the recursive Viterbi search
based on k-means clustering with different initial clusters.

126

Speech Waveforn for word 'six' Recursive Viterbi Result based on distance metric

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2°00 ﬁr Y 5 v

3 ‘ ' .
g t :
E. 4 .......... . . . . ........ .......... 1
g 3 ...... . ....... l ....... ............
U) 2 ............................ . ............
m 2000 ; .' z 1 ; q

0 2000 4000 6000 0 50 100

5 5

4 4

3 3

2 ..
1 L i 1 ; ;
0 50 100 0 50 100

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

 

50 100 0 100

Figure 4.19: State search results for word “six” using the recursive Viterbi search
based on k-means clustering with different initial clusters.

127

approach, for signals which cannot be modeled by stationary processes, for exam-
ple speech, nonstationary autoregressive (AR) processes also have received consider—
able attention [77, 78]. In hidden ﬁlter model (HFM) approach, for example, 0, is
not only conditioned by Markov transition probabilities, but also is dependent on {

Ot_1, . . . ,OH, } for an apprOpriate p so that
0t = 9i(1)0t—1 + 0i(2)0t—2 + . . . + 0i (p)0t_p + 65(t). (4.36)

Here 03' = [ 9,.(1) 9,.(2) ... 9,0,) ] is the vector of AR coefﬁcients for state i and
the driving sequence e,(t) is i.i.d Gaussian with mean u,- and variance of. The AR
model parameters are made conditional on the state of the Markov chain 2'.

The original motive for exploiting the SM ID in speech recognition is taking a type
of “ﬁltering approach” which considers both stochastic and deterministic aspects of
observations at the same time in speech recognition. The prOposed model was thus
called the hidden ﬁlter for output probability distributions (HF-OPD).

For the development of HF-OPD, the TIA HMM was employed. Applying z-

transformation to (2.67) and (2.68), we have an input-output equation such as
y(2) = B(zI — A)‘1u(0)6(z). (4.37)
For convenience, (4.37) has been transformed to
B(zI — A)-1u(0) = D;1(z)NL(z) (4.38)
using matrix fraction description so that (4.37) can be represented as

DL(Z)y(z) = NL(Z)6(Z)‘ (4-39)

128

 

By inverse transformation, the corresponding temporal difference equation becomes
Aoy(t) + . . . + ANy(t — N): B16(t — 1) + . . . + BN6(t — N) + e(t) (4.40)

with the added noise process. An estimated output probability distribution at each
time was used as training data for this dynamics. To accommodate a more reasonable
noise assumption, and to prevent excessive computation, a SM ID [81, 82] technique
was proposed for this process.

However, it turned out that the SM ID approach using the TIA HMM is not easy.
The main reason of the difﬁculty of applying the SM ID to the TIA HMM is the noise
bound s(t). Since the SM identiﬁcation is based on a bounded-noise assumption, it is
very important to have tenable noise bound. It is well known that noise bounds have
signiﬁcance inﬂuence on the performance of the SM identiﬁcation [89, 90, 91, 92].
However, it was revealed that it is not easy to ﬁnd an informative error bound when
the training data for the SM ID is a probability distribution vector as y in (4.40).

For example, if m(n) is a state vector and :i:(n) is an estimate of m(n), and if
e(n) = m(n) - a‘:(n), (4.41)
we have
||8(n)||§ S 2- (4-42)

Thus, the error bound of the input-output equation from the TIA HMM is too trivial
and large. Similarly to this case, we had more difﬁculty in ﬁnding an informative
error bound for e(t) in a multivariable system as (4.40).

However, we found that the SM ID is still useful to identify a preliminary state

sequence of speech signals because of its inherent characteristics of identifying the

129

 

dynamics of a model.

4.3.2 Background of the SM Identiﬁcation

The SM identiﬁcation was pioneered by Schweppe [81], Witsenhausen, and Bertsekas
and Rhodes [82] in the late 19605, in the domain of control and system science. In
recent years, SM-based signal processing has been receiving considerable attention and
has become increasingly popular around the world; especially for the “bounded error”
(BE) problem which aims at characterizing the set of all parameter vectors consistent
with prior bounds on the errors between the measurements and model outputs. These
BE algorithms can be combined with various forms of least square error (LSE) signal
processing algorithms with beneﬁcial consequences [83, 84, 85, 87, 88, 89, 90, 91, 92].

Suppose that there is a general ARX (Auto-Regressive with Exogenous input)

model,
to 4:
yn = Z aiyn—i + z: bjun—j + ”it (443)
i=1 j=0
= O'Txn + 127, (4.44)
where 0” = [a1, . . . ,ap,bo, . . . ,b,,], x: = [yn_1, . . . ,yn_p,un,un_1, . . . ,un_q], yn and un

are measurable outputs and inputs, respectively, and un is an unknown noise process.
Let m = p + q + 1 and assume that for each time n, on is bounded by a magnitude,

i.e.,

of, < 7,, (4.45)

where {7"} is a known positive sequence.
Let w(n) be a parameter set at time n such that all 0 E w(n) are feasible parameter

estimates of the model which are consistent with the error bounding in (4.45). In

130

conjunction with the model of form (4.44), m(n), which is a “hyper-strip” region, can
be expressed as
m(n) z {0 : 0 E Rm, (3111 - ngn)2 S 7n}

which, when intersected over a given time range t E [1,n], usually form convex

polytOpes of feasible parameters

In general, <I>(n) is an irregular convex set and hence it is difﬁcult to describe
and track. But in conjunction with the WRLS (Weighted Recursive Least Square)

processing, <I>(n) can be shown to be contained in a hyper-ellipsoid superset 5(12)
5(71) = {9 : (0 — 0")T91:n—m(0 — 9,.) < 1} (4.46)
Where C(n) is the weighted covariance matrix,
C(n) = C(n — 1) + Anxnxz,‘ (4.47)

n" is a scalar quantity,

n 2
..., = 030000. + 24.7.(1 — if") (448)

i=1

and 0“, the center of 5(71), is the weighted LS estimate at time n using the weights

{A,}?zl. It can be shown that 0,. can be computed recursively using

,9

3

V
|l>

— P-1(n)
G7, = x£P(n - 1)xn
8,, = yn — 0:_1xn (4.49)

131

PM) = P(n — 1) _ A"P(n - 1)xnx,7,'P(n — 1)

 

1 + AnGn
6n = 0,._1 + /\,,P(n)x,,en
Km = nil—1 + An’l'n — M
1 + Ana"

The SM-WRLS algorithm starts off with a large ellipsoid, 5(0), which contains all
admissible values of the model parameter vector. The objective is to ﬁnd the weight
A“ at time n so as to minimize the size of the membership set. Different criteria can

be applied to the optimization process. One criterion is to minimize the volume ratio

_ detB(n)
A") _ detB(n — 1)

 

V( (4.50)

where B(n) é KmC‘1(n).
It has been shown[84, 88] that the Optimal weight A; is the unique positive root

of the quadratic equation

0193, + a2)” + a3 = 0 (4.51)
where
(11 = (m — 1)Gi'yn (4.52)
(kg = 2mG’n'yn — MAG: — Guy" + EiGn (4.53)
03 = m'yn — me: — rtn_1G,, (4.54)

4.3.3 State Search Using the SM Identiﬁcation

The SM ID is useful not only for computing the unknown system parameters, but
also for identifying the candidate state boundaries in a speech signal. In this section,

a state search technique using the SM ID is discussed.

132

 

Concerned with a problem of ﬁnding a meaningful state sequence of a speech
utterance, we focus on where the transitions of the parameter sets occurs rather than
parameters of a model in the SM ID. The interesting part of the SM identiﬁcation in
connection with ﬁnding an apprOpriate sequence is that it takes advantage of volume
or trace of an ellipsoid to describe the feasible values of parameters of a system. If
the data are not informative, there is no change of volume or trace and this selective
updating makes the SM theory useful. In addition, when the data come from a
same source which may be regarded as one “state,” the volume or trace will be
monotonically non-increasing for new data. When the input data may come from a
different source or “state,” there exists a different set of parameters. Thus, it leads
to a sudden change of the volume or trace (reset to the initial values) according to
the theory of the SM ID. Therefore, if the volume or trace is tracked graphically
or analytically with the SM algorithm, it is possible to get approximate information
about plausible state boundaries in a speech utterance. In the SM, algebraically this
implies that if the volume becomes negative at some points, those points could be
considered where a signiﬁcant change of characteristics of a signal happens.

The resulting state sequences of a spoken digit “six” deduced are shown in
Fig. 4.20. The ﬁgure shows the sequences of three computed parameters, the vol-
ume of a bounding ellipsoid, Km, and informative points which indicates a sudden
change of volume or an in the time axis.

Here, the model is based on AR(3) which has three unknown parameters in a
linear prediction model. However, the order can be changed. As described, the
performance of the SM theory depends on the accurate bounds. In fact, knowing the
bound information for the model is a difﬁcult problem. There are a few of ways to

estimate the noise bound. One method to approximately estimate the noise bound is

133

 

 

 

The input speech of spoken word “six“ noise bound for the normalized speech

 

 

 

 

 

 

 

 

 

 

 

 

1500 . 0.25 . . f

0.2 p . ............................. .4
, 0.15 . .........................................
i’ 0,, ..........................................
005 ......................................... a

-1000 4 4 i 0 ‘ '

0 2000 4000 6000 8000 0 2000 4000 6000 8000
SM algorithm to estimate at SM algorithm to estimate a2
3 v v v 2 r v v

   

 

 

 

 

 

 

 

 

_2 , .......... . . . ...... F . . . . ..... ......... ..
_2 '. L .' _3 ‘ ; ;
0 2000 4000 6000 8000 0 2000 4000 6000 8000
SM algorithm to estimate as ‘0 volume of the ellipsoid
2 . . v 1 0 1— v .

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

. . a . . .
—2 - ‘ - 10 ‘ . .
0 2000 4000 6000 8000 0 2000 4000 6000 8000
10 kappa of the ellipsoid informative point
10 r v . 1
0.8
g 0.6
lg 0.4
>-
0.2
10“° 4 'L ‘ 0
0 2000 4000 6000 8000
Time (n)

 

Figure 4.20: Some informative results regarding state segmentation by the SM tech-
nique for the word “six.”

134

 

to use the short-term signal power of Speech signals itself as

 

e 0

Here, Q represents the integer part of quotient. These expressions imply that the
noise is bounded uniformly within a N -size frame and that the noise is proportional
to the average signal power of the frame. In addition, an amplitude or attenuation
scalar a has been multiplied to adjust a noise bound depending on the application.

Conceptually, there is a transition of a set of parameters signifying the dynamics
of a signal when the system changes the state. In Fig. 4.20, three parameters, 01, a2,
and a3 make a transition simultaneously at the points where physical status may
change. More signiﬁcantly, two major indicators, the volume of an ellipsoid covering
the feasible parameter quantities and 5,, make sudden changes at those points so that
the size of volume is reset to the initial size of volume. Those informative points
indicate a change of the dynamics of a signal. Thus, by detecting the informative
points, we had information about the transitions of acoustical phenomena of a speech
signal.

The performance of this technique is heavily dependent on the accuracy of noise
bound. Figure 4.21 and 4.22 are the results of state search for spoken digits “four”
and “six” by the SM technique with various values of a. First, depending on the
scale factor a, different sets of informative points are identiﬁed. Therefore, ﬁnding a
good scale factor is signiﬁcant for processing speech signals in this way. Second, most
informative points are located in the boundary between unvoiced and voiced region

as well as the unvoiced region. At the unvoiced region, the dynamics of AR model

135

 

changes rapidly. Thus, there occurs lots of informative points. For the transition

region, it is as we expected. Therefore, the search technique is useful to detect a

voiced / unvoiced regions of a signal by controlling a constant.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The input speech for word "four' informative point
2000 4000 6000
Time (n) 1 T1me(n)
‘i’
.1 i 0.5 ..............................
>-
0 Z
4000 6000 0 2000 4000 6000
1 Time (n) 1 Time (n)
i 0.5 . .......................... Eu 0 5 . . . . . ..........................
>- E 3 £
0 L L o 1
0 2000 4000 6000 0 2000 4000 6000
1 T1me(n) 1 Time (n)
g 0.5 h ......................... a g o 5 ] ........................................
l t
>- E >-
0 f L 0 ;
0 2000 4000 6000 0 2000 4000 6000
1 Tm“ (n) 1 Tune (11)
g o 5 .......................................... g o 5 l ......................................... ,
'- o 4 v- 0 “
11 11
£ _o'5 ............................ . ............. g _'°.5 ............................ ...............
_1 L i _1 A 4
0 2000 4000 6000 0 2000 4000 6000
T1me(n) Turns (0)

Figure 4.21: State segmentation result by the SM theory with various values of a for

word “four.”

136

 

 

 

The input speech for word 'six" informative point

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2m 1 ]
E 0.5
L ' i o
0 2000 4000 6000 8000 2000 4000
1 Time r 1 Time (r)
ET 0 5 .......... E 05 ............................
i .
0 i 0
0 2000 4000 6000 8000 0 2000 4000 6000 8000
1 Time (1) 1 T1me(n)
Z7 0.5 E 05 ..............................
i .
'0 2000 4000 8000 "o 2000 4000 6000 8000
1 Time (r) 1 Time (n)
i E
E o 5 ............. E 05 .................
>- >-
v0 2000 4000 6000 8000 v0 2000 4000 6000 8000
1 ﬁrm (n 1 Time (n)
g 0 5 ~ - - .Z- 0 5
i l :l
>- >-
UO 2000 4000 6000 8000 U0 2000 4000 6000 8000
Time (n) Time (n)

Figure 4.22: State segmentation result by the SM theory with various values of a for
word “six.”

137

Chapter 5

Conclusions and Future Research

5. 1 Conclusions

This research is study about the HMM, a state-of-the—art technique in Speech recog-
nition. Most speech-based studies of the HMM are focused on the applications of the
conventional F-B HMM.

First, the conventional F-B HMM has been transformed to the vector-matrix
formulation. This general formulation admits diverse formulations of the three HMM
problems and assist in the derivation of a more computationally efﬁcient model.

Then, the main focus of this research is the reexamination of the TIA HMM, an
approximate model of the conventional F-B HMM, with a vector-matrix formulation
suggested by Snider et al. [7, 8]. In particular, we have focused on the inherent advan-
tages with the TIA HMM in speech recognition. The TIA HMM reduces the storage
requirements, improves computational efﬁciency, and increases numerical stability.

The TIA HMM has a natural vector-matrix formulation akin to a state-space
model. This state-space formulation permits the reduction of the dimensions of the
elements within a HMM pOpulation of tying some state variables so that the prob—

ability can be shared by the tied state variables. Therefore, by providing analytical

138

 

 

and empirical results to the TIA HMM, this research becomes the basis of viability
of work of Snider [6] in related to the HMM compression.

In general speaking, this dissertation is an extension and more thorough examina-
tion of the work begun in [6] which falls short of reasonable explanations of applying
the TIA HMM to speech recognition. In particular, this dissertation is focused on the
analysis of the TIA HMM and its relation to the F-B HMM. By taking [1le P(Ot)
as likelihood measure for a speech utterance, the TIA HMM naturally causes to the
generation of the extra likelihoods. However, we showed analytically and empirically
that under some practical conditions such as a Bakis tOpOlogy which has a diagonally
dominant state-transition matrix, such an approximation is viable in some speech
processing applications like Spoken digit recognition.

Besides the analysis of vector-matrix formulation of the HMM undertaken here
to demonstrate the viability of the TIA HMM, we derive some useful mathematical
expressions for the F—B HMM as well as the TIA HMM. Such derived equations and
accompanying results make it possible to re—exploit classic results of the HMM.

Next, we showed a couple of techniques to reconcile the F—B HMM to the TIA
HMM although theoretically it is not possible to make the two models completely
equivalent. However, these attempts are signiﬁcant in showing similarities and rela-
tionships between some of the state variables of the models. Those approaches illu-
minate the Operation of the HMMS while supporting the validity of the TIA HMM
approximation of the F-B HMM.

It is concluded that the TIA HMM has some signiﬁcant advantages resulting from
loosening of the legal state path constraint which is implicitly required in the F-B
HMM. However, we showed that such an illegal likelihood does not severely degrade
the performance of spoken digit recognition with the TIA HMM.

Finally, we have proposed two new state search techniques. They are a new max-

imum likelihood approach (different from the conventional Viterbi technique which

139

also employs a ML criterion), and the acoustic distance approach. Comparing with
the conventional Viterbi technique, by the new ML approach, we obtain an appropri-
ate state sequence of a speech signal which is close to that of the Viterbi technique in
a global sense that the global shapes of computed state sequences by the conventional
Viterbi and the new ML approach are similar to each other. Also, as we obtain more
computational efﬁciency in the TIA HMM, we also get the computational savings with
this approach. Furthermore, the quite closeness of the state sequence between the
conventional Viterbi approach and the new ML approach also becomes one viability
condition of the TIA HMM.

The recursive Viterbi search based on k-means clustering is designed to examine
the acoustic temporal variations of the speech signals and determine the correspond-
ing state sequence. Thus, the resulting state sequence is more consistent with the
acoustical evidence of a speech utterance.

In addition, the other possible approach of determining a possible state sequence
is to use the SM ID. Even though the original motivation of adOpting the SM ID in
our research was to apply an adaptive linear ﬁltering technique to ﬁnd a parameters
for M efﬁciently, we showed that SM ID technique is useful to ﬁnd an appropriate
state sequence of a speech signal. In particular, this SM technique is useful to detect
boundaries of voiced, and unvoiced region of a speech signals. According to the
application, we can choose a suitable state search method among proposed technique.

Recently, it is required for the speech recognition system to cover larger vocabu-
laries as well as to support real-time speech applications. These applications require
a robust and computationally fast processing of speech recognition system. Under
certain environment of applications, however, one factor between the recognition rate
and computational advantages is relatively more required than the other. In case
that the computational aspect is relatively important without severe degradation of

performance of the speech recognition system, the TIA HMM is a suitable technique

140

 

 

which allows to compromise the rate of recognition and resources with additional

controllable factor according to the application.

The TIA HMM is one HMM approach which uses the state-space formulation.

We conclude that the TIA HMM represents a viable alternative to the conventional

F-B HMM in certain speech processing area such as digit recognition.

5.2

Future Research

Some of the open problems related to this research are summarized as follows:

Using the vector-matrix formulation of the HMM, there are possibilities we
can ﬁnd some unknown characteristics of the F-B HMM which may be useful
in speech-recognition technology albeit its inherent difﬁculty of time-varying

characteristics.

To develop and apply the TIA HMM more to the speech recognition problems,
it is necessary to test and show the viability of the TIA HMM under the di-
verse domains of speech recognition problems. For a certain application, this
technique may be favorable and for other applications, it may not. Therefore,

it is signiﬁcant to apply this technique to various application areas.

In connection with the previous Open problem, it is necessary to ﬁnd or develOp
the conditions, if any, of the TIA HMM to improve the recognition performance
without undermining the advantages of the TIA HMM.

In this research, the application of the TIA HMM has been to the discrete
HMM whose probabilities are all discrete components. However, application to
continuous-density HMMs is also possible. In continuous observation HMM, M
is to denote the elements of an HMM, namely N, {aij}, {bJ-(x)}, and {7r,-}. Here
b, (x) is the density function of continuous observation process x. Of course, the

141

same diagonalizing and compression idea can be applied to the state-transition
matrix A. However, for the continuous probability distribution, we need to ﬁnd
how to efﬁciently integrate the means and variances of the bJ-(x) and determine

the effects on recognition performance.

Since the HMM is a linear model, we can interpret the HMM in light of the
adaptive signal processing technology. In the previous analysis, the SM ID
technology has been exploited only to ﬁnd a state sequence of a speech sequence.
However, the SM ID theory is mostly applicable to the computation of unknown
parameters of a model. More than anything else, the SM ID has an advantage in
a selective updating of the parameters through the bounding ellipsoid based on
the assumption of an known error bound. Therefore, the SM technique can be
applied to deduce the state parameters in a different fashion from the F-B HMM.
Furthermore, because the SM uses a bounding ellipsoid which includes all the
feasible parameter values, the SM ID technique is assumed to make the speech
recognition system robust. This is because using “set” of parameters instead of
Speciﬁc value for parameters, the SM ID theory can possibly assist to recover

lost symbol information in the string during pronunciation or quantization.

In connection with the application of the SM technique, we need to ﬁnd an
informative error bound in advance. Presently, the error bound of the input-
output equation from the TIA HMM is too trivial and large for the deve10pment
of HF—OPD. Also, the size of {A, B} is still too large to process efﬁciently in the
SM framework. As known, B is usually a sparse matrix which has many zeros
in its elements and A is a triangular matrix. By considering matrix properties,
a useful application of the SM ID technique to the HMM problems may emerge.

This requires more intensive study.

142

Bibliography

[1] L.R. Rabiner, and EH. Juang, “An introduction to hidden Markov models,”
IEEE ASSP Magazine, vol. 3, pp. 4-16, 1986.

[2] L.R. Rabiner, “A tutorial on hidden Markov models and selected applications in
speech recognition,” Proc. of the IEEE, vol. 77, pp. 257-285, 1978.

[3] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals,
Englewood-Cliffs, New Jersey: Prentice-Hall, 1978.

[4] J.R. Deller, Jr., J.H.L. Hansen, and J.G. Proakis, Discrete Time Processing of
Speech Signals, (2nd ed.), New York: IEEE Press, 2000.

[5] J .G. Proakis, Digital Communications, (4th ed.), New York: McGraw-Hill, 2000.

[6] R.K. Snider, Eﬂ‘icient Discrete Symbol Hidden Markov Model Evaluation Using
Transformation and State Reduction (MS. Thesis), Michigan State University,
East Lansing, 1990.

[7] J .R. Deller, Jr. and R. K. Snider, “Reducing redundant computation in HMM
evaluation,” IEEE Trans. on Speech and Audio Processing, vol. 1, Oct. 1993.

[8] J .R. Deller, Jr., and R.K. Snider, “‘Quantized’ hidden Markov models for efﬁ-
cient recognition of cerebral palsy Speech,” Proc. 19.90 IEEE Int. Symp. Circuits
and Sys., New Orleans, vol. 3, pp. 2041-2044, May 1990.

[9] RA. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach,
Englewood-Cliffs, New Jersey: Prentice-Hall, 1982.

[10] AK. Jain, and RC. Dubes, Algorithms for Clustering Data, Englewood Cliffs,
New Jersey: Prentice-Hall, 1988.

[11] L.R. Rabiner, B.H Juang, and MM. Sondhi, “Recognition of Isolated Digits
Using Hidden Markov Models with Continuous Mixture Densities,” AT 85 T
Tech. J., vol. 64, pp. 1211-1234, July-Aug. 1985.

143

[12] M. Naito, L. Deng, and Y. Sagisaka, “Speaker clustering for speech recognition
using the parameters characterizing vocal-tract dimensions,” Proc. IEEE Int.
Conf. Acoustics, Speech, and Signal Processing, Seattle, Washington, vol. 2, pp.
981-984, May 1998.

[13] A. Lazarides, Y. N ormandin, and R. Kuhn, “Improving decision trees for acoustic
modeling,” Proc. IEEE Int. Conf. Spoken Language, Philadelphia, PA, vol. 2,
pp. 1053—1056, Oct. 1996.

[14] C. Dugast, P. Beyerlein, and R. Haeb—Umbach, “Application of clustering tech-
niques to mixture density modeling for continuous speech recognition,” charac-
terizing vocal-tract dimensions,” Proc.]EEE Int. Conf. Acoustics, Speech, and
Signal Processing, Detroit, vol. 1, pp. 524—527, May 1995.

[15] DB. Paul, “Extensions to phone-state decision-tree clustering: Single tree and
tagged clustering,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Process-
ing, Munich, Germany, vol. 2, pp. 1487-1490, Apr. 1997.

[16] J. Zhang, Z. Hwang, and X. Wang, “Selection and analysis of HMM’S state-
number in speech recognition,” Proc. IEEE Int. Conf. Signal Processing, Beijing,
China, pp. 641-645, Oct. 1998.

[17] KS. Fu, Syntactic Pattern Recognition and Applications, Englewood-Cliffs, New
Jersey: Prentice-Hall, 1982.

[18] J .1. T011 and RC. Gonzalez, Pattern Recognition Principles, Reading, Mas-
sachusetts: Addison-Wesley, 1974.

[19] C.-H. Lee, and KS. Fu, “A stochastic syntax analysis procedure and its appli-
cation to patter classiﬁcation,” IEEE Trans. Computers, vol 21, pp. 660—666,
July 1972.

[20] J. Picone, “Continuous speech recognition using hidden Markov models,” IEEE
ASSP Magazine, vol. 7, pp. 26-41, July 1990.

[21] J .K. Baker, “The dragon system-An Overview,” IEEE Trans. Acoustics, Speech,
and Signal Processing, vol. 23, pp. 24-29, Feb. 1975.

[22] LE. Baum, and J .A. Eagon, “An inequality with application to statistical esti-
mation for probabilistic functions of Markov processes and to a model for ecol-
ogy,” Bulletin of the American Mathematical Society, vol. 73, pp. 360-363, 1967.

[23] LE. Baum, and T. Petrie, “Statistical inference for probabilistic function of ﬁnite
state Markov chains,” Annals of Mathematical Statistics, vol. 37, pp. 1554-1563,
1966.

[24] LE. Baum, and T. Petrie, G. Soules et al., “A maximization technique in the
statistical analysis of probabilistic functions of Markov chains,” Annals of Math-
ematical Statistics, vol. 41, pp. 164-171, 1970.

144

[25] C. Mitchell, M. Harper, and L. J amieson,“ Comments on “Reducing computation
in HMM evaluation””, IEEE Trans. Speech and Audio Processing, vol 2, Oct.
1994.

[26] CT. Chen, Introduction to Linear System Theory, New York: Holt, Reinhart
and Winston, 1970.

[27] W. Turin, “Unidirectional and parallel Baum-Welch algorithms,” IEEE Trans.
Speech and Audio Processing, vol. 6, Nov. 1998.

[28] C. Moler, and C. Van Loan, “Nineteen dubious ways to compute the exponential
of a matrix,” SIAM Rev., vol. 20, pp. 801-836, 1978.

[29] R.L. Streit, “The moments of matched and mismatched hidden Markov models,”
IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 38, Apr. 1990.

[30] H. Hjalmarsson, and B. Ninness, “Fast, non-iterative estimation of hidden
Markov models,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Process-
ing, Seattle, vol. 4, pp. 2253-2256, May 1998.

[31] M. Karan, B.D.O. Anderson and RC. William, “An efﬁcient calculation of the
moments of matched and mismatched hidden Markov models,” IEEE Trans.
Signal Processing, vol. 43, Oct. 1995.

[32] R.J. Elliott, L. Aggoun, and J .B. Moore, Hidden Markov models : Estimation
and Control, New York: Springer-Verlag, 1995.

[33] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Englewood
Cliffs, New Jersey: Prentice-Hall, 1993.

[34] D. O’Shaughnessy, Speech Communication: Human and Machine, Reading,
Massachusetts: Addison-Wesley, 1987.

[35] J. Allen, “How do humans process and recognize speech?,” IEEE Trans. Speech
and Audio Processing, vol. 2, Oct. 1994.

[36] RA. Horn, and CR. Johnson, Topics in Matrix Analysis, New York: Cambridge
University Press, 1991.

[37] J .M. Ortega, Matrix Theory: A Second Course, New York: Plenum Press, 1987.

[38] B. Noble, and J. W. Daniel, Applied Linear Algebra, Englewood Cliffs, New
Jersey: Prentice-Hall, 1988.

[39] G.H. Golub, and CF. Van Loan, Matrix Computations, (2nd ed.), Baltimore:
Johns Hopkins Press, 1989.

[40] X.D. Huang, Y. Ariki and MA. Jack, Hidden Markov Models for Speech Recog-
nition, Edinburgh: Edinburgh University Press, 1990.

145

[41] 8.1. Resnick, Adventures in Stochastic Processes, Boston: Birkhauser, 1992.
[42] M.J.R. Healy, Matrices for Statistics, , New York: Oxford University Press, 1986.
[43] SR. Searle, Matrix Algebra Useful for Statistics, New York: Wiley, 1982.

[44] H. Minc, Nonnegative Matrices, New York: John Wiley & Sons, 1988.

[45] A. Graham, Nonnegative Matrices and Applicable Topics in Linear Algebra, New
York: John Wiley & Sons, 1987.

[46] RS. Varga, Matrix Iterative Analyses, Englewood Cliffs, New Jersey: Prentice-
Hall, 1962.

[47] D.K. Faddeev, Computational Methods of Linear Algebra, San Francisco: W. H.
Freeman and Company, 1963.

[48] A. Jennings, Matrix Computation for Engineers and Scientists, New York: John
Wiley & Sons, 1977.

[49] M. Aoki, State Space Modeling of Time Series, New York: Springer-Verlag,
1987.

[50] W.H.M. Zijm, Nonnegative Matrices in Dynamic Programming, Amsterdam:
Mathematisch Centrum, 1983.

[51] E. Seneta, Non-negative Matrices and Markov Chains, (2nd ed.), New York:
Springer-Verlag, 1981.

[52] E. Bodewig, Matrix Calculus, Amsterdam: North-Holland, 1959.

[53] SS Haykin, Adaptive Filter Theory, Englewood Cliffs, New Jersey: Prentice-
Hall, 1996.

[54] J .S. Lim and A.V. Oppenheim (editors), Advanced Topics in Signal Processing,
Englewood Cliffs, New Jersey: Prentice-Hall, 1988

[55] A.V. Oppenheim, and R.W. Schafer, Discrete-Time Signal Processing, Engle-
wood Cliffs, New Jersey: Prentice—Hall, 1989.

[56] CH. Lee and L.R. Rabiner, “A frame synchronous network search algorithm for
connected word recognition”, IEEE Trans. Acoustics, Speech and Signal Process-
ing, vol. 37, pp. 1649-1658, Nov. 1989.

[57] A. Viterbi, “Error bounds for convolutional codes and an asymptotically 0p—
timum decoding algorithm,” IEEE Trans. Information Theory, vol. 13, pp.
260-269, Apr. 1967.

[58] A.J. Viterbi, and J .K. Omura, Principles of Digital Communications and Coding,
New York: McGraw—Hill, 1979.

146

[59] F. Jelinek, “A fast sequential decoding algorithm using a stack”, IBM J. Research
and Development, vol. 13, pp. 675-685, Nov. 1969.

[60] J. Makhoul, S. Roucos, and H. Gish, “Vector quantization in Speech coding,”
Proc. IEEE, vol. 73, Nov. 1985.

[61] M. Ostendorf, V.V. Digalakis, and DA. Kimball, “From HMM’s to segment
models: A uniﬁed view of stochastic modeling for Speech recognition,” IEEE
Trans. Speech and Audio Processing, vol. 4, Sept. 1996.

[62] Sing-Tze Bow, Pattern Recognition: Applications to Large Data-Set Problems,
New York: M. Dekker, 1984.

[63] J. Dai, I.G. Mackenzie, and EM. Tyler, “Stochastic modeling of temporal in-
formation in speech for hidden Markov models,” IEEE Trans. Speech and Audio
Processing, vol. 2, Jan. 1994.

[64] N. Nocerino, F.K. Soong, L.R. Rabiner, and DH. Klatt, “Comparative study
of several distortion measures for speech recognition,” Proc. IEEE Int. Conf.
Acoustics, Speech, and Signal Processing, Tampa, vol. 1, pp. 25—28, 1985.

[65] L. Rabiner, J.G. Wilpon, and F.K. Soong, “High performance connected digit
recognition,” IEEE Ti‘ans. Acoustics, Speech, and Signal Processing, vol. 36,
Aug. 1989.

[66] P. McGoldrick, “Did you hear what I said?,” Electronic Design, Oct. 1996.

[67] AB. Poritz, “Hidden Markov models: A guided tour,” Proc. IEEE Int. Conf.
Acoustics, Speech, and Signal Processing, vol. 1, pp. 7-13, 1988.

[68] P. Kenny, M. Lennig, and P. Merrnelstein, “A linear predictive HMM for vector-
valued observations with applications to speech recognition,” IEEE Trans. Speech
and Audio Processing, vol. 38, pp. 220—225, Feb. 1990.

[69] AB. Poritz, “Linear predictive hidden Markov models and the Speech signal,”
'Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. pp. 1291-
1294, May 1982.

[70] B. Juang, and LR Rabiner, “Mixture autoregressive hidden Markov models for

speech Signals,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 33,
pp. 1404-1413, Dec. 1983.

[71] H. Sheikhzadeh, and L. Deng, “Waveform-based Speech recognition using hidden
ﬁlter models: parameter selection and sensitivity to power normalization, ” IEEE
Trans. Speech and Audio Processing, vol. 2, pp. 80-89, Jan. 1994.

[72] L. Deng, “A stochastic model of speech incorporating hierarchical nonstation-
arity,” IEEE Trans. Speech and Audio Processing, vol. 4, pp. 471-474, Oct.
1993.

147

[73] B. Juang, S. Levinson, and M. Sondhi, “Maximum likelihood estimation for
multivariate mixture observations of Markov chains, ” IEEE Trans. Information
Theory, vol. 32, pp. 307—309, Mar. 1986.

[74] N. Tishby, “On the application of mixture AR hidden Markov models to text
independent Speaker recognition,” IEEE Trans. Signal Processing, vol. 39, pp.
563—570, Mar. 1991.

[75] L. Liporace, “Maximum likelihood estimation for multivariate observations of
Markov sources, ” IEEE Trans. Information Theory, vol. 28, pp. 729-734, Sep.
1982.

[76] B. Atal, and S. Hanauer, “Speech analysis and synthesis by linear prediction of
the speech wave, ” J. Acoust. Soc. Amer., vol. 50, no. 2, pt. 2, pp. 637—655, 1971

[77] L. Liporace, “Linear estimation of nonstationary signals,” J. Acoust. Soc. Amer.,
vol. 58, pp. 1288-1295, 1975.

[78] Y. Grenier, “Time—dependent ARMA modeling of nonstationary signals,” IEEE
Trans. Speech and Audio Processing, vol. 31, pp. 899-911, 1983.

[79] R. Bakis, “Continuous speech word recognition via centisecond acoustic states,”

Proc. 913t Annual Meeting of the Acoustic Society of America, Washington,
DC, 1976

[80] J. J. Ford, and J. B. Moore, “On adaptive HMM state estimation,” IEEE Trans.
Signal Processing, vol. 46, Feb. 1998.

[81] RC. Schweppe, “Recursive state estimation: Unknown but bounded errors and

system inputs,” IEEE Trans. Automatic Control, vol. AC-13, pp. 22-28, Feb.
1968.

[82] DP. Bertsekas, and LB. Rhodes, “Recursive state estimation for a set-
membership description of uncertainty,” IEEE Trans. Automatic Control, vol.
AC—16, pp. 117-128, Apr. 1971.

[83] E. Fogel, “System identiﬁcation via membership set constraints with energy con-
strained noise,” IEEE Trans. Automatic Control, vol. AC-24, pp. 752-758, Oct.
1979.

[84] E. Fogel, and Y.F. Huang, “On the value of information in system identiﬁcation
- bounded noise case,” Automatica, vol. 18, pp. 229-238, 1982.

[85] S. Dasgupta, and Y.F. Huang, “Asymptotically convergent modiﬁed recursive
least-squares with data-dependent updating and forgetting factor for systems
with bounded noise,” IEEE Trans. on Information Theory, vol. IT-33, pp. 383-
392, May 1987.

148

[86] JP. Norton, “Recursive computation Of inner bounds for the parameters Of linear
model,” Int. J. Control, vol. 50, pp. 2423-2430, 1989.

[87] J .R. Deller, Jr., and SF. Odeh, “Implementing the Optimal bounding ellipsoid
algorithm on a fast processor,” Proc. IEEE Int. Conf. Acoustics, Speech, and
Signal Processing, Glasgow, vol. 2, pp. 1067-1070, May 1989.

[88] J .R. Deller, Jr., “Set-membership identiﬁcation in digital Signal processing,”
IEEE ASSP Magazine, vol. 6, pp. 4-22, Oct. 1989.

[89] J.R. Deller, Jr., and TC. Luk, “Linear prediction analysis Of Speech based on
set-membership theory,” Computer Speech and Language, vol. 3, pp. 301-327,
Oct. 1989. '

[90] J.R. Deller, M. Nayeri, and MS. Liu, “Unifying the landmark deve10pments
in Optimal bounding ellipsoid identiﬁcation,” Int. J. on Automatic Control and
Signal Processing, vol. 8, pp. 43—60, Jan—Feb. 1994.

[91] M. Nayeri, M.S. Liu, and J .R. Deller, “An interpretable and converging set-
membership algorithm,” Proc. IEEE Int. Conf. Acoustics, Speech, and Signal
Processing, Minneapolis, vol. 4, pp. 472-475, Apr. 1993.

[92] M. Nayeri, J .R. Deller, and MS. Liu, “A converging Optimal ellipsoid algorithm
with volume minimization,” Proc. 26th Annual Asilomar Conf. Signals, Systems,
and Computers, Monterey, pp. 20—24, Oct. 1992.

[93] AK. Rao, Y.F. Huang, and S. Dasgupta, “ARMA parameter estimation using
a novel recursive estimation algorithm with selective updating,” IEEE Trans.
Speech and Audio Processing, vol. 38, pp. 447-457, Mar. 1990.

[94] T. Kailath, Linear Systems, EnglewoOd-Cliffs, New Jersey: Prentice-Hall, 1980.

[95] K.F. Lee, Automatic Speech Recognition, the Development of the SPHINX Sys-
tem, Boston: Kluwer, 1989.

[96] T. Soderstrom, and P. Stoica, System Identiﬁcation, Englewood-Cliffs, New Jer-
sey: Prentice-Hall, 1989.

149