1 $1.)
i It... I“.
741:...» I
2

’1 a».
x B“

‘ .rllth’:
L9,!!! 4..
. ‘ . ’p'). All- I
Eli). .dLls‘x. .5”!
$51!] Olit‘rlv ul‘b
o 3. t!_lv I vvlvflt...‘
‘1‘ l... ' ll

.Lwlzl‘ﬁ
r; 1.1.1.0, 95‘.vlzl<!
o'l Iliu'liul.l.

at)» I

|'IV "u:
H]

4‘
v

“qw “y: r
‘ ‘o

. ..
“'41 1‘. H

. D

o‘
.p \. ‘txa.,luln
.'. ;| v!!!l|€1|ifl|§ilobulvb IICC'

‘ ...yolual0 ‘ "I.I‘avll'!‘t"cl'ny

 

 

 

'I‘.1 | .d ' O . u y
We on n.¢.ullnul...ult 2.0.1. I, ic‘ x l .. ‘ fasw wifnnhEKM. . . A . ‘1 . l£ .
. r tr ‘ L a . . o 4.1 l 11.. Y \. IV Ix ‘l‘vlo t ~ .. t: r‘. :0 ,‘a'tu. (ll... .0)- .... n..t|.y
I .. (1933!; Splallqn i lull i.
‘ ‘ ..: nt

 

 

(HES’S

UNIVE RS ITYU

ﬂUlUlHIIﬂ{IIIHIHHUIW lllllllllllllllllllllHlﬂ L

3 1293 00885 0194

 

 

 

 

 

 

 

This is to certify that the

dissertation entitled

Large-Vocabulary Continuous-Speech Recognition
Using Partitioned Graph Search

presented by

Chuang—Chien Chiu

has been accepted towards fulﬁllment
of the requirements for

Ph.D. Electrical

degree in

 

 

Engineering

{2 Wy‘

Major professor

 

Date all?” 3,. I??3 '

MSU is an Afﬁrmative Action/Equal Opportunity Institution 0- 12771

 

h.

LIBRARY
Michigan State
University ,i

“

 

 

L

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

ﬁ=f=

 

 

 

 

 

 

 

 

 

 

 

_________JL_
_____‘ ﬂ;

MSU Is An Afﬁrmative Action/Equal Opportunity Institution
cWMS-oi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

LARGE-VOCABULARY CONTINUOUS-
SPEECH RECOGNITION USING
PARTITIONED GRAPH SEARCH

BV

U

Chuang- Chien Chiu

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering

1993

ABSTRACT

LARGE-VOCABULARY CONTINUOUS-
SPEECH RECOGNITION USING
PARTITIONED GRAPH SEARCH

By

Chuang-Chz'en Chin

Contemporary speech recognition ordinarily involves the decoding of an utterance
by association of a set of acoustic observations with a best path in a ﬁnite-state net-
work. The speech recognition problem is then reduced to a network search problem.
Conventional left-to-right search methods generally encounter two major problems
which make them infeasible and unattractive: First is the computational complexity.
Once the network becomes large. a complete left-to-right search of the network is
computationally impractical. even when a pruning strategy is applied. The second
problem is in the presence of noise. lf the search uses the data in a sequential left-
to-right manner, the correct path through the network is likely to be “derailed” or
curtailed by nonstationary sources of noise typical of many practical environments
where speech recognition is desirable.

This research is concerned with a novel method called partitioned graph search
which avoids the drawbacks of conventional search techniques. The partitioned graph
search techniques employ a computationally inexpensive pass to pare down the prob-
lem to one involving a smaller graph of likely solutions. The smaller graph can then
be subjected to intense scrutiny to ﬁnd an optimal solution.

Three major contributions have been made in this work: (1) A vertex partition-
ing algorithm for generally nonplanar graphs is developed, and the key operation for

ﬁnding a cycle for planar graph subpartitioning is improved; (‘2) A two-pass search

algorithm, which is applicable to the case of unknown boundaries in the observa-
tion string, is developed: (3) The partitioned graph search techniques are applied to
recognition of continuous-speech taken from the TIMIT speech database. A language
graph with perplexity 5.1 containing 14.259 vertices and 28,838 edges is constructed
from this database. Comparisons of the results using three different search approaches
are made. These experiments include conventional left-to-right search, random selec-
tion of as many vertices as in the partitioned graph search cases for evaluation, and

partitioned graph search.

Copyright by
CHUANG-CHIEN CHIU

1993

To my parents

Chin-Jen Chin and Shih-Mey Liang

Acknowledgments

First and foremost. I would like to thank my thesis advisor, Professor John R.
Deller, .Ir., for his continuous encouragement, patient guidance and generous support
throughout my graduate study at Michigan State University. Without his help, this
work could not possibly be ﬁnished by now. I would also like to thank my “teacher
mother” - Joan, for her tolerance of my occupying her husband’s time, and for her
kindliness to me in these years. They have made my overall learning experience
enjoyable in the Speech Processing Lab at Michigan State University.

I would like to thank all the members in my Ph.D. guidance committee, Profes-
sor Abdol-Hossein Esfahanian (Department of Computer Science), Professor Majid
Nayeri (Department of Electrical Engineering) and Professor Clifford E. Weil (De-
partment of Mathematics) for their time and effort in discussing and reviewing my
thesis. I am also deeply indebted to Dr. (LG. Venkatesh in CTA Incorporated, who
initiated the idea for using graph partitioning techniques in signal decoding, and
helped me to shape many aspects in graph partitioning algorithms of this work.

I would like to express my gratitude to everyone in my family for their love and
support. I am especially grateful to my parents for all the sacriﬁces they made in
my upbringing. I also owe a great deal to the friends I met here at Michigan State
University. Most of all. I wish to thank my host family, W'ellington, Emily and
Millicent Ow for their friendliness and kindliness during the past four years.

A very special word of thanks for my ﬁancee, Elm-Hi Park, for all her love, pa-

tience, encouragement and imderstanding over past three and a half years we have

vi

known each other. Because of her, I feel more hopeful in the future. Also, I appreciate
her promise to be my wife in this coming September. Therefore, I can have a good
chance in my lifetime to repay her.

This work is supported in part by a contract from CTA, Incorporated and by an

Ameritech Faculty Fellowship awarded to Dr. Deller.

Contents

List of Tables x
List of Figures xi
1 Introduction and Background 1
1.1 Introduction ................................ 1
1.1.1 The Significance of Partitioned Graph Search ......... 1

1.1.2 Search Techniques in Speech Recognition ............ 2

1.1.3 General Scope and Achievements ................ 4

1.2 Partitioned Graph Search ........................ 9
1.2.1 Graph Partitioning ........................ 9

1.2.1.1 Planar Separator Theorems .............. 9

1.2.1.2 A Revised PST for General Graphs .......... 12

1.2.2 Graph Search in the Presence of Partitioning .......... 13

1.2.2.1 First Pass: Modiﬁed Stack Decoding ......... 13

1.2.2.2 Second Pass: Optimal Solution ............ l4

2 Algorithms for Graph Partitioning 16
2.1 Introduction ................................ 16
2.2 Building a Language Graph ....................... 17
2.2.1 The Need for a Large Graph ................... 17

2.2.2 The Method for Building the Language Graph ......... 20

2.3 Graph Partitioning Algorithms ..................... 23
2.3.1 Partitioning Algorithms for Planar Graphs ........... 23

2.3.2 Partitioning Nonplanar Graphs ................. 25

2.4 Cycle-Finding Algorithm ......................... 28
2.4.1 Edge-Scanning Approach ..................... 28

2.4.2 An Example ............................ 33

3 A Stack Algorithm for Partitioned Search 42
3.1 The n-Best Strategy ........................... 42
3.2 First Pass: Modiﬁed Stack Decoding .................. 44
3.2.1 Vertex Evaluation Algorithm ................... 44

3.2.1.1 Baseline Algorithm ................... 44

3.2.1.2 Refining Boundaries Following Vertex Evaluation . . 47

viii

ix

3.2.2 Modiﬁed Stack Decoding ..................... 52

3.2.3 Computation Savings from the First Pass ............ 56

3.3 Second Pass: Optimal Solution ..................... 58

4 Implementation and Experimental Issues 59
4.1 The TIMIT Database ........................... 59
4.2 Implementation Issues .......................... 63
4.2.1 Building a Language Graph ................... 63

4.2.2 Speech Modeling for Partitioned Graph Search Techniques . . 68

4.2.3.1 Cepstral Analysis .................... 68

4.2.3.2 Vector Quantization .................. 69

4.2.3.3 HMM Training ..................... 72

4.2.3 Graph Partitioning ........................ 74

4.2.4 Two-Pass Graph Search ..................... 79

4.3 Experiments ................................ 80
4.3.1 Methods and Measures of Performance ............. 80

4.3.2 Results and Discussion ...................... 82

5 IMrther Discussion, Conclusions and Future Work 91
5.1 Summary and Further Discussion .................... 91
5.2 Contributions ............................... 93
5.3 Future Work ................................ 94
Appendix A: Graph-Theoretic Notations and Deﬁnition 97
Appendix B: Supporting Lemmas for the PST 100
Appendix C: Lipton-Tarjan Approach to Finding a Cycle 103
Appendix D: A Planarity Testing Algorithm 106
Appendix E: Elements of the Hidden Markov Model 109

Bibliography 1 15

List of Tables

I
O
p—s

[\3 (\3
N (\D

2.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15

The set of vertices in each face of the planar embedding of the graph

in Fig. 2.13 .................................

The vertices on each level of the graph in Fig. 2.13 after running BFS.

The set of vertices in each face of the planar embedding of the newly
formed graph. ...............................
The set of vertices in each face of H, except the vertices which are on
the chosen cycle. .............................
The set of vertices on each side of the chosen cycle. ..........
Summary of the usage of the TIMIT database in this work .......
Phonetic labels used in the TIMIT database ...............
No. of phonetic segments in the training set ...............
A list of function words appearing in this research. ..........
Different ways the function word the is pronounced in TIMIT.
Different ways the function word in is pronounced in TIMIT ......
Different ways the function word with is pronounced in TIMIT.
Different ways the function word a is pronounced in TIMIT. .....
Experimental results ............................
Experimental results with three problematic test utterances removed.

36
36

36

List of Figures

p—e .—
. 0
[Cu '-

[V'—
J—QQ

“10331

($300

[\DIQIQIQNIQICIQIQIQNIQIQ

p—‘p—ep—ep—ep—p—p—p—ep—n
NCDCJTt-Ii-ler—O

?9 9’
p—-
coco

3.20

3.21
4.22
4.23

An utterance with the boundaries for each word marked. .......
The block diagram of the speech recognition system using the parti-
tioned graph search techniques. .....................
The condition implied by the Planar Separator Theorem ........
Two possible graphs which can “generate" sentences 1 — 3 listed in the
text. ....................................
An example of a left-to—right 6-state model. ..............
An example of a partial language graph. ................
The condition implied by the PST of Lipton and Tarjan. .......
The condition implied by the PST of Djidjev. .............
Three cases cause the violation of the planar separator theorem. . . .
The block diagram for ﬁnding the partitioning cycle in the new graph.
The block diagram for the ONE module. ................
The block diagram for the ANE module. ................
An example graph used to illustrate the subpartitioning algorithm.

The planar embedding of the planar graph of Fig. 2.13 .........
The planar embedding of the newly formed graph ............
A desired cycle is found in this example. ................
The visual understanding of the partitioning results ...........
The n-best search paradigm ........................
An example illustrates how the phone models can be combined to form
a big word model ..............................
Three possible cases of merging potential intervals after word spotting
operation. .................................

A conceptual plot of the word-spotting results for the selected vertices.

The dataﬁle structure of the TIMIT database ..............
All the possible vertices with different phonetic labels for the word ask
in a large graph ...............................
The word ask in a bigram graph. ....................
An example of a language graph. ....................
Filter bank for generating mel—based cepstral coefficients ........
The complexity measure (CM) for normal speech using three different
methods ...................................
The complexity measure (CM) for normal speech using partition and
random selection methods .........................

xi

18
‘20
99

H-

25

xii

4.29 The complexity measure (CM) for noisy speech using three different

methods ................................... 88
4.30 The complexity measure (CM) for noisy speech using partition and

random selection methods ......................... 89
4.31 The overall performance as measured by the normalized word accuracy

(N WA) of three different search methods ................. 90
A.32 A plane graph with four faces ....................... 98
C.33 Cases of Step 4 in the L&T approach for ﬁnding a proper cycle. . . . 103
D.34 The planarity testing algorithm of Demoucron et al ........... 108

E.35 Some examples of four-state HMM topologies. ............. 112

Chapter 1

Introduction and Background

 

1 .1 Introduction

1.1.1 The Signiﬁcance of Partitioned Graph Search

Contemporary speech recognition ordinarily involves the decoding1 of an utterance
by association of a set of acoustic observations with a best path in a finite-state
network. Therefore, the speech recognition problem is reduced to a network search
problem. However. almost all existing search algorithms. which are related to either
breadth-ﬁrst search or depth-ﬁrst search (best-ﬁrst search) [1]. [2], [3], [4], implement a
left-to-right search strategy. These conventional search methods encounter two major
problems which make them infeasible and unattractive: First is the computational
complexity. Once the network becomes large (typically 106 — 1011 [5]), a complete left-
to-right search of the network is computationally impractical, even when a pruning
strategy is applied. The second problem is the presence of noise. If the search uses

the data in a sequential left-to-right manner, the correct path through the network

 

1A decoding problem involves the ﬁnding of the most likely path in a graph. say G, given an
observation sequence, say Y. and an a prion structure embedded in G.

l

2
is likely to be “derailed” or curtailed by impulsive noise typical of many practical
environments where speech recognition is desirable. For example, if noisy speech
is to be recognized. and if the “left-most” observations are corrupted by noise. the
left-to-right search might immediately eliminate the correct path.

The impetus for this research is the need for robust search methods which avoid
the drawbacks of conventional search techniques. A novel method called partitioned
graph search is presented. Partitioned graph search offers a systematic way of locating
a very small number of vertices which are guaranteed to give effective coverage of the
graph thus assuring high pruning safety. Briefly, a computationally inexpensive pass
is employed to pare down the problem to one involving a smaller graph of likely
solutions from which the ultimate solution is quickly found. The techniques are
used to undertake a large. complex recognition task based on graph partitioning and
network searching techniques which achieve the reduction of computational load and
improve the recognition rate with respect to that of left~to-right approaches. The

purpose of this dissertation is to describe signiﬁcant progress towards this goal.

1.1.2 Search Techniques in Speech Recognition

Among existing search algorithms for speech recognition tasks, variations of the
l’iterbi search algorithm [6] and the stack decoding algorithm [7] are the most popular.
Viterbi search examines many possible paths in parallel by progressing through the
network. It is based on the principle of optimality [8], and has been used extensively

in dynamic-programming-based speech recognition. There are a number of signiﬁ-

3
cant bookkeeping issues involved in managing the complexity of the Viterbi search
algorithm. Brieﬂy, the Viterbi search is a time-synchronous search algorithm that
completely processes time t before going on to time t + 1. Each state for time t is
updated by the best score from states at time t — 1. In this fashion, the most likely
state sequence can be recovered at the end of the search.

A common goal of fast search algorithms is the reduction of the search space which
must be evaluated. Some measure of goodness of a path is established. Paths that fail
the goodness test are pruned (not extended). The pruning approach is used to avoid
the combinatorial explosion of paths which accompanies a full search. A full search
is efficient enough for a smaller task using the Viterbi strategy. However, for a large
task, Viterbi can be very time and space consuming. Therefore, the beam search [8].
[9] technique is used to prune the search space. The basic principle of Viterbi beam
search is that at any time. all paths whose probabilities fall within some threshold of
the best global path are retained. Note that Viterbi beam search only guarantees a
locally optimum path with the risk that the correct global path may be overlooked [8]
[10]. In spite of the reduction in the search space achieved by Viterbi beam search,
most continuous-speech recognition tasks are still intractable. In order to reduce the
amount of computation required by the Viterbi search algorithm in a large network,
stack decoding is frequently adopted.

Stack decoding. ﬁrst proposed by .Ielinek in 1969 [7], is also a left-to-right search
approach which starts at the initial state and extendspaths of different lengths.
Thus, it is not a “thue-synchronous" search technique [8]. Stack decoding is an

implementation of the best-ﬁrst tree search [11]. The term "stack” refers to an ordered

4

set held in the decoder's memory. The following must be contained in each stack entry:

1. The history of a partial path.
2. The partial path likelihood (which is used to order the entries).

3. An end-of—path ﬂag (if appropriate).

To carry out best-ﬁrst search in speech recognition tasks, a likelihood function
must be deﬁned with which to compare incomplete paths of varying lengths. It is
also desired to have a likelihood function which increases along the most likely path,
and decreases along other paths. Stack decoding will be incorporated with partitioned
graph search techniques in our continuous-speech recognition task. A modiﬁed version

of stack decoding will be discussed in detail in Chapter 3.

1.1.3 General Scope and Achievements

In previous work, Venkatesh ct al. [13] presented the conceptual development of a
graph-theoretic strategy for reducing the computational complexity of signal decoding
with respect to conventional decoding approaches [6], [11], [14], [15]. The technique,
which is based on the “Planar Separator Theorem" of Lipton and Tarjan [17], uses
a partitioning approach to locate O(\/8_1\_') vertices for evaluation from an N-vertex
decoding graphz G. In the partitioned graph search method, evaluation of this rela-
tively small number of vertices is used to ﬁnd an optimal path for a given observation

sequence. Although many of the basic conceptual ideas underlying this method were

 

L’In speech recognition task, the decoding graph G is a ﬁnite-state network which represents all
“linguistic" knowledge about a corpus of speech data.

5

worked out by Venkatesh et (11., no speciﬁc computer algorithms were developed for
selecting these “high payoff” vertices of G, and no applications were demonstrated.

In [16], Chiu reports two signiﬁcant extensions of the Venkatesh research. One is
the development of a planar graph partitioning algorithm to select3 (ﬁx/TV) vertices
for evaluation from an .‘V-vertex planar decoding graph. This algorithm is based on
the solution of Djidjev [18] whose separator4 is of size O(\/6—T). In the process. a
method for ﬁnding an appropriate simple cycle to complete the vertex partitioning is
developed as well. Recently, the algorithm for ﬁnding the cycle has been improved
[20], and the result will be discussed in more detail in Chapter 2. The other contri-
bution in [16] is to test the partitioned graph search technique in an experiment. in
which the time boundaries in the observation sequence are known. The experiment
involves a language graph (the method to construct such a language graph was pre-
sented in [16]) consisting of 1.000 vertices and 1.200 edges using an early version of
a “multiple stack” decoding algorithm developed by Deller [13]. It has been shown
that there is the potential for a steep decrease in the computational complexity using
the partitioned graph search approach when compared with conventional left-to-right
decoding.

The present research focuses on applying partitioned graph search techniques to
recognize continuous speech. There are four principal problems to be investigated.

Three of them are related to the theoretical and implementational issues of the par-

 

3Since N is usually much larger than 6 or 8, we henceforth write (Xx/TV?) to indicate the separator
size.

4A separator is a set of vertices or edges whose removal disconnects the graph into two subgraphs
each of which contains no more than some constant fraction of the original weight.

6

titioned graph search. The fourth is concerned with experimental issues arising when
the partitioned search is applied to real speech data. The major problems and their

signiﬁcance are as follows:

1. Certain practical difﬁculties must be considered in implementing the partition-
ing method. The most obvious problem is that one cannot impose an a priori
condition of planarity to a graph in order to get an (QUART) separator accord-
ing to the planar separator theorem [17], [18]. The graph to be partitioned and
searched will often be nonplanar in practice. Thus, the extension of the existing

planar partitioning technique for generally nonplanar graphs is very signiﬁcant.

2. In continuous-speech recognition tasks. the boundaries in the observation se-
quence are generally unknown. For example, Fig. 1.1 illustrates the boundaries
for each word in the utterance. Without knowledge of the boundaries for each
word in an utterance, the recognition task becomes signiﬁcantly more chal-
lenging. Thus. the extension of previous search algorithms to the unknown

boundaries case is very important.

3. In a large graph search problem. a subgraph remains after the ﬁrst partition and
search, which can then be further partitioned and searched in a similar mannar
if the remaining subgraph is still large. An important problem is how to perform
the repartition and search to find the most likely path. If this procedure can be
done. the solution would be expected to converge rapidly since each partition
and search involves 0( VAT) or fewer evaluations. Because of the relatively small

ra 11 14,259 vertices) em )loved in this work, this issue will only be ex lored
g P I . . P

7

theoretically and superﬁcially in the experiments.
The major problem to be addressed by experimentation is as follows:

4. When partitioned graph search is applied to real speech data, two very im-
portant issues must be considered: One is to test performance of partitioned
search with respect to conventional left-to-right methods. The other is to test
robustness of the search to intermittent noise. Added robustness to noise is
shown to be a principal beneﬁt of the method. Generally, the problem is: Is the
partitioned graph search approach more robust than conventional left-to-right
methods in noisy environments? This issue is signiﬁcant since communication
environments are frequently corrupted by noise. A comparison of sensitivity to
noise and computational complexity among partitioned search and the conven-

tional left-to-right approaches is reported in this work.
The major contributions of this research are summarized as follows:

1. A vertex partitioning algorithm for generally nonplanar graphs is developed,
and the key operation for finding a cycle for the planar graph subpartitioning
is improved.

2. A partitioned graph search algorithm. which is applicable to the case of unknown

boundaries in the observation string, is developed.

3. Partitioned graph search techniques are applied to recognition of continuous-
speech taken from the TIMIT [21] database. Comparisons of the results using

the following approaches for recognizing continuous speech are made:

 

Amplitude

 

 

 

 

4000" -1

she Shad i 3 darkf SUI! 1103 greasy f wash 5 water; all 3 year 5

s 2 W s s s s = = r =

- 6000 . . 1 . . 1 . . 1 . . A . . 1 .
0 l 2 3 4 5 6
Time (samples) it 10‘

Figure 1.1: An utterance with the boundaries for each word marked. The sample
rate on the data is 16 kHz.

(a) partitioned graph search.
(b) left-to-right search using random selection of as many vertices as in the

partitioned graph search cases for evaluation.

(c) conventional left-to-right stack search with pruning.

The block diagram of the speech recognition system using the partitioned graph
search techniques is shown in Fig. 1.2. There are essentially four steps in the recog-
nition process involving three externally generated datasets. The implementation of

Fig. 1.2 is described in detail in Chapter 4.

 

Isolated
VQ COdCbOOk WOI'd HMM

 

 

 

 

 

 

 

 

 

 

 

“(11) ___.. Cepstral __, Vector Partitioned _ Decision recognized
input utterance analysis Quantization graph search Logic sentence

 

 

 

 

 

 

 

 

 

Finite state
network
(language graph)

 

 

 

Figure 1.2: The block diagram of the speech recognition system using partitioned
graph search.

1.2 Partitioned Graph Search

1.2.1 Graph Partitioning

1.2.1.1 Planar Separator Theorem

Partitioned graph search employs a computationally inexpensive pass to pare down

the problem to one involving a smaller graph of likely solutions. The smaller graph can

10

then be subjected to intense scrutiny to ﬁnd an optimal solution. The salient features
of the partitioning procedure are best understood by overviewing the underlying
theory and discussing its advantages with respect to conventional left-to-right search.
The following theorem. which is a general statement of the Planar Separator Theorem

(PST), is fundamental to the methods:

Theorem 1.1 Let G be any planar graph with N vertices having non-negative vertex
rewards summing to no more than one. Then G has a set of vertices C ofsize O(\/T)
which separates the set of vertices A from the set of vertices B, where A, B, and C
comprise a partition of the vertices in the given planar graph and neither A nor B

has total reward ercceding 2/3. The set C is called an 0(VN) separator.

Without loss of generality, one may assume that the planar graphs considered here
are connected. The theorem is also true for planar graphs which are not connected

[1?]. The condition implied by the PST is illustrated in Fig. 1.3.

Edges joinA Edges join C
set and C set set and B set

 

Figure 1.3: The condition implied by the Planar Separator Theorem. A, B, C form
a vertex partition in a planar graph such that no edge joins a vertex in A with a
vertex in B.

The vertices in the set C separate the graph into two sets of vertices A and B
making it impossible to pass from one to the other without encountering the set C.

Therefore, the vertices in the set C must contain many convergent and divergent

 

11

paths. Loosely speaking, the vertices of the set C can be considered “bottlenecks”
in the graph into which many paths converge. The PST therefore, guarantees the
selection of signiﬁcant vertices (those vertices in the set C) which will cover many
paths.

One might select. in some ad hoc manner, vertices with large numbers of entering
and exiting edges. It is known that these vertices represent “bottlenecks” in the graph
upon which many paths converge. However, the location of O( N) such vertices in no
way assures proper coverage of the graph and the process cannot be done efﬁciently
when N is large. By coverage we mean the extent to which the selected vertices
represent a signiﬁcant number of the paths, ideally with multiple vertices per each
path; and, to a lesser extent, the degree to which the selected vertices are widely
distributed throughout the graph so that better statistical use can be made. of the
observation sequence.

The increased coverage provided by the partitioning procedure will generally pro-
vide an acceptable “pruning safety." On the other hand, the set C is relatively
small [Oh/A7”, and these vertices are selected at distributed locations throughout
G, rather than always “from the left” which is conventional [11], [14]. Thus, the
use of this set can minimize the number of vertex evaluations. If the procedure for
evaluating vertices is very costly. the partitioning approach will greatly improve the
computational cost compared to a conventional left-to—right search strategy.

For partitioning, there are two algorithms available. One is based on the PST of
Lipton and Tarjan [17], the other is based on the theorem of Djidjev [18]. The issue

of which of these is better for partitioning the graph in solving the continuous-speech

12

recognition problem is addressed in Chapter 2.

1.2.1.2 A Revised PST for General Graphs

Another major issue for the present project involves the application of the PST
to a nonplanar graph. According to Lipton and Tarjan [17], there exists a class
of nonplanar graphs called ﬁnite element graphs to which their O(\/.’V—’)-separator
theorem still applies. The method for treating the nonplanarity problem using a
ﬁnite element graph provides us some guidance for generally nonplanar graphs. A
second approach would be to transform a given nonplanar graph to a planar one by
adding a large number of supplemental vertices. However, with this approach the
total vertex count in the augmented planar version of G may be such that the “new
(QM/AT) no longer offers any computational savings over competing methods [13].

An ad hoc procedure will be presented in Chapter 2 to partition generally nonpla-
nar graphs. The graph partitioning approach introduced here can be used in two ways
to select a set of O( N) vertices from an .’\:"-vertex language graph. This selected set.
of vertices will be relatively small in number. yet will provide extremely good coverage
of the paths in the graph because of the procedures used in forming the partition.
It will also represent a careful selection of vertices which are well-trained. Because
it is impractical to sufﬁciently train all the vertices (e.g., words) in a large network.

it is signiﬁcant that the partition selects those vertices which are well-trained. The

training procedure is discussed in detail in Chapter 4.

13

1.2.2 Graph Search in the Presence of Partitioning

1.2.2.1 First Pass: Modiﬁed Stack Decoding

Two important issues related to this partitioned graph search techniques are ad-
dressed in this work. One is the ability to search with unknown boundaries in the
observation sequence. The. other important issue is concerned with the ability to
perform the repartition of the subgraph following a given search. This second issue
is only explored superﬁcially in this research, but the method for recursive parti-
tioning and search follows directly from the methods described here. A two-pass
graph search technique is developed in this work to solve these problems. This search
method integrates existing techniques to yield a large-vocabulary speaker-independent
continuous-speech recognition algorithm requiring only a modest amount of compu-
tation, and with performance comparable to that of previous recognizers. A brief
discussion of the two-pass graph search technique is presented below.

The unknown boundary problem is solved inherently in a modiﬁed stack search
algorithm to be presented. One advantage of partitioned graph search is the reduction
of the computational complexity which occurs in this search as a result of the vertex
selection. In conjunction with stack search, a vertex evaluation algorithm is employed
at the selected vertices to extend paths over various segments of observation strings.
This procedure hypothesizes all the potential time boundaries of the selected words
(by graph partitioning) in the input speech utterance. These methods are presented
in Chapter 3.

To search the paths in the partitioned case, a modiﬁed left-tonight procedure

14

is used. The search is carried out using a modiﬁed “stack” algorithm [14], i.e., the
evaluation of vertices occurs as the selected vertices are encountered along paths, and
the possible boundaries and likelihood for each selected vertex is obtained from the
results of the word—spotting. When the search encounters a vertex which is not in the
set C, a time duration model is used for evaluation. The time duration distributions
are estimated from training data. A formal discussion is given in Chapter 3. Each
evolving path is entered into a "stack" (memory array), its position in the stack is
determined by its likelihood. deﬁned as the negative logarithm of the probability of
that path. The most likely partial path is put at the top of the stack.

Pruning strategies can also be applied to reduce the search space. Since the stack
is of ﬁnite length. say (1. only the q most likely partial paths survive. The ﬁnite stack,
therefore, effects one type of pruning operation called hard pruning [13]. A second type
of pruning occurs when a partial path, for which there is sufficient room in the stack,
is deemed too unlikely to be viable and is removed. This type of pruning has been
called soft pruning [13]. At each iteration. the partial path in the top location of the
stack is extended by one word. When a complete path appears as an entry at the top
of the stack, the decoding is complete. It is desirable to retain the n-best sentences.
This is achieved by delaying termination until it sentences have been output. The
surviving paths in the stack after the ﬁrst stage of search form a subgraph, which will
be signiﬁcantly scaled down with respect to the original problem. Then, the second

stage of the search is used to ﬁnd the most likely path.

15

1.2.2.2 Second Pass: Optimal Solution

In a large problem. the subgraph remaining in the stacks after the ﬁrst partition
and search can be further partitioned and searched in a similar manner. The solution
would be expected to rapidly converge. Each partition and search involves 0(\/TV_')
or fewer evaluations.

The goal in the second stage of search is to ﬁnd the most likely path in the
remaining subgraph after the ﬁrst stage of search. In this work, a bigram grammar
[8], [21], is employed in conjunction with a small and well-trained graph. A bigram
grammar ﬁnds the list of words that can follow any given word, and estimates the
probability of a word followed by another word. Since the remaining subgraph is
small, a left-to-right method (either Viterbi search or stack algorithm) can be used
to search for the best path. Here we reuse the modiﬁed stack procedure employed in

partition search. but each vertex is evaluated in the second pass.

 

Chapter 2

Algorithms for Graph Partitioning

 

2.1 Introduction

Vertex partitioning of graphs is applied in “divide-and-conquer” approaches [30] to
solving combinatorial problems. The imderlying concept in this method is to divide
the original problem into independent multiple subproblems of roughly the same size.
The subproblems are then solved recursively and combined to provide a solution to
the larger problem.

For problems expressible as graphs, separator theorems [17], [18], [19] specify
different classes of graphs to which the divide-and-conquer strategy can be applied.
Here, only a vertex separator is discussed. A vertex separator is a set of vertices whose
removal disconnects the graph into two subgraphs, each of which contains no more
than some constant fraction of the original graph. Other deﬁnitions of separators and
separator theorems are found in [22 . [23]. The purpose of this chapter is to develop

an algorithm for vertex partitioning.

l6

17

This chapter is organized as follows. First. a detailed method for constructing
a language graph (i.e.. a ﬁnite-state network) is presented. Some essential grap11~
theoretic concepts underlying separator theorems are discussed. Algorithms for graph
partitioning are then outlined. and a complete discussion of the key operation for ﬁnd—
ing a cycle for the planar graph subpartitioning is presented. Elementary notations

and deﬁnitions used in this discussion are found in Appendix A.

2.2 Building a Language Graph

2.2.1 The Need for a Large Graph

Most speech recognition tasks can be organized into a hierarchy of ﬁnite-state net-
works with a ﬁnite number of vertices and edges corresponding to acoustic, phonetic,
and syntactic knowledge sources and their interactions [5], [11], [14]. [24]. The rep-
resentation of knowledge sources in a ﬁnite-state network has been applied to both
isolated-word recognition and continuous-speech recognition [11]. [15], [25].

Before presenting the algorithm for vertex selection, it is necessary to describe the
procedure for mapping a given speech recognition task into a ﬁnite-state network.
Very large grammar graphs are preferable because they more closely represent the
underlying language. and because the elemental models of the graph are trained “in

context.” This is illustrated by the following simple example. Consider the language

with an allowable set of messages (assumed equiprobable):

1. The father is going to walk with the child.

18

2. The father is going to walk the dog.

3. The Child is going to walk the dog.

 

1b)

Figure 2.4: Two possible graphs which can “generate” sentences 1 — 3 listed in the
text.

In Fig. 2.4 two possible graphs which can generate sentences 1 — 3 are illustrated.
The larger graph in Fig. 2.4 (a) is less5 “perplex.” The lower perplexity arises in the
larger graph because it more accurately models exactly those precise messages that
are in the language. To the extent that the speaker adheres to the language, there
are two reasons why the larger graph is preferable: First, in the larger graph there
are fewer “illegal” (not in the language) paths that the recognizer may explore (and
potentially even declare to be the spoken message). From a statistical point of view,
there are fewer decisions and sources of error. Secondly, because elements of spoken

utterances are affected by their context within the whole utterance, it is desirable for

 

5The perplerity of a language roughly refers to the average branching factor at any decision in
the graph. See [8] for a rigorous discussion.

19

the models for various message elements to be trained in the proper context. This
will occur for the larger graph.

The smaller graph shown in Fig. 2.4 (b) has several desirable features which rep-
resent tradeoffs to the large graph beneﬁts: First, there is added “richness” in the
language in the sense that the user may utter some messages which are not in the
original corpus (and have them recognized correctly). For example, the sentence “The
child is going to walk with the dog” has effectively been augmented to the original
language by virtue of the smaller graph. Secondly, a smaller amount of training data
is needed to statistically “train” the models represented in the smaller graph simply
because there are fewer of them. Thirdly, the added perplexity offers more of a chal-
lenge to any research algorithm. The lack of sufficient training data to properly train
large graphs, combined with a desire to work on a challenging problem of relatively
high perplexity. are the reasons why most contemporary speech recognition work has
been carried out with language models represented by relatively small graphs. The
partitioned search technique will permit the use of “large” language models which im-
ply the lowered perplexity and training—in-context beneﬁts discussed above. However,
these two “large model” beneﬁts do not come at the expense of insufficient training
because the partitioning procedure presented in this chapter will select well-trained

vertices.

20
2.2.2 The Method for Building the Language Graph

For the present continuous-speech recognition task. it is instructive to decompose the
language graph into two levels, a grammar level and an intraword level. The two
levels have completely different properties. The intraword level is usually a word
model, which could be a whole-word template. a whole-word hidden Markov model
(HMM) [8], [21], [24]. or a word representation in terms of subword models such as
phones“. In this work, we will focus our attention on the use of both the whole-word
HMMS and a word representation in terms of phones. The reasons are described
below, and a detailed discussion of implementation issues appears in Chapter 4. A
complete discussion of the use of the HMM in speech recognition is found in Appendix
D. An example of a six-state left-to-right model, which is also a ﬁnite-state network
for representing a word, is shown in Fig. 2.5. The topology of a ﬁnite-state network
for representing a phone is similar to the one in Fig. 2.5, but with fewer states and

transitions.

 

Figure 2.5: An example of a left-to-right 6-state model, where a,,- is the state transition
probability from state i to state j.

The reasons why we select both whole-word HMMS and the word representation

in terms of phone models for representing the intraword level are the following:

0 Hidden Markov modeling is a powerful technique capable of robust modeling

 

"A phone is a single speech sound represented by a phonetic symbol [8].

21
of speech. An HMM is a parametric model that is particularly suitable and
succinct for describing speech events. Moreover, it requires less storage than
many other strategies [8], [21]. HMMs have two stochastic processes which
enable the modeling not only of acoustic phenomena, but also of timescale
distortions. Furthermore, efficient algorithms exist for accurate estimation of

H M M parameters.

words are the most natural units of speech. Word models are able to capture
within-word contextual effects. Therefore. when word models can be adequately
trained, they will usually yield the best performance [21]. However, the memory
usage grows linearly with the number of words since there is no sharing among
words. For a large-vocabulary system, say 106 - 1011 states [5], [11], it becomes
unrealistic to train all necessary word models. In [21], for example, word-
dependent phones and context-dependent phones (triphones) are found to be
two appealing units since both of them are trainable for a large-vocabulary task.
Therefore, the following approach is developed to represent the intraword level
in the partitioned cases: A small number of the well-trained whole-word HMM’s
is used to represent the selected words (which are relatively small in number as
discussed in Chapter 1). However, if there are selected vertices which are not
represented by well-trained whole-word HMM’S, then the word representations
in terms of the phones are used to represent those selected words. Therefore,
the training process for a large~vocalmlary system in the partitioned cases is

solved. More discussion of the training issues is found in Chapter 4.

22

The grammar level is represented by a graph in which the vertices represent the
whole-word models and the edges represent word transitions7. To use this language
graph for recognition. we could put the HMMS for each of the grammar vertices (i.e.,
words) in parallel, and allow a transition from the last state of word A to the first
state of word B if (AB) is a “legal” word pair. In a simpliﬁed model, the transition
probability of this transition would be l/n, where n is the number of words that could
follow word A (word-pair grammar). The bigram grammar described in Chapter 1
can also be employed. By replacing each grammar vertex on the language graph with
the corresponding word models (or word representations in terms of phone models),
we obtain a complete language graph showing all internal and grammar vertices and

edges. An example of a partial language graph is shown in Fig. 2.6.

1 am

0 0 0

o O o
\‘eivie’l’

 

 

 

Figure 2.6: An example of a partial language graph.

The decoding task in speech recognition is the problem of associating a set of

acoustic observations with a best path in the finite-state network. The appropriate

 

7Alternatively, the vertices may represent the word boundaries. and the edges represent whole-
word models and word transitions.

23

network to search is the complete language graph as constructed above. To ﬁnd the
most likely word string means ﬁnding the most likely path in the graph, given the
observation string and the a priori structure embedded in the graph. In Chapter 3,

a novel “two-pass graph search” method is presented.

2.3 Graph Partitioning Algorithms

2.3.1 Partitioning Algorithms for Planar Graphs

The main focus of this subsection is on graph partitioning algorithms for planar
graphs. Since the coverage issue is critical to proper pruning, we want to select an
algorithm for partitioning which will provide maximal coverage. In [20], it is found
that the vertices selected in the set C using the algorithm based on the PST of Lipton
and Tarjan (LET) contain at least the vertices on two levels ([0 and 12 in Fig. 2.7).
However, if the algorithm of Djidjev is applied, it is possible that the vertices in the
set C lie on only one special level as shown in Fig. 2.8. Inferior coverage of the graph
to that of L&T will result. To support this argument, some experiments have been
done to partition a language graph. For most graphs, it is necessary to ﬁnd a cycle
to complete an appropriate vertex partition when the PST of L&T is applied. The
partitioning results will also give better coverage than the results created by Djidjev’s
method. Therefore, we select the PST of L&T for the graph partitioning in this work.

The algorithm developed in this work for vertex partitioning is based on the solu-

tion of LtQLT. In turn, the constructive proof of the PST depends on two fundamental

24

lemmas (see Appendix B). The partitioning steps are as follows:

Step 1 Partition the vertices into levels according to their distance from some vertex:
3 in the graph G. Let L(l) be the total number of vertices on level I, and let 1] denote
a level such that the total reward of levels 0 through l1 — 1 does not exceed 1/2, but

the total reward of levels 0 through ll does exceed 1/2.

Step 2 Find two levels, say [0 and 12, such that lo is the highest level and l; is the

lowest level which satisﬁes the following conditions:

l0§11<ll+l§lg

Lug) + 2(11-10) 5 NZ
L(12)+2(l-2 —1, — 1) _ 2 .v —z.-

where l: is the number of vertices on level 0 through II. If the vertices on levels lo and
12 are deleted from G, then the remaining vertices ofG are separated into three parts
(all of which may be empty): vertices on levels 0 through to — 1. vertices on levels

lo + 1 through [2 —- l, and vertices on levels l; + l and above.

Step 3 If the total reward on levels lo + 1 through I; -— 1 does not exceed 2/3, let C be
the set of vertices on levels 10 and ’2, let A be the part of the three with most reward,
and let B be the remaining two parts. Then. the vertex partitioning is complete. The

algorithm is te rm inated.

Step 4 Otherwise. shrink all the vertices on levels 0 through lg to a single vertex 1‘

25

of reward zero and delete all the vertices on levels 12 and above to form a new graph
GI. Note that G, is planar [17]. Apply the cycle-ﬁnding algorithm presented in the
next section to ﬁnd a cycle of the new graph G, to complete the vertex partitioning.
Then, the vertices in set C are the vertices on levels to andilg plus the vertices on the

cycle found in the new graph GI. The vertex: partitioning is complete.

nu.

AV?

 

 

 

(a)

      
 

.,.O""-Q

‘
0..

.&.- and“

   

O
O
O
O

   

coo-.0.....-‘.

 

(b)

Figure 2.7: (a) A conceptual plot of the location of the vertices in the set C for case
(1) in the proof of the PST of Lipton-Tarjan. The set C is represented by the dotted
lines in the plot; (b) The location of the vertices in the set C for case (2).

2.3.2 Partitioning Nonplanar Graphs

An ad hoc procedure to deal with nonplanarity has been developed in this research.
Suppose that the PST of L&T is used. Let G be a nonplanar graph, and let G'

be a planar graph formed from G such that all edges causing nonplanarity in G are

26

on III-II... nN

 
 

 
 

 

.+i

5N

 

 

   

     
 

 
 

l' 1"
I
I GO.-..
:6‘ Ill-o :
‘ I
s a ‘- :
:‘. ..‘O“----' :
I o I
I
i .

(C)

Figure 2.8: (a) A conceptual plot of the location of the vertices in the set C for case
(1) in Djidjev‘s version of the PST. The set C is represented by the dotted lines in
the plot; (b) The location of the vertices in the set C for case (2a). (c) The location
of the vertices in the set C for case (2b).

27

extracted and set aside for future reference. Then G and G' have the same number
of vertices but different numbers of edges. G' is said to be a planar subgraph of G.
Assume that the PST of L&T is applied to G'. Let A‘. B' and C' be the resulting
partition. Now, form G from G' by bringing back the edges causing nonplanarity
according to the following procedures: If certain planarity-breaking edges do not join
A' and B‘. then they can be restored without changing the partitioning result in
G'. That is. we may assign C to be all the vertices in C'. For planarity-breaking
edges which violate the PST (e.g., some planarity-breaking edges join A' and B'),
then assign one end vertex of the planarity-breaking edge to the C set (originally, it
might be in the A' or 3" set). Therefore. the violation is eliminated. Let A be the
remaining vertices in A'. and let B be the remaining vertices in 3'. Then no edge
in G joins A and B. neither A nor B has total reward exceeding 2/3.

By inspection. there are at most three cases which will cause violation of the
PST and which arise from the planarity-breaking edges being returned to the graph.
These cases are shown in Fig. 2.9. Since the vertex partitioning is not unique, one
can always choose a good approximation of the O(\/N)-separator for the nonplanar

case by inspecting the violation of the PST when it is applied to nonplanar graphs.

28

case 1 case 2 case 3

 

Figure 2.9: Three cases which will cause the violation of the planar separator theorem
when the edges causing nonplanarity are restored.

2.4 Cycle-Finding Algorithm

2.4.1 Edge-Scanning Approach

Algorithms for ﬁnding a vertex partitioning satisfying the PST have been outlined
in the previous section. An important issue not discussed there is how to ﬁnd a
subpartitioning of the new graph G, formed by the shrinking and deleting operations.
Lipton and Tarjan have presented an algorithm for ﬁnding the cycle to complete
the vertex partitioning (see Appendix C). However, the operation in Step 3 of their
approach is difﬁcult to implement since the computer cannot distinguish which tree
edges incident to the cycle should be located to the inside and which should be to the
outside. Therefore. the “edge-scanning" approach, developed as part of this research
[20]. is used to ﬁnd a proper cycle to complete the vertex partitioning. The edge-

scanning approach is a signiﬁcantly improved version of the one described in [16],

29
which was developed in the earlier stages of this research.
The edge-scanning approach simpliﬁes the procedures for ﬁnding a cycle in vertex
subpartitioning. The method has three key features which distinguish it from the

L&T approach:

1. There is no need to construct a new representation for the planar embedding.

(i'onsequently. the proposed method is good for partitioning in real time.

2. There is no need to triangulate all faces of the new graph, thereby eliminating

the need for adding new edges to this new graph in advance.

3. The algorithm uses a simple rule to determine which vertices should be consid-

ered inside of the cycle and which vertices outside.

As a consequence of these simpliﬁed strategies. the new cycle-ﬁnding algorithm is
easy to implement.

The block diagram for the edge-scanning approach is presented in Fig. 2.10. The
algorithm consists of four main modules: (1) a planarity testing (PT) module to run
the Demoucron algorithm (see Appendix D) [1] which is used to determine the bound-
ary of each face in the planar embedding of GI, say HI; (2) a breadth-ﬁrst search
(BFS) module to ﬁnd a breadth-ﬁrst spanning tree of HI; (3) an original nontree edge
(ONE) module to determine whether there exists a nontree edge, originally in this
newly formed graph H’, which forms a cycle that satisﬁes Lemma 2 in [17]; (4) an
added nontree edge (ANE) module. If there is no nontree edge in H, which forms a

desired cycle in the ONE module, by the proof of L&T one can ﬁnd a desired cycle by

30

new formed planar graph

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

desired partitioning cycle

Figure 2.10: The block diagram for ﬁnding the partitioning cycle in the new graph.

adding new edges to HI. This module provides a systematic way to add a new edge
which forms a desired cycle to complete the vertex partitioning. Modules PT and
BFS are essentially the inputs of this cycle~ﬁnding algorithm. We will now describe
modules ONE and ANE in more detail.

The block diagram of the ONE module is shown in Fig. 2.11. Choose any nontree
edge from H], say (v.w), and form the corresponding cycle using (v,w) and some
tree edges in the graph HI. If neither the inside nor the outside of the chosen cycle
has total reward exceeding 2/3 of the reward in the graph GI, then the cycle is the
desired one in the planar graph subpartitioning of G’. A problem arises here. How
can it be determined which vertices should be considered inside of the cycle and which
vertices outside? A cycle-ﬁnding algorithm (CFA) is presented to solve this problem.

Suppose one nontree edge (v. w) is chosen in H'. Let the two incident faces be f’

and f". For convenience, let F be the set of the faces in H’, let F0 be the set of the

31

 

< for each nontree edge (v,w) ©0—

 

Locate the corresponding cycle
of (v,w)
CE = {(v,w)} U[some tree edges}

 

 

 

 

 

Algorithm Cycle-Finding

desired cycle

Yes

 

 

 

      

 

Figure 2.11: The block diagram for the ONE module.

faces in H, to the outside of the chosen cycle, and let E- be the set of the faces in H,
to the inside of the chosen cycle. Once a fixed planar embedding of GI, viz. H], is
found. it is simple to obtain a list of all the faces to which each edge in H, belongs
by scanning the boundaries of the faces. This list must be prepared before the CPA
is invoked. The input to the CFA can be either f’ or f”. V’Vithout loss of generality,
f’ can be chosen to be the input to the CFA, and the face is assigned to be located
to the inside of the chosen cycle. The algorithm for determining which face is located

to the inside of the cycle and which face is located to the outside is as follows:

Cycle-Finding Algorithm(H')

1 mark all the faces and edges in H, “unused”

(\3

mark all the edges on the chosen cycle “used”
3 F,» ‘— {f '1

4 mark f’ “used”

32

') finput l—_ f,

6 Determination (fmput)

The algorithm Determination is as follows:

Determination (fmput)
1 for each edge e E fmpu: satisfying:
(i) e. is an “unused” edge
(ii) 6 has more than one “unused” face incident to it (Check the list

of all the faces to which the edge 6 belongs.)

do
2 mark edge 6 ”Used”
3 for each face. say f incident to e and f is an “i_1nused., face
do
4 mark face f "used”
5 RFRMH
6 finpug <———f'

*1

Determination (finput)

The faces in F: are the faces in H, to be located to the inside of the chosen cycle.
Thus. the vertices which should be located to the inside of the chosen cycle are all
the vertices in set E. excluding the vertices on the chosen cycle. Once the vertices to

the inside of the chosen cycle are determined, the vertices to the outside of the cycle

33

are known. The proof of correctness of the CFA is straightforward. Since the number
of the faces located to the inside of the chosen cycle is ﬁnite. the CFA will therefore
terminate in a finite number of steps.

If. on the other hand. there does not exist a set of nontree edges in the graph
G, that forms an appropriate cycle as described in the ONE module. then the ANE
module is applied. The block diagram for ANE module is shown in Fig. 2.12. In this
module, one new edge is added in a face for each test. The new edge in the face is
denoted as a chord of this chosen face. i.e.. the new edge is an edge (u, to) such that v
and w are nonadjacent vertices on the boundary of the face. The newly added edge
is a nontree edge which can form a cycle with some tree edges. A new edge added to
a face will divide the original face into two parts. each part forming a new face of the
planar embedding. Each time a new edge is added to a face. it is determined whether
the corresponding cycle formed by the new added edge satisﬁes the condition that.
neither the inside nor the outside of the cycle has reward exceeding 2/3 of the reward
in the graph G’. The ONE module is applied for this determination. Therefore. the

proper cycle can be found.

2.4.2 An Example

The method for ﬁnding a cycle is illustrated by an example. The original graph
G is shown in Fig. 2.13. Without loss of generality, all vertices in G are assigned
equal reward values which sum to unity. For visual understanding of the graph. this

graph is embedded in the plane as shown in Fig. 2.14 using Demoucron's planarity

 

< for each face fdr)

 

 

 

 

Add a diagonal to this
face which is not only

a nontree edge but also
dividing this face into two
faces

0

Figure 2.12: The block diagram for the ANE module.

 

 

 

 

:33

0G9

l

.4
Q

9
9

°\

0
0

6369

Figure 2.13: An example graph used to illustrate the subpartitioning algorithm.

 

 

Figure 2.l4: The planar embedding of the planar graph of Fig. 2.13, where the solid
lines represent the tree edges and the dotted lines are the nontree edges.

algorithm [1]. The set of vertices in each face of the planar embedding is listed in
Table 2.1. The identified breadth-ﬁrst spanning tree is represented by the solid lines
in Fig. 2.14. The nontree edges are shown by dotted lines. The vertices on each level
are shown in Table 2.2. If the PST of L&T is applied to this example. it is necessary
to ﬁnd a cycle in the new graph G" (formed by the shrinking and deleting processes)
to complete the vertex partitioning. By using Demoucron’s planarity algorithm. the
boundary of each face is found, and the new graph is embedded in the plane. The
planar embedding of G", say HI. is shown in Fig. 2.15 and the set of vertices in each
face of H, is listed in Table 2.3.

The breadth—ﬁrst spanning tree is also shown in Fig. 2.15. After using the tech-

Table 2.1: The set of vertices in each face of the planar embedding of the graph in

Fig. 2.13.

Table 2.2: The vertices on each level of the graph in Fig. 2.13 after running BFS.

Table 2.3: The set of vertices in each face of the planar embedding of the newly

formed graph.

36

 

 

 

 

 

 

 

 

 

 

 

 

 

Face Vertices
face 1 O. 1. 5.6.7. 24

face 2 0. 1.2.3

face 3 0.8.9.10.11. 12.13.15. 16.17
face 4 0.3.9. 12

face 5 4, 8. 9

face6 1.2. 3. 18. 19. 20.21.22. 23
face 7 3. 4. 9

face 8 I. 4. 14. 18. 25. 26.27. 28
face 9 3. 4. 14. 18. 25. 26. 27

face 10 0.1.4.8.13.24

 

 

 

 

 

 

 

 

 

 

 

 

Level Vertices
Level 0 0

Level 1 1.3.5.8,12. 24

Level 2 2, 4, 6, 9, 13, 18

Level 3 7.10.11, 14.15.16.19, 25
Level 4 17, 20. 26, 27. 28

Level 5 21

Level 6 22

Level 7 23

 

Face

face 1
face 2
face 3
face 4
face 5

face 6

face 7
face 8
face 9

 

Vertices
O. 1.5, 6.7.24
1. 2.3.1819. 25
0. 9.10.11 12.15 16
0.3.9.12
4.8.9
1.2.3.4
3 4 9
0.1.3.18
0.1.4,8.13.14. 24

 

 

37

 

f9

 

Figure 2.15: The planar embedding of the newly formed graph in this example. where
the solid lines represent the tree edges and the dotted lines show nontree edges.

Face Vertices
face 1 5. 6. 7. 24

face 2 2. 18. 19. 25

face 3 8. 9.10.12.15.16

face4 9. 12

face 5 8. 9
face 6 2

face 7 9
face 8 l8
face 9 8. 13. 14. 24

 

Table 2.1: The set of vertices in each face of H, except the vertices which are on the
chosen cycle.

38

 

 

Figure 2.16: A desired cycle, which is shown by the bold lines in the ﬁgure. is found
by the cycle-ﬁnding algorithm. Note that the dotted edge (3. 4) is a nontree edge.

39
nique presented in this section. one nontree edge. (3.4). and its corresponding cycle
are found. The edges of the cycle are (0. l).(1.4). (4.3). (3.0). This chosen cycle is
shown in Fig. 2.16. The set of vertices in each face. except the vertices which are on
the chosen cycle. are listed in Table 2.4. The vertices inside and outside the cycle are
separated as described below:

All the faces of H, are marked "1.111used.” The edges. on the chosen cycle are
marked “used” and the rest marked “unused.” Note that an edge cannot be common
to more than two faces. and in this case since it happens to be on the cycle. one of
the faces subtended must be inside and the other outside the cycle. Either one of
these faces is tagged “inside” and the other “outside.” The vertices associated with
an inside or outside face are also necessarily inside or outside the cycle respectively.
and they are tagged accordingly. In the example. f6 and f-; are two incident faces
to the nontree edge (3.4). Face f6 is marked “used.” Let f6 be the input to the
algorithm Determination.

Starting with an unused edge of face f... a search is conducted for an unused face
incident on it. Since the edge already bounds an inside face. the other unused face
incident on it must also be inside the cycle. Once found. the edge and the incident
face are both marked “used.” The vertices bounding this newly found face will then
be inside vertices. In Fig. 2.16. the unused edge (2.3) of HI is found to be common
to faces f6 and f2. Face f2 and edge (2. 3) are marked “used.”

The process above is carried out exhaustively for all the inside faces. The faces
determined to be located inside the cycle are f2. f6. and f3. Finally. it can be

determined which vertices are located on the inside of the cycle. and which vertices

40

 

 

 

Vertices Total Number
Inside 2. 18. 19.25 4 ( < 23 N)
Outside 5. 6. 7. 8. 9. 10. 11. 12. l3. 14. 15. 16. 24 13 ( < m N)

 

 

 

 

 

Table 2.5: The set of vertices on each side of the chosen cycle.

outside the cycle. These are recorded in Table 2.3. The vertices in the set C are the
vertices on the chosen cycle plus the vertices on the two levels. For this example.
these two levels are level 0 and level 4. There are a total of nine vertices on the set
C. These are vertices 0. 1.3.4. 17.20.126.27. and 28. A visual understanding of the
location of sets A. B. and C after partitioning is obtained by viewing Fig. 2.17. One
should notice that the result of the vertex partitioning is not unique [17].

If. after the subpartitioning. the total number of vertices inside and outside the
cycle in H, does not satisfy Lemma A.2. another cycle can be chosen and the process
repeated. If none of the existing cycles is found suitable. a particular face can be
triangulated and a new cycle chosen. The graph could be incrementally triangulated
until an appropriate partitioning cycle is found. From Fig. 2.15. it is evident that

further trials are not necessary for this example.

 

i” ”If”: £53?

9 e

 

@ I vertex in set 3

 

Figure 2.17: The visual understanding of the partitioning results of the example

graph.

Chapter 3

A Stack Algorithm for Partitioned
Search

 

3.1 The n-Best Strategy

The partitioned graph search techniques presented here are based on a type of “n-
best” philosophy which has become popular in recent years [:37]. The general n—best
search paradigm is as follows: Apply the most powerful and inexpensive knowledge
sources ﬁrst to produce a scored list of the top It paths. Then. this list of the top It
paths is reordered using the remaining (“expensive”) knowledge sources to arrive at

the most likely solution. Fig. 3.18 illustrates such an n-best search paradigm.

   

 

 

 

n—best
. sentence -
in ut utterance ‘ ' list ' o timal choice
P _ inexpenswe r expensrye P
processmg processrng

 

 

 

 

 

Figure 3.18: The *n-best search paradigm.

42

43
In this research. the partitioning theorem described in previous chapter is ap-
plied to continuous-speech recognition whose high-level constraints are expressible as
a ﬁnite-state network. The application of the theorem drastically reduces the compu-
tational complexity involved in the search of such graphs. Off-line graph partitioning
is used to locate a relatively small number of key vertices which cover a significant
number of paths. The graph is searched and pruned according to a likelihood mea-
sure. with some subset of these selected vertices evaluated en route. This search under
“scattered” evaluation requires a modiﬁed decoding strategy to be described in this
chapter. At the end of the search. the problem is then pared down to one involving
a smaller graph of likely solutions (i.e.. the top 72 surviving paths after the graph is
searched and pruned). The remaining graph can be further partitioned and searched
in a similar manner if it is undesirably large. Then. the smaller graph is subjected to

intense scrutiny to ﬁnd an optimal solution.
In the following sections. a two-pass graph search technique is presented to conduct
the search in the partitioned cases. The ﬁrst stage of the search uses a modiﬁed stack
decoder in conjunction with a vertex evaluation subroutine to efﬁciently reduce the

at

graph to a subgraph consisting of the 72” best paths. This small graph comprises a
small search space of usually much less than size (9(x/N). whereas the original graph
is of size N. The second stage of the search seeks out the optimal path from among

those remaining in the subgraph. At the end of the search. the most likely path

corresponding to the input utterance is found.

44
3.2 First Pass: Modiﬁed Stack Decoding

3.2.1 Vertex Evaluation Algorithm

3.2.1.1 Baseline Algorithm

Because it is based on conventional principles. we ﬁrst describe the procedure by
which a selected vertex is evaluated when encountered during the search process. It
will be noted that this procedure is akin to conventional word-spotting [73]. but is
carried out in the context of a top-down search [8] so that the beneﬁts of the grammar
may be exploited.

1n the following discussion. the evaluation technique is presented for a the case
in which a whole-word model is present at the vertex. However. for those selected
vertices represented by phone models. a similar technique can be applied. The only
difference is that the whole-word model is replaced by combining the relevant phone
models into a word HMM. In this case. if the probability of the between-phone tran-
sition is set to a. then the last state self-transition probability (prior to the phone
transition) is changed to l — a (no longer being 1) if there exists a between-phone
transition. Figure 3.19 illustrates how to combine the phone models into a word
HMM.

Suppose it is desired to evaluate the model M” with respect to the partial ob—
servation sequence Ymu : yy. - - ~ .gtu for any times t' g t". Let 3, denote the state
at time t. Let 6:7“) denote the joint probability of the partial observation sequence

Ymu. and arriving at state Iv at time t”. i.e.. s." = k, given a speciﬁc whole-word

45

HMM. say M”.

“f"

t. (k) = P(Ym~ and s." = NM”). (3.1)

Also let blike(tmrt. t) be the likelihood (negative log of the probability) of associat-
ing the partial observation sequence Ymn with the best sequence of states (Viterbi

decoding [6]) in the model.

blllt'dfl. t”) (léf —- 10g P(Y¢I'tll alld 8:7, ' ' ' . S:~]Mp) (3.2.)

where .9}. - - - . sf" indicates the optimal sequence of states. Suppose we wish to eval—

l‘h vertex on a path. I). which contains model M”. The observation string

uate the
Yul—1 has been associated with the vertices on the path leading up to .171. so the

observation string for M” must begin at time t 2 t]. We allow for some tolerance in

the choice of starting time.

t1 g t..... g r} “2 min{tl + D,.T} (3.3)

where D) is the average length of all words in the training data appearing in the 1‘"

position in an utterance. and T is the total number of observations in the string being

decoded. In seeking endpoints for the evaluation. we allow

tsmrt + l g tstop S t; a; min{tl + DP + 0p. T} (3.4)

where 0,, and 0,, are the mean and standard deviation of the training sequence dura-

46

tions for model Mp. Then. the basic evaluation then follows the frame-synchronous

Viterbi algorithm presented below (see Appendix E for details).

Vertex Evaluation Algorithm
{Initialization}
l for each t1 S i g f]

do

3 6‘0) 2 O. 2 Sj S I = total states in M”

4 if 65(1) < A). next i.

{Recursion}
.3 foreaclii+1§t§t2.andISjSI.
do
6 5511')=max{6f"(k)ai’..b§’(.I/z)=l s k s 1}.
{Termination}
7 if blilve(i.t) < 3.. and (t —— i + 1) > AL. then
8 record 6f(l). i+1§t£ T.

The notations (1?-

J and 6373/.) indicate standard HMM probabilities which are de-

ﬁned in Appendix E. Three thresholds. AL. A; and A... are set in this algorithm
to ﬁnd acceptable intervals. AL is the minimum length of a candidate interval. A;
is the lower-bound probability for any potential interval which starts at time i. i.e..

if 6f(l) > A). then time i is a possible starting time of a potential interval in the

47

observation sequence corresponding to the given word model. Au is the upper-bound
likelihood for any promising interval, any interval with likelihood exceeding Au is
excluded. Brieﬂy. the vertex evaluation algorithm presented above works as follows:
For every possible starting time i, i.e.. (Sf-(1) > A]. ﬁnd the largest ending time t such

that blike(i,t) < Au and (t — i + 1) > AL.

 

Figure 3.19: An example illustrates how the phone models can be combined to form
a big word model.

3.2.1.2 Reﬁning Boundaries Following Vertex Evaluation

After executing the baseline evaluation algorithm above. many potential intervals
for the extension of the path over 1:) may be hypothesized. The likelihood for each
surviving interval will usually be very close to A... However, one can still tell which
interval is most likely by inspecting the lengths of these potential intervals. The best
result is one which has the smallest likelihood per observation time. However, there
is a problem: There will generally be many overlapping potential intervals detected

by the baseline algorithm. Therefore, the space for storing these results is large. A

48

merging criterion presented below is used to build a stack which contains a relatively
small number of entries. At the same time. a reasonable likelihood measure for the
merged boundaries is described. Finally, the merged potential boundaries and their
corresponding likelihoods are stored in the stack.

In designing a merging criterion and likelihood measure, two requirements are
considered. One is that those intervals with good likelihoods are retained after they
are merged. The other is that the new likelihood resulting from the merger of bound-
aries must be reasonable. which means the merger does not decrease the likelihood

to an unacceptable level (the length of the merged interval is usually increased).

A. Merging Criterion
Given two hypothesized word intervals B1 and 82 from the results of the baseline al-
gorithm. let Blur, and BL”, be the starting time and ending time for the hypothesized

interval B‘. respectively. Then. there are three possible cases for the relationships

among the boundaries of B1 and B2:

1. One interval is contained in the other. say 32 Q 8‘. Then. the merged interval

B is the same as 3‘. This case. is shown in Fig. 3.20 (a).

2. There is no intersection between B1 and 82. e.g., 851.0,, < 332“", or 83,01, < Balm”.
In this case. no merging occurs. This case is shown in Fig. 3.20 (b).
3. B1 and 82 partially intersect. Assume HEW, S letop < Bftop. “(Haw—351.01,) <

AL. then the boundaries B1 and 82 can be merged as one interval 8 such that

_ l _ 2
Bstart — Bsmy-t and Bstop — 8

gap. Otherwise, no merging occurs between B1

49

and 82 because the boundary gap between them is large enough to retain both

as hypothetical intervals. This case is shown in Fig. 3.20 (c).

likelihood likelihood

II (I
""33”" 1:>

 

 

 

 

w
w

 

 

 

 

 

 

—> Time > Time
(a)

likelihood likelihood
[1 11

 

 

 

 

 

 

 

. u : ‘ I i
I?) i j I?! : I
: ' : 1
, 1 ‘ :: TWIHC? , L I =3 13nne
(b)
likelihood likelihood
11 (l
' -------- I
_;__ B: ; (:3 B
3’: I
' I
‘ ' = Time 47 Time

 

 

 

 

 

 

 

 

(c)

Figure 3.20: Three possible cases of merging potential intervals B1 and 82. which are
the resulting potential intervals from the baseline word spotting algorithm: (a) 32 Q
B‘; (b) no intersection between B1 and B): (c) B1 and 82 are partially intersected.

B. Likelihood Measure for Merged Boundaries
Suppose the intervals 8‘, B2, - - - . 3” result from the merging process above. Then.

the following likelihood measure is used to calculate the likelihood of these merged

boundaries. Let like(B B be the likelihood of the resulting interval 8‘. Then.

i l )
start‘ stop

the likelihood is of the form:

[at

stop

like(Bi

i
start’ [3

stop

) = blike(B‘

start“

) + (duration likelihood) (33-5)

where the duration likelihood is computed as follows: Since matching the word model
Mp to some part of the observation sequence is essentially unconstrained. it is possible
to expand or contract the subsequence assigned to a model so that it accounts for
a large or a small part of the observation sequence. This problem is alleviated by
adding the duration constraints to likelihood measure. A simple Gaussian duration
model is used. Let Pp(D.) be the probability density of the potential interval 3‘ with

length D.- for the word model M”. Then:

P310.) — —e “t 9-61

where the mean and standard deviation for M” are DP and 0p. respectively. These
statistics are estimated from the training data. Then, the “duration likelihood” is just
the negative logarithm of Pp(D.). The. conceptual plot of the word spotting results

for the selected vertices (i.e., words) is shown in Fig. 3.21.

51

 

        
   
 

        
 
  
 

 

  

It is a lan _uage graph example
“v V' '
/‘ list Of sentences
' ~ 0 0
contains
very ;.
Stan‘n »
nodel g . .
wan

 
  

(soft pruning and hard pruning are applied.“

 

 

word (HMM) = "wants" word (HMM) = "8380’"

 

 

 

 

 

 

 

 

 

 

 

 

 

 

begin end likelihood begin end likelihood
3 8 258 17 20 366
12 18 1468 5 8 986
. O O O
o o . o o o
o O o

 

 

 

 

 

 

 

Figure 3.21: A conceptual plot of the word-spotting results for the selected vertices.
The vertices represented by shaded circles are vertices selected through the graph
partitioning approach.

52
3.2.2 Modiﬁed Stack Decoding

In conventional path driven left-to-right search strategies. the evaluation of vertices
takes place as they are encountered along paths. In partitioned case. where the
objective is to eliminate this costly procedure by preselecting vertices for evaluation.
Here, a modiﬁed left-to-right procedure is presented. This method involves some
rather straightforward modifications of conventional stack decoding procedures [I 1].
[:38] to accommodate the 'inevaluated vertices.

The essence of the partitioning method is that only (9(\/]_V_) vertices in G are
selected for evaluation. This means that an alternative method must be available to
“evaluate” an unselected (not on the selected set C) vertex. A simple and conservative
procedure is used to replace the evaluation model by a time-duration model at each
unselected vertex extension. When the quantity of the probability of an unselected
vertex extension is needed. we use. instead. the probability that the observation
string corresponding to this unselected vertex extension is a given length. These
probabilities are estimated from the training data as described above.

For unselected vertices. a simple Gaussian duration model is used. i.e.. the prob-
ability density for an unselected vertex. say .r,.. is of the form (3.6) where .r,, and M”
are equivalent in these two discussions since model M” is resident at vertex rp. Thus,
the duration model is incorporated into the partitioned graph search techniques. Ac-
cordingly, a modiﬁed stack decoding algorithm for the partitioned graph search is as

follows:

53

Modified Stack Decoding Algorithm
{Initialization}
Put the start vertex. says. in the stack to form a “null” partial path.

{Recursionz Best-First Path Growing Loop}

Take the top entry (the best partial path) off the stack. say X0,1-1 = 8,.L'1, - - - ..r1_1.
If the best partial path is complete (i.e.. the end—of-path flag is “true”). Then,
i. Output the path and increment the output hypothesis counter by one.

ii. If the output hypothesis counter is equal to n. stop.
Else
For each possible vertex extension. say I). of the partial path

(1'9” 33;): < Bitnrt)
do
i. If .r, is an unselected vertex. then a time-duration model is used.
to update the likelihood (see below)
ii. If .131 is a selected vertex, then either a whole-word HMM or
a phonetic-baseform model is used (see below).

iii. The pruning strategy discussed below is applied.

It is clear that the stack itself is just a sorted list which supports the operations: take
the best entry off the stack and insert new entries according to their likelihoods. The

following must be contained in each stack entry:

1. The history of the partial path.

2. The partial path likelihood.

3. An end-of-path ﬂag to indicate whether the entry is a complete path.

4. The beginning time in the observation sequence corresponding to the terminal

vertex of the partial path.

Each path extension is entered into the stack. its position determined by its likelihood.
The stack is of ﬁnite length. say q. so that only the q most likely partial paths
survive. The ﬁnite stack, therefore. effects the pruning operation called hard pruning
in Chapter I. A second type of pruning occurs when a partial path. for which there is
sufﬁcient room in the stack. is deemed too unlikely to be viable and is removed. This
process is called soft pruning. The best It paths are determined through the following
process: At each iteration. paths are extended from the top of the stack down. The
decoding is complete when 72 complete paths appear as entries in the top 72. entries of
the stack.

Likelihood updates during the extensions occur according to the following argu-
ments: The method described below is based on the report by Venkatesh, et a1.
[13] although differences exist in the present method from the one described there.
Suppose the likelihood of a particular path through the graph G represented by the
vertex string X = 131.12. - - - .17], is evaluated, and assume that the likelihood mea-
sure is the probability of the occurrence of path X. given the observation string

Y :2 y1,y2.~-,yT. Using Bayes rule. P(X I Y) = P(X.Y)/P(Y). Since P(Y) is

55

identical for all paths. it is sufﬁcient to seek the path X' such that
X' = arg n‘lXaX P(X.Y). (3.7)

Let us assume that the best. partial path in the stack ranges over the vertices
X114 = 33.1.. - - - ..r,_l. and that the best likelihood associates observations Yup]
with this partial path. i.e.. t, — 1 maximizes — log P(X1,1_1.Y1,.) over all 1'. Then
we wish to compute the likelihood associated with extending the path over vertex .1?)

using observations Yr... where t 2 t1. Note that

D

P(Xl,l—l~Yl.t1—I ‘FI' Yt1.t) : P(Xl.I-l$Yl.f1-I)P(II'Yt1.t I Xl.l-19Yl.t1-l) (3'8)

' : P(X1.1_I,Y1..._1)P(.r1 I X1,(-1.Y1.z)P(Yt1.t IxthJ—DYlJI—l)

= P1X1.1—1.Y1.c,—1)P(-FI I-Fl—1)P(Yt1.t 1131).

We have made extensive use of the assumed independence of the observations in this
derivation. We see that the likelihood extension is simple. The quantity P(.r( I 11-1)
is the transition probability associated with the graph edge between 1.1-1 and .r), and
HY,” I .171) is obtained form the vertex evaluation procedure described above, or. if
an is not selected for evaluation. by inserting the likelihood that the duration of the
sequence generated by the model at 1:1 is t — t1. This latter number is obtained from
the duration density for the model.

When the best 72. paths have been obtained. the question remains as to how to se-

lect the optimal path. What remains in the stack are surviving paths which represent

56

a small subgraph of G. say G'. Depending on the scale of the problem. it is useful
to “partition” G' using another set of vertices and repeat the procedures. or simply
to search GI using the standard left-to-right method or some other ad hoc technique.
Whatever the case. the new search problem will be very signiﬁcantly scaled down with
respect to the original problem. Using these methods. the search will converge to a
solution very quickly. A grammatical approach is presented in next section to ﬁnd the
most likely path after the ﬁrst-pass search. Note that the repartition of G' cannot be
done in real time in the current research due to the limitations of the machine speed
and memory space. Many other interesting search procedures may arise with future

research.

3.2.3 Computation Savings from the First Pass

The evaluation process comprises a vast majority of the computational effort of the
search algorithm above. In fact, each attempt to extend a path by one vertex in-
volves multiple evaluations of the vertex with respect to about T(T + 1)/2 separate
substrings of the observation sequence. Y. where ‘T is the length of the complete
observation sequence. In the case in which a HMM is present at the vertex, each of
these “subevaluations” will require between 0(HT’). and (9(H2T’) operations (de—
pending on the model structure), where H is the number of states in the HMM and T'

denotes the substring lengthg. The other computations required in the search become

 

8It is not necessary to “start over.” but rather to supplement the information in the existing
stack with further evaluations.

9Further improvement might be possible using an HMM evaluation method due to Deller and
Snider [l2].

57

insigniﬁcant in this light, and the computational expense is seen to be nearly directly
proportional to the number of vertices actually evaluated. The partioning approach.
therefore. can be seen to reduce this cost by a factor which is (Xx/T) with respect
to a full search of G.

It is difﬁcult to quantify the beneﬁt of the partitioning approach in terms of the
cost a conventional left-to-right search. since the later is highly problem-dependent.
For illustrative purposes. however. consider a language graph of N = 108 vertices.
each representing a phoneme (e.g.. see [11]). If there are typically 30 phonemes in
a sentence. then. on the average. we would expect to ﬁnd about 33 x 105 vertices
in a time slot. To evaluate only the ﬁrst three time slots (2 1 word) would require
about 107 vertex evaluations. Such an evaluation is 1000 times more expensive than
the 104 vertex evaluations required in the partitioned case. and does not offer the
pruning safety of using acoustical data which are distributed across the message in
time. Not explicitly mentioned above is the very important fact that the partitioned
search is more robust to impulsive or intermittent noise than conventional left-to-right
search. The evaluation of the data. by focusing on the partitioned C-set of vertices,
is distributed across the observations in time. This means that the search is less likely
to be pruned in the presence of unmodeled noise. Also, the inherent advantage in this
work is that there need be no special form of the underlying model for a symbol at a
vertex. as long as some suitable measure of likelihood is computable for each vertex

extension.

58

3.3 Second Pass: Optimal Solution

It is important to keep in mind that the objective of the partitioned graph search is
not necessarily to deduce a single best path through a language graph. Partitioned
search provides a systematic way to divide and conquer the original large graph using
a series of low complexity operations. In a large problem. the subgraph remaining in
the stack after the ﬁrst partition and search can be further partitioned and searched
in a similar manner. The solution would be expected to rapidly converge. The it”
partitioning recursion would require 0(gV1/2') evaluations.

The goal in the second stage of the search is to ﬁnd the most likely path in the
remaining subgraph after the ﬁrst stage of the search. In this work. we use the simple
procedure of iterating over the existing subgraph with its inherent bigram grammar
using the stack decoding algorithm described above. The number of resulting hy-
potheses. n. is set to unity (“I-best” search). and all nodes are evaluated. Since
the graph is very small, the computational effort involved in this procedure is not

signiﬁcant. In future applications. any number of procedures may be used in this

“expensive” pass through the graph-

Chapter 4

Implementation and Experimental
Issues

 

4.1 The TIMIT Database

To evaluate the performance of the partitioned graph search techniques. the TIMIT
(Texas Instruments - MIT) speech database [29] is used. The TIMIT database has
been designed to provide speech data for the acquisition of acoustic-phonetic knowl-
edge and for the development and evaluation of automatic speech recognition systems.
TIMIT contains a total of 6.300 sentences. 10 sentences spoken by each of 630 speakers
from eight major dialect regions of the United States. Seventy percent of the speakers
are male. and most speakers are Caucasian adults. There are of total 2.342 distinct
sentences. which consist of two dialect “shibboleth” sentences (sa). 450 phonetically-
compact sentences (sx). and 1890 phonetically-(Iiverse sentences (si). The database
is divided into dialect regions: New England (drl). Northern (dr2), North Midland
(dr3). South Midland (dr4). Southern (dr3). New York City (dr6), W'estern (dr7). and
Army Brat (moved around) (dr8). The “sa” sentences are spoken across all speakers.

they introduce an unfair bias for certain phonemes in certain contexts [21]. Therefore.

:39

60

the two “sa” sentences are not used in the current experiments. The vocabulary size
in this work is 6.100. Some relevant information about the use of TIMIT in this work

is summarized in Table 4.6.

 

 

 

 

 

 

 

 

 

Training set Test set
No. of speakers 462 168
Dialect coverage 8 8
No. of male 326 112
No. of female 136 56
semen“ mm” gill-1338 :ix 11.7530
No. of sentences 3,696 1.344
Vocabulary size 4,891 2.373
No. of words 1 1.087 4.621

 

 

 

 

 

Table 4.6: Summary of the usage of the TIMIT database in this work.

The TIMIT corpus includes time-aligned orthographic transcriptions. phonetic
transcriptions. word transcriptions as well as speech waveform data for each sentence-
utterance. The file structure for the TIMIT database is shown in Fig. 4.22.

The TIMIT waveforms are sampled at 16 kHz. and the samples are 16-bit short
integers which are in VAX/Intel byte order (least signiﬁcant byte/most signiﬁcant
byte). Thus. it is required to swap the bytes in the samples to MSB/LSB order before
they are used. Since the current experiments are performed on SUN workstations.
the following command is used to skip the 1024-l)yte ASCII header and byte-swap

the samples:

dd if=input-file.wav bs=1024 skip=1 I dd conv=swab > output.file

TIMIT employs a set of 62 phonetic labels. These labels. along with examples.

are listed in Table 4.7.

61

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Phone Example Phone Example Phone Example
/iy/ bEEt ‘ﬂ Ray lb/ Bob

/ih/ Way Id/ Dad

/eh/ Yacht /g/ Gag

ley/ lhh/ Hay /p/ Pop

lae/ aHead lt/ Tot

/aa/ /el/ bottLE /1</ Kick
/aw/ bOUt Sea ldx/ dirTy
lay/ blte /sh/ She lql baT

/ah/ bUt /2/ Zone ljh/ Judge
lao/ bOUght Izh/ meaSure /ch/ CHurch
/0y/ bOY /f/ Fin lbcl/ (b closure)
/ow/ bOAt l: THin Idcl/ (d closure)
/uh/ bOOk /v/ Van lgcl/ (g closure)
/uw/ bOOt THen /pcl/ (p closure)
lux/ bEAUt)’ Im/ MoM my (t closure)
/er/ ble /n/ NooN /1<cl/ (k closure)
/ax/ About /ng/ siNG /epi/ (epin. clos.)
lix/ deblt lem/ bottoM /h#/ (beg. sil)
/axr/ buttER I /en/ buttoN /#h/ (end sil)
lax-h/ sUspect /eng/ washINGto lpau/ (betw. sil)
/1/ Lay IF /nx/ wiNNer

Table 4.7: Phonetic labels used in the TIMIT database.

 

62

Q .0. @® 0..

%

eater coo er 000 00
”(iii ' Spirit

®®'®® . ®@i@@ '

Figure 4.22: The dataﬁle structure of the TIMIT database.

The TIMIT database is used in two ways in the experiments: First. a graph with
14.259 vertices (i.e.. words) is constructed from the database. The procedures for
building the graph are described in detail in next section. Second, TIMIT is used for
both training and recognition. In the training process. 1,156 words, each of which
has a predetermined amount of training data. are selected to train the whole-word
models. Also. 60 phone models are trained for use in phonetic baseforms. In the
recognition process, a test sentence from TIMIT is chosen as an input utterance. and
then the partitioned graph search techniques described in Chapter 3 are applied to
ﬁnd the most likely path in the graph. Note that the training process is done before

the graph partitioning process in this work. and both processes are off-line.

63

4.2 Implementation Issues

4.2.1 Building a Language Graph

The large graph for testing the partitioned graph search methods is constructed as
follows: All the “sx” and “si” sentences in TIMIT database are used to build the
graph. Let E = {3.} be the set of sentences, which is composed of discrete words
from the set. W 2 {an}. Note that two words. say to. and 102, are distinct in the
graph (have different models at different vertices) if the phonetic labels for w] are
different from that for w-z. This is true even if to] and w-z are the same word (same
orthographic transcription). For example. as illustrated in Fig. 4.23, there are eight
different vertices for representing the word ask in the large graph. However, there is
only one way to represent the word ask as shown in Fig. 4.24 in the bigram graph.
The it” sentence (of length L) is of the form, 5.3. = w”.*w.-2.....w.-L, where the
comma denotes concatenation. The word set and the sentence length are ﬁnite, so that
the set of sentences can be represented as a directed graph (digraph), say G(V, A), in
which each vertex is associated with only one word. each edge (or dart in the digraph)
represents the transition from word to word in sentences. and each path represents
a legal sentence. Formally speaking, the Markov language graph G(V,A) consists
of a set of vertices, V = {.13.}, a set of edges. A = {akj} (at) connects vertex 1:).
to vertex my), a special vertex s G V indicating the start vertex of each path, and
a set of transition weights. {P(rj I 13.)}. P(.rj I .r.) is the probability that w(.r.)
is followed by w(.rj) in any sentence, where w(.rk) denotes the word associated with

vertex 1').. Moreover. the elements in the set of complete paths through G (which

64

means that those paths in G begin at s and terminate at some .13).) will have one-to-
one correspondence to the elements in 2. Resident at the vertex in the selected set
(e.g.. the set C after graph partitioning) is either a whole-word HMM or a phonetic-
baseform model composed of context-independent phone HMMS. Note that whole-
word models are often found to outperform phone-based models if sufﬁcient data
are available to train them [8]. Whether a whole-word model or a phonetic model
is used at a particular vertex depends on whether there are sufficient training data
for the whole word. In addition to the HMMS at the vertices of the selected set. a
simple Gaussian time-duration distribution described in Chapter 3 for the word is
resident at the vertex. Note that the statistics for these distributions are tied across
all occurrences of the word in the training data regardless of phonetic baseforms.
An example language graph is shown in Fig. 4.25(a). This graph is created from

'5'. ’2

the following list of sentences. each sentence begins with a dummy point, say 0 ,
the dummy point is the start vertex .9 of each path in the graph G. As mentioned
in Chapter 3. this dummy vertex represents a silent region at the beginning of each
sentence. The sentences are :

0 She is thinner than I am (5x5)

0 They often go out in the evening (sx187)

o I honor my mom (5x231)

0 Never happier in my life (si548)

o Nor is she a wet boat (si778)

The task is to ﬁnd the planar subgraph of the language graph by extracting out

the planarity breaking arcs (arcs which make the language graph G nonplanar) and

65

. ............ 92 .................
, n n
:IIIIIIIEEII """"""""
an n In»
TF4}. ' """" =
L """""" (it)

H ...................... '
, El a: a:
IIIIIIBE: """" '
:IIIIIEEII: ..... .
—*->H—-I——~I-i—>
(b)

Figure 4.23: All the possible vertices with different phonetic labels for the word ask
in a large graph which is similar to that of Fig. 2.1 (a). Each case represents a vertex
in the large graph.

      
   

 

 

 

Figure 4.24: The word ask in a small graph which is similar to that of Fig. 2.1 (b).

66

They often go out in the evening
ldh-ey/ lao-f-tcl-t-ix-ng/ /gcl--ow/ /aa-q/ lix-n/ /dh-iy/ liy-v-n-ix-ng/

1 honor my mom
lq-aa/ /q-aa-nx-axr/ lm—aa/ /m-aa-m/

 
     
     
 
  

Never happen 11] life
Starting . ln-eh-v-er/ lhv-ae-pcl-p-ix-er/ /1h-. -aa-fl
node . . .
. is thinner than I am
Iih-z/ /th-ih-nx-er/ ldh.eh-n/ lay/ lac-m/
a wet boat
lax/ lw-eh-tcl/ lb-ow-tcl/
Observation
String y1y2y3 y4 y5y6y7y8y9y10y11y12y13 y14 y15y16y17y18 leyZO

(Nor is she a wet boat)

Figure 4.25: (a) An example language graph. (1)) The boundaries of the observation
string are unknown.

67

store the planarity breaking arcs for future reference. Note that the planar subgraph.
say G', must include all the vertices in the original graph G. On the other hand, let
G(V, E) be the underlying undirected graph of G(V. A). then the undirected version
of G(V, A) is the undirected graph formed by converting each edge of G(V.A) to an
undirected edge and removing duplicate edges. The way to construct the graph G is
as follows:

A data ﬁle to store the set of sentences 2 is created (recall that each sentence
begins with a dummy point 0). The sentences are stored in the data ﬁle one by one,
and every word together with its phonetic transcriptions in a sentence are stored line
by line in the data ﬁle according to the concatenation of the words in the sentence.
After building up a data ﬁle using the method described above, the data ﬁle is read
line by line from the top. Whenever a new line is read. one new vertex may be added to
the partial graph (which has been built up to this point) if the phonetic transcription
in this line is different from any one in the existing partial graph. At the same
time, a new edge to the partial graph is added (recall that each edge represents the
transition from word to word in sentences). Since the language graph is constructed
in this fashion, a connected directed graph is obtained. However, the underlying
undirected graph might not be planar. The graph is generally too large to use the
method presented in [16] for extracting the planarity breaking arcs in the graph.
The following method is developed to extract the planarity breaking arcs and ﬁnd
the partition sets of the graph. The detailed process of the graph partitioning is
described later in this chapter.

Since the existence of the start vertex increases the possibility of nonplanarity.

68

we can temporarily remove the start vertex from the graph. However, all the edges
incident to the start vertex can be brought back in the partitioning process. The
reason why we can remove the start vertex ﬁrst is that: According to the search
algorithms developed in Chapter 3. the start vertex is always in the selected set C.
and it is the ﬁrst vertex to be evaluated on each path. Let us recall the partitioning
algorithm for nonplanar graphs developed in Chapter 2. Any planarity breaking arc
incident to a vertex in the set C can be brought back without violating the PST.
Therefore, we can remove the start vertex ﬁrst and discover the planarity breaking arcs
for each component of the remaining graph. Note that the remaining subgraph might
not be connected. Therefore, it might contain some connected components. If some
component remains large. we can further discover the blocks10 in that component. A
linear algorithm to identify the blocks of a graph is found in [4]. Since the size of
each block in each component of the remaining subgraph is relatively small. we can
then apply the method described in [16] to ﬁnd the planarity breaking arcs for each
component of the remaining subgraph. After this process. the partitioning procedure

can be implemented.

4.2.2 Speech Modeling for Partitioned Graph Search Tech-

niques

4.2.3.1 Cepstral Analysis

The feature extraction task for this speech recognition system is based on the mel-

\

10A block is a graph which contains no_cut vertex.

69

cepstral parameters. The analysis in this research is based on a paper by Davis and
Mermelstein [27], in which the mel-based cepstral coefﬁcients. say 0... are calculated

as below:

'20

c7. = 2E.- cos[n(i —. )9

I =1

1 for n = l,2.---.I\’ (4.9)

[V'—
O

2:1
where E.- denotes the critical banal11 ﬁlter log energy outputs. and K is set to 8 in our
implementation. Twenty mel-frequency components are desired on the Nyquist range
0 — 8 kHz (the sampling rate is 16 kHz). The critical-band ﬁltering is then simulated
by a set of twenty triangular bandpass ﬁlters as shown in Fig. 4.26, where ten mel-
frequency components appear linearly on the Range 0 — 1 kHz and the remaining ten

are distributed logarithmically on the range 1 — 8 kHz.
4.2.3.2 Vector Quantization

One major issue in vector quantization (VQ) is the design of an appropriate code-
book for quantization. The VQ codebook contains a set of L vectors which provide
the minimum average distortion (distance) between a given training set of analysis
vectors and the codebook entries. The most common measure of difference between
two vectors, say c1 and C2, is the Euclidean distance (1(01. 0;) (suppose that there are

K features in each vector):

 

1.-
d(c1.c2) —_- 2W3) ——c2(i)I2. (4.10)

i=1

 

11A critical band can be viewed as a bandpass ﬁlter whose frequency response corresponds roughly
to the tuning curves of auditory neurons [76].

70

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Filter bank
1 v r
0.9 J
0.7 s
0.6 4
-§
0 0.5 r -
0.4 . .
0.3 > I I I e
0.2 - [ [ [ <
0.1 *- -1
0 . 1 l 1 m 1 L
O 1000 2000 3000 4000 5000 6000 7000 8000
Frequency (Hz)

Figure 4.26: Filter bank for generating mel-based cepstral coefficients.

Codebook generation usually involves the analysis of a large sample of training
sequences. Many iterative procedures have been proposed for designing codebooks
[8]. [67]. In this research. the splitting approach is adopted. which iteratively de-
signs codebooks of increasing dimension using the binary search method. The binary
search algorithm is summarized as follows: The set of training sequences in the K—
dimensional space is ﬁrst partitioned into two regions using the 2-means algorithm
[67]. Then each of the two regions is divided into two subregions. This process is
repeated until the space is divided into L regions. The centroid is computed for each
region by averaging all of the training vectors in the partition. Then. the centroid of
each region is a prototype vector in the VQ codebook. In our implementation. L is

set to 128.

71

In order to improve the recognition rate. the power and differenced power are
extracted from the speech signal. Power is computed from the speech waveform as

follows:

lvw

P.=108(Zfiil (4.11)

k=l
where P. is the power for frame i. There are N... discrete samples in each frame. N...
is set to 256 and consecutive frames are spaced 128 samples apart. The Hamming
window is applied to the speech to create the frame samples.

It is also necessary to normalize for speaker loudness variation. In this work, we
normalize by the factor which makes the variance of the power term unity. That
is, each term is divided by the square root of its variance. Then. a direct distance
measure can be computed without. weighting the power term.

In addition to the power information. another source of information is differenced

power, which is computed as follows:

Di 2 i+l —' Pi—l (412)

where D. is the differenced power term of the i” frame of the utterance. Clearly.
differenced power provides the information about relative changes in amplitude or
loudness. Similarly, D. is normalized to make the variance of the differenced power
unity. The recognition accuracy improves signiﬁcantly with the power and differenced
power features added. Therefore. there are ten features selected for representing each

frame of the speech signal including eight mel-based cepstral parameters, power, and

72

differenced power.
4.2.3.3 HMM Training

Both the whole-word HMh/Is and phone models are used in the intraword level
in the language graph, and methods described in this subsection are used to train
either model form. The Bakis model [8] is used for both the word models and phone
models. In the implementation. the three-state Bakis model is used to train the phone
models. and six-state for the word models. The state transition coefficients are such
that a” = 0, Vj < i in a Bakis model. Further, additional constraints disallow jumps
of more than two states. This constraint is of the form: (1.5 = 0, Vj > i+ 2. For

example, the form of the ”state transition matrix A of Fig. 2.5 with six states is

an an (113 0 0 0
0 (122 (£23 (124 0 0
0 0 (133 (134 (135 0

A = (4.13)

0 0 0 (£44 (145 (146

0 0 0 0 ass “56

 

 

{000001)

71'.= (4.14)

In this research. the forward-backward (or Baum- 14"elch) reestimation algorithm

[8]. [24]. [21] is adopted for training both the word models and phone models. It has

73
been proved by Baum et al. [21] that either the parameters of the re-estimated model
remain the same when a critical point is reached or every re-estimate is guaranteed
to improve the model parameters. A detailed description of the forward—backward
algorithm is given in Appendix E.

There are of total 1.156 words (with different phonetic labels) plus one silent
“word” (representing the start vertex) and 60 phones trained in this experiment.
These words are selected from the TIMIT database with at least ﬁve training data.
and none of them is a function word. The training data for phone models are chosen
from the “sx” sentences in the TIMIT database. The total number of training data
for each phone model is at most 200. Figure 4.8 shows the actual number of phonetic

segments in the training set.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Phone Number Phone Number Phone Number
Ind 2658 /r/ 2953 IN 1317
/ih/ 2425 /w/ 1216 Id/ 1438
leh/ 1815 ly/ 635 /g/ 773
ley/ 1374 lth 462 /p/ 1668
lae/ 1414 lhv/ 337 /t/ 2340
/aa/ I468 lel/ 568 [lg] 2463
law/ 428 /s/ 3679 /dx/ 1081
/ay/ 1 198 /shl 820 /q/ 1626
lah/ 1282 /1./ 2309 Ith 787
/a0/ 1 I62 /zh/ 95 /ch/ 597
loy/ 233 /f/ 1308 [be]! 1 196
low/ 944 /th/ 476 Ich/ 2395
/uh/ 263 IV] 1 125 lch 866
luw/ 342 /dh/ I387 /pcl/ 1693
lux/ 956 /m/ 2012 Itcl/ 3454
ler/ 1036 /n/ 3565 Aid] 2667
lax/ 21 11 lng/ 736 lepi/ 538
[ix] 4387 lem! 69 lpau/ 363
laxr/ 1490 /en/ 402 lux/ 395
lax-bl 22S /eng/ 15 /1/ 2615

 

Table 4.8: Phonetic segments in the training set.

 

74

4.2.3 Graph Partitioning

After the training process is completed. graph partitioning is used to select the key
vertices in the language graph. A good partition set should satisfy the following

criteria:

1. The vertices in the selected set C are well-trained.

2. The selected set C contains as few function words” as possible. These function
words are problematic in continuous speech recognition [‘21] because they are
articulated very poorly and are hard to recognize or even locate. For example,
Table 4.10, Table 4.11, Table 4.12, and Table 4.13 enumerate the number of
times the function words the. in, with, and a. respectively of different transcrip-

tion labels in the TIMIT database.

To achieve a partition satisfying the criteria above. a non—negative reward for each
vertex is required. A reward is assigned to each vertex according to the following

policies:

1. Among all vertices representing whole words, higher rewards are assigned to

vertices trained by more training data.

‘2. A higher reward is assigned to a vertex representing a whole-word HMM than

to a vertex containing a phone model.

l2Function words are typically prepositions. conjunctions. pronouns, and short verbs. For example,
the, a. in. with are function words ['21].

75
3. A penalty (negative reward) is assigned to a vertex representing a function word.
The function words used in this research are listed in Table 4.9. There are 42

function words indicated by Lee et al. [21].

 

a all and any are at be
been by did find for from get
give has have how in is it

list many more of on one or
show than that the their to use
was were what why will with would

 

 

 

Table 4.9: A list of function words appearing in this research.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Phonetic Labels No. Phonetic Labels No. II Phonetic Labels No.
dh-ix 688 dcl-d-iy 1 I] d-ih 2
(“W 298 q 3 th-ih 2
dh-ax 804 ah-z-ax l l til-21X 2
dh-eh 25 dh-el 3 d-ah 1
dh-ih 36 s 3 ll z-ax 1
dh-ah 46 n-ax 1 z-ax-h l
iy 66 n-iy 4 dh-q-eh l
dh-ax-h 34 th-ax-h 5 dcl-dh-ix l
dh 24 dh-aa 3 aX-Pau-dh-ax 1
3mm (mm 3 I] sh-dh-iy 1
h-ix 5 z-dh-ix 1 I] th-ix 1

 

Table 4.10: Different ways the function word the is pronounced in TIMIT.

Clearly, a better vertex selection is a set of higher reward.
To quantify these requirements. a reward function of the following form is used

for vertex .13:

76

 

 

 

 

 

 

 

 

 

 

 

 

 

Phonetic Labels No. IrPhonetic Labels No 0.5+Phonetic Labelsl No.
en 219 ih-ng hv -—1x n 2
ih-n 261 ix-nL 10_0| ix- pawq- -ih- -n 1
ax-n 114 er q- ax- --h mg 1
q-ix-ng 40 q-ih-ng 1 1 hh-ix-ng 1
w-ix-n 23 3mg 1 ax-h-pau-q-ih-n 1
Cl-ih-n 86 I iX-q-ih—n 2 | hh-ih-n 1
eng 13 q-ih-cng l q-ih-m 1
ix-nx 128 ih-ax-n I hh-cm 1
q-iy-ng 2 ih-ix-n I to] 1
ax-hn 2 q-ih-nx 4 I
ng 2 n I

 

 

 

 

 

 

 

Table 4.11: Different ways the function word in. is pronounced in TIMIT.

 

 

 

 

 

 

 

 

 

 

 

Phonetic Labels No. Phonetic Labels No. Phonetic Labels .
w-ax-dh l9 w-ix-tcl-q l ix-th 2
w-ix-th 10 l uh-dh l w-ix-s 8
w-ix-dh 25 w-ax-h-th l w-dh 1
w-ih-th 26 w-th 1 w-uh-tcl-t l
w-ih-s 8 w-ih-ax 1 w-ix-tcl 2
w-ax-th 46 w-ih-tCl-th l W-iX-q l
hh-w-ax-h-th l w-ih-dh 4

 

 

 

 

Table J(.122: Different ways the function word with. is pronounced in TIMIT.

 

 

 

 

 

 

 

Phonetic Labels No. Phonetic Labelsl No. Phonetic Labels No.
ax-h 31 q-ax 42 q-eh 4
ax 497 61-“! 6 m 2
ix 401 ch 19 uh 6
ey 1 30 q-ix 12 hv-ey 1
0'6)’ 54 q-ax-h 1
q-ah 35 ah 103

 

 

 

 

 

 

 

 

Table 4.13: Different ways the function word a is pronounced in TIMIT.

 

77

Q(.r) =O‘S(£)+13T(I)+‘Y(I)F(.II) (4.15)

Let Q(.r) be the non-negative reward assigned to the vertex 1:. Let 5(2) be the weight
for vertex J: which is based 011 the number of occurrences of the corresponding word
with its particular phonetic baseform in the TIMIT database. Generally speaking,
the larger S(.r), the more incoming and outgoing transitions from vertex 1:. Let T(.1:)
be the weight for the quality of training for the word at vertex 2:, e.g., the larger T(I)
is, the better training the resident word has. Let F(.1:) be the penalty for the vertex .1:
if it contains a function word. a, 1‘3, and 7(1') are the weighting factors. These factors

satisfy the following properties:

a +3 = 1, and (1 >13 (4.16)

—1, if the word at vertex 2: is one of the function words in Table 4.5

0, otherwise

(4.17)
Experimentally, a is set to 0.6, and 1’3 is 0.4. The penalty for a function word [(1‘) is
assigned to be 0.805(2). The significance of 5(1) and T(x) in the reward function
is the following: Suppose the word model at vertex .1: is not well-trained. Then, T(.r)
is set to zero, and 03(2) is the reward for the vertex 1:. If the word model at .r is

well-trained, then more reward can be assigned to I through T(r). At the same time,

78

the existence of T(r) can be used to indicate the quality of training of the HMM for

the word resident at .1:. By this, we mean that the more training data for training

the model at .r, the larger T(r). If a vertex is not represented by a well-trained word

HMM, then it can be represented in terms of phone models.

After a reward is assigned to each vertex, the partitioning algorithm described in

Chapter 2 is applied directly to the graph. Therefore, a set of vertices (i.e., those

vertices in the set C) are selected for further evaluation and search. The partition

results are affected by the following three factors:

1.

[\1

The selection of the source vertex 3. In a planar subgraph of the language graph,
we need to choose a source vertex for breadth-ﬁrst search (BFS) in Step 1 of
the partitioning algorithm described in Section 2.3.1. Clearly, different source

vertices for the BFS cause different partition results.

. The reward assignment for each vertex. The rewards assigned to the vertices

can be modiﬁed by applying different sets of weighting factors a, 13, and 7(2) in
equation (4.17). The resulting partition depends on the choice of the weighting

factors.

. The extraction of planarity breaking arcs. For a nonplanar graph, the planarity

breaking arcs are ﬁrst extracted to create a planar subgraph. The graph parti-
tioning algorithm for planar graphs is then applied to this planar subgraph to
get a preliminary partition set. A complete partition set is achieved by applying
a method described in Section 2.3.2. However, the resulting partition depends

on which set of planarity breaking arcs is extracted. The set of arcs causing

nonplanarity is not unique.

Our goal is to ﬁnd a partition which contains a high reward in the set C by comparing
several partitioning results. A set of high reward should be marked for evaluation.
[11 our experiments, there are a total of 470 vertices (z 2.6% of the total number
of vertices in the language graph) in the set C. Among the selected vertices, 1.91
of them are from the violation of the PST (recall the partitioning algorithm for
nonplanar graphs in Chapter ‘2). The partition result is good enough because the set
C contains 25% of the vertex reward in the language graph, and there are a total of
217 well-trained word models in the selected set. In the next section, these selected
vertices are used to test. the performance of the graph search algorithm developed in

this work.

4.2.4 Two-Pass Graph Search

111 this section, a few parameters for the word-spotting algorithm described in Section
3.2.1 are assigned. Experimentally, three thresholds, AL, A1, and Au, are set as
follows to yield good spotting results: For each selected word model M”, AL is set to
0,, — Up, A; is set to 10-3. and _\u is set to 10‘2““, where DP and 0,, are estimated
from the training data. The signiﬁcance of these parameters is discussed in Chapter

3. The stack size (1 is virtually unlimited. Therefore, there is no hard pruning.

80

4.3 Experiments

4.3.1 Methods and Measures of Performance

Partitioned graph search techniques are applied to recognition of continuous-speech
taken from the TIMIT [21] database. Comparisons of the results of the following

approaches are made:
1. partitioned graph search.

2. left-to-right search employing randomly selected vertices of the same number as

used in the partitioned graph search cases.
3. conventional left-to-right stack search with pruning.

To compare the performance between the partitioned-search-based algorithms and

the conventional left-to-right methods, the measures of performance include:

1. computational complexity.

(\D

. recognition accuracy.

3. noise robustness.

ln continuous-speech recognition, there are three types of errors: substitution
errors (a word is misrecognized in the sentence), deletion errors (a correct word is
omitted in the recognized sentence), and insertion errors (an extra word is added
in the recognized sentence). h’leasures suggested by Lee [21] are used to assess the

recognition performance. we ﬁrst align the recognized word string against the correct

81

word string, and then compute the number of words correct, and the number of
substitutions, deletions. and insertions. Five results are reported in computing the
error rate. These are the percent correct (% Corn), percent substitutions (71;. Subs),
percent deletions ((76 Dels.), percent insertions (% 1113.). and word accuracy (\Nord
Acc.). Note that the percent correct does not consider insertions as errors but the

word accuracy does. These performance measures are computed as follows:

. Correct
‘7. C.‘ .= 101) x 4.18
( orr Correct Sent Length ( )

 

_ S l.‘
(7:: Subs. = 100 x a, 1‘" (4.19)
(.orrect Sent Length

 

_ D1.
70 Dels. = 100 x ‘3 Q (4.20)

Correct Sent Length

 

1113
(7" l ‘. = 100 _ 4.21
C “b x Correct Sent Length ( )

 

l . .'l. D 1
Error Rate = 100 X ns + S” )8 + e S (4.22)
(.orrect Sent Length

 

Word Acc. = 1 — Error Rate (4.23)

Since Correct Sent Length 2 Correct + Subs + Dels.

82

Correct — Ins
W (1 A = 100 . 4.24
or CC x Correct Sent Length ( )

 

To consider the measure of performance regarding to noise robustness, we need to

know the signal-to-notse ratio (SNR). In this work, the SNR is computed as follows:

Denote the SNR in (18 by N. Then,

. T2
h‘.(dB) = 10 X log ELL; (4.25)

1712'

where f,- is the sample at timei in the speech signal, and n, the noise at time i. The
summations in equation (4.27) are taken over the ranges of is which are corrupted

by noise.

4.3.2 Results and Discussion

A set of ‘20 utterances taken from the dr3 dialect (North Midland) segment of TIMIT
is used to evaluate the performance of the three methods listed above.

In order to reduce the computational effort for the left-to-right search, a pruning
strategy is used to prune unlikely partial paths after evaluating at least the ﬁrst and
second models on each path. The partition evaluation likewise comprises a sort of
“pruning” procedure and it is assured that no path is pruned in the partition search
case unless at least two evaluations have taken place on it.

Noisy utterances in these experiments were created by adding noise of 0 dB SNR

to randomly selected time intervals in the normal speech. The method used to corrupt

the normal speech is as follows: Four ﬁxed length blocks (3,000 samples”) of noisy

samples are added to the normal speech beginning at four randomly (using a C-

language version of the machine-dependent random number generator) selected times.

The experimental results are shown in Table 4.14.

In addition to the simple

measure of performance deﬁned above, we have added three further measures of

performance speciﬁcally dealing with complexity reduction.

 

  

 

  

 

 

 

 

 

 

 

 

 

 

Method I Partition Random Left-to-Right
Measures Normal Nolsy Normal Noisy Normal Noisy
% Corr. 79.05 % 56.08 % 59.46 % 40.54 % 60.14 % 42.56 %
% Subs. 16.22 % 35.13 % 34.46 % 52.03 % 39.19 % 50.68 %
% Dels. 4.73 % 8.79 % 6.08 % 7.43 % 0.67 % 6.67 %
% Ins. 0.00 % 2.03 % 2.70 % 4.05 % 6.76 % 6. 67%
Word Acc- 79.05 % 54.05 % 56.76 % 36.49 % 53.38 % 35.81 %
Average it of surviving
Paths after ﬁrst PCS 763 778 1,327 1.318 976 991
or LTR pruning
it of vertices evaluated
at first PCS or LTR 470 470 470 470 10,061 10.061
Pruning
Average node evaluated!
node pruned In the first 0.05271 0.05285 0.05824 0.05814 1.17034 1.173512
and second time slots

 

 

 

 

 

 

 

Table 4.14: Experimental results.

To assess complexity, let us consider the computational effort exerted to reduce

the graph by pruning in the ﬁrst pass of the search.

 

13The average sentence length is about 45,000 samples.

As a reasonable complerity

 

84

measure (CM), we define

Average number of vertex evaluations performed

 

Complexity Measure (CM) 2 Number of vertices pruned in the first two time slots.

(4.26)
In the equation above. the number of vertices pruned is obtained either effectively
in the partition case, or directly in left-to-right search. This measure favors the
left-to-right search in the sense that left—to-right evaluation works directly with the
ﬁrst two models on any path whereas the partitioned search can generally only affect
them indirectly. The CM result for each noise—free utterance is shown in Fig. 4.27.
Fig. 4.28 is a "blomip" of the CM comparison between the partition search and
random selection methods so that the results can be seen clearly. Similar plots for
the noisy speech are shown in Figs. 4.29 and 4.30. On the average, the computational
expense of left-to—right search as measured by the CM is about 22 times more than
the computational effort for partitioned search.

111 considering this CM result, it should be carefully noted that the partition and
search procedure is only performed once in these experiments. If partition and search
were to be repeated in successive stages. even greater cost savings could be obtained.
However, in a relatively small graph like the one under consideration here, the beneﬁt
of repartitioning is not great, especially in terms of the CM. For example, we estimate
that the graph is reduced to about 10% of its original size by the initial partitioning
procedure (requiring 470 evaluations). If a second partition recursion could reduce
the remaining subgraph by an additional 90% using, say, 47 further evaluations, then

the CM measure for partitioned search would be reduced only in the third decimal

85

place. The recursive partitioning procedure will become increasingly more important
and beneﬁcial as the graph size increases.

It is acknowledged that the recognition rates for each method explored in Ta-
ble 4.14 are disappointingly low compared with rates obtained for other speaker-
independent continuous-speech recognition systems using the TIMIT and other databases
for evaluation [21], [65]. The main focus of this research has been to show the poten-
tial for a steep decrease in the computational effort using partitioned graph search
compared with conventional left-to-right search approaches in the continuous-speech
recognition task. Much of the research effort has focused on the graph-theoretic as-
pects of partitioning. and future attention to modeling and training issues. as well as
more extensive experimental data, is expected to bring the recognition results in line
with the state of the art. There is nothing about the partitioned search method which
precludes similar modeling and training procedures to those used in very successful
systems (see. e.g. [8]).

As an example of the effects of possibly unfavorable modeling and relatively few
test utterances in the present work, consider the following facts. Three of the test
utterances were found to be particularly problematic in the recognition task. These
are 5x140 (from fkmsO). si908 (from fpktO) and si1490 (from fkmsO). We suspect
that poorly trained models were to blame for prematurely pruning correct paths in
each trial and in each method. Table 4.15 shows the results for the basic measures
of performance with these sentences removed from the study. Clearly these results
bring the present study more in line with expected rates. Much is still to be learned

about important modeling issues, parameter settings, and other details through future

experimental research.

86

 

 

    

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Method I Partition Random Leﬁ-to-nght
Measures Normal Noisy Normal Noisy Normal Noisy
% Corr. 92.8 % 65.6 % 69.6 % 47.2 % 70.4 % 50.4 %
% Subs. 5.6 % 26.4 % 26.4 % 46.4 % 29.6 % 42.4 %
% Dels. 1.6 % 8.0 % 4.0 % 6.4 % 0.0 % 7,2 %
% Ins. 0.0 % 0.8 % 0.1) % 4.0 % 4.8 % 7.2 %
Word Ace. 92.8 % 64.8 % 69.6 % 43.2 % 65.6 % 43.2 %
Average # of surviving
patm after first P68 7 10 748 1,244 1,234 1,013 1.045
or LTR pruning
# of vertices evaluated
at first PGS or LTR 470 470 470 470 10.061 10.061
Pruning
Average node evaluated!
node pruned In the first 0.05224 0.05258 0.05725 0.05736 1.17788 1.18487
and second time slots

 

Table 4.15: Experimental results with three problematic test utterances removed.

Finally, we wish to exhibit an overall measure of performance for the various

 

methods which takes into account the complexity beneﬁts of the partitioned search.
Using CM as deﬁned in (4.26) as a measure of computational effort, then the word
accuracy normalized to CM is used as a good indicator of overall performance of these
search methods:

‘76 Word Accuracy
Complexity Measure °

 

Normalized Word Accuracy (NWA) =

The results of the NVVA for each search method are shown in Fig. 4.31. For the nor-

mal speech, the recognition results by partitioned graph search are signiﬁcantly better

87

than the results of conventional left-to-right methods in these experiments. Further-
more, the partitioned graph search is more robust than the left-to-right approach
in noisy environments. An interesting and expected outcome is that the robustness
to noise afforded by the partitioned search is apparently due to the “scattered site"
evaluation which takes place. This is evident in Fig. 4.31 because the random selec-
tion technique is also much more robust to noise than left-to-right search. However,
previous results clearly indicate that. random selection is not nearly as effective in
reducing the computational complexity as partitioning. This is due to the fact that

partitioning provides a very systematic way of selecting "scattered,” but high-payoff,

   
    
 

 

vertices.
Complexity Performance for Normal Speech
—'— Parﬂoned search
x 1.8 -~
g 1.6 ,_ —+— Random selection
a 1.4 1» —O— LOR-iO-rlghi
‘5
2
a
>
o
8
.9
E
> f f . . -

 

12 3 4 5 6 7 8 91011121314151617181920
Test utterance number

Figure 4.27: The complexity measure (CM) for normal speech using three different
methods.

88

Complexity Performance for Normal Speech

 

 

 

3 0,07 -— —'— Partionod search
+1! do locllo
§ 0.065" on moo n
8 0.0,.
83 \
0.055
53 /
3 0.05 1 .
3 _
§ 0.045
>0,04.:+¢¢¢e::::%::::r.-.
1234567891011121314151617181920

Test utterance number

Figure 4.28: The complexity measure (CM) for normal speech using using partition
and random selection methods.

Complexity Perlormanco lor Noisy Speech
——'— Padlonod search
2 l
1.8 ~
‘16 i.
1.4 --

.1. O __ .—— .‘K
E 1.2 0§~./.__./0_’ \./. O 0"” \.\.’/\.

——*— Random selection

 

 

O

Loft-to-rlghl

'l ..
0.8 ..
06 4.
0.4 ,_

0.2:

12 3 4 5 6 7 8 91011121314151617181920
Tostuﬂoranconumbor

 

Vertices evaluated per vortex

 

Figure 4.29: The complexity measure (CM) for noisy speech using three different
methods.

89

Complexity Performance for Nolsy Speech

 

 

g 0.07 -. _'— Partloned search

§ 0.065 0 + Random selection

2' 0.06 1*

2 3 /

§ 5 0.055 .

B a.

5 0.05

3

.2 0.045

E

> 0.04 t t ##r % l l : t # l l ¢ ¢ : + : .L :
12 3 4 5 6 7 8 91011121314l5lbl7l81920

Test utterance number

Figure 4.30: The complexity measure (CM) for noisy speech using using partition
and random selection methods.

90

Overall Performance

 

 

 

0.l8"
0.16‘
0.l4*
0.12.

0.1“
0.084
0.06‘
0.04‘
0.02‘

 

 

Cl Normal speech

I Nolsy speech

    

 

Word accuracy per
computational effort

 

 

 

 

 

Partition Random L-to-R
Method

Figure 4.31: The overall performance as measured by the normalized word accuracy
(N WA) of three different search methods.

Chapter 5

Further Discussion, Conclusions
and Future Work

 

5.1 Summary and Further Discussion

This work has introduced a novel approach for continuous-speech recognition based
on partitioned search of language graphs. In general, the most important ﬁnding of
this work is that the partitioned search technique shows great potential for application
to a large-vocabulary contimious-speech recognition. The general goal of this work
has been to open the door to new research directions for future systems, which. if
they are to be capable of recognizing speech based on practical and ﬂexible languages.
must involve very large graphs of very high perplexity. The central motivation for
this partitioning operation is to reduce search complexity from (9(N) to O(\/—\7) by a
judicious selection of vertices. This complexity reduction has the obvious advantage
of permitting the search of larger spaces. but also some less obvious beneﬁts which are
discussed above and below. To summarize. the advantages of the partitioned graph

search procedures explored in this work are as follows:

91

92

e The decoding procedure is based upon a very large graph (reﬂecting a large

language model). The benefits of a large graph are twofold:

l. A large graph will reﬂect the original grammar more accurately thereby

improving recognition performance.

2. Models in a large graph will be trained in context and reﬂect the acoustic

knowledge more accurately.

These two “large model” benefits do not come at the expense of insufﬁcient
training because the partitioning procedure selects well-trained vertices. This

feature permits a more accurate language model and better training in context.

The computational complexity of partitioned search (with respect to conven-
tional procedures) is greatly reduced by the inherent divide—and-conquer proce-

dure.

e Partitioned search can be used to increase robustness to frequently misarticu-

lated words such as function words.

Partitioned search is more robust to impulsive or intermittent noise than con-
ventional left-to-right search, because the evaluation of the data, by focusing
on the partitioned C-set of vertices, is distributed across the observations in
time. This means that the search is less likely to be pruned in the presence of

unmodeled noise.

Significantly. partitioned search may be generally viewed as a type of n-best

method, which has become quite popular in the speech recognition community in

93

recent years. An n-best approach employs a computationally inexpensive pass to
pare down the problem to one involving a smaller graph of likely solutions. The

smaller graph can then be subjected to intense scrutiny to find an optimal solution.

5.2 Contributions

The partitioning concept is simple to understand, but finding a good partition is
a highly nontrivial problem. Graph partitioning algorithms developed in this work
offer an efﬁcient and systematic method for locating the high-payoff set of vertices
for distributed evaluation. The implementation of the novel partitioning algorithms
are reported in this dissertation.

In continuous-speech recognition tasks, the boundaries in the observation data
are generally unknown. Without knowledge of the boundaries for each word in an
utterance, the recognition task becomes significantly more challenging. In this work,
a two-pass search algorithm for the unknown boundaries case is developed.

The major contributions of this research are summarized as follows:

1. A vertex partitioning algorithm for generally nonplanar graphs is developed,
and the key operation for ﬁnding a cycle for the planar graph subpartitioning

is improved.

‘2. Partitioned graph search algorithms, which are applicable to the case of un-

known boundaries in the observation string, are developed.

3. Partitioned graph search techniques are applied to recognition of continuous-

94

speech taken from the TIMIT [‘21] database. Comparisons of the results using

the following approaches for recognizing continuous speech are made:

(a) partitioned graph search.

(b) left-to-right search employing randomly selected vertices of the same num-

ber as used in the partitioned graph search cases.

(c) conventional left-to-right stack search with pruning.

5.3 Future Work

The following issues for future research have emerged from this work:

1. There is a “bottleneck” in the process of partitioning a very large graph. The
difficult task appears in the planarity testing operation. Before a language graph
is partitioned. certain planarity breaking arcs need to be extracted, and the
planar embedding for a large graph is extremely time and space consuming. To
solve this problem when a large vocabulary system is considered, the following
method is suggested. The language graph is decomposed into several small
subgraphs and each subgraph is partitioned individually. Then the individual
partitioning results are combined to form a selected set for the large graph.
However, the selected vertex set C may no longer satisfy (9( ﬂ). Therefore, the
very interesting and challenging problem of finding more efﬁcient and effective

methods remains for future work.

95

2. Since most language models are expected to be representable only by nonplanar
graphs, it is critical that the partitioning procedure be applicable to such graphs.
One ad hoc method for dealing with the nonplanarity issue is described in
Chapter 2. Other methods for treating nonplanarity and general effects of

nonplanarity upon performance remain important issues for future research.

3. In the present research, resident at each vertex is either a whole-word HMM or
a phone-based model composed of context-independent phone HMMs. Whole-
word models are often found to outperform phone-based models if sufficient
data are available to train them [81]. However, this training approach is im-
practical in a large speech recognition task. In the future systems, other models
for representing these vertices are required. such as context-dependent phone

models (e.g., triphones [21]).

4. Much is still to be learned about important modeling issues, parameter settings,
and other details through future experimental research aimed at improving the

overall performance of the partitioned graph search technique.

5. The complexity reduction is very significant when N is much larger than ﬂ.
Future systems must employ a very large (109 - 10” vertices or even more)
graphs of very high perplexity if they are to be capable of recognizing speech
based on practical and flexible languages. In order to prevent the intractable
problem of training of all vertices in huge graphs, the partitioning process can
be used to determine a small set of high-payoff vertices to be trained. In this

case, the partitioning is not guided by the availability of training data, but a.

96

number of partition sets can be tried to find a set for which sufﬁcient training

data are available.

Appendices

 

Appendix A

Graph-Theoretic Notations and Deﬁnitions

 

A graph G(V, E) is an ordered pair consisting of a finite set of vertices V and a
ﬁnite—set of edges B. Let |V| denote the number of vertices in G and IE] the number
of edges. If each edge is an unordered pair of distinct vertices, then the graph is
undirected. If each edge has an assigned orientation, then the graph is directed. Let
(n.1,?) be an edge in a directed graph. Then (a. v) is said to join vertex u to vertex v; u
is the tail of (u, v), and v is its head. If G is a directed graph, the underlying graph of
G is the undirected graph formed by converting each edge of G to an undirected edge.
Only undirected graphs are considered in the following definitions. Of course, we may
extend the definitions to directed graphs by considering their underlying graphs.

A walk from v0 to 12;, in G is a ﬁnite nonempty sequence whose terms are alternately
vertices v,- and edges 64-1“, 1 g i S k, such that 6,-” = (v,_1,v,). If all the vertices
of a walk are distinct, then the walk is called a path. On the other hand, if all the
edges of a walk are distinct, the walk is called a trail. A walk from a vertex to itself
is called a closed walk. A cycle. is defined as a closed trail whose origin and internal

vertices are distinct.

97

98

A vertex w is reachable from a vertex v if there is a path from v to w. A graph
G(V, E) is said to be connected if any vertex in G is reachable from any other vertex.
If G is a connected graph. then a spanning tree of G is a subgraph T Q G which is a
tree and contains all the vertices of G. The radius of a tree is the maximum distance
of any vertex from the root. A graph G(V, E) is said to be planar if it can be drawn,
or embedded in the plane in such a way that the edges of the embedding intersect
only at their ends. A planar embedding of a planar graph is sometimes referred to as
a plane graph [1]. A planar representation of a graph divides the plane into a number
of connected regions called faces, each bounded by some edges of the graph. The
boundary of a face is a closed walk. A face f is said to be incident with the vertices
and edges on its boundary. Figure A.32 depicts the faces of a particular embedding
of the graph. Let b(f) denote the boundary of the face f. For example, f; of the

plane graph given in Figure A.32 is:

b(f)) = "161.2'0262.7U7€7.sl-’666.1Ur

 

 

 

Figure A.32: A plane graph with four faces.

99

Of course. any planar representation of a connected planar graph always contains
one face enclosing the graph. This face, called the erteriorfacc, is f, in Figure A.32.
Note that if e is a cut edge in a plane graph, then only one face is incident with 6
since the removal of the cut edge disconnects the plane graph; otherwise, there are
two faces incident with 6. Similarly. a cut vertex of a connected graph is defined as a
vertex whose removal disconnects the graph. For instance. 8.7.3 is a cut edge and v7 a
cut vertex in Figure A.32. Since 6773 is a cut edge. only f3 is incident with it. On the
contrary, em is not a cut edge: there are exactly two faces (f1 and f2) incident with
it. This property is important in the developments of graph partitioning algorithms
described in ("hapter 2. Finally. if G is connected and contains no cut vertices, then

G is biconnectrd.

 

Appendix B

Supporting Lemmas for the PST

 

The constructive proof of the Planar Separator Theorem (PST) depends on two

fundamental lemmas. A brief description of these supporting lemmas is as follows:

Lemma 3.1 Let G be any .V-verter connected planar graph having non-negative ver-
ter rewards summing to no more than one. Suppose G has a spanning tree of radius
1'. Then the N rtices ofG can be partitioned into three sets A, B, and C. such that
no edge joins a vertex in A with a vertex in B. neither A nor B has total reu'ard

exceeding 2/3. and C contains no more than 2r + 1 vertices. one the root of the tree.

The proof of the lemma proceeds by first embedding G in the plane and finding a
breadth-first spanning tree [4] [28] of G. Since each face is triangulated by adding
some additional edges. any nontree edge (including the new added edges) forms a
simple cycle with some of the tree edges. Therefore, the length of this cycle is at
most 27‘ + 1 if it contains the root of the tree. By the Jordan Curve Theorem [1]. the
cycle divides the graph into two parts. the inside and the outside of the cycle. Lipton
and Tarjan [17] show by examples that at least one such cycle separates the graph so

that neither the inside nor the outside of the cycle has total reward exceeding 2/33.

100

101

Lemma 3.2 Let G be any N—verter connected planar graph. Suppose the vertices of
G are partitioned into levels according to their distance from some vertex .5 and that
L(l) denotes the total number of vertices on level 1. Given any two levels ll and I" such
that the total reward on levels 0 through I, — 1 not exceeding 2/3 and levels I” + 1 and
above have total reward not exceeding 2/3. it is possible to ﬁnd a partition A, B, C of
the vertices ofG such. that no edge joins a vertex in A with a vertex in B, neither A

nor B has total reward emcee-ding 2/3. and C S L(l’) + LU”) + max{0,2(l' — l” — 1)}.

The lemma is very important for constructing a vertex partitioning algorithm for
actual implementation. The proof of the lemma concerns the relationship between
levels l, and l". (i) Suppose l' 2 I". Then the lemma is obviously true if we choose all
the vertices on level l, to be in the set C and let A be all the vertices below the level
I, and B be all the vertices above the level 1'. (ii) Suppose that l, < I”. Since the
vertices in levels 1' and l" are deleted. the graph naturally partitions into three parts:
vertices on levels 0 through 1' — 1. vertices on ll + 1 through I" — 1. and vertices on
levels I" + 1 and above. To find an appropriate vertex partitioning in this condition.
two cases must be considered. One is the case in which the total reward between
ll + 1 and l" — 1 does not exceed 2/3. A proper partition is obtained by setting A
the part of the three with the most vertices, B the remaining two parts, and C the
set of vertices in levels 1' and l". The other case is that in which the total reward
between I, + l and l" —- 1 exceeds 2/3. 111 this case, the part between l, + l and l" — 1
requires subpartitioning. A subpartitioning is carried out as follows: All vertices on

levels I" and above are deleted and all vertices on levels 0 through I, — l shrunk to a

102

single vertex .1: of reward zero. A new graph. say GI. is formed. Note that the new
graph preserves planarity [17]. Apply Lemma 8.1 to the new graph. Let Al. BI, C,
be its vertex partition: the set C, being the vertices on the cycle. Therefore, a proper
vertex partitioning of the graph G derives from letting A be the set among A, and
B! with more vertices. C the vertices in levels I, and l" plus the set Ci. and B the

remaining vertices.

Appendix C

Lipton-Tarjan Approach to Finding a Cycle

 

The Lipton-Tarjan (L&T) approach [17] for finding a cycle to complete the vertex

partitioning for planar graphs involves the following steps:

 

(C) (d)

Figure C33: Cases of Step 4 in the L&T approach for finding a proper cycle. Solid
edges are tree edges; dotted edges are nontree edges.

Step 1 Construct a breadth-ﬁrst spanning tree rooted at .1: (created by shrinking pro-

cedure as described in the proof of Lemma 3.2?) in the new graph G’. Record, for

103

104

each vertex v, the parent of v in the tree. and the total number of all descendants of

v including v itself.

Step 2 Embed the new graph G, in the plane. Let H, be the planar embedding of

G’. Make each face of H, a triangle by adding a suitable number of additional edges.

Step 3 Choose any nontree edge. say (vhwl), and locate the corresponding cycle.
Compute the number of vertices on each side of this cycle by scanning the tree edges
incident on each side of the cycle. Determine which side of the cycle has greater

reward and call it the “inside."

Step 4 Let (r. :) be the nontree edge whose cycle is the current candidate to complete
the separator. If the reward inside the cycle exceeds 2/3, then ﬁnd a better cycle by
the following method. Locate the triangle (1:. y. .2) which has (.1'. 2) as a boundary edge

and lies inside the (.r, 3) cycle. Two cases must be considered:

1. If either (.r,y) or (y, z) is a tree edge. lct (v. m) be the nontree edge among (1:.y)
and (y. .7). Then compute the reward inside the (v, w) cycle. This case is shown

in Figure C33 (a) and (b).

‘3

If neither (.r,y) nor (y.:) is a tree edge, determine the tree path from y to the
(.r, 2) cycle by following parent pointers from y. Let u be the vertex on the (J3. :)
cycle reached during this search. Then. compute the number of vertices inside
the (.r.y) cycle and (y. 3) cycle. Let (v, to) be the nontree edge among (any) and
(y, :) whose cycle has more reward inside it. This case is shown in Figure (7.33

(c) and (d).

105

Repeat Step 4 until a cycle satisfying Lemma 8.1 is found.

Appendix D

A Planarity Testing Algorithm

 

The planarity testing algorithm of Demoucron et al. [1] is shown in Figure D.34.
Before using this algorithm to determine whether a given graph is planar. some pre-

processing considerably simpliﬁes the work. Note the following points:

1. If the graph is not connected. then each component should be subjected to

planarity testing.

2. If no cycle is found. then the graph is a tree. Therefore. it is planar.

3. If IEI < 9 or N < 5 then the graph must be planar; if IEI > 3N — 6. then the

graph must be nonplanar [l].

The following deﬁnitions are required: Let Gi(V,-.E.~) be a subgraph of G. a

bridge B of G related to G. is then:

1. either an edge (u. v) E E where (u. v) Q E.- and (am) 6 Vi, or

2. a connected component of (G—G.) plus any edge incident with this component.

106

107

We denote by V(B.G.~) the vertices of attachment of B to G.. Let H,- be an
embedding of G. in the plane. If B is any bridge of G.. then B is said to be drawable
in a face f of H,- if V(B. G.) are contained in the boundary of f. We write F(B. H.)
for the sets of faces of H,- in which B is drawable. The algorithm to follow is based
on a very important criterion: lfF(B, H.) = 0. then we cannot obtain further planar
subgraph embedding. Thus the algorithm will terminate for nonplanarity.

Given a graph G. the algorithm determines an increasing sequence G1. G2. - - - of
planar subgraphs of G. and corresponding planar embeddings H1, H2, - - - when G is
planar. Through the algorithm. it is easy to record the faces of each subgraph G.“

at each iteration i as shown in Figure D.34. The procedure is as follows:

1. If there exists a bridge B such that F(B,H.—) 2 lb. then the graph G is non-

planar. Thus. the planarity testing and planar embedding ceases.

[Q

If there exists a bridge B such that |F(B.H.')| = 1. then let f = F(B,H.-).
From the bridge B. choose a path P.- Q B and set G.“ = G.- U R. Thus. the

faces of G.“ can be obtained by drawing P. in the face f of Gi.

3. Otherwise, choose any face f and any bridge B such that f E F(B. H.). From
the bridge B. choose a path P.- Q B and set G.“ = G. U Pi. The faces of G.“

can also be obtained by drawing P. in the face f of G..

Note that if G is planar. then by Euler's formula [1], IEI — N + ‘2 faces will he
found. Since these faces have been found following implementation of the process

shown in Figure D.34. a ﬁxed planar embedding of the graph G can be constructed.

 

108

 

Find a cycle GI and

a planar embedding
H1 Of 61

 

 

 

 

     

Y
i=i+1 es

T

E(G)-E(Gi) =

 

 

 

 

 

For each bridge B of
Gi ﬁnd F(B,Hi)

 

 

 

 

 

   

      

 

Find a path P. in B

connecting two vertices 13 there a B
of attachment. 81101! that
set Gi+ 1: Gi U Pi' IF(B.I::I)I = 0
Draw PL in [to get Hi f -

 

   

      
 

 

 

 

 

YES Is there a B
‘— such that
"303.1101 = 1
A '2
Choose any
B and f, f
belong to NO
F(BJ'Ii)

 

 

 

Figure D.34: The planarity testing algorithm of Demoucron et al.

Appendix E

Elements of the Hidden Markov Model

 

ln speech recognition. hidden Markov models (HMM) have become increasingly
popular in the last ﬁfteen years since the modeling provides a mathematically rigorous
approach and works very well in practice [8]. [9].[21]. [‘24]. Much of the popularity of
the HMM is based on the discovery of efﬁcient methods of estimating its parameters.
The form of the HMM can be chosen according to the knowledge of application
domain and the parameters trained from known data.

To use HMM. three basic problems (training, evaluation, and recognition) must

be understood:

1. The training problem: Training in HMM is simply the estimation of the pa-
rameters of a model from a set of known training data (observation sequences)
in which the parameters can best describe how a given observation sequence is

generated.

‘2. The evaluation problem: Given a model and a sequence of observations. the
evaluation problem considers the probability that the observed sequence is pro-

duced by the model.

109

110

3. The recognition problem: Recognition in HMM is the identiﬁcation of an un-
known observation sequence by choosing the most likely model (representing a

word. or a sentence. etc.) that produces the observation sequence.

A discrete observation HMM. say M. is deﬁned by a set of 1 states. .1 observation

symbols, and three probabilistic matrices:

M: {n.A.B}. (E.1)

Let the individual states be denoted as 5' = 1,2. - - - . I. and the observation symbols

Z = {.31. :2. ~ - - . :J}. Then. the probabilistic matrix [I is the initial state probability:

fl = {7n}. 7r. 2 P(state i at t = 1). (E2)

where the variable t denotes discrete time. A is an I x I matrix containing the

Markovian state transition probabilities:

A = {aij}. a.) = P(statej at t+ llstate i at t). (E3)

and B is an I x J matrix containing the observation symbol probability distributions:

B : {bj(k)}, bj(k) : P(::,c at tlstatej at t). (EA)

Note that “ij and bj(k) satisfy the following properties:

a.) Z 0.bJ(k) 2 0, Vi.j.k (13.5)

111

2a.] =1, Vi (Eb)
:bJ-(lc) = l. Vj. (ET)
k

In the HMM. the state sequence is hidden but the output sequence can be directly
observed. A set of sample model topologies is shown is Figure [5.35. The HMM
topology of Figure E.35(a) is called an ergodic topology because any state can be
reached from any other state. However. this topology is inappropriate for speech
recognition because speech consists of an ordered sequence of sounds. Figure E.35(b)
through E.35(d) are all left-to-right models. the left-to-right restriction forces an
ordering on the state sequence in which the path evaluation must either stay in
the same state or go to a higher numbered state. Figure E.35(b) is the general
left-to-right topology. Figure E.35(c). a Bakis model [9] (stay. move. or skip one).
and Figure E.35(d). a linear (stay or move) model. are commonly used for speech
recognition. Although we have dichotomized HMMS into ergodic and left-to-right
models. there are still many possible variations and combinations [8]. [9]. [‘21].

Let Y = y1.y-2.~-.yT be the discrete observation of a training utterance for a
model M. Our goal is to choose the model parameters such that P(Y|M) is locally
maximized. Let m(i) be the forward variable and ,3¢(i) be the backward variable

defined as:

at(i) = P(y1.y2.---.y,. and s, = ilM) (BS)

 

(b) General Left-to-Right Model

0 o o e
“x‘bka

(c) Bakis Model

—8—8.8—Q—~

(d) Linear Model

Figure [3.35: Some examples of four-state HMM topologies.

.‘iltlll = Plyi+1.yi+2.- ' ' .yrlsi = F! M) (E9)

where st denotes the state at time t. Deﬁne ,9,(i.j) the probability of being in state i
at time t. and state j at time t + 1 given the model M and the observation sequence
Y. The forward-backward algorithm [‘24], which can be used to adjust the model
parameters (A. B. f1) and maximize the probability of the observation sequence given

the model, is sketched below:

Step 1 Initialize all the elements in. [1, A, and B with reasonable random numbers

satisfying the basic properties discussed above.

Step 2 For each training datum, implement Steps 3 to 5.

113

Step 3 Calculate the forward and backward variables inductively as follows:

||
3‘
"Q”
{C
Y
p—A
/\
N
/\
N

01(1)

0i+1(] )=[Za.(i mutual). IStST—l.andISjSI

—j=21(lu'bj( (Ut+1) )f3t+1(j) :T—l,T—2,"',l, and ISIS].

Step 4 Calculate .pg(i.j) and '7,(i).

Gill)(lijbj(yt+i)l3t+l(j)
:1 =1 211:1 at( )“iij (yt+1)l3t+l(.l)

 

Wild):

I
Adi) = 29:01]).
i=1

Then, a set of reestimation formulas for 11, A. and B are:

in = 71(1)

:i_—1T-1.,~(i.j)

 

 

Zi=1 1W )
m) : Ziuwmn
J 232174!)

Step 5 set 7r.- +— Ti}, (1.04— (iilz'j, and l)j(l\') (— 01(h').V2.j.l€.

114

Step 6 If some convergence criterion is reached. then stop. Otherwise, go to Step 2?.

Bibliography

Bibliography

[11

[‘21
131
[41

[91

[101

[111

[1‘3]

J.A. Bondy and U.S.R. Murty. Graph Theory with Applications. New York:
American Elsevier Publishing. 1976.

F. Harary, Graph Theory. Reading, Massachusetts: Addison-Wesley, 1969.
S. Even. Graph Algorithms, Potomac. Maryland: Computer Science Press, 1979.

A. Gibbons. Algorithmic Graph Theory. New York: Cambridge University Press,
1985.

B.T. Lowerre and R. Reddy. “The HARPY speech understanding system,” in
W.A. Lea (ed.), Trends in Speech Recognition. pp.340——360, Englewood Cliffs,
New Jersey: Prentice-Hall, 1980.

A.J. Viterbi, “Error bounds for convolutional codes and an asymptotically opti-
mal decoding algorithm.” IEEE Trans. on Information Theory, vol. IT-l3, pp.
260-269. Apr. 1967.

F. .lelinek, “Fast sequential decoding algorithm using a stack,” IBM J. of Re-
search and Development, vol. 13. pp. 675-685. Nov. 1969.

.1.R. Deller, Jr., .l.G. Proakis. and .1.H.L. Hansen. Discrete Time Processing of
Speech Signals, New York: Macmillan. 1993.

.1. Picone. “Continuous speech recognition using hidden Markov models.” IEEE
ASSP Magazine, pp. 226-41, July 1990.

DB. Paul. “Speech recognition using hidden Markov models,” Lincoln Labora-
tory J... vol. 3, no. 1, pp. 41—61, Spring 1990.

LR. Bahl. F. Jelinek and R.L. Mercer. “A maximum likelihood approach to
continuous speech recognition,” IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. PAMl-S. no. ‘2, pp. 179—4190. Mar. 1983.

.1.R. Deller, Jr. and R.K. Snider. “Reducing redundant computation in HMM
evaluation,” IEEE Trans. on. Speech and Audio Processing, June, 1993.

115

[13]

[141

[151

1161

[17]
[18]

[191

['20]

116

C.C. Venkatesh, J.R. Deller. Jr.. and C.C. Chiu, “A graph partitioning approach
to signal decoding,” manuscript.

L.R. Bahl and F. Jelinek. “Decoding for channels with insertions. deletions and
substitutions with applications to speech recognition,” IEEE Trans. on Informa—
tion Theory, vol. lT-‘Zl. no. 4. pp.404—411. July 1975.

S.E. Levinson. “Structural methods in automatic speech recognition.” Proc. of

the IEEE, vol. 73. no. 11, 1985.

C.C. Chin. Algorithms for Signal Decoding Using Graph Partitioning, (MS. the-
sis). Dept. of Electrical Engineering. Michigan State University, East Lansing.
1991.

R.J. Lipton and R.E. Tarjan. “A separator theorem for planar graphs.” SIAM
J. of Computing, vol. 36. no. ‘2. pp. 177—189. Apr. 1979.

H.N. Djidjev. “On the problem of partitioning planar graphs,” SIAM J. of Al-
gebraic and Discrete Methods, vol 3, no. 2. pp. 229—240, June 1982.

G.L. Miller. “Finding small simple cycle separator for 2—connected planar
graphs,” Proc. [6th Annual ACM Symp. on Theory of Computing, pp. 376—382.
Apr. 1984.

C.C. Chiu. C.C. Venkatesh. A.-H. Esfahanian, and J.R. Deller. Jr.. “An al-
gorithm for planar graphs subpartitioning.” submitted to J. of Combinatorial
.flIathematics and Combinatorial Computing.

K.F. Lee. Automatic Speech Recognition. the Development of the SPHINX Sys-
tem, Boston: Kluwer Academic Publishers. 1989.

DA. Plaisted. “A heuristic algorithm for small separators in arbitrary graphs."

SIAM J. of Discrete Mathematics, vol 19, no. 2, pp. 267—280. Apr. 1990.

S. Rao. “Finding near optimal separators in planar graphs,” Proc. 28th Annual

IEEE Symp. on Foundations of Computer Science, pp. 225-237, 1987.

LR. Rabiner. “A tutorial on hidden Markov models and selected applications in
speech recognition.” Proc. of the IEEE, vol. 77. no. ‘2, 1989.

CH. Lee and LR. Rabiner. “A frame-synchronous network search algorithm
for connected word recognition.” IEEE Trans. on Acoustics. Speech, and Signal
Processing, vol. 37, no. 11, pp. 1649-1658, 1989.

R. Schwartz and Y.L. Chow, “The N-best algorithm: an efﬁcient and exact
procedure for ﬁnding the N most likely sentence hypotheses,” Proc. IEEE Int.
Conf. on Acoustics. Speech. and Signal Processing, pp. 81~84. 1990.

[‘27]

[281

[291

[30]

[31]

[:33]

[34]

[36]

[37]

[38]

[39]

[40]

117

S. Davis and P. Mermelstein. “Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences.” IEEE Trans.
on Acoustics, Speech, and Signal Processing, vol. ‘28. pp. 357—366. 1980.

.1.A. McHugh. Algorithmic Graph Theory. Englewood Cliffs. New Jersey:
Prentice-Hall. 1990.

“Getting started with the DARPA TIMIT CD-ROM: An acoustic-phonetic
continuous speech database.” National Institute of Standards and Technology

(NIST), Gaitherburg. Maryland, 1988.

A.V. Aho. .1.E. Hopcroft and JD. Ullman. The Design and Analysis of Computer
Algorithms, Reading, Massachusetts: Addison-Wesley. 1974.

A.V. Aho and JD. Ullman. The Theory of Parsing, Translation, and Compiling.
’olume I: Parsing. Englewood Cliffs. New Jersey: Prentice-Hall, 1972.

EB. Messinger. L.A. Rowe. and RR. Henry. “A divide-and-conquer algorithm
for the automatic layout of large directed graphs,” IEEE Trans. on Systems.
Man, and Cybernetics, vol. 21. no. 1. pp. 1—12. .1an./Feb. 1991.

MR. Fellows. “Transversals of vertex partitions in graphs.” SIAM J. of Discrete
Mathematics. vol 3. no. ‘2, pp. 206-215, May 1990.

RR. Barnes. A. Vannelli. and IQ. Walker. “A new heuristic for partitioning the
nodes of a graph.” SIAM J. of Discrete Mathematics, vol 1, no. 3. pp. 299—305.
Aug. 1988.

T. Ozawa. “The principal partition of vertex-weighted graphs and its applica-
tions.” Discrete Algorithms and Complexity, pp. 5-33. 1987.

JP. Hutchinson and G.L. Miller. “On deleting vertices to make a graph of posi-
tive genus planar,” Discrete Algorithms and Complexity, pp. 81—98, 1987.

H. Gazit and G.L. Miller. “A parallel algorithm for ﬁnding a separator in planar
graphs.” Proc. 28th Annual IEEE Symp. on Foundations of Computer Science.
pp. 238—248, 1987.

J.R. Gilbert, D.J. Rose. and A. Edenbrandt. “A separator theorem for chordal
graphs.” SIAM J. of Algebraic and Discrete Methods, vol 5. no. 3. pp. 306—313.
Sep.1984.

R.E. Tarjan, “Space-efﬁcient implementations of graph search methods,” ACM
Trans. on Mathematical Software, vol 9, no. 3, pp. 326-339. Sep. 1983.

BR. Barnes, “An algorithm for partitioning the nodes of a graph,” SIAM J. of
Algebraic and Discrete Methods, vol 3, no. 4. pp. 541—550, Dec. 1982.

1411

[4‘31

[43]

[441

[451

1461

[491

1501

1511

[s21

[:33]

[541

[as]

[.36]

118

N. Deo, G.M. Prabhu. and MS. Krishnamoorthy, “Algorithms for generating
fundamental cycles in a graph.” ACM Trans. on Mathematical Software, vol 8.
no. 1, pp. 26-42. Mar. 1982.

H.N. Djidjev. “A linear algorithm for partitioning graphs,” Comptes rendus de
1' Academic Bulgare des Sciences, vol. 35, pp. 1053—1056, 1982.

R..1. Lipton and R.E. Tarjan. “Applications of a planar separator theorem,”
SIAM J. of Computing, vol. 9. no. 3. pp. 615—627, Aug. 1980.

.1. Hopcroft and R.E. Tarjan, “Efﬁcient planarity testing,” J. of the Association
for Computing Machinery. vol. 21, no. 4, pp. 549—568, Oct. 1974.

J.R. Gilbert. .1.P. Hutchinson and R.E. Tarjan, “A separator theorem for graphs
of bounded genus.” J. of Algorithms, vol. 5, pp. 391—407, 1984.

RC. Read. “A new method for drawing a planar graph given the cyclic order of
the edges at each vertex.” Research Report CORR 86-14, University of Waterloo,
July 1986.

R..1. Lipton and R.E. Tarjan, “Applications of a planar separator theorem.”
Proc. 18th Annual IEEE Symp. on Foundations of Computer Science, pp. 162—
170, 1977.

T.H. Cormen. GE. Leiserson and R.L. Rivest. Introduction to Algorithms, Carn-
bridge. Massachusetts: MIT Press. 1991

G. Brassard and P. Bratley. Algorithms — Theory and Practice, Englewood Cliffs.
New Jersey: Prentice-Hall, 1988.

A.V. Aho, .1.E. Hopcroft and .1.D. Ullman. The Design and Analysis of Computer
Algorithms, Reading. Massachusetts: Addison-Wesley, 1974.

M.R. Garey and D.S. Johnson. Computers and Intractability, A Guide to the
Theory of NP—Completeness, New York: W.H. Freeman. 1979.

R. Sedgewick. Algorithms, Reading, Massachusetts: Addison-Wesley, 1988.

F.S. Roberts, Applied Combinatorics. Englewood Cliffs, New Jersey: Prentice-
Hall, 1984.

G. Casella and R.L. Berger. Statistical Inference, Belmont. California:
Wadsworth. 1990.

RE. Pfeiffer. Concepts of Probability Theory. New York: Dover Publications.
1978.

A. Papoulis. Probability, Random variables. and Stochastic Processes, New York:
McGraw-Hill. 1984.

[571

1641

[651

[661

1671

1681

119

R. Schwartz and Y.L. Chow. “The N-best algorithm: an efﬁcient and exact
procedure for ﬁnding the N most likely sentence hypotheses,” Proc. IEEE Int.
Conf. on Acoustics. Speech. and Signal Processing, pp. 81—84, 1990.

DB. Paul, “Algorithms for an optimal A" search and linearizing the search in
the stack decoder,” Proc. IEEE Int. Conf. on Acoustics. Speech, and Signal
Processing, pp. 693—696, 1991.

L.Bahl, P.S. Gopalakrishnan. D. Kanevsky and D. Nahamoo, “Matrix fast match:
a fast method for identifying a short list of candidate words for decoding,” Proc.
IEEE Int. Conf. on Acoustics. Speech, and Signal Processing, pp. 345—348. 1989.

LR. Rabiner and S.E. Levinson. “A speaker-independent, syntax-directed, con-
nected word recognition system based on hidden Markov models and level build-
ing,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. ASSP-33.
no. 3, pp. 561—573, 1985.

H.Ney, “The use of one-stage dynamic programming algorithm for connected
word recognition,” IEEE Trans. on Acoustics, Speech, and Signal Processing,

vol. ASSP-3‘2. no. 2. pp. 263—271. 1984.

F. Jelinek, L. Bahl, and R.L. Mercer, “Design of a linguistic statistical decoder

for the recognition of continuous speech.” IEEE Trans. on Information Theory,
vol. 1T-21, no. 3, pp. 250—256, 1975.

C.C. Tappert, N.R. Dixon. and AS. Rabinowitz. “Application of sequential de-
coding for converting phonetic to graph representation in automatic recognition
of continuous speech (ARCS),” IEEE Trans. on Audio and Electroacoustics. vol.
All-21, pp. 225—228, 1973.

LR. Bahl, P.de Souza, P.S. Gopalakrishnan. D. Kanevsky and D. Nahamoo,
“Constructing groups of acoustically confusable words,” Proc. IEEE Int. Conf.
on Acoustics, Speech. and Signal Processing. pp. 85—88, 1990.

K.F. Lee and H.W. Hon, “Speaker-independent phone recognition using hidden
Markov models,” IEEE Trans. on Acoustics. Speech, and Signal Processing, vol.
37. no. 11. PP- 1641-1648, 1989.

PS. Cohen and R.L. Mercer. “The phonological component of an automatic
speech-recognition system,” IEEE Symp. on Speech Recognition, pp. 177—187,
1974.

.1. Makhoul. S. Roucos and H. Gish. “ ’ector quantization in speech coding.”
Proc. of the IEEE, vol. 73. no. 11, pp. 1551—1587. 1985.

R.M. Fano, “A heuristic discussion of probabilistic decoding,” IEEE Trans. on
Information Theory, pp. 64—74. 1963.

 

[691

[701

1711

[7‘21

[751

[761

[77]

[781

[791

1801

[811

120

L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Englewood
Cliffs, New Jersey: Prentice-Hall, 1978.

.1.D. Markel and AH. Gray, Jr.. Linear Prediction of Speech, Berlin, Heidelberg,

New Jersey: Springer-Verlag. 1976.

E..1. Yannakoudakis and P..1. Hutton Speech Synthesis and Recognition Systems,
England: Ellis Horwood Limited, 1987.

.1.S. Bridle, M.D. Brown and R.M. Chamberlain. “An algorithm for connected
word recognition,” Proc. IEEE Int. Conf. on Acoustics. Speech, and Signal Pro-
cessing, pp. 899-902. 1982.

H. Sakoe. “Two-level DP-matching - A dynamic programming-based pattern
matching algorithm for connected word recognition.” IEEE Trans. on Acoustics,
Speech, and Signal Processing. vol. 27, no. 6, pp. 588-595, 1979.

AN. Ince, Digital Speech Processing: Speech Coding, Synthesis and Recognition.
Boston: Kluwer Academic Publishers, 1992.

S. Furui. Digital Speech Processing, Synthesis, and Recognition, New York: Mar-
cel Dekker, Inc., 1989.

D. O’Shaughnessy. Speech Communication, Human and Machine, New York:

Addison-Wesley, 1988.

Y. Linde. A. Buzo. and R.M. Gray, “An algorithm for vector quantizer design.”
IEEE Trans. on Communications. vol. COM-28, no. 1, pp. 84—95, Jan. 1980.

J.G. Wilpon. LR. Rabiner. C.-H. Lee. and ER. Goldman “Automatic recogni-
tion of keywords in unconstrained speech using hidden Markov models,” IEEE
Trans. on Acoustics. Speech, and Signal Processing, vol. 38. no. 11. pp. 1870—
1878. 1990.

.1.G. Wilpon. C.-H. Lee, L.R. Rabiner. and ER. Goldman “Application of hidden
Markov models for recognition of a limited set of words in unconstrained speech,”
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 254-257,
1989.

J.G. Wilpon. L.G. Miller, and P. Modi “Improvements and applications for key
word recognition using hidden Markov modeling techniques,” Proc. IEEE Int.
Conf. on Acoustics, Speech, and Signal Processing, pp. 309—312, 1990.

LR. Bahl, R. Bakis, PS. Cohen, A.G. Cole, F. Jelinek, B.L. Lewis, and R.L. Mer-
cer. “Recognition results with several experimental acoustic processors.” Proc.

IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 249—251, 1979.