r... a
3.4... ‘

.33 .‘ n...
1qu&5
.. Era.» ..

s .3! 1|: :1
..

wwwﬂubiiﬁ..x.\‘ 4

~ 73.; .r! r. I)
i. larval...
1.1.1.11. .

1 .. . V
3...: at.
r. it... . s

 

.3: I .1
. {33.03.

,uﬁuhnﬂvnh

V‘
it? ”Dingo.

A. 12!}!
:2

ﬁg";

brie... .
#5

£319. 1‘ .14. .
¢ JI‘ y .
“nun-4 .a I (“an
«2... .‘LP {as}.
E . ‘
1:1“ J inbtxh .09
In
01......
. i ‘thu
«3v! i
:9“va +4.1. .
1:1. ‘ is...“
.3151" ‘
Q \.I\.:vn\ 0.
Mail“?! I .
. I...

3:! n15!»
ﬂu .
t

)‘i h.

' wig-i3?

1:; * -~:
111'- gig;

1. .
)1}? Q‘
. 35.6.9 1 us.
ta. v: . a. :5 5

:39?)

{($¢'lw .
. ‘ f9 -1§\9!\

[1.5.1:

. «.1i... :1\3LI1Q
‘ >.‘$..1..vv$;||\4.. .
1. .2'») J .{xiucnu ‘ xv.
.. 31.1.» .4 7.. I.

‘25:;

 

 

m1
M If]!
.5 -1}??? c.

Dl$.ler.I-, r A

Trans»?

 

\W

5m IBRARIES

\W \\\\\\\\\\\\\\\\\\1\\\\\\\\\\\ 90\\\\\\\\\\2\L\\\\\i

131293

 

 

i

This is to certify that the

dissertation entitled

AUTOMATED EVALUATION OF
LANGUAGE MODELS BASED ON
RECEIVER-OPERATOR-CHARACTERISTIC ANALYSIS

presented by
Yin-Pin Yang
has been accepted towards fulﬁllment

of the requirements for

Ph. D. degree in Electricai Engineering

 

f3. Mil/Lox

 

Major professor

 

Date37VtaﬁL 96.

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

 

LIBRARY
Michigan State
University

 

 

 

PLACE ll RETURN BOX to remove We checkout from your record.
To AVOID FINES return on or before dete due.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Jl I |

MSU Ie An Afﬂrmetive Action/Equal Opportunity InetItutIon
Wanna-o1

AUTOMATED EVALUATION OF
LANGUAGE MODELS BASED ON
RECEIVER-OPERATOR-CHARACTERISTIC ANALYSIS

By

Yin-Pin Yang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering

1996

ABSTRACT

AUTOMATED EVALUATION OF
LANGUAGE MODELS BASED ON
RECEIVER-OPERATOR-CHARACTERISTIC ANALYSIS

By

Yin-Pin Yang

The problem of designing and evaluating a language model is formulated as an
optimization task parameterized by the trade-off between the coverage and over-
generation with respect to the task language. These parameters parallel the detection
and false-alarm rates in signal detection theory. Currently successful language models,
such as n-gram models and context-free grammars, provide no systematic approach
to design and evaluation other than direct experimentation. Further, current model
designs do not take into account the effects of the quality of the acoustic models in
the system. A language model that is optimized to maximize recognition, may not,
in fact, optimally represent the grammar of the task language.

In this research, the recognition performance of a language model is statistically
characterized by speciﬁc mathematical quantities that can be optimized and evaluated
similarly to receiver-operator-characteristic (ROC) analysis used in detection theory.
The design process accounts for the quality of the acoustic models in the system and
presents a graphical ROC-like display of language model performance without the
need for extensive experiments with the recognition system.

Speciﬁcally, analogously to ROC analysis, the performance of a speech recognizer
in terms of recognition rate and computational cost can be illustrated in a two-
dimensional plane of coverage (of the task language) and over-generation (of out-of-

task-language sentences), resulting in a performance contour plot. A given language

model is parameterized by coverage and over—generation and is represented by a point
in the ROC plane. While adjusting certain design parameters of the language model
(e.g., amount of training, weights of sublanguages, etc.), the designer can readily
observe the motion of the point and can ”drive” the model in the direction of better
performance.

X-window-based software called the ROC-LM design tool is presented in this
paper. The tool is quite general and can be used to design and evaluate all commonly-

used language models. Example case studies are presented in the paper.

Copyright by
YIN-PIN YANG

1996

To my parents
Jong-Hsz'ang Yang, Re-Chu Chang

and

my wife, Wen- Yu H321.

Acknowledgments

First and foremost, I would like to thank my thesis advisor, Professor John R.
Deller, Jr., for his endless encouragement, patient guidance and generous support
throughout my Ph.D. pursuit at Michigan State University. Without his help, this
work could not possibly be ﬁnished now.

I would like to thank all the members in my Ph.D. guidance committee, Professor
Marvin Siegel, Professor Majid Nayeri (Department of Electrical Engineering), and
Professor Jonathan I. Hall (Department of Mathematics) for their time and effort in
discussing and reviewing my thesis.

I would like to express my gratitude to my parents for all their support. I also
owe a great deal to the friends I met here at Michigan State University. Most of all, I
wish to thank Professor James Gallagher (Department of Education), who is a very
good friend of my father, for his kind and sincere words whenever I needed help.

A very special word of thanks for my wife, Wen-Yu Hsu, for all her love, patience,
encouragement and understanding over past three and a half years. We are expecting

our ﬁrst baby which is due on August. Wish the best for a new life.

vi

Contents

List of Figures

1

Introduction and Background

1.1 The Signiﬁcance of Using ROC Analysis in Language Model Design .
1.2 Literature Review and Contributions of This Research ........
1.3 Descriptive View of Language Modeling .................

Problem Formulation

2.1 Formal View of the Recognition and Language Modeling Problems . .

2.2 Recognition Problem ...........................
2.3 LD Problem ................................
2.4 Integration of the LD Problem and Focus on the Decision Processes .
2.5 ROC Analysis of the “Language Detector” ...............

Development of the ROC LM Design
3.1 Channel Probabilities in the ROC Analysis of the LM Performance

3.2 Recognition Rate .............................

3.3 Representing Recognition Rate by Coverage and Over-Generation . .
3.4 Modeling the Probability Density Functions ..............
3.5 Issue of Sentence Length .........................
3.6 The Weight, w, and Relative Motion of LMs ..............
3.7 The Scheme of the ROC-LM Design ...................
The ROC-LIV! Design Tool
4.1 Overview ..................................
4.2 The ROC Display Window, ROC-WIN DOW ..............
4.3 The ESM-RR Block ............................
4.4 The EVL-LM Block ............................
4.5 The Control Panel Block .........................
Implementation and Design Cases
5.1 The Acoustic Environments .......................
5.1.1 Acoustic Settings .........................
5.1.2 TI-46 and TIMIT Database ...................
5.2 Introducing the TLs ...........................

vii

ix

(:1me

7
7
8
13

17
20

23
24
28
32
34
43
44
45

48
49
51
53
57
67

75
76
76
77

viii

5.3 LM Design Cases .............................
5.3.1 Case I : To Determine the Required Number for Training N -

Gram for an Alphabet Recognizer ................

5.3.2 Case II: A Hybrid Model of Bigram and Trigram and Concerns

About Computational Cost ...................

5.3.3 Case III : Adaptive Language Modeling .............

6 Conclusions and Future Work

6.1 Summary and Concluding Remarks ...................
6.2 Contributions ...............................
6.3 Future Work ................................

Appendix A: The Case of Continuous Speech Recognition

Bibliography

80

81

95

99
99
100
101

103

105

List of Figures

2.1

2.2

2.3

2.4
3.5

3.6

3.7

3.8

4.9

4.10

4.11
5.12

5.13

Formalizing the recognition problem and the embedded “LD problem”

into a conventional signal detection framework. ............ 10
An illustration of formalizing the LD problem into a conventional signal
detection framework. .......................... 14

A conceptual illustration of how the LD model can be treated as a
grammar observer. First, a practical evaluation methodology can be
thought as hypothesizing all the possible word strings (sentences) ﬁrst

then performing the acoustic and grammatical evaluations. ..... 18
The block diagram of the operation of the decision rule 53 ....... 19
Illustration of the four channel probabilities, detection rate, false-alarm,

miss rate and correct rejection. ..................... 25

Illustration of the trade-off between coverage and over-generation. More
sentences are covered by LM, but at the cost of increased perplexity
of the LM. Consequently, the recognition rate is decreased and the
computation cost is increased. ..................... 26
(a) An example of how the f (zald = 0, D2) is constructed, when E =
4 and k = 2. (b) An example of how the f(a:a,o|d = 3, 0903) is
constructed, when L = 4, k0 = 2 and k1 = 0. .............. 41
An illustration of a changing contour plot with LMs unchanged. The
shift of a contour plot will make no difference to all of the LMs on
the plane. However, the rotation will result in different increments or
decrements which depend on the location. ............... 46
A diagram of the relationships among the four blocks: The ROC dis-
play window, the Estimation of Recognition Rate, the Evaluation of

LM, and the Control Panel and Designer Interface. .......... 50
A diagram of the data structure for storing word concatenation infor-
mation from training context. ...................... 62
The layout of the ROC-LM Design Tool. ............... 68
By clicking on the button ACOU PDF, the two Gaussian p.d.f.’s of
acoustic level evaluations are shown on the ROC-WIN DOW. . . . . 84

The bigram performance when the slider PRM #1, representing the
number of sentences in training context, is moved from 100, 190, through

550 with PRM #2 random seed as 177. A trajectory marked by cir-

cles is form. The random sees is then changed to 1,997, and PRM #1
moves backward from 550 to 100. Another trajectory with “X” is made. 85

ix

5.14

5.15
5.16

5.17

5.18

5.19

X

The trigram performance when the slider PRM #1, representing the
number of sentences in training context, is moved from 100, 190, through
1,000 with PRM #2 random seed as 177. A trajectory marked by cir-
cles is form. The random sees is then changed to 1,997, and PRM #1
moves backward from 1,000 to 100. Another trajectory marked “X” is
made. ...................................
The comparison of the bigram performance and trigram. .......
after initializations and rescaling, the location of BiTri-Gram and con-
tour plot is shown. Notice that the new estimated the G-set = { 5.61,
1.44, 1.70, 1.71 } is directed to their corresponding ﬁeld on the ESM-
RR block. .................................
When the new contour plot is drawn, the old LM (marked “X”) is
transformed to a relative position with respect to the new contour plot.
Move the threshold tg from 16.0, 13.2, 10.4, 7.6 through 6.2. BiTri-
gram moves toward the criterion line and crosses it at t9 = 7.6. How-
ever, the price for the computational improvement is the decrement of
recognition rate from 86.2% to 83.4%. .................
PRM #2, representing the size of TrainSet II, is set to 0 at ﬁrst. Then,
PRM #1, which is the size of TrainSet I, is moved from 1000, 1900,
2800, 3700, 4600 through 5500. Then, PRM #1 is changed to 400.
Repeat the above. Another trajectory appears on the left side of the
previous one. On this trajectory, when the size of TrainSet I is 3700,
the recognition rate reaches its maximum. ...............

93

94

Chapter 1

Introduction and Background

 

1.1 The Signiﬁcance of Using ROC Analysis in

Language Model Design

The objective of this research was to develop a quantiﬁed strategy for adjusting pa-
rameters in a language model (LM), in order to optimize a general performance func-
tion in a speech recognizer. The performance function consists of the recognition rate
and the computational cost incurred in using the task language. A novel approach, the
ROC-LM design scheme, emulating receiver-operating-characteristic (ROC) analysis
[1], [2] is presented to achieve this objective. The main idea is that a LM is funda-
mentally quantiﬁed by two parameters, “coverage” and “over-generation,” paralleling
hit-rate and false-alarm rate in ROC analysis. In detection problems, ROCs depict
detection rate versus false-alarm rate for various values of the signal-to—noise ratio.
Analogously to ROC analysis, the performance of a LM in terms of recognition rate

and computational cost may be expressed in a plot of coverage versus over-generation.

2

In this study, a speech recognition system is modeled as a detection system with an
N-interval forced choice (N IFC) and two-observer (acoustic and grammar observer)
task. Within this framework, given a LM, the recognition rate of a speech-recognition
system can be calculated from the statistical model of acoustic evaluation, perplexity
of the task language, and a quantiﬁed characterization of the LM, instead of by direct,
time-consuming recognition experiments. As a subsystem of the recognizer model, the
LM performance can be characterized in terms of a classical binary-signal-detection-
in-noise problem. It is this subsystem that is the principal focus of the work. It
should be emphasized that acoustic information is incorporated in the LM design.

Using the binary-detection analogy, the performances of various LMs can be an-
alytically compared according to their coverage and over-generations. In addition,
given a strategy for creating a LM, and given that the coverage and over-generation
can be adjusted by changing certain parameters in this model, a trajectory can be
drawn on the chart by varying one of these parameters. The designer can thereby

determine appropriate values for the parameters.

1.2 Literature Review and Contributions of This

Research

Currently successful approaches to language modeling can be divided into the fol-
lowing three main categories: context-free (type 2 in the Chomsky hierarchy [4]) [5]-

[10], N -gram (Markov model, regular or ﬁnite-state, type 3) [11]-[14] and automatic-

3

learning ﬁnite state (ALF S, also type 3) [15]-[19]. With prior syntactic knowledge
being applied through the production rules, the context-free grammar provides bet-
ter performance for word strings that obey the grammar rules. However, the drawback
is that the grammar is so rigid (inﬂexible) that it often involves complex interactive
parsing methods. Meanwhile, an N -gram grammar is relatively adaptable but it suf-
fers from the mismatch of the modeled language and the natural language, and often
over-generates meaningless hypothetical sentences resulting in wasted computation
time. Finally, the ALFS grammar is aimed at retaining adaptability and simultane-
ously avoiding the mismatch problem. However, the current ALFS algorithm is only
successful on recognition tasks with low perplexity [18].

Another important category is adaptive language modeling [20], [22], [23], in which
LMs tend to “adapt” to particular sublanguages. When a large heterogeneous lan-
guage source, which is composed of smaller, more homogeneous “sublanguages,” such
as newspaper articles, is used for training, an adaptive LM is capable of focusing
on certain sets of vocabularies and grammar rules which are related to particular
subjects. An adaptive LM is thus powerful because it uses the immediate history of
context so that the LM hypothesizes sentences more precisely.

This research is concerned with three issues:

1. Current speech recognition research focuses only on the performance of an in-
dividual LM rather than the overall performance of a speech recognizer. The
accuracy of acoustical evaluation and the properties of the task language are not

involved in the training processes of a language model. In this study, the over-

4

all speech recognition behavior is mathematically modeled. Therefore, given
acoustical evaluation and task language information, the criterion optimized by
the LM is the recognition rate rather than entropy, [20] which is commonly used

in Language modeling problems.

2. The trade-off between “flexibility” of a LM and recognition performance has
been noted by many speech researchers [18]. However, there has not been a
mathematical formulation of the phenomenon. In this work, an arbitrary LM is
parameterized by two values, coverage and over-generation‘, that quantify the
trade-off. The LM is then detached from the overall recognition system and
communicates with the remaining part, mainly acoustic evaluations, through
the two parameters. Given requirements of a recognition task, such as recog-
nition rate and computational cost, the required coverage and over-generation
are speciﬁed. The LM is then designed to achieve these speciﬁcations. Conse-

quently, the language modeling problem is formalized as a design problem.

3. In recent years, there has been considerable interest in the use of adaptive LMs
[16], [20]. The proposed ROC design scheme can be used to quantitatively ad-
just the “adaptability” to given sublanguages. Weights can be assigned to each
of the sublanguages to indicate their importance. Thus different requirements
of recognition correctness and computational cost can be speciﬁed by an op-
timization function that varies with application. Consequently, a LM can be

trained to be adaptive as required by the performance function.

 

1Coverage and over-generation will be deﬁned in later chapters.

5

1.3 Descriptive View of Language Modeling

Roughly speaking, the syntactic grammar (or syntax) of a speech recognizer is a set
of rules by which words are properly combined to form a sentence [3]. The language
associated with that syntactic grammar is a set of all possible sentences which are
produced by the grammar. The term “language model” is often used to indicate the
software system that generates the language (sentences hypothesized as the spoken
one) in accordance with the syntactic rules.

In a top-down search algorithm, the LM is used to hypothesize a search space
consisting of all valid sentences according to the syntax. Each of the sentences in the
search space corresponds to a string of word models. In an isolated-word recognizer,
by hypothesizing the acoustic word boundaries, each word model can then be evalu-
ated by considering the corresponding segment of the acoustic symbol string (see [3])
corresponding to the utterance. An evaluation algorithm then produces a likelihood
for each word model string and the word string with the highest total likelihood is
selected as the recognized sentence.

The main purposes of the LM in a speech recognizer are:

1. To narrow the search space to reduce computational cost. For a recognition
system with vocabulary size NV and an uttered sentence with In,ax words in
length, if no grammar is involved, the size of the LM search space is NM =

Nam”. However, if a grammar with perplexity2 P is employed, the size of search

 

28peech researchers use perplexity as a measurement of the size of recognition task. Roughly
speaking, P is the average “branch factor” at any decision point in the language graph. A precise
deﬁnition can be found in [3].

6

space is reduced to P“. P is normally smaller than NV, so P’m“ is consequently

much smaller than Nflm“, and the computation is reduced dramatically.

2. To improve recognition performance. The recognition of a sentence is beneﬁted
by the knowledge of allowable word concatenations. For instance, in a speech
recognizer based on an N -gram statistical model, the following sentences, “This
is a book,” and “This eats a book,” may be both hypothesized by the model.
However, since the frequency of the concatenation between “this” and “is” is
higher than that between “this” and “eats,” the former sentence is more prob-

able.

In the next chapter, LMs are formalized from the view point of the detection task.

 

Chapter 2

Problem Formulation

 

In this chapter, we set forth the basic formalisms and notation to be used through-

out the dissertation.

2.1 Formal View of the Recognition and Language

Modeling Problems

A speech recognition system is to be designed. Let the vocabulary, V, be an exhaustive
set of NV = [V] words that potentially arise in the recognition task. For convenience,
let one element of V be a “null” word. Assuming that the maximum sentence length
considered in the task is In,ax < 00, there are on the order of NL = N?“ strings of
words (or sentences) that can be formed from the set V. Let us denote these sentences
by L = {31,32,...,3NL}. Further, for each 3;, there will be a (finite) number of
acoustic representations (“pronunciations”) of that sentence. Let us denote the entire
set of pronunciations of sentences (over all sentences) by A = [a], a2, . . . , aNA}. This

set also needs to be associated with the acoustic strings which, in general, may be

7

8

shared by numerous sentences. From a set theoretic point of view, the pronunciations
and the acoustic strings are in one-to-one correspondence.

We begin by structuring the recognition problem and the embedded “language
detection problem” into a conventional signal detection framework. The discussion

can be best understood with reference to Figs. 2.1 and 2.2.

2.2 Recognition Problem

The underlying probability space is (Q, 2“, P), where P reflects the “true” distribution
of pairs of sentences and acoustic strings in the task. The sample points of Q are ng,

and

P(w,-j) = P(s.~,aj) = probability of sentence 3, “pronounced as” (2.1)

acoustic string a]- in the task language.

Formally, when a sentence 3,- is spoken by the user of the device using “pronunciation”
aj, the point on, = (s;,a,~) is selected from Q with probability P(w.-j). The point Log,-

is then analyzed by the “transmitter” using a source classiﬁer:

93012.5) = 3;. (2.2)

9

The underscore below QR indicates that this quantity is a random variable3. We have

9,, : (52,29) _. (L,2L). (2.3)

The space (L,2L) is also the classiﬁcation space (or “parameter” space) for the de-
tection problem.

At the “receiver” (recognizer), the observation is made:

2(wv) = 01- (2-4)

The observation is a random variable mapping,

9: (9,29) —+ (A2"). (2.5)

The acoustic space, therefore plays the role of the observation, or measurement, space

in this development. Based on the value of g, recognition decision is made:

£3016) = arg mgXi- 1n Q(8k)Q(aj:8k)} = 3:, (2.5)

where Q() is deﬁned below. QR is then a random variable mapping,

QR : (51,29) _. (L,2L). (2.7)

 

3Strictly speaking, g3, QR, 93, and similar mappings in this development should have domains
corresponding to the integer indices of the sentences on acoustic strings. It is notationally useful,
however, to allow this small conceptual abuse.

10

PR = 59
Cost Space
\ - -Q(sir 5“) -
SI? ‘ L XL ‘—> B (L' 2L)
' Decision Space
5_g(aj) = 3;:
argmax Q(a’lsk)Q(st)

1‘ 8—,,(aj) : A ->L
V
(L. 2 l - caustic (Observation)

Language (Parameter) Space

Space 03(0)”) 2 Q —)A
3(0)”) 3 0—)L _

(o, 2“, P)

Figure 2.1: Formalizing the recognition problem and the embedded “LD problem”
into a conventional signal detection framework.

11

Let us denote by l the decision statistic used in the decision rule QR,

Iota.) = I—f—lﬂ 1nQ(s.Ia.-)Q(s.).(a.-Is.) (2.8)

For future purposes, we note that I can be decomposed into a negative log likelihood
(NLKHD), la, associated with the acoustic model information, and a negative log

probability, lg, associated with the language model“:

-1n Q(aj|3k)

lajl

— In Q(3k)

|3k|

i am

 

“Snail = Ia(ajl3k) + I9(Sic) = ( )+(

Where appropriate, the statistic I, la and 1,, may also be modeled as random variables
that depend on the probability structure of the problem.

Next the “evaluator” will Compute rates of proper operation and penalties as-
sociated with errors. The cost function Q simply determines the correctness of the

outcome

9: LxLaB, mm)

L ai=2

where, <I>(s.-,s;) = (2.11)
a ui¢2

The recognition rate for the system is given by

pH = E2. (2.12)

 

4Subscript g to denote “grammar.”

12

The rate, or further cost criteria, may be used to adjust the parameters of the recog-
nition models.

The measure Q may be viewed as a “noisy probability measure” over (I learned by
training of the system by ﬁnite data. It represents an attempt to estimate the “true”
(task) distribution P. In the language space L, there are two critical sets associated

with the distributions P and Q. The task language, LT g L, is defined as the set

 

LT = {8;[P(3.') 7é 0;.9.‘ E L}, (2.13)

whereas the model language, LM Q L, is deﬁned as the set

 

LM = {s,-|Q(s,-) 75 0;s.- E L}. (2.14)

Therefore P(LT) = l and Q(LM) = 1.

For the convenience of further discussions, we also deﬁne

LMT 2 LM (1 LT, (2.15)
_ def

LT = LT — LMT, (2.16)
.. def

The recognition problem may be viewed as an “Nb-ary” detection problem. For

the hypothesis structure,

U,- :93 = 3;, i E [1, N1] (i.e. s.- is uttered). (2.18)

13

For the decision structure:

R}. :_6_R = 3]., k 6 [1, NL] (i.e. 63 responds with sk). (2.19)

2.3 LD Problem

The purpose of “language detection” (LD) problem is to characterize the expediency
of the hypothesizing of a LM. The LD is not actually performed during

recognition phase. Once a LM is modeled by evaluation statistic, the same amount
of hypothesis sentences with their grammar likelihoods are produced for each recog—
nition trial.

In the LD problem, the initial probability space can be taken as L as generated
in the problem above. The sets LT and LM are also subsets of L as created above.
A testing sentence 3,, which is considered as a sample point from L with probability
P(s.-), can be selected from either outside or inside LT to evaluate LM. Note that
in the LD problem, sentences outside of L7, i.e., those with P(s,) = 0, may be
“selected” for attempted detection. However, only those sentences in the set L7,, play
any signiﬁcant role in this regard.

In terms of the detection theory, the testing sentence 3.- is then analyzed by the

“transmitter” (speaker) using a source classiﬁer:

1, If S; 6 LT
cL(3.-) = 17(3.) = (2.20)
0, If S,‘ ¢ LT

14

pL = E¢L

- -¢_L(cv ao- ‘

 

¢_L:LxL—>B .0 .1
O 5'-
5569143
0_L(S.-) :L-+L

» v
a (L, 2‘, P)

Figure 2.2: An illustration of formalizing the LD problem into a conventional signal
detection framework.

15

where 11 is the indicator variable for the event LT. We have

9L : (L, 21’) —+ (8,23), where B is the set of binary numbers, {0,1}. (2.21)

The space (B, 23) is the classiﬁcation space (parameter space) for the detection prob-
lem.

At the “receiver” (detector), the observation is the sample sentence itself:

gL(s,-) = 3;. (2.22)
The decision is then made by:
I, If 8.’ E LM
Ms.) = 1M(s.-) = (2.23)
0, If 8; ¢ LM

where [M is the indicator variable for the event L M. The decision random variable is

a, : (52,27) —> (3,23) (2.24)

where (L,2L) and (8,23) are as deﬁned above. It is observed that (L,2L) plays
the role of both the classiﬁcation and decision spaces in the recognition problem,
while (B, 23) plays these roles in the LD problem. The decision processes are further

described below.

16

The cost (reward) function is the product

¢L(CL,5L) = CL X 5L- (2.25)

and the expectation of (151, is exactly the LD rate, that is,

pl, = EQL. (2.26)

Note that for the LD problem, we have inherently speciﬁed the following hypoth-

esis structure (letting s,- E L be a testing sentence in a given trial):

CL = 1 (i.e. s, 6 LT) (2.27)

H0 CL = 0 (i.e. 3, ¢ LT) (2.28)

D1 6L = 1 (i.e. s.- 6 LM) (2.29)

db = 0 (i.e. 3,- ¢ LM) (2.30)

17

2.4 Integration of the LD Problem and Focus on

the Decision Processes

The heart of the speech recognition system is embedded in the detection rule 63. In
turn, the “performance” of the rule 63 is a reﬂection of the quality of the training
which is quantiﬁed by the “noisy” probability distribution Q. In particular, we are
interested in the quality of the LM, and this part of the training is reﬂected in the
distribution Q(s,°) over L and the corresponding term Q(s,-), or 19(3)), in 63.

The LD problem as illustrated in Fig. 2.2 is a formal construct to allow the
convenient, qualiﬁed study of the LM as reﬂected in Q(s,), hence 63. Formally, a
practical speech recognizer functions along the lines described in Fig. 2.1, without
any speciﬁc operation related to the language detection problem. However, the LD
problem may be formally integrated into the overall speech system in the following
. way: Suppose that the speech recognizer is capable of hypothesizing a_ll s.- 6 L, not
just there in LM. Each 3,- is subjected to a LD test “_6L(s,-)” and is then passed
on to 63 (with corresponding acoustic observation a,) only if it passes the language
test (6,13,) = I). This amounts using the LD scheme in Fig. 2.2 without any
probability structure, prior to performing recognition in Fig. 2.1. The LD test serves
as a formal binary “blocking mechanism” that prevents untrained sentences from
receiving consideration. The integrated mechanism is illustrated in Fig. 2.3.

Once a sentence passes the LD test, it is subjected to evaluate and integrate
the acoustic and language information in light of observation a,- and the structure

embodied in the LM and acoustic models. This process is illustrated in Fig. 2.4.

18

 

All Possible Word Strings

Prune <—<§l>Gl-ammar
T Decrsnon

Grammatical Evaluation

1

Acoustic Evaluation

 

 

 

 

 

 

 

 

 

 

 

Figure 2.3: A conceptual illustration of how the LD model can be treated as a gram-
mar observer. First, a practical evaluation methodology can be thought as hypothe-
sizing all the possible word strings (sentences) ﬁrst then performing the acoustic and
grammatical evaluations.

In evaluating the set of total likelihood measures over LM for a given a,-, two

assumptions are made in this work:

Assumption 2.1 In the present research, the stack is assumed large enough so that
stack size does not affect recognition results. In other words, the failure of a correct
sentence to be recognized can only be due to the poor acoustic evaluation or the pruning
by the LM. The theories developed in this research do not consider the effects of stack

size.

For a recognition system with a high perplexity LM and poor acoustic evaluations,

it is not appropriate to assume the recognition results are independent of stack size.

Assumption 2.2 Word boundaries of the uttered acoustic symbol string are known,
so that the acoustic symbol string of a sentence can be correctly divided into E seg-

ments, where ﬂ is the length of the uttered sentence.

Therefore, the theories created in the following can only be applied in present form to

19

 

Database of Uttered Acoustic Symbol String
Word Model

I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

+Acoustic Level _~
valuation A .
cousuc
si-l Q(aj|S.°) Likelihood
. Q(S.') Total
NM 8,” Smng Likelihood
Grammar Level
si+2 *Evaluatton I 563038!
9 Probability
Si+3
Language Model
Modeled Language, LM
Top One
81
8);]
Recognition Output When
Result <— All Evaluations ‘__, S] a 8,
Done ,‘ .
sj+l
An Ordered List
according to Total Likelihood

Figure 2.4: The block diagram of the operation of the decision rule 63.

20

an isolated-word recognition (IWR) system or a continuous-speech recognition (CSR)
system with a powerful word-boundary detector.

The two assumptions above restrict the immediate applicability of this work to
cases of the unlimited stack size and known word boundaries (USS—KB). This study
can be extended by learning that the pruning due to the insuﬂicient stack and the
performance of segmentations can be mathematically modeled by given certain cor-
responding statistics. This study emphasizes the USS-KB case to prove the validity
and function of the proposed methodologies. Once this fundamental goal is achieved,

these more realistic cases will be considered in future work.

2.5 ROC Analysis of the “Language Detector”

The LD problem is formulated as a convenient, quantiﬁed means for assessing the
quality of a LM. Currently, a LM can only be assessed using experimentation and
simulation. Further, the prevailing (and rational) approach is to attempt to make
pl, as large as as possible by sufﬁcient training (make LM and TL match as closely
as possible). While this approach does not always maximize p3, there would be no
systematic approach to do otherwise.

Two factors aﬂect the relationship between p3 and p1,. They are NM and the
accuracy of the likelihood evaluation l(a,-,sk), s)c 6 LM. pL is usually improved
accompanied with the increases of N M. In this case, if the probability of the l(a,-, s.)
obtaining the best score among all the evaluations in LM is relatively sensitive to

NM, which implies a bad acoustic evaluation performance, p3 may not be improved

21

by improving pL. On the other hand, if the likelihood evaluation is extremely powerful
and totally insensitive to N M, p3 can be improved by improving p1,. In the case that
NM is ﬁxed (unchanged with respect to some adjustable parameters in the LM), p3
is improved when pl, is improved.

Therefore, the design of a LM requires the integration of both sources of acoustic
and language information. Usually, the acoustic space is ﬁxed to designers when con-
sidering a speech recognition task. Hence, this research emphasizes the relationships
among p3, p1, and NM. These are precisely the factors related in the ROC analysis
of the LD problem. In the remaining chapters, we will focus on the LD problem as
a means to assess the effects of the LM on recognition and to change the recognizer
accordingly.

The signiﬁcances of analyzing the recognition problem via LD and ROC analysis

are the following:

1. The recognition rate can be directly determined for various parametric permu-

tations of the LM rather than obtained from the simulations.

2. LMs are parameterized by two measurable values, “coverage” and “over-generation.”
Given a performance function, the performance of each coverage—over-generation
pair is evaluated and a contour plot is drawn on the ROC-LM design chart. Once

the chart is obtained, the designer can easily compare and design LMs.

There are two applications of the proposed ROC design scheme:

1. For comparison purposes:

For example, for the same vocabulary size, a trigram model is known to have

22

lower perplexity than a bigram model for the same NV. However, the bigram
model includes more grammatically correct sentences than the trigram (better
coverage). The ROC-LM design chart can quantitatively show the trade-off and

determine which model is better given a recognition requirement.

. For design purposes:

Suppose there are several designable parameters in a LM, the ROC-LM design
chart can be used to adjust these parameters. For instance, a LM which success-
fully combines trigram and bigram [20] needs to have weights that optimize the
task requirements. On the ROC-LM design chart, the direction to obtain better

performance is clearly shown so that the designer can readily make adjustments.

 

Chapter 3

Development of the ROC-LM
Design Tool

 

In this chapter, the ROC-LM design tool is developed by starting from a simpliﬁed

task in which

1. Recognition rate is the only measurement of recognition performance;

2. All sentences in the TL have the same length in words;

The ROC-LM design tool can be divided into two parts. Part one is the recognition
environment, which is concerned with the performance of the acoustic evaluation and
the properties of the TL. These are the factors that are assumed to be ﬁxed once the
LM design/analysis begins. Part two is concerned with the LM. These two aspects
are actually interactive, communicating with each other through two parameters,
coverage and over-generation (deﬁned below), which are the two axes of the ROC-
LM plane.

In the ROC—LM design, LMs are characterized by the coverage and over-generation.
One point on the ROC plane, which represents a coverage, over-generation pair, cor-

responds to a LM. The design starts with measuring (or parameterizing) the LM

23

24

individually by coverage and over-generation. Next, performance measures, such
as recognition rates and computational cost, are estimated according to the given
acoustic environments for each of grid points on the ROC plane. Contour plots pa-
rameterized by performance measures can then be drawn on the ROC plane.

On the other hand, a LM measured by coverage and over-generation is located on
the contour plot. The designer can easily observe the “motion” of the model when
some of its parameters are modiﬁed, and “move” the LM to approach the design goal.

In the following, all recognition behaviors, for example, the estimation of recogni-
tion rates, the measurement of LMs and the performance of the acoustic level evalu-
ations, are mathematically modeled. The ROC-LM design scheme is then described

at the end of the chapter.

3.1 Channel Probabilities in the ROC Analysis of

the LM Performance

Here we elaborate on the LD problem described in Chapter 2 and deﬁne some future
notation and vocabulary.

In terms of classical binary detection theory and the formulation of Ch. 2, we
can deﬁne the following hypothesis structure and “channel transition” probabilities

as related the LD problem.

1. Correct language detection rate (hit—rate), related to “coverage”

P(DIIH1)=P(§L=1|£L= 1) (3'1)

25

  
   

~Correct Rejection
L — L M—LT

   

Figure 3.5: Illustration of the four channel probabilities, detection rate, false-alarm,
miss rate and correct rejection.

2. Language false-alarm rate, related to “over-generation”

P(D1|H0) = P(§L = 1191, = 0) (3-2)
3. Miss rate

P(Do|H1) = PM]. = 0|£L =1) (3-3)
4. Correct rejection rate

P(DOIH0) = P(§L = 0121. = 0) (3-4)

Figure 3.5 illustrates the set relationships among these four probabilities in the lan-
guage space L.
The idea to use ROC analysis in LM design originates from an interesting trade-

off phenomenon. A LM needs to be “ﬂexible” (LM sufficiently large) so that the

26
Modeled Lan uage, LM~

 

Figure 3.6: Illustration of the trade-off between coverage and over-generation. More
sentences are covered by LM, but at the cost of increased perplexity of the LM.
Consequently, the recognition rate is decreased and the computation cost is increased.

complete complement of sentences can be hypothesized during recognition. However,
if the model is so ﬂexible (NM >> NT) that many sentences outside of L7 are
hypothesized, the computation cost is increased and the recognition rate is decreased.

Figure 3.6 illustrates the trade-off. The TL, LT, is modeled by LM, LM. LM
usually does not match LT ideally. In order to increase the percentage of LT cov-
ered, another LM, LM:, is generated by a looser grammar that is derived from LM
by adjusting certain parameters. Now more sentences in L7 are covered by LM, but
usually with the price that the over-generation of LM: is also increased. The increase
of coverage and over-generation implies the likely increase of perplexity of LM. Con-
sequently, the correct sentence (if hypothesized) has less chance to be selected and
recognized in the larger LM. The recognition rate is thus likely degraded and the

computational cost increased.

27

Based on the trade-off described above, the recognition correctness is affected by
the LM through the following two factors. One is whether the correct sentence can be
hypothesized (“covered”) by the model; the other one is that how many extraneous
word strings (not in LT) are hypothesized so that the recognizer makes extra effort
evaluating them. Meanwhile, the more hypothesized sentences the more unlikely it is
that the correct sentence obtains the highest likelihood, thus reducing the recognition
rate. More quantiﬁed justiﬁcations will be addressed in the next sections.

Therefore, the performance of a LM is mathematically described by two parame-
ters, “coverage” and “over-generation.” The coverage, CM, of the LM is deﬁned as the
probability that any sentence, say 3,, generated by LT (35 6 LT), that is, legally spo-
ken by the user of the device, is detected as a grammatical sentence (3.- 6 LM), that
is, being hypothesized as a recognition candidate by LM. In terms of the formalism

of Chapter 2, we have

CM is P(D,|H1) = P(§,, = mg, = 1). (3.5)

The over-generation, OM, of the LM, however, is not directly deﬁned as the false-
alarm of the LD problem shown in Eq. (3.2). Because of the way in which the LD
problem is structured, P(cL = 0) = P(Ho) = 0, so that P(DIIH0) is not deﬁned. We
therefore deﬁne GM to be an “anti—miss-rate” of LM versus L7, in which the given

condition is any sentence randomly chosen from LM.

OM “s! 13(9), = on), = 1) = P(H0|Dl). (3.6)

28

In the later sections, we will ﬁnd that the over-generation OM only concerns the total
number of sentences in LM, and is simply a normalized measurement for the size of

LM.

3.2 Recognition Rate

Figure 2.1 illustrates the mechanism of a speech recognition task producing recogni-
tion rates, represented by the notations and deﬁnitions above. First, when a speaker
utters a sentence, 3,, one of the samples in LT, s.- is then sent to measurement space
and classiﬁcation space in parallel. For the former, since 3,- is unknown to the speech
recognition system, the measurement space evaluates all hypotheses in sample space
LM, resulting a vector of evaluation results. The decision space is then select a
most-likely sentence, say 3;, among the result vector. On the other branch, the clas-
siﬁcation space sends the index of the uttered sentence, the correct answer, say 3,, to
the cost function block. Cost function then compares the two indexes and responds
with the recognition correctness, and the probability of the correctness represents the
recognition rate.

From Eq. (2.12), the recognition rate p3 is the expectation of Q. As mentioned
previously, suppose 3,- ¢ LMT, it is impossible for Q to respond i, resulting P(6_R =

8;ng = 3,) = 0 and Q = 0. That is,

Pa = E2 = P(LMT)E(91|LMT) + P(Lf)E(QIL1—~) (3-7)

= P(LMT)E(.‘IZ|LMT) (33)

29

= P(LMT) 2 PM}: = 3.193 = 3i)P(£n = SiILMT) (3-9)

tuéLMr

Now let A,- be the set of acoustic strings corresponding to sentence 3,:

A. ‘1—°;‘{a,-|P(a,,s.) > 0}. (3.10)

Suppose by experimentation with the acoustic models, we have determined that the

statistic
—1

_ lajl ° lsz-I

an(a,-|s,~,a,~ E A;)Q(S,‘), (3.11)

321’

where IaJ-I and [3.] denote the length of a,- in “symbols” and the length of s, in words,
respectively, is distributed approximately according to the probability density function

(p.d.f.),

Q,- ~ f,‘(.’13), a: Z 0, S.‘ E LMT- (3.12)

Further, experiments have shown that, for most task language, the random vari-
ables g:_,-, i = l, 2, . - - , L MT, are approximately identically distributed. In this research,
we assume the above is true and also assume that the g,- are independent. Hence, any
one of the i.i.d. g,- may be characterized by the single distribution f(a:).

Similar] we deﬁne Z for re resentin an “incorrect” evaluation. Let
a P

X; déf {aj|P(a,-,s,-) = 0}. (3.13)

30

Again, in this research, we assume that the statistic

—3

H

—1

lajl ' Isz'l

 

In Q(a,-|s,-,a,- E X;)Q(3;), S; 6 LM, (3.14)

2,- is distributed approximately identically for all i, and we also assume that the Z,-

are independent. Let the distribution 7(a) characterize the i.i.d. 2,75.

In these terms, the recognition rate p3 (2.12) can be written as

PR

= P(LM7‘)P(Z‘, < Q, V8), 6 L34) (3.15)
00 oo__ NIH—'1
= Pam) / ax) [/ ram] dz, (3.16)

thus circumventing the need to know prior probabilities P(gR = SglLMT).

Now consider the LD rate p1,. Let s,- also serve as a testing sentence for the LD

problem. From Eq. (2.26), the LD rate, pL, can be expressed as, noting that since 3,-

always chosen from L7, q, is always 1,

PL = EQL = E(9L X $51.) = E21. (3-17)

= PM]. =1) = P(LMT) (3-18)

Therefore, we further have,

p1. = pl. [0” for) [[71293] NW dx. (3.19)

The equation derived above is the recognition rate for a “binary-decision” LM

31

because all p.d.f.’s, say 7(3), are universal in LM. In this case, the LM is incapable of
distinguishing the event .9), 6 LE, and the event 3,. E LMT by its grammar-likelihood
measurement, lg.

According to Fig. 3.5, 7(a) is then split into two distinguishable p.d.f.’s, 7(a'ILMT)
and 7(xILX4), letting the sizes of L}, and LMT be N1; and NMT respectively. By

modifying Eq. (3.19), the recognition rate for a general LM can be written as

]NMT"1

Pa = p. /°° ax) [f REILIDW] ”’3 [/°° fawn: dz. (3.20)

The recognition rates derived above, representing the sentence correct rate in
which all of the words in the uttered sentence are correctly recognized, are not the
only ﬁgure of merit for a speech recognition system. For instance, other measurements
such as word correction rate, insertion rate, deletion rate, and replacement rate are
frequently used in the evaluation of many speech recognition systems [3]. Although
only the sentence correct rate is formulated in this research, it will not be difﬁcult
to derive the other measurements in a similar methodology. For instance, one-word
replacement rate is the probability of choosing sentences which have a one-word dif-

ference from the uttered sentence.

32

3.3 Representing Recognition Rate by Coverage

and Over-Generation

To represent p3 by CM and OM, ﬁrst, by comparing Eq. (3.5) and (3.18), we have
)0], = P(6L =1|cL :1): CM. (3.21)

To characterize 0M, the following assumption is made:

Assumption 3.3 The distribution P and Q over L are sufﬁciently uniform so that

the size ofa subset L', L' Q LM {or L' g L7), in LM (or LT) can be approximated

by

M = Q(L')NM (3.22)

(orIL'l = P(L')NT). (3.23)

Based this assumption, OM can be interpreted as the percentage of sentences in
LM belonging to L}, and C M is the percentage of sentences in LT belonging to LMT.

Then, we have,

 

NMT = (l—0M)'NM=CM'NT; (3.24)
CM
N = - .
0M = 1 - 91m; (3.26)

33

Similarly,

0’" NT—CM-NT = CMOM

-::N — =
N” M N” 1—0M 1—0M

NT. (3.27)

 

 

Therefore, NMT and NE, in Eq. (3.20) can be expressed by OM, CM and NT as

follows.

p1. = 0M [0” Hz) [ff 7(Tv'ILX1W] Jail” [f71-fILMT)da]C”NT"Idx. (3.28)

Two important issues concerning 0M and N M are to be addressed.

1. 0M can be viewed as a better measurement for the size of the LM, since it is
normalized by NM. From Eq. (3.20), p3 depends on CM and NM in terms of
LM performance, compared with that p3 depends on CM, CM and NT by Eq.
(3.28). The transformation between OM and N M is expressed by Eq. (3.26) and
(3.25). The following example shows the convenience of employing 0M instead

of N M to characterize the LM performance.

Suppose there are two LMs, A and B. Both have the same coverage, 0.9, but
the size of A is 106 and that of B is 104. The designers have no clue about how
badly the two LMs over-generate the extra sentences unless the size of TL is

speciﬁed. Now suppose the size of LT which A intended to model is 105 and

that of B is 102, by Eq. (3.26),

0.9 5
IOI‘A. OM—l—WXIO —0.91,

34

0.9 2
for B: OM =1 — ID: x10 = 0.991.

Thus, conclude that 91% of A and 99.1% of B are over-generated. In this case,

A is a better LM than B in terms of CM and 0M.

2. Practically, for most of the speech recognition task, NM is easier to estimate
(will be discussed later) than 0M. The recognition rate p3 is thus directly
calculated by N M. Therefore, even Assumption 3.3 fails, only the measurement
OM is effected. The estimation of p3 is still correct as long as the estimation of

N M is accurate.

3.4 Modeling the Probability Density Functions

General Formulations

In the binary detection problem, the p.d.f.’s of observations under hypotheses Ho
and H1 are prior knowledge in predicting the detection rate. In this research, the N L-
ary detection problem contains two observations, acoustic evaluation and grammatical

evaluation. Similar to Eq. (2.9), total likelihoods g and E can be decomposed as

Its
ll

£9 + wga’ (3.29)

IHI
II

1‘, + min, (3.30)

where w denotes the weight5 on the lg.

 

5In most speech recognition systems, these likelihoods are linearly combined to produce a total

35

The p.d.f. of; and E can be obtained from the p.d.f.’s of £0 and 329. Since usually

the acoustic level and grammar level evaluations are assumed to be independent [3],

the conditional p.d.f.’s on g; are given by:

f($|LMT) =

. TLHLMT) =

7(EILX4) =

f(:r: = 2:“ + wngLMT)

[0” 1a.)“

23
w

_$a

 

[LMT) (1:13“;

7(5 = r, + wrglLMT)

7

(am

53..
w

5‘1.

 

 

ILMT) (153a;

(3.31)
(3.32)
(3.33)
(3.34)
(3.35)

(3.36)

The p.d.f.’s of , £12“, £9, 1,, and 29 are estimated experimentally by employing pre-

deﬁned density functions to best ﬁt the histograms of the evaluation results. In this

research, Gaussian density functions are used to model the p.d.f.’s of the resulting

histograms. Because of the independence assumption, the p.d.f. of _;1_: (or i) is easily

obtained by summing up the means and taking the square root of the sum of squared

variances.

By using Gaussian p.d.f.’s, the conditional p.d.f.’s of 3;“ and $9 on both grammar

level and acoustic level under different hypothesis are discussed in the following. Then,

the conditional p.d.f.’s of the total likelihoods g and I); in Eq. (3.32), (3.34) and (3.36)

are constructed.

 

likelihood.

36

Qammar Level

According to Eq. (3.5) and Eq. (3.6), CM and 0M are defined as two conditional
probabilities. CM is based on the hypothesis structure, H1 and H0, see Eq. (2.27)
and (2.28), while OM is based on decision structure, 01 and Do, see Eq. (2.29) and
(2.30). Therefore, to characterize a LM by the coverage and over-generation, two
sets of testing sentences are required. Let T T denote the set containing grammatical
sentences which are randomly chosen from the TL, and TM denote a sentence set
which are randomly generated by the LM.

To obtain CM, TT is fed into the LM to be evaluated. Suppose there are Npruned
sentences in T T which cannot be generated by LM, or do not have a finite likelihood,

then CM is estimated as

 

CM = 1 _ (3.37)

The rest of sentences, which obtain finite likelihoods, are used to construct a histogram
of likelihood. The f (39) is then obtained by employing a Gaussian function to best-fit

this histogram. Then, we have

[($glLMT) = glxgi ”9,1: 09.1): (3'38)

where f indicates a p.d.f.
To obtain 0M, a sentence set generated by LM is fed into an observer which can
detect whether the input sentence is grammatical in the task language sense. Then,

0M can be estimated as the percentage of how many of them are not TL. However, this

37

method is unrealistic in speech recognition problems since usually the true grammar
rules are unknown or not even existent. Therefore, an alternative method to obtain
0M and f(xg|L;,) is proposed in the following.

According to Eq. (3.26), OM can be expressed by CM, NT and NM. The CM can
be obtained from equations above and NT is supposed to be a known knowledge to
designers. The estimation of OM is then turned into the estimation of NM.

For a small-size LM, NM can be obtained by simply counting the total of all
possible generated sentences. For a large LM, N M can be estimated by averaging the
logarithmic generating probabilities of sentences in TM, which is a randomly-chosen
subset of LM, TM C LM. Let G(s,-) represent the generating probability of a sentence

.3,- and lTMl is the size of TM. NM is estimated as,

'TM'InGu.)
NM z {EMF—W . (3.39)

Using Eq. (3.22),

f($g) = Q(Ll—M)f($ylL;!) + Q(LMT)f($glLMT) (3'40)

= 01wf($g|L174) + (1 - 0m)f($g|LMT) (3-41)

Consequently,

“$9le4) ___ f(93g) — (1 “OOMM)f(xglLMT). (3.42)

 

38

Similarly, f(a:g|L;,) is modeled by a Gaussian p.d.f.,

f($glL1—\:!) : 9(13g; ”9,0: 0.0.0)- (3'43)

Acoustic Level

It should be noted that the acoustic symbol strings are supposed to be correctly
segmented and evaluated by a word model, usually implemented by the HMM. As a
consequence, the resulting likelihoods are associated with one “word” instead of one
“sentence.” Let us work on the word model ﬁrst.

Using similar ideas to sentence cases, let A’ 6 be the set of acoustic strings corre-

sponding to a word7 12,-, v; E V:

A' “$3 {a,|P(a,-,v.) > 0}. (3.44)

Let random variable g2“, denote the statistic of the acoustic evaluations by experi-

ments:
—1
$2,] = m 111 P(aj|v,-,aj E AI), (3.45)
.7
Similarly, we have
—1
I —— ll] P(aj|v,-, a,- ¢ A’), (3.46)

32a,0 = l _

1|

From experiments, g1,” and £1“ can be modeled by a mixed function of a delta

 

6The “prime” stand for using “word” as one signal in this chapter.
7Note that “a word” is a sentence with length one. The underlying space (9,2Q,P) is still
appliable.

39

and a Gaussian p.d.f.

f (332,0) = (1 - P950132 - 6) + 1069014; #L,o.0;,o); (3-47)

“93:13) (1 — 05(3’; “ 5) + Pig($ixi #:1,1a‘7;,1)- (3-48)

The reason for using 6() is that when evaluating an HMM, sometimes the HMM
cannot ﬁnd any valid path for the input symbol string. Consequently, the probability
of this string generated from this HMM is zero, and its N LKHD is 00. Practically,
a penalty 6 rather than 00 is assigned to 3:3,, (or 332,0) in this case. Therefore, from
detection theory point of view, let P6 represent the probability of a acoustic symbol
string evaluated by a wrong word model is mistakenly classiﬁed (false-alarm); while,
let Pi represents the probability of a acoustic symbol string evaluated by a correct
word model is successfully classiﬁed (detection rate).

However, in current formulations, the interested signal is a “sentence” rather than
a “word.” In order to construct f (ma), the p.d.f. of the acoustic likelihoods for a
sentence, LM is partitioned into [I + 1 subsets, L9,L)\?,- - - , L56), where L is the
length of the uttered sentence in words and LS? represents a subset which contains
sentences with d word difference from the correct sentence.

Now we start with the trivial case when d = 0, which refers to the correct sentence.
Obviously, IL)?! = l. The random variable gm which is equivalent to the d = 0 case,

is obtained by summing the E of £2,1-

£§a = L ' 33:14; (3°49)

40

From Eq. (3.48), random variable $5” is composed by one constant 6 with probability
(1 —P,’) and one Gaussian random variable, say £2.12 with total conditional probability
Pl’. In other words, 333,1 either produces a constant e or a Gaussian $11. To obtain
950,1, which is comprised by L of 334,1, let 0" denote the event that there are k of e

and (£ — k) of $3.1 produced. For each k, 0 S k S E,

12.06) 1.. + (L — km... (3.50)

f(xald=0.D") = 9(xx.;ke+(£-k)u;,.,¢£—ham. (3.51)

The probability of the event 0" is Cf(1 - P’ 1)kP1’( L—k) 8. Thereby, gm] given d = 0

can be constructed as,

c
f(a:a|d = 0) = 2 f(2:a|d = 0, D")P(d = 0, D") (3.52)
k=0
C
= 2 gm; he + (L — km“, \/£ — ka;,1)C,f(l — P;)"P,‘“"’(3.53)
k=0

Consult with Fig. 3.7(a) to have an example of the case when £ = 4 and k = 2.
To construct f (mold > 0), or say 7(30), considering the subset L53), where d > 0,
the random variable 2,, is comprised by d incorrect word models, represented by $30:

and (L: — d) correct word models, represented by £2,1-

2.01) = d . _. + (c — 4) _:.. (3.54)

 

80:5 denote the combinator of i chosen from £.

41

 

 

 

 

 

 

 

 

 

 

p.d.f of ('0,1 p.d.f of ('0’, p.d.f of ('0’ 1 p.d.f of ('0’ 1
D0 D1 D0 D1
P11 P11
p(Iald=0, D2)
(20
\
p.d.f of 1'“, 0 p.d.f of ['0’ 1
D D1 D0 D1
P '01 P' 11

 

p(la|d=3, 13%?)

(b)

Figure 3.7: (a) An example of how the f (mald = 0, D2) is constructed, when E = 4
and k = 2. (b) An example of how the f(2:a,o|d = 3, 03’03) is constructed, when
£=4,ko=2and k1 =0.

42

Figure 3.7(b), shows an example to construct [(xald > 0). By extending the idea
used in the d = 0 case, let ’61 and k0 denote the number of 6 produced by $31,, and
£29 respectively, and Di“ and 05° denote these events. Then, f (mald > 0) is simply
the sum of all combinations of the event k1 and ’60. Given (1 > 0, kg E [0,d] and

161 E [0,.C — d], we have

C-d

{($31.33) = z: 29(z.;u.6.(d)a.o(d))P(Ds°)P(Df*); (3.55)

ko=o 141:0
where, ,ua,0(d) 2 (k0 + k1)e + (d — 160)};31’0 + (I. — d — kﬁpfm; (3.56)

 

«45(4) = \/(d— ko)a;.+(£— d— 1.1)., (3.57)
P(D{;°) = 0:0(1—Pg)"°Pg“"“’; (3.58)

d—k)

19(1):“) = Cg-d(1—P;)*IP;‘“ (3.59)

For 2,, (i.e. d > 0), the number of incorrect words (I, can spread from 1 to L.

Subsequently,

Mai, L3,?) Q(L3,‘3’). (3.60)

61:]
Recall that only modeled language are considered in estimating the recognition rate.
QM(L(A‘;)) can be then approximated by assuming 8.7 is the averaged branch factor

of the modeled language LM. We have

(3.61)

gs.—

Bf:

43

and

Cf(B}'— 1)d

 

(3.62)

Note that 25:0 Q(L3¢)) = 1.

Total Likelihood

According to Eq. (3.29), total likelihood g is the linear combination of acous-
tic likelihood ga and grammar likelihood g9. From Eqs. (3.38), (3.43), (3.53) and
(3.60), the p.d.f.’s of £9 no matter in LMT or in L}, are all Gaussian, and all ga’s are
all linearly—combined Gaussian. Therefore, all p.d.f.’s of total likelihoods g; are also
linearly-combined Gaussian, in which the mean can be obtained by adding the gram-
mar bias (mean) on the mean of each Gaussian component of f (93“) (or f—(ia)) in Eq.
(3.60) and Eq. (3.53). Also, the variance can be obtained by taking the square-root
of the sum of each squared variances. The desired p.d.f.’s, f(:c|LMT), f(:E|LMT) and
f-(ilLil), are thereby obtained.

Finally, by plugging the above three equations into Eq. (3.28), the recognition
rate under the Gaussian assumption can be obtained. It is important to note that

the recognition rate depends on sentence length L and the weight w.

3.5 Issue of Sentence Length

Normally, the recognition rate of a speech recognition system is sensitive to the length
of the uttered sentence, say £. The “length” means “the length in words” in an IWR
system and means “the length in acoustic symbols” in an CSR system. In this chapter,

all formulations developed above are under the assumption that [I is ﬁxed. However,

 

 

 

A

 

44

in real world, the TL usually do not have unique length.

To avoid the dependence of L, the following two methods are proposed. First, the
function P.1(l) is deﬁned as the percentage of the test sentence set with length l in
the TL. For all possible I, the recognition rate pR(l) is estimated by Eq. (3.28) given

the sentence length L = I . Then the overall recognition rate is deﬁned as

£3 = :P.z(l)pn(l)- (3.63)

The other simpler deﬁnition to eliminate the dependence of L is to deﬁne L as the

average sentence length in the TL,

FR = PR(£)- (3-64)

The second method is simply an approximation of estimating the recognition rate.
It is important to note that the coverage and over-generation are also sensitive to L,
which means one LM, when dealing with diverse sentence length in TL, needs more
than one pair of CM and OM to characterize it. In this case, the ROC-LM design is

not achievable because a LM is no longer characterized by only two parameters.

3.6 The Weight, 111, and Relative Motion of LMs

Another adjustable parameter in Eq. (3.28) is the weight w between acoustic and
grammar levels. w can be interpreted as an indicator of the importance (or reliability)

of the grammar evaluation with respect to acoustic evaluation, and is always adjusted

to optimize the recognition rate.

Under the equal-likelihood assumption, the weight w is set to zero to optimize the
recognition rate. In this case, the responses of the LM are binary, that is, 00 or a
constant.

When w 74 0, from Eq. (3.43) and (3.38), the grammar-level evaluations are
characterized by CM, OM, pg“), 09,0, a,” and 09,1, six parameters. In the ROC-LM
design scheme, a trajectory of LMs can be made by assuming only that CM and
0M are changed and the rest of them remain invariant. However, this assumption
is not practical in many applications, for example, the adaptive language modeling
problems.

In the case that CM and OM remain the same and the rest of the parameters are
changed, resulting the ROC contour plot is changed, a relative motion of that LM
is calculated with respect to the new contour plot. Fig. 3.8 illustrates a changing
contour plot with LMs unchanged. The shift of a contour plot will make no difference
to all of the LMs on the plane. However, the rotation will result in different increments
or decrements which depend on the location. In Fig. (3.8), LM1 increases from 93 to

95, while LM2 increases from 93 to 97.

3.7 The Scheme of the ROC-LM Design

To summarize, the scheme of the proposed ROC-LM design is itemized as follows:

1. The means and variances of random variables 2;“, and £3“, are obtained from

the histograms of acoustic evaluations;

46

 

 

Coverage

 

 

 

 

 

 

Over-Generation

Solid Line: the old ROC contours
Dash Line : the new ROC contours

Figure 3.8: An illustration of a changing contour plot with LMs unchanged. The shift
of a contour plot will make no difference to all of the LMs on the plane. However,
the rotation will result in different increments or decrements which depend on the
location.

47

. The means and variances of random variables $9,] and £99 are obtained from

the histograms of grammatical evaluations;

. The size of TL, NT, and the average sentence length in TL are measured.

. The two-dimensional ROC plane is divided into n by n grids, for example, n = 5.
Each grid corresponds to a pair of coverage CM and over~generation OM, and
its recognition rate can calculated according to (3.28). In the equal—likelihood

case, w is set at zero.

. The contour plot of recognition rates is plotted on the ROC chart.

. CM and 0M are measured for an existing LM and the corresponding point is

located on the ROC plane.

. By adjusting various parameters of the LM, the CM and 0M are changed, and a

trajectory is made by connecting the locations of (CM, CM) on the ROC plane.

. If CM and 0M are not changed but ”9,0, #9,], 09,0, or 09,1 are changed, the
location of the old (CM, OM) is mapped to a relative location with respect to

the new ROC contour plot.

. By overlapping the contour plot and the trajectory, the designer can observe
the changing performance of the LM under the inﬂuences of various parameters

and can locate the optimal set of parameters in the model.

 

Chapter 4

The ROC-LM Design Tool

 

According to the ROC-LM Design in the previous chapter, a mouse—driven X-
window version ROC-LM Design Tool was implemented on Sun Workstation SPARC
10 with its graphic facilities supported by MATLAB.

The ROC-LM Design Tool includes three individual programs, one for calculating
the estimated recognition rates, one for evaluating LM with CM and OM, one for
displaying results and controlling interfaces. Each program communicates one another
through text ﬁles ”which contain required information. Several key-words (identiﬁers)
are used in those ﬁles to identify what the following number stands for, for example,
“COVERAGE: 93.1.” The order of identiﬁers in the text ﬁle is not important. The
merit of implementing the tool by three individual programs and using identiﬁers to
carry information is that, for instance, when different methodologies of evaluating
LMs are considered, designers may simply replace the LM evaluation part of the tool,

as long as the same identiﬁers are employed for communications.

48

49

4.1 Overview

The ROC-LM Design Tool is composed of four functions:
1. A display window for the ROC design plane, ROC—WIN DOW;
2. The estimation of recognition rates, ESM-RR;
3. The evaluation ofthe LM, EVL-LM;
4. The control panel and designer interface, CTRL-PANEL.

Figure. 4.9 shows a diagram of the relationships among these four blocks9. The ﬁrst
three parts are interactive on one another and the fourth is a designer interface con-
trolling the communications and inter-activities among the designer and the above
three blocks. The ROC-WIN DOW includes a two-dimensional ROC plane with cover-
age, C'M, as its vertical axis and over-generation, OM, as its horizontal axis. The main
function of ROC-WIN DOW is to draw the contour plot of the estimated recognition
rates, and also mark the location of the evaluated LM by the CM and OM.

The estimation of recognition rates involves with the acoustic evaluation, LM
evaluation excluding CM and 0M, and some general information such as sentence
length in words, TL size. Given the range of the CM and OM which can be adjustable
by the control panel, dividing the range into 5 by 5 cells, each cell corresponds to one

estimated recognition rate, and resulting a 5 by 5 matrix. This matrix is sent to the

ROC-WIN DOW block to draw the contour plot.

 

9The block ROC-WIN DOW and CTRL-PANEL are implemented as one program.

50

 

 

 

 

 

 

 

 

 

 

‘8
'4:
d)
Covera e a
and g , . Contour Plot
Over-generation ll; of . .
:2 coverage Recogmtion Rate
File: eval__lm.dat The ROC plane File: esm rr. da t
LM Information
Evaluation Estimation of
(13f (language \A o I Recognition
Rate
Aon/off

 

 

 

A i A

 

Testing Sentence Set

 

 

 

 

 

Controllable Parameters Recognition Environment Settings

File: esm_rr.prm File: esm_rr.prm

 

 

 

 

Control Panel and Designer Interface

 

Designer

Figure 4.9: A diagram of the relationships among the four blocks: The ROC display
window, the Estimation of Recognition Rate, the Evaluation of LM, and the Control
Panel and Designer Interface.

51

On the other hand, a LM with parameters controlled by the control panel is
evaluated by CM and 0M, and the evaluation results are sent to the ROC-WINDOW
to mark its appropriate location. As described in the previous chapter, two testing
sentence sets are required in evaluating LMs. Note that although the LM is assumed
to be characterized by only CM and OM, the means and variances of the evaluated
grammar likelihoods are still required by ESM-RR to estimate the recognition rate.
Therefore, there is pass way from EVL-LM to ESM-RR to allow designers to make
the decision whether passing the resulting means and variances to the ESM-RR block.
Note that, most of the times, the means and variances are supposed to be invariant
during the modiﬁcation of LMs.

Finally, CTRL—Panel is responsible for the settings of all parameters and com-
mands from the designer. Details for each blocks are addressed in the following

sections.

4.2 The ROC Display Window, ROC-WINDOW

ROC-WINDOW is implemented by employing the MATLAB Engine Library. The
MATLAB Engine Library is a set routines that allows the users to call MATLAB
from their own application programs, thereby employing MATLAB as a computation
engine.

ROC—WINDOXV is activated by threeincidences: the output of ESM-RR is ready,
the output of EVL-LM is ready and any plotting command sent directly from CTRL-

Panel. They are described as follows.

52
1. When the output of ESM-RR ready
The ROC-WIN DOW receive the 5 x 5 matrix from ESM-RR through the ﬁle
esm_rr.dat, and draw a new contour plot and scales on both axes by employing
plotting commands in the MATLAB engine library. To avoid over-writing those
existing markers for the evaluated LMs, a buffer, named LM-BUFFER, is used
to store the locations of those existing LMs. Each time a new contour plot is
completed, LM-BUFFER pops up these reserved LMs and draw them on the

plane.

2. When the output of EVL-LM ready

The ROC-WINDOW receives the CM and 0M from EVL-LM and draws a
marker at the corresponding location on the ROC plane. The style of the
marker, such as, “green circle” or “red cross-hair,” is designer’s choice which

can be changed anytime during designing.

The CM and OM pair is then stored in LM-BUFFER for redrawing purpose.

3. Any plotting command from the Control Panel

All MATLAB commands can be used through a MATLAB command window
in the CTRL—PANEL. Not only ROC contour plots is displayed on this win-

dow, but also the p.d.f.’s of acoustic evaluations, or histograms of grammar

evaluations, can be displayed by ROC-WINDOW.

53
4.3 The ESM-RR Block

The ESM-RR block is in charge of the estimation of recognition rates which is derived
based on an IWR speech system, given sentence length and TL size. As previously
described, ESM-RR is an individual program and is detachable from the ROC-LM
Design Tool, which means, if a new method to estimate recognition rates is developed
or the mechanism of a speech recognition system dose not ﬁt with the developed
mathematical models in the previous chapter, this block can be replaced as long as
the format of input and output remains the same.

ESM-RR starts with receiving the settings of the interested ranges of CM and 0M
on the ROC plane from the ﬁle, esm_rr.prm, which is sent by CTRL-PAN EL. Both of
the ranges are evenly divided by ﬁve. Then, the centroid of each cell is elected as an
imaginary LM characterized by CM and 0M. The recognition rate is then estimated
by substituting the CM and 0M into Eq. (3.28). Let Cleft and ”cm-9h, represent the
right bound and left bound of the CM respectively, and Cleft and 0,49)“ represent those

of the 0M. Then, the ﬁve elected CM and OMS are

Cright _ cleft
5

0right - 01m
5

 

c(i)=clcft+(i+0.5) , fori=0,---,4; (4.1)

 

0(j)=0(cft+(j+0.5) , forj =0,--°,4. (4.2)

A 5 x 5 matrix with each entry being an estimated recognition rate then results and
is written to the output ﬁle esm.rr.dat.

Several issues regarding numerical analysis are addressed in the following. From

54
Eq. (3.28), the estimation of recognition rates is mainly an integration from —00 to
+00 on a mixed Gaussian p.d.f., f(:z:), multiplied by two “maskers,” say MLMT(.’L')

and M Li} (2:), which are deﬁned as,

Mama) a [f Ramada]de (43)
oo __ N;
M.,(x> 2 [f3 mama-s] (4.4)
Then, Eq. (3.28) can be written as,
pa = [0” f(x)ML...(x)M.,-,(z)dx. (4.5)

Since f-(§:|LMT) and f—(ilLXJ) are both p.d.f.’s, the integrations of them are always
smaller than one, NMT and NE, both positive integers, MLMT(2:) and ML;(2:) are
always in the range of [0, 1] for any ﬁnite :c.

To implement Eq. (4.5) on digital computers, ﬁrstly, the inﬁnite integral domain
is replaced by a ﬁnite range, say [a, 1)]. From the previous chapter, f (2:) is a multiple
mixed Gaussian p.d.f., and by the fact that erf(2.5) ’5 0.9996, if the integral domain
outside (250(k), —2.5a(k)] for each Gaussian p.d.f. in Eq. (3.60) are truncated, the

error is bounded within 0.04 ‘70.

= °°_ M M- d- .
(f... L[2.350(k),_2.50(m)f(x) [mm .Mm 4. (46)

= [outside u,,[2.5a(k),-2.5a(k)] f(x)MLMT(x)ML;4 (x)d;z: (4'7)

. d' < 0.0004. 4.8
loutsnde Uk[2.5cr(k),—2.Sa(k)]f(x) I ( )

55
Therefore, a is chosen as the left bound of the “left-most” Gaussian p.d.f. in [(56),

that is, when k = 0 in Eq. (3.53)"). Similarly, b is chosen as the right bound of the

“right-most” Gaussian p.d.f., that is, when k = L in Eq. (3.53). Therefore,

 

a = L14, + Wﬂg’] — 2.5\/La,’,2,1 + wzang; (4.9)

b = Le + w/LgJ + 2.5 w’agd. (4.10)

Next, by evenly partitioning the range [a, b] into n segments, letting h = 6‘“ and
2:.- = a + h(i — 0.5), the pa in Eq. (4.5) can be approximated by

PR = 21(4)ML...(x.-)M.,(x.)h. (4.11)

3:]

To calculate MLMT(.2:) and ML;($), note that f—(EILMT) and f(:E|LX4) are both
mixed Gaussian (a linear combination of Gaussian p.d.f.), thereby, the integration of
them can be obtained by just dropping the integral symbol and replacing g with 51
in Eq. (3.55). The subscript 9 —v 8 of f () is used to denote the replacement of g

with 81, where 81(3) is deﬁned as
a. 51— 5(5) 2 /°° g(z)dz .—. 0.5(1 — 871(5)). (4.12)

Therefore, we have,

)] NMT-l

MLMT($) = [fG-’£1($ILMT (4.13)

 

10Note that c >> p1,,

56

mm = [6-363532

The algorithm for calculating pg is thus derived as follows.

ESM-RR Algorithm

{Read file: esm.rr.prm}

{Initialization}
foreachlSéS5,and1£jS5
do

C(i) = Cleft + (i + 0.5)M3-3—‘3

0(2') = 0,... + (,- + 05)—2—4

r5, 2 Recognition Rate Estimation(c(i),o(j));
{Termination}

record ng;

end do

Recognition Rate Estimation Algorithm

{Argsz c, o}

{ Initialization}

2 Find integral domain [a, b] Eq. (4.9) and (4.10);
3 h = 19;“).

ﬂ ,

3 foreachlSisn,

do

(4.14)

57

4 x,=a+(i+o.5)3"’7“l;
5 Calculate ﬂag);
6 Calculate MLMT(:c,-) and MLIl(:1:,-), by Eq. (4.13), (4.14);
7 r=r+f(:z:,-)*MLMT(x,)*MLxl(x,-)*h;
end do

8 return 1';

4.4 The EVL-LM Block

In this research, the ROC-LM Design Tool is installed with a LM, Bi-TriGram, which
is a hybrid of bigram and trigram, to exemplify how to evaluate a LM by the CM and
0M. Bi-TriGram is a ﬂexible model which can easily shift between bigram and tri-
gram, or a combination between them. There are three parameters in Bi-TriGram to
control the p.d.f. of N LKHD, CM and OM of the Bi-TriGram. These three parameters

are weight, penalty and threshold.

0 weight, wg : a weight between bigram and trigram likelihoods, 0 S wg g 1. Let
15,1, and 19 represent the N LKHD of bigram, trigram and the total grammatical

likelihood respectively. We have,

lg = (1 — wg)lb + wglt; (4.15)

58

o penalty, 1),, : if trigram does not exist in the train context, It is assigned a

penalty. 1).:

a threshold, tg : if lg is above a threshold, t9, the whole hypothesized sentence is

pruned, or say lg = 00.

Following is an algorithmic descriptions for illustrating their relationships, in which
00 stands for the pruned sentence. Practically, 100.0 is used for representing 00, and
all 19’s larger than 100.0 are considered as 00. The symbol - in a word string, for

instance, w” — -, is used to represent “any word in vocabulary V.”

Grammar Rules for Bi-TriGram
{lnitializatiom}

1 for each three consecutive word w” — w’ — w in one sentence,

 

 

do
2 if bigram exists,
_ _ count of w”—w',
3 I” _ ln count of w"—- ’
4 if trigram exists,
_ _ count of w"-w’—w.
5 1‘ _ ln count of w”—w’-~ ’
else
6 It = P9;
end if

else (if bigram does not exist)

59

8 I, .— I, + (1 — w,)lb + wglt;

9 if I, > t,,,

The above algorithm is applied 011 each word concatenation in one hypothesized

sentence. Now note that,
1. If threshold t,, is set to 00, the pruning due to the high lg (line 8) never occurs;

2. If pg is set to 00, when the consecutive word pairs cannot be found in trigram,
the whole sentence will be assigned to 00 (line 6) and can be considered as being

pruned.

Conclusively, by letting w, = 1, t9 = pg 2 00, the above algorithm reduces to the
following and Bi-TriGram reduces to a trigram. Note that the existence of trigram

implies the existence of bigram.

Grammar Rules for Bi-TriGram
{lnitializatiom}
1 for each three consecutive word w” — w’ -— w in one sentence,

do

2 if trigram exists,

II I
l = —ln count Ofw -w —w,
3 ‘ count ofw"-w'-. ’

60

else
4 I; = 00,
end if

5 lye—lg+l¢;

Similarly, when wg = 0 and tg = 00, the original algorithm is reduced to the

following one and Bi-TriGram reduces to a bigram.

Grammar Rules for Bi-lh‘iGram
{Initialization ;}
1 for each three consecutive word w” — w’ — w in one sentence,

do

2 if bigram exists,

II I
I = —111 count ofw —w,
3 ” count of w”—- ’

else (if bigram does not exist)
4 1;, = 00'.

end if
5 lg (— 19 + lb;

end do

The following is an algorithm for evaluating Bi-TriGram, which consists of three

phases. Note that, for some applications, designers may choose to skip the training

61

phase, directly reading a constructed data base from disk, and reduce the response

time of EVL-LM.

EVL-LM (Bi-'h'iGram) Algorithm
{Initialization}
Training;
Evaluate CM;
Evaluate 0M:
Write results to ﬁle evl_lm.dat;

{Terminator}

To illustrate the training algorithm, a diagram of the data structure for storing word
concatenation information from training context is shown in Fig. 4.10. In Fig. 4.10,
the horizontal links represent bigram links, such as, wl-wbl, wl-wbz, - - -, etc. The
vertical links represent trigram, such as, wl-wbl-wn, wl-wbywn, - - 0, etc. Note that

if there exists a trigram wl-wbl-wn, there must be a bigram win-w“.

Training Algorithm
{Train-Set: training sentences chosen from TL;}
{Initialization}
1 repeat until End-of-File,
do
2 Read one word, w, from Train-Set;

3 if link w’-w exists,

62

  

 

 

 

 

 

 

 

 

 

w: ~+rw..11:im
W2 —_>b1gramlmk

. ——->

C

O

W“ Wu -'=

I

W1v1-1——>

WM -—

 

 

 

 

Vocabulary, size = IVI

Figure 4.10: A diagram of the data structure for storing word concatenation infor-
mation from training context.

63

4 count of link w’-w increases;
else

5 Insert link w’-w;

6 count of link w’-w = 1;
end if

7 if link w”-w’-w exists,

8 count of link w”-w’-w increases;
else

9 Insert link w”-w’-w;

10 count of link w”-w’-w = 1;
end if

11 w” = w’;

12 w’ = w;

end do

13 record trained grammar;

To evaluate CM and 0114, the same evaluation algorithms but different testing sets
are employed. From the previous chapter, CM is based on the TL; while the OM is
based 011 the LM. Thereby, there are two sentence sets, Test-Set I and Test-Set II,
for evaluating these two parameters.

The following is an algorithm for evaluating grammar likelihoods for a BiTri-Gram.

The evaluation returns the mean, pg, variance, 09, of the grammar likelihoods, and the

64

total number of the sentences obtaining ﬁnite likelihoods (that is, without pruned).

Evaluation Algorithm
{Initialization}

1 for each word, w, in testing set,

do
2 if link w’-w exists,
3 bi_lk = log(count of link w’-w / count of w’--);
4 if link w”-w’-w exists,
5 tri_lk = -log(count of link w”-w’-w / count of w”-w’--);
else
6 tri_lk = penalty;
end if
7 1k = (l-weight)*bi.lk + weight*tri_lk;
8 if w =2 word-terminator,
9 nlk = 1k / sentenceJength;
10 if 1k > threshold,
11 increase totaLcount;
12 store nlk in nlk_array;
end if
13 {Reset variables;}
end if

14 w” = w’;

65

15 w = w;
end if
end do
16 p = mean of nlk_array;
17 a = variance of nlk_array;

18 return totaLcount, p, 0;

According to Eq. (3.26), when the grammar rules for the TL is not available to EVL-
LM in some applications, 0M can be estimated from N M. In the current Bi-TriGram
case, suppose the grammar rules of TL is unknown, and let N Mbimm be the size of
Bi-TriGram when threshold t9 = 00, that is, the size of bigram part. Then, when t9

is ﬁnite, we have,

NM ~ total_count

 

 

= ' 4.16
NMbigram ITest.Set II| ( l
Therefore, the NM is estimated as,
totaLcount
N = N ' - .1
M MW“ |Test.Set 11) (4 7)

NMbimm is estimated as follows. Let lg), denote the averaged bigram negative likeli-
hood for sentences in Test_Set II. Then, the averaged probability of generating one

sentence from the bigram is 10196. Thereby, the size of N Mbimm is the inverse of 1059*,

NMbi m, = 10494». 4.18
8

66

The algorithms for evaluating CM and 0M are thus listed as follows.

Evaluate CM Algorithm
{Test-Set I: randomly choose m test sentences from TL; }
{Initialization}

1 [count, pg, 09] = call Evaluation;

2 ”9.1 = 119;

3 09,1 = 09;

4 CM = count / m;

5 write CM, ’19,], 09,1 to ﬁle evl_lm.dat;

end.

Evaluate 0M Algorithm
{Test—Set II: randomly generate m test sentences from LM; }
{Initialization}

1 [count, pg, 09, 19),] = call Evaluation;

2 #g.0 = #9;

3 09,0 = try;

4 NM = 107195 ”‘ count / m;

5 write NM, #93, 09,0 to ﬁle evl_lm.dat;

end.

Note that since the information of ILTI is held in CTRL-PANEL, it is CTRL-PANEL’S

responsibility to convert NM to OM by applying Eq. (3.26).

67
4.5 The Control Panel Block

The CTRL-PANEL block is implemented by employing X Window system program-
ming, or X, for short, and Motif toolkit. The X provides a standard window envi-
ronment that allows application programmers to spend less time porting to graphics
user interface (GUI) platform. Motif, one of the leading commercially—available GUIs,
provides the user interface components, such as pushbuttons, dialogs, etc, to realize
the visual on-line ROC design scheme.

The main functions of CTRL-PAN EL are to collect all required parameters from
the designer and dispatch commands to EVL-LM, ESM-RR and ROC-WINDOW.
The layout of CTRL-PANEL is shown in Fig. 4.11. Followings are the function
descriptions of each element on the panel.

There are three types of controls for CTRL-WIN DOW, buttons, sliders, and text
ﬁeld. Their functions are described as follows.

The label on a button used for a single command that has the same name as the
command. For example, to apply ROC contour plot on the ROC plane, the designer
would click 011 a button that says “CONTOUR.”

Sliders are used to set a numeric value and to give a visual indication of the setting.
Every slider associates a type-in ﬁeld and a label on its left side. The type-in ﬁeld
displays the corresponding value immediately when the slider is moved, and the ﬁeld
also allows users type in a number in case that the slider fails to provide the number.

Text ﬁelds are used when the users requires input from the keyboard. If the text

ﬁeld is a command line, press return-key to activate the command. For example,

68

 

Speech Processing Lab
Michigan State University

 

ROC WINDOW

OHIO 0

Evalu

It)

 

 

Over-Generation

 

ROC-Contour Control
Panel (ESM-RR)

MU_A0:

SIGMA_AO:

P’_A0:

MU_A1:
SIGMA_A1:
P’_Al:

EPS ILON :

Setence Length:
TL Size:

 

 

Language Model Design Panel (EVL-LM)

 

Control and Designer

 

 

 

 

 

Interface (CT RL-PANEL
Th LM ' : P te #1:
e to desrgn arame r Coverage Range: Left:
Parameter #2; Parameter #3: R181“:
Over-Gen Range: Left:
Right:
LM Locator: MATLAB Command: CONTOUR ACOU PDF EXIT
APPLY LM CLEAR LM NLK PDF UCL R.R. CLR UCLS

 

 

 

Figure 4.11: The layout of the ROC-LM Design Tool.

 

69

typing in “text(0.5,0.5,’here’)” followed by a return key in the ﬁeld labeled MATLAB
Command will send the command to the MATLAB Engine and write a string “here”
on [0.5, 0.5] of the ROC plane.

From the lay-out in Fig. 4.11, all of the controls fall into two groups. The upper-
left group, the ESM-RR group, concerns with the information for ESM-RR, and the
bottom group, the EVL-LM group, is the information for EVL-LM.

Followings are the descriptions of each control on CRTL_PAN EL.

P-Al and P_A0

Pf/Pé in Eq. 3.48) and (3.47), representing the probability of a cor-
rect / incorrect word HMM ﬁnding a valid path for the input acoustic sym-

bol string and returning a likelihood.
MU.A1 and MU_A0

ﬂing/14w in Eq. (3.48) and (3.47), representing the mean of the word-level
HMM evaluations when correct / incorrect HMMs are used, excluding those

evaluations returned penalties.
SIGMA-A1 and SIGMA-A0

dad/0&0 in Eq. (3.48) and (3.47), representing the standard deviation of
the word-level HMM evaluations when correct / incorrect HMMS are used,

excluding those evaluations returned penalties.

PENALTY

The e in Eq. (3.48) and (3.47).

 

1L

70

WEIGHT

The w in Eq. (3.29). Refer to Sec. 3.6 for adjusting w if undecided.

MU-G1 and MU_G0

The [LN/149,0 in Eq. (3.38) and (3.43) representing the mean of the LM
evaluations when grammatical / non-grammatical sentences are used, ex-

cluding those evaluations returned penalties.

SIGMA_G1 and SIGMA-G0

093/094) in Eq. (3.38) and (3.43) representing the standard deviation
of the LM evaluations when grammatical / non-grammatical sentences are

used, excluding those evaluations returned penalties.

Sentence Length

Average sentence length or the length which appears most frequently are

applied in this work, Eq. (3.64). See the discussion in Sec. 3.5.

Sizeof(Task L)

The NT in Eq. (3.28), and Eq. (3.26). [LTI combined with L indicates the
difﬁculty of a speech recognition task. Experiments show that the estima-
tion of recognition rate is not very sensitive to [L7]. Roughly estimation

on ILTI is enough.

Over-Gen Range, OVR—G LF and OVR-G RT

71

The range of OM on the ROC plane to estimate recognition rates, always
between zero and one. Note that OM cannot be one unless 0M equals

zero, see Eq. (3.25).

CM Range, COVER LF and COVER RT

The range of Co on the ROC plane to estimate recognition rates, always

between zero and one. Similarly to the above, CG = 0 never happens.

CONTOUR

When clicking on this button, CTRL-PAN EL collects the information in
the ESM—RR group, writes to the ﬁle ESM-RR.prm, and activates the
ESM-RR block to estimate recognition rates 011 the 5 X 5 grid points on

the ROC plane. The resulting matrix is then delivered to ROC-WIN DOW

to draw a new contour plot.

If one of ”9,0, 09,0, #9,] , 09.1 is different from the one when the old contour
plot is made, and if the weight w is not zero, it is the occasion for relative
motion. The old LMs are then transformed to their relative positions with

respect to the new contour plot.

ULC R.R.

The button CONTOUR produces 5 X 5 recognition rates each time. In
contrast, ULC RR, abbreviated by upper-left-corner recognition rate, pro-

vides designers to estimate recognition rate on one particular location,

which is (OVR—G LF, COVER RT) on the ROC plane. ULC RR uses

72
vertical axis as the estimated recognition rate and horizontal axis is the

count for each estimation. For example, a designer is interested to show
the effect on recognition rates when a parameter 6 is changed. Each time
the c is changed, click on the ULC RR button. ULC RR will provide
a changing recognition rate associated with (OVR-G LF, COVER RT)
versus the changing e. Designers can thus observe the tendency of the

evolving recognition rate.
CLEAR UCL

Clear the UCL RR buffers.
ACOU PDF

The p.d.f.’s of acoustic evaluations will be drawn on ROC-WIN DOW ac-

cording to f(:rf,0) and f(a:f,1) in Eq. (3.47) and (3.48) respectively.
EXIT

Exit the ROC Design Tool.
PRM #1, #2 and #3

The ROC Design Tool allows three adjustable parameters fro designing 3.
LM. The setting for each parameter associates with a text ﬁeld to brieﬂy
describe the parameter, a numeric ﬁeld for displaying its current value,
two numeric ﬁelds showing its left and right bounds, and a slider between

them to adjust the current value between the left and right bounds.

73

When all of the three settings are ready, the designer may click on the

APPLY LM button to activate the EVL-LM block.
APPLY LM

When this button is clicked, CTRL-PAN EL collects the the numeric ﬁelds
for three parameters and activates EVL-LM. As mentioned above, if the
ﬁrst numeric ﬁeld is empty, EVL-LM will be notiﬁed by an initialization

mode rather than activation mode.

CLEAR LM

Clear the buffer LM-BUFFER. All makers for LMs on ROC plane will be

erased.
LM Locater
Designers may choose from the following six types of markers:

ro : red circle;

rx : red X;

r+ : red crosshair;
go : green circle;
g+ : green X;

gx : green crosshair.
NLK PDF

The histograms of the grammar-level evaluation results, grammar likeli-
hoods, will be drawn 011 ROC-WINDOW. EVL-LM is required to write
the grammar likelihoods to ASC II ﬁles nlkcv.dat and nlkog.dat when

evaluating CM and 0114, respectively.

74

MATLAB CMD

Designers may use MATLAB command thru this ﬁeld. A return key is

needed to activate the command.

 

Chapter 5

Implementation and Design Cases

 

The main purpose of the proposed ROC-LM Design Tool is to quantitatively
describe the interactions among the performance of the LM and the recognition en-
vironments which includes acoustic performances and the complexities of the TLs.
To demonstrate the proﬁciency of the ROC-LM Design Tool to achieve these goals,
four types of LMs, two simulated TLs, and two acoustic environments are used in
this chapter. By using ROC-LM, the theory of the trade-off between coverage, CM,
and over—generation, GM, for each underlined LM with the recognition rate and com-
putational cost as its concerns can be “seen” from the graphical screen of the tool.
In addition, the issue of computational cost, the issue of adaptive LM can also been

visually shown by using this tool.

75

76

5.1 The Acoustic Environments

5.1.1 Acoustic Settings

The acoustic environment refers to the settings in the ﬁrst seven parameters in the
ESM_RR block of the ROC-LM Tool. Let A denote these seven parameters to model

one set of the acoustic evaluation performance:

A = {#09, 00.0? ”4.1 9 00.1, Prim (in, 6} (5'1)
The ﬁrst six parameters in A are obtained from the histograms of the acoustic eval-
uations from the sets A’ (correct) and A’ (incorrect). In this research, these acoustic

evaluations employ the following methods“.

1. Mel-Cepstrum
The feature extraction task for this acoustic evaluations is based on the mel-
cepstral parameters. Twenty mel-frequency components are desired on the

Nyquist range 0-8 KHz (the sampling rate is 16 KHz).

2. Vector Quantization
Ten features are selected to compose a vector for representing each frame of
the speech signal including eight mel-based cepstral parameters, power and
differenced power. The vector quantization codebook contains a set of 128

vectors which provide the minimum average distance between a given training

 

11These methods have been developed and veriﬁed in the Speech Processing Lab of Michigan
State University by Chiu [24], 1992. For more details on features and techniques used in speech
recognition see [3]. ,

77

set of analysis vectors and the codebook entries.

3. HMM Training
The six-state Bakis model [3] is used for the word model. Further, additional
constrains disallow jumps of more than two states. The forward-backward (or

Baum- Welch) re—estimation algorithm [3], [25], [26] is adopted for training.

5.1.2 TI-46 and TIMIT Database

Two speech data corpses are employed in this study:

1. TI-46 Word Speech Database Speaker-Dependent Isolated Word Corpus (TI-
46), which contains a corpus of isolated words designed and collected at Texas

Instruments (TI) in 1980 [27], [28];

2. TIMIT (Texas Instruments - MIT) Speech Database, which provides continuous

speech data for the acquisition of acoustic-phonetic knowledge [29].

The T146 corpus contains 16 speakers and 46 isolated words per speaker, compar-
ing with the TIMIT corpus, which is a continuous speech database, containing 630
speakers from eight major dialect regions with vocabulary size 6,100. Obviously, the
former have better acoustic performance than the latter. Following ﬁgures show the
experimental results of the histograms of the evaluations under H63 and Hap.

The seventh parameter, 6, represents the penalty assigned to an HMM word model
which fails to ﬁnd a valid path for the input acoustic symbol string. Initially, e

is arbitrarily set to 10. The ROC-LM Design Tool provides a function to adjust

78

parameters in the settings of acoustic environment. Refer to the description of the
button ULC RR“ in the previous chapter.

The seven acoustic parameters for TI-46 and TIMIT are:

A“-.. = {5.20, 1.80, 001,085, 1.05, 0.97, 20.1}; (5.2)

ATIMIT = {4.82, 2.55, 0.03, 0.85, 0.92, 0.93, 6.76}. (5.3)

5.2 Introducing the TLs

Two TLs are considered in this work, English word spelling rules and artiﬁcial digit-
string (ADS) language. The ADS language will be deﬁned below.

The ﬁrst language is used in the alphabet recognizer, which is considered as an
important application for an IWR system. In alphabet recognizers, a letter is treated
as a “word” and a word is a “sentence.” In order to eliminate those impossible
concatenations of two letters, for example, letter “j” never followed by letter “g,”
usually a “LM”, such as bigram, is employed to represent the spelling rules and
predict the next letter, thus improving the recognition.

The source of the words (sentences) used for the purposes of training or testing
the LM is from the Sport Column of San Jose Mercury News. Repetitive words are
allowable, which means some “sentences” frequently occurs.

The second TL, ADS language, is suggested in [3, Ch. 10, Problem 10.] as a

pedagogical language example. The reason of using ADS language is ﬁrstly that

this language is quantitatively well-deﬁned from the bottom lexical level through

79

pragmatic level, such that the size and generating rules 011 each level can be exactly
calculated by the ROC—LM Design Tool. This is important because by deﬁnitions
the OM is the percentage of LM belonging TL. Therefore, given a sentence randomly
generated from LM, the ROC—LM requires an evaluation model to determine whether
or not this sentence is TL, then obtaining the OM rate. For spoken English or
English Spelling Rules as TL, there is no such speciﬁc rules to proceed this evaluation.
Although the 0M can be estimated from the estimation of the [LM|, Eq. (3.26), it is
still more precise if the 0M can be obtained directly from evaluating the LM.

Following is the deﬁnition of the ADS language [3]:

Consider a “language”for which the symbols are 0, 1, 2, - - -, .9, +, -. The

formal components of the language are as follows:

I. Lexical. A “word” consists of any number of digits concatenated by
any sequence of + and - signs in any order. The ﬁrst or last symbol
in a word is never a sign. The total numerical value of a word must

be nine. For example, 4+5—0 is a word in the language.

2. Syntactic. A “sentence” is any sequence of words such that any word
ending in an odd digit must be followed by a word beginning in an

even digit, and vice versa. For example,

3+6+4/7+3—1/2+0—5+8+4

is a proper sentence.

80

3. Semantic. To be “meaningful,” a sentence must contain words that

become increasingly longer. For example,

9/2+7+1—1/4+5—5+5+0—0+2—1—1

is a meaningful sentence.

4. Pragmatic. When “speaking” to a child, each word in a sentence is
usually no more than ﬁve digits long. There are no constraints for

adult listeners.

5.3 LM Design Cases

Following are three typical examples in designing LMs. These three cases are used to
verify the trade-off theory and demonstrate usages of the ROC-LM Design Tool. Note
that there are of total 21 adjustable parameters on the ROC-LM Control Panel, and
there is no speciﬁc design scheme and no guarantee to locate the optimal parameter

settings. However, there is one simple rule for adjusting parameters:

The most signiﬁcant parameter to improve recognition rate is the one
which makes the moving direction of the LM on the ROC plane normal

to the local contour curves.

81
5.3.1 Case I : To Determine the Required Number for TYain-

ing N -Gram for an Alphabet Recognizer

Problem Description

 

The ﬁrst design case is to ﬁnd the best n-gram (n=2 or n=3) and appropriate
number of training sentences for an alphabet recognizer. The alphabet recognizer is
used for retrieving the spelling of a word, and is an important application of IWR
systems. Although the vocabulary size in this case is as limited as 26, the recogni-
tion task is usually difﬁcult due to the acoustic similarities existing among alphabet
symbols. N -gram is usually employed for providing the information associated with
the spelling rules. The size of the training context for an n-gram is always a trade-off
problem to the LM designer. Too few training sentences results into the insufﬁciency
of word concatenation information and causes the low CM; meanwhile, too many
training sentences results into the over-hypotheses of the LM and causes the 0M-

In this design case, bigram, trigram and a combination of them, say BiTri-Gram
(mentioned in Chapter 3), are evaluated and compared on the ROC plane. The
acoustic environment set-ups are obtained from TI-46 data bases. For each n-gram, a
trajectory is form along with the variation of the total number of training sentences.

Parameter Settings

 

The following is a list of initial settings required by the ROC-LM Control Panel.
The acoustic parameter set-ups use 9411-46. The e and w are initialized to zero. Since
w = 0, 119,0, 09,0, ’19,] and 09,1 Will not effect recognition rates and are arbitrarily

initialized. The sentence length L is set as the averaged “word” length in the testing

82
context. The TL size is assumed to be 1,000. Finally, the initial scopes of the C M

and 0M are both set as (0.0, 1.0) to have a gross view of the ROC contour plot.

 

MU.A0 14,, 5.20 MU_A1 11;, 0.85

 

SIGMAAO 6;, 1.05 SIGMA.A1 5;, 1.80

 

P’-A0 5,, 0.01 P’-Al Pg, 0.97

 

EPSILON e -10.00 WEIGHT w 0.00

 

MU_H0 149.0 1.00 MU.Hl mm 1.00

 

SIGMA_H0 09,0 1.00 SIGMA_H1 09,1 1.00

 

SNT LEN L 8 TL SIZE |LT| 1000

 

COVER LF Cleft 0.00 OVR-G LF Oleft 0.00

 

 

 

 

 

 

 

 

 

COVER RT crgght 1.00 OVR—G RT any,“ 1.00

 

Design Procedures

 

1. Firstly, we may click 011 the button ACOU PDF to show the two Gaussian

p.d.f.’s of acoustic level evaluations, £3”, and x2,“ Fig. 5.12.

2. After the button CONTOUR is clicked, a contour plot is shown on the ROC-
WIN DOW. Since the performance of acoustic evaluations is almost perfect, the
recognition rates are dominated by CM, resulting in a set of ﬂat ROC contour

curves.

3. Next, a “middleman” program bigram_train, which conveys the two parame-
ter settings, number of training sentences n and the random seed r, from the

ROC-LM Control Panel to the language model bigram, then prepares a training

83

context for the LM evaluation program bigram, is run. Note that bigram-train

-init will bring the description and initial settings for n and r.

4. By clicking on the button APPLY LM, a circle appears 011 (0.994 0.482) of the
ROC plane, which means OM is 0.994 and CM is 0.482. So, we change the range

of ROC plane by moving COVER LF to 0.44, OVR-G LF to 0.992.

5. Then, the slider PRM #1, representing n, is moved as the following sequence,

and APPLY LM is clicked for each time when n is changed.

100 -+ 190 -§ 280 —2 370 —) 460 —2 560

A trajectory is made on the plane as shown in Fig. 5.13.

6. Then, the random seed r is changed to show how sensitive the bigram is when
different set of the training context is applied. PRM #2 is moved from 177
to 1997, and LM Locator is switched from “red circle” to “red X.” Next, move
PRM #1 by the above sequence backward. Another trajectory is made as shown

in Fig. 5.13.

7. Same procedures is applied for the program trigram_train except that PRM #1

is changed from 100 to 1000. Results are shown in Fig. 5.14.

Results and Discussion
The comparisOn between the bigram performances and trigram performances is

shown in Fig. 5.15. It is observed that the area of trigram occupy spreads around

84

Speech Processing Lab
Michigan State University

 

 

 

 

 

  

 

 

 

0.4 »

 

 

0.35

 

 

 

0.3

 

 

 

0.25

 

 

 

 

0.2

 

 

 

 

0.15

 

 

0.1

 

 

 

 

0.05

 

 

 

 

 

 

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0 .
-5 10
L Lnucunceuunrt DESIGN PMEL
DesignU'III
Pal-1|] 0.0 PHI-3'1
[7.7)— |'1._o— [IE-—

 

 

 

 

 

 

 

 

 

 

 

 

 

 

near Ln ] arm LHJ msxsn EFFECT] ILK 91r| 11.1: 12.8.] cu: lLC’sI

 

 

 

 

 

 

 

Figure 5.12: By clicking on the button ACOU PDF, the two Gaussian p.d.f.’s of
acoustic level evaluations are shown on the ROC-WIN DOW.

 

 

 

 

 

 

 

 

Speech Processing Lab
Michigan State University [M
. , . ., ‘_ _ 11030 [5.20 .11:
ROC WINDOW H 51811930 1.80
1 ' P'_no [0.01 ..
HUJII .85
0,9
SIGHFLRI 1.05
08 P’Jll I0.97
‘03 I EPSILON 20.1
<
(I
g0] 1:11:17 0.00
O
U

 

 

I'1U_H0 I 1.14

SIGHRJIO 0.12

HU_H1 I 0.35 "'= '

SIGI‘IRJIl 0.17

P
m

0.5

   

 

 

 

 

  

 

 

 

 

 

 

n 1 '129151/

0.992 0.994 .996 0 98 1 l— .,..,.
OVER— ENERATION 5’" LE" 5

' ' - TL SIZE 1000

[ LANGUAGE MODEL DESIGN PRNEL ]
W r'—‘°— CUVER LF 0.“
Design LH Ibisran- treiri PR" .2
I177 RT 1.00

 

UVR-G LF I .992
RT I1.00

LH LOCRTOR x , I [.9 n tleb run: I ]
I [F— " a I manlmpnrlsxn]

Ii?” L" I CLERR LI'I I HRSKER EFFEETI ILK PIFI J Lu: R.R.I CU? LLC'sI

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.13: The bigram performance when the slider PRM #1, representing the
number of sentences in training context. is moved from 100. 190. through 550 with
PRM #2 random seed as 177. A trajectory marked by circles is form. The random
sees is then changed to 1997, and PRM #1 moves backward from 550 to 100. Another
trajectory with “X” is made.

86

Speech Processing Lab
Michigan State UniverSIty 1 WW 0mm Pm ]

 

 

 

 

 

 

 

 

 

 

 

 

 

  
  
  
  
 
 

 

 

9
en

 

 

 

P
V

 

 

 

 

 

 

 

COVERAGE
p
m

 

 

.9
U!

 

 

 

 

P
A

 

 

 

 

 

P
m

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

WI WWI EXIII
MYLIII I CTI ILCRJIJ MM'OI

 

 

 

 

 

 

 

Figure 5.14: The trigram performance when the slider PRM #1, representing the
number of sentences in training context, is moved from 100, 190, through 1,000 with
PRM #2 random seed as 177. A trajectory marked by circles is form. The random

sees is then changed to 1,997, and PRM #1 moves backward from 1,000 to 100.
Another trajectory marked “X” is made.

87

the southwest of that of bigram does. If the size of training context is limited at,
for instance, 550, bigram has better recognition rate (around 72%) than trigram has
(around 54%) due to its high CM. Trigram always has lower 0M, which can be
a beneﬁt to recognition rates if the training context is unlimited and trigram’s CM
reaches the same height of bigram’s. However, in this case, the beneﬁt is not promising
because of the ﬂat contour plot, which implies the CM dominates the recognition rate.
Therefore, it can be predicted that if the contour plot rotates counter-clockwise, which
implies the OM is relatively important, trigram may have better performance than
bigram due to its lower 0M.

Concerning the random seed, which represents different set of training context, it
is shown that the smaller size of the training context the more sensitive the selection of
the training context is. For example, there is an about 4.5% difference of recognition
rate for a bigram uses different sets of training contexts with size 100. When size is

increased to 370, the difference is within 1%.

5.3.2 Case II: A Hybrid Mode] of Bigram and Trigram and
Concerns About Computational Cost

Problem Description

 

At present, n-gram models, especially bigram or trigram models, are recognized
as effective LMs and have been widely used in speech recognition systems. As men-
tioned in the Chapter 3, the ROC—LM Design Tool is installed a hybrid of bigram

and trigram, the BiTri-Gram. In this case, the same recognition task, an alphabet

88

Speech Processing Lab
M'

 

 

  
 
 
 

.. . ..-.._.__...-_..-............ .... ... ".......--.-M.--................_...............
:~.t......:.......:... ........................ ......i

 

ROC WINDOW

 

P
an

P
\J

 

 

COVERAGE
D
O)

P
at

 

 

P
A

P
to

 

0.99 0.985 9 0.995 1

OVER-G N RATION

 

'1;
U

 

 

 

 

 

 

 

 

 

$1019-90 l1.90 {31—33: 2 .

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[ Lnucuncsuonrtnrsrcupnurt ]

 

 

 

.
assign L" I ”‘VU'JNHI PR” 2 [W

 

 

 

 

   

m '1 [iomeSn

 

 

 

 

 

 

 

 

 

 

m7 U1 | 11599 LHJ msxra EFFECTI nu: PIFI

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

OVR-G LF I0.98 [..-.:5:33:11;iiiQ-;:;;iiifii

 

 

RT [1.00 [.111

 

 

 

 

UNTIIRI MINI EXITI

w: R.R.] cue LLC'sI

 

Figure 5.15: The comparison of the bigram performance and trigram.

 

89

recognizer, and the same recognition set-ups are used to show the performance of
BiTri-Gram.

Computational cost is often an important consideration in many applications. In
this research, since the stack size is assumed to be inﬁnite, the computational cost is
reasonably assumed to be proportional to the size of LM, that is, the total number
of the hypothesized sentences. In this design case, BiTri-Gram uses a parameter,
threshold, to control the size of the generated LM Suppose the threshold is set to 00,
the size of BiTri—Gram is exactly the size of bigram, and the CM of them are identical
too. The reason is that when bigram and trigram both exist in the training context
the likelihood of BiTri-Gram is the linear combination of those of bigram and trigram.
When bigram exists but trigram does not, BiTri-Gram equals bigram likelihood plus
a penalty. The same CM and the same LM size imply the same OM. Conclusively,
when threshold is infinite, no matter of what values the weight (between bigram and
trigram) and penalty are, the C M and 0M are always the same, and the LM “stays”
at the same location on the ROC plane. Nevertheless, the p.d.f.’s of the grammar
likelihoods, that is, ﬂap, 09.0: [15,1 and a“, may be changed, and the recognition rate
may be beneﬁted by the more “distinguishable” p.d.f.’s of grammar likelihoods.

Although the CM and 0M remain the same, the recognition rate contour plot
is changed, a trajectory can be made by transforming the previous locations of the
LMs to the relative new locations for the new contours. Refer to the discussions of
“relative motion of LM” in the Chapter 3.

Design Procedures

 

90

. As in case I, ATI—46 is employed. According to the result of the previous case,
550 and 1,997 are chosen as the size of training context and random seed re-

spectively.

. The hybrid of bigram and trigram, BiTri_Gram, is run with with its three con-
trollable parameters, weight wg, penalty p9 and threshold t9, which are deﬁned in
the previous chapter. As before, after initializations and rescaling, the location
of BiTri—Gram and contour plot is shown in Fig. (5.16). Notice that the new

estimated { [493, 09,0, ”9,1, 09,1 } (call it G—set) is directed to their corresponding

ﬁeld on the ESM-RR block. G-set = { 5.61, 1.44, 1.70, 1.71 }.

. Now PRM #1, wg, is moved from 1.0 to 0.66. After clicking on APPLY LM,
G-set is changed to { 4.17, 0.97, 1.46, 1.16 } but the CM and 0M remain the

88.1118.

. This is the illustration of the “relative motion.” Since G-set is changed, the
contour plot needs to be reestimated. By clicking on CONTOUR, the new
contour plot is drawn and the “old” LM is transformed to a “relative position”

with respect to the new contour plot. See Fig. (5.17).

. It is shown that when wg move toward 0 from 1, the recognition rate decreases,
indicating that wg = 1 is the best choice. Then, penalty pg is considered. By

the same scheme, pg = 14.0 is found to optimize the recognition rate.

. The contour plot of [LMI is drawn on the ROC plane. Note that from Eq.

3.25, lLMI = (l—gcoha-ILTI (dashed lines). A BiTri-Gram is employed by starting

91

with wg=1.0, pg=l4.0, tg=100.0 (representing 00). Suppose there is a upper
bound, say 2 x 106|LT| for computational cost. As shown on the ROC plane,

the BiTri-Gram is not qualiﬁed to this requirement. See Fig. 5.18.

7. Now the threshold t,, is moved to reduce the 0114. Move tg from 16.0, 13.2,
10.4, 7.6 through 6.2. It can be observed that the BiTri-gram moves toward the
criterion line and crosses it at tg = 7.6, as shown in Fig. (5.18). However, the

price for the computational improvement is theidecrement of recognition rate

from 86.2% to 83.4%.

Results and Discussion

To start with, concerning the hybrid of bigram and trigram and the relative mo-
tion, from experiment results, when wg and t9 are modiﬁed with t, = 100(00), the
CM and 0M remain the same, while the G-Set is changed. Fig. 5.17 shows that when
G-Set is changed all contour curves propagate toward the northwest, and relatively
the LM moves toward the southeast, which is the opposite direction to design. Also
note that since the whole contour plot does not rotate much, the three LMs locates at
a straight line, which implies that each time the G-Set changed the recognition rates
on each grid point decrease the same amount so that the “shapes” of the contour plot
do not have much difference.

Secondly, when threshold tg is ﬁnite, it can be observed from Fig. (5.18) that the
decrease of tg results into the decrease on both the CM and 0M. In this case, since
the tangent vector of the contour curves is flatter than the moving direction of LMs

when tg is decreasing, conclude that the decrease of tg will result in the decrease of

92

Speech Processing Lab
Michigan

.. .....—

      

     
    

 
   

_WW

 

 

 

 

 

 

tate University

 

 

 

 

 

 

 

 

 

 
 
 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

86.21 , P’_no |0.01 [
51mm |1.05
0.8 E
8 I
§ 0.7 I
g .
o 0.6 g
0.5 3
0.4
0.996 0.997 9 0.999 I
OVER—(QESN RATION
[ Lnucunceuonrt ntsrcurnurt
PRI'I '2 [faculty [10.00
' mare-l
”“19" L” I [530—— ITOIT
| 0.86 PR" 03 KMI [100.00
53253223 ::.:::.-: ::::~"' rib-T—

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mv LII | 1:11:09 111] msxsa EFFECT] u Purl

 

 

«marl EXIII

uJ: R.R. I cur M's]

 

 

Figure 5.16: after initializations and rescaling, the location of BiTri-Gram and contour
plot is shown. Notice that the new estimated the G-set = { 5.61, 1.44, 1.70, 1.71 } is
directed to their corresponding ﬁeld on the ESM-RR block.

93

Speech Processing Lab
Michigan State University

............................................... —.
v . ~—

 

 

 

 

   

 

    

 

   
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  
 
 

 

 

3 0.94
E
g 0.92 ”m
U
FILM
0.9

$1000.00 |0.52 L
110.111 [1.22"]
510119.111 |o'."" L

 

0.88

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

L Lnucunceuonu nesrcu PANEL ]
Design LII Ibltr1-9rad I 10.00
|10.0
mum‘s“, I100.00
LW— : “ [370.0—

   

 

 

RT W [itinerant-101.1...”

 

 

 

 

 

 

 

 

LutocarmIFhm] Hummill mantel mum] EXITI
mm L" | 61598 LH | msm EFFECTI u m1 at R.R.I cur lLC'sI

Figure 5.17: When the new contour plot is drawn, the old LM (marked “X”) is
transformed to a relative position with respect to the new contour plot.

 

 

 

 

 

 

 

94

Speech Processing Lab
Mlchlgan State University ﬁrmware comm Pm J

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   

I o e I I
e I v a c e

c e a n e n
I z I I O U
c n
n
loin-Io-D—eecnnoIO0I0..I...DUIIoIIleloonooIl-IIIODIIIIII: ooooooooooo
0

0.9994 0.9995 DOVLFI PE'ERILYRATSCISB 0.9999

[ 111111301165 11011151 0:511:11 PANEL

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

”my, U, Ibitrigul

 

 

 

 

 

PM“ [lantern

 

 

 

 

 

 

RT |1.00 [mainaammaﬂ

mtmrmﬁlm] HatlabCI'Illzll man| 1mm: EXIT
mar 1.11 | 115113 111 | 110st EFFECTI 10.x 91r| 11.1: R.R.I cm ur':|

Figure 5.18: Move the threshold t9 from 16.0, 13.2, 10.4, 7.6 through 6.2. BiTri-gram
moves toward the criterion line and crosses it at t, = 7.6. However, the price for
the computational improvement is the decrement of recognition rate from 86.2% to
83.4%.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

95

recognition rate. It is interesting to conclude that, if the contour curves are steeper
than the moving direction, the decrease on tg will improve the recognition rates.
Also note that from Fig. (5.18) there is a big drop from the LM labeled as “[2]”
to LM “[3].” At this point, the moving direction is almost normal to the contour
curves, which means the recognition rate is sacriﬁced with a little improvement on

computational cost.

5.3.3 Case III : Adaptive Language Modeling

Problem Description

 

In recent years, adaptive LMs become popular to the speech processing researchers
[20], [23] because of their capabilities of adapting to the style or topics of the doc-
ument. Assume that from a partial document, denoted by h, h 6 LT, an adaptive
unigram, bigram, or trigram, distribution can be written as pd(w|h), pd(w' — wlh), or
pd(w” — w’ — wlh), respectively, where, d denotes dynamic model, w is any word from
vocabulary, w’ is the previous word and w” is the previous second word.

The problem encountered in the adaptive modeling design is that h usually does
not contain enough sentences to train. Therefore, the “static” model is still needed to
supply more information about word concatenations. Thus, how many of the training
sentences from static model are needed to optimize the performance involves again
with a trade—off problem. The design goal of this case is to ﬁnd the optimal size of
the training context for training the static model.

BiTri-Gram is still employed in this case except that it is trained by different way.

96

There are two sets of training sentences. One is the local context TrainSet II, which
is generated according the pragmatic rules and is the one LM intends to adaptive
to. The other is the static context TrainSet I, which is generated by semantic rules
only. The bigram part inside BiTri-Gram is trained by both TrainSet I and II, and
the trigram part is trained by TrainSet II only.

The ADS language, which is well-deﬁned from lexical level to pragmatic level, is
employed to demonstrate how an adaptive LM is beneﬁted from the knowledge and
classiﬁcation of TL.

Design Procedures

 

1. ATIMIT is employed. The sentence length of the TL, SNT LEN, is set as
three. The TL size is set to 415 which is counted by program according to the

pragmatic generating rules of ADS language.

2. The size of TrainSet I and the size of TrainSet II are the two controllable
parameters of the program adaptivcztrain, which ﬁrstly train BiTri-Gram by

TrainSet I and TrainSet II, then evaluate its CM and 0M by TestSet II, which

is generated according to the pragmatic rules.

3. PRM #2, representing the size of TrainSet II, is set to 0 at ﬁrst. Then, PRM
#1, which is the size of TrainSet I, is changed from 1000, 1900, 2800, 3700, 4600

through 5500. A trajectory is drawn in Fig. (5.19).

4. Then, PRM #1 is changed to 400. Repeat the above. Another trajectory is
shown on thc left side of the previous one. On this trajectory, when the size of

TrainSet I is 3700, the recognition rate reaches its maximum.

97

Results and Discussions

From Fig. (5.19), with the assist of TrainSet II the whole trajectory is shifted
toward the northwest. Not only CM but also OM is improved, thus improving the
recognition rate, also computational cost. Note that the ﬁrst trajectory stands for a
pure bigram. From this experiment, it is shown how the knowledge from pragmatic
level can beneﬁt the recognition performance, and this is what the adaptive LMs take

advantage of to improve recognition performance.

 

98

Speech Processing Lab
Mlchigan State University I RUE-mow com. Pm J

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  
 
     
    
 

8 0.92

 

 

 

g 0.9

E

Q 0.38
0.06
0.04
0.02

0. i i 1 i
8'92 0'94OVER-G'lzﬁiial0i1b‘ivo 98 l 5'" LE" r1— [5

TLSIZE I415 [

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

L Lnusunceuonu DESIGN P011£L J

 

 

 

 

 

 

 

“8,9,, U, Iodaptmyaidm

 

 

 

 

 

PRM '1IMTmSet(Ge:)

 

 

  

 

RT |1.00 [1111111111 11111001111104]

1111111110] mm| £x1r|
09mm| 1:1:00u1| 1.01111111| 5m urR.R.| c131w:|

Figure 5.19: PRM #2, representing the size of TrainSet II, is set to 0 at ﬁrst. Then,
PRM #1, which is the size of TrainSet I, is moved from 1000, 1900, 2800, 3700,
4600 through 5500. Then, PRM #1 is changed to 400. Repeat the above. Another
trajectory appears on the left side of the previous one. On this trajectory, when the
size of TrainSet I is 3700, the recognition rate reaches its maximum.

 

 

 

 

   

 

 

 

 

 

 

 

 

 

 

 

 

Chapter 6

Conclusions and Future Work

 

6.1 Summary and Concluding Remarks

Converting a LM problem into a design task arises from the trade-off between cov-
erage and over-generation, paralleling the hit-rate and false alarm in the detection
theory. Currently successful approaches to modeling language, such as n—gram mod-
els, context-free grammar, etc., cannot formalize this trade-off.

The main objective of this research was the development of a systematic scheme
for the design of a LM. The idea of the design scheme originates from the ROC
analysis in the detection theory. Analogous to ROC analysis, the performance of a
speech recognition task in terms of recognition rate and computational cost needs
to be expressed in a plot of coverage and over-generation, resulting in performance
contour plot. On the other hand, 8. LM can be parameterized by coverage and over-
generation and occupies a spot on the two-dimensional plot. Thus, by adjusting
certain parameters in the LM, designs can readily observe the motion of the spot and

can “drive” the spot in the direction of better performance.

99

100

An X-window version software called the ROC-LM Design Tool was created to
implement the proposed design scheme. In order to demonstrate how this tool func-
tions, several standard design cases for LMs were performed. Simulation results show
that the design tool is useful for both design and comparison purposes. The recogni-
tion rates can be calculated in virtually real time, enabling the designers to observe
the inﬂuence of each factor in the LMs on the resulting recognition rates to achieve

the design purposes.

6.2 Contributions

The main theme of this study is to apply detection theory and ROC technique to
the speech recognition task. Emulating those methodologies developed in the detec-
tion theory, the recognition rate is estimated by given prior knowledge, such as the
statistics of acoustic and grammatical evaluations, the length of uttered sentences
in words, the size and statistics of the TL. In a consequence, a general and impor-
tant contribution of this work is that the overall recognition behavior is modeled by
mathematical formulations. The goal of this work has been to open the door to new
research directions.

Some other contributions of this research are summaries as follows:

1. Mathematical deﬁnition of all language modeling tasks.

2. Demonstrative evidence of the inﬂuence of all well-deﬁned quantities on the
recognition rate, given the fact that the speech recognition process is mathe-

matically modeled.

101

3. Implementation of the ROC LM design tool to demonstrate the proposed design

scheme.

6.3 Future Work

In further study, ﬁrst, the CSR problem must be solved. The known-boundary as-
sumption is very restrictive for the current speech recognition system. A rough intro—
ductory idea of mathematically formalizing the CSR case is addressed in Appendix
A.

Secondly, one of the signiﬁcant aspects of this research is that a totally new view
point of language modeling problem has been proposed. It is interesting to explore
more measurements employed in the detection theory, for instance, detectability and
sensitivity, to apply to the speech recognition task.

Thirdly, the adaptive LM has demonstrated its powerful ability to hone in on the
particular sublanguage used in each of the articles. The deﬁnition of sublanguage
requires certain prior knowledge, which should be classiﬁed as the pragmatic knowl-
edge. The tendencies of speech recognition tasks are to employ not only phonetic
and grammatical knowledge, but also higher levels of knowledge, such as those on the
pragmatic level. It is challenging and interesting work to integrate the whole recog-
nition behavior from the bottom level, acoustic level, through the grammar level and
the semantics level, to the pragmatic level, by utilizing the well-developed techniques
in the detection theory.

Finally, sincethe recognition performance has been formalized, ANN (Artiﬁcial

102

Neural Networks) engineers would be interested to know that the optimization func-
tion which neurons desire to optimize has been well deﬁned. The proposed idea
derives from the fact that the ROC design tool cannot really ﬁnd an optimal LM
given recognition environments, but it can ﬁnd the optimal values for those unde-
cided parameters in an existing language model. Neural networks, which have very
ﬂexible structures with many weights undetermined and are equipped with many
efﬁcient training schemes, are the best candidates to optimize as initial models.

In summary, research topics for future work are listed below:

1. Solution of the CSR problem.

2. Exploration of more techniques in detection theory to apply to language mod-

eling.

3. Integration of recognition behaviors, from the acoustic level through the prag-

matic level.

4. Integration with Artiﬁcial Neural Networks.

Appendix A

The Case of Continuous Speech
Recognition

 

To generalize the IWR system to the CSR system, ﬁrst some information about the
segmentation performance must be retrieved from experiments. By the experimental
statistics, certain stochastic model is created and the estimated recognition rate,
which is involved with segmentation results, can be derived for the CSR system.

Suppose a known-boundary acoustic symbol string, a = a1 - - - ab, ab,“ - - - 011,011.44 - - . aLa,
is uttered, where L, is the length of acoustic symbol string. Let ()1, - - - , bL_1, 1 < (11 <
(12,- - - , < bL_1 < La, denote the (L — 1) word boundaries, where L is the length of
sentence in words. Then, the segmentation performance can be measured by evaluato
ing the acoustic likelihoods under all possible segmentation hypotheses. Deﬁne d as
the difference between the true boundary and hypothesized boundary. The p.d.f of
acoustic score (likelihood) can be obtained by varying d. In the CSR case, sentence
length L is unknown, thereby all possible L is hypothesized before hypothesizing

boundaries.

103

104

As a consequence, in the CSR case, the total number of all possible hypothesized
sentences is no longer IVIL as in the IWR case but 252‘,“ IVI‘, where In,“ is the
maximum of sentence length L. For each uttered sentence with length L, there are
05:1 segmentation hypotheses in total”. Therefore, theoretically the recognition
rate can be calculated by learning the total number and the p.d.f.’s of all hypotheses.

Conclusively, the ROC language model design tool for CSR case is doable but it

needs a great amount of computation. More theories and experiments need to be

made to simplify the calculations. We will leave this task in the future work.

 

12This is computationally prohibitive unless done in a systematic way such as the use of an HMM
search algorithm.

Bibliography

[1] James P. Egan, Signal Detection Theory and ROC Analysis, Academic Press,

1975.

[2] Ivan Selin, Detection Theory, Princeton: Princeton University Press, 1965.

[3] J.R. Deller, Jr., J.G. Proakis, and J.H.L. Hansen, Discrete Time Processing of

Speech Signals, New York: Macmillan, 1993.

[4] N. Chomsky, “On certain formal properties of grammars,” Information and Con-

[51

[6]

[7]

[8]

[9]

trol, vol. 2, pp. 137-167, 1959.
M. Weintraub, H. Murveit, M. Cohen, P. Price, .1. Bernstein, G. Baldwin, D.

Bell, “Linguistic constraints in Hidden Markov model based speech recognition,”
Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing, pp. 699—702, 1989.

H. Murveit, and R. Moore, “Integrated natural language constrains into HMM-
based speech recognition,” Proceedings of the IEEE' International Conference on
Acoustics, Speech, and Signal Processing, pp. 573—576, 1989.

Y.L. Chow and S. Roukos, “Speech understanding using uniﬁcation grammar,”
in Proceedings of the IEEE' International Conference on Acoustics, Speech, and
Signal Processing, pp. 727—730, 1989.

K. Kita, T. Kawabata and H. Saito, “HMM continuous speech recognition using
predictive LR parsing,” Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. 703-706, 1989.

K.Y. Su, T.H. Chiang and Y.C. Lin, “A uniﬁed framework to incorporate speech
and language information in spoken language processing,” Proceedings of the
IEEE' International Conference on Acoustics, Speech, and Signal Processing, pp.
I-185-I-188, 1992.

106

[10] J. Kupiec, “Hidden Markov estimation for unrestricted stochastic context-free
grammar,” Proceedings of the IEEE International Conference on Acoustics,
Speech, and Signal Processing, pp. I-177-I-180, 1992.

[11] L. Bah], F. Jelinex and R.L. Mercer, “A maximum likelihood approach to contin-
uous speech recognition,” in Proceedings of the IEEE International Conference
on Acoustics, Speech, and Signal Processing, pp. 179—190, 1983.

[12] J.M. Baker, “Dragondictate-30K: Natural language speech recognition with
30,000 words,” in Proceedings of Eurospeech-89, pp. 161-163, 1989.

[13] H. Cerf-Damon, S. DeGannaro, M. Ferretti, J. Gonzalez, and E. Keppel, “TAN-
GORA - A large vocabulary speech recognition system for ﬁve languages,” Pro-
ceedings of Eurospeech-QI, pp. 183—192, 1991.

[14] G. Maltese and F. Mancini, “An automatic technique to include grammatical
and morphological information in a trigram-based statistical language model,”
Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing, pp. I-I57—I-160, 1992.

[15] P. Garcia, E. Vidal and F. Casacuberta, “Local languages, the successor method,
and a step towards a general methodology for the inference of regular grammars,”
Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing, pp. 841—845, 1987.

[16] H. Rulot, N. Prieto and E. Vidal, “Learning accurate ﬁnite-state structural mod-
els of words through the ECGI algorithm,” Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing, pp. 643-646, 1989.

[17] F. Casacuberta, “Some relations among stochastic ﬁnite state networks used in
automatic speech recognition,” Proceedings of the IEEE International Conference
on Acoustics, Speech, and Signal Processing, pp. 691—695, 1990.

[18] N. Prieto and E. Vidal, “Automatic learning of structural language models,”
Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing, pp. 789—792, 1991.

[19] E. Giachin, “Automatic training of stochastic ﬁnite-state language models for
speech understanding,” Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. l-173—I-176, 1992.

[20] R. Lau, R. Rosenfeld, S. Roukos, Trigger-Based Language Models: A Maximum
Entropy Approach, Proceedings of the IEEE International Conference on Acous-
tics, Speech, and Signal Processing, pp. II-45—II-48, 1993.

[21] M. Meteer, J.R. Rohlick, Statistical Language Modeling Combining N-gram and
Context-free Grammars, Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. II-37—II-40, 1993.

107

[22] L.G. Miller and A.L. Gorin, A Structured Network Architecture For Adaptive
Language Acquisition, Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. 1—201—1-204, 1992.

[23] SD. Pietra, V.D. Pietra, R.L. Mercer, S. Roukos, Adaptive Language Modeling
Using Minimum Discriminant Estimation, Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing, pp. I-633—I-636, 1992.

[24] C.C. Chiu, Large- Vocabulary Continuous Speech Recognition Using Partitioned
Graph Search, (Ph.D. Dissertation), Dept. of Electrical Engineering, Michigan
State University, East Lansing, 1993.

[25] K.F. Lee, Automatic Speech Recognition, the Development of the SPHINX Sys-
tem, Boston: Kluwer Academic Publishers, 1989.

[26] LR. Rabiner, “A tutorial on hidden Markov models and selected applications in
speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, 1989.

[27] G.R. Doddington and TB. Schalk, “Speech Recognition: Turning Theory to
Practie,” IEEE Spectrum, pp. 26-32, 1981.

[28] TB. Schalk, “The Design and Use of Speech Recognition Data Bases,” Proceed-
ings of the Workshop on Standardization for Speech I/O Technology, pp. 211-214,
March 18-19, 1982.

[29] “Getting started with the DARPA TIMIT CD-ROM: An acoustic-phonetic
continuous speech database,” National Institute of Standards and Technology

(N IST), Gaitherburg, Maryland, 1988.

 

"1111111111 [11111111115