11r1rbi< 1. > r~i‘.—._

:vg.

may

'_V
. .-

‘ J1‘I’rlﬁ4

 

g... n.‘

"F‘Kf.

mf

”‘3
‘1‘: fxmv‘

1 .
,mm-~.

r. .\m‘1\"‘~-

”Ural - )1

(a z“!
- . V
J;-. “‘ 1:].

I
('7‘, Anna... 1,.
.5" ~14 q-Ji. "gt?“ #14:: v
(if! r:-_’ I I".

 

1 v
vfr;?-—_..-?._ . ”2f.
'“ ILL“. "-1" itwyu-W‘TY M w:
2‘15! ..
".'rru‘-;!;r.-.r~--M

‘ v

.5. .-—-'- ”0,, an‘k u»

5’.

Virivérx.
'"' zi'xxe'EI-é”

',.. .7...-
pxr" a *R':'f£-I
v" “pl-var?!“ r: 'rr'
no y: w.,..—.,.,.,..., 1:11;;
' . '."‘.:' 1.11.}.
:2"..- .- if”:

my
....
..',. '1r,£.:;:.:.-,.;.':‘

1..
an“. .
'x'wv

391;?w-M er.» 2‘» av
mvrrvz‘w‘ _

‘11“:4"

1-er ~-

.
,.
.5 1:1: 573:?“ 1.._

:»‘L’.-.-.,':‘
.7) .

. .., o., .
w
. l-«l . 1—1.

1-? . 51‘s."??- ,
. ~,..
1213...:
'1.13:.....'....w.....':'W
. 1 1 1 ..1:..-,.. ....,..
.;.'....v " r , . 11:51.1; 15.4.3.3?
.- ~, .. W.- -117“
'9 w- “am?”
1.-.. .

I."

m-ﬂw'vwﬁ.‘

.

.. if” 6
'In a'n’tb’gp-
LJ .751!

Jib-u".

tt‘a - .‘v
u".

 

£21.; 0.; .2“? ~..'.‘...

«rug; In; “5"“?!-
723::

a:

‘4?-
N

MALI}. ’_
":1";me “a?” ..
‘51:! .311! .4. :55," .1.
It...”
... W...
..

,
5......1”, .'.,...
gun“. ~- ELr11; -.

5%....”qém 1.1-.

~.—...~rﬂ:r.r:¢wfﬂ We

 

MICHIGAN STATE E SITY LIBRAR S
i ’H

1: Hum numui’i‘iﬂm :muuuw‘iuu

3 1293 00876 4932

I!

 

 

 

 

 

 

 

 

This is to certify that the

thesis entitled

THE WHITAKER DATABASE 0F DYSARTHRIC SPEECH:
Creation and Baseline Recognition Study

presented by

Ming—Shou Liu

has been accepted towards fulﬁllment
of the requirements for

 

Master's degreein Electrical
Engineering

 

Major professor

 

DateW?L

0-7639 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

LIBRARY
Michigan State
University

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before due due.

DATE DUE DATE DUE DATE DUE

      

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CWMM 1

ﬁ-———___

THE WHITAKER DATABASE
OF
DYSARTHRIC SPEECH:

Creation and Baseline Recognition Study
By

Ming-Short Liu

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE
Department of Electrical Engineering

1991

ABSTRACT

THE WHITAKER DATABASE
OF
DYSARTHRIC SPEECH:

Creation and Baseline Recognition Study
By

Ming-Shou Lin

This research represents the culmination of a three year project sponsored by the
Whitaker foundation of which the primary purpose was to conduct research related to
the development of a PC-based isolated word recognition (IWR) system for persons
with severe motor and speech disabilities. This dissertation describes three aspects

of the final stages of the work:

1. The creation of an isolated word database of dysarthric speech (Whitaker Database

(WD)) which is publicly accessible over the internet computer network.
2. A baseline recognition study on the WD using a hidden Markov model approach.

'3. Formulation of an IWR system concept and plans for its development and future

enhancements.

To my Mother
A-Chu Yang

For her love, support and sacriﬁce

ii

ACKNOWLEDGMENTS

First and most important, I would like to thank my advisor John R. Deller, Jr. for his
patience and support in spite of his busy schedule. His direction was very important
in helping me step into the Speech Processing world.

Secondly, I would like to thank all the members of my thesis committee: Dr. B.
Ho, Dr. Roland Zapp, Dr. Bon K. Sy of Queens College of the City University of
New York, and Dr. John R. Deller, Jr. I would also like to thank Dr. Linda Ferrier of
Northeastern University for her permission to use her report in Section 2.2.2 on the
various dysarthric speakers.

Many recognition procedures and utilities coded by Ross K. Snider were helpful
in the development of this thesis. I would also thank my friend Pei-Chun Chen for
her great encouragement and support.

I gratefully acknowledge the ﬁnancial support provided by a grant from the Whitaker
Foundation and the collaboration with the Speech and Language Pathology and Au-

diology department at Northeastern University, Boston.

iii

Contents

1 Introduction and Background
1.1 The Purpose and Signiﬁcance of this Research .............

1.2 Previous Work and Relation to the Current Project ..........

2 Collection and Creation of the Whitaker Database

2.1 Introduction to the Database ......................
2.1.1 Data Acquisition System .....................
2.1.2 Composition of the Whitaker Database (WD) .........

2.2 Summary .................................
2.2.1 Characteristics of the Whitaker Database ............
2.2.2 Characteristics of the Speakers .................

2.3 How To Access the Database .......................

3 Technical Description of the System

3.1 Feature Extraction ............................
3.2 Vector Quantization (VQ) ................ ‘ ........
3.3 The Hidden Markov Model (HMM) ...................

4 Speech Recognition Experiments

4.1 Size of the Codebook ...........................
4.2 HMM Structure ..............................
4.3 Acoustic Parameterization ........................
4.4 Silent Portion Extraction .........................
4.5 Number of Training Utterances .....................
4.6 Number of States in the HMM ......................

iv

CDQDQOCDCJ‘UU‘

11

14
14
14
15

17
17
19
20
22
23
24

The Prototype IWR System for Dysarthric Speech

Conclusion
6.1 Summary .................................

6.2 Future Work .................. . ..............
Experimental Result for Speaker DC

Program Listing: LP Parameter Generating Program

Program Listing: Cepstral Parameter Generating Program

Program Listing: Codebook Generating Program

26

29
29
30

36

42

49

57

List of Tables

macaw

10

11

12

13

Grandfather word list. .......................... 9
Results of different codebook sizes using the TI-46 database. ..... 18
Results of different codebook sizes using the Grandfather database. . 18

Recognition performance with different models using the TI-46 database. 19
Recognition performance with different models using the Grandfather
database. ................................. 20
Vocabulary for the comparison of WRLS and autocorrelation methods
of LP parameter computation. These words are not in the WD for
reasons explained in the text. ...................... 21
Recognition results comparing WRLS and autocorrelation methods of
computing LP parameters ......................... 21
Recognition results comparing LP and mel-cepstrum using the TI-46
database. ................................. 21
Recognition results comparing LP and mel-cepstrum using the Grand-
father database ....................... ' ........ 21
Effect of silent portion extraction on recognition performance using the
TI-46 database. .............................. 23
Effect of silent portion extraction on recognition performance using the
Grandfather database. .......................... 23
Effect of number of training observation sequences on recognition using
the TI-46 database ............................. 24
Effect of number of training observation sequences on recognition using

the Grandfather database. ........................ 24

vi

14

15

Effect of number of states in HMM on recognition performance using

the TI-46 database .............................

Effect of number of states in HMM on recognition performance using

the Grandfather database. ........................

vii

25

List of Figures

1 Equipment setup for sampling ....................... 6
2 Frequency response of anti-aliasing ﬁlter ................. 7
3 Directory structure of Whitaker Database in the computer network. . 12

viii

1 Introduction and Background

1.1 The Purpose and Signiﬁcance of this Research

Many signiﬁcant advances have been achieved in both speaker-dependent and speaker-
. independent speech recognition in the past three decades, (see, e.g. [1, 2, 3, 4, 5, 6,
14, 15, 23, 31]). Most research, however, has been concerned with the recognition of
normal speech. The difficult problem of applying speech recognition technology to
assisting persons with speech disabilities to communicate effectively is still an open
issue for researchers, as indicated by the small amount of literature on the topic and
the relatively small number of systems available to users (e.g. ~ANTIC [10], CDC
[11])-

The inability to speak and write can be caused by a number of neuromuscular
diseases, such as cerebral palsy (CP), aphasia, amyotrophic lateral sclerosis (ALS),
multiple sclerosis (MS), Parkinson’s disease, muscular dystrophy, laryngectomy, and
others [33]. In this study we have focused upon the CP population which comprises a
signiﬁcant proportion of the total population of profoundly speech and motor disabled
persons. CP is a prevalent condition, present in approximate one of every 330 live
births [32]. Anyone working with these people has observed that many individuals
persistently try to express their needs and feelings vocally, even though many attempts

may fail. However, due to the difﬁculty of controlling their articulator movements

and voicing in uttering messages, it is frequently impossible for them to produce
intelligible and ﬂuent continuous speech. The goal of this study in a general sense is
to adapt existing electronic technology to devices which will assist such persons to
express ideas and feelings, to have normal social lives and interpersonal interactions,

and to function in the mainstream of society.

1.2 Previous Work and Relation to the Current Project

This work represents the culmination of a three year research effort sponsored by the
Whitaker Foundation of which one of the subgoals was to conduct research related to
the development of a PC-based isolated word recognition (IWR) system for severely
dysarthric speech. Previous work on this project has been reported in the papers
of Sy, Hsu, Deller et a1. [4, 5, 12, 30, 31]. In particular, Hsu’s thesis research was
concerned with the development (on a mainframe system) of hidden Markov model
(HMM) [25] based IWR software, and its testing using a 200 word database collected
from an moderately dysarthric (cerebral palsied) individual, and a digit (10 words)
study involving four other persons whose speech spans a spectrum of dysarthria [7, 12].

The subsequent work of Snider [6, 28] and this author has generally been concerned
with scaling down Hsu’s software to operate on a reasonably ordinary personal com-
puter (PC) in real time, and with extensive testing of the resulting algorithms. In this
process, we have made a point of carefully collecting and organizing a large database
of isolated word dysarthric speech (Whitaker Database (WD)) with which to test the
system. The WD has been made publicly accessible to other research centers over
the internet computer network. Whereas Snider’s work was principally concerned
with scaling and programming the PC-based software, and with developing sampling

and editing software for manipulating the new data, this author has been chieﬂy con-

cerned with the creation of the database, and with the testing and enhancement of
the recognition software. The result has been the completion of enhanced, ﬂexible
PC-based IWR software which can now be tested “in the ﬁeld” in conjunction with
a system concept to be described.

Accordingly, this thesis consists of three parts which describe the three research

components noted above:
1. Creation and distribution of the WD,
2. Execution of a baseline recognition study using HMM-based software, and

3. Reﬁnement of the PC-based HMM IWR software for dysarthric speech, and

development of plans and strategies for its distribution and testing.

We note that two speciﬁc engineering developments from previous work will be
used in this thesis. They are an algorithm due to Deller and Hsu [4] and Deller and
Snider’s diagonalization strategy [6, 28]. The ﬁrst implementation of the recognition
software was developed by Hsu in his doctoral work [12]. A fast and simple adaptive
Weighted Recursive Least Square (WRLS) algorithm was derived for the purpose of
feature extraction at the acoustic level. This algorithm enjoys a small improvement
in computational complexity over the conventional WRLS algorithm. The adaptive
method also provides several useful by-products in the context of the recursion which
the conventional one usually does not have [5, 8]. In the word-level processing, an
enhanced HMM based approach was developed to operate under the constraints of
having highly variable speech as well as a lack of statistical information about the
speech. .

The second engineering development from previous work is as follows: Given sev-

eral HMM’s and the observation sequence 0, we need to choose the word model which

has the highest likelihood P(O|M) [25]. A frequently used algorithm to evaluate the
HMM for maximum likelihood criterion is the Baum-Welsh “F orward-Backward” pro
cedure [25]. The Forward-Backward procedure generally requires 0(N2) operations
per observation for an N state, fully connected, HMM. Deller and Snider [6, 28] found
that the number of calculations can be reduced to 0(N) by diagonalizing the matrix1
A in the HMM. All the evaluation work in this thesis is based on this diagonalized

matrix.

 

1A = {(1.3} is called the state transition probability distribution,

2 Collection and Creation of the Whitaker Database

2.1 Introduction to the Database
2.1.1 Data Acquisition System

The utterances were spoken by 6 speakers and recorded on TDK type II tape cas-
settes. A TEAC W-450R stereo cassette deck with Dolby-C noise reduction was used.
The recording took place in the Department of Speech and Language Pathology and
Audiology at Northeastern University in Boston and was supervised by Dr. Linda
J. Ferrier, Assistant Professor in that department. All data were recorded in an
acoustically isolated booth.

The recordings were played back using a duplicate TEAC tape deck and then sam-
pled in the Speech Processing Laboratory in the Department of Electrical Engineering
’ at Michigan State University. The MetraByte “STREAMER” data acquisition sys-
tem was used to facilitate the sampling. The equipment setup for sampling is shown
in Fig. 1. The filter used is an active bandpass2 fourth order Butterworth ﬁlter with
a IOWpass cutoff frequency of 4.7kHz (the sample rate is lOkHz). The frequency re—
sponse of the filter is shown in Fig. 2. A MetraByte DASI6F 12-bit analog to digital
(A/ D) conversion board was set to accept a signal with dynamic range of :10 volts.
To make certain that the input to the A / D board did not exceed the dynamic range
of the board, the input signal was monitored with an oscilloscope and the gain of the
amplifier adjusted appropriately. Data are stored in 16 bit records, one per sample.
Encoded in the 16 hit record are 12 bits of measurement data and 4 bits that specify

the channel.

 

2This ﬁlter is effectively lowpass for speech which rarely contains signiﬁcant energy below about
75 Hz.

 

 

Back of NAD stereo ampliﬁer 3130

 

 

 

 

 

 

 

 

 

 

 

 

R L
A o 53
B O O O
_l
P S l
ower up p y Butterworth ﬁlter Terminal Accessory Board
-15v gnd +15v 4') all”;
69 0 GD & Signal in .2 CHM/é”? ”‘
‘ Filtered signal out
>-@ .a
a»:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.. Q @
-'§v +liv

 

 

 

HP 54201D
Oscilloscope

l’l’

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1: Equipment setup for sampling.

(I!)

4th

Order Bondposs Butterworth Filter

 

20

15-1

10"

DOD

Cl C] UCJUCICD

 

 

 

 

-5...

-10 ..

 

L3

 

 

 

 

Log(Frequency in Hz)

. J I

3.8

Figure 2: Frequency response of anti-aliasing ﬁlter.

4.2

At the beginning of the project, the sampling process was carried out word by
word. That is, we located the word on the audio tape, made a ﬁle for it and then sam-
pled the word. It took about 90 seconds per word to complete this task. This method
is time-consuming and unrealistic because 17,895 words needed to be processed. In
order to solve this problem, Snider wrote a program called “Wavemark”. With this
routine, utterances from an entire cassette tape can be sampled then stored in a large
ﬁle (about 30 Mbytes). “Wavemark” can then also be used to extract the words from
the large ﬁle with a screen editing facility [28]. This procedure reduces the per word
processing time by a factor of about 10. The program also has a provision for playing

back (audio) any selected portion of an utterance. Details are found in [29].

2.1.2 Composition of the Whitaker Database (WD)

The word sets in the WD are partitioned into the T146 word list and Grandfather
word list. These word list were selected for the WD to provide one partition of
vocabulary which has become “standard” in speech recognition studies, and one which
is signiﬁcant for its speech science attributes.

There are 46 words in the TI-46 word list. They are utterances of the 26 letters of
the alphabet, 10 digits (zero to nine) and 10 the words “start”, “stop”, “yes”, “no”,
“go”, “help”, “erase”, “rubout”, “repeat” and “enter”. This word list is suggested as
a standard by Texas Instruments [9]. The Grandfather word list consists of 35 words
which are shown in Table 1. The set is called “Grandfather” because it was taken
from a passage commonly used by speech therapists which begins with the sentence
“Let me tell you about my grandfather . . . ”. These words are chosen by Dr. Ferrier
due to their phonetic diversity [13].

There are 27 cassette tapes in the Speech Processing Laboratory. Each word in the

TI-46 and Grandfather word list was uttered 30 to 45 times by one of the six speakers.

 

 

 

missing several to well thinks long
my old you ever an frock
coat usually still he dresses about
years is wish know himself buttons
all grandfather as swiftly black beard
in yet nearly clings ninety-three

 

 

Table 1: Grandfather word list.

Each utterance of each word ultimately became a distinct ﬁle. The collection of all
the ﬁles sampled from these tapes comprise the Whitaker Database (WD). Each ﬁle
consists of integer samples with dynamic range from -2048 to +2048. All the ﬁles are
ASCII with <CR><LF> after each integer.

2L2 Sununany
2.2.1 Characteristics of the Whitaker Database

e The vocabulary sets in the WD are the T146 and Grandfather word lists indi-
cated in Table 1.

e There are 17,895 ASCII ﬁles in the database, each ﬁle represents an utterance
of a single word. The end points of each word were detected by hand using the

“wavemark” utility described above.

e Each sample point in each ﬁle is represented by an integer and is followed by

<CR><LF>.

e The dynamic range of the integer samples is from -2048 to +2048.

2.2.2 Characteristics of the Speakers

The following clinical assessments of the speakers are taken from a report by Dr.

Linda J. Ferrier, Assistant Professor of Speech and Language Pathology and Au-

9

diology, Northeastern University, Boston. Dr. Ferrier is the clinical consultant to
this project. She also received Whitaker funding to support her interaction with the
subject population, analysis of data from a clinical perspective, and writing clinical
assessments of the subjects’ speech and language disorders. The author appreciates

Dr. Ferrier’s permission to use the following descriptiOns3:

1. Speaker DC is a 48-year-old male with a diagnosis of spastic athetoid cerebral
palsy (CP). His intelligibility is mildly impaired, and his voice has a typical

strained-strangled quality and consonants are imprecise.

2. Speaker CJ is a 41-year-old male with a diagnosis of athetoid CP. His intelligi-
bility is moderately impaired but decreases with fatigue. Speech characteristics
include slow rate of speech, imprecise consonants, vowel distortions, and little
variation in pitch or loudness. He is consistently over-loud. Vowel distortions

appear to be caused by deviation of the mandible to the left.

3. Speaker LE4 is a 40-year-old male with spastic CP with dysarthria. His intelli-
gibility is in the moderate to severely impaired range, speech is slow with little
variation in loudness or pitch. He has particular difﬁculty with transition from
one consonant to the next in consonant clusters. and some difﬁculty initiating

sounds and dysﬂuencies often occur at the beginning of the words.

4. Speaker ED is a 39-year-old male with spastic CP. His intelligibility is mildly to
moderately impaired, and his voice shows occasional pitch breaks, inappropriate
nasality, and he is occasionally dysﬂuent. He has poor breath support for speech.

His amplitude is low and there is little variation in pitch or loudness.

 

3Our subjects all have some form of cerebral palsy, but there is nothing speciﬁc to this disorder
in our work.
4LE is the main speaker in Hsu’s previous work [12].

10

5. Speaker LL is a 47-year-old male with quadriplegic CP, mixed spastic/ataxic.
His intelligibility is severely impaired, utterances are short, consonants are im-

precise.

6. Speaker PW is a 28-year-old male with severe athetoid CP. His intelligibility is
severely impaired, consonants and vowels are extremely distorted, and loudness

is extremely variable.

2.3 How To Access the Database

The Whitaker database is accessible through the internet computer networks. The
database can be obtained from a MSU ﬁle server through anonymous ftp. The
database is in the subdirectory “speech” under the directory “pub”. Six subdirec-
tories “DC”, “CJ”, “LE”, “BD”, “LL”, “PW” are in the database, the directory
structure is shown in Fig. 3. The ﬁle naming convention is as follows: “coat.0502”
means this word is the ﬁfth utterance of the word “coat”. The last two digits in the
ﬁle name are used for internal grouping, and the user may ignore them. Files are
compressed so that they will not require excessive space. All the ﬁles (utterances)
uttered by a single speaker are “tarred” together so that only one instruction can be
used to obtain all the ﬁles in a tape. The name of the “tarred” ﬁles are “t.tar” and
“g.tar”, where “t” means the TI-46 word list and “g” means the Grandfather word
list.

In summary, the steps for accessing the WD are as follows:

1. ftp archive.egr.msu.edu, both the login name and password are anonymous

 

5The method of accessing the speech data is subject to change due to periodic changes in the
computing facilities. Please contact the author by e-mail if there is any problem in accessing the
data. Electronic mail addresses are: lium©frith.egr.msu.edu or deller©eecae.ee.msu.edu

11

speech

/\ /\ /\ A /\/\..

Liar g.tar ttar g.tar LtaI' g.tar Liar g.ta.rt.tar g.tar

Figure 3: Directory structure of Whitaker Database in the computer network.

12

. cd pub

. cd speech

. cd DC if you are interested in the speaker DC.

. get t.tar if you are interested in TI-46 database.

. tar va t.tar, this is to “untar” the ﬁle. “x” means extract.

13

3 Technical Description of the System

A wide variety of approaches to the recognition of human speech has been proposed
in the past three decades. In this chapter, we brieﬂy describe the techniques which
were applied in this research. Details of the underlying technical methods can be

found in many references (e.g. see [3]) and are not further addressed here.

3.1 Feature Extraction

To extract a feature one has to look at a small segment or frame of speech. We deﬁne
a frame of speech to be the product of a shifted window with the speech sequence.
For a sample rate of lOkHz, we use a Hamming window of length 256, which is shifted
50 samples for each feature computation.

Two types of vector features are employed in this work:

Linear prediction (LP) [16, 19] has been applied extensively in parameterizing
speech samples. Here we use 14‘” order LP parameters resulting from the autocor-
relation method, where L-D recursion [3, 22, 26] is used to solve the autocorrelation
equation. A computer program which computes the LP parameters is found in Ap-
pendix B.

The other method applied to parameterize the waveform is mel-cepstral analysis.
[3, 21]. We use a 10‘“ order cepstrum produced using a 1024 point FFT [24]. A

computer program for this approach is found in Appendix C.

3.2 Vector Quantization (VQ)

The recognition approach taken here is based on discrete symbol hidden Markov mod-
els (HMM) of isolated words. Accordingly the observations used are discrete symbols

chosen from a ﬁnite set. A vector quantizer is required to map each continuous

14

parameter vector into a ﬁnite integer index [17, 25].
Two distance measures are used to measure feature similarities in the VQ process.

In the LP case, we use the Itakura’s distance [26]:
aRa"
d “ = — - 1
(at a) log [amt] ( )

Where a is a reference LP vector6 and a is an estimated LP vector7.

Unlike LP coefﬁcients, the cepstral parameters may be interpreted as coefﬁcients
of a Fourier series expansion of the periodic log spectrum. Accordingly, they are
based on a set of orthonormal functions; thus we can simply choose the Euclidean
distance between mel-cepstral vectors as the distance measure [3].

An 128 symbol codebook was developed using the n-means algorithm. In our
work, we employ the binary clustering approach, i.e., we let It = 2. A computer
program to geneate the 128 symbol codebook is found in Appendix D. The binary
structure has the advantage that it reduces the number of searches from L to log2 L

[12, 17], where L is the number of symbols in the codebook.

3.3 The Hidden Markov Model (HMM)

The hidden Markov model (HMM) has been used in automatic speech recognition
successfully in recent years for modeling speech waveforms [4, 7, 12, 14, 15, 23, 28] at
various acoustic levels (word and subword) as well as for modeling languages. Some
computationally efﬁcient algorithms have been developed in the previous work by
Snider to evaluate the likehood of the HMM. A major advantage of using an HMM

in the problem of dysarthric speech recognition is that the HMM is a stochastic

 

6In the present case, an entry in the codebook.
7In the present case, derived from a frame of speech.

15

modeling approach which can automatically handle the “large” variability in speech

for recognition purposes.

16

4 Speech Recognition Experiments

In this chapter, we focus on a baseline recognition study on the WD using a hidden
Markov modeling approach in an effort to learn more about the characteristic features
of dysarthric speech which affect recognition performance. If not specially stated, we
use eight utterances for training, seven for testing, a 128 symbol codebook, and six
state Bakis HMMs. The percentage given is the ratio of the number of correctly
recognized words to the total number of testing words. An example comprehensive

experimental result for speaker DC is found in Appendix A.

4.1 Size of the Codebook

Since the recognition system is based on discrete symbol HMMs of isolated words,
a vector quantizer is required to map each continuous parameter vector into a ﬁnite
integer index. The number of indices (code vectors) used should correspond to the
number of meaningful clusters in the feature vectors in the population. Very roughly
speaking, these codes (clusters) represent distinct acoustic tokens. If too few are
used, many dissimilar features will be quantized into the same token. If too many are
used, superﬂuous and ambiguous codes exist. Either situation potentially degrades
performance. With normal speech typically 64-128 codes provide good performance
[for speaker independent recognition. The following experiments were implemented to
determine whether fewer codes would improve recognition of dysarthric speech, under
the hypothesis that fewer acoustic tokens may exist in some speakers’ utterances.
Experimental results given in Table 2 for the TI—46 database and Table 3 for the
Grandfather database show the effects of different size codebooks on the recognition
rate. A quick glance at the recognition rate would seem to indicate that a larger

codebook is better. Closer inspection reveals that this is not always the case. Large

17

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

16 symbol 128 symbol
correct top 2 top 4 correct top 2 top 4
Speaker DC 72.05% 82.92% 90.68% 89.13% 95.96% 98.45%
Speaker CJ 81.06% 92.24% 97.52% 81.99% 92.24% 95.96%
Speaker LE 58.39% 70.50% 84.16% 68.94% 81.06% 90.37%
Speaker BD 76.09% 88.51% 95.65% 77.02% 90.68% 96.27%
Speaker LL 47.21% 65.53% 80.43% 56.83% 73.60% 83.85%

 

Table 2: Results of different codebook sizes using the TI-46 database.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

16 symbol 128 symbol
correct top 2 top 4 correct top 2 top 4
Speaker DC 90.61% 98.78% 98.78% 91.43% 97.55% 99.59%
Speaker CJ 61.63% 84.75% 91.02% 86.53% 95.51% 97.55%
Speaker LE 73.06% 84.49% 93.06% 74.29% 83.67% 93.06%
Speaker BD 68.98% 82.04% 93.06% 79.18% 90.61% 93.47%
Speaker LL 57.55% 76.33% 84.90% 60.00% 74.29% 85.31%

 

 

 

 

 

Table 3: Results of different codebook sizes using the Grandfather database.

codebooks do not work as well for the discriminating vowel sounds. For example,
the recognition of the dipthong /ai/ (utterances of letter “a”) for small codebooks
(16 symbols) is better than that for a large codebook (128 symbols), but a larger
codebook is necessary for the fricatives. For example, the utterance / pi / (letter “p”)
is frequently recognized as / i / (letter “e”) if the 16 symbol codebook is used.

The reason why a large codebook does not work for the vowel case is generally ex-
plained as follows: dysarthric speakers have difﬁculty in controlling their articulators,
and multiple symbols in the codebook which are close “acoustically” can accordingly
represent the same vowel sound. Increasing the number of symbols will not increase
the recognition rate. In fact, as noted above, too many symbols made degrade per-
formance. However, fewer symbols do not provide the acoustic diversity necessary to

represent frictives, for example.

18

 

 

 

 

 

 

 

 

 

 

 

 

ergodic model left-to-right model
correct top 2 top 4 correct top 2 top 4
Speaker DC 84.47% 92.86% 97.51% 89.13% 95.96% 98.45%
Speaker CJ 80.75% 90.68% 94.10% 81.99% 92.24% 95.96%
Speaker LE 64.60% 76.09% 89.13% 68.94% 81.06% 90.37%
Speaker BD 70.81% 83.54% 92.86% 77.02% 90.68% 96.27%
Speaker LL 55.90% 68.94% 81.37% 56.83% 73.60% 83.85%

 

 

 

 

 

 

 

 

Table 4: Recognition performance with different models using the TI-46 database.

4.2 HMM Structure

One of the important factors that was found to greatly affect the recognition rate is the
HMM model structure. In this study, two types of model structure were considered,
the “ergodic” and “left-to-right (Bakis)” model [25]. For IWR in which (at least)
one HMM is designated for each word in the vocabulary, it should be clear that a
left-to-right model is more appropriate than an ergodic model, since time and model
states are associated in a natural manner [25]. In addition to the property that the
state transitions always occur from left to right in the Bakis model, an additional
constraint is placed on the state transition coefﬁcients to make sure that “large”
changes in state indices do not occur. That is, a;,- = 0 if j > A. In our system, we
take A = 2. Experimental results in Table 4 and Table 5 also show that the Bakis
model yields better performance than the ergodic model.

This result is contrary to Hsu’s ﬁndings. Hsu found the ergodic structure to be
slightly preferable to the Bakis structure [7, 12]. His results, however, were based on a
digit database collected from speaker LE. In fact, if we examine only the recognition
rate of digits for speaker LE, ergodic structure and Bakis structure produce the same
recognition rate in this study as well. Thus, we could conclude that although the
Bakis model is intuitively more appropriate for normal speech, the choice of Bakis vs.

ergodic model in the dysarthric case may be vocabulary and speaker-dependent.

l9

c to- t
correct top top correct top top
.5 95. 97.55 1. 97.55 99.
82.04 90.61 95.51 86.53 95.51 97.55

71. 81. 89. 74. 83. 93.06
BD 68.5 84.08 93.06 79.18 90.61 .4
LL 55.1 70.61 81. 60. 74. 85.31

 

Table 5: Recognition performance with different models using the Grandfather
database.

4.3 Acoustic Parameterization

The choice of parametric vector representation for the acoustic waveform is an im-
portant factor in automatic speech recognition. We have used weighted recursive
least squares (WRLS) estimation (with weights chosen to implement a forgetting fac-
tor [4, 12]) and autocorrelation methods to compute LP parameters, and cepstral
analysis. Results in Table 7 show that there is no signiﬁcant difference between the
WRLS and autocorrelation LP methods. In this experiment we use ﬁve utterances
for training and ﬁve for testing. The words which were used to compare the WRLS
and autocorrelation methods are shown in Table 6 which consists of 40 words spoken
by speaker LE. These words are a subset of a 200 word database which is reported in
the Ph.D. dissertation of Hsu8 [12]. From Table 8 and Table 9, we see that the ex-
periments show that the mel-based cepstrum signiﬁcantly improves the performance
with respect to the LP case. This result is consistent with a ﬁnding of Davis and

Mermelstein [2] on IWR of normal speech”.

 

8Except. for the window employed, the WRLS and autocorrelation methods are nearly equivalent
procedures. These experiments were performed prior to the creation of the WD as a quick check of
the expected similarity of performance between the two methods.

9The result of Davis and Mermelstein was based on dynamic time warping

20

 

 

 

 

a american about becomes
bicycle calculus child doesnt
drink enough existed from
father gauge go has
home in just knows
landmark muscle movies notion
never old opinion paycheck
problem question rattle shaky
sounds topic today tell
usually vote who with

 

 

 

 

Table 6: Vocabulary for the comparison of WRLS and autocorrelation methods of
LP parameter computation. These words are not in the WD for reasons explained in

 

 

 

 

 

the text.
ergodic model Bakis model
correct top 2 top 4 correct top 2 top 4
WRLS 54.00% 65.00% 74.00% 59.50% 67.00% 77.00%
autocorrelation 56.00% 68.00% 74.00% 59.00% 71.00% 80.50%

 

 

 

 

 

 

 

 

 

 

Table 7: Recognition results comparing WRLS and autocorrelation methods of com-

puting LP parameters.

 

LP

Mel- Cepstrum

 

correct

top 2

top 4

correct

top 2

top 4

 

Speaker DC

84.78%

93.79%

 

98.14%

89.13%

95.96%

98.45%

 

Speaker CJ

78.88%

89.44%

94.10%

81.99%

92.24%

95.96%

 

Speaker LE

59.63%

72.05%

84.47%

68.94%

81.06%

90.37%

 

Speaker BD

74.53%

86.34%

97.20%

77.02%

90.68%

96.27%

 

 

 

 

Speaker LL

43.48%

 

 

56.52%

 

72.67%

 

 

56.83%

 

73.60%

 

83.85%

 

 

Table 8: Recognition results

database.

comparing LP and mel-cepstrum using the TI-46

 

LP

Mel- Cepstrum

 

correct

top 2

 

top 4

correct

top 2

top 4

 

Speaker DC

89.39%

95.92%

98.37%

91.43%

97.55%

99.59%

 

Speaker CJ

73.47%

85.71%

93.06%

86.53%

95.51%

97.55%

 

Speaker LE

67.35%

79.59%

89.39%

74.29%

83.67%

93.06%

 

Speaker BD

74.29%

84.90%

91.43%

79.61%

90.61%

93.47%

 

Speaker LL

57.96%

 

 

 

 

71.84%

 

86.12%

 

60.00%

 

74.29%

 

85.31%

 

 

Table 9: Recognition results comparing LP and mel-cepstrum using the Grandfather

database.

21

 

 

 

4.4 Silent Portion Extraction

For many dysarthric speakers, “steady state” vowel-like phonemes are the easiest
sounds to produce because they do not require dynamic movement of the vocal sys-
tem. Conversely, phonetic transitions in speech are more difﬁcult to produce for
dysarthric individuals because they require ﬁne muscle control to move the articula-
tors. Many dysarthric individuals are not able to consistently and reliably make such
transitions between two phonemes due to lack of muscle control. Consequently, it is
reasonable to assume that acoustic transitions in dysarthric speech are of much larger
variance than stationary regions. Hsu tested this hypothesis by pursuing a method
to clip out the dynamic regions from the speech in order to decrease the variability.
These experiments revealed signiﬁcant performance improvement as a result of this
procedure [7, 12].

Early experiments conducted during Snider’s work [28] suggested that this clip-
ping procedure might have been effective principally because it was removing short
silent regions from the acoustics. To test this hypothesis, a silence detection strategy
based on the zero-crossing and energy thresholds was employed to remove short silent
regions. The thresholds were carefully selected so that the technique would extract
only silence regions without removing the weak frictives and other low-amplitude por-
tions of the speech. However, most experiments reported in Table 10 and Table 11
do not support Snider’s hypothesis. These results suggest that silence portion extrac-
tion algorithm does not beneﬁt the system performance and Hsu’s improvement from
the clipping procedure is apparently not due to silence extraction alone as Snider

suspected.

22

D
LL

Table 10: Effect of silent portion extraction on recognition performance using the

TI-46 database.

t portion rem

correct
86.65
82.

4.
53.

top
95.34
91.

1.06
88.51
64.91

top 4
98.14
97.

79.1

correct
89.1

81 .

68.

56. _

t portion

top
95.96
92.24
1.

73.

t
top
98.45

95.

83.85

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Silent portion removed Silent portion kept
correct top 2 top 4 correct top 2 top 4
Speaker DC 91.43% 96.33% 98.78% 91.43% 97.55% 99.59%
Speaker CJ 84.90% 93.88% 97.55% 86.53% 95.51% 97.55%
Speaker LE 72.65% 83.67% 90.61% 74.29% 83.67% 93.06%
Speaker BD 75.10% 88.16% 95.10% 79.18% 90.61% 93.47%)"
Speaker LL 60.00% 71.43% 83.27% 60.00% 74.29% 85.31%

 

 

 

 

 

 

 

 

 

Table 11: Effect of silent portion extraction on recognition performance using the
Grandfather database.

4.5 Number of Training Utterances

Training of each HMM was based on the Baum-Welch reestimation procedure for
multiple observation sequences [25]. The problem of having little training data with
which to accurately characterize the statistical distributions in the HMM, which is
common to most HMM training problems, is extraordinary in the dysarthric speech
problem. The experimental results in Table 12 and Table 13 show that the number
of training sequences has a signiﬁcant effect on the recognition rate. However, the
number of observation sequences used for training is limited, since any attempt to
collect large bodies of speech data by lengthy recording sessions is impractical. Such
sessions are mentally and physically fatiguing for many persons, a fact which only
contributes to the variability one is trying to characterize by collecting more data. In

order to get the best performance from the system, we suggest a retraining strategy.

23

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5 observation sequences 8 observation sequences

correct top 2 top 4 correct top 2 top 4
Speaker DC 81.99% 91.61% 95.03% 89.13% 95.96% 98.45%
Speaker CJ 79.81% 89.13% 94.41% 81.99% 92.24% 95.96%
Speaker LE 62.11% 75.16% 87.27% 68.94% 81.06% 90.37%
Speaker BD 69.25% 82.61% 91.61% 77.02% 90.68% 96.27%
Speaker LL 47.83% 61.49% 77.33% 56.83% 73.60% 83.85%

 

Table 12: Effect of number of training observation sequences on recognition using the

TI—46 database.

LL

5
correct
.76
74.69

.41

48.5

sequences

top
89.
.4

64.

top 4
.55
95.
85. 1

8
correct
91 .

86.

4

.1
60.

sequences

top
.55
95.51
.6

74.

top 4

97.55
93.

.4
85.31

 

 

 

Table 13: Effect of number of training observation sequences on recognition using the
Grandfather database.

Whenever the recognition is incorrect or correct but the likelihood of the recognized
word is not sufﬁciently different from that of other candidates, we retrain the model

by incorporating the new observations into the existing HMM.

4.6 Number of States in the HMM

It is clear that the Markov structure cannot correctly reﬂect the temporal speech
waveform unless enough states are involved. One idea is to let the number of states
correspond roughly to the number of phonemes within words, hence models with
two to 10 states would be appropriate [25]. For computational efﬁciency, however,
including fewer states is favorable.

The experimental results show that for short words, especially single syllable

words, using fewer states results in better performance. This is consistent with the

24

Table 14: Effect of number of states in HMM on recognition performance using the

TI-46 database.

 

 

 

 

 

 

 

 

 

6 states 8 states 10 states
Speaker DC 89.13% 90.99% 88.51%
Speaker CJ 81.99% 83.85% 84.78%
Speaker LE 68.94% 70.50% 68.94%
Speaker BD 77.02% 77.95% 75.47%
Speaker LL 56.83% 57.14% 57.45%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6 states 8 states 10 states
Speaker DC 91.43% 92.24% 92.65%
Speaker CJ 86.53% 84.49% 88.16%
Speaker LE 74.29% 72.65% 75.92%
Speaker BD 79.18% 76.73% 77.14%
Speaker LL 60.00% 60.41% 56.73%

 

 

 

 

 

 

 

 

 

Table 15: Effect of number of states in HMM on recognition performance using the
Grandfather database.

assumption that the number of states roughly reﬂects the number of phonemes within
words. Since the TI-46 word list contains 26 alphabetic characters and 10 digits (most
of which are short words), the effect of increasing the number of states in using TI-46
is not as obvious as that in using the Grandfather word list. Note, however, that for
Speaker LL who routinely produces short sounds, the use of fewer states results in

better performance consistent with expectation.

25

5 The Prototype IWR System for Dysarthric Speech

The long term goal of this research is the development of an “artiﬁcially intelligent”
communication aid to serve the needs of a person who is severely speech disabled and
whose motor skills will only permit simple responses in answering “interrogations” by
the device.

In research related to the speech recognition function of such a device, experi-
ments with the dysarthric speech database yielded results which were highly sensitive
to many analysis parameters, in particular, the settings of Hsu’s transition clipping
procedure [12]. Whereas Hsu’s hypothesis was that the clipping procedure was effec-
tive because it removed transitional acoustics from the observation sequence [7, 12],
preliminary experiments conducted during Snider’s work [28] suggested that the clip-
ping procedure might be beneﬁcial principally because it was removing “gaps” or
short silent regions from the acoustics. In most of the experiments in Section 4.4,
however, discarding the silent regions is seen to cause a decrement in performance,
though this is not generally true. These results signiﬁcantly affected our thinking
about the proper course of action in the development of the communication aid.

The conclusion from these mixed results is that building a “ﬁxed box” for all
the speech disabled individuals is not possible nor appropriate because the choice of
parameters to improve the recognition rate is highly speaker-dependent. Our future
plan is to cooperate with clinical centers in the development of customized systems for
a few selected speech-disabled clients with “small” task requirements, for example,
issuing a small set of verbal commands to an assistive device. “Customized” means
that the inclusion of speciﬁc modules and parameter-choices in the system will be
based on the needs and speech characteristics of the client. The clinical center will

transmit digitized speech data over the electronic mail service (email) on the computer

26

network to the Speech Processing Laboratory. These data along with knowledge of
the needs of the clients will be used to create a customized recognition system (soft-
ware) which will then be returned to the clinical center over the network. Periodic
updates (adaptation) of the software can be accomplished by the same means, par-
ticularly if the system is designed to record information about recognition errors and
representative confused utterances. Such adaptation can also be achieved “on-line” if
the system is apropriately designed. In parallel, of course, an opportunity exists for
further research and development as we gain experience from this endeavor.

In this research, we have developed a fundamental speech recognition software
module which, in keeping with the basic philosophy expressed above, remains ﬂexible
for user-speciﬁc customization. In addition, the “front” and “back” ends of the device
remain unspeciﬁed, to be customized for individual users. For example, the basic
operation could be as follows: 1) The user hits one key ﬁrst, utters the words vocally,
and then hits the key again to indicate the end of the utterance. 2) The software
will then process the incoming speech by coding the speech signal, quantizing, and
computing the probability for each prestored model. 3) Finally, the software presents
a list of probable words in the decreasing order of their likelihood measure for the
user to afﬁrm, to deny, or from which to make a selection (see [30, 31]).

One relatively straightforward technical problem remains in the development of a
complete prototype recognition system. The system is running on a general-purpose
PC, and thus, the speed of the recognition is limited. Even on a “high-end” PC based
on an Intel 80486 microprocessor with math coprocessor support, it takes about 15
seconds, for example, to recognize a word from a 46 word vocabulary (TI-46) with the
current system. In order to achieve the real-time speech recognition system, we can
include Snider’s compression approach [6, 28] to reduce the computational complexity,

or employ a programmable signal processing board to maximize the speed, to achieve

27

real-time operation.

The performance of this prototype system depends on numerous inter-related
factors. Although our approach can easily be adjusted to adapt to different dysarthric
cases and maximize the performance, further study of several approaches to enhance

the recognition system is in progress and will be discussed in the next section.

28

6 Conclusion

6.1 Summary

Recognition of the speech of severely dysarthric individuals requires a technique which
is robust to extraordinarily high variability and very little training data. Many ex-
perimental results show that the recognition of dysarthric speech is a distinctly differ-
ent problem from that of normal speech, and new strategies and approaches will be
needed. Because the personal needs and the degree of dysarthria of the speakers are
different, this effort has suggested that a ﬂexible system in which system parameters
can be selected on an individual basis is preferable to a “ﬁxed” system.

The principal contributions of the this research are:

1. The creation of the “Whitaker Database”: The WD provides a well-organized
speech data set which is accessible over the internet computer network. The
words in the database were carefully selected for their phonetic richness and
complexity. It is hoped that this database will serve as a standard for researchers
around the world with which many systems can be compared and meaningfully

evaluated.

2. ’ Extensive experimental studies on the WD to determine effects of various recog-
nition parameters and strategies on performance. These studies resulted in the
conclusion that “customization” of the recognition system to individual speak-
ers is the proper design philosophy. This, in turn led to the conception of an
“on-line” development and testing paradigm to be employed in cooperation with

clinical centers in future work.

29

6.2 Future Work

In view of the current research on IWR for dysarthric speech, several issues which
should be addressed in future work have been identiﬁed.

First, collection of speech data by lengthy recording sessions is a stressful ex-
perience for many dysarthric speakers, and resulting. mental and physical fatigue
and frustration introduce more variability. Consequently, training data are severely
limited. We have suggested a “retraining” strategy as an area of future work in Sec-
tion 4.5. The recognizer must have a convenient way to let the speaker identify the
correctness of the recognized word and decide if the retraining process is required.
Of course, the system should have the ability to decide automatically whether or not
the retraining process is required when the recognition is correct in order to make it
more robust.

Second, it is common for some speakers to introduce unnatural and irregular
pauses within words. We introduced the idea of building a silent model as in Section
4.4. In this case, the incoming speech is represent as an arbitrary sequence of phone

and silent models:
signal = (silent) - phone - (silent) - phone - ( silent)

where the silent part is optional and may appear in general between any two phones
in the signal. A similar strategy was proposed by Levinson in an HMM-based level-
building connected word recognition system as a means for accounting for inter-word
silence [15]. The signiﬁcant beneﬁts from this algorithm are: 1) It can be used to
automate the end-point detection process, and 2) It avoids removing the transition
information of dysarthric speech which can occur if silence regions are removed using
conventional silence detection approaches [27].

Third, recognition based on sub-word (e.g. phoneme) modeling would alleviate

30

some of the problems encountered in collecting sufﬁcient training data. In fact, such
an approach might be a natural solution for some speakers who tend to use only a
small number of phones. A natural extension of this idea would be to incorporate
a grammar and begin recognition of “continuous” speech or at least isolated word
sentence or phrase utterances. While the use of a grammar and these “higher-level”
considerations were beyond the scope of the present work, signiﬁcant beneﬁts may

result from their use in future research.

31

References

[1] Bahl, L.R., P.F. Brown, P.V. de Souza, and R.L. Mercer, “A new algorithm for
the estimation of hidden Markov model parameters,” Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing, New York,

Vol. 1, pp. 493-496, 1988.

[2] Davis, SB. and P. Mermelstein, “Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences,” IEEE Trans-
actions on Acoustics, Speech, and Signal Processing, vol. 28, pp.357-366, 1980.

[3] Deller, J .R., J .G. Proakis and J .H.L. Hansen, Discrete Time Processing of Speech
Signals, New York: Macmillian, 1993.

[4] Deller, J.R. and D. Hsu, “On the use of HMM’s to recognize cerebral palsy
speech: Isolated word case,” Proceedings of the IEEE International Conference
on Acoustics, Speech, and Signal Processing, Glasgow, vol. 1, pp. 290-293, 1989.

[5] Deller, J .R. and D. Hsu, “An alternative adaptive sequential regression algorithm
and its application to the recognition of cerebral palsy speech,” IEEE' Trans.
Circuits and Systems, vol. 34, pp. 782-787, 1987.

[6] Deller, J .R. and R.K. Snider, “ ‘Quantized’ hidden Markov models for efﬁcient
recognition of cerebral palsy speech,” Proceedings 1990 IEEE International Sym-
posium on Circuits and Systems, New Orleans, vol. 3, pp. 2041-2044, 1990.

[7] Deller, J. R, D. Hsu and L. J. Ferrier, “On the use of hidden Markov models
for the recognition of dysarthric speech,” Computer Methods and Programs in
Biomedicine, in press.

[8] Deller, J.R. and GP. Picaché “Advantages of a Givens rotation approach to
temporally recursive linear predition analysis of speech,” IEEE Transactions on
Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 429-431, 1989.

[9] Doddington, GR. and TB. Schalk, “Speech recognition: Turning theory to
. practice,” IEEE Spectrum, vol. 18, pp. 26-32, 1981.

[10] Foulds, R.A., G. Balesta, W.J. Crochetiere and C. Meyer, “The Tufts non-vocal
communication program,” Proceedings 1976' Conference on Systems and Devices
for Disabled, pp. 14-17.

[11] Heckathorne, CW. and D.S. Childress, “Applying anticipatory text selection in a
writing aid for people with a severe motor impairment,” Micro (IEEE Computer
Society), vol. 3, pp. 17-23, 1983.

32

[12] Hsu, D., Computer Recognition of Nonverbal Speech Using Hidden Markov Model
(Ph.D. Dissertation), Northeastern University, Boston, 1988.

[13] Johnson, W., F.L. Darley, and DC. Spreisterbach, Diagnostic Methods in Speech
Pathology, New York: Harper & Row, 1963.

[14] Lee, K.-F., Automatic Speech Recognition, the Development of the SPHINX sys-
tem, (Ph. D Dissertation), Carnegie-Mellon University, 1989.

[15] Levinson, S.E., “Structural methods in automatic speech recognition,” Proceed-

ings of the IEEE, vol. 73, pp. 1625-1650, 1985.

[16] Makhoul, J., “Linear prediction: A tutorial review,” Proceedings of the IEEE,
vol. 63, pp. 561-580, 1975.

[17] Makhoul, J., S. Roucos and H. Gish, “Vector quantization in speech coding,”
Proceedings of the IEEE, vol. 73, pp. 1551-1587, 1987.

[18] Mann, H.B. and A. Wald, “On the statistical treatment of linear stochastic
difference equation,” Econometrica, vol. 11, pp. 173-220, 1943.

[19] Markel, JD. and AH. Gray, Jr., Linear Prediction of Speech, New York:
Springer-Verlag, 1976.

[20] Niemann, H., M. Lang and G. Sagerer, Recent Advances in Speech Understanding
and Dialog Systems, New York: Springer-Verlag, 1987.

[21] O’Shaughnessy, D., Speech Communication: Human and Machine, Reading,
Massachusetts: Addison Wesley, pp. 420 - 424, 1987.

[22] Parsons, T.W., Voice and Speech Processing, New York: McGraw-Hill, 1986.

[23] Picone, J ., “Continuous speech recognition using hidden Markov models,” IEEE
ASSP Magazine, vol. 62, pp. 29 - 41, 1990.

[24] Press, W.H., B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, Numerical
Recipes in C, New York: Cambridge University Press, 1988.

[25] Rabiner, L.R., “A tutorial on hidden Markov models and selected applications
in speech recognition,” Proceedings of the IEEE, vol. 77, pp. 257-285, 1989.

[26] Rabiner, L.R. and R.W. Schafer, Digital Processing of Speech Signals,
Englewood-Cliffs, New Jersey: Prentice-Hall, pp.120-135, 1978.

[27] Rosenthal, L.H., L.R. Rabiner, R.W. Schafer, P. Cummiskey, and J .L. Flanagan,
“A multiline computer voice response system utilizing ADPCM coded speech,”
IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 22, no. 5,

pp.339-352, 1974.

33

[28] Snider, R.K., Efﬁcient Discrete Symbol Hidden Markov Model Evalution Using

Transformation and State Reduction (MS. Thesis), Michigan State University,
1990.

[29] Snider, R.K., Laboratory Manual, Speech Processing Laboratory, Michigan State
University, 1991.

[30] Sy, B.K., A Knowledge-Based Message Generation System for Nonverbal Severely
Motor Disabled Persons: Design and Prototype Testing (Ph.D. Dissertation),
Northeastern University, Boston, 1988.

[31] Sy, B.K. and J .R. Deller, “An AI-based communication system for motor and
speech disabled persons: Design and methodology and prototype testing,” IEEE'
Transactions on Biomedical Engineering, vol. 36, no. 5, 1989.

[32] U CP Prospectus, United Cerebral Palsy of Chicago, 1983.
[33] Van Hattum, R.J., Communication Disorers, New York: MacMillan, 1980.

[34] Wilpon, J .G. and L.R. Rabiner, “Application of hidden Markov models to au-

tomatic speech endpoint detection,” Computer Speech and Language, vol. 2, pp.
321 - 341, 1987.

34

APPENDICES

35

A Experimental Result for Speaker DC

Shown in this appendix is an example experimental result for speaker DC using the
TI-46 word list. In this experiment, eight utterances were used for training, seven for
testing, a 128 symbol codebook, and 6 state Bakis model. The ﬁrst column represents
correct words, the second column are recognized words, the third column represents
the number of correct recognitions to that point in the table, the fourth column
are the number of times the correct word appeared in the two words to which the
recognizer assigned highest likelihood to that point in the table (column ﬁve and six
are similar results for top four and eight candidates), and the last column represents
the total number of testing words.

36

(sac—4HHHHHHH:nmmmmmmommoommmnmmmmmmmmmmmmococooonnnnnnnmwmmmwtnszwwzvzvnv

ENTER

‘<01>2V3’?

H<

REPEA

MNUQMM'UUUUUDUUOOOOOOOWCUG’

mcqc.HiarsriH14rim:nn:m:nu:m¢DG)C)Q¢DC)Q'ﬂr4m

m
['1

p-l

p-l

p-l

p-Z

p-3

p-4

p-5

p-6

p-6

p-6

p-6

p-7

p-8

p-9

p-lO
p-ll
p-12
p-13
p-l4
p-lS
p-16
p-17
p-18
p-18
p-l9
p-20
p-21
p-22
p-22
p-23
p-24
p-24
p-24
p-25
p-26
p-26
p-26
p-27
p-28
p-29
p-29
p-30
p-31
p-32
p-33
p-34
p-35
p-36
p-37
p-38
p-39
p-40
p-41
p-42
p-43
p-44
p-45
p-46
p-47
p-48
p-49
p-SO
p-Sl
p-SZ
p-53
p-53

p2=1
p2-1
p2-1
p2-2
p2-3
p2-4
p2-5
p2-6
p2-7
p2-7
p2-8
p2-9

p2-12
p2-13
p2-14
p2-15
p2-16
p2-17
p2-18
p2-19
p2-20
p2-21
p2-22
p2-23
p2-24
p2-25
p2-26
p2-27
p2-28
p2-28
p2-29
p2-30
p2-31
p2-31
p2-31
p2-32
p2-33
p2-34
p2-35
p2-36
p2-37
p2-38
p2-39
p2-40
p2-41
p2-42
p2-43
p2-44
p2-45
p2-46
p2-47
p2-48
p2-49
pZ-SO
p2-51
p2-52
p2-53
p2-54
p2-55
p2-56
p2-57
p2-58
p2-59
p2-59

37

p4-1
p4-l
p4-1
p4-2
p4-3
p4-4
p4-5
p4-6
p4-7

p8-1
p8-2
p8-3
p8-4
p8-5
p8-6
pB-7
p8-8
p8-9

tot=l
tot-2
tot-3
tot-4
tot-5
tot-6
tot-7
tot-8
tot-9

p4-8
p4-9

p4-10 p8-12
p2-10 p4-11 p8-l3 tot-13
pZ-ll p4-12 p8-14

p4-13
p4-14
p4-15
p4-16
p4-17
p4-18
p4-19
p4-20
p4-21
p4-22
p4-23
p4-24
p4-25
p4-26
p4-27
p4-28
p4-29
p4-30
p4-31
p4-32
p4-33
p4-33
p4-34
p4-35
p4-36
p4-37
p4-38
p4-39
p4-40
p4-41
p4-42
p4-43
p4-44
p4-45
p4-46
p4-47
p4-48
p4-49
p4-50
p4-51
p4-52
p4-53
p4-54
p4-55
p4-56
p4-57
p4-58
p4-59
p4-60
p4-61
p4-62
p4-63

p8-15
p8-16
p8-17
p8-18
p8-19
p8-20
p8-21
p8-22
p8-23
p8-24
98-25
p8-26
p8-27
p8-28
p8-29
p8-30
p8-31
p8-32
p8-33
p8-34
p8-35
p8-35
p8-36
p8-37
p8-38
p8-39
p8-40
p8-41
p8-42
p8-43
p8-44
p8-45
p8-46
p8-47
p8-48
p8-49
p8-50
p8-51
p8-52
p8-S3
p8-54
pB-SS
p8-56
p8-57
p8-58
p8-59
p8-60
p8-61
p8-62
p8-63
p8-64
p8-65

p8-10 tot-10
p8-11 tot-11
tot-12

tot-14

tot-15
tot-16
tot-17
tot-18
tot-19
tot-20
tot-21
tot-22
tot-23
tot-24
tot-25
tot-26
tot-27
tot-28
tot-29
tot-30
tot-31.
tot-32
tot-33
tot-34
tot-35
tot-36
tot-37
tot-38
tot-39
tot-40
tot-41
tot-42
tot-43
tot-44
tot-45
tot-46
tot-47
tot-48
tot-49
tot-50
tot-51
tot-52
tot-53
tot-54
tot-55
tot-56
tot-57
tot-58
tot-59
tot-60
tot-61
tot-62
tot-63
tot-64
tot-65
tot-66

mmmmmmwwwwwwwooooooommmmmmmooooooozzzzzzzxzzzzzzr—«rrarr—«r—«xwmxxquuqe

zzxzxzzthhvBavvmxmxmwmumeg

22

STO

*<
m
mmwxmmwwt'wwwwooooooommmvomowooooooo

p-S4
p-55
p-56
p-57
p-58
pr59
p560
p-61
p-62
p-63
p-64
p-GS
p-66
p-67
p-68
p-69
p-70
p-71
p-72
p-73
p-74
p-75
p-76
p-77
p-78
p-79
p-80
p-BO
p-BO
p-80
p-81
p-82
p-83
p-84
p-85
p-86
p-87
p-88
p-89
p-90
p-9O
p-91
p-92
p-93
p-94
p-9S
p-96
p-97
p-98
p-99
p-100
p-lOl
p-102
p-103
p-104
p-104
p-105
p-105
p-106
p-107
p-108
p-109
p-109
p-llO
p-llO
p-lll

p2-60
p2-61
p2-62
p2-63
p2-64
p2-65
p2-66
p2-67
p2-68
p2-69
p2-70
p2-71
p2-72
p2-73
p2-74
p2-75
p2-76
p2-77
p2-78
p2-79
p2-80
p2-81
p2-82
p2-83
p2-84
p2-85
p2-86
p2-87
p2-87
p2-88
p2-89
p2-90
p2-91
p2-92
p2-93
p2-94
p2-95
p2-96
p2-97
p2-98
p2-99
p2-100
p2-101
p2-102
p2-103
p2-104
p2-105
p2-106
p2-107
p2-108

p4-64
p4-65
p4-66
p4-67
p4-68
p4-69
p4-70
p4-71_
p4-72
p4-73
p4-74
p4-7S
p4-76
p4-77
p4-78
p4-79
p4-80
p4-81
p4-82
p4-83
p4-84
p4-85
p4-86
p4-87
p4-88
p4-89
p4-90
p4-91
p4-92
p4-93
p4-94
p4-95
p4-96
p4-97
p4-98
p4-99
p4-100
p4-101
p4-102
p4-103
p4-104

p8-66
p8-67
p8-68
p8-69
p8-7O
p8-71
p8-72
p8-73
p8-74
p8-75
p8-76
p8-77
p8-78
p8-79
p8-80
p8-81
p8-82
p8-83
p8-84
p8-85
p8-86
p8-87
p8-88
p8-89
p8-9Q
p8-91
p8-92
p8-93
p8-94
p8-95
p8-96
p8-97
p8-98
p8-99

tot-67
tot-68
tot-69
tot-70
tot-71
tot-72
tot-73
tot-74
tot-7S
tot-76
tot-77
tot-78
tot-79
tot-80
tot-81
tot-82
tot-83
tot-84
tot-85
tot-86
tot-87
tot-88
tot-89
tot-9O
tot-91
tot-92
tot-93
tot-94
tot-95
tot-96
tot-97
tot-98
tot-99
tot-100

p8-100 tot-101
p8-101 tot-102

p8-102
p8-103
p8-104
p8-105
p8-106

tot-103
tot-104
tot-105
tot-106
tot-107

p4-105
p4-106
p4-107
p4-108
p4-109
p4-110
p4-111
p4-112
p4-113

p8-107
p8-108
p8-109
p8-110
pB-lll
p8-112
p8-113
p8-114
p8-115

p2-109
p2-110
p2-111
p2-112
p2-113
p2-113
p2-114
p2-115
p2-116
p2-117
p2-118
p2-119
p2-120
p2-121
p2-122
p2-123

38

p4-114
p4-115
p4-116
p4-117
p4-118
p4-119
p4-120
p4-121
p4-122
p4-123
p4-124
p4-125
p4-126
p4-127
p4-128
p4-129

p8-116
p8-117
p8-118
p8-119
p8-120
p8-121
p8-122
p8-123
p8-124
p8-125
p8-126
p8-127
p8-128
p8-129
p8-130
p8-131

tot-108
tot-109
tot-110
tot-111
tot-112
tot-113
tot-114
tot-115
tot-116
tot-117
tot-118
tot-119
tot-120
tot-121
tot-122
tot-123
tot-124
tot-125
tot-126
tot-127
tot-128
tot-129
tot-130
tot-131
tot-132

urea:NceeaNr<~<Kv<~<Kt<><X:K><X:K><2:E132:81§£<:<‘<<$<<3<<3C26<2C26<3raevaraevaram

*3
:1:
{3
mm

ZER

><X1K><818132=85:8<3<‘<<:<<=<3Q<3CJG<DCIGIQFiarariﬂ

REPEA

NCQDINEQb18r<Hth<h<Kr<><N><

p-112
p-112
p-113
p-114
p-llS
p-116
p-ll?
p-118
p-119
p-120
p-120
p-121
p-122
p-123
p-124
p-125
p-126
p-127
p-128
p-129
p-130
p-131
p-132
p-133
p-134
p-135
p-136
p-137
p-138
p-139
p-140
p-141
p-142
p-143
p-144
p-145
p-146
p-147
p-148
p-149
p-lSO
p-lSl
p-152
p-152
p-153
p-154
p-lSS
p-156
p-157
p-158
p-159
p-160
p-161
p-162
p-163
p-164
p-165
p-166
p-167
p-168
p-169
p-170
p-171
p-172
p-173
p-174

p2-124
p2-125
p2-126
p2-127
p2-128
p2-129
p2-130
p2-l31
p2-132
p2-l33
p2-134
p2-135
p2-136
p2-137
p2-138
p2-139
p2-140
p2-141
p2-142
p2-143
p2-l44
p2-145
p2-146
p2-147
p2-l48
p2-149
p2-150
p2-151
p2-152
p2-153
p2-154
p2-155
p2-156
p2-157
p2-158
p2-159
p2-160
p2-161
p2-162
p2-163
p2-164
p2-165
p2-166
p2-167
p2-168
p2-169
p2-l70
p2-171
p2-172
p2-173
p2-174
p2-175
p2-176
p2-177
p2-178
p2-179
p2-180
p2-181
p2-182
p2-183
p2-184
p2-185
p2-186
p2-187
p2-188
p2-189

39

p4-130
p4-131
p4-132
p4-133
p4-134
p4-135
p4-136
p4-137
p4-138
p4-139
p4-140
p4-141
p4-142
p4-143
p4-144
p4-l45
p4-146
p4-147
p4-148
p4-l49
p4-150
p4-151
p4-152
p4-153
p4-154
p4-155
p4-156
p4-157
p4-158
p4-159
p4-160
p4-161
p4-162
p4-163
p4-164
p4-165
p4-166
p4-167
p4-168
p4-169
p4-170
p4-17l
p4-172
p4-173
p4-174
p4-17S
p4-l76
p4-177
p4-178
p4-l79
p4-180
p4-181
p4-182
p4-183
p4-184
p4-185
p4-186
p4-187
p4-188
p4-189
p4-190
p4-191
p4-192
p4-193
p4-194
p4-195

p8-132
p8-133
p8-l34
p8-135
p8-136
p8-l37
p8-138
p8-l39
p8-l40
p8-141
p8-142
p8-143
p8-144
p8-145
p8-l46
p8-147
p8-148
p8-149
p8-150
p8-151
p8-152
p8-153
p8-154
p8-155
p8-156
p8-157
p8-158
p8-159
p8-160
p8-161
p8-162
p8-163
p8-164
p8-165
p8-166
p8-167
p8-168
p8-169
p8-170
p8-171
p8-172
p8-173
p8-174
p8-175
p8-176
p8-177
p8-178
p8-179
p8-180
p8-181
p8-182
p8-183
p8-184
p8-185
p8-186
p8-187
p8-188
p8-189
p8-l90
p8-191
p8-192
p8-193
p8-194
p8-195
p8-196
p8-197

tot-133
tot-134
tot-135
tot-136
tot-137
tot-138
tot-139
tot-140
tot-141
tot-142
tot-143
tot-144
tot-14S
tot-146
tot-147
tot-148
tot-149
tot-150
tot-151
tot-152
tot-153
tot-154
tot-155
tot-156
tot-157
tot-158
tot-159
tot-160
tot-161
tot-162
tot-163
tot-164
tot-165
tot-166
tot-167
tot-168
tot-169
tot-170
tot-171
tot-172
tot-173
tot-174
tot-175
tot-176
tot-177
tot-178
tot-179
tot-180
tot-181
tot-182
tot-183
tot-184
tot-185
tot-186
tot-187
tot-188
tot-189
tot-190
tot-191
tot-192
tot-193
tot-194
tot-19S
tot-196
tot-197
tot-198

ZERO

START
START
START
START
START
START
START
STOP
STOP
STOP
STOP
STOP

p-l75
p-176
p-176
p-177
p-178
p-179
p-180
p-181
p-182
p-183
p-184
p-185
p-186
p-187
p-l88
p-189
p-190
p-191
p-192
p-193
p-194
p-l95
p-196
p-197
p-198
p-199
p-200
p-201
p-202
p-203
p-204
p-205
p-206
p-207
p-207
p-207
p-207
p-208
p-209
p-209
p-210
p-210
p-210
p-le
p-212
p-213
p-214
p-ZlS
p-216
p-217
p-218
p-219
p-220
p-221
p-222
p-223
p-224
p-225
p-226
p-227
p-228
p-229
p-230
p-231
p-232
p-233

p2-190
p2-19l
p2-192
p2-193
p2-194
p2-195
p2-l96
p2-197
p2-198
p2-199
p2-200
p2-201
p2-202
p2-203
p2-204
p2-205
p2-206
p2-207
p2-208
p2-209
p2-210
p2-211
p2-212
p2-213
p2-214
p2-215
p2-216
p2-217
p2-218
p2-219
p2-220
p2-221
p2-222
p2-223
p2-224
p2-225
p2-226
p2-227
p2-228
p2-228
p2-229
p2-229
p2-230
p2-231
p2-232
p2-233
p2-234
p2-235
p2-236
p2-237
p2-238
p2-239
p2-240
p2-24l
p2-242
p2-243
p2-244
p2-245
p2-246
p2-247
p2-248
p2-249
p2-250
p2-251
p2-252
p2-253

40

p4-196
p4-197
p4-l98
p4-199
p4-200
p4-201
p4-202
p4-203
p4-204
p4-205
p4-206
p4-207
p4-208
p4-209
p4-210
p4-211
p4-212
p4-213
p4-214
p4-215
p4-216
p4-217
p4-218
p4-219
p4-220
p4-221
p4-222
p4-223
p4-224
p4-225
p4-226
p4-227
p4-228
p4-229
p4-230
p4-231
p4-232
p4-233
p4-234
p4-235
p4-236
p4-236
p4-237
p4-238
p4-239
p4-240
p4-241
p4-242
p4-243
p4-244
p4-245
p4-246
p4-247
p4-248
p4-249
p4-250
p4-251
p4-252
p4-253
p4-254
p4-255
p4-256
p4-257
p4-258
p4-259
p4-260

p8-l98
p8-199
p8-200
p8-201
p8-202
p8-203
p8-204
p8-205
p8-206
p8-207
p8-208
p8-209
p8-210
p8-211
p8-212
p8-213
p8-214
p8-215
p8-216
p8-217
p8-218
p8-219
p8-220
p8-221
p8-222
p8-223
p8-224
p8-225
p8-226
p8-227
p8-228
p8-229
p8-230
p8-231
p8-232
p8-233
p8-234
p8-235
p8-236
p8-237
p8-238
p8-239
p8-240
p8-241
p8-242
p8-243
p8-244
p8-245
p8-246
p8-247
p8-248
p8-249
p8-250
p8-251
p8-252
p8-253
p8-254
p8-255
p8-256
p8-257
p8-258
p8-259
p8-260
p8-261
p8-262
p8-263

tot-199
tot-200
tot-201
tot-202
tot-203
tot-204
tot-205
tot-206
tot-207
tot-208
tot-209
tot-210
tot-211
tot-212
tot-213
tot-214
tot-215
tot-216
tot-217
tot-218
tot-219
tot-220
tot-221
tot-222
tot-223
tot-224
tot-225
tot-226
tot-227
tot-228
tot-229
tot-230
tot-231
tot-232
tot-233
tot-234
tot-23S
tot-236
tot-237
tot-238
tot-239
tot-240
tot-241
tot-242
tot-243
tot-244
tot-245
tot-246
tot-247
tot-248
tot-249
tot-250
tot-251
tot-252
tot-253
tot-254
tot-255
tot-256
tot-257
tot-258
tot-259
tot-260
tot-261
tot-262
tot-263
tot-264

REPEAT
REPEAT
REPEAT
REPEAT
REPEAT
REPEAT
ENTER
ENTER
ENTER
ENTER
ENTER
ENTER
ENTER

RUBOUT
RUBOUT
RUBOUT
Y
RUBOUT
STOP
RUBOUT
REPEAT
REPEAT
REPEAT
REPEAT
REPEAT
REPEAT
REPEAT
ENTER
ENTER
ENTER
ENTER
ENTER
ENTER
ENTER

p-234
p-235
p-236
p-237
p-238
p-239
p-24O
p-241
p-242

p-243

p-244
p-245
p-246
p-247
p-248
p-249
p-249
p-ZSO
p-251
p-251
p-252
p-253
p-254
p-255
p-256
p-257
p-258
p-259
p-260
p-261
p-262
p-263
p-264
p-265
p-266
p-267
p-268
p-269
p-27O
p-271
p-271
p-272
p-272
p-273
p-274
p-275
p-276
p-277
p-278
p-279
p-280
p-281
p-282
p-283
p-284
p-285
p-286
p-287

recognition rate - 89.1304

p2-254
pZ-ZSS
p2-256
p2-257
p2-258
p2-259
p2-260
p2-261
p2-262
p2-263
p2-264
p2-265
p2-266
p2-267
p2-268
p2-269
p2-269
p2-270
p2-271
p2-272
p2-273
p2-274
p2-275
p2-276
p2-277
p2-278
p2-279
p2-280
p2-281
p2-282
p2-283
p2-284
p2-285
p2-286
p2-287
p2-288
p2-289
p2-290
p2-291
p2-292
p2-292
p2-293
p2-294
p2-295
p2-296
p2-297
p2-298
p2-299
p2-300
p2-301
p2-302
p2-303
p2-304
p2-305
p2-306
p2-307
p2-308
p2-309

percent

41

p4-261
p4-262
p4-263
p4-264
p4-265
p4-266
p4-267
p4-268
p4-269
p4-270
p4-271
p4-272
p4-273
pA-274
p4-275
p4-276
p4-276
p4-277
p4-278
p4-279
p4-280
p4-281
p4-282
p4-283
p4-284
p4-285
p4-286
p4-287
p4-288
p4-289
p4-290
p4-291
p4-292
p4-293
p4-294
p4-295
p4-296
p4-297
p4-298
p4-299
p4-300
p4-301
p4-302
p4-303
p4-304
p4-305
p4-306
p4-307
p4-308
p4-309
p4-310
p4-311
p4-312
p4-313
p4-314
p4-315
p4-316
p4-317

p8-264
p8-265
p8-266
p8-267
p8-268
p8-269
p8-270
p8-271
p8-272
p8-273
p8-274
p8-275
p8-276
p8-277
p8-278
p8-279
p8-279
p8-280
p8-281
p8-282
p8-283
p8-284
p8-285
p8-286
p8-287
p8-288
p8-289
p8-290
p8-291
p8-292
p8-293
p8-294
p8-295
p8-296
p8-297
p8-298
p8-299
p8-300
p8-301
p8-302
p8-303
p8-304
p8-305
p8-306
p8-307
p8-308
p8-309
p8-310
p8-311
p8-312
p8-313
p8-314
p8-315
p8-316
p8-317
p8-318
p8-319
p8-320

tot-265
tot-266
tot-267
tot-268
tot-269
tot-270
tot-271
tot-272
tot-273
tot-274
tot-275
tot-276
tot-277
tot-278
tot-279
tot-280
tot-281
tot-282
tot-283
tot-284
tot-285
tot-286
tot-287
tot-288
tot-289
tot-290
tot-291
tot-292
tot-293
tot-294
tot-295
tot-296
tot-297
tot-298
tot-299
tot-300
tot-301
tot-302
tot-303
tot-304
tot-305
tot-306
tot-307
tot-308
tot-309
tot-310
tot-311
tot-312
tot-313
tot-314
tot-315
tot-316
tot-317
tot-318
tot-319
tot-320
tot-321
tot-322

B Program Listing: LP Parameter Generating
Program

This program computes the LP parameters of an input speech ﬁle using the autocor-
relation method, and then quantizes the LP parameters.

42

#include <stdio.h>
Oinclude <ctype.h>
Oinclude <string.h>
tinclude <time.h>
linclude <math.h>

#define N 256 /* Hamming window size */

tdefine NUM 50000 /* maximum allowed for speech data */
{define MO 14 /* Model Order of LPG */

Odefine NLEVELS 7 /* number of levels in binary codebook */
{define LEVELINDX 126 /* ZANLEVELS-Z */

#define TOTVECT 254 /* number of vectors in codebook */
Odefine CODEFILE "codedc36.dat" /* output file directive */

int count:

int index2[100][2]:

double speech_data[NUM]:

double window[N]:

double aiMO+1]: /* estimated LP parameters */
double r[MO+1]: /* short term autocorrelation */
double codebookiTOTVECT][MO]:

char codefilelBO];

FILE *outfile:

/*i’i’it****§***t§***tititiiﬂitti*****i*******t***********t***i***************

* Program name: lpcqnt_ a. c ‘ *'
* Command : run386 lpcqnt_ a cdbkfile. dat *
* Description : Computing the LP parameters using autocorrelation *
* method with 256 points *
* Hamming window as a frame. *
* Date : July 19, 1990 *

**ﬂ****************iiﬁtiﬁiiiti‘tii‘ttiii'i'ﬂiiitii’fttttti**t*i****t***t******ft/

mainiargc,argV)
int argc:
char *argvi]:

{

int i,j,k,h,t;
int numread:
int M,p,sum,m:
char infilesl80],infilenamelIBO],infilename[80],filenameilOO]:
char instring[40],numstring[5],inputl[30]:

short bufferi64]:

FILE *infilel,*infile2,*infile,*file:

int total_data:

void lpc_computation():

void codeboox_entry();

if( argc < 1 i
{

print£(”***** After program name enter two file names *****\n");

printf("1. The first file name is the name of the codebook. \n"):

printf("2. The second is the name of the file that contains the paths and\n"):
printfi” names of all the data files to be qauntized.\n\n");
print£("Example: "i:

printf("lpc2 codebook. dat allfiles. dat\n", argv[0]);

eXit (O) I

l
stGCyicodefile,argv[1]):
codebooh_entry(i:
count-O;
strcpyiinfilename,"tstti. dat");

printfi'Start data input - infilename - %s\n", infilename);
if ( (infile - £open(infilename, "r")) - NULL)

1

43

printf("fopen failed for infilename %s.\n",infilename):
exit(0):

}
while( fscanf(infile,"%s\n",instring) !- EOF)

{
for(h-l: h<8: h++i

i
strcpy(filename,"c:\\dc36\\test\\bin\\"):
strcat(filename,instring):
strcat(filename,".36"):

switch(h)

{

case 1 strcpy(numstring,”9"): break:
case 2 strcpyinumstring,"a"); break:
case 3 : strcpyinumstring,"b"): break:
case 4 : strcpyinumstring,"c"): break:
case 5 . strcpyinumstring,"d"): break:
case 6 strcpy(numstring,"e”); break:
case 7 strcpy(numstring,"f"); break;

default : break:

strcat(filename,numstring):

_pmode - 0x8000:

if ( (file - fopen(filename, "r")) - NULL)
{
printf(”fopen failed for filename %s\n",filename):
exit(0): -

}
printf(”reading in data from file %s ..... \n",fi1ename):
k-t-O:
do

{

numread - fread((void *)buffer, sizeof(short),64,file):
if(numread -- 0)

break:
for(i-O: i<64: i++)

{
speech_data[t] - (int)buffer[i]:
t++:

} .

}while( feof(infile) - 0 || numread -- 64 ): /** for with.cp5 file **/
fclose(filei:
strcpy(filename,"\\dc36\\test\\qnt\\"):
strcat(filename,instring):
strcat(filename,".vq"):
strcat(filename,numstring):
_pmode - 0x4000:
outfile - fopen(filename,”w"i:
printfi' Quantizing data...\n"):
count-0:
lpc_computation(ti:
printf("%u Quantized lpc vectors written to file %s\n",count,filename):
fclose(outfile);
i

}

/tii*****ﬁ*******i********i****i**i******t*if**ﬂ**ti*ti***i***i*******tiii

* This function computes the lpc vectors from the speech data and *

* then vector quantizes the lpc vectors *
ttti*tttttttitttttiiit*itiittttttiittttitiiiiiiiittit**i*t**t********ttttl

void lpc_computation(total_data)
int total_data:

int n,i:
void LD_recursion():

44

void shift_window();
void vector_guantize():

a[0]-1:
for(n-0: (n+N)<total_data: n-n+50)

shift_window(n):
LD_recursion():
vector_quantize():
}
}

/********i******************iittititiiitiif*****************************ﬂ***

* This function takes 256 points from sampled data using Hamming window *
* to implement short term LP analysis *
*i***t******it****tittttti*ttttt*t*************************tt********t****t/
void shift_window(n)
int n:
i
int i:
for (i-O: i<N: i++l

{
windowIiJ-speech_data[n]*(0.54-0.46*cos(2*PI*i/N));
n++:
}
}

/*t*****i*********ﬂ*ii******i*************i*****iifit****************i****ﬁﬁ

* This routine use Levinson-Durbin Recursion to get the LP parameters *
itittittt***************tt***tittttt*itt*tttt*t********tt**ti**ttttttttittr/

void LD_recursion()

{

int 1,1:

double ai,aj,temp:

double e: /* xi, the average energy in the predition residual */
double k: /* kappa, reflection coefficient */

void comp_corr(): /* compute short term autocorrelation */

comp_corr();
/* initialization */
e-riO]:
for (1-1: l<-MO: l++)
{
/* step 1 */
temp-0.0:
for (i-l: i<l: i++)
temp-temp+a[i]*r[l-i]:
k-(r[l]-temp)/e:
/* step 2 */
a[l]-k:
for (i-l: i<-l/2: i++)
{
ai-ali];
aj-aIl-i]:
a[i]-ai-k*aj:
a[l-i]-aj-k*ai:
}
/* step 3 */
e-e*(l.-k*k):
}
i

/***t*****tii*******ﬁ*ﬂ*i********tf*********************************ii******

* This procedure computes short-term autocorrelation *
*********iittitttt*ittitttitttttttt**ttttttt*********ﬂ*********tttttttttiit/

void comp_corr()

45

int ilj;
for (i-O: i<-MO: i++)

rlil-O:
for (j-O: j+i<N: j++i
r[i]-r[i]+window[j]*window[j+i];
r[i]-r[i]/N:
l
l

/***********i*******i****************************ﬁ*******************ititit

* This routine vector quantizes the computed lpc vector with respect to *
* the given codebook. *
it*ﬁ*******itttt**************************ti********t****************t*itt/

void vector_guantize()

{

int i,index1,index2,vq;

double idml,idm2:

double itakura_dist_meas();

int level_index():

indexl - O:
indexZ - 1:
idml - itakura_dist_meas(codebook[index1]);
idm2 - itakura_dist_meas(codebook[index2]);
if(imu.>i¢ﬂ )

indexl - indexZ:

for(i-l: i<NLEVELS: i++)
{
indexl - (indexl - level_index(i))*2 + level_index(i+l):
indexZ - indexl + l:
idml - itakura_dist_meas(codebook[indexl]):
idm2 - itakura_dist_meas(codebook[index2]);
if( idml > idm2 )
indexl - index2:

}
vq - indexl - LEVELINDX:
fprintf(outfile,"%d\n",vq);
count-count+l:

}

/*itttﬂtﬁit************t***********t**it****#***t*ii**ﬁt*i***************it

* This routine calculates which vector to compare next in the codebook *
* once a vector index in the previous level is given *
*tttttttttttitttttttttttitttttttttttﬂirt*Qtttttttitt*ttttttttttttttiittttt/

int level_index(k)

int k:

{

int num:

num - (int)pow((double)2,(double)k) - 2:

return num:

}

/*********i****t*****************t*itﬂ!***tit******************************i

* This routine calculates the Itakura Distance Measure between the *
* computed lpc vector and a vector from the codebook *
*t*******ittt**********t****ifit*titttti*ti**ﬁ*i*********t**t*************t/
double itakura_dist_meas(array) '
double arrayl]:

int i,j:

double templIMo+l],temp2[M0+1]:
double al[MO+l],entry[M0+l]:

46

double idml,idm2,idm:

entryi01-1.0:
al[0]-l.0:
for (i-l: i<-MO: i++l

entryli]-0.-array[i-l]:
aliil-0--alil:
l
for (i-O: i<-MO: i++)
{
templ[i]-0:
temp2[i]-0:
for (j-O: j<-MO: j++i
if (i<j)
{

templIi]-temp1[i]+a1[j]*rij-i]:
temp2[i]-temp2[i]+entry[j]*r[j-i]:
}
else
{
templii]-templ[i]+a1[j]*r[i-j]:
temp2[i]-temp2[i]+entry[j]*r[i-j]:
}
}
idml-0:
idm2-O:
for (i-O: i<-MO: i++)
{
idml-idml+templ[i]*al[i]:
idm2-idm2+temp2[i]*entry[i]:

}
idmplogiidm2)-log(idml):
return idm:

}

/**************************************************ii*********i‘i’t‘ktttii‘kiti’ttt/

void codebook_entry()

i

FILE *infile3:

int i,j,k,m:

char inputlli801,input12[80],inputl3i80]:
void extractword():

infile3 - fopeniCODEFILE,”r"):

printf("\nReading %s\n",CODEFILE):
m-O:
for(i-l: i<-NLEVBLS: i++)
{
fgets(inputll,80,infile3):
printf("%s",inputlli:
for(j-0: j<(int)pow((double)2,(double)i): j++i
{
fgets(inputll,67,infile3):
fgets(input12,67,infile3):
fgets(input13,80,infile3):
extractwordiinputll,input12,input13,m):
for (k-O: k<l4; k++) printf("%f ”,codebookim][k]);
printf("\n”): .
m++:
}

l
fcloseiinfileB):
}

47

/it*ttitittititttttittttttitititittiiitttitttttttwwwtwttttttitttwttwttwttwttt/

void extractword(inl,in2,in3,m)

char inl[80],in2[80],in3[80];

int m:

(
int i,j,k:
char templ[30],temp2[30],temp4[30]:
for( j-O: j<30: j++l

{
templij] - '\0':
temp2ij] - '\0':
temp4ij] - ’\0’:
)
for(i-O: i<6: i++)
i
k-O:
forij-i*ll: j<(i+l)*ll: j++i
{

templikl - inlij]:
temp-21k] - inZijl:
k++:

}
codebook[m][i] - atof(temp1):
codebook[m][i+6] - atof(temp2):
l
for(i-O: i<2: i++i
{
k-O:
for(j-i*ll: j<(i+l)*11: j++)

temp41k1 - in3ijl:
k++:

}
codebookim][i+12] - atof(temp4):
}

48

C Program Listing: Cepstral Parameter Gener-
ating Program

This program computes the mel-cepstral parameters of an input speech ﬁle using a
1024 point F FT, and then quantizes the cepstral parameters.

49

<stdio.h>
<ctype.h>
<string.h>
<time.h>
<math.h>

#include
#include
#include
Oinclude
tinclude
#define N
#define NUM 50000
Odefine MO 10
#define NLEVELS 7
#define LEVELINDX 126
{define TOTVECT 254
#define CODEFILE "codele27.dat"
#define FFT 1024

256

int count:

int freql22]:

int speech_data[NUM]:

double window[PPT];

double c[MO+1]:

double codebook[TOTVECT][MO];
FILE *outfile:

/*
/*
It
/*
/*
/i
/*
/*

It

/*
It

/*

Hamming window size */

maximum allowed for speech data */
model order of cepstrum */

number of levels in binary codebook */
Z‘NLEVELS-Z */

number of vectors in codebook */
codebook file */

number of point for FFT */

mel_frequency */

256 samples plus zero padding */
cepstrum parameters */

pointer to quantized file */

l****it*************i****t**i**it****ii***************iﬁ***ﬁ***if***********

* Program name:
* Command :
Description :

ceps_qnt.c
run386 cepstrum

*
t
* parameters
*

Date : August 1, 1990

cepstral analysis training data with 1024 points PET and
silent portion kept and then quantize these cepstrum

\

#10! t i

i

tiff************t**i*****t****i**********t**********t*tt**ti*t***t******ﬁ*tl

main()

{
int i,j,k,h,t:
int numread:
int M,p,sum,m,mt:

int indataINUM],index2[100][2]:
char infilesIBO],infilenamel[80],infilename[80],filename[1001:
char instringi40],numstring[5],inputll30]:

short bufferi64]:

FILE *infile2,*infile,*file:
int total_data:

void cepstrum_comp():

void codebook_entry():

void mel_freq():

codebook_entry():
mel_freq():

count-0:
for (i-O; i<FFT:
windowIil-0.0:

i++)

strcpy(infilename,"tstti.dat"):

printf("$tart data input -
if (
{

infilename
(infile - fopen(infilename,

- %s\n",infilename):
"r")) - NULL)

printf("fopen failed for infilename ts.\n",infilename):

exit(0):

1
while( fscanf(infile,"%s\n",instring)

for(h-l: h<9: h++)

{

!- EOF)

strcpy(filename,"c:\\le27\\train\\bin\\");
strcat(filename,instring);

strcat(filename,".27");

50

switch(hl
{

case 1 strcpy(numstring,"l"):break:
case 2 . strcpy(numstring,"2"):break:
case 3 : strcpyinumstring,"3"):break:
case 4 : strcpy(numstring,"4"):break:
case S : strcpy(numstring,"5"):break:
case 6 : strcpy(numstring,"6"):break:
case 7 : strcpy(numstring,"7"):break;
case 8 : strcpy(numstring,"8"):break:
default : break:

strcat(filename,numstring):
_pmode - 0x8000:
if ( (file - fopentfilename, "r")) - NULL)

{
printf(”fopen failed for filename %s\n",filename):
exit(0):

printf("reading in data from file %s ..... \n",filename):
k-t-O:
do

{
numread - freadiivoid *)buffer, sizeof(short),64,file):

ifinumread - 0)
break:
for(i-0: i<64: i++i

{
speech_data[t] - (int)buffer[i]:
tH:

}

)while( feof(infile) - 0 ll numread -- 64 i: /* for with.cp5 file */
fcloseifile):
strcpyifilename,“\\le27\\train\\qnt\\"):
strcat(filename,instring);
strcatifilename,".vq"):
strcat(filename,numstring):

ode - 0x4000:
outfile - fopen(filename,"w");
printfi" Quantizing data...\n"):
count-0:
cepstrum_compiti:
printf("%u Quantized lpc vectors written to file %s\n",count,filename):
fclose(outfile):
l

}

/it***********i*******iii****iti****t**t**t*t*****ti**#*****it*t*i*********t**

* This function computes the cepstrum parameters from the speech data and *
* then vector quantizes the cepstrum parameters *
it*ttiit*tiiit*ttttﬁ******ﬁtittt**ii**ttt*ttttt*i*i**t*t*i****t**********tti*/
void cepstrum_comp(total_data)
int total_data:
{
int n,i:
float f[2*FFT+1]:
double mel[21],rf[FFT/2+1]:
void shift_window():
void stdft():
void mel_energy():
void mel_cepstrum():
void vector_quantize():

forth-0: (n+N)<total_data: n-n+50)
{

51

shift_window(n):
for (i-l: i<-FFT: i++i

f[2*i]-0.0:
f[2*1-11-(float)windowii-l]:

i
stdftif,FFT,1);
for (i-l: i<-(FFT/2): i++)
rf[1]-sqrt((double)f[2*i—1]*(double)f[2*i-1]+(double)f[2*i]*(double)f[2*i]
mel_energy(mel,rf):
mel_cepstrum(mel);
vector_guantize():
l
/*********************t***********ti**itt********************t******t****tﬂ**t
* This function takes 256 points from sampled data using Hamming window and *
* put zero in remaining position to implement 2048 points FFT *
**t*********t******titttittt**********t**#*******ttitttt*****i**t*i#********i/
void shift_window(n)
int n:

(
int i:

for (i-O: i<N: i++)

windowIil-speech_data[n]*(0.54-0.46*cos(2*PI*i/N));
n++:

}

/******it*********i*******i**********t***********ﬂi*tti***************ﬁ******

* This routine use radix-2, 2048 points FFT to implement short term DPT *
itti******t****i*********itittttttﬂﬁttttttit*****ttittrtt**it***t*****it***t/

tdefine SWAP(a,b) tempr-(a):(a)-(b):(b)-tempr

void stdft(data,nn,isign)

float datai]:

int nn,isign:

{
int n, mmax,m,j, istep,i:
double wtemp, wr, wpr, wpi, wi, theta:
float tempr, tempi:

n-nn << 1:

j-l:
for (i-1:i<n:i+-2) {
if (j > i) {
SWAP(data[j],data[i]):
SWAP(data[j+l],data[i+1]);
}
m-n >> 1:
while (m >- 2 && j > m) {
' j -- m:
m >>- 1:
}
j +- m:
}
mmax-Z:
while (n > mmax) {
istep-2*mmax:

theta-6.28318530717959/(isign*mmax):
wtemp-sin(0.5*theta):

wpr - -2.0*wtemp*wtemp:
wpi-sin(theta):

wr-l.0:

52

wi-0.0:
for (m-1:m<mmax:m+-2) (
for (i-m;i<-n:i+-istep) {
j-i+mmax:
tempr-wr*data[j]-wi*data[j+1]:
tempi-wr*data[j+l]+wi*data[j]:
data[j]-data(i]-tempr;
dataij+1]-data[i+1]vtempi:
datali] +- tempr:
data[i+1] +- tempi:
l
wr-(wtemp-wr)*wpr-wi*wpi+wr:
wi-wi*wpr+wtemp*wpi+wi:

mmax-istep:

}
#undef SWAP

/*i**ﬁ*****f*ii*******i*t*****i***iit********t*****i*******i*************t***t

* This routine computes the MEL-frequencies, then computes the critical *
* band energy and put these values in the same array. *
********itittitittitittitttitt*tiﬂ*t*******i**it*i*it***tittiﬁtttt*t*********/

void me1_energy(mel,f)

double mel[]:

double ft]:

{
double ratio,r:

int i,j:

for (i-l: i<-20: i++)
{
ratio-1.0/(freq[i]-freq[i-l]):
melli]-0.0:
for (j-l: j<(freq[i]-freq[i-l]): j++) .
{

r-ratio*j:
mel[il-mel[i]+r*r*f[freq[i-1]+j]*fifreq[i-l]+j]:

l
ratio-1.0/(freq[i+1]-freq[i]):

for (j-O: j<(freq[i+1]-freq[i]): j++i
{

r-l-ratio*j:
melii]-mel[i]+r*r*f[freq[i]+j]*flfreqii]+j]:
}
melti]-loglO(mel[i]):
l
i
/*t*t***********************************i**ti*********i*************it**t****

* This routine computes MEL-based cepstral coefficients with critical band *
* filtering. *

*****t**ﬁ*i**i****i*******tttit******titttwtttt***tt*ttitt**tt*******t****t*/
void mel_cepstrum(mel)
double melt]:
{
int n,k:
double a:

for (n-l: n<-MO: n++)

{
cln]-0.0:
for (k-l: k<-20: k++)

a-n*(k-0.S)*PI/20.0:
cln]-c[n]+mel[k]*cos(a):

53

}

/***i********ﬂt***i*itﬂit******it*tt*******t*******itﬁt*t***********t*t**itt*i

* This routine vector quantizes the cepstrum parameters with respect to the *
* given codebook. *
tttiiiittiiitiitiiiiiiti*itttiii*ititiitii**t*tttt*f*t***t*tfitttitttitttttttl

void vector_guantize()

{

int i,index1,index2,vq:

double idml,idm2:

double euclidean_dist_meas();

int level_index();

indexl - O:
indexZ - 1:
idml - euclidean_dist_meas(codebook[indexl]):
idm2 - euclidean_dist_meas(codebooklindexZJ);
if( idml > idm2 )

indexl - index2:

for(i-1: i<NLEVELS: i++)

indexl - (indexl - level_index(i))*2 + level_index(i+l):
index2 - indexl + 1:
idml - euclidean_dist_meas(codebook[indexl]):
idm2 - euclidean_dist_meas(codebooinndexZ));
if( idml > idm2 )
indexl - index2:

- }
vq - indexl - LEVELINDX:
fprintfioutfile,"$d\n",vq):
count-count+l:

l

/*iﬁ**t****ﬁﬁ**iitﬁf***i*ﬁifﬁitii*iittiiiﬂifiﬁiittitiii**t***9i****f****titit!

* This routine calculates which vector to compare next in the codebook once *
* a vector index in the previous level is given , *
**********ﬁ******i***tttttittwttittt*t**t*********tt**w****t*************t**t/

int level_index(k)

int k:

{

int num:

num - (int)pow((double)2,(double)k) - 2:

return num:

}

/tiit****ti*ii'fi****t****i***i******t*****i*i*****t*t**tt****i*******tt******

* This routine calculates the Euclidean Distance between the cepstrum *

* parameters and the one in the codebook. *
*tttttttittit!*itttiti*****t**ti*t*t******t**i***t********************tttittt/

double euclidean_dist;meas(array)
double arrayll:

{
int 1:
double idm:

idm-0.0:
for (i-O: i<MO: i++)
idm-idm+(c[i+l]—array[i])*(c[i+1]-array[i]):
return idm:
l

/t*t*******t********iﬁttiﬁﬁttittiiﬁtﬁiititﬂ****************ttiiitﬁtt****t****i/

void codebook_entry(i

54

{
FILE *infile3:

int i,j,k,m:
char input11[80],input12[801:
void extractwordi):

infile3 - fopen(CODEFILE,"r"):

printf("\nReading %s\n",CODEFILE):
m-O:
forii-l: i<-NLEVELS: i++)
{
fgets(inputll,80,infile3):
printf("%s“,input11):
for(j-o; j<(int)pow((doub1e)2,(double)i); j++)
I
fgets(input11,67,infile3):
fgets(input12,80,infile3):
extractword(inputll,input12,m):
m++ :

}

fclose(infile3):
i

/*t******************************itit************t*********ttI

void extractword<in1,in2,m)
char inl[80],in2[80]:
int m:
{
int i,j,k:
char templIBO],temp2l30],temp3[101,temp4[30]r
fori jéo: j<30: j++)
{

templ[j] - ’\0’:
temp2[j] - '\0':
temp4[j] - '\0':
}
fori j-O: j<10: j++i

temp3[j] - '\0’:
}
forii-O: i<6: i++i

{
k-O:
for(j-i*11: j<ii+1)*ll: j++i

{
templlk] - inlij]:
k++:

}
codebookim][i] - atofitempl):

}
forii-O: i<4: i++i

{
k-O:
for( j-i*1l: j<(i+1)*11+1: j++l

{
temp4[k] - in2[jl:

k++:
i
codebookim][i+6] - atof(temp4);
}
}
/itit**i********t**t*t*****ititit*********t*tt*******ﬁ**t**i*tittttttittttittt

* This routine computes the MEL-frequencies. *
1*ittitttttttttttttttiittittititt*iii!*tttttitittttitittttttttiiiititttittﬁtt/

55

void mel_freq()

{
double interval,scale,mel_f:
int i:

interval-10000.0/FFT:
scale-(loglO(5000.0)-3)/11.0:
freq[O]-O:
for (i-l: i<ll: i++l
{
mel_f-100.0*(i):
if ( fmod(mel_f,interval) < interval/2 )
freqlil-floor(me1_f/interval):
else
freqlil-floor(mel_f/interval)+1:
mel_f-3.0+scale*i:
mel_f-pow(10.0,me1_f):
if ( fmod(mel_f,interval) < interval/2 )
freq[i+10]-floor(mel_f/interval):
else
freq[i+lO]-floor(mel_f/interva1)+1:
l
mel_f-3.0+scale*ll:
mel_f-pow(10.0,mel_f):
if ( fmod(me1_f,interval) < interval/2 )
frquZIJ-floor(mel_f/interval):
else
freq[21]-floor(mel_f/interval)+1:

56

D Program Listing: Codebook Generating Pro-
gram

This program produces a seven-level, binary tree codebook for cepstral parameters.
The input of this program is a large ﬁle which consists of cepstral parameters of all
the words spoken by one speaker in the TI-46 or Grandfather word list.

57

#include <stdio.h>
tinclude <ctype.h>
#include <string.h>
#include <math.h>

#define MO 10 /* Model Order of Cepstrum */

#define NLEVELS 7 /* number of levels in binary codebook */
#define SYMBOL 128 /* 22NLEVELS */ '

#define CEP 75000 /* number of cepstrum parameters */

int groupiCEP][2],change:

long total_count,counta,countb:
double table[CEP][MO]:

double centroid[NLEVELS+1][SYMBOL][MO]:
FILE *outfile:)

/*i‘ttﬂ’ti’t*******i************************t'k'k'k'ktti'kii‘kti****t********ii’i‘ii*titt

* Program name: cdbkgen.c *
* Command : cdbkgen *
* Description : generate a N-level codebook by using the cepstral analysis *
* Date : August 14, 1990 *

**t*****t********i*******t*****ﬁ**************iti’tﬂtti'ﬁittitiiiii‘ktt*tit‘kttii/

main()
{ .
int i,j,k,level,nt:
int aa,bb,cc:
long m:
long readcode():
double distance():
void separatei):
void compute_centroid():
void perturb():

outfile - fopen(”codebd26.dat","w"):

total_count - readcode():

printf("total_count-%1d\n",total_count):

for (m-O: m<total_count: m++i
stompiml[Oi-quUlellll-O:

compute_centroid(0,0,0):

printf("first centroid is %10.6f\n",centroid[0][0][0]):

for(level-O: level<NLEVELS: level++)

l
\

for(i-0: i<(int)pow((double)2,(double)level): i++)

perturb(level,i,centroid[level][1]):
nt-O:
do
{
change-0:
separate(level,i);
nt++:
compute_centroid(level+1,2*i,i):
compute_centroid(level+l,2*i+1,i):
printf(”level-%d,symbol-td,iteration-td,counta-%ld,countb-%ld\n",levelp
} while (change-l):
i
for (m-O; m<total_count: m++i
} groupimliol-QIOUPimllll:
for(i-1:i<-NLEVELS:1++)
{
fprintf(outfile,”level %u ....... \n",i):
for(j-0: j<(int)pow((double)2,(doubleii): j++)

for(k-0:k<MO:k++)
fprintfioutfile,"% 10.6f ",centroidIi][j][k]):

58

fprintf(outfile,"\n");
}

i
fclose(outfile): \

}

/t*******************i*******iﬂit*iit*tti******i#*****tfﬂt*ttii***tittttti*t

* This procedure reads the cepstral parameters file and put these vectors *
* into an array. *
*ttttitttitfﬁtttitt**t**tt**t**tttttitit**********t*i*t**tt**t**tt*ttt*tttt/
long readcode()
{
FILE *infile:
int j,k:
long 1:
char code[20]:

infile - fopen("bd26.dat","r");
i-O:
while (feof(infile)--O)

{
for (j-O: j<MO: j++)
{

fscanf(infile,"%s",code):
table[i][j]-atof(code):

1

i++:

fscanf(infile,”\n");

l
fcloseiinfile):
return i:

}

/iﬁ*****i*ﬁ****i******iﬁ*******itii*****ittiiittitttﬂ******ﬂ***i**************

* This procedure computes the centroid in a cluster *
*itttttiittitt*tttttmitttittttit*t*tttttttttiittit*tiittttittiﬁtittrittitiitt/

void compute_centroid(level,symbol,now)
int level,symbol,now:
{

int 1:

long j,k:

k-O:
for (i-O: i<MO: i++)
l
centroidllevel][symbol][i]-0.0:
for (j-O; j<total_count: j++i
if ((groupljlll] - symbol) at (group[j][0]--now))
{

if (i-Oi
k-k+1:
centroidilevel][symbol][ii-centroidilevel][symbol][i]+table[j][i]:

}
centroidilevel][symbol][i]-centroid[level][symbol][i]/k:
}

/*****t***t***tttiiit********it*************ttiti*******************t******tit

* This procedure splits the centroid into 2 vectors *
*itt*ﬁtttitttt***tiiit*tti*tttt**t*****ﬁ**ﬁt*tttttttiittttttittitiittttt*titt/
void perturb(1evel,symbol,vectora)
int level:
int symbol:
double vectora[MO]:

i
register int j:

59

for (j-O: j<MO: j++)

{
centroid[leve1+1][symbol*2][j]-vectora[j]*l.01:
centroid[level+l][symbol*2+l][jJ-vectoratj]*0.99:

/i************t****i***********it*********************************************

* This procedure separates one group into 2 clusters *
*t********ﬁ*********it*****************************i****************t********/
void separate(level,symbol)
int level,symbol:
{
long i:
double dist1,dist2:
double distance():

counta-countb-O:
for (i-O: i<total_count: i++)
{

if (group[i][O]-symbol)

{

distl-distance(table[i],centroidtlevel+l][2*symboli);
dist2-distance(table[i],centroidtlevel+1][2*symbol+1]):
if ( distl < dist2 )
f
if (group[i][l] !- 2*symbol)
change-1:
groupti][1]-2*symbol:
counta-counta+l:
}
else
{
if (group[i][1] !- 2*symbol+l)
change-1:
group[i][l]-2*symbol+l:
countb-countb+1:

}

/ititit****************tiff*iti***************************t**t****************

* This procedure computes the Euclidean distance *
tﬂ**********************ﬁ**1t**************t*********************************/
double distance(vector1,vector2)
double vector1[1,vector2[]:
{
int 1:
double dist:

dist-0.0:

for (i-O: i<MO: i++)
dist-dist+(vectorl[i]~vector2[i])*(vector1[i]-vector2[i]):

return dist:

60

"Tillﬂllﬂlllﬁtﬂﬂllllﬂlﬂlﬂ