.‘.

.
.._.
,.,..=if

....
3.2:

.. q- “wry-u‘

.,.....4.

‘3‘ «.v

-‘o

1...,
_..:...
V. . f

I.
i.

...:
..:

.._..._‘
........

_......
V... ...
..:.._..‘.

_ . ..... ._..
.._ ,,: 1. :o .1
‘ <..~‘ ,.._, _~:.....‘r~,~..~...~v.._k
...:.:_....;.....: .1 .

 

.. . 41:; vx

 

Lth ‘

 

This is to certify that the
thesis entitled
Automatic Speech Recognition Based On
A New Segmentation Procedure

presented by

Earl J. Craighill

has been accepted towards fulﬁllment
of the requirements for

___P_hL degree in _EE_

W M“ ‘ \4ALW‘

Major professor

 

Date 1/13//7/

07639

 

MEN—$549

 

 

 

 

ABSTRACT

AUTOMATIC SPEECH RECOGNITION BASED ON
A NEW SEGMENTATION PROCEDURE

By

Earl J. Craighill

A procedure for segmentation of an acoustical speech signal is cru-
cial to the design of any system for automatic Speech recognition (ASR),
yet no adequate scheme currently exists. This study proposes and inves—
tigates the implementation of a procedure for segmenting input in the form
of connected speech from divers speakers using unlimited vocabularies.

A segmentation procedure which assigns linguistic elements, such as
phonemes, to contiguous acoustical signal intervals would be hopelessly
complex because of the many-to~many correspondence between currently used
linguistic elements and portions of the acoustical signal. Instead, we
propose a method for dividing the acoustical signal into analysis epochs
with minimal linguistic specification so that they are independent of
speaker and context.

Each epoch is defined by homogeneous signal characteristics; that is,
a generation model is identified with associated parameters, and nonlinear
time—varying differential equations are derived for these parameters. The
equations are used to track the parameter values, and an epoch boundary is
set at the point where they no longer predict (within a threshold) the
characteristics of the observed speech signal. From the functional forms

of the differential equations, we derive further processing algorithms

I"
I'
I

Earl J. Craighill

(analogous to data—dependent adaptive filters) for each epoch. Identi—
fication of the functional forms gives a gross linguistic classification
which forms the basis for classification of the epoch.

The differential equations are characterized in terms of sliding
moment averages of envelope and zero-crossing estimates on bandpass—
filtered speech signals. This method of estimation is amenable to low-
cost hardware implementation and requires few computations; thus, connected
speech may be analyzed in real time without overloading a standard general—
purpose computer. Asynchronous, real-time classification is achieved by
decomposition of the decision algorithm by a process similar to that used
in Kilmer's model of the reticular formation.

Overlapping bandpass filters are used to give an initial separation
of acoustical features. Experimental evidence shows how this reduces
the speaker dependence of further acoustical measurements. A decision
logic structure is specified and discussed, showing that it is possible
to select appropriate preprocessing procedures to focus attention on
significant features of an acoustical signal epoch and to accentuate sig—
nal characteristics closely correlated with linguistic features. This
preprocessing, when coupled with the syntactical structures developed
from theoretical linguistics, is hopefully a first step in recognizing

human connected speech from different speakers.

AUTOMATIC SPEECH RECOGNITION
BASED ON A NEW SEGMENTATION PROCEDURE

By

Earl John Craighill

A THESIS
Submitted to
Michigan State University

in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering

1971

 

ACKNOWLEDGEMENTS

The Author wishes to thank Professor William Kilmer for his continuing
support, guidance and patience during the preparation of this thesis. These
qualities were shown both in and out of the classroom and are deeply appre-
ciated.

For their constructive comments and evaluation of this thesis, the
Author wishes to acknowledge the other members of his doctoral committee:

R. C. Dubes, T. Guinn, C. L. Park, R. F. Reid, and H. Salehi and also
colleagues at Stanford Research Institute: W. F. Foy and W. P. Rupert.

The research was supported at Michigan State University under Air
Force Contracts No. Af—AFOSR—1023—66,67,68 and at Stanford Research Insti-
tute under IR&D Project No. 656531-329.

Without the support and encouragement of many people, this thesis would
not have been finished. A few of these people are: my supervisor, D. F.
Babcock; secretaries, A. Guinn, S. Peterson and K. Spence; my mother and

father and my wife Karilyn.

TABLE OF CONTENTS

 

Page
LIST OF TABLES v
LIST OF FIGURES vi
I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . l
A. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1
B. The Structure and Interrelations of Acoustical Features

in Human Speech Signals . . . . . . . . . . . . . . . . . . 7

C. Segmentation of the Acoustic Speech Signal into Analysis
Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . 17
D. Preprocessing of the Acoustical Speech Signal . . . . . . . 20
E. Decomposition of Pattern Recognition Algorithms . . . . . . 39

II REPRESENTATION OF TIME—VARYING SIGNALS . . . . . . . . . . . . . 47

A. Analytic Signals . . . . . . . . . . . . . . . . . . . . . . 47
B Sliding Fourier Series . . . . . . . . . . . . . . . . . . . 57
C. Response of Linear Filters to Analytic Signals . . . . . . . 65
D Estimation and Segmentation of Instantaneous Signal

Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 85

III THE USE OF LINGUISTIC THEORY FOR THE DECODING OF SPEECH

ACOUSTICAL SIGNALS . . . . . . . . . . . . . . . . . . . . . . . 109
A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 109
B. Stratification Model for Generative Phonology . . . . . . . 121
C. Recognition Phonology . . . . . . . . . . . . . . . . . . . 130
IV RECOGNITION STRUCTURES FOR REAL—TIME SPEECH PROCESSING . . . . . 138
A. Reduction of Dimensionality Using Bayes' Formulation . . . . 138
B. Quasi-Independent Probability Distributions . . . . . . . . 146
C. Specification of First—Level Decision Structure . . . . . . 152
D. Proposed First—Level Recognition Block Diagram . . . . . . . 159
V CONCLUSIONS AND RECOMMENDATIONS FOR FURTHER STUDY . . . . . . . 166
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . 170

iii

APPENDIX A

APPENDIX B

APPENDIX C

APPENDIX D

APPENDIX E

Description of Sapir's Pseudo-Language

Recording Apparatus Used to Collect Experimental Data

A Program for TIMSER Interactive Analysis of Time Series
Sliding Power Spectra Showing Vowel Transition

Instantaneous Estimators of Time—Varying Parameters

iv

181

185

192

LIST OF TABLES

Table E-l Envelope Derivative Chebyshev Weighted Errors Using 221
Hilbert Envelope Estimators

Table E—2 Envelope Derivative Chebyshev Weighted Errors Using 222
Absolute Value Estimator

Table E—3 Effects of Frequency Change on Envelope Derivative 223
Estimators—l ms Subinterval

Figure

1

9a, 9b

10

11

12

13

14

16

LIST OF FIGURES

Title

Typical ASR System Based on Discrete Encoding
Model

Sonogram of English Word, Rudder,.Showing High
Value of Frequency Derivative

Speech Acoustical Signal Showing Short Transient
Phenomena

Consistent Time Waveforms for Several Speakers
from Different Bandpass Filters

Schematic of Multifilter Recognition Logic
Quasi—Statistical Formulation of Local PR Algorithm

Short Transient Phenomenon Which is Difficult to
Analyze with Fourier Series

Idealized Fourier Coefficient Response to Varying
Frequency Input

Magnitude of Fourier Coefficient Outputs for Time—
Varying Frequency Input —— Actual and Quasi—
Stationary Terms Using Instantaneous Frequency of

Input

Formant Envelope and Frequency Transition Causing
Delay Distortion

Correspondence of Z-Plane Spirals and S—Plane Lines
for the Chirp Z-Transform (from Rabiner, Schafer,

and Rader)

Time~Varying Filter for Formant Parameter
Estimation

Bandwidth Requirements for Large Frequency
Derivatives

Estimation Procedure for Time—Varying Parameters
for Bandpass Filtered Speech Signals

Representations of Bandpass Filtered Speech Signals

Smoothed Differentiator Transfer Function

vi

Page

28

30-31

32

39
44

60

63

73-74

75

79

83

89

91—93

95

Figure

17a,

18

19
20
21
22

23

24

25
26

27

8-2

3-3

C-l

17b

Title

Standard Deviation Versus Mean for Envelope
(Lower) and Frequency (Upper)

Segmentation Results Shown with Bandwidth
Estimators [dhuath] (male)

Segmentation Results for [umbif] (female)
Formal Language Model

Level and Ranks of a Generative Phonology
Relationship of Units Within a Stratum

Composition Rules and Example of Their Application
on Morph— Stratum

Several Linguistic Phenomena Described by Alter—
nation Rules

Model of Reco—Generative Phonology
Recognition System Without Feedback

Recognition of Vowels from Normalized Second
Formant Information

Apparatus for Recording Speech Signals on Analog
Tape

Apparatus for Multiplexing and Digitizing Data
from Analog Tape

Overlapping Filter Bank

Operational Diagram of the TIMSER Ensemble
Dhuath 16 BE 1 Real—Time Wideband Signal
Dhuath 16 BE 1 Filter Bandwidth 458—1167 Hz
Dhuath 16 BE 1 Filter Bandwidth 1467—2917 Hz
Dhuath 16 BE 1 Filter Bandwidth 577-1867 Hz
Operations for Parameter Estimation

Two Envelope Derivative Estimators

Chebyshev Weighted Error for Envelope Derivative
Estimators as a Function of Sliding Average Length

vii

Page

97-98

100

127

128

133

182
182
184

186

197
201

205

218

224

Figure Title
E-4 Chebyshev Weighted Error for Envelope Derivative

Estimators as a Function of Sub-Interval Length

E-5 Chebyshev Weighted Error for Envelope Derivative
Estimators as a Function of Sliding Average

viii

Page

225

226

 

INTRODUCTION

I—A Overview

A procedure for segmentation of an acoustical signal is crucial to
the design of automatic speech recognition (ASR) systems. As yet, however,
no adequate procedure exists for real—time automatic recognition of con—
nected human speech from several speakers. Principles from communication
theory and linguistic theory must be incorporated in order to derive an
efficient segmentation procedure. The language of modern communication
theory, familiar to the electrical engineer, most appropriately describes
the input with which we are concerned. For this study, we limit the input
to connected phrases of naturally spoken human language that have been
transduced into time—varying analog voltages. The output of an ASR system,
usually in the form of a sequence of linguistic elementsf is generally
described in the framework of linguistic theory, primarily phonology.

At first glance, the goals of communication theory and phonology
(namely, an accurate description of the current state of the process,
acoustical signal, or sequential linguistic elements) seem to be com—
patible. However, when one considers the large number of variations in
the acoustical signal possible for any given linguistic element, the
situation becomes hopelessly complex. Many attempts have been made to

eliminate this variation and thereby preserve only the meaningful

*These linguistic elements may be phonemes, distinctive features or
words. We are specifically thinking of only one level of classifi—
cation rather than a composite process such as identification of
phonemes and then morphemes. Our recommendation for a first element
is smaller than the usual phoneme or distinctive feature.

 

relationships of the linguistic elements. Successful decoding of this
complex acoustical signal by human listeners involves at least the
application of knowledge acquired from previous experiences of hearing
and speaking natural language and the listener's expectation of what will
be said. Thus, at this level, the basic assumptions of engineering
communication theory are no longer valid, and there is no applicable
strong property of ergodicity.

The purpose of this thesis is to describe a segmentation procedure
which not only specifies basic units for recognition but also gives an
adequate description of the complicated speech acoustical signal. This
description is prescribed by the requirements of further linguistic
decoding (words, phrases, ...). Further, the segmentation procedure
that identifies lower units will direct the higher levels of decoding
so that the search space is kept within practical bounds. The segmen—
tation procedure requires three subsystems based on a parametric
generation model of the acoustical signal:

(1) Initial estimation of parameters.

(2) A classification based on parameter estimates for

signal types.
(3) Selection of appropriate time-varying filters operating
on the input to give refined parameter measurements.
The requirements of these diverse topics are discussed in terms of a
representation of the acoustical signal which is developed from the
viewpoint of time-varying differential operators. Its use in deriving
estimators and detecting initial changes in these estimators is verified

experimentally.

 

In the remaining sections of this chapter, currently used segmen—

tation procedures are discussed in light of the complex nature of the
information—bearing features present in the human speech acoustical signal.
A parallel interrelated feature structure is described that is capable
of recognizing a shift of the pertinent information from one feature

to another. Linguistic information is conveyed with respect to two
levels (the vowels of an utterance form a primary, and the consonants

are incorporated by perturbations of this primary substrate). In order
to unravel this complicated structure, broad classes of speech sounds
that represent different types of signal characteristics must be defined;
this classification can then be used to direct further analysis for
recognition. By this method, formant* theory is related to higher levels
of linguistic decoding. Various preprocessing schemes are considered
which are commonly applied to ASR'systems for the purpose of isolating
individual formants. To satisfy the requirement for real—time operation,
a preprocessing scheme is chosen which uses a bank of overlapping wide-
band filters (with sufficient bandwidth to avoid distortion) to remove
noise and to provide a compact representation of the salient features
required for the recognition task. Real—time operation requires decom—
position of the decision process resulting in fewer computations and a
recognition structure tailored to the complicated overlapping nature of

the speech signal.

 

a:
By a formant, we mean the resulting time waveform for one cavity of
the vocal tract excited by glottal pulses, or frication noise.

 

 

In Chapter Two, the acoustical properties of the Speech signal

are modeled as a composite nonstationary stochastic process and the
mathematics of communication theory are used formally to describe the
process's complicated nature. One isolated formant is modeled by a
time—varying differential operator involving envelope, frequency, and
bandwidth parameters. The inadequacies of fixed—frequency types of
analysis (such as sliding Fourier transforms) are discussed, and require—
ments for low—distortion filtering are derived. Then the transient
response of linear filters to envelope and frequency changes found in
typical acoustical signals is derived in a way that offers new insight
into the behavior of analysis procedures and defines requirements for
the preprocessing Wideband filters. Formulas for real—time pointwise
estimators of the significant parameters are derived, and a predictive
differential equation segmentation procedure is specified which will
specify epochs in the acoustical signal having homogeneous signal
characteristics.

In Chapter Three, this segmentation procedure is discussed within
the framework of traditional linguistic theories. The complicated
structure of human communications requires additional mechanisms

(1) To determine the linguistically significant changes

in signal parameters,and
(2) To incorporate contextual information into the decision
process (which, in turn, resolves ambiguities and
directs further classification).
Structural theories are modified to include recognition and to show

the effects of linguistic rules on lower elements (effects of stress

on vowels, etc.). The use of the segmentation recognition procedure
proposed here is basic to a feed forward system, thus eliminating compli—
cated feedback analysis—by—synthesis techniques.

In Chapter Four, the formant representation and segmentation
results allow application of state—of-the—art detection/recognition
techniques* to a restricted speech signal (without the complex inter—
relationships between features). Study of the Bayes minimum risk
solution reveals that the primary concept is a probability mixture
formula for the outputs of nonlinear estimation filters, each tailored
to a possible generating model for the input signal and (correlated)
noise. Several difficulties are noted for implementation of this opti—
mal solution: realization of the nonlinear filters, correlation between
different (suboptimum) filter outputs, and conflict between classifications
on different filter outputs.

It is concluded that a heuristic recognition scheme tailored more
to the filter bank used in this study would be a better choice. Tech—
niques are developed to reduce the dependence among the probabilities
computed on the different filter outputs.

A first—level recognition system which can operate asynchronously
in real time is described. A nonlinear iterative structure determines
‘which filters have pertinent formant information. Specialized algorithms
derived from linguistic rules are then applied to these filter outputs

to determine the needed information for classification of this particular

*
Section I—E contains a discussion of terminology that is used in this
study for the pattern recognition discussions.

signal epoch. The output is a classification which is compatible with
higher levels of linguistic analysis. A second stage with formant
tracking filters guided by the initial classification gives the ability
to focus attention on only the desired acoustical features. Thus, the
complex acoustical signal can be segmented in time into homogeneous
epochs and also concurrent features of varying frequency with well-defined
mathematical models and time—varying parameters.

A total system design incorporating this segmentation procedure
as a first step will facilitate the use of human speech as input to
machines for robot control, text manipulation, command and control of

space vehicles, and many other man/machine tasks.

I—B THE STRUCTURE AND INTERRELATIONS 0F ACOUSTICAL FEATURES-IN

HUMAN SPEECH SIGNALS

The object of an ASR system is to determine recurrent elements from
measurements made on acoustical speech signals. Figure 1 shows a composite
of several approaches to Automatic Speech Recognition based on the theo—
retical encoding of speech shown in the upper block. This theoretical
encoding is motivated by Hockett's1 disCussion of a GHQ (grammatical head—
quarters) emitting a discrete flow of morphemes which are encoded into a
discrete flow of phonemes. Then, a speech transmitter converts the dis—

crete flow of phonemes into a continuous speech signal.

The determination of parameter values for each idealized element

<3

is motivated by the following studies. Peterson and Barney measured
first and second formant frequencies of nine English vowels in a fixed

:5
consonantal context (the word h_____d). Gerstman re—worked their data,
normalizing for each speaker, showing a sufficient amount of separability
of the measurements for vowel classification (in a fixed context for
isolated words), The correspondence between a fixed frequency or hub
of origin and consonants was first proposed by Potter, Kopp, and Green.4
Classification of stop consonants by association with a frequency value
was modified by Cooper et alb and Yilmazi They proposed consistent
measurements for stop—consonant classification could be made relative to
the following vowel fonnant frequencies. The slurring box accounts for
perturbations (hopefully slight) of these parameter values caused by
environment and speaker variations.

The first step in recognition is a division of the acoustical speech

into time epochs. The segments studied may be separated by epochs (portions

WORD
I THEORETICAL ENCODING

 

 

ENCODING PROCESS

Q Q Q ° ' ' SEQUENCE OF ELEMENTS

EACH ELEMENT GIVE A SET

OF PARAMETER VALUES
[ SLURRING BOX

1

VALUES TO GIVE CONTINUOUS

J ALTERS SEQUENCE OF PARAMETER
CONTROL OF ARTICULATORS

 

 

I

ACOUSTICAL SPEECH SIGNAL

ASR SYSTEM

 

 

i

l SEGMENTATION ] SEGMENT TIME POINTS

 

LINEAR SEQUENCE OF EPOCHS

1 OF SIGNAL
7 7 EACH BLOCK IS ANALYZED AND
I - MEASUREMENTS OF PARAMETERS

. ARE MADE

[ ] MEASUREMENTS ARE COMPARED TO

 

 

TEMPLATES (STORED REF', ETC.)
FOR IDEALIZED ELEMENTS

E] D SEQUENCE OF IDEALIZED ELEMENTS

 

 

FIGURE 1 TYPICAL ASR SYSTEM BASED ON DISCRETE ENCODING MODEL

of signal) rather than points, as in the case of Reddy7 who analyzes
only steady—state portions (i.e., portions with constant values of
envelope and frequency) and ignores the transition portions between
them. The opposite approach is taken by Dixon et al.“ in their analysis
and segmentation procedure. They define a new element called the
transeme, which is a "dynamic segment describable on a production basis
as the transition from one relatively steady—state articulatory con—
figuration to another."

The criterion for segmentation and further analysis may not be
related to linguistic elements at all, as in the case of Gazdag.9 His
segmentation points are determined completely in terms of the measurement
procedure that he uses to analyze the speech waveform; hence they are
independent of any exterior linguistic criterion. ASR systems developed
along these lines have no ability to ignore Speaker and environment
variations or free phonetic variation; i.e., in midwest English, prevoicing
before [b] or [d] is optional. Usually a separate "case" (pattern
class) is set up for each; hence the success that these various ASR
systems have in isolated sound situations or in one—perSOn conversational
speech cannot easily be extended to connected conversational speech
for many speakers.

Harris10 has discussed the extremely difficult problem of trying
to define linguistic elements as direct descriptions of portions of
the flow of speech. He finds it convenient in his analysis to define
certain elements which extend over quite long periods and others which
extend over short periods. ”In the course of reducing our elements to

simpler cmnbinations of more fundamental elements, we set up entities

10

such as junctures and long components which can only with difficulty be
considered as variables directly representing any member of a class of
portions of the flow of speech." (p. 18) A similar formalization in the
early work of Fant, Jakobson et a1?1 describes distinctive features that
are parallel rather than serial descriptions of the acoustical waveform.
Extensions of this approach by Chomsky and Halle12 are discussed at the
end of this section. Bobrow, Klatt, and Hartley 1“ have proposed an ASR
system based on this idea and derived independent parallel features from
the acoustical signal and performed classification on those features.
Other ASR systems using independent features have been proposed by Hill14
is

and FOCht. Bobrow et aL discuss the difficulties of recognizing conver—
sational speech for divers speakers in terms of:

(1) Consistency of each speaker in repeating words

for training (giving rise to phonetic variation)
(2) Speaker—dependent variation in their measurements

(shifts in formant frequency location)

(3) Segmentation of longer utterances.

These difficulties are caused in part by the extremely complex
nature of parallel features and the interrelations between them. Ohmanlﬁ
has studied various vowel/consonant/vowel (VCV) combinations and has
stated that it is impossible to treat even these short utterances as
three successive gestures. It is possible to analyze them only by con—
sidering the stop—consonantal gesture as superimposed on the substratum

1’7
determined by the two vowels and the transition between them. Houde

 

has investigated this further by means of X—ray movies of the configu—
ration of the tongue during articulation. The dynamic trajectories of
points on the tongue during articulation of VchV nonsense words can

be decomposed into target—directed (targets are long duration steady—

state vowel positions) and deviation (900 to target—direction) components.*
Five facts are clear:

(1) The deviation component is characteristic of the con—
sonant ([b] and [g] were used).

(2) The characteristic deviation for [b] and [g] was not
toward a target or hub but rather a consistent defor—
mation of articulator (primarily tongue) configuration.

(3) Targets of preceding vowels are changed by the conse—
nant (i.e., I in Tlge] has a different steady state
position than I in [I b e J).

(4) Stress placement affects vowel target positions.

(5) Timing of target—directed component was dependent only
on distance between target positions and not on speed
of articulation, speaker or consonantal environment

for the limited data investigated.

We can discuss these results in a way more compatible with lin—
is
guistic theory by use of Lamb's concept of a medium as a most unrestricted

(or most predictable) form and then describe the pertinent features which

convey information as perturbations of that medium. He defines a phonetic

 

3|:
This decomposition is slightly different from Houde's, in order to
demonstrate the concept of overlapping features.

12

feature as distinctive if its presence is not determined by its environ—
ment. This idea may be extended to explain the Ohman and Houde data by
stating that the vowel—to—vowel transition is actually the medium for

the consonantal distinctions.

We should define acoustical features more generally than just
those defining linguistic events.* These acoustical features may be
classified as:

(l) Linguistic

(2) Speaker signature

(3) Speaker emotional state.
The interrelationship of all these features that are present simultane—
ously, preceding or following in time, may be correlated with the dominant
(distinctive) feature, but this correlation is usually situation (speaker,
context) dependent and thus can introduce much variation in determining
recurrent elements. It has been pointed out by Harris that time of start
and stop of different acoustical features may not be coincident in time.
Thomas1U suggests that a Speaker is able to adjust only one formant fre—
quency; other frequencies are allowed to fall where they may. He states
further that this formant is always the second, but the data presented by
Ohman does not support this. Rupert20 has studied isolated words spoken

by three males and two females; he suggested that:

 

*By "linguistic" we mean the specific content of the Speech waveform

that is being used to communicate a discourse or text. For the purposes
of man/machine communication, this definition will be sufficient. We

do not wish to get into a discussion of various gestures, intonations,
etc., which can also convey information.

 

13

(1) Each speaker does consistently control at least one
acoustical (linguistic) feature which is usually less
than the entire acoustical signal (i.e., one or two
formants).

(2) Although the controlled feature(s) (say, second for—
mant) may not be the same in absolute value for all
speakers, the time patterns are similar and can be

identified by their recurrent nature.

(3) There is a high degree of recurrence across speakers
of these controlled features.

(4) Other acoustical features (may be correlated with
linguistic) that occur vary considerably according to

speaker, phonetic environment, etc.

Ohman has proposed a motor—control model to partially explain his
data as saying that for a VcV sound there are independent signals (or
parameters in our theoretical model) for the first vowel, the consonant
and the second vowel. The various muscles work in a coordinated fashion
to produce continuous changes in articulatory configuration. This approach
has actually been used to some extent in the work of Reddy. He first
classifies his segments into phoneme classes (vowel, fricative, stop,
nasal, liquid) and then performs a specialized analysis on each segment
which is directed by the phoneme class label.

Based on this discussion we formulate the following premises about
a feature description of the speech signal:

(1) Only a subset of the acoustical features present in a

time epoch of speech are linguistically significant;

 

(2)

(3)

(4)

14

this subset can be recognized by the precise repeatable
nature of its members. We do not mean precise values
(formant frequencies : 500 hz, 1500 hz, and 2400 hz)
but rather, precise time behavior within physical
(motor control) and linguistic* constraints.

Epochs of the acoustical signal can be equivalenced

to classes determined by a subset of linguistic acous—
tical features. These classes can be defined(by

the choice of the subset of features) in such a way
that they are situation (context, speaker) indepen-
dent. Roughly, the class labels are a generalization
of the consonant, vowel labels used by linguists and
also a refinement of Reddy's phoneme classes and
Rupert's production modes (PM's).

Further feature analysis is simplified considerably,
and a more precise syllable (canonical form) analysis
can be performed by a directed—search technique based
on the above classification. This removes the inherent
circularity in many classification schemes involving
normalization (analogous to the Visual recognition
problem of finding an object of interest to focus on
while it is out of focus).

Once vowel (peak of syllable —— Hockett) classes are
specified, they set up a primary formant transition

structure.

 

* ..
As noted by Ohman, consonantal variations of formant transitions are
different for Russian speakers than for English speakers.

(5)

(6)

15

Consonantal modifications are with respect to the
primary formant structure and hence will be termed
secondary.

There is interaction between primary and secondary
acoustical features, but the class labels can be

assigned independent of this interaction.

The concept of precisely controlled features determined by phonetic

environment at first appears similar to the distinctive features matrices

proposed by Chomsky and Halle as the final linguistic idealized description

of the speech waveform. However, there are two crucial distinctions:

(l)

(2)

Significant features are chosen, and other (redundant)
features are eliminated based on the simplicity of
description and reduction of logical complexity in the
encoding process. In speech recognition, the human

is generally unaware of mathematical formulations when
he is learning to speak; hence, the features he selects
to emphasize and control precisely are chosen for
communication with another human being and immunity

to noise for that communication. Hence, an ASR system
must determine the precisely controlled features that
are present rather than formulate hypotheses about
which ones would be easiest to analyze if they were
present.

Their concept of opposition is with respect to elements
that can occupy the same time epoch (minimal pair).

This involves a comparison of definite (albeit

 

situation—dependent) measurements of the present input
with some representative set of measurements for the
opposing element. Many investigators have noted the
difficulty in this approach (Hemdal and Hughesgl).

The relative opposition concept of Rupert and Yilmaz
does not have this difficulty, because a time epoch is
compared to the preceding and successive epochs for
its relevant opposition measurements. Hence, normali—

zation becomes less of a problem.

In the following sections, we will expand these premises and show experi—
mental evidence indicating a different description of the acoustical
speech signal is necessary for an ASR system which more accurately

measures timing and frequency characteristics.

 

I—C SEGMENTATION OF THE ACOUSTIC SPEECH SIGNAL INTO ANALYSIS EPOCHS

The optimistic goal of some segmentation procedures is to define
time points and the acoustical signal such that the resulting sequence
of signal epochs will correspond to a sequence of idealized linguistic
elements. One then simply decides which linguistic elements each
epoch is most like. In the previous section we discussed this approach
and the resulting difficulties, especially in conversational speech
involving long phrases. Bobrow et a1. state that the purpose of seg—
mentation should be a selection of appropriate measurements to be
made, dependent on the phonetic context. Reddy's phoneme classes are
directive in the sense that they select appropriate decision procedures
to be uSed in analyzing each of his segments. We are thus led to a
procedure that will define time boundaries and also prescribe a par—
ticular type of analysis to be performed between these time boundaries.
The resulting epochs may not necessarily correSpond one-to—one to the
final sequence of linguistic elements. As an example, we might consider
a word such as ”back" spelled phonetically [b a k] that has been modified
by tape cutting at the beginning and the end to remove all noise bursts
related to the consonants. The resulting acoustical signal would contain
only a vowel—like portion, and only two time boundaries would occur at
the beginning and end of this epoch. However, if the tape cutting has
not been too severe, a person would still perceive the entire word; hence,
further analysis should determine from the transitions that the generating
sequence of linguistic elements is more like three: consonant/vowel/

censonant, rather than one vowel.

17

 

 

18

A segmentation procedure should also identify the significant
controlled acoustical feature within the time boundaries. Rupert
discusses how this reduces the variability induced by situation—dependent
acoustical features. This would amount to attention focusing that
includes as a special case formant tracking. By ignoring all but the
distinctive controlled features, a large amount of noise rejection can
be accomplished. Further segmentation need not be impaired by this
attention focusing, because, as proposed by Rupert, it should be the
precisely controlled features that govern the segmentation. However,
the beginning of new features outside the area of attention must be able
to "capture" the recognition choice so that a feature does not dominate
long after it has ceased being significant.

The object of our segmentation procedure, to act as a direction
for analysis, must then be able to isolate homogeneous epochs of signal,
since in order to make reliable measurements we must have a tailored
measurement algorithm (i.e., it is extremely difficult to track a for—
mant during a fricative or noise—like portion of the acoustical signal,

Thomaslg).

This suggests a representation of the acoustical waveform
that shows isolated acoustical features and gives an adequate description
of the signal properties so that segmentation and class identification
can be performed.

The concept of homogeneous segments must be augmented somewhat
because of the special nature of speech signals. In order to analyze
a generalized acoustical signal generated by a complex scheme, as in

human speech, one could use standard communication theory techniques

of identifying a state model for each epoch (i.e., a set of differential

 

19

equations, n—degree polynomial fit, etc.) and then say the epoch has

physical homogeneity as long as the model is valid. Then the switching

times or segmentation points will correspond to changes in models. We
must also consider linguistic homogeneity as discussed previously;
there are several portions (acoustical features) of the total speech
signal which are not linguistically significant. Therefore, the homo—
geneous property is with respect to both the physical measurements

of the signal and the linguistic significance of these measurements.

 

 

 

 

I-D PREPROCESSING OF THE ACOUSTICAL SPEECH SIGNAL

Preprocess1ng of acoustical speech signals, when inspired by modern
communication theory techniques, has been dictated more by what is avail—
able rather than by what is appropriate. Researchers have attempted to
justify application of existing techniques by analogy with color (light
frequency) perception (Yilmaz) or human perceptual experiments. The
former approach can be though of as looking at the world through rose—
colored (harmonic) glasses. The latter technique must be used with
caution, since the capabilities of the human brain are not available
in an ASR system.

The complicated nature of speech signals involves a predominant
pitch frequency, which does not contain linguistic information (at a
lower unit level), plus several components with time—varying frequencies.
An acceptable analysis is possible but requires much computation (Schafer
and Rabiner 2). A real—time ASR system intended to make efficient use
of a machine cannot afford this luxury. The problem involves more than
waiting for a faster computer or a trickier algorithm when one wants to
recognize connected speech from several speakers. In this section we
will discuss the complicated nature of human speech signals and form a
basis for specification of a preprocessing scheme tailored to the nature
of ASR requirements.

The primary goal of preprocessing is to specify a transformation
(filtering) which will: (1) remove noise (including other, confounding
features of speech as discussed in the previous section); and (2) provide
a compact representation of the salient features required for the recog—

nition task. We cannot expect a straightforward application of standard

20

21

techniques based on homogeneous models* to achieve these goals. The
generation of the acoustical speech signal is best modeled as a composite
stochastic process (that is, a heterogeneous mixture of several interdepen—
dent time—varying systems). In addition, experiments measuring human
perception of acoustical events indicate that man's ability to discriminate
frequency is more acute than his perception of differences in intensity
(Flanaganfﬂ' We will show that the commonly used filtering techniques

have poor frequency resolution, which adversely affects ASR system per—
formance in natural human conversation.

If we assume that the signal is generated by a homogeneous process,
the most efficient transformation would match this generation process, as
attempted by Weiner—Hopf or Karhunen—Loeveg5 filtering. The difficulty
(and success) in using these methods depends on the initial selection of
the representation criterion and representation constraints.

The formation of the input signal minimizes, according to the chosen
criterion, the differences between the output and an idealized signal. The
criterion chosen has a considerable effect on the final form of the filter.
There are many problems in which the mean squared error formulation is
required in order to obtain any useful mathematical results. However,
another criterion may be better suited to a particular estimation problem.
For example, a filter designed for minimum mean squared error would be
used successfully in the case of a stochastic signal (fricative), where
the mean value and bandwidth of the frequency energy distribution are

sufficient statistics. On the other hand, in the case of a vowel formant

 

One characterization of a homogeneous process is a set of differential
equations of a prescribed form with (time—varying) parameters and a
fixed forcing function.

 

22

the peak of the frequency energy distribution is much more important than
the mean value, necessitating a maximal likelihood criterion. Thus,even
assuming that we can apply the more sophisticated techniques of communication
theory to the speech preprocessing problem, we will generally need more than
one "optimumH filter for a speech signal because of the changing nature of
the speech acoustical signal.

The set of all possible inputs must be limited (by the filtering
operation) in order to achieve rejection of noise and unwanted signals.
This "allowable" subset is usually defined by a set of constraints (dif-
ferential equation in the Kalmanr}formulation). Along with providing rejec—
tion capabilities, this would make the recognition problem easier by limiting
the search space. However, the set of constraint equations, in order to
be useful, must be a very accurate description of the instantaneous (rather
than some average) "state" of the speech signal, implying that the classifi—
cation must be known in advance in order to perform the preprocessing trans—

27

formation. Halle has proposed a feedback type ASR system (analysis by
synthesis) to perform this circular classification. However, in view of
the large number of computations implied by such a procedure and the pre—
vious discussion of the nature of the speech signal, we would propose the
following: At the marking of a change in the speech signal decide which
of several classes the new epoch belongs to and which "portion" of the
total signal energy contains the significant information. Then, tailor
a "filter" to this portion and perform the required transformation for as
long as the desired features remain in the signal (determined by observing

the results of the transformation).

 

23

We have already discussed how different criteria lead to several
filters or transformations. Also, the parallel nature of the acoustic
feature in a Speech acoustical signal indicates multifiltering as a first

step. We can summarize some of the requirements of a multifiltering

pre—processing to remove noise and unwanted signals.

(1) Simile — Preservation of the necessary characteristics of
a selected portion of the total acoustical signal. The
subspace resulting from the filter transformation should,
at this stage, preserve the input's characteristics (for
instance, if the filter were a bandpass, time—invariant
filter, this criterion would require preservation of the
amplitude and phase relationships of the input within
the 3 dB bandwidth of the filter).

(2) Rejection - Removal of extraneous acoustical characteristics,
including background noise and other speech features, such
as other formants or the pitch component (for bandpass
filters, this would require extremely good attentuation
outside the 3 dB bandwidth).

(3) Continuity _ At least one of the filters should contain a
feature throughout its duration (for bandpass filters with
a vowel glide of the second formant in the input signal,
that extends from 1400 Hz to 2800 Hz, at least one of the
bandpass filters should have 3 dB bandwidth to encompass
this range). This is desirable because we do not want
artifact boundaries particular to a specific set of

filters introduced when a feature transverses filter

24

boundaries. If this condition is not satisfied, a much
more complicated decision network must be used to eli—
minate these artifact boundaries.

Further complications arise because of the wide frequency range,
extending over many octaves, and the extreme variations in amplitude. Five
contiguous l/l octave filters are required to cover the intelligible range
of speech (one more if high—quality speech transmission is required), and
the amplitude ranges over 120 dB with short—term variations on the order
of 20—30 dB. One of the most popular instruments for displaying and repre-
senting speech signals is the sonagram, a 2-dimensional graphical display
of frequency versus time, with intensity indicated by shading on the dis—
play. It has been shown that the sonagram is a physical approximation of
the generalized sliding Fourier series (Lernenj ), that is, a Fourier series
computed over a time interval that is stepped along the acoustical signal.
The difficulties in analyzing speech can be discussed in terms of the
sliding Fourier series and the parameters involved. First, the length of
the interval over which the series coefficients are computed must be
greater than the period of the lowest frequency component of interest.
Measurement of formant frequencies is further complicated during vowel—
like portions by the pitch frequency (proportional to the repetition rate
of the glottal pulses). The range of these pitch frequencies is from 80
to 400 Hz. The time period over which the Fourier series coefficients are
computed must be greater than the pitch period (say two or three times the
largest, «125—30 ms), or a great deal of variation will occur depending on

*
the phase of the pitch frequency . Thus, there is a lower bound on

 

*
The ideal situation would be to synchronize the Fourier series computa—

tion period with the pitch periods. This requires a pitch detector and
a device to decide on presence of pitch periods. The resulting frequency
resolution is still on the order of the pitch frequency.

 

frequency resolution on the order of the pitch frequency. Sliding Fourier

power spectra for both Wideband (65—6500 Hz) and bandpass filtered vowel

glides are shown in Appendix D. The irregular form of the spectra is due
to the pitch component. Also, the high power of this component relative
to higher frequency components (which carry the linguistic information)
requires a significant dynamic range (50 dB is shown in Fig. D-l); even
then, formant frequencies are difficult to identify. It would be expected
that bandpass filtering should isolate these peaks, as is seen in Figure D—2.
However, we should note that there are several problems that still are not
solved:
(1) When two energy peaks are in the same filter, a decision
must be made as to which peak corresponds to a formant
and whether the other peak is simply a harmonic of the
pitch frequency or a second formant. Ideally, it would
be nice to treat one formant in every filter; however,
this is overly optimistic.
(2) Measurement Resolution ~ This is possibly a special
case of (l) in that the measurement scheme (sliding
Fourier series, for instance) has a certain resolution;
i.e., a certain minimum distance must be present between
two peaks for them to he recognized as two separate peaks.
The problem that can occur here is that different speakers
may have different spacing, so that for one a two-formant
"sound" may appear as a broad single peak while for the
other the same "sound" will appear as two close narrow
peaks.
(3) Frequency Glides (large values of derivatives of frequency)

that move in and out of filters and across filter boundaries.

 

26

The ideal approach, of course, is to treat a feature
as a continuous event, independent of the filter band—
widths, so that artifacts would not be introduced.
(4) Correlation of Formants in Adjoining Filters — Since
the filters are overlapping, the formant could be
present in two filters; important types of information
29
may be found by comparing adjoining filters (Hanne ).
(5) Requirements (2) and (3) above are actually contradic—
tory and cause, in the case of bandpass filters, a situ—
ation where in order to contain a formant glide within
one filter, the bandwidth would be entirely too wide
for adequate rejection and for emphasis of the many
types of speech features encountered.
(6) The effects of the pitch component are not completely
removed by 25—30 ms computations, time window tapering,
or bandpass filtering as has been suggested by researchers.
These problems for bandpass filters, or, as has been shown by
22
Schafer and Rabiner, for even more sophisticated types of frequency
analysis, are caused by the inappropriate nature of any fixed—frequency
type of analysis for speech processing. The criteria for using such
analysis on (1) steady—state phenomena, such as constant vowels or nasals,
(2) vowel glides (great changes in frequency of formants) and (3) noise—
like signals, very quick , random transient—type phenomena, are
in general quite incompatible. Further, it has been shown by Hanne that
for several measurement schemes, the estimation of formant frequencies
(natural modes) of the acoustical signal approaches a harmonic of the pitch

frequency rather than the true'value.

27
30

A recent article by Lecours and Sparkes has indicated that narrow—
band filters enhance the frequency pattern of vowels, whereas Wideband
filters more accurately show the transient time behavior of stop conso—
nants (rapid envelope onset —— a fact well known to users of sonagraphs).
Hanne has pursued this prefiltering idea further with a more sophisticated
system of overlapping filters to estimate first formant frequencies within
3 percent. Flanagan'ska’ml study indicates that this approach is closer
to the frequency estimation error in human recognition. Thomaslv has also
used Wideband filters to emphasize frequency regions to show second—formant
variations more clearly. Both Hanne and Thomas have argued that the effect
of filtering speech signals can be predicted or inferred from usual steady—
state filter analysis. However, Fig. 2 shows a sonagram of a common English
word, indicating a frequency derivative on the order of 10,000 Hz per second.
This high value of frequency derivative is known to give quite unpredictable
and unexpected outputs from time—invariant linear filters (Baghdadyél,

32 as

Wiener and Leone , Cannon and Duncan ). One should reexamine the criterion
for filter bandwidth in terms of the time—varying properties that can occur
in speech signals. The inverse relationship between rise time and band—
width indicates that a fixed bandwidth bank of filters must be a compromise
at best. The effect of an analysis period on the order of 25—30 ms is to
average or smear quick transient phenomena. Discu551on of recognition
errors in various systems using this type of techniques (Reddy7) indicates
that many consonants, especially stop consonants, are missed due to this
smearing or averaging. The usual reason given for the recognition errors

is the low energy and short duration of these speech sounds. One possible

solution would be to vary the computing period inversely with frequency

   

 

“95mm:

x. ,n‘
jg 1‘. 2100—1200 _

  

  

I 90 ms — 10,000 Hz/sec

246 ms 365 ms
275 ms

FIGURE 2 SONOGRAM OF ENGLISH WORD, RUDDER,
SHOWING HIGH VALUE OF FREQUENCY
DERIVATIVE

29

(long periods for low frequency and short periods for high). The resulting
coefficients would not be for an orthogonal expansion,and also vastly
differing waveforms can occur in the same frequency region.

Figure 3 shows both Wideband (65 to 6500 Hz) and bandpassed time
acoustical signals from recordings of four speakers saying medial [b]
from [umbif] (see Appendix A for a description of the experimental
pseudo—language used).

Bandpass filters can emphasize characteristics in the real—time
waveform of extremely short transient-type bursts (release of the stop
consonant [b] for different Speakers, both male and female). Although the
Wideband waveforms (Figure 3, first page) show very little similarity, it
is possible by bandpass filtering to find similar waveforms for the different
speakers (Figure 4). The rejection of other features in the acoustical
signal, as well as noise, by the filtering has made this posSible. It
will be noted that the most consistent and similar waveforms across
speakers need not, and often do not, occur in the same frequency range
(filter).

It has been argued by researchers that other acoustical clues for
the perception of a stop consonant exist, namely the transition into the
following vowel. Cooper et a1.5 investigated perception of synthetic

1F .
J has shown w1th

initial vowels with frequency glide onsets. Ohman
sonagrams of actual speech that these results may not apply to connected
human speech sounds. His data showed that, for medial stOp consonants,
the common notion of a formant hub does not hold; that is, there is no

consistent point of origin for a given consonant, say [b], to which and

from which vowel formants tend.

3O

<Zm:>.OZwIn_ Fzm:mz<m.r ...mOIm OZ_>>OIm 4.4205 4<U_.rmDOU< ImeEw m meOC

,3, o. i .a- E 3.5, time as: 45.. 9.1 8.!

I[I A IIIFI»
1:2-

..Ex . 3:. 1.!

 

l5. .ﬂ: £23

a

[In

a a. x :2! . :- 9 .3.“ w!» 41.. a5} uni

2.! .1»: m 3: ES: 95. a 3: 1;! I... m 3: [EB

_ n

i E .. .ll:_lu_1.ni m!» 12.95: Xi

Ice: .3: u was Haul I I

n): o 1: E33

   

l5:

. x L . 3.5 :13 .2; 5.. 9:. Si
IIII. . I IIIIIIJ»
3:95 El vn NJ: [.693 linn VN at. ring

9-... .~ :3 ISI-

 

31

.f...

.2: 33

:ﬂ. c- "i_ __ . Ud1Um ‘2 560— a. A“. 1" ICC-aﬂ

— - o a
u'lnlncUIIIIOIIIQIIaI‘l-nn.0u30I'0l

.03 v w...“

Eng

 

 

 

.-lil I I
‘2: «of .5! 2 .0! m 3: 2. an;
_

l\/\I\l\l\/\l\l\/.\/\/\I\IIIII IDIII >1
_

rt «a .u. o. slung: Eng...

 

_ 1o 2 .0! ch 3; .E «.8-

 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

 

 

.I_ ‘- .1-_ f u Mich .3 rec 1. v?! I... (‘1‘:

Bios: {no}

 

o u n a - u I o o a n
r oooooooo s nnnnnnnnn ‘7 uuuuuuuu I? ooooooooo on: p ns bl if v p L

 

 

 

 

 

tuna pia— Blawa; Enmnu
I: "a . w. 0. no, 0 ma: 9.. "who
I: noon — 1w 0. «O! a mg: [I Hi

 

 

 

 

 

 

.I u. i an.“ wag a! g“ 0. ma :50 ‘00:

,0! v 3..

- u . u . u o . u .
fins... bl urn f In D b 5 > P h h

 

 

 

E. 00!. .2. a. .0! m 5.. are
E. on; . we 9 6! o 3.. E. 09...
I. 03m . ...w 2 .0! on 3: l. as

 

 

 

:2 o. .3. . . 3(8 E 29 o. :2 .In 9008

a u n . o g .
f P P 0... LI :7 P I .tot

'30— 0—03. mg;

 

 

 

 

 

 

It .5 .13 .— Salome; 11.90
" LI, 'FI
“l 1“

1.38 —u¢2 9089:: Fiona

{:3 :32 20823.; I...’

 

 

 

 

 

.39! 0.53 i 22.3! ..n o. .I. 9334‘ u to!)

u u - o a I
r OOOOOOOO .000: h» V I 3 L: > I t .-

B’CUJC v5.03

 

 

.3: alas;

I... —m~:

 

in. u .40? L.

 

 

8.0 win I. noon

pun .—

frag

 

.l .8» , :u 2 8! ch 3..

 

 

 

 

 

..wd‘uuxh'.ouad:89

. . a . u . o o s .
.1 ........ s ......... P ........ I? ......... I ........ p by b r p

 

flag

 

'1' o. B’m win [nth

 

 

“.538 .8: 8.1.... (.39

In! Flu: aivnwau

 

 

32

 

 

 

487.5 ms FILE 24 M04 A 19 EH 1 512.5 ms
549.0 ms FILE 6 M06 19 BE 1 574.0 ms
695.1 ms FILE 5 MD8 19 MJ 1 720.1 ms

550.0 ms FILE 4 MD6 19 BA 1 575.0 ms

OUTPUTS OF OVERLAPPING BANDPASS FILTERS [M] TO [8]]

 

‘w-JI“

FIGURE 4 CONSISTENT TIME WAVEFORMS FOR SEVERAL SPEAKERS
FROM DIFFERENT BANDPASS FILTERS

 

 

33

The choice of a set of filters for preprocessing the acoustical
signal ranges from a set tailored to several classes of acoustical signals,
possibly along with different representation criteria, to a set of contigu—
mm narrow bandpass filters. The second approach has been extremely popular,
especially for speech synthesis using vocoders. A familiar characteristic
of narrowband filters, i.e., ringing, when excited with a sharp increase
or decrease in amplitude or frequency is not consistent with the require-
mmw of simile. For a period of time after a sudden change in amplitude
or frequency, the output of the filter is not representative of the input.
ﬁns problem will be discussed in the next chapter.

To avoid these difficulties, we have chosen a Wideband (half—power
bandwidth greater than one third the center frequency) overlapping filter
bank (See Appendix B). The particular choice of the number of filters
and the bandwidth of each filter was made in order to satisfy the three
Mmted requirements. We shall see that the type of multiband filtering
used here fulfills these requirements to a certain degree but has several
limitations which must be corrected in the decision algorithm that follows
the multiband filtering. The reason for these limitations is obvious.

A time—invariant filter based on steady—state sinusoidal considerations
OIJViously is not representative of the speech acoustical signal. However,
there are several reasons for this choice over the admittedly better set
of tailored filters, These reasons include:

(1) The hardware is readily accessible

(2) A large number of investigators have used Wideband

preprocessing
{v

Gazdag ,

filtering in proposing and implementing

an I
schemes, including Hanne , Reddy , Thomas ,

Shafer et a1 and Yilmaz .

 

34

(3) Adequate representations have not been tailored
to the time—varying acoustical signal,

(4) Few decision structures have been studied which are

tailored to this type of multiband filtering
preprocessing.

Another popular related analysis tool is the Fourier transform,
especially since the introduction of the "Fast Fourier Transform" algorithm
byCooley andTukey."J The Fourier transform equations can be modified so
Umt each coefficient computation may be thought of as a (digital) filter
Operation. Hence, the complete transform computation may be considered
a multi—bandpass filter processing.*

Much can be learned by considering a multi—bandpass filtering
scheme with the intention of using it only as a first step and deriving

from it further requirements for a tailored multi-filtering scheme.

A popular approach for parameterization of the filter outputs is
t0 compute coefficients for an orthogonal series representation. However,
me Criterion commonly used for these computations is complete represen—
taltion of the entire signal and minimization of the error between the
orthogonal series and the original signal. This is not what is needed
f0? an input to an ASR system. We would rather like to see only those
Imrameters necessary for recognition. FlanaganM has modelled the Speech-
generation process as either a two—pole linear filter excited by the
ﬂOttal DUISeS (for vowels and oral continuants) or a filter excited by
Mute noise with variable bandwidth in the center frequency (for frica—

’ - ' . - r (
nves. stop consonants). The various parameters of input, envelope an!

F“‘--—- .
In the next chapter we will see that the Fourier coeffi01ent computaﬁion
acts like a narrowband digital filter and hence is subJeet to ringi g.

g d

 

 

35

filter bandwidth and center frequencies are considered to be time-varying.
Thus, he would propose two parameters for each of our acoustical features,
related to center frequency and bandwidth. Rupert has suggested that there
are also consistent spectral shapes to the acoustical features which have
been only slightly considered by previous investigators. These shape
functions appear to be easily described by at most four parameters; say,
the first four moments of the spectral density. They were first derived
from sonagrams, but inspection of machine-calculated power spectra
(Appendix D) shows that they may be more artifacts of the hardware than
ccnmistent features of speech acoustical signals. However, Sittonﬂg has
studied the first four moments of reciprocal zero—crossing distributions
and found more consistent results.

Thus, one is led to different estimates of center frequency (and

higher moments) for a narrowband (unimodal) spectral density. Zero—crossing

counts immediately come to mind. There are many schemes and investigations
". l‘.

of zero-crossings for analysis of speech signals (Cherry and Philips. ).
However, these measures were usually made on the total signal and, as
can be seen by considering the sum of two sinusoids with variable ampli—
tudes, the resulting output can be very difficult to interpret unless
the signal has its spectral energy concentrated in a narrow frequency
band. Thomas has used zero-crossing analysis on the output of his
bandpass filter to estimate second formants; he finds an extremely good
representation for vowels and indicates trouble only fer very low power
portions of the acoustical signal (fricatives, stop consonants). The
use of bandpass filters followed by zero—crossing counts to estimate the
a?
frequency structure of formants has been demonstrated (Peterson ,

29
Hanne ).

A

36

Recently, Scarnﬂi has discussed the fine structure of zero—crossings
for speech—like signals having formants and pitch frequency components.
He uses wide (1 octave) filters to isolate formants and shows the effect
of pitch periods on formant frequency estimation. The errors involved in
zero—crossing analysis are on the order of l/number of zero crossings and
therefore proportional to frequency. The case with Fourier series analysis
is different, in that the frequency—location error is fixed at ié the
lowest frequency component (in this case, the pitch frequency).

Zero—crossing counts can be related to instantaneous frequencies

31 as

(Baghdady , Lerner ) and thus incorporated into a discussion of quasi—
stationary response of linear filters. However, few investigators have
pursued this approach in the case of speech signals. Reddy7 uses zero—
crossing measures as an estimation of steady-state frequencies and also

some envelope measurements (primarily relative envelope changes). We

will discuss on a slightly more theoretical basis the relationships

between zero—crossing measures and instantaneous envelope measurements in
the next chapter. There are obvious benefits to be derived from the use
of both derived time series in that the interpretation of zero—crossing
counts is greatly enhanced by specification of the nature of the speech
Signal (i.e., if it is a vowel portion or a fricative portion, etc.),
which can be determined by investigation of the envelope time series.

The subject we will investigate in the following chapters involves
prefiltering by a bank of overlapping bandpass filters with the criterion
that significant acoustical features appear in at least one of the filters
over their duration. This presents a new type of recognition problem,
involving the logic to decide which filter has the significant output and
to perform a preliminary classification as discussed previously. This

is the topic of the next section.

I-E DECOMPOSITION OF PATTERN RECOGNITION ALGORITHMS

The use of multiband overlapping filters to preprocess speech
signals presents a specialized type of pattern—recognition.processor, For
the sake of clarity, we will adopt the widely used mathematical formulation
in our discussion of this problem: The inputs to pattern—recognition
devices are parameters, distinguishing characteristics of a physical

event. A measurement is the numerical value of a parameter. A pattern

 

vector, then, is an ordered set of measurements of a physical event; each
measurement can be thought of as a component. The distance in pattern
vector space between two vectors is a geometric measure of their closeness.
A typical, but not always appropriate, distance is the standard Euclidean

sum of squared differences of each component._ A pattern—recognition

 

algorithm is an assignment of class labels to the pattern vectors. In a

 

typical pattern-recognition algorithm, each input pattern vector to be
classified is compared with a number of reference vectors by a distance
measure. The input vector is then assigned the label of that reference
vector for which the distance is minimized. An ideal pattern—recognition
algorithm would result in a dichotomization of the pattern vector space
with unique class labels for each disjoint region. In the cases where
this is not possible, the output of our pattern—recognition (PR) algorithm

can be a degree of presence (DOP) vector, which has one component for each

 

class label. The DOP vectors indicate the relative assignment for each
class (say, normalized distances) and hence are a generalization of the
single class label output.

A directed search is a special type of pattern—recognition algorithm

 

that trades sequential operations for multidimensional single operations;

37

A

38

i.e., in the reference vector comparison case, a subset of reference vectors
is selected by first examining few components and eliminating large port—
tions of the pattern vector space from further search. Plasticity is a
description of a particular type of pattern—recognition algorithm that
allows changes in the pattern vector to DOP vector mapping, depending

on a subset or all of the pattern vectors (the terms "learning" and
"adapting" have been used for this process). A deterministic pattern—
recognition algorithm is one which has no plasticity; that is, an a priori
ﬁxed mapping of vectors into classes, possibly by setting thresholds on
mamurements. Normalization is a process which we will distinguish from

Um pattern—recognition algorithm as being more concerned with the deri—
vation of the parameter measurements. Although there are analogous types
ofstandardization processes that do occur in pattern—recognition algorithms,
it will facilitate the discussion to make this distinction.

We can now consider a schematic of the logic required for a pattern—
recognition algorithm for our multi—bandpass filters and its operation.

In Figure 5, the output of each bandpass filter goes into a measure—
ment device, producing an n~dimensional pattern vector for a time epoch
(physical-event) of the acoustical signal. These may be coefficients of
an orthogonal expansion over a certain time interval, coefficients of
a differential equation or another set of appropriate measurements (mean
ValueS, maximum value derivatives, maximum value standard deviations, etc.).
For a Continuous output of the bandpass filter, these types of measurement
require time interval marks, which we will assume for now are generated
elsewhere or are a Part of the measurement scheme. The output DOP vector

iS of ..
dimen51on r, the number of speech sound classes, discussed in

L

A

Spefﬁ‘
Input

 

 

r.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I
BANDPASS MEASUREMENT LOCAL
FILTER DEVICE : LOGIC
n
' °n—dimensiona| ' r~dim
. BanI< 0f . pattern . DOP
"1 THIQrS vector vector
u o 0
I
BANDPASS MEASUREMENT LOCAL
FILTER DEVICE LOGIC
n

 

 

 

 

 

 

 

 

GLOBAL
L()(3|C

 

FIGURE 5 SCHEMATIC OF MULTIFILTER RECOGNITION LOGIC

 

A

 

40
Section D (on the order of 4—6). When referring to operations or pro—
perties of individual filter outputs, we will denote these as local, and
when talking about properties of the entire bank of filters, we will denote
these as global. By the particular choice of our filters, we see that a
local property is one that is restricted to a certain frequency range.
We will talk formally about "closeness" of pattern vectors in terms of

as

clusters in the sense of Ball and Hall. That is, we will say a set of
pattern vectors is clustered if the intra—cluster distances are small
(relative to a threshold, or to inter—cluster distances). The homogeneous
property, which we introduced in our definition of acoustical semnents,
is with respect to both the physical measurements of the signal and the
linguistic significance of these measurements. We might reformulate
that property in terms of our definitions; physical measurements have
some significance and consistency if they form a cluster (denoted as a

physical cluster) in the pattern vector space. It may not always be the

case that these physical clusters have linguistic significance. For
example, a frequency measurement on a low—order filter primarily exhibits
the pitch frequency. In this case, the physical clusters would correspond
to different pitch frequencies and not to different linguistic events. At
the opposite extreme, a physical cluster might be related to two distinct
linguistic events, such as a medial Lb], which has a very small amount of
silence before the burst release, or a great amount of background noise
such that it is difficult to distinguish from a fricative such as [f].

The resulting measurements for both the [b] and the [f] would tend to

lie "close" to each other and, hence, lie in one physical cluster. Thus,
the linguistic clusters would correspond to one physical cluster. At

first, it appears that appropriate class labelling of the physical clusters

41

would define the linguistic significance; however, as indicated previously,
the difficult task of assigning an exterior linguistic criterion to physical
measurements subject to speaker, environment, and free phonetic variations
will require a more sophisticated, plastic type of correspondence. The
intention of keeping the actual decision algorithm simple so that it may

be implemented in real time (with a minimal amount of computation) requires
a better solution to the problem than simply keeping track of all the
physical clusters and then making a correspondence to a set 01 linguistic
labels. This type of approach requires, for example, storage of a large
number of reference vectors (say, one for each physical cluster), comparison
to these at each step of the decision algorithm, and a continual updating

of these reference vectors due to slowly drifting measurements. In our
problem, this approach is not feasible because of the variations due to different
speakers. Bobrow and Klatt13 have shown a decision algorithm (applied

to the speech recognition problem) which is a directed search using

decision—tree type logic that reduces the computational limitations

(amount of storage, number of comparison speed of classification) of the
Iusual multidimensional pattern recognition algorithm. Their procedure,
applied to a speech measurement situation in which the variations discussed
above are removed, would result in an effective ASR algorithm. Their
technique, of course, will fail in the situation where a large number of
reference vectors are saved for comparison.

The concept of precisely controlled features can be related here,
also, to physical clusters, in that if other perturbing influences are
removed, these precisely controlled features should result in "tight"
physical clusters. This approach in itself should reduce considerably

the amount of variation and hence the number of physical clusters needed

 

42

for description. This is then what we mean by attention focusing; i.e.,
the selection of a portion of the speech signal with precisely controlled

features and tight physical clusters for further processing.

The complexity of the decision logic in Fig. 5 for an ASR system is
dependent upon whether a decision for assigning a class label can be
dichotomized into a number of local decisions followed by a global decision
. 40
(analogous to the Zeiger decomposition of automata), i.e., Is the dimen—
sionality of the pattern vectors on the order of mxn or n (where m is the
number of modules and n is the number of measurements in the input pattern
vector for each module)? In the situation where two estimation criteria
are appropriate (not necessarily simultaneously) for an n parameter problem,
hence leading to two "filters", (as discussed in Section])), we would say
the dimensionality is n rather than Zn, but "shifts" according to the input.
The local decision would be based on "best" estimate according to the local
criterion and the global decision would then be the choice of which esti—
mator was most appropriate by examining the variance of the parameter

estimator, for instance.

This variance measure of the estimation process can be generalized
to handle the many more difficult and varied situations in ASR systems. We
can also measure the quality of the DOP vector, e.g., the peakedness measure
41
introduced by Kilmer et a1 . A quality measure of the specific classifi—
cation of an input pattern vector indicates the significance of the esti—

mation of the measurements and consistency of the pattern vector w.r.t

previous classifications.

A

43

Knowing that the complexity of PR algorithms goes up exponentially
with the number of dimensions, a decomposition can result in real—time
computations. The discussion of the previous sections indicates that this
is the case for speech, in that the entire Wideband acoustical signal
is not precisely controlled and does not contribute in its entirety to
the linguistic information. The choice of a logic structure, then, depends
on this decomposition. We propose to show in Chapter IV that this is
valid and indeed enhances the physical measurements in such a way as to
reduce variations and improve the probability of success of classification.

Kilmer et al. have studied parallel recognition structures of the
type shown in Fig. 6 and have demonstrated that an iterative nonlinear
Shakedown net (called S-RETIC)* is capable of arriving at a consensus
of opinion among the local pattern—recognition elements (denoted modules),
solving conflicts that may arise and selectively tuning to particular
modules that have made a high—quality decision. We feel that this type
of logic—structure is ideally adapted to the requirements of an ASR
system. In particular, the bandpass overlapping filters have a mixture
of correlation with neighboring filters and a high degree of local specifi—
city because of the precisely—controlled features in speech signals (cor—
responding to the local redundancy of potential command concept of the
S—RETIC). The parallel computations involving low—dimensionality (on the
order of the dimensionality of each module) allow a minimal amount of
computation.

*By S~RETIC, we mean the algorithm that performs the iterative nonlinear
shakedowu as described in Kilmer et a]. (1967) and not the complete

simulation study. Effectively, we denote S—RETIC for the computer program
which corresponds to the B parts of the modules with their interconnections.

 

44

 

  

x, —> MODULE K P1IC1/Xk) —-—> 1
INPUT .
PATTERN e Xk . DOP VECTOR
VECTOR °

Pr(Cr/Xk) H r

 

 

mi FROM MODULE j (Not all lines may be used)

FIGURE 6 QUASI—STATISTICAL FORMULATION OF PR ALGORITHM

 

45
In order to get some feeling as to how S—RETIC arrives at its decision
and also to consider an alternative procedure for using a number of pattern—
recognition elements in unision, we can consider the probability distribu—
42 43
tion approximation techniques first discussed by Lewis and Brown . In

order to apply their techniques to PR problems, we will consider each

component of the DOP vector as being a conditional probability distribution

Pl(CL/xk)' L = l, . . ., r, defined over the (module) input pattern vector
<

space, Xk (xkexk) for each class, O{ (see Fig. 6). The DOP vector is com—

k , , k
puted from stored conditional distributions Pk(X /C¥) [or an input x by

Bayes formula (assume P(C£) = l/r).
r
k k ' k
' . z . / P C I~E~1)
1369/“ ) Pku mp. Z k(x / p (
£21

The only requirements on the stored distributions is that they be non—

negative for all C XR and normalized such that

L,
E; .P (xk/C ) : l t : 1, . . ., r (I—E—Z)
k t
k
X
. , k
We can apply Lew1s and Brown's techniques to Pk(X /CL)’ kzl, . . . m.

for one class by considering each pattern-recognition module as computing
a low order approximation to the true distribution. Chow45 defines the
structure of a pattern recognition algorithm as the function form of the
probability distributions, particularly the condition dependencies among
the components of the pattern vectors. He describes the Lewis-Brown
approximation as structure adaptation. Hence, a parallel net of modules
with lateral communication between local PR computations allows at least
m different structures for each class. S—RETIC then selects the appro—

priate structure.

46

So far in our discussion, we have been considering decision struc—
tures that, except for the possibility of operating with minimal computations
and less complexity, appear similar to those termed template—matching in
Section B. This static type of pattern classification has little hope of
working with connected conversational speech. The structure we are proposing
has more flexibility built into it and operates like the PR algorithm we
have described for isolated sounds where timing marks are well defined. The
philosophy behind the design of the STL—RETIC program was to operate in I
an asynchronous manner, rolling over from one decision to another based
on input changes. This structure is exactly the type that is needed for
dynamic speech recognition; when one classification is chosen, such as
silence preceding a word, and a new feature begins. It has been demonstrated
by Kilmer that the change in the input (as reflected by the change in the
local DOP vectors) is sufficient to cause a change in the overall global
DOP vector. It will possibly be necessary to also determine changes in the
input measurements. We propose to do this by detecting inherent changes
in the physical characteristics of the signal and then deciding if these
changes are significant enough to cause a recomputation of the global
decision.

We will return to these questions in Chapter IV. First, however, we
consider in Chapter II the nature of the acoustical waveform and discuss
a procedure for detecting inherent changes in that waveform. In order to
Specify a training procedure for a plastic PR algorithm, an external classi—
fication criterion is needed. The lack of a one—to—one correspondence
between acoustical and linguistic events rules out completely unsupervised
learning. In Chapter III, structural linguistics is discussed in order to

provide this criterion.

II REPRESENTATION OF TIME-VARYING SIGNALS

Representation of signals that result from transformations by a
time—varying differential operator of standard signals present many
difficulties, particularly to engineers with backgrounds in linear time—
invariant differential operator analysis. Two representations are

commonly used, the analytic signafﬁsand the sliding Fourier transform

methodsgg

II-A Analytic Signals

 

The analytic signal representation is an attempt to define pre-
cisely the empirical notions of envelope and frequency. The primary
advantage of this representation is that it separates the envelope and
phase portions of the signal; in addition, the resulting Spectrum is one
sided (i.e., there is no mirror negative frequency portion). This cor—
responds to most spectra "pictures" and makes various moment calculations
practical.

The Spectrum of a real signal u(t) for t €(—m,m) is the Fourier

transform

(1 J

U(jw) = J u(t) e3.ULJt dt .*

~CI)

The Hilbert transform of the real signal x(t) defined on the interval

”m < t < m as the Cauchy principal value of the integral

9 1 O) x(o.)

h ,
x (t) E t_0 'do , “m < t s w (II-A-l)

 

"CO

 

*
We will adopt the convention of denoting the spectrum of a real function
of time by capital letters.

47

48

h
is another useful transform. The new real signal x (t) has the following

properties (Titchmarsh47):

(1) x(t) = cos(wt + e) xh(t) = sin(wt + o)
(2) Under rather general conditions if x : yh; xh : -y
(3) th : amt); r > 0

z o , 1“ z o

: JX(f); t s 0

We can now define the analytic signal corresponding to x(t)

 

 

x(t) : x(t) + jxh(t) (II—A-Za)
: a(t) eja(t) (II—A—Zb)
where
a(t) :AAlfx2(t) + xh~(t) (II—A-Zc)
u(t) : arctan{xh(t)/x(t)} (II-A—Zd)

The analytic signal x(t) has the one—sided spectra mentioned before, because
of Property (3) and the definition. This signal is complex (the real portion
is the original signal). Since the process of taking the real part of a
complex function is a linear operation, it commutes with other linear

operations such as convolution, differentiation, and integration.

49

Equation II—A-Zb gives us an interpretation of the analytic signal
representation as a phasor in the complex plane with time-varying magni—
tude and angle (with respect to the real axis). We may denote these quan-
tities as the envelope and phase functions, terms motivated by the use of
the analytic signal in various modulation studies (Baghdadyal, Weiner

and Leonsz). The instantaneous frequency is defined as the time

derivative of the phase function.
d
wi(t) = da(t)/dt (II—A—3)

The analytic signal, although giving an instantaneous time descrip—
tion, can be used effectively for only a limited set of signals, namely
those with slowly varying envelope and frequency functions. In order to
enlarge this set of signals, we will introduce another definition which
will be useful in discussing second—order time—varying differential opera—
tors. The derivative of an analytic signal may be written as a product
of the analytic signal and a new signal, bX(t), which we will denote the

prebandwidth signal.

 

 

 

d§(t) d { ja(t)t _ 1 da(t) , dd(t)‘ A
_dt__ = a? a(t)e J _ {a(t) dt + J _dt__} x(t) (II—A~4)
where
d 1 d a(t) . da(t)
bx(t) = a(t) dt + 3 dt

*
The name of this function follows the convention of Deutsch and

camera“6 definition of effective bandwidth. First shift the Spectrum

 

 

* A
Deut5ch48 denotes x(t) as the pre-envelope signal because its magnitude is
the envelope. '

A .
of x(t) to its center frequency. This frequency shift can be included

in bx(t) by a property of Fourier transforms

Xs(jw)g X{j(w + wo)}(})- e_‘jw0t x(t) = §S(t) (II—A—5a)

b (t) a(t)/a(t) + j(&(t) - w

x5

O) (II—A-Bb)

When we is the center frequency of X(jw), the complex portion of b s
x
reflects the time variations of instantaneous frequency about the mean.

The effective bandwidth, BW, is the second moment of the Spectrum

about the mean.

00 2

 

 

00 A
BW‘ g f;m w ‘X (w)‘ dw : _m lth (tl dt
I lxsw)!2 dw J_ ‘xs(t)'2dt
..m 00

(I)

As
I Jb S(t) x (t) dt
—co X

L a2 (t) at
(I)

2

 

 

The magnitude of bXS is an upper bound for the effective bandwidth

by the Schwarz inequality and thus is a measure of an instantaneous

 

bandwidth

sz

CD CD
, 2 2 s ,
j;m lbXS (t)| a (t) dt Jlm a (t) dt

{bxs (t)|2 dt (II—A—7)

[A
chﬁ

51

A

Another interesting relationship between bx(t) and x(t) is (for x(t) # 0):

d A A d “
: --— : —-— 1 t --
bx(t) dt (t) // x(t) dt { 0s X( )} (II A 8)
In speech analysis, a logarithmic scale for amplitude (loudness) has
often been used. By taking a derivative (with appropriate definitions
for the complex logarithm) we can replace the transcendental function with
a function more easily computed on a digital machine.

Now, consider a second—order time—varying linear differential

equation (DE).

Q + a1(t)x +a (t) Q :: a(t) (II—A—9)

where a1(t) and ao(t) are real functions denoting the time—varying para—
meters (for example, of a formant-producing cavity in Speech generation).
A

u(t) is an excitation function which may be stochastic (fricatives) or

deterministic (glottal pulses). Introducing the prebandwidth function,

[ bx(t) + b:(t) + al(t) bx(t) + a2(t) ] x(t) : 13(t) (II—A~10)

A
The homogeneous solution of the reduced DE (u(t) : 0) involves solution

of a Ricatti equation for bx’ which can be solved if a1 and a2 are constant.

' *
= b + (b + (b +~<2 )(b + c )) = O
X X X 1 X 1

52

where

0
ll
NIH

a + a2/4 — a (II—A—llb)
l l 2

c = %a + /a2/4 — a (II—A—lla)

1 1 1 2

c1 and c: are the pole locations for the time-invariant system given by
Eqn. (II—A—lO). When the constants c1 and c: are complex, the magnitude
of st, the shifted prebandwidth function, has the damping factor a1/2,
which is an accepted "bandwidth" for this system. Thus, our definition is
useful in relating bandwidth to a system that may have an infinite value
for BW (this happens for certain values of a1 and a2).

When a1(t) and a2(t) vary Slowly with time, so that bX R10, we
can still define cl(t) and c:(t) by Eqn. (II—A—ll) and we can define time—
varying poles without Fourier transforms. In general, Eqn. (II—A—lO) must
be solved by numerical integration, but the function bX is related to the
crucial parameters of a system described by Eqn. (II—A—9) and can provide
insight into the system's behavior. Analysis of higher-order time—varying
systems by this approach is not as easy as the analysis of time—invariant
systems, where reduction to second—order systems is achieved by partial
fraction expansions. The lack of a superposition principle, plus the com—
putational difficulty with sums of analytic functions, further complicates
the generalization.

The analysis of the dynamic characteristics of one isolated for—
mant is possible (and more tractable) with the introduction of the
prebandwidth function. Real differential equations (DE) for the envelope
and frequency functions can be derived by Substituting the definition of
bx from Eqn. (II-A—4) into Eqn. (II-A—lO) and separating the result into

real and imaginary parts, giving

53

[:(t) + a1(t)é(t)+a{a2(t) - (19%)} a(t)] [COS{Q(t)} — sin{°1(t)ﬂ
= a(t) cos {y(t)l (II—A—lZa)

[®(t) + 2w(t)é(t)/a(t) + a1(t)w(t)]a(t)[cos{a(t)} + sin{d(t)l]

= a(t)sin{y(_t)} (II-A-le)

where
§(t) = a(t)eja(t)

3(t) = g(t)eJY(t)

w(t) = d<t>

The equation for the envelope (II—A—lZa) is of the same form as
the total signal DE with a "natural frequency" reduced by w2(t). The DE
for the frequency is nonlinear in w and a and shows the effect of damping

on the natural frequency.

We can change (II—A-12a) by substituting for the second derivative

of the envelope

on d . . r)
a(t)/a(t) = a; {a(t)/a(t)} + {a(t)/a(t)}“.

Then we can rewrite (II—A—12) as

d . . 2 _ . a
a? {a/a} = g/aCcosy/{cosa - sind}] + w - ap — ala/a — {a/a} (II—A—lBa)

d a
asz} = g/a[siny/{cosa + sina]] — Zwa/a + alw , (II—A—13b)

 

If we identify w and a/a as state variables, then Eqn. (II—A-13)
is in the form of a nonlinear vector differential equation. For Speech
acoustical Signal representation, these state variables are invariant to
amplitude scale changes as seen from their differential equations; further,

they form the real and imaginary part of the prebandwidth function.

AS noted in Chapter 1, Speech acoustical signals fall into a number
of classes, depending on the values of the four Signal parameters al(t),
a2(t), g(t), and Y(t) in our single formant model. InSpection of Eqn. (II—A-13)
indicates that the derivatives of the two state variables depend only on
the state variables and these four time—varying parameters. Thus, if we
were to Specify the two state variables and their derivatives as functions
of time, we could perform the Speech Signal classification. This procedure
does not require us to solve the complex nonlinear differential equations
or to perform any type of matrix inversion that would be necessary to iden—

tify the time—varying model parameters.

When a(t) is a train of unipolar glottal pulses (each being 2 to
12 ms in duration), u(t) can be represented by the excitation enve10pe g(t),
For this Situation, the sinusoidal oscillation terms can be removed from
Eqn. (II-A—12). This is achieved by the physical process of envelope
detection and lOWpass filtering. In Section D, this filtering Operation
is investigated and a criterion for selecting the cutoff frequency is given
to minimize distortion of the solution of the differential equation and
maximize the smoothing of the oscillation terms.

When the excitation signal is stochastic, we cannot obviously

reduce the complexity of the differential equation (i.e., g(t) may

55

not adequately represent the total characteristics and y(t) may also be
required to adequately describe the random fluctuations). Under certain
conditions, it is possible to assume that the excitation function a(t) is a
Gaussian random process with expected value of 0(E(G) = O) and has independent
increments with a uniform energy versus frequency distribution (white noise).
The differential operator described by Eqn. (II—A—lO) will then specify an
autocorrelation function for x(t). Kelly and Reed49 Show that the envelope
and phase functions for x(t) and their derivatives have the following

A
probability densities for each fixed t when x(t) is a stationary process.

p(a, 01, h, w) = p(a)p(01)p(a)p(w/a) (II—A—l4)
where

p(a) T R(O) Rayleigh with mean 0, E{x2} : CB

p(a) T N(0, Bxe) Normal with mean 0 and variance Bxg.

p(a) I U(0, 2“) Uniform between 0 and 2h

p(w/a)-’ NUT), Bxg/az)
_ d
U.) = E{ IUJI}

” gEtxgl/[Etkgﬂ — we.

c.

B
x

This indicates that the angle, envelope, and envelope derivative
are statistically independent for each t (independent random variable).
Thus, no information is lost by removing the OSCillatory terms in
Eqns.(II-A—12) and (II-A~13). For bandpass spectral densities (like those
we are considering), where the energy is concentrated in a range Aw about E,
the envelope and phase function energy distributions are concentrated

25

in a similar range about w = O (Davenport and Root ). Also, the uniform

distribution of the phase contains no parameters of the generating equations.

 

 

a
Abramsonbo has defined Bx for stochastic processes as the mean

Square bandwdith. For ergodic stationary processes it is equal to the effec—
tive bandwidth, BW2, given by Eqn. (II-A-6), which is applicable to deter—
ministic processes. Thus the instantaneous bandwidth function, bx(t) is
related to bandwidth measurements for deterministic and stochastic
(stationary) processes. Further, for second—order differential operators,

Eqn. (II—A—lO), all the parameters of the process can be determined from

first—order probability distributions (cf. Eqn. II—A—14), It is not necessary

to estimate autocorrelation functions or spectral relationships between
bx(t) and the parameters of the differential operator (Eqn. II-A—lO). Since
this operator determines the autocorrelation fumztion, these remarks apply

to nonstationary processes also.

Many speech sounds can be modeled by stochastic processes with sta—
tionary autocorrelation functions (giving time-invariant Spectral densities).
However, the short duration and low relative energy of these sounds does

not allow a ”steady—state' Spectral density approach. Thus we must consider
transient responses. In the next section we will discuss the problems of

uSing Spectral estimation techniques and the transient response of linear

SyStems to envelope and frequency changes.

II—B. Sliding Fourier Series

 

The recent development of the Cooley-Tukeya4 algorithm for fast
digital computation of Fourier series coefficients has caused much interest
in Fourier frequency analysis. Modern communication literature
uses "Fourier analysis” to refer to a particular use of any set of
orthogonal functions to approximate a given signal by the following

form:
f(t) ~Z akcpk(t) (II-B—l)
kZO

where the set of functions {wk(t)jk20 is such that for some interval of

time [a,b] and some weight functions h(t) (definition of orthogonal functions)

.b
. 2
Jah(t) en(t) ¢m(t) dt = cnénm (h(t) > o) (II—B—2)
where 6 = 1 n=m
nm
= 0 otherwise;

the ak'S are constants.

For any N, and for any given finite energy function f(t), the integral

weighted squared error defined by
N
h - — > t 3
j: (t) ‘rm akCPk( )l at

is minimized by the constants

b

ak = I; h(t)f(t)mk(t)dt . (II-B-3)

57

 

- V

58

The most popular orthogonal set is the set of trigonometric functions, with
h(t) = 1 over [a,b]. However, the trigonometric functions have finite
energy only over finite intervals. Therefore, the class of functions we can
represent by Eqn (II-B—l) with trigonometric functions must be non—zero only
on a finite interval.

A finite energy representation over an infinite interval is achieved

by defining the truncated time function

f(t) -T/2 S t S T/2 (II-B-4)

H

fT(t)

HQ.

0 otherwise

*
and then repeating fT(t) every T seconds. A Fourier series of the form of

Eqn. IIeB—l can be used, with

d
(t) : cos kw t
o
d d
2 . t ' (,0 2217
@2k_l(t) Sin kwo , o /T

Some of the properties of the finite Fourier series are:

‘T/2 . _
‘ kw t
(1) a2' : Re[ f(t) e3 0 dt
k d-T/Z
/2 'kw
a : —ImJT f(t) eJ 0t dt
Zk'l —T/2
(2) f(t) V‘ ra cos(km t) a Sin kw t} i q Q 0
”’ ’ 1 2k 0 t Zk—l o ' ‘-1 -
k40
V j(kw t + (pk) Z { ‘ }
~i “ 0 2 c 03 w )
Re A Cke kc k Ot + qk
C 9 ‘l1 s + a d
k “ X‘2k ’ 2k—l

 

*
This representation is a good approximation only over the interval [~T/2,

T/2],

59

t

m g arctan{a /a
k 2k—l ZkJ

Notice in property (2) the resemblance to the form for analytic signals.
*
The analytic Signal corresponding to this series is

Y" 83' {kwot + cpk}
f(t) a. 2 k (11-3-5)

Now, consider some implications of these properties for time—varying
Signals, especially signals with varying frequency. Looking at
Property (2) again, the series is a sum of cosine functions with
constant amplitude and constant phase. (Guilleminbl) states that the
approximation of arbitrary functions by this type of series is due

to constructive (and destructive) interference between sinusoidal
functions of different frequencies. The natural association of the
Fourier coefficients with a frequency distribution (analogous to
Laplace and Fourier transform theory) causes some problems due to

the interference phenomena. Figure 7 shows a particular waveform

defined over a finite period ‘Ta’ T The transform of y(t)
L

i
bi

 

r g .1 . ‘ .
(assume I‘a _ 3Tb) lb .
i r2|!(f f )T /4}
S F eijiyﬂl 5 n1 «t b ft. : l/T.
~ = n _ 4 ‘L

 

To put the series in true analytic form, Baghdady considers each term

as a phasor and defines the amplitude and phase function for the
resulting phasor sum, a construction that may have some intuitive appeal
but is no help at all computationally.

60

Wt)

 

FIGURE 7 SHORT TRANSIENT PHENOMENON WHICH IS DIFFICULT
TO ANALYZE WITH FOURIER SERIES

 

61

and indicates that Fourier coefficients computed over [0,T], for Tsz, would
be significantly nonzero for several values of k other than k0 : T/Tt'
The nonzero coefficients are necessary to cancel out ck0 cosianot + mkol
over 0 S t S T/2. The distribution of energy among the ck's is mis—
leading to an intuitive concept of frequency associated with y(t).

A remedy that has been suggested for these problems is to make
T smaller (less than Tb/2) and compute a sliding Fourier series (i.e.,
starting the computation at increasing times). The resulting computations
can be interpreted as "time—varying" ck's and Vk's. (However, this
approach adversely affects the computational savings of the Cooley—Tukey

method.) We may then ask if a representation of the form

, wkm
x(t) A4 E: ck(t) e (II—B—6)
kBO

would combine the properties of the analytic function and Fourier series.
We can get some insight into the behavior of this series in the case when

¢k(t) = wkt + 9k. The Fourier transform of f(t) in that case is

‘ JG
X(jw) A. 2“ E. e k Ck(w—wk) ; W > 0 (II—B—7)

R40

wh .
ere Ck(w) 15 the Fourier transform of ck(t). Thus, the convolution sum

1
n Eqn, (1 I~B—7) has smeared all the ck(t) functions together.

Ari (Example of a set of ck(t)'s results from the "sliding" defi—

niti
on of Ifourier coefficients,

1.... ' L-

62

t
ck(t) = Loo §(o)¢k(t—o)do (II-B—8a)

where tk(0) = ¢k(0), one of a set of orthogonal functions and the "duration

(non—zero time interval or effective time width) of t is much less than

A
that of x. In particular

t ~ ,_
Ck(t) = l- I ;’:(o)erk(t Cy)“do (,II—B—8b)

t
-w ~ ~‘w o
z e‘J kt I §(o)e J k do
'L—T
We see that the calculation of sliding Fourier trigonometric coefficients
can be interpreted as the output of the linear filter with input x(t) and

impulse response

h(t) = e OStST (II—B—Q)

= 0 otherwise

We might ask how ck(t) would look for various situations, especially

for time—varying frequencies (as in Speech formants, FM modulation

systems, etc.). To answer that question precisely, we must develop
Some methods of looking at the response of linear filters to a

general class of inputs. Before developing such a method, we might suggest

what the ck(t)'s Should display.
A
Suppose the input x(t) is a constant amplitude sine func—

_ *
tion with a linearly varying frequency, wi(t) (wi(tk) = wk, k=1,2,3,4).
‘________________

)I:
We denote an instantaneous frequency function by wi (t) when it may be
Confnsed with values of frequency.

1'

I

01 (t)

63

 

 

 

 

 

FREQUENCY INPUT

ego
t2 T:
“(23(1)
[3 #
Ic4(t)
t4\
FIGURE 8 IDEALIZED FOURIER COEFFICIENT RESPONSE TO VARYING

64

Then, each ck(t) corresponds to a frequency wk’ k=l...4, which should
ideally look like Figure 8. In the next section we Show that this is
possible only with restrictions which are too severe for the class of

speech acoustical signals.

 

II—C Response of Linear Filters to Analytic Signals

 

When inputs to a linear filter (used to separate different for—
mants in speech signals, say) contain amplitude and frequency derivatives
of significant magnitude, the usual transform—superposition method of
analysis becomes unwieldy, especially in determining transient response.

1 Leon and Weiner,32 and Cannons;3 have suggested a different

Baghdady,3
approach to this problem; they use the analytic signal and convolution
integral to Show the nature of the output of a linear filter in a more
enlightening manner. Their approach is a generalization of standard sinu—
soidal analysis using Fourier series. If the input to a filter is a

sinusoid that starts at t : 0

[j wot]
x(t) : ae t

IV
0

(II—C-la)

and the filter has Fourier transform H(jw), which is rational, with

simple poles at the point 5 = S S , Sn, then the output of the

1; 2, ..-
filter is
n
jwot --‘ Skt
o(t) : aH(jw )e + a Ak e
0 (II—C—lb)
k=l
with
(S-s ) H(s)
k
Ak
s—jw _
0 ask

The first term in Eqn. (II’C—lb) is the Steady—state or stationary
solution, and the second is the transient term. The stationary solution
is simply the input multiplied by the Fourier transform of the filter

evaluated at the input frequency. When the input has a time-varying

65

66

amplitude and/or frequency, the form of Eqn. (II-C—lb) is duplicated by

0(t) = a(t)eja(t) H(jw(t)) + e (II-C-lc)
where
0(t) ' is the output of the filter
a(t)eja(t) is the analytic signal form of the input
H(.) is the complex Fourier transform of the filter
impulse response
w(t) is the instantaneous frequency of the input
6 is the transient or distortion term.

The first term, called the quasi-stationary term, is merely a complex

number times the input, giving an amplitude and phase change. Thus, the

idea of ”frequency selection” by filtering has a definite meaning when

6 is small compared to the quasi—stationary term. The transient or dis—
tortion term results from the filter's attempt to "follow" the changing
input. Baghdady (and others) have bounded the distortion term and restricted
the set of inputs to satisfy the bound in order to use the quasi—stationary
term as an approximation to the output of the filter. The class of linear
filters was limited in these studies to those described by rational functions

of the frequency variable.

For the representation problems we are considering, this class
of filters is not general enough (a. ”Fourier coefficient” filter is
not of that type), nor do we have control over the class of inputs in
the same manner. We will find the following definitions notationally

(and possibly,intuitively) convenient.

67

The Fourier transform pair for a real function h(t) is

co

H(jy) = §_m h(t) (9—3“)t dt (II—C—Za)
h(t) : I m H(jY) e'm’t dy w : 2ny (II—C-Zb)

Baghdady, Leon, and Cannon now define the quasi—stationary response of
the filter as (for input instantaneous frequency, wi(t))
m -Jw.(th3

H(jwi(t)) 3f h(0)e 1 do (11_c_3)

___CD

However, this is not a precise definition of a filter reSponse to the
instantaneous frequency unless the frequency changes slowly. Assume

that h(t) is nonzero only over a finite interval LO’ThJ' Then,

*
wi(t+0) for 0 s a 5 Th 15 given (for mi analytic in [0,Th]) by

. . k
w.(t+0) = w.(t) + w.(t)0‘+-E: 24 w(k)(t)
1 1 1 k!

k22

and so a more exact definition results by using w_(t+U),
l

T Gk+l w (k)(t))

. h .ML(tﬂ3 -( k‘ 1
d f '
2 J h(o)e 1 n'e
O kc—‘l

H(jw.(t)) J d0 (11-0-3')
1

This (k3finiticnl ha unwitﬂ(h¢ for silluitions “dtﬂi signiflxuuit fre—
quency derivatives, although it is more accurate than Eqn. (II—C-3).

Of course, the two definitions are compatible if Thwi(t) \R wi(t).

 

*
We use the notation w for the "irst derivative of w with reSpect to
its dependent variable and w for higher derivatives.

68

Our approach will be to use Eqn. (II—C—3) as a definition, but

with a generalized frequency term, i.e.

d Th Jy(t,to)o
H(j¢(t,t0)) = J h(o)e do (II~C—4)
0
where
d
=3 S S '
w(t’to) f(wi(t+t0)) 0 t0 Th to fixed

We can illustrate by an example. The Fourier transformation of Eqn. (II-B—9)
is:

3w (0) _.w0
' e k e J do

H 'w
k(J )

<i(w—wk)T/2

sin(w—w )T/2
: e k

(II—C—Sa)

 

(w~wk)T/2

and for the frequency function t(t,to)

‘J[¢(t’t )-w lT/Z - t t )~w T/2
Hk(3¢(t)) : e o k Sln(t( . o k)#_ (IluC—Sb)

(y(t,tO)—wk)T/2

Thus, the coefficient c has a maximum value whenever t(t,t0) = w

k k

as we had shown in Figure 8.

The calculation of the distortion term will be facilitated

by the definition

t -Jwﬂ
H(jw,t) = J h(o) e do (II—C—6a)
__OO
—Jwt t jw(t—o)
= e I h(O)e do (II—Ce6b)
-00

{h(t) * ejwt} / ejwt ‘ (II—C-6c)

69

53 H
Kharkevich calls this function the "running Spectrum. It can be
shown that this definition introduces artifacts into the spectrum,

although it does have the limiting preperty
m -jwo
(1) H(jw,m) = H(jw) : e h(o)do

Another property is

(2) ‘37:" H(jw,t) = h(t) e‘jwt

From Eqn. (II-C-GC) we can see that the running Spectrum is a normalized
transient response of a filter with impulse reSponse h(t) to an analytic
sinusoid signal.

The output of a filter with impulse response h(O) and input

x(t) = a(t)ejw(t) is

t .
a t—o

0(t) = j. a(t—G)eJ ( ) h(0)do (II—C—7)

—00
We make the following assumptions
h(o) = o; O < o; o > Th (II-C-8a)
€ .
. -JUJO
le f0 h(O)e do = O (II—C—8b)

6:_'0

Equation (II—C-88)is realistic, since most digital computer applications
require this truncation. Equation (II—C—8b) simplifies the exposition
by not allowing terms of the form Km6(t) in h(t). Using integration by

parts, the output becomes

70

f Th 30(t-G)
0(t) : J a(t—o)e h(o)do
r- T JQ( t—O) .j‘llo \
h d . l
= J C) a(t-o)e {e 65 H(JW,O)I do
J[a(t—o)+to] Th
2 [a(t—o)e H(j¢,g)]
0
d

JTh ES a(t-o) jWG

O -——3(?:5y— + deg a(t—o) — t1] x(t—o)e H(jw’0)dg

By the assumption in Eqn. (II-C—Bb) the first term evaluated at zero is

ZGI‘O.

l

J[0(L—Th) 4 Wle
:‘I( t—Th) e H( .j W)

T

. h f 311:0
+ J {bx(t—O)-Jtlx(t—o)th(o) * e } do
0

0(t)‘

H

0 (t) + 0 (t) (II—C-Q)
q d

We denote by oq(t) the quasi-stationary portion of the output
transient reSponse and by od(t) the distortion term. The quasi—stationary
term shows, explicitly, that the output is delayed from the input by an
amount on the order of the interval over which h(t) # O. A reference
different than the one commonly used minimizes phase distortions
occurring in od(t) compared to use of the usual reference, t. The dis—
tortion term integrand is the prebandwidth function for the input times

f “to
1h(0) * eJq I , a transient response term for the filter. For exponential

filter functions (resulting from rational transfer functions), this term is

71

(II-C—lO)

 

:7
A
('1‘
v
*-
(D
II

which correSponds to one factor in the distortion term in Cannon and

Duncan's result when t is the instantaneous frequency.

The interpretation of H(jw,O) as a transient response (Eqn. II-C—6)
shows us that the distortion term is a weighted average of the filter's
ability to track frequency and amplitude changes. The term t(t,t0) is
indicative, also, of the precautions necessary in interpreting the response.
That is, for t(t,to) = wi(t) + towi(t) , we have a "pseudo—frequency,"
to®i(t), biasing the instantaneous frequency wi(t). An attempt to include
this bias in the distortion terms complicates the result tremendously.

H(.) evaluated at the biased instantaneous frequency term is actually
the predominant output when to®i(t) is signficant. (See the following

example.)

We could ask whether tomi(t) is ever significant in the class
of signals we wish to represent. Figure 2 (in Sec. I-D) shows a typical
formant frequency transition from samples of the spoken word ”rudder."
This frequency transition has been inferred from a sonograph display. The
range (over several Speakers) of the frequency derivative wi(t) is from
5000 to 15000 Hz/sec or 5 to 15 Hz/msec. So, computation times on the

order of 20 to 30 ms can have biases of i 100 to 450 Hz. If we take

an idealized "formant transition" of the form:

£(t) = em“) (II-C-ll)

72

 

where ¢(t) = 2000 Hz 0 S t S .020
2n

= 2000 - 300[3(t/.030)2 — 2(t/.030)3] .020 s t s .050

= 1700 Hz. .050 s t s .070

p(t) gives a cubic transition from 2000 Hz to 1700 Hz with a maximum

second derivative of 10,000 Hz/sec. (see Figure 9a).

Figure 9 compares the magnitude of the actual output, 0(t), with
the magnitude of the quasi—stationary term for five Fourier coefficient
filters with 201MB computing period. Also shown is a curve of the envelope
maxima across the five filters. Figure 9a shows the quasi-stationary
term evaluated at the input instantaneous frequency, w = @'(t). Figure 9b
shows the quasi—stationary term evaluated at a biased instantaneous

frequency.

W = wi(t) + Th/Zdi(B) , T : 20 ms (II-C-lZ)

As is seen, this biased term gives a good correspondence between the

quasi-stationary envelope maxima and the actual output envelope maxima.
(Note that this delay distortion is not due to nonlinear delay versus
frequency characteristics),

The implications of this analysis for the signals we are con—

sidering are obvious. Sliding Fourier spectra with computation periods

on the order of 20 ms cannot adequately show frequency changes in the

input without bias.

73

H3d2_ u_O >OZmDOmmd mDOmZ<FZ<FmZ_
0253 m_>_m_m:. >m<ZOC<lem<DO QZ< 4<DFO<IFDQZ_ >OZwDOmmu.
OZ_>m<>|m:>:.r mOd wFDdHDO HZEOEHEOO mmEDOa do mODEZO/iz m mmDOE

 

 

 

 

 

 

    

 

 

    

 

 

 

828%
08. Ba 08. OS. So. one So. So.
. / a X] \! _ NIoom.
/I|\Iu ' \\
/_ _ _ A l// \u ~18:
/ _ / \I
/ \
/ _ \llb.\
\
/ _ \ /./
K _ / . Ni 82
, i / ,
/I/
, _ x/
_ / v Saz. do 55:85
I _ :I. / \lm30w232<52_ . :t
r s N:
89
v _ «532.2
_ wnod>zw
ILlL r _ , x v 1 .3...
08 08 So 4 I II/
o o /I f :5:
> 2“; we“. / E42925 .930
3825553 so”: _ 5&8 d<ao<i\ / no 33:20.22
2:5;42 waod>2m _ me 33:29:... //
no >58 ll'
H H H _ _ P _

 

 

74

Aomtiocoov FDaZ_ ”.0 >UZwDmed mDOmZ<HZ<PmZ_
0253 mSEmH >m<ZO_._.<.rwl_w<DO QZ< 4<DFU<|FD¢Z_ >UZmDOmma

OZ_>m<>nm:>_; m0”. 8.33.30 ._.Zm_o_n_dmoo mmEDOu. do mDDtZO/iz

 

 

 

 

 

 

 

 

 

    

 

o, mmDoE
828$
80 So So 08. So. one. 08. OS.
_ i N: 82
_ W _ _ _ N: 8:
z _
z _
_
s _
a. _
.. w N: 82
, _
a
. _
. _
. _
. _
n _ N: 82
.
. _
— .
_ “ <s__x<s_
. 393%
a _ \\
_ _ — _ _ N
one. 80. So 0 _ I 83
o 3 _ .2wa
mnemm<wm<wwh_i20mu _ // £220.55 .930
I
to 33:29::
<_2_x<s_ mmod>2m _ SE30 .232 r
_

I
I
”.0 ><4wo do wQDCZO<E

_ _ _ _ _

 

 

75

Another example of enve10pe delay changes due to changing frequency
can be seen in Figure 10. Power Spectra for a vowel transition of band-
limited Speech (module 6; bandpass, 577—1867 Hz) are shown. The vowel
transition for a male speaker, from unstressed /u/ to stressed /a/,
is also shown in Appendix [L The sliding Spectra are computed every 15 ms
over a 25 ms interval. Two vowel peaks are present in this filter, with
one peak changing in center frequency from 902 Hz at 405.1 ms to 1445.Hz
at 53C’ms. The estimated frequency derivative at 480 ms is 9200 Hz/S.
The absence of a significant second peak at 480 ms (relative to the lower
peak) can be explained by envelope delay, which is caused by the bias fre—
quency due to the great change in both frequency and envelope of the cor—
responding formant. These ”holes" occur frequently in Spectra of Speech

Signals, as iS noted by Schafer and Rabinergz

and require complicated logic
to avoid errors in formant peak tracking systems. The technique used by
Schafer and Rabiner gives better frequency resolution at the cost of
numerous computations (4 minutes on a GE—635 computer to compute two
formants and pitch period for two seconds of speech). They first compute

a cepstrug to reduce the influence of the pitch frequency and then display
the magnitude of the cepstrum transform along a spiral arc (see Figure 11)
rather than along the unit circle. The spiral arc correSpondS to a straight

line in the s—plane. (Schafer and Rabiner call this procedure the chirp

z—transform.) Improved frequency resolution results from passing close

 

:5:
A Cepstrum is computed taking the log of the Fourier transform. Two

convolved time waveforms can then be separated if their frequency
dIStributions are approximately disjoint.

76

TIME

 

577 Hz 1867 Hz

FILTER 3dB . I
BANDWIDTH I

\————’-"\’/"/\‘

    
 
 

PWR 4630.0 529.7 ms

 

 

 

1445 Hz

I PWFI 5279.8 504.8 ms

 

 

 

I 1395 Hz

 

l PWR 4255.0 479.9 ms

 

PWR 2359.7 455.0 ms

—_——-_——r

 

981 Hz I

    

PWR 1759.3

 

  
 

PWR 2013.2 405.1 ms

 

 

 

 

“T I I I I I I I I
40 H7 FILE 3 MD6 LIN MAGN FREQUENCY —— 3333 Hz
BANDPASS FILTERED SPEECH [ul T0 In] TRANSITION

 

 

(See Appendix D for description 01 labeling.)

FIGURE 10 FORMAT ENVELOPE AND FREQUENCY TRANSITION
CAUSING DELAY DISTORTION

77

II

3 PLANE S PLANE

II
/\
x x
POLES OF

x* (3)
=— x

 

X
POLES OF X(s) /

 

 

FIGURE 11 CORRESPONDENCE OF Z—PLANE SPIRALS AND S-PLANE LINES
FOR THE CHIRP Z—TRANSFORM (From Rabiner, Schafer and Radar)

78

to the poles of X(jw). We can express this concept in our formulation
by letting y be a complex variable rather than purely imaginary. The

choice of t to minimize od(t) would be an approximation bX(t) throughout

A
[t—Th, t]. When x(t) is generated by a second—order linear time—invariant
A

operator, the real part of bx is the real part of the complex pole in X(jw),
We can see, then, how frequency resolution is enhanced, although
envelope delay is still present since no provision was made for frequency
derivative compensation. Schafer and Rabiner's techniques require a classi—
fication process to limit the input signals to male, nonnasal vowels and
an iterative process to find the best Spiralarc. to achieve good discrimi—
nation. If the characteristics of the input were known, we could reduce
the number of computations by appropriate choice of filter transfer func—
tion. This is the technique used in modern\scan-frequency analysers where
the phase of the IF filter transfer function is matched to the frequency
derivative (scan rate) of the input, (Kincholoeb4 ). If we knew bx(t)
and the signal class, we could parallel Kincholoe's techniques by adjusting
a time—varying filter to select a formant by center frequency tracking,
minimize delay distortion by adjustment of the filter phase function to
match the frequency derivative, and improve frequency discrimination
(rejection of other formants) by bandwidth matchingf Such a Scheme is

shown in Figure 12,

The estimation of bx(t) requires classification of the input
signal (as we discussed in Chapter I) and results in a ”rough" initial

estimate bx(t) which is used to generate a mixing Signal CXpI-bx(t)I.

 

*
As noted by Kincholoe, the matched phase function would attenuate adjacent
formants whose frequency derivatives are not matched in the same manner
shown in Figure 9.

79

 

29.515me mmkm2<m<m ._.<_>_m0u_ mOm mmkuzd OZ_>m<>:m_>:._.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

NF mmDOE
>28 85 Exp
A _ ”.0
7 3.8
3.3 ‘ zoF<EEmm
ll 332:8 2 88 ill
v 32 :
3:23
0
; Exm-

 

 

 

 

><4mo meJI «so
II... n H c
4. we t4<_m<>z_ _
wEC.

 

 

 

 

 

 

 

 

 

 

 

202540; .mm<._o

 

 

 

80

The filter can then be specified using standard Laplace transform tec
niques where the dependent complex variable of the transfer function
the difference between the mixing signal's complex ”frequency" {g;(t)
and that of the input. The estimate is improved by a feedback loop.

The delay distortion caused by frequency and amplitude changes is est
mated and then corrected by a variable ideal delay. Equation (II*C—€
can be used to analyze the feedback 100p, but it can also provide a

Synthesis procedure for a digital algorithm which significantly reduc

 

the computations necessary to implement the scheme shown in Figure 12

Assuming that bx(t) is given, the majority of the computations are re
to implement the filtering (mixing and delay require one operation, e

per point of time).

There are two types of digital filter algorithms, transversa
recursive. Transversal filters compute an output value from delayed
values and are basically discrete convolutions (or correlations) of t

form:
N—l
2 ‘ . k2]. 2 co. .-
o leijlk-j , , (II
b

J:

The number of operations (one addition and one multiplication) per po

*
of time is N. Recursive filters compute an output value from delayed

and output values. The algorithm is derived from the z transform of

:1:

The Cooley—Tukey algorithm for computing Fourier coefficients is of
form and for this Special case requires only «dogr N, where r is the
greatest divisor of N.

81

filter time function.

 

-1 -m
- — Z . . . a z
0(z 1) _ P(z 1) _ a0 + a1 + + m
- — — — _ -n
I(z 1) Q(z 1) l + blz l + . . . + bnz
m n
= ) 1 ‘ 0 I —c-14
Ok 1.. a), k-j + A bj k—j ( I )
jzo J20
.. -' S
where z 1 e‘JA‘ is an ideal delay of time A
m is the number of zeroes
n is the number of poles.

The number of operations per time point is m+n.

We can use the quasistationary term from Eqn. II-C-9 to approxi—
mate the filter Operation in one operation per time point. The prefiltering
classification and estimation of bx(t), along with feedback correction,
allow this approximation to yield precise frequency tracking (the amplitude
distortion is not relevant). The appropriate (narrowband) filter characteris—
tics are stored by means of the complex transfer function H('). The value
of the input at each time instant is multiplied by the value of this func—
tion at the estimated bias frequency. This method cembines the relatively
low number of operations of the recursive filter with a desirable feature
of the transversal filter. This feature is its ability to change the filter
coefficients. If this is done with a recursive filter, an additional
transient distortion is introduced. Thus we can achieve an approximate
time—varying digital transfer function with a low number of operations,

given an estimate of bx(t) and a classification of the input.

 

82

The classification system must be able to determine rough, but
unbiased estimates of parameters of the incoming signal. The overlapping
filter bank discussed in Chapter I can provide a basis for the estimation
with some restrictions. In order to maintain similg between the outputs
and input of a filter (within the effective bandwidth), Thwi(t) must be
less than the acceptable frequency resolution error. Thus a "worst case"
bandwidth requirement can be derived which would introduce negligible
frequency bias for all Speech signals (although the bandwidth would be

excessive for some).

InSpection of sonograms of English words Spoken by several
Speakers indicates that the maximum value of ®i(t), 15,000 Hz/second,
occurs frequently from 800 Hz up to 3,000 Hz. (Above this frequency it
is hard to make reliable inferences.) Figure IKSshows bandwidth requirements
for several percentage resolution errors. The bandwidth is determined from T

h

(approx. rise—time) for linear-phase filters by the relation

BT .s 1 (II-C-l3)
where
B = J A(w) dw/A(o)
O
{.00
T = J h(s) ds/h(o) as .8T
h
0
f m -'wt
A(w) = ‘kJ h(t)e 3 dt
‘00

B gives a measure of bandwidth that is approximately equal to the half
power and effective bandwidths (for filters with very sharp rolloff like

those we are using, this approximation is better). T is a measure of

 

*
Defined in Section I-D, p. 23.

 

83

 

mHZmeEDOmm

>OZmDOmmd m0m<4 m0”. mFZmEmEDOmm Il_.0_>>DZ<m mp mmDOE

NIIIFQ _>>OZ<m

003 com F com — co. _ 000, 00m com com com 00m cow com com

.‘

_ _

>095 m:.:. 2. v_Z<m mmkﬁu.
Oz_dn_<.._mm>0 QZ<m m9? ll.

oaanoqﬂ

_#4____~__

       
  

\

NIOOOF

mmkd‘m mOmmw 02..
Z<OmZZ<IE

NI Doom

me2 m3

NI ooom

 

‘

>OZwDmeu

84
rise time, usually between the 10 percent and 90 percent points on a
step—response envelope curve. Figure 13 Shows that our choice of
bandwidths (Sec. I-D) is adequate in view of the inference from

33

Flannagan's data that a just noticeable difference in frequency
for human experiments ranges from approximately 5 percent at 1000 Hz
to approximately 3 percent at 2000 Hz. The data for this experiment
results from individual variation of the first and second formant
frequencies in a four—formant synthesized vowel.

In the next section we look at the outputs of such a filter

bank and attempt to segment the speech signal into homogeneous epochs

with center frequency and bandwidth as parameters.

 

II-D ESTIMATION AND SEGMENTATION OF INSTANTANEOUS SIGNAL PARAMETERS

The preceding section demonstrates how complex acoustical signals,
such as those encountered in speech analysis, are represented most appro—
priately by instantaneous time functions related to the envelope, instan—
taneous frequency, and pre—bandwidth function. Differential equations
for these functions have been derived for a single isolated formant. The
bandpass pre—filtering that we have specified in Appendix B attempts to
isolate formants. However, the inadequacies of fixed—frequency bandpass

filters and the presence of inherent background noise in any realistic

 

environment indicate that these differential equations will not be an
exact representation. Therefore, a general form for these differential
equations that can be expected to describe the signal parameters as seen
on the outputs of our bandpass filters is more appropriate. In the

following, we will denote the ratio a/a as br (the real part of bx)'

l r r

i— b = f (b . w, g/a, v, n , t) (II-D—la)
dt .t 1

1w r

i7 : 1‘,)(b , w, g/a, ‘Y. T1,. 0 (II-D 1b)

where fl and [O are nonlinear time—varying functions for the derivatives
of the state variables. TR and WE are stochastic processes which represent

the unwanted Signals and other noise.

The classical theorems on "best” estimators deal with asymptotic

properties as the number of samples becomes large. These results are of

85

II-D ESTIMATION AND SEGMENTATION OF INSTANTANEOUS SIGNAL PARAMETERS

The preceding section demonstrates how complex acoustical signals,
such as those encountered in Speech analysis, are represented most appro—
priately by instantaneous time functions related to the envelope, instan—
taneous frequency, and pre—bandwidth function. Differential equations
for these functions have been derived for a single isolated formant. The
bandpass pre-filtering that we have specified in Appendix B attempts to
isolate formants. However, the inadequacies of fixed—frequency bandpass
filters and the presence of inherent background noise in any realistic
environment indicate that these differential equations will not be an
exact representation. Therefore, a general form for these differential
equations that can be expected to describe the signal parameters as seen
on the outputs of our bandpass filters is more appropriate. In the

following, we will denote the ratio a/a as hr (the real part of bx)'

d r

— b = f (hr, w, g/a, v, T] . t) (II-D—la)
dt ,t 1

iw

% = 1;,(br, w, g/a, v, TL. L) (II—D 1b)

where [1 and fa are nonlinear time—varying functions for the derivatives
of the state variables. TE and WE are stochastic processes which represent

the unwanted signals and other noise.

The classical theorems on "best” estimators deal with asymptotic

properties as the number of samples becomes large. These results are of

85

86

little help in estimating instantaneous values. A multiple regression
analysis would fit a polynomial of Specified degree to the observations
over a fixed interval. However, this method requires a priori knowledge
that is not available (maximum degree of the polynomial and a fixed
interval for fit) and much computation (usually a matrix inversion

(Donahue7l). Thus, pointwise estimates are required.

*

For time—invariant differential Operators with either stochastic

or deterministic excitation, the two common parameters are mean frequency
(E) and bandwidth (BWZ). The mean frequency for analytic signals is well

A
defined in terms of the spectrum X(w). We can derive a formula in terms

 

of the time functions a(t), a(t) and w(t).

I uX(w) X*(w)dw

oo 03
_ m 1 P A* d A .
w = m = e J x —— (t) dt// I ad(t)dt
2 J 0 dt
I a (t)dt ~®

..CD

 

H:

where the second integral is due to a step discontinuity at the origin.

 

.(t) . 2 m * d A + m a
a(t) + Jw(t)] |x(t)l dt +-J¥f (t) a? x(o )dt I a (t)dt

..(I)

._ ”a - __1__"°°-_ _ shop] I‘m.
w : JO? (t)w(t)dt + ZﬂjLJ0a(t)a(t)dt + ——§——- J_ma (t)dt.

 

*
Because the process is ergodic we W111 use time averages rather than
expectations.

87

A
Since we assume that x is a well-behaved, finite energy function, a2(®) : O.
_ .00 CO
[ 2 F 2 ,
w = J a (t)w(t)dt J a (t)dt (II-DeZa)
0 _w

The effective bandwidth can be converted to a similar form
(from II—A—7)).

J lb (t)l2a2(t)dt
Bwr2 z 0 x9

‘m

J0a2(t)dt

 

 

 

..00 ,3 ‘10.)
. J br (t) a2(t)dt J0(w(t)—E)2ad(t)dt
de = O + (II—D—Zb)
Cma2(t)dt m 2
l . a (t)dt
.0 0

 

Thus for constant coefficient Operators we have weighted time average
formulas for intuitive parameters. For time—varying Operators, we
are not so fortunate. In order to derive formulas we need an assump—
tion that is Often true for physical systems. We call a process
locally ergodic if we may reasonably approximate ensemble averages by

sliding time averages, i.e.
t
‘t 1.‘
E{c(t)f R: - J c(G)dO (II—D-3)
I T t

-T
Basically the assumption is that the time behavior Of the parameter c(t)
is "smooth" with reSpect to the statistical variations. This procedure
is incorporated in many engineering systems, and we are merely recognizing
this Often—invoked assumption explicitly. The determination of T is a

key to this approach and depends on the nature Of the processes. We will

discuss its choice later.

 

 

88

Equation (II-D—3) can now be rewritten to give averaging equations

for time-varying Operators.

 

 

t t
a(t) s I a2(t)w(t)dt/j a2(t)dt (11—13—421)
t—T t-T
t a t .
J br (t)a2(t)dt J (w(t)—U3(t))‘ia2(t)dt
13w"3(t)rsb'd (t) = t—T + t-T (II-D—4b)
XS t t
j a2(t)dt I a2(t)dt
t-T t-T

Notice that (II—D-4b) gives a sliding time average Of b S(t)
X
and hence the time average BW(t) is denoted bXS(t). In Appendix E, the
relationship between Sliding standard deviations and derivatives is shown.
TO summarize the arguments in Appendix E, the most estimator for the enve—
10pe is derived from the Iﬂlbert transform. The absolute value estimator
gives some distortion, primarily during epochs with changing frequency,
but requires much less computation than the Wilbert transform estimator.
For real—time recognition Of connected speech, the following
estimation procedure (shown in Figure 14 and discussed in detail in
Appendix E) is proposed. The output of a (wide) bandpass filter is passed
to absolute value enve10pe and zero crossing frequency estimators. LOWpass
filters then remove unwanted oscillations. In Appendix E, the best choice
for the time constant of these filters (called subinterval length) is shown
to be on the order Of l to 2 ms.
A Sliding mean and standard deviation is then computed on the
output of each bandpass filter. This procedure has been chosen for its

adaptability to real—time operation, its low—cost hardware implementation,

 

89

 

 

 

ZO_._.<_>wQ
QI<OZ<Fm
02515

 

 

 

 

 

Z<m:>_
02.9.5

 

 

 

 

 

 

mmﬁh _ m
mm<a>>04

m4<ZO_w IQmmam Qmmmﬁhl mm<doz<m mOm
mmmHmSEm/od OZ_>m<>Im:>_:. m0”. meOwQOmd 20:32me

 

 

 

 

 

 

 

ZO_H<_>wO
QI<DZ<Fm
02.9.5

 

 

 

 

Z<m:2
02.9.5

mOF<Z_S:mom_O
DZGmOKO
nOmmN

 

 

 

 

 

 

 

 

m m2... _ d
mw<d>>04

3 mmDOE

 

 

mmCHZm
mm<aoz<m

AISQE

 

 

 

 

 

 

 

m34<>
mquOmmd.

 

 

 

 

Ex

 

90

and the minimal degradation Of any further processing that may be
>k
required.

Pictures Of bandpass—filtered Speech acoustical signals indicate
that fricatives such as [f] from [umbif] can be analyzed in a Similar
fashion. The instantaneous envelope and frequency estimators (both
real—time and derivative) retain the stochastic nature Of the bipolar
signal (Figure 15). The subinterval length of 1.2 ms and
sliding average length of 10 ms appears to be adequate for the frequency

range shown (see Appendix C for filter bandwidths). Comparison of the

bipolar signal (Figure 15a) and instantaneous estimators (Figure 15b)

 

indicates that a narrowband assumption, which has been incorporated in
the local ergodic assumption, is appropriate.

Consideration Of many cases for different Speakers and utterances
indicates that the zero crossing-absolute magnitude enve10pe repreSentation
occasionally fails to represent bandpass—filtered Signals adequately.

The primary case where an ambiguous representation arises is when two
energy peaks occur in the same filter (for example, the case discussed

in Chapter I and illustrated in Appendix D). In the filter of bandpass
577—1867 Hz (Module 6), a (relatively) strong energy peak continues at
750—800 Hz from 370 ms to 700 ms. At 430 ms, a second energy peak begins
t0 "move" away from 902 Hz toward 1445 Hz. ApprOpriate choice Of filter

band bandwidths could isolate these peaks; however, this approach would

M

*

For instance, if frequency resolution must be increased, the sliding
mean length can easily be increased, averaging the previous time series
again. However, more frequency resolution cannot be gained by further
aVeraging of the output Of a sliding Fourier series computation; a new
tranSform must be computed.

91

 

H rm 9 EB m: 59 2. mom .30 man: .2205 $5935 530%

o o g o o o c or of
r-..'-.--‘ ....... t .......... r ........ r ......... h ........................................

m: N. mmw two: :mr m4: m: m. cow.

éréiagigéegggég

m: .m. 02. we: rm mi... m: m. m5

égﬁéégé. a. ..éiéesisiézi

m... mnwmmw ................................................ a on--.mmzws.ﬂ:: ....... m. mmdwm
n-.wm..~...8.m ................................................ m. m::-....m:.m.s.ml..-.--.--...m.p. was.

§§§§§§§

 

 

S)2

 

Aemscﬁcoov mJ_<ZO_m IOmmam memhﬁm mm<aoz<m m0 mZO_._.<._.wawmn.wm

mm04m>zw mtjommc
as 80 .u wt: .u a 1w 2 wt: ... ~ In @—
8w
g; II «I
J (\Itfu‘ 33K m

u.Em 0N5 .u

 

rcr

.f

4

n o
. .- och

 

 

we own. .

on.
I.

S\‘ (K)

@—

mmDOE

35.3.2 ...0 cozatogn .3 U xicwna< ommv

mx: ... M rm 0,

5 \l/
Xx I .1.

 

225:8 nmcozaew >m 8.2.555 m>C¢>Emm wand)?»

2 00¢ .u pf: .-mH 1m m“ an 02 .- mx: {w— Im 3 me o2.

mt»

 

ti ’3 .61. c(n‘l)..l.lr\.’||vr \6. 4). .0. . 0.0w .. 4.x .
s. .5 2 .1 I a. .... .... .. «it. . ...... .. .. m (r.
..t . .4 .

a
n 0
.-mx

 

 

I‘.

I.u-ms. rm ma

 

o

.o 0;

m5 Own .|
Oﬂq

mE ovm . I
ONH

w::

m»:

 

\
1“?- VII)..(
’1

‘- .
)

mExthw 5750mm; ozawomu omwm

m6 00¢ .v

wt: .n H Im m" an 02 .- mi: .s _ 1w 2

 

 

mE omv .l

 

000
..4 ... .
..a “v.4. .Ix Ila-Ir . s11
)‘r\'l'h'.“.\|’a\ 1‘ ks

.........(<.x.........L/ru....> A

4"}! “a.

 

 

@Nv

m \31]...

1V \Afsxlu}..}|a

 

{~1me

0
.
u

o
.c Q;

 

.-ws 1m ms

 

wx:
02

N 1. )JJI\.I.J\.IJS .

we ovw ..

.u .. rm m”

 

x («x

 

93

me 00m .

 

I?
o g 0 b a
. N... .2 . ...\.. . o.» !
O 0 I\ I00
0. t o. ‘0 r ’0’ f. 0%. Col‘ on.
I ‘ .00.I O Q
o. I I \f' I n a, I It
... Jo . .. h.
o o co. ’0 o
O o 0.

u:na .-ma :m ma

 

 

 

{2 Im mu

 

I .
O I
.0 .
O I
‘ Oh I
o u u e I O
‘ -.. O I a. ‘ 0' o
.r . .. .r... ... .
Cu O o. O I .
.. ...: t a... 33...“ t
o f. 0 O. '0 I .
O 0 o
O O
C
as com .
s m2:
.
0- no u. I... O“. h 0.
.1... "o. 0’ a. n 00-,
O ' 'I00 0 a a O I o
o o - o. I

’0
R
g!

I
\.
o
I

I

 

zomuouzam rennxnznum mo swan 1.6mm

.-mz :m m" nuns. -- m>n- .w-mu

Adah—mam. *0 :OZOCOmOU .2 U

.E o.“ .- m>n-

cw m”

5283 8m.
.-wn cm a“

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0 ‘ I O
o If I L
woztzoax 76:95.; inﬁrm/am no wdaxnhmm >zm ngo
as om» .- ms: {91m 2 9. Sq .- m1: .umw
80 ON?

 

 

 

 

1m 9

.... ox. .- m...”

we own . - ms:
08 wmq o2 4.
r. s .. m .1. . .... N
. .. . . ..... . .. .. .... .... .
aaﬁn.. is .a an . -n.z . at. - .mn ? ﬁdanx.~w w r3 m
..s .. .. .. . .s . . .. .. ... . ... -. ... ... m .... ... 4; ﬂ . . ... :2.
a ... ~ ... . v . . w c .. . a) .. . a. w u . . . w
n H o u . o o . ..- . iv... II. a o. l n ......u on u.
w .JS H m w .. m + As Jm . w
.> . > > a . x
. m
m m m o
1 n u
u . o l n x I. ..U H X
onFuzza renmxazqm do pace >¢azuoaxH
.-mz Io ma as amt .- mzzk .-m_ In as a. 0.. .- m>m- .-mm :w mm .5 cam .- ma”- .-mm mm a”
cow 3v 1 1 ON“ .
r m m a
W IOI‘O 0" ' 8. At- DIG. ’ I In.5 w '0 I ... 3" . ”I O m " I... 3' I. CI! 0. I x
D. I. .ov-Ilo upatna L ‘ O I. r. C\o ‘ . f o .o o o 0... a ‘ '0 he
.- Onlr d . .‘Q I I . 8 00 a... .0 . (‘- ft GIN-‘a.o o .1 0v o . .Iul .
w a - . .... w . an - m w
.. .. ...
x z .... . y.
m
m m
a m o o
.. m .- a x . n a x

 

.nmu cm 2

 

 

y I W .0 I I. O o
u ,- n ) m .I 0 C I. 0 U m
. . C. . u. .. . (.15.on 91.3“}; \o s . I; 5‘... ...-NJ! i .0. .l .1. .... I
o. o O. a an. I . In“ | I. c u. - O- \ o O. O . ...-u l-
o . ..... . ca 0. «.31.»... .. .. . w rave? $31 ..Aq no r . .5 r1. . .. ... . .
Q o .- .I.. ‘0 a m n .0 a .0 w
w ..m ”w” a. x
z u n .hu . x x . :

CLO

 

LO

 

 

94

require a fixed bandwidth filter tailored to each Speaker and utterance.
For our particular choice of filters (described in Appendix B), the zero
crossing-absolute magnitude representation (Figure 17) follows the
stronger low—frequency energy peak.

One method of isolating the more interesting high—frequency
energy peak is by first computing the time derivative of the bandpass—
filtered acoustical signal and then the zero crossing—absolute magnitude
representation. Several factors recommend this approach. Cherry and
Phillips indicate an increase from 65 to 92 percent intelligibility by

using the derivative (hardware derived) Of the wideband acoustical

 

Signal for their zero—crossing intelligibility studies. Thomas, referring
to this increase, states that the pre—processing accentuates the second
formant, which (he proposes) contains the significant linguistic information.
For isolated formants, the increased intelligibility can be
due to an emphasis Of information-bearing parameters which are related
d§(t)

A
tO the prebandwith function (recall that dt : bx(t)x(t) ). In

 

the wideband Signal case, high frequencies are emphasized, as we see if

we consider the transfer function of an ideal differentiator (linearly
increases with frequency). Most physical differentiators* are necessarily
approximations and incorporate smoothing to high—frequency variation.

A typical transfer function of this approximation is depicted in

Figure 16, where lower frequencies are deemphasized (with reSpect to

higher frequencies). Thus, Cherry and Phillips’ results can be

 

*
In this study, a cubic interpolation is made between extrema Of the band—
pass filtered Speech signal, and this interpolation equation is differen—
tiated.

95

 

ml

 

f

FIGURE 16 SMOOTHED DIFFERENTIATOR TRANSFER FUNCTION

96

explained if Thomas' hypothesis is true. Resulting zero—crossing-absolute
amplitude representations are thus able to "capture" other energy peaks.

The frequency estimate (upper left plot of Figure 17b) for Module 7
(derivative of Module 6 filter) and utterance 16 BE 1 clearly shows the
frequency transition that was difficult to find in the Fourier series
(Appendix D), or in the zero crossing frequency estimate for the undif-
ferentiated Signal (upper left plot of Figure l7a)T The resulting transition
is depicted even in a situation where bandpass filter selection was not
appropriate (for this particular case). The form of the transition is

what one might assign by eye to the sliding Fourier series in Appendix D

 

and also looks very similar to the dynamic articulator (tongue) trajectories
depicted by Houde for vowel—to-vowel transitions.

Figure 17 shows another feature of the absolute amplitude—zero
crossing representation. The sliding standard deviation is plotted against
the sliding mean for the absolute amplitude (lower right) and zero crossings
(upper right). In both cases (17a and 17b), the bivariate samples
form a tight group during the first vowel segment (before 430 ms) and
then cross a ”bridge" toward a new group during the transition. The
differentiated zero crossing case (Figure l7b——upper right) is the most
dramatic. The two—dimensional plots can only approximate the actual
four—dimensional situation, but it is still possible to recognize a

coherent time behavior that is not apparent with standard preprocessing.

 

*

The series of numbers, 0—9, indicates contiguous sample points simul—
taneously on all four plots. The "INDEX OF ZERO" gives the time of the
starting zero in milliseconds.

97

 

 

 

 

 

 

 

 

 

 

 

 

 

X q ‘0 X I '9 '.09
2c FREQ zc FREQ sov '
03 D '~I
.2 w 7
.. ,3. M .....,,a,.e.9
LI I. .°.° ‘. 5::1'- . l} . O '. .
R L...'...'..' 2...... R ' °.
2 2
16 BE 1 -. TIME -.480ms 15 BE 1 -. {JanuR 2 xu—.
- 2c FREQ
x 1 -. x 1 -. . . .
ABS ENV ABS ENV sov .89
01 2 .. _7 .
M M o. :i‘ C 8
-\ - .
T ‘ a . LI
LI 4 1235
R R £1:
" 0-9 - 5535f
Q .‘
2 u- M‘f 2 g”
16 BE 1 -. TIME -.4eoms' 16 BE 1 -. 01M 4 R 2 x 5-.
SPEECH DATE 12% INTERVRL Res ENVELOPE RND ZERO CROSSINGS ABS ENV
* INDEX OF ZERO 182
BANQ PASS FILTER“) §PEE§H my“. mp4 aw 233-I467 Hz

 

[See Appendix C for descriptiOn OI labeling.)

FIGURE 17a STANDARD DEVIATION VERSUS MEAN FOR ENVELOPE
(LOWER) AND FREQUENCY (UPPER)

98

 

 

 

 

 

 

 

 

 

 

 

(See Appendix C for description OI labeling.)

STANDARD DEVIATION VERSUS MEAN FOR ENVELOPE
(LOWER) AND FREQUENCY (UPPER) (Concluded)

FIGURE 17b

X tr‘v ' X 1 _.l
2‘3 FREQ z-c FREQ sov
03 ...... [g4 .
.9 9
M i‘) | ' . o .
‘ ~ O F3
5 2 . .. ...-...... .12.: O —T.;_::.:_-' 5 O. . .0. 3 o o o
.‘.'....‘ I" 2?: .2. '
R R .- Ev
2 2
16851—- TIME --480ms 16 as oavsR 2 x ll-,
ZC FREQ
X 1 "I X l —.|
ABS ENV ABS ENV SDV
DI DZ
M V
5 s
r") a
R P. ‘2‘. ‘8 -
.- ‘3
2 0:34 IT: 2 '2‘"
A vv an“ .
W
16 351-. Tiff-E -.480ms 16 BE 2 —. 31v 5 :2 2 x 5-.“
. - ,_ , ABS ENV
SPEECH DRTR 1.2Ns 1N.ERV9L R35 ENVELGFE RND ZERa CROSSINGS
INDEX OF ZERO 161
DERIVATIVE 0F BANDPASS FILTERED SPEECH SIGNAL __M__D_§ BEVWZEEIASYHZ

 

 

 

 

99

This dynamic behavior is further diSplayed by various estimators
related to the bandwidth function. Two utterances are considered, the
. . _ h
B to E vowel tranSition from [duat ] [16 BE 1, Figure 17] and the utterance

[umbif] [19 EH 1, Figure 15]. The following estimators are derived from
Eqns. (D—2) and (D-4):
1. Real part—-sliding standard deviation divided by
sliding mean Of envelope
a. For the real—time bandpass signal
b. For the derivative of the bandpass signal

2. Imaginary part——Sliding standard deviation of the
zero crossing frequency

 

3. DER/ENV——sliding mean of derivative envelope divided
by Sliding mean Of envelope.

These estimators are shown in Figures 15c and 18. Several points are evident
from the figures.

1. Bandwidth function estimates all have a stable "nature"
for certain epochs with significant perturbations at
the boundaries.

2, These epochs correspond (for some of the time series)
to natural Speech signal "groupings” (say, Reddy’s
phoneme classes).

3. The ”nature" can be grossly defined in a consistent
manner for strong deterministic signal groups (vowels)
as Opposed to weak stochastic (fricative) groups by
the deviations of the bandwidth function about an epoch
mean value.

4. The bandwidth function is relatively normalized across rather
large amplitude variations while still Showing variation

for different groups.

 

100

 

A295 353:3 me...<_2_.hmm I._.Q_>>OZ<m I._._>> Z>>OIm mHJDmmm ZO_._.<._.ZmS_Omm m— meOE

63:82 so cOZQIOmmU L8 0 Encoded. $9

82885 DEN ozq mmod>zm mmc s¢>mmt£ wzm; «ED 55%
951w 0...; EN onod u mwmzh
mpaxpmm armamwd 9.339% 00mm maod:ru mwnuomma

252 .- ms: .- .. mm s as: .- my: .- l mm m: ”32 .- mi- .- x mm s. 8. .- ti: .- a. mm m:

m m . j m ...)!Hc N
. .... .

LI

 

 

 

..I. . ... ... .n. m .l.n|m..l-.. ..I... .....l. .. m a 3
I. cl - I out. I. I “to h I. ) II o \J. ‘a I
.. .. m In... m .. ..

o a o o
.-rx .-rx .-Hx {ﬁx
52.62% “a mExﬁmm Efﬁe 29-9.8 3.22% no Han. same
...... of .- Ex: .-2 mm m: E 2:. .- E: .-m: mm 2 2 o: .- Mr: {,2 mm m: 2 a; .- Ox: (mu. m. 2

 

 

 

 

 

 

 

. .. m m “...... .......I. . m . .IIr N

I II .I . II
‘I II I I. I I. II .II I II
o. 0.. l. o n a I 0- .
. .... s I. .Hl . . c . c . o . cl. 0. 3 .I. . I
l. .n\.\(. . ... o ..o . I. .u .n .. I..
o . \ 5’ ﬂ . \IA’. . D’ a W . . . . . n... ...-l. W
o \. .I ... . Ks \l L. I." o I o
.. . . . . ....\ .v .x. ...\ - ..
r 5... .... ...... . .... .. .... . .... .. ..
6 ..n J R.” T . r
-- I \o
N I VIN

 

('I If)
ﬂ
('3 (V)

-.. .-uw (nun. .-..r.

onruznd tnnmrgxcm do ..«cn >mazuocau .

 

 

 

2 cs .- ms: .-2 mm 2 F a; .- we: {9 mm m: E 02 .- s: .-m.

 

mm ml. m... of. .- mx: .-mﬂ. mm m"

 

 

 

 

m m m m
? I. L!
. I . . I . ‘1. .. .I‘.‘
n I l l I II I . II. “I I... .. l a o. .I ... I . u s L
.. I I I I I - I. C... I I I ‘1’ ’ I. I . II
I o I I I m I I II- . W I . I o I II. n .0 u I I. . I” I II I’ I m
\I I o l O I ...... I E III I I- ”an. . I I 0-. Q . I ... I Wu 04.3 I II P
. . .... .. u ... . .. n. . . . .. . . .... ....t. . ..v .
... .. ..... .u. y . . . . .. . . . . m .
' I - OI - ' (A I ' I ‘0
. - x z ... .. r. x
r. . r w w
n .... n n m
. I V K I I V I a I a k a I H ‘

 

 

 

101

These results indicate that the first step in a procedure for
segmenting connected speech is to identify the points in the (filtered
speech) signal where "fundamental changes" occur in the "nature" of the
signal. The precise definition of the terms "fundamental change" and
"nature" involves specification of a real—time clustering* algorithm and
the four time series which give a dynamic representation of the signal.
The changes and nature are relative to information we can derive from the
particular signal we have at this time (thus termed real time). Since
we are dealing with signals that are heterogeneous in nature, any general

assignment of functional models to simplify the representation or

 

reduce computations would surely cause higher error rates, at least part
of the time (for further inferences based on the functional models, for
instance) or ambiguous interpretations of derived measurements. The
Clustering procedure is real time, Operating on data as they arrive
without requiring further passes through the data; self—normalizing and
not dependent on a priori knowledge; conceptually simple (in terms of
number of adjustable parameters); requires little storage and few compu—
tations; and gives a more revealing stabilized (in terms of stochastic
variability) dynamic representation of the original output along with

the marking of points of significant change.

 

*The procedure is termed "clustering” in order to relate a process for
dynamic (differential equation generators) transient phenomena to the
usual static data clustering techniques (ISODATA, Ball and Hall, 1967).
A precise relationship between the static and dynamic clustering exists
when one can choose a functional model for a set of differential
equations and then estimate the parameters of this model. The set of
all parameters would then, for a given time cpoeh, be one vector of the
type that is discussed in static clustering procedures.

 

102

In the defined state Space, the time trajectory of the differen—
tial equations varies about some mean value and the clustering would
define limits about that mean value which expand and contract, depending
on the time—varying parameters of the differential equations. For the
single formant model (and for other higher ordered systems as well), a
time-varying mean value and standard deviation can represent the time
series state variable value and its first derivative. To show this

variation, consider a normalized variable 2 at time n by the formulas

Zn 2 n n (II—D—S)

for the two time series (envelope y; and frequency y;) and describe the
variations in terms of the distribution of this error term. InSpection

of this (punitity indiealins that lunnmilization is ;xnfﬁ3nned by tJu)(livi—

sion by the sliding standard deviation. The segmentation procedure asks

the question: Is the differential model, defined by our two time series

for each state variable, adequate to describe the variations in the input
signal. For that reason, we will consider predictive instead of synchronous
normalized variables. That is, instead of using values of 2 at time n,

we will look at distribution of the expected next value of z. If we

write this out in a slightly different form, it becomes:

y‘ 2 m' + 0 y‘ (II—D—6)

 

 

103

Defining n3 as the difference between the mean value and the observed

value at each time n,
y —y zE’JzJ—TT: j:1,2 (II-D—7)

which is a discrete version of (II—D—l). The terms on the right side of
(II—D—7) are functions of the state variables, excitation parameters, a

stochastic term and time (n 2 1).

These equations can occur in some classical estimation problem

formulations:

(l) deterministic but unknown equation

= M + f (M )
n s n

n+1
(2) observation equation (sample function generation)
: X — X : "
yn m+l n g(mn) + gnf(mn)
where

gn are independent, identically distributed random

variables for each n, independent of mn and x
n

at time n with moments ul, u2, u3, u4,
(3) observation equation (ensemble generation)
d
\ = = O .. :: X
n+1 mn n/n mn Ez{\n}
a d 2
on = Ez{(xn_mn) }

 

104
where
Zn is a random variable with moments v1, v2,
V3, V4, ...

The difference between ii) and iii) is primarily one's point of view
(derivatives versus expected values of moments). The relation between
these, which is empirically Shown in Appendix E, can be derived by taking

CXpected values of ii) and iii),

kl_ ‘.

E§LXn+J.- Kn} :: g(mn) +f(mn) ul

1!

E {U z } 2 G v (II—D—8)
z n n n 1

Thus, the [our time series for enve10pe and frequency and their
derivatives adequately represent the bipolar bandpass signals and the
deviations from these time series can be exhibited in the normalized

predictive variables defined by (II—D—G) where y; : xi are the subinterval

averages and Ki is the sliding mean and Si is the sliding standard deviation
for envelope (3' =1) and frequency (,j:2). Then the points where these four
time series no longer represent the input signal can be determined by a
statistical test based on the distribution of the nonualized predictive

variables. This distribution can be estimated by use of the samples np

to t lIHL‘ 11

.J' .v‘—Ff‘. ,

[r 2 r r-l \:1,2 r:1,2,...n—l (II—H—Q)
35
r—l

In general, these values will not be symmetrical about zero because of

the nonlinearities. We need an estimation technique more powerful than

 

105

the currently pOpular procedures which are based on the normal distribution.
Because of the local ergodic assumption, there will be continuous changes in
the parameters rather than "jumps" between two or more ranges of values. Thus,
the distribution for each epoch will be unimodal (bimodal distributions will
yield two epochs), and a modified t—test with an Edgeworth approximation to
the distribution is apprOpriate.

The segmentation procedure, then, uses normalized predictive samples
derived from sliding mean and standard deviation time series to estimate
four moments, the coefficients of an Edgeworth series. If the probability of
occurrence (from the Edgeworth distribution) of the normalized values of
envelope and frequency exceed a predetermined threshold, then that sample
is included in the present epoch. If the probability falls below this
threshold, then the sample is declared a ndelshot. This procedure is useful
in identifying (and eliminating) data values of questionable use (such as
parity errors, computational errors, external impulse noise) which arise
quite often in digital processing of acoustical signals.

The definition of a segment point is an extension of the concept
of a wildshot. If the data continue to give low probability, it is
quite natural to assume that their "nature” has changed and that a new
epoch should be marked. This is controlled by two factors, the number
of wild shots and the length of time within which this number of wild-
shots must occur. (For example, two wild shots during four time
units may define a segment point.) Examples of the segment points
resulting from a computer algorithm based on this procedure are shown

in Figures 18 and 19. Figure 18 demonstrates dramatically

 

 

106

that the procedure is most sensitive to changes in bandwidth estimators
and not in the enve10pe or frequency estimators. This is not a limitation
since simple thresholds can detect significant changes in these time
series. Figure 19 shows how the fixing of locations for segment points
depends on the choice of the criterion and threshold. There are several
important observations related to the 7 regions depicted in Figure 19:

(l) Erroneous data (caused by a computation error)
are detected and flagged (2)

(2) The values of threshold and segment criteria
depend on the instantaneous nature of the signal
(1, 3, 6, and 7)

(3) Immediately after a segment point, a higher threshold
should be set to eliminate false alarms (e.g., the
threshold might decay exponentially to the set value)

(5, 6)

(4) The definition of homogeneity of the epochs is insen-
sitive to amplitude variations even during highly
transient behavior (4)

(5) Comparison with Figure 18 indicates that male/female

differences do not affect the algorithm.

In summary, the acoustical speech signal is viewed as a composite
nonstationary stochastic process and the mathematics of communication theory
is used formally to describe and discuss its complicated nature. One
isolated formant is modelled by a time—varying differential operator with
stochastic or deterministic driving functions. The parameters of this

model are related to steady—state concepts of envelope, frequency and bandwidth.

107

 

.mEEmt SEES mOu mHJDmmm ZO_._.<HZmS_0mw my mmDOE

23:39 *0 cotatemmc cow 0 x_UceaQ< mmmv
moo. .2 95.33.: a;

 

 

 

 

 

 

 

 

 

 

 

25:... wt: .: H am ms 2.0m» .- m5». .: H am 2 2:03 .: my? .: H mm 9 meovm I m)? .o N cm a"
1833...?) . .7)a:¢\;<fw l \. ......(sé _ éﬂﬂ
. n +- _ .1 . m ’71}... ...«I\ ..\J. . N . _ . ﬂ
. . A. Q, MI" . _ y
... _ M . _ ... l L ..Y...\.. .. . M _
JR.” TL. .u... u. u H U _ swam: . .. . aunt. .
... . : . . .. . . .. i .
_ .L *t 0:. a l \u
_ N. ..L1 . ... _
j l .... ...m r. 1 r _ 2
$5.1... _ . ...... m
x . : .. x x a x
0"
_ a M a a
_ o o o r o
..-o; 5-2x. 7.3;. is:

moo. ._.< whormois ¢x~

 

 

 

 

 

 

 

«Eowm .v mz: .: a am 2 mEONN .: W2». .: H am 2 «some; . .wz: .. N am ma 2.9% .- wimp .: N am m“

. .. 11?... ... . ...... ...... .. .. l 1%.... its)...“
_ _ T. _ __ _ . l m .111:4Jl\ ....x _ w .. __ W a

. mu _ _ _ W .t _ . $9.. _ m . _

~ F ......f.‘ t _W M1 . a M . .1... _ .... h .
: K04 m ﬁ’ .L w a _ ......urr..o.l\n../..\\...w~_ m .osul .. h. . u m

..ﬂ k.‘ a . w . .. ... w. m ..\ w
— 1 _ _ m not... I‘MW m .1. .. n j .u 1 .~ . j
.l. ‘1' ”do.
. . .. : . . _ . e a _. . e .
l".v _ .
a _ H m H
AU 0 @ @ @ A n o o
.u o: .: DC. .: oux. .a D;

 

0.0. .2 msoxmoﬁs mxm

 

 

 

 

 

 

 

 

 

26$.-. wz: .- 1mm 8 25.2 .- ms: .- i am 2 25? .- m1: .- H am 2 253.- ex: .- N am E
. 7:) . 41. J. _ . ll)». . «3 Ni.
.. ...).... m it...» _ m . .. a}. . S .
t. _ n .I d 4.." _ M w u . _
.. l H “v.5 _ .. _ w . _
. ~ ... 0\«.o . _ a: rat”. ._ _ _ _ . _
M W fuﬂ....~..r..:.ol \o. I. ‘ s. ..U m ...” I ... . \o ..M u m
.n. 3 w ... w . 7
. _ a... n
M I . 1 a. m I
..I.. _ _ m
_ ﬂ a u a
__ a o o
..l DHK . :0 D—X. .- aux

108

A derivatkniof the transient response for linear filters has been used to
show how fixed-frequency analysis (such as sliding Fourier transforms) are
inadequate and misleading for representation of Speech signals (eSpecially
the transient portions of these signals). This analysis also defines
requirements for the preprocessing wideband filters and the sliding averages
used for the pointwise estimators. Formulas for these pointwise estimators
of envelope, frequency and bandwidth are derived, and a predictive dif-
ferential equation segmentation procedure defines as an epoch those samples

of the acoustical Speech which have homogeneous characteristics.

The envelope and frequency estimators are standard absolute-value
and zero—crossing counts averaged over a short interval to Show the dynamic
behavior of the parameters. Additional preprocessing (derivatives of the
bipolar signal) increases the ability to isolate Significant features. Using
differential equation and statistical moment methods, we have defined time—
varying bandwidth estimators which have a stable behavior during epochs
corresponding to natural speech classes. They reveal information about the
diriving function as well as the differential operator and are normalized

with reSpect to large variations in envelope and frequency.

The segmentation procedure, defined on sliding mean and standard
deviation time series for envelope and frequency, is most dependent on
variations in the bandwidth function. It also eliminates erroneous (out-of—
place) data, is invariant to scale changes (due to male-female and amplitude
differences, etc.) but requires a sephisticated (plastic) threshold which
depends on the recognition and use of the epochs. This is the subject of

the following chapters.

 

CHAPTER III

THE USE OF LINGUISTIC THEORY FOR
THE DECODING OF SPEECH ACOUSTICAL SIGNALS

III-A Introduction

The difficulties of making a correspondence between physical
measurements and linguistic events (which we denote as code units), as
discussed in Chapter I, lead us to consider the use of contextual con-
straints to reduce the ambiguities. Linguistic theory has been formulated
in an attempt to characterize the complicated internal relationships within
one language. In order to make any progress along this course, we must
distinguish between characterizing the total knowledge that a language
user may ideally acquire (all grammatical sentences, for example) and
characterizing the production or perception of particular utterances
(referred to as competence and performance, respectively). Chomsky and
others have proposed finite models that generate all grammatical sentences
of a language. An ASR system must be considered in the second sense,
however, and one can justifiably ask whether formal language models merely
cloud the issues. For communication between two humans this may be true,
but the intended use of our ASR system is with a computer system incorporating
a quasi-formal language (FORTRAN, AIGOL) to perform functions. The con-
straints imposed by dynamic real—time decoding of human language (even
in a restricted context such as simple declarative sentences) demand a
slightly different interpretation of the procedures suggested by linguistic

theories.

110

In discu881ng the incorporation of contextual conSLraints into

an ASR system to improve its operation we will consider the following

topics:

(1) Error—types and correction

(2) Natural use of spoken language

(3) System memory and real—time operation requirements

(4) Flexibility of recognized vocabulary add-on and
delete

(5) Junctures in connected Speech

(6) Interface with existing computer operations

(7) Divers speakers.

(1) Error types and correction--Norma1 definitions of "error

 

(caused by misclassification due to noise, etc.) do not ade—
quately represent natural speech situations by postulating that
individuals do not always produce the ”correct" phoneme (or
equivalent code unit) sequence because of variations in dialect,
accent, speed of articulation, etc. Even "perfect" classifi—
cation into code units will yield sequences that differ signifi-
cantly from stored messages. A more general definition of
”error" should include production as well as misclassification
types of errors. Code unit production errors can be further
broken down into:

1. Substitution (miSSpelling) errors

2. Omission errors

3. Insertion errors.

 

111

An ASR system which corrects production errors as well as
misclassification errors could conceivably work on the
principle of identifying in the "best" way the code units
and then correcting (by the use of contextual constraints
or other means) the code unit sequence (Alterbs, ReddysG).
For limited vocabularies and one Speaker, the correction
can be performed by correlation with stored messages
(Bobrow et al.13). However, such an approach requires
much computation and can require meaningless measurements
(computing pitch periods in a fricative segment, finding
formants in a silent segment before step—releases). Fur-
ther, omission errors can be exceptiOnally disastrous because
they can cause synchronization errors. (Alter attempts to
protect against omission errors by requiring Specialized
names for FORTRAN variables.) A better approach would be

to have a directed classification, where only the required

 

and meaningful measurements are computed.

(2) Natural Use of Spoken Language—-One can avoid some of

 

the problems discussed above by an apprOpriate choice of
vocabulary (throwing out problem words) or by requiring the
Speaker to articulate his words precisely. However, these
alternatives are not compatible with a Situation in which
the normal use of language must be preserved (e.g., in
storing a dialogue between two humans, even with limited
vocabularies). In addition, it is known that considerable

training is required for someone to be able to control his

 

112
articulations in a consistent manner. We are, then, forced
to emulate the human's ability to decode Speech acoustical
signals and to determine the cues important to him (and
communicated by him) rather than those convenient for a

machine.

(3) System Memory and Real—Time Operation Requirements——
Modern computers with large random—access auxiliary memories
permit the use of large vocabularies, but ASR systems for
these vocabularies must provide a directed search to reduce
computation time and computing requirements (e.g., by limiting
the number of templates to be compared). It seems undesirable
to tie up an entire computer to decode a speech input and

then not to be able to do something with the decoded input.
Engineering solutions for ASR systems are often ad hoc
procedures which are "tailored" for memory economy and compu—
tation Speed and which may work quite well for the Situation
and vocabulary for which they were Specifically designed.
However, extension of these procedures to larger vocabularies,
more Speakers, etc., is often a patchwork procedure resulting
in a hodge—podge system which may or may not preserve the

original economy.

(4) Flexibility of Recognized Voeabulary Add-on and Delete——
Changing the vocabulary by adding or deleting one word (incre—
mental expansion) is not easy with ad hoc Systems. Either a new
section must be incorporated, or large amounts of reprogramming

are required. The problem of code unit sequences with errors enters

here too. A good closeness measure for code units must be defined in

 

order to use a system without adding Special case analyses for
each "perturbed" sequence. The search for an appropriate close—
ness measure for human perception of phonemes, for example, has
not been fruitful.

Specific problem procedures have an additional difficulty
with errors. Suppose the English word "blink" is decoded as
"blik" in one case or "bnik" in another. The ASR system must
find a best fit for these sequences from its past experience.

In normal English, both ”blik” and "bnik" are not found. Therefore,
the only action taken would be to find the best fit. There is

no way to determine that "bnik" can Beyer occur in English

(which says something aboutiﬂe "errors present") and that "blik"

is a possible English word and possibly a new event. Ad hoc

procedures have no way of deriving facts about the procedure

 

operation. This is similar to storing a multiplication table

when a multiplication rule would be more general and compact.

(5) Junctures in Connected Speech——Decoding of continuous Speech

 

texts requires the introduction of junctures (spaces, commas,
periods) not needed in isolated word situations. The ambiguity
resulting from lack of junctures can be seen in the following
three examples:

1
1. (Space) The phoneme sequence /ae ne m/ decoded as

"a name" or "an aim”

2. (part of "They are flying planes.'
Speech)

(”They” refers to

gliders or peeple?)

3. (period) "John is trying to understand this sentence
is a problem” (the period could be placed

either before "to” or after "understand").

114

Code-unit——code~word decoding when length of code word is
not fixed has been discussed under the subject of uniquely deci—
pherable (UD) error-free codes. The constraints on such a code
(AshE‘7 ) are much too exacting for a natural speech situation.
The constraints can be paraphrased for instantaneous codes, which
are special types of UD codes. No code word can be a prefix of any
other code word. Any subSet of a natural language meeting this
requirement would have little communicative power left. A UD
code that is not instantaneous may be found for some situation,
but real-time operation would suffer, as is seen in the following

example from Ash.

Let wl . . . wn 1 be code words, and 0,1 be code units:
+
“I 0
w1
«1 01
w2
w «I 0 . . 01

11+l K—;~Wi~.aJ
n

This is a'UD code, but it is not instantaneous. The unique
decoding of the first code unit of the sequence 0 . . . 01 into
. . A I n+1
w1 must wait until n+1 elements are encountered.
Synchronization errors causing emission and insertion are
disastrous for a UD code, and substitution errors could be disas—
‘trous without error correction before decoding into code words.
'Fhis situation is not likely to be successful for correcting pro—

(Jiurtion errors.

((5) Interface with Existing Computer Operations——One of the

 

useful applications of an ASR system is in conjunction

with artificial intelligence tasks. We will denote these

 

 

115

tasks as text manipulation (TM) programs (information
storage and retrieval, questions and answer programs).
Presently existing TM programs have no flexibility in
interpreting their input representation (typewriters,
punched cards), i.e., using semantic information to
resolve ambiguities. An ASR system could conceivably
be independent of TM operations. However, the need

for interaction for proper interpretation of natural
language texts is apparent, and a system should not be
limited by ignoring the possibilities. In addition, as
mentioned previously, most computer Operations make use
of quasi—formal languages for communication with users
as well as definition of tasks, making the interactive

possibilities of ASR systems very attractive.

(7) Divers Speakers——Many of the problems of incrementally

 

eXpandable vocabularies and production errors are caused by
allowing many Speakers to use the ASR system. Two different
situations are possible: In the first, the Speaker is known
previously and, hence, Specific adjustments can be made; in the
second the Speaker is not known previously and a Speaker nor—
malization period is necessary at the beginning of the use period.
The variations induced by Operating with several Speakers (or
with the same speaker in different emotional states, for example)
constitute one of the major difficulties of applying idealized
formal language models. No explicit account is given in the
present theories for these variations. It is evident that

these variations cannot be completely removed without interaction

116

between the grammatical system and the lower expressive system

(describing the code units).

The seven interrelated topics discussed above give some indication
of the problems involved with incorporating contextual constraints into
ASR systems. After an eXposition of some of the concepts of formal language
modeling (from Chomsk§£5we will proposed a different approach for
operating with these requirements.

We will define a (specific) language L as a set (possibly infinite)
of texts (each text may be one word, one sentence, one paragraph, etc.),
finite in length, constructed from a finite set of elements (code units).

The fundamental aim of a linguistic system is the separation of grammatical

 

texts which are members of L from ungrammatical texts not in L. A

 

grammar is a model of L in that it produces texts coming from L. The
usefulness of the model is determined by how the texts it produces are
related to the grammatical texts. A linguistic theory abstracts general
principles about successful grammars and thus gives us a way to compare
grammars for different languages. ASR systems perform a linguistic
analysis and are thus able to be studied in linguistic theory. Ad hoc
systems are not easy to fit into a general theory.

But why attempt to fit an ASR System into a theory of grammars that
generate infinite sets of texts when the number of reSponses must be
finite? A review of the preceding discussion should answer this question.
First, impossible ("bnik") events are distinguished from improbable (”blik")
events. Secondly, the problems of incremental eXpansion are Simpler. We
may be recognizing only a finite subset of any grammar's finite set of

producible texts, but we can change the membership of the subset within

 

lIIIIIIIIIIIIIIIIIIIllIIIII::———————————————————————_______E,_,=L_.,

117

the larger set much more easily than we could with some bounded finite
set.

We have already seen a dichotomization of a total system into
TM and ASR systems. This is in accord with the arguments of linguistic
theoreticians for Such a division.* Linguistic theory does unify these

these two seemingly independent tasks. The two parts (or Chomsky's

components) are:

(l) The allowable sequences of words or grammatical sentences

for a specific language

(2) The composition of the "words".

Rules are given to form sentences from ideas and sounds from ”morpheme"
representations. In addition to the organizational advantages we gain from
such structuring, we can also give more precise definitions to such nebulous
terms as "morpheme", ”sentence", and "meaning". The vagueness results from
the many uses in different theories in striving to overcome the inadequacy
of "the written word” in representing spoken utterances.

A morpheme is generally accepted to be the smallest unit that
conveys meaning. Its more precise specification is discussed within the
linguistic theory presented later. These morphemes form the interface
between the two levels in Figure 20(correSponding to code words in Figure l).
The purpose of the grammatical system is the coding of ideas into morpheme
sequences, and the purpose of the phonological System is the coding of
morpheme sequences into Speech signals. The primary weakness of this

V

definition of morpheme is the use of 'meaning". There has been a

revival of interest in semantics, the study of meaning, but we still have

*
”A language is a tremendously involved system and it is quite obv1ous

that any attempt to present directly the set of grammatical phoneme
sequences would lead to a grammar so complex it would be practically
useless.” Chomsky

118

UNITS OF LOWER LEVEL

 

 

 

 

 

 

 

 

 

UPPER
LEVEL RULES FOR GENERATION
CONTENT OF SEOUENCES

UNITS OF UPPER LEVEL
LOWER
LEVEL RULES FOR GENERATION
EXPRESSION OF SEOUENCES
INPUT

FIGURE 20 FORMAL LANGUAGE MODEL

119
no concrete guidelines for study of natural language use by humans. When
we limit ourselves to communication with a machine, the situation can
be a bit more promising. The restricted definition of "meaning" is, in
this situation, in terms Of machine response. Two morpheme sequences
have different meanings if they give different responses. Machine reSponses

could be different programs for compiling, different descriptors for

indexing, etc. In restricted cases, such as the computer applications
we are discussing here, an unambiguous workable definition Of meaning
can be given. It must be dependent on the actual linguistic theory used
for its precise Specification.

In the next two sections, we will make use of a linguistic

 

theory (Lamb's stratificational theorylg) in Specifying an ASR system
with the desired features. The language model preposed by Lamb must be
augmented to give explicit formulation of "supra—segmental" features.
We suggest that the difficult task of juncture identification requires
these features. Most linguists agree and give implicit status to these
features by calling them supplementary. The features we will use are:

(l) Juncture phon— modification

(2) Speaker expression

(3) Intonation and stress patterns.

The stratificational approach is used here in deference to trans—
formational grammars because of the compatibility with ASR system require—
ments discussed previously and summarized here.

In the stratificational theory:

(1) Different strata reflect the distinction between

representations of ideas where loops occur without

regard to direction or to the linear time—ordered

 

(2)

(3)

“ms, 'the(m

junctu reS,

121)

representation closest to the speech acoustical
signal (i.e. John hurt himself ~ JOhE:::::£:h'urt.
We use this dichotomization to treat the predictable
”error—like” phenomena discussed previously.

Grammar rules are explicitly identified by the types
of sequences on which they may operate, the number of
units at which they must look, and the types of phe—
nomena used to index them.

Only those rules required for the particular text

need be applied, and we need not apply all rules. The

primary classification indexes the Subset of rules.

tput of our ASR system will be sequences of morphemes with

This input can be used with several TM programs now available.

III—B Stratification Model for Generative Phonology

 

In current linguistic theory, the two parts discussed above in
the Introduction are called the grammatical system and the phonological
system. (LamblH refers to these levels as the upper and lower strata.)
As we saw, our particular requirement for the interface between the two systems
is a sequence of context-free morphemes plus junctures. It is not clear
that there exists a complete set Of rules to perform this task for unrestriced
use Of American English or even that finding such a set is feasible, but
we wish to Show a structure which will use such a hypothetical set of

context rules for possibly a subset of English. Work with well—Specified

 

pseudo—languages indicates that this goal is a realistic one.

We go along with Lamb18 in defining the following articulatory

feature types:

(1) Universal for all human Spoken language
(2) Automatically present when specific language is
Spoken (but not accounted for in (1))
(3) Distinctive presence not predicted by environment
and expressing meaning to a listener
(4) Nondistinctivc but autOmatically accompanying
a
distinctive features.
Thus, we can envision Spoken communication as a.neutral medium with dis—

tinctive variations which convey the desired information. This model

leads us to rank the two representations on a given stratum vertically;

 

*
Though not explicitly stated by Lamb, we believe that these types are
sometimes quite variable and may be consistent only in one Speaker's
pronunciation (ideolect). This concept is discussed in the next section.

121

122

*
these ranks will be specified by the suffixes defined below:

—on -— an objective (as possible) statement of the events
-eme -— the result of an analysis of the events with reSpect
to a Specific language for the purpose of identifying

only distinctive events.

The property of distinctiveness is a very important one which,
in one form or another, is fundamental to language usage and theoretical
models to describe language. We cannot have any concrete understanding
of the behavior of the two ranks without further Specifying "events".

The difficulties that arise in attempting to resolve this problem involve

 

other fundamental concepts and lead to the primary contribution of Lamb,
his separate strata. One of the most controversial topics in modern
linguistics involves units and the segmentation of the Speech Signal

that results from defining units.

H

It would seem natural to call the ”s on the end of ”boys” a
type of unit (since it indicates plural) different from that of the "S"

in "stone”, since the first is a unit of meaning and the second is only

one element in a sequence of Similar elements which have some meaning
because they are grouped together in a particular order. The phonological
system gives rules for constructing these sequences (for a given language)

from distinctive units. (For example, words can be formed from the 26

 

I

The original distinction was between "—etic” and "-emic' ("phonetic being
an objective description of sounds that are perceived and "phonemic"
being what is linguistically distinctive). The term "phonon" replaces
"allophone" or "phone", and "morphon" replaces "morphophoneme" (thank
goodness!). The new terminology is more symmetrical and less burdensome
(although the term "phonon" has been used by physicists in another
"context").

 

 

123

English graphemes.) However, the confusion that may result from a stin—

ation such as the one in our example, plus the requirements of 'context-
free" strings demands a separate code for meaning units (morphemes) and

lower units which make up these meaning units.

One other example may motivate our discussion of Lamb's model.
The Spelling of English words is often effected by neighboring meaning

units (context sensitive). The natural example is the way in which nouns

I

are affected when they become plural ("knife" to ”knives", 'wife" to "wive§').

This is a typical form of linguistic alternation. The pronunciation of

V! 1!

these words presents another problem, because the final 3 is pronounced

 

H H

like a z . This change is independent of the preceding alternation
("bed" to "beds") if performed in the correct order (change I a v; then
2 after voiced,s after unvoiced). Thus, there is a need to separate

these two processes.

Lamb Specified two strata in his model of phonology: the
morph— stratum and the phon- stratum. It is very difficult to give precise
definitions to these terms (at least in anything short of a full article).
More often, a construction or generative algorithm is given which results
in a precise quantity. Each of Lamb's strata has —emic units as upper
interface and —onic units as lower interface (cf. Figure 21). The
definition of the stratum label is best understood in terms of its
upper distinctive unit. A morpheme is generally accepted as a minimal
element, either a meaning unit or a grammatical unit. The following
is a list of definitions for a phoneme; each of them contains a small

grain of truth:

 

 

 

 

 

124
RANKS
_eme / .0”
I I INTERFACE
I
I MORPHEME I
l l
l I
I I
I I
MORPH— | I
STRATA l I
| l
' I
I I
' I
I MORPHON I
I / I
. INTERFACE
T I
I PHONEME I
I I
' l
I I
I I
I l
PHON— I I
STRATA | I
I
I I
I l
' I
I I
: PHONON l
l
I I INTERFACE

/

FIGURE 21 LEVEL AND RANKS OF A GENERATIVE PHONOLOGY

125

(1) "A bundle of distinctive features"
(2) "A class of phones in free variation or complementary

distribution'

(3) "A minimal term in phonological opposition".

We \vc311ilci prefer a definition analogous to that for the morpheme: "A
ph011c2u1e3 _is a minimal element of distinctive expression". We hope that

this; 113 as truthful as any other Short definition.

Each stratum, then, is simply a correSpondenee or set of inter-

PFelJEltxi()n rules for relating its upper and lower units. The purpose of

 

the Ciilcliotomization (although a huge one—shot scheme is often suggested),

just £155 in the discussion of the grammatical system and the phonological

SySt(3"l, is to allow independent analysis in each stratum. Because the

units; Cit the interface must be context—free sequences (at least with the

prescerlt: state of the art), the operation within each stratum is to

H
una ,
IE1VC1" the context dependencies. An example on the phon— stratum

w
0u1(1 1)(> helpful. Many dialects of American English do not articulate

the ‘ - .. . . .
1r11tial vowel in "before” precisely; it becomes /:v as in "bufl"

rath ‘ . . 11 .H . . . . . . .
(311 13han /1/ as in beef . The difference is distinctive for the

I)a111 " H I! H - . H H
lDIJff and beef but ev1dently not in before . At the upper

illtol‘ ~r .
[‘1<:C of the phon— level the coding would be the same for both

dial
QC La1 variations of "before” (i.e. /bifr/) but would distinguish

/q)of/ from "beef" /bif/.

The actual mechanics of this context unravelling involves two

tVDQR
()1 rules and the unit epoch (duration in time) of each stratum.

711110 1 ‘
(”llérth of the morpheme corresponds (approximately) to the syllable, and

the 1
C3I1E:th of the phoneme correSponds (even more approximately) to the

 

126

givipuienne or Arabic letter of printed language. In line with our defi~

Iiitxicui (3f the ~on unit as an objective description of events, the

Inorqoruoii is the same length as the phoneme and the phonon is Smaller than

the p hon eme .

The two types of rules, called realization and composition,

estak)l.j43h relations between these units (cf. Figure 22).

(l) Realization rules are the code for the —eme unit in terms

of the smaller -on units. Conditioning by neighboring

—on units is accounted for here (as in our example above).

(2)

Composition rules are the code for transforming the

—on unit of a higher level into the ~eme unit of the

 

lower level. Conditioning by virtue of belonging to

a unit of the higher level (i.e., stress on a morph—

length unit affects vowel phonemes) is accounted for

here. Alternation caused by linguistic constraints

is also accounted for here.

We C(11)

(wif(3:)
as /VV{1_j~

vowe]~

This
and t; he

morbhc) 11 s occurr

impo

I‘eturn to our example. Suppose we have the morpheme string ——

(Pt) —- to be encoded. The realization rules would code (wife)

1“/ and (Pi) as /S/, with the conditioning rules selecting the

i
EQII idc after /w/. The composition rules would change /wa f/ to

i
wa
/ \”/' E>ecause of the alternation caused by the plural (CL FigureZB).

C3><€1mple also shows another distinction between the morph- level

phon— level mentioned above. Notice that the alternation of

ed only within the syllable. This is the restriction
as
(3‘1 on this stratum. The alternation of the plural morphon /s/ to

(311use of the/v/ ending of the previous syllable is performed in

l>1 . . .. . .
1(311~ level, so the length of influence of each level’s rules is

127

k /
/

”e m e REALIZATION
RULES

 

COMPOSITION RULES

  
   

 

ALTE RNATION

CONDITIONING

 

 

*me O '2 LOWER STRATA

FIGURE 22 RELATIONSHIP OF UNITS WITHIN A STRATUM

 

 

128

 

<P<mhwuimm02 ZO
ZO_._.<O_.._n_n_< m_wI._. “.0 m4n22<Xm 02.4 mmst ZO_._._wOn:>_OU mm mmDOE

 

 

 

 

.395 5sz m eo cc: 9 36:23
e0 £033 pcm 895 2:3 :5: 3E:
mcconcmac 8032: l EmEcoc$cm

.mmamccerm 2930a *0 95

2E 3.53 8.8.3. EoEcoFEo >n 9:826:00

 
 

2E Fm):

 

 

25. a0 ”20:32: a seesaw ease: 3 2953mm

zo:<zmm5<

U.u._._.OZ

 

129

defiii<3<j. gas within syllables for the morph— and within clusters (vowel

or consonant) for the phon—.

Several features of this model which are particularly attractive

for 21111;cnnation can be summarized:

(1)

(2)

(3)

Relation of units —— time length and types —— stresses
the relation of a sequence to its members, to the
sequence of separate time epOchs, and to the combina—
tion of components (features) in one time epoch.
The operations within a stratum occur nearly independent
of the other strata. This is done by separating the
objectives and operation on units within each upper unit.
(Again the extent to which any natural language, eSpe—
cially English, can be so described, is not known at the
present time.)
The correspondence rules within a stratum are typed;
that is, an algorithm which would implement the rule
can be very Specific with respect to inputs, outputs,
and procedures. We have
(a) Realization rules —— encoding of sequences into

higher units
(b) Composition rules

1. Alternation —— alternative "spellings"

2. Conditioning -— rewrite, depending on

a. Neighboring units

b. Membership in higher unit.

 

 

III~(3 Recognition Phonology

 

The stratificational model discussed in the previous section
law rnzxrljy properties useful to a recognition system. The dichotomization
into .rieezix'ly independent strata with Specification of interaction by means
of scaxzcexral types of rules is extremely useful in Specifying the training
and iissee of an automated recognition system. The stratificational model is
mn13c>lr1363d to be a two—way model (ChomskySB), for recognition as well as

geneax‘21t:j_on, but we find that this is not entirely true. Three problems

arise :

(1) The lowest unit (closest to acoustical signals) of
a generative phonology (Lamb's or any other) is still
in terms of abstract quantized units that reflect
economical encoding rather than good correspondence
with features of the acoustical Signal.
(2) This ideal sequence is still ambiguous (in general)
unless the Specific rule used at each point in the
encoding is also known. In recognition Situations
we do not know the rules used until the correct
sequence is known.
(3) Formal language representations only Show redundant
features in a secondary or "tacked on" fashion for
the same reasons of economy mentioned in Item 1.
The more realistic situation that apparently operates
in human communication will be discussed below.
The higgr‘lbl redundant nature of the correspondence between acoustic features
and pexweeeived sounds suggests a slightly different approach than looking

130

 

 

 

 

 

for "primary and secondary features". Human speakers generally have
individllal language pronunciations, called ideolects; i.e. , one person
might find that a particular articulatory situation causes a noise burst

of a specific center frequency with no modification of the following

 

vowel , and his listener agrees that he "heard" a "b". Another speaker
finds that precise modification of the following vowel with no Specific
noise burst elicits the same response. One cannot say that there is a
Primary feature here; the listener's responses to both Speakers are equally
positive. We might call this property of the listener a dialectal generali—

zation- Each person may learn a particular set of features that must be

 

controlled precisely in order to communicate. The remainder of the features
(redUndant in Lamb's terms for this speaker) are not precisely controlled;

thus they may vary considerably with respect to many speakers. This variation

will occur above that caused by lack of speaker normalization (suggested

by Thomas ,19 Gerstmana). Perceptual experiments with repeated words corro—

borate this conclusion, and the work of RupertZO shows that this approach

15 needed for situations involving diverse speakers. In the light of this

disc - . . . . . .
usslon, we propose a model which aVOids the deficienCies mentioned

above _

To overcome the first inadequacy, the same arguments that lead
Lamb to a two—strata view of generative phonologies suggest a three—
strata mono} of recognition phonology. This addition may also be useful
in a generative phonology, as Lamb has suggested.

Note first that acoustical features can be considered as either
$0801th Or relative with respect to speakers; i.e., schemes can be

deViSe
Q Which measure stridency (to distinguish vowel—like and fricative—

132

liker) , (:liecked (stop—release) silence and local envelope maxima without
speal<eex¢ liormalization, whereas specific formant placement, duration, stress
mui 1111:()r1ation are very speaker dependent. We can then define a third stratum
with 111)}:eer units called acoustemes. An acousteme is a minimal distinctive
unit C:c>1ﬂresponding to a homogeneous (with respect to both relative and
absci1111:£3 acoustic features) epoch of the acoustic signal. Some implications

of this definition are:

(1) It is a specific definition not only with reSpect to
a particular language but also with respect to a par—
ticular Speaker and utterance; i.e., different utterances
of a given phrase, even from the same speaker, could give
different sequences of acoustemes.

(2) The distinctiveness property requires that only the
controlled features can be involved.

(3) The segmentation is performed with respect to controlled

feature changes and hence induces a useful criterion.

The 14111_ts; thus defined, while more accurately representing the specific
acous t:i(3a1 signal, also behave like units of higher strata and exhibit
many ()1? the same linguistic phenomena,

The four terms diagrammed in Figure 24 (diversification——A may
becomes 1’ or Q; neutralization-—B and A may become Q; zero realization——
C may n0t have a corresponding unit; empty realization——S may be filled
in) Gaul ‘bezused to describe the various ways in which a speaker actually
perfcxrnls the dynamic task of selecting which features are to be controlled
and wliiczli are to "float”. Examples of these are found in the work of

Ruperta 0 u n
on isolated words. Several of the phenomena occur in before .

 

 

brow

 

. Diviersification

Neutralization

Zero realization

. E mpty realization

FIGURE 24

 

 

133

T"? ________

ALTE RNATION PATTE RN

f i —emes

 

 

 

SEVERAL LINGUISTIC PHENOMENA DESCRIBED
BY ALTERNATION RULES

 

 

 

134
Diversification is seen by the different types of formant structure in

the diphthong on the end. Zero realization is almost always seen in

initial "13" with the lack or prerelease voicing. Empty realization is

exempl ified in the extra state or fill—in after the release of "b".

Neutral ization, which is evident on higher levels ("bitter" becomes

"bidder" ) , is an alternate explanation of the modification of the

diph tli on g .

Lack of knowledge of the Specific rules used to generate the

acoustical signal and the resulting ambiguity in decoding requires another

n10dification. This ambiguity primarily causes extreme difficulty in place—

ment of junctures (word, phrase, sentence) in the morpheme sequence. An

attempt at recovering this information can be made by attaching another

rank to the model. The third type of information primarily affects lower

Strata , such as stress and intonation patterns. This information has not

been included in any word recognition system known to date. It is well

known that these patterns delineate phrases and sentences. Other types

of ' ~ . . . . .
1n IOrmation occur in smaller units; hence this rank operates on different

Strata also

For the present, we will label this rank the hyper- —on,
Wh1011 :iridicates how the information is abstracted at each stratum. The
-Oni(’ ul‘lits are the most objective description of the events. The -emic
units Elrwe generalizations which show the distinctive events; the hyper—
-OniC: 111‘lits are derived from the —onic units and show events which affect
lowel‘ llrlits. For instance, stress is a feature of a whole morpheme but
affekytgg (generally) only the vowel phonemes. Figure 25 shows the
amnion L ed model .

(l) Hyper—morphon features include stress and intonation
patterns.

 

135

Ewhw>w

 

>OOJOZOId w>_._.<mmeOuOme no 4....502 mm meOI

4.1205

80mm“? 45.5303
\Ill' 205303 m we;

/

 

wEmPMDOUd. EDF<mFm
\ nkmDOUd‘
wkowuu/x ZOZOIm
ﬁll, ZOIamOS. mma>I /
23F<mkm
\OIQ uwZOIm
inll' ZOZOId Ewa>I
EDkdmhm
4<O_._.<_2_2<m0 wEwIamOE IamOS.

COI 58>: COI 0E?

m¥2<m

136

(2) Hyper—phonon features include juncture phonological
variation.
(3) Hyper—acouston features include Speaker identity and

speaker emotional state.

What this implies in a recognition system is a different type of
structure than that suggested by a generative approach, namely a directed

algorithm in which predominant features control the search for more fuzzy

 

features and decide which of them are actually needed. This can be accomplished
in the generative model simply by attaching a priority to the set of rules

so that we can dynamically select the proper rules to be applied. The

 

priority would be a function of the high reliability features (in a degree

of presence) filled in by long-term statistical expected events.

Figure 25 shows the augmented generative phonology model which
may have the two—way property necessary for recognition. At the least,
this figure points out the major problem in recognition of natural language
type vocabularies, the number of ”feedback” paths. That is, if one starts
with the acoustical signal and tries to proceed upward through the model,
every step is affected by a higher level. Decisions made on the first
stratum will cause a certain interpretation of higher strata which will
feed back a conditioning of the lower stratum causing a different decision,

and so on.

We will propose a system based on this model without the feedback

and discuss it in the next section. It will look roughly like Figure 2&5.

 

mQZmJOmm
msmxemos

137

¥0<mowwu FDOIE>> Emkm>m ZO_._._ZOOOmm

 

All.

 

23F<mkw
ImmOS.

 

 

 

 

+

 

23F<mhw
ZOIn.

 

23F<mkm

 

 

 

a

 

IFmDOO<

 

 

 

+

mm meOE

E395 or: coEmanOowO .N

cotombxm .mEmEmmm .33 6:96.". .P

cocﬂcmmmamm
o<

 

 

 

 

352 mEC. DEW
coauofmmﬂo

 

 

 

 

mmmDF<wu
ZOZOIa mwm>I

 

 

 

 

 

20:; .400me
04.

 

 

 

mmeP<wm
ZOIamO—z Email,

 

H...uZ:.$OIW

lull FDmZ.
Iowwmw

 

 

 

 

IV RECOGNITION STRUCTURES FOR REAL-TIME SPEECH PROCESSING

Dynamic recognition of connected Speech is much more difficult than
most pattern—recognition problems. In the first place, the complex
acoustical signal requires some modification or filtering to accentuate
its significant characteristics. Secondly, the unknown nature of the
precise generating model at each instant poses an identification problem.
Thirdly, the dynamic changing nature of the information content of the
acoustical signal requires a sequential decision structure. In this
chapter, we will attempt to define an adequate structure for real-time
recognition of the acoustical Speech signal according to the models

developed in Chapter II and the criteria developed in Chapter III.

IV A. Reduction of Dimensionality Using Bayes' Formulation
Speech—recognition algorithms that permit real—time computation
require low—dimensional representations (input pattern vectors). We will
use modern communication techniques to show how the representation/
recognition schemes develOped here can reduce the normal dimensionality
of the input acoustical signals. The results of Section II-A indicate
that we may represent each single formant present in the acoustical signal
by a two—dimensional state vector with an associated differential equation
(Eqn. II—A—l3). Further, the results of Section II—D Show that our
segmentation procedure permits a time partition of the acoustical signal
into epochs, each of which can then be classified in sequential fashion.
Lainiotis59 considered signal detection and recognition in a
recent paper and attempted to determine a "natural" dimensionality. His

problem formulation, however, does not account for the complicated

138

 

 

139

relationships between features described in Section I-B. Thus, we have
to restrict the conceptual model of information—bearing features to fit
into this signal—detection model. First we must assume that one formant
contains the significant linguistic information at any one time. This
idea is proposed in the literature by Thomas19 and implied in the theo—
retical linguistic work of Fant, Jakobsen, and Halle,11 describing their
distinctive feature matrices. Thus, one regards other formants present
in the acoustical signal as correlated noise and therefore only a distur—
bing character for the true signal (dominant formant). In addition, we
must restrict the type of driving function allowed in the state variable
differential equations. Fricatives and nasals can be adequately modeled
with white excitation processes, but vowel—driving functions (pitch
pulses) are not easily accounted for. For this particular formulation,
one can consider only whiSpered vowels. Yilmaz6 and others have followed
this approach in their studies, since there is some degree of intelli—
gibility in whiSpered vowels.

We must emphasize at this point that these restrictions,
although discussed in the literature, are made merely to permit the
mathematical formulation of sequential detection theory. However, we can
consider the implications of the theoretical results based on this
restricted Speech model and use these results to make further inferences
about the more general speech signal case. Suppose we are given the

following noisy observations:

y(t, 520) = H(t, g0) {(t, EU) + v(t) o e I‘

140

of the stochastic signal (one of the ensemble of possible Speech "sounds")

represented by a state vector x(t, go), where

H(t,80) is the transmission matrix

v(t) is a white guassian noise process, independent of x,

with zero mean and unit variance

is a parameter vector specifying the differential equations

“‘0'
generating x (Eqn. II—A—13)
I; is the index set for 0. Each 0 refers to a single
discrete value of the parameter vector 90 (i.e. the
range of each component must be discrete)
k
d 'ls'tl' t'st- t<t <t <...<t <t.
an a set of Signa w1 Cling 1me { 1}i=1 1 2 3 k-l k

The natural dimensionality, or structure, is then determined by the para—
meter vector 20' In addition to having different values for each component
for different 0, 90 may have fewer or more components or may reference
different noise structures (one or two correlated formants).

Lainiotis shows that the pattern—recognition/detection problem
of determining the presence of one of M signals from these noisy obser—
vations when the signals are generated by differential equations of
unknown functional form has a Bayes minimum—risk solution of the following
form:

(1) A bank of nonlinear Kalman—Bucy filters is derived

based on Elegy possible form of the differential
equations, that is one for each 0 6 I8. The output

of each K—B filter is a conditional mean; i.e.

 

 

141

“’(t/vt 'Vk-1 Ua) is the ex e ted value of the state
330 k? a ) 0 p0

vector for each possible hypothesis Ha (signal ya(t)
is present) a = 1, ..., m, conditioned on the present
and all

z (t t) t S t

signaling interval, Vk k—l’ k’
. . . k-l .
past Signaling intervals, Va 2 (uh/Ha was active)

a
and the value of the parameter vector 20.

(2) The expected value of the signal process is derived
from the conditional means from each K—B filter by a

mixture probability formula

 

rv t k-l 28 a t k-l
= 6 v
yo,(t|vk, Va ) P(—o| k, V0, )

06
IO

9 ) (IV—A—l)

k-l O!
O! ’ -U

0 ea N t
H(t,_O) xa(t|vk, v

k-l

a
a , 20') is the conditional mean of
1

where E (tlvt V
—a k’
thezlposteriori distribution computed by means on a

. * . . . 60 el

nonlinear time—varying filter (Kushner, Kalinapur )

d
H(t,§0) is a matrix which converts the state vector

«1 rv a t k—l
5 into the observed signal y and P(901\)k,VOZ ) is the

learneda posteriori probability of the parameter vector
a
value 90 conditioned on the present signal interval and

all past signal intervals when HCY was active.

“x

*
Lainiotis specifies only linear differential equations, but the extension

t0 the nonlinear case is obvious.

142

(3) The likelihood ratio for each hypothesis, Ha’ is then
computed using a correlator/estimator formula (Kailathsg)

with the input signal and the conditional mean'

k—l

a ), and a standard Bayes criterion is used

3:13!“le V
for the decision.

Even for the restricted speech model, and assuming the nonlinear
estimation filters can be implemented, this is still an inadequate
solution. The primary difficulty is the mixture formula (Eqn. IV—A—l).
Although this formula is optimum in the Bayes sense for randomly selected

3' *
generating models (Wainstein and Zubakovfa), one would not expect the

 

convergeda posteriori probabilities of each model parameter vector

value 2:1 to be either 1 or O; i.e. the formula (Eqn. IV-A—l) reduces to

selection of one filter. Then the conditional mean, y& , will be a

sum of an output from the "correct" filter (given the particular signal

and noise conditions) and others that are based on noise or other

unwanted signals. For high signal—to-noise ratios (ratio of inner to

outer distances for a pattern-recognition case), with apprOpriate models,

l-O probabilities may be learned; in Speech, however, this is very

difficult, because of the large class of "signal—like" noise processes.
Implementation of the optimum solution involves many difficulties.

Some of these problems are discussed below:

 

*
The comparison for a parameter measurement/detection problem is between a
Inixture formula and choice of the maximum probability (likelihood ratio).
For large m (number of filters),a factor of two minus signal—to—noise ratio

is required to maintain the same probability of error.

143

1. Implementation of Filter Bank
The nonlinear estimation filters are physically unrealizable

(they require an infinite number of components). Sub-

optimum techniques are available (Byrd6$), but they sacrifice the ability
of the optimum filters to match the transient response in order to have
appropriate asymptotic behavior. As was pointed out in Chapter II, con—
nected Speech signals from divers speakers contain critical dynamic por—
tions requiring good transient response in the preprocessing stages. Thus,
the heuristic criteria develOped in Chapter II are more appropriate.

The optimum formulation requires a model of the desired
signal plus unwanted (correlated) signals and noise for each filter. For
the suboptimum case, however, this is not always desirable, eSpecially
if one is not sure of the exact structure of the undesirable signals.
Groneﬁshas shown that under certain conditions the performance of linear
threshold elements with adjustable weights can be decreased by increasing
the number of inputs. In attempting to classify isolated words from one
speaker using the zero-crossing counts and energy levels from eight
bandpas filters, performance measures increased as a function of the
number of inputs but then began to decrease. The posited explanations
were:

(1) Assumptions about the pattern statistics were
not correct.

(2) The number of sample patterns was insufficient.

(3) Incorrect structure or training algorithms were
used.

(4) Training was not allowed to continue to convergence.

144

Determination of the proper number of inputs to give a maximum
performance is difficult to accomplish except in special cases. One
example from Groner may help.

When one is using Euclidean distance differences to classify

pattern vectors, an additional measurement degrades performance when

 

n n n
2n+l
2; > z ..
Wi + 4o n2 . Q13
1:]. 1:1 3:1
where [wij] i,j = l, .... n is the correlation matrix of
the existing n measurements
wb is the variance of the new
new measurement
qt i = l, ..., n is the correlation of the new

measurement with each old

measurement.

Hence the performance is degraded by the addition of a new
measurement which is correlated with the others and which adds noise

(proportioned to W6) to the recognition process.

*
2. Independence of Filter Bank Outputs
The mixing formula (IV—A—l) is based on the assumption of

randomly selected generating models and optimum least-mean—squared-

 

*
Probabilities computed on two input sets X3 and X2 are independent if

p(xk,x£) = P(xk)P(x£).

 

 

 

145

error estimation filters. The highly structured and situation—dependent
interrelationships of acoustical features make the former assumption
very suSpect. Further, the choice of suboptimum filters again indicates
a set of dependent probabilities. In order to achieve the superior
performance of a mixture formula, Wainstein and Zubakov apply the central
limit theorem for a sum of independent random variables. Thus it would
be beneficial to the performance if the mixture probabilities were
independent. A second observation about such sums is pertinent here.
The study of robust estimators shows that convergence to a stable value
is quicker for arbitrarily distributed random variables if "outliers"
(events significantly removed from the mean) are not included. In this
context, probabilities assigned to certain filter outputs can be "outliers"
due to reasons cited in the discussion of the implementation of the
filter bank. The long training period that may be required even for
optimum classifiers (which is lengthened due to outliers) is especially
detrimental in the Speech situation. The plastic structure must be
responsive to "drifts" and slow changes in the input's salient features.
Thus, for the suboptimum filter bank specified in Chapter II
and Appendix B, we need to investigate recognition structures which form
near independent probability estimates and mixture formulae which reduce
the undesirable effects of outliers. We will Show in Section B how
the Lewis‘B—Brown¢b probability approximation technique attempts to compute
independent probabilities and in Section C how the S—RETIC algorithm

of Kilmer+1 operates to eliminate outliers.

IV B. Quasi—Independent Probability Distributions

The discussion in the last section indicated that the set of
azxxneriori probabilities computed on the outputs of the preprocessing
filters should be independent in order to increase recognition performance.
This will be difficult to achieve because of the overlap of the input

sets——one for each probability computer--and also because of the correlated

nature of the inputs.

«2 4a
The Lewis-Brown ’

iterative technique can be used to reduce
the dependence between the probabilities. The notation follows that

of Section I-E. Suppose we have (for a given class CL) 3 set of m low-

order distributions {Pk}:_1 such that
Pk(x) 4 O k = l, ..., m for all x
ka(x) dX r: 1 k :1, ..., m
x€Xk

where Xk* is the set of n inputs for the kth probability computer. Then,
if we consider the entire m x n dimensional pattern vector for a given
class and hypothesize a "true" distribution, each low—order probability
distribution, Pk(Xk), satisfies a marginal property; integration of the
"true" probability distribution over all components not contained in

k k
X equals Pk(X ). Brown gives an iterative procedure for determining,

 

For Speech, one input, xi, might be one component of a four-component
state vector representing the output of one filter of an m-filter

k
(overlapping) bank. X , then, is the state vector (n=4) for each filter.

146

 

 

 

147
among all products of low—order distributions that satisfy the marginal
property, the one that minimizes an information measure of the close—
neSs to the "true" probability distribution. Brown defines an
iterative procedure as follows: Given an initial (a priori) m x n
. . . o . .th . . .
distribution P (x), define the j iteration, j = l, 2, 3, ...

probability distribution, pJ

{P }m

k k2].

, from the Set of low-order distributions

33(X)

gé'1(x) Pk(x) / Pi’l(x)] (Iv—B-l)

th th
That is, multiply the (j-l) probability distribution by the k low—order

distribution, where k j modulo m, and divide by the marginal distribution.

._ ._1
P? 1(x) = {‘ P‘J (x)dx (IV—B—Z)
< .

k
x e X

Brown shows that the distribution PJ

does satisfy the marginal requirement
for all j and does converge to a limiting distribution with the minimum
information preperty.

At first it appears that Pj will contain low—order distributions

raised to a power but if we rewrite the marginal distribution (B—2) we can

see that this is not so.

PJ_1(x) _ Pk(x) gj—l(x)
gﬂ‘2(x) ° k

where the g's will be defined. Substitution of this into Eqn. (IV—B—l)

gives (after m iterations)

148

Ill

P300 = r1 2km / aid) 3 = m, n+1, (IV-B-3)
k=1
where

gj(X) = gj-l (X) j # k modulo m

k k

. m .
gJ(X) = f U P (X) / gJ(X) dx j = k modulo m
k " k t=i I; it

Note that gi(x) is a function of x e Xk and hence the gi functions tend to
make {Pk}::1 a set of independent probability distributions so that a product
rule for recombination applies. The computation of gi requires, for a given
module, an integration over the set of measurements not contained in the input
to that module. The gi functions have the same limiting prOperty as discussed

in Brown. To see this, define the limiting probability distribution

Pr(x) = Lim Pj(x) (IV—B-4)

j—aCD
r
and recall from Brown that g has the following marginal properties

[Ilr(x)dx : Pk(x) k = l, ..., m (IV—B—5)

k
x €3X

r
Substituting for E from Eqn.(IV—B-3) and Eqn. (IV-B—4) (with proper

assumptions to give interchange of limits, integrals, and products)

149

I Pr(x)dx : Lim j‘ Pj(x)dx

: him gk (yj: Pk(x) k = l, ., m
and
'—1
g: (X)
Lim j = l (IV—B—G)
if” gk(X)

Thus, it is necessary only to compute the iterative definitions

J

1. An example may clarify the role that the g; functions play. Let
<

of g

P (x x
o

3 .
2, x3) be an a priori distribution over X(:R . Let the two-dimenSional

1 )
(n22) lower-order distributions (m—Z) be (where P with no indices denotes

the marginal distribution of the indicated arguments)

l
Pl(x) : P(xl,x2) X = (x1, x2)
P (x) : P(x x ) X2 : (x x )
z 2’ 3 2’ 3
Then
J .
gl(x) 2 1 J : 1) 3) o.-
s‘im) = P(x2) j = 2, 4,
and
Pr : Pj : P(X X )P (X /X ) J :: 2 3
1’ 2 3 2 ’ ’

Note that any a priori distribution is allowed and does not affect the

final result. The effect of the gi functions in this simple example is to

change the marginal distribution into a conditional distribution. Chow's45

 

 

150

approximate scheme for learning conditional dependencies is analogous
(allowing conditioning on one variable only) but the use of the gi functions
allows one to determine the structure of the problem for any set of
lower—order distributions (possibly in a theoretical sense only, as the
computations may become unwieldy eSpecially if the lower—order distributions
change).

In summary, we have shown how an iterative procedure for approxie
mating probability distributions is a mathematical model for learning
conditional dependencies such as those found between Kilmer'sqq‘STC—RETIC
modules. The reduced formulae develOped here require only integration
and multiplication and no powers (as in the original scheme). These
iterative formulae develop only the conditional dependencies and do not
depend on measurements that are independent. That is, if Xk and Xq
are nonoverlapping,independent input sets, then the integration set to compute
gi need not include Xq. The resulting approximation formula is a product

which implies independent measurement sets. Thus, they form an appropriate

set of mixture probabilities discussed in the last section.

 

 

IV C. Specification of First—Level Decision Structure

 

At this point, we have Specified a set of m state vector represen~
tations of the input signal, each state vector having dimension n, and a set
of a posteriori distributions for the probability of each state vector,
given one classification C&, t : l, ..., r. We wish to decide, on the
basis of this information (and possibly other information which needs to be
Specified), the appropriate subset of state vectors that best repreSents
the pertinent features in the input signal. We can write a general
formula to compute r numbers to decide between the different classifications
(hypotheses), including the Bayesian approach developed in the last

two sections.

5 : zlfk(pkt) t : 1, ..., r (IV-C-l)

where fk is a monotonic, nondecreasing, continuous function and P

kt

ispk(C£/Xk), the a posteriori probability of class C iven the input

t g
k
set X . This formula includes a large number of likelihood functions.

We will discuss these different formulations and relate them to the

specific problem of Speech recognition.

The usual (lmncmrt)flnonotonic functitnns<3f the probabilities
is the natural logarithm, which converts a product of independent
probability distributions, as discussed in the previous section, into a
summation. Since the function is monotonically nondecreasing, a decision
test based on the probabilities alone will have similar results for a

function of those probabilities. Another function of this type is discussed

151

152

in Kilmer.+1 There, the purpose was to emphasize probabilities that weredif—

ferent from a. uniform value, l/r. Thus, if a given P& was significantly

k
greater than or less than %, the f function would tend to emphasize this
particular probability.

The formulation (Eqn. IV—C-l) also allows several types of cost
factors to be included in the decision quantities. Various cost factors
are discussed in the literature. One of the most pertinent to the study
is an information measure that is related to the amount of information in
the a posteriori distribution p(Cﬂ), given xk, t = 1, ..., r. Here the
implication is that a module input should be considered very strongly in
the decision if there is a significant peaked distribution among the var-
ious categories. Another possible interpretation of a cost function of
the input Xk, especially for suboptimal systems operating in noisy environ-
ments, is a quality measure which could be determined in two ways: first,
in terms of the distance from the cluster centers of the input variable.
This would indicate whether the input were quite far from the majority of
inputs seen previously with respect to the given set of learned categories
(where we are not concerned with unknown or new input classes). This type
of cost factor would indicate that low values of a posteriori probability
have less influence, especially in the case of an insufficient or small
number of training patterns. This is so because the majority of known
probability distribution estimation techniques give much worse estimates
of the tails of a distribution (events with low probability of recurrence)
than they do of more densely populated modes. Another type of quality
measure based on the physical characteristics or measurements (low signal—

to—noise ratio of input, extremely high background interference, etc.)

 

153

would lessen the effect of noisy inputs. These types of cost factors can

easily be incorporated into the formulation (Eqn. IV-C~1).

The third thing that may be included in this formulation is
prior distributions. Lainiotis only used the prior distributions as
thresholds for comparison of likelihood ratios that he generated. In
Speech it is well known that successive Speech segments are highly dependent
(redundancy of about 33 percent); hence there is much information in
the probability of a given segment, given the last decision or classification
of the preceding segment. Thus, we must augment Bayes' formula that was

stated in Section I—E to include this conditional probability.

k k—l f k ] [ k—l ‘
P(C&/X ,x )—LP(X A), / (p(c,/c,_l)p(c&_l/x )j (IV—C-2)

This Should be incorporated in such a way that when the a posteriori
probabilities computed on the present input do not contain sufficient
inforuudjinl to give zliwiliable estinm1h3 of the [nngNit category, (in?
conditional distributions should be used. Even with the different
interpretation given to cost factors and prior distributions it is
possible to formulate a recognition problem for a restrictive speech
signal within the Bayesian framework, as discussed in the previous two
sections. However, there are several events,eSpecially for subOptimal
systems Operating in noisy environments, that will have a probability
), t = 1, ..., r; k = 1,

matrix P = .., m, which does not give

I)
( tk
acceptable decisions using Eqn. IV-C—l. This can be due to conflict

between modules having high probabilities for one class and other

modules having high probabilities for another class. This situation involving

154

"outliers," as discussed pnadoqum can occur because of: (l) inappropriate
assumptions; (2) presence of noise in the input that is very much signal—
like (white noise that looks like a fricative, sinusoidal inputs that

look like vowels or nasals, etc.), or (3) a dichotomization of inputs;

that is one module may have the same input for two completely different
classes of the input Signal. An example of this would be during a nasal,
when a high—frequency—bandpass filter might have a strong formant that

looks vowe1~like, whereas the presence of no signal in other filters

and low energy of pitch frequency component would indicate that this

signal interval is not a vowel. This type of correlation, of course,

should be incorporated by the Lewis~Brown+2’+8

approximation technique, but
it may not be a sufficient mechanism.

Wainstein and Zubakoéxhave used the central limit theorem with
reSpect to likelihood ratio formulae, such as Eqn. IV—C—l for the fol—
lowing reasons: given that the individual terms (probabilities, like—
lihood ratios) are independent events and given certain restrictions
on the tails of the distribution of these events (that they are well
behaved and go to zero sufficiently fast as the value of the event goes
to infinity), the distribution of the sum tends toward the normal
distribution. As is well known, this is an asymptotic prOperty, but it
illustrates two things: (1) convergence to a definite value, and (2) this
value is not a local minimum, since the asymptotic distribution is uni—
modal (these theorems have been proven for a larger class of distributions

than the normal, but with similar convergence and unimodal properties).

Thus one can expect a stable rule for finding a maximum value with a
guaranteed convergence property. The problem that occurs with low proba—

bility outliers is that it will take a large number of terms in the summation

 

155

to counteract its effects. In our situation, where there is such a mix-
ture of probability distributions and a large number of possibilities for
generating such outliers, one cannot sit back and hope that they will only
occur with a small probability. Several theorems have been proven (Hertz7o)
where the tail behavior of the even distributions has been relaxed by
eliminating outliers and still obtaining the central limit theorem. This

of course is intuitively the correct thing to do in order to maintain the

convergence and unimodel properties.

The discussion of causes for outliers' occurrence leads one

to consider two approaches: One is to use a Bayes decision formulation

 

but compute a larger dimension probability distribution, possibly over
the entire m X n dimensional space. The discussion of the first chapter
has indicated empirical objections to this approach. With respect to
the discussion in the first two sections of this chapter, the module
concept can be justified by stating that the filtering representation
scheme presented in Chapter Two is better matched to the natural dimen-
sionality of each feature and thus is better able to eliminate unwanted
signals and noise and thus to isolate individual features. Second, as
pointed out by Groner, too many inputs to a suboptimal design Bayes
decision network very often add noise and thus degrade the overall
classification performance. Thus, it would seem very natural that the
first level of decision logic would be to extract the features as separate
entities and then, on the basis of this extraction, look for the inter-

relationships and more detailed properties of the features.

156

The other possible decision structure is to allow interconnections
between the modules that allow lateral passage of gross information. Kilmer“r1
has considered this problem and has related the S—RETIC modal computations
to nonlinear summative schemes such as Eqn. IV—B—l. He Shows that, based

on three symmetries that are assumed for such Systems, "S—RETIC computes

a mode [detection/classification] function, F, that no S-RETIC net without
a and 6 [lateral] connections but with nonlinear summative output scheme
could compute even though it is allowed more equipment." The three

symmetries that are assumed for these systems are as follows (note that

the first two are typical for Bayesian schemes of the type discussed):

 

(1) We must be able to compute the same classification
decision regardless of which module has the proper
information. This is especially necessary in
slurecl), as; is (iviihsnt. in lﬁlgUIW: 4, Slluﬁv [Jle tianki
module will not always have the appropriate classi—
fication information, eSpecially when different
Speakers are eXpected to be using the system.
Further, Figure 17 indicates how different processing
schemes will isolate the pertinent information,
dependent on the surrounding feature environment.

(2) The evaluation scheme must be the same for any
classification decision (the computation of 8% is

independent of t, t = l, ..., r).

 

 

157

(3) Strength—of—effect symmetry. Given prior distri—
butions, an average (summative) decision across the
net and conflicting decisions, any two can overcome
the third or any one can overcome the other two, whether
the other two are in favor of the same classification
or conflicting classifications. This symmetry states
that the decision rule must give equal weight and
operate in an equal fashion in judging the effects
of these three possible situations that can occur

simultaneously.

 

[The last symmetry requires the lateral communication, since
Bayesian schemes have the first two symmetries (any of the formulae from
Wainstein and Zubakov discussed previously) but when faced with the type
of situation depicted in (3) will not operate in a consistent or appropriate
manner.) To paraphrase Kilmer's statement, in light of the outlier
situation, it is seen that the lateral communication is necessary to
decide among the probability matrix and the prior distribution matrix,
which modules should work in conjunction and be averaged together to
determine the output and which should be considered outliers and eliminated.
The S—RETIC algorithm is iterative and thus the intuitive arguments we
are presenting here are intended for understanding rather than analysis,
but the importance of Kilmer's statement about decision algorithms of this
nature is that the structure must be implemented in this way or else the

performance will suffer.

158

In the next section, we will specify a first—level recognition
system that operates according to these principles and incorporates the
S—RETIC type of decision logic. We will see how the state variable
representation presented here can be incorporated in a dynamic real-time

asynchronous decision network.

 

 

 

 

 

IV D. Proposed First-Level Recognition Block Diagram

The purpose of this study is to Specify a mathematical model and
a system block diagram that are tailored to the acoustical speech signal,
rather than the converse. The deficiencies of state-of—the—art solutions
derived by means of a Bayes minimum—risk criterion have been discussed:
It is necessary to use a restricted speech model; it is very difficult to
implement the nonlinear estimation filters that are required; a high
dimensionality is required because of the complicated interrelationships
of the speech signals. Even the optimal filters' outputs will be dependent,
in a probabilistic sense; the mixture probability formula allows the
possibility of adding together nonsimilar waveforms, based solely on the
learned probability of presence. Adding to these difficulties those that
have been discussed for suboptimal solutions, which can give rise to
outlier probabilities for particular classes, one is left with a very
negative picture. There are several other requirements of a recognition
system that are difficult to include in a Bayes formulation, which will
be mentioned here to help Specify the recognition system:

(1) Significance of the marked change——The segmentation

marks that are derived from the inherent signal

characteristics must be monitored with reSpect to

past occurrences of the Speech signal to determine
whether the marked change is due to noise (parity
error...), another energy peak entering the filter
bandwidth, the actual start of a new feature, a change
from one feature to another, or the finish of a

feature.

159

 

 

160

(2) Correlation with overall system behavior——Each
module decision must be compared with all other modules
to determine if this is a new feature, whether an
energy peak has moved from one filter to another,
or whether (one of) the predominant feature(s)
has finished.
(3) Precisely controlled features——The main criterion
for classifying a pertinent feature is whether it
is repeatable for different Speakers and contexts,

whether it is a transition of prescribed form, and

 

whether the terminal state of the transition is
predictable before the end in case the segment is
terminated.

The Bayesian formulation of course has a different philosophy
toward marked changes, in that they are assumed to be a true segment and
the mixture probability formula is used to decide on the actual significance.
Since in a speech recognition system these marked changes also have
linguistic meaning in higher levels (determining the consonant/vowel
relatiaiships, directing higher—level analysis, ...), there must also be
a decision on their validity. As is well known, the overall system
behavior must not be degraded in allowing individual modules to make
classification decisions. It has been demonstrated that the S-RETIC will
work in a correlated fashion as a total system rather than m individual
systems, each screaming for its own way (as in pandemonium machines).

This is a very serious requirement which will not necessarily be satisfied

by using a simple mixture formula. The particular method of training the

161

classification network is well Specified in our Section I—C and the work
of Rupert“). We can see that the requirements of precisely controlled
features are very pertinent to determining a consistent system performance.
The work of Houde-iv7 also indicates that recognizing transients can be
performed because of the consistent and precise form of articular transitions.
Since it appears from the preprocessing pictures that these transitions
also exist in the acoustical waveform, we can see that this is a desirable
and necessary requirement for efficient recognition.

As was mentioned previously in the discussion of the filter bank,
the fixed-frequency filters that were used are not tailored to actual

Speech characteristics, especially during frequency—transition epochs.

 

As can be seen from the three requirements stated above for the recognition
system, there might also be difficulties for specific systems that have

set filter bandwidths, in that energy peaks can move across filter boun—
daries. Depending on the skirt reSponse of the filter it may be very
difficult, for a particular filter, to distinguish an energy peak which
moves into a filter from one that simply begins in that filter. For this
reason and to avoid a very complicated classification system which must
make these additional decisions, we Should make use of the time—varying
tracking filters discussed in Section II—D. We will outline a procedure
for their use in conjunction with the classification system. First we
make the assumption that at any one given instant of time the filter bank
is constructed such that there is at least one filter that isolates the
pertinent feature information (here the use of the word "filter" indicates
the derivative calculations as well as the actual bandpass filter Operation,

since the combination of filtering and differentiation is sometimes

162

required to isolate the desired feature). Given that assumption, we can
then Specify a tracking filter, shown in Figure 12, in the following way.
At a marked change, we make a classification of the overall system input
(i.e., each module decision is calculated and then an overall global
decision is arrived at from these local decisions). Then, based on this
decision, selected filters which have the pertinent features are activated
to start tracking. The estimates of frequency and bandwidth are used, as
indicated in Figure 12, to modify the input signal further to emphasize
the particular pertinent features. Thus, other formants entering this

particular filter will not affect the tracking filter output. Also,

 

it will be possible to allow the tracking filter to operate across the
filter bank boundaries. The combination of this tracking filter with the
fixed—frequency filter bank will then lock on certain features and follow
them throughout their duration, emphasizing the chracteristics which may
be needed for higher—level classification.

These requirements allow us to Specify a recognition and pre—
processing structure which matches the nature of the Speech signal and
allows higher—level linguistic classification. This structure is shown
in Figure 27. The wideband Speech Signal is processed by the overlapping
filter bank. Each filter output is operated on by a measurement device
similar to that described in Section II-E. The inherent Signal changes
are detected to give derivative segmentation indicators. The measurement
outputs from Al,1 go to A21, which is the acousteme class selection. Here
the stored precisely controlled feature information is compared to the
input and local class decisions are made. Based on these local class
decisions, the outputs of A21 correSponding to degree of presence (DOP)

vectors shown in Figure 6 are compared with the derivative segmentation

 

 

 

 

163

 

Emkw>w 20:200me IOmmam 92:.

..(mm mDOZOmIOZ>m< “.0 4w>wa_ meI m0“. 2<m0<5 v.004m

 

 

m wUZwDme

O¢.U( #234350»

 

 

_ G

 

 

 

 

 

 

 

T

30¢):
05.

 

 

 

 

 

 

 

  

 

 

Tl

 

19.2%
20.5on
SOCZELS

 

 

..Ouz.
OE.

 

mm mmDOE

 

 

 

 

 

 

 

 

 

 

 

 

u

 

 

 

 

 

 

 

J _r

_ 8“»meUM r

i wM4xn

c w\.uvmlnw¢. I ~

_ Ill

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I no hzaoooa‘ 35.» II
90 O
01 53.1.3? wzor. Z4tom300 .
.833 35.554
L
(5&5
.20.: u
o» :5!
> tocazﬂzaoz.
Ouz.
ilzwxuwméwanm
‘llllllll
04
m-)i¢dOU to
205350 u<

mom 1o?!

:0 -..:«our

..qrrwiowm

mead
0 Te
.mowcwaOum . _ genome _ zozbwﬁm .
tuna»... u_!«t»c . ._ .24 «gran
$023me u< oifm . oz<
0230:5on 2330'. i 21.:«004
10.9qu

 

iKOn mama.»
Z. WZOCUwCB CI .91
mZOC<>3aIOU Owﬂ_JOwu

  

 

 

 

 

 

 

 

 

 

ZOL<3¢OKZ.
Q. 024
ZO:<U.ZD§OU
4u44(¢(&
1%.): km?
Z>>OD ml¢Iw

:35.)
u}:
.2504

 

 

9 A
OZ<m mop;
‘llll
. .Ewhni
JJFZwiﬁcmm dwaDm .
,flll'
N «waJ:
.‘llll
' Kwkfu
A_.
.
_ UN
20:123..wa .34 l
L w)..—(>”Ew0 _ r4 Ku»4_u
_

  

 

 

 

 

\.

 

 

 

 

 

 

 

 

 

 

 

ZO_kq»Zw}Own

 

 

 

 

 

 

 

 

w).»<>..awo

 

 

1
Fagin
. .d
a
\)
/.\ .
Jo_»uw-wm
ﬁduw F
mswrwaooa ..J F
lillllllL

 

 

 

il<m awkod
DZ.§4K~)O

 

 

 

 

 

 

 

 

 

— (whizu

«...: m

Ulqwma C

164

information to give a local time marker. This local time marker incor—
porates the predictive segmentation on the state vectors, which correspond
to physical clusters, and also probability changes in the DOP vectors,
corresponding to linguistic clusters. The two types of time markers are
integrated to give a local time marker for the ith module. Based on the
table of DOP vectors and the local time marker information, a Shakedown

net with parallel communication and initial condition information deter-
mines an overall global decision and also decides which modules contain the
pertinent acoustical features. This decision network also computes a global

time marker, which is a segmentation point in the input acoustical signal

 

and also an acoustical class decision. Based on this information, a
library is searched for the required computations to determine further
classification. The decision and pointer information is also used with
an attention—focusing device (called the attention selection switch) which
selects pertinent filters and performs the tracking operation on these
filters, based on the bandwidth and frequency information . The lower
box, which is discussed in Chapter III, computes suprasegmental features
such as pitch contours and stress placement, based on the filter outputs,
the wideband Signal, global time markers and ac decisions. The supra—
segmental information is used in higher-level decisions Shown here in
the terminology of Rupert's Relative Oppositions, to give a first—level
output which is an acousteme class label and a list of RO'S to completely
determine the first~level output and input to the next level.

The block diagram gives an integrated philosophy of speech recog—
nition, rather than a particular implementaton. The requirements that were
Specified throughout this report have led to this type of structure. The

particular names and implementation, although specific, indicate the

165

necessary sort of structure that is required. We have tried to indicate
from as much experimental evidence as is available the feasibility and
practicality of this particular approach. However, nothing Short of a

full—scale implementation and testing will actually prove its worth.

 

 

 

 

V CONCLUSIONS AND RECOMMENDATIONS FOR FURTHER STUDY

This study has Specified a segmentation procedure which also is the
first level of analysis of a total system intended to give a commanded
reSponse based on recognition of thespeech input. Section I—A contains
a summary of the arguments presented which lead to this Specification.
The deficiencies of current segmentation procedures in trying to mark
Signal epochs in the Speech acoustical signal exist because they are not
adequately tailored to the complex speech acoustical Signal. The

complicated interrelationships of pertinent linguistic features in the

acoustical signal necessitates a more sophisticated procedure; that is,

 

one which first isolates individual, primary features and then further
processes them to determine secondary features, defined as perturbations
of the primary features. The general purpose of this study is to specify
a first level recognition system which isolates individual formants, then,
guided by a syntactic and semantic structure, computes the necessary
measurements. The purpose of this thesis is to describe a segmentation
procedure within this general specification which not only specifies
basic units for recognition but also gives an adequate description of

the complicated speech signal. Further, the segmentation procedure that
identifies lower units will direct the higher levels of decoding so that

the search space is kept within practical bounds.

By formant, we mean the resulting time waveform [or one cavity of
the vocal tract excited by glottal pulses, or frication noise. Thus,
vowels and continuants may have three predominant formants; some frica-
tives have one formant, and silent portions have none. Each formant is

166

167

assumed to be generated by a second order time-varying differential

equation driven by either a pitch pulse or white noise source.

First

order differential equations are derived for four state variables.

terms of these state variables, a procedure for isolation and acdentuation

of the formants present in the acoustical signal at any one time is given

which:

(1)

(2)

(3)

(4)

(5)

(6)

Uses a wideband overlapping fixed—frequency filter bank
and real—time processing to attempt to isolate individual

formants (the criterion is that at least one filter pro—

 

cessing combination sufficiently isolates each formant).
Derives gross measurements of the state variables from the
observed acoustical signal.

From these measurements, identifies a model for the
current state of the acoustical signal and thus Speci-
fies how many formants are present and which estimation
formulae are appropriate.

Selects the filter/processing combination which best
isolates each formant.

Tailors an estimation procedure to track each formant and
give reliable estimates of the state variables to be used
for furhter analysis of the speech signal.

Uses a predictive comparison to determine when a given

model is no longer valid.

 

 

 

168

Experimental evidence shows that this procedure can achieve the
ation of the pertinent features from the nonessential (possibly
elated) portions of the speech acoustical Signal and also isolation
background noise. Also, the processing can reduce speaker depen—
e (examples given in Chapter 2).
This study also specifies a real—time procedure, which is a combin-
i of analog and digital processing, to accomplish the feature rep-
itation. It involves:
(1) Estimation of the parameters of a dynamic time-varying
differential equation model for single formants; and
[2) Use of these estimates to isolate and accentuate the
single formants.
representation shows the varying parameters of frequency, bandwidth,
melitude and gives a compact (low memory requirement) representation
11y the information necessary for further linguistic processing.
levelopment of this real—time procedure is partially mathematical
id in part on an attempt to relate the results to theoretical studies
timation of time—varying parameters), but it is mainly empirical,
se of the inadequacy of any mathematically tractable model to Show
omplete, complicated nature of the Speech acoustical signal.
Linguistic theory is needed in addition to this acoustical signal
ssing technique to give a proper learning criterion for the pattern—
nition portions of the algorithm. The inadequacies of existing
etical linguistic studies to properly account for complicated acoustical
1 properties are pointed out, and an alternative theoretical

work is described to incorporate these properties.

 

 

 

 

169

hﬂsstudy has resulted in the block diagram specification of a first—

It is evident that many of the results can be

level recognit ion system .
Existing

cmmmaimm by implementation and testing of the preposed system.

ummnaranalysis programs can be used to design and check out the various

Further theoretical work is suggested

paﬂs afthe recognition system.
The adequate analysis of the

bynmchcﬁ the discussion in this study.

trmwimuzresponse of nonlinear time—varying estimation filters (extensions

afKahmm—Bucy filters), can possibly be achieved through use of the
quasiémeady—state formulation and prebandwidth function definitions

givenzh1Chapter II. The detection/estimation problem formulated in

Chapmn‘IV may possibly be treated through a combination of these techniques
:Rn'the particular models developed here to give significant theoretical

results in the relatively new areas of nonlinear filtering and time-varying

Signal recognition. Also the formulae for deriving bandwidth estimates
and the

can be very useful for investigation of time—varying systems,
linguistic studies that have been Specified can be continued further for

natural spoken American English to define and evaluate linguistic elements
that aiwelnore closely related to the acoustical signal. The effective

use of’sniprasegmental features such as pitch and intonation has only

suggested in this study but will surely be useful for further progress

been
The techniques deve10ped here for

.n recognition of connected Speech.

epresenting 'the acoustical signal give a very practical method for

onunateyr :analysis of (suprasegmental) features such as stress or intona-

through linguistic and communi—

Lon patterns. The integrated approach,

ti<1n ‘trueories, gives methods to attack the complicated problem of

ecifying the interactions and effects of these features on all levels

linguistic element recognition.

 

 

BIBLIOGRAPHY

t—I

O

CH

BIBLIOGRAPHY

 

C.F.Hockett, "A Manual of Phonology, Memoir ll," Internatl. J
Am.ldnguistics (Waverly Press, Inc., Baltimore, Maryland, 1955).

G.E.Peterson and H. L. Barney, "Control Methods Used in a Study
175-184 (1952).

(d Um Vowels," J. Acoust. Soc. Am., Vol. 24, pp.

Lmns J. Gerstman, "Classification of Self-Normalized Vowels," IEEE
Vol. AU—l6, No. 1, pp. 78-80 (March 1968).

Trmmn Audio Electroacoust.,

G. A. Kepp, and H. C. Green, Visible Speech (D. Van

R.(L Potter,
New Jersey, 1947).

Nostrand Co. , Inc. , Princeton,

C Delattre, A. M. Liberman, J. M. Borst, and

1% S. Cooper, P. .
In J. Gerstman, ”Some Experiments on the Perception of Synthetic
Speech Sounds,” J. Acoust. Soc. Am., Vol. 24, pp. 597—606 (1952).

IL Yilmaz, ”A Program of Research Directed Toward the Efficient

and Accurate Machine Recognition of Human Speech: A Theory of
Speech Perception,” Final Report No. 2, Contract NAS 12-129,
Arthur D. Little, Inc., Cambridge, Massachusetts (November l967),
"Phoneme Grouping for Speech Recognition," J. Acoust.

1% ll. Reddy,
41, No. 5, PP. 1295—1300 (1967).

Soc. Am., Vol.
C. C. Tappert, N. R. Dixon, D. H. Beetle, Jr., and W. D. Chapman,
'Vi Dynamic—Segment Approach to the Recognition of Continuous Speech:

an Exploratory Program,’ Final Report, Technical Report RADC—TR—68—177,
(knitract F30602—67-C-0123, Project 4027, International Business
Matdnhies Corp., Systems Development Division, Research Triangle

Paid<, North Carolina (June 1968).

Technical Report 9,

"A Method of Decoding Speech,"
Illinois (June 1966).

J . Ga zdag ,
Ihiiversity'()f Illinois, Ilrbana,

Al? Graint 7—66,
Z. S. 'Harris, Structural Linguistics (University of Chicago Press,
CHIiCEHIO, Illinois, l961).

Preliminaries to Speech

and M. Halle,
1965).

R. Jakobson, C. Bunnar M. Fant,
/\n2113nsis (MIT Press, Cambridge, Massachusetts,

 

Halle, The Sound Pattern of English (Harper and

N. Chomsky and M.
Row, New York, New York, 1968).
Klatt, "A Limited Speech Recog—
1819, Final Report, Contract

1). C}. Isobrow, A. K. Hartley, D. H.
Inc., Cambridge, Massachusetts

riitgicni System 11," BEN Report No.
NAS 112—138, Bolt Beranek and Newman,

(Aiiri.l 1969).
170

 

M.

16

II

18.

19.

20.

32.

171

D.R.Hill, "An ESOTerIC Approach to Some Problems in Automatic
SwediRecognition," Int. J. for Man—Machine Studies, Vol. 1, No. 1,

(1969).

h R.Focht, "Single Equivalent Formant Extractor System," Contract
lMSém—SSZ, Philco—Ford Corp., Blue Bell, Pennsylvania (November 1967).
S.E.G. Ohman, "Coarticulation in VCV Utterances: Spectrographic
lwamnements," J. Acoust. Soc. Am., Vol. 39, No. 1, pp. 151—168

(Jmnmry 1966).
It A.Houde, "A Study of Tongue Body Motion During Selected Speech
Somum," SCRL Monograph Number 2, Air Force Office of Scientific
stmnth Grant No. AF-AFOSR-1252-67, Speech Communications Research
Lahnatory, Inc., Santa Barbara, California (August 1968).

Lamb, "Prolegomena to a Theory of Phonology,

" Language, Vol. 42,

 

3.1L

NO.1L.PP. 536—573 (1966).

I. B. Thomas, "The Significance of the Second Formant in Speech Intel—

ligibility," Technical Report 10, Contract AF—33(615)—3890, University
Illinois (July 1966).

of I llinois , Urbana ,
W. P. Rupert, "A Representation for the Information-Carrying Units of
Natural Speech" (Ph.D. Thesis, Montana State University, Bozeman,

Montana, April 1969).
J. F. Hemdal and G. W. Hughes, "A Feature Based Computer Recognition
on Models

Program for the Modeling of Vowel Perception," Proc. Symp.
for the Perception of Speech and Visual Form, AFCRL (NoVember 1964).

R. W. Shafer and L. R. Rabiner, "A System for Automatic Formant Analysis
Inc., Murray Hill,

of Voiced Speech,” Bell Telephone Laboratories,

New Jersey.

.I. In iFlanagan, Speech Analysis Synthesis and Perception (Academic

Press, Inc., New York, New York, 1965).

"A Difference Limen for Vowel Formant Frequency,"

.1. ll. Flanagan,
Vol. 27, No. 3, pp. 613—617 (May 1955).

J. Acoust. Soc. Am.,
Root, An Introduction to the Theory

IV. 13. .Davenport, Jr., and W. In
Inc., New

01? Raxudom Signals and Noise (McGraw—Hill Book Company,

York, New York, 1958).

1(aJJnan and R. S. Bucy, ”New Results in Linear Filtering and Pre—
83, pp. 95—

It.
' Trans. ASME Ser. D, J. Basic Eng., V01.

die 1: ion Theory ,'

1 08 (March 1961) .
a Model and a Program

IiaJ.le and K. N. Stevens, "Speech Recognition:
155—159 (1962).

RI.
1231' Itensearch," IEEE Trans. Info. Thy., Vol. IT—8, pp.

 

 

172

'1

R. M. Lerner, "Representation of Signals, Chapter 10 in Lectures
on Communication System Theory, E. J. Baghdady, ed., pp. 203—242
(McGraw-Hill Book Co., Inc., New York, New York, 1961).

 

J. R. Hanne, "Formant Analysis," Report 12, NR 049—122, Contract
Nonr 1224(22), Communication Sciences Laboratory, University of
Michigan, Ann Arbor, Michigan (March 1965).

M. Lecours and J. J. Sparkes, "Adaptive Spectral Analysis for Speech
Sound Recognition," IEEE Trans. Audio and Electroacoust., Vol. AU—l6,
No. 4, p. 523 (December 1968).

 

H

E. J. Baghdady, "Analog Modulation Systems, Chapter 19 in Lectures
on Communication System Theory, E. J. Baghdady, ed., pp. 439-555
(McGraw-Hill Book Co., Inc., New York, New York, 1961).

 

B. J. Leon and D. D. Weiner, "The Quasi-Stationary ReSponse of
Linear Systems to Modulated Waveforms," National Science Foundation
Contract GPn581, Purdue University, Lafayette, Indiana (May 1964).

 

J. D. Duncan and L. E. Cannon, "Investigation of Pre-Detection Fil—

tering Techniques,” Quarterly Report, Contract DA28-O43AMC-01548(E),
Electronics Research Lab, Montana State University, Bozeman, Montana
(December 1965).

J. W. Cooley and J. W. Tukey, "An Algorithm for the Machine Calculation
of Complex Fourier Series," Mathematics of Computation, Vol. 19,
No. 90, pp. 297—301 (April 1965).

 

H. L. Resnikoff and G. A. Sitton, "Linguistic Segmentation of Acoustic
Speech Waveforms," paper presented to the Seventy—Fifth Meeting of
the Acoustical Society of America, Ottawa, Canada, 21—24 May 1968.

E. C. Cherry and V. J. Phillips, "Some Possible Uses of Single
Sideband Signals in Formant—Tracking Systems,” J. Acoust. Soc. Am.,
Vol. 33, No. 8, pp. 1067—1077 (August 1961).

 

H

E. Peterson, "Frequency Detection and Speech Formants, J. Acoust.

Soc. Am., Vol. 23, pp. 668-674 (1951).

 

R. W. A. Scarr, ”Zero Crossings as a Means of Obtaining Spectral
Information in Speech Analysis,” IEEE Trans. Audio and Electroacoust.,
Vol. AU—l6, No. 2, pp. 247-255 (June 1968).

 

G. H. Ball and D. J. Hall, "A Clustering Technique for Summarizing
Multivariate Data," Beh. Sci., Vol. 12, No. 2, pp. 153—155 (March
1967).

H. P. Zeiger, ”Cascade Synthesis of Finite—State Machines," IEEE
Conf. Record on Switching Oct. Thy. and Leg. Design, pp. 45—51
(October 1965).

 

41.

42.

43.

44.

45.

46.

47.

48.

49.

52.

53.

173

W. L. Kilmer, W. S. McCulloch, and J. Blum, "Some Mechanisms for

a Theory of the Reticular Formation," Final Scientific Report, Air
Force Office of Scientific Research Grant AF-AFOSR—1023—66, Division
of Engineering Research, Michigan State University, East Lansing
Michigan (February 1967).

P. M. Lewis, "Approximating Probability Distributions to Reduce
Storage Requirements,” Information and Control, Vol. 2, pp. 214~225
(1959).

 

D. T. Brown, "A Note on Approximations to Discrete Probability
Distributions," Information and Control, Vol. 2, pp. 386-392 (1959).

 

W. Kilmer, W. McCulloch, and J. Blum, ”Embodiment of a Plastic
Concept of the Reticular Formation," in Proc. Symposium on Bio—
cybernetics of the Central Nervous System (Little and Brown, Boston,
Massachusetts, 1968).

 

 

C. K. Chow and C. N. Liu, "An Approach to Structure Adaptation in
Pattern Recognition," IEEE Trans. Sys. Sci. and Cyber., Vol. SSC—2,
No. 2, pp. 73—80 (December 1966).

 

 

D. Gabor, "Theory of Information,” J. Inst. Elec. Engrs., Part III,
Vol. 93, pp. 429-457 (November 1946).

 

E. C. Titchmarsh, Introduction to the Theory of Fourier Integrals
(Clarendon Press, Oxford, England, 1948).

 

R. Deutsch, Nonlinear Transformations of Random Processes (Prentice-
Hall, Inc., Englewood Cliffs, New Jersey, 1962).

 

E. J. Kelly and I. S. Reed, ”Some Preperties of Stationary Gaussian
Processes," Technical Report No. 157, Massachusetts Institute of
TechnolOgy, Lexington, Massachusetts, 1957.

N. Abramson, "Nonlinear Transformations of Random Processes," IEEE
Trans. Info. Thy. Vol. IT—13, No. 3, pp. 502—505 (July 1967).

 

E. A. Guillemin, Theory of Linear Physical Systems (John Wiley &
Sons, Inc., New York, New York, 1963).

 

J. D. Bruce, "Discrete Fourier Transforms, Linear Filters, and
Spectrum Weighting," IEEE Trans. Audio Electroacoust., Vol. AU-16,
No. 4, pp. 495-499 (December 1968).

 

A. A. Kharkevich, Spectra and Analysis (Consultants Bureau, New
York, New York, 1960).

 

W. R. Kincheloe, Jr., "The Measurement of Frequency with Scanning
Spectrum Analyzers,” Technical Report No. 557—2, Contract AF30(602)—
2398, Systems Techniques Laboratory, Stanford Electronics Labora—
tories, Stanford, California (October 1962).

55.

56.

57.

58.

59.

60.

61.

62.

63.

64.

65.

66.

67.

68.

174

R. Alter, "Utilization of Contextual Constraints in Automatic
Speech Recognition," IEEE Trans. on Audio and Electroacoustics,
Vol. AU-16, No. l, p. 6 (March 1968).

D. R. Reddy and A. E. Robbinson, "Phoneme to Grapheme Translation
of English," IEEE Trans. on Audio and Electroacoustics, Vol. AU-l6,
No. 2, p. 6 (June 1968).

R. Ash, Information Theory (Intersciences Publishers, New York,
New York, 1967).

 

N. Chomsky, Syntactic Structures (Janua Linguarum, Series Minor,
No. 4, Mouton Publishing, The Hague, Netherlands, 1957).

D. Lainiotis, "Sequential Structure and Parameter-Adaptive Pattern
Recognition—-Part One: Supervised Learning," IEEE Trans. on
Information Theory, Vol. IT—l6, No. 5, pp. 548—556.(September 1970)

 

 

H. J. Kushner, "Dynamical Equations for Optimal Nonlinear Filtering,"
J. Differential Equations, Vol. 3, pp. 179—190 (1967).

 

 

G. Kalinapur and C. Striebel, "Stochastic Differential Equations
Occurring in the Estimation of Continuous Parameter Stochastic
Processes," Technical Report No. 103, University of Minnesota,
Minneapolis, Minnesota (September 1967).

T. Kailath, "A General Likelihood-Ratio Formula for Random Signals
in Gaussian Noise," IEEE Trans. on Information Theory, Vol. IT-15,
No. 3. pp. 350—361 (May 1969).

 

L. A. Wainstein and V. D. Zubakov, Extraction of Signals from Noise
(translated by R. A. Silverman) (Prentice—Hall Book Company, New
Jersey, 1962).

 

M. W. Byrd, "Asymptotic Convergence of Nonlinear Continuous Time
Filters, Ph.D. dissertation for Michigan State University, East
Lansing, Michigan (1969).

G. F. Groner, "Statistical Analysis of Adaptive Linear Classifiers,"
SEL Report No. SEL-64-026 (TR No. 6761—1), Stanford Electronics Lab,
Stanford University, Stanford, California (April 1969).

E. Sapir, "Sound Patterns of Speech,"
(June 1925).

Language, Vol. 1, pp. 37-51

K. L. Pike, Phonemics: A Technique for Reducing Languages to
Writing (University of Michigan Press, Ann Arbor, Michigan, 1947).

T. H. Crystal and L. Ehrman, "Design and Applications of Digital
Filters with Complex Coefficients," IEEE Trans. on Audio and Electro—
acoustics, Vol. AU—16, No. 3, pp. 315—320 (September 1968).

 

 

 

69.

70.

71.

175

E. Parzen, Stochastic Processes (Holden-Day, Inc., San Francisco,
California, 1962).

E. S. Hertz, "On Convergence Rates in the Central Limit Theorem,"
Ann. Math. Stat., Vol. 40, No. 2, pp. 475—479 (April 1969).

 

P. J. Donoghue, "System Identification by Bayesian Learning,"
Ph.D. thesis, Department of Electrical Engineering, Michigan State

University, East Lansing, Michigan (1968).

 

 

 

 

 

APPENDICES

 

 

 

 

APPENDIX A

Description of Sapir's Pseudo-Language
The choice of a data base is governed by two conflicting criterion.
In order to allow computer analysis, it must be of a limited size, but
also, the data base must be representative; i.e., it must Show:
(1) phonetic patterns determined by some consistent rules
as are normally found in natural languages

(2) phonetic "Slurrings" by native speakers

 

(3) several Speakers pronunciation, including male and female
and various accents
(4) long phrase environments with stress and intonation
patterns.
It is very difficult to find such a data base among common American English
because of the assimilation of words from other languages and the many
dialects that exist. For that reason, we have chosen a pseudo—language
from which we will draw our experimental utterances. These linguistic
forms were constructed by Donald Stark from data suggested in Edward Sapiif36
"Sound Patterns for Speech", (June, 1925) (See Pikey7p. 156). This data
was intended to serve as an illustration of a classical phonetic analysis
and, as such, points out several of the more difficult problems that would

be encountered by an ASR system operating on a natural language. _The

phonetic chart for this pseudo—language is shown below.

176

177

Phonetic Chart for Sapir's Language B

a e e i u o o
Vocoids (a°) (6') (e-) (i') (no) (0°) (0°)
Semi-Vocoids (?) h (w) (y) (1) m n
Non-Vocoids p t k
Stops (ph) (th) (kh)
b d g
Fricatives (f) (9) S (x)
v d z g

Several rules, with regard to conditional variants for this language,

 

were proposed by Sapir:

(1) long vowels (denoted by v-) can arise only when the
syllable is opened and stressed
(2) The glotal stop (?) is not an organic constant, but, as
in North German, an attack of initial vowels. This rapid
onset is lost in mid-utterance position
(3) W and y are merely semi—vocalic developments of u and i
that correspond to a glide between adjoining vowels
(4) L arises merely as a dissimilated variant of n
(5) Aspirated (denoted by C“) p, t, and k are characteristic
of this pattering at the end of the word. It is a reverse
of the American—English habit
(6) F, O, and x similarly arise from the unvoicing of final
v, d, and g. Z and s also alternate in this way, but
there is a true S besides.
The linguistic forms are given below. The English "translations" are
Inerely to give an indication of how the utterances are composed from the

shorter forms by typical natural language rules.

178

Table A—2. Experimental Linguistic Forms from Sapir's Language B

 

l. L'hoo.sek“] 'She is tired' l4. ['naeaagnn] 'smoke'
2. ['hoo.SexJ 'bear' 15. L' aaaa.om] 'working'
3. ['hoo.zex] 'onion' 16. [dw.'ath] 'horse'
4. [bo.'gif] 'to answer' 17. ['?eg.n kh] 'bloody'
5. [p0.'gin] 'dish' 18. ['voo.e0] 'man'
6. ['gaa.yph] 'round' 19. [?um.'bif] 'four'
7. ['gaa.na] 'tarantula' 20. ['?e .go] 'fire'
8. ['?ai.ba] 'white' 21. [hii.'duul 'cloudy'
9. [?a1.baa] 'knife' 22. ['taa.ha] 'Square'
10. ['?m21.bas] 'radish' 23. ['daa.os] 'water'
11. ['duu.e] 'two' 24. L'kaa. oph] 'acrid'
12. L'?el.bas] 'three' 25. ['haa] 'you'
13. ['71 .go] 'even though' 26. [po.'gin] 'I wash'
27. [maak.'soth. 'al.ba] 'white stones'
28. [’zOl.gi. um.'bif] 'four houses'
29. ['daa.oz. o.'ke0] 'She carries water'
30. ['h00.zeg. 'duu.e] 'two onions'
31. ['voo.ed. ' aaa2.om] 'the man is working'
32. [dw. 'ath. um. 'biv.am] 'his four horses'
33. [po.'gil. 'gaa.yph] 'round dish'

All symbols that are used are taken from Pike and are given with their
English equivalents in that book. The non—English sound, g, is a voiced x
or lambda, as it is commonly called by phoneticians. It is a sound common
to several African languages. The glotal stop, ?, appears in some German

dialects and also, may be a part of American—English pronunciation that

has been neglected. The choice of speakers were four: one American male

179

(EH), one African male (BE), one African female (BA), one American female
(MJ). These people had previous training in linguistics, but were not
accomplished phoneticians, so that their pronunciations of these words
were more natural, not academically stilted. Their linguistic background

allowed them to comment on the exact nature of those pronunciations.

This data does not have any r'S or 1's, as commonly found in American
English. The result is that this is a slightly easier case because of
the lack of complicated vowel glides. They do exist, however, between
adjacent vowels so that this is not a completely academic data base. All
utterances that are referenced in this paper are indexed by the following
method: a seven—character label is assigned, the first two characters
correspond to the utterance number given in the list above, a blank
separates them from the next two characters, which are the initials
assigned to the individual making the utterance, and the last two
characters are assigned to the repetition number of the utterance
(i.e., if the speaker has said the same utterance three times, then the
last number will reflect this as l, 2, or 3 -- for example, 16 EH 2 is
the Sixteenth utterance in the list, dwath, said by Earl Herrick (EH)
and it is his second repetition.
This data base is sufficient for the study of very many questions
arising in automatic recognition of natural speech.
(1) Automatic machine determination of existing phonetic
patterns. There is a sufficient data set available
to perform that experiment.
(2) The actual phonetic variations caused by different
speakers of this basic data. This is the primary

question that we are investigating in this research.

 

 

 

 

180

(3) Measurement of stress and intonation and other supra—
segmental features.
(4) Determination of stress variation of vowels.

(5) Determination of syllable and word boundaries.

There are a total of 258 utterances on the analog tape from the
four speakers with repetitions. A subset of these were chosen for an
initial reCOgnition experiment and these were subsequently filtered and
digitized as described in Appendix B. That subset is:

c - h
4. [b0. gif] 16. [dw at ]
18. ['voo.e0] 19. [9um.'bif]

h
27. [ma3k.'sot . 'a1.ba]

 

28. ['zoi.g1 um.'bif]
31. ['voo.ed .' ae ae . em]

32. [dw.'ath . um.'biv .am]

 

 

    

"5’ , . ‘
.q'
1;“,
73"“! A
' » . 11"?“

APPENDIX B

Recording Apparatus Used to Collect Experimental Data

The recording of the experimental data on analog tape was performed

on the apparatus shown schematically in Fig. 8—1.

The subjects were seated in a quiet office and arranged around a
microphone so that they were talking in a conversational—type atmosphere. The
microphone used was an EV Model 654 Dynamic Non—Directional Microphone and it

was input to a Type 122 Tektronix Pre—Amplifier, then to Channel 2 of an

 

FR 1100 Ampex Recorder with l/2-inch instrumentation tape at a speed of

thirty inches per second. In order to minimize timing variations in re-
cording/reproducing due to wow and flutter and possibly the use of different
recorders with possibly different speed adjustments, a timing waveform was

also recorded on another Channel Simultaneously. The timing circuit was

provided by a Tektronix Type 114 Pulse Generator fed directly into

Channel 5 of the Ampex FR 1100 Tape Recorder. The input was a 20—micro—

second pujgp repeated every SO microseconds for a timing frequency of 20 kHzf This
analog tape was then processed using the equipment shown schematically in

Fig. B—2 and B—3. First, the analog tape was marked with start pulses on

Channel 7 indicating beginning points for the words that were to be pro—

cessed and converted to digital samples. Six separate passes of the analog

tape were used to get the various filtered combinations that were required.

The use of the start pulses assured a uniform beginning point on all the

 

no:
The output variation in a lO—KC square wave, with reference to the input
for a record/reproduce situation, was as much as 80 to 100 microseconds.

181

 

 

V
V 654

182

 

Tektronix 122 Channel 2

 

 

.crophone

.ming Waveform

:art Pulse

 

 

Preamplifier

 

 

Tektronix 114 Channel 5

 

Puls e Genera tor

 

 

 

 

Tektronix 114 Ghannel 7
PukxaGenenﬁnr

 

 

 

 

Ampex FR 1100
Recorder

 

Fig. 8—1. Apparatus for Recording Speech Signals on Analog Tape

 

FR 1100
Ampex
Recorder

 

 

 

 

 

 

 

 

Fig. B—2.

 

 

 

 

 

 

 

 

CDC 3100
Computer

Manual
Interrupt
Ch 7 Start Pulse
Ch 5 Timing Waveform
Ch FKhr‘on—Hi te ‘—
tg 315AR Filter “~«v
75—6500 Hz
«-—~~——‘ ...-..ﬂ...“ --- ..-- -._-,...,I
mmvwmmw~~v —— Multi 1 er
3-x Filter p ex
From 1 and A/D
Bank
to 12 converter
outpum

 

 

 

 

 

 

 

 

 

Apparatus for Multiplexing and Digitizing

Data from Analog Tape

 

 

 

 

 

 

183

[n Fig. B~J, the following operations are shown:

I)

The analog tape was mounted on another FR 1100 Ampex
Recorder. The reproduced output was passed through a
Khron—Hite 315AR variable bandpass filter set to lower
limits of 75 Hz and upper limit of 6500 Hz, then to a
Bruel—Krujer bank of third octave filters. This bank
is shown in Fig. B-3 with further hardware and soft-
ware recombination of these filter outputs to achieve
the bandpass overlapping filter outputs as desired.

The B-K filter bank was chosen because of its linear—
phase characteristics, 50 dB per octave skirts for
adequate bandpass filtering and ability to sum adjacent
filters to increase bandwidth. Sampling was done on a
CDC 3100 computer giving continuous A/D operation on an
analog tape using a l2-channel multiplexer and a lO—bit
A/D converter. The data were then sorted on digital

magnetic tapes and used for further processing.

The A/D sampling rate was controlled by the timing wave—
form reproduced from the analog tape, minimizing variations
from the desired 50 microsecond sample interval.

Start pulses from Channel 7 initiated A/D operations (via
30—interruptin the A/D converter). A manual interrupt was

used for termination.

 

 

 

 

 

mIPD§>OZ<m mom: OZ< mom It; v_Z<m ”3.5.”. OZEm<4mw>O

mike—30245 mo m.
NI Sam

4.

NP 00.2

NI gm
11

~I gv
: .0” DOS.
~I mmpm

I]

 

m .m COS.

184
i

 

 

n .0 00.2

n. .v 00$
2... coop

 

 

 

~I nmm

n .N 00.2;

 

 

mIFO =SOZ(m mu m

~I 80m

4.

my 005.

2.. 23 IT

: .0, DOS.

2... nnmm '4'

~I $9

J]

m .m 00.2

m .v 00.2

 

If

N: 2m
N: mmv LF

~I non l—I

 

~I Dun — 00$.

NI Ms

 

 

 

 

¥Z<m Imbfu 02_n_a<..mw>0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ozjn§<w
20:59. 0-4 20:69..
th.O.O ozgqumquDE 004(24
«.00:
‘I.
88
: no: 89.
‘1
So .D. 88
E
41
, or 00.2 00mm
88
m nos. 08.
50
m a:
«4
‘IL m 00.2 8o—
08
, h CO, 8%
So
b 8m
#1
n , m 00: 8e
m 00.2
‘1.
So
«4 m2<
v 092 08 o
n cos.
‘1 w>:<>_mme,
R
N no: 08
T
a 00.2
I n=2(
no

 

 

 

 

 

U

u

u

U

U U U U

58v

.Om—m

.83

.93,

5mm—

5:”

6.3

5.0—

.WNF
.8?

th wKDOE

 

 

 

Imb... _ u
~I
OOmwlmw

 

 

 

hDaZ_
Iuwwum

APPENDIX C

TIMSER

A Program for Interactive Analysis of Time Series

MSER-—Techniques for interactive Manipulation of §§quential
ns——is a program ensemble that runs on the CDC 3300 at SRI.
developed to allow a user to edit and transform time series
tively, observing the results on a CRT display. The time

of primary interest are bipolar, one—dimensional time series

8 the result of A/D operation on an analog voltage)and unipolar

riate time series (such as the envelope and zero—crossing count

 

ries derived from the bipolar sampled analog signal). Figure l
n overview of the operations available to perform these two

f analysis.

ata

ﬂTMSR—4Rne-Dimensional Bipolar Time Series; i.e., the output

 

an A/D conversion of an analog waveform from magnetic tape.

3 system allows inputting data from up to three tapes, which may
fierent sampling time intervals, accuracy (range of data), and
length. No capability is available now to unpack multiplexed data;
, it could be implemented by modifying a single subroutine. The
3rs that are needed to read each magnetic tape are read in from
:ard. These tape parameters are fixed for each magnetic tape;

30

n/

(1) Tape ID (4 BCD Characters);

(2) Length of record;

(3) Sample time interval;

(4) Logical unit; and

(5) Range of the data, 8 bits (i 256) to 12 bits (i 2048).

185

 

 

 

 

186

F

 

mem—m wS__._. w._.<.m<>_._.._DS_ “.0 ><._aw_o OZ< m_w>._<z<
m0”. meEwaw 2<m00mm ><.Em_o m>_._.o<mm._.z_ummm_>:._.

 

ZOdeDaEOU
m2; 23m

   

    

  

Emhzzun.

 

ImOdmI
mon hlwl<§mwa

hmﬂrﬂo

         

WEIh_IOOJ<
OZ_OO<Ju
BPdEOde

>mok0wmﬁ
man

    

_
_ wmmmeqGQu

   

mmwhw5414n

H H H >(Jn90 ecu

 

85.228
oz<0m>uy .L

 

_
_ 5... ..IL - . _
mozSLo _ ._ . _ _

   

L

EJCOIUEJ

mmwuqu
.(Jamﬁ
NEOU

ail)?
in?!»

 

 

hmIaEwk2_
EWSkOCa

:

 

 

 

 

 

OKdOmywx

_
_
_
_|

   

   

mamauum

kaz_ia

«20. rqugono
owx_a

_ _

mﬂwkwidudu

_ >4Jam.u r043
_ $3:sz wozqru

[IIIIL

FlllllllllllIL

To mmDOE

   

    
  

mzoiahsaiou
wSCHZDm

mJC
LusdaUm
pwoknc

 

 

         
          

>IOFme5
win

 

 

 

 

     

 

)

   

me.u prZ
-QIawnkworan

L
Ji_
_

F

mn24

230 U .

0n40m>w¢

kwda

amp).

Imihgdm

 

 

_

w

 

 

meJU w 2wqo
9279.133

      

         
       

Q

187

‘ucture of the magnetic tapes is typical for A/D operations; namely,
‘ry number of records per file and data blocked by end-of-file marks,
arbitrary number of files. A compass subroutine enables quick

ng for files.

EgTMSRifUnipolar Multiple Time Series; i.e., results of sampling

 

.terrelated analog waveforms or results of processing one—dimensional

gital time series.

Lese time series are stored in random—access files that have

.ble structure, depending on four parameters:

(1) Length of record
(2) Number of records
(3) Number of modules

(4) Number of dimensions.

four parameters are contained in a header at the start of each file

e assumed to be constant for that file. Use of a virtual-core storage
(PUTGET) permits storage of up to 200 files. Maximum record length

ited to 200; however, the rest of the parameters are bounded only by

ble disk storage. The number of dimensions is the number of different

eries in this file. The number of modules is a sub-file structure

llows a within—file breakdown of data. For instance, there may be

1 processing schemes for one multiple time series which the user

to compare. The number of records parameter refers to each module.

)tions Available to Both Programs

 

.crofilm Hard Copy

 

is possible to obtain hard copy of the actual picture on the
splay. This is done by dumping the octal diSplay buffers onto
Lc tape and then converting them to microfilm pictures on the

) display,later, as a batch job.

 

 

 

 

188

pmments and Titles

 

he user can type in a title or a comment on the CRT diSplay;
comments are transmitted to the hard copy. This is useful for
ting where you are in your analysis and for titling microfilm

BS.

un—Time Computations

 

limited incremental compiler, allowing the user to manipulate,

c, scale, or transform the time series, will soon be added. The
ill be able to type up to 10 algebraic equations (which may perform
ear transformation, lead—lag averaging, magnitude, absolute value,

of squares computations, normalizing by maximum value, etc.)

 

gand Transformation Capabilities

 

\WTMSR

)ata on magnetic tapes is transferred to a circular virtual~core
', that is, a buffer with a fixed length; when this length is
led the first data introduced is overwritten. The user can select
no of three magnetic tapes from the keyboard; at this time, a data
.s read in with the tape parameters discussed above. Arbitrary
:ion of files and records from that tape can be made so that the
.ar buffer may contain an allocation of data not necessarily the
[S the original tape. From the circular buffer the time series are
'ted to octal display buffers for the CRT display. A pointer is
;o determine the origin of display in the circular buffer. It is
>le to change plot parameters, including the number of points in
trve, the number of curves on the screen, and the scale of the

The pointer can automatically increment through the buffer allow-
Lpid editing. The user may also reference (save) one or more curves
2 screen and edit others for comparison. Once a curve is saved,
rr modification of the plot parameters will not affect it. Various

[8 are available for rapid editing of data from several tapes.

 

A

 

189

III euiddtion to the editing capability, RAWTMSR allows some computations
9 performed on the time series. There are two types:

(IL) Fixed Computations--including Fourier series and a
smoothed time derivative computation. These compu—
tations can be made on any selected portion of the
data in the circular buffer by setting limit pointers.
The resulting time series are stored in a temporary
scratch buffer, located in core, and immediately dis-
played above the last reference curve. These curves
are automatically referenced so that further computa-
tions or moving of time series will not remove them.

(13) Run—Time Computations--These computations will be

 

performed by the incremental compiler. They can be
performed either on data from the circular buffer,
again indicated by beginning and ending pointers, or
on data in the scratch buffer in core.
The result of any of the above computations will be written over
ything in the scratch buffer and immediately displayed. Permanent
cords of the computations can be made either by the hard copy option

' by printing the scratch buffer contents.

For example, the combination of these fixed run—time computations
an result in the following display: First, pointers are set in the
cratch file and a Fourier transform is computed. Then the incremental
ompiler is called, and a logarithmic transformation of the magnitude of
he Wander series is computed and normalized. Then another Fourier
;ransﬁnmn on the resulting time series, is computed and displayed. The
mmulthnzwaveform, called a cepstrum, is useful in speech analysis.

Hm pnnxdure of introducing the data in the scratch file, assigning
txginnhm;and ending pointers, and calling a subroutine to do the compu—
thg iscmmmon to many forms of time series analysis; namely, autocorrela-
thnicmmmtations, convolution operations with matched filters, etc. and

alkms ageneral structure for incorporation of additional Operations.

 

 

 

190

XX ixyrxical hard copy (microfilm) picture (Fig. 4 in the text, repeated
> irujixzates some of the editing capabilities. Wave forms from four
erwerrt :files (4, 5, 6, 24) and three different tapes (MD4, MD6, MD8) are
layed simultaneously. Each waveform is labeled with the beginning and
ng time references (from the start of each file) computed from the

|1£3 irrteryal parameter for each tape. File names (19 BA 1, 19 MJ 1,

3F} 1, 19 F311) and a general comment (CUTPUTS 0F ﬂVERLAPPING BAND PASS
IHBRS [1W1 Tﬂ [Bl]) are also displayed.

PTKYUMSFi

13MB PROTMSR program allows a study of the interrelationships of time
ixxs selected frmn random access files. Four two-dimensional
.tteI'1910ts are diSplayed. The selection of the plot parameters for
ﬂ1<1f these four plots allows the plotting of an individual time series
rsus its index, the scatter plots of two time series (from the same
1e, but not necessarily the same record or module), and comparisons of
atter plots from four different files. The structure of the files on the
sk is completely determined by the parameters (discussed above) in the
:ader; thus, it may vary from one file to another. An index function
ivolving four parameters (instead of the three commonly available in
ortran) is used in a separate subroutine called INDEX so that
he usual Fortran requirements of predetermined dimensions and maximum
alue of each dimension are not necessary. A file directory showing the
tarious parameters zuui file ixknrtification (Rani fronliﬂua header is zuniilable
itinxn‘option for selecting the files. Various options are available

to Huﬂlitate the comparison of the four scatter plots;

0) A time sequence option,which allows ten points on
each scatter plot to be labeled 0 through 9 according
to their sequential index. These ten labels are
then incremented through the scatter plot, showing the

sequential relations.

C» An overlay option, which plots all [our scatter plots

0]] COIIIIIlOll ilXt‘S.

 

191

(3) Automatic incrementing of either dimensions, modules,

or records for rapid editing.

ui—dzinue compiler can be used in this program to generate a new file and
a“; seat (Df derived time series from any of the existing files. Similar
es <3f ‘transformations, as discussed before, are available here.

1\ tyqxical picture (Fig. 17b in the text, repeated here) shows four
nultaneous time series plots, the two on the left are univariate plots
1 time two on the right are bivariate (scatter) plots. Labels on the
ttxnn are added by the user. Each plot is labeled with a file name
6 EH3 1) three index parameters (D1 M5 R2) or TIME (indicating the
idexﬂ writh maximum time shown (480 ms). Names of the D index are also
iouni (ABS ENV). Scale for each axis is shown by a factor (X 4) which

iltiplies the original data.

 

APPENDIX D

SLIDING POWER SPECTRA

Sliding power spectra are computed from the A—D tapes described in
Appendix B‘by means of the TIMSER display program ensemble. Each curve
displays the square root of the power spectra computed over a fixed time
interval (25 milliseconds for all curves in this appendix).

Spectrum smoothing is done by multiplication of the time waveform

by the following taper function:

2 2
(1-x ) , —ISxS1

 

This smoothing is performed to minimize the effects of the pitch frequency
and give better side lobe response (see Blackman and Tukey).

The labels on each picture give fundamental frequency (40 hz), file
number and tape ID (corresponding to module), linear or log magnitude
plot, maximum frequency, utterance and speaker label, and filter band—
width. The label for each curve shows the start time and the square root
of power (all curves in this appendix are stepped 15 milliseconds).

Each curve is normalized to the maximum frequency component. The
linear magnitude plots show percentage of the maximum component. The
log magnitude plots show dB relative to the maximum component (50 dB

range).

192

 

 

 

PWR 3072.9

 

 

PWR 2906.1

 

PWR 3128.6

 

 

PWR 2656.0

 

PWR 2289.0

 

 

PWR 2093.7 370.4 ms

 

r I 1 1 r v ‘ ‘l r r —,
' I I I I I I I I I

I
40 Hz FILE 3 RTWB LOG MAGN 6500 Hz
DHUATH 16 BE ‘1 REAL TIME WIDE BAND SIGNAL

 

 

FIGURE Dvl DHUATH 16 BE 1 REAL—TIME WIDEBAND SIGNAL

 

 

 

 

PWR 3457.1 535.3 ms

 

PWR 4129.5 520.3 ms

 

  

  

PWR 4792.2

 

   

PWR 4467.3

 

 

   

PWR 4486.9

 

 

 

:' I
40 Hz
DHUATH 16 BE 1

PWR 3983.6

0 I l
FILE 3 RTWB
REAL TIME WIDE BAND SIGNAL

I
LOG MAGN 6500 Hz

 

 

FIGURE D—1

DHUATH 16 BE 1 REAL—TIME WIDEBAND SIGNAL (Continued)

 

 

 

 

 

195

 

 

 

PWR 208.4

 

PWR 272.2

 

PWR 743.5

 

 

PWR 1426.8 580.3 ms

 

 

PWR 1971.5 565.3 ms

 

PWR 2670.0

 

 

P""' 1 1 1 r v ' 1 r r *1
I I I

I O I I
40 Hz FILE 3 RTWB LOG MAGN 6500 Hz
DHAUTH 16 BE 1 REAL TIME WIDE BAND SIGNAL

 

 

 

 

FIGURE D—I DHUATH 16 BE 1 REAL-TIME WIDEBAND SIGNAL (Continued)

 

 

196

 

 

PWR 139.6 715.3 ms

PWR 161.2 700.3 ms

PWR 194.3 685.3 ms

PWR 169.9 670.3 ms

PWR 140.6 655.3 ms

PWR 225.1 640.3 ms
r -------- t """""" 1 """""" ‘I """""" r """""" 7 """"" 1 """""" a """""" r """"" r """" “I
I I I I I I I I I I I
40 HZ FILE 3 RTWB LOG MAGN 6500 HZ

DHUATH 16 BE 1 FILTER BANDWIDTH 70 TO 6500 Hz

 

FIGURE D—I DHUATH 16 BE 1 REAL-TIME WIDEBAND SIGNAL (Concluded)

 

 

197

 

.... ......................... ..- .......................

 

PWR 2341.7 445.4 ms

 

 

PWR 1823.4

 

PWR 1881.1 415.4 ms

 

PWR 1579.0

 

 

 

PWR 1189.6 385.4 ms

 

 

 
 

PWR 1032.7

  

40 Hz FILE 3 MD4 LIN MAGN 3333 Hz

DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 HI

 

 

FIGURE D—2 DHUATH 16 BE 1 FILTER BANDWIDTH 4584467 Hz

 

 

 

 

 

 

 

 

 

 

 

198
PWR 3504.2 5353 ms
PWR 3548.0 520.3 ms
'1 A Av VA

PWR 3435.6

PWR 3351.1

PWR 4761.9

I 0
FILE 3 MD4
DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz

 

 

PWR 4285.6 .

r ......... ‘ ......... ‘ ......... ﬂ .......... r ......... Y ......... ‘.-

505.4 ms

 

490.4 ms

475.4 ms

 

....... 1------00--..--.--OOQDr-QOOD-.-q

LIN MAGN 3333 Hz

 

FIGURE D—2
(Continued)

DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz

 

 

 

 

199

 

 

 

PWR 128.8 625.3 ms
AMA VA
PWR 158.3 610.3 ms

 

PWR 412.3 595.3 ms

 

PWR 848.3 580.3 ms

 

 

 

 

PWR 1514.1 565.3 ms
,— AT W
PWR 2798.0 550.3 ms
‘ W
' """"" 1 -------- 1 --------- a --------- r °°°°°°° I '''''''' 'I -------- "1 ---------- I ---------- I- -------- -I
I I I I I I I I 0 I I
40 Hz FILE 3 MD4 LIN MAGN 3333 Hz

DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz

 

 

FIGURE D—~2 DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz
(Continued)

 

 

 

200

 

 

PWR 90.4 ' 715.3 ms
PWR 100.2 700.3 ms
PWR 79.6 685.3 ms

 

PWR 74.6 670.3 ms

 

PWR 71.4 655.3 ms

PWR 105.0

 

40 Hz FILE 3 MD4 LIN MAGN 3333 H2

DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz

 

FIGURE D—2 DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz
(Concluded)

 

 

 

‘

201

 

 

PWR 3165.1 445.4 ms

 

PWR 2959.9 430.4 ms

 

 

PWR 3484.2 415.4 ms

 

 

PWR 2806.4 400.4 ms

 

 

 

 

 

 

 

 

PWR 1924.8 385.4 ms
M ‘
PWR 1527.2 370.4 ms
WA‘A : A M :VA
7 """"" 1 """""""""" "I """"" f' """"" Y """"" 'I """"" 1 """"" I """"" f """" "I
40 Hz FILE 3 M06 LIN MAGN 3333 Hz

DHUATH 16 BE 1 FILTER BANDWIDTH 577-1867 Hz

 

 

FIGURE D~3 DHUATH 16 BE 1 FILTER BANDWIDTH 57741867 Hz

 

 

202

 

 

 

PWR 7072.7 535.3 ms
7 At;-
PWR 7784.1 520.3 ms

 

PWR 8072.0

PWR 6663.8 490.4 ms

 

 

PWR 6394.7 475.4 ms

 

 

     

PWR 4690.8

 

4° “2 FILE 3 M06 LIN MAGN 3333 Hz
DHUATH 16 BE 1 FILTER BANDWIDTH 577—1867 Hz

 

 

FIGURE D—3 DHUATH 16 BE 1 FILTER BANDWIDTH 577-1867 Hz
(Continued)

 

 

 

 

203

 

-' ----------- o ....................................................................................... —

PWR 188.6 625.3 ms

PWR 286.3 610.3 ms

 

PWR 741.2 595.3 ms

 

 

 

 

M 13905 580.3 ms
4W.
PWR 5874.9 550.3 ms
7 """"" I """"" 1 """""" 'I """""" r """"" V """"" '1 °°°°°°° 1 """"" I """"" I’ """"" W
40 Hz FILE 3 MD6 LIN MAGN 3333 Hz

DHUATH 16 BE 1 FILTER BANDWIDTH 577—1867 Hz

 

FIGURE D—3 DHUATH 16 BE 1 FILTER BANDWIDTH 577—1867 Hz
(Continued)

 

 

 

 

 

204

 

   

PWR 123.7 715.3 ms

   

 

PWR 141.5 700.3 ms

PWR 149.4 685.3 ms

 

 

PWR 156.5 670.3 ms

 

 

PWR 113.6 655.3 ms

PWR 132.9 640.3 ms
.r --------- ,..-- -----s --------- a ---------- I- -------- I --------- 1 --------- 1 ---------- I ---------- r -------- -I
O I I I I I I I I I
40 Hz FILE 3 M06 LIN MAGN 3333 Hz

DHUATH 16 BE 1 FILTER BANDWIDTH 577-1867 Hz

 

 

FIGURE D-3 DHUATH 16 BE 1 FILTER BANDWIDTH 577«1867HZ
(Concluded)

 

 

 

 

 

 

 

205

 

 

 

 

 

 

 

 

 

 

 

 

 

PWR 1343.6 445.1 ms
‘7" Av—%
PWR 912.4 430.1 ms
PWR 972.8 415.1 ms -
rA" __ l
PWR 619.9 400.1 ms
__ A —--— v A ~
PWR 470.7 385.1 ms
PWR 227.1 370.1 ms
r --------- 1 --------- 1 --------- w- --------- r --------- v --------- 1 -------- I """"" r """"" r """"" "'I
I Q 0 I I I I I I I
40 HZ FILE 3 M08 LIN MAGN 5000 Hz

DHUATH 16 BE 1 FILTER BANDWIDTH 1467—2917 Hz

 

FIGURE D-4

DHUATH 16 BE 1 FILTER BANDWIDTH 1467—2917Hz

 

 

 

 

 

 

 

206

 

 

 

 

 

 

 

 

 

 

 

 

 

PWR 2399.3 535.1 ms
‘vb._ v A AW A
PWR 3817.7 520.1.ms
k_— v A A
PWR 4932.5 505.1 ms
PWR 3744.0 490.1 ms
M: v v .A A v. ~‘v‘vA‘M
PWR 2944.3 475.1 ms
A M
PWR 1896.2 460.1 ms
*v-— #2, A ~— 4— ‘v‘v‘s
7" """"" 1 """"" ﬂ """"" ‘I """"" l' """"" Y """"" 1 """"" ‘I """"" I‘ """"" f """" .1
I I I I I O 0 I O I I
40 Hz FILE 3 MD8 LIN MAGN 5000 Hz

DHUATH 16 BE 1 FILTER BANDWIDTH 1467—2917 Hz

 

FIGURE D—4 DHUATH 16 BE 1 FILTER BANDWIDTH 1467—2917Hz
(Continued)

 

 

 

A

 

 

 

 

207

 

 

.‘OOOOOOOOOI...—00.00.0000... oooooooooooooooooooooooooooooooooooooooooooooooooo o ..... ...-00.0.0000...-

PWR 178.1 625.1 ms

 

. ...... -. ......... ... ........................................... Q ......................... . ....... ...-

PWR 235.3 610.1 ms

PWR 369.5 595.1 ms

 

PWR 492.8 580.1 ms

 

PWR 1088.0 565.1 ms

 

PWR 2370.6 550.1 ms
'0.-. ..... ' ......... ‘ ......... C'— --------- r ......... t ......... ‘ ......... a .......... r ......... rung-....“
I I I I I 0 I I I I I
40 Hz FILE 3 M08 LIN MAGN 5000 Hz

DHUATH 16 BE 1 FILTER BANDWIDTH 1467—2917 Hz

 

FIGURE D—4- DHUATH 16 BE 1 FILTER BANDWIDTH 1467-2917HZ
(Continued)

 

 

 

A

 

 

 

 

APPENDIX E

INSTANTANEOUS ESTIMATORS 0F TIME—VARYING PARAMETERS

It has been shown that representation of a class of acoustical signals

can be reduced to estimation of the time—varying frequency and envelope and

their derivatives. A common estimator of the instantaneous frequency is

a sliding average of the zero crossings of x.

 

t
w(t) : K J z(o)do (E—la)
1 t4
*
Or for discrete samples
n
113 = KE 2. (E—lb)
n 1 k J
J

Where K is a normalizing constant
1

Z = 1 if X. S 0 and Xj > 0
1

or x, Z 0 and x, < O
J-i J

0 otherwise.

II

A
A reasonable estimator of the envelope of x(t) is the sliding mean

of the absolute value of the real part.

t
[a(t) = % J |x(o)ldo (E—Za)
t-T

 

‘We will denote all estimators by adding a tilde (the estimate of w 151:)
and the sliding sum of length R over the index 3 from n—k+1 to n as

u.f\/jp

209

 

‘

 

 

210

for discrete samples
n
N 1 ‘5 1
a 2 — _
n ka ‘le (E 2b)
3

A
Axuyther estimator for the envelope of x(t) can be derived from equa—
A
on (II—JL—Bc), that is, an average of the magnitude of x(t). The Hilbert
'ansform of an arbitrary function can be obtained by means of a complex
56
.gital filter (Crystal and Ehrman) operating on the real signal.

The computation of these estimators involves a non-linear, no—

emory operation followed by a low—pass filter.’

he problem of removing the oscillatory terms from the state variable dif—
erential equations as discussed in Sec. II—A and the selection of T (or k)
s analogous to the selection of the cutoff frequency for the lOWpass filter
shomuin Figure E—l. An effective measure for stationary signals is the mean

so
Squmxabandwidth. Abramson has shown that the mean squared bandwidth (see II-A-

14) dfxwt), the result of the nonlinear, no—memory Operation, is Computed by

*
thezknlowing formula:

B' :: E{(V’E2%}E{X2} B: (E-3)
E v0

<0

 

‘_

*

Wecbmﬁw the derivative of a function, v, with respect to its argument
as v’.

 

 

 

211

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

x(t) NON'L'NEQS VIl) LOW~PASS y(t) ESTIMATE
NO—IVIEIVIO FILTER OF MEAN
OPERATION

ESTIMATE OF ,
DERIVATIVE

 

 

 

FIGURE E~~1 OPERATIONS FOR PARAMETER ESTIMATION

 

212

For the situations we are considering, Bi is equal to a constant related to
the non-linear operation times the mean squared bandwidth of the input.
For example, the bandwidth of the envelope function (using II—A-l4) for

stationary Gaussian processes

941-.) = a(t)eja(t)

13:2...4— : 1398 2 (E—4)
E {32} X

A
where xs is the shifted (low pass) version of x. Use of a full wave
detector (absolute value) as an estimator of the envelope gives a band—

width (Abramson)
- E{(g')2} E{x2} B? = I E{ (13:7 )2} E{x2} B:
EI g2} BI leg}

 

 

: B.f (E—S)

The mean frequency can be derived by converting v(t) to an analytic

signal with discontinuous phase

C(t) = b(t)e‘jp(t) |x(t)l -I- jlxh(t)l (13-6)
x(t) : a(t)eja(t)
where
h(t) = a(t)
5(1) = 0’(t) + C(t)

g(t) is a step function which increases by W whenever x(t) = 0.

 

Using (II—D—2), the mean frequency, O, is given by:
(D

00

f b2(t)é(t)dt J a2(t)oz(t)dt I a2(t)dI;(t)
o 0 [1

SI
I
I

+

_ ——-——-—————
(D 00 CD

— L b2(t)dt J0 a2(t)dt I0 a2(t)dt

 

where [1: {tlx(t) = 0}

Since the discontinuities at t e [7 are steps (first order), the last

integral is zero. Consequently, 5 becomes

(1)

_ I a”(t)oz(t)dt
w = J...— (13-7)

CO

Jo a2(t)dt l

 

For signals generated by time—varying differential operators, the
mean square bandwidth is not an effective criterion. Rather, the instan—
taneous fluctuations of the bandwidth must be considered. We can fix an
69

upper bound by using a Chebyshev inequality for stochastic processes (Parzen)

for the time interval [t t j .
1 2

 

T, 1 ~ _ 1
ﬂag, besmI > Ml s —— Eta... bewr} ..-...
EﬁjztStz lbxs(t)l2] S %[E{\bxs(t1)|2} + Eﬂbxsug) 2}]
ta 21% 2 E
+ LlET bxs(t) J E{ bxs(t) } dt (E—8b)

214

[.A J is the probability of the event A

[ - 1 is the statistical expectation with respect to P.

bxs(t)' is constant (time—invariant generating equation), we have

r
amping coefficient b )

[t StSt ‘bxs(t)| > Am] S br/(Aw)2

have seen, the magnitude of bXS is related to both the effective band—

for deterministic system impulse reSponses and effective bandwidth for

arying differential operators with white noise driving functions. We may

Aw to the bandwidth of our bandpass pre-filters and then use this rela—

ip to determine a length of average [tl,t2] (which is related to the

frequency). The bound, then, depends on not only the instantaneous

of our state variables, but also the instantaneous values of their

:ives. We can use these relationships to investigate the properties

:ific estimators.

1e estimation of the derivatives of the state variables is, in general,

;y for noisy observations. A more stable estimator is derived by alge-
Bu

Ianipulation of the stochastic derivative (Parzen) of a time series

=1} with finite second moment. (yn may be the discrete envelope samples,

or the zero crossing samples, Zn)

HQ.

L.I.M. Yn ’ Yn~1
A40 A

 

..I.M. is the usual limit in mean definition and A is the sample time

each discrete sample. For computer applications, A is fixed and the

average 0f the square 0f Ayn is more appropriate (for locally ergodic

es).

 

 

n
~ 2.1: (y _, r (m
n M k J' J-1
J
n
l ~g 1 2 9
= EA an + E yn—k - yn . 2 Elk nyJ-l _ m 1
J'
where
n
a; 312 y
n k k j
J
n
gig-£2 (y _m.§
n k k j n
3'

Thus, the sliding variance is a factor in the mean square sliding stochastic
derivative (and has a shorter name). Reliable estimation of a significant
derivative requires a small value of k (the number of points averaged) while
reduction of stochastic variation requires a large value. By using the
sliding variance, these requirements are partially reconciled by eliminating
terms primarily due to stochastic noise. Also, the sliding variance is

more stable than a simple difference of the sliding mean which reduces to
1 /
E (yn _ yn-k)‘

Let us summarize the alternatives for selection of a total estimation

process.
1.) Sub—interval length — This is the number of points to be
summed corresponding to the first low—pass filter in
Figure E—l. In order to compute the proper sliding
averages, these intervals are non—overlapping rather than
sliding.
ii.) Sliding average length — The value, k, in the formulas

for the various estimators relates to both standard

deviation and mean value.

216
) Envelope estimator — Either absolute value of the input
signal or a Hilbert envelope (the square root of
the sum of the squares of the input signal and its
Hilbert transform).
Derivative estimators — For each of the envelope esti—
mators we may define three derivative estimators:

One—Point Difference for Sliding Mean

A] n n—l
(1) I; = .1.\ ya - ilk 33 (E—lOa)
1 kLk J k J
J J

Sliding Standard Deviation

n ‘ n 1/2

T) '1: . a a ll, 3 2]
2 : — r, _ — E—lOb
() 32(n) [kLk(}J) (kLk yj) ( )
J J
Mean Square Derivative

n
(3) 5,1(n)=L%Zk(y:— (11)]

a . . .
where yJ 18 the J-th estimate of the envelope.

1/2

)tO that the last two estimators only give the magnitude of the

2 but not the sign.

'der to make choices between these different alternatives, we
Iresentation criterion to compare waveforms (which will be the
timates of the underlying signal properties). We note that mean
or is inappropriate for the types of comparisons we wish to
The reason is its insensitivity to very sharp derivative
An alternative criterion can be derived by use of the Chebyshev
discussed in Equation (E—8). By algebraic manipulation of
ality using the weighted difference between the two waveforms,

ive at a criterion that gives a better comparison. For two

 

 

A

 

waveforms, y (n) and y (n), n = l, 2, . . . N define the Chebyshev
1 2

weighted error by:
. 2 1/2 1/2

N 2
g _[_1_(e2(1) + e2(N) > + 1; [C(11) _ e(n-l) j < e(n))
C _ 2 N N (n) _ y (:14) y (n)/
y:(l) y:( ) “:2 ye a 2 (E_11)

where

an): a(m -an)

The assumption of local ergodicity must be invoked to relate this measure

to the probability of exceeding a bound as in Eqn. (E—8) (much the same as

the justification for mean square criteria). However, Eqn. (E-ll) can be

used to compare estimators. In Figure —2, two estimates and a smooth

envelope are shown. The rapid variations are averaged by the mean square error
computation so that the value for the two estimators is approximately

equal (0.098 and 0.101). However, using the Chebyshev weighted measure,

the difference in the two estimators is apparent, indicated by a calculated
value of 0.2974 for the rapidly varying one and 0.1689 for the smooth

one.

Envelope and frequency estimators must work in different situations
ranging from slow but large magnitude variation and possibly a smooth fre—
quency transition (such as during vowel formant portions) to rapid, small
amplitude and frequency changes (which occur during fricatives). We will
first consider the typical vowel onset which occurs in the order of 50
to 100 milliseconds (see Figure 3). In order to compare our different

estimators, we will use the following idealized vowel onset waveform:

mmOP<Eme m>_._.<>_mwo meJm>Zw O>>._. me meOE

ovod omod oNod o 5.0 08.0

_ . _ _

mE omod

 

II ...Eo u mo<mm>< 02.9.5 \\
o I 20.55%. maoJmSzM
mud n a. x \
ma04w>2m 33084

w>.k<>_mwo wd04m>2m 4<DHU<

wE m. n ._<>mm._.z_ me .

II 822:8 IR . . . II 8
Q

7 D
a 4 2: o; n .3555. mam

 

 

 

E w

219

Let §(t) = a(t)eja(t) OStS.050 (E-lZ)
Where
t
u(t) = 80 + jo arc tan Ia(s)/1001n(s) ds

d(t) = w t + B It {Ms/.05)8 _ 2(s/.05)2} ds
0 O 0

a is initial amplitude (= 10. )

w is initial frequency (: 2n 2000)

B is frequency deviation

H(-)is a I.I.D. r.v. with Gaussian distribution with mean independent
of a(') and a(-) at time t with mean E(n) : l and standard deviation Oa(n) = a
By selecting values for the parameters; frequency deviation; BO, and amplitude
noise standard deviation; a”, we can generate time functions with complex
nonlinear behavior in order to investigate the stability properties of the
estimators we have chosen. Figure E—2 shows the ideal envelope derivative
(B0 = a2 : 0) and two envelope derivative estimators for the Hilbert envelope
of x(t).

Even for this idealized model, we have five parameters to change in
order to investigate the properties of the three envelope derivative esti-
mators: (l) absolute or Hilbert envelope, (2) variation of subinterval
length, (3) variation of sliding average length, (4) amount of frequency
deviation, and (5) envelope standard deviation. Typical values for varia-
tions of number of these parameters are shown in Tables E—l, E—2, and
E—3 and Figures E-2, E-3, and E-4. Table E—l and E—2 show variation of
the sliding average length for Bo=0 and various values of envelope standard
deviation for both absolute and Hilbert envelope derivative estimators.

Figure E—3 shows a typical plot from Table E-2. Table E—3 indicates variation

220

induced in envelope derivative estimators by frequency changes. Figure E-4
shows the Chebyshev weighted error for the three envelope derivative esti—
mators as the subinterval length is increased (evaluation of the stability

of the derivative estimators is sufficient to tell us about the estimation of
the enve10pe itself). Rather than discuss all these data in detail, we

will state the choice of estimator subinterval length and sliding average

length and give reasons for that choice.

We note from Table E—l and E—Z that the Hilbert and absolute
envelope estimators have almost exactly the same values of Chebyshev
weighted error when there is no frequency deviation. Because of the
additional complexity in computing the Hilbert envelope, this would
recommend the absolute value envelope as an estimator. However, Table
E—3 shows that for frequency deviations the absolute value estimator
gives an order of magnitude higher Chebyshev weighted error than the
Hilbert envelope estimator.

Note that the behavior or the sliding standard deviation as an
estimator of envelope derivative behaves much more stably and gives, in
most cases, a lower Chebyshev weighted error. Table E—3 for absolute
value envelope estimator shows this very dramatically. For this estimator,
sub—interval lengths on the order of 0.5 to 2 milliseconds give approximately
the same Chebyshev weighted error. This result can be anticipated from the
form of the mathematical relationships between the three derivative esti—
mators since both the l—point sliding difference and mean square derivative
have unaveraged terms that vary as the random samples. For this reason,
as shown in Figure E—tl they are very dependent on the variation of the

sub— interval average values.

 

221

TABLE E-l ENVELOPE DERIVATE CHEBYSHEV WEIGHTED
ERRORS USING HILBERT ENVELOPE ESTIMATOR

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mmdope Estimator Envelope Estimator
uﬂ Dev. 1 2 3 Std. Dev. 1 2 3
.2 .0540 .0916 .0514 .2 .0615 .0470 .0616
.1 .1076 .1830 .1093 .1 .1230 .0939 .1227
.6 .1618 .2712 .1619 .6 .1815 .1409 .1832
.8 .2150 .3650 .2208 .8 .2161 .1879 .2132
1.0 .2687 ' .1553 ;.2769 1.0 .3076 , .2348 .3037 j
1 ms sliding average 6 ms sliding average
1 ms sub interval 1 ms sub interval
ﬂWﬂope Estimator Envelope Estimator
rd Dev. 1 2 3 Std. Dev. 1 2 3
I I I
3 0256 .0396 .0280 .9 1 .0111 2 .0226 i .0119
-1 .0515 .0791 .0553 .1 f .0827 § .0156 1 .0830 f
: I = “i
-6 .0776 .1196 .0816 .6 g .1212 3 0688 § 1232 1
-8 . .1011 f .1591 § .1070 i .8 i .1658 1 .0922 g .1622
1-0 i .1308 ' .1995 E .1116 g 1.0 i .2071 I .1157 E .2002
8 ms sliding average 10 ms sliding average

1 ms sub interval 1 ms sub interval

 

Envelope
.td. Dev.

.2

Envelope
Std. Dev.

222

TABLE E—2 ENVELOPE DERIVATIVE CHEBYSHEV WEIGHTED
ERRORS USING ABSOLUTE VALUE ESTIMATOR

 

 

 

Estimator
l 2 3
.0543 .0916 .0518
.1085 1830 .1102
.1628 2741 .1662
.2170 3649 1 2226
.271: 4552 E.2791

 

4 ms sliding average

1 ms sub interval

Estimator

 

I
I

 

l 2 J
.0256 E .0396 0280
.0514 g .0793 0552
.0776 E .1192 .0817
.1040 g .1593 1072 ‘
.1307 .1995 .1318 E

8 ms sliding average

1 ms sub interval

Envelope
Dev.

Std.

‘)
...4

1.

Envelope
Std. Dev.

i\;

0

 

 

 

Estimator
1 2 3

I
.0614 .0470 0616
.1228 .0941 1226
.1843 .1412 .1831
.2458 .1883 .2433
.3073 .2353 E .3037

6 ms sliding average

1 ms sub interval

Estimator

 

 

 

l 2 3
.0413 .0227 .0419
0827 .0456 .0830
1211 .0687 I .1231 I
1657 .0921 g .1622 5
.2072 .1157 1 .2002 A

10 ms sliding average

‘1 ms sub interval

I1" 1

Sliding
\verage

4

6

8

10

Ereq.

Dev.

223

TABLE E-3 EFFECTS OF FREQUENCY CHANGE 0N ENVELOPE
DERIVATIVE ESTIMATORS-l MS SUBINTERVAL

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Est‘ '
stimator Sliding Estimator

l 2 3 Average 1 2 3
1.805 1.831 .031 4 .2066 .2599 .2057
1.405 1.172 .644 6 .1077 .1369 .1122
1.155 .8593 .432 8 .0990 .0693 .0993
.8641 .6001 .199 10 .0836 .0383 .0826
Absolute envelope derivative Hilbert envelope derivative
vs. sliding average vs. sliding average
Freq. dev. - 0.25 Freq. dev. : 0.25

Estimator Estimator
Freq.

1 .5 3 DCV . 1 ‘3 1
.75 H) .2745 .I.286 .05 .0129 .LH(X1 .0136
.6717 .1400 .9978 .10- 0750 0255 0852
.6482 .2533 1.295 .15 .1685 .0115 .2001
1.035 .3862 2 512 .20 1351 0:88 1172
.8641 .6001 1.199 .25 .0836 .0383 .0826
Absolute envelope derivative Hilbert envelope derivative
vs. freq. dev. vs. freq. dev;

sliding average x 10 ms sliding average : 10 ms

 

FIGURE E—3

I ms SUB INTERVAL I
ABSOLUTE ENVELOPE

ENVELOPE STD. DEVIATION = 0.6

FREQUENCY DEVIATION = 0

l l l I I I I I l _

5 10 ms

CHEBYSHEV WEIGHTED ERROR FOR ENVELOPE DERIVATIVE
ESTIMATORS AS A FUNCTION OF SLIDING AVERAGE LENGTH

225

IHOZmJ ._<>mw.rz_nm3m ”.0 ZOEbZDu

 

< m< 38.32.58 m>:<>_mwo mmodm>2m moi memmm 856m; >wzm>mmzo 2.1 $56;
MW 06 9m Qv 0d QN QP QO
1 _ 2 _ _ 2 a 88
Il.mQHO
N
IIAREd

IlmKQo

Ilnﬁﬁo

 

mOmmw

226

m0<mm>< 02.07% m0 202.023“.

< w< meF<E_._.mm w>_._.<>_mmo mm04w>2m m0“. mOmmw

mE NP op m o

\

Dm._.10_m>> >mIm>meU mum mmDOE

 

I 4 _ 2 2

 

 

I1 mmod

ZO_._.<E<> wm04m>2w OZ
mNd n ZD__<_>wo >OZwDmeu
meJm>Zm wPDJOmmd

J<>mw._.2_ mDm mE F
l omod

I1 mnod

 

227

Figures E-4 and E—5 show an increase in the estimation error for
larger values of subinterval length and sliding average. Reference to
Figure 13 and the corresponding discussion of distortion in linear filters
induced by large frequency changes explain this increase. The intuitive
notion of longer averaging time can be misleading in this complex situation.
The error rates in Figure E-4 and E-5 were derived for a(t) = 100.

There appear to be two sources of variance in envelope estimation:
the first is induced by the small number of samples,which would require
longer averaging times,and the second is the distortion caused by the
time—varying parameters, which would require shorter averaging times. We
must select a compromise value,which appears to be approximately 1 ms
subinterval length and 6—10 ms sliding average length.

We may conclude that,for this idealized Speech acoustical signal,
the most stable estimator is the Hilbert envelope with sliding standard
deviation as a derivative estimator. The sliding mean of the absolute
value of x(t) gives some envelope estimation distortion, primarily during

epochs with changing frequency, but requires much less computation.

 

 

 

Illa-.11 3.1.9. in

 

 

 

 

i

 

 

1111

6

“III!

9
6
4
0
3
0
3

1211

ll
3

CAN STATE UNIVERSITY LIBRARIES

MICHI
l

 

1.3.1.:

 

 

.1
.. . . y . ., . I

 
 

 
 
 
 
 

 
 

         

 

 

  

 

X r... ... w...”
73:92 ...-ﬁt:
3:! 23:0]! I
. .1! E ....
ER .13... ..Z.