.‘. . .._. ,.,..=if .... 3.2: .. q- “wry-u‘ .,.....4. ‘3‘ «.v -‘o 1..., _..:... V. . f I. i. ...: ..: .._..._‘ ........ _...... V... ... ..:.._..‘. _ . ..... ._.. .._ ,,: 1. :o .1 ‘ <..~‘ ,.._, _~:.....‘r~,~..~...~v.._k ...:.:_....;.....: .1 . .. . 41:; vx Lth ‘ This is to certify that the thesis entitled Automatic Speech Recognition Based On A New Segmentation Procedure presented by Earl J. Craighill has been accepted towards fulfillment of the requirements for ___P_hL degree in _EE_ W M“ ‘ \4ALW‘ Major professor Date 1/13//7/ 07639 MEN—$549 ABSTRACT AUTOMATIC SPEECH RECOGNITION BASED ON A NEW SEGMENTATION PROCEDURE By Earl J. Craighill A procedure for segmentation of an acoustical speech signal is cru- cial to the design of any system for automatic Speech recognition (ASR), yet no adequate scheme currently exists. This study proposes and inves— tigates the implementation of a procedure for segmenting input in the form of connected speech from divers speakers using unlimited vocabularies. A segmentation procedure which assigns linguistic elements, such as phonemes, to contiguous acoustical signal intervals would be hopelessly complex because of the many-to~many correspondence between currently used linguistic elements and portions of the acoustical signal. Instead, we propose a method for dividing the acoustical signal into analysis epochs with minimal linguistic specification so that they are independent of speaker and context. Each epoch is defined by homogeneous signal characteristics; that is, a generation model is identified with associated parameters, and nonlinear time—varying differential equations are derived for these parameters. The equations are used to track the parameter values, and an epoch boundary is set at the point where they no longer predict (within a threshold) the characteristics of the observed speech signal. From the functional forms of the differential equations, we derive further processing algorithms I" I' I Earl J. Craighill (analogous to data—dependent adaptive filters) for each epoch. Identi— fication of the functional forms gives a gross linguistic classification which forms the basis for classification of the epoch. The differential equations are characterized in terms of sliding moment averages of envelope and zero-crossing estimates on bandpass— filtered speech signals. This method of estimation is amenable to low- cost hardware implementation and requires few computations; thus, connected speech may be analyzed in real time without overloading a standard general— purpose computer. Asynchronous, real-time classification is achieved by decomposition of the decision algorithm by a process similar to that used in Kilmer's model of the reticular formation. Overlapping bandpass filters are used to give an initial separation of acoustical features. Experimental evidence shows how this reduces the speaker dependence of further acoustical measurements. A decision logic structure is specified and discussed, showing that it is possible to select appropriate preprocessing procedures to focus attention on significant features of an acoustical signal epoch and to accentuate sig— nal characteristics closely correlated with linguistic features. This preprocessing, when coupled with the syntactical structures developed from theoretical linguistics, is hopefully a first step in recognizing human connected speech from different speakers. AUTOMATIC SPEECH RECOGNITION BASED ON A NEW SEGMENTATION PROCEDURE By Earl John Craighill A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering 1971 ACKNOWLEDGEMENTS The Author wishes to thank Professor William Kilmer for his continuing support, guidance and patience during the preparation of this thesis. These qualities were shown both in and out of the classroom and are deeply appre- ciated. For their constructive comments and evaluation of this thesis, the Author wishes to acknowledge the other members of his doctoral committee: R. C. Dubes, T. Guinn, C. L. Park, R. F. Reid, and H. Salehi and also colleagues at Stanford Research Institute: W. F. Foy and W. P. Rupert. The research was supported at Michigan State University under Air Force Contracts No. Af—AFOSR—1023—66,67,68 and at Stanford Research Insti- tute under IR&D Project No. 656531-329. Without the support and encouragement of many people, this thesis would not have been finished. A few of these people are: my supervisor, D. F. Babcock; secretaries, A. Guinn, S. Peterson and K. Spence; my mother and father and my wife Karilyn. TABLE OF CONTENTS Page LIST OF TABLES v LIST OF FIGURES vi I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . l A. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1 B. The Structure and Interrelations of Acoustical Features in Human Speech Signals . . . . . . . . . . . . . . . . . . 7 C. Segmentation of the Acoustic Speech Signal into Analysis Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . 17 D. Preprocessing of the Acoustical Speech Signal . . . . . . . 20 E. Decomposition of Pattern Recognition Algorithms . . . . . . 39 II REPRESENTATION OF TIME—VARYING SIGNALS . . . . . . . . . . . . . 47 A. Analytic Signals . . . . . . . . . . . . . . . . . . . . . . 47 B Sliding Fourier Series . . . . . . . . . . . . . . . . . . . 57 C. Response of Linear Filters to Analytic Signals . . . . . . . 65 D Estimation and Segmentation of Instantaneous Signal Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 85 III THE USE OF LINGUISTIC THEORY FOR THE DECODING OF SPEECH ACOUSTICAL SIGNALS . . . . . . . . . . . . . . . . . . . . . . . 109 A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 109 B. Stratification Model for Generative Phonology . . . . . . . 121 C. Recognition Phonology . . . . . . . . . . . . . . . . . . . 130 IV RECOGNITION STRUCTURES FOR REAL—TIME SPEECH PROCESSING . . . . . 138 A. Reduction of Dimensionality Using Bayes' Formulation . . . . 138 B. Quasi-Independent Probability Distributions . . . . . . . . 146 C. Specification of First—Level Decision Structure . . . . . . 152 D. Proposed First—Level Recognition Block Diagram . . . . . . . 159 V CONCLUSIONS AND RECOMMENDATIONS FOR FURTHER STUDY . . . . . . . 166 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . 170 iii APPENDIX A APPENDIX B APPENDIX C APPENDIX D APPENDIX E Description of Sapir's Pseudo-Language Recording Apparatus Used to Collect Experimental Data A Program for TIMSER Interactive Analysis of Time Series Sliding Power Spectra Showing Vowel Transition Instantaneous Estimators of Time—Varying Parameters iv 181 185 192 LIST OF TABLES Table E-l Envelope Derivative Chebyshev Weighted Errors Using 221 Hilbert Envelope Estimators Table E—2 Envelope Derivative Chebyshev Weighted Errors Using 222 Absolute Value Estimator Table E—3 Effects of Frequency Change on Envelope Derivative 223 Estimators—l ms Subinterval Figure 1 9a, 9b 10 11 12 13 14 16 LIST OF FIGURES Title Typical ASR System Based on Discrete Encoding Model Sonogram of English Word, Rudder,.Showing High Value of Frequency Derivative Speech Acoustical Signal Showing Short Transient Phenomena Consistent Time Waveforms for Several Speakers from Different Bandpass Filters Schematic of Multifilter Recognition Logic Quasi—Statistical Formulation of Local PR Algorithm Short Transient Phenomenon Which is Difficult to Analyze with Fourier Series Idealized Fourier Coefficient Response to Varying Frequency Input Magnitude of Fourier Coefficient Outputs for Time— Varying Frequency Input —— Actual and Quasi— Stationary Terms Using Instantaneous Frequency of Input Formant Envelope and Frequency Transition Causing Delay Distortion Correspondence of Z-Plane Spirals and S—Plane Lines for the Chirp Z-Transform (from Rabiner, Schafer, and Rader) Time~Varying Filter for Formant Parameter Estimation Bandwidth Requirements for Large Frequency Derivatives Estimation Procedure for Time—Varying Parameters for Bandpass Filtered Speech Signals Representations of Bandpass Filtered Speech Signals Smoothed Differentiator Transfer Function vi Page 28 30-31 32 39 44 60 63 73-74 75 79 83 89 91—93 95 Figure 17a, 18 19 20 21 22 23 24 25 26 27 8-2 3-3 C-l 17b Title Standard Deviation Versus Mean for Envelope (Lower) and Frequency (Upper) Segmentation Results Shown with Bandwidth Estimators [dhuath] (male) Segmentation Results for [umbif] (female) Formal Language Model Level and Ranks of a Generative Phonology Relationship of Units Within a Stratum Composition Rules and Example of Their Application on Morph— Stratum Several Linguistic Phenomena Described by Alter— nation Rules Model of Reco—Generative Phonology Recognition System Without Feedback Recognition of Vowels from Normalized Second Formant Information Apparatus for Recording Speech Signals on Analog Tape Apparatus for Multiplexing and Digitizing Data from Analog Tape Overlapping Filter Bank Operational Diagram of the TIMSER Ensemble Dhuath 16 BE 1 Real—Time Wideband Signal Dhuath 16 BE 1 Filter Bandwidth 458—1167 Hz Dhuath 16 BE 1 Filter Bandwidth 1467—2917 Hz Dhuath 16 BE 1 Filter Bandwidth 577-1867 Hz Operations for Parameter Estimation Two Envelope Derivative Estimators Chebyshev Weighted Error for Envelope Derivative Estimators as a Function of Sliding Average Length vii Page 97-98 100 127 128 133 182 182 184 186 197 201 205 218 224 Figure Title E-4 Chebyshev Weighted Error for Envelope Derivative Estimators as a Function of Sub-Interval Length E-5 Chebyshev Weighted Error for Envelope Derivative Estimators as a Function of Sliding Average viii Page 225 226 INTRODUCTION I—A Overview A procedure for segmentation of an acoustical signal is crucial to the design of automatic speech recognition (ASR) systems. As yet, however, no adequate procedure exists for real—time automatic recognition of con— nected human speech from several speakers. Principles from communication theory and linguistic theory must be incorporated in order to derive an efficient segmentation procedure. The language of modern communication theory, familiar to the electrical engineer, most appropriately describes the input with which we are concerned. For this study, we limit the input to connected phrases of naturally spoken human language that have been transduced into time—varying analog voltages. The output of an ASR system, usually in the form of a sequence of linguistic elementsf is generally described in the framework of linguistic theory, primarily phonology. At first glance, the goals of communication theory and phonology (namely, an accurate description of the current state of the process, acoustical signal, or sequential linguistic elements) seem to be com— patible. However, when one considers the large number of variations in the acoustical signal possible for any given linguistic element, the situation becomes hopelessly complex. Many attempts have been made to eliminate this variation and thereby preserve only the meaningful *These linguistic elements may be phonemes, distinctive features or words. We are specifically thinking of only one level of classifi— cation rather than a composite process such as identification of phonemes and then morphemes. Our recommendation for a first element is smaller than the usual phoneme or distinctive feature. relationships of the linguistic elements. Successful decoding of this complex acoustical signal by human listeners involves at least the application of knowledge acquired from previous experiences of hearing and speaking natural language and the listener's expectation of what will be said. Thus, at this level, the basic assumptions of engineering communication theory are no longer valid, and there is no applicable strong property of ergodicity. The purpose of this thesis is to describe a segmentation procedure which not only specifies basic units for recognition but also gives an adequate description of the complicated speech acoustical signal. This description is prescribed by the requirements of further linguistic decoding (words, phrases, ...). Further, the segmentation procedure that identifies lower units will direct the higher levels of decoding so that the search space is kept within practical bounds. The segmen— tation procedure requires three subsystems based on a parametric generation model of the acoustical signal: (1) Initial estimation of parameters. (2) A classification based on parameter estimates for signal types. (3) Selection of appropriate time-varying filters operating on the input to give refined parameter measurements. The requirements of these diverse topics are discussed in terms of a representation of the acoustical signal which is developed from the viewpoint of time-varying differential operators. Its use in deriving estimators and detecting initial changes in these estimators is verified experimentally. In the remaining sections of this chapter, currently used segmen— tation procedures are discussed in light of the complex nature of the information—bearing features present in the human speech acoustical signal. A parallel interrelated feature structure is described that is capable of recognizing a shift of the pertinent information from one feature to another. Linguistic information is conveyed with respect to two levels (the vowels of an utterance form a primary, and the consonants are incorporated by perturbations of this primary substrate). In order to unravel this complicated structure, broad classes of speech sounds that represent different types of signal characteristics must be defined; this classification can then be used to direct further analysis for recognition. By this method, formant* theory is related to higher levels of linguistic decoding. Various preprocessing schemes are considered which are commonly applied to ASR'systems for the purpose of isolating individual formants. To satisfy the requirement for real—time operation, a preprocessing scheme is chosen which uses a bank of overlapping wide- band filters (with sufficient bandwidth to avoid distortion) to remove noise and to provide a compact representation of the salient features required for the recognition task. Real—time operation requires decom— position of the decision process resulting in fewer computations and a recognition structure tailored to the complicated overlapping nature of the speech signal. a: By a formant, we mean the resulting time waveform for one cavity of the vocal tract excited by glottal pulses, or frication noise. In Chapter Two, the acoustical properties of the Speech signal are modeled as a composite nonstationary stochastic process and the mathematics of communication theory are used formally to describe the process's complicated nature. One isolated formant is modeled by a time—varying differential operator involving envelope, frequency, and bandwidth parameters. The inadequacies of fixed—frequency types of analysis (such as sliding Fourier transforms) are discussed, and require— ments for low—distortion filtering are derived. Then the transient response of linear filters to envelope and frequency changes found in typical acoustical signals is derived in a way that offers new insight into the behavior of analysis procedures and defines requirements for the preprocessing Wideband filters. Formulas for real—time pointwise estimators of the significant parameters are derived, and a predictive differential equation segmentation procedure is specified which will specify epochs in the acoustical signal having homogeneous signal characteristics. In Chapter Three, this segmentation procedure is discussed within the framework of traditional linguistic theories. The complicated structure of human communications requires additional mechanisms (1) To determine the linguistically significant changes in signal parameters,and (2) To incorporate contextual information into the decision process (which, in turn, resolves ambiguities and directs further classification). Structural theories are modified to include recognition and to show the effects of linguistic rules on lower elements (effects of stress on vowels, etc.). The use of the segmentation recognition procedure proposed here is basic to a feed forward system, thus eliminating compli— cated feedback analysis—by—synthesis techniques. In Chapter Four, the formant representation and segmentation results allow application of state—of-the—art detection/recognition techniques* to a restricted speech signal (without the complex inter— relationships between features). Study of the Bayes minimum risk solution reveals that the primary concept is a probability mixture formula for the outputs of nonlinear estimation filters, each tailored to a possible generating model for the input signal and (correlated) noise. Several difficulties are noted for implementation of this opti— mal solution: realization of the nonlinear filters, correlation between different (suboptimum) filter outputs, and conflict between classifications on different filter outputs. It is concluded that a heuristic recognition scheme tailored more to the filter bank used in this study would be a better choice. Tech— niques are developed to reduce the dependence among the probabilities computed on the different filter outputs. A first—level recognition system which can operate asynchronously in real time is described. A nonlinear iterative structure determines ‘which filters have pertinent formant information. Specialized algorithms derived from linguistic rules are then applied to these filter outputs to determine the needed information for classification of this particular * Section I—E contains a discussion of terminology that is used in this study for the pattern recognition discussions. signal epoch. The output is a classification which is compatible with higher levels of linguistic analysis. A second stage with formant tracking filters guided by the initial classification gives the ability to focus attention on only the desired acoustical features. Thus, the complex acoustical signal can be segmented in time into homogeneous epochs and also concurrent features of varying frequency with well-defined mathematical models and time—varying parameters. A total system design incorporating this segmentation procedure as a first step will facilitate the use of human speech as input to machines for robot control, text manipulation, command and control of space vehicles, and many other man/machine tasks. I—B THE STRUCTURE AND INTERRELATIONS 0F ACOUSTICAL FEATURES-IN HUMAN SPEECH SIGNALS The object of an ASR system is to determine recurrent elements from measurements made on acoustical speech signals. Figure 1 shows a composite of several approaches to Automatic Speech Recognition based on the theo— retical encoding of speech shown in the upper block. This theoretical encoding is motivated by Hockett's1 disCussion of a GHQ (grammatical head— quarters) emitting a discrete flow of morphemes which are encoded into a discrete flow of phonemes. Then, a speech transmitter converts the dis— crete flow of phonemes into a continuous speech signal. The determination of parameter values for each idealized element <3 is motivated by the following studies. Peterson and Barney measured first and second formant frequencies of nine English vowels in a fixed :5 consonantal context (the word h_____d). Gerstman re—worked their data, normalizing for each speaker, showing a sufficient amount of separability of the measurements for vowel classification (in a fixed context for isolated words), The correspondence between a fixed frequency or hub of origin and consonants was first proposed by Potter, Kopp, and Green.4 Classification of stop consonants by association with a frequency value was modified by Cooper et alb and Yilmazi They proposed consistent measurements for stop—consonant classification could be made relative to the following vowel fonnant frequencies. The slurring box accounts for perturbations (hopefully slight) of these parameter values caused by environment and speaker variations. The first step in recognition is a division of the acoustical speech into time epochs. The segments studied may be separated by epochs (portions WORD I THEORETICAL ENCODING ENCODING PROCESS Q Q Q ° ' ' SEQUENCE OF ELEMENTS EACH ELEMENT GIVE A SET OF PARAMETER VALUES [ SLURRING BOX 1 VALUES TO GIVE CONTINUOUS J ALTERS SEQUENCE OF PARAMETER CONTROL OF ARTICULATORS I ACOUSTICAL SPEECH SIGNAL ASR SYSTEM i l SEGMENTATION ] SEGMENT TIME POINTS LINEAR SEQUENCE OF EPOCHS 1 OF SIGNAL 7 7 EACH BLOCK IS ANALYZED AND I - MEASUREMENTS OF PARAMETERS . ARE MADE [ ] MEASUREMENTS ARE COMPARED TO TEMPLATES (STORED REF', ETC.) FOR IDEALIZED ELEMENTS E] D SEQUENCE OF IDEALIZED ELEMENTS FIGURE 1 TYPICAL ASR SYSTEM BASED ON DISCRETE ENCODING MODEL of signal) rather than points, as in the case of Reddy7 who analyzes only steady—state portions (i.e., portions with constant values of envelope and frequency) and ignores the transition portions between them. The opposite approach is taken by Dixon et al.“ in their analysis and segmentation procedure. They define a new element called the transeme, which is a "dynamic segment describable on a production basis as the transition from one relatively steady—state articulatory con— figuration to another." The criterion for segmentation and further analysis may not be related to linguistic elements at all, as in the case of Gazdag.9 His segmentation points are determined completely in terms of the measurement procedure that he uses to analyze the speech waveform; hence they are independent of any exterior linguistic criterion. ASR systems developed along these lines have no ability to ignore Speaker and environment variations or free phonetic variation; i.e., in midwest English, prevoicing before [b] or [d] is optional. Usually a separate "case" (pattern class) is set up for each; hence the success that these various ASR systems have in isolated sound situations or in one—perSOn conversational speech cannot easily be extended to connected conversational speech for many speakers. Harris10 has discussed the extremely difficult problem of trying to define linguistic elements as direct descriptions of portions of the flow of speech. He finds it convenient in his analysis to define certain elements which extend over quite long periods and others which extend over short periods. ”In the course of reducing our elements to simpler cmnbinations of more fundamental elements, we set up entities 10 such as junctures and long components which can only with difficulty be considered as variables directly representing any member of a class of portions of the flow of speech." (p. 18) A similar formalization in the early work of Fant, Jakobson et a1?1 describes distinctive features that are parallel rather than serial descriptions of the acoustical waveform. Extensions of this approach by Chomsky and Halle12 are discussed at the end of this section. Bobrow, Klatt, and Hartley 1“ have proposed an ASR system based on this idea and derived independent parallel features from the acoustical signal and performed classification on those features. Other ASR systems using independent features have been proposed by Hill14 is and FOCht. Bobrow et aL discuss the difficulties of recognizing conver— sational speech for divers speakers in terms of: (1) Consistency of each speaker in repeating words for training (giving rise to phonetic variation) (2) Speaker—dependent variation in their measurements (shifts in formant frequency location) (3) Segmentation of longer utterances. These difficulties are caused in part by the extremely complex nature of parallel features and the interrelations between them. Ohmanlfi has studied various vowel/consonant/vowel (VCV) combinations and has stated that it is impossible to treat even these short utterances as three successive gestures. It is possible to analyze them only by con— sidering the stop—consonantal gesture as superimposed on the substratum 1’7 determined by the two vowels and the transition between them. Houde has investigated this further by means of X—ray movies of the configu— ration of the tongue during articulation. The dynamic trajectories of points on the tongue during articulation of VchV nonsense words can be decomposed into target—directed (targets are long duration steady— state vowel positions) and deviation (900 to target—direction) components.* Five facts are clear: (1) The deviation component is characteristic of the con— sonant ([b] and [g] were used). (2) The characteristic deviation for [b] and [g] was not toward a target or hub but rather a consistent defor— mation of articulator (primarily tongue) configuration. (3) Targets of preceding vowels are changed by the conse— nant (i.e., I in Tlge] has a different steady state position than I in [I b e J). (4) Stress placement affects vowel target positions. (5) Timing of target—directed component was dependent only on distance between target positions and not on speed of articulation, speaker or consonantal environment for the limited data investigated. We can discuss these results in a way more compatible with lin— is guistic theory by use of Lamb's concept of a medium as a most unrestricted (or most predictable) form and then describe the pertinent features which convey information as perturbations of that medium. He defines a phonetic 3|: This decomposition is slightly different from Houde's, in order to demonstrate the concept of overlapping features. 12 feature as distinctive if its presence is not determined by its environ— ment. This idea may be extended to explain the Ohman and Houde data by stating that the vowel—to—vowel transition is actually the medium for the consonantal distinctions. We should define acoustical features more generally than just those defining linguistic events.* These acoustical features may be classified as: (l) Linguistic (2) Speaker signature (3) Speaker emotional state. The interrelationship of all these features that are present simultane— ously, preceding or following in time, may be correlated with the dominant (distinctive) feature, but this correlation is usually situation (speaker, context) dependent and thus can introduce much variation in determining recurrent elements. It has been pointed out by Harris that time of start and stop of different acoustical features may not be coincident in time. Thomas1U suggests that a Speaker is able to adjust only one formant fre— quency; other frequencies are allowed to fall where they may. He states further that this formant is always the second, but the data presented by Ohman does not support this. Rupert20 has studied isolated words spoken by three males and two females; he suggested that: *By "linguistic" we mean the specific content of the Speech waveform that is being used to communicate a discourse or text. For the purposes of man/machine communication, this definition will be sufficient. We do not wish to get into a discussion of various gestures, intonations, etc., which can also convey information. 13 (1) Each speaker does consistently control at least one acoustical (linguistic) feature which is usually less than the entire acoustical signal (i.e., one or two formants). (2) Although the controlled feature(s) (say, second for— mant) may not be the same in absolute value for all speakers, the time patterns are similar and can be identified by their recurrent nature. (3) There is a high degree of recurrence across speakers of these controlled features. (4) Other acoustical features (may be correlated with linguistic) that occur vary considerably according to speaker, phonetic environment, etc. Ohman has proposed a motor—control model to partially explain his data as saying that for a VcV sound there are independent signals (or parameters in our theoretical model) for the first vowel, the consonant and the second vowel. The various muscles work in a coordinated fashion to produce continuous changes in articulatory configuration. This approach has actually been used to some extent in the work of Reddy. He first classifies his segments into phoneme classes (vowel, fricative, stop, nasal, liquid) and then performs a specialized analysis on each segment which is directed by the phoneme class label. Based on this discussion we formulate the following premises about a feature description of the speech signal: (1) Only a subset of the acoustical features present in a time epoch of speech are linguistically significant; (2) (3) (4) 14 this subset can be recognized by the precise repeatable nature of its members. We do not mean precise values (formant frequencies : 500 hz, 1500 hz, and 2400 hz) but rather, precise time behavior within physical (motor control) and linguistic* constraints. Epochs of the acoustical signal can be equivalenced to classes determined by a subset of linguistic acous— tical features. These classes can be defined(by the choice of the subset of features) in such a way that they are situation (context, speaker) indepen- dent. Roughly, the class labels are a generalization of the consonant, vowel labels used by linguists and also a refinement of Reddy's phoneme classes and Rupert's production modes (PM's). Further feature analysis is simplified considerably, and a more precise syllable (canonical form) analysis can be performed by a directed—search technique based on the above classification. This removes the inherent circularity in many classification schemes involving normalization (analogous to the Visual recognition problem of finding an object of interest to focus on while it is out of focus). Once vowel (peak of syllable —— Hockett) classes are specified, they set up a primary formant transition structure. * .. As noted by Ohman, consonantal variations of formant transitions are different for Russian speakers than for English speakers. (5) (6) 15 Consonantal modifications are with respect to the primary formant structure and hence will be termed secondary. There is interaction between primary and secondary acoustical features, but the class labels can be assigned independent of this interaction. The concept of precisely controlled features determined by phonetic environment at first appears similar to the distinctive features matrices proposed by Chomsky and Halle as the final linguistic idealized description of the speech waveform. However, there are two crucial distinctions: (l) (2) Significant features are chosen, and other (redundant) features are eliminated based on the simplicity of description and reduction of logical complexity in the encoding process. In speech recognition, the human is generally unaware of mathematical formulations when he is learning to speak; hence, the features he selects to emphasize and control precisely are chosen for communication with another human being and immunity to noise for that communication. Hence, an ASR system must determine the precisely controlled features that are present rather than formulate hypotheses about which ones would be easiest to analyze if they were present. Their concept of opposition is with respect to elements that can occupy the same time epoch (minimal pair). This involves a comparison of definite (albeit situation—dependent) measurements of the present input with some representative set of measurements for the opposing element. Many investigators have noted the difficulty in this approach (Hemdal and Hughesgl). The relative opposition concept of Rupert and Yilmaz does not have this difficulty, because a time epoch is compared to the preceding and successive epochs for its relevant opposition measurements. Hence, normali— zation becomes less of a problem. In the following sections, we will expand these premises and show experi— mental evidence indicating a different description of the acoustical speech signal is necessary for an ASR system which more accurately measures timing and frequency characteristics. I—C SEGMENTATION OF THE ACOUSTIC SPEECH SIGNAL INTO ANALYSIS EPOCHS The optimistic goal of some segmentation procedures is to define time points and the acoustical signal such that the resulting sequence of signal epochs will correspond to a sequence of idealized linguistic elements. One then simply decides which linguistic elements each epoch is most like. In the previous section we discussed this approach and the resulting difficulties, especially in conversational speech involving long phrases. Bobrow et a1. state that the purpose of seg— mentation should be a selection of appropriate measurements to be made, dependent on the phonetic context. Reddy's phoneme classes are directive in the sense that they select appropriate decision procedures to be uSed in analyzing each of his segments. We are thus led to a procedure that will define time boundaries and also prescribe a par— ticular type of analysis to be performed between these time boundaries. The resulting epochs may not necessarily correSpond one-to—one to the final sequence of linguistic elements. As an example, we might consider a word such as ”back" spelled phonetically [b a k] that has been modified by tape cutting at the beginning and the end to remove all noise bursts related to the consonants. The resulting acoustical signal would contain only a vowel—like portion, and only two time boundaries would occur at the beginning and end of this epoch. However, if the tape cutting has not been too severe, a person would still perceive the entire word; hence, further analysis should determine from the transitions that the generating sequence of linguistic elements is more like three: consonant/vowel/ censonant, rather than one vowel. 17 18 A segmentation procedure should also identify the significant controlled acoustical feature within the time boundaries. Rupert discusses how this reduces the variability induced by situation—dependent acoustical features. This would amount to attention focusing that includes as a special case formant tracking. By ignoring all but the distinctive controlled features, a large amount of noise rejection can be accomplished. Further segmentation need not be impaired by this attention focusing, because, as proposed by Rupert, it should be the precisely controlled features that govern the segmentation. However, the beginning of new features outside the area of attention must be able to "capture" the recognition choice so that a feature does not dominate long after it has ceased being significant. The object of our segmentation procedure, to act as a direction for analysis, must then be able to isolate homogeneous epochs of signal, since in order to make reliable measurements we must have a tailored measurement algorithm (i.e., it is extremely difficult to track a for— mant during a fricative or noise—like portion of the acoustical signal, Thomaslg). This suggests a representation of the acoustical waveform that shows isolated acoustical features and gives an adequate description of the signal properties so that segmentation and class identification can be performed. The concept of homogeneous segments must be augmented somewhat because of the special nature of speech signals. In order to analyze a generalized acoustical signal generated by a complex scheme, as in human speech, one could use standard communication theory techniques of identifying a state model for each epoch (i.e., a set of differential 19 equations, n—degree polynomial fit, etc.) and then say the epoch has physical homogeneity as long as the model is valid. Then the switching times or segmentation points will correspond to changes in models. We must also consider linguistic homogeneity as discussed previously; there are several portions (acoustical features) of the total speech signal which are not linguistically significant. Therefore, the homo— geneous property is with respect to both the physical measurements of the signal and the linguistic significance of these measurements. I-D PREPROCESSING OF THE ACOUSTICAL SPEECH SIGNAL Preprocess1ng of acoustical speech signals, when inspired by modern communication theory techniques, has been dictated more by what is avail— able rather than by what is appropriate. Researchers have attempted to justify application of existing techniques by analogy with color (light frequency) perception (Yilmaz) or human perceptual experiments. The former approach can be though of as looking at the world through rose— colored (harmonic) glasses. The latter technique must be used with caution, since the capabilities of the human brain are not available in an ASR system. The complicated nature of speech signals involves a predominant pitch frequency, which does not contain linguistic information (at a lower unit level), plus several components with time—varying frequencies. An acceptable analysis is possible but requires much computation (Schafer and Rabiner 2). A real—time ASR system intended to make efficient use of a machine cannot afford this luxury. The problem involves more than waiting for a faster computer or a trickier algorithm when one wants to recognize connected speech from several speakers. In this section we will discuss the complicated nature of human speech signals and form a basis for specification of a preprocessing scheme tailored to the nature of ASR requirements. The primary goal of preprocessing is to specify a transformation (filtering) which will: (1) remove noise (including other, confounding features of speech as discussed in the previous section); and (2) provide a compact representation of the salient features required for the recog— nition task. We cannot expect a straightforward application of standard 20 21 techniques based on homogeneous models* to achieve these goals. The generation of the acoustical speech signal is best modeled as a composite stochastic process (that is, a heterogeneous mixture of several interdepen— dent time—varying systems). In addition, experiments measuring human perception of acoustical events indicate that man's ability to discriminate frequency is more acute than his perception of differences in intensity (Flanaganffl' We will show that the commonly used filtering techniques have poor frequency resolution, which adversely affects ASR system per— formance in natural human conversation. If we assume that the signal is generated by a homogeneous process, the most efficient transformation would match this generation process, as attempted by Weiner—Hopf or Karhunen—Loeveg5 filtering. The difficulty (and success) in using these methods depends on the initial selection of the representation criterion and representation constraints. The formation of the input signal minimizes, according to the chosen criterion, the differences between the output and an idealized signal. The criterion chosen has a considerable effect on the final form of the filter. There are many problems in which the mean squared error formulation is required in order to obtain any useful mathematical results. However, another criterion may be better suited to a particular estimation problem. For example, a filter designed for minimum mean squared error would be used successfully in the case of a stochastic signal (fricative), where the mean value and bandwidth of the frequency energy distribution are sufficient statistics. On the other hand, in the case of a vowel formant One characterization of a homogeneous process is a set of differential equations of a prescribed form with (time—varying) parameters and a fixed forcing function. 22 the peak of the frequency energy distribution is much more important than the mean value, necessitating a maximal likelihood criterion. Thus,even assuming that we can apply the more sophisticated techniques of communication theory to the speech preprocessing problem, we will generally need more than one "optimumH filter for a speech signal because of the changing nature of the speech acoustical signal. The set of all possible inputs must be limited (by the filtering operation) in order to achieve rejection of noise and unwanted signals. This "allowable" subset is usually defined by a set of constraints (dif- ferential equation in the Kalmanr}formulation). Along with providing rejec— tion capabilities, this would make the recognition problem easier by limiting the search space. However, the set of constraint equations, in order to be useful, must be a very accurate description of the instantaneous (rather than some average) "state" of the speech signal, implying that the classifi— cation must be known in advance in order to perform the preprocessing trans— 27 formation. Halle has proposed a feedback type ASR system (analysis by synthesis) to perform this circular classification. However, in view of the large number of computations implied by such a procedure and the pre— vious discussion of the nature of the speech signal, we would propose the following: At the marking of a change in the speech signal decide which of several classes the new epoch belongs to and which "portion" of the total signal energy contains the significant information. Then, tailor a "filter" to this portion and perform the required transformation for as long as the desired features remain in the signal (determined by observing the results of the transformation). 23 We have already discussed how different criteria lead to several filters or transformations. Also, the parallel nature of the acoustic feature in a Speech acoustical signal indicates multifiltering as a first step. We can summarize some of the requirements of a multifiltering pre—processing to remove noise and unwanted signals. (1) Simile — Preservation of the necessary characteristics of a selected portion of the total acoustical signal. The subspace resulting from the filter transformation should, at this stage, preserve the input's characteristics (for instance, if the filter were a bandpass, time—invariant filter, this criterion would require preservation of the amplitude and phase relationships of the input within the 3 dB bandwidth of the filter). (2) Rejection - Removal of extraneous acoustical characteristics, including background noise and other speech features, such as other formants or the pitch component (for bandpass filters, this would require extremely good attentuation outside the 3 dB bandwidth). (3) Continuity _ At least one of the filters should contain a feature throughout its duration (for bandpass filters with a vowel glide of the second formant in the input signal, that extends from 1400 Hz to 2800 Hz, at least one of the bandpass filters should have 3 dB bandwidth to encompass this range). This is desirable because we do not want artifact boundaries particular to a specific set of filters introduced when a feature transverses filter 24 boundaries. If this condition is not satisfied, a much more complicated decision network must be used to eli— minate these artifact boundaries. Further complications arise because of the wide frequency range, extending over many octaves, and the extreme variations in amplitude. Five contiguous l/l octave filters are required to cover the intelligible range of speech (one more if high—quality speech transmission is required), and the amplitude ranges over 120 dB with short—term variations on the order of 20—30 dB. One of the most popular instruments for displaying and repre- senting speech signals is the sonagram, a 2-dimensional graphical display of frequency versus time, with intensity indicated by shading on the dis— play. It has been shown that the sonagram is a physical approximation of the generalized sliding Fourier series (Lernenj ), that is, a Fourier series computed over a time interval that is stepped along the acoustical signal. The difficulties in analyzing speech can be discussed in terms of the sliding Fourier series and the parameters involved. First, the length of the interval over which the series coefficients are computed must be greater than the period of the lowest frequency component of interest. Measurement of formant frequencies is further complicated during vowel— like portions by the pitch frequency (proportional to the repetition rate of the glottal pulses). The range of these pitch frequencies is from 80 to 400 Hz. The time period over which the Fourier series coefficients are computed must be greater than the pitch period (say two or three times the largest, «125—30 ms), or a great deal of variation will occur depending on * the phase of the pitch frequency . Thus, there is a lower bound on * The ideal situation would be to synchronize the Fourier series computa— tion period with the pitch periods. This requires a pitch detector and a device to decide on presence of pitch periods. The resulting frequency resolution is still on the order of the pitch frequency. frequency resolution on the order of the pitch frequency. Sliding Fourier power spectra for both Wideband (65—6500 Hz) and bandpass filtered vowel glides are shown in Appendix D. The irregular form of the spectra is due to the pitch component. Also, the high power of this component relative to higher frequency components (which carry the linguistic information) requires a significant dynamic range (50 dB is shown in Fig. D-l); even then, formant frequencies are difficult to identify. It would be expected that bandpass filtering should isolate these peaks, as is seen in Figure D—2. However, we should note that there are several problems that still are not solved: (1) When two energy peaks are in the same filter, a decision must be made as to which peak corresponds to a formant and whether the other peak is simply a harmonic of the pitch frequency or a second formant. Ideally, it would be nice to treat one formant in every filter; however, this is overly optimistic. (2) Measurement Resolution ~ This is possibly a special case of (l) in that the measurement scheme (sliding Fourier series, for instance) has a certain resolution; i.e., a certain minimum distance must be present between two peaks for them to he recognized as two separate peaks. The problem that can occur here is that different speakers may have different spacing, so that for one a two-formant "sound" may appear as a broad single peak while for the other the same "sound" will appear as two close narrow peaks. (3) Frequency Glides (large values of derivatives of frequency) that move in and out of filters and across filter boundaries. 26 The ideal approach, of course, is to treat a feature as a continuous event, independent of the filter band— widths, so that artifacts would not be introduced. (4) Correlation of Formants in Adjoining Filters — Since the filters are overlapping, the formant could be present in two filters; important types of information 29 may be found by comparing adjoining filters (Hanne ). (5) Requirements (2) and (3) above are actually contradic— tory and cause, in the case of bandpass filters, a situ— ation where in order to contain a formant glide within one filter, the bandwidth would be entirely too wide for adequate rejection and for emphasis of the many types of speech features encountered. (6) The effects of the pitch component are not completely removed by 25—30 ms computations, time window tapering, or bandpass filtering as has been suggested by researchers. These problems for bandpass filters, or, as has been shown by 22 Schafer and Rabiner, for even more sophisticated types of frequency analysis, are caused by the inappropriate nature of any fixed—frequency type of analysis for speech processing. The criteria for using such analysis on (1) steady—state phenomena, such as constant vowels or nasals, (2) vowel glides (great changes in frequency of formants) and (3) noise— like signals, very quick , random transient—type phenomena, are in general quite incompatible. Further, it has been shown by Hanne that for several measurement schemes, the estimation of formant frequencies (natural modes) of the acoustical signal approaches a harmonic of the pitch frequency rather than the true'value. 27 30 A recent article by Lecours and Sparkes has indicated that narrow— band filters enhance the frequency pattern of vowels, whereas Wideband filters more accurately show the transient time behavior of stop conso— nants (rapid envelope onset —— a fact well known to users of sonagraphs). Hanne has pursued this prefiltering idea further with a more sophisticated system of overlapping filters to estimate first formant frequencies within 3 percent. Flanagan'ska’ml study indicates that this approach is closer to the frequency estimation error in human recognition. Thomaslv has also used Wideband filters to emphasize frequency regions to show second—formant variations more clearly. Both Hanne and Thomas have argued that the effect of filtering speech signals can be predicted or inferred from usual steady— state filter analysis. However, Fig. 2 shows a sonagram of a common English word, indicating a frequency derivative on the order of 10,000 Hz per second. This high value of frequency derivative is known to give quite unpredictable and unexpected outputs from time—invariant linear filters (Baghdadyél, 32 as Wiener and Leone , Cannon and Duncan ). One should reexamine the criterion for filter bandwidth in terms of the time—varying properties that can occur in speech signals. The inverse relationship between rise time and band— width indicates that a fixed bandwidth bank of filters must be a compromise at best. The effect of an analysis period on the order of 25—30 ms is to average or smear quick transient phenomena. Discu551on of recognition errors in various systems using this type of techniques (Reddy7) indicates that many consonants, especially stop consonants, are missed due to this smearing or averaging. The usual reason given for the recognition errors is the low energy and short duration of these speech sounds. One possible solution would be to vary the computing period inversely with frequency “95mm: x. ,n‘ jg 1‘. 2100—1200 _ I 90 ms — 10,000 Hz/sec 246 ms 365 ms 275 ms FIGURE 2 SONOGRAM OF ENGLISH WORD, RUDDER, SHOWING HIGH VALUE OF FREQUENCY DERIVATIVE 29 (long periods for low frequency and short periods for high). The resulting coefficients would not be for an orthogonal expansion,and also vastly differing waveforms can occur in the same frequency region. Figure 3 shows both Wideband (65 to 6500 Hz) and bandpassed time acoustical signals from recordings of four speakers saying medial [b] from [umbif] (see Appendix A for a description of the experimental pseudo—language used). Bandpass filters can emphasize characteristics in the real—time waveform of extremely short transient-type bursts (release of the stop consonant [b] for different Speakers, both male and female). Although the Wideband waveforms (Figure 3, first page) show very little similarity, it is possible by bandpass filtering to find similar waveforms for the different speakers (Figure 4). The rejection of other features in the acoustical signal, as well as noise, by the filtering has made this posSible. It will be noted that the most consistent and similar waveforms across speakers need not, and often do not, occur in the same frequency range (filter). It has been argued by researchers that other acoustical clues for the perception of a stop consonant exist, namely the transition into the following vowel. Cooper et a1.5 investigated perception of synthetic 1F . J has shown w1th initial vowels with frequency glide onsets. Ohman sonagrams of actual speech that these results may not apply to connected human speech sounds. His data showed that, for medial stOp consonants, the common notion of a formant hub does not hold; that is, there is no consistent point of origin for a given consonant, say [b], to which and from which vowel formants tend. 3O .OZwIn_ Fzm:mz>OIm 4.4205 41 _ rt «a .u. o. slung: Eng... _ 1o 2 .0! ch 3; .E «.8- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII .I_ ‘- .1-_ f u Mich .3 rec 1. v?! I... (‘1‘: Bios: {no} o u n a - u I o o a n r oooooooo s nnnnnnnnn ‘7 uuuuuuuu I? ooooooooo on: p ns bl if v p L tuna pia— Blawa; Enmnu I: "a . w. 0. no, 0 ma: 9.. "who I: noon — 1w 0. «O! a mg: [I Hi .I u. i an.“ wag a! g“ 0. ma :50 ‘00: ,0! v 3.. - u . u . u o . u . fins... bl urn f In D b 5 > P h h E. 00!. .2. a. .0! m 5.. are E. on; . we 9 6! o 3.. E. 09... I. 03m . ...w 2 .0! on 3: l. as :2 o. .3. . . 3(8 E 29 o. :2 .In 9008 a u n . o g . f P P 0... LI :7 P I .tot '30— 0—03. mg; It .5 .13 .— Salome; 11.90 " LI, 'FI “l 1“ 1.38 —u¢2 9089:: Fiona {:3 :32 20823.; I...’ .39! 0.53 i 22.3! ..n o. .I. 9334‘ u to!) u u - o a I r OOOOOOOO .000: h» V I 3 L: > I t .- B’CUJC v5.03 .3: alas; I... —m~: in. u .40? L. 8.0 win I. noon pun .— frag .l .8» , :u 2 8! ch 3.. ..wd‘uuxh'.ouad:89 . . a . u . o o s . .1 ........ s ......... P ........ I? ......... I ........ p by b r p flag '1' o. B’m win [nth “.538 .8: 8.1.... (.39 In! Flu: aivnwau 32 487.5 ms FILE 24 M04 A 19 EH 1 512.5 ms 549.0 ms FILE 6 M06 19 BE 1 574.0 ms 695.1 ms FILE 5 MD8 19 MJ 1 720.1 ms 550.0 ms FILE 4 MD6 19 BA 1 575.0 ms OUTPUTS OF OVERLAPPING BANDPASS FILTERS [M] TO [8]] ‘w-JI“ FIGURE 4 CONSISTENT TIME WAVEFORMS FOR SEVERAL SPEAKERS FROM DIFFERENT BANDPASS FILTERS 33 The choice of a set of filters for preprocessing the acoustical signal ranges from a set tailored to several classes of acoustical signals, possibly along with different representation criteria, to a set of contigu— mm narrow bandpass filters. The second approach has been extremely popular, especially for speech synthesis using vocoders. A familiar characteristic of narrowband filters, i.e., ringing, when excited with a sharp increase or decrease in amplitude or frequency is not consistent with the require- mmw of simile. For a period of time after a sudden change in amplitude or frequency, the output of the filter is not representative of the input. fins problem will be discussed in the next chapter. To avoid these difficulties, we have chosen a Wideband (half—power bandwidth greater than one third the center frequency) overlapping filter bank (See Appendix B). The particular choice of the number of filters and the bandwidth of each filter was made in order to satisfy the three Mmted requirements. We shall see that the type of multiband filtering used here fulfills these requirements to a certain degree but has several limitations which must be corrected in the decision algorithm that follows the multiband filtering. The reason for these limitations is obvious. A time—invariant filter based on steady—state sinusoidal considerations OIJViously is not representative of the speech acoustical signal. However, there are several reasons for this choice over the admittedly better set of tailored filters, These reasons include: (1) The hardware is readily accessible (2) A large number of investigators have used Wideband preprocessing {v Gazdag , filtering in proposing and implementing an I schemes, including Hanne , Reddy , Thomas , Shafer et a1 and Yilmaz . 34 (3) Adequate representations have not been tailored to the time—varying acoustical signal, (4) Few decision structures have been studied which are tailored to this type of multiband filtering preprocessing. Another popular related analysis tool is the Fourier transform, especially since the introduction of the "Fast Fourier Transform" algorithm byCooley andTukey."J The Fourier transform equations can be modified so Umt each coefficient computation may be thought of as a (digital) filter Operation. Hence, the complete transform computation may be considered a multi—bandpass filter processing.* Much can be learned by considering a multi—bandpass filtering scheme with the intention of using it only as a first step and deriving from it further requirements for a tailored multi-filtering scheme. A popular approach for parameterization of the filter outputs is t0 compute coefficients for an orthogonal series representation. However, me Criterion commonly used for these computations is complete represen— taltion of the entire signal and minimization of the error between the orthogonal series and the original signal. This is not what is needed f0? an input to an ASR system. We would rather like to see only those Imrameters necessary for recognition. FlanaganM has modelled the Speech- generation process as either a two—pole linear filter excited by the flOttal DUISeS (for vowels and oral continuants) or a filter excited by Mute noise with variable bandwidth in the center frequency (for frica— ’ - ' . - r ( nves. stop consonants). The various parameters of input, envelope an! F“‘--—- . In the next chapter we will see that the Fourier coeffi01ent computafiion acts like a narrowband digital filter and hence is subJeet to ringi g. g d 35 filter bandwidth and center frequencies are considered to be time-varying. Thus, he would propose two parameters for each of our acoustical features, related to center frequency and bandwidth. Rupert has suggested that there are also consistent spectral shapes to the acoustical features which have been only slightly considered by previous investigators. These shape functions appear to be easily described by at most four parameters; say, the first four moments of the spectral density. They were first derived from sonagrams, but inspection of machine-calculated power spectra (Appendix D) shows that they may be more artifacts of the hardware than ccnmistent features of speech acoustical signals. However, Sittonflg has studied the first four moments of reciprocal zero—crossing distributions and found more consistent results. Thus, one is led to different estimates of center frequency (and higher moments) for a narrowband (unimodal) spectral density. Zero—crossing counts immediately come to mind. There are many schemes and investigations ". l‘. of zero-crossings for analysis of speech signals (Cherry and Philips. ). However, these measures were usually made on the total signal and, as can be seen by considering the sum of two sinusoids with variable ampli— tudes, the resulting output can be very difficult to interpret unless the signal has its spectral energy concentrated in a narrow frequency band. Thomas has used zero-crossing analysis on the output of his bandpass filter to estimate second formants; he finds an extremely good representation for vowels and indicates trouble only fer very low power portions of the acoustical signal (fricatives, stop consonants). The use of bandpass filters followed by zero—crossing counts to estimate the a? frequency structure of formants has been demonstrated (Peterson , 29 Hanne ). A 36 Recently, Scarnfli has discussed the fine structure of zero—crossings for speech—like signals having formants and pitch frequency components. He uses wide (1 octave) filters to isolate formants and shows the effect of pitch periods on formant frequency estimation. The errors involved in zero—crossing analysis are on the order of l/number of zero crossings and therefore proportional to frequency. The case with Fourier series analysis is different, in that the frequency—location error is fixed at ié the lowest frequency component (in this case, the pitch frequency). Zero—crossing counts can be related to instantaneous frequencies 31 as (Baghdady , Lerner ) and thus incorporated into a discussion of quasi— stationary response of linear filters. However, few investigators have pursued this approach in the case of speech signals. Reddy7 uses zero— crossing measures as an estimation of steady-state frequencies and also some envelope measurements (primarily relative envelope changes). We will discuss on a slightly more theoretical basis the relationships between zero—crossing measures and instantaneous envelope measurements in the next chapter. There are obvious benefits to be derived from the use of both derived time series in that the interpretation of zero—crossing counts is greatly enhanced by specification of the nature of the speech Signal (i.e., if it is a vowel portion or a fricative portion, etc.), which can be determined by investigation of the envelope time series. The subject we will investigate in the following chapters involves prefiltering by a bank of overlapping bandpass filters with the criterion that significant acoustical features appear in at least one of the filters over their duration. This presents a new type of recognition problem, involving the logic to decide which filter has the significant output and to perform a preliminary classification as discussed previously. This is the topic of the next section. I-E DECOMPOSITION OF PATTERN RECOGNITION ALGORITHMS The use of multiband overlapping filters to preprocess speech signals presents a specialized type of pattern—recognition.processor, For the sake of clarity, we will adopt the widely used mathematical formulation in our discussion of this problem: The inputs to pattern—recognition devices are parameters, distinguishing characteristics of a physical event. A measurement is the numerical value of a parameter. A pattern vector, then, is an ordered set of measurements of a physical event; each measurement can be thought of as a component. The distance in pattern vector space between two vectors is a geometric measure of their closeness. A typical, but not always appropriate, distance is the standard Euclidean sum of squared differences of each component._ A pattern—recognition algorithm is an assignment of class labels to the pattern vectors. In a typical pattern-recognition algorithm, each input pattern vector to be classified is compared with a number of reference vectors by a distance measure. The input vector is then assigned the label of that reference vector for which the distance is minimized. An ideal pattern—recognition algorithm would result in a dichotomization of the pattern vector space with unique class labels for each disjoint region. In the cases where this is not possible, the output of our pattern—recognition (PR) algorithm can be a degree of presence (DOP) vector, which has one component for each class label. The DOP vectors indicate the relative assignment for each class (say, normalized distances) and hence are a generalization of the single class label output. A directed search is a special type of pattern—recognition algorithm that trades sequential operations for multidimensional single operations; 37 A 38 i.e., in the reference vector comparison case, a subset of reference vectors is selected by first examining few components and eliminating large port— tions of the pattern vector space from further search. Plasticity is a description of a particular type of pattern—recognition algorithm that allows changes in the pattern vector to DOP vector mapping, depending on a subset or all of the pattern vectors (the terms "learning" and "adapting" have been used for this process). A deterministic pattern— recognition algorithm is one which has no plasticity; that is, an a priori fixed mapping of vectors into classes, possibly by setting thresholds on mamurements. Normalization is a process which we will distinguish from Um pattern—recognition algorithm as being more concerned with the deri— vation of the parameter measurements. Although there are analogous types ofstandardization processes that do occur in pattern—recognition algorithms, it will facilitate the discussion to make this distinction. We can now consider a schematic of the logic required for a pattern— recognition algorithm for our multi—bandpass filters and its operation. In Figure 5, the output of each bandpass filter goes into a measure— ment device, producing an n~dimensional pattern vector for a time epoch (physical-event) of the acoustical signal. These may be coefficients of an orthogonal expansion over a certain time interval, coefficients of a differential equation or another set of appropriate measurements (mean ValueS, maximum value derivatives, maximum value standard deviations, etc.). For a Continuous output of the bandpass filter, these types of measurement require time interval marks, which we will assume for now are generated elsewhere or are a Part of the measurement scheme. The output DOP vector iS of .. dimen51on r, the number of speech sound classes, discussed in L A Speffi‘ Input r. I BANDPASS MEASUREMENT LOCAL FILTER DEVICE : LOGIC n ' °n—dimensiona| ' r~dim . BanI< 0f . pattern . DOP "1 THIQrS vector vector u o 0 I BANDPASS MEASUREMENT LOCAL FILTER DEVICE LOGIC n GLOBAL L()(3|C FIGURE 5 SCHEMATIC OF MULTIFILTER RECOGNITION LOGIC A 40 Section D (on the order of 4—6). When referring to operations or pro— perties of individual filter outputs, we will denote these as local, and when talking about properties of the entire bank of filters, we will denote these as global. By the particular choice of our filters, we see that a local property is one that is restricted to a certain frequency range. We will talk formally about "closeness" of pattern vectors in terms of as clusters in the sense of Ball and Hall. That is, we will say a set of pattern vectors is clustered if the intra—cluster distances are small (relative to a threshold, or to inter—cluster distances). The homogeneous property, which we introduced in our definition of acoustical semnents, is with respect to both the physical measurements of the signal and the linguistic significance of these measurements. We might reformulate that property in terms of our definitions; physical measurements have some significance and consistency if they form a cluster (denoted as a physical cluster) in the pattern vector space. It may not always be the case that these physical clusters have linguistic significance. For example, a frequency measurement on a low—order filter primarily exhibits the pitch frequency. In this case, the physical clusters would correspond to different pitch frequencies and not to different linguistic events. At the opposite extreme, a physical cluster might be related to two distinct linguistic events, such as a medial Lb], which has a very small amount of silence before the burst release, or a great amount of background noise such that it is difficult to distinguish from a fricative such as [f]. The resulting measurements for both the [b] and the [f] would tend to lie "close" to each other and, hence, lie in one physical cluster. Thus, the linguistic clusters would correspond to one physical cluster. At first, it appears that appropriate class labelling of the physical clusters 41 would define the linguistic significance; however, as indicated previously, the difficult task of assigning an exterior linguistic criterion to physical measurements subject to speaker, environment, and free phonetic variations will require a more sophisticated, plastic type of correspondence. The intention of keeping the actual decision algorithm simple so that it may be implemented in real time (with a minimal amount of computation) requires a better solution to the problem than simply keeping track of all the physical clusters and then making a correspondence to a set 01 linguistic labels. This type of approach requires, for example, storage of a large number of reference vectors (say, one for each physical cluster), comparison to these at each step of the decision algorithm, and a continual updating of these reference vectors due to slowly drifting measurements. In our problem, this approach is not feasible because of the variations due to different speakers. Bobrow and Klatt13 have shown a decision algorithm (applied to the speech recognition problem) which is a directed search using decision—tree type logic that reduces the computational limitations (amount of storage, number of comparison speed of classification) of the Iusual multidimensional pattern recognition algorithm. Their procedure, applied to a speech measurement situation in which the variations discussed above are removed, would result in an effective ASR algorithm. Their technique, of course, will fail in the situation where a large number of reference vectors are saved for comparison. The concept of precisely controlled features can be related here, also, to physical clusters, in that if other perturbing influences are removed, these precisely controlled features should result in "tight" physical clusters. This approach in itself should reduce considerably the amount of variation and hence the number of physical clusters needed 42 for description. This is then what we mean by attention focusing; i.e., the selection of a portion of the speech signal with precisely controlled features and tight physical clusters for further processing. The complexity of the decision logic in Fig. 5 for an ASR system is dependent upon whether a decision for assigning a class label can be dichotomized into a number of local decisions followed by a global decision . 40 (analogous to the Zeiger decomposition of automata), i.e., Is the dimen— sionality of the pattern vectors on the order of mxn or n (where m is the number of modules and n is the number of measurements in the input pattern vector for each module)? In the situation where two estimation criteria are appropriate (not necessarily simultaneously) for an n parameter problem, hence leading to two "filters", (as discussed in Section])), we would say the dimensionality is n rather than Zn, but "shifts" according to the input. The local decision would be based on "best" estimate according to the local criterion and the global decision would then be the choice of which esti— mator was most appropriate by examining the variance of the parameter estimator, for instance. This variance measure of the estimation process can be generalized to handle the many more difficult and varied situations in ASR systems. We can also measure the quality of the DOP vector, e.g., the peakedness measure 41 introduced by Kilmer et a1 . A quality measure of the specific classifi— cation of an input pattern vector indicates the significance of the esti— mation of the measurements and consistency of the pattern vector w.r.t previous classifications. A 43 Knowing that the complexity of PR algorithms goes up exponentially with the number of dimensions, a decomposition can result in real—time computations. The discussion of the previous sections indicates that this is the case for speech, in that the entire Wideband acoustical signal is not precisely controlled and does not contribute in its entirety to the linguistic information. The choice of a logic structure, then, depends on this decomposition. We propose to show in Chapter IV that this is valid and indeed enhances the physical measurements in such a way as to reduce variations and improve the probability of success of classification. Kilmer et al. have studied parallel recognition structures of the type shown in Fig. 6 and have demonstrated that an iterative nonlinear Shakedown net (called S-RETIC)* is capable of arriving at a consensus of opinion among the local pattern—recognition elements (denoted modules), solving conflicts that may arise and selectively tuning to particular modules that have made a high—quality decision. We feel that this type of logic—structure is ideally adapted to the requirements of an ASR system. In particular, the bandpass overlapping filters have a mixture of correlation with neighboring filters and a high degree of local specifi— city because of the precisely—controlled features in speech signals (cor— responding to the local redundancy of potential command concept of the S—RETIC). The parallel computations involving low—dimensionality (on the order of the dimensionality of each module) allow a minimal amount of computation. *By S~RETIC, we mean the algorithm that performs the iterative nonlinear shakedowu as described in Kilmer et a]. (1967) and not the complete simulation study. Effectively, we denote S—RETIC for the computer program which corresponds to the B parts of the modules with their interconnections. 44 x, —> MODULE K P1IC1/Xk) —-—> 1 INPUT . PATTERN e Xk . DOP VECTOR VECTOR ° Pr(Cr/Xk) H r mi FROM MODULE j (Not all lines may be used) FIGURE 6 QUASI—STATISTICAL FORMULATION OF PR ALGORITHM 45 In order to get some feeling as to how S—RETIC arrives at its decision and also to consider an alternative procedure for using a number of pattern— recognition elements in unision, we can consider the probability distribu— 42 43 tion approximation techniques first discussed by Lewis and Brown . In order to apply their techniques to PR problems, we will consider each component of the DOP vector as being a conditional probability distribution Pl(CL/xk)' L = l, . . ., r, defined over the (module) input pattern vector < space, Xk (xkexk) for each class, O{ (see Fig. 6). The DOP vector is com— k , , k puted from stored conditional distributions Pk(X /C¥) [or an input x by Bayes formula (assume P(C£) = l/r). r k k ' k ' . z . / P C I~E~1) 1369/“ ) Pku mp. Z k(x / p ( £21 The only requirements on the stored distributions is that they be non— negative for all C XR and normalized such that L, E; .P (xk/C ) : l t : 1, . . ., r (I—E—Z) k t k X . , k We can apply Lew1s and Brown's techniques to Pk(X /CL)’ kzl, . . . m. for one class by considering each pattern-recognition module as computing a low order approximation to the true distribution. Chow45 defines the structure of a pattern recognition algorithm as the function form of the probability distributions, particularly the condition dependencies among the components of the pattern vectors. He describes the Lewis-Brown approximation as structure adaptation. Hence, a parallel net of modules with lateral communication between local PR computations allows at least m different structures for each class. S—RETIC then selects the appro— priate structure. 46 So far in our discussion, we have been considering decision struc— tures that, except for the possibility of operating with minimal computations and less complexity, appear similar to those termed template—matching in Section B. This static type of pattern classification has little hope of working with connected conversational speech. The structure we are proposing has more flexibility built into it and operates like the PR algorithm we have described for isolated sounds where timing marks are well defined. The philosophy behind the design of the STL—RETIC program was to operate in I an asynchronous manner, rolling over from one decision to another based on input changes. This structure is exactly the type that is needed for dynamic speech recognition; when one classification is chosen, such as silence preceding a word, and a new feature begins. It has been demonstrated by Kilmer that the change in the input (as reflected by the change in the local DOP vectors) is sufficient to cause a change in the overall global DOP vector. It will possibly be necessary to also determine changes in the input measurements. We propose to do this by detecting inherent changes in the physical characteristics of the signal and then deciding if these changes are significant enough to cause a recomputation of the global decision. We will return to these questions in Chapter IV. First, however, we consider in Chapter II the nature of the acoustical waveform and discuss a procedure for detecting inherent changes in that waveform. In order to Specify a training procedure for a plastic PR algorithm, an external classi— fication criterion is needed. The lack of a one—to—one correspondence between acoustical and linguistic events rules out completely unsupervised learning. In Chapter III, structural linguistics is discussed in order to provide this criterion. II REPRESENTATION OF TIME-VARYING SIGNALS Representation of signals that result from transformations by a time—varying differential operator of standard signals present many difficulties, particularly to engineers with backgrounds in linear time— invariant differential operator analysis. Two representations are commonly used, the analytic signaffisand the sliding Fourier transform methodsgg II-A Analytic Signals The analytic signal representation is an attempt to define pre- cisely the empirical notions of envelope and frequency. The primary advantage of this representation is that it separates the envelope and phase portions of the signal; in addition, the resulting Spectrum is one sided (i.e., there is no mirror negative frequency portion). This cor— responds to most spectra "pictures" and makes various moment calculations practical. The Spectrum of a real signal u(t) for t €(—m,m) is the Fourier transform (1 J U(jw) = J u(t) e3.ULJt dt .* ~CI) The Hilbert transform of the real signal x(t) defined on the interval ”m < t < m as the Cauchy principal value of the integral 9 1 O) x(o.) h , x (t) E t_0 'do , “m < t s w (II-A-l) "CO * We will adopt the convention of denoting the spectrum of a real function of time by capital letters. 47 48 h is another useful transform. The new real signal x (t) has the following properties (Titchmarsh47): (1) x(t) = cos(wt + e) xh(t) = sin(wt + o) (2) Under rather general conditions if x : yh; xh : -y (3) th : amt); r > 0 z o , 1“ z o : JX(f); t s 0 We can now define the analytic signal corresponding to x(t) x(t) : x(t) + jxh(t) (II—A-Za) : a(t) eja(t) (II—A—Zb) where a(t) :AAlfx2(t) + xh~(t) (II—A-Zc) u(t) : arctan{xh(t)/x(t)} (II-A—Zd) The analytic signal x(t) has the one—sided spectra mentioned before, because of Property (3) and the definition. This signal is complex (the real portion is the original signal). Since the process of taking the real part of a complex function is a linear operation, it commutes with other linear operations such as convolution, differentiation, and integration. 49 Equation II—A-Zb gives us an interpretation of the analytic signal representation as a phasor in the complex plane with time-varying magni— tude and angle (with respect to the real axis). We may denote these quan- tities as the envelope and phase functions, terms motivated by the use of the analytic signal in various modulation studies (Baghdadyal, Weiner and Leonsz). The instantaneous frequency is defined as the time derivative of the phase function. d wi(t) = da(t)/dt (II—A—3) The analytic signal, although giving an instantaneous time descrip— tion, can be used effectively for only a limited set of signals, namely those with slowly varying envelope and frequency functions. In order to enlarge this set of signals, we will introduce another definition which will be useful in discussing second—order time—varying differential opera— tors. The derivative of an analytic signal may be written as a product of the analytic signal and a new signal, bX(t), which we will denote the prebandwidth signal. d§(t) d { ja(t)t _ 1 da(t) , dd(t)‘ A _dt__ = a? a(t)e J _ {a(t) dt + J _dt__} x(t) (II—A~4) where d 1 d a(t) . da(t) bx(t) = a(t) dt + 3 dt * The name of this function follows the convention of Deutsch and camera“6 definition of effective bandwidth. First shift the Spectrum * A Deut5ch48 denotes x(t) as the pre-envelope signal because its magnitude is the envelope. ' A . of x(t) to its center frequency. This frequency shift can be included in bx(t) by a property of Fourier transforms Xs(jw)g X{j(w + wo)}(})- e_‘jw0t x(t) = §S(t) (II—A—5a) b (t) a(t)/a(t) + j(&(t) - w x5 O) (II—A-Bb) When we is the center frequency of X(jw), the complex portion of b s x reflects the time variations of instantaneous frequency about the mean. The effective bandwidth, BW, is the second moment of the Spectrum about the mean. 00 2 00 A BW‘ g f;m w ‘X (w)‘ dw : _m lth (tl dt I lxsw)!2 dw J_ ‘xs(t)'2dt ..m 00 (I) As I Jb S(t) x (t) dt —co X L a2 (t) at (I) 2 The magnitude of bXS is an upper bound for the effective bandwidth by the Schwarz inequality and thus is a measure of an instantaneous bandwidth sz CD CD , 2 2 s , j;m lbXS (t)| a (t) dt Jlm a (t) dt {bxs (t)|2 dt (II—A—7) [A chfi 51 A Another interesting relationship between bx(t) and x(t) is (for x(t) # 0): d A A d “ : --— : —-— 1 t -- bx(t) dt (t) // x(t) dt { 0s X( )} (II A 8) In speech analysis, a logarithmic scale for amplitude (loudness) has often been used. By taking a derivative (with appropriate definitions for the complex logarithm) we can replace the transcendental function with a function more easily computed on a digital machine. Now, consider a second—order time—varying linear differential equation (DE). Q + a1(t)x +a (t) Q :: a(t) (II—A—9) where a1(t) and ao(t) are real functions denoting the time—varying para— meters (for example, of a formant-producing cavity in Speech generation). A u(t) is an excitation function which may be stochastic (fricatives) or deterministic (glottal pulses). Introducing the prebandwidth function, [ bx(t) + b:(t) + al(t) bx(t) + a2(t) ] x(t) : 13(t) (II—A~10) A The homogeneous solution of the reduced DE (u(t) : 0) involves solution of a Ricatti equation for bx’ which can be solved if a1 and a2 are constant. ' * = b + (b + (b +~<2 )(b + c )) = O X X X 1 X 1 52 where 0 ll NIH a + a2/4 — a (II—A—llb) l l 2 c = %a + /a2/4 — a (II—A—lla) 1 1 1 2 c1 and c: are the pole locations for the time-invariant system given by Eqn. (II—A—lO). When the constants c1 and c: are complex, the magnitude of st, the shifted prebandwidth function, has the damping factor a1/2, which is an accepted "bandwidth" for this system. Thus, our definition is useful in relating bandwidth to a system that may have an infinite value for BW (this happens for certain values of a1 and a2). When a1(t) and a2(t) vary Slowly with time, so that bX R10, we can still define cl(t) and c:(t) by Eqn. (II—A—ll) and we can define time— varying poles without Fourier transforms. In general, Eqn. (II—A—lO) must be solved by numerical integration, but the function bX is related to the crucial parameters of a system described by Eqn. (II—A—9) and can provide insight into the system's behavior. Analysis of higher-order time—varying systems by this approach is not as easy as the analysis of time—invariant systems, where reduction to second—order systems is achieved by partial fraction expansions. The lack of a superposition principle, plus the com— putational difficulty with sums of analytic functions, further complicates the generalization. The analysis of the dynamic characteristics of one isolated for— mant is possible (and more tractable) with the introduction of the prebandwidth function. Real differential equations (DE) for the envelope and frequency functions can be derived by Substituting the definition of bx from Eqn. (II-A—4) into Eqn. (II-A—lO) and separating the result into real and imaginary parts, giving 53 [:(t) + a1(t)é(t)+a{a2(t) - (19%)} a(t)] [COS{Q(t)} — sin{°1(t)fl = a(t) cos {y(t)l (II—A—lZa) [®(t) + 2w(t)é(t)/a(t) + a1(t)w(t)]a(t)[cos{a(t)} + sin{d(t)l] = a(t)sin{y(_t)} (II-A-le) where §(t) = a(t)eja(t) 3(t) = g(t)eJY(t) w(t) = d The equation for the envelope (II—A—lZa) is of the same form as the total signal DE with a "natural frequency" reduced by w2(t). The DE for the frequency is nonlinear in w and a and shows the effect of damping on the natural frequency. We can change (II—A-12a) by substituting for the second derivative of the envelope on d . . r) a(t)/a(t) = a; {a(t)/a(t)} + {a(t)/a(t)}“. Then we can rewrite (II—A—12) as d . . 2 _ . a a? {a/a} = g/aCcosy/{cosa - sind}] + w - ap — ala/a — {a/a} (II—A—lBa) d a asz} = g/a[siny/{cosa + sina]] — Zwa/a + alw , (II—A—13b) If we identify w and a/a as state variables, then Eqn. (II—A-13) is in the form of a nonlinear vector differential equation. For Speech acoustical Signal representation, these state variables are invariant to amplitude scale changes as seen from their differential equations; further, they form the real and imaginary part of the prebandwidth function. AS noted in Chapter 1, Speech acoustical signals fall into a number of classes, depending on the values of the four Signal parameters al(t), a2(t), g(t), and Y(t) in our single formant model. InSpection of Eqn. (II—A-13) indicates that the derivatives of the two state variables depend only on the state variables and these four time—varying parameters. Thus, if we were to Specify the two state variables and their derivatives as functions of time, we could perform the Speech Signal classification. This procedure does not require us to solve the complex nonlinear differential equations or to perform any type of matrix inversion that would be necessary to iden— tify the time—varying model parameters. When a(t) is a train of unipolar glottal pulses (each being 2 to 12 ms in duration), u(t) can be represented by the excitation enve10pe g(t), For this Situation, the sinusoidal oscillation terms can be removed from Eqn. (II-A—12). This is achieved by the physical process of envelope detection and lOWpass filtering. In Section D, this filtering Operation is investigated and a criterion for selecting the cutoff frequency is given to minimize distortion of the solution of the differential equation and maximize the smoothing of the oscillation terms. When the excitation signal is stochastic, we cannot obviously reduce the complexity of the differential equation (i.e., g(t) may 55 not adequately represent the total characteristics and y(t) may also be required to adequately describe the random fluctuations). Under certain conditions, it is possible to assume that the excitation function a(t) is a Gaussian random process with expected value of 0(E(G) = O) and has independent increments with a uniform energy versus frequency distribution (white noise). The differential operator described by Eqn. (II—A—lO) will then specify an autocorrelation function for x(t). Kelly and Reed49 Show that the envelope and phase functions for x(t) and their derivatives have the following A probability densities for each fixed t when x(t) is a stationary process. p(a, 01, h, w) = p(a)p(01)p(a)p(w/a) (II—A—l4) where p(a) T R(O) Rayleigh with mean 0, E{x2} : CB p(a) T N(0, Bxe) Normal with mean 0 and variance Bxg. p(a) I U(0, 2“) Uniform between 0 and 2h p(w/a)-’ NUT), Bxg/az) _ d U.) = E{ IUJI} ” gEtxgl/[Etkgfl — we. c. B x This indicates that the angle, envelope, and envelope derivative are statistically independent for each t (independent random variable). Thus, no information is lost by removing the OSCillatory terms in Eqns.(II-A—12) and (II-A~13). For bandpass spectral densities (like those we are considering), where the energy is concentrated in a range Aw about E, the envelope and phase function energy distributions are concentrated 25 in a similar range about w = O (Davenport and Root ). Also, the uniform distribution of the phase contains no parameters of the generating equations. a Abramsonbo has defined Bx for stochastic processes as the mean Square bandwdith. For ergodic stationary processes it is equal to the effec— tive bandwidth, BW2, given by Eqn. (II-A-6), which is applicable to deter— ministic processes. Thus the instantaneous bandwidth function, bx(t) is related to bandwidth measurements for deterministic and stochastic (stationary) processes. Further, for second—order differential operators, Eqn. (II—A—lO), all the parameters of the process can be determined from first—order probability distributions (cf. Eqn. II—A—14), It is not necessary to estimate autocorrelation functions or spectral relationships between bx(t) and the parameters of the differential operator (Eqn. II-A—lO). Since this operator determines the autocorrelation fumztion, these remarks apply to nonstationary processes also. Many speech sounds can be modeled by stochastic processes with sta— tionary autocorrelation functions (giving time-invariant Spectral densities). However, the short duration and low relative energy of these sounds does not allow a ”steady—state' Spectral density approach. Thus we must consider transient responses. In the next section we will discuss the problems of uSing Spectral estimation techniques and the transient response of linear SyStems to envelope and frequency changes. II—B. Sliding Fourier Series The recent development of the Cooley-Tukeya4 algorithm for fast digital computation of Fourier series coefficients has caused much interest in Fourier frequency analysis. Modern communication literature uses "Fourier analysis” to refer to a particular use of any set of orthogonal functions to approximate a given signal by the following form: f(t) ~Z akcpk(t) (II-B—l) kZO where the set of functions {wk(t)jk20 is such that for some interval of time [a,b] and some weight functions h(t) (definition of orthogonal functions) .b . 2 Jah(t) en(t) ¢m(t) dt = cnénm (h(t) > o) (II—B—2) where 6 = 1 n=m nm = 0 otherwise; the ak'S are constants. For any N, and for any given finite energy function f(t), the integral weighted squared error defined by N h - — > t 3 j: (t) ‘rm akCPk( )l at is minimized by the constants b ak = I; h(t)f(t)mk(t)dt . (II-B-3) 57 - V 58 The most popular orthogonal set is the set of trigonometric functions, with h(t) = 1 over [a,b]. However, the trigonometric functions have finite energy only over finite intervals. Therefore, the class of functions we can represent by Eqn (II-B—l) with trigonometric functions must be non—zero only on a finite interval. A finite energy representation over an infinite interval is achieved by defining the truncated time function f(t) -T/2 S t S T/2 (II-B-4) H fT(t) HQ. 0 otherwise * and then repeating fT(t) every T seconds. A Fourier series of the form of Eqn. IIeB—l can be used, with d (t) : cos kw t o d d 2 . t ' (,0 2217 @2k_l(t) Sin kwo , o /T Some of the properties of the finite Fourier series are: ‘T/2 . _ ‘ kw t (1) a2' : Re[ f(t) e3 0 dt k d-T/Z /2 'kw a : —ImJT f(t) eJ 0t dt Zk'l —T/2 (2) f(t) V‘ ra cos(km t) a Sin kw t} i q Q 0 ”’ ’ 1 2k 0 t Zk—l o ' ‘-1 - k40 V j(kw t + (pk) Z { ‘ } ~i “ 0 2 c 03 w ) Re A Cke kc k Ot + qk C 9 ‘l1 s + a d k “ X‘2k ’ 2k—l * This representation is a good approximation only over the interval [~T/2, T/2], 59 t m g arctan{a /a k 2k—l ZkJ Notice in property (2) the resemblance to the form for analytic signals. * The analytic Signal corresponding to this series is Y" 83' {kwot + cpk} f(t) a. 2 k (11-3-5) Now, consider some implications of these properties for time—varying Signals, especially signals with varying frequency. Looking at Property (2) again, the series is a sum of cosine functions with constant amplitude and constant phase. (Guilleminbl) states that the approximation of arbitrary functions by this type of series is due to constructive (and destructive) interference between sinusoidal functions of different frequencies. The natural association of the Fourier coefficients with a frequency distribution (analogous to Laplace and Fourier transform theory) causes some problems due to the interference phenomena. Figure 7 shows a particular waveform defined over a finite period ‘Ta’ T The transform of y(t) L i bi r g .1 . ‘ . (assume I‘a _ 3Tb) lb . i r2|!(f f )T /4} S F eijiyfll 5 n1 «t b ft. : l/T. ~ = n _ 4 ‘L To put the series in true analytic form, Baghdady considers each term as a phasor and defines the amplitude and phase function for the resulting phasor sum, a construction that may have some intuitive appeal but is no help at all computationally. 60 Wt) FIGURE 7 SHORT TRANSIENT PHENOMENON WHICH IS DIFFICULT TO ANALYZE WITH FOURIER SERIES 61 and indicates that Fourier coefficients computed over [0,T], for Tsz, would be significantly nonzero for several values of k other than k0 : T/Tt' The nonzero coefficients are necessary to cancel out ck0 cosianot + mkol over 0 S t S T/2. The distribution of energy among the ck's is mis— leading to an intuitive concept of frequency associated with y(t). A remedy that has been suggested for these problems is to make T smaller (less than Tb/2) and compute a sliding Fourier series (i.e., starting the computation at increasing times). The resulting computations can be interpreted as "time—varying" ck's and Vk's. (However, this approach adversely affects the computational savings of the Cooley—Tukey method.) We may then ask if a representation of the form , wkm x(t) A4 E: ck(t) e (II—B—6) kBO would combine the properties of the analytic function and Fourier series. We can get some insight into the behavior of this series in the case when ¢k(t) = wkt + 9k. The Fourier transform of f(t) in that case is ‘ JG X(jw) A. 2“ E. e k Ck(w—wk) ; W > 0 (II—B—7) R40 wh . ere Ck(w) 15 the Fourier transform of ck(t). Thus, the convolution sum 1 n Eqn, (1 I~B—7) has smeared all the ck(t) functions together. Ari (Example of a set of ck(t)'s results from the "sliding" defi— niti on of Ifourier coefficients, 1.... ' L- 62 t ck(t) = Loo §(o)¢k(t—o)do (II-B—8a) where tk(0) = ¢k(0), one of a set of orthogonal functions and the "duration (non—zero time interval or effective time width) of t is much less than A that of x. In particular t ~ ,_ Ck(t) = l- I ;’:(o)erk(t Cy)“do (,II—B—8b) t -w ~ ~‘w o z e‘J kt I §(o)e J k do 'L—T We see that the calculation of sliding Fourier trigonometric coefficients can be interpreted as the output of the linear filter with input x(t) and impulse response h(t) = e OStST (II—B—Q) = 0 otherwise We might ask how ck(t) would look for various situations, especially for time—varying frequencies (as in Speech formants, FM modulation systems, etc.). To answer that question precisely, we must develop Some methods of looking at the response of linear filters to a general class of inputs. Before developing such a method, we might suggest what the ck(t)'s Should display. A Suppose the input x(t) is a constant amplitude sine func— _ * tion with a linearly varying frequency, wi(t) (wi(tk) = wk, k=1,2,3,4). ‘________________ )I: We denote an instantaneous frequency function by wi (t) when it may be Confnsed with values of frequency. 1' I 01 (t) 63 FREQUENCY INPUT ego t2 T: “(23(1) [3 # Ic4(t) t4\ FIGURE 8 IDEALIZED FOURIER COEFFICIENT RESPONSE TO VARYING 64 Then, each ck(t) corresponds to a frequency wk’ k=l...4, which should ideally look like Figure 8. In the next section we Show that this is possible only with restrictions which are too severe for the class of speech acoustical signals. II—C Response of Linear Filters to Analytic Signals When inputs to a linear filter (used to separate different for— mants in speech signals, say) contain amplitude and frequency derivatives of significant magnitude, the usual transform—superposition method of analysis becomes unwieldy, especially in determining transient response. 1 Leon and Weiner,32 and Cannons;3 have suggested a different Baghdady,3 approach to this problem; they use the analytic signal and convolution integral to Show the nature of the output of a linear filter in a more enlightening manner. Their approach is a generalization of standard sinu— soidal analysis using Fourier series. If the input to a filter is a sinusoid that starts at t : 0 [j wot] x(t) : ae t IV 0 (II—C-la) and the filter has Fourier transform H(jw), which is rational, with simple poles at the point 5 = S S , Sn, then the output of the 1; 2, ..- filter is n jwot --‘ Skt o(t) : aH(jw )e + a Ak e 0 (II—C—lb) k=l with (S-s ) H(s) k Ak s—jw _ 0 ask The first term in Eqn. (II’C—lb) is the Steady—state or stationary solution, and the second is the transient term. The stationary solution is simply the input multiplied by the Fourier transform of the filter evaluated at the input frequency. When the input has a time-varying 65 66 amplitude and/or frequency, the form of Eqn. (II-C—lb) is duplicated by 0(t) = a(t)eja(t) H(jw(t)) + e (II-C-lc) where 0(t) ' is the output of the filter a(t)eja(t) is the analytic signal form of the input H(.) is the complex Fourier transform of the filter impulse response w(t) is the instantaneous frequency of the input 6 is the transient or distortion term. The first term, called the quasi-stationary term, is merely a complex number times the input, giving an amplitude and phase change. Thus, the idea of ”frequency selection” by filtering has a definite meaning when 6 is small compared to the quasi—stationary term. The transient or dis— tortion term results from the filter's attempt to "follow" the changing input. Baghdady (and others) have bounded the distortion term and restricted the set of inputs to satisfy the bound in order to use the quasi—stationary term as an approximation to the output of the filter. The class of linear filters was limited in these studies to those described by rational functions of the frequency variable. For the representation problems we are considering, this class of filters is not general enough (a. ”Fourier coefficient” filter is not of that type), nor do we have control over the class of inputs in the same manner. We will find the following definitions notationally (and possibly,intuitively) convenient. 67 The Fourier transform pair for a real function h(t) is co H(jy) = §_m h(t) (9—3“)t dt (II—C—Za) h(t) : I m H(jY) e'm’t dy w : 2ny (II—C-Zb) Baghdady, Leon, and Cannon now define the quasi—stationary response of the filter as (for input instantaneous frequency, wi(t)) m -Jw.(th3 H(jwi(t)) 3f h(0)e 1 do (11_c_3) ___CD However, this is not a precise definition of a filter reSponse to the instantaneous frequency unless the frequency changes slowly. Assume that h(t) is nonzero only over a finite interval LO’ThJ' Then, * wi(t+0) for 0 s a 5 Th 15 given (for mi analytic in [0,Th]) by . . k w.(t+0) = w.(t) + w.(t)0‘+-E: 24 w(k)(t) 1 1 1 k! k22 and so a more exact definition results by using w_(t+U), l T Gk+l w (k)(t)) . h .ML(tfl3 -( k‘ 1 d f ' 2 J h(o)e 1 n'e O kc—‘l H(jw.(t)) J d0 (11-0-3') 1 This (k3finiticnl ha unwitfl(h¢ for silluitions “dtfli signiflxuuit fre— quency derivatives, although it is more accurate than Eqn. (II—C-3). Of course, the two definitions are compatible if Thwi(t) \R wi(t). * We use the notation w for the "irst derivative of w with reSpect to its dependent variable and w for higher derivatives. 68 Our approach will be to use Eqn. (II—C—3) as a definition, but with a generalized frequency term, i.e. d Th Jy(t,to)o H(j¢(t,t0)) = J h(o)e do (II~C—4) 0 where d =3 S S ' w(t’to) f(wi(t+t0)) 0 t0 Th to fixed We can illustrate by an example. The Fourier transformation of Eqn. (II-B—9) is: 3w (0) _.w0 ' e k e J do H 'w k(J ) Th (II-C-8a) € . . -JUJO le f0 h(O)e do = O (II—C—8b) 6:_'0 Equation (II—C-88)is realistic, since most digital computer applications require this truncation. Equation (II—C—8b) simplifies the exposition by not allowing terms of the form Km6(t) in h(t). Using integration by parts, the output becomes 70 f Th 30(t-G) 0(t) : J a(t—o)e h(o)do r- T JQ( t—O) .j‘llo \ h d . l = J C) a(t-o)e {e 65 H(JW,O)I do J[a(t—o)+to] Th 2 [a(t—o)e H(j¢,g)] 0 d JTh ES a(t-o) jWG O -——3(?:5y— + deg a(t—o) — t1] x(t—o)e H(jw’0)dg By the assumption in Eqn. (II-C—Bb) the first term evaluated at zero is ZGI‘O. l J[0(L—Th) 4 Wle :‘I( t—Th) e H( .j W) T . h f 311:0 + J {bx(t—O)-Jtlx(t—o)th(o) * e } do 0 0(t)‘ H 0 (t) + 0 (t) (II—C-Q) q d We denote by oq(t) the quasi-stationary portion of the output transient reSponse and by od(t) the distortion term. The quasi—stationary term shows, explicitly, that the output is delayed from the input by an amount on the order of the interval over which h(t) # O. A reference different than the one commonly used minimizes phase distortions occurring in od(t) compared to use of the usual reference, t. The dis— tortion term integrand is the prebandwidth function for the input times f “to 1h(0) * eJq I , a transient response term for the filter. For exponential filter functions (resulting from rational transfer functions), this term is 71 (II-C—lO) :7 A ('1‘ v *- (D II which correSponds to one factor in the distortion term in Cannon and Duncan's result when t is the instantaneous frequency. The interpretation of H(jw,O) as a transient response (Eqn. II-C—6) shows us that the distortion term is a weighted average of the filter's ability to track frequency and amplitude changes. The term t(t,t0) is indicative, also, of the precautions necessary in interpreting the response. That is, for t(t,to) = wi(t) + towi(t) , we have a "pseudo—frequency," to®i(t), biasing the instantaneous frequency wi(t). An attempt to include this bias in the distortion terms complicates the result tremendously. H(.) evaluated at the biased instantaneous frequency term is actually the predominant output when to®i(t) is signficant. (See the following example.) We could ask whether tomi(t) is ever significant in the class of signals we wish to represent. Figure 2 (in Sec. I-D) shows a typical formant frequency transition from samples of the spoken word ”rudder." This frequency transition has been inferred from a sonograph display. The range (over several Speakers) of the frequency derivative wi(t) is from 5000 to 15000 Hz/sec or 5 to 15 Hz/msec. So, computation times on the order of 20 to 30 ms can have biases of i 100 to 450 Hz. If we take an idealized "formant transition" of the form: £(t) = em“) (II-C-ll) 72 where ¢(t) = 2000 Hz 0 S t S .020 2n = 2000 - 300[3(t/.030)2 — 2(t/.030)3] .020 s t s .050 = 1700 Hz. .050 s t s .070 p(t) gives a cubic transition from 2000 Hz to 1700 Hz with a maximum second derivative of 10,000 Hz/sec. (see Figure 9a). Figure 9 compares the magnitude of the actual output, 0(t), with the magnitude of the quasi—stationary term for five Fourier coefficient filters with 201MB computing period. Also shown is a curve of the envelope maxima across the five filters. Figure 9a shows the quasi-stationary term evaluated at the input instantaneous frequency, w = @'(t). Figure 9b shows the quasi—stationary term evaluated at a biased instantaneous frequency. W = wi(t) + Th/Zdi(B) , T : 20 ms (II-C-lZ) As is seen, this biased term gives a good correspondence between the quasi-stationary envelope maxima and the actual output envelope maxima. (Note that this delay distortion is not due to nonlinear delay versus frequency characteristics), The implications of this analysis for the signals we are con— sidering are obvious. Sliding Fourier spectra with computation periods on the order of 20 ms cannot adequately show frequency changes in the input without bias. 73 H3d2_ u_O >OZmDOmmd mDOmZ_m_m:. >mOZwDOmmu. OZ_>m<>|m:>:.r mOd wFDdHDO HZEOEHEOO mmEDOa do mODEZO/iz m mmDOE 828% 08. Ba 08. OS. So. one So. So. . / a X] \! _ NIoom. /I|\Iu ' \\ /_ _ _ A l// \u ~18: / _ / \I / \ / _ \llb.\ \ / _ \ /./ K _ / . Ni 82 , i / , /I/ , _ x/ _ / v Saz. do 55:85 I _ :I. / \lm30w232<52_ . :t r s N: 89 v _ «532.2 _ wnod>zw ILlL r _ , x v 1 .3... 08 08 So 4 I II/ o o /I f :5: > 2“; we“. / E42925 .930 3825553 so”: _ 5&8 d2m _ me 33:29:... // no >58 ll' H H H _ _ P _ 74 Aomtiocoov FDaZ_ ”.0 >UZwDmed mDOmZmUZmDOmma OZ_>m<>nm:>_; m0”. 8.33.30 ._.Zm_o_n_dmoo mmEDOu. do mDDtZO/iz o, mmDoE 828$ 80 So So 08. So. one. 08. OS. _ i N: 82 _ W _ _ _ N: 8: z _ z _ _ s _ a. _ .. w N: 82 , _ a . _ . _ . _ . _ n _ N: 82 . . _ — . _ “ 2m _ SE30 .232 r _ I I ”.0 ><4wo do wQDCZO_m0u_ mOm mmkuzd OZ_>m<>:m_>:._. NF mmDOE >28 85 Exp A _ ”.0 7 3.8 3.3 ‘ zoF<4mo meJI «so II... n H c 4. we t4<_m<>z_ _ wEC. 202540; .mm<._o 80 The filter can then be specified using standard Laplace transform tec niques where the dependent complex variable of the transfer function the difference between the mixing signal's complex ”frequency" {g;(t) and that of the input. The estimate is improved by a feedback loop. The delay distortion caused by frequency and amplitude changes is est mated and then corrected by a variable ideal delay. Equation (II*C—€ can be used to analyze the feedback 100p, but it can also provide a Synthesis procedure for a digital algorithm which significantly reduc the computations necessary to implement the scheme shown in Figure 12 Assuming that bx(t) is given, the majority of the computations are re to implement the filtering (mixing and delay require one operation, e per point of time). There are two types of digital filter algorithms, transversa recursive. Transversal filters compute an output value from delayed values and are basically discrete convolutions (or correlations) of t form: N—l 2 ‘ . k2]. 2 co. .- o leijlk-j , , (II b J: The number of operations (one addition and one multiplication) per po * of time is N. Recursive filters compute an output value from delayed and output values. The algorithm is derived from the z transform of :1: The Cooley—Tukey algorithm for computing Fourier coefficients is of form and for this Special case requires only «dogr N, where r is the greatest divisor of N. 81 filter time function. -1 -m - — Z . . . a z 0(z 1) _ P(z 1) _ a0 + a1 + + m - — — — _ -n I(z 1) Q(z 1) l + blz l + . . . + bnz m n = ) 1 ‘ 0 I —c-14 Ok 1.. a), k-j + A bj k—j ( I ) jzo J20 .. -' S where z 1 e‘JA‘ is an ideal delay of time A m is the number of zeroes n is the number of poles. The number of operations per time point is m+n. We can use the quasistationary term from Eqn. II-C-9 to approxi— mate the filter Operation in one operation per time point. The prefiltering classification and estimation of bx(t), along with feedback correction, allow this approximation to yield precise frequency tracking (the amplitude distortion is not relevant). The appropriate (narrowband) filter characteris— tics are stored by means of the complex transfer function H('). The value of the input at each time instant is multiplied by the value of this func— tion at the estimated bias frequency. This method cembines the relatively low number of operations of the recursive filter with a desirable feature of the transversal filter. This feature is its ability to change the filter coefficients. If this is done with a recursive filter, an additional transient distortion is introduced. Thus we can achieve an approximate time—varying digital transfer function with a low number of operations, given an estimate of bx(t) and a classification of the input. 82 The classification system must be able to determine rough, but unbiased estimates of parameters of the incoming signal. The overlapping filter bank discussed in Chapter I can provide a basis for the estimation with some restrictions. In order to maintain similg between the outputs and input of a filter (within the effective bandwidth), Thwi(t) must be less than the acceptable frequency resolution error. Thus a "worst case" bandwidth requirement can be derived which would introduce negligible frequency bias for all Speech signals (although the bandwidth would be excessive for some). InSpection of sonograms of English words Spoken by several Speakers indicates that the maximum value of ®i(t), 15,000 Hz/second, occurs frequently from 800 Hz up to 3,000 Hz. (Above this frequency it is hard to make reliable inferences.) Figure IKSshows bandwidth requirements for several percentage resolution errors. The bandwidth is determined from T h (approx. rise—time) for linear-phase filters by the relation BT .s 1 (II-C-l3) where B = J A(w) dw/A(o) O {.00 T = J h(s) ds/h(o) as .8T h 0 f m -'wt A(w) = ‘kJ h(t)e 3 dt ‘00 B gives a measure of bandwidth that is approximately equal to the half power and effective bandwidths (for filters with very sharp rolloff like those we are using, this approximation is better). T is a measure of * Defined in Section I-D, p. 23. 83 mHZmeEDOmm >OZmDOmmd m0m<4 m0”. mFZmEmEDOmm Il_.0_>>DZ>OZ095 m:.:. 2. v_Z0 QZOZwDmeu 84 rise time, usually between the 10 percent and 90 percent points on a step—response envelope curve. Figure 13 Shows that our choice of bandwidths (Sec. I-D) is adequate in view of the inference from 33 Flannagan's data that a just noticeable difference in frequency for human experiments ranges from approximately 5 percent at 1000 Hz to approximately 3 percent at 2000 Hz. The data for this experiment results from individual variation of the first and second formant frequencies in a four—formant synthesized vowel. In the next section we look at the outputs of such a filter bank and attempt to segment the speech signal into homogeneous epochs with center frequency and bandwidth as parameters. II-D ESTIMATION AND SEGMENTATION OF INSTANTANEOUS SIGNAL PARAMETERS The preceding section demonstrates how complex acoustical signals, such as those encountered in speech analysis, are represented most appro— priately by instantaneous time functions related to the envelope, instan— taneous frequency, and pre—bandwidth function. Differential equations for these functions have been derived for a single isolated formant. The bandpass pre—filtering that we have specified in Appendix B attempts to isolate formants. However, the inadequacies of fixed—frequency bandpass filters and the presence of inherent background noise in any realistic environment indicate that these differential equations will not be an exact representation. Therefore, a general form for these differential equations that can be expected to describe the signal parameters as seen on the outputs of our bandpass filters is more appropriate. In the following, we will denote the ratio a/a as br (the real part of bx)' l r r i— b = f (b . w, g/a, v, n , t) (II-D—la) dt .t 1 1w r i7 : 1‘,)(b , w, g/a, ‘Y. T1,. 0 (II-D 1b) where fl and [O are nonlinear time—varying functions for the derivatives of the state variables. TR and WE are stochastic processes which represent the unwanted Signals and other noise. The classical theorems on "best” estimators deal with asymptotic properties as the number of samples becomes large. These results are of 85 II-D ESTIMATION AND SEGMENTATION OF INSTANTANEOUS SIGNAL PARAMETERS The preceding section demonstrates how complex acoustical signals, such as those encountered in Speech analysis, are represented most appro— priately by instantaneous time functions related to the envelope, instan— taneous frequency, and pre—bandwidth function. Differential equations for these functions have been derived for a single isolated formant. The bandpass pre-filtering that we have specified in Appendix B attempts to isolate formants. However, the inadequacies of fixed—frequency bandpass filters and the presence of inherent background noise in any realistic environment indicate that these differential equations will not be an exact representation. Therefore, a general form for these differential equations that can be expected to describe the signal parameters as seen on the outputs of our bandpass filters is more appropriate. In the following, we will denote the ratio a/a as hr (the real part of bx)' d r — b = f (hr, w, g/a, v, T] . t) (II-D—la) dt ,t 1 iw % = 1;,(br, w, g/a, v, TL. L) (II—D 1b) where [1 and fa are nonlinear time—varying functions for the derivatives of the state variables. TE and WE are stochastic processes which represent the unwanted signals and other noise. The classical theorems on "best” estimators deal with asymptotic properties as the number of samples becomes large. These results are of 85 86 little help in estimating instantaneous values. A multiple regression analysis would fit a polynomial of Specified degree to the observations over a fixed interval. However, this method requires a priori knowledge that is not available (maximum degree of the polynomial and a fixed interval for fit) and much computation (usually a matrix inversion (Donahue7l). Thus, pointwise estimates are required. * For time—invariant differential Operators with either stochastic or deterministic excitation, the two common parameters are mean frequency (E) and bandwidth (BWZ). The mean frequency for analytic signals is well A defined in terms of the spectrum X(w). We can derive a formula in terms of the time functions a(t), a(t) and w(t). I uX(w) X*(w)dw oo 03 _ m 1 P A* d A . w = m = e J x —— (t) dt// I ad(t)dt 2 J 0 dt I a (t)dt ~® ..CD H: where the second integral is due to a step discontinuity at the origin. .(t) . 2 m * d A + m a a(t) + Jw(t)] |x(t)l dt +-J¥f (t) a? x(o )dt I a (t)dt ..(I) ._ ”a - __1__"°°-_ _ shop] I‘m. w : JO? (t)w(t)dt + ZfljLJ0a(t)a(t)dt + ——§——- J_ma (t)dt. * Because the process is ergodic we W111 use time averages rather than expectations. 87 A Since we assume that x is a well-behaved, finite energy function, a2(®) : O. _ .00 CO [ 2 F 2 , w = J a (t)w(t)dt J a (t)dt (II-DeZa) 0 _w The effective bandwidth can be converted to a similar form (from II—A—7)). J lb (t)l2a2(t)dt Bwr2 z 0 x9 ‘m J0a2(t)dt ..00 ,3 ‘10.) . J br (t) a2(t)dt J0(w(t)—E)2ad(t)dt de = O + (II—D—Zb) Cma2(t)dt m 2 l . a (t)dt .0 0 Thus for constant coefficient Operators we have weighted time average formulas for intuitive parameters. For time—varying Operators, we are not so fortunate. In order to derive formulas we need an assump— tion that is Often true for physical systems. We call a process locally ergodic if we may reasonably approximate ensemble averages by sliding time averages, i.e. t ‘t 1.‘ E{c(t)f R: - J c(G)dO (II—D-3) I T t -T Basically the assumption is that the time behavior Of the parameter c(t) is "smooth" with reSpect to the statistical variations. This procedure is incorporated in many engineering systems, and we are merely recognizing this Often—invoked assumption explicitly. The determination of T is a key to this approach and depends on the nature Of the processes. We will discuss its choice later. 88 Equation (II-D—3) can now be rewritten to give averaging equations for time-varying Operators. t t a(t) s I a2(t)w(t)dt/j a2(t)dt (11—13—421) t—T t-T t a t . J br (t)a2(t)dt J (w(t)—U3(t))‘ia2(t)dt 13w"3(t)rsb'd (t) = t—T + t-T (II-D—4b) XS t t j a2(t)dt I a2(t)dt t-T t-T Notice that (II—D-4b) gives a sliding time average Of b S(t) X and hence the time average BW(t) is denoted bXS(t). In Appendix E, the relationship between Sliding standard deviations and derivatives is shown. TO summarize the arguments in Appendix E, the most estimator for the enve— 10pe is derived from the Ifllbert transform. The absolute value estimator gives some distortion, primarily during epochs with changing frequency, but requires much less computation than the Wilbert transform estimator. For real—time recognition Of connected speech, the following estimation procedure (shown in Figure 14 and discussed in detail in Appendix E) is proposed. The output of a (wide) bandpass filter is passed to absolute value enve10pe and zero crossing frequency estimators. LOWpass filters then remove unwanted oscillations. In Appendix E, the best choice for the time constant of these filters (called subinterval length) is shown to be on the order Of l to 2 ms. A Sliding mean and standard deviation is then computed on the output of each bandpass filter. This procedure has been chosen for its adaptability to real—time operation, its low—cost hardware implementation, 89 ZO_._.<_>wQ QI_ 02.9.5 mmfih _ m mm>04 m4m<>Im:>_:. m0”. meOwQOmd 20:32me ZO_H<_>wO QI>04 3 mmDOE mmCHZm mm mquOmmd. Ex 90 and the minimal degradation Of any further processing that may be >k required. Pictures Of bandpass—filtered Speech acoustical signals indicate that fricatives such as [f] from [umbif] can be analyzed in a Similar fashion. The instantaneous envelope and frequency estimators (both real—time and derivative) retain the stochastic nature Of the bipolar signal (Figure 15). The subinterval length of 1.2 ms and sliding average length of 10 ms appears to be adequate for the frequency range shown (see Appendix C for filter bandwidths). Comparison of the bipolar signal (Figure 15a) and instantaneous estimators (Figure 15b) indicates that a narrowband assumption, which has been incorporated in the local ergodic assumption, is appropriate. Consideration Of many cases for different Speakers and utterances indicates that the zero crossing-absolute magnitude enve10pe repreSentation occasionally fails to represent bandpass—filtered Signals adequately. The primary case where an ambiguous representation arises is when two energy peaks occur in the same filter (for example, the case discussed in Chapter I and illustrated in Appendix D). In the filter of bandpass 577—1867 Hz (Module 6), a (relatively) strong energy peak continues at 750—800 Hz from 370 ms to 700 ms. At 430 ms, a second energy peak begins t0 "move" away from 902 Hz toward 1445 Hz. ApprOpriate choice Of filter band bandwidths could isolate these peaks; however, this approach would M * For instance, if frequency resolution must be increased, the sliding mean length can easily be increased, averaging the previous time series again. However, more frequency resolution cannot be gained by further aVeraging of the output Of a sliding Fourier series computation; a new tranSform must be computed. 91 H rm 9 EB m: 59 2. mom .30 man: .2205 $5935 530% o o g o o o c or of r-..'-.--‘ ....... t .......... r ........ r ......... h ........................................ m: N. mmw two: :mr m4: m: m. cow. éréiagigéegggég m: .m. 02. we: rm mi... m: m. m5 égfiéégé. a. ..éiéesisiézi m... mnwmmw ................................................ a on--.mmzws.fl:: ....... m. mmdwm n-.wm..~...8.m ................................................ m. m::-....m:.m.s.ml..-.--.--...m.p. was. §§§§§§§ S)2 Aemscficoov mJ_zw mtjommc as 80 .u wt: .u a 1w 2 wt: ... ~ In @— 8w g; II «I J (\Itfu‘ 33K m u.Em 0N5 .u rcr .f 4 n o . .- och we own. . on. I. S\‘ (K) @— mmDOE 35.3.2 ...0 cozatogn .3 U xicwna< ommv mx: ... M rm 0, 5 \l/ Xx I .1. 225:8 nmcozaew >m 8.2.555 m>C¢>Emm wand)?» 2 00¢ .u pf: .-mH 1m m“ an 02 .- mx: {w— Im 3 me o2. mt» ti ’3 .61. c(n‘l)..l.lr\.’||vr \6. 4). .0. . 0.0w .. 4.x . s. .5 2 .1 I a. .... .... .. «it. . ...... .. .. m (r. ..t . .4 . a n 0 .-mx I‘. I.u-ms. rm ma o .o 0; m5 Own .| Oflq mE ovm . I ONH w:: m»: \ 1“?- VII)..( ’1 ‘- . ) mExthw 5750mm; ozawomu omwm m6 00¢ .v wt: .n H Im m" an 02 .- mi: .s _ 1w 2 mE omv .l 000 ..4 ... . ..a “v.4. .Ix Ila-Ir . s11 )‘r\'l'h'.“.\|’a\ 1‘ ks .........(<.x.........L/ru....> A 4"}! “a. @Nv m \31]... 1V \Afsxlu}..}|a {~1me 0 . u o .c Q; .-ws 1m ms wx: 02 N 1. )JJI\.I.J\.IJS . we ovw .. .u .. rm m” x («x 93 me 00m . I? o g 0 b a . N... .2 . ...\.. . o.» ! O 0 I\ I00 0. t o. ‘0 r ’0’ f. 0%. Col‘ on. I ‘ .00.I O Q o. I I \f' I n a, I It ... Jo . .. h. o o co. ’0 o O o 0. u:na .-ma :m ma {2 Im mu I . O I .0 . O I ‘ Oh I o u u e I O ‘ -.. O I a. ‘ 0' o .r . .. .r... ... . Cu O o. O I . .. ...: t a... 33...“ t o f. 0 O. '0 I . O 0 o O O C as com . s m2: . 0- no u. I... O“. h 0. .1... "o. 0’ a. n 00-, O ' 'I00 0 a a O I o o o - o. I ’0 R g! I \. o I I zomuouzam rennxnznum mo swan 1.6mm .-mz :m m" nuns. -- m>n- .w-mu Adah—mam. *0 :OZOCOmOU .2 U .E o.“ .- m>n- cw m” 5283 8m. .-wn cm a“ 0 ‘ I O o If I L woztzoax 76:95.; infirm/am no wdaxnhmm >zm ngo as om» .- ms: {91m 2 9. Sq .- m1: .umw 80 ON? 1m 9 .... ox. .- m...” we own . - ms: 08 wmq o2 4. r. s .. m .1. . .... N . .. . . ..... . .. .. .... .... . aafin.. is .a an . -n.z . at. - .mn ? fidanx.~w w r3 m ..s .. .. .. . .s . . .. .. ... . ... -. ... ... m .... ... 4; fl . . ... :2. a ... ~ ... . v . . w c .. . a) .. . a. w u . . . w n H o u . o o . ..- . iv... II. a o. l n ......u on u. w .JS H m w .. m + As Jm . w .> . > > a . x . m m m m o 1 n u u . o l n x I. ..U H X onFuzza renmxazqm do pace >¢azuoaxH .-mz Io ma as amt .- mzzk .-m_ In as a. 0.. .- m>m- .-mm :w mm .5 cam .- ma”- .-mm mm a” cow 3v 1 1 ON“ . r m m a W IOI‘O 0" ' 8. At- DIG. ’ I In.5 w '0 I ... 3" . ”I O m " I... 3' I. CI! 0. I x D. I. .ov-Ilo upatna L ‘ O I. r. C\o ‘ . f o .o o o 0... a ‘ '0 he .- Onlr d . .‘Q I I . 8 00 a... .0 . (‘- ft GIN-‘a.o o .1 0v o . .Iul . w a - . .... w . an - m w .. .. ... x z .... . y. m m m a m o o .. m .- a x . n a x .nmu cm 2 y I W .0 I I. O o u ,- n ) m .I 0 C I. 0 U m . . C. . u. .. . (.15.on 91.3“}; \o s . I; 5‘... ...-NJ! i .0. .l .1. .... I o. o O. a an. I . In“ | I. c u. - O- \ o O. O . ...-u l- o . ..... . ca 0. «.31.»... .. .. . w rave? $31 ..Aq no r . .5 r1. . .. ... . . Q o .- .I.. ‘0 a m n .0 a .0 w w ..m ”w” a. x z u n .hu . x x . : CLO LO 94 require a fixed bandwidth filter tailored to each Speaker and utterance. For our particular choice of filters (described in Appendix B), the zero crossing-absolute magnitude representation (Figure 17) follows the stronger low—frequency energy peak. One method of isolating the more interesting high—frequency energy peak is by first computing the time derivative of the bandpass— filtered acoustical signal and then the zero crossing—absolute magnitude representation. Several factors recommend this approach. Cherry and Phillips indicate an increase from 65 to 92 percent intelligibility by using the derivative (hardware derived) Of the wideband acoustical Signal for their zero—crossing intelligibility studies. Thomas, referring to this increase, states that the pre—processing accentuates the second formant, which (he proposes) contains the significant linguistic information. For isolated formants, the increased intelligibility can be due to an emphasis Of information-bearing parameters which are related d§(t) A tO the prebandwith function (recall that dt : bx(t)x(t) ). In the wideband Signal case, high frequencies are emphasized, as we see if we consider the transfer function of an ideal differentiator (linearly increases with frequency). Most physical differentiators* are necessarily approximations and incorporate smoothing to high—frequency variation. A typical transfer function of this approximation is depicted in Figure 16, where lower frequencies are deemphasized (with reSpect to higher frequencies). Thus, Cherry and Phillips’ results can be * In this study, a cubic interpolation is made between extrema Of the band— pass filtered Speech signal, and this interpolation equation is differen— tiated. 95 ml f FIGURE 16 SMOOTHED DIFFERENTIATOR TRANSFER FUNCTION 96 explained if Thomas' hypothesis is true. Resulting zero—crossing-absolute amplitude representations are thus able to "capture" other energy peaks. The frequency estimate (upper left plot of Figure 17b) for Module 7 (derivative of Module 6 filter) and utterance 16 BE 1 clearly shows the frequency transition that was difficult to find in the Fourier series (Appendix D), or in the zero crossing frequency estimate for the undif- ferentiated Signal (upper left plot of Figure l7a)T The resulting transition is depicted even in a situation where bandpass filter selection was not appropriate (for this particular case). The form of the transition is what one might assign by eye to the sliding Fourier series in Appendix D and also looks very similar to the dynamic articulator (tongue) trajectories depicted by Houde for vowel—to-vowel transitions. Figure 17 shows another feature of the absolute amplitude—zero crossing representation. The sliding standard deviation is plotted against the sliding mean for the absolute amplitude (lower right) and zero crossings (upper right). In both cases (17a and 17b), the bivariate samples form a tight group during the first vowel segment (before 430 ms) and then cross a ”bridge" toward a new group during the transition. The differentiated zero crossing case (Figure l7b——upper right) is the most dramatic. The two—dimensional plots can only approximate the actual four—dimensional situation, but it is still possible to recognize a coherent time behavior that is not apparent with standard preprocessing. * The series of numbers, 0—9, indicates contiguous sample points simul— taneously on all four plots. The "INDEX OF ZERO" gives the time of the starting zero in milliseconds. 97 X q ‘0 X I '9 '.09 2c FREQ zc FREQ sov ' 03 D '~I .2 w 7 .. ,3. M .....,,a,.e.9 LI I. .°.° ‘. 5::1'- . l} . O '. . R L...'...'..' 2...... R ' °. 2 2 16 BE 1 -. TIME -.480ms 15 BE 1 -. {JanuR 2 xu—. - 2c FREQ x 1 -. x 1 -. . . . ABS ENV ABS ENV sov .89 01 2 .. _7 . M M o. :i‘ C 8 -\ - . T ‘ a . LI LI 4 1235 R R £1: " 0-9 - 5535f Q .‘ 2 u- M‘f 2 g” 16 BE 1 -. TIME -.4eoms' 16 BE 1 -. 01M 4 R 2 x 5-. SPEECH DATE 12% INTERVRL Res ENVELOPE RND ZERO CROSSINGS ABS ENV * INDEX OF ZERO 182 BANQ PASS FILTER“) §PEE§H my“. mp4 aw 233-I467 Hz [See Appendix C for descriptiOn OI labeling.) FIGURE 17a STANDARD DEVIATION VERSUS MEAN FOR ENVELOPE (LOWER) AND FREQUENCY (UPPER) 98 (See Appendix C for description OI labeling.) STANDARD DEVIATION VERSUS MEAN FOR ENVELOPE (LOWER) AND FREQUENCY (UPPER) (Concluded) FIGURE 17b X tr‘v ' X 1 _.l 2‘3 FREQ z-c FREQ sov 03 ...... [g4 . .9 9 M i‘) | ' . o . ‘ ~ O F3 5 2 . .. ...-...... .12.: O —T.;_::.:_-' 5 O. . .0. 3 o o o .‘.'....‘ I" 2?: .2. ' R R .- Ev 2 2 16851—- TIME --480ms 16 as oavsR 2 x ll-, ZC FREQ X 1 "I X l —.| ABS ENV ABS ENV SDV DI DZ M V 5 s r") a R P. ‘2‘. ‘8 - .- ‘3 2 0:34 IT: 2 '2‘" A vv an“ . W 16 351-. Tiff-E -.480ms 16 BE 2 —. 31v 5 :2 2 x 5-.“ . - ,_ , ABS ENV SPEECH DRTR 1.2Ns 1N.ERV9L R35 ENVELGFE RND ZERa CROSSINGS INDEX OF ZERO 161 DERIVATIVE 0F BANDPASS FILTERED SPEECH SIGNAL __M__D_§ BEVWZEEIASYHZ 99 This dynamic behavior is further diSplayed by various estimators related to the bandwidth function. Two utterances are considered, the . . _ h B to E vowel tranSition from [duat ] [16 BE 1, Figure 17] and the utterance [umbif] [19 EH 1, Figure 15]. The following estimators are derived from Eqns. (D—2) and (D-4): 1. Real part—-sliding standard deviation divided by sliding mean Of envelope a. For the real—time bandpass signal b. For the derivative of the bandpass signal 2. Imaginary part——Sliding standard deviation of the zero crossing frequency 3. DER/ENV——sliding mean of derivative envelope divided by Sliding mean Of envelope. These estimators are shown in Figures 15c and 18. Several points are evident from the figures. 1. Bandwidth function estimates all have a stable "nature" for certain epochs with significant perturbations at the boundaries. 2, These epochs correspond (for some of the time series) to natural Speech signal "groupings” (say, Reddy’s phoneme classes). 3. The ”nature" can be grossly defined in a consistent manner for strong deterministic signal groups (vowels) as Opposed to weak stochastic (fricative) groups by the deviations of the bandwidth function about an epoch mean value. 4. The bandwidth function is relatively normalized across rather large amplitude variations while still Showing variation for different groups. 100 A295 353:3 me...<_2_.hmm I._.Q_>>OZ> Z>>OIm mHJDmmm ZO_._.<._.ZmS_Omm m— meOE 63:82 so cOZQIOmmU L8 0 Encoded. $9 82885 DEN ozq mmod>zm mmc s¢>mmt£ wzm; «ED 55% 951w 0...; EN onod u mwmzh mpaxpmm armamwd 9.339% 00mm maod:ru mwnuomma 252 .- ms: .- .. mm s as: .- my: .- l mm m: ”32 .- mi- .- x mm s. 8. .- ti: .- a. mm m: m m . j m ...)!Hc N . .... . LI ..I. . ... ... .n. m .l.n|m..l-.. ..I... .....l. .. m a 3 I. cl - I out. I. I “to h I. ) II o \J. ‘a I .. .. m In... m .. .. o a o o .-rx .-rx .-Hx {fix 52.62% “a mExfimm Effie 29-9.8 3.22% no Han. same ...... of .- Ex: .-2 mm m: E 2:. .- E: .-m: mm 2 2 o: .- Mr: {,2 mm m: 2 a; .- Ox: (mu. m. 2 . .. m m “...... .......I. . m . .IIr N I II .I . II ‘I II I I. I I. II .II I II o. 0.. l. o n a I 0- . . .... s I. .Hl . . c . c . o . cl. 0. 3 .I. . I l. .n\.\(. . ... o ..o . I. .u .n .. I.. o . \ 5’ fl . \IA’. . D’ a W . . . . . n... ...-l. W o \. .I ... . Ks \l L. I." o I o .. . . . . ....\ .v .x. ...\ - .. r 5... .... ...... . .... .. .... . .... .. .. 6 ..n J R.” T . r -- I \o N I VIN ('I If) fl ('3 (V) -.. .-uw (nun. .-..r. onruznd tnnmrgxcm do ..«cn >mazuocau . 2 cs .- ms: .-2 mm 2 F a; .- we: {9 mm m: E 02 .- s: .-m. mm ml. m... of. .- mx: .-mfl. mm m" m m m m ? I. L! . I . . I . ‘1. .. .I‘.‘ n I l l I II I . II. “I I... .. l a o. .I ... I . u s L .. I I I I I - I. C... I I I ‘1’ ’ I. I . II I o I I I m I I II- . W I . I o I II. n .0 u I I. . I” I II I’ I m \I I o l O I ...... I E III I I- ”an. . I I 0-. Q . I ... I Wu 04.3 I II P . . .... .. u ... . .. n. . . . .. . . .... ....t. . ..v . ... .. ..... .u. y . . . . .. . . . . m . ' I - OI - ' (A I ' I ‘0 . - x z ... .. r. x r. . r w w n .... n n m . I V K I I V I a I a k a I H ‘ 101 These results indicate that the first step in a procedure for segmenting connected speech is to identify the points in the (filtered speech) signal where "fundamental changes" occur in the "nature" of the signal. The precise definition of the terms "fundamental change" and "nature" involves specification of a real—time clustering* algorithm and the four time series which give a dynamic representation of the signal. The changes and nature are relative to information we can derive from the particular signal we have at this time (thus termed real time). Since we are dealing with signals that are heterogeneous in nature, any general assignment of functional models to simplify the representation or reduce computations would surely cause higher error rates, at least part of the time (for further inferences based on the functional models, for instance) or ambiguous interpretations of derived measurements. The Clustering procedure is real time, Operating on data as they arrive without requiring further passes through the data; self—normalizing and not dependent on a priori knowledge; conceptually simple (in terms of number of adjustable parameters); requires little storage and few compu— tations; and gives a more revealing stabilized (in terms of stochastic variability) dynamic representation of the original output along with the marking of points of significant change. *The procedure is termed "clustering” in order to relate a process for dynamic (differential equation generators) transient phenomena to the usual static data clustering techniques (ISODATA, Ball and Hall, 1967). A precise relationship between the static and dynamic clustering exists when one can choose a functional model for a set of differential equations and then estimate the parameters of this model. The set of all parameters would then, for a given time cpoeh, be one vector of the type that is discussed in static clustering procedures. 102 In the defined state Space, the time trajectory of the differen— tial equations varies about some mean value and the clustering would define limits about that mean value which expand and contract, depending on the time—varying parameters of the differential equations. For the single formant model (and for other higher ordered systems as well), a time-varying mean value and standard deviation can represent the time series state variable value and its first derivative. To show this variation, consider a normalized variable 2 at time n by the formulas Zn 2 n n (II—D—S) for the two time series (envelope y; and frequency y;) and describe the variations in terms of the distribution of this error term. InSpection of this (punitity indiealins that lunnmilization is ;xnffi3nned by tJu)(livi— sion by the sliding standard deviation. The segmentation procedure asks the question: Is the differential model, defined by our two time series for each state variable, adequate to describe the variations in the input signal. For that reason, we will consider predictive instead of synchronous normalized variables. That is, instead of using values of 2 at time n, we will look at distribution of the expected next value of z. If we write this out in a slightly different form, it becomes: y‘ 2 m' + 0 y‘ (II—D—6) 103 Defining n3 as the difference between the mean value and the observed value at each time n, y —y zE’JzJ—TT: j:1,2 (II-D—7) which is a discrete version of (II—D—l). The terms on the right side of (II—D—7) are functions of the state variables, excitation parameters, a stochastic term and time (n 2 1). These equations can occur in some classical estimation problem formulations: (l) deterministic but unknown equation = M + f (M ) n s n n+1 (2) observation equation (sample function generation) : X — X : " yn m+l n g(mn) + gnf(mn) where gn are independent, identically distributed random variables for each n, independent of mn and x n at time n with moments ul, u2, u3, u4, (3) observation equation (ensemble generation) d \ = = O .. :: X n+1 mn n/n mn Ez{\n} a d 2 on = Ez{(xn_mn) } 104 where Zn is a random variable with moments v1, v2, V3, V4, ... The difference between ii) and iii) is primarily one's point of view (derivatives versus expected values of moments). The relation between these, which is empirically Shown in Appendix E, can be derived by taking CXpected values of ii) and iii), kl_ ‘. E§LXn+J.- Kn} :: g(mn) +f(mn) ul 1! E {U z } 2 G v (II—D—8) z n n n 1 Thus, the [our time series for enve10pe and frequency and their derivatives adequately represent the bipolar bandpass signals and the deviations from these time series can be exhibited in the normalized predictive variables defined by (II—D—G) where y; : xi are the subinterval averages and Ki is the sliding mean and Si is the sliding standard deviation for envelope (3' =1) and frequency (,j:2). Then the points where these four time series no longer represent the input signal can be determined by a statistical test based on the distribution of the nonualized predictive variables. This distribution can be estimated by use of the samples np to t lIHL‘ 11 .J' .v‘—Ff‘. , [r 2 r r-l \:1,2 r:1,2,...n—l (II—H—Q) 35 r—l In general, these values will not be symmetrical about zero because of the nonlinearities. We need an estimation technique more powerful than 105 the currently pOpular procedures which are based on the normal distribution. Because of the local ergodic assumption, there will be continuous changes in the parameters rather than "jumps" between two or more ranges of values. Thus, the distribution for each epoch will be unimodal (bimodal distributions will yield two epochs), and a modified t—test with an Edgeworth approximation to the distribution is apprOpriate. The segmentation procedure, then, uses normalized predictive samples derived from sliding mean and standard deviation time series to estimate four moments, the coefficients of an Edgeworth series. If the probability of occurrence (from the Edgeworth distribution) of the normalized values of envelope and frequency exceed a predetermined threshold, then that sample is included in the present epoch. If the probability falls below this threshold, then the sample is declared a ndelshot. This procedure is useful in identifying (and eliminating) data values of questionable use (such as parity errors, computational errors, external impulse noise) which arise quite often in digital processing of acoustical signals. The definition of a segment point is an extension of the concept of a wildshot. If the data continue to give low probability, it is quite natural to assume that their "nature” has changed and that a new epoch should be marked. This is controlled by two factors, the number of wild shots and the length of time within which this number of wild- shots must occur. (For example, two wild shots during four time units may define a segment point.) Examples of the segment points resulting from a computer algorithm based on this procedure are shown in Figures 18 and 19. Figure 18 demonstrates dramatically 106 that the procedure is most sensitive to changes in bandwidth estimators and not in the enve10pe or frequency estimators. This is not a limitation since simple thresholds can detect significant changes in these time series. Figure 19 shows how the fixing of locations for segment points depends on the choice of the criterion and threshold. There are several important observations related to the 7 regions depicted in Figure 19: (l) Erroneous data (caused by a computation error) are detected and flagged (2) (2) The values of threshold and segment criteria depend on the instantaneous nature of the signal (1, 3, 6, and 7) (3) Immediately after a segment point, a higher threshold should be set to eliminate false alarms (e.g., the threshold might decay exponentially to the set value) (5, 6) (4) The definition of homogeneity of the epochs is insen- sitive to amplitude variations even during highly transient behavior (4) (5) Comparison with Figure 18 indicates that male/female differences do not affect the algorithm. In summary, the acoustical speech signal is viewed as a composite nonstationary stochastic process and the mathematics of communication theory is used formally to describe and discuss its complicated nature. One isolated formant is modelled by a time—varying differential operator with stochastic or deterministic driving functions. The parameters of this model are related to steady—state concepts of envelope, frequency and bandwidth. 107 .mEEmt SEES mOu mHJDmmm ZO_._. helpful. Many dialects of American English do not articulate the ‘ - .. . . . 1r11tial vowel in "before” precisely; it becomes /:v as in "bufl" rath ‘ . . 11 .H . . . . . . . (311 13han /1/ as in beef . The difference is distinctive for the I)a111 " H I! H - . H H lDIJff and beef but ev1dently not in before . At the upper illtol‘ ~r . [‘1<:C of the phon— level the coding would be the same for both dial QC La1 variations of "before” (i.e. /bifr/) but would distinguish /q)of/ from "beef" /bif/. The actual mechanics of this context unravelling involves two tVDQR ()1 rules and the unit epoch (duration in time) of each stratum. 711110 1 ‘ (”llérth of the morpheme corresponds (approximately) to the syllable, and the 1 C3I1E:th of the phoneme correSponds (even more approximately) to the 126 givipuienne or Arabic letter of printed language. In line with our defi~ Iiitxicui (3f the ~on unit as an objective description of events, the Inorqoruoii is the same length as the phoneme and the phonon is Smaller than the p hon eme . The two types of rules, called realization and composition, estak)l.j43h relations between these units (cf. Figure 22). (l) Realization rules are the code for the —eme unit in terms of the smaller -on units. Conditioning by neighboring —on units is accounted for here (as in our example above). (2) Composition rules are the code for transforming the —on unit of a higher level into the ~eme unit of the lower level. Conditioning by virtue of belonging to a unit of the higher level (i.e., stress on a morph— length unit affects vowel phonemes) is accounted for here. Alternation caused by linguistic constraints is also accounted for here. We C(11) (wif(3:) as /VV{1_j~ vowe]~ This and t; he morbhc) 11 s occurr impo I‘eturn to our example. Suppose we have the morpheme string —— (Pt) —- to be encoded. The realization rules would code (wife) 1“/ and (Pi) as /S/, with the conditioning rules selecting the i EQII idc after /w/. The composition rules would change /wa f/ to i wa / \”/' E>ecause of the alternation caused by the plural (CL FigureZB). C3><€1mple also shows another distinction between the morph- level phon— level mentioned above. Notice that the alternation of ed only within the syllable. This is the restriction as (3‘1 on this stratum. The alternation of the plural morphon /s/ to (311use of the/v/ ending of the previous syllable is performed in l>1 . . .. . . 1(311~ level, so the length of influence of each level’s rules is 127 k / / ”e m e REALIZATION RULES COMPOSITION RULES ALTE RNATION CONDITIONING *me O '2 LOWER STRATA FIGURE 22 RELATIONSHIP OF UNITS WITHIN A STRATUM 128 _OU mm mmDOE .395 5sz m eo cc: 9 36:23 e0 £033 pcm 895 2:3 :5: 3E: mcconcmac 8032: l EmEcoc$cm .mmamccerm 2930a *0 95 2E 3.53 8.8.3. EoEcoFEo >n 9:826:00 2E Fm): 25. a0 ”20:32: a seesaw ease: 3 2953mm zo:lr1363d to be a two—way model (ChomskySB), for recognition as well as geneax‘21t:j_on, but we find that this is not entirely true. Three problems arise : (1) The lowest unit (closest to acoustical signals) of a generative phonology (Lamb's or any other) is still in terms of abstract quantized units that reflect economical encoding rather than good correspondence with features of the acoustical Signal. (2) This ideal sequence is still ambiguous (in general) unless the Specific rule used at each point in the encoding is also known. In recognition Situations we do not know the rules used until the correct sequence is known. (3) Formal language representations only Show redundant features in a secondary or "tacked on" fashion for the same reasons of economy mentioned in Item 1. The more realistic situation that apparently operates in human communication will be discussed below. The higgr‘lbl redundant nature of the correspondence between acoustic features and pexweeeived sounds suggests a slightly different approach than looking 130 for "primary and secondary features". Human speakers generally have individllal language pronunciations, called ideolects; i.e. , one person might find that a particular articulatory situation causes a noise burst of a specific center frequency with no modification of the following vowel , and his listener agrees that he "heard" a "b". Another speaker finds that precise modification of the following vowel with no Specific noise burst elicits the same response. One cannot say that there is a Primary feature here; the listener's responses to both Speakers are equally positive. We might call this property of the listener a dialectal generali— zation- Each person may learn a particular set of features that must be controlled precisely in order to communicate. The remainder of the features (redUndant in Lamb's terms for this speaker) are not precisely controlled; thus they may vary considerably with respect to many speakers. This variation will occur above that caused by lack of speaker normalization (suggested by Thomas ,19 Gerstmana). Perceptual experiments with repeated words corro— borate this conclusion, and the work of RupertZO shows that this approach 15 needed for situations involving diverse speakers. In the light of this disc - . . . . . . usslon, we propose a model which aVOids the deficienCies mentioned above _ To overcome the first inadequacy, the same arguments that lead Lamb to a two—strata view of generative phonologies suggest a three— strata mono} of recognition phonology. This addition may also be useful in a generative phonology, as Lamb has suggested. Note first that acoustical features can be considered as either $0801th Or relative with respect to speakers; i.e., schemes can be deViSe Q Which measure stridency (to distinguish vowel—like and fricative— 132 liker) , (:liecked (stop—release) silence and local envelope maxima without speal1flresponding to a homogeneous (with respect to both relative and absci1111:£3 acoustic features) epoch of the acoustic signal. Some implications of this definition are: (1) It is a specific definition not only with reSpect to a particular language but also with respect to a par— ticular Speaker and utterance; i.e., different utterances of a given phrase, even from the same speaker, could give different sequences of acoustemes. (2) The distinctiveness property requires that only the controlled features can be involved. (3) The segmentation is performed with respect to controlled feature changes and hence induces a useful criterion. The 14111_ts; thus defined, while more accurately representing the specific acous t:i(3a1 signal, also behave like units of higher strata and exhibit many ()1? the same linguistic phenomena, The four terms diagrammed in Figure 24 (diversification——A may becomes 1’ or Q; neutralization-—B and A may become Q; zero realization—— C may n0t have a corresponding unit; empty realization——S may be filled in) Gaul ‘bezused to describe the various ways in which a speaker actually perfcxrnls the dynamic task of selecting which features are to be controlled and wliiczli are to "float”. Examples of these are found in the work of Ruperta 0 u n on isolated words. Several of the phenomena occur in before . brow . Diviersification Neutralization Zero realization . E mpty realization FIGURE 24 133 T"? ________ ALTE RNATION PATTE RN f i —emes SEVERAL LINGUISTIC PHENOMENA DESCRIBED BY ALTERNATION RULES 134 Diversification is seen by the different types of formant structure in the diphthong on the end. Zero realization is almost always seen in initial "13" with the lack or prerelease voicing. Empty realization is exempl ified in the extra state or fill—in after the release of "b". Neutral ization, which is evident on higher levels ("bitter" becomes "bidder" ) , is an alternate explanation of the modification of the diph tli on g . Lack of knowledge of the Specific rules used to generate the acoustical signal and the resulting ambiguity in decoding requires another n10dification. This ambiguity primarily causes extreme difficulty in place— ment of junctures (word, phrase, sentence) in the morpheme sequence. An attempt at recovering this information can be made by attaching another rank to the model. The third type of information primarily affects lower Strata , such as stress and intonation patterns. This information has not been included in any word recognition system known to date. It is well known that these patterns delineate phrases and sentences. Other types of ' ~ . . . . . 1n IOrmation occur in smaller units; hence this rank operates on different Strata also For the present, we will label this rank the hyper- —on, Wh1011 :iridicates how the information is abstracted at each stratum. The -Oni(’ ul‘lits are the most objective description of the events. The -emic units Elrwe generalizations which show the distinctive events; the hyper— -OniC: 111‘lits are derived from the —onic units and show events which affect lowel‘ llrlits. For instance, stress is a feature of a whole morpheme but affekytgg (generally) only the vowel phonemes. Figure 25 shows the amnion L ed model . (l) Hyper—morphon features include stress and intonation patterns. 135 Ewhw>w >OOJOZOId w>_._.I / 23FI EDkdmhm 4: COI 0E? m¥2> Emkm>m ZO_._._ZOOOmm All. 23FI 20:; .400me 04. mmeP z .. Wi + 4o n2 . Q13 1:]. 1:1 3:1 where [wij] i,j = l, .... n is the correlation matrix of the existing n measurements wb is the variance of the new new measurement qt i = l, ..., n is the correlation of the new measurement with each old measurement. Hence the performance is degraded by the addition of a new measurement which is correlated with the others and which adds noise (proportioned to W6) to the recognition process. * 2. Independence of Filter Bank Outputs The mixing formula (IV—A—l) is based on the assumption of randomly selected generating models and optimum least-mean—squared- * Probabilities computed on two input sets X3 and X2 are independent if p(xk,x£) = P(xk)P(x£). 145 error estimation filters. The highly structured and situation—dependent interrelationships of acoustical features make the former assumption very suSpect. Further, the choice of suboptimum filters again indicates a set of dependent probabilities. In order to achieve the superior performance of a mixture formula, Wainstein and Zubakov apply the central limit theorem for a sum of independent random variables. Thus it would be beneficial to the performance if the mixture probabilities were independent. A second observation about such sums is pertinent here. The study of robust estimators shows that convergence to a stable value is quicker for arbitrarily distributed random variables if "outliers" (events significantly removed from the mean) are not included. In this context, probabilities assigned to certain filter outputs can be "outliers" due to reasons cited in the discussion of the implementation of the filter bank. The long training period that may be required even for optimum classifiers (which is lengthened due to outliers) is especially detrimental in the Speech situation. The plastic structure must be responsive to "drifts" and slow changes in the input's salient features. Thus, for the suboptimum filter bank specified in Chapter II and Appendix B, we need to investigate recognition structures which form near independent probability estimates and mixture formulae which reduce the undesirable effects of outliers. We will Show in Section B how the Lewis‘B—Brown¢b probability approximation technique attempts to compute independent probabilities and in Section C how the S—RETIC algorithm of Kilmer+1 operates to eliminate outliers. IV B. Quasi—Independent Probability Distributions The discussion in the last section indicated that the set of azxxneriori probabilities computed on the outputs of the preprocessing filters should be independent in order to increase recognition performance. This will be difficult to achieve because of the overlap of the input sets——one for each probability computer--and also because of the correlated nature of the inputs. «2 4a The Lewis-Brown ’ iterative technique can be used to reduce the dependence between the probabilities. The notation follows that of Section I-E. Suppose we have (for a given class CL) 3 set of m low- order distributions {Pk}:_1 such that Pk(x) 4 O k = l, ..., m for all x ka(x) dX r: 1 k :1, ..., m x€Xk where Xk* is the set of n inputs for the kth probability computer. Then, if we consider the entire m x n dimensional pattern vector for a given class and hypothesize a "true" distribution, each low—order probability distribution, Pk(Xk), satisfies a marginal property; integration of the "true" probability distribution over all components not contained in k k X equals Pk(X ). Brown gives an iterative procedure for determining, For Speech, one input, xi, might be one component of a four-component state vector representing the output of one filter of an m-filter k (overlapping) bank. X , then, is the state vector (n=4) for each filter. 146 147 among all products of low—order distributions that satisfy the marginal property, the one that minimizes an information measure of the close— neSs to the "true" probability distribution. Brown defines an iterative procedure as follows: Given an initial (a priori) m x n . . . o . .th . . . distribution P (x), define the j iteration, j = l, 2, 3, ... probability distribution, pJ {P }m k k2]. , from the Set of low-order distributions 33(X) gé'1(x) Pk(x) / Pi’l(x)] (Iv—B-l) th th That is, multiply the (j-l) probability distribution by the k low—order distribution, where k j modulo m, and divide by the marginal distribution. ._ ._1 P? 1(x) = {‘ P‘J (x)dx (IV—B—Z) < . k x e X Brown shows that the distribution PJ does satisfy the marginal requirement for all j and does converge to a limiting distribution with the minimum information preperty. At first it appears that Pj will contain low—order distributions raised to a power but if we rewrite the marginal distribution (B—2) we can see that this is not so. PJ_1(x) _ Pk(x) gj—l(x) gfl‘2(x) ° k where the g's will be defined. Substitution of this into Eqn. (IV—B—l) gives (after m iterations) 148 Ill P300 = r1 2km / aid) 3 = m, n+1, (IV-B-3) k=1 where gj(X) = gj-l (X) j # k modulo m k k . m . gJ(X) = f U P (X) / gJ(X) dx j = k modulo m k " k t=i I; it Note that gi(x) is a function of x e Xk and hence the gi functions tend to make {Pk}::1 a set of independent probability distributions so that a product rule for recombination applies. The computation of gi requires, for a given module, an integration over the set of measurements not contained in the input to that module. The gi functions have the same limiting prOperty as discussed in Brown. To see this, define the limiting probability distribution Pr(x) = Lim Pj(x) (IV—B-4) j—aCD r and recall from Brown that g has the following marginal properties [Ilr(x)dx : Pk(x) k = l, ..., m (IV—B—5) k x €3X r Substituting for E from Eqn.(IV—B-3) and Eqn. (IV-B—4) (with proper assumptions to give interchange of limits, integrals, and products) 149 I Pr(x)dx : Lim j‘ Pj(x)dx : him gk (yj: Pk(x) k = l, ., m and '—1 g: (X) Lim j = l (IV—B—G) if” gk(X) Thus, it is necessary only to compute the iterative definitions J 1. An example may clarify the role that the g; functions play. Let < of g P (x x o 3 . 2, x3) be an a priori distribution over X(:R . Let the two-dimenSional 1 ) (n22) lower-order distributions (m—Z) be (where P with no indices denotes the marginal distribution of the indicated arguments) l Pl(x) : P(xl,x2) X = (x1, x2) P (x) : P(x x ) X2 : (x x ) z 2’ 3 2’ 3 Then J . gl(x) 2 1 J : 1) 3) o.- s‘im) = P(x2) j = 2, 4, and Pr : Pj : P(X X )P (X /X ) J :: 2 3 1’ 2 3 2 ’ ’ Note that any a priori distribution is allowed and does not affect the final result. The effect of the gi functions in this simple example is to change the marginal distribution into a conditional distribution. Chow's45 150 approximate scheme for learning conditional dependencies is analogous (allowing conditioning on one variable only) but the use of the gi functions allows one to determine the structure of the problem for any set of lower—order distributions (possibly in a theoretical sense only, as the computations may become unwieldy eSpecially if the lower—order distributions change). In summary, we have shown how an iterative procedure for approxie mating probability distributions is a mathematical model for learning conditional dependencies such as those found between Kilmer'sqq‘STC—RETIC modules. The reduced formulae develOped here require only integration and multiplication and no powers (as in the original scheme). These iterative formulae develop only the conditional dependencies and do not depend on measurements that are independent. That is, if Xk and Xq are nonoverlapping,independent input sets, then the integration set to compute gi need not include Xq. The resulting approximation formula is a product which implies independent measurement sets. Thus, they form an appropriate set of mixture probabilities discussed in the last section. IV C. Specification of First—Level Decision Structure At this point, we have Specified a set of m state vector represen~ tations of the input signal, each state vector having dimension n, and a set of a posteriori distributions for the probability of each state vector, given one classification C&, t : l, ..., r. We wish to decide, on the basis of this information (and possibly other information which needs to be Specified), the appropriate subset of state vectors that best repreSents the pertinent features in the input signal. We can write a general formula to compute r numbers to decide between the different classifications (hypotheses), including the Bayesian approach developed in the last two sections. 5 : zlfk(pkt) t : 1, ..., r (IV-C-l) where fk is a monotonic, nondecreasing, continuous function and P kt ispk(C£/Xk), the a posteriori probability of class C iven the input t g k set X . This formula includes a large number of likelihood functions. We will discuss these different formulations and relate them to the specific problem of Speech recognition. The usual (lmncmrt)flnonotonic functitnns<3f the probabilities is the natural logarithm, which converts a product of independent probability distributions, as discussed in the previous section, into a summation. Since the function is monotonically nondecreasing, a decision test based on the probabilities alone will have similar results for a function of those probabilities. Another function of this type is discussed 151 152 in Kilmer.+1 There, the purpose was to emphasize probabilities that weredif— ferent from a. uniform value, l/r. Thus, if a given P& was significantly k greater than or less than %, the f function would tend to emphasize this particular probability. The formulation (Eqn. IV—C-l) also allows several types of cost factors to be included in the decision quantities. Various cost factors are discussed in the literature. One of the most pertinent to the study is an information measure that is related to the amount of information in the a posteriori distribution p(Cfl), given xk, t = 1, ..., r. Here the implication is that a module input should be considered very strongly in the decision if there is a significant peaked distribution among the var- ious categories. Another possible interpretation of a cost function of the input Xk, especially for suboptimal systems operating in noisy environ- ments, is a quality measure which could be determined in two ways: first, in terms of the distance from the cluster centers of the input variable. This would indicate whether the input were quite far from the majority of inputs seen previously with respect to the given set of learned categories (where we are not concerned with unknown or new input classes). This type of cost factor would indicate that low values of a posteriori probability have less influence, especially in the case of an insufficient or small number of training patterns. This is so because the majority of known probability distribution estimation techniques give much worse estimates of the tails of a distribution (events with low probability of recurrence) than they do of more densely populated modes. Another type of quality measure based on the physical characteristics or measurements (low signal— to—noise ratio of input, extremely high background interference, etc.) 153 would lessen the effect of noisy inputs. These types of cost factors can easily be incorporated into the formulation (Eqn. IV-C~1). The third thing that may be included in this formulation is prior distributions. Lainiotis only used the prior distributions as thresholds for comparison of likelihood ratios that he generated. In Speech it is well known that successive Speech segments are highly dependent (redundancy of about 33 percent); hence there is much information in the probability of a given segment, given the last decision or classification of the preceding segment. Thus, we must augment Bayes' formula that was stated in Section I—E to include this conditional probability. k k—l f k ] [ k—l ‘ P(C&/X ,x )—LP(X A), / (p(c,/c,_l)p(c&_l/x )j (IV—C-2) This Should be incorporated in such a way that when the a posteriori probabilities computed on the present input do not contain sufficient inforuudjinl to give zliwiliable estinm1h3 of the [nngNit category, (in? conditional distributions should be used. Even with the different interpretation given to cost factors and prior distributions it is possible to formulate a recognition problem for a restrictive speech signal within the Bayesian framework, as discussed in the previous two sections. However, there are several events,eSpecially for subOptimal systems Operating in noisy environments, that will have a probability ), t = 1, ..., r; k = 1, matrix P = .., m, which does not give I) ( tk acceptable decisions using Eqn. IV-C—l. This can be due to conflict between modules having high probabilities for one class and other modules having high probabilities for another class. This situation involving 154 "outliers," as discussed pnadoqum can occur because of: (l) inappropriate assumptions; (2) presence of noise in the input that is very much signal— like (white noise that looks like a fricative, sinusoidal inputs that look like vowels or nasals, etc.), or (3) a dichotomization of inputs; that is one module may have the same input for two completely different classes of the input Signal. An example of this would be during a nasal, when a high—frequency—bandpass filter might have a strong formant that looks vowe1~like, whereas the presence of no signal in other filters and low energy of pitch frequency component would indicate that this signal interval is not a vowel. This type of correlation, of course, should be incorporated by the Lewis~Brown+2’+8 approximation technique, but it may not be a sufficient mechanism. Wainstein and Zubakoéxhave used the central limit theorem with reSpect to likelihood ratio formulae, such as Eqn. IV—C—l for the fol— lowing reasons: given that the individual terms (probabilities, like— lihood ratios) are independent events and given certain restrictions on the tails of the distribution of these events (that they are well behaved and go to zero sufficiently fast as the value of the event goes to infinity), the distribution of the sum tends toward the normal distribution. As is well known, this is an asymptotic prOperty, but it illustrates two things: (1) convergence to a definite value, and (2) this value is not a local minimum, since the asymptotic distribution is uni— modal (these theorems have been proven for a larger class of distributions than the normal, but with similar convergence and unimodal properties). Thus one can expect a stable rule for finding a maximum value with a guaranteed convergence property. The problem that occurs with low proba— bility outliers is that it will take a large number of terms in the summation 155 to counteract its effects. In our situation, where there is such a mix- ture of probability distributions and a large number of possibilities for generating such outliers, one cannot sit back and hope that they will only occur with a small probability. Several theorems have been proven (Hertz7o) where the tail behavior of the even distributions has been relaxed by eliminating outliers and still obtaining the central limit theorem. This of course is intuitively the correct thing to do in order to maintain the convergence and unimodel properties. The discussion of causes for outliers' occurrence leads one to consider two approaches: One is to use a Bayes decision formulation but compute a larger dimension probability distribution, possibly over the entire m X n dimensional space. The discussion of the first chapter has indicated empirical objections to this approach. With respect to the discussion in the first two sections of this chapter, the module concept can be justified by stating that the filtering representation scheme presented in Chapter Two is better matched to the natural dimen- sionality of each feature and thus is better able to eliminate unwanted signals and noise and thus to isolate individual features. Second, as pointed out by Groner, too many inputs to a suboptimal design Bayes decision network very often add noise and thus degrade the overall classification performance. Thus, it would seem very natural that the first level of decision logic would be to extract the features as separate entities and then, on the basis of this extraction, look for the inter- relationships and more detailed properties of the features. 156 The other possible decision structure is to allow interconnections between the modules that allow lateral passage of gross information. Kilmer“r1 has considered this problem and has related the S—RETIC modal computations to nonlinear summative schemes such as Eqn. IV—B—l. He Shows that, based on three symmetries that are assumed for such Systems, "S—RETIC computes a mode [detection/classification] function, F, that no S-RETIC net without a and 6 [lateral] connections but with nonlinear summative output scheme could compute even though it is allowed more equipment." The three symmetries that are assumed for these systems are as follows (note that the first two are typical for Bayesian schemes of the type discussed): (1) We must be able to compute the same classification decision regardless of which module has the proper information. This is especially necessary in slurecl), as; is (iviihsnt. in lfilgUIW: 4, Sllufiv [Jle tianki module will not always have the appropriate classi— fication information, eSpecially when different Speakers are eXpected to be using the system. Further, Figure 17 indicates how different processing schemes will isolate the pertinent information, dependent on the surrounding feature environment. (2) The evaluation scheme must be the same for any classification decision (the computation of 8% is independent of t, t = l, ..., r). 157 (3) Strength—of—effect symmetry. Given prior distri— butions, an average (summative) decision across the net and conflicting decisions, any two can overcome the third or any one can overcome the other two, whether the other two are in favor of the same classification or conflicting classifications. This symmetry states that the decision rule must give equal weight and operate in an equal fashion in judging the effects of these three possible situations that can occur simultaneously. [The last symmetry requires the lateral communication, since Bayesian schemes have the first two symmetries (any of the formulae from Wainstein and Zubakov discussed previously) but when faced with the type of situation depicted in (3) will not operate in a consistent or appropriate manner.) To paraphrase Kilmer's statement, in light of the outlier situation, it is seen that the lateral communication is necessary to decide among the probability matrix and the prior distribution matrix, which modules should work in conjunction and be averaged together to determine the output and which should be considered outliers and eliminated. The S—RETIC algorithm is iterative and thus the intuitive arguments we are presenting here are intended for understanding rather than analysis, but the importance of Kilmer's statement about decision algorithms of this nature is that the structure must be implemented in this way or else the performance will suffer. 158 In the next section, we will specify a first—level recognition system that operates according to these principles and incorporates the S—RETIC type of decision logic. We will see how the state variable representation presented here can be incorporated in a dynamic real-time asynchronous decision network. IV D. Proposed First-Level Recognition Block Diagram The purpose of this study is to Specify a mathematical model and a system block diagram that are tailored to the acoustical speech signal, rather than the converse. The deficiencies of state-of—the—art solutions derived by means of a Bayes minimum—risk criterion have been discussed: It is necessary to use a restricted speech model; it is very difficult to implement the nonlinear estimation filters that are required; a high dimensionality is required because of the complicated interrelationships of the speech signals. Even the optimal filters' outputs will be dependent, in a probabilistic sense; the mixture probability formula allows the possibility of adding together nonsimilar waveforms, based solely on the learned probability of presence. Adding to these difficulties those that have been discussed for suboptimal solutions, which can give rise to outlier probabilities for particular classes, one is left with a very negative picture. There are several other requirements of a recognition system that are difficult to include in a Bayes formulation, which will be mentioned here to help Specify the recognition system: (1) Significance of the marked change——The segmentation marks that are derived from the inherent signal characteristics must be monitored with reSpect to past occurrences of the Speech signal to determine whether the marked change is due to noise (parity error...), another energy peak entering the filter bandwidth, the actual start of a new feature, a change from one feature to another, or the finish of a feature. 159 160 (2) Correlation with overall system behavior——Each module decision must be compared with all other modules to determine if this is a new feature, whether an energy peak has moved from one filter to another, or whether (one of) the predominant feature(s) has finished. (3) Precisely controlled features——The main criterion for classifying a pertinent feature is whether it is repeatable for different Speakers and contexts, whether it is a transition of prescribed form, and whether the terminal state of the transition is predictable before the end in case the segment is terminated. The Bayesian formulation of course has a different philosophy toward marked changes, in that they are assumed to be a true segment and the mixture probability formula is used to decide on the actual significance. Since in a speech recognition system these marked changes also have linguistic meaning in higher levels (determining the consonant/vowel relatiaiships, directing higher—level analysis, ...), there must also be a decision on their validity. As is well known, the overall system behavior must not be degraded in allowing individual modules to make classification decisions. It has been demonstrated that the S-RETIC will work in a correlated fashion as a total system rather than m individual systems, each screaming for its own way (as in pandemonium machines). This is a very serious requirement which will not necessarily be satisfied by using a simple mixture formula. The particular method of training the 161 classification network is well Specified in our Section I—C and the work of Rupert“). We can see that the requirements of precisely controlled features are very pertinent to determining a consistent system performance. The work of Houde-iv7 also indicates that recognizing transients can be performed because of the consistent and precise form of articular transitions. Since it appears from the preprocessing pictures that these transitions also exist in the acoustical waveform, we can see that this is a desirable and necessary requirement for efficient recognition. As was mentioned previously in the discussion of the filter bank, the fixed-frequency filters that were used are not tailored to actual Speech characteristics, especially during frequency—transition epochs. As can be seen from the three requirements stated above for the recognition system, there might also be difficulties for specific systems that have set filter bandwidths, in that energy peaks can move across filter boun— daries. Depending on the skirt reSponse of the filter it may be very difficult, for a particular filter, to distinguish an energy peak which moves into a filter from one that simply begins in that filter. For this reason and to avoid a very complicated classification system which must make these additional decisions, we Should make use of the time—varying tracking filters discussed in Section II—D. We will outline a procedure for their use in conjunction with the classification system. First we make the assumption that at any one given instant of time the filter bank is constructed such that there is at least one filter that isolates the pertinent feature information (here the use of the word "filter" indicates the derivative calculations as well as the actual bandpass filter Operation, since the combination of filtering and differentiation is sometimes 162 required to isolate the desired feature). Given that assumption, we can then Specify a tracking filter, shown in Figure 12, in the following way. At a marked change, we make a classification of the overall system input (i.e., each module decision is calculated and then an overall global decision is arrived at from these local decisions). Then, based on this decision, selected filters which have the pertinent features are activated to start tracking. The estimates of frequency and bandwidth are used, as indicated in Figure 12, to modify the input signal further to emphasize the particular pertinent features. Thus, other formants entering this particular filter will not affect the tracking filter output. Also, it will be possible to allow the tracking filter to operate across the filter bank boundaries. The combination of this tracking filter with the fixed—frequency filter bank will then lock on certain features and follow them throughout their duration, emphasizing the chracteristics which may be needed for higher—level classification. These requirements allow us to Specify a recognition and pre— processing structure which matches the nature of the Speech signal and allows higher—level linguistic classification. This structure is shown in Figure 27. The wideband Speech Signal is processed by the overlapping filter bank. Each filter output is operated on by a measurement device similar to that described in Section II-E. The inherent Signal changes are detected to give derivative segmentation indicators. The measurement outputs from Al,1 go to A21, which is the acousteme class selection. Here the stored precisely controlled feature information is compared to the input and local class decisions are made. Based on these local class decisions, the outputs of A21 correSponding to degree of presence (DOP) vectors shown in Figure 6 are compared with the derivative segmentation 163 Emkw>w 20:200me IOmmam 92:. ..(mm mDOZOmIOZ>m< “.0 4w>wa_ meI m0“. 2 tocazflzaoz. Ouz. ilzwxuwméwanm ‘llllllll 04 m-)i¢dOU to 205350 u< mom 1o?! :0 -..:«our ..qrrwiowm mead 0 Te .mowcwaOum . _ genome _ zozbwfim . tuna»... u_!«t»c . ._ .24 «gran $023me u< oifm . oz< 0230:5on 2330'. i 21.:«004 10.9qu iKOn mama.» Z. WZOCUwCB CI .91 mZOC<>3aIOU Owfl_JOwu ZOL<3¢OKZ. Q. 024 ZO:>OD ml¢Iw :35.) u}: .2504 9 A OZ”Ew0 _ r4 Ku»4_u _ \. ZO_kq»Zw}Own w).»<>..awo 1 Fagin . .d a \) /.\ . Jo_»uw-wm fiduw F mswrwaooa ..J F lillllllL ilOZO mike—30245 mo m. NI Sam 4. NP 00.2 NI gm 11 ~I gv : .0” DOS. ~I mmpm I] m .m COS. 184 i n .0 00.2 n. .v 00$ 2... coop ~I nmm n .N 00.2; mIFO =SOZ(m mu m ~I 80m 4. my 005. 2.. 23 IT : .0, DOS. 2... nnmm '4' ~I $9 J] m .m 00.2 m .v 00.2 If N: 2m N: mmv LF ~I non l—I ~I Dun — 00$. NI Ms ¥Z0 ozjn§:<>_mme, R N no: 08 T a 00.2 I n=2( no U u u U U U U U 58v .Om—m .83 .93, 5mm— 5:” 6.3 5.0— .WNF .8? th wKDOE Imb... _ u ~I OOmwlmw hDaZ_ Iuwwum APPENDIX C TIMSER A Program for Interactive Analysis of Time Series MSER-—Techniques for interactive Manipulation of §§quential ns——is a program ensemble that runs on the CDC 3300 at SRI. developed to allow a user to edit and transform time series tively, observing the results on a CRT display. The time of primary interest are bipolar, one—dimensional time series 8 the result of A/D operation on an analog voltage)and unipolar riate time series (such as the envelope and zero—crossing count ries derived from the bipolar sampled analog signal). Figure l n overview of the operations available to perform these two f analysis. ata flTMSR—4Rne-Dimensional Bipolar Time Series; i.e., the output an A/D conversion of an analog waveform from magnetic tape. 3 system allows inputting data from up to three tapes, which may fierent sampling time intervals, accuracy (range of data), and length. No capability is available now to unpack multiplexed data; , it could be implemented by modifying a single subroutine. The 3rs that are needed to read each magnetic tape are read in from :ard. These tape parameters are fixed for each magnetic tape; 30 n/ (1) Tape ID (4 BCD Characters); (2) Length of record; (3) Sample time interval; (4) Logical unit; and (5) Range of the data, 8 bits (i 256) to 12 bits (i 2048). 185 186 F mem—m wS__._. w._.<.m<>_._.._DS_ “.0 ><._aw_o OZ< m_w>._<.Em_o m>_._.o:._. ZOdeDaEOU m2; 23m Emhzzun. ImOdmI mon hlwl<§mwa hmflrflo WEIh_IOOJ< OZ_OOmok0wmfi man _ _ wmmmeqGQu mmwhw5414n H H H >(Jn90 ecu 85.228 oz<0m>uy .L _ _ 5... ..IL - . _ mozSLo _ ._ . _ _ L EJCOIUEJ mmwuqu .(Jamfi NEOU ail)? in?!» hmIaEwk2_ EWSkOCa : OKdOmywx _ _ _ _| mamauum kaz_ia «20. rqugono owx_a _ _ mflwkwidudu _ >4Jam.u r043 _ $3:sz wozqru [IIIIL FlllllllllllIL To mmDOE mzoiahsaiou wSCHZDm mJC LusdaUm pwoknc >IOFme5 win ) me.u prZ -QIawnkworan L Ji_ _ F mn24 230 U . 0n40m>w¢ kwda amp). Imihgdm _ w meJU w 2wqo 9279.133 Q 187 ‘ucture of the magnetic tapes is typical for A/D operations; namely, ‘ry number of records per file and data blocked by end-of-file marks, arbitrary number of files. A compass subroutine enables quick ng for files. EgTMSRifUnipolar Multiple Time Series; i.e., results of sampling .terrelated analog waveforms or results of processing one—dimensional gital time series. Lese time series are stored in random—access files that have .ble structure, depending on four parameters: (1) Length of record (2) Number of records (3) Number of modules (4) Number of dimensions. four parameters are contained in a header at the start of each file e assumed to be constant for that file. Use of a virtual-core storage (PUTGET) permits storage of up to 200 files. Maximum record length ited to 200; however, the rest of the parameters are bounded only by ble disk storage. The number of dimensions is the number of different eries in this file. The number of modules is a sub-file structure llows a within—file breakdown of data. For instance, there may be 1 processing schemes for one multiple time series which the user to compare. The number of records parameter refers to each module. )tions Available to Both Programs .crofilm Hard Copy is possible to obtain hard copy of the actual picture on the splay. This is done by dumping the octal diSplay buffers onto Lc tape and then converting them to microfilm pictures on the ) display,later, as a batch job. 188 pmments and Titles he user can type in a title or a comment on the CRT diSplay; comments are transmitted to the hard copy. This is useful for ting where you are in your analysis and for titling microfilm BS. un—Time Computations limited incremental compiler, allowing the user to manipulate, c, scale, or transform the time series, will soon be added. The ill be able to type up to 10 algebraic equations (which may perform ear transformation, lead—lag averaging, magnitude, absolute value, of squares computations, normalizing by maximum value, etc.) gand Transformation Capabilities \WTMSR )ata on magnetic tapes is transferred to a circular virtual~core ', that is, a buffer with a fixed length; when this length is led the first data introduced is overwritten. The user can select no of three magnetic tapes from the keyboard; at this time, a data .s read in with the tape parameters discussed above. Arbitrary :ion of files and records from that tape can be made so that the .ar buffer may contain an allocation of data not necessarily the [S the original tape. From the circular buffer the time series are 'ted to octal display buffers for the CRT display. A pointer is ;o determine the origin of display in the circular buffer. It is >le to change plot parameters, including the number of points in trve, the number of curves on the screen, and the scale of the The pointer can automatically increment through the buffer allow- Lpid editing. The user may also reference (save) one or more curves 2 screen and edit others for comparison. Once a curve is saved, rr modification of the plot parameters will not affect it. Various [8 are available for rapid editing of data from several tapes. A 189 III euiddtion to the editing capability, RAWTMSR allows some computations 9 performed on the time series. There are two types: (IL) Fixed Computations--including Fourier series and a smoothed time derivative computation. These compu— tations can be made on any selected portion of the data in the circular buffer by setting limit pointers. The resulting time series are stored in a temporary scratch buffer, located in core, and immediately dis- played above the last reference curve. These curves are automatically referenced so that further computa- tions or moving of time series will not remove them. (13) Run—Time Computations--These computations will be performed by the incremental compiler. They can be performed either on data from the circular buffer, again indicated by beginning and ending pointers, or on data in the scratch buffer in core. The result of any of the above computations will be written over ything in the scratch buffer and immediately displayed. Permanent cords of the computations can be made either by the hard copy option ' by printing the scratch buffer contents. For example, the combination of these fixed run—time computations an result in the following display: First, pointers are set in the cratch file and a Fourier transform is computed. Then the incremental ompiler is called, and a logarithmic transformation of the magnitude of he Wander series is computed and normalized. Then another Fourier ;ransfinmn on the resulting time series, is computed and displayed. The mmulthnzwaveform, called a cepstrum, is useful in speech analysis. Hm pnnxdure of introducing the data in the scratch file, assigning txginnhm;and ending pointers, and calling a subroutine to do the compu— thg iscmmmon to many forms of time series analysis; namely, autocorrela- thnicmmmtations, convolution operations with matched filters, etc. and alkms ageneral structure for incorporation of additional Operations. 190 XX ixyrxical hard copy (microfilm) picture (Fig. 4 in the text, repeated > irujixzates some of the editing capabilities. Wave forms from four erwerrt :files (4, 5, 6, 24) and three different tapes (MD4, MD6, MD8) are layed simultaneously. Each waveform is labeled with the beginning and ng time references (from the start of each file) computed from the |1£3 irrteryal parameter for each tape. File names (19 BA 1, 19 MJ 1, 3F} 1, 19 F311) and a general comment (CUTPUTS 0F flVERLAPPING BAND PASS IHBRS [1W1 Tfl [Bl]) are also displayed. PTKYUMSFi 13MB PROTMSR program allows a study of the interrelationships of time ixxs selected frmn random access files. Four two-dimensional .tteI'1910ts are diSplayed. The selection of the plot parameters for fl1<1f these four plots allows the plotting of an individual time series rsus its index, the scatter plots of two time series (from the same 1e, but not necessarily the same record or module), and comparisons of atter plots from four different files. The structure of the files on the sk is completely determined by the parameters (discussed above) in the :ader; thus, it may vary from one file to another. An index function ivolving four parameters (instead of the three commonly available in ortran) is used in a separate subroutine called INDEX so that he usual Fortran requirements of predetermined dimensions and maximum alue of each dimension are not necessary. A file directory showing the tarious parameters zuui file ixknrtification (Rani fronliflua header is zuniilable itinxn‘option for selecting the files. Various options are available to Hufllitate the comparison of the four scatter plots; 0) A time sequence option,which allows ten points on each scatter plot to be labeled 0 through 9 according to their sequential index. These ten labels are then incremented through the scatter plot, showing the sequential relations. C» An overlay option, which plots all [our scatter plots 0]] COIIIIIlOll ilXt‘S. 191 (3) Automatic incrementing of either dimensions, modules, or records for rapid editing. ui—dzinue compiler can be used in this program to generate a new file and a“; seat (Df derived time series from any of the existing files. Similar es <3f ‘transformations, as discussed before, are available here. 1\ tyqxical picture (Fig. 17b in the text, repeated here) shows four nultaneous time series plots, the two on the left are univariate plots 1 time two on the right are bivariate (scatter) plots. Labels on the ttxnn are added by the user. Each plot is labeled with a file name 6 EH3 1) three index parameters (D1 M5 R2) or TIME (indicating the idexfl writh maximum time shown (480 ms). Names of the D index are also iouni (ABS ENV). Scale for each axis is shown by a factor (X 4) which iltiplies the original data. APPENDIX D SLIDING POWER SPECTRA Sliding power spectra are computed from the A—D tapes described in Appendix B‘by means of the TIMSER display program ensemble. Each curve displays the square root of the power spectra computed over a fixed time interval (25 milliseconds for all curves in this appendix). Spectrum smoothing is done by multiplication of the time waveform by the following taper function: 2 2 (1-x ) , —ISxS1 This smoothing is performed to minimize the effects of the pitch frequency and give better side lobe response (see Blackman and Tukey). The labels on each picture give fundamental frequency (40 hz), file number and tape ID (corresponding to module), linear or log magnitude plot, maximum frequency, utterance and speaker label, and filter band— width. The label for each curve shows the start time and the square root of power (all curves in this appendix are stepped 15 milliseconds). Each curve is normalized to the maximum frequency component. The linear magnitude plots show percentage of the maximum component. The log magnitude plots show dB relative to the maximum component (50 dB range). 192 PWR 3072.9 PWR 2906.1 PWR 3128.6 PWR 2656.0 PWR 2289.0 PWR 2093.7 370.4 ms r I 1 1 r v ‘ ‘l r r —, ' I I I I I I I I I I 40 Hz FILE 3 RTWB LOG MAGN 6500 Hz DHUATH 16 BE ‘1 REAL TIME WIDE BAND SIGNAL FIGURE Dvl DHUATH 16 BE 1 REAL—TIME WIDEBAND SIGNAL PWR 3457.1 535.3 ms PWR 4129.5 520.3 ms PWR 4792.2 PWR 4467.3 PWR 4486.9 :' I 40 Hz DHUATH 16 BE 1 PWR 3983.6 0 I l FILE 3 RTWB REAL TIME WIDE BAND SIGNAL I LOG MAGN 6500 Hz FIGURE D—1 DHUATH 16 BE 1 REAL—TIME WIDEBAND SIGNAL (Continued) 195 PWR 208.4 PWR 272.2 PWR 743.5 PWR 1426.8 580.3 ms PWR 1971.5 565.3 ms PWR 2670.0 P""' 1 1 1 r v ' 1 r r *1 I I I I O I I 40 Hz FILE 3 RTWB LOG MAGN 6500 Hz DHAUTH 16 BE 1 REAL TIME WIDE BAND SIGNAL FIGURE D—I DHUATH 16 BE 1 REAL-TIME WIDEBAND SIGNAL (Continued) 196 PWR 139.6 715.3 ms PWR 161.2 700.3 ms PWR 194.3 685.3 ms PWR 169.9 670.3 ms PWR 140.6 655.3 ms PWR 225.1 640.3 ms r -------- t """""" 1 """""" ‘I """""" r """""" 7 """"" 1 """""" a """""" r """"" r """" “I I I I I I I I I I I I 40 HZ FILE 3 RTWB LOG MAGN 6500 HZ DHUATH 16 BE 1 FILTER BANDWIDTH 70 TO 6500 Hz FIGURE D—I DHUATH 16 BE 1 REAL-TIME WIDEBAND SIGNAL (Concluded) 197 .... ......................... ..- ....................... PWR 2341.7 445.4 ms PWR 1823.4 PWR 1881.1 415.4 ms PWR 1579.0 PWR 1189.6 385.4 ms PWR 1032.7 40 Hz FILE 3 MD4 LIN MAGN 3333 Hz DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 HI FIGURE D—2 DHUATH 16 BE 1 FILTER BANDWIDTH 4584467 Hz 198 PWR 3504.2 5353 ms PWR 3548.0 520.3 ms '1 A Av VA PWR 3435.6 PWR 3351.1 PWR 4761.9 I 0 FILE 3 MD4 DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz PWR 4285.6 . r ......... ‘ ......... ‘ ......... fl .......... r ......... Y ......... ‘.- 505.4 ms 490.4 ms 475.4 ms ....... 1------00--..--.--OOQDr-QOOD-.-q LIN MAGN 3333 Hz FIGURE D—2 (Continued) DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz 199 PWR 128.8 625.3 ms AMA VA PWR 158.3 610.3 ms PWR 412.3 595.3 ms PWR 848.3 580.3 ms PWR 1514.1 565.3 ms ,— AT W PWR 2798.0 550.3 ms ‘ W ' """"" 1 -------- 1 --------- a --------- r °°°°°°° I '''''''' 'I -------- "1 ---------- I ---------- I- -------- -I I I I I I I I I 0 I I 40 Hz FILE 3 MD4 LIN MAGN 3333 Hz DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz FIGURE D—~2 DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz (Continued) 200 PWR 90.4 ' 715.3 ms PWR 100.2 700.3 ms PWR 79.6 685.3 ms PWR 74.6 670.3 ms PWR 71.4 655.3 ms PWR 105.0 40 Hz FILE 3 MD4 LIN MAGN 3333 H2 DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz FIGURE D—2 DHUATH 16 BE 1 FILTER BANDWIDTH 458—1467 Hz (Concluded) ‘ 201 PWR 3165.1 445.4 ms PWR 2959.9 430.4 ms PWR 3484.2 415.4 ms PWR 2806.4 400.4 ms PWR 1924.8 385.4 ms M ‘ PWR 1527.2 370.4 ms WA‘A : A M :VA 7 """"" 1 """""""""" "I """"" f' """"" Y """"" 'I """"" 1 """"" I """"" f """" "I 40 Hz FILE 3 M06 LIN MAGN 3333 Hz DHUATH 16 BE 1 FILTER BANDWIDTH 577-1867 Hz FIGURE D~3 DHUATH 16 BE 1 FILTER BANDWIDTH 57741867 Hz 202 PWR 7072.7 535.3 ms 7 At;- PWR 7784.1 520.3 ms PWR 8072.0 PWR 6663.8 490.4 ms PWR 6394.7 475.4 ms PWR 4690.8 4° “2 FILE 3 M06 LIN MAGN 3333 Hz DHUATH 16 BE 1 FILTER BANDWIDTH 577—1867 Hz FIGURE D—3 DHUATH 16 BE 1 FILTER BANDWIDTH 577-1867 Hz (Continued) 203 -' ----------- o ....................................................................................... — PWR 188.6 625.3 ms PWR 286.3 610.3 ms PWR 741.2 595.3 ms M 13905 580.3 ms 4W. PWR 5874.9 550.3 ms 7 """"" I """"" 1 """""" 'I """""" r """"" V """"" '1 °°°°°°° 1 """"" I """"" I’ """"" W 40 Hz FILE 3 MD6 LIN MAGN 3333 Hz DHUATH 16 BE 1 FILTER BANDWIDTH 577—1867 Hz FIGURE D—3 DHUATH 16 BE 1 FILTER BANDWIDTH 577—1867 Hz (Continued) 204 PWR 123.7 715.3 ms PWR 141.5 700.3 ms PWR 149.4 685.3 ms PWR 156.5 670.3 ms PWR 113.6 655.3 ms PWR 132.9 640.3 ms .r --------- ,..-- -----s --------- a ---------- I- -------- I --------- 1 --------- 1 ---------- I ---------- r -------- -I O I I I I I I I I I 40 Hz FILE 3 M06 LIN MAGN 3333 Hz DHUATH 16 BE 1 FILTER BANDWIDTH 577-1867 Hz FIGURE D-3 DHUATH 16 BE 1 FILTER BANDWIDTH 577«1867HZ (Concluded) 205 PWR 1343.6 445.1 ms ‘7" Av—% PWR 912.4 430.1 ms PWR 972.8 415.1 ms - rA" __ l PWR 619.9 400.1 ms __ A —--— v A ~ PWR 470.7 385.1 ms PWR 227.1 370.1 ms r --------- 1 --------- 1 --------- w- --------- r --------- v --------- 1 -------- I """"" r """"" r """"" "'I I Q 0 I I I I I I I 40 HZ FILE 3 M08 LIN MAGN 5000 Hz DHUATH 16 BE 1 FILTER BANDWIDTH 1467—2917 Hz FIGURE D-4 DHUATH 16 BE 1 FILTER BANDWIDTH 1467—2917Hz 206 PWR 2399.3 535.1 ms ‘vb._ v A AW A PWR 3817.7 520.1.ms k_— v A A PWR 4932.5 505.1 ms PWR 3744.0 490.1 ms M: v v .A A v. ~‘v‘vA‘M PWR 2944.3 475.1 ms A M PWR 1896.2 460.1 ms *v-— #2, A ~— 4— ‘v‘v‘s 7" """"" 1 """"" fl """"" ‘I """"" l' """"" Y """"" 1 """"" ‘I """"" I‘ """"" f """" .1 I I I I I O 0 I O I I 40 Hz FILE 3 MD8 LIN MAGN 5000 Hz DHUATH 16 BE 1 FILTER BANDWIDTH 1467—2917 Hz FIGURE D—4 DHUATH 16 BE 1 FILTER BANDWIDTH 1467—2917Hz (Continued) A 207 .‘OOOOOOOOOI...—00.00.0000... oooooooooooooooooooooooooooooooooooooooooooooooooo o ..... ...-00.0.0000...- PWR 178.1 625.1 ms . ...... -. ......... ... ........................................... Q ......................... . ....... ...- PWR 235.3 610.1 ms PWR 369.5 595.1 ms PWR 492.8 580.1 ms PWR 1088.0 565.1 ms PWR 2370.6 550.1 ms '0.-. ..... ' ......... ‘ ......... C'— --------- r ......... t ......... ‘ ......... a .......... r ......... rung-....“ I I I I I 0 I I I I I 40 Hz FILE 3 M08 LIN MAGN 5000 Hz DHUATH 16 BE 1 FILTER BANDWIDTH 1467—2917 Hz FIGURE D—4- DHUATH 16 BE 1 FILTER BANDWIDTH 1467-2917HZ (Continued) A APPENDIX E INSTANTANEOUS ESTIMATORS 0F TIME—VARYING PARAMETERS It has been shown that representation of a class of acoustical signals can be reduced to estimation of the time—varying frequency and envelope and their derivatives. A common estimator of the instantaneous frequency is a sliding average of the zero crossings of x. t w(t) : K J z(o)do (E—la) 1 t4 * Or for discrete samples n 113 = KE 2. (E—lb) n 1 k J J Where K is a normalizing constant 1 Z = 1 if X. S 0 and Xj > 0 1 or x, Z 0 and x, < O J-i J 0 otherwise. II A A reasonable estimator of the envelope of x(t) is the sliding mean of the absolute value of the real part. t [a(t) = % J |x(o)ldo (E—Za) t-T ‘We will denote all estimators by adding a tilde (the estimate of w 151:) and the sliding sum of length R over the index 3 from n—k+1 to n as u.f\/jp 209 ‘ 210 for discrete samples n N 1 ‘5 1 a 2 — _ n ka ‘le (E 2b) 3 A Axuyther estimator for the envelope of x(t) can be derived from equa— A on (II—JL—Bc), that is, an average of the magnitude of x(t). The Hilbert 'ansform of an arbitrary function can be obtained by means of a complex 56 .gital filter (Crystal and Ehrman) operating on the real signal. The computation of these estimators involves a non-linear, no— emory operation followed by a low—pass filter.’ he problem of removing the oscillatory terms from the state variable dif— erential equations as discussed in Sec. II—A and the selection of T (or k) s analogous to the selection of the cutoff frequency for the lOWpass filter shomuin Figure E—l. An effective measure for stationary signals is the mean so Squmxabandwidth. Abramson has shown that the mean squared bandwidth (see II-A- 14) dfxwt), the result of the nonlinear, no—memory Operation, is Computed by * thezknlowing formula: B' :: E{(V’E2%}E{X2} B: (E-3) E v0 <0 ‘_ * Wecbmfiw the derivative of a function, v, with respect to its argument as v’. 211 x(t) NON'L'NEQS VIl) LOW~PASS y(t) ESTIMATE NO—IVIEIVIO FILTER OF MEAN OPERATION ESTIMATE OF , DERIVATIVE FIGURE E~~1 OPERATIONS FOR PARAMETER ESTIMATION 212 For the situations we are considering, Bi is equal to a constant related to the non-linear operation times the mean squared bandwidth of the input. For example, the bandwidth of the envelope function (using II—A-l4) for stationary Gaussian processes 941-.) = a(t)eja(t) 13:2...4— : 1398 2 (E—4) E {32} X A where xs is the shifted (low pass) version of x. Use of a full wave detector (absolute value) as an estimator of the envelope gives a band— width (Abramson) - E{(g')2} E{x2} B? = I E{ (13:7 )2} E{x2} B: EI g2} BI leg} : B.f (E—S) The mean frequency can be derived by converting v(t) to an analytic signal with discontinuous phase C(t) = b(t)e‘jp(t) |x(t)l -I- jlxh(t)l (13-6) x(t) : a(t)eja(t) where h(t) = a(t) 5(1) = 0’(t) + C(t) g(t) is a step function which increases by W whenever x(t) = 0. Using (II—D—2), the mean frequency, O, is given by: (D 00 f b2(t)é(t)dt J a2(t)oz(t)dt I a2(t)dI;(t) o 0 [1 SI I I + _ ——-——-————— (D 00 CD — L b2(t)dt J0 a2(t)dt I0 a2(t)dt where [1: {tlx(t) = 0} Since the discontinuities at t e [7 are steps (first order), the last integral is zero. Consequently, 5 becomes (1) _ I a”(t)oz(t)dt w = J...— (13-7) CO Jo a2(t)dt l For signals generated by time—varying differential operators, the mean square bandwidth is not an effective criterion. Rather, the instan— taneous fluctuations of the bandwidth must be considered. We can fix an 69 upper bound by using a Chebyshev inequality for stochastic processes (Parzen) for the time interval [t t j . 1 2 T, 1 ~ _ 1 flag, besmI > Ml s —— Eta... bewr} ..-... EfijztStz lbxs(t)l2] S %[E{\bxs(t1)|2} + Eflbxsug) 2}] ta 21% 2 E + LlET bxs(t) J E{ bxs(t) } dt (E—8b) 214 [.A J is the probability of the event A [ - 1 is the statistical expectation with respect to P. bxs(t)' is constant (time—invariant generating equation), we have r amping coefficient b ) [t StSt ‘bxs(t)| > Am] S br/(Aw)2 have seen, the magnitude of bXS is related to both the effective band— for deterministic system impulse reSponses and effective bandwidth for arying differential operators with white noise driving functions. We may Aw to the bandwidth of our bandpass pre-filters and then use this rela— ip to determine a length of average [tl,t2] (which is related to the frequency). The bound, then, depends on not only the instantaneous of our state variables, but also the instantaneous values of their :ives. We can use these relationships to investigate the properties :ific estimators. 1e estimation of the derivatives of the state variables is, in general, ;y for noisy observations. A more stable estimator is derived by alge- Bu Ianipulation of the stochastic derivative (Parzen) of a time series =1} with finite second moment. (yn may be the discrete envelope samples, or the zero crossing samples, Zn) HQ. L.I.M. Yn ’ Yn~1 A40 A ..I.M. is the usual limit in mean definition and A is the sample time each discrete sample. For computer applications, A is fixed and the average 0f the square 0f Ayn is more appropriate (for locally ergodic es). n ~ 2.1: (y _, r (m n M k J' J-1 J n l ~g 1 2 9 = EA an + E yn—k - yn . 2 Elk nyJ-l _ m 1 J' where n a; 312 y n k k j J n gig-£2 (y _m.§ n k k j n 3' Thus, the sliding variance is a factor in the mean square sliding stochastic derivative (and has a shorter name). Reliable estimation of a significant derivative requires a small value of k (the number of points averaged) while reduction of stochastic variation requires a large value. By using the sliding variance, these requirements are partially reconciled by eliminating terms primarily due to stochastic noise. Also, the sliding variance is more stable than a simple difference of the sliding mean which reduces to 1 / E (yn _ yn-k)‘ Let us summarize the alternatives for selection of a total estimation process. 1.) Sub—interval length — This is the number of points to be summed corresponding to the first low—pass filter in Figure E—l. In order to compute the proper sliding averages, these intervals are non—overlapping rather than sliding. ii.) Sliding average length — The value, k, in the formulas for the various estimators relates to both standard deviation and mean value. 216 ) Envelope estimator — Either absolute value of the input signal or a Hilbert envelope (the square root of the sum of the squares of the input signal and its Hilbert transform). Derivative estimators — For each of the envelope esti— mators we may define three derivative estimators: One—Point Difference for Sliding Mean A] n n—l (1) I; = .1.\ ya - ilk 33 (E—lOa) 1 kLk J k J J J Sliding Standard Deviation n ‘ n 1/2 T) '1: . a a ll, 3 2] 2 : — r, _ — E—lOb () 32(n) [kLk(}J) (kLk yj) ( ) J J Mean Square Derivative n (3) 5,1(n)=L%Zk(y:— (11)] a . . . where yJ 18 the J-th estimate of the envelope. 1/2 )tO that the last two estimators only give the magnitude of the 2 but not the sign. 'der to make choices between these different alternatives, we Iresentation criterion to compare waveforms (which will be the timates of the underlying signal properties). We note that mean or is inappropriate for the types of comparisons we wish to The reason is its insensitivity to very sharp derivative An alternative criterion can be derived by use of the Chebyshev discussed in Equation (E—8). By algebraic manipulation of ality using the weighted difference between the two waveforms, ive at a criterion that gives a better comparison. For two A waveforms, y (n) and y (n), n = l, 2, . . . N define the Chebyshev 1 2 weighted error by: . 2 1/2 1/2 N 2 g _[_1_(e2(1) + e2(N) > + 1; [C(11) _ e(n-l) j < e(n)) C _ 2 N N (n) _ y (:14) y (n)/ y:(l) y:( ) “:2 ye a 2 (E_11) where an): a(m -an) The assumption of local ergodicity must be invoked to relate this measure to the probability of exceeding a bound as in Eqn. (E—8) (much the same as the justification for mean square criteria). However, Eqn. (E-ll) can be used to compare estimators. In Figure —2, two estimates and a smooth envelope are shown. The rapid variations are averaged by the mean square error computation so that the value for the two estimators is approximately equal (0.098 and 0.101). However, using the Chebyshev weighted measure, the difference in the two estimators is apparent, indicated by a calculated value of 0.2974 for the rapidly varying one and 0.1689 for the smooth one. Envelope and frequency estimators must work in different situations ranging from slow but large magnitude variation and possibly a smooth fre— quency transition (such as during vowel formant portions) to rapid, small amplitude and frequency changes (which occur during fricatives). We will first consider the typical vowel onset which occurs in the order of 50 to 100 milliseconds (see Figure 3). In order to compare our different estimators, we will use the following idealized vowel onset waveform: mmOP_._.<>_mwo meJm>Zw O>>._. me meOE ovod omod oNod o 5.0 08.0 _ . _ _ mE omod II ...Eo u mo< 02.9.5 \\ o I 20.55%. maoJmSzM mud n a. x \ ma04w>2m 33084 w>.k<>_mwo wd04m>2m 4mm._.z_ me . II 822:8 IR . . . II 8 Q 7 D a 4 2: o; n .3555. mam E w 219 Let §(t) = a(t)eja(t) OStS.050 (E-lZ) Where t u(t) = 80 + jo arc tan Ia(s)/1001n(s) ds d(t) = w t + B It {Ms/.05)8 _ 2(s/.05)2} ds 0 O 0 a is initial amplitude (= 10. ) w is initial frequency (: 2n 2000) B is frequency deviation H(-)is a I.I.D. r.v. with Gaussian distribution with mean independent of a(') and a(-) at time t with mean E(n) : l and standard deviation Oa(n) = a By selecting values for the parameters; frequency deviation; BO, and amplitude noise standard deviation; a”, we can generate time functions with complex nonlinear behavior in order to investigate the stability properties of the estimators we have chosen. Figure E—2 shows the ideal envelope derivative (B0 = a2 : 0) and two envelope derivative estimators for the Hilbert envelope of x(t). Even for this idealized model, we have five parameters to change in order to investigate the properties of the three envelope derivative esti- mators: (l) absolute or Hilbert envelope, (2) variation of subinterval length, (3) variation of sliding average length, (4) amount of frequency deviation, and (5) envelope standard deviation. Typical values for varia- tions of number of these parameters are shown in Tables E—l, E—2, and E—3 and Figures E-2, E-3, and E-4. Table E—l and E—2 show variation of the sliding average length for Bo=0 and various values of envelope standard deviation for both absolute and Hilbert envelope derivative estimators. Figure E—3 shows a typical plot from Table E-2. Table E—3 indicates variation 220 induced in envelope derivative estimators by frequency changes. Figure E-4 shows the Chebyshev weighted error for the three envelope derivative esti— mators as the subinterval length is increased (evaluation of the stability of the derivative estimators is sufficient to tell us about the estimation of the enve10pe itself). Rather than discuss all these data in detail, we will state the choice of estimator subinterval length and sliding average length and give reasons for that choice. We note from Table E—l and E—Z that the Hilbert and absolute envelope estimators have almost exactly the same values of Chebyshev weighted error when there is no frequency deviation. Because of the additional complexity in computing the Hilbert envelope, this would recommend the absolute value envelope as an estimator. However, Table E—3 shows that for frequency deviations the absolute value estimator gives an order of magnitude higher Chebyshev weighted error than the Hilbert envelope estimator. Note that the behavior or the sliding standard deviation as an estimator of envelope derivative behaves much more stably and gives, in most cases, a lower Chebyshev weighted error. Table E—3 for absolute value envelope estimator shows this very dramatically. For this estimator, sub—interval lengths on the order of 0.5 to 2 milliseconds give approximately the same Chebyshev weighted error. This result can be anticipated from the form of the mathematical relationships between the three derivative esti— mators since both the l—point sliding difference and mean square derivative have unaveraged terms that vary as the random samples. For this reason, as shown in Figure E—tl they are very dependent on the variation of the sub— interval average values. 221 TABLE E-l ENVELOPE DERIVATE CHEBYSHEV WEIGHTED ERRORS USING HILBERT ENVELOPE ESTIMATOR mmdope Estimator Envelope Estimator ufl Dev. 1 2 3 Std. Dev. 1 2 3 .2 .0540 .0916 .0514 .2 .0615 .0470 .0616 .1 .1076 .1830 .1093 .1 .1230 .0939 .1227 .6 .1618 .2712 .1619 .6 .1815 .1409 .1832 .8 .2150 .3650 .2208 .8 .2161 .1879 .2132 1.0 .2687 ' .1553 ;.2769 1.0 .3076 , .2348 .3037 j 1 ms sliding average 6 ms sliding average 1 ms sub interval 1 ms sub interval flWflope Estimator Envelope Estimator rd Dev. 1 2 3 Std. Dev. 1 2 3 I I I 3 0256 .0396 .0280 .9 1 .0111 2 .0226 i .0119 -1 .0515 .0791 .0553 .1 f .0827 § .0156 1 .0830 f : I = “i -6 .0776 .1196 .0816 .6 g .1212 3 0688 § 1232 1 -8 . .1011 f .1591 § .1070 i .8 i .1658 1 .0922 g .1622 1-0 i .1308 ' .1995 E .1116 g 1.0 i .2071 I .1157 E .2002 8 ms sliding average 10 ms sliding average 1 ms sub interval 1 ms sub interval Envelope .td. Dev. .2 Envelope Std. Dev. 222 TABLE E—2 ENVELOPE DERIVATIVE CHEBYSHEV WEIGHTED ERRORS USING ABSOLUTE VALUE ESTIMATOR Estimator l 2 3 .0543 .0916 .0518 .1085 1830 .1102 .1628 2741 .1662 .2170 3649 1 2226 .271: 4552 E.2791 4 ms sliding average 1 ms sub interval Estimator I I l 2 J .0256 E .0396 0280 .0514 g .0793 0552 .0776 E .1192 .0817 .1040 g .1593 1072 ‘ .1307 .1995 .1318 E 8 ms sliding average 1 ms sub interval Envelope Dev. Std. ‘) ...4 1. Envelope Std. Dev. i\; 0 Estimator 1 2 3 I .0614 .0470 0616 .1228 .0941 1226 .1843 .1412 .1831 .2458 .1883 .2433 .3073 .2353 E .3037 6 ms sliding average 1 ms sub interval Estimator l 2 3 .0413 .0227 .0419 0827 .0456 .0830 1211 .0687 I .1231 I 1657 .0921 g .1622 5 .2072 .1157 1 .2002 A 10 ms sliding average ‘1 ms sub interval I1" 1 Sliding \verage 4 6 8 10 Ereq. Dev. 223 TABLE E-3 EFFECTS OF FREQUENCY CHANGE 0N ENVELOPE DERIVATIVE ESTIMATORS-l MS SUBINTERVAL Est‘ ' stimator Sliding Estimator l 2 3 Average 1 2 3 1.805 1.831 .031 4 .2066 .2599 .2057 1.405 1.172 .644 6 .1077 .1369 .1122 1.155 .8593 .432 8 .0990 .0693 .0993 .8641 .6001 .199 10 .0836 .0383 .0826 Absolute envelope derivative Hilbert envelope derivative vs. sliding average vs. sliding average Freq. dev. - 0.25 Freq. dev. : 0.25 Estimator Estimator Freq. 1 .5 3 DCV . 1 ‘3 1 .75 H) .2745 .I.286 .05 .0129 .LH(X1 .0136 .6717 .1400 .9978 .10- 0750 0255 0852 .6482 .2533 1.295 .15 .1685 .0115 .2001 1.035 .3862 2 512 .20 1351 0:88 1172 .8641 .6001 1.199 .25 .0836 .0383 .0826 Absolute envelope derivative Hilbert envelope derivative vs. freq. dev. vs. freq. dev; sliding average x 10 ms sliding average : 10 ms FIGURE E—3 I ms SUB INTERVAL I ABSOLUTE ENVELOPE ENVELOPE STD. DEVIATION = 0.6 FREQUENCY DEVIATION = 0 l l l I I I I I l _ 5 10 ms CHEBYSHEV WEIGHTED ERROR FOR ENVELOPE DERIVATIVE ESTIMATORS AS A FUNCTION OF SLIDING AVERAGE LENGTH 225 IHOZmJ ._<>mw.rz_nm3m ”.0 ZOEbZDu < m< 38.32.58 m>:<>_mwo mmodm>2m moi memmm 856m; >wzm>mmzo 2.1 $56; MW 06 9m Qv 0d QN QP QO 1 _ 2 _ _ 2 a 88 Il.mQHO N IIAREd IlmKQo Ilnfifio mOmmw 226 m0< 02.07% m0 202.023“. < w< meF_._.<>_mmo mm04w>2m m0“. mOmmw mE NP op m o \ Dm._.10_m>> >mIm>meU mum mmDOE I 4 _ 2 2 I1 mmod ZO_._. wm04m>2w OZ mNd n ZD__<_>wo >OZwDmeu meJm>Zm wPDJOmmd J<>mw._.2_ mDm mE F l omod I1 mnod 227 Figures E-4 and E—5 show an increase in the estimation error for larger values of subinterval length and sliding average. Reference to Figure 13 and the corresponding discussion of distortion in linear filters induced by large frequency changes explain this increase. The intuitive notion of longer averaging time can be misleading in this complex situation. The error rates in Figure E-4 and E-5 were derived for a(t) = 100. There appear to be two sources of variance in envelope estimation: the first is induced by the small number of samples,which would require longer averaging times,and the second is the distortion caused by the time—varying parameters, which would require shorter averaging times. We must select a compromise value,which appears to be approximately 1 ms subinterval length and 6—10 ms sliding average length. We may conclude that,for this idealized Speech acoustical signal, the most stable estimator is the Hilbert envelope with sliding standard deviation as a derivative estimator. The sliding mean of the absolute value of x(t) gives some envelope estimation distortion, primarily during epochs with changing frequency, but requires much less computation. Illa-.11 3.1.9. in i 1111 6 “III! 9 6 4 0 3 0 3 1211 ll 3 CAN STATE UNIVERSITY LIBRARIES MICHI l 1.3.1.: .1 .. . . y . ., . I X r... ... w...” 73:92 ...-fit: 3:! 23:0]! I . .1! E .... ER .13... ..Z.