THESIS :j,,v‘-_“.‘.:LI._;V_ ._-. _ _ .v A“- ' V} I l ' EW‘Z"? 318% LE $3; «1:5th { Micmw state 1 ‘ Univwsi l _ ”J This is to certify that the thesis entitled COMPUTER VOICE IDENTIFICATION METHOD BY USING INTENSITY DEVIATION SPECTRA AND FUNDAMENTAL FREQUENCY CONTOUR presented by Hirotaka Nakasone has been accepted towards fulfillment of the requirements for Ph.D. degfiein Audiology and Speech Sciences %’ NIBJOI' prole§§or Date November 8, 1983 0-7639 MSUis an A’fi'mnrim’ ‘ ' '1, ’ " "J Institution MSU LIBRARIES \— RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. ml use our no Not aermn COMPUTER VOICE IDENTIFICATION METHOD BY USING INTENSITY DEVIATION SPECTRA AND FUNDAMENTAL FREQUENCY CONTOUR By Hirotaka Nakasone A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Audiology and Speech Sciences 198% @ Copyright by Hirotaka Nakasone 1981i ABSTRACT COMPUTER VOICE IDENTIFICATION METHOD BY USING INTENSITY DEVIATION SPECTRA AND FUNDAMENTAL FREQUENCY CONTOUR by Hirotaka Nakasone The major purpose of this study was to investigate the effectiveness of several speech parameters in eliminating the influence of the transmission and recording channels of unknown response characteristics. These speech parameters were tested for their effectiveness in text-independent voice identification by computer. Text-independent speech samples were recorded from 10 male speakers randomly selected from a population of the native speakers of Midwest American-English dialect, each serving as unknown speaker and also known speaker. Recording was made simultaneously by three different transmission and recording devices. From these speech data, the following parameters were extracted to represent the unknown and the known speakers: l) intensity deviation spectrum (IDS), 2) a set of fundamental frequency related measurements (PFC), 3) long-term averaged spectrum (LTAS), and h) choral spectrum (SPT). The principal algorithm utilized for the measurement of the parameters l), 3), and h) was the Fast Fourier Transform (FFT); and for parameter 2), the interactive peak picking technique was employed. All speech parameters were Hirotaka Nakasone subjected to pre-processing in order to select optimum features from each parameter by using the hierarchical clustering, F-ratio, and data standardization. Distance between the speakers was measured by Euclidian distace. The decision rules employed were the nearest-neighbor rule and the minimum set distance rule. The rate of correct identification served as the criterion to determine the effectiveness of each parameter in elimination of the influence of the response characteristics. From the results of this study, the following general conclusions were drawn: Both IDS and FFC were found to be effective in eliminating the influence of the transmission and/or recording channels, but their correct identification rates were only moderate (50-602.) The composite parameter of FFC and IDS was found to be effective in eliminating the influence of the response characteristics although, the correct identification rate was not improved, i.e.. it was as good as each component of that composite parameter. The composite of FFC and LTAS was found to be the most effective parameter in eliminating the influence of the transmission system by achieving the highest possible correct identification rate (100%.) Dedicated to My mother, Haru Nakasone dis IIIOS men hel tOI Dr St ma ACKNOWLEDGMENTS I am deeply indebted to Dr. Oscar Tosi, my professor and dissertation director, whose guidance, encouragement, efforts, and most of all patience, made the completion of this study possible. I would like to extend my sincere appreciation to Committee members, Dr. Leo V. Deal and Dr. Paul A. Cooke, for their helpful suggestions, corrections, and constructive criticisms toward the completion of this study. I also would like to thank Dr. Jose-Luis Menaldi of the Department of Mathematics, Wayne State University, who also served as a Committee member, for his mathematical assistance and critical review of the computer programms developed for this study. Special acknowledgments are due Dr. Richard C. Dubes, Department of Computer Science, who kindly made the graphic plotter available, and Dr. Ernest J. Moore, Chairman of the Department of Audiology and Speech Sciences, for his interest and valuable suggestions during the final oral defense meeting. My warmest gratitude is expressed to Ms. Nancy Brown of the State Bar of Michigan who proofread my draft, Dr. Christina JacksoneMenaldi for her companionship and her enthusiasm about this study, and Dr. Paul N. Deputy of Idaho State University, who first introduced me to this exciting field of Speech Science during the early stages of my graduate program. Special recognition is noted to a number of individuals who provided their voiCe samples for this study. I am very grateful iv for their Fina who gave in typing final veI for their cooperation. Finally, but certainly not the least, I thank my wife, Aiko, who gave me her unconditional support and most qualified assistance in typing and retyping the numerous rough drafts, as well as the final version. LIST OF LIST OF CHAPTER I. TABLE OF CONTENTS Page LIST OF TABLES ............................................. viii LIST OF FIGURES ............................................. ix CHAPTER I. INTRODUCTION ....................... ..... ............. l Statement of the Problem ... ..... .. ....... ......... 7 Purpose of the Study .............................. IO Significance of the Study ... ...... ...... ..... ..... lh Limitation of the Study .................. ....... .. l5 Background in Selecting Speech Parameters . ........ l6 Literature Review .... ....... ... ..... ............ 18 Definition of the Terminology ..................... 28 Organization of the Study ......................... 32 II. EXPERIMENTAL PROCEDURE ....... ....... ..... .... ........ 33 List of Equipment .... ............ . .......... ... 33 Recording of Phonetic Materials .......... ......... 3A Speakers and Phonetic Materials ...... . ...... ... 3h Recording Setting . ..... ...... ............... 35 Arrangement of Speech Data ............ ......... 37 Pre- processing of Speech Data ........ .......... ... 39 Digitization ......... ..... ..... ...... .......... 39 Pause Elimination ............ ...... . ..... ...... ho Extraction of Speech Parameters ................ Ah IDS (intensity deviation spectrum) LTAS (long-term averaged spectrum) FFC (fundamental frequency contour) SPT (choral spectrum) Optimization of Features ....................... 56 Voice Identification Experiment .............. ..... 70 Organization of Experiment ......... ............ 70 Distance Measurement and Standardization ....... 70 Decision Rules ....................... ..... ..... 72 The nearest-neighbor decision rule The minimum set distance decision rule vi CHAPT III REFEF APPEI IPPEI APPEI APPEI APPEI APPEI APPE APPE TABLE OF CONTENTS (continued) CHAPTER Page III. RESULTS ..... ........................................ 76 IV DISCUSSIONS AND CONCLUSIONS ......................... 99 Discussions ...................................... 99 On Speaker Population ......................... 100 On Effects of Feature Optimization ............ l02 0n Composite of Parameters .................... l03 0n Influence of Pause Elimination ............. l08 On Interactionn by the Experimenter ........... l09 Conclusions ...................................... IlD Implications for Further Research ................ lll REFERENCES .................................................. llh APPENDIX A A SAMPLE TEXT EXCERPT .......................... 118 APPENDIX B RESPONSE CHARACTERISTICS OF TEAC TAPE RECORDER AND BRUEL 5 KJAER MICROPHONE .......... ITS APPENDIX C SUMMARY OF THE RESULTS FROM PAUSE ELIMINATION .. 120 APPENDIX D COMPLETE-LINK DENDROGRAMS OF IDS AND LTAS FEATURES .................................. 12] APPENDIX E RAW DATA 0F FFC FEATURES ....................... 127 APPENDIX F A SAMPLE OUTPUT: PARTIAL COMPUTER PRINTOUT OF VOICE IDENTIFICATION ........................ I30 APPENDIX C VOICE IDENTIFICATION RESULTS: OPERATIONS l THROUGH 2h ........................ l3l APPENDIX H LIST OF FORTRAN SOFTWARE ....................... 155 Tables I. Tables 1. LIST OF TABLES page Results of the feature clusterings and corresponding F-ratios of (a) IDS and (b) LTAS ........ 60 Optimum features selected for IDS parameter (as denoted by circles in Table 1(a)) ....... ....... ... 65 Optimum features for LTAS parameters (as deoted by circles in Table l(b)) .................. 66 Nine features and F-ratios of the FFC ................. 69 Summary of the results of the cross-transmission voice identification operations. ............. ......... 77 Summary of the results of 2A voice identification operations ........................ ...... ..... ......... 79 viii Figure i. ll. l2. LIST OF FIGURES Response curve of a transmitting and recording system including a commercial telephone line and a magnetic pick- up attached at the receiver end of the line ........................ ..... ............. A diagram showing equipment used for three simultaneous transmission and recording systems ...... Sample arrangements of text-independent phonetic materials from two speakers ............ ............. A graphic illustration of pause elimination procedure ............ ......... ........... Two sound spectrograms showing examples of an input speech and a resulting output speech after pause elimination ........... ........ .... ....... Computer plottings of three IDS's generated from the text-independent speech data of a speaker ........ Computer plottings of three LTAS's generated from the text-independent speech data of a speaker ........ Photographs of the CRT displaying the interactive peak detecting procedures ............ ...... .. ........ Illustration of F0, F0, and A0 by using three consecutive peaks in a simplified short speech segment .. ........ ....................... A plotting of a sample choral spectrum based on a ll second long speech taken from a speaker .............. A diagram illustrating the nearest-neighbor decision rule ............................. ...... ..... A diagram showing an example of the minimum set distance rule ......................... .......... . Sammon's projections of 5 known speakers and 5 unknown speakers, all represented by telephone IDS parameter ........ ........... .. ....... ............ Sammon's projections of 5 known speakers and 5 unknown speakers, all represented by telephone LTAS parameter ....................................... ix Page 36 38 A2 A3 A5 #8 50 5h 55 73 75 83 88 Sammon's projections of 5 known speakers and 5 unknown speakers - knowns represented by the composite parameter of FFC and LTAS by normal transmission system and unknowns represented by the same composite parameter by telephone transmission system ............ ...................... 9h MI genera' include method: (short recordI sPeakeI upon characI Voices IUdgnEI PFOdUCI examini the u. the 5p. object deCIsh exam,“ Subjec. pr°9ral idAnti- CHAPTER I INTRODUCTION Methods of voice identification can be classified into two general groups: subjective and objective. The subjective methods include aural and spectrographic examination; the objective methods are usually performed by using a computer. Aural methods (short term and long term memory) are performed by listening to the recorded voices of an unknown and a known or by remembering the speaker-dependent features of a voice. These are primarily based upon perceptual extraction of the speaker-dependent speech :haracteristics. The final decision regarding the identity of Ioices is made by the human examiner based upon his subjective iudgment. With the spectrographic methods, the sound spectrograms _ Iroduced from speech samples under study are examined. The :xaminer compares the acoustical characteristics of the known and he unknown voices displayed on the three dimensional plottings of he spectrograms (frequency, intensity, and time). In spite of the bjective means of displaying the speech parameters, the final ecision still belongs to the subjective judgment of the human (aminer. Hence, both aural and spectrographic methods are called ijective methods (Tosi, I979). When a computer is properly ogrammed with a set of algorithms, the results concerning entification of the voices are reproducible -- when similar types of dam expecteI commonlj recogni‘ ThI name I determiI samples speaker: voice i< ACI SDI include: equ I V3 II I ‘ I V°Ice It Specifi, of an Ul speaker data are submitted to the same procedure, the same output is cted. Hence, computer method is considered to be objective, only referred to as automatic or semi-automatic speaker gnition. The term 'voice identification' has been applied as a generic which encompasses various aspects in the process of 'mining the identity of an unknown speaker, given his/her voice es and voice exemplars collected from one or more known .ers. To be more specific, Tosi (I979) classifies tasks of identification as follows: According to the composition of unknown and known voice samples, tests of voice identification or elimination can be classified into three groups: discrimination tests, open tests, and closed tests. In the discrimination tests the examiner is provided with one unknown voice sample and one known voice sample. He has to decide whether or not both samples belong to the same talker.... In the open tests the examiner is given one unknown voice sample and several known samles. He is told that the Jnknown sample may or may not be found among the known samples.... In the closed tests of voice identification the examiner is also given one unknown voice sample and :everal known voice samples but he is told that the Inknown voice sample is also included in the known voice :amples....(pp. 4-5). n this stUdy, since in all tests the unknown voice was always ed within the known voices, the task is considered to be lent to the 'closed test' quoted above, excepting that the ner' is being replaced by a 'computer'. Hence, the term identification' as used in this study covers only this ic task which is described as follows: Given voice samples Inknown and a group of knowns, the task is to select a whose voice sample is the closest to that of the unknown. be I meaSI speeI storI are such feah the eanI non obta wher inte (day of v than cone sPea the Como Sam: com; The The standard procedure of voice identification by computer can oroadly divided into three major stages: data collection, Jrement, and identification processing. In the first stage, :h samples from a given group of speakers are recorded and ad. In the second stage, speech parameters (characteristics) measured. This stage includes a series of pre-processings, as filtering, deletion of pauses and gaps, extraction of Ires, and statistical processing for feature optimization. In .hird stage, the identification operation is performed by ying apprOpriate decision rules or criteria. Speech samples collected can be either 'contemporary' or ontemporary' (Tosi, I979). Contemporary speech samples are ned from each speaker during the same recording session, as noncontemporary speech samples are recorded over some time Ials depending upon the scope of the researcher's interest . weeks, months, or years). It has been noted that the task 'ce identification is easier with the contemporary samples Iith the noncontemporary samples. .nother aspect involved during the data collection stage is ned with the type of the phonetic content spoken by the rs: 'text-dependence' vs. 'text-independence'. When all peech samples of the speakers are the same in context, it is Iy referred to as 'text-dependence'. In this case_the speech 5 of the speakers under an identification process are :d word by word, phrase by phrase, or sentence by sentence. Ijor advantage of text-dependent speech samples is that they can be U' of this i many rI commercia In speakers text-indl duration who use (19%) sentence them in Of text- identifi rePorted Seconds rate. I differen when the Influenc d°mlnate duration Idehtifi ltext‘In °°nsiste makers be utilized in every method of voice identification. Because .his advantage, this type of text has been rigorously studied by researchers, but mainly directed toward industrial or ercial applications. In 'text-independence‘, all the speech samples spoken by the kers are different in phonetic content. The duration 6f the -independent materials must be relatively long, and the minimum tion appears to vary in length depending upon the researchers Jse the term 'text-independence' somewhat differently. Atal 4) generated the text-independent speech sample from a single once by cutting it into no equal segments and later recombining in random order. He reported that the minimum of two seconds ext-independent speech sample resulted in a high correct :ification rate. Bunge (I978) and Furui et 8]. (I972) ted the minimum duration of a close agreement of II and lo 'ds is required for a sufficiently high correct identification In Bunge's study, Al male and 9 female speakers produced 50 rent texts, each text lasting ll seconds. It was found that the text length was decreased to below ll seconds, the nce of the text became increasingly obvious and finally ted the Speakers identity. He concluded that an ll-second on was a limit for text-independence for a high correct fication rate. Markel and Davis (I979) defined the term indepenent' a little more stringently. Their speech data ted of the extemporaneous speech material from 17 male rs, each speaker recorded in five interview sessions at one week II from II speech 39-secc high < scope c speech languag Each 5 from bc by one text-ir promisi SI acousti Speech which a element air p, Stream °I turt and Pic taking V°Cal l c°Illiair Speake, ek intervals. They attempted to make the speech samples free om linguistic constraint and further free from the manner of eech production. With this type of text-independent materials, a -second text length (containing only voiced frames) resulted in 1h correct identification. Tosi et a], (I979) extended the pe of text-independence to different languages. In their study ech samples were obtained from 20 speakers who could speak three guages (Piamontes, Italian, and French) with equal fluency. ‘h speaker was recorded while reading a l0-minute long passage m books and newspapers in three sessions, each session separated one week, and concluded that automatic voice identification with t-independent speech materials of different languages is nising. Speech samples so collected are then processed for extracting JStIC speech parameters to represent the speakers. Acoustic :ch parameters are the measurements derived from speech signals ‘h are considered to consist of three major elements. The first ent is the energy source coming out of the lungs as a stream of pressure. The second element is a modulation of this air am into vibratory motions (for voicing) set by the vocal cords urbulence of air in a constriction of the vocal tract (friction losive). The third element involves resonance phenomena 9 place as the modulated air pressure traverses through the tract (pharynx, oral and nasal cavities). In each element ined is information of the linguistic contents as well as the er characteristics. inten fundar rise consir parti< outpui charar ident combir short- frequr basic furthr the identi set I SPEakr l a Sp, the \ SPEaII Variai Variar Var I at he first element carries variation of overall speech ity as a function of time. The second element determines the ental frequency and its harmonics of voiced phonemes giving to perceptual pitch of the speaker. The third element is ered to be the most important because resonance gives a ular shape or envelope to the spectrum of the speech sound which includes both a phonetic content and the individual teristics of each speaker (Tosi, I979). peech characteristics (parameters) used in voice fication by a computer are generally extracted from one or a tion of these acoustic elements. Often used parameters are erm spectra, long-term spectra, formants, fundamental ncy, and other variations statistically derived from these parameters. Usually, a certain number of features are ' selected from the parameter and these features represent :peaker in a multidimensional space. In general, the 'ication process is based on the distance measured between a features of the unknown speaker and that of the known 5. e basic implicit assumption for voice identification is that ker can be distinguished by his/her speech signals and that riation of speech characteristics within an individual differs from that of the other speakers. The former an is commonly referred to as the 'intra-speaker ity' and the latter is referred to as the 'inter-speaker ity'. The sources of the intra-speaker variability can be attri produ physi over inter confi and i Proce metho indiv and/o the alter; Imper contei study frequ SPeak This Ident IOng l the ributed to different emotional status, various manners of speech iuction demanded by different circumstances, and small siological changes in articulatory apparatus of the same person an interval of time. On the other hand, the sources of the :r—speaker variability are the different vocal tract ‘igurations, physiological characteristics of the vocal cords, idiosyncratic speaking habits of different speakers, etc. ement f the Problem There are several sources which could interact with the ess of voice identification by a computer or by any other ad. These are distortion of speech characteristics of an widual due to the unknown response curve of the transmission or recording devices, various types of noise which deteriorate intelligibility of speech samples and intentional self 'ation of voice either to disguise the identity or to sonate another person. In addition, differences in phonetic Int and duration of the speech utterances of the speaker under interact with the procedures of voice identification. 0n ent occasions, the phonetic content spoken by the unknown er can be different from the one spoken by the known speaker. condition calls for so-called 'text-independent' voice ification, which usually requires speech samples of relatively uration obtained from each speaker. his study focuses on the problem caused by the influence of transmission and recording devices by using the text-ine Te' variable junctior conditie real III the une call, wl moment for late response distorti (I! O I a. O I O) o h- lnrsnsrrv (as) M O h FiQUre incl atta 1979 -independent and contemporary phonetic materials. Telephone transmission contains in itself several sources of ables such as the carbon microphone, number of connecting tions, line distance, and carrier systems. Under this Ition, the response characteristics of the telephone line, in a life setting, cannot be determined. This is largely due to uncertainty of the telephone line involved in each telephone which is random, according to the existing traffic at the It of the call. Ordinarily the speech signals must be stored ater processing in a recording medium which also has its own nse characteristics. An example of this type of combined rtion in response characteristics is given in Figure l. ———————-—1-<-I-————-v- \—T——-a—_I .-_____.-...._7J__-_-_I ..... .-- .-..__-.-. ..----____--.----J y .l/ .05 0.1 0.2 0.5 1 2 3 5 1O FREQUENCY (KHz) -> e 1. Response curve of a transmitting and recording system, :luding a commercial telephone line and a magnetic pick-up :ached at the receiver end of the line (taken from Tosi, '9). trans charae sample conver respor of the and I only c to pr distor resear Propos samples of the unknown speaker In many cases, speech nsmitted and recorded through the system which has the racteristics as shown in Figure I are compared with speech ples of the known speakers transmitted and recorded through a ventional microphone-to-tape recorder which has relatively flat ponse characteristics (linear). In such a case, the magnitude the distortion present in two entirely different transmission recording systems may be even more serious than the case where one type of the transmission system is being used. It seems preclude any reasonable effort to eliminate this type of tortion made up of the unknown sources of variables. One earcher (Tosi, 1979) being pessimistic about this phenomenon, Yosed a very interesting idea as a possible solution: ....elimination of perturbing telephone influence could consist of including a ‘standard' burst sound at the beginning of every telephone communication. Because the real spectrum of such a 'standard' sound would be known, ‘ the transfer function (response characteristics) of the telephone line could then be easily computed. (p. 55) te, unfortunately, this idea has not yet been realized. Other alternative methods to eliminate the influence of the ency distortion are 'normalization' procedures on the speech eters extracted (Atal, I978; Furui, l98la; Bunge, l978; Tosi Nakasone, I980) and selection of the appropriate speech eters which are considered to be inherently resistive to the ency response characteristics (Atal, l972; Markel et 3], Hunt at 3],, I977). Since these alternative methods are mented in this study, details will be discussed in a later chaI unde tesl ever (to rate coo; char the all com and Pie: mini betv Vole lace thaI cAll The major purpose of this study was to investigate the tiveness of several speech parameters in eliminating the ence of the transmission and recording devices of the ined response characteristics. These speech parameters were d in text-independent voice identification by a computer. It is known that a speaker can be recognized by his/her voice when the content of the text spoken is different -independent). Among many conditions necessary for a high of success, two factors are essential: l) The speakers are rative (no disguise/mimicry, small variation in voice :teristics from one recording session to another, etc) and 2) ame transmission recording device is used for acquisition of ie unknown and known voices. I n the present study, one of these two factors was under I, i,e,, all the speakers were cooperative, rendering clean elatively uniform texts. Each speaker read an excerpt ibed for him and recorded in a single recording session, thus zing possible variation in his speech due to a time interval sessions. The other factor was intentionally varied, i,e_, amples were collected through different transmission and ing devices. An annoying outcome of the latter procedure is 0 identical speech samples of one speaker (one sample being ed by one transmission and recording device, another sample bein resu iden tran reco prob tran this char aPpl spec fund Spee aPPl succ Proc two III. unkr Char com; Idel rev] fune ing collected by another transmission and recording device) may ult in different spectral shapes and consequently yield a false ntification. This type of problem can be easily solved if the nsfer functions of the transmission systems involved in the ording are well—defined. But the prime obstacle in solving this blem'is that in most cases the true transfer function of the nsmission recording channels is not known. The present study focused on alternative approaches to solve 5 problem of eliminating the influence of the undefined response racteristics upon the speech samples. These approaches were U lication of the speech parameter IDS (intensity deviation ctra), 2) application of the speech parameter FFC (several damental frequency related measurements), 3) application of the ech parameter LTAS (long-term averaged spectrum), and A) lication of the composite parameter of l) and 2) and of 2) and The IDS is a spectrum statistically derived from a set of essive short-term spectra. The prototype of computational edure for IDS prameter was introduced by Bunge (I978). IDS has properties: By definition (details to be discussed in Chapter it cancels the influence of the transmission systems of the own response characteristics, and it represents dynamically ging spectral structures of the speech signals. The FFC is sed of a set of fundamental frequency related measurements ined in this study) derived from a pitch contour. Literature wed in the field of voice identification indicates that the mental frequency is a sufficiently effective speech parameter in di respon record I parame pre-pr optimi. statis proced PDP ll Scienc T; Parame Io distinguishing speakers and is relatively insensitive to the nse characteristics of the transmission line used for ding the speech samples. In order to enhance the 'elimination' effects, all speech eters (except choral spectra) were subjected to a series of rocessing procedures, such as pause elimination, feature ization by the hierarchical clustering technique and F-ratio stics, and standardization of features. Most of the dures were carried out by FOrtran softwares implemented on a l/AO minicomputer at the Department of Audiology and Speech ces, Michigan State University. he following assumptions on the performance of each Speech eter were set up: The IDS is sufficiently effective in eliminating the influence of the frequency distortion from the voice samples of the unknown and the known speaker recorded through different transmission and recording systems, and it is equally effective for the voice samples recorded through the same transmission and recording system. The FFC is sufficiently effective in eliminating the influence of the frequency distortion from the voice samples of the unknown and the known speaker recorded through different transmission and recording systems, and it is equally effective for the voice samples all recorded through the same transmission and recording system. The LTAS is highly effective in eliminating the influence of 13 the frequency distortion from the voice samples of the unknown and the known all recorded through the same transmission and recording system, but the effectiveness is decreased when the voice samples of the unknown is recorded through one transmission and recording system and that of the known through another system. Choral spectrum is assumed to be as effective as the LTAS for the same conditions described in 3. The composite parameter of FFC and IDS increases the effectiveness level in eliminating the influence of the frequency distortion from the voice samples of the unknown and known recorded through the different transmission and recording system. The composite parameter of FFC and LTAS increases the effectiveness level in eliminating the influence of the frequency distortion from the voice samples of the unknown and the known recorded through the different transmission and recording systems. test the above assumptions, a total of 2b voice ication operations were conducted in different designs ng to various combinations of the speech parameters and the ssion systems used for recording the unknown and the known 5. Each operation yielded the results in terms of the rate rect identification, which served as the measure of the veness of each parameter tested. mu alone perso chang ident much the o searc distu there direc other resea trans text- large repor °Ptin haVe StUdy text- the F and Speak 14 'gnificance _j the Study It is known that a person can be recognized by his/her voice one when heard live or over the telephone line, provided that the rson is somebody the listener is familiar with. Despite some ange in the perceptual quality of the voice, the judgment on the entity of a speaker does not seem to be critically influenced too ch by the transmission line. This is the underlying fact that objective of this study rests on. This study was designed to rch and study several speech characteristics which are not turbed by transmission and/or recording devices. It is hoped, -refore, that the results from this study will contribute ‘ectly or indirectly to understand more about human speech sound Ier than the linguistic message it carries. Another justification for this study is the scarcity of learch reports dealing with the problem of the influence of the nsmission and/or recording devices coupled with the t-independent speech materials for voice identification. A ge body of research reports is available, though most of these arts are based on the text-dependent speech data recorded under imaly controlled conditions. To date, only a few researchers 3 been concerned with these two problems together. One such iy conducted by Hunt at a]. (1977) consisted of the :-independent materials spoken by l3 speakers transmitted over FM radio broadcast. However, they used only one transmitting receiving system of high quality for recording all the kers. Therefore, even though excellent identification rates were trans into. text- the u throu Admit there ,— S ro- ] : are a three SOftw Utiii infer taxfi tend; ihtra 15 reported in their study, the adverse influence of the mission and recording media does not seem to have been taken account. The problem addressed in the present study involved independent voice identification tasks by using the voice of nknown speakers distorted by the telephone transmission system the relatively clean voices of the known speakers recorded gh a conventional microphone-to-tape recorder system. tedly, although this problem is very difficult to fully solve, is a legitimate need for investigation. flotsam This study was exploratory in nature. Therefore, the results splicable only within the following limitations. iirst, a nominal size of lo speakers were employed. This size iontrivial considering that actual speaker size was tripled by simultaneous transmission systems and that only Fortran ‘res were available (no real-time hardware processor was ed). Because of this small speaker size, no statistical nce was attempted for generalization of the results. econd, all the speakers were recorded only once (contemporary while reading in more or less the same style. These two ons obviously contributed in producing the unusually small peaker variability. Thus, the results from this study can ralized only to these types of speech data. for cri dem. for idel was thi: eXp mod: Whei dun COD‘ ten: rev COn: Piel tho; (ID) 16 Third, though a mini computer was applied as the major tool data processing including decision algorithms, at a few cal points arbitrary intervention by the experimenter was ided. For this reason, the design of the voice identification iis study was not meant to be a completely 'objective' method. Finally, the term 'voice identification' operationally defined the purpose of this study was 'closed' type, i,e,, in each ification process, the speech sample of the unknown speakers always included among those of the known speakers. Presumably type of identification procedure can be considered as ratory, or linient, in terms of its credibility; hence, the tested in this study, as it is, has little immediate cability to real environment. :gggg lg Selecting §peech Parameters ipeech samples in this study were not only under the influence nknown response characteristics but also text-independent. ext-independent voice samples are involved, a relatively long on of speech is required in order to homogenize the phonetic t of the different texts from all speakers. Within this aint, several different types of speech parameters were ad from the literature in this field. Speech parameters :red ppimae facie were cepstral coefficients, linear :ive coefficients (LPC), long-term averaged spectrum (LTAS), *spectrum, variance spectrum, intensity deviation spectrum and fundamental frequency contour (FFC). its to th trans cepst to be The L respc remai Spect sampl trans to ta hiera Witho LTAS, the f and In Sp sPect ident Same LTAS, this 17 Of these seven parameters, the cepstrum was discarded despite well recognized useful property -- it is relatively resistant 1e influence of the frequency response characteristics of the zmission system. it was found to be impractical to apply the :ral analysis for the amount of ' text-independent' speech data a processed by using only Fortran softwares on a mini computer. .PC was also discarded because of its susceptibility to the >nse characteristics of the transmission system used. The ning five parameters, LTAS, choral spectrum, variance ,rum, lDS and FFC were tested in the pilot study. Speech es were collected from five male speakers through two mission systems, telephone line and a conventional microphone pe recorder. The identification process was performed by the rchical clustering technique with complete-link method, ut feature optimization procedure. irom the results of the pilot study, it became apparent that variance spectrum, and choral spectrum were very sensitive to "equency response characteristics of the transmission system ,hat lDS and FFC were relatively insensitive to the influence. te of their susceptibility to the influence, LTAS and choral a were found to be very effective speech parameters for voice fication provided all voice samples were recorded through the transmission and recording channels. Consequently, IDS, FFC, Ind choral spectrum were adopted as speech parameters for udy. ii pro the stu pur sev arr of fil at C3! 18 rature Review A review of the related literature revealed that research rts on voice identification by a computer contain vast rsification in the methodology employed depending upon the type number of speakers, phonetic materials, and the tasks involved. ontrast, there are very few reports primarily dealing with the lem of the transmission and/or recording media. It was, efore, felt that organized presentation of these diversified ies in methodology was neither practical nor essential for the use and the scope of this study. Hence, this section presents 'al studies dealing with automatic voice identification, 1ged by the type of speech parameters employed. A short-term spectrum is generated from a very brief portion I speech signal. it can be produced by the use of a bank of :rs simulated on a digital computer (Prusansky, 1963; Bricker ’7., l97l) or by using the Fast Fourier Transform. In either the principal expression used is the Fourier transform: NI‘ . F0») =fo frocedure, resulting in a higher recognition rate of 97%. The :tudies cited were performed on speech data recorded through same transmission and recording equipment, though it was noted the spectra were strongly influenced by the frequency cha has and imp alt tern spe spe 197 thi fi I who con tex tex res cteristics of the recording devices. The short-term spectrum een often applied to the text-dependent data for which tedious complex procedures for time alignment become extremely tant during the recognition procedure. Later long-term averaged spectrum was considered as an native to compensate for the tedious procedures of the ral alignment of the set of the text-dependent short-term ra. Typically, long-term averaged spectrum is obtained from a h signal of a relatively long duration (l-2 minutes in Tosi 70 seconds in Markel et a], l977; l0 seconds in Furui et l972, ll seconds in Bunge l978). When a computer is used, spectrum can be easily produced either by using a bank of 's, or by FFT algorithm. A unique property of this spectrum, taken from a sufficiently long speech, is that phonetic its of different utterances can be balanced, thus enabling ndependent voice identification. Its potential use for ndependent voice identification has been recognized by many chers (Tosi et 3],, l979; Bunge, 1978; Majewski and Hollien, provided there is no influence of the transmission and/or ing media. we variation of the long-term averaged spectrum is called spectrum‘ developed by Tosi (l979). He defines the spectra long-term Fourier transforms of temporal choral speech 2y, l958), which is produced from a temporal rearrangement speaker's normal speech. The major difference between spectrum and long-term averaged spectrum is seen in the com fas chc I’EC tec putational economy: The former is said to be generated much ter than the other by a factor of about 20. Tosi et a], (l979) ducted a text-independent voice identification by applying the ral spectra. Speech samples of different languages were orded from 20 speakers. By using the hierarchical clustering hnique, they reported the identification error rate of 5 to 30% ending upon the method used and suggested the promising utlity the choral spectra for text-independent voice identification. ver, inasmuch as the spectrum bears the same property as the -term averaged spectrum, the choral spectrum is also known to susceptible to the influence of the transmission media. Linear predictive coefficients (LPC) has been studied by many ech researchers for automatic speaker recognition (Atal, 197A, 3; Markel et 3],, l977; Markel and Davis, l979; He and Dubes, t.) The LPC is usually derived by using the autocorrelation rod from speech signals, revealing the spectral properties of speech as a function of time. LPC can represent the amental frequency and its harmonics when the order of the ictor coefficients is relatively high (ho coefficients) and can represent formants when the order of the predictor is low (l2 ficients), but it is susceptible to the frequency response of ecording apparatus and the transmission systems. Atal (l978) led the effectiveness of LPC (l2 coefficients) for automatic er identification by using l0 female speakers. All speakers ded six repetitions of the same short sentence, using a quality microphone on two occasions at 27-day intervals. Each utt dim non (19 (of TEC utt div spe ser fir seg mei fez pil rance was divided into 40 segments; and from each segment, l2 ictor coefficients were extracted to form a vector of AD nsions. The identification decision was based on the Euclidian distance measure defined by Shafer and Rabiner 5.) The correct rate of identification was found to be 63.8% 60 total judgments.) He and Dubes (l982) presented a paper on speaker tification by using LPC and pitch contour. Speech samples were ided in a sound booth by eight Chinese male speakers, each hing l5 repetitions of a short sentence in the Chinese language a microphone attached to a tape recorder. Each utterance was fled into b-second epochs, resulting in five speech data per :er (each datum, thus containing either two complete spoken :nces or one complete and partially complete sentence.) Then, ly, each datum was partitioned into #0 segments. From each :nt, l2 predictor coefficients were computed. A pitch contour also prepared from each datum by two different ds: cepstrum method and peak detecting technique. Five res measured on a pitch contour were maximum, minimum, average period, the maximal slope, and the larger one of the two period values determining the largest slope. Subsequently, data were subjected to feature optimization procedures sting of: The hierarchical clustering technique, discriminant sis, and F-ratio as discussed in Chapter II. For fication decision operation, the Euclidian distances were ed between the test pattern and reference patterns and the dec rul cision criterion was based upon the nearest-neighbor decision la. The results from their study were as follows: 8l.92 by tch contour when all speech data were included, but the rate rose 96.h% when the data containing partially complete sentences were scarded; 75.62 by LPC for the entire data, but increased to .42 when the data containing partially complete sentences were scarded. A combination of features from the pitch contour and re LPC was also tested with all speech data included (in an .tempt to compensate for varying text contents). This resulted in .8% correct identification. Although the authors of this study d not specify, the speech data were rather text-dependent even if e proper alignment of each phonetic unit was not attempted. The cepstrum of a speech signal is defined as the power ectrum of the logarithm power spectrum of the signal (Noll, 57). This method was introduced as a means to separate damental frequency from the speech signal .in the frequency ain. Much attention has been given to the cepstral method in field of automatic speaker recognition (Atal, l978, Luck, 1969; ge, l978; Tosi, 1979; Furui, l98lb: He and Dubes, l982). The son for its pepular application appears to be twofold: The strum is mathematically well defined, i.e., renders itself for iable algorithmic implementation to a computer, and it is atively resistive to the frequency response characteristics of transmission system as well as the recording devices. Furui (l98la) published a comprehensive study on the hniques for automatic speaker verification based on the cepstrum coe (te con mal Se\ frc the sei ficients computed on a fixed, sentence-long utterance t-dependent) recorded over the conventional telephone ection. A total of 50 utterances from each of 20 speakers (10 5, l0 females) were recorded over the period of two months. ral kinds of utterance sets were prepared, all band limited l00 Hz to 3.0 KHz. Cepstral coefficients were derived from predictor coefficients (LPC) obtained from the same speech ent. After several pre-processings —- such as pause ination, time registration, normalization, and optimum feature :tion -- were applied, the results of verification error rate less than l3 even if the test utterance and the reference 'ance were subjected to different transmission conditions (but, ably, within the same telephone connection). The key factor, i commented, would be in the normalization procedure of the :rum coefficients to remove the distortions of the response icteristics introduced by the transmission system. The iicients were averaged over the duration of the entire 'ance, and the average values were subtracted from the cepstrum "icients of every frame. ‘Pitch contour is a plotting of the time-varying pitch of the h signal. 0r, synonymousiy expressed, pitch contour is [he glottal frequency - characteristics melody curve of the r.” (Tosi, l979). There are two properties of pitch ur: First, it is sufficiently speaker'dependent; second, it sistive to the distortion introduced by the frequency response cteristics of the transmission and recording devices. Many earchers (Atal, 1972; Markel, Oshika, and Gray, l977; Hunt at , l977; He and Dubes, l982) explored pitch contour as one of the ch parameters for speaker recognition, and confirmed it to be a ‘iy (or, at least sufficiently) reliable speaker-dependent racteristic. Atal (l972) studied the average pitch and the measurements of temporal variation of pitch for automatic speaker itification. Speech data were text-dependent and collected from male speakers, each producing 6 repetitions of the short :ence, and each sentence lasting 1.8 to 2.8 seconds depending I the different rate of utterance by individual speakers. He urted that the measurement of the average pitch was far better 1 correct identification) than that of the temporal variation of h. Markel et a], (1977) also investigated pitch contour but ied it to text-independent speech materials obtained from a small group of speakers with homogeneous pitch distribution. used F—ratio (analysis of variance) as the measure of the tiveness of the average pitch and the standard deviation uted on pitch contour) as a function of the number of frames, rom Lv = l0 to Lv = lOOO, which correspond to about 70 ds). It was concluded that the average pitch is significantly effective than the standard deviation of pitch and that the ated standard deviation about the average pitch was reduced about l8 Hz (for Lv= 10) to about 6 Hz (for Lv= lOOO). They ed two other parameters,the spectral-related feature obtained by of eff spe ide par ea< COT in me; fr: fu PT ii 26 LPC and gain variation, to the same speech data for the purpose comparing the three speech parameters. Ranking of the ’ectiveness in discriminating speakers was in the order of :ctral feature, pitch contour, and gain variation. Hunt at a], (l977) conducted text-independent voice :ntification by using pitch contour (and including other speech -ameters). They used a group of l2 professional meteorologists, :h reading two sets of different text transmitted over the imunication channel. Seven different kinds of fundamental :quency related measures were derived from a pitch contour, viz, in fundamental frequency, group mean of the mean fundamental :quency and its rate of change, and proportion of time that idamental frequency is rising or falling. The pitch contour was pared by the use of a hardware implemented real time cepstral cessor. Identification performance was tested in two ways: st, texts of the unknown and the known speakers were arranged in oncontemporary manner, resulting in 89% (l33 out of lh9 samples) rect identification; and second, texts were arranged in a temporary manner, resulting in IOOX correct identification. In ition to pitch contour, they included two other parameters, :tral related and gain related. When all three parmeters were )ared in terms of identification performance, ranking was in :r of spectral parameters, followed by pitch contour, then lly, gain related parameters. The study by Hunt at al, showed voice identification can be done even if the speech data of erent content were transmitted over the communication channel; 27 >wever, the degree and nature of the distortion of the 'ansmitting system were not specified clearly. It appears that Il speech data in their study were transmitted and recorded by the ame system. a m —« I - defi Cate Cros Feai FFC IDS Lh 28 inition 9: the Terminology Key terminologies used throughout this study are operationally ined as follows. egory: The term refers to a set of patterns (or samples) and is used synonymously to a speaker in this study. ss-transmission: This term refers to a voice identification procedure in which the speech data from the unknown speaker and the known speaker are prepared through two different transmission systems. ture: A feature refers to an individual measurement component within a speech parameter. The number of features determines the dimensionality of a parameter. For instance, the first frequency component in the IDS, or the average fundamental frequency in the FFC, is called a feature. The FFC is a speech parameter which consists of a set of fundamental frequency related measurements computed on a pitch contour of a running speech sample. (intensity deviation spectrum): The IDS is a speech parameter derived from a set of successiVe short-term spectra. The IDS reflects, by definition, temporal variations of the spectra of speech sound. ar system: This system refers to a path in which the speech sound is transmitted by a microphone and recorded onto an audio tape by using a tape recorder. In this system the microphone and the tape recorder are characterized as having the relatively flat response curve (Linear) covering the LTAS Norm Pat' Sho SPE 29 speech frequency range. The term 'Linear speech' or 'Linear -' in this study refers to the speech sound, or processed speech data made available by using this system. 5 (Long-term averaged spectrum): The LTAS is a speech parameter computed by superposing and averaging of n short-term spectra. Each one of these spectra is originated by successive segments from the speech samples of about ll seconds utilized in this study. The LTAS reflects static spectral feature of speech sound. ial system: It refers to a path in which the speech is transmitted by a microphone and recorded onto a magnetic tape by a tape recorder. The system is assumed to have undefined response characteristics. The term 'normal speech' or 'normal -' in this study denotes the speech sound, or processed speech data made available by this system. ern: It is composed of the set of features chosen from a single or more of the speech parameters. A pattern is equivalent to a speech sample and is the basic data set to represent the voice characteristics of the speaker. t-term spectrum: This spectrum is generated from a short segment (in this study 25.6 msec) of each processed speech sample by using Fast Fourier Transform (FFT). :h parameter: The term refers to the measurement(s) derived from acoustic speech signal. The individual feature is extracted from a speech parameter. (choral spectrum): The SPT is a speech parameter produced Telei Text Trar Voi 30 from choral speech by processing it through FFT. An elaborated definition and algorithm to generate this spectrum is presented by Tosi (I979). In this study, choral speech is obtained by superimposing O.h096 second long segments of on-going speech. ephone system: It refers to a path in which the speech is transmitted via the telephone transmitter, received at the remote end of the local line telephone set, and recorded onto an audio tape recorder. The term 'telephone speech' or 'telephone -' in the text refers to the speech sound or processed speech data made available by this system. :-independence: This term refers to the type of phonetic materials used as the speech data for voice identification. Text-independent voice identification uses different texts from the unknown and the known speakers. Counterpart of this term is the 'text-dependence'. smission system: Restricted to this study, 'transmission system' refers to a system of _the devices used for transmitting the speech sound, such as a_microphone, telephone transmitter and its attachment, and so forth. The term 'system‘ is used to refer to a path of the speech signal from the speaker's mouth to the sound storage device, an audio tape recorder. 3 Identification: It is defined as a process of selecting a speaker (from a group of the known speakers) whose voice sample is the closest to that of the unknown. 31 thin-transmission: This term refers to a voice identification procedure in which all speech data from the unknown and the known speakers are prepared by the same transmission system. Organization _t thg §£E§X This study is divided into four chapters. Chapter I presented a general introduction to voice identification, the statement of the problem, the purpose, significance and limitation of this study, a review of the literature, and a list of operational definitions of the terminologies. Chapter II is devoted to the description in detail of experimental procedures: Recording of speakers, digitization, pause elimination, generation of speech parameters, feature optimization and standardization, and identification operations. Chapter II presents the results from the identification operations. Chapter IV concludes this study by presenting discussions, conclusions, and implications for further research. ofp materi operat record proced Speech standa sectic voice list: Lsts F°r n CHAPTER II EXPERIMENTAL PROCEDURE This chapter is organized into three sections: l) recording phonetic materials, 2) pre-processing of these phonetic terials, and 3) experimental procedure for voice identification erations. In the first section, recording and arrangement of the corded speech data are discussed. In the second section, ocedures and algorithms for pause elimination, extraction of the eech parameters, optimization of the features, and andardization of the features are discussed. In the third :tion, distance measurements, decision processes, and designs of ‘ce identification operations are covered. The following is a ;t of equipment and softwares used throughout this study. t _t Equipment recording speech data: Condenser microphone, Bruel 8 Kjaer, type hl32 Cathode follower, Bruel & Kjaer, type 2619 Microphone amplifier, Bruel a Kjaer, type 2603 Dynamic microphone, Ampex, model 2001 Local line telephone sets Open-reel tape recorder, Teac, model A-70l0 Open-reel tape recorder, Sony, model TC-l06A Cassette tape recorder, Marantz, Superscope, model C-202-LP Open-reel tapes, Scotch, low noise, l.5 mil, l200 ft Cassette tapes, 3M, low noise, 30-minute (one side) 33 For rar Iii: in re. as no N 34 processing speech data: PDP ll/hO mini computer, 6h k(byte) memory, with 2 disk drives l6-bit A/D and D/A converters, 3 Rivers Computer Corp. RK05 disks, 2.h Mega bytes CRT monitor Light pen connected to the CRT monitor Deckwriter II, Digital Equipment Corp. Open-reel tape recorder, Ampex, model hOOOG Fortran software (see Appendix H) RECORDING 0F PHONETIC MATERIALS Ikers and Phonetic Materials The phonetic materials used in this study consisted of nute long speech samples. The subjects were l0 male speakers omly selected from a population of native speakers of estern American-English dialect, ages ranging from 20 - 35, from defective or pathological voice conditions. All speakers different excerpts from a nontechnical book (Appendix A shows nple excerpt) at a 'normal' reading speed. Each speaker was 5 to rehearse by reading aloud a brief paragraph (one which was going to be included as speech data for him) while all the 'ding equipment and the telephone line were checked for proper :tion. During recording, each speaker was instructed to tain approximately the same distance from his mouth to the :mitter of the telephone set (about 3—5 cm) and to the other microphones (about l5 cm). No additional instructions as to the] Reco spea syst to tran recc micr char tel: trar ins' teh Spa alt the SUp the tra mod cal mic he manner in which the speaker should read the excerpt were given. ecording Setting Figure 2 illustrates the simultaneous recording of each aeaker through three different transmission and recording Istems: I) through a telephone line with the remote end connected > a tape recorder by an inductive pick up ('telephone ’ansmission'); 2) through a conventional microphone-to-tape :corder ('normal transmission'); and 3) through a crophone-to-tape recorder of an almost linear frequency response aracteristics. Hereafter, these three systems are referred to as lephone transmission, normal transmission, and linear ansmission system, respectively. For telephone transmission, the telephone set was placed side of a sound booth and dialed up to the other end of the local lephone system (campus line at Michigan State University.) aech signals were drawn by the use of an inductive coil directly tached around the receiver of the telephone set, and connected to a microphone input of a cassette tape recorder (Marantz, >erscope model C-ZOZLP). No care was taken to check the response Iracteristics of this telephone transmission system. For normal Insmission, a dynamic microphone (Ampex, model 200i) was placed a sound booth and connected to an open-reel tape recorder (Sony, el TC-l06A) outside the booth. No care was taken for ibrating the frequency response characteristics of this rophone and tape recorder either. Thus, a normal transmI55ion Nor P181m: 2. 36 Sound Booth Dynamic Microphone Text Telephone transmitter Local telephone line Condenser -__ Microphone _- Inductive pick up Linear transmission Telephone receiver Cassette Open reel tape recorder I ta e recorder 3 Mic. i em- 0 O O O l Bruel & Kjaer Teac, model A—7010 Marantz, model C—202LP tYPe 2603 Open reel tape recorder Normal transmission \\\~___—r’// Sony, model TC—106A 2 2. A diagram showing equipment used for three simultaneous insmission and recording systems. system For I type i type 2 2603) model were charai these becau trans long know: as an a in were Iden iIIu FigL trar The tray 37 system was assumed to have the undefined response characteristics. For linear transmission, a condenser microphone (Bruel 8 Kjaer, type hl32) was coupled with the cathode follower (Bruel 8 Kjaer, type 26l9) connected to a microphone amplifier (Bruel 8 Kjaer, type 2603) outside the booth, then to an open-reel tape recorder (Teac, model A-70l0). This condenser microphone and Teac tape recorder were calibrated for their linearity of the response characteristics. Plottings of the response characteristics of these two devices are given in Appendix B. Although each speaker read only one 6-minute long text, because of the above described simultaneous recordings by three transmission systems, each speaker produced a total of l8-minute long speech data. Arrangement gt Speech Data In this study, all speakers served both as unknown and as the known persons. This required that the speech sample of a speaker as an unknown differed in context from that of the same speaker as a known. Therefore, the speech data stored in audio tape recorders were properly arranged to enable 'text-independence' from three identical texts simultaneously produced by a speaker. Figure 3 illustrates this procedure for proper arrangement. As shown in Figure 3, a 6-minute long speech of each speaker for each transmission was partitioned into three 2-minute long portions. The initial 2-minute portion was then segmented from telephone transmission (indicated by' CD), the medial portion was segmented H hwxwwam an < uxoh N thmmam cha m— UXUF Hg >‘fi nit-I jg L I ® | 7 3 3 Normal transmission H m L I I o I Linear transmission L G) I l I Telephone transmission >.N .13“ I In I @ I 1 5 3 Normal transmission [-tm L I I o | Linear transmission Figure 3. Sample arrangements of text-independent phonetic materials from two The 38 6 minutes ~- 2 minutes @ I I Telephone transmission speakers. For t e cross-transmission voice identificaiton operation: segmented portion 1 in a circle in telephone transmission was used to prepare the data base for the speaker as an unknown while portions 1 and 2 in cirles in normal transmission and linear transmission were used to prepare the data base for the same speaker as the knowns. For the within-transmission voice identification: For each transmission system, the segmented portion in a circle was used to prepare the data base as an unknown and the portion in a square, as a known speaker. from nc was 5: arrange cross-1 of por‘ as us: The la' in a data. conten system Speech l 39 om normal transmission (indicated by CD), and the final portion. [5 segmented from linear transmission (indicated by C§). This 'rangement was conducted to prepare input speech data for ‘oss-transmission voice identification . In addition, another set f portions was necessary to prepare ’text-independent' speech data 5 used in within-transmission voice identification operations. he latter set of partitionings are indicated by the number encased n a square seen in Figure 3. Unmarked portions were not used as ata. Consequently, two 2-minute portions of different phonetic ontent were segmented as raw speech data from each transmission ystem and represented a speaker. In total, 60 2-minute long peech data resulted from the above arrangement: l0 (speakers) x 3 (transmissions) x 2 (portions) = 60. PRE-PROCESSING 0F SPEECH DATA Lgitization During the digitization process, each analog speech sample ored in the original magnetic tape was played back on the same :ording equipment which was used to record it. A l6-bit :log-to-digital converter (ADC) interfaced with a mini computer IP ll/AO) digitized a 2-minute long speech one at a time, at a pling rate of lOOOO/sec. Then the digitized speech was stored a disk (2.h Mega bytes) for the subsequent pause elimination proced (=l wo the AD F of nc digit variaI Pause auton elimi alter part‘ extr: each abou SPEC elin Star def‘ def Pau by 4O procedure. Each sampled point (digitized) was quantized by 2 bytes (=1 word) resulting in dynamic range of about 90 dB as specified by the ADC. Frequency transfer function of the ADC indicated some amount of nonlinearity. However, this nonlinearity (distortion) in the digitization process was not considered as the source of the variables because it was constant for all input speech data. Pause Elimination The silent portions and pauses were detected and deleted automatically from the speech samples. The objective of pause elimination was to reduce the amount of speech data without altering speech characteristics. Also, this procedure was particulary important for properly computing IDS parameter whose extraction was based on a set of successive short-term spectra, each spectrum being transformed from a brief speech segment of about 25 msec long. It was deemed essential that no short-term spectrum was resulted from the 'silent‘ portions or 'pauses'. In many studies on automatic speaker identification, elimination of the silent portions and pauses is included as a standard procedure, though no author has provided a specific definition of the pauses. In this study, pauses were clearly defined by applying the quantitative pausometric definition of >auses proposed by Tosi (l97h). The entire process was implemented 'y Fortran software on the PDP ll/AO. that detern detern is det only the v; resid that and phone pause be so respe the rEprl and * Spee a sp fixe and eacl eIin 41 \ This process was performed in the time domain in such a way when the signal falls within two pre-set parameters (one to rmine the amplitude threshold, 'Ap', and the other, to rmine the time threshold, 'Tp') that portion of the speech wave etermined as a pause. The result was a concatenated speech of 'signals' from which all pauses were eliminated. Initially, values for Tp and Ap parameters were sought by listening to the dual pauses (deleted and concatenated for audio playback so the most unvoiced and weak consonants, such as /f/, /0/,/h/, brief portions preceeding to and following after plosive emes such as /p/, /t/, and /k/ were detected and deleted as es. Consequently, typical values for Tp and Ap were found to omewhere between 15 to 20 milliseconds, and 0.90 to 0.99, ectively. Figure A illustrates how pauses were defined and deleted from input speech wave. Figure 5 shows spectrographic asentations of the original input speech containing all pauses :he resulting output speech without pauses. The goal of pause elimination was to obtain a 'pause-free h sample of about 55 to 60 seconds long from each portion for aker per transmission. The lower boundary of 55 seconds was so that five ll-second epochs were secured for each speaker, 1e upper boundary was set to provide a margin of one second to epoch. Summary of the results from the above pause ration procedure is given in Appendix C. Subsequently, this was subdivided into five 11- second long epochs, each epoch [Fr (a) KComputed average peak amplitude A9 = 0.60 Ihm Ap = 0.90 Pd/ZO msec time A A Ibl Pauses eliminated at ()c Signals concatenated at Ap - O. 90 and Tp= 20 msec Ap = 0.90 and T9 = 20 msec D VAN /\U A" a 3 J a 1 i ALMA. VZ/f‘ r _ (e) Signals concatenated Id} Pauses eliminated at at hp u 0 d Ap - 0.60 and 1p - 20 msec 'I‘p - 20 ec r5 Fliqure 4. A graphic illustration of pause elimination procedure. (a) Input speech signal of about 2 second long is subjected to the pause elimination procedure with a = two different Ap values. A dotted horizontal line is choc computed aver aeg peak a N O S a. Ap - 0.90 and Tp - 20mg ancd concatenated. (d) Pauugo are eliminated at Ap - 0.60 and ‘11: - 20 msec. (a) Signals axe detectedc from (a) at Ap - O. 60 and ‘l‘p- mac andco nonunaced. «.000an UDQEH Aflv £03926 ucoooe N £5»? 9:: mcwugmou Auoms o~ u E. can om.o n a: an umuecdfiaao manning :uoomn unmade 3. .mwmsma mca inn—Hui nodumusu 9503 v .32? ac cough £35 :3 £032.65": wanna know: noodle unauao meauasmuu m can nuwuam and...“ nude mode—.33 0530.: uEduDOHuuome oz. .m 2.6: ill NEHH 39:20 mun 2»le : 9.5:: own acumug are—yuan :5 0 mafia «z u > cucum- usmuao 3. 135 a La cows 5 m: o v won m 373: on an a: 5 He: oau meucH 9: an. m can; m >uu> 0 m3 0 a; w> a . ,.__u. i _ é. _ , v w . _ . ..-? ... _ ...|..».u_.r..|.lH... til»! I _th. .9; IIIII nu E.“ ON. (race). .. mcwwucH Emu. . , ._ h . . — _ — . n - a n u 1 d .t - ...mnfismcoamawao kuwnmm.=nanx Hana: ......t... .... an weapon ange .muaufiwg. . . .u to be u process analog tape re and'chc ExtraC‘ an I] (one F were bY usi was 1 cover about Where 44 > be used for generating speech parameters, IDS and LTAS. All the 'ocessed speech segments, epoch by epoch, were also stored as ialog speech (by playing back through the DAC) onto audio tapes by Ipe recorder (Ampex, model 4000 G) for later use to generate FFC Id‘choral spectra. traction of Speech Parameters IDS (intensity deviation spectrum) An [05 was generated from a set of short-term spectra. From ll-second epoch of the processed speech, a set of th portions ne portion = 25.6 millisecond, or window of 256 sampled points) re transformed to the corresponding number of short-term spectra using the FFT (Fast Fourier Trasform). Each short-term spectrum 5 then represented by the intensity at l28 discrete frequencies, vering the frequency range from O to 5000 Hz, with an interval of out 39 Hz. The IDS was computed by the following expression: 1 J _ Pik= - leijk' Sik s-k l J=l are; P. = intensity of the ith frequency of the kth IDS, (I)! 'k = average intensity of the ith frequency over J I short-term spectra from the kth segment, S..k = intensity of the ith frequency of the jth short-term IJ spectrum from the kth segment. a b - VHHWZMHZH dmm Figure my (a) 45 3.0 3.5 u.'0 u.5 5.0 FREQUENCY [N K H2 1 gure 6. Computer plottings of three IDS's generated from the text-independent speech data of speaker 1: (a) by telephone, (b) by normal, and (c) by linear system. Es express these 1 number: frequel at th; differ 3) Re spectr C for e assigr were ( Figur Speec trans set whicl 46 Essentially, the above expression to compute the IDS can be pressed by 1) Adding the intensities within each frequency of ese 440 short-term spectra and dividing the sum by the total mbers of spectra, thus obtaining the average intensity for that equency; 2) Subtracting the average intensity from each intensity that particular frequency, and taking the sum of all absolute fferences, then dividing this sum by the same average intensity; Repeating above steps for all ordinates over 440 short-term actra. Consequently, an lDS was represented by l28 intensities -- one each frequency available in the spectrum. Five lDS's were signed to each speaker per transmission. in total, 300 lDS's 'e generated: 5(IDS) x l0(speaker) x 3(type of transm.) x 2(cross- or within- transm.) = 300 lure 6 shows three computer plottings of lDS's generated from :ech samples of a speaker recorded via three different nsmission systems. LTAS (Long-term averaged spectrum) An LTAS was computed by averaging the ordinate values across a of successive 440 short-term spectra, the same set of spectra Ch generated the IDS. It was computed by: where; L L S Total nL three cc speaker 3. FF An segment been pr Speech 10000/5 for me: Peak revieu and Di relull ShOUI| The 0 avera Were 47 ere; Lik =average intensity of the ith frequency for the kth LTAS. S = intensity of the ith frequency of the jth short-term ijk spectrum for the kth segment. tal numbers of LTAS's generated was also 300. Figure 7 shows 'ee computer plottings of LTAS generated from speech samples of a aaker recorded via the three different transmission systems. FFC ( Fundamental frequency related measurements) An FFC was prepared from the first five seconds of each iment of the pause-deleted speech sample which has previously an processed and stored in the audio tape. Once again, this each segment was digitized by the ADC at a sampling rate of loo/second, one segment at a time. Then it was further processed ' measuring fundamental frequencies (Fo) as described below. A. Detection of F0 Several techniques of the computer implementation of direct ik detection, which estimate Fo's from the digitized speech were liewed from the existing literature (Gold and Robiner, l969; He i Dubes, l982.) The review indicated that this type of technique luires frequent heuristic adjustments when the incorrect Fo's vuld occur, tending to result in a grossly smoothed Fo contour. ; one developed by He and Dubes was tested for computing the .rage F0 from a half second long speech signal, and the results he compared to the ones estimated by laboratory equipment (sound . D rHHmzmHZH AMM Figure tea (a? 48 3.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 u.'0 0.5 5. FREQUENCY IN K HZ igure 7. Computer plottings of three LTAS's generated from the text-independent speech data of speaker 1: (a) by telephone, (b) by normal, and (c) by linear system. spectre! excelle indicat informa problem Fc applyir a new p study. experin technir Procedl C domain Figure Segmen While method 0f abc wave line i fundar entirr deieh segme Were 49 :trograph, visipitch, and oscilloscope). The match was allent. Nevertheless, this preliminary experimentation cated that a direct peak picking technique may lose significant irmation of F0 variations. Markel (1977) also attests to this ulem. For the reasons listed above and inasmuch as the aim of ying the FFC was to represent fine glottal dynamic variations, w peak detecting technique was devised specifically for this y. This technique demands interactive participation of the rimenter; hence, it is called 'interactive peak detecting nique.l Discussion of this technique and measurement edures for the FFC follows. Cycle-to-cycle Fo's were measured directly from the time in speech by the use of an interactive peak picking method. re 8 shows a photograph of the actually displayed speech ant on the CRT (display screen operated via the PDP ll/40) 3 F0 detection was in progress. Major steps involved in this :d were as follows: 1) displaying of the digitized speech wave pout 100 milliseconds on the CRT, 2) visually inspecting the to determine recurrent wave patterns. 3) drawing a flexible by a light pen capturing those recurrent peaks associated with amental frequencies, and 4) repeating the above steps until an 'e 5-second segment was exhausted. Inevitably, because of :ed pauses, there were discontinuity points within the speech :nt under process. These discontinuities in the displayed wave carefully avoided so that they would not be falsely considered (b) (a) .Hm>:wcmE coo uLMflH an AvooEDWV vwoowxm mH LUflLB ucomuuo .oEocoxa puuao>d3 po .omm m nuHB ucoEmom oLH ADV .po%maomflp oEmpm ofiu Cfi£uw3 ucowwwo new on fiuHB upoEmom may va .A®0Hpoavam>uoucfl xmoo umoE uzwflu ofiu Eoww wousmmoe hocodeum Hm lucmEmvcsw pounce: ofiu mH Moduou pamfiu poem: ozu so ucflummdom prEDG mfiH .poucoEprQXo ecu kn woumuooo coo uLwHH m %n wououcw women m we mxmom acouusoou osu mcfiwpo>muu mafia Hmucwnwwo: oHLonHw < .wEOH Amucfloo puaofimm qNoHv oomE q.NoH usonm mH czocm uswEmow fioouaw comm .mopsoooouo msfluuoump xmuo o>HuompouGH ecu wcflzmaomfiw HMO msu mo msaouwouosm .w ousmam f mEHu t oEfiu 50 apnnrtdmv <— T as +- epnnttdmv as pitch pen mane displaye automat An pitch c two end partici periods reliabl B N F0 con below. number 0f the ii.C 51 pitch periods. This manipulation was carried out by the light maneuver. However, the continuity from one frame of the >layed signal to the subsequent frame was maintained >matically by a software. An output of this interactive peak detecting technique was :h contour containing successive pitch periods and amplitudes of ends of the pitch periods. Although it required a careful .icipation of the experimenter to correctly target pitch ods, this method was found to be quite simple, quick and able. B. Extraction of FFC Nine features (measurements) were computed for the FFC on each ontour created earlier. Computational procedures are described (w. For all procedures; Fo= fundamental frequency, N = total ers of Fo's (in each Fo contour), and A0 = relative amplitude he peak (of F0). fio : The average Fo computed on a PO contour. l N F, = TA” Fo : The standard deviation of F0. 1 N OFo = —-— 2 (Fan ’ I30)2 N n=1 iii. AFO iv. AFo/ V" 5A0 Vii. F0! Viii. F0 bow/F. CM, F°(max) . F°(min) 52 : The average temporal variation of F0 in successive cycles. N-l AP, = —1—- §:|Fon+l — Fon N-l n=1 : The ratio of the temporal variation of F0 to the average Fo. : Standard deviation of cycle-to-cycle peak amplitudes N 0A0 = —% Z (A011 ' Ao)2 n=1 ' The average temporal variation of peak amplitudes 1 N-l AAO = KT} ;§;|A°n+l - Aon : The maximun F0 in a pitch contour. F°(max) = Max[F°1, F02,-'-, FON] ' The minimum F0 in a pitch contour. F°(min) = Min[F°1, F02,°--, FON] “fiazaaf ix. FC Figure frequer above. i. S S elsewh Spectr a Chi sPeecl durat stack sPeec gener (It-1: inter the Para Show 53 Fo(rng) : The range of the F0. The Fo(rng) is simply computed by Fo(rng) = Fo(max) - Fo(min) are 9 illustrates three basic measurements (periods, fundamental yuency, and amplitude) required for computations described 'e. SPT (choral spectrum) Since the detailed procedure to create an SPT is given :where (Tosi, l979). here only a brief description of the :trum is presented. A choral spectrum is a Fourier Transform of :horal speech. A choral speech is generated by segmenting a ech wave of a certain duration into n portions of the same ition t (in this case 0.4096 second, or 4096 sampled points) and :king entire n portions, one on top of another, resulting in one :ch segment of length t. Choral spectrum used in this study was :rated by using FFT resulting in l-byte integer intensity values 20 dB) at 2048 discrete frequency components with about 2.44 Hz irval. For this study, all choral spectra were generated from i previously processed speech segments which were used for other meters. A total of 300 choral speech resulted. Figure 10 s a computer plotting of a sample choral Spectrum. axe.» sated ulllrtt uwwu 54 FM " F92 1] ‘13 13 r—fl; =T1'Ti ~ 717-3-ng Foi‘l/m Fo2=y72 AF“ 7' A03 peaks Apia = “oz-Ali!“ AA... iAu-Aal fl.z Peak. 2 ..ii. —-i \ d—i— T8. T3 ti me. Figure 9. Illustration of F0, APO, and 5A0 by using three con- secutive peaks in a simplified short speech segment. In this figure, Fo = fundamental frequency; ji= ith pitch period, and A03“. = amplitude of the ith peak. 55 Care “mm m.: .Ahum an exam BOchB no can: mucfioo paaoEmo coo: uov ccouom oaoc.o mm: Hum an pwoopouo no: Eouuuooo Hmuoco mwsu cues: Eoum coupon Hmuozo mo cofiuwusp way u. umxmoow Eoum coxmu common pompouoEOo mco~ ccouwmId— o co comma Ecuuuoom fimuozu oHoEmm o no wcfiuuofio pousosoo .o_ ouawae i 6.: m.m o.m m.~ o.~ m._ c." m.o .o Optimizat The i of featu features between t has beer identific relative in a gel the mean increase number, number Sizes of “978) Should t Tal °Ptimiz; Contain °i fea i0? the hierarc F‘i’atic hierar‘ iWO‘st. USed. 56 thimization 3: Features The objective of feature optimization was to reduce the number of features for computational simplicity by selecting only those Features which were determined to be effective in discriminating Jetween the speakers. The importance of the optimization procedure was been recognized in many studies on automatic voice identification, especially when the number of speaker is very small 'elative to the number of features utilized. Hughes (1968) showed In a general statistical model that for a fixed number of samples, :he mean identification accuracy increased when the dimensionality increased until an optimum value was reached. Beyond the optimum iumber, the accuracy decreased linearly. He suggested the optimum iumber of features to be 5, l0, 20, and lOO or greater for sample ,izes of 20, 100, 500, and larger, respectively. Jain and Dubes :l978) suggested as a rule of thumb that the number of features hould be at least five for each sample. Taking these notions suggested above into consideration, the ptimization procedure was applied to the original set of features ontained in each speech parameter used in this study. The number f features was l28 for both IDS and LTAS, nine for FFC, and 20h8 or the choral spectrum. Optimization was carried out by using the ierarchical clustering technique with complete-link method, and/or -ratio statistics. For the parameters, lDS and LTAS, both the ierarchical technique and the F-ratio statistics were applied in a wo-step sequence, whereas for the parameter FFC, only F-ratio was sed. The tool ir speech I word | identif and th il973). partiCL cluster and Di patterl themse statis resear as a F voice Parami were of cc (tempt 60mp¢ high. bit The clustering technique has been employed as an effective ool in many scientific endevors including the field of general peech research. This technique has been applied to the study of ord recognition (Rabiner et 37,, 1977) and also to voice Identification by computer (Tosi at 3],, l979). Elaborated review nd theory of the clustering technique is provided by Anderberg (1973). The use of this technique in this study was in only one articular way to illustrate the diverse applicability of the lustering technique. This particular usage was suggested by Jain nd Dubes (l978) as a means to reduce the number of features in a attern by finding those features which are highly correlated among hemselves. This technique was later coupled with F-ratio .tatistics, which also has been often implemented by some 'esearchers (Paul at 3],, 1975; Markel and Davis, 1979; Atal, l978) IS a part of the procedure for selecting effective features for 'oice identification. Feature optimization procedures applied to the speech arameters used in this study are discussed next. . IDS and LTAS First, the original number of 128 features for IDS and LTAS ere reduced to 100. This reduction was imposed by the limitation f computer memory capacity available on PDP to carry on further amputational procedures. Sacrificing the lowest three frequency amponents (78 Hz and lower) and highest 25 components (h0l7 Hz and igher up to 5000 Hz), the new I08 and LTAS were then represented y 100 features of frequency ranging from 117 to 3978 Hz, with the interval carried complete- Hier hierarchi features Spearman- all pail speaker) correlat hierarch Prepared three 5 transmis From e; COMpone horizon six den F. statist the de: each c erval of 39 Hz. Optimization procedures were sequentially ried out by the hierarcical clustering technique with piete-link method, and then by F-ratio statistics. Hierarchical Clustering: The objective of using the rarchical clustering technique was to determine subsets of tures which were highly correlated among themselves. The arman-product moment correlation coefficient was calculated for pairs of features over a set of 50 patterns (5 for each aker). Then, a similarity matrix was prepared from these relation coefficients and submitted as input data to the rarchical clustering technique. Six similarity matrices were pared from six sets of 50 patterns (three sets for the I05, and Fee sets for the LTAS, both based upon three different ansmission systems), resulting in six complete-link dendrograms. am each dendrogram, 10 clusters (groupings by frequency Iponents) were systematically chosen by means of placing a izontal line at the appropriate proximity level. The resulting dendrograms are presented in Appendix D. F~ratio statistics: The objective of application of F-ratio tistics was to pick the best feature from each cluster formed in dendrogram. The F-ratio was computed for every features within h cluster. This F-ratio was expressed as: Between (inter-) speaker variance (considering each speaker as a group) Within (intra-) speaker variance (considering 5 samples per speaker) Basical the d f-rati and a best f l of ti featu dendr featu indic while the. woul type was fol fea 59 Basically, the larger the F value of the feature is, the greater the discriminating power as indicated by that feature. All F-ratio's so computed within each cluster were then rank-ordered and a feature which yielded the largest F-ratio was chosen as the best feature in that cluster. Table 1(a) shows that three different transmission data bases of the IDS parameter resulted in varying compositions of the features and their F-ratios whithin each cluster formed by the dendrogram. For example, in the case of telephone IDS, two features, viz,, ll7 Hz and I95 Hz, were grouped in one cluster indicating a relatively high correlation between the two features, while the same two features were grouped in different clusters in the case of normal IDS and of linear IDS. Since it was necessary condition that each resulting pattern would be composed of the same set of features regardless of the type of transmission systems, some degree of arbitrary interaction was introduced by the experimenter. By inspecting Table l(a), the following strategic steps were taken to determine lO optimum features for IDS. a. The minimum F-ratio denoted by Fm was chosen as a cut~off criterion which was arbitrarily set to Fm = 5.0. b. From each cluster of features in the telephone column, a feature with the largest F value (and equal to or greater than Fm) was chosen. c. F values of the same feature in other two transmission data bases (normal and linear) were checked if they exceeded or Table 1( m .HU .H0 .H0 60 1e 1(a). Results of the feature clusterinqs and the corresponding F—ratio's of IDS in three transmission systems. fol-ghee. tron-union Int-1 {tn-1.001011 Lieu: Inn-Lulu tutu“ Insane! P-rouo Future Drag-1.11:1 D—utln Future koguucz Putin 29 1211 6.235 28 1172 7.687 26 10916 9.688 ,_, 27 1133 11.904 25 1055 3.455 ° 69 1992 4.106 g 60 1953 2.629 51 2070 3.194 52 2109 1.566 c1. 3 v-l C 5 q 7 . ...4 u 30 1 96 3020 16.260 22 937 0.211 ‘9 57 2305 1.626 19 020 6.266 . i 50 2031 2.276 10 701 6.994 .3 25 1055 1.756 17 762 12.342 30 1250 5.421 16 703 12.676 27 1133 6.669 . 15 664 11.010 26 1016 6.140 --t 39 1601 2.666 " 26 1094 5.000 U 13 506 5.546 . 29 1211 7.751 30 1562 6.660 v-I 20 1172 10.974 35 1645 4.350 0 32 1321 9.270 33 1367 6.723 31 1209 4.003 93 3711 17.625 90 3594 20.345 00 69 2773 1.679 o < 92 3672 17.379 62 1719 0.909 95 __ 3739 _ 22_._6_67 - 73 2930 5.055 - L:~_-:___3_7§O__ “'2'3.gg}'v H 71 2051 2 H 91, 3633 "‘"2'12033' ° 72 2091 3.949 9 3437 26.927 70 2012 2.696 05 33 0 34. 00 03 3320 23.206 0‘ 36 1406 12.500 " 16 703 16.795 00 3516 39.217 . 35 1445 5.071 . 15 664 13.952 04 3359 36.203 .—4 37 1523 7.576 H 20 059 9.606 37 ...--.5. _ _ _31." U 23 976 3.616 U 19 020 6.673 L§£__ _3555 2 .620:- 75 3000 5.363 17 742 6.965 (continued) 8 c1. c1. 10 c1- Tabli Table 1(a) continued. Telephone trenenieeion Feature Frequency F-retio 61 16 . 36 1606 7.282 b. 37 1523 2.436 36 1404 3.145 32 1320 9.552 29 1211 4.938 V 26 1096 3.301 74 2969 7.161 31 1209 6.600 40 1641 2.176 25 055 2.153 C 02 3201 15.643 3242 . 70 3125 11.430 00 3203 9. 770 (17' ‘_";300‘6 __ 9. 556‘, 79“ ‘ 3164‘“ T396 75 3000 7. 932 I 76 3047 5.472 30 1250 9.709 20 1172 3.033 72 2091 6.292 71 2051 6.043 70 2012 3.670 73 2930 5.019 67 2695 6.294 66 2656 5.050 69 2773 2.300 60 2734 3.374 6]. Normal trenelteeion Feature Frequency F-retio 0 .- -.- - u-a».. - - ~ - . (_9 __-- ,4__30_- __,__6._905,1 61 1600 1. 052 Ch 15 666 7.295 60 1661 6.523 F; 0 391 2.244 0 52 2109 4.640 56 2266 3.103 74 2969 4.660 70 3125 7. 069 (fzf‘j; 30067' f:9. 603‘. 03’ " 3320 ‘ ' 9.691 79 3164 7. 009 .09_ - . -3203 10_. 972 g 02_ _ 3201 __ 12. 2011 ‘01 3242 'i1. 306 04 3359 11.300 4 234 0.257 __2|<___ 3631_ h.622___ C 9 3555 13.626 ‘9‘ 36777 6. $3 06 3437 .5.115 05 3390 1.730 . 90 3594 5.062 ed 07 3476 4.933 ‘1 00 3516 9.714 7 351 0.003 47 1914 2.014 10 701 3.502 20 059 4.739 19 020 4.236 40 1953 4.063 16 703 7.026 1:1 . £3 9 (:1 l() (:1. Linear crane-100100 Feature Yreguencz P-retio 71 2051 4.704 6 312__-___ _4_. 0_39' C 231111395 ____20_ ..0353 ‘. 72 2091 2. 37 0 70 2012 2.999 14 625 6.974 32 1320 5.046 31 1209 7.034 33 1367 4.157 30 1256 5.544 36 1404 7.317 35 __ 1445 ‘__7._7_49 ‘1;;4__ ___;Izg§jj:_m._10. 1‘ j 100 ° 3904 157634 90 3906 10.004 99 3945 14.675 37 1523 0.333 20 1562 3.020 24 1016 5.196 , 10 701 7.164 -31-- _ __09§ 11.399“ C17,. #086" 1 2136M C 94 3756 32 102 3 3023 20 935“' ) 95 3709 21.514 93 3711 20.590 92 3672 22.930 91 3633 16.660 1 90 3594 17.117 12 547 4.634 Only features (frequency components) of IDS considered in this stdudy were the ones circled. features selected as optimum Dotted line circles which were resulted from the cc>131001. discussed in the text. circles. tn! <3].. 1., 1:]..22, Note there are 10 common features Clusterings of 100 etc) in each by dendrograms in Appendix D. Full line circles indicate those from the corresponding transmission also indicate optimum features, but interaction in the selection strategy that in every transmission column as indicated by full and dotted line features into 10 subsets (as indicated column in the above table were produced the deter stra‘ feau resu corr whit Ste] bas SUlll Pre de: LT equal to Fm = 5.0. Then, if both F values in two trasmission data bases met the criterion of Fm > 5.0, that feature was selected as an optimum feature from the cluster considered. Otherwise, another feature of the next largest F value from the same cluster was processed for the same sequence until a feature was found which met the criterion in 3, or all features were exhaused. The above steps were repeated for the remaining clusters of ie telephone transmission data base. Optimum features so far :termined are marked by the solid circles in Table l(a). The same ,rategy was applied to further determine the remaining 5 optimum :atures basing on the other two transmission data bases. The :sults are also marked by full line circles entered in the irresponding columns of Table l(a). Dotted line circles appearing each transmission column of the table are also optimum features ich were resulted from the interaction of the aforementioned eps. Consequently, l0 mutually common optimum features of the IDS sed on three transmission data bases were selected. These are mmarized in Table 2. Another set of to optimum features of the LTAS were also epared by referring to Table l(b) according to the similar steps scribed above. The results of IO optimum features determined for AS are summarized in Table 3. . we ,;_ —~=A-r Table 1( .JHU 01“” ‘H0 0H0 lle l (b) . 63 Results of the feature clusterings and the corresponding F-ratio‘s of LTAS in three transmission systems. Telephone crane-1001.001 learure Frequency F—ratio 4 4 234 50.969 2 156 6. 92 . ( 1 117 131.090 )_3 * .-----_QQ_-HALQL_ J (_3 ________ 195_ 17.619 ‘ 35 IMS 1L2m M . 4:4 11.1 3 30 1562 13.054 9 37 1523 0.227 . 42 1719 14.422 4 41 1600 17.039 3 40 1641 9 571 39 1601 0.292 20 1172 3.210 27 1133 7.050 33 1367 10.652 31 1209 10.401 3 32 1320 12.612 . 30 1250 11.374 4 29 1211 6.040 3 0 391 16.120 7 351 13.136 5 273 20.009 56‘. ‘ 26.59.... _ _li-315- (6.? :72511 -.ll-iOL’ 64 " 2570 9.614 62 2500 0.069 61 2461 6.112 63 2539 7.125 r 47 1914 7.965 46 1075 0.026 ; 40 1953 3.160 , 26 1094 s 504 25 1055 2.932 24 1016 6.775 23 976 3.942 22 937 4.256 2%9 103m 2930 7.450 _ .. .. 2921 -3éé ‘ -2051 _ _ [§.535‘J , 0"’ 2734' 103259 4 2695 12.312 J 2012 0.007 2773 11.396 2422 5.075 2303 6.035 2344 7.724 2305 5.406 2266 5.213 2226 4.220 2107 3.161 2031 5.303 > 1w2 57w . 2140 4.565 4 2109 4.422 J 2070 5.076 1797 7.400 2750 0.521 1036 6.647 312 4.559 3201 16.610 3242_ _ _ _.1__7._930~ ~ ' -39.:1'125, -- --_9-.8§_7.1 3125 17 225 ° 3006 10.002 1 3164 16.610 3047 14.053 3000 10.020 Ion-n1 (ran-natal: Feature Pregnancy F-ratio 20 059 12.959 19 020 10.354 70 2012 0.907 69 2773 10. 609 72 ..-.-. _ 2321.....- ___-1.15 ”‘71-- -_2.3_5_1 ........ 3 -99! J Z‘p_ __ 742_ _ _12. 901) 703 187073 10 701 17.325 95 3709 13.799 94 3750 22.609 __29.._._.-329§ .--- - lliell '-_9Z _____ 3067 __.“Léei§§) 99 594 16.670 96 3020 15.930 I“ 'L . a‘_-_’fitfl__2LM6J 11 500 ‘40.115 10 669 26.603 06 3437 0.971 05 3390 0.760 00 3516 0.735 07 3476 12.500 93 3711 2 .360 92 3672 9.701 91 3633 10.460 90 3594 12.744 09 3555 11.660 22 937 6.450 21 090 6.047 74 2969 17.101 73 2930 10.026 60 2422 32.043 77 3006 35.397 76 3047 10.257 75 3000 16.621 4 234 75.029 2 156_ _ 107.373\ 5 ”10:34:40 5'9 "' 1601 15.202‘ 30 1562 15.691 37 1523 21.660 13 506 22.930 50 2344 22.609 57 2305 10.117 56 2266 14.462 55 2226 13.313 17 1916 6.19 66 1075 9.650 15 1036 11.533 44 1797 7.631 43 1750 4.060 12 1719 5.809 11 1600 0.300 40 1661 9.641 (continued) c].. c].. clu “l Linear erm-tutu Future Prague ncy P—ratio 07 3476 7.318 76 3047 39.477 75 3000 40 212 70 3125 64.471 77 3006 31.590 74 2969 25.100 73 2930 20.035 72 2091 26.349 40 1953 6.470 47 1914 9.007 49 1992 10 494 51 2070 9.243 50 2031 9.969 53 2140 9.939 52 2109. 11.612 42 1719 0. 757 41 1600 17. 500 57 2305 30.191 56 2266 41.343 55 2226 17.351 54 2107 10 135 30 1562 16.601 37 1523 19.336 60 1641 15 954 39 1601 9.169 ' 35 _ 1445___ __ 15:913. .’ ‘35 : "T406 57.294 /- ‘3 “'T4‘04"“ 21.670 62 2500 25.403 61 2461 60 127 59 2303 29.503 50 2344 35.593 60 2422 33.442 13 506 17.404 71 2051 23. 296 70 2012 20. 467 69 2773 11.245 60 2734 9.152 67 2695 6.090 66 “56.- _1g4u_ 5:63": : :00- _ f- .1005 -: 64’ 2570 16.509 63 2539 15. 591 .95 3709 15.214 94 3750 16. 246 93 3711 23.021 99 3945 17.020 90 906 (::97 3067 33. 640 96 3020 . 100 ___ -395t - —.- 11353 ('3' ”L _10.2” J 23 976 12.142 22 937 11.300 26 1016 12 915 19 020 9.535 16 703 25.364 10 78 33:2‘?.. ”T7" -h’ 722 .-. 13.5711 ‘ ‘23 """i9s"" "BEST 20 359 7. 273 c1. 10 c1. 8 c1. cl. 10 c1. 64 Table l(b) continued. Telephone tranntuion Normal rrananiaaton Linear rranauiaalon Feature Frequency F-ratio Feature Frequency F-ratio Feature Frequency F-ratto ; 54 2107 11.302 92 3672 16.137 100 3904 11.902 53 2140 10.170 91 3633 10.267 99 3945 17.497 r~ J 52 2109 16.709 r~ 3594 10.290 96 3020 2 . - . 51 2070 12.407 . 09 3555 0.917 M .. I so 2... ...... H . m m :3 3320 16.394 0 49 1992 11.061 0 10 469 22.361 07 3476 32.655 60 1953 7.620 430 16.746 -25--.-.-.3.996.---_ -9-é3} ............... (_97“ -__3_86_7 ' 21.4_43 :1 26 1094 9.599 r_'1_2 ___ _54_7_ ___19._4_2§; 95 3709"' "‘1T.462 25 1055 7.505 a: 11 500 39.773 94 3750 13 300 0 391 17.570 . 3_ . 93 3711 13.005 29 1211 19.242 ,_, ' _ 2.. __-_ _143.02_~ 92 3672 14.301 20 1172 22.032 :1 t 1 117__ 191.461 1 90 3594 10.374 00 27 1133 17.779 ~ - " "'" "“""‘ 91 3633 12.009 32 1320 27.179 0 391 10.092 09 3555 13.005 ,4 31 1209 19 942 5 273 37.250 34 14.007 01 30 1250 17.065 15 664 39.059 24 1016 10.195 Ch 14 625 23.351 32 1320 19.659 1 15 664 17.506 . 31 1209 10.072 Ch 7 351 31.731 *4 33 1367 26.045 , 14 625 13.370 U 7 351 23.070 .4 23 976 5/905 0 5 273 03.671 46 1075 14.662 45 1036 14.907 64 2570 13.550 44 1797 24.200 <3 63 2539 14.040 c3 43 1797 17.924 .4 67 2695 15.116 '9 26 1094 11.755 . 66 2656 17.302 _ 25 1055 7.656 ,4 60 2734 11.231 .4 27 1133 19.000 0 o 29 1211 10.722 - 430 20.091 20 1172 14.200 6 312 12.267 30 1250 10.073 * Only features (frequency components) of LTAS considered in this study were the ones circled. Full line circles indicate those’ features selected as optimum from the corresponding transmission column. Dotted line circles also indicate optimum features, but which were resulted from the interaction in the selection strategy discussed in the text. Note that in every transmission column there are 10 common features as indicated by full and dotted line circles combined. Clusterings of 100 features into 10 subsets (as indicated by c1. 1, cl. 2, etc.) in each column in the above table were produced by dendrograms in Appendix D. Table II Feature (if) H I I H * F-rat Df 01 able 2. Optimum features selected for IDS parameter (as denoted by circles in Table 1(a)). * F-ratio in Frequency Telephone Normal Linear ( Hz ) transmission transmission transmission 1 117 11.02 23.75 31.99 2 195 30.28 61.35 28.83 3 430 9.02 6.96 10.26 4 508 7.40 13.91 12.71 5 1406 7.28 8.47 10.61 6 3086 9.56 9.61 10.99 7 3281 15.64 12.20 9.13 8 3555 21.62 15.63 15.48 9 3750 23.28 12.62 32.10 0 3867 22.92 14.55 21.68 i-ratio is statistically significant (p=0.05) for F >12.12 with f of numerator = 9, and Df of denominator = 40. Table : Featur H Df .17......I..1.5,n....l.u_o01.10~~ * F-r 66 Cable 3. Optimum features selected for LTAS parameter (as denoted by circles in Table l(b)). * F-ratio in ature Frequency Telephone Normal Linear # ) ( Hz ) transmission transmission transmission 1 117 131.10 147.55 191.46 2 195 17.62 50.30 28.36 3 547 12.85 21.09 18.84 4 742 36.73 21.90 18.84 5 1406 20.90 39.17 47.20 6 2617 11.41 29.24 10.62 7 2851 15.24 8.36 23.30 8 3203 9.89 43.49 56.85 9 3359 37.87 17.06 10.85 10 3867 21.44 14.59 33.64 -ratio is statistically significant (p=0.05) for F >2.12 with f of numerator = 9, and Df of denominator = 40. L frequ: compo: great extra resu] syste extre norma p=0.C value was Paran char; and j feau inte larg atte Effe feat ValL thre 67 The general trend as shown in Table 2 was that the lowest two frequency components, 117 Hz and l95 Hz, and highest four components, 3281 H2, 3555 Hz. 3750 Hz, and 3860 Hz had relatively greater F values than those of the frequencies between the two extremes. Unlike the case of IDS. most features of the LTAS resulted in somewhat similar F values across the three transmission systems excepting for one, ll7 Hz. This feature yielded the extremely high values of 13l.lO, 147.55, and l9l.46 for telephone, normal and linear transmission LTAS parameter, respectively. In all cases, F-ratio was statistically significant (at p=0.05) for F>2.l2. All features (for IDS and LTAS) had greater F values than this critical value of 2.l2. One interesting outcome was the composition of the optimum features in the IDS and LTAS parameters in spite of the presumably different speech characteristics that they carried: Four features (117, l95, l406 and 3867 Hz) were shared by both parameters. The remaining six features were distributed somewhat differently along the frequency interval. 2. FFC Since the number of features included in an FFC was not very large, feature optimization for the FFC speech parameter was attempted by the use of F-ratio alone. To study the relative effectiveness in discriminating between different speakers, all features in an FFC were subjected to F-ratio statistics. Each F value was computed over 50 patterns (5 for each speaker) and for three different transmission systems. tel dev tel re; (15 Table 4 is a list of F-ratios of nine features of the FFC computed from speech data in three transmission systems. As shown in Table 4 F0 (mean fundamental frequency) had the largest F-ratio in all transmission systems (F=429.705, l47.9l5, 3l3.346, for telephone, normal, and linear respectively). CIFo (standard deviation of F0) resulted in much smaller F-ratio than that of is in all transmission systems ( F =55.378, 28.488, and 4.976, for telephone, normal, and linear, respectively). These results of F0 andeFo in this study comply with the ones reported by Markel et 3],, (l977), Hunt at 3],, (l977). and Atal (I972) in view of the relative effectiveness of these two features in discriminating speakers. Less conspicuous features were found to be CjAo (standard deviation of amplitudes of successive peaks of F0) and 1&Ao (temporal variation of amplitudes of successive peaks of F0) which had the smallest F-ratio among the rest, in all transmission systems. Especially, in linear transmission, these two features, Cho and 8A0 yielded F-ratios smaller than the critical value of F=2.l2. 3. SPT No feature optimization procedure was taken for the SPT, i,e,, all SPT's retained the original dimensionality of 2048 frequency components and entered as they were to voice identification operations. Table Featuz name able 4. Nine features and F-ratios of the FFC. F-ratio* iin ** Telephone Normal Linear transmission transmission transmission to 429.705 147.915 313.346 UFO 55.378 28.488 40.976 AFO 34.616 37.447 27.443 APO / FO 5.274 5.404 6.499 F0(max) 41.384 51.388 9.240 F0(min) 7.289 11.570 4.440 F0(rng) 18.457 17.272 7.107 5A0 5.274 5.320 1.152 5A0 2.486 4.374 1.370 * F—ratio of each feature for each transmission was computed on a data set of 50 samples. Between speaker variance was based on 10 speakers and the within speaker variance was based on 5 samples for each speaker. F-ratio was statistically significant (p = 0.05) for F'>2.12 with Df of numerator = 9, and Df of denominator = 40. 80 = mean fundamental frequency (Fo). OFo = standard deviation of F0. ZFo = average temporal variation of F0. ZFo / To = ratio of the average variation of F0 to the mean of F0. Fo(max) = maximum (highest) Fo. Fo(min) = minimum (lowest) Fo. Fo(rng) = range of F0. 3A0 - average temporal variation of peak amplitude of F0. 0A0 standard deviation peak amplitude of F0. by tral unk whi lin I'EF knc HeI frl me De pi kl VOICE IDENTIFICATION EXPERIMENTS Organization gfi Experiment A total of 24 voice identification experiments were conducted by different combinations of the speech parameters and types of transmission systems. In addition, all the parameters were tested in the cross-transmission as well as in the within-transmission voice identification experiments. In the cross-transmission, all unknown speakers' voices were based upon the telephone system, while all known speakers' were based upon either the normal or linear transmission system. It was assumed that all speakers were represented by the biased response characteristics. In contrast, in the within-transmission experiment, both the unknown's and the known's voices were based upon the same transmission system. Hence, in the latter case, all the speakers were represented free from the influence of the transmission system. Two major steps involved in each experiment were I) measurement of distance and 2) application of the decision rules. Description of these two steps follows. Distance Measurement As a measure of proximity, or separation, between a pair of patterns, one belonging to the unknown and another belonging to the known, Euclidian distance was applied. Euclidian distance is a vectorial summation of the differences between a pair of features available in the patterns. This implies that if the values IIIIIIII II hm- assigne patterr For thi all it standa' standa where featUI trans these dista wheri assigned to the features are not distributed homogeneously across a pattern, the distance measure may introduce a highly biased result. For this reason, prior to the computation of Euclidian distance, all features in the parameter —- IDS, LTAS, and FFC .. were standardized by Z-transformation. Each feature in a pattern was standardized by transforming into a Z-score as described below. P.. - E. .. s lJ 1 'for i= 1, 2, ..., l 1J (7P1 (number of features) j= I, 2, ..., J (number of patterns) vhere bi ando’pi are the mean and the standard deviation of the ith Feature computed over J (=50 in this study) patterns. Zij is a transformed score for the ith feature of the jth pattern. Then, these standardized Z values were used in the subsequent Euclidian iistance measurement. Euclidian distance was calculated by the following expression: K D. o = Z. - Z. )2 13 \/kgl( 1k Jk here; Dij = Euclidian distance between the ith pattern and the jth pattern, zik = kth feature of the ith pattern, 2 = kth feature of the jth pattern, jk K = total number of features within a pattern. c:- (D O I . mini idei unk dia Decision rules Two decision rules, the nearest-neighbor decision rule and the minimum set distance rule were applied concurrently for all voice identification experiments. I. The nearest~neighbor decision rule: This decision rule assigned one of the known speakers to the unknown by the following sequence. Figure l] is a simplified diagram to illustrate the following sequence. a. Designating one of the n patterns belonging to the unknown as the test pattern, and all other patterns belonging to the knowns as reference patterns. b. Computing the Euclidian distance between this test pattern and all other reference patterns. c. Assigning the test pattern to the known one of whose reference patterns is the closest (the nearest) to the test pattern: One decision has been rendered. d. Repeating a through c until all patterns of the unknown are processed as test patterns. Up to this point, n identification decisions (n=5, in this study) were reached, i,e,, the Euclidian distances from each of the n test patterns available for an unknown to all other reference patterns of the known speakers (the number of reference patterns in this study was 50). Then the entire sequence was repeated to process the remaining unknown speakers. A total of 50 decisions for each voice identification experiment were yielded by the nearest-neighbor decision rule. 73 Unknown speaker known speaker by telephone by normal/linear Test pattern Reference pattern Speaker 1 J f Speaker 1 I888 Speaker ZJ r Speaker 2 Speaker IOJ Speaker 10 BIBBB-II~BBB Figure 11. A diagram illustrating the nearest-neighbor decision rule. All the speakers are treated as the unknowns basing on the tele- phone transmission as well as the knowns basing on the linear/normal transmission system. An arrow indicates Euclidian distance between the test pattern of the unknown to all reference patterns of the knowns. Note that Euclidian distance is not computed among the unknowns nor among the knowns, and that the length of an arrow is not proportional to the actual Euclidian distance. 2. Th Ur require set di consist compute illusti The a Speaks eXperl t0 \ combil 74 2. The minimum set distance rule: Unlike the former decision rule. this set distance rule requires a priori category (speaker) information in determining the set distance between two speakers under process. A, set is consisted of n patterns assigned to each speaker. Major steps to compute a set distance in this study is discussed next. Figure l2 illustrates these steps by using n=3 for simplicity. a. The Euclidian distance from the set of unknown patterns to each set of known patterns are computed. b. Then, the maximum distance from each unknown pattern to each known pattern within a category is chosen. c. From this set of maximum Euclidian distances the minimum is chosen to represent the Euclidian distance between the unknown speaker to every known speaker. These sets of distances are ranked from the shortest to the longest distance. d. Finally, the known speaker whose set distance to the unknown is the shortest is assigned to the unknown. The above procedures were then repeated for the remaining unknown Speakers. A total of IO decisions for each voice identification experiment were yielded by the minimum set distance decision rule. Two decision rules described above were concurrently applied to voice identification operations conducted under various combinations of the speech parameters and transmission systems. Unknown : U1 U13 known spea 11 75 Known speaker I U Known Speaker 2 '— igure 12. A diagram showing an example of the minimum set distance rule. Three speakers are shown, one as an unknown and two as knowns, each represented by three patterns. In this diagram, the symbol U11 denotes the first pattern of the unknown speaker 1, and a symbol K11, the first pattern of the known speaker 1. A line drawn between a pattern of the unknown and that of the known indicates the Euclidian distance. The length of the line is proportional to the Euclidian distance computed. A '”' indicates the maximum distance among Euclidian distances computed from a pattern of the unknown to all patterns of the known. A '0' is the minimum distance of the maxima between the unknown and the known. In this example, the minimum set distance between the unknown and the known speaker I is designated as D(1,1) and that between the unknown and the known Speaker 2, as D(1,2). Since D(1,1) < D(1,2) in this example, the unknown is identified with the known Speaker 1. ider Spel sys‘ wer spe spe and as the the Idi te ei 5P CHAPTER III RESULTS This Chapter focuses on the results of the voice identification operations which were conducted by using different speech parameters tested under various combinations of transmission systems (telephone, normal, and linear). Speech parameters tested were IDS (intensity deviation spectra), LTAS (long-term averaged spectra), FFC (fundamental frequency contour), SPT (choral spectra), and two composite parameters of IDS and FFC and of LTAS and FFC. In each identification operation, l0 male speakers served as both unknown and known speakers. Speech data obtained from all the speakers were 'text-independent' as described in the previous Chapter. Table 5 summarizes the results of the cross-transmission voice identification (all the unknown speakers were recorded through a telephone system and all the known speakers were recorded through either normal or linear systems). The relative effectiveness of speech parameters are depicted in Table 5 in terms of the correct identification rates. It is clearly seen from this table that the highest correct identification rate of loo 2 was achieved by the composite parameter of LTAS and FFC, and the lowest rate of 20 Z by SPT. The identification rates of the remaining parameters, IDS, LTAS. and FFC (each tested as a single parameter) and the composite 76 Table II II 77 Table 5. Summary of the results of the cross-transmission voice identification operations. Type of parameter Rate of correct * and transmission identification (%) IDS Telephone vs. Normal 70 Telephone vs. Linear 60 LTAS Telephone vs. Normal 70 Telephone vs. Linear 70 FFC Telephone vs. Normal 50 Telephone vs. Linear 4O SPT Telephone vs. Normal 20 Telephone vs. Linear 20 IDS + FFC Telephone vs. Normal 60 Telephone vs. Linear 6O LTAS + FFC Telephone vs. Normal 100 Telephone vs. Linear 100 * By the minimum set distance rule. of IDS Ta the ini the inl Furthel presen as se Theref identi F the r the fa the < Opera' Obtaii the known teste (Wher this this IDS. iden oVer the 78 of IDS and FFC fell in the intermediate range of 40 to 70 %. Table 6 provides more comprehensive results in order to enable the interpretation of the elimination effect of each parameter upon the influence of the response curve of the transmission systems. Further detailed identification results for all the operations are presented in Appendix G. Two independently applied decision rules, as seen in Table 6, resulted in close conformity to one another. Therefore, the following discussion of the results is based on the identification rates yielded only by the minimum set distance rule. First, IDS produced the elimination effect on the influence of the response curve. This interpretation is clearly supported by the fact that similar identification rates were obtained in both the cross-transmission and the within~transmission identification operations. This effect can also be seen by comparing the rates obtained by IDS (60 -70 a) with the ones obtained by SPT (20 2). Second, LTAS was found to be susceptible to the influence of the type of transmission systems used. This susceptibility is known by the decrease in the identification rates from 100 2 (when tested under the within-transmission operations) to 60 to 70 % (when tested under the cross-transmission operations). In spite of this susceptibility of the LTAS to the type of transmission system, this parameter yielded about the same identification rate as the IDS. The reason that both LTAS and IDS resulted in the same identification rate could be in the way that LTAS was extracted by overlapping the set of short-term spectra. Practically, LTAS was the same as the denominator used in the expression to compute the Table Desig Cross tran: With tran Desi Cros trar witl trai DeS' Cro tra DeS Cro tra Wit tre l/ *r *‘k 79 Table 6. Summary of the results of 24 voice identification operations Type of transmission and speech Identification parameter used for rate (in %) unknown speakers known speakers rule 1* rule 2** Design 1 Telephone (IDS) vs. Normal (IDS) 52 70 Cross— Telephone (IDS) vs. Linear (IDS) 56 60 transmission Telephone (LTAS) vs. Normal (LTAS) 70 70 Telephone (LTAS) vs. Linear (LTAS) 66 70 Telephone (IDS) vs. Telephone (IDS) 58 60 Normal (IDS) vs. Normal (IDS) 64 70 Within- Linear (IDS) vs. Linear (IDS) 7O 70 transmission Telephone (LTAS) vs. Telephone (LTAS) 100 100 Normal (LTAS) vs. Normal (LTAS) 98 100 Linear (LTAS) vs. Linear (LTAS) 98 100 Design 2 Cross- Telephone (FFC) vs. Normal (FFC) 52 50 transmission Telephone (FFC) vs. Linear (FFC) 58 40 within- Telephone (FFC) vs. Telephone (FFC) 48 40 transmission Normal (FFC) vs. Normal (FFC) 48 40 Linear (FFC) vs. Linear (FFC) 56 50 Design 3 Telephone (IDS+FFC) vs. Normal (IDS+FFC) 62 60 Cross— Telephone (IDS+FFC) vs. Linear (IDS+FFC) 56 6O transmission Telephone (LTAS+FFC)vs. Normal (LTAS+FFC) 92 100 Telephone (LTAS+FFC)VS. Linear (LTAS+FFC) 94 100 Design 4 Cross- Telephone (SPT) vs. Normal (SPT) 10 20 transmission Telephone (SPT) vs. Linear (SPT) 20 20 Within- Telephone (SPT) vs. Telephone(SPT) 92 80 transmission Normal (SPT vs. Normal (SPT) 94 90 Linear (SPT) vs. Linear (SPT) 88 80 * The nearest—neighbor decision rule. ** The minimum set distance decision rule. IDS of tort were wen bei suf Apr spe ire 9 I IIOI 5P is di IDS parameter. Third, FFC was also shown to be quite free from the influence of the frequency response curve -- FFC resulted in very similar _correct identification rates no matter what transmission systems were used for both unknown and known speakers. However, the rates were only moderate at 40 to 50 z. This implies that FFC, although being free from the influence of the response curve, may not be a sufficiently effective speech parameter for voice identification. With reference to the FFC features, the raw data presented in Appendix E were inspected. It revealed that certain groups of speakers shared extremely similar Fo's (average fundamental frequency) and other features. For example, speakers I, 3, 5, and 9 had among themselves almost interchangeably close Fo's. Speakers 6 and l0 formed another grOUp with very close Fo's. This homogeneity of the distribution of Fo's within the certain group of speakers appears to be rather contradictory to the fact that this feature, Fo, resulted in the highest F-ratio (indicating good discriminating power) in all the transmission data bases. Such a contradiction, however, may not be surprising considering the fact that F-ratio only reflected (as applied in this study) the variation of the feature values among the speakers as a whole group, instead of the variation between all possible parings of the individual speakers. Apparently, interpretation of the face value of the F-ratio as the measure of discriminating power for the speakers must be made with caution. idenl were rate spea noti corr thai chai (as the con die hi! 09 81 Fourth, SPT came out as predicted. High correct identification rates (80 to 90 a) were produced when the voices were recorded only by one type of the transmission system, but the rates decreased (20 X) when voices of the unknown and the known speakers were recorded through different transmission systems. Fifth, the composite of IDS and FFC did not show any noticeable improvement in terms of elimination effect and the correct identification rates. This was probably due to the fact that both parameters included the same type of speech characteristics. In view of the results that these two parameters (as tested separately) were relatively free from the influence of the transmission system, the identification rate of 60 a by a combination of the features from IDS and FFC was a rather disappointing outcome. Finally, the composite of LTAS and FFC showed the unexpectedly high correct identification rate of ICC a in two cross-transmission operations (telephone vs. normal and telephone vs.linear). The probable reason for this high identification rate can be expressed as follows. LTAS and FFC carried different types of speech characteristics working in a complementary fashion, i,e,, LTAS provided the static spectral features, thus reflecting more or less average vocal tract (shape during speech production, while FFC contained the fundamental frequency related features, thus reflecting information about the glottal dynamics in on-going speech. the s projei I unI is de from apprc knowr two-c of from rep: Spe Iii Ide Spe the C0 wi The following Figure I3(a-e), l4(a-e), and I5(a-e) show, for the sake of illustration, two-dimensional projections (nonlinear projection algorithm by Sammon, I969) each projection consisting of I unknown and 5 known speakers. Briefly, the Sammon's projection is described to perform a point mapping of N L-dimensional vectors from the L-space to a lower-dimensional space to preserve approximate data structure. In this study, N=6 (I unknown and 5 known Speakers) and L=5 (5 samples/speaker) was plotted into a two-dimensional space. In each projection, five patterns (samples) of the unknown speaker are denoted by ui (i=unknown speaker index from I to 5), and a center of the dispersion of the unknown speaker i is indicated by Uci. Known speakers are simply denoted by the speaker index and the center of the dispersion of known speaker i is indicated by Kci. In Figure I3(a-e) all the speakers (unknown and known) are represented by telephone IDS data base. It is shown that unknown speaker I is closest to known speaker I (correct identification, I3(a) unknown speaker 4, closest to known Speaker 4 (correct identification, l3(d)), but unknown speaker 5 as closest to unknown speaker 3 (incorrect "identification, I3(c). As clearly shown in these projections (Figure I3(a-e), relatively tight spatial distribution of 5 speakers could be accounted for rather modium correct identification rate achieved by IDS as tested under all the within- and cross-transmission operations. In Figure l4(a-e), each projection shows I unknown and 5 known speakers, all the speakers represented by telephone LTAS. In .4 Unknown 'Ul ,l 01 .2 ,4 Kcl‘l Kc4¥,2 ul 1 DE} 4 I 5 Kc2 ,ul 2 2 4 .5 .5 ,3 Kc5 . .2 .1 5.3 T Kc3 .4 3 .ul .5 ,ul Figure l3(a). 5 known speakers (telephone IDS) vs. unknown speaker l (telephone IDS): Known speaker l = 1 Known speaker 2 = 2 Known speaker 3 = 3 Known speaker 4 = 4 Known speaker 5 = 5 Unknown speaker I = ul ch = The center of dispersion of Samples of the jth known Speaker Ucl = The center of dispersion of samples of unknown speaker I Figure 13 (a-e). Sammon's projections of 5 known speakers and 5 unknown speakers, all represented by telephone IDS parameter: (a) 5 knowns vs. unknown 1; (b) Sknowns Vs. unknown 2; (c) 5 knowns bs. unknown 3; (d) 5 knowns Vs. unknown 4; and (e) 5 knowns vs. unknown 5. 84 .u2 Unknown Figure 13 (b). 5 known speakers (telephone IDS) vs. unknown 2 (telephone IDS): Known speaker Known speaker Known speaker Known speaker Known speaker 5 = 5 Unknown speaker 2 = u2 KCj — The center of dispersion of samples of the jth known speaker. Uc2 = The center of dispersion of samples of unknown speaker 2. .— — ibbJNH waF’ I Fig 85 .4 .l ,3 .1 '2 4 fl. Kc4 .2 c1 '4 .5 l Kc2 2 ' 2 4 0505 3 .3 F3 ’3 02K L . 1 '5 .3 I ‘pga K 3 ° 4 ' u3 O 3 3 'u3 .5 u3 Unknown Figure 13 (c). 5 Known speakers (telephone IDS) vs. unknown 3 (telephone IDS: Known speaker Known speaker Known speaker Known speaker Known speaker Unknown speaker 3= u3 ch = The center of dispersion of Samples of the jth known speaker. Uc3 = The center of dispersion of samples of unknown speaker 3. m.ncbioia Uie.w MIA Fit 86 Unknown :!4 °u4 2 ‘ _.9 /'.’42 .5._. .l '5 .3 7 KC Figure 13(d). 5 known speakers (telephone IDS) vs. unknown speaker 4 (telephone IDS): Known speaker Known speaker Known speaker Known speaker Known speaker 5 Unknown speaker 4= u4 ch = The center of dispersion of samples of the jth known speaker. Uc4 = The center of dispersion of samples of unknown .bibtoia II (flib(»lvi‘ Fig 87 04 'l 02 'l '4 Kcl U. Kg4.2 .4 .5 01 KC .u5 ~ 02 4 2 . '5 ‘5 .u5 '3 Ué‘p ‘3 02 C5 0 l . ‘ Kc3 .4 U nown '3 o 5 .u5 Figure l3(e). 5 known (telephone IDS). Known speaker l = Known speaker 2 = Known speaker = Known speaker Known speaker 5 = Unknown speaker 5= ch = The center of UcS = The center of ohm speakers (telephone IDS) vs. unknown speaker 5 U'liD-UJNH u5 dispersion of samples of the jth known speaker. dispersion of samples of unknown speaker 5. 88 .5 'ul '4 Figure l4(a). 5 known speakers (telephone LTAS) vx. unknown speaker I (telephone LTAS): Known speaker I = Known speaker 2 Known speaker 3 = Known speaker 4 Known speaker 5 = Unknown speaker I = ul ch = The center of dispersion of samples of the jth known speaker Ucl = The center of dispersion of samples of unknown speaker I U'liwal-J Figure 14 (a-e). Sammon's projections of 5 known speakers and 5 unknown speakers, all represented by the telephone LTAS parameter: (a) 5 knowns vs. unknown 1; (b) 5 knowns vs. unknown 2; (c) 5 knowns vs. unknown 3; (d) 5 knowns vs. unknown 4; (e) 5 knowns vs. unknown 5. 89 Unknown 0112 Figure l4(b). 5 known speakers (telephone LTAS) vs. unknown speaker 2 (telephone LTAS). Known speaker 1 Known speaker 2 Known speaker 3 Known speaker 4 Known speaker 5 Unknown speaker 2 — ...- .— — ch = The center of Uc2 = The center of ibbdMF’ 5 u2 dispersion of samples of the jth known speaker dispersion of samples of the unknown speakeriZ 9O '3 Unknown Kc3 Kc5 '5 .3 pp. ‘5 3 ¢u3 5 .4 ou3 0113 .4 -3 K04 '3 -4 '4 Figure 14 (c). 5 known speakers (telephone LTAS) vs. unknown speaker 3 (telephone LTAS). Known speaker I = Known speaker 2 Known speaker 3 Known speaker 4 Known speaker 5 = Unknown speaker 3 = u3 ch = The centerof dispersionc> Uc3 = The center of dispersion of samples 0 IIII U'liwaI—J f samples of the jth known speaker f the unknown Speaker 3. 91 .4 Figure l4(d). 5 known speakers (telephone LTAS) vs. unknown speaker 4 (telephone LTAS). Known speaker I = Known speaker 2 Known speaker 3 = Known speaker 4 = Known speaker 5 = Unknown speaker 4 = u4 ch = The center of dispersi Uc4 = The center of dispersi II U'libUJNH on of samples of the jth known speaker. on of samples of the unknown speaker 4. 92 .3 Unknown Figure l4(e). (telephone LTAS). Known Known Known Known Known Unknown speaker 5 The center of The center of ch = UcS = speaker speaker speaker speaker Speaker 1 2 3 4 5 5 known speakers (telephone LTAS) vs. unknown speaker 5 U'IibWNI-J u5 dispersion of samples of the jth known speaker. dispersion of samples of the unknown speaker 5. 93 contrast to the spatial dispersion of 5 known speakers by IDS given in Figure l3(a-e), here, all the speakers (both unknown and known) are more clearly separated. In Figure I5(a-e), each projection shows I unknown and 5 known speakers, all speakers being represented by the composite of LTAS and FFC parameters, and the unknown speaker recorded through a normal transmission system. These projections (a-e) indicate that all 5 unknown speakers were correctly identified and that all the known speakers were shown to have the relatively small intra-speaker variation. 94 .4 Unknown .2 ,2 Kc2 . 'I o 3 . v£C4 .4 ' .5 -5 . 3 .4 '4 .5 Figure 15 (a). 5 known speakers (composite of FFC and LTAS by normal transmission vs. unknown speaker l (composite of FFC and LTAS by tele— phone transmission). Known Known Known Known Known speaker 1 speaker speaker speaker speaker DOOM 5 - Con— .— — Unknown speaker I ch = The center of Ucl = The center of U’libLAMl-J ul dispersion of samples of the jth known speaker. dispersion of samples of the unknown speaker I. Figure 15 (a-e). Sammon's projections of 5 known speakers and 5 unknown speakers - knowns represented by the composite parameter of FFC and LTAS by normal transmission system and unknowns represented by the same composite parameter by telephone transmission system. 5 known speakers vs. (a) unknown speaker l; (b) unknown speaker 2; (0) unknown speaker 3; (d) unknown speaker 4; (e) unknown speaker 5. 95 .5 .5 .4 ;u2 Unknown .u2 2 O .l 01.12 1" Uc2 ‘Kc4 .4 .4 ,4 '1 o4 .4 u2 Figure 15 (b). transmission) vs. 5 known speakers (composite of FFC and LTAS by normal unknown speaker 2 (composite of FFC and LTAS by telephone transmission). speaker I = speaker 2 speaker 3 Known speaker 4 Known speaker 5 Unknown speaker 2 = ch = The center of Uc2 = The center of Known Known Known 1 = 2 3 = 4 = 5 u2 dispersion of samples of the jth known speaker. dispersion of samples of the unknown speaker 2. 96 01 K31 Unknown 3 0 p3 o‘u3 Uc3 03 .3 u3 Figure 15(c). 5 known speakers (composite of PFC and LTAS by normal transmission) vs. unknown speaker 3 (composite of FFC and LTAS by telephone transmission). Known Known Known Known Known Unknown speaker 3 speaker 1 speaker 2 speaker 3 speaker 4 speaker 5 l = 2 3 4 = 5 ch = The center of Uc3 = The center of u3 dispersion of samples of the jth known speaker. dispersion of samples of the unknown speaker 3. 97 cl Figure 15 (d). 5 known speakers (composite of FFC and LTAS by normal transmission) vs. unknown speaker 4 (composite of FFC and LTAS by telephone transmission). Known Known Known Known Known speaker l = 1 speaker 2 = 2 speaker 3 = 3 speaker 4 = 4 speaker 5 5 Unknown speaker 4 =u4 ch = Uc4 II The center of dispersion of samples of the jth known speaker. The center of dispersion of samples of the unknown speaker 4. ‘1'! I . ll Illiiv“ H liar 98 Unknown ou5 .5 Figure 15(e). 5 known speakers (composite of FFC and LTAS by normal transmission) vs. unknown speaker 5 (composite of FFC and LTAS by telephone transmission). Known speaker I = 1 Known speaker 2 = Known speaker 3 = Known Speaker 4 = Known speaker 5 = Unknown speaker 5 =u5 ch = The center of dispersion of samples of the jth known speaker. Uc5 = The center of dispersion of samples of the unknown speaker 5. (fish-DUN CHAPTER IV DISCUSSIONS AND CONCLUSIONS In the present study, the speech materials were produced from IO male speakers, each speaker simultaneously recorded by three different transmission and recording systems. Four types of speech parameters were extracted from the text-independent materials to represent the speakers, as unknown and known persons. Then these parameters were studied in voice identification operations for their effectiveness in eliminating the influence of the response characteristics of the transmission and recording devices. DISCUSSIONS Analytical comparisons of the results obtained in this study to those reported in the literature do not appear feasible nor meaningful due to the large variation in the types of phonetic materials, size and type of the speaker population, methodology and procedure employed, etc, Nonetheless, in order to facilitate some reasonable interpretation of the results from this study, several factors other than the distortion due to the transmission system which could have been critically involved in the process are discussed below. 100 Q2 Speaker ngulation Typically the number of speakers included in studies of voice identification by computer is small. This fact is mainly due to the huge amount of information present in a brief segment of speech to be processed by the limited memory capacity and computational speed of the most computers. Clearly this constraint upon the size of speaker population makes it difficult to generalize many study results. Doddington (I974) presented a computer simulation on the expected error rate for speaker identification as a function of population size. It was shown that the overall probability of an incorrect decision is a monotonically increasing function of population size. Some examples of expected error rate (for optimum identification) with population size N were 0.0I for N=2, 0.025 for N=5, 0.08 for N=l0, 0.l8 for N=20, and 0.5 for N=l00. According to Doddington's study, then, the error rate of 0.08 or conversely the correct identification rate of 0.92 (92%) can be interpreted as optimum, or sufficient for voice identification by computer; but it seems to be far from the ultimate goal for practical use. In addition to the number of speakers employed, the ' homogeneity of the speaker population is also known to considerably affect the identification rate. 0n the issue of 'homogeneity' of speaker population, Bogner (l98l) presented a good example in explaining the varying results of identification rates reported in much of the literature: 101 One possible 'explanation' is that this talker's speech is exceptionally similar to the average of the papulation of talkers. Let us denote talkers of this type as ‘type-X.‘ Assuming that a proportion 0.l of the talker population is of type-X, we find that the expected number of such talkers in a sample of I0 is l, with a standard deviation of l. Thus, it would not be surprising to find some sets of IO talkers, with no type-X talkers, and some with 2 or 3, causing the resultant error rates to differ greatly, by factors of more than 3. Raw data (from Appendix F) were inspected to check how the speaker population in this study can be explained in Bogner’s view. The sample average of F0 , which yielded the largest F-ratio value, was calculated over five FFC's for all the speakers. Based on this feature alone, speakers I, 3, 5, and 9 were found to be extremely similar, as their Fo's were 109.8, 110.1, 111.1, and 111.7 Hz, respectively. Visual inspection of Appendix G (ll-l5) also indicated that these four speakers were frequently misidentified among themselves. Speakers 2 and I0 were seldom misidentified with any other speakers. Speaker 2 had the lowest Fo, whereas speaker 110 had the highest Fo. Another interesting interpretation can be drawn from the results obtained by the use of the LTAS speech parameter. Judging from the IOOX identification rate achieved under the within-transmission voice identification, it appears that the speakers in this study were less homogeneous when represented by their long-term spectral characteristics than when represented by glottal dynamic characteristics. In other words, given a group of speakers, the spectral features have greater discriminablity in distinguishing speakers than glottal features do. In general, this complies with the results of Markel et a], (1977) and Hunt at a], (1977)- Qg Effects pi Feature Optimization The major topic of this study was the 'elimination' of the distortion of the frequency characteristics existing in the Speech samples of the unknown speakers and the known speakers who were recorded through the different transmission systems. Two approaches were tried. The first approach was attempted by applying the IDS and the LTAS parameters whose features were optimized (reduction and selection). The second approach was carried out by means of selecting several time-varying fundamental frequency related features forming the FFC parameter whose features were also subjected to the optimization procedure (selection). Since the design of this study was not intended to study the effect of the optimization procedure per 39, no substantiating grounds for interpreting this effect is given in this study. Despite this lack, indirect interpretation of this Optimization effect was attempted by referring to the resulting correct identification rates as the measure of effectiveness. It became obvious that the optimization procedure induced the 'elimination' effects to different degrees according to each speech parameter in use. First, the effect of the optimization Upon the IDS was undeterminable simply because there were no contrasting results to be compared. Second, in view of the fact that all the original features of the FFC were retained for voice identification operation, the optimization procedure applied to the FFC was not in effect at all for the 'elimination'. Contrary to this, when only five features with relatively high F-ratio values were used, the FFC yielded very poor identification rates. Third, in view of the fact that the LTAS resulted in the better identification rate than the choral spectrum did, the feature optimization procedure was probably in effect for 'elimination‘ of the distortion. An earlier pilot study carried out with the LTAS which included all the original l28 features (no optimization) also resulted in identification rate of only a chance level of io-zoz. A reasonable speculation may be that l0 features (frequency components in Table 3 chosen for the LTAS were those which represented speaker-dependent characteristics, leaving out other characteristics related to linguistic content and to the transmission and recording devices. This speculation, of course, is difficult to verify but appears to be worthy of further elaborate investigation. 92 Composite Parameter Besides the 'elimination' of the influence of the response characteristics, another issue of concern in this study was the discriminating power of the speech parameter used. The goal of this study was to investigate speech parameters which are relatively resistive to distortion and also highly effective in identifying the speaker. The results indicated that the composite of the LTAS and FFC was the best approach to such a goal. Probable effects of the composite are suspected to be twofold: That the 104 number of features were simply increased by a combination of IO from the LTAS and 9 from the FFC resulting into a total of I9 features; or different types of speech characteristics were integrated into one parameter containing both static Spectral information and dynamic glottal information. With respect to the first notion, Hughes (I968) showed that the tendency of monotonically increasing the error rate of identification beyond the certain optimum number of features and suggested that the number of the optimum features was 5, IO, 20, l00 or greater, for the size of 20, ICC, 500, and larger. A similar result was also reported by Cheung and Eisenstein (I978). Using 32 features (pitch, log energy, l0 partial correlation coefficients, l0 cepstral coefficients, normalized absolute prediction error energy, and 9 normalized autocorrelation coefficients) extracted from text-independent speech data, they plotted the identification error rate as a function of the number of features. They concluded that the identification error rate gradually decreases by increasing the number of features, but it starts to taper off with 6 features -- no further improvement is gained by increasing by more than 6. Many other researchers in automatic speaker recognition appear to select a set of a certain number of features according to prior knowledge about their effectiveness without giving specific attention to the optimum number of features. Since not only the number but the type of speech parameter from which features are derived may interact with the results, both theoretically and practically it does not seem feasible to establish the optimum number of features. For these reasons, the possible effect of the increased number of features in the composite parameter upon the improvement of the identification rate in this study remains unanswered. The latter notion perhaps provides more reasonable interpretation of the possible effect for improving the identification rate. As discussed earlier, one particular subgroup of speakers was consistently misidentified (correct rate 50%) among themselves when represented by one parameter (FFC), but another subgroup of speakers was misidentified (correct rate 702) when represented by the other parameter (LTAS). In one particular condition, telephone vs. linear cross-transmission voice identification operation, the correct rate of ICC? (by minimum set distance rule) was achieved when the above two parameters were made into a single composite parameter. Cheung and Eisenstein (I978) studied text-independent voice identification and showed that if the speakers were represented by the set of features from three different types of parameters including pitch, energy, and spectral information, the identification performance was much better than if the speakers were represented by the set of features from only one of these parameters. By using text-dependent speech samples, He and Dubes (I982) also concluded that the combined feature set from the LPC and pitch contour resulted in the similar trend. Although these two studies were not concerned with the influence of the response 106 characteristics, the results may indicate general efficacy of the composite parameter of different types. A reason for the poor identification rate obtained by the composite of the IDS and the FFC parameters may largely be due to the similar type of speech characteristics contained in both parameters. Since IDS was derived by the time-normalization formula reflecting the intensity variation of the feature (frequency component) as a function of time -- hence partially dynamic in nature, and FFC was mainly consisted of the variation of the fundamental frequency related features extracted from the time domain speech -- also dynamic in nature, these two parameters can be considered to share partially common characteristics. More convincing supportive evidence for the usage of composite parameter of different types might be seen in the other methods of voice identification, namely the spectrographic method and aural method. Typically in the spectrographic method, multiple speech characteristics (parameters) are concurrently examined in a paired set of spectrograms, one for the unknown speaker and the other for the known speaker. These include mean frequencies and bandwidths of vowel formants, gaps and type of vertical striation , slopes and transition of formants, duration of similar phonetic elements and plosive gaps, energy distortion of fricatives and plosives, and interformant acoustic density patterns (Tosi et 3],, I972). Perceptual speech characteristics predominantly used by the human examiner in the aural method are known to be clarity, roughness, magnitude, and animation (Voiers, I964), pitch, intensity, quality, 107 and rate (Holmgren, I967), or quality, rhythm, melody pattern, pitch, rate and respiratory group (Tosi, I979). To date there is no evidence alluding to the superiority of the perceptual method over the computer method of identification, or vice versa. Nonetheless, in practice , it is a general notion that the computer method of the present state of art would often fail to correctly identify the speaker given the text-independent speech materials containing other undefined characteristics in addition to those of speaker dependent characteristics. Under the same circumstances, the human examiner can often recognize the speaker with relative ease. In this sense, the ultimate goal of the computer voice identification seems to be attained only by simulating yet unknown perceptual mechanisms of the trained examiner who can extract multiple speech parameters singly or in any combination, depending upon the type of speech data at hand. To sum up, considering the results from this study and reports in the literature cited above, the main reason for the improved identification rate by the composite parameter seems to be attributed to the inclusion of two different types of speech parameters, one type carrying static Spectral features, the other type carrying dynamic glottal features. Though much effort in the study of automatic voice identification has been focused on the search of the cardinal speech parameter(s), it is likely that the identification system demands several types of parameters to fully represent the Speakers who are text-dependently or -independently, and/or with or without influence of the transmission system. 108 Qfl Influence 9i Pause Elimination During the recording session, the Speaker was allowed to read the text material at his preferred reading rate. Consequently, the varying rate of each individual speaker was reflected in the resulting speech data in terms of the duration of the voiced frames or in terms of the overall articulatory patterns. For example, compressed speech (processed speech data from which pauses are deleted) of a speaker who read at a relatively faster rate might have contained more voiced frames per unit of time than that of a speaker who read at a slower rate. Inevitably, the rate of voiced frames included in the compressed speech appeared to have interacted with each speech parameter derived to the different degree and in different aspects. One of the speech parameters, LTAS (long-term averaged spectrum) is believed to have been affected to the minimum degree by the varying rate of voiced frames. Because LTAS was a spectrum of the averaged intensities of frequency components, it reflected no dependency upon the time variation. Also, the duration of the total voiced frames in the compressed speech of all the speakers was assumed to be long enough to counterbalance the different phonetic contents for each speaker. In contrast, because all the features in IDS and in FFC (except Fo) were computed as a function of time (IDS, computed from a successive temporary varying short-term spectra, and FFC, computed from time varying Fo contour), these two parameters are believed to have been under the influence of the rate of voiced 109 frames. The influence seems to be double faceted: One is that the rate of voiced frames successfully reflected speaker-dependent characteristics as it was intended; another is that it was involved as a confounding variable. The latter aspect of the influence certainly leaves some room for further investigation. 93 Interaction 91 the Experimenter Certainly any kind of interaction by the experimenter Should be avoided, or at least minimized if a completely objective or automatic method of voice identification is desired. In this study there were two spots which demanded the intensive participation of the experimenter. One spot took place during feature optimization procedure, where some amount of experimenter's strategy was inevitable in choosing the IQ features for the IDS and LTAS parameters. This interaction during the feature selection appears to be a drawback in view of the repeatability of the procedure. Since it was found that the feature optimization was a crucial procedure in the 'elimination' of the influence of the response characteristics, the ,need for further study for interaction-free algorithms for this feature Optimization scheme is obvious. Another spot of interaction took place during the interactive peak detecting method applied for the measurement of 9 features of the FFC parameter. This method was implemented to acertain accurate measurements of fine temporal variations of the fundamental frequency. Since the compressed speech data included many discontinuous points between successive voiced frames 110 (signals), the application of the interactive method was crucial in order to prevent these discontinuities from resulting in erroneous measurements. Fortunately, it became clear that as long as the experimenter is familiar with the wave pattern of speech sound depicting the recurrent peaks, this interactive method would result in the stable measurements from experiment to experiment. For this reason, the interaction involved in the measurement procedure for the FFC parameter does not appear to have contributed to the resulting identification rate as a confounding factor. CONCLUSIONS This study was exploratory in nature regarding the methodology applied and the types of problems dealt with. Despite the application of a computer as the major computing source, there were several stages where interactions by the author were required. Within such a limitation set forth, the following general conclusions were drawn from the results of this study. I. Both IDS (intensity deviation spectra) and FFC (fundamental frequency contour) are effective in eliminating the influence of the response characteristics of the transmission and recording channels. But their correct identification rates were only moderate 50-60%. 2. LTAS (long-term averaged spectra) is susceptible to the 111 influence of the response characteristics, but even under that influence, the correct identification rate was 60-70%. 3. The composite parameter of IDS and FFC is effective in eliminating the influence of the response characteristics. However, the correct identification rate is not improved, i,e,, it is only as good as each component. 4. The composite parameter of LTAS and FFC is the most effective speech parameter in eliminating the influence of the response characteristics of the transmission systems. It achieved the highest possible correct identification rate of 100%. IMPLICATIONS FOR FURTHER RESEARCH The major findings of this study was that the speech parameter composed of the optimized features of LTAS and the derived features of FFC can successfully eliminate the influence of the biased frequency response characteristics of the transmission and recording devices. In relation to this finding, further immediate research topics focusing on the same problem investigated in this study are suggested below. I. The methodology used in this study could be replicated by using one composite parameter of LTAS and FFC with the increased speaker population of 50 or more. 112 Choral spectra could be investigated for its feasibility for feature optimization procedure. It is clear that choral spectra and the LTAS possess equally useful property in distinguishing the Speakers provided all the voices are collected by the same transmission systems. The major advantage of choral spectra over LTAS (long-term averaged spectra) is a drastic reduction in the "amount of time required to generate it. If the feature Optimization can be made feasible with choral spectra, then it can form a composite parameter with FFC. Feature Optimization procedures applied in this study could be investigated for its further improvement. Although it was found that this procedure played a nontrivial role toward the elimination effects, it required some amount of arbitrary interaction by the experimenter. Ideally, there should be no interaction taken by the experimenter during the feature optimization process. This aspect appears to be worthy of serious investigation to make the computer method voice identification more objective. The measure of the intra-speaker variability could explicitly be taken into account to establish the basis for assigning the probability of errors of identification. Speech parameters other than IDS, LTAS, choral spectra, and FFC could be investigated not as alternative parameters, but as possible candidates to be included in the composite parameter. One such candidate is the pause characteristics 113 (considered to be independent of the frequency response characteristics) in an on-going speech of the Speakers. REFERENCES REFERENCES Anderberg, M.R. Cluster Analysis for Application, Academic Press, New York, I973. Atal, B. S. 'Automatic recognition of speakers from their voices' in Automatic Speech g Speaker Recognition, N. Rex Dixon and Thomas B. Martin (ed.) IEEE Press, New York, I978, pp.349-364. Atal, B.S. 'Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification', J; Acoust. Soc. Amer., I974, Vol. 55, pp. l304-l3l2. Atal, 8.5. 'Automatic speaker recognition based on pitch contour', g; Acoust. Soc. Amer., I972, Vol. 52, pp. l687-l697. Bogner, R.E. 'On talker verification via orthogonal parameters', IEEE Trans. Acoust.i Speegfll and Signal Processipg, I98l, Vol. ASSP-29, No. I. pp.l-l2. Bricker, P.D. et a], 'Statistical techniques for talker identification', Bell System Technical Journal, l97l, Vol. 50. Pp.l427-l454. Bunge, E. 'Autmatic speaker recognition system AUROS for security systems and forensic voices identification' in Automatic Speech g Speaker Recognition, N. Rex Dixon and Thomas B. Martin (ed.). IEEE Press, New York, I978, pp. 4l4-420. Cheung, R.S. and Eisenstein, B.A. 'Feature selection via dynamic programming for text-independent speaker identification'. IEEE Trans. Acoust.J Speeghi and Signal Processing, 0ct.I978, Vol. ASSP-26, No.5, pp.397-403. Das, S.K. and Mohn, W.S. 'A scheme for speech processing in automatic speaker verification', IEEE Trans. Audio Electro-Acoust., l97l, Vol. Au-l9, pp.32-43. Doddington, G.R. 'A method of speaker verification', Paper presented at The Eightieth Meeting 2i Egg Acoust. Soc. Amer. I970, Nov. 3-8, Houston, Texas. Doddington, G.R. 'Speaker verification - Final report', Rome 51; Development Centegy Griffiss AFB; NJ;l Tech. Rep; April, l974, RADC 74-l79. 114 115 Furui, S. 'Cepstrum analysis technique for automatic speaker verification' IEEE Trans. Acoust., Speech) and Signal 5522255129. April. 1981a. Vol. ASSP-Z, No. 2. pp.254-272. Furui, S. 'Comparison of speaker recognition methods using statistical features and dynamic features', IEEE Trans. Acoust., Speech, and Signal Processing, l98lb, Vol. ASSP-29, NO. 39 RFD-3162-350. Furui, S. - ltakura, F., and Saito, S. 'Talker recognition by longtime averaged speech spectrum', Electronics Egg Communications jg Japan, I972, 55-A, pp.54-6l. Gold, 8. and Rabiner, L. 'Parallel processing techniques for estimating pitch periods of speech in the time domain'. J; Acoust. Soc. Amer., I969, Vol. 46, pp.442-448. Hair, G.D. and Rekieta, T.W. 'Automatic speaker verification using phoneme spectra', J; Acoust. Soc. Amer., I972, Vol. 5], P.l3l (a) . Hair, G.D. and Rekieta, T.W. 'Mimic resistance of speaker verification using phoneme spectra', J; Acoust. Soc. Amer., I972, Vol. 5l, p.I3l(a). ' He, Q., and Dubes, R. 'An Experiment in Chinese speaker identification', presented at I982 IEEE Int'l Conf. Trans. Acoustpi Speech, Egg Signal Processing. Holmgren, G. 'Physical and psychological correlates of speaker recognition', Journal 9: Speech and Hearipg Research, I967, Vol. IO. pp.57-66. Hughes, G. F. 'On the mean accuracy of statistical pattern recognizers', IEEE Trans. 93 Information Theory, I968, Vol. IT-lh. pp.55-63. Hunt, M.J., Yates, J.W., and Briddle, J.S. 'Automatic speaker recognition for use over communication channels', IEEE Int's Conf. Record pg Acousggi Speech and Signal Process. May 9-11. 1977. pp.764-767. Jain, A.K. and Dubes, R. 'Feature definition in pattern recognition with small sample size', Pattern Recognition, 1978, Vol.l0, pp.85-97. Luck, J.E. 'Automatic speaker verification using cepstral measurements', A; Acoust. Soc. Amer., I969, Vol.46, pp.102—1o32. 116 Majewski, Z. W., and Hollien H. 'Cross correlation of long-term speech Spectra as a speaker identification technique', Acustica, I975, Vol. 34, pp.20-24. Markel, J.D. 'The SIFT algorithm for fundamental frequency estimation', IEEE Trans., Audio, Electroacoust., I977, Vol. AV‘ZO. PP-367‘377- Markel, J.D., and Davis, S.B. 'Text-independent speaker recognition from a large linguistically unconstrained time-spaced data base', IEEE Trans. Acoust., Speech, Egg Signal Processipg, Feb., I977, Vol. ASSP-27, No.l, pp.74-82. Markel, J.D., Oshika, B.T., and Gray, A.H. 'Long-term feature averaging for speaker recognition', IEEE Trans. Acoust., Speech, and Signal Processing, I977, Vol. ASSP-25, 99-330-337- The National Research Council, 9g the Theory and Practice 9: Voice Identification, National Academy of Sciences, Washington D.C., 1979. Noll, A.M. 'Cepstrum pitch determination', g4, Acoust. Soc. Amer., I967, Vol.4l, pp.293-309. Paul, J.E., Rabinowitz, A.S., Riganati, J.P., Richardson, J.M. 'Development of analytical methods for a semi-automatic speaker identification system', l925 Carnahan Conf. pg Crime Countermeasures, I975, pp-52-64. Pruzansky, S. 'Pattern-matching procedure for automatic talker recognition', g; Acoust. Soc. Amer., I963, Vol.35, pp.354-358. Rabiner, L.R., Levinson, S.E., Rosenberg, A.E., and Wilson,J.G. 'Speaker-independent recognition of isolated words using clustering techniques', IEEE Trans. Acoust.) Speech! gag Signal Processing, Aug., I977. Vol. ASSP-27, No.4, PP-336'3h9- Sammon, J.W. Jr. 'A nonlinear mapping for data structure analysis', IEEE Transactions 93 Computer, May l969, pp.40l-409. Shafer, R. L., and Rabiner, L. 'Digital representation of speech signals', Proceedings 9: IEEE I975, Vol. 63, pp.662-667. Tarno'czy, T. 'Determination du spectre de la parole avec une methode nouvelle', Acustica, I958, 8:392-395. Tosi, 0. Voice Identification: Theogy and Legal Applications. University Press, Baltimore, I979. 117 Tosi, 0. 'Pausometry: Measurement of a low level of acoustic energy', in World Papers jg Phonetics, Phonetics Society 9: Japan, The Phonetic Society of Japan, Tokyo, I974, pp.l29-I44. Tosi, 0., and Nakasone, H. 'Cancellation of the telephone response curve’, Paper presented at International Association for Identification, Aug., I982, Rochester, New York. Tosi, 0., Pisani, R., Dubes, R., and Jain, A. 'An objective method of voice identification', Current Issues in the Phonetic Sciences, Harry 8 Patricia Hollien (ed.) In series of Current Issues lg Linguistic Theory Vol. 9, in Amsterdam Studies 13 35; Theory 22g Hearing 9: Linguistic Science 1y, Amsterdam-John Benjamins B.V., I979, pp.85l-86I. Tosi, 0., et 3], 'Experiment on voice identification', J; Acoust. Soc. Amer., I972, Vol. 5l, pp.2030-2043. Voiers, W. 'Perceptual bases of Speaker identity', g; Acoust. Soc. Amer., I964, Vol. 36, pp.l065-l073. APPENDIX A A SAMPLE TEXT EXCERPT 118 .mhma .mHImH .mm .smpowoncouu: pH>mQ >9 :uwmm so one: "condom. < 699...: 9. 9.9... 99.89.9959. ... 5.9.9.9... 9899... 2.5.98 <10 9:11.9996890 .959 9 8:8 a. 95.99.95... ..9 859.999 99.95.... .6 .9. 5.6.9.9.... 9:. 95.8: (.../d :9 :35. < 99:? .9..- .9.6E 9: 9. 9. :99... .39: 6.9.9:. 9 ... Q9259 9.6 9.»..99 9......- ...9.9..:. 99 .9 .35... 9:. .39: 99....9999 5...... 92h. 9.98.99. <20 9.9. ..85......... 9:. ..9 59.9 38:29.9 :99 6.5.5.. 9 ... 59.9.6 ....9 99.5 9. 999.9... 9.9 99:. ...: J.E.: .99.. ...—99.9 9.9 ...9: ...—:9... 9. <20 9:. 5.....- Eo... 8.59.9... 9.5.... 9:... 9:9: 9.:99... 9 99.89: 9.9... 9999 6: :98 .....9 .5536 99.89: 99.89.99. 5.9.... 5:8 :92: 9. 96.95. 9 99 9.99 5:. :99...— .89..9: 96.6999 99: 9.... :65. 9.. 999.9 9.9.99.9... 9:. .55.? .99.... 999:. 69.9.9... :99 9.190 89:9: 59.3.5.9. 9.5. 9... 99:2. 9. .. 9.9.99.6 999...: 9.. ..9 995999998 9 9. :99: 88:99. 9. <20 ..9 ...:Bu 9:... . 99.99999... 9.6: 9.3 9:999. 5.9.9 9:. 992.6 99.9 9.6 .399: 9.9 9.: .9 59. 5.9.5 9:. 9:9: 828: 6.5.99: 9...... 95.99: 6 :999 299.9999 9...»: ..9 999:. 99.9 9.9 $10.9 99.65.9965 9...: 98:. .9. . s9: 9...... 95:69.8... 29:98:. 9:. 99:98. .6: A5.59.9... 69:96:99 9.... 5...? :9... 38:5. 9. .3938 9:. 99: .. 699999 ....9 6.9.99 99.9.9 ..9 95.99.9999. 9:. .o.. .5392: 9 99 .99 :8 .. .5...— 85999... >9. 9.: :...s .. 9.99:5 9.9.99.6 9.. 4.9:. .9. 9.59.2.8 9.9.8... 9. 5:359... .9 ....9. 5:5 ... 89929.98. 9.. 9.9: ......9959 5:. .9: .6. 99999... ....9. 9:. to: 9.99 .9: 9.9 9.5 ..9: ....9 9.93 ..9:.. ...: .98. 9:9. 9. 0995.339 .6: .. 5:99.53 .6: 6:. .92.; .96..» ..9 9.9.6.9 9:.—. .68. ..9 92. 9... 9:9 9:. 9.9.... .63 5.6 9998.9 ... 295...... 2.8.6. 63 5:. .999... 9: ... . :5: 9.. 5:9 99:98 ....9 .996 69.9.... ...96388 99.9 :68 9:. 5:3 9:... 9 919.59.79.29. 69:.8 9:. 5.9 9.9.9: 9.8.. 99...? 2.8.5:. 55...... 9 :99: :99. 9. 9.6: 93 .9996 9.... .99: 5.”..9999 9. :6.» 93 .. ...m 9.8.. 93...... 95999:. 9 .3 95.9.9959 ......990 9:. 996.919... 9.9929326 .98... ..9 99.9w 9:953. .26 8:3. ... 35:9 ... 9.59. 9.9.9 £59.82. 5:5 36 2.9.9 9:. .999 :- ...: 856:. 5.92 ... ...... .99 99:88. .95.... 86.99.... 5.0 ......990 9:. ... :59. 9...... .2933... 9...? 9.9.99 99 9.9.6.599... 9 938.... ....9 559.. .99... 6 ...—8.9 6: 5:9 9.... 5.59:9. :99 99.9.9999 .98.. 9:. 996.969 .6: :92. 8.... .9 99.5.3 9: ... 95:9: 9. :55... 9.9.... 5.6 56999 9:... .9 12.9.. .6» 9 :95 .9. .99....99 9.9: 9.99:. .55... 99:9. 6: ... ...9 6 96:63. 59: 96: 9.999 99.9.9999 .98.... 99 .392. ...: :999 6:. 9.899 .26... o. 9263...... 699:9 .9259. .— 9569: 29.9.99. 99.6.99 9.: .9 8.9. 5.99.... 9:. ....3 69:5... 9: 9. 5:99. 99.99 9.6 69.9.9929 9...... :6! 9.9.6.6.. 9.. .6: 6.9929 9.339.952... 9.93 5:. 92:3 6.9:... 6.9.9 ..o 9.596.: 993E999. 99.9w 999.99 9.55:..9. 9 .9 9:69:35: 9...: .9 999 99:. 9.9... 9.. :98 £59.99»... 9.96.9 .9 89:9 9:. ...9:9 9:. ... 992999... 9999. 5:. .8883... 9:. :m99.:. .5... 999.8999 ....9 £999.95: 9.9.9 9.9:. 6:. ...... 99 999... 9.... 939.. E9... .9999.» 39.... 85.6 83.99. .99 29:288. 5:... .9: ...99.8 .996. 99:. 9.9:. 9.. 95.8... 59.0 9.9.. .9995: 9.9.: 9 6:36.99... ..9 99.99 6:59.99 .99... 9:. 5..» 9.9... 8.6:. 9:. ....9 9.99 9: 9.999 999 97. ".969...9».9 9...: .3 9959.. 59: 96: 5:. 9.999 .9 989 659...... 9:. .9 69:9: 9:. ..9 .99... 9:. ... 99.3.9 ....95... 98:. 9.93 999.99 99 .9 9:9... 9 69.. 9.25998 9.2... 996.6 2:398 .. .95.... 9.6 9.9... .299 89.5... 5:. ... .. 999.. 9.9999... 9:. 99989: 09.59 .99. 9:. 9.1:... 9399: :93 99s 3:... .55 99:8 99:96:99 9....-.9... 999.899.... 9 ..9 99.8. 96.98 9.9:. .9 99.5 .8595 9:91. ..9 8.9:. 9:. ..9 95.99 5...:— 998290 9:. 99.99: 999:. 99 9.9 9...! 9:. 59:9 ..9 99.99. 29.9.6 .9550 9.9.0 9:. .9 68:5... 99:... 99899:. < 9:99. 9.65.5 £6.55... 99.99 ..9 8989 .98.... 09538-52 999 o. 99-9: 55.8. 9:. .993. 9:. 3...... 5:... 9.5.5.5 9...»: :5. 9.. 9. 5996:... 9.2.9.. .6: 6:. 99.6.59. .99. .9 996999... .863: 9:. .3 599.99.... 9.9.9.6.. 99 999:. 1999.59... 8:19:59 69... ...: .959: 953 92.9:- vuo 93. .9 990 99.9.5. 9.599 .9. 995.89 .....9.9.8 9.9.9 3.99.26 9.... .9 9.99.. .23.. 9:. .96 ..< 96.29.09» 5.9999 999.69.... 999:. 9.8.. :99... .9...— mg..oE .36 99:9... 99 99.9.39 99 9.9... 99.56.99 :9... ..9 :99 .....9 9.98. 9. 8.9.85 6.... 9:. 6:. 9: .. p.990 .52. 9.9.. 99:35 59: 96: 9999.. :999 ..9 6:. 2.9.9.6 £25.69 99 9.9 93:9 :82... 899.999.: 9:. ....9 9:99.. 999:. 99989: 9.99:9 9.: ..O 95.9 9.< 9:95 .9... ..9 99.9» .....9 5.9... ...6 .5399: p99 56...: 9.9 .9: .999:- 9995 ..9 9.: 9.9.9. 6.9.99.3: ... .99 9: 3.99.. .8962... 4.9.. 9:... 6...: .26 ..9 .9195. 9.99.... 9.. 96 999:. Sn .9... .9 99:99.99: .09.. 9:. .9. 9959.5 :9... 9. 9:9: 29.... 99. 9.9... 9.8.. 93...... 989.9 999 859...: 9:. 9. 56.. 59: 96: 3.99. 9:. 9.6 9... 9:. 39.9: 9:... .852. 9 39.. 9.6 99% 9:93 .99. 5.: 593.9: 59.5 9...: .9»; 9.6.9.90 9:. 9.9:! 99.9. 539. 9:. 9.... .99. 6 9.... 99.. 659.6 96. 9:. an 556:! 9.... ..9 99.9 9.. 9. 9.9:. .99.. .9: .9589... .9 9.96. 5999:. 95:598.. ....9 9.9 :9» .56.. .63 9:. ..9 5.69.7992... .98.. 89 3.99... 9:. 99) 6:! ... 9:9: .9 .6999: 9 9.29: ...9. 96: 6:. 9.9.9.. E6 :65 39.. 9 9.9 9.9.; 9...: 9.6.9 9.66.99 «9993.99: ..9 5.9 9.. 9.9.98 399. 9:. 1 5.18 9.8» 92...... 5.3.9: 9 p99 .. .96. 99 .9 .99: ..< :9: 5.99....- 92659 ..9 89°: 9:. 9.9 9.9:. .9: .95... ....99. 9: 9. 8.....9. 99 9.6 9.9:... 8998...: 95.8.1853 9. 99.8 99. £99.60 9:. 9.99.. 63...;— Bmmmuxm EXMB MQQde d fl xHQmemfl APPENDIX B RESPONSE CHARACTERISTICS OF TEAC TAPE RECORDER AND BRUEL & KJAER MICROPHONE 119 APPENDIX B RESPONSE CHARACTERISTICS OF TEAC TAPE RECORDER AND BRUEL & KJAER MICROPHONE (8) W88“ WOW ‘ble‘cT w ...: C C 8 8 ' 1 . I 1 I g 1 . . | , ' . n 8180' & Kior Pohntiomotu Range: 50 dB Roam—L“ ‘ Low-r Um. Freq: 970 H: Wt. Spud: ‘9 nth/toe. ' Gui-I- 1 .3. maturing Obj; Minn—___— Teac,.— ‘AjIOIO ' Hz 80 'N zoo 500 10” 2000 5000 1m 2001 01124 Multiply Froquoncy Sale by Zero Lani: (18121211 up (1)) mu... ww- APPENDIX C SUMMARY OF THE RESULTS FROM PAUSE ELIMINATION .aOHumuav mo 0m>umuafi vanauommum,m:u mo 0ma00m unauso mnu afimuno ou m00m0u¢> mvma wma m90m> a< 009 .ome ON 00 ucmumaoo unmx was m=0m> my use .Aumm :00 :uwcwH usaca %0 Aomm Gav nuwamH uaauao wafivfi>fiu zn kusaaoo mma oaumu ~59 .mumxmomm 00m Mom mousafia N mma andmfim summam anus“ mo coaumusv one 000. 0N0p0. 0N0. 000. NON.0 5N0. 500. N00. ¢00. .D.0 000w 00N.00 000. 00¢. 00¢.00 500. 00¢. 000.00 000. flame 500. 050.00 000. 000. 00¢.00 000. 00¢. 000.00 050. 00 000. 000.00 000. N0¢. 000.00 050. 000. 000.00 000. 0 000. 000.00 000. 000. 000.00 000. 50¢. 5¢¢.00 0¢0. 0 000. 0¢N.00 000. 000. 050.00 000. 00¢. 0N0.00 0¢0. 5 000. 00N.00 000. N00. 00N.00 0¢0. 00¢. 000.00 000. 0 ¢5¢. 000.00 000. N0¢. 000.50 050. 00¢. 005.50 550. 0 mm 00¢. 0¢5.50 0N0. 00¢. 5¢0.00 000. 50¢. 000.00 050. ¢ 000. 000.00 0N0. 05¢. 000.50 000. 50¢. 505.00 050. m 000. 00N.00 0N0. 000. ¢05.00 000. N00. 00N.00 000. N 000. 000.00 000. N0¢. 000.00 0¢0. 00¢. 00¢.00 050. 0 oaumu Aoomvusauso \m< .IMNMMMI. Aoomvusmwso 10mm: oaamu Aowmvuamuno 10¢ memmmm omma ON a my omms om n AH omma 0N n ma scammfiamamuu £00mmwamamuu scammaamamuu ummawg HmBuoz maosamHmH ZOHBflZHEHAm mdem 20mm mBQDmmm Ema ho MMdZSDm U xHDmem< APPENDIX D COMPLETE—LINK DENDROGRAMS OF IDS AND LTAS FEATURES 121 .1 1». l1.- . 1,- L. ---------- ----------- i---- C O B O ---+ + \ -------- ----------- - I----------— ------------------ O I H C N O 1 Mr I 1 .------—-- 1 ..fl------------.----------.---- I----- ---- ...---------o----- .---o --------- ..h--------------- ...--------------------------- E I .----- -------- .‘fio-----------o----- .--------. .---------. .---- I------.- 0--------- ...-------------o - ----- I - I .----------- - I I I I I I I I I I I I - ..------- -------- ...---- ...---0 ----..-.---O----- - |----------- ....------------.----- : o...----...----........ - ..---.------- - I .N.....------------.- I-------- H O I K 0 0 .fi“------ ---------- -- o.“-------------..----------' .---.-- - .------- I-------o----- ...---------- .- I g- |---. .---- .----- ...-- ...-------------------. - I----------- ..‘------.-.---.--0---. I-------.-o- ..fi----------- ----o----- .-----------o ...--------------------- ..'------------------o-oo ...-----------o-o---o---ooc-- ...o-------------o-------- ...-------------- ..~--------------- ..'------------0 ----- . ...---------------o------ ...----------------- ...----o-------—-------------- ..K---- ------------------------ ..“D-.-----------0-000----- ..~---------.---------- ...-------------o -. .---- ..'---------o---o--oo .fl.-------------------— .----0 .~.----.-—--------------.--- I ..'--------------------- .-.-------------------- .-.---- .- —-----.------ .-~---- -- ---oo---------- .------------- .---o I----- . .-.-oo-o-o-o-o ..----- ...---- .".---- ------------o .... --- ----------. .~----. ------.-.-0- Qfi----------------o------- Q“.-------------------- .".---------o------------ ...--------------------o O ..—-------------- .~.-----------------o .'.------------- ...-------------- .”.------------- ...----------. .-.-’-. ---------- .~~O ----c --- ’~.--.- -------- ...--------o----- °fi¢---------------- .fi.-------- .fi.-------- .hfi------ .fi.------ ..h-O ..'--- -..---- - f.“---- 9..---- - C ..------------------- ‘ . . C . . o. a. .0 no a. ‘ J-h.i-U . .a '0 3 2 3 3' 8 '6 3 3 3 t t I R t .n o. 2 p. a. : .5638 .3: .IIIII I... a? i a t a z 2 x a a a ‘ g a a .mQH ”0200.70000de 20 Dmmdm mmmDBdmm 00H n00 Eoomn—Zmo Aavo xHozmmmm APPENDIX D(2) DENDROGRAM OF 100 FEATURES BASED ON NORMAL IDS. bunt. I)!” 07 III-50.31t '~ '0 III” I? 'ILII IIIIBO.IIL O'HJWIJfllI'IDUII FIIIIIII' Ll‘tt I I. n 93‘” '7‘" I. II II 6' I, 09 7| ’3 ,3 77 ’9 CI "II? 03 CI ‘7 0' II I. II II II 37 I. II I. II I, II II 122 .-.---------------- .-.-------------- ---------- .---------- ---------- .-..------------------------ ..fi-----. -------- -----------.------------- \ ..fi------------------------------- - I ...------ 3---- ..----- :--------- . .-- . ..~---- .----- .~.-----.---. .-------- .- I I 3 ..~--------.- ‘-------------- .fi“---------- I .“.--- ------ .-------------- - .fivu-------------------------------- ...--------------------------- -------- .-.-----------------o .------- ..------------------- Q ...------------- .----- .------------ I .fi”------------- ----------- .~~------.--------------- .nfi-o-----------o :--------‘ ...-------- |-------- .------ .~.-------- .“.-------------------------- .fi.--- ----------------------- .------------ .fi------------1-- I------------- .fi.-----------o- ..~------------------- ..--------- ------. .---------- ..~------.-------- .------ .".-------o---- .----.-------- . .".----.---o--- ' |----------------- .fl.------.----c------------ I .~'----------------- .------- I .~“--------- .----—-- -------- .~.------ I - .~.----- ..°----------—-------------------- .-------- ----- .--------- I .‘fi----------------------.------.- : .-fi------------------------- I I - ...-----------—----o .------------- I .-------- . . ...-----.---------—- . |---------- ----- ono------~--------o---—-—:-__--------_--J ono---------—----------—-- oun-----------oo-—-o---‘_-____--___-_- .5.-.-------.-.--------- oco-o--o-s----—------‘_-__---_-_-_--_J o onnoo—o---~---—--oo--- ..‘-------------------. . .------------------ .‘.------- - ---------------- ...------.-----.--o ...-----0. ...---- ...------ |----- ...------ ...------- .------- ...-----o - ..'---------------- - .---------------- --- .-- ...-------------- ...------------------o----- - .~~----- ---oo----. ------- ...--------------------- --------------- ....----- .---------------- ..---------------------- I ...----------------------- .---------------- ...-------o ---------------- ----------------- ...----- .-------- --------- n a «noonno-noo-vnonnsonno--oo-c 'fl 5 "N .. H‘...'~-.“n.”o.flnfl‘....nfl.“..hz. “a“ g C .. ... . I I I . I 653336035‘5‘5"ooooooooooooooooooooo00.000.000.000 ------------}- I I I I .----------I------ ------- -----g- -.--- ---- I I-- --.}- ..--i.- ..OIOR 00°“ :~:: 0". ...... ...... ----------{.- ----- 123 i .---------- .-~------------------- ‘ ...-------- ..--- ..N- ...----- : ...---- ...---- g ...- ...- b O I-4-- 2 O .--------------- O k - O - o O _i --------’-- .---------------- -----------------{-. .------------------ ..---.--------------- : ..Q-ooo--.---.‘°-’--‘ ---O---- 8 .fi-------------------.----------- I - .-------- ------ ‘ ...------------------. - .------ ...-------------oo--o.------ :------ - i----- ---------- .---- .------- - O 3 I -------------- O OI H VI 0 O C .“.--------o o - .- ------------------- .---------- .---- .O-o -------- ...-o------o ------ .--- |------------ ...--------- ------ - '8? . onoon.o nosos.o ocoos.o 0.....0. "no—0.. . use»... ...no.o ao.no.o 9...... 0.000.. ..nu... ouuco.o - ..---- -- .fi~---------------------------------- ...------ ----- -------------------. ..---------------------------- ...-------------------------- .“..------o -------------------O-- ...------------------------o---- ...--------o------------------- ...---------------------------- ..fi-------------o ------------ ...----o---.---------.------ ...----oo -------- ---------- .-.----------------------- .-.---------o-----o-o - ....-------- --- ------ ...-O ---. -- 0 ------ .'.--------. ----- .~.----------. ...-----------------.------ ...-----------o----------- .------o --- -------------o ...-----.--.-o--. .------ .~“----- --------------- .“'------------------o ..-----------------o ...---------------o---- ...-----o --.---------- .----------- ‘ ...---------.--------- ...-----------o--9-- ..fi-----..- -------- ...-------- --- ...-----.-o-- ...-------o -- ..fi--------- ‘ ...-------- ...-------- ...--------- ..--------- ..~----------o .p,.------------ .A.-------. --- ...---------- ...-----------o .“.----o ------ ...-------o ----------------- ...------------------------ ...----------- ---------- ...------------------------ ...---------------o----------- ...----------0 .-- oo------ .-.-------------------- ...------------------ ...-------o -------- ...--------------- ..~--------------- ...-----------o-- .‘.-------- ----- ..‘------ ----- .-.--------- .~.-------- .-.------o------ .-.------------ ...--------o-------- .n...------.-. -- .‘.-----o ------------ ...----------------- ..3.---- r .“.----------------------.----------------- : ..fi-----------.o------------------ ‘ ...--o--------------o------------o . ...------------------------- ' .~.---------—----------- : .~~---------.---------- . ...------oo------- . .“.------------ I . - O O O h C a C C O s O a I I R 8 I: 8 3 t D. a. I. a. .- JIDUJ ubululllt J-b‘inb .035 .i S: Uri on an .0 a. a-h..lll-b 3 a C I 3 8 I A .mQH m CALL INIT(IBUFv500) CALL SCROL<191000) TYPE 1220! ITAB TYPE 1220’ ID CALL APNT(O.9730.vOy-3y0y4) CALL UECT(900.90.) CALL APNT<200.9860.909—8) CALL SUBP(1) CALL TEXT(’ PROGRAM AUTOCM’) CALL ESUB TYPE 1200' ID CALL OFF(1) CALL SCROL<20957078) FORMAT<1A1) TYPE 215 FORMAT(’ Enter file name ( 6 letters ) 3 ’r$) READ<59405) (FILNAM(M)1M=196) FORMAT<6A1) TYPE 220 FORMAT(’ How mans seconds ( in I3 FMT ) ? 3 ',$) ACCEPT 4109 NSEC FORMAT(13) IF) IF(NLET.NE.6) STOP ’Dad RAD50 conversion’ MBLKS = IENTER HAP FORMATvSFILE(4)vOFILE(4) INTEGER IOUT(2560)yFLAGyTMPKSvTPKTyTP INTEGER TOTPrTSyIN2(2) EGUIUALENCE (IN4vIN2) LOGICALXI TM(8)vYOPNvNAME(6)vIBELvDLANK DATA SFILE<1)/3RDKO/vSFILE(4)/3RSND/ DATA OFILE/3RDK193RCMP93RRSD93RSND/ DATA PCH/12RDK1POZCMPSAU/ DATA VCH/12RDK1POZUI SAU/ DATA IN2/25690/vISUICH/B/rIDEL/‘OO7/vBLANK/’ ’/ FORMAT(’r’71A1) CALL RCHAIN(IF?ISUICH734) NSIZE = 2560 UP = 000.1. DOWN ; 0.01 COUNT = 0.0 NH = .10 Check from which pregam this program is chained to. IF(JF.NE.-1) GO TO 4100 4100 4200 4400 1352 4110 7001 5001 5000 5010 5030 6030 160 IF(ISUICH.EO.2) GO TO 4105 IF(ISUICHoEQ.l) GO TO 4110 STOP ’ *XXINUALID ISUITCHXXX’ TYPE 4200 FORMAT(’ Enter input sound file name (6 letters) P’v$) READ<594400) (NAME(M)7M = 196) FORMAT(6A1) CALL IRAD50<69NAMErSFILE(2)) URITE<791350) READ(591220) NSEC ICIN = IGETC() IF(ICIN .LT. 0) STOP ’ NO INPUT CHANNEL.’ ILK = LOOKUP>IEND=’vI77’ IBEG=”I7!’ CALL C"PRES(IBUF’IBEG'IEND’ITDT) IF(IDONE .LT. 1) GO TO 8050 . ND = IURITU’9$) FORMAT(’ Enter the number of seconds < 1=€SEC=fi60 ) b’.$) FORMAT<1X98A19’ Search ended.’v/) FORMAT<1X98A17’ Compression in procee. Uait.’/) FORMAT<1A1) FORMAT(F8.1) FORMAT<1X9’Pre-detection outputs :’v//v1Xv’Computed maximum 8 amplitude =’vF10.49/91Xv’Computed minimum amplitude =’p 8F10.475(/)) CALL EXIT END 164 CtXXt***##*X***#****#****X**#*#****##*###*X**********###*# C i C INTEGER FUNCTION IAUTO(HAP9UP9DOUNrCOUNTyTHSEC) * C * C Called be Prosram HNAUTO. * p ad * C TMSEC = 0.5XFLOAT(NUBLK)#2S.6 ! Threshold. * p a * C COUNT = The number of sample points detected as i C pauses ( devide be 10 to obtain msec.) * C it C IAUTO = 999 IF COMPRESSION COMPLETED. * C t C IAUTO = -999 IF COMPRESSION NOT COMPLETED. t C t C Uritten be : Hirotaka Nakasone t C Date : 11 Mae, 1983 * CttttttXX*********¥*#****XX*XtttttXtttXttttt**¥*##******## FUNCTION IAUTO(HAPTUPTDOUNTCOUNTyTMSEC) TEST = COUNT/10. - TMSEC THRESH = TMSEC * 0.1 IF(TEST.GT.THRESH) GO TO 10 IF(TEST.LT.0.0) GO TO 20 C HERE, MISSION COMPLETED. IAUTO = 99 RETURN C HERE, COMPRESSED SIGNAL TOO LONG. 30 REDUCE C AP VALUE BY CURRENT DOUN VALUE. 10 HAP = HAP - DOUN UP = DOUN / 2. D TYPE 1009COUNT/10.yTEST.DOUNoHAP 100 FORMAT<1X9’ Compressed too Ions; ’yF10.1r’ msec.’s/p 8’ Difference = ’9F1001” msec. Ap is reduced be ’9F7.59/! 8’ Repeating compression be Ap = ’9F7.5!//) . IAUTO = -99 RETURN C HERE, COMPRESSED SIGNAL TOO SHORT. SO INCREASE THE C AP VALUE BY CURRENT UP VALUE 20 HAP = HAP + UP DOUN = UP / 2. D TYPE 2009COUNT/10.9TEST9UP9HAP 200 FORMAT(’ Compressed too short; ’9F10.17’ msec.’./y 8’ Difference = ’9F10.1y’ msec. Ap is increased be’pF7.5./p 8’ Repeatins compression be Ap = ’.F7.59//) IAUTO = -99 RETURN END 165 CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH Program POZCMP (on RT-llFDv v02c-O2Ds PDP11/40) POZCMP detects: measures, and stores pauses from running speech. The deteCted pauses are concatenated so that it can be reproducible through DAC either to a loud speaker or to a tape recorder. An input file resides in Disk 0 with .SND extensions and an output filer also in Disk 0. Subprogram called: CMPRES: Used to fill temporare buffer until it eets fully 10 blocks(2560 words) IDA: DAC 8 ADC routine found in.FUSS.LID. PAUSOM POZVl: SIGCMPy DTOA cnitor directle be tepine .RUN POZCMP. Prosram chainins(from) ' ' t to) POZCMP can be run from 3 .. V. To compile: /U/S/N:m switches: (m=14) To link 2 *PQZCMP=PDZCMP9CMPRESvFUSSPSYSLIB/F Hirotaka Nakasone 27 March, 1982 Uritten be Date .0 v. Department of Audiology and Speech Sciences Michigan State Universite East Lansing: Michigan HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH DOUBLE PRECISION VCHrSCH COMMON/PASS3/ ISUICHyNSECySFILEyTPeAPsSUMMAXySUMMINvDFILE COMMON/Bl/COUNTrIOUTsNSIZErIDONErLAST INTEGERX4 IN49JP4rLE4 ‘ INTEGER IHOLD<256O):IDUF(256O)'SFILE(4)TOFILE<4F INTEGER IN2(2)TIOUT(2560)rFLAGvTPKTTTP LOGICAL*1 TM(8)vYORNvNAME(6)rIBELsBLANK EOUIVALENCE (IN491N2) DATA SFILE<1)/3RDKO/ySFILE(4)/3RSND/ DATA OFILE/3RDK093RCMP93RRSDT3RSND/ DATA VCH/12RDK POZVI SAV/ySCH/12RDN SIGCMPSAV/ DATA ISUICH/2/rIBEL/‘007/TBLANK/’ ’/ nononnonmanna-000000110000000000 3011 FORMAT(’+’91A1) CALL RCHAIN(IFTISUICH934) COUNT.= 0.0 NSIZE=2560 NU = 10 TYPE 4200 4200 FORMAT(’ Enter input file name (6 letters) F’p$) READ(594205) (NAME(M)9M=176) 4205 FORMAT<6A1> CALL IRAD50(6!NAMETSFILE(2)) 1352 URITE(7:1350) READ(59122O) NSEC 4100 IF(ISUICH.EO.3) GO TO 6050 3010 7001 5001 5000 5010 5030 5500 6023 6030 O14oo 166 FORMAT(IAI) ICIN = IGETC() IF(ICIN .LT. 0) STOP ’ NO INPUT CHANNEL.’ ILK = LOOKUP (TM(M).M=1.8) CONTINUE NUD = IREADU’9$) FORMAT(’ Enter the number of seconds ( 1=€SEC=€60 ) >’9$) FORMAT(1X98A19’ Search ended.’9/) . . FORMAT(T398A19’ Pause compression begins. Uait. ’9$) FORMAT(FB.1) . FORMAT(1X9’Pre-detection outputs:’9//91X9'Computed maximum 7100 6000 6666 6695 6670 6690 6680 169 8 amplitude =’9F10.43/91X9’Computed minimum amplitude =’9 8F10.495(/)) FORMAT(/9T398A19’ PAUSE compression completed.’9//) COUNT=COUNT/10. ICHAN = IGETC() IN2(1)=256 IN2(2)=0 NB = LOOKUP TYPE 60009COUNT9NB ' FORMAT(’ Output file name: CMPRSD.SND’9/9’ Total compressed % pauses = ’9F10.19’ msec(’9I49’ blocks)’/) CALL JICVT(NB9JB4) CALL JMUL(IN49JB49LE4) IER = IDA(ICHAN96940091009LE499120960) TYPE 6710 FORMAT(’ Hit {Return} to plas asain.’/’ Otherwise tepe ans 8 kee9 then hit {Return} b’9$) ACCEPT 30109YORN IF(YORN .EG. BLANK) GO TO 6720 CALL CLOSEC(ICHAN) ' IF(IFREEC(ICHAN).NE.0) STOP ’ IFREEC failed’ TYPE 6666 FORMAT(’ Try with other TP and AP (Y/N) ?’9$) ACCEPT 30109YORN IF(YORN.EO.89) GO TO 6030 ISUICH = 2 TYPE 6670 FORMAT(’ Options (Type a letter of your choice below.)’9/// 89T59’ S --- to creat a compressed SIGNAL.’9/ 89T59’ H --- to Set PRINT OUTS of pauses and sianals.’9/ 89T59’ O --- to QUIT this program.’9//9TS9’P’9$) ACCEPT 30109YORN IF(YORN.EG.81) GO TO 6680 ISUICH = 2 IF(YORN.EO.72) CALL CHAIN TYPE 6690 FORMAT(’ Invalid choice. Try again !’/) GO TO 6695 CONTINUE CALL EXIT END 170 CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH C C C C C C C C C C C C D100 [I To 1‘) UI DUCT-b to O 0 Program CMPRES is called from AUTOCM9 POZCMP9 SIGCMP9 and POZV1. CMPRES fills output buffer be the (pauses/signals) one NSIZE (10 blocks = 2560 words) at a time. Uritten be 3 Hirotaka Nakasone Last modified: 27-March9 1982 Department of Audiology and Speech Sciences Michigan State Universite HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH SUBROUTINE CMPRES(INB91B9IE9IT) COMMON/Bl/COUNT9IOUT9NSIZE9IDONE9LAST INTEGER INB<2560)9IOUT(2560) TYPE 1009LAST9IDONEvIBvIE FORMAT(//91X9’ AT CMPRES ENTERANCE9 LAST=’9I79’ IDONE=’eI29/9 8T24!’ IBEG=’9I79’ IEND=’yI7) IF(IT .LE. 0) GO TO 4 IF((LAST+IT) .LE. NSIZE) GO TO 2 ITEMP = NSIZE - LAST - 1 T=00 DO 1 I=IBTIB+ITEMP LAST = LAST + 1 IOUT(LAST)=INB(I) IT = IT - 1 T=T+10 CONTINUE COUNT=COUNT+T LAST = 0 IDONE = 2 I3 = ID + ITEMP + 1 GO TO 4 CONTINUE T=00 DO 3 I IBvIE LAST LAST + 1 IOUT(LAST)=INB(I) T=T+1o CONTINUE COUNT=COUNT+T IF(LAST.LT.NSIZE) GO TO 5 LAST=0 IDONE=1 GO TO 4 CONTINUE IDONE=0 CONTINUE TYPE 2009LAST9IDONE9IB9IE FORMAT(1X9’ LEAVING CMPRSD UITH LAST =’9I79’ IDONE=’9I29Jr 8T239’ IBEG=’9I79’ IEND=’9I79///) RETURN END~ 171 CHHHH PROGRAM SHORTS: HHHHHHHHHHHH SHORTS is designed to creat a set of parent files of short-term spectra. A parent file can have as mane as 400 short-term spectra. 2 spectra are stored in a l-block INTEGER*2. Input sound file resides in DK1 with .SND extension. Output short-term spectrum will be assiSned .STS extension name. I This output file is then stored in DKO. To compile: *SHORT8=SHORT8/U To link: *SHORT8=SHORT89FFT10H9SHUFFL9TABE109SYSLID/F Uritten be: Hirotaka Nakasone Date: Has 49 1983 Updated: Mae 69 1983 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH C C C C C C C C C C C C C C C C C C C C C C INTEGER FILEIN(596)9FILOUT(59596)9NSEC(5)9UCNT INTEGER IBUF(256)9NSPT(5)9DATAIN(4)9DATOUT(4)9LBUF<256> REAL TX(256) COMPLEX F(256) COMMON /FFTCOM/F LOGICALXI TM(8)9BUG9 YN DATA DATAIN<1)/3RDKO/9DATAIN(4)/3RSND/ DATA DATOUT(1)/3RDK0/9DATOUT(4)/3RSTS/ FUNIT = 5000./128. BUG = .FALSE. D TYPE 10 10 FORMAT(’ DEBUG (Y/N) ?’9$) D ACCEPT 119 YN 11 FORMAT(1A1) D IF(YNOE0089) BUG = OTRUEO TYPE 20 ' 20 FORMAT(’ Uhich disk has input file (1/0) ?’9$) ACCEPT 259 NDISK 25 FORMAT(II) IF(NDISK.EO.1) CALL IRAD50(39’DK1’9DATAIN(1)) TYPE 30 30 FORMAT(’ Uhich disk will store output file (1/0) ?’9$) ACCEPT 259 NDISK IF(NDISK.EO.1) CALL IRAD50(39’DK1’9DATOUT(1)) TYPE 100 . 100 FORMAT(’ Number of input sound files(max=o) >’,$) ACCEPT 2009NFILES 200 FORMAT(I3) DO 300 I=19NFILES TYPE 40091 172 400 FORMAT(1X9’Tepe input file f’9I29’ (6A1) P ’93) READ<59500) (FILEIN(I9J)9J=1.6) 500 FORMAT(6A1) TYPE 510 510 FORMAT(’ Number of seconds of this input file E ’.$) ACCEPT 2009NSEC(I) TYPE 600 600 FORMAT(’ Number of parent spectra from this file 8(max=5) b ’9$) ACCEPT 2009NSPT(I) TYPE 700 700 FORMAT(’ Tape all parent Spectra names (6A1) below.’/) DO 800 K=19NSPT(I) TYPE 9009K 900 FORMAT(T2O9’ For output parent file t’9I29’ P ’9$) READ(59500)(FILOUT(I9K9J)9J=196) 800 CONTINUE 300 CONTINUE C END OF INPUT PROCEDURES. NOU BEGIN LOOP FOR C ALL INPUT FILES. NSAMP = 256 D0 1100 II=19NFILES CALL TIME(TM) TYPE 11109(TM(M)9Mm198)T(FILEIN(II9KK)9KK=196) 1110 FORMAT(1X98A19’ BEGIN INPUT FILE DKO: ’96A19’.SND’/) CALL CVRADI: FILE VERIFICATION ROUTINE. g FUNCTION INnEx: MATRIX FORMATTING ROUTINE. C C C C C C C C C C C C PROGRAMMED AS A PART OF FEATURE EXTRACTION PROCEDURE FOR A' PH.D. DISSERTATION (COMPUTER VOICE IDENTIFICATION BY USING INTENSITY DEVIATION SPECTRA AND FUNDAMENTAL FREQUENCY CONTOUR) BY THE AUTHOR. URITTEN BY 3HIROTAKA NAKASONE9 22-MAY-1983 DEPARTMENT OF AUDIOLOGY AND SPEECH SCIENCES9 MSU. TO COMPILE3 DENCOR=DENCOR/UE/DJ TO LINK 3 DENCOR=DENCOR9SYSLIB/F HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH COMMON /NAME/FNAMES COMMON /NUMF/N DOUBLE PRECISION EXT9PROG REAL DMAT(4950)9DAT(309100) LOGICALXI FNAMES<159100)9YN9LNAME(15)9FMT(6O) LOGICALXI NEXT(3) - DATA LNAME(1)/’D’/9LNAME(2)/’K’/9LNAME(4)/’3’/ DATA LNAME(10)/’.’/ DATA FROG/12RDKIORDPR SAV/ LNAME(15) = 0 FORMAT(1A1) FORMAT(I3) FORMAT(6A1) FORMAT(14A1) FORMAT(15A1) FORMAT(12A1) FORMAT(1X9’ UAIT.’) 'FORMAT(1X9/) ‘OCOLfia-buf-JH N = 100 D0 10 I = 19 4950 DMAT(I) = O. 10 CONTINUE C READ DATA FILES FROM A MASTER FILE. 300 210 410 420 180 TYPE 210 FORMAT(’ ENTER MASTER FILE NAME (DEV3FILNAM.EXT) P’9$) READ(594) (LNAME(L)9L=1914) CALL ASSIGN(139LNAME9149’OLD’) READ(139410) (LNAME(L)9L=1914)9NUMF9(FMT(M)9M=196O) FORMAT(14A19I396OA1) DO 420 I = 19 NUMF READ(1394) (FNAMES(M9I)9M=1914) URITE(79425) (FNAMES(M9I)9M=1914) FORMAT(1X914A1) CALL ASSIGN(129FNAMES(19I)9149’OLD’) READ(129FMT) DUM9 DUM READ(129FMT) (DAT(I9J)9J=19100) CALL CLOSE(12) CONTINUE ' CALL CLOSE(13) C COMPUTE SUM(N) AND SSQ(N) FOR N FREQUENCIES OVER NUMF SPECTRA. 600 605 606 CD CD666 610 620 700 11 = 3 I2 = 102 RN = FLOAT(NUMF) ND = 0 DO 620 I=19N-1 DO 610 J=I+19N SUMX = 09 SUMY = 0. SUMXX = 0. SUMYY = 09 SUMXY = 0. no 600 R=1.NUHF SUMX = SUMX + DAT(K9I) SUMY = SUMY + DAT(R9J) = SUMXX+ DAT(K9I)*DAT(K.I) SUMYY= SUMYY+ DAT(K9J)#DAT(R9J) SUMXY= SUMXY+ DAT/GN SSUIT = SSTOT - SSBET BGMS = SSBET / DFB UGMS = SSUIT / DFU TSUG = 0 DO 4 J = 19 N6 TSUG = TSUG + ( SSQ(J) - (SUM(J)*SUM(J))/ SN ) CONTINUE URITE(7910) SSTOT9SSBET9SSUIT FORMAT(’ SSTOT=’9F14.39’ SSBET=’9F14.39’ SSUIT=’9F14.3) URITE(7912) BGMS9UGMS9TSUG FORMAT(’ BGMS=’9F14.39’ UGMS=’9F14.39’ TSUG=’9F14.3) F(K) = 9999.9091 IF(UGMS.GT.0.) F(K) = BGMS / UGMS TINUE URN 186 CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH FILE3 UPICK.FOR UPICK IS DESIGNED TO DETECT FUNDAMENTAL FREQUENCIES IN A RUNNING SPEECH BY INTERACTIVE METHOD. THIS PROGRAM MUST BE CHAINED TO PROGRAM HPICK9 THEN TO FFPICK TO COMPLETE MEASUREMENT OF 9 FEATURES OF FFC. URITTEN BY3 H. NAKASONE DATE3 26-JUN-83 AS A PART OF SERIES OF SOUND PROCESSING SOFTUARES USED IN THE PH.D. DISSERTATION BY THE AUTHOR. C C C C C C C C C C C C DEPARTMENT OF AUDIOLOGY AND SPEECH SCIENCES C MICHIGAN STATE UNIVERSITY C C HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH COMMON /SENSE/ MAXPIC9NBLOCK9NAME129NSR COMMON /ICSPEC/ ISPEC(39) COMMON /BUFF/ NBUF9XMAX9XMIN9YMAX9YMIN9YRANGE DOUBLE PRECISION HPROG9 EXT INTEGERX4 JLEN INTEGER NDEV9 NEXT9 NBUF(3600) LOGICALXl NAME6(6)9 NAME12(12)9 YN9 BEL9 ONE LOGICAL*1 ZERO9 CHA9 REPT9 VTAB9 DOLBY DATA EXT/3RSND/9 NSR/10000/9 IFILTR/3/9 BEL/'007/ DATA ONE/.TRUE./9 ZERO/.FALSE./9 REPT/.FALSE./9 VTAB/'Ol3/ DATA HPROG/12RDK1HPICK SAV/ KNIT = 3600 TYPE 1 1 FORMAT(’ PROGRAM UPICK3 ’///91X9’ Uant DOLBY?’9$) ' ACCEPT 779YN 77 FORMAT(Al) DOLBY = .FALSE. IF(YN.EQ.89) DOLBY = .TRUE. 500 CALL INIT(NBUF.KNIT) CALL SCROL(191000) TYPE 20009 VTAB 2000 FORMAT(1X9A1) CALL SCROL(29100) CALL APNT(O.9200.919-59094) CALL APNT(0.9700.919-79091) CALL SUBP(1) CALL TEXT(’ READY ANALOG INPUT.’929’ UHEN READY9 S HIT RETURN KEY TO START.’) CALL ESUB CALL OFF(1) CALL APNT<0o9800.909—79191) N 40 100 O OD \104 187 CALL SUBP(2) CALL TEXT(’ CONVERSION IN PROCESS. ’) CALL ESUB CALL OFF(2) ICHAN = IGETC() IF(ICHAN.LT.O) STOP’ ICHAN NOT AVAILABLE’ TYPE 2 VFORMAT(’ NUMBER OF SECONDS (13) 7’9$) ACCEPT 39 NSEC LEN = NSR/256 RSEC = FLOAT(NSEC) SR = FLOAT(NSR) LENB = LEN # NSEC + 1 TYPE 10 FORMAT(1X9’ENTER OUTPUT FILE NAME fi’9$) IF(ICSI(ISPEC9EXT9990).NE.0) GO TO 20 IF(ISPEC(16).NE.0) GO TO 30 TYPE 22 FORMAT(’ THARDUARE ERROR?’//) CALL EXIT IF(IFETCH(ISPEC(16)).NE.0) STOP’FETCHING ERROR’ NBLOCK = IENTER(ICHAN9ISPEC(16)9-1) IF(NBLOCK.GE.LENB) GO TO 100 TYPE 409 LENB9 NBLOCK FORMAT(1X9’NOT ENOUGH BLOCK SIZE.’/’ 8 BLOCK SIZE REQUESTED = ’9I49’ AND ALLOCATED =’9I4) CALL EXIT TYPE 79 BEL CALL R50ASC(129ISPEC(16)9NAME12(1)) CALL JAFIX(SR#RSEC+.59 JLEN) CALL ON(l) CALL BIT12(ONE) CHA = ITTINR() IF(CHA.NE.13) GO TO 210 CALL BIT12(ZERO) CALL OFF(1) CALL ON(2) TYPE 79 BEL I=IDA(ICHAN979INT(0.04*SR)9INT(0.01*SR)9JLEN996090) IF(I.NE.0) TYPE 2259 I FORMAT(’ ERROR IDA I = ’9I4) CALL CLOSEC(ICHAN) CALL IFREEC(ICHAN) CALL OFF(2) TYPE 250 , FORMAT(1X9/l9’ SOUND INPUT PROCEDURE COMPLETED. UAIT...’//) CALL INIT(NBUF9KNIT) NUORD = 9 --- CREATE SUBPICTURES OF SPEECH UAVES -—- CALL SUBPIC(DOLBY) --' CHAINING TO PROGRAM HPICK -6- CALL CHAIN(HPROG9MAXPIC9NUORD) FORMAT(I3) FORMAT(’+’9A1) TO COMPILE9 UPICK=UPICK/U TO LINK9 UPICK=UPICK9FUSS9BIT129SUBPIC9GTLIB9SYSLIB/F END C 188 0 FILE SUBROUTINE3 SUBPIC.FOR CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH C C C C C C C C Routine SUBPIC create a set of displae subpictures of speech wave and command Characters reouired be the succeeding program HPICK (main part of the interactive peak picking technioue). SUBPIC is called from program UPICK. H. NAKASONE JUN 279 1983 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH SUBROUTINE SUBPIC(DOLBY) COMMON /SENSE/ MAXPIC9 NBLOCK9 NAME129 NSR COMMON /BUFF/ NBUF9 XMAX9 XMIN9 YMAX9 YMIN9 YRANGE DOUBLE PRECISION DUMMY INTEGER IBUF(1024)9 NBUF(3600) LOGICALXI BEL9 YN9 DONE9 NAME12(12)9 NAMOUT(10) LOGICAL*1 VTAB9NAME15(15)9NAMER(6)9DOLBY DATA VTAB/'013/9 NSR/IOOOO/ DATA NAME12(10)/’S’/9NAME12(11)/’N’/9NAME12(12)/’D’/ DATA NAME12(1)/’D’/9NAME12(2)/’K’/9NAME12(3)/’O’/ DATA NAME15(l)/’D’/9NAME15(2)/’K’/9NAME15(3)/’0’/ DATA NAME15(4)/’3’/9NAME15(5)/’T’/9NAME15(6)/’M’/ DATA NAME15(7)/’P’/9NAME15(11)/’.’/9NAME15(12)/’D’/ DATA NAME15(13)/’P’/9NAME15(14)/’Y’/ XMIN = 0. YMIN = ~573. XMAX = 1023. YMAX = 450. YRANGE = 400. XEND = 1022. NOISE = 400 KSIG = 6000 .LION = 8000 IPRO = 7001 'IRPT. = 7002 ICOR = 7003 IJUMP = 7004 ISTART = 7005 LUAIT = 9000 IDOT = 9500 IDOTC = 9600 1816 = 8500 KNIT = 3600 NPTS = 1024 CALL SCROL(191000) TYPE 8409VTAB 840 FORMAT(1X9A1) CALL SCROL(1912) CALL SCAL(XMIN9YMIN9XMAX9YMAX) CALL APNT(XMIN9YMAX919-89O91) CALL SUBP(LION) CALL LVECT(XMAX90.91) CALL LVECT(O.9-2.*YMAX91) CALL LVECT(-XMAX90.91) CALL CALL CALL CALL CALL CALL CALL CALL CALL YBOT CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL CALL 189 LVECT(0.92.*YMAX91) APNT(0.90.909‘19091) LVECT(1023.90.) ESUB ERAS(LION) CMPRS SAVE(’DKO:FRAMER.DPY’) INIT(NBUF9KNIT) SCAL(XMIN9YMIN9XMAX9YMAX) = ~(YMAX+65.) APNT(150.9YBOT919-39091) SUBP(ICOR) TEXT(’?Correction?’) ESUB ERAS(ICOR) CMPRS SAVE(’DKO3CORECT.DPY’) INIT(NBUF9KNIT) SCAL(XMIN9YMIN9XMAX9YMAX) APNT<800.9YBOT919-89091) SUBP(IRPT) TEXT(’?REPeat?’) ESUB ERAS(IRPT) CMPRS SAVE(’DK03REPEAT.DPY’) INIT(NBUF9KNIT) SCALCXMIN9YMIN9XMAX9YMAX) APNT<550.9YBOT919-89091) SUBP(IPRO) TEXT(’?Proceed?’) ESUB ' ERAS(IPRO) CMPRS SAVE(’DK03PROCED.DPY’) INIT(NBUF9KNIT) SCAL(XMIN9YMIN9XMAX9YMAX) APNT(0.9YBOT-26.9-19-89091) SUBP(ISTART) TEXT(’ Hit frame’) ESUB ERAS(ISTART) CMPRS SAVE(’DK03STARTR.DPY’) INIT(NBUF9KNIT) SCAL(XMIN9YMIN&XMAX9YMAX) APNT(350.9YBOT919-S9091) SUBP(IJUMP) TEXT(’?Jump?’) ESUB ERAS C CALL CMPRS CALL SAVE(’DKO3UAITER.DPY’) CALL INIT(NBUF9KNIT) CALL SCAL(XMIN9YMIN9XMAX9YMAX) 50 SR = FLOAT(NSR) ICHAN = IGETC() CALL IRAD50(129NAME12(1)9DUMMY) NBLOCK = LOOKUP(ICHAN9DUMMY) IF(NBLOCK.LT.-1) STOP ’ BAD LOOKUP’. IF(NBLOCK.EQ.-1) STOP ’ FILE NOT FOUND’ RSEC = 256.1 FLOAT(NBLOCK)/SR D TYPE 609 NBLOCK9 RSEC D60 FORMAT(1X9I49’ BLOCKS(= ’9F8.49’ SEC) FOR THIS FILE.’//) MAXBLK = NBLOCK - (NPTS/256) NBLK = 0 KSIG = 6000 NAMSIG = 0 1000 NU = IREADU(NPTS9IBUF9NBLK9ICHAN) NAMSIG = NAMSIG + 1 MAX = 0 CHECK IF NOISE REDUCTION BY DOLBY DESIRED. IF(DOLBY) GO TO 53 COME HERE IF NO DOLBY REQUESTED. DO 1550 I = 19 1024 IF(IABS(IBUF(I)).GT.MAX) MAX = IABS(IBUF(I)) 1550 CONTINUE GO TO 55 COME HERE TO DO DOLBY. 53 DO 1553 I = 19 1024 ITEMP = O IBUF(I) — NOISE IBUF(I) + NOISE IF(IBUF(I).GT.NOISE) ITEMP IF(IBUF(I).LT.~NOISE)ITEMP IBUF(I) = ITEMP IF(IABS(ITEMP).GT.MAX) MAX 1553 CONTINUE IABS(ITEMP) 55 FAC = 1.0 IF(MAX.GT.400) FAC = 400. / MAX DO 560 I = 19 NPTS IBUF(I) = IBUF(I) * FAC 560 CONTINUE CALL APNT(0.90.999-89091) CALL SUBP(KSIG) CALL LVECT(0.9FLOAT(IBUF(1))) DO 565 I = 29NPTS DY 3 FLOAT(IBUF(I) - IBUF(I‘1)) CALL LVECT(1.9DY) 565 CONTINUE CALL ESUB ND 3 NAMSIG / 100 NAME15(8) = ND + '60 ND = MOD(NAMSIG9100) NAME15(9) = ND / 10 + '60 ND = MOD(ND91O) NAME15(10) = ND + '60 NAME15(15) = 0 D700 300 191 CALL ERAS(KSIG) CALL CMPRS CALL SAVE(NAME15) URITE(79700) NAME159 KSIG9 NAMSIG FORMAT(1X9’DONE FOR ’915A193X9’ KSIG=’9IS9 8’ AND NAMSIG=’9159/) CALL INIT(NBUF9KNIT) CALL SCAL(XMIN9YMIN9XMAX9YMAX) NBLK = NBLK + 4 IF(NBLK.LE.MAXBLK) GO TO 1000 MAXPIC = NAMSIG CALL CLOSEC(ICHAN) CALL IFREEC(ICHAN) RETURN END 192 TT3=DK03BIT12.FOR FILE3 BIT12.FOR HHHHHHHHHHHHHHHHHHHH SUBROUTINE TO ACTIVATE/DEACTIVATE A BIT 12 FOR SPECIAL KEY MODE. URITEN BY H. NAKASONE 08-JUN-1983 HHHHHHHHHHHHHHHHHHHHH 0000000 SUBROUTINE BIT12(KON) LOGICALXI KON C DEACTIVATE SPECIAL KEY MODE FOR KON = .FALSE.. IF(.NOT.KON) CALL IPOKE('449IPEEK('44).AND.'167777) C ACTIVATE SPECIAL KEY MODE FOR KON = .TRUE.. IF(KON) CALL IPOKE('449IPEEK('44).OR.'10000) RETURN END 193 C FILE HPICK.FOR CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH C HPICK is designed to measure Fo’s (fundamental freeuence) C in speech signal directle from the time domain be the use of ’Interactive peak detecting technieue’. Input3 File of displae subpictures created be UPICK. Output3 Data file which contains amplitudes9100. TYPE 7222 FORMAT(1X9/) . URITE(797200) (FILNAM(M)9M=1910)9K FORMAT(//91X9' z—— "/ 8’ Fi1e3 ’910A19/9’ Number of periods detected 8 = ’9I49/9’ “'7' 8" '4/> 201 URITE(797300) AVFF9SDFF9DLFF9PERJIT 7300 FORMAT(’ Summare of fundamental freouence (F.F.)3’9//9 8’ Mean F.F. =’9F8.29’ (Hz)’/9 8’ Standard deviation of F.F. =’9F8.29’ (Hz)’/9 8’ Average variation of F.F.(DF) =’9F8.29’ (Hz)’/9 8’ Ratio of BF to mean F.F =’9F8.29’ (%)’///) URITE(797400) AVEAMP9DBAVAM9SDAMP9DBSDAM9DAMP9DAMPDB9PERSIM 7400 FORMAT(’ Summare of amplitude of F.F. 3’9//9 8’ Mean amplitude =’9F9.29’ (’9F5.19’ dB)’/9 8’ Standard‘deviation of ampl. =’9F9.29’ (’9F5.19’ dB)’/9 8’ Average variation of amp1.(DA) =’9F9.29’ (’9F5.19’ dB)’/9 t’ Ratio of DA to mean ampl. =’9F9.29’ (Z)’ 8/9’ ~ ' 8 ’9/) URITE(797450) FFMX9FFMN9FFRG 7450 FORMAT(1X9’Maximum F.F.=’9F8.29’ Minimum F.F.=’9F8.29 3’ Rania FOF9=’7F702) CONVERT EXTENSION .FFC TO .HAN. CHARACTER H = 789 A = 659 N = 72 BY ASCII. FILNAM(10) a 78 FILNAM(9) = 65 FILNAM(B) = 72 CALL ASSIGN(139FILNAM9109’NEU’) URITE(139400) (FILNAM(M)9M=1910) 400 FORMAT(10A19’ (8F6.29F10.2)’/) URITE(139410)AVFF9SDFF9DLFF9PERJIT9PERSIM9FFMX9FFMN9FFRG9DAMP 410 FORMAT(8F6.29F10.2) CALL CLOSE(13) TYPE 4259FILNAM 425 FORMAT(//////91X9’Computed HAN file name3 ’910A1//) CALL EXIT END 202 C FILE: FFFRAT.FOR CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH FFFRAT IS DESIGNED TO COMPUTE F-RATIO’S OF 9 FFC FEATURES INPUT: DATA FILE WITH AN EXT. NAME ’HAN’. OUTPUT3 DATA FILE UITH AN EXT. NAME ’TIM’. 11-JULY-83 H.NAKASONE HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH COMMON /SENSE/ X: F COMMON /PASS/ FMT DOUBLE PRECISION UN(9) REAL X(5910920)9 F(20) LOOICAL¥1 FNAMES<14750)vMNAME<14)9LNAME(14)vFMT(6O) LOGICALXl OUTEXT(3)rOUTDEU(3)9EXT<3)9DEU(3) DATA OUTEXT(1)/’T’/rOUTEXT(2)/’I’/9OUTEXT(3)/’M’/ DATA OUTDEU(1)/’D’/rOUTDEU(2)/’K’/rOUTDEV(3)/’1’/ 000000000 DATA VN(1)/’ AVFF’/9VN(2)/’ SDFF'/.UN<3>/' DLFF’/ DATA UN(4)/’ FERJIT'/.VN<5>/' FERSIM'/.UN<6)/' FFMAX’/ DATA VN(7)/’ FFMIN’/rVN(B)/’ FFRNG’/yUN(9>/’ DAMP’/ NP = 5 N6 = 10 ND = 9 TYPE 10 10 FORMAT(’ ENTER MASTER FILE NAME (14A1) D'.$> READ(5.20) .N= 1.14) 20 FORMAT(14A1) TYPE 30 30 FORMAT(’ ENTER EXT NAME OF INPUT DATA FILE b'.$) READ(5.35) TYPE 40 4o FORMAT(’ ENTER DEV NAME OF INPUT DATA FILE y'.s> READ(SyBS) (DEU(M);M=193) CALL RDFILE(MNAME!FNAMES!EXT7DEV) C TRANSFER DATA TO ARRAY X. D TYPE 90 D90 FORMAT(1X9’AVFF’1T87’SDFF’leéy’DLFF’vT24r’PERJIT’: D 8T327’PERSIM’9T4O,’FFMAX’9T489’FFMIN’vTfiéy’FFRNG’y D 8T64y’DAMP’//) K = 0 DO 100 I = 17 NB DO 200 J = 1r NP K x K 1 CALL ASSIGN<12TFNAMES(1rK)914,’OLD’) READ(12721O) (LNAME(M)TM=1910)9FMT 203 210 FORMAT(1OA1760A19/) READ(127FMT) (X(J!I!KU)9KD=17ND) D URITE(79220) (X(J!IrKD)rKD:19ND) D220 FORMAT(1X!8F7.2:F10.2) CALL CLOSE(12) 200 CONTINUE TYPE 9 9 FORMAT(1X7/) 100 CONTINUE D TYPE 90 CALL SUBP(NP7N89ND) PAUSE ’ ADJUST TO THE TOP OF NEW PAGE.’ URITE(793OO) (MNAME(M)9M=1914) 300 FORMAT(1OXy’SUMMARY 0F F—RATID”S: ’914A19//’ 810x:’FEATURE’r15Xy’F-RATIO’v/rloxy’ ——————— '.15x, x' ——————— './) DO 310 I = 19 ND NRITE(7!320) IvUN(I)yF(I) 320 FORMAT(11X9’X(’7I19’)’73X7A89F12.37/) 310 CONTINUE TYPE 330 330 FORMAT(10X7’ ------------------------------- ’) TYPE 9 TYPE 9 C NORMALIZE CALL ZNORM(XrNPyNGyND) C CHANGE EXTENSION NAME. DO 350 I = 19 50 DO 360 J = 1’ 3 FNAMES(JyI) = OUTDEU(J) FNAMES(J+1lrI) = OUTEXT(J) 360 CONTINUE 350 CONTINUE C WRITE THE NORMALIZED DATA FILE NITH NEH EXT NAME. 0 1 O 400 I = 19 50 CALL ASSIGN<129FNAMES<191)9149’NEU’) WRITE<12720) (FNAMES(M91):M=1714) J = J + 1 IF(J.LE.NP) GO TO 410 J = 1 K = K + 1 410 CONTINUE DO 420 ND = 17 ND wRITE(12r425) X(JyKyKD) H H J K D 425 FORMAT(F14.6) 420 CONTINUE CALL CLOSE(12) 400 CONTINUE CALL EXIT END 'J ._-"-F‘:' 204 CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH PROGRAM NAKR03.FOR NAKR03 IS DESIGNED TO EXECUTE VOICE IDENTIFICATION OPERATION FOR DgglGNSS (CROSS-TRANSMISSION DY COMPOSITE PARAMETER OF FFC AND IDS L ). - ' NAKR03 REQUIRES 4 MASTER FILES, EACH CONTAINING SO FILE NAMES OF PATTERNS. 13-JULY-83 H. NAKASONE DEPARTMENT OF AUDIOLOGY AND SPEECH.SCIENCES MICHIGAN STATE UNIVERSITY C C C C C C C C C E SUBPROGRAMS CALLED: SBUEITvRDFILE92NORMyAND SUBSET. C C C C C C EAST LANSING, MICHIGAN C C HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH REAL TXU(5910920)vTXK(5910920) REAL XU(5v10920)9 XK(5910T20)9FUEIT(10)vTUEIT(9) LOGICALXI MFILUF(14)vMFILUT<14)vMFILKF(14)rMFILKT<14) LOGICAL*1 UFILEF(14950)vUFILET(14950)rKFILEF<14vSO) LOGICALXI KFILET(14950)vFINEXT(3)vTINEXT(3)rINDEV(3) LOGICALXI LFILE(14)9DEDUG9YNvDEL DATA TINEXT<1)/’T’/vTINEXT(2)/’I’/vTINEXT(3)/’M’/ DATA INDEV(1)/’D’/rINDEV(2)/’K’/rINDEV(3)/’1’/ DATA BEL/'007/yDEBUG/.FALSE./ FORMAT(’+’7A1) FORMAT(Al) FORMAT(1X9/) 4 FORMAT(3A1) H‘OWV NDF = 10 NDT = 9 ND NDF + NDT NP 5 N6 10 DEBUG = .FALSE. TYPE 4500 - 4500 FORMAT(’ DEBUG ?’9$) ACCEPT 45101YN 4510 FORMAT(AIL TYPE 1 FORMAT(’ PROGRAM NAKRO3 - Experiment III (Cross-Transmission)’/ 8’ A Pattern contains features from Freeuencs and Time Domain.’/ /) TYPE 10 10 FORMAT(’ UNKNOWN MASTER FILE FROM FREO.(14A1) >’y$) READ(SvZO) MFILUF 20 FORMAT(14A1) TYPE 30 30 FORMAT(’ KNOUN MASTER FILE FROM FREO.(14A1) >’r$) READ(592O) MFILKF 205 TYPE 40 40 FORMAT(’ EXT NAME OF FREQ. FILE (FRI 0R FR a, READ(SySO) FINEXT L) , '$> 50 FORMAT(ZAI) CALL RDFILE CALL ZNORM CALL ZNORM(TXU9NP:NGyNDF) CALL ZNCRN TYPE 9 TYPE 9 THE MINIMUM SET DISTANCE CALL SUBSET(XU9XK1NP7NG’ND) TYPE 9 TYPE 9 IFfl.NOT.DEBUG) GO TO 9999 TYPE 6000 FORMAT(’ WANT DIFFERENT FEATURES COMBINATION ?’r$) ACCEPT 60089YN FORMAT(Al) . IF(YN.NE.89) GO TO 9999 TYPE 6050 FORMAT(’ FOR FREQUENCY DOMAIN:’/) CALL SBWEIT TYPE 6060 FORMAT(’ FOR TIME DOMAIN:’/) CALL SBWEITSTWEITINDT) CALL EXIT DO 6100 J = 19 N6 DO 6200 I = 1r NP DO 6300 K = 1: NDF XU LOGICALXl MFILU(14) ‘ LOGICALII UFILE(14950)7 INEXT(3)!INDEU(3) LOGICALXI OUTEXT<3>90UTDEV(3)9 YN9 DEBUG DATA INDEV(1)/’D’/vINDEV(2)/’K’/9INDEV(3)/’1’/ DATA DEBUG/.FALSE./ 8 FORMAT(AI) 9 FORMAT(1X9/) TYPE 10 10 FORMAT(’ PROGRAM NAKROA: VOICE I.D. WITHIN TRANSMISSION.’//) TYPE 15 15 FORMAT(’ ENTER MASTER FILE NAME (14A1) 3’95) . READ(5v20) MFILU 20 FORMAT(14A1) TYPE 25 25 FORMAT(’ ENTER EXT NAME (FRI, FRL, OR TIM) fi’r$) C FRI = IDS C FRL = LTS C TIM = FFC READ(5r22) INEXT 22 FORMAT(3A1) TYPE 26 26 FORMAT(’ ENTER NUMBER OF DIMENSIONS >’s$) READ(5927) ND 27 FORMAT(I3) CALL RDFILE9 MFILE(14)vINEXT(3)9INDEV(3) LOGICAL¥1 0UTEXT(3)90UTDEV(3)rBEL!FMT(60) DATA OUTDEV(1)/’D’/vOUTDEV(2)/’K’/yOUTDEV(3)/’1’/ DATA DEL/'007/ 8 FORMAT(Al) 9 FORMAT(’+’74X:’*’,$) NUMF = so 1000 TYPE 10 10 FORMATC’ MASTER FILE NAME (14A1) >'.$) READ(5.20) MFILE ‘20 FORMAT(14A1) TYPE 30 30 FORMAT(’ EXT NAME OF INPUT DATA FILE (3A1) D'.s> READ(5.40) INEXT 40 F0RMAT<3A1) TYPE 45 4s FORMAT(’ EXT NAME OF OUTPUT DATA FILE (3A1) P'.s) READ(5.40) OUTEXT TYPE so so FORMAT(’ DEV NAME OF INPUT DATA FILE (3A1) 5’95) READ(Sv40) INDEV TYPE 6 6 F0RMAT<1X./) CALL RDFILE(MFILE9FNAMESvINEXTrINDEVsNUMF) TYPE 9 DO 100 I = 1' NUMF CALL ASSIGN(127FNAMES(19I)914v’OLD’) READ(127FMT) DUMv DUM9 (X(I'J)'J=1!100) CALL CLOSE(12) 100 CONTINUE ,-I . 3"... 212 TYPE 9 DO 110 I = 19 NUMF DO 120 J = 19 3 FNAMES’9$) READ(5920) (NAMPRX(M)9M=1914) DEBUG = .FALSE. TYPE 4500 00 FORMAT(’ DEBUG ?’9$) ACCEPT 45109YN IF(YN.EQ.89) DEBUG = .TRUE. 10 FORMAT(Al) - CALL SBWEIT READ(12920) LFILE DO 120 K = 1. ND READ(129150) XU(I9J9K) FORMAT(F14.6) XU(I9J9K) = XU(I9J9K) * WEIGHT(K) CONTINUE CALL CLOSE(12) CALL ASSIGN<139KFILE<19KT)9149’OLD’) READ(13920) LFILE DO 130 K = 19 ND READ(139150) XK(I9J9K) XK(I9J9K) = XK(I9J9K) X WEIGHT(K) CONTINUE CALL CLOSE(13) CONTINUE CONTINUE IF(INEXT(1).EQ.84) GO TO 200 -—- STANDARDIzE DATA --- CALL ZNORM(XU9NP9NG9ND) CALL ZNORM(XK9NP9NG9ND) IF(DEBUG) TYPE 4210 FORMAT(’ NORMALIZATION DONE.’/) CALL EUCMAT(XU9XK9NAMPRX9ND9NP9NG) IF(DEBUG) TYPE 4215 FORMAT(’ EUCMAT RETURNED. NOW CHAINING TO DRVSAV.’//) CALL CHAIN(PROGS990) CALL EXIT END CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH FILE: FILFIL.FOR FILFIL IS TO CREATE A MASTER FILE OF FILE NAMES. MAXIMUM NUMBER OF FILES CAN BE STORED IN A MASTER FILE IS CURRENTLY 100. HIROTAKA NAKASONE9 22-MAY-1983 DEPARTMENT OF AUDIOLOGY AND SPEECH SCIENCES9 MSU. C C C C C C C C C C TO COMPILE: FILFIL=FILFIL/U C TO LINK : FILFIL=FILFIL9LCHECK9SYSLIB/F CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH COMMON /NAME/FNAMES COMMON /NUMF/NUMF DOUBLE PRECISION EXT LOGICALXl FNAMES(159100)9LETR(9)9YN9LNAME<15>9FMT(BO) LOGICAL*1 NEXT(3) DATA LNAME(l)/’D’/9LNAME(2)/’K’/9LNAME(4)/’:’/ DATA LNAME(10)/’.’/ LNAME(15) = 0 FORMAT(1A1) FORMAT(I3) FORMAT(6A1) FORMAT(14A1) FORMAT(12A1) FORMAT(15A1) FORMAT(3A1) FORMAT(1X9’ WAIT.’) FORMAT(1X9/) ('1 omnomauww TO HERE TO ENTER ALL FILES. TYPE 30 . 30 FORMAT(’ ENTER NUMBER OF FILES (I3) }’9$) READ(594O) NUMF 40 FORMAT(I3) TYPE 50 ‘0 FORMAT(’ SPECIFY EXT NAME (3A1) P’9S) READ(597) (NEXT(N)9N=193) CALL IRAD50(39NEXT(1)9EXT) TYPE 60 6O FORMAT(’ ENTER ALL FILES BELOW (6A1).’9//) DO 70 I = 19NUMF 75 TYPE 8091 80 FORMAT(’ FILE 9’9139’ }’9$) IF(LCHECK(EXT9ICHAN9I).GT.0) GO TO 79 TYPE 90 90 FORMAT(1X9’*FATAL ERRORX’) STOP 79 CALL CLOSEC(ICHAN) CALL IFREEC(ICHAN) 70 CONTINUE IE = 5 WRITE(79100) 100 FORMAT(//91X9’ TABLE OF FILE NAMES ’//) 130 120 145 140 150 155 160 230 218 D0 120 J = 19 NUMF WRITE(79130) (FNAMES(M9J)9M=1915)9J FORMAT(1X915A19’E’9I39’J’) CONTINUE TYPE 9 TYPE 140 FORMAT(’ OK ?79$) ACCEPT 19YN IF(YN.EQ.89) GO TO 200 THEN DO SOME CORRECTION BUSINESS HERE. TYPE 150 FORMAT(’ SPECIFY THE FILE NUMBER TO BE CORRECTED b’9$) ACCEPT 291 TYPE 16091 FORMAT(’ ENTER THE CORRECT FILE NAME FOR #’9I39’ P’9$) IF(LCHECK(EXT9ICHAN9I).LT.0) STOP’FATAL ERROR’ CALL CLOSEC(ICHAN) I CALL IFREEC(ICHAN) GO TO 145 TYPE 210 . FORMAT(’ ENTER MASTER FILE NAME(DEV:FILNAM.EXT) b’9$) READ(594) (LNAME(L)9L=1914) CALL ASSIGN<129LNAME9149’NEW’) WRITE(129220) (LNAME(L)9L=1914)9NUMF FORMAT(14A19I39’ (5E15.7)’) DO 230 I = 19NUMF WRITE(1294) (FNAMES(M9I)9M=1914) CONTINUE CALL CLOSE(12) CALL EXIT END 219 C FILE: ZNORM.FOR CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH C SUBROUTINE ZNORM: TO STANDARDIZE FEATURE VALUES OF SPEECH C PARAMETERS9ID89 FFC9 AND LTS. C Written by: Hirotaka Nakasone C Date: 11 JULY9 1983 C Dept. of Audiology and Speech Science59 MSU9 East Lansins9 MI. CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH SUBROUTINE ZNORM REAL X(NPyNG9ND) C NP = NUMBER OF PATTERNS C NG = NUMBER OF GROUPS9 OR SPEAKERS C ND = NUMBER OF DIMENSIONS9 OR SAMPLES / SPEAKER NPAT = NP X MG 2 FORMAT(1X9’NORMALIZATION BY Z-TRANSFORM.’/) DO 10 J=19ND RM 0.0 SD 0.0 ‘3 H O O .4 .< ‘11 m 1'.) DO 20 I = 19 NB DO 25 K = 19 NP RM = RM + X(K9I9J) SD = SD + X(K9I9J)*X(K9I9J) 25 CONTINUE 20 CONTINUE RM = RM/FLOAT(NPAT) SD = (SD/FLOAT(NPAT) - RMXRM)*#O.5 IF(SD.EQ.0.) GO TO 40 DO 30 I = 19 MG DO 35 K = 19 NP X(K9I9J) = (X(K9I9J)-RM)/SD 35 - CONTINUE 30 CONTINUE GO TO 10 40 CONTINUE DO 50 I = 19 NG DO 45 K = 19 NP X(K9I9J) = 0.0 45 CONTINUE 5 CONTINUE 10 CONTINUE RETURN END 220 CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH C SUBROUTINE RDFILE.FOR. C RDFILE IS TO READ ALL THE FILES IN A MASTER FILE. C 11 JULY 1983 ' C H. NAKASONE CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH SUBROUTINE RDFILE