ROBUST SIGNAL PROCESSING METHODS FOR MINIATURE ACOUSTIC SENSING,
SEPARATION, AND RECOGNITION
By
Amin Fazel

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Electrical Engineering
2012

ABSTRACT
ROBUST SIGNAL PROCESSING METHODS FOR MINIATURE ACOUSTIC SENSING,
SEPARATION, AND RECOGNITION
By
Amin Fazel
One of several emerging areas where micro-scale integration promises signiﬁcant breakthroughs
is in the ﬁeld of acoustic sensing. However, separation, localization, and recognition of acoustic
sources using micro-scale microphone arrays poses a signiﬁcant challenge due to fundamental
limitations imposed by the physics of sound propagation. The smaller the distance between the
recording elements, the more difﬁcult it is to measure localization and separation cues and hence
it is more difﬁcult to recognize the acoustic sources of interest. The objective of this research is
to investigate signal processing and machine learning techniques that can be used for noise-robust
acoustic target recognition using miniature microphone arrays.
The ﬁrst part of this research focuses on designing “smart” analog-to-digital conversion (ADC)
algorithms that can enhance acoustic cues in sub-wavelength microphone arrays. Many source separation algorithms fail to deliver robust performance when applied to signals recorded using highdensity sensor arrays where the distance between sensor elements is much less than the wavelength
of the signals. This can be attributed to limited dynamic range (determined by analog-to-digital
conversion) of the sensor which is insufﬁcient to overcome the artifacts due to large cross-channel
redundancy, non-homogeneous mixing and high-dimensionality of the signal space. We propose a
novel framework that overcomes these limitations by integrating statistical learning directly with
the signal measurement (analog-to-digital) process which enables high ﬁdelity separation of linear
instantaneous mixture. At the core of the proposed ADC approach is a min-max optimization of a
regularized objective function that yields a sequence of quantized parameters which asymptotically

tracks the statistics of the input signal. Experiments with synthetic and real recordings demonstrate
consistent performance improvements when the proposed approach is used as the analog-to-digital
front-end to conventional source separation algorithms.
The second part of this research focuses on investigating a novel speech feature extraction algorithm that can recognize auditory targets (keywords and speakers) using noisy recordings. The features known as Sparse Auditory Reproducing Kernel (SPARK) coefﬁcients are extracted under the
hypothesis that the noise-robust information in speech signal is embedded in a subspace spanned
by sparse, regularized, over-complete, non-linear, and phase-shifted gammatone basis functions.
The feature extraction algorithm involves computing kernel functions between the speech data and
pre-computed set of phased-shifted gammatone functions, followed by a simple pooling technique
(“MAX” operation). In this work, we present experimental results for a hidden Markov model
(HMM) based speech recognition system whose performance has been evaluated on a standard
AURORA 2 dataset. The results demonstrate that the SPARK features deliver signiﬁcant and
consistent improvements in recognition accuracy over the standard ETSI STQ WI007 DSR benchmark features. We have also veriﬁed the noise-robustness of the SPARK features for the task of
speaker veriﬁcation. Experimental results based on the NIST SRE 2003 dataset show signiﬁcant
improvements when compared to a standard Mel-frequency cepstral coefﬁcients (MFCCs) based
benchmark.

I dedicate this dissertation to my parents, for their love and support.

iv

ACKNOWLEDGMENT

I would like to take this opportunity to acknowledge several people who have been particularly
inspiring and helpful to my research work.
First I would like to thank my advisor, Dr. Shantanu Chakrabartty for the tremendous time,
energy, and wisdom he invested in my Ph.D. education.
I would like to thank the other members of my Ph.D. thesis committee, Prof. Hayder Radha,
Prof. Lalita Udpa, and Dr. Rong Jin, for their valuable feedback, suggestions, and time that helped
me improve the quality of this work.
I would like to thank my past and current colleagues at AIM lab, who shared their friendship
with me: Amit Gore, Yang Liu, Chenling Huang, and Ming Gu.
Finally, there is no way I would be where I am today without the immeasurable love, support,
and encouragement of my parents Manouchehr Fazel and Behjat Kazemi and my lovely ﬁancee
Soodabeh.

v

TABLE OF CONTENTS

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

List of Figures

1 Introduction
1.1 Motivations and applications . . . . .
1.2 Miniature acoustic recognition system
1.3 Scientiﬁc contributions . . . . . . . .
1.4 Dissertation organization . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

1
. 1
. 6
. 10
. 11

2 Smart Audio Signal Acquisition Devices
13
2.1 Motivation for smart audio signal acquisition devices . . . . . . . . . . . . . . . . 13
2.2 Signal acquisition in miniature microphone array . . . . . . . . . . . . . . . . . . 18
3 Sigma-Delta Learning
3.1 Stochastic gradient decent and Σ∆ modulators . . .
3.1.1 Σ∆ Learning . . . . . . . . . . . . . . . . .
3.1.2 Resolution Enhancement . . . . . . . . . . .
3.2 Acoustic source separation . . . . . . . . . . . . . .
3.3 Experimental results . . . . . . . . . . . . . . . . .
3.3.1 Numerical evaluation . . . . . . . . . . . . .
3.3.2 Experiments with far-ﬁeld model . . . . . .
3.3.3 Experiments with real microphone recordings
4 Robust Acoustic Recognition
4.1 Fundamental of speech . . . . . . . . . . . . . .
4.2 Architecture of an acoustic recognition system . .
4.3 Speech acquisition and feature extraction module
4.4 Speech and speaker modeling . . . . . . . . . . .
4.4.1 Generative Models . . . . . . . . . . . .
4.4.2 Discriminative Models . . . . . . . . . .
4.5 Robust acoustic recognition . . . . . . . . . . . .
4.5.1 Robust Feature Extraction . . . . . . . .
4.6 Robust speaker modeling . . . . . . . . . . . . .
4.7 Score normalization . . . . . . . . . . . . . . . .
vi

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

22
24
31
34
35
38
38
44
46

.
.
.
.
.
.
.
.
.
.

49
49
51
52
57
59
61
68
69
72
75

5 Hierarchical Kernel Auditory Features
5.1 Motivation for hierarchical kernel auditory features
5.2 Hierarchical architecture . . . . . . . . . . . . . .
5.2.1 Regularized kernel optimization . . . . . .
5.2.2 Pooling mechanism . . . . . . . . . . . . .
5.3 Sparse auditory reproducing kernel coefﬁcients . .
5.4 Experiments and performance evaluation . . . . . .
5.4.1 Speech recognition setup . . . . . . . . . .
5.4.2 Speaker veriﬁcation setup . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

77
77
81
82
88
91
91
92
102

6 Concluding Remarks and Future Directions
107
6.1 Summery and concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

112

LIST OF TABLES

3.1

Performance (SDR (dB) ) of the proposed Σ∆ for the real data for different oversampling ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1

AURORA 2 clean training word accuracy results when ETSI FE is used. . . . . . . 94

5.2

AURORA 2 word recognition results when conventional Gammatone ﬁlter-bank
(GT) is used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3

AURORA 2 clean training word accuracy results when ETSI AFE is used. . . . . . 96

5.4

The effect of different time-shifts on the SPARK features. . . . . . . . . . . . . . . 97

5.5

The effect of different kernel functions on the SPARK features. . . . . . . . . . . . 97

5.6

The effect of different pooling mechanisms (different ζ) when Ψ = max ζ(|b|)
and K(x, y) = tanh(0.01xyT − 0.01). . . . . . . . . . . . . . . . . . . . . . . . 98

5.7

∑
The effect of different pooling mechanisms (different ζ) when Ψ = ζ( |b|) and
K(x, y) = tanh(0.01xyT − 0.01). . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.8

∑
The effect of different pooling mechanisms (different ζ) when Ψ = ζ( |b|) and
K(x, y) = (xyT )4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.9

The effect of λ on extracting the SPARK features. . . . . . . . . . . . . . . . . . . 99

5.10 AURORA 2 word recognition results when SPARK and PBS were used together. . 102
5.11 AURORA 2 clean training word accuracy results. . . . . . . . . . . . . . . . . . . 102

viii

LIST OF FIGURES

1.1

Motivation: Ofﬂine data collection and acoustic recognition (For interpretation of
the references to color in this and all other ﬁgures, the reader is referred to the
electronic version of this dissertation) . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Acoustic recognition system composed of four main sub-systems . . . . . . . . . .

6

1.3

Architecture of the “smart” signal acquisition device . . . . . . . . . . . . . . . .

8

1.4

Hierarchical model of auditory feature extraction. . . . . . . . . . . . . . . . . . .

9

2.1

System architecture where the source separation algorithm is applied (a) after
quantization (b) after analog projection and quantization . . . . . . . . . . . . . . 16

2.2

Far-ﬁeld recording on a miniature microphone arrays.

3.1

Illustration of the two-dimensional signal distribution for: (a) the input signals ;
(b) signals obtained after transformation B and (c) signals obtained after resolution
enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2

Architecture of the proposed sigma-delta learning applied to a source separation
problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3

(a) System architecture of a ﬁrst order Σ∆ modulator, (b) input-output response
of single bit quantizer, and (c) illustration of “limit-cycle” oscillations about the
minima of the cost function C(v) . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4

One dimensional piece-wise linear regularization functions and the multi-bit quantization function as its gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5

Limit cycle behavior using bounded gradients . . . . . . . . . . . . . . . . . . . . 28

3.6

Reconstruction of the sources using conventional and proposed Σ∆ with OSR=1024 38
ix

. . . . . . . . . . . . . . . 19

3.7

Evaluating the reconstruction of the sources for classical (without), learning (with),
and learning with resolution enhancement (with+) Σ∆ at different OSR for log2 (condition
number) of (a) 10 and (b) 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.8

Evaluating the reconstruction of the sources for classical (without), learning (with),
and learning with resolution enhancement (with+) Σ∆ at different condition number for OSR of (a) 256 and (b) 512 . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.9

Evaluating the reconstruction of sources at different dimension for the learning Σ∆
at different condition number for OSR of 128 . . . . . . . . . . . . . . . . . . . . 41

3.10 SDR corresponding to with/without Σ∆ learning for the near-far recording conditions using (a) SOBI and (b) EFICA algorithms. . . . . . . . . . . . . . . . . . . . 43

3.11 Σ∆ performance with and without learning for three speech signals corresponding
to the far-ﬁeld model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.12 Spectrogram of the recorded signals (top row) and recovered signals using Σ∆
without learning (middle row) and with learning (bottom row) . . . . . . . . . . . 46

4.1

Fundamental of speech: (a) Magnetic resonance image showing the anatomy of
speech production apparatus. The property of the speech signal is determined
by shape of the vocal tract, orientation of the mouth, teeth and nasal passages.
(b) Spectrograms corresponding to a sample utterance “ﬁftysix thirty-ﬁve seventytwo” for a male and female speake. . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2

Functional architecture of a speaker veriﬁcation system as a example of acoustic
recognition which consists of two main phases: (a) An training/enrollment phase
where parameters of a speaker speciﬁc statistical model are determined and (b) a
recognition/veriﬁcation phase where an unknown speaker authenticated using the
models trained during the training phase. . . . . . . . . . . . . . . . . . . . . . . . 51

4.3

Example of generative models that have been used for speech/speaker recognition:
(a) HMMs where each state has a GMM which captures the statistics of a stationary
segment of speech. (b) HMMs are trained by aligning the states to the utterance
using a trellis diagram. Each path through the trellis (from start to end) speciﬁes a
possible sequence of HMM state that generated the utterance . . . . . . . . . . . . 58
x

4.4

Discriminative Models: (a) General structure of an SVM with radial basis functions as kernel. (b) Structure of a multi-layer ANN consisting of two hidden layers.
(c) An example of a kernel function K(x, y) = (x.y)2 , which maps a non-linearly
separable classiﬁcation (left) problem into a linearly separable problem (right) using a non-linear mapping Φ(.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5

Functional architecture of an SVM-based speaker veriﬁcation system: (left) the
extracted features are ﬁrst aligned, reduced and normalized. The speaker speciﬁc
and speaker non-speciﬁc features are combined to create a dataset used for SVM
training. (right) The soft-margin SVM determines the parameter of a hyperplane
that separates the target and non-target dataset with the maximum margin. . . . . . 65

4.6

An example of fusion of low-level and high-level features for the speaker veriﬁcation system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.7

(a)Equivalent model of additive and channel noise in a acoustic recognition system
(b) Different techniques used for designing robust acoustic recognition systems. . . 70

5.1

A set of 25 gammatone kernel basis functions with center frequencies spanning
100Hz to 4KHz in the ERB space . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.2

Acyclic convolution matrix Φi for gammatone basis function ϕi . . . . . . . . . . . 83

5.3

Signal ﬂow of the SPARK feature extraction . . . . . . . . . . . . . . . . . . . . . 89

5.4

Colormaps depicting b vectors (left column) and IDCT of SPARK feature vectors
(right column) obtained for utterances of digit “1” and “9” respectively . . . . . . . 90

5.5

Signal ﬂow of the MFCC feature extraction . . . . . . . . . . . . . . . . . . . . . 93

5.6

Signal ﬂow for the conventional Gammatone ﬁlterbank features, note that this ﬁgure shows each frame of speech after two steps of pre-emphasis and windowing. . . 95

5.7

Speech recognition accuracy obtained in additive noisy (subway and bable) environments on AURORA 2 database. . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.8

Speech recognition accuracy obtained in additive noisy (car and exhibition) environments on AURORA 2 database. . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.9

Speech recognition accuracy obtained in additive noisy (restaurant and street) environments on AURORA 2 database. . . . . . . . . . . . . . . . . . . . . . . . . . 100
xi

5.10 Speech recognition accuracy obtained in additive noisy (airport and station) environments on AURORA 2 database. . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.11 Speech recognition accuracy obtained in different convolutive noisy environments
on AURORA 2 database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.12 An example of DET curve which plots the FRR with respect to FAR. . . . . . . . . 104
5.13 DET curve comparing MFCC-CMN and SPARK features. . . . . . . . . . . . . . 106

xii

Chapter 1
Introduction
One of several emerging areas where micro/nano scale integration promises signiﬁcant breakthroughs is in the ﬁeld of acoustic sensing, separation, and recognition. For example, it is envisioned that next generation of intelligent hearing devices will integrate hundreds of micro/nanoscale
microphones, separate speech from noise, track conversations in cluttered environments and thus
provide signiﬁcant improvements in speech intelligibility for individuals with hearing impairments.
Sensing, separation and recognition of acoustic sources using micro/nano scale microphone arrays,
however, poses signiﬁcant challenges in the area of robust signal processing. The objective of this
research is to develop theory and algorithm for robust acoustic recognition systems using miniature
microphone arrays and to investigate using of these devices in real world applications.

1.1 Motivations and applications
The acoustic sensing and recognition has been widely used in different applications ranging from
bioacoustics to military devices. In bioacoustics [1, 2, 3, 4], acoustic sensing and recognition
systems have been used by ornithologists to study bird species interaction in their environment.
1

The acoustic based technology in bioacoustic is particularly important in places where the visibility is limited such as rain forest environment. In military applications [5, 6, 7, 8, 9, 10],
acoustic sensing and detection system will detect an acoustic event, such as a sniper’s weapon
ﬁring or a door slamming and then will use that information for further actions. These acoustic
sensing/detection devices are usually mounted on a robotic vehicles which provides commanders
with overall situational awareness. Target detection and tracking systems also partially beneﬁt
from the acoustic sensing technology [11, 12]. The acoustic sensing system has also been used in
electronic textiles (e-textiles) with applications mostly in military equipments [13, 14]. Acoustic
sensing and recognition systems have also been utilized in intelligent transportation technology for
different purposes such as speed monitoring, trafﬁc counting, and vehicle detection and classiﬁcation [15, 16]. Railroad system has also been beneﬁted in using acoustic devices for bearing health
monitoring [17, 18].

Micro/nano-scale acoustic recognition system is one of emerging areas in miniaturization techniques where they have found different applications across disciplines. There are some applications in which their imposed limitations motivate the use of miniature acoustic recognition systems.
Some of these applications like hearing aids require the source of interest to be far away from the
recording device. In such cases, it is very beneﬁcial to use array of microphones in order to use the
spatial information for environmental noise compensation. However, acoustic sensing in miniature
microphone arrays introduces a key challenge which is the “high ﬁdelity” imaging of the acoustic
events in their surrounding environment. Other challenge comes from the acoustic recognition
system where the objective is to robustly recognize the acoustic events in real environment which
has been attracting a bulk of research in signal processing society. This research addresses these
challenges and is particularly interested in micro/nano scale acoustic source separation and recog2

Figure 1.1: Motivation: Ofﬂine data collection and acoustic recognition (For interpretation of the
references to color in this and all other ﬁgures, the reader is referred to the electronic version of
this dissertation)

nition systems where the multiple recording elements are places very close to each other in which
the recording condition can be viewed as far-ﬁeld.
Fig. 1.1 shows a big picture motivation behind this research. In such systems, an ofﬂine data
collection provides the training data for the recognition system. This data is collected using the
miniature microphone array and in an environment where it has as less mismatch as possible with
the on-line recognition situation. This data then will be used to build an acoustic model. In the
on-line acoustic recognition, ﬁrst the acoustic event will be captured again using the miniature
microphone array and then from the recorded data some appropriate feature will be extracted.
Using the acoustic model generated ofﬂine, an acoustic target will be recognized from the extracted
features. Usually the acoustic model in the ofﬂine training procedure, is built with the same features
as the on-line recognition system. To motivate even more this research two speciﬁc applications of
3

miniature acoustic sensing and recognition is presented below.

Biometric systems: One of the technology were miniature microphone array promises breakthrough is in the area of speech based biometric systems. Speech based biometric system such as
speaker veriﬁcation/identiﬁcation is a popular biometric identiﬁcation technique used for authenticating and monitoring human subjects using their speech signal [19, 20]. The method is attractive
because it does not require direct contact with the individual, thus avoiding the hurdle of “perceived
invasiveness” inherent in most biometric systems like iris and ﬁnger print recognition systems. To
date, most commercial implementations of speaker veriﬁcation systems have been designed for enterprise applications that utilize large scale computing resources and infrastructure [21, 22, 24, 23].
However, with the proliferation of portable devices there has been an increasing demand for small
scale speaker veriﬁcation systems and therefore these systems demand recognition performance
which is robust to variable background noise and to channel (microphone) conditions. Homeland
security and surveillance applications might require identiﬁcation/veriﬁcation of target speakers
in a given environment, in which case the speaker may not be in proximity with the microphone.
The enrollment data corresponding to the target speaker could be limited and could have been
acquired from unconventional sources (video/audio tapes, over the network or from archives). Unfortunately, even though most existing speech based recognition systems deliver acceptable accuracy under controlled/clean conditions, their performance degrades signiﬁcantly when subjected to
noise present in practical environments and especially for applications where the speaker is not in
the proximity of the recording elements [25, 26, 27]. This shows that acquisition and recognition
of the speech signal for small scale speech based biometric system needs high resolution signal
and a robust recognition system where this research addresses these problems by employing the
techniques for both super-resolution recording and robust recognition.
4

Hearing aids: Typically hearing aid users which were only %20 of all the hearing impaired
in 2004 [28], have difﬁculty when listening to a speaker in the noisy environment. This difﬁculty
comes from the fact that the conventional hearing aids amplify all receiving sounds without discriminating between the speaker of the interest and background noises. The ideal hearing aid is
the one with the functionality of binaural hearing system which provides the signals and allows the
patient to listen to one speaker. The ability to record the high quality speech signal using microphone arrays and the lack of success on single microphone hearing aids motivate microphone array
hearing aids [29, 30, 31]. Using the microphone arrays for hearing aids provides the special information of the receiving signals which can be used to focus on a speciﬁc speaker in the surrounding
environment. The bulk of research concerning the speech array processing has been done [34]. Although the basic principle of speech array processing can be applied for hearing aid applications,
several other problems exist which needs to be answered. One of them is the cosmetic consideration which limits the design of hearing aids to use the miniaturize/compact microphone arrays.
Using miniature/compact microphone arrays makes the spatial aperture much smaller relative to
the wavelength of the speech signal. Other problem is the noisy environment which exists for the
operation of the hearing aid. Although this work does not directly address the speciﬁc hearing aids
application needs, but the techniques suggested here can be used in the next generation of “smart”
hearing aids.

System miniaturization is not only limited to the acoustic sensing applications but it has found
applications in other areas as well. For example in the rapidly emerging ﬁeld of brain machine
interface, it is very common to record from thousands of neurons using micro-electrode sensor
array [79, 80, 81]. The recording signals are then processed to extract useful information for
controlling the movement of a prosthetic device. Another example comes from the area of wireless
5

communication where a 16-element transmitter array is used with a sub-wavelength inter-element
spacing between sensors [35].

Figure 1.2: Acoustic recognition system composed of four main sub-systems

1.2 Miniature acoustic recognition system
In a micro/nano scale setting, this research proposes an acoustic recognition system composed of
four main components: (a) “smart” signal acquisition, (b) source separation, (c) feature extraction,
and (d) acoustic recognition where Fig. 1.2 shows a block diagram of the proposed system.
The “smart” signal acquisition unit is used to record the acoustic signals with high ﬁdelity
by using as much dynamic range as possible in analog-to-digital conversion module. This unit
uses a miniature microphone array in order to be able to record signal of interest with high ﬁdelity. Note that in many of applications where a miniature acoustic recognition system can be
used, the signal of interest is far away from the recording elements, hence the need for a microphone array. The proposed signal acquisition device performs the spatial sensing of signal along
with analong to digital process where we have formulated the analog-to-digital modulation within
the framework of statistical learning such that the algorithm retrives the spatial manifold which
contains the information for decorrelating the signal wavefront. In the current research, a min-max
optimization approach is used to model the signal de-correlation and analog-to-digital conversion
with a single cost function. In order to optimize the cost function, a stochastic gradient descent
6

and ascent algorithms are employed. The stochastic gradient descent is used to minimize the cost
function with respect to the internal state of the system in which it yeilds the analog-to-digital
conversion module. The stochastic gradient ascent maximizes the cost function with respect to the
signal transformation that minimizes the input correlation hence decorrelating the input signals.
The architecture of the “smart” signal acquisition is shown in Fig. 1.3 where the input is a time
varying analog signal x. This system consists of a matrix-vectormultiplier which transforms the
input signal x into Bx where B denotes a linear transformation matrix. This transformed signal is
then processed by an array of analog-to-digital converter to produce a binary data stream d along
with the spatial information Λ and B. An adaptation unit uses the binary output d to update the B
and Λ.
The source separation system is used in order to separate the source of interest from the rest
of acoustic events and to provide a high quality speech signal for the recognition system. The
assumption here is that the input speech signals are statistically independent from each othre,
therefore conventional independent component analysis (ICA) algorithms were applied to separate
the sources. After separating the source of interest from mixutre of different signals, the next step
is to extract the robust features for the recognition.
The feature extraction unit provides speech features that are robust to the environmental
noises. A hierarchical model is used to extract the robust auditory features from the speech signal.
This model is based on the recent ﬁnding in auditory neuroscience indicating that there is a hierarchical processing in the human auditory cortex where the received signal is ﬁrst broken down into
basic features and later they are integrated into more complex stimuli. Inspiring from biological
data, the proposed hierarchical model consists of two layers of processing as shown in Fig. 1.4.
In the ﬁrst level of this computational model, the similarity of sensory auditory world is measured
7

Figure 1.3: Architecture of the “smart” signal acquisition device

through a kernel based approach with a set of gammatone basis functions. These simple basis
functions represent the so-called spectro-temporal receptive ﬁelds (STRFs) in the auditory cortex.
In order to implement this, we apply the kernel based approach to a reproducing kernel Hilbert
space (RKHS) spanned by gammatone basis functions. The result of incorporating this a-priori
information is that these signitures can be extracted in real-time using pre-computed projection
matrices. Then all the outputs are sent to the higher level where they will be integrated in order
to generate the more complex outputs. In current research, we explored two different idealized
pooling mechanisms of summation (“SUM”) and maximization (“MAX”) operation, both with
nonlinear weights to integrate the outputs of previous level. This weighting function also emulates
the psychoacoustical nonlinear relation between the intensity of sound and its perceived loudness.
The proposed computational model is very close to the HMAX approach introdued in [63] where
it models the visual cortex in a hierarchical fashion for objective recognition task. In order to feed
these features into a back-end acoustic recognition system, discrete cosine transform (DCT) is used
to decorrelate the features.
The acoustic recognition unit is used to recognize the acoustic events. Once the feature vectors
8

Figure 1.4: Hierarchical model of auditory feature extraction.

corresponding to the acoustic events have been extracted the associated data also known as training
data is used to build models for the recognition systems in an ofﬂine process. During the test phase,
the trained models are used to recognize a sequence of feature vectors extracted from unknown
acoustic events.
Even the proposed miniature acoustic recogniton system is general and can be used for recognition of any type of acoustic events, but in this research we used it for two tasks of speech recognition and speaker veriﬁcation.
9

1.3 Scientiﬁc contributions
The conducted research has the following two main scientiﬁc contributions:
“Super-resolution” high-density acoustic signal acquisition: In this research a far-ﬁeld
recording condition on miniature microphone array has been investigated, a research area that
have not been received much attention. First a mathematical model has been developed for the
miniature microphone array. This model shows that speech signals received at the miniature microphone array can be considered to be in the far-ﬁeld condition with the instantaneous mixing.
Then a “smart” signal acquisition system is introduced in order to remove the correlation of the
received signal at the analog-to-digital conversion level and increasing the dynamic range of the
acquisition system. This method is based on a min-max optimization of a regularized objective
function which integrates the analog-to-digital conversion with the statistical machine learning.
Hierarchical auditory features: This research also proposes a novel speech feature extraction
method based on a hierarchical fashion to improve the robustness of the acoustic recognition system by exploiting properties of a functional regression procedure in a reproducing kernel Hilbert
space (RKHS) [64, 67, 65, 66]. This method is based on the hypothesis that robustness in speech
signal is encoded in temporal and spectral manifolds (represented here by kernels) which remain
intact even in presence of ambient noise. However, under clean recording conditions (laboratory
setting), most learning algorithms like hidden Markov models (HMMs) [104] and support vector
machines (SVMs) [68] exploit only linear dominant features which unfortunately can easily be
corrupted by ambient noise. RKHS regression endows the proposed innovative features with the
following robustness properties:
• The algorithm doesn’t make any prior assumption on noise statistics.
• The algorithm uses kernel methods to extract features that are nonlinear and robust to cor10

ruption by noise.
• Robust parameter estimation is ensured by imposing smoothness constraints based on regularization principles.
• The proposed signitures can be extracted in real-time using pre-computed projection matrices.

1.4 Dissertation organization
The dissertation is organized as follows: Chapter 2 motivates the “smart” audio signal acquisition
systems as substitutions for conventional Nyquist ADCs for miniature microphone arrays. Then
a mathematical model for signal acquisition in miniature/compact microphone array is presented.
The model shows that the signal recorded from miniature array is near singular and conventional
ways of signal acquisition fail to deliver a robust performance due to limited dynamic range of microphone which is determined by analog-to-digital conversion. This limitation is coming from the
fact that a large cross-channel redundancy and non-homogeneous mixing is presented in recorded
signal space on miniature microphone array. The proposed “smart” signal acquisition device is
then presented in chapter 3. The core of this technique is based on a min-max optimization framework that can efﬁciently and adaptively quantize non-redundant analog signal sub-spaces which
leads to signiﬁcant performance improvement for any DSP based source separation algorithm.
The performance of the proposed signal acquisition device is evaluated using synthetic and real
recordings. Experiments have been shown to demonstrate signiﬁcant and consistent performance
improvements when the proposed approach is used as the analog-to-digital front-end to conventional source separation algorithms. A detail overview of the acoustic signal recognition system
11

is presented in chapter 4. In this chapter a brief introduction to statistical pattern recognition
techniques that are commonly used in acoustic recognition will be provided. This includes an
overview of some of the basic functional units such as speech feature extraction, acoustic modeling, and classiﬁcation. Then this chapter discusses some of the commonly used techniques which
make real-world acoustic recognition systems more robust in noisy environment. Chapter 5 introduces a novel hierarchical model to extract the auditory speech features. This model uses a
regression technique in a reproducing kernel Hilbert space (RKHS) in order to measure the similarity of sensory auditory world. In this chapter, the theory behind these features that are known
as Sparse Auditory Reproducing Kernel (SPARK) is ﬁrst described. They are extracted under the
hypothesis that the noise-robust information in speech signal is embedded in a subspace spanned
by overcomplete, regularized and normalized gammatone basis functions. In the last part of this
chapter two benchmarks is presented for acoustic recognition systems: the ﬁrst one is a HMM
based speech recognition system and the second one is an SVM-based speaker veriﬁcation system.
Using these benchmarks, the performance of the proposed system is evaluated and compared to
the conventional acoustic recognition systems. Concluding remarks and the future directions for
the presented work are discussed in chapter 6.

12

Chapter 2
Smart Audio Signal Acquisition Devices

2.1 Motivation for smart audio signal acquisition devices
Miniature microphone arrays for sensing the acoustic events are becoming more common for different applications. One of such applications is an acoustic recognition system where the objective
is to identify a person as in a speaker recognition system or convert the speech into text as in a
speech recognition system. However using micro/nano-scale microphone arrays in an acoustic
recognition system poses a key challenge to image acoustic events occurring in its environment
with high ﬁdelity (spatial and temporal) where due to the dispersive nature of the surrounding
media, each element of the sensor array records a mixture of signals generated by the source of
interest and other events (noises) in its environment. In order to recognize the source of interest,
the proposed recognition system takes advantage of a source separator. However several factors
limit the performance of traditional source separation techniques and hence the performance of the
acoustic recognition system when acoustic signal acquired from miniature/compact microphone
arrays:
13

• Far-ﬁeld effects: For miniature arrays, sources are usually located at distances much larger
than the distance between recording elements. As a result, the mixing of signals at the
recording elements is near singular. Recognition of the acoustic source of interest needs
separation of that source from near ill-conditioned mixtures which would require superresolution signal processing to reliably identify the parameters of the separation manifold.

• Near-far effects: For miniature sensor arrays, a stronger source that is closer to the array
can mask the signal produced by background sources. Separating the background sources in
presence of the strong masker would again require super-resolution processing of the input
signals.

DSP based source separation algorithms are typically implemented subsequent to a quantization
operation (analog-to-digital conversion) and hence do not consider the detrimental effects of ﬁnite
resolution due to the quantizer. In particular, for a high-density sensor array, a naive implementation of a quantizer that uniformly partitions each dimension (pulse coded modulation) of the
input signal space could lead to a signiﬁcant loss of information. To understand the effect of
this degradation consider the framework of a conventional source separation algorithm as shown
in Fig. 2.1(a). The “analog” signal x ∈ RM recorded at each of the sensor array is given by
x = As, A ∈ RM × RM being the mixing matrix and s ∈ RM being the independent sources
of interest. This simpliﬁed linear model is applicable to both instantaneous mixing as well as to
convolutive mixing formulation [37]. The recorded signals are ﬁrst digitized and then processed by
a digital signal processor (DSP) which implements the source separation algorithm. For the sake
of simplicity, assume that the algorithm is able to identify the correct unmixing matrix given by
W = A−1 which is then used to recover the source signals ˜ ∈ RM . The effect of quantization
s
14

in this approach can be captured using a simple additive model as shown in Fig. 2.1(a)

˜d = W(x + q) = s + A−1 q
s

(2.1)

where q denotes the additive quantization error introduced during the digitization process. The
reconstruction error between the recovered signal ˜d and the source signal s can then be expressed
s
as
||˜d − s|| = ||A−1 q|| ≤ ||A−1 ||.||q||
s

(2.2)

where ||.|| denotes a matrix and vector norm [38]. Equation (2.2) indicates that under ideal reconstruction conditions, the performance of conventional source separation algorithm is limited by: (a)
quantization error (accuracy of analog-to-digital conversion) and (b) the nature of the mixing operation determined by A. For high-density sensors, the mixing typically tends to be ill-conditioned
(||A−1 || ≫ 1), as a result the reconstruction error due to equation (2.2) could be large.

Now consider the framework shown in Fig.2.1(b) which is at the core of the proposed resolution
enhancement approach. The signals recorded by the sensor array is ﬁrst transformed by P (in the
analog domain) before being quantized. In this case, the reconstructed signal ˜m ∈ RM can be
s
expressed as
˜m = D(Px + q).
s

(2.3)

For source separation DP = A−1 , for which the reconstructed signal now can be expressed as

˜m = s + A−1 P−1 q
s
15

(2.4)

Figure 2.1: System architecture where the source separation algorithm is applied (a) after quantization (b) after analog projection and quantization

which leads to the reconstruction error

||˜m − s|| = ||A−1 P−1 q|| ≤ ||A−1 P−1 ||.||q||.
s

(2.5)

Thus, the reconstruction error can now be controlled by the choice of the transform P and is
not completely determined by the mixing transform A. An interesting choice of the matrix P is
the one that satisﬁes
||A−1 P−1 || = 1
16

(2.6)

which ensures that the input signals are normalized before processed by the DSP based source
separation algorithm. Equation 2.5 then reduces to

||˜m − s|| ≤ ||q||
s

(2.7)

and the expected performance improvement when employing the framework in Fig. 2.1(b) over the
framework in Fig. 2.1(a) is given by

P I = −20 log ||A−1 ||.

(2.8)

Equation (2.8), thus, shows that the for near-singular mixing, ||A−1 || ≫ 1, the performance
improvement based on the resolution enhancement technique shown in Fig. 2.1(b) could be signiﬁcant. However, the performance improvement is valid only if the analog projection P can be
precisely and adaptively determined during the process of quantization (“analog-to-digital” conversion). This procedure is unlike traditional multi-channel “analog-to-digital” conversion where
each signal channel is uniformly quantized the input signal without taking into consideration the
spatial statistics of the input signal. Since the projection P is also quantized, the precision to which
the condition (2.6) is satisﬁed is also important. In this regard, oversampling “analog-to-digital”
converters like Σ∆ modulators are attractive since the topology is robust to analog imperfection
and can easily achieve dynamic ranges greater than 120 dB (more than 16 bits or accuracy) [39].
In this research we show that the learning algorithm can efﬁciently and adaptively quantize nonredundant analog signal sub-spaces which leads to signiﬁcant performance improvement for any
DSP based source separation algorithm and hence the proposed acoustic recognition system. This
innovative approach which is called “super-resolution Sigma-delta” will be discussed in chapter 3.

17

2.2 Signal acquisition in miniature microphone array

In this section a mathematical model for a miniature microphone array is presented. The model
shows that the recorded signals from the array can be near singular. Then this model will be
used later to show the superior performance of proposed smart signal acquisition process over
the standard ADC. This modeling resort to the far-ﬁeld wave propagation models that have been
extensively studied within the context of array processing and plenacoustic models [82, 83, 84].
Knowing the plenacoustic function, the actual sound at a desired position in a sound ﬁeld especially
in a room can be modeled via the convolution of this function with the source signal. For the
modeling purpose, consider a microphone array shown in Fig. 2.2 that consists of two recording
elements. If the inter-element distance is much less than the wavelength of the microphone signal
of interest, the signals recorded at each of the sensor elements can be approximated using farﬁeld models. For example, for audio signals (100-20,000 Hz), a far-ﬁeld model can be assumed
for microphone arrays with inter-element distances less than 3.4cm (coherence length). Also,
for miniature microphone array the distance to the sources from the center of the array can be
assumed to be larger than the inter-element distance. We express the signal xj (pj , t) recorded at
j th microphone as a superposition of i independent sources si (t) (i ∈ 1, .., D), each of which are
referenced with respect to the center of the array [83]. This can be written as

xj (pj , t) =

∑

ci (pj )si (t − τi (pj ))

(2.9)

i
where ci (pj ) and τi (pj ) denotes the attenuation and delay, for the source si (t) at the position pj ,
measured relative to the center of the sensor array. pj in equation (2.9) denotes the position vector
of the j th microphone. Equation (2.9) can be approximated using Taylor’s series expansion as
18

Figure 2.2: Far-ﬁeld recording on a miniature microphone arrays.

xj (pj , t) =

∑

ci (pj )

i

∞
∑ (−τi (pj ))k (k)
si (t)
k!
k=0

(2.10)

Under far-ﬁeld conditions it can be assumed that ci (pj ) ≈ ci is constant across all the sensor
elements. Also, the higher-order terms in the series expansion (2.10) can be ignored and can be
expressed as
xj (pj , t) ≈

∑

ci si (t) −

i
The component xc (t) =

∑

∑

ci τi (pj )si (t).
˙

(2.11)

i

i ci si (t) signiﬁes the common-mode signal common to all the record-

ing elements and the second part of the RHS signiﬁes an instantaneous mixture of the derivative of
the source signals. The common-mode component can be canceled using a differential measure19

ment [85] under which equation (2.11) becomes

∆xj (pj , t) = xj (pj , t) − xc (t) ≈ −

∑

ci τi (pj )si (t)
˙

(2.12)

i
and can be expressed in a matrix-vector form as

∆x(t) ≈ −A˙ (t)
s

(2.13)

where A = {ci τi (pj )} denotes the instantaneous mixing matrix. Under far-ﬁeld approximation,
the time delays can be expressed as

τi (pj ) = uT pj /v
i

(2.14)

where ui is the unit normal vector of the wavefront of source i and v is the speed of sound (v =
340m/s in air). Thus equations (2.13) and (2.14) show that for miniature recording array, recovery
˙
of the desired sources s or s entails solving a linear source separation problem [86]. However,
equation (2.14) reveals that sources that are located closer to the sensor array can completely mask
the sources located away from the sensor array, resulting in a near-singular mixing A. As shown
in the following description that under near-singular mixing conditions, conventional methods of
signal acquisition and source separation fail to deliver robust performance.

Many source separation and speech feature extraction algorithms fail to deliver robust performance when applied to signals recorded using miniature microphone arrays. This can be a result
of limited dynamic range (determined by analog-to-digital conversion) of the microphone which
is insufﬁcient to overcome the artifacts due to large cross-channel correlation, non-homogeneous
20

mixing and high-dimensionality of the signal space. In the next chapter a novel framework will
be proposed that overcomes these limitations by integrating statistical learning directly with the
signal measurement (analog-to-digital) process which enables high ﬁdelity separation of linear
instantaneous mixtures which we saw it in this section for miniature microphone array.

21

Chapter 3

Sigma-Delta Learning
The underlying principle behind the proposed technique is illustrated using Fig. 3.1 which shows
a two dimensional signal distribution along with the respective signal quantization levels (depicted
using rectangular tick marks). In this example, the signal distribution has been chosen to cover
only a small region of the quantization space which would be the case for near-singular mixing.
Thus, in a traditional implementation where each dimension is quantized independently of the
other there would be a signiﬁcant information loss due to quantization. This approach towards
estimating P (which was previously introduced in chapter 1) while performing signal quantization
will be to decompose P ∈ RM × RM as a product two simple matrices Λ ∈ RM × RM and
B ∈ RM × RM such that P = ΛB. The transformation matrix B will ﬁrst “approximately” align
the data distribution along the orthogonal axes, each axis representing an independent (orthogonal)
component (shown in Fig. 3.1(b)). Based on this alignment, the signal distribution will be scaled
according to a diagonal matrix Λ such that the quantization levels now span a signiﬁcant region of
the signal space (Fig. 3.1(c)). Our objective will be to compute these transforms B and Λ recursively while performing signal quantization. Even though the proposed procedure bears similarity
22

Figure 3.1: Illustration of the two-dimensional signal distribution for: (a) the input signals ; (b)
signals obtained after transformation B and (c) signals obtained after resolution enhancement

to recursive techniques reported in many online “whitening” algorithms [87, 88, 89], the key difference for the proposed approach is that the adaptation and estimation of the projection matrix P
is coupled with the quantization process. Thus, unlike traditional online “whitening” techniques,
in the proposed approach any imperfections or errors in the quantization process can be corrected
through the adaptation of P . This approach can therefore be visualized as a “smart” analog-todigital converter as shown in Fig. 3.2 which not only produces quantized (digitized) representation
of the input signal d but also quantized (digitized) representations of the transforms B and Λ. In
our formulation, the estimation algorithm for P (B and Λ) has been integrated within a Σ∆ modulation algorithm and hence the name “Σ∆ learning. The choice of Σ∆ modulation is due to its
robustness to hardware level artifacts (mismatch and non-linearity) which makes the modulation
amenable for implementing high-resolution analog-to-digital converters [94]. Before presenting a
generalized formulation for Σ∆ learning, ﬁrst an optimization framework will be presented that
can model the dynamics of ﬁrst-order Σ∆ modulation.
23

3.1 Stochastic gradient decent and Σ∆ modulators
First, a one dimensional example of Σ∆ will be presented to illustrate how a Σ∆ modulator can
be modeled as an equivalent stochastic gradient descent based optimization problem. Consider an
architecture of a well known ﬁrst-order Σ∆ modulator [94] as shown in Fig. 3.3(a) which consists
of a single discrete-time integrator in a feedback loop. The loop also consists of a quantizer Q
which produces a sequence of digitized representation d[n], where n = 1, 2, .. denotes a discrete
time-index. Let x[n] ∈ R be the sampled analog input to the modulator and without any loss of
generality let d[n] ∈ {+1, −1} be the output of a single-bit quantizer given by d[n] = sgn(v[n−1])
(Fig. 3.3(b)) where vn ∈ R is the internal state variable or the output of the integrator as shown in
Fig. 3.3(a). Then, the Σ∆ modulator in Fig. 3.3(a) implements the following recursion:

v[n] = v[n − 1] + x[n] − d[n]

(3.1)

It can be seen from equation (3.1) that if v[n] is bounded for all n, then
N
N
1 ∑
N →∞ 1 ∑
d[n] −→
x[n].
N
N
n=1
n=1

(3.2)

This implies that Σ∆ algorithm given by equation (3.1) produces a binary sequence d[n] whose
temporal average asymptotically converges to the temporal average of the input analog signal.
This statistical dynamics is at the core of most Σ∆ modulators. However, from the perspective of
statistical learning, the Σ∆ recursion in equation (3.1) can be viewed as a stochastic gradient step
of the following optimization problem:

min C(v) = min[|v| − vEx (x)]
v
v
24

(3.3)

Figure 3.2: Architecture of the proposed sigma-delta learning applied to a source separation problem
where Ex (.) is the ensemble expectation of the random variable x. The optimization function C(v)
is shown in Fig. 3.3(c) for the case |Ex (x)| < 1. The minima under this condition is minv C(v) = 0
which is achieved for v = 0 and thus does not contain any information about the statistical property
of x. The recursion (3.1) ensures that v[n] approaches the minima and then exhibits oscillations
about the minima (shown in Fig. 3.3(c)). Note that unlike conventional stochastic gradient based
optimization techniques [95], recursion (3.1) does not require any learning rate parameters. This
is unique to the proposed optimization framework where the stochastic gradient descent is used to
generate limit-cycles (oscillations) about the minima of the cost function C(v). The only requirement in such a framework is that the assumption that the input signal is bounded which ensures that
the limit-cycles are bounded. Under this condition, the statistics of the limit-cycles can asymptotically encode the statistics of the input signal with inﬁnite precision, as shown by equation (3.2).
25

In the later sections, we will exploit this asymptotic property to precisely estimate the transform P
which can then be used for resolving the acute spatial cues in miniature microphone arrays.

Figure 3.3: (a) System architecture of a ﬁrst order Σ∆ modulator, (b) input-output response of
single bit quantizer, and (c) illustration of “limit-cycle” oscillations about the minima of the cost
function C(v)

Another unique aspect of the proposed optimization framework for modeling Σ∆ modulators
is that the cost function C(v) links “analog-to-digital” conversion through the regularizer |v| whose
derivative leads to a single-bit quantizer (sgn function). The second term in C(v) ensures that the
statistics of the quantized stream d[n] matches the statistics of the input analog signal x[n].
We now extend this optimization framework to a multi-dimensional Σ∆ modulator which uses
a multi-bit quantizer and incorporates transformations B and Λ. Consider the following minimization problem
min C(v)
v

(3.4)

C(v) = Ω(λ−1 v) − vT Ex {Bx}.

(3.5)

where the cost function C(v) is given by

x ∈ RM is now an M dimensional analog input vector and v ∈ RM is an internal state vector.
For the ﬁrst part of this formulation, λ will be assumed to be a constant scalar and the transform B
26

Figure 3.4: One dimensional piece-wise linear regularization functions and the multi-bit quantization function as its gradient

will be assumed to be a constant matrix. Ω(.) denotes a piece-wise linear regularization function
that is used for implementing quantization operators. An example of a regularization function Ω(.)
is shown in Fig. 3.4 for 1-dimensional input vector v. Due to the piece-wise nature of the function
27

Ω(.) its gradient d = ∇Ω (shown in Fig. 3.4) is equivalent to scalar quantization operators. Without
loss of generality, it will be assumed that the range of the quantization operator is limited between
[−1, 1]. Therefore, for a 2K step quantization function the corresponding regularization function
Ω(.) is given by
Ω(v) =

M
∑
j=1

|

i
v |;
2K j

|vj | ∈ [i − 1, i]

(3.6)

for i = 1, .., 2K.

Figure 3.5: Limit cycle behavior using bounded gradients

To reiterate, the uniqueness of the proposed approach, compared to other optimization techniques to solve (3.4) is the use of bounded gradients to generate Σ∆ limit-cycles. This is illustrated in Fig. 3.5 showing the proposed optimization procedure using a two-dimensional contour.
28

Provided the input x and the norm of the linear transformation ||B||∞ are bounded and the regularization function Ω satisﬁes the Lipschitz condition, the optimal solution to (3.4) is well deﬁned
and bounded from below which is shown in the next lemma:
Lemma 3.1.1. For the bounded matrix ||B||∞ ≤ λ−1 , bounded vector ||x||∞ ≤ 1, C as deﬁned
1
in equation (3.5) is convex and is bounded by below according to C ∗ = minv C(v) > 1 ( K − 1).
2
Proof. A topological property of norms [96] will be used in this proof which states that for two
1
integers p, q satisfying p + 1 = 1, the following relationship is valid for vectors v and u
q

|vT u| ≤ ||v||p ||u||q

(3.7)

Setting u = Ex {Bx} and applying equation (3.7) the following inequality is obtained:

||v||1 ||u||∞ ≥ |vT u| ≥ vT u ≥ vT Ex {Bx}

(3.8)

1
It can be easily veriﬁed that Ω(v) ≥ ||v||1 − 1 ( K − 1) which is shown graphically in Fig. 3.4
2
for a one dimensional case and hence can be extended element-wise to the multi-dimensional case.
Using the deﬁnition of the matrix norm and the given constraints, it can be easily seen ||B||∞ ≥
||Bx||∞ ≥ ||u||∞ . Thus, |u||∞ ≤ λ−1 . Therefore, the inequality (3.8) leads to
1 1
Ω(λ−1 v) − vT Ex {Bx} ≥ λ−1 ||v||1 − ( − 1) − vT Ex {Bx} ≥ 0
2 K
which proves that the cost function C(v) is bounded from below by C ∗ .

29

(3.9)

However, for Σ∆ learning the trajectory toward the minima of the cost function (3.5) is of
importance. A stochastic gradient minimization corresponding to the optimization problem (3.5)
leads to
v[n] = v[n − 1] + Bx[n] − λ−1 d[n]

(3.10)

with n signifying the time steps and d[n] = ∇Ω(v[n − 1]) being the quantized representation
according to functions shown in Fig. 3.4. Note also that formulation (3.10) does not require any
learning rate parameters. As the recursion (3.10) progresses, bounded limit cycles are produced
about the solution v∗ (see Fig. 3.5) .
The following two lemma exploits the property of the ﬁrst-order modulator to show that the
auxiliary state variable v[n] deﬁned by (3.10) is uniformly bounded if the input random vector x
and matrix B are uniformly bounded.

Lemma 3.1.2. For any bounded input vector sequence satisfying ||x[n]||∞ ≤ 1 and the transformation matrix B satisfying ||B||∞ ≤ λ−1 , the internal state vector v[n] deﬁned by equation (3.10) is always bounded, i.e., ||v[n]||∞ ≤ 2λ−1 for n = 1, 2, ...

Proof. The mathematical induction will be applied to prove this lemma. Without any loss of
generality one can assume ||v[0]||∞ ≤ 2λ−1 . Suppose ||v[n − 1]||∞ ≤ 2λ−1 , it then follows
that ||v[n − 1] − ∇Ω(v[n − 1])||∞ = ||v[n − 1] − d[n]||∞ ≤ λ−1 . Because x and B are bounded
and using equation (3.10), the following relationship holds

||v[n]||∞ = ||v[n − 1] − λ−1 d[n] + Bx[n]||∞
≤ ||v[n − 1] − λ−1 d[n]||∞ + ||B||∞
≤ 2λ−1

(3.11)
30

Lemma 3.1.3. For any bounded input vector ||x||∞ ≤ 1 and bounded transformation matrix B,
n→∞
d[n] asymptotically satisﬁes estimates En {d[n]} −→ λEn {Bx[n]}.
Proof. Following N update steps the recursion given by equation (3.10) yields

Bx[n] − λ−1 En {d[n]} =

1
(v[N ] − v[0])
N

(3.12)

which using the bounded property of random vector v asymptotically leads to

n→∞
En {d[n]} −→ λEn {Bx[n]}

(3.13)

Thus, according to Lemma 3, recursion (3.10) produces a quantized sequence whose mean
asymptotically encodes the scaled transformed input at inﬁnite resolution. It can also be shown
that for a ﬁnite I iterations of (3.10) yields a quantized representation that is log2 (I) bits accurate.

3.1.1

Σ∆ Learning

In this section, the optimization framework will be extended to include on-line estimation of the
transform B. Here again λ is assumed to be constant. Given an M dimensional random input
vector x ∈ RM and an internal state vector v, the Σ∆ learning algorithm estimates parameters of
a linear transformation matrix B ∈ RM × RM according to the following optimization function
31

max (min C(v, B))
B∈C v

(3.14)

C(v, B) = Ω(λ−1 v) − vT Ex {Bx}.

(3.15)

where

C denotes a constraint space on the transformation matrix B. The minimization step in equation (3.14) will ensure that the state vector v is correlated with the transformed input signal Bx
(tracking step) and the maximization step in (3.14) will adapt the matrix B such that it minimizes
the correlation (de-correlation step).

The stochastic gradient descent step corresponding to the minimization yields the recursion

v[n] = v[n − 1] + B[n]x[n] − λ−1 d[n].

(3.16)

where B[n] denotes the transform matrix obtained at time instant n. The transform B is then
updated according to a stochastic gradient ascent step given by

B[n] = B[n − 1] − 2−P v[n − 1]x[n]T ;

B[n] ∈ C.

(3.17)

P in equation (3.17) is a parameter which determines the resolution of updates the parameter matrix
B. If we assume that locally the matrix B∗ behaves as a positive deﬁnite matrix, equation (3.17)
can be rewritten as

B[n] = B[n − 1] − 2−P v[n − 1](B[n]x[n])T
= B[n − 1] − 2−P d[n]d[n]T
32

(3.18)

where we have replaced the transformed input B[n]x[n] by its asymptotic quantized representation
d[n]. Similarly v[n − 1] is replaced by its quantized representation d[n]. The update can be
generalized further by incorporating non-linear quantization function ϕ(.) as

B[n] = B[n − 1] − 2−P ϕ(d[n])d[n]T

(3.19)

where ϕ : RM → RM are functions dependent on the transformation B. Here, ϕ(.) = tanh(.)
is assumed and the constraint space C has been chosen to restrict B to be a lower triangular matrix
with all diagonal elements to be unity. One of the ways to ensure that B[n] ∈ C

∀n is to apply

the updates only to lower triangular elements bij ; i > j. The choice of this constraint guarantees
convergence of the Σ∆ learning by ensuring B is bounded.
Lemma 3.1.4. If the transform matrix B is bounded then the quantized sequences di [n] and dj [n]
with i ̸= j are uncorrelated with respect to each other.
Proof. Using equation (3.19) the following relationships are obtained:

−2−P d[n]ϕ(d[n])T

= B[n] − B[n − 1]

−2−P En {d[n]ϕ(d[n])T } =

B[N ]
lim
N →∞ N

En {di [n]ϕ(dj [n])} = 0

∀i ̸= j

(3.20)

Since this relationship holds for a generic form of ϕ(.), the sequences di [n] are (non-linearly)
uncorrelated with respect to each other.

Equation (3.20) also provides a mechanism of reconstructing the input signal using the trans33

n→∞
formed output d[n] and the converged estimate of the transformation matrix B[n] −→ B∞ .
The input signal can be reconstructed using

ˆ
x = B−1 λ−1 En {d[n]}.
∞

(3.21)

The use of lower-triangular transforms for B greatly simpliﬁes the computation of the inverse B−1
∞
through use of back-substitution techniques. Also, due to its lower-triangular form, the inverse of
B∞ always exists and is well deﬁned.

3.1.2 Resolution Enhancement
Once the transform B has been determined such that the output of the Σ∆ learner is “de-correlated”,
we can apply resolution enhancement by “zooming” into the transformed signal space that does
not cover the quantization regions (see Fig. 3.2(b)). This can be achieved using another diagonal
matrix Λ−1 which scales each axes as shown in the illustration 3.2(c). The Σ∆ cost function can
be appropriately transformed to include the diagonal matrix Λ ∈ RM × RM as
C(v, B, Λ) = Ω(Λ−1 v) − vT Ex {Bx}.

(3.22)

where the optimization (3.15) is also performed with respect to the parameter matrix Λ such that
the constraint ||B||∞ < ||Λ−1 ||∞ is satisﬁed. This constraint is to ensure that C(v, B, Λ) is
always bounded from below. The stochastic gradient step equivalent to recursion (3.16) is given
by
v[n] = v[n − 1] + (B[n − 1]x[n] − Λ−1 [n]d[n])
34

(3.23)

The asymptotic behavior of update (3.23) for equation (3.22) can be expressed as En {d[n]}

n→∞
→

ΛEn {Bn xn }. Thus, reducing the magnitude of diagonal elements of Λ will result in an equivalent ampliﬁcation of the transformed signal. To satisfy the constraint on the transform B and Λ, a
suitable update for the diagonal matrix Λ and its elements λi are

λi = max |(B[n]x[n])i |; N1 > n > N0

(3.24)

where N0 is the number of iterations required for the matrix B to stabilize and N1 is the maximum
observation period used to determine the parameters λi .

3.2 Acoustic source separation
Advances in acoustic miniaturization are enabling integration of an ever increasing number of microphones within a single sensor device which makes integration of miniature/compact microphone
arrays possible. By introducing these devices, there have been several attempts in overcoming the
fundamental problems introduced by these devices. A key challenge is to be able to image acoustical events occurring in the environment with high ﬁdelity (spatial and temporal). However, due to
the dispersive nature of the surrounding media, each element of the sensor array records a mixture
of signals and noises generated by independent events in its environment. In order to improve the
accuracy of the acoustic recognition systems, the signals of interest should be separated from the
noises. The recovery of signals of interest from the recorded mixtures lies within the domain of
blind source separation. Blind source separation (BSS) is based on a general class of unsupervised learning which has application in many areas of technologies. BSS task has connection to
human perception where human hearing system has the ability to focus on acoustic sources of in35

terest even in a very noisy environment. Environmental assumptions about the surrounding of the
microphone array directly inﬂuence the complexity of the BSS problem. Blind separation of the
acoustic signals is sometimes referred to as the Cocktail Party Problem [90, 91] where the problem deﬁned as the separation of voices from a mixture of sources in an uncontrolled environment
like cocktail party. In the real world scenario, each microphone observation is a mixture of all the
acoustic sources in the natural environment in which each of those acoustic sources are affected by
signal reverberation. In order to make the problem more tractable, BSS techniques usually make
some assumptions about the environment. The simplest scenario is termed instantaneous mixing
where acoustic sources receive instantly at the microphones and only considering the intensity of
sources. An extension of the previous assumption where arrival delays between microphones are
also considered is know as the anechoic case. A more realistic assumptions lead to the convolutive
mixing which considers multiple paths between each acoustic source and each microphone in order
to model the signal reverberation. In modeling the BSS problem, assumptions can also be made
about the number and statistical properties of the acoustic sources. It is very common to assume
that sources are independent or at least decorrelated where the solution can be based on the higher
order statistics (HOS) or second order statistics (SOS). This class of approaches are commonly
called independent component analysis (ICA). A series of techniques are motivated by the insight
from the auditory systems and they make strong assumption on acoustic sources such as common
onset, harmonic structure, etc. These techniques are commonly refer to as computational auditory
scene analysis (CASA) where they ﬁrst detect and classify acoustic sources and then perform a supervised decomposition of the auditory scene. One increasingly popular and powerful assumption
is that the acoustic sources have a sparse representation in some basis. These methods have come
to be known as sparse methods. The advantage of a sparse signal assumption is that the probabil-

36

ity of two or more sources being active simultaneously is low. This results in good separability
because most of the energy of the observed signal at any time instant belongs to a single source.
It has also been shown that sparse representation exists in auditory cortex of brain in which ﬁring
pattern of neurons is characterized by long periods of inactivity [92, 93]. Usually it is assumed
that there are at least as many sources as sensors for separation, but under strong assumption of
sparsity it is sometimes possible to relax the conditions on the number of sensors. Some speech
signal properties that can provide assumptions for BSS systems are as follows:

• Speech signals originating from different speakers at different spatial locations in an acoustic
environment can be considered to be statistically independent.

• Speech signals are inherently non-stationary over long periods, but can be considered as
quasi-stationary for small time durations (around 25 ms).

Most of linear BSS models can be expressed in the matrix format as:

X = AS + V

(3.25)

where X is the observation matrix X ∈ Rm×N , m and N being the number of observation and
number of samples in each observation, A ∈ Rm×N represents the mixing matrix, S ∈ Rn×N
contains the sources matrix, and X ∈ Rm×N is the noise matrix. Often BSS is performed by ﬁnding an n×m, full rank separation matrix W = A† , where A† is some well-deﬁned pseudo-inverse
of A in a way that the output signal Y = WX contains components that have special properties
of interest based on the assumptions which can be measured by Kulllback-Leibler divergence or
other criteria like sparseness, smoothness, etc.
37

3.3 Experimental results
2
(a)

0
−2

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

0

500

1000

1500

2000

2
(b)

0
−2
2

(c)

0
−2
2

(d)

0
−2

Figure 3.6: Reconstruction of the sources using conventional and proposed Σ∆ with OSR=1024

3.3.1 Numerical evaluation
The achievable improvement predicted by the equations (2.8) for Σ∆ learning was ﬁrst veriﬁed
using numerical evaluation. For this controlled experiment two synthetic signals were chosen.

s1 (t) = 480t − ⌊480t⌋
s2 (t) = sin(800t + 6 cos(90t))
38

(3.26)

Each of these source signals were mixed using a random ill-conditioned matrix A to obtain the
two-dimensional signals which were then processed by the Σ∆ learner. The outputs of the Σ∆
learner were then used to reconstruct the source signals according to

˜d = A−1 x
s

(3.27)

˜m = A−1 B−1 x
s

(3.28)

˜m = A−1 B−1 Λ−1 x
s´

(3.29)

assuming that the un-mixing matrix A−1 can be perfectly determined.
25

20
without
with
with+

15
Signal to Error Ratio (dB)

Signal to Error Ratio (dB)

20

15

10

5

10

5

0

0

−5

without
with
with+

5

6

7
8
9
Log (OSR)

−5

10

2

5

6

7
8
9
Log (OSR)

10

2

(a)

(b)

Figure 3.7: Evaluating the reconstruction of the sources for classical (without), learning (with), and
learning with resolution enhancement (with+) Σ∆ at different OSR for log2 (condition number) of
(a) 10 and (b) 12

The equations (3.27)- (3.29) represent the following three cases: (a) ˜d which is the signal
s
39

30

30
without
with
with+

25
Signal to Error Ratio (dB)

Signal to Error Ratio (dB)

25
20
15
10
5

20

15

10

5

0
−5

without
with
with+

6

0

7 8 9 10 11 12
Log (Condition number)
2

6

7 8 9 10 11 12
Log (Condition number)
2

(a)

(b)

Figure 3.8: Evaluating the reconstruction of the sources for classical (without), learning (with),
and learning with resolution enhancement (with+) Σ∆ at different condition number for OSR of
(a) 256 and (b) 512

reconstructed using a Σ∆ modulator without any learning (denoted by without); (b) ˜m which
s
is the signal reconstructed using Σ∆ learning without resolution enhancement (denoted by with);
and (c) ˜m which is the signal reconstructed using Σ∆ learning with resolution enhancement
s´
(denoted by with+). For this experiment, the condition number of the mixing matrix was chosen
to be 1000 and the over-sampling ratio (OSR), which is deﬁned as the sampling frequency/Nyquist
frequency, was chosen to be 1024. For the signal in (3.26) the Nyquist frequency was chosen to
be 10 KHz. Figure 3.6 shows the reconstructed signals obtained with and without the application
of Σ∆ learning. The quantization artifacts can be clearly seen Fig. 3.6(b) which is the signal
recovered using Σ∆ modulator without learning. However, the signals obtained when Σ∆ learning
is applied does not show any such artifacts indicating improvement in resolution. To quantify
40

25
2 Dim
3 Dim
6 Dim

Signal to Error Ratio (dB)

20

15

10

5

0

6

7 8 9 10 11 12
Log (Condition number)
2

Figure 3.9: Evaluating the reconstruction of sources at different dimension for the learning Σ∆ at
different condition number for OSR of 128

this improvement, we compared the signal-to-error ratio (SER) for the separated signals. SER is
deﬁned as
SER = log2 {

||s||2
}
||s − ˜||2
s

(3.30)

where s and ˜ is based on the deﬁnition in (3.27)- (3.29). To compute the mean SER and its variance
s
10 different mixing matrices with a ﬁxed condition number were chosen and the mean/variances
were calculated across different experimental runs. Fig. 3.7(a) compares the SER obtained when
the mixing matrix with condition number 210 was chosen for different values of OSR. Figure 3.7(b)
compares the SER obtained when the mixing matrix with condition number 212 was chosen. It can
be seen in Fig. 3.7(a) and (b) that as the OSR of the Σ∆ modulator increases, the SER increases.
This is consistent with results reported for Σ∆ modulators where OSR is directly related to the
resolution of the “analog-to-digital” conversion. However, it can be seen that for all conditions of
41

OSR, Σ∆ learning with resolution enhancement outperforms the other two approaches.
Figure 3.8(a) and (b) compares the performance of Σ∆ learner when the condition number of
the mixing matrix is varied for ﬁxed over-sampling ratios of 256 and 512. The results again show
that Σ∆ learner (with and without resolution enhancement) demonstrates consistent performance
improvement over the traditional Σ∆ modulator. Also, as expected the SER performance for all
the three cases deteriorates with the increasing condition number, which indicates that the mixing
becomes more singular. Figure 3.9 evaluates the SER achieved by Σ∆ learning (with resolution
enhancement) when the dimensionality of the mixing matrix is increased. For this experiment,
the number of source signals are increased by randomly selecting signals which were mutually
independent with respect to each other. It can be seen from Fig. 3.9 that the response of the Σ∆
learning is consistent across different signal dimensions with larger SER when the dimension is
lower.
In the next set of experiments the performance of Σ∆ learning is evaluated for the task of
source separation when the un-mixing matrix is estimated using an ICA algorithm. Speech samples were obtained from TIMIT database and were synthetically mixed using an ill-conditioned
matrix with different condition number. The instantaneous mixing parameters simulate the “nearfar” scenario where one of the speech sources is assumed to be much closer to the microphone
array than the other. This scenario was emulated by scaling one of the signals by −50dB with
respect to another. The speech mixture is then presented to the Σ∆ learner and its output is then
processed by a second-order blind inference (SOBI) [97] and by an efﬁcient FastICA (EFICA) [98]
algorithms. The performance metrics chosen for this experiment is based on source-to-distortion
ratio (SDR) [99] where the estimated signal sj (n) is decomposed into
ˆ

sj (n) = starget (n) + einterf (n) + eartif (n)
ˆ
42

(3.31)

with starget (n) being the original signal, and einterf (n) and eartif (n) denote the interference
and artifacts errors, respectively. The SDR metric is a global performance metric which then
measures both source-to-interference ratio (SIR) - the amount of interferences from non wanted
sources and also other artifacts like quantization and musical noise. The SDR is deﬁned as:

SDR = 10 log

∥starget ∥2

20

15

15

10

10

5

5
SDR (dB)

20

SDR (dB)

(3.32)

∥einterf + eartif ∥2

0
−5
−10

−5
−10

with (S1)
with (S2)
without (S1)
without (S2)

−15
−20
−25

0

6

7

8
9 10
Log (OSR)

11

with (S1)
with (S2)
without (S1)
without (S2)

−15
−20
−25

12

2

6

7

8
9 10
Log (OSR)

11

12

2

(a)

(b)

Figure 3.10: SDR corresponding to with/without Σ∆ learning for the near-far recording conditions
using (a) SOBI and (b) EFICA algorithms.
The speech sources s1 and s2 in this experiment consists of 44200 samples with a sampling
rate of 16KHz which is also the Nyquist rate. In this setup, after mixing, one of the sources is
being completely masked by the other which is consistent with the“near-far” effect. Figure 3.10(a)
43

and (b) shows the SDR obtained using the Σ∆ learning when the OSR is varied from 128 to 4096
. Also shown in Fig. 3.10 (a) and (b) are the SDR metrics obtained when a conventional Σ∆
algorithm is used. It can be seen from the Fig. 3.10 that the SDR corresponding to the stronger
source is similar for both cases (with and without Σ∆ learning), where as for the masked source the
SDR obtained using Σ∆ learning is superior. This is consistent with the results published in [100].
However, the approach in [100] is applied after quantization and hence according to formulation
in section I, is limited by the condition number of the mixing matrix. It should also be noted that
the Σ∆ learning only enhances the resolution of the measured signals. The ability to successfully
recover the weak source under “near-far” conditions, however, is mainly determined by the choice
of the ICA algorithm.

3.3.2 Experiments with far-ﬁeld model
In this section, the mathematical model presented in chapter 2 for miniature microphone model
will be used. This model with the Σ∆ learner will be used to compare the performance of the
algorithm with a traditional source separation technique. Traditional DSP based source separation
algorithms are typically implemented subsequent to a ADC and hence do not consider the detrimental effects of ﬁnite resolution due to the quantizer. Σ∆ learner smartly quantizes the array
signals with respect to each other and uncorrelates them in order to use as much information as
possible when digitizing the signals.
In this setup, recording conditions consisted of four closely spaced microphones. In this arrangement, three of the microphones were placed along a triangle, whereas the fourth microphone was
placed at the centroid and act as the reference sensor which records the common signal. The set up
is similar to the conditions that have been reported in [86] where the simulation have been shown
44

10

10

5
5
0
0
SDR (dB)

SDR (dB)

−5
−10
−15
−20

−5

−10

−25
−30
−35

−15

with
without
6

7
8
9
Log2(OSR)

−20

10

(a)

with
without
6

7
8
9
Log2(OSR)

10

(b)

Figure 3.11: Σ∆ performance with and without learning for three speech signals corresponding to
the far-ﬁeld model

to be consistent with recording in the real-life conditions. The outputs of each microphone along
the triangle were subtracted from the reference microphone to produce three differential outputs.
In these experiments three independent speech signals were used as far-ﬁeld sources. The differential outputs of the microphone array were ﬁrst presented to the proposed Σ∆ learner, and the
outputs of Σ∆ learner array were then used as inputs to the SOBI algorithm. A benchmark used
for comparative study consisted of Σ∆ converters that directly quantized the differential outputs of
the microphones. Figure 3.11(b)-(c) summarizes the performance of source separation (with and
without Σ∆ learning) for different orientation of the acoustic sources. For the three experiments,
only the bearing of the sources were varied but their respective distances to the center of the mi45

Figure 3.12: Spectrogram of the recorded signals (top row) and recovered signals using Σ∆ without learning (middle row) and with learning (bottom row)
crophone was kept constant. It can be seen that for each of these orientations, source separation
algorithm that uses Σ∆ learning as a front-end “analog-to-digital” converter delivers superior performance compared to the algorithm that does not use Σ∆ learning. Also, from Fig. 3.11(b)-(c)
it can be seen that the improvement in SDR performance signiﬁcantly increases when the sampling frequency (resolution) decreases showing that Σ∆ learning efﬁciently utilizes the available
resolution (due to coarse quantization) to capture information necessary for source separation.

3.3.3 Experiments with real microphone recordings
In this set of experiments, Σ∆ learning have been applied to speech data that was recorded using
a prototype miniature microphone array, similar to the set up described in [85]. The four omnidirectional microphones Knowles FG3629 were mounted on a circular array with radius 0.5 cm.
46

The differential microphone signals were pre-processed using second-order bandpass ﬁlters with
low-frequency cutoff at 130 Hz and high-frequency cutoff at 4.3 kHz. The signals were also ampliﬁed by 26dB. The speech signals were presented through loudspeakers positioned at 1.5 m distance
from the array and the sampling frequency of the National Instruments data acquisition card was
set to 32 KHz. Male and female speakers from TIMIT database were chosen as sound sources
and was replayed through standard computer speakers. The data was recorded from each of the
microphones, archived, and then presented as inputs to the Σ∆ learning and the SOBI algorithm.
Figure 3.12(top row)) shows the spectrogram of the speech signals recorded from the microphone
array. The two spectrograms look similar, thus emulating a “near-far” recording scenario where
a dominant source masks the background weak source. Also it can be seen from the spectrogram
in Fig. 3.12(top row-right) that one of the recordings is more noisy than the other (due to microphone mismatch). Figure 3.12(middle row) show the spectrogram of the separated speech signals
obtained without Σ∆ learning. Figure 3.12(bottom row) show the spectrogram of the separated
speech signals obtained with Σ∆ learning. A visual comparison of the spectrograms show that
separated speech signal (without Σ∆ learning) contain more quantization artifacts which can be
seen as the broadband noise in Fig. 3.12(middle row). The table 3.1 summarizes the SDR performance (for different OSR) for each of the sources in these two cases (with and without Σ∆
learning), showing that Σ∆ learning indeed improves the performance of the source separation
algorithm. Also, from table 3.1, it can be seen that when the OSR increases, the performance differences between the two cases becomes insigniﬁcant. This artifact is due to the limitations in the
SOBI algorithm for separating sources with high ﬁdelity, noise in the microphones and ambient
recording conditions.

Up to here, It have been argued that the classical approach of signal quantization followed by
47

Table 3.1: Performance (SDR (dB) ) of the proposed Σ∆ for the real data for different oversampling ratio.
OSR=4

OSR=8

OSR=16

with
S1
S2

without

with

without

with

1.03
-12.52

-0.72
-13.15

1.33
-9.77

0.41
-10.17

1.34 0.94
-8.69 -8.88

OSR=32

without with
1.3
-8.29

OSR=64

without

with

without

1.15
-8.32

1.28 1.21
-8.12 -8.1

DSP based source separation fails to deliver robust performance when processing signals recorded
using miniature microphone/sensor arrays. We proposed a framework that combined statistical
learning with Σ∆ modulation and can be used for designing “smart” multi-dimensional analog-todigital converters that can exploit spatial correlations to resolve acute differences between signals
recorded by miniature microphone array.

48

Chapter 4
Robust Acoustic Recognition

4.1 Fundamental of speech
Speech is produced when air from the lungs passes through the throat, the vocal cords, the mouth
and the nasal tract (see Fig. 4.1(a)). Different position of the lips, tongue and the palate (also
known as the articulators) then create different sound patterns and gives rise to the physiological
and spectral properties of the speech signal like pitch, tone and volume. These properties are
speaker related and they can be used as signitures for speaker recognition systems as they are
modulated by the size and shape of the mouth, vocal and nasal tract along with the size, shape and
tension of the vocal cords of each speaker where It has been shown that even for twins, the chances
for all of these properties to be similar are very low [32, 33].
One of the most commonly used methods for visualizing the spectral and dynamical content of
speech signal is called the spectrogram which displays the frequency of vibration of the vocal cords
(pitch), and amplitude (volume) with respect to time. Examples of the spectrograms for a male and
a female speaker are shown in Fig. 4.1(b) where the horizontal axis represents time and the vertical
49

Figure 4.1: Fundamental of speech: (a) Magnetic resonance image showing the anatomy of speech
production apparatus. The property of the speech signal is determined by shape of the vocal tract,
orientation of the mouth, teeth and nasal passages. (b) Spectrograms corresponding to a sample
utterance “ﬁftysix thirty-ﬁve seventy-two” for a male and female speake.

axis represents frequency. The pitch of the utterance manifests itself as horizontal striations in the
spectrogram as shown in Fig. 4.1(b). For instance, it can be seen from Fig. 4.1(b) that the pitch of
the female speaker is greater than the pitch of the male speaker. Other important spectral parameters of speech signal are formants which are deﬁned as the resonant frequencies (denoted by F1,
F2, F3, ...) of the vocal tract, in particular, when vowels are pronounced. They are produced by
restricting air ﬂow through the mouth, tongue, and the jaw. The relative frequency location of the
formats can vary widely from person to person (due to shape of the vocal tracts) and hence can be
used as a biometric feature. Even though multiple resonant frequencies exist in speech signal, only
three of the formats (typically labeled as F1, F2, F3 as shown in Fig. 4.1(b)) are used for speech
and speaker recognition applications. However, reliable estimation of the spectral parameters requires segments of speech signal that are stationary and hence most veriﬁcation systems use 20-30
50

Figure 4.2: Functional architecture of a speaker veriﬁcation system as a example of acoustic recognition which consists of two main phases: (a) An training/enrollment phase where parameters of a
speaker speciﬁc statistical model are determined and (b) a recognition/veriﬁcation phase where an
unknown speaker authenticated using the models trained during the training phase.
milliseconds segments. Another biometric signature embedded in the speech signal is the stress
patterns also known as prosody which manifests as spectral trajectory and distribution of energy in
the spectrogram. This signature is typically considered as one of the “high-level” features which
can be estimated from observing the dynamics across multiple segments of the speech signals. In
the next section, we will discuss some of the popular approach to extract some of these biometric
features and discuss some of the statistical models which are used to recognize the speaker speciﬁc
features.

4.2 Architecture of an acoustic recognition system
Any speech based recognition systems like speaker or speech recognition typically consist of two
distinct phases in general: (a) a training phase where parameters of statistical models are determined using annotated (pre-labeled) speech data; and (b) a testing phase where an unknown speech
51

sample is recognized using the trained statistical models. Fig. 4.2 presents an speaker veriﬁcation
system where these two phases are shown as enrollment and veriﬁcation phases. As this ﬁgure
shows, in such a system the speech signal is ﬁrst sampled, digitized, and ﬁltered before a feature
extraction algorithm computes salient acoustic features from the speech signal. The next step in the
training phase uses the extracted features to train a statistical model. During the recognition phase
(as shown in Fig. 4.2), an unknown utterance is authenticated against the trained statistical model
for a speciﬁc task. In the following sections, each of these standard modules will be reviewed that
are used during each of these phases.

4.3 Speech acquisition and feature extraction module
The speech acquisition module typically consists of a transducer that is coupled to an ampliﬁer and
a ﬁltering circuitry. Depending on the speciﬁcations (size, power and recognition performance) imposed on the recognition system, the transducer could be a standard microphone (omni-directional
or directed) or a noise-canceling microphone array where the speech signal is enhanced by suppressing background noise using a spatial ﬁlter [34]. The ampliﬁer and the ﬁltering circuitry are
used to maintain a reasonable signal-to-noise ratio (SNR) at the input of an analog-to-digital converter (ADC) which is used to digitize the speech signal. Depending on the topology of the ADC,
the speech signal could be sampled at the Nyquist rate (8KHz) or oversampled using a sigma-delta
modulator. Typically, a high-order sigma-delta modulator is the audio ADC of choice because of
its ability to achieve resolution greater than 16 bits. Once the speech signal is digitized, a feature
extraction module (typically implemented on a digital signal processor) extracts speech information from the raw waveform. Depending on the application of the recognition system, different
feature can be extracted, e.g., in speaker recognition system feature extraction module extracts
52

speaker speciﬁc features where in speech recognition systems, this module extracts type of features
that are more speaker independent. In speaker recognition systems the “high-level” characteristics
which convey behavioral information such as prosody, phonetic, conversational patterns, etc. seem
to be promising than the “low-level” information which conveys the physical structure of the vocal
tract [117, 103]. The “low-level” features have been mostly used in speech recognition systems,
however it has been shown that good performances can be achieved using these features for speaker
recognition systems. The difference between these two features is the relative time-scale required
for extracting and processing the features. While “low-level” features can be effectively computed
using short frames of speech (<30ms), the “high-level” features could require time-scales greater
than few seconds [103]. In the following, we present a short overview of two of the popular classes
of “low-level” features: linear predictive cepstral coefﬁcients (LPCC) and Mel frequency cepstral
coefﬁcients (MFCC).

Linear Predictive Cepstral Coefﬁcients (LPCC): The basic assumption underlying the Linear Prediction Coding (LPC) [101, 104], which is in the heart of LPCC is that speech signal can be
modeled by a linear source-ﬁlter model. This model has two sources of human vocal sounds: the
glottal pulse generator and the random noise generator. The glottal pulse generator creates voiced
sounds. This source generates one of the measurable attributes used in voice analysis: the pitch
period. The random noise generator produces the unvoiced sounds and the vocal tract serves as the
ﬁlter of the model that produces intensiﬁcation at speciﬁc formants. In LPC feature extraction, the
ﬁlter is typically chosen to be an all-pole ﬁlter. The parameters of the all-pole ﬁlter are estimated
using an auto-regressive procedure where the signal at each time instant can be determined using
53

a certain number of preceding samples. Mathematically, the process can be expressed as

s(t) = −

P
∑

ai s(t − i) + e(t)

i=1
where s(t is the speech signal at time instant t is determined by p past samples s(t − i) where i represents the discrete time delay. e(t) is known as the excitation term (random noise or glottal pulse
generator) which also signiﬁes the estimation error for the linear prediction process and ai denotes
the LPC coefﬁcients. During an LPC, a quasi-stationary window of speech (about 20-30ms) is used
to determine the parameters ai and the process is repeated for the entire duration of the utterance.
In most implementations, an overlapping window or a spectral shaping window [104] is chosen to
compensate for spectral degradation due to ﬁnite window size. The estimation of the prediction
coefﬁcients is done by minimizing the prediction error e(t) and several efﬁcient algorithms like the
Yule-Walker or Levinson Durbin algorithms exist to compute the features in real-time. The prediction coefﬁcients will be further transformed into Linear Predictive Cepstral Coefﬁcients (LPCC)
using a recursive algorithm [104]. A variant of the LPC analysis is the Perceptual Linear Prediction
(PLP) [60] method. The main idea of this technique is to take advantage of some characteristics
derived from the psychoacoustic properties of the human ear and these characteristics are modeled
by ﬁlter-bank.
Mel Frequency Cepstral Coefﬁcients (MFCC): These features have been extensively used in
speech based recognition systems [105, 104]. MFCCs were introduced in early 1980s for speech
recognition applications and since then have also been adopted for speaker recognition applications. A sample of speech signal is ﬁrst extracted using a window. Typically two parameters are
important for the windowing procedure: the duration of the window (ranges from 20 - 30 ms) and
the shift between two consecutive windows (ranges from 10-15ms). The values correspond to the
54

average duration for which the speech signal can be assumed to be stationary or its statistical and
spectral information does not change signiﬁcantly. The speech samples are then weighed by a
suitable windowing function, for example, Hamming or Hanning window are extensively used in
acoustic recognition. The weighing reduces the artifacts (side lobe and signal leakage) of choosing
a ﬁnite duration window size for analysis. The magnitude spectrum of the speech sample is then
computed using a fast Fourier transform (FFT) and is then processed by a bank of band-pass ﬁlters. The ﬁlters that are generally used in MFCC computation are triangular ﬁlters, and their center
frequencies are chosen according to a logarithmic frequency scale, also known as Mel-frequency
scale. The ﬁlter bank is then used to transform the frequency bins to Mel-scale bins by the following equations:
my [b] =

∑

wb [f ]|Y [f ]|2

f
where wb [f ] is the bth Mel-scale ﬁlter’s weight for the frequency f and Y [f ] is the FFT of the
windowed speech signal. The rationale for choosing a logarithmic frequency scale conforms to response observed in human auditory systems which has been validated through several biophysical
experiments [104]. The Mel-frequency weighted magnitude spectrum is processed by a compressive non-linearity (typically a logarithmic function) which also models the observed response in a
human auditory system. The last step in MFCC computation is a discrete cosine transform (DCT)
which is used to de-correlate the Mel-scale ﬁlter outputs. A subset of the DCT coefﬁcients are chosen (typically the ﬁrst and the last few coefﬁcients are ignored) and represent the MFCC features
used in the training and the test phases.
Dynamic and Energy Features: Even though each feature set (LPC or MFCC) is computed
for a short frame of speech signal (about 20-30ms), it is well known that information embedded in
the temporal dynamics of the features are also useful for recognition [106]. Typically two kinds
55

of dynamics have been found useful in speech processing: (a) velocity of the features (known
as ∆ features) which is determined by its average ﬁrst-order temporal derivative; and (b) acceleration of the features (known as ∆∆ features) which is determined by its average second-order
temporal derivative. Other transforms of the features which have also been found useful in recognition include: logarithm of the total energy of the feature (L2 norm) and its ﬁrst-order temporal
derivative [104].

Auxiliary Features: Even though cepstral features have been widely used speaker recognition
systems, it can be suggested that the features might contain phonemic information that may be unrelated to the speaker recognition task as they convey less speaker speciﬁc information. Recently,
new techniques have been reported that can extract speaker-related information from LPCCs and
MFCCs and in the process improve system’s recognition performance. One group of these features are sometimes referred to as voice source features. For example, in [107], an inverse ﬁltering
technique has been used to separate the spectra of glottal source and vocal tract. In another approach, the residual signal obtained from LP analysis has been used in estimating the glottal ﬂow
waveform [108, 109, 110, 111]. An alternative approach to estimating the glottal ﬂow (derivative)
waveform was presented in [112, 113, 114] where a closed-phase covariance analysis technique
was used during the intervals when the vocal folds are closed. Other group of these features includes prosodic features. Prosody which involves variation in syllable length, intonation, formant
frequencies, pitch, rate and rhythm speech, can vary from speaker to speaker and relies on longterm information of speech. One of the predominant prosodic features is the fundamental frequency (or F0). Other features include, pitch, energy distribution on a longer frame, speaking rate
and phone duration [115, 116, 117]. The auxiliary features usually have been used in addition
to“low-level” features by fusion technique
56

Voice Activity Detector (VAD): Before the features can be used in the recognition systems it
is important to determine whether the features correspond to the ”speech” portion of the signal or
correspond to the silence or background part of the signal. Most speech based recognition systems
use a voice activity detector (VAD) whose function is to locate the speech segments in an audio
signal. For example, a simple VAD could compute instantaneous signal-to-noise ratio (SNR) and
pick segments only when the SNR exceeds a predetermined threshold. However, it is improtant
to know that design of robust VAD could prove challenging since it is expected that the module
works consistently across different environments and noise conditions.

4.4 Speech and speaker modeling
Once the feature vectors corresponding to the speech frames have been extracted the associated
speech data also known as training data is used to build models even for speech or speaker recognition systems. For speech recognition systems, the models are generated for the speech components
like phonemes, words, etc. and for the speaker recognition systems, speaker models are generated.
During the test phase, the trained model is used to recognize a sequence of feature vectors extracted
from unknown utterances. The focus of this section is on the statistical approaches for constructing the relevant models. The methods can be divided into two distinct categories: generative and
discriminative. Training of generative models typically involve data speciﬁc to the target speech
component or speaker with the objective that the model can faithfully capture the statistics of that
component. Training of discriminative models which have been used more in speaker recognition
systems, involves data corresponding to the target and imposter speakers and the objective is to
faithfully capture the manifold which distinguishes the features for the target speakers from the
features for the imposter speakers. An example of a popular generative model used in speaker ver57

(a)

(b)
Figure 4.3: Example of generative models that have been used for speech/speaker recognition:
(a) HMMs where each state has a GMM which captures the statistics of a stationary segment
of speech. (b) HMMs are trained by aligning the states to the utterance using a trellis diagram.
Each path through the trellis (from start to end) speciﬁes a possible sequence of HMM state that
generated the utterance

58

iﬁcation is Gaussian Mixture Models (GMMs), and an example of a popular discriminative model
is Support Vector Machines (SVMs). In the following section these classical techniques will brieﬂy
be described. Hidden Markov Models (HMMs) is also a generative model which have been extensively used in speech recognition systems. In the following section these classical techniques will
brieﬂy be described and the readers are referred to appropriate references [104] for details.

4.4.1 Generative Models

Generative models include mainly Gaussian Mixture Models (GMMs) and Hidden Markov Models
(HMMs) capture the empirical probability density function corresponding to the acoustic feature
vectors. GMMs represent a particular case of HMMs and can be viewed as a single-state HMM
where the probability density is deﬁned by a mixture of Gaussians.
GMM-based modeling. GMMs have unique advantages compared to other modeling approaches
because their training is relatively fast and the models can be scaled and updated to add new
speakers with relative ease. A GMM model λ, is composed of a ﬁnite mixture of multivariate
Gaussian components and estimates a general probability density function pλ (x) according to:

p(x) =

M
∑

wi pi (x)

i=1
where M is the number of Gaussian components, wi is the prior probability (mixing weights) of
the ith D-variate Gaussian density function Ni (x) given by

pi (x) =

1
−(1/2)(x−µi )T Σ−1 (x−µi )
i
e
(2π)D/2 |Σi |1/2
59

The parameters µi and Σi represent the mean vector and covariance matrix of the multi-dimensional
Gaussian distribution and the mixing weights wi are constrained according to

∑M
i=1 wi = 1 .

GMM have extensively been used in speaker recognition system. Usually in these systems,
a speaker-independent world model also known as a universal background model (UBM) is ﬁrst
trained using speech data gathered from a large number of imposter speakers [118]. The training
procedure typically uses an iterative expectation-maximization (EM) algorithm [119] which estimates the parameters µi and Σi using a maximum likelihood criterion [120]. More details on EM
training procedure can be found in numerous references [119, 120, 121]. The background model
obtained after the training thus represents a speaker-independent distribution of the feature vectors.
When enrolling a new speaker to the system, the parameters of the background model are adapted
to the feature vector distribution of the new speaker using the maximum a posteriori (MAP) update
rules. In this way, the model parameters are not required to be estimated from scratch and instead
the previously estimated priors are used for re-training. There are alternative adaptation methods to
MAP, and usually selection of the method depends on the amount of available training data [122].
For very short enrollment utterances (a few seconds), some other methods like Maximum likelihood linear regression (MLLR) [52], have shown to be more effective.
Hidden Markov Models (HMMs). By construction, GMMs are static models that do not take
into account the dynamics inherent in the speech vectors. In this regard, HMMs [104] are statistical models that capture the temporal dynamics of speech production as an equivalent ﬁrst-order
Markov process. Fig. 4.3 shows an example of a simple HMM which comprises of a sequence of
states with a GMM associated with each state. In this example, each state represents a stationary
unit of the speech signal also known as “tri-phone”. The training procedure for HMMs involves
an EM algorithm, where the feature vectors are ﬁrst temporally aligned to the states using a dy60

namic programming procedure and the aligned feature vectors are used to update the parameters
of the state GMM. During the recognition procedure, the most probable sequence of states/phones
are estimated (again using a dynamic programming procedure) for a given utterance. The scores
generated by each state in the most probable sequence are accumulated to obtain the utterance and
speaker speciﬁc likelihood. Because the HMMs rely on the phonetic content of the speech signal,
they have been dominantly used in speech recognition systems as well as in text-dependent speaker
veriﬁcation systems [123].

4.4.2 Discriminative Models
The discriminative models are optimized to minimize the error on a set of genuine and impostor
training samples. They include, among many other approaches, Support Vector Machines (SVMs)
and Artiﬁcial Neural Networks (ANNs).
Support Vector Machines. SVMs are an attractive choice for implementing discriminative models where they provide good performance in speaker recognition systems even with relatively few
data points in the training set and bound on the performance error can be directly estimated from
the training data [68]. This is important because of only limited amount of data is usually available for the target speaker. The learning ability of the classiﬁer is controlled by a regularizer in
the SVM training, which determines the trade-off between its complexity and its generalization
performance. In addition, the SVM training algorithm ﬁnds, under general conditions, a unique
classiﬁer topology that provides the best out-of-sample performance [68]. The key concept behind an SVM based approach is the use of kernel functions which map the feature vectors to a
higher dimensional feature space by using a non-linear transformation Φ(.). Fig. 4.4(c) illustrates
an example of the mapping operation from a two dimensional feature space to a three dimensional
61

(a)

(b)
Figure 4.4: Discriminative Models: (a) General structure of an SVM with radial basis functions
as kernel. (b) Structure of a multi-layer ANN consisting of two hidden layers. (c) An example
of a kernel function K(x, y) = (x.y)2 , which maps a non-linearly separable classiﬁcation (left)
problem into a linearly separable problem (right) using a non-linear mapping Φ(.).

62

space. In the feature space the data points corresponding to the binary classes (denoted by ”circles” and ”squares”) are non-linearly separable. In the higher dimensional space the data points
are linearly separable and can be classiﬁed correctly by a linear hyper-plane. A binary (two-class)
SVM comprises of a linear hyper-plane constructed in the higher dimensional space and is given
by
f (z) =< w, Φ(z) > +b
where < ., . > deﬁnes an inter-product in the higher dimensional space and are the parameters of
the hyper-plane. As with SVMs, the hyper-plane parameters w are obtained as linear expansion
over training features Φ(xn ), n = 1, · · · , N as w =

∑

n = 1N an Φ(xn ) where an are the ex-

pansion coefﬁcients. Accordingly the inner-products in the expression for f (z) convert into kernel
expansions over the training data xn , n = 1, · · · , N by transforming the data to feature space
according to

f (z) = < w, Φ(z) > +b
= w=
= w=

∑
∑

(4.1)

n = 1N an < Φ(xn ), Φ(z) > +b

(4.2)

n = 1N an K < x, z > +b

(4.3)
(4.4)

where K < ., . > denotes any symmetric positive-deﬁnite kernel that satisﬁes the Mercer condition and is given by K < x, z >=< Φ(x), Φ(z) >, which is an inner-product in the higher
dimensional feature space. For example in Fig. 4.4(c) the kernel function corresponding to Φ(.) is
given by K(x, z) = (< x, z >)2 . The use of kernel function avoids the curse of dimensionality
by avoiding direct inner-product computation in higher-dimensional feature space. Some other
63

examples of valid kernel functions are radial basis functions K(xi , xj ) = exp(−σ(xi − xj )2 )
or polynomial functions K(xi , xj ) = [1 + (xi .xj )]p . Training of the SVM involves estimating
the parameters ai , b that optimizes a quadratic objective function. The exact form of the objective function depends on the topology of the SVM (soft-margin SVM [124], logistic SVM [125]
or GiniSVM [166]) and there exist open-source software packages implementing these different
algorithms. The following two key steps are the basis for an SVM based recognition:
• Feature reduction and normalization: Due to variability in the duration of utterances, the
objective of this step is to reduce/equalize the size of the feature vectors to a ﬁxed-length
vector. One of the possible approaches could be to use clustering or random selection to
determine a pre-determined number of representative vectors. Another approach could use
the scores obtained from a generative model (GMM or HMM) as the ﬁxed-dimensional input
vector. The features are then scaled and normalized before processed by an SVM.
• Kernel modeling: The reduced and normalized feature vectors are used to model each
speaker using different types of kernel functions like linear, quadratic, or exponential. For
each frame of the feature vector corresponding to the “non-silence” segment of the speech
signal, the SVM generates a score and the scores are integrated over the entire utterance to
obtain the ﬁnal decision score. It is important to note that since the scores are required to be
integrated it is important that the SVM outputs are properly calibrated. In this regard, logistic SVMs and GiniSVM are useful and have been shown to deliver more robust veriﬁcation
performance compared to traditional soft-margin SVMs. Fi. 4.5 shows an example of SVM
based speaker veriﬁcation system.
Artiﬁcial neural networks (ANNs). Artiﬁcial neural networks [126] have also been used for
acoustic recognition systems and are based on discriminant learning. One such example of ANN
64

Figure 4.5: Functional architecture of an SVM-based speaker veriﬁcation system: (left) the extracted features are ﬁrst aligned, reduced and normalized. The speaker speciﬁc and speaker nonspeciﬁc features are combined to create a dataset used for SVM training. (right) The soft-margin
SVM determines the parameter of a hyperplane that separates the target and non-target dataset with
the maximum margin.

is the Multilayer Perceptron (MLP) which is a feed-forward neural network comprising of multiple
layers and each layer comprising of multiple nodes (as shown in Fig. 4.4(b)). Each node computes
a linear weighted sum over its input connections, where the weights of the summation are the
adjustable parameters. A non-linear transfer function is applied to the result to compute the output
of that node. The weights of the network are estimated by gradient descent based on the backpropagation algorithm. An MLP for speaker veriﬁcation would classify speaker and impostor’s
access by scoring each frame of the test utterance. The ﬁnal utterance score is the mean of the
MLP’s output over all the frames in the utterance. Despite their discriminate power, the MLP
present some disadvantages. The main disadvantage is that their optimal conﬁguration is not easy
to select and a lot of data is needed for the training and the cross-validation steps.
Fusion. Fusion refers to the process of combining information from multiple sources of evidence to improve the performance of the system. The technique has been also applied in acoustic
recognition where a number of different sets of feature are extracted from the speech signal and a
65

Figure 4.6: An example of fusion of low-level and high-level features for the speaker veriﬁcation
system.

66

different classiﬁer is trained on each of the feature set. The scores produced by each of the classiﬁer are then combined to arrive at a decision. Ideally, the information contained in the different
features should be independent of each other so that each classiﬁer focuses on different regions of
the discrimination boundary. Fig. 4.6 shows an example of a fusion technique for speaker veriﬁcation system that combines “low-level” features like cepstrum or pitch with “high-level” features
like prosody or conversational patterns. However, performance gains could also be obtained by
fusion of different low-level spectral features (e.g., MFCCs and LPCCs) as they contain some
independent spectral information.
Authentication. The authentication module in speaker veriﬁcaion/recognition systems uses the
integrated likelihood scores to determine if the utterance belongs to the target speaker or belongs
to an imposter. Mathematically, the task is equivalent to hypothesis testing where given a speech
segment X and a claimed identity S the speaker veriﬁcation system chooses one of the following
hypotheses:
¯
Hs : X is pronounced by S Hs : X is pronounced by S The decision between the two hypothe¯
ses is usually based on a likelihood ratio given by

Λ(X) = p(X|Hs )p(X|Hs
¯

where p(X|Hs ) and p(X|Hs ) are the integrated likelihood scores (probability density functions)
¯
generated by the classiﬁer and Θ is the threshold to accept or reject Hs . Setting the threshold Θ
appropriately for a speciﬁc speaker veriﬁcation application is a challenging task since it depends
on environmental conditions like SNR. The threshold is usually chosen during the development
phase, and is speaker-independent. However, to be more accurate, the threshold parameter should
be chosen to reﬂect the speaker peculiarities and the inter-speaker variability. Furthermore, if there
67

is a mismatch between development and test data, the optimal operating point could be different
from the pre-determined threshold.

4.5 Robust acoustic recognition
The area of acoustic recognition has existed for the last couple of decades but there still exists
a large number of challenges that need to be addressed. For example, in the area of speaker
recognition/veriﬁcation the amount of speech data available during enrollment is important in order
to have good speaker speciﬁc models, especially for generative models like GMMs and HMMs.
However, for forensic applications only limited data could be available due to limited access to
the target speaker. This was conﬁrmed during the NIST-SRE evaluations [127], where it has been
shown that increase in the duration of the utterance improves the recognition performance. Or
another challenge in speaker recognition systems is intra-speaker variability. This challenge is as
a result of the speaker’s voice which could change due to aging, illness, emotions, tiredness and
potentially other cosmetic factors and model trained during the training phased might not represent
all possible states of the speaker. One of the proposed solutions to this problem is an incremental
technique which captures both the short and long-term evolution of a speaker’s voice [128].
In addition to all the open problems in the area of acoustic recognition, mismatch in training
and recognition phases is of more importance as it can limit the application of such systems in
real world scenarios. Mismatch in recording conditions during the training and the test/recognition
phase pose the main challenge for acoustic recognition systems. Differences in the telephone
handset, in the transmission channel and in the recording devices can all introduce variability in
recordings and decrease the accuracy of the system. This decrease of accuracy is mainly due to
the statistical models that capture not only the speaker characteristics but also the environmental
68

ones. Hence, the system decision may be biased if the recognition environment is different from
the training. A generic framework that models artifacts in a acoustic recognition system is shown
in Fig. 4.7, where the sources of interference could either arise due to the additive channel noise
or due to the convolutive channel effects. To make speech based recognition systems to be more
robust to channel variations, the state-of-the-art systems either use a noise-robust feature extraction
algorithm or suitably adapt the models. Fig. 4.7(b) summarizes the approaches that have been used.
These approaches in general consist of robust feature extraction techniques and robust modeling.
In the following a review of robust feature extraction techniques will be covered as this research
proposed technique is in this category.

4.5.1 Robust Feature Extraction
Different feature based approaches have been proposed to compensate the cross-channel effects
which include well-known and widely used techniques such as cepstral mean subtraction (CMS) [59],
RASTA ﬁltering [44], and variance normalization [104] as well as recently developed techniques
for speaker recognition systems like feature warping [130], stochastic matching [129], and feature
mapping [131]. Here will present a brief overview of these techniques:
Cepstral mean subtraction. In a Cepstral Mean Subtraction (CMS) method, the mean of cepstral coefﬁcients like MFCC or LPCC computed over a frame of speech is removed from each of
the coefﬁcients. The rationale behind CMS is based on the “homomorphic” ﬁltering principles
where it can be shown that slow variations in channel conditions are reﬂected as offsets in the
MFCC coefﬁcients. However, CMS is not suitable for additive white noise channel. Also, in addition to mean subtraction sometime the variance of the coefﬁcients is also normalized to improve
the noise robustness of the cepstral features.
69

(a)

(b)
Figure 4.7: (a)Equivalent model of additive and channel noise in a acoustic recognition system (b)
Different techniques used for designing robust acoustic recognition systems.

70

RASTA ﬁltering. RASTA (RelAtive SpecTrA) is a generalization of CMS method to compensate the cross-channel mismatch. The method was ﬁrst introduced to enhance the robustness of
speech recognition system and since then it has also been used for speaker recognition systems
as well. In RASTA ﬁltering, the low and high frequency components in cepstral coefﬁcients are
removed using cepstral band-pass ﬁlters.
Feature warping. Feature warping was deigned for speaker recognition systems aiming to construct more robust cepstral feature distribution by whitening and hence generating an equivalent
normal distribution over each frame of speech. This method delivers a more robust performance
than the mean and variance normalization technique, however, the approach is more computationally intensive.
Feature mapping. Feature mapping is also designed speciﬁcally for speaker recognition systems. The approach is a supervised normalization technique which transforms the channel speciﬁc
features to a channel independent feature space such that the channel variability is reduced. This
is achieved with a set of channel dependent GMMs which are adapted from a channel-independent
root model. During the recognition phase, the most likely channel (highest GMM likelihood) is
detected, and the relationship between the root model and the channel-dependent model is used for
mapping the vectors into channel-independent space.
While the spectral features (MFCC and LPC) accurately extract linear information of speech
signals, by construction they do not capture information about nonlinear or higher-order statistical characteristics of the signals, which have been shown to be not insigniﬁcant [132, 133]. One
of the hypotheses is that many of the non-linear features in speech remain in tact even when the
speech signal is corrupted by channel noise. Previous studies in this area have approximated auditory time-series by a low-dimensional non-linear dynamical model. In [133], it was demonstrated
71

that sustained vowels from different speakers exhibit a nonlinear, non-chaotic behavior that can be
embedded in a low dimension manifold of order less than four. Other non-linear speech feature
extraction approaches include non-linear transformation/mapping [134, 135], non-linear Maximum Likelihood Feature Transformation [136], kernel based time-series features [137, 138, 139],
non-linear discriminant techniques [140], neural predictive coding [141] and other auxiliary methods [142, 143]. we will propose a novel feature extraction technique in which it can extract robust
non-linear manifolds embedded in speech signal. The method uses non-linear ﬁltering properties
of a functional regression procedure in a reproducing kernel Hilbert space (RKHS). The procedure
is semi-parametric and does not make any assumptions on the channel statistics. The hypothesis
is that robustness in speech signal is encoded in high-dimensional temporal and spectral manifold
which remains intact even in presence of ambient noise. In the following section we will introduce
a benchmark setup in order to evaluate our features.

4.6 Robust speaker modeling
Several session compensation techniques have been recently developed for both GMM and SVMbased speaker models. Factor analysis (FA) techniques [144] were designed for the GMM-based
recognizer and take explicit use of the stochastic properties of the GMM, whereas the methods
developed for SVM-based models are often based on linear transformations. One such linear transform based approach uses Maximum Likelihood Linear Regression (MLLR) approach to transform
the input parameter of the SVM. MLLR transforms the mean vectors of a speaker-independent
model as µk = Aµk + b, where µk is the adapted mean vector, µk is the world model mean
´
´
vector and the parameters A and b are parameters of the linear transform. A and b are estimated
by maximizing the likelihood of the training data with a modiﬁed EM algorithm. Other normal72

ization techniques for SVMs have also been reported which include nuisance attribute project
(NAP) [145, 146] which uses the concept of eigenchannels and withinclass covariance normalization (WCCN) [147, 148] that reweighs each dimension based on different techniques like principal
component analysis (PCA). The Nuisance attribute project (NAP) uses an appropriate projection
matrix, P in the feature space to remove subspaces that contain unwanted channel or session variability from the GMM supervectors. The projection matrix ﬁlters out the nuisance attributes (e.g.
session/channel variability) in the feature space by P = I − UUT , where U is the eigenchannel
matrix. NAP requires a corpus labeled with speaker and/or session information.

The underlying principle behind factor analysis (FA) when applied to GMMs is the following: When speech samples are recorded from different handsets, the super-vectors or the means
of the GMMs could vary and hence require some sort of channel compensation and calibration
before they can be compared. For channel compensation to be possible, the channel variability
has to be modeled explicitly and the technique that has been used is called joint factor analysis (JFA) [144, 149]. The JFA model considers the variability of a Gaussian supervector as a
linear combination of the speaker and channel components. Given a training sample, the speakerdependent and channel (session) dependent supervector M is decomposed into two statistically
independent components as M = s + c, where s and c are referred to as the speaker and channel
(session) supervectors, respectively. The channel variability is explicitly modeled by the channel
model of the form c = Ux where U and x are the channel factors estimated from a given speech
utterance and the columns of the matrix U are the eigen-channels estimated for a given dataset.
During enrollment, the channel factors x are to be estimated jointly with the speaker factors y of
the speaker model of the form s = m + Vy + Dz , where m is the UBM supervector, V is a
rectangular matrix with each of its columns referred to as the eigenvoices and D is a parameter
73

matrix of JFA and z is a latent variable vector for JFA. In this formulation, JFA can be viewed as a
two-step generative model which models different speakers under different sessions. The core JFA
algorithm comprises the ﬁrst level and the second or the output level is the GMM generated using
the ﬁrst level. If we consider all the parameters that affect the mean of each component in output
GMM, the mean of the session dependent GMM can be expressed as

Mki = mk + Uk xi + Vk ys(i) + Dk zks(i)
with the indices k correspond to different GMM components, i corresponds to session, and s(i)
for the speaker in session i. The system parameters are mk , Uk , Vk , and Dk where xi , ys(i) ,
and zks(i) are hidden speaker and session variables.
In other approaches the GMM and the SVM principles can be combined to achieve robustness.
In [150], the generative GMM-UBM model was used for creating “feature vectors” for the discriminative SVM speaker modeling. For example the mean and the variance of the GMM-UBM states
could be used as feature vector for SVM training. When the means of the GMMs are normalized
by their variance, the resulting feature vectors are known as supervectors, which have been used
in SVM training. The SVM kernel function could be also appropriately chosen that reﬂects the
distance between the pdfs generated by the GMMs. One such measure is the Kullback-Leibler
(KL) divergence measure between GMMs. Another extension is the GMM-UBM mean interval
(GUMI) kernel which uses a bounded Bhattacharyya distance [151]. The GUMI kernel exploits
the speakers information conveyed by the mean of GMM as well as those by the covariance matrices in an effective manner. Another alternative kernel known as probabilistic sequence kernel
(PSK) [152] uses output values of the Gaussian functions rather than the Gaussian means to create
supervectors. Other SVM approach based on Fisher kernels [125] and probabilistic distance ker74

nels [153] have also been introduced where they use generative sequence models for SVM speaker
modeling. Similar hybrid methods have been used for HMMs and SVMs but for applications in
speech recognition.

4.7 Score normalization
As the name suggests, the score normalization techniques aim to reduce the score variabilities
across different channel conditions. The process is equivalent to adapting the speaker-dependent
threshold which was brieﬂy discussed in Section 2.3. Most of the normalization techniques used
in speaker veriﬁcation are based on the assumption that the impostors scores follow a Gaussian
distribution where the mean and the standard deviation depend on the speaker model and/or test
utterance. Different score based normalization techniques have been proposed which includes
Znorm [154], Hnorm [155], Tnorm [156], and Dnorm [157]. We describe some of these scores in
this section.
ZNorm. In zero normalization (ZNorm) technique, a speaker model is ﬁrst tested against a set
of speech signals produced by an imposter, resulting in an imposter similarity score distribution.
Speaker-dependent mean and variance normalization parameters are estimated from this distribution. One of the advantages of Znorm is that the estimation of the normalization parameters can be
performed ofﬂine during the enrollment phase. TNorm. TNorm The test normalization (TNorm)
is another score normalization technique in which the mean and the standard deviation parameters
are estimated using a test utterance. The TNorm is known to improve the performances particularly in the region of low false alarm. However, TNorm has to be performed online while the
system is being evaluated. There are several variants of the ZNorm and TNorm that aim to reduce
the microphone and transmission channels effects. Among the variants of ZNorm, are the Handset
75

Normalization (HNorm and the Channel Normalization (CNorm). In the last approach, handset or
channel-dependent normalization parameters are estimated by testing each speaker model against
a handset or channel-dependent set of imposters. During testing, the type of handset or channel
related to the test utterance is ﬁrst detected and then the corresponding sets of parameters are used
for score normalization. The HTNorm, a variant of TNorm, uses basically the same idea as the
HNorm. Here, handset-dependent normalization parameters are estimated by testing each test utterance against handset-dependent imposter models. DNorm. Both TNorm and ZNorm procedure
rely on availability of imposter data. However, when the imposter data is not available an alternate
normalization called DNorm can be applied [157] where the pseudo-imposter data are generated
from the trained background model using Monte-Carlo techniques.
In the next chapter a novel robust speech feature extraction method will be presented. Both
speaker veriﬁcation and speech recognition results will also be shown in order to present the consistant performance improvement of this new features compared to the conventional methods.

76

Chapter 5
Hierarchical Kernel Auditory Features
This chapter introduces a novel speech feature extraction algorithm using a hierarchical model.
This hierarchical model consists of two levels where in the ﬁrst level the similarity of auditory
sensory world is measured with regularized kernel regression technique in a reproducing kernel
Hilbert space (RKHS). Then in the second level, the nose-robust features is choosen using a pooling function. The features known as Sparse Auditory Reproducing Kernel (SPARK) are extracted
under the hypothesis that the noise-robust information in speech signal is embedded in a subspace spanned by overcomplete and regularized set of gammatone basis functions. Computing the
SPARK features involves correlating the speech signal with a pre-computed matrix, thus making
the algorithm amenable to DSP based implementation.

5.1 Motivation for hierarchical kernel auditory features
Unlike human audition, the performance of speech based recognition systems degrades signiﬁcantly in the presence of noise and background interference [25, 40]. This can be attributed to
inherent mismatch between training and deployment conditions, especially when the characteris77

tics of all possible noise sources are not known in advance. Therefore in literature several strategies
have been presented to mitigate this mismatch which can be broadly categorized into three main
groups. The ﬁrst strategy is to improve speech recognition robustness by enhancing the speech
signal before feature extraction. Speech enhancement techniques have been designed to improve
the perception of speech by objective listeners in noisy conditions or to improve the performance
of speech recognition systems. Spectral subtraction (SS) is widely used due to its simpleness
which suppress the additive noise in the spectral domain [41]. The second strategy is to make
the front-end feature extraction more robust in different conditions. Most of the methods in this
group modify the well established technique in order to make them robust. Cepstral mean normalization (CMN) [42] and cepstral variance normalization [43] improve the speech recognition
performances by adjusting the feature mean and variance in cepstral domain in order to reduce the
convolutive channel distortion. Relative spectra (RASTA) [44] suppress the acoustic noise by highpass (or band-pass) ﬁltering applied to a log-spectral representation of speech. In recent years, new
methods have been proposed to make the exiting features robust by using more advanced signal
processing techniques. Examples include feature space nonlinear transformation techniques [45],
the ETSI advanced front end (AFE) [46, 162], stereo-based piecewise linear compensation for environments (SPLICE) [47], and power-normalized cepstral coefﬁcients (PNCC) [49]. AFE, for
example, integrates several methods to remove both additive and convolutive noises. A two-stage
Mel-warped Wiener ﬁltering combined with a SNR-dependent waveform processing is used to reduce the additive noise and a blind equalization is used to mitigate the channel effects. There are
some methods in this group which are designed to be inherently robust to mismatched conditions
inspiring from human hearing [48]. Recently . The third strategy aims at making the classiﬁer
more robust by adjusting or adapting the parameters of the speech models including stochastic

78

pattern matching methods [50], maximum likelihood estimation (MLE) based signal bias removal
method [51], Maximum likelihood linear regression (MLLR) method [52], parallel model combination (PMC) method [53, 54], and joint compensation of additive and convolutive distortions
(JAC) based methods [55, 56, 57]. Even though signiﬁcant improvements in recognition performance can be expected by the application of the third approach, the overall system performance is
still limited by the quality of speech features extracted using the second method. Therefore, in this
research we focuse on extraction of speech features that are robust to mismatch between training
and testing conditions.

Traditionally, speech features used in most of the state-of-the-art speech recognition systems
have relied on spectral-based techniques which include Mel-frequency cepstral coefﬁcients (MFCCs)
[104], linear predictive coefﬁcients (LPCs) [104, 58, 106], and perceptual linear prediction (PLP)
[60]. Noise-robustness is then achieved by modifying these well established techniques to compensate for channel variability. For example, cepstral mean normalization (CMN) [42] and cepstral
variance normalization [43] adjust the mean and variance of the speech features in the cepstral domain and in the process reduce the effect of convolutive channel distortion. Another example is
the Relative spectra (RASTA) [44] technique which suppresses the acoustic noise by high-pass (or
band-pass) ﬁltering of the log-spectral representation of speech. More recently advanced signal
processing techniques to improve noise-robustness. These include feature-space non-linear transformation techniques [45], the ETSI advanced front end (AFE) [46, 162], stereo-based piecewise
linear compensation for environments (SPLICE) [47] and power-normalized cepstral coefﬁcients
(PNCC) [163]. AFE approach, for example, integrates several methods to remove the effects of
both additive and convolutive noises. A two-stage Mel-warped Wiener ﬁltering, combined with
an SNR-dependent waveform processing is used to reduce the effect of additive noise and a blind
79

equalization technique is used to mitigate the channel effects. Other attempts in this group has
been made to use the auditory models in order to extract features for speech recognition systems.
For example, ensemble interval histogram (EIH), an auditory model proposed by [48], has been
used as a front-end for speech recognition systems. The EIH is composed of a cochlear ﬁltebank
where the output of each ﬁlter is attached to a level-crossing detector.
An alternate and more promising approach towards extracting noise-robust speech features
is to use data-driven statistical learning techniques that do not make strict assumptions on the
spectral properties of the speech signal. Examples include kernel based techniques [166, 167]
which operate under the premise that robustness in speech signal is encoded in high-dimensional
temporal and spectral manifolds which remain intact even in the presence of ambient noise.
In [167], we had presented a reproducing kernel Hilbert space (RKHS) based regression to
extract high-dimensional and noise robust speech features. The procedure required solving a
quadratic optimization problem for each frame of speech, thus making the data-driven approach
highly computationally intensive. Also, due to its semi-parametric nature, the methods proposed
in [166, 167] did not incorporate any a-priori information available from neurobiological or psychoacoustical studies. But, it has been recently demonstrated that cortical neurons use highly
efﬁcient and sparse encoding of visual and auditory signals [168, 61, 62]. The study [62] showed
that auditory signals can be represented by a group of basis functions which are functionally similar to gammatone functions. Gammatone functions are equivalent to time-domain representations of human cochlear ﬁlters and have also been used in psychoacoustical studies [169, 170].
Other studies by a number of auditory neurophysiologists [174, 175, 176] indicates that there is
a hierarchical processing in the human auditory cortex where the received signal is ﬁrst broken
down into basic features and later they are integrated into more complex stimuli. These stud80

ies [177, 178] also indicate that the so-called spectro-temporal receptive ﬁelds (STRFs) in auditory
cortex (AI) can capture different frequencies, spectral scales, and temproral rates. Several researchers have begun to apply these recent developments in neuroscience to speech recognition
systems [179, 180, 181, 182, 183]. For example authors in [181] have ﬁltered spectrograms of
speech sgnals with spectro-temporal kernels derived from recordings in primary auditory cortex of
the ferret. The study [182] presents a model to extract the patch-based features for a word spotting
system where a set of patches randomly extracted from the spectrum of training data and in the
testing phase a ﬁxed amount of time-frequency ﬂexibility is given to the extracted patches in order
to match with the ones from a potential target. The motivation in this research is to apply the kernel
based approach [166, 167] to a reproducing kernel Hilbert space (RKHS) spanned by gammatone
basis functions and extract sparse, noise-robust, discriminative speech features. The result of incorporating this a-priori information is that SPARK features can be extracted in real-time using
pre-computed projection matrices and at the same time demonstrating superior noise-robustness
compared to the state-of-the-art features.

5.2 Hierarchical architecture

In this section we present two main units of proposed hierarchical architecture in order to generate
the speech features. For the analysis presented in this section, we will assume that a window of
speech signal is ﬁrst extracted and the following regression technique is applied on the overlapping
windows.
81

(a)
Figure 5.1: A set of 25 gammatone kernel basis functions with center frequencies spanning 100Hz
to 4KHz in the ERB space

5.2.1 Regularized kernel optimization
Given a stationary, discrete-time speech signal x[n] ∈ R, where n = 1, .., N denotes the time
indices, the objective of the regression is to estimate the parameters of a manifold f : RP → R
that captures the information embedded in x[n]. The function f is assumed to be formed using
linear superposition of time-shiftable basis functions ϕm , m = 1, .., M according to

f [n] =

M K
∑ ∑

bi.m ϕm [n − τi.m ]

(5.1)

m=1 i=1
τi.m indicates the temporal position of the ith instance of gammatone function, and bi.m denotes
a scaling factor. In this model, all the basis functions and their time-variants are zero padded to
82

(a)
Figure 5.2: Acyclic convolution matrix Φi for gammatone basis function ϕi .
have a length of K. Note that it has been shown that [61, 62] this shift-invariant mathematical
representation can derive the efﬁcient auditory codes through an unsupervised sparse learning. For
the purpose of this research we chose the gammatone basis with some ﬁxed parameters as they
will be explained later in this section. The reason to choose these basis functions is based on
the physiological data in which cochlea exhebits the following characteristics: (a) non-uniform
ﬁlter bandwidths where each of the frequency resolution is higher at the lower frequency than
at the higher frequency, (b) peak gain of the ﬁlter centered at fc decreases as the level of the
input increase, and (c) the cochlear ﬁlters are spaced more closely at lower frequencies than at
higher frequencies. Other reason to choose the gammatone basis functions is that the authors
of [61, 62] showed that unsupervised sparse learning over a dataset of natural sounds converges to
83

the gammatone-shape basis function when the shift-invarient model of (5.1) used to represent the
audio signal. Based on the above we chose the gammatone basis functions over the uniform short
time Fourier transform (STFT). The gammatone basis functions can be mathematically expressed
as cochlear ﬁlters given by:

ϕm [n] = am nk−1 cos(2πfm n)e−2πbERB(fm )n

(5.2)

where fm is the center frequency parameter (Note: gammatone functions are time-domain representation of band-pass cochlear ﬁlters), am is the amplitude normalization parameter and k is
the order of the gammatone basis. The center frequencies are uniformly spaced according to an
equivalent rectangular bandwidth (ERB) scale [171]. The parameter b controls the damping ratio
of the gammatone basis and is proportional to the ERB of center frequencies. In this research, we
have chosen the order k and parameter b to be 4 and 1.019 respectively [172] and the following
equation for the ERB(fm ) (suggested by Glasberg and Moore [173]):

ERB(fm ) = 0.108fm + 24.7.

(5.3)

A set of 25 gammatone kernel basis functions with the above parameters are shown in Fig. 5.1.
Note that in equation (5.1) a single gammatone basis could occur at multiple times during the
speech signal, and hence the model is sufﬁciently rich to different time-frequency variations.

Equation (5.1) can be expressed in a matrix form as

f = Φb
84

(5.4)

and the basis matrix Φ is deﬁned as

Φ = [Φ1 Φ2 · · · ΦM ]

(5.5)

where Φi is an N × K acyclic convolution matrix, one for each basis function (see Fig. 5.2).
Hence, the matrix Φ has dimensions N × M K and the regression procedure, presented next
estimates the parameters bi.m . To solve the function regression problem, We ﬁrst assume that f
is an element of a Hilbert space f ∈ H, where inner-product between two functional elements
f, g ∈ H will be represented as ⟨f, g⟩H . Note that it can be proved [72] that to every RKHS
H there is a unique positive deﬁnite function K called the reproducing kernel of H that has the
reproducing property: f [n] =< f [l], K[l, n] >. The function K behaves in H as the Kronecker’s
delta function does in L2 [72, 73].

For the purpose of this paper, we take the Hilbert space to be the set of functions of the form
deﬁned in equation (5.1) and deﬁne the scalar product in this space to be:

⟨

M K
∑ ∑
m=1 i=1

M K
∑ ∑

bi.m ϕm [n − τi.m ],

m=1 i=1

ci.m ϕm [n − τi.m ]⟩H

≡

M K
∑ ∑

bi.m ci.m .

m=1 i=1
Equation (5.6) shows that the norm of the RKHS has the form:

||f ||2
HK =

M K
∑ ∑
m=1 i=1
85

b2 ,
i.m

(5.6)

The regression involves minimizing the following cost function with respect to f
N
∑

min C(f ) =
L(x[n], f [n])
f ∈H
n=1

(5.7)

where L(., .) is a loss function. While numerous choices of loss functions are possible like the L1
loss function or Vapnik’s ϵ-insensitive (Lϵ ) loss function [68], in this paper we have chosen the L2
loss function given by
L(x[n], f [n]) = ||x[n] − f [n]||2
2

(5.8)

In the cost function (5.8), we introduce a stabilizer or a regularizer Ω(f ) to ensure the solution
is more robust. The regularized cost function is given by
N
∑

min H(f ) =
L(x[n], f [n]) + λΩ(f )
f ∈H
n=1

(5.9)

where Ω(f ), in this paper, is chosen as

Ω(f ) = ||f ||2
HK

(5.10)

which is a norm in the Hilbert space H deﬁned by the positive deﬁnite function K. ||f ||2
HK
determines the smoothness of f based on the regularization parameter λ (see the seminal work
of [72]). In fact, in [66, 73] it has been shown that when H is an RKHS deﬁned by speciﬁc types
of kernels, the use of the regularizer is equivalent to low-pass ﬁltering with cut-off determined by
the hyper-parameter λ.
86

Cost function (5.9) can be written as
N
∑

J(bi.m ) =

n=1

(x[n] −

M K
∑ ∑

bi.m ϕm [n − τi.m ])2 + λ

m=1 i=1

M K
∑ ∑

b2 .
i.m

(5.11)

m=1 i=1

Taking the derivative of (5.11) with respect to parameter bi.m and equating to zero, the following
is obtained:
N
∑
∂J
= 2λbi.m −
2α[n]ϕm [n − τi.m ] = 0
∂bi.m
n=1

(5.12)

where α[n] is the reconstruction error for speech sample at time instant n = 1, ..., N , given by

α[n] = x[n] −

M K
∑ ∑

bi.m ϕm [n − τi.m ]

m=1 i=1
= x[n] − f [n]

(5.13)

This leads to the minimizer of (5.11) given by
N
1 ∑
α[n]ϕm [n − τi.m ].
bi.m =
λ
n=1

(5.14)

1
b∗ = ΦT α.
λ

(5.15)

or in a matrix format as

Now f can be written in terms of α as

f=

1
ΦΦT α.
λ
87

(5.16)

Using equations (5.13) and (5.16), (5.11) reduces to

1
1
1
min(x − ΦΦT α)T (x − ΦΦT α) + αT ΦΦT α
α
λ
λ
λ

(5.17)

with the optimum solution written in a matrix format as

α∗ = λ(ΦΦT + λI)−1 x.

(5.18)

Using equation (5.18), the optimal b∗ also can be calculated as

b∗ = ΦT (ΦΦT + λI)−1 x
= (ΦT Φ + λI)−1 ΦT x.

(5.19)

By applying the kernel trick equation (5.19) can be written as

b∗ = (K(Φ, Φ) + λI)−1 K(Φ, x)

(5.20)

The optimal vector b∗ shows the similarity of the input speech signal with each of the basis
functions and these parameters will be sent to the second level of this computational model for
more complex oparations.

5.2.2 Pooling mechanism
An important consequence of projecting the speech signal onto a normalized gammatone function space (representing the STRFs) as shown in equation (5.20) is that the high-energy elements
88

(a)
Figure 5.3: Signal ﬂow of the SPARK feature extraction
( in ||.||2 sense) of the parameter vector b will capture the salient and the noise-robust aspects
of the speech signal in terms of spectral scales, frequencies, and temroral rates. On the other
hand, the low-energy components of b are more susceptible to corruption by noise and they can
be eliminated. We therefore apply a weighting function ζ(.) on elements of |b| to obtain a more
noise-robust representation of the speech signal. This weighting function also emulates the psychoacoustical nonlinear relation between the intensity of sound and its perceived loudness. After
nonlinear weighting, a pooling function will choose the winner based on even summation (“SUM”)
or maximum operation (“MAX”). Note that this pooling mechanism is local and chooses the winner from a set of gammatone function where all of them have a same central frequency. The
pooling mechanism we used are:

Ψm (bi.m ) = ζ 

K
∑


|bi.m | ,

(5.21)

i=1

Ψm (bi.m ) =

K
∑
i=1

89

ζ(|bi.m |),

(5.22)

and

Gammatone Kernel Basis index

Gammatone Kernel Basis index

K
Ψm (bi.m ) = max ζ(|bi.m |).
i=1

25
20
15
10
5
20

40
60
Frame index

80

25
20
15
10
5
20

100

25
20
15
10
5
40
60
Frame index

40
60
Frame index

80

100

(b)

Gammatone Kernel Basis index

Gammatone Kernel Basis index

(a)

20

(5.23)

80

25
20
15
10
5
20

(c)

40
60
Frame index

80

(d)

Figure 5.4: Colormaps depicting b vectors (left column) and IDCT of SPARK feature vectors
(right column) obtained for utterances of digit “1” and “9” respectively

90

5.3 Sparse auditory reproducing kernel coefﬁcients

The ﬂow-chart describing the SPARK feature extraction procedure is summarized in Fig. 5.3. The
input speech signal is processed by a pre-emphasis ﬁlter of the form xpre (t) = x(t) − 0.97x(t −
1) after which a 25ms speech segment is extracted using a Hamming window. The parameter
vector b∗ is obtained using the kernel regression procedure described in section 5.2.1. A pooling
mechanism chooses the parameters that are more robust to noisy conditions. Note that this pooling
system also uses a nonlinear weighting function to emulate the psychoacoustical non-linear relation
between the intensity of sound and its perceived loudness. Then, a Discrete Cosine Transform
(DCT) is applied to de-correlate the features. Like the MFCC based feature extraction, only the ﬁrst
13 coefﬁcients are used as features. We further apply the mean normalization (MN) to the feature
vectors and append the velocity ∆ and acceleration ∆∆ parameters to extract a 39 dimension
feature vectors for each speech frame. Fig. 5.4 (top row) shows the regression vectors (b) for
three different utterances “1” and “9” and bottom row shows the inverse DCT transformation of
HKC features. The ﬁgures visually depict discriminatory SPARK features for these two different
utterances.

5.4 Experiments and performance evaluation

In this section two setups is presented for the evaluation of the proposed SPARK features, one for
speech recognition system and the other for speaker veriﬁcation system.
91

5.4.1 Speech recognition setup
In order to compare the speech recognition results with state-of-the-art system reported in literature, we have set up a benchmark system based on the standard Aurora 2 speech recognition
task [158].
The setup includes a hidden Markov model (HMM)-based speech recognition architecture,
where the speech recognition system is implemented using the hidden Markov toolkit (HTK) package [159]. By the end of the training phase, we have whole word HMM for each digit with 16-state
per HMM with three diagonal Gaussian mixture components per state in addition to “sil” and “sp”
models.
Aurora 2 database [158] consists of recognizing English digits in the presence of additive
noise and linear convolutional distortion. All the speech data in this database are derived from the
TIDigits database at the sampling rate of 8 Khz. The original TIDigits database contains the digit
sequences which was originally designed and collected at Texas Instruments Inc. (TI) in 1982.
There are 326 speakers in this database with 111 men, 114 women, 50 boys, and 51 girls each
pronouncing 77 digit sequences where each speaker group spillited into test and training subsets.
The corpus was collected in a quit acoustic environment using an Electro-Voice RE-16 Dynamic
Cardiod microphone, digitized at 20 kHz.
In the AURORA 2 database, there are two training mods: training on clean data and multiconditional training on noisy data. The “clean training” corresponds to TIDigits training data
downsampled to 8 kHz and ﬁltered with a G712 characteristics. The “multiconditional training”
corresponds to TIDigits training data downsampled to 8 kHz and ﬁltered with a G712 characteristics with four different noises added artiﬁcially to the data at several SNRs (20 dB, 15 dB, 10 dB,
5 dB, and clean where no noise added), therefore 20 different conditions are taken as input for this
92

mode.
Three testing sets are provided for the evaluation of the Aurora-2 task. The ﬁrst testing set
(set A) contains 4 subsets of 1001 utterances corrupted by subway, babble, car, and exhibition hall
noises, respectively, at different SNR levels (20 dB, 15 dB, 10 dB, 5 dB, 0 dB, -5 dB, and clean
where no noise added). The second set (set B) contains 4 subsets of 1001 utterances corrupted
by restaurant, street, airport, and train station noises at different SNR levels. These distortions
have been synthetically introduced to clean (TIDigits) data. The test set C contains 2 subsets of
1001 sentences, corrupted by subway and street noises. The data set C was ﬁltered with the MIRS
ﬁlter [161] before the addition of noise in order to evaluate the robustness of the speech recognition
systems under convolutional distortion mismatch.
The above back-end HMM-based speech recognition system is used with three different feature
extraction algorithms for the comparison purpose.

(a)
Figure 5.5: Signal ﬂow of the MFCC feature extraction
The basic ETSI front-end [158, 160] is based on Mel Frequency Cepstral Coefﬁcients (MFCCs)
which has been widely used in speech based recognition/identiﬁcation systems [104]. The signal
ﬂow of MFCC based feature extraction is shown in Fig. 5.5. The ETSI basic front-end generates
the MFCCs with the following parameters. Speech, sampled at 8 kHz, is windowed into frames of
93

size 200 samples with 80 samples between frames. A logarithmic frame energy measure is calculated for each frame before any processing takes place. Then each frame undergoes pre-emphasis
using a ﬁlter coefﬁcient equal to 0.97. A Hamming window is then used prior to taking an FFT.
Then a magnitude spectrum estimate is used before the ﬁlter bank. The basic front-end generates a
feature vector consisting of 13 coefﬁcients made up of the frame log-energy measure and cepstral
coefﬁcients C1 to C12. In the recognition experiments, velocity and acceleration coefﬁcients are
appended to the 13 static features above, to give a total of 39 elements in each feature vector.
Table 5.1: AURORA 2 clean training word accuracy results when ETSI FE is used.
Set A
Sub
Clean
20dB
15dB
10dB
5dB
0dB
-5dB
Avg

99.11
97.11
91.34
74.64
45.47
17.44
8.50
61.94

Bab

Car

98.85 98.81
93.65 97.55
79.26 93.08
54.63 74.20
26.96 35.55
8.92 11.69
2.78
8.65
52.15 59.93

Set B

Set C

Exh

Res

Str

Air

Stat

Sub

Str

99.20
97.04
91.05
73.09
40.64
14.69
8.89
60.66

99.11
92.45
80.44
59.29
32.05
12.28
4.88
54.36

98.85
96.37
89.96
67.26
37.03
16.63
8.95
59.29

98.81
96.00
87.92
66.57
34.92
14.97
8.35
58.22

99.20
97.35
91.45
70.81
32.09
10.64
7.96
58.50

99.23
95.58
87.96
70.62
41.76
15.54
8.75
59.92

98.97
96.13
90.45
72.70
44.32
19.62
9.95
61.73

Table 5.1 shows the accuracy results of the benchmark speech recognition system on AURORA
2 dataset when ETSI basic front-end (FE) is used. Using this front-end, the avarage word accuracies are %58.67, %57.59, and %60.83 for set a, set b, and set c respectively.
Conventional gammatone ﬁlterbank uses the auditory gammatone ﬁlterbank in order to extract more robust features as shown in Fig. 5.6. In this settings, ﬁrst a preemphasis of the form
xpre (t) = x(t) − 0.97x(t − 1) is applied. Then the short-time Fourier transform is performed using Hamming windows of duration 25 ms, with 10 ms between frames, and we used 26 gammatone
ﬁlters (with the exact gammatone parameters we used in extracting SPARK features). After that,
log compression is performed and each speech signal is parameterized with a DCT transformation
94

Figure 5.6: Signal ﬂow for the conventional Gammatone ﬁlterbank features, note that this ﬁgure
shows each frame of speech after two steps of pre-emphasis and windowing.

of order 13. The parameters were normalized to have zero mean complemented by their ﬁrst and
second derivatives for a total of 39 coefﬁcients. The results are shown in Fig. 5.6 with the avarage
word accuracy of %64.53, %67.49, and %65.46 on set a, set b, and set c respectively.

Table 5.2: AURORA 2 word recognition results when conventional Gammatone ﬁlter-bank (GT)
is used.
Set A

Set B

Set C

Sub
Clean
20dB
15dB
10dB
5dB
0dB
-5dB
Avg

Bab

Car

Exh

Res

Str

Air

Stat

Sub

Str

99.23
96.62
92.60
79.61
50.26
23.55
14.86
65.25

99.33
97.94
94.71
83.40
53.08
22.43
12.76
66.24

98.96
96.99
93.47
78.20
41.57
19.83
12.47
63.07

99.26
96.67
91.89
77.75
46.37
20.67
12.22
63.55

99.23
97.97
95.33
85.69
60.12
27.30
12.96
68.37

99.33
97.58
93.65
82.44
53.02
23.61
12.85
66.07

98.96
97.67
95.23
86.85
59.23
28.93
15.00
68.84

99.26
97.81
94.97
83.52
52.92
24.28
13.92
66.67

99.14
96.90
92.88
80.04
50.60
23.55
14.55
65.38

99.37
97.49
93.38
81.08
52.09
22.70
12.73
65.55

95

ETSI advanced front-end is the most recent ETSI standard front-end feature extraction [46,
162]. ETSI advanced front-end (AFE) integrates several methods to remove both additive and
convolutive noises. A two-stage Mel-warped Wiener ﬁltering combined with a SNR-dependent
waveform processing is used to reduce the additive noise and a blind equalization is used to mitigate the channel effects. The word accuracy results using AFE on the AURORA 2 dataset is shown
in Table 5.3.
Table 5.3: AURORA 2 clean training word accuracy results when ETSI AFE is used.
Set A

Set B

Set C

Sub
Clean
20dB
15dB
10dB
5dB
0dB
-5dB
Avg

Bab

Car

Exh

Res

Str

Air

Stat

Sub

Str

99.08
97.91
96.41
92.23
83.82
61.93
30.86
80.32

99.00
98.31
96.89
92.35
81.08
51.90
19.71
77.03

99.05
98.48
97.58
95.29
88.49
66.42
30.84
82.31

99.23
97.90
96.82
92.78
84.05
63.28
32.86
80.99

99.08
97.97
95.33
90.08
76.27
51.09
18.67
75.50

99.00
97.64
96.74
92.78
83.28
60.07
29.87
79.91

99.05
98.39
97.11
93.47
84.07
60.99
28.54
80.23

99.23
98.36
96.73
93.77
84.57
62.57
29.96
80.74

99.08
97.36
95.33
90.24
79.03
51.73
24.62
76.77

99.03
97.70
95.77
90.69
78.17
52.09
25.57
77.00

SPARK front-end We extracted the SPARK features for speech recognition experiments using
the procedure described in section 5.3. A 25-ms window with a 10-ms shift has been used and
the vector b has been extracted using 26 kernel gammatone basis functions. In the following
experiments the effect of changing different parameters of SPARK features on the performance
of the speech recognition system is demonstrated and then a full comparison with the benchmark
described above is presented.
In order to reduce the computational complexity of the algorithm, we reduced the size of matrix
Φ by taking to account different of time-shifts of gammatone basis function.
In order to see the effect of using different kernel functions, we changed the K in equation (5.20). The results presented in Table 5.5 where we used different kernel functions of linear
96

Table 5.4: The effect of different time-shifts on the SPARK features.
Set A
SPARK, Shift=3.125 ms
SPARK, Shift=4.375 ms
SPARK, Shift=7.375 ms
SPARK, Shift=11.875 ms

Set B

Set C

72.33 73.02 71.57
71.79 72.48 70.97
70.60 70.63 69.74
64.58 64.37 63.28

K(x, y) = xyT , exponential K(x, y) = exp(cxyT ), sigmoid K(x, y) = tanh(axyT + c), and
polynomial K(x, y) = (xyT )d .
Table 5.5: The effect of different kernel functions on the SPARK features.
Set A
SPARK, Exponential kernel, c = 0.01
SPARK, Exponential kernel, c = 1.0
SPARK, Sigmoid kernel, a = 0.01, c = 0
SPARK, Sigmoid kernel, a = 0.01, c = −0.01
SPARK, Linear kernel
SPARK, Polynomial kernel, d = 2
SPARK, Polynomial kernel, d = 4

Set B

Set C

69.83
69.22
68.35
69.84
67.80
70.77
67.89

71.45
71.16
70.60
71.48
69.65
71.14
68.24

69.52
68.24
68.89
69.54
68.30
71.07
68.05

In order to investigate the effect of different pooling mechanisms, we compared the proposed
features with different pooling mechanism. First we ﬁxed the pooling function to be Ψm (bi.m ) =
(∑
)
K |b
ζ
| , and changed the nonlinear weighting function ζ(.) where the results are prei=1 i.m
sented in Table 5.6 for the polynomial kernel of degree 4 and λ = 0.01. The results presented in
this table clearly show that the non-linear weighting function has a huge effect on the performance
of the recognition system.
We ran the same experiments with different pooling function and different kernel function
where the results are presented in Tables 5.7 and 5.8. These experiments also show the importance
of the non-linear weighting function in extracting SPARK features.
Parameter λ controls the smoothness of the regularized regression network presented in sec97

Table 5.6: The effect of different pooling mechanisms (different ζ) when Ψ = max ζ(|b|) and
K(x, y) = tanh(0.01xyT − 0.01).
Set A
SPARK, ζ(.) = (.)1/3
SPARK, ζ(.) = (.)1/11
SPARK, ζ(.) = (.)1/13
SPARK, ζ(.) = (.)1/15
SPARK, ζ(.) = (.)1/17
SPARK, ζ(.) = (.)1/19

Set B

Set C

64.91
70.91
70.27
69.83
68.83
68.35

65.60
72.32
71.96
71.24
70.75
70.36

62.60
70.19
69.68
68.88
68.44
68.10

∑
Table 5.7: The effect of different pooling mechanisms (different ζ) when Ψ = ζ( |b|) and
K(x, y) = tanh(0.01xyT − 0.01).
Set A
SPARK, ζ(.) = (.)1/3
SPARK, ζ(.) = (.)1/11
SPARK, ζ(.) = (.)1/13
SPARK, ζ(.) = (.)1/15

Set B

Set C

66.39
71.26
70.62
69.84

66.50
72.32
72.00
71.48

65.59
70.90
70.25
69.54

∑
Table 5.8: The effect of different pooling mechanisms (different ζ) when Ψ = ζ( |b|) and
K(x, y) = (xyT )4 .
Set A
SPARK, ζ(.) = (.)1/13
SPARK, ζ(.) = (.)1/15
SPARK, ζ(.) = (.)1/17
SPARK, ζ(.) = (.)1/19
SPARK, ζ(.) = (.)1/25
SPARK, ζ(.) = (.)1/35

Set B

Set C

67.11
67.89
69.06
69.59
70.80
70.90

67.71
68.24
69.01
69.44
70.79
71.87

66.90
68.05
69.34
69.74
70.96
70.50

tion 5.2.1. In order to see the effect of this parameter on the speech recohnition performance we
ran experiments where we ﬁxed all the parameters except λ. The results are presented in Table 5.9.
As the results show the regularization made the features more robust to the noise in general.
We also compared the SPARK features with the ETSI basic fron-end. Fig. 5.7, 5.8, 5.9, 5.10,
98

Table 5.9: The effect of λ on extracting the SPARK features.
Set A

Set C

72.33
71.41
69.18
64.12

SPARK, λ = 0.01
SPARK, λ = 0.0001
SPARK, λ = 0.00001
SPARK, λ = 0.000001

Set B
73.02
72.35
69.73
64.79

71.57
70.25
67.99
62.68

and 5.11 compare the word error-rate obtained by SPARK (with λ = 0.001) and basic ETSI frondend based recognizers. The experimental results demonstrate a reduction in the word-error-rate
(WER) by 31%, 36%, and 27% for set A, set B, and set C.
Subway Noise

Babble Noise
80
Accuracy

100

80
Accuracy

100

60
40
MFCC

20
0
0

10
SNR (dB)
(a)

15

40
MFCC

20

SPARK
5

60

20

0
0

SPARK
5

10
SNR (dB)
(b)

15

20

Figure 5.7: Speech recognition accuracy obtained in additive noisy (subway and bable) environments on AURORA 2 database.

We ran another set of experiments to compare the SPARK features to the state-of-the-art ETSI
AFE front-end. ETSI AFE uses noise estimation, two-pass Wiener ﬁlter-based noise suppression,
and blind feature equalization techniques. To incorporate an equivalent noise-compensation to the
SPARK features, we used the power bias subtraction (PBS) [163] method. PBS method resembles
in some ways to the conventional spectral subtraction (SS), but instead of estimating noise from
non-speech parts which usually needs a very accurate voice activity detector (VAD), PBS simply
subtracts a bias where the bias is adaptively computed based on the level of the background noise.
99

Car Noise

Exhibition Noise
80
Accuracy

100

80
Accuracy

100

60
40
MFCC

20
0
0

10
SNR (dB)
(a)

15

40
MFCC

20

SPARK
5

60

0
0

20

SPARK
5

10
SNR (dB)
(b)

15

20

Figure 5.8: Speech recognition accuracy obtained in additive noisy (car and exhibition) environments on AURORA 2 database.
Restaurant Noise

Street Noise
80
Accuracy

100

80
Accuracy

100

60
40
MFCC

20
0
0

10
SNR (dB)
(a)

15

40
MFCC

20

SPARK
5

60

20

0
0

SPARK
5

10
SNR (dB)
(b)

15

20

Figure 5.9: Speech recognition accuracy obtained in additive noisy (restaurant and street) environments on AURORA 2 database.
Table 5.10 shows the performance of SPARK+PBS recognition system under different types of
noise. These results can be compared to Table 5.3 where they show that SPARK+PBS system
consistently performs better than the ETSI AFE all noise types except subway and exhibition noise
at low SNR. In fact, SPARK shows an overall relative improvements of 4.69% with respect to the
ETSI AFE.
Table 5.11 shows a comparative performance of SPARK+PBS features against other baseline
100

Airport Noise

Station Noise
80
Accuracy

100

80
Accuracy

100

60
40
MFCC

20
0
0

10
SNR (dB)
(a)

15

40
MFCC

20

SPARK
5

60

0
0

20

SPARK
5

10
SNR (dB)
(b)

15

20

Figure 5.10: Speech recognition accuracy obtained in additive noisy (airport and station) environments on AURORA 2 database.

Subway (MIRS) Noise

Street (MIRS) Noise
80
Accuracy

100

80
Accuracy

100

60
40
MFCC

20
0
0

10
SNR (dB)
(a)

15

40
MFCC

20

SPARK
5

60

20

0
0

SPARK
5

10
SNR (dB)
(b)

15

20

Figure 5.11: Speech recognition accuracy obtained in different convolutive noisy environments on
AURORA 2 database.

systems. The results clearly show that the SPARK+PBS demonstrates improvement over the baseline systems even in clesn condition but the advantage of SPARK+PBS features become more
apparent under noisy conditions.
101

Table 5.10: AURORA 2 word recognition results when SPARK and PBS were used together.
Set A

Set B

Set C

Sub
Clean
20dB
15dB
10dB
5dB
0dB
-5dB
Avg

Bab

Car

Exh

Res

Str

Air

Stat

Sub

Str

99.36
98.10
96.41
92.94
82.87
59.26
27.97
79.56

99.12
98.70
97.64
95.37
86.61
58.19
21.58
79.60

99.19
98.69
98.03
95.47
88.76
71.28
34.54
83.71

99.38
98.15
96.64
92.69
81.67
56.77
25.24
78.65

99.36
98.83
97.51
94.32
82.99
56.77
21.95
78.82

99.12
98.37
97.58
94.04
84.22
60.85
27.48
80.24

99.19
98.90
98.30
96.60
89.41
69.52
32.03
83.42

99.38
98.58
97.59
95.06
86.76
66.52
33.35
82.46

99.32
97.82
96.41
92.05
80.60
54.81
25.02
78.00

99.09
98.04
96.80
93.59
82.98
57.13
25.57
79.03

Table 5.11: AURORA 2 clean training word accuracy results.
Set A

Set B

Set C

ETSI FE WI007
58.67 57.59 60.83
ETSI AFE WI008 80.16 79.10 76.89
Conventional GT 64.53 67.49 65.46
SPARK + PBS
80.38 81.24 78.52

5.4.2 Speaker veriﬁcation setup
For this setup we used a support vector machine (SVM) based speaker veriﬁcation system in order
to discriminate the target speaker from the imposters. This system is based on an open source machine learning software library (Torch [164, 165]). In this system, we used the GMM supervector
linear kernel (GSLK) prposed by [150] to measure the dissimilarity between two GMMs, where
each GMM obtained by adapting the world model. We used 200 Gussian mixtures for the world
model.
NIST database Since 1996, the speech group of the National Institute of Standards and Technology (NIST) has been organizing evaluations of text-independent speaker recognition/veriﬁcation
technologies. During the evaluation, a unique data-set and an evaluation protocol are provided to
each of the participating research group. The objective is to provide a fair comparison between dif102

ferent speaker veriﬁcation systems even though the identity of the systems is not publicly revealed.

The effectiveness of the proposed features is evaluated on the NIST 2003 Speaker Recognition
Evaluations (SRE) corpus. For the purpose of this work we used the one speaker cellular detection
task in NIS SRE 2003 as the evaluation set while for training of the world model (Universal Background Model (UBM)) we used 457 examples from NIST SRE 2000. For evaluation speakers,
there were about 2 minutes of speech for training the target speaker models and each test attempt
was 15 to 45 seconds long. We only used the male speakers. The evaluation set consists of 149
target speakers. The total number of attempts in the evaluation 17,772 with 10% of true target
attempts.

Evaluation metric Typically the performance of a speaker veriﬁcation system is determined
by the errors generated by the recognition. There are two types of errors that can occur during a
veriﬁcation task: (a) false acceptance when the system accepts an imposter speaker; and (b) false
rejection when the system rejects a valid speaker. Both types of errors are a function of the decision
threshold. Choosing a high threshold of acceptance will result in a secure system that will accept
only a few trusted speakers, however, at the expense of high false rejection rate (FRR). Similarly
choosing a low threshold would make the system more user friendly by reducing false rejection rate
but at the expense of high false acceptance rate (FAR). This trade-off is typically depicted using a
decision-error trade-off (DET) curve whose example is shown in Fig. 5.12. The FAR and FRR of
a veriﬁcation system deﬁnes different operating points on the DET curve. These operating points
(shown in Fig. 5.12) vary according to their deﬁnition and are considered different performance
metrics of the speaker veriﬁcation system. We describe the commonly used ones below:

Detection Cost Function (DCF): The DCF is a weighted sum of the two error rates and com103

Figure 5.12: An example of DET curve which plots the FRR with respect to FAR.

puted as follows:

DCF = (CF RR × F RR × PT arg ) + (CF AR × F AR × (1 − PT arg ))

(5.24)

where CF AR and CF RR denote the cost of false acceptance and cost of false rejection; and
104

PT arg denotes the prior probability that the utterance belongs to the target speaker. For instance, in
evaluations conducted by National Institute of Standards and Technology (NIST), CF AR , CF RR
,and PT arg are assumed to be 10, 1, 0.01. Minimum DCF (min. DCF ) which is the performance
metric of the veriﬁcation system is deﬁned as the smallest value of (5.24) computed over the
cross-validation set when the decision threshold is varied. Another related metric is the actual
DCF which is the minimum value of (5.24) computed over the test set for the entire range of the
decision threshold. An example of the min DCF and actual DCF metric is shown on the DET curve
in Fig. 5.12.
Equal Error Rate (EER): An alternative performance measure for speaker veriﬁcation is the
EER which is deﬁned as the FAR which is equal to FRR (see 5.12). Thus, smaller the EER of the
system, the superior is the veriﬁcation system.
As a benchmark system, the veriﬁcation system described above was developed using MFCC
features. MFCCs were extracted for each window of 20ms with 10ms overlap between the adjacent windows. To extract the benchmark features, we used 24 band-pass ﬁlters between 300 and
3400HZ. Then each speech signal is parameterized with a DCT transformation of order 16, complemented by the log-energy and their ﬁrst and second derivatives for a total of 51 coefﬁcients,
then all the frames were normalized in order to have a zero mean.
For speaker veriﬁcation task, we extracted the SPARK features using the procedure described
in section 5.3. A 25-ms window with a 10-ms shift has been used and the vector b has been
extracted using 26 kernel gammatone basis functions. Here we kept the ﬁrst 16 coefﬁcients after
the DCT complemented by the ﬁrst and second derivatives of SPARK features to create a feature
vector of 51 coefﬁcients. Fig. 5.13 shows the DET curve comparing the MFCC-CMN features
with SPARK features where it clearly demonstrate the effectiveness of the proposed features.

105

SRE NIST 2003

40

SPARK
MFCC-CMS

False Rejection Rate

20

10

5

2
11

2

5

10

20

False Acceptance Rate

Figure 5.13: DET curve comparing MFCC-CMN and SPARK features.

106

40

Chapter 6
Concluding Remarks and Future Directions

6.1 Summery and concluding remarks
In this work, a miniature acoustic recognition system introduced where the recoding elements
are placed in micro/nano scale distance from each other. The mathematical model presented in
chapter 2 shows that recording on a miniature microphone array can be approximated with an instantaneous linear mixing model. In chapter 3, a “smart” acquisition system is introduced. At the
core of the proposed acquisition system is a min-max optimization of a regularized objective function that yields a sequence of quantized parameters which asymptotically tracks the statistics of
the input signals and at the same time removes the cross-correlation of the input space. Therefore,
the proposed acquisition system achieves the signal de-correlation along with data conversion at
lower digital data bandwidth unlike the conventional data acquisition approach of analog-to-digital
conversion followed by data de-correlation process. The performance of this acquisition system
is evaluated using synthetic and real recordings and the experiments using the miniature/compact
microphone arrays showed a consistent improvement against a standard analog-to-digital converter
107

for any DSP based source separation algorithms. One of the limitations that prevent a miniature
acoustic recognition system to be used in real world applications is its robustness to noisy conditions. It was argued that this issue can be addressed with two general approaches of robust feature
extraction techniques and robust modeling. This work also proposed a hierarchical model for robust feature extraction in order to make the miniature acoustic recognition system robust to noisy
conditions. The proposed auditory features are extracted in two levels where in the ﬁrst level of
this computational model, the similarity of sensory auditory world is measured through a kernel
based approach with a set of gammatone basis functions. The result of incorporating this a-priori
information is that these signitures can be extracted in real-time using pre-computed projection
matrices. In the second level of this model, the feature are extracted using a pooling mechanism
in order to feed into the acoustic recognition unit. The beauty of this approach is its robustness
to different noisy conditions and its simplicity in which it can be implemented in real-time using pre-computed matrices therefore it is suitable for the proposed miniature acoustic recognition
system.

6.2 Future directions
The future work in enhancing the proposed Σ∆ learning for “Smart” acquisition system includes:
• Exploring higher-order noise-shaping Σ∆ modulators for improving the performance of resolution enhancement.
• Extending Σ∆ learning to non-linear signal transforms by embedding kernels into the optimization framework. Incorporating the kernels in signal transformation can capture interesting non-linear information from higher-order statistics of the signal.
108

• Extending Σ∆ learning to integrate a source separation technique with signal quantization
in which the ADC module not only provides the digital representation of the signal but also
separates the signal of the interest from other interferences.
In this research framework, Σ∆ learning has been demonstrated to improve the performance of
speech based source separation algorithms, but the proposed technique is general and can be applied to any sensor arrays. The potential applications include microphone array hearing aids, microelectrode array in neuroprosthetic devices, miniature radio-frequency antenna arrays and for
radar applications.
The future work in extending the proposed hierarchical kernel coefﬁcients includes:
• Learning the basis functions from a speech dataset or updating the gammatone parameters
in order to be able to extract more information from the speech signal.
• Exploring other type of basis functions like gammachirp instead of gammatone basis functions.
• Even we introduced a hierarchical just model for the feature extraction module, but this work
can be extended to have a hierarchical recognition module as well.
• Exploring the use of hierarchical model for speech and audio coding.
For the speaker veriﬁcation system, in addition to the features used by the proposed system,
there are many other sources of speaker information in the speech signal that can be used. These
include idiolect (word usage), prosodic measures and other long-term signal measures. This work
will be aided by the increasing use of reliable speech recognition systems for speaker veriﬁcation
research. These high-level features not only offer the potential to improve accuracy, they may also
109

help improve robustness since they should be less susceptible to channel effects and recent research
in this regards show very promising results.

110

BIBLIOGRAPHY

111

BIBLIOGRAPHY

[1] E. Vilches, I. Escobar, E. Vallejo, and C. Taylor, “Data Mining Applied to Acoustic Bird Species
Recognition,” in Proc. of International Conference on Pattern Recognition, 2006.
[2] D. J. Mennill, J. M. Burt, K. M. Fristrup, and S. L. Vehrencamp, “Accuracy of an Acoustic Location System for Monitoring the Position of Duetting Songbirds in Tropical Forest,” J. Acoustic Soc.
America, vol. 119, pp. 2832-2839, 2006.
[3] V. M. Trifa, A. N. G. Kirschel, and C. E. Taylor, “Automated species recognition of antbirds in a
Mexican rainforest using hidden Markov models.,” Journal of Acoustical Society of America, vol. 123,
2008.
[4] A. N. G. Kirschel, D. A. Earl, Y. Yao, I. A. Escobar, E. Vilches, E. E. Vallejo, and C. E. Taylor,
“Using songs to identify individual Mexican antthrush Formicarius moniliger: Comparison of four
classiﬁcation methods,” Bioacoustics, vol. 19, 2009.
[5] S. Young and M. Scanlon, “Robotic Vehicle uses Acoustic Sensors for Voice Detection and Diagnostic”, Proc. of SPIE, vol. 4024, pp. 72-83, 2000.
[6] S. H. Young and M. V. Scanlon, “Detection and Localization with an Acoustic Array on a Small
Robotic Platform in Urban Environments,” Progress Report, Army Research Lab , Adelphi, MD, 2003.
[7] C. Clavel, T. Ehrette, and G. Richard, “Events Detection for an Audio-Based Surveillance System,” in
Proc. of International Conference on Multimedia and Expo, pp. 13061309, 2005.
[8] J. Rouas, J. Louradour, and S. Ambellouis, “Audio Events Detection in Public Transport Vehicle,” in
Proc. of International Conference on Intelligent Transportation Systems, 2006.
[9] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti “Scream and gunshot detection
and localization for audio-surveillance systems,” in Proc. of IEEE Conference on Advanced Video and
Signal Based Surveillance, pp. 21-26, 2007.

112

[10] A. Pikrakis, T. Giannakopoulos and S. Theodoridis, “Gunshot detection in audio streams from movies
by means of dynamic programming and bayesian networks,” in Proc. of International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), 2008.
[11] V. Cevher, A. C. Sankaranarayanan, J. H. McClellan, R. Chellappa, “Target Tracking using a Joint
Acoustic Video System,” IEEE Transactions on Multimedia, vol. 9, pp. 715-727, June 2007.
[12] H. Zhou, M. Taj, and A. Cavallaro, “Target detection and tracking with heterogeneous sensors,”, IEEE
Journal of Selected Topics in Signal Processing, vol. 2, issue 4, pp. 503-513, 2008.
[13] K. A. Luthy, “The development of textile based acoustic sensing arrays for sound source acquisition,” MS thesis Department of Electrical and Computer Engineering, North Carolina State University,
Raleigh, 2003.
[14] E. Grant, K.A. Luthy, J.F. Muth, L.S. Mattos and J.C. Braly, A. Seyam, T. Ghosh, A. Dhawan and
K. Natarajan, “Developing portable acoustic arrays on a large-scale e-textile substrate,” International
Journal of Clothing Science and Technology, Vol. 16, pp. 73-83, 2004.
[15] R. Chellappa, G. Qian, and Q. Zheng, “Vehicle Detection and Tracking Using Acoustic and Video
Sensors,” in Proc. IEEE Intl Conf. Acoustics, Speech, and Signal Processing, pp. 793-796, 2004.
[16] Z. Sun, G. Bebis, and R. Miller, “On-Road Vehicle Detection: A Review,” IEEE Trans. on Pattern
Analysis and Machine Intelligence, vol. 28, no. 5, pp. 694-711, 2006.
[17] A. Morhain and D, Mba, “Bearing defect diagnosis and acoustic emission,” in Proceedings of the
Institution of Mechanical Engineers, vol. 217, pp. 257-272, 2003.
[18] A. M. Al-Ghamd, and D. Mba, “A comparative experimental study on the use of Acoustic Emission and
vibration analysis for bearing defect identiﬁcation and estimation of defect size,” Mechanical Systems
and Signal Processing, vol. 20, no. 7, 1537-1571, 2006.
[19] L. G. Kersta, “Voiceprint identiﬁcation,” Nature, vol. 196, no. 4861, pp. 1253-1257, 1962.
[20] D. Meuwly, “Voice analysis,” in Encyclopaedia of Forensic Sciences, J. A. Siegel, P. J. Saukko, and G.
C. Knupfer, Eds., vol. 3, pp. 1413-1421, Academic Press, NY, USA, 2000.
[21] T.G. Clarkson, C.C. Christodoulou, Y. Guan, D. Gorse, D.A. Romano-Critchley, and J.G. Taylor,
“Speaker identiﬁcation for security systems using reinforcement-trained pRAM neural network architectures,” IEEE Transactions on Systems, Man and Cybernetics, Part C, vol. 31, no. 1, pp. 65-76,
Feb. 2001.

113

[22] M. Newman, L. Gillick, Y. Ito, D. McAllaster, and B. Peskin, “Speaker veriﬁcation through large vocabulary continuous speech recognition,” in Proc. International Conf. on Spoken Language Processing
(ICSLP ’96), vol. 4, pp. 2419-2422, Philadelphia, Pa, USA, October 1996.
[23] D. A. Reynolds, R. B. Dunn, and J. J. McLaughlin, “The Lincoln speaker recognition system: NIST
EVAL2000,” in Proc. International Conf. on Spoken Language Processing (ICSLP ’00), vol. 2, pp.
470-473, Beijing, China, October 2000.
[24] L. Wilcox, D. Kimber, and F. Chen, “Audio indexing using speaker identiﬁcation,” in Proc. SPIE
Conference on Automatic Systems for the Inspection and Identiﬁcation of Humans, pp. 149-157, San
Diego, Calif, USA, July 1994.
[25] Y. Gong, “Speech recognition in noisy environments: a survey,” Speech Communication, vol. 16, pages
261-291, 1995.
[26] D. A. Reynolds, “Automated Speaker Recognition: Current Trends and Future Direction,” Biometrics
Colloquium, vol. 17, June 2005.
[27] G. Davis, Ed., “Noise Reduction in Speech Applications,” CRC Press, 2002.
[28] S. Kochkin, “MarkeTrak VII: Hearing Loss Population Tops 31 Million People”, emphThe Hearing
Review, vol. 12, no. 7, pp. 16-29,July 2005.
[29] J. Greenberg and P. Zurek, “Evaluation of an adaptive beamforming method for hearing aids” ,Journal
of the Acoustical Society of America, vol. 91, pp. 1662-1676, 1992.
[30] B. Widrow, “Microphone Arrays for Hearing Aids,” , IEEE Circuits and Systems Magazine, vol. 1, no.
2, pp. 2632, 2001.
[31] B. Widrow and F-L Luo, “Microphone Arrays for Hearing Aids: an overview,” Speech Communication, vol. 39, pp. 139-146, 2003.
[32] M.M. Homayounpour and G. Chollet, “Discrimination of voices of twins and siblings for speaker veriﬁcation,” In Proceedings of the 4th European Conference on Speech Communication and Technology
(EUROSPEECH), pp. 345348, Madrid, Spain, 1995.
[33] A. Ariyaeeinia, C. Morrison, A. Malegaonkar, and S. Black, “A test of the effectiveness of speaker
veriﬁcation for differentiating between identical twins,” Science and Justice, vol. 48, no. 4, pp. 182186, 2008.
[34] M. S. Brandstein and D. B. Ward, Microphone Arrays: Signal Processing Techniques and Applications,
Springer-Verlag, New York, 2001.

114

[35] K-J Koh, J. W. May, G.M. Rebeiz, “A Millimeter-Wave (4045 GHz) 16-Element Phased-Array Transmitter in 0.18-.µm SiGe BiCMOS Technology, IEEE Journal on Solid-State Circuits, vol. 44, no. 5,
pp. 1498-1509, 2009.
[36] R.N. Miles and R.R. Hoy, “The development of a biologically-inspired directional microphone for
hearing aids,” Audiology and Neuro-Otology, vol. 11, no. 2, pp. 86-94, 2006.
[37] A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing: Learning Algorithms and
Applications, New York, NY: Wiley, 2002.
[38] W. Rudin, Functional Analysis, New York: McGraw-Hill, 1991.
[39] J.C. Candy and G.C. Temes, “Oversampled methods for A/D and D/A conversion,” in Oversampled
Delta-Sigma Data Converters, Piscataway, NJ: IEEE Press, pp. 1-29, 1992.
[40] B. H. Juang and T. H. Chen, “The past, present, and future of speech processing,” IEEE Signal Processing Magazine, vol. 15, no. 3, pp. 24-48, May 1998.
[41] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust.
Speech Signal Process., vol. 27, no. 2, pages 113-120, 1979.
[42] B. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker
identiﬁcation and veriﬁcation,” J. Acoust. Soc. Am., vol. 55, pp. 13041312, 1974.
[43] S. Molau, F. Hilger, H. Ney, “Feature space normalization in adverse acoustic conditions,” In Proc.
ICASSP, pp. 656659, 2003.
[44] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE Trans. Speech Audio Process, vol.
2, no. 4, pages 578-589, 1994.
[45] M. Padmanabhan, S. Dharanipragada, “Maximum likelihood non-linear transformation for environment adaptation in speech recognition systems,” In Proc. Eurospeech, pp. 23592362, 2001.
[46] D. Macho, L. Mauuary, B. Noe, Y.M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. Pearce, F. Saadoun,
“Evaluation of a noise robust DSR front-end on Aurora databases,” In Proc. ICSLP, pp. 1720, 2002.
[47] L. Deng, A. Acero, M. Plumpe, X. Huang, “Large vocabulary speech recognition under adverse acoustic environments,” In Proc. ICSLP, vol. 3, pp. 806809, 2000.
[48] O. Ghitza, “Auditory nerve representation as a basis for speech processing,” Edited by S. Furui and M.
M. Sondhi, Advances in Speech Signal Process., pp. 453-485, 1992.

115

[49] C. Kim and R. M. Stern, “Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction,” In Proc. Interspeech 2009, September 2009, Brighton, United
Kingdom.
[50] A. Sankar, C-H. Lee, “Robust speech recognition based on stochastic matching,” in Proc. ICASSP, vol.
1, pp. 121-124, 1995.
[51] M.G. Rahim, B.-H. Juang, “Signal bias removal by maximum likelihood estimation for robust telephone speech recognition,” IEEE Trans. Speech Audio Process, vol. 4, no. 1, pp. 1930, 1996.
[52] C.J. Leggetter, P.C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density HMMs,” Comput. Speech Lang., vol. 9, no. 2, pp. 171185, 1995.
[53] M.J.F. Gales, S. Young, “An improved approach to the hidden Markov model decomposition of speech
and noise,” In Proc. ICASSP, pp. 233236, 1992.
[54] M.J.F. Gales, Model-Based Techniques for Noise Robust Speech Recognition, Ph.D. Thesis, Cambridge
University, 1995.
[55] P. Moreno, Speech Recognition in Noisy Environments, Ph.D. Thesis, Carnegie Mellon University,
1996.
[56] A. Acero, L. Deng, T. Kristjansson, J. Zhang, “HMM adaptation using vector Taylor series for noisy
speech recognition,” In Proc. ICSLP, pp. 869872, 2000.
[57] Y. Gong, “A method of joint compensation of additive and convolutive distortions for speakerindependent speech recognition,” IEEE Trans. Speech Audio Process., vol. 13, no. 5, pp. 975983,
2005.
[58] K. T. Assaleh and R. J. Mammone, “New LP-derived Features for Speaker Identiﬁcation,” IEEE Trans.
on Speech Audio Process., vol. 2, no. 4, pages 630-638, 1994.
[59] S. Furui, “Cepstral Analysis Techniques for Automatic Speaker Veriﬁcation,” IEEE Trans. Acoust.
Speech Signal Process., vol. 29, pages 254-272, 1981.
[60] H. Hermansky, “Perceptual linear predictive (PLP) analysis for speech,” J. Acoust. Soc. Am., vol. 87,
pages 1738-1752, 1990.
[61] E.C. Smith and M.S. Lewicki, “Efﬁcient coding of time-relative structure using spikes,” Neural Computation, vol. 17, no. 1, pp. 19-45, 2005.
[62] E.C. Smith and M.S Lewicki, “Efﬁcient auditory coding,” Nature, vol. 439, pp. 978-982, 2006.

116

[63] M. Riesenhuber, and T. Poggio, “Hierarchical models of object recognition in cortex,” Nature Neuroscience, vol. 2, pp. 1019-1025, 1999.
[64] N. Aronszajn, “Theory of reproducing kernels,” Trans. American Mathematical Society, vol. 68, pp.
337-404, 1950.
[65] S. Saitoh, “Theory of reproducing kernels and its applications,” Longman Scientiﬁc and Technical,
Harlow, England, 1988.
[66] F. Girosi, M. Jones, and T. Poggio, “Regularization Theory and Neural Networks Architectures,” Neural Computation, vol. 7, pages 219-269, 1995.
[67] G. Wahba, “Soft and hard classiﬁcation by reproducing kernel Hilbert space methods,” Proceedings of
the National Academy of Sciences, vol. 99, no. 26. pp. 16524-16530, 2002.
[68] V. Vapnik, The Nature of Statistical Learning Theory, New York: Springer-Verlag, 1995.
[69] M. Banbrook, S. McLaughlin, and I. Mann, “Speech characterization and synthesis by nonlinear methods,” IEEE Trans. Speech Audio Process., vol. 7, pages 1-17, 1999.
[70] H. M. Teager and S. M. Teager, “Evidence for nonlinear sound production mechanisms in the vocal
tract,” in Proc. NATO ASI on Speech Production Speech Modeling, vol. 55, pages 241-261, 1990.
[71] B. Scholkopf, C. Burges and A. Smola, eds., Advances in Kernel Methods-Support Vector Learning,
MIT Press, Cambridge, 1998.
[72] G. Wahba, “Splines Models for Observational Data,” Series in Applied Mathematics, vol. 59, SIAM,
Philadelphia, 1990
[73] T. Evgeniou, M. Pontil, and T. Poggio, “Regularization Networks and Support Vector Machines,”
Advances in Computational Mathematics, vol. 13, pp. 1-50, 2000.
[74] H. Nyquist, “Certain Topics in Telegraph Transmission Theory,” Transaction of the A. I. E. E., vol. 47,
no. 2, pp. 617644, 1928.
[75] C. E. Shannon, “Communication in the presence of noise,” Proceedings of IRE, vol. 37, pp. 1021,
1949.
[76] R. H. Walden, “Analog-to-digital converter survey and analysis,” IEEE Journal on Selected Areas in
Communications, vol. 17, no. 4, pp. 539550, 1999.

117

[77] L. Bin, T. W. Rondeau, J. H. Reed, and C. W. Bostian, “Analog-to-digital converters,” IEEE Signal
Processing Magazine, vol. 22, no. 6, pp. 6977, 2005.
[78] M.J. Madou, M.J., Fundamentals of Microfabrication: The Science of Miniaturization, Boca Raton,
FL: CRC Press, 2002.
[79] K.D. Wise, D.J. Anderson, J.F. Hetke, D.R. Kipke, and K. Najaﬁ, “Wireless implantable microsystems:
High-density electronic interfaces to the nervous system,” Proceedings of the IEEE, vol. 92, no. 1, pp.
138-145, Jan. 2004.
[80] C.T. Nordhausen, E.M. Maynard, and R.A. Normann, “Single unit recording capabilities of a 100microelectrode array,” Brain Research, vol. 726, pp. 129140, 1996.
[81] K.D. Wise and K. Najaﬁ, “Microfabrication techniques for integrated sensors and microsystems,” Science, vol. 254, pp. 1335-1342, Nov. 1991.
[82] T. Ajdler and M. Vetterli, “The plenacoustic function, sampling and reconstruction,” in Proceedings of
International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hong Kong, 2003.
[83] M.N. Do, “Toward sound-based synthesis: the far-ﬁeld case,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Canada, 2004.
[84] T. Ajdler, L. Sbaiz, and M. Vetterli, “The Plenacoustic Function and Its Sampling,” IEEE Transaction
on Signal Processing, vol. 54, no. 10, pp. 3790 - 3804, October 2006.
[85] A. Celik, M. Stanacevic, and G. Cauwenberghs, “Gradient ﬂow independent component analysis in
micropower VLSI,” Adv. Neural Information Processing Systems (NIPS), vol. 8, pp. 187-194, Cambridge: MIT Press, 2006.
[86] J. Barr` re, and G. Chabriel, “A Compact Sensor Array for Blind Separation of Sources,” IEEE Transe
action on Circuits and Systems: Part I, vol. 49, no. 5, pp. 565-574, 2002.
[87] E. Oja, “Principal components, minor components, and linear neural networks,” Neural Networks, vol.
5, no. 6, pp.927-935, 1992.
[88] A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing: Learning Algorithms and
Applications, New York, NY: Wiley, 2002.
[89] S. Amari, A. Cichocki, and H.H. Yang, “A new learning algorithm for blind signal separation,” in Adv.
Neural Information Processing Systems (NIPS), vol. 8, pp. 757-763, Cambridge: MIT Press, 1996.
[90] C. E. Cherry, “Some experiments in the recognition of speech, with one and two ears,” Journal of the
Acoustical Society of America, vol. 25, pp. 975-979, 1953.

118

[91] S. Haykin and Z. Chen, “The cocktail party problem,” Neural Computation, vol. 17, pp. 18751902,
Sep 2005.
[92] P. F¨ ldi´ k and M. Young, “Sparse coding in the primate cortex,” In The Handbook of Brain Theory
o a
and Neural Networks, pp. 895-898, MIT Press, 1995.
[93] K. P. K¨ rding, P. K¨ nig, and D. J. Klein, “Learning of sparse auditory receptive ﬁelds,” In International
o
o
Joint Conference on Neural Networks, 2002.
[94] J.C. Candy and G.C. Temes, “Oversampled methods for A/D and D/A conversion,” in Oversampled
Delta-Sigma Data Converters, Piscataway, NJ: IEEE Press, pp. 1-29, 1992.
[95] L. Bottou, “Stochastic learning,” in Advanced Lectures on Machine Learning, Lecture Notes in Artiﬁcial Intelligence, vol. 3176, O. Bousquet and U. von Luxburg, Ed. Berlin: Springer Verlag, 2004, pp.
146-168.
[96] W. Rudin, Functional Analysis, New York: McGraw-Hill, 1991.
[97] A. Belouchrani, K. Abed-Meraim, J.-F. Cardoso, E. Moulines, “A blind source separation technique
using second-order statistics,” IEEE Trans. Signal Process., vol. 45, no. 2, pp. 434444, 1997.
y
y
[98] Z. Koldovsk´ , P. Tichavsk´ and E. Oja, “Efﬁcient Variant of Algorithm FastICA for Independent
Component Analysis Attaining the Cram´ r-Rao Lower Bound,” IEEE Trans. on Neural Networks,
e
Vol. 17, No. 5, Sept 2006.
[99] E. Vincent, R. Gribonval, and C. Fvotte, “Performance measurement in Blind Audio Source Separation,” IEEE Trans. Audio, Speech, and Lang. Processing, vol. 14, no. 4, pp. 1462-1469, July 2006.
[100] M. Gupta and S. C. Douglas, “Performance Evaluation of Convolutive Blind Source Separation of
Mixtures of Unequal-Level Speech Signals,” in Proc. Int. Symposium on Circuits and Systems (ISCAS), New Orleans, Louisiana, 2007.
[101] J. Makhoul, “Linear prediction: A tutorial review,” The Proceedings of the IEEE, vol. 63, no. 4, pp.
561580, 1975.
[102] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall Signal Processing Series, Prentice-Hall Inc., Engelwood Cliffs, New Jersey, 1978.
[103] J.P. Campbell, D. Reynolds, and R. Dunn, “Fusing high- and low-level features for speaker recognition,” In Proc. of the European Conference on Speech Communication and Technology (Eurospeech),
September 2003.
[104] L. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.

119

[105] S. Davis, and P. Mermelstein, “Comparison of parametric representations for monosyllabic word
recognition in continuously spoken sentences,” IEEE Trans. Acoustics, Speech, and Signal Processing,
vol. 28, no. 4 pp. 357-366, 1980.
[106] S. Furui, “Comparison of speaker recognition methods using static features and dynamic features,”
IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 29, no. 3, pp. 342-350, 1981.
[107] T. Kinnunen, P. Alku, “On Sepating Glottal Source and Vocal Tract Information in Telephony Speaker
Veriﬁcation,” In Proc. of ICASSP, pp. 4545-4548, 2009.
[108] M. Chetouani, M. Faundez-Zanuy, B. Gas, and J.L. Zarader, “Investigation on LP-residual presentations for speaker identiﬁcation,” Pattern Recognition, vol. 42, pp. 487-494, 2009.
[109] N. Zheng, T. Lee, and P.C. Ching, “Integration of complementary acoustic features for speaker recognition,” IEEE Sign. Proc. Lett., vol. 14, no. 3, pp. 181-184, March 2007.
[110] K.S.R. Murty and B. Yegnanarayana, “Combining evidence from residual phase and MFCC features
for speaker recognition,” IEEE Sign. Proc. Lett., vol. 13, no. 1, pp. 52-55, Jan. 2006.
[111] S.R.M. Prasanna, C.S. Gupta, and B. Yegnanarayana, “Extraction of speaker-speciﬁc excitation information from linear prediction residual of speech,” Speech Comm., vol. 48, pp. 1243-1261, 2006.
[112] J. Gudnason and M. Brookes, “Voice source cepstrum coefﬁcients for speaker identiﬁcation,” in
ICASSP, Las Vegas, pp. 4821-4824, 2008.
[113] M.D. Plumpe, T.F. Quatieri, and D.A. Reynolds, “Modeling of the glottal ﬂow derivative waveform
with application to speaker identiﬁcation,” IEEE Trans. Speech and Audio Proc., vol. 7, no. 5, pp.
569-586, September 1999.
[114] R.E. Slyh, E.G. Hansen, and T.R. Anderson, “Glottal modeling and closed-phase analysis for speaker
recognition,” in Proc. Speaker Odyssey 2004, Toledo, May 2004, pp. 315-322.
[115] A. Adami, R. Mihaescu, D. Reynolds, and J. Godfrey, “Modeling Prosodic Dynamics for Speaker
Recognition,” in ICASSP 2003.
[116] B. Peskin, J. Navratil, J. Abramson, D. Jones, D. Klusacek, D. Reynolds, B. Xiang, “Using Prosodic
and Conversational Features for High-performance Speaker Recognition: Report from JHU WS’02,”
in ICASSP 2003.
[117] D. Reynolds, W. Andrews, J. Campbell, J. Navr´ til, B. Peskin, A. Adami, Q. Jin, D. Klus´ vck, J.
a
a
Abramson, R. Mihaescu, J. Godfrey, D. Jones, B. Xiang, “The SuperSID Project: Exploiting Highlevel Information for High-accuracy Speaker Recognition,” in ICASSP 2003.

120

[118] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker veriﬁcation using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1, pp. 19-41, 2000.
[119] R. O. Duda , P. E. Hart , D. G. Stork, Pattern recognition, 2nd ed. Wiley-Interscience, New York,
2000.
[120] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1-38, 1977.
[121] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identiﬁcation using Gaussian mixture speaker models,” IEEE Trans. Speech, and Audio Processing, vol. 3, no. 1, pp. 72-83, 1995.
[122] J. Mari´ thoz and S. Bengio, “A comparative study of adaptation methods for speaker veriﬁcation,”
e
In Proc. Int. Conf. on Spoken Language Processing (ICSLP), pp. 581-584, Denver, Colorado, USA,
2002.
e
[123] M. H´ bert, “Text-dependent speaker recognition,” In Springer handbook of speech processing (Heidelberg, 2008), J. Benesty, M. Sondhi, and Y.Huang, Eds., Springer Verlag, pp. 743-762.
[124] B. Sch¨ lkopf and A. Smola., Learning with Kernels: Support Vector Machines, Regularization, Opo
timization, and Beyond, MIT Press, Cambridge MA, 2001.
[125] T. Jaakkola and D. Haussler, “Probabilistic kernel regression models,” Proc. 7th Int. Workshop on
Artiﬁcial Intelligence and Statistics, 1999.
[126] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan, New York, NY, USA, 1994.
[127] http://www.nist.gov/speech/tests/sre/2008.
[128] C. Fredouille, J. Mariethoz, C. Jaboulet, J. Hennebert, J.-F. Mokbet, and F. Bimbot, “Behavior of a
Bayesian adaptation method for incremental enrollment in speaker veriﬁcation,” in Proc. of ICASSP,
vol. 2, pp. 1197 - 1200, 2000.
[129] Qi Li, S. Parthasarathy, and Aaron E. Rosenberg, “A fast algorithm for stochastic matching with
applications to robust speaker veriﬁcation,” in Proc. ICASSP, pp. 1543-1546, 1997.
[130] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker veriﬁcation,” In Proc. of the IEEE
Workshop on Speaker and Language Recognition (Odyssey), June 2001.
[131] D. A. Reynolds, “Channel robust speaker veriﬁcation via feature mapping,” in Proc. IEEE Int. Conf.
Acoustics, Speech, Signal (ICASSP’03), pp. 53-56, 2003.

121

[132] M. Banbrook, S. McLaughlin, and I. Mann, “Speech characterization and synthesis by nonlinear
methods,” IEEE Transactions on Speech and Audio Processing, vol. 7, pp. 1-17, 1999.
[133] H. M. Teager and S. M. Teager, “Evidence for nonlinear sound production mechanisms in the vocal
tract,” presented at NATO ASI on Speech Production and Speech Modelling, 1990.
[134] H. Hermansky, S. Sharma, P. Jain, “Data-derived nonlinear mapping for feature extraction,” Proceedings of the Workshop on Automatic Speech Recognition and Understanding, Dec 1999.
[135] S. Sharma, D. Ellis, S. Kajarekar, P. Jain, H. Hermansky, “Feature Extraction using non-linear transformation for robust speech recognition on the Aurora database”, ICASSP 2000
[136] M. K. Omar, M. Hasegawa-Johnson, “Non-Linear Maximum Likelihood Feature Transformation for
Speech Recognition” Interspeech, September, 2003.
[137] A. Kocsor and L. T´ th, “Kernel-based feature extraction with a speech technology application,” IEEE
o
Trans. On Signal Processing, vol. 52, no. 8, pp.2250-2263, 2004.
[138] H. Huang and J. Zhu, “Kernel based Non-linear Feature Extraction Methods for Speech Recognition,”
Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications
(ISDA’06) - Volume 02, pp. 749 - 754, 2006.
[139] A. Lima, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda, and T. Kitamura, “On the use of kernel PCA
for feature extraction in speech recognition,” in Proc. Eurospeech, Geneva, Switzerland, pp. 26252628, 2003.
[140] Y. Konig, L. Heck, M. Weintraub, K. Sonmez, and R. Esum, “Nonlinear discriminant feature extraction for robust text-independent speaker recognition,” Proc. RLA2C, ESCA workshop on Speaker
Recognition and its Commercial and Forensic Applications, pp.72-75 (1998).
[141] M. Chetouani, B. Gas, J.L. Zarader, and C. Chavy, “Neural Predictive Coding for Speech Discriminant Feature Extraction : The DFE-NPC,” European Symposium on Artiﬁcial Neural Networks Bruges
(Belgium), 24-26 April 2002.
[142] Q. Zhu, A. Alwan ‘Non-linear feature extraction for robust speech recognition in stationary and nonstationary noise,” Computer Speech and Language, vol. 17, 2003, pp. 381-402.
[143] M. Chetouani, M. Faundez, B. Gas and J.L. Zarader, “Non-linear Speech Feature Extraction for
Phoneme Classiﬁcation and Speaker Recognition,” in Nonlinear speech processing : Algorithms and
Analysis. Eds. G. Chollet, A. Esposito, M. Faundez, M. Marinaro. Springer Verlag (2005).

122

[144] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Factor analysis simpliﬁed,” in Proc. IEEE
Int. Conf. Acoust. Speech, Signal Processing (ICASSP05), Philadelphia, PA, Mar. 2005, vol. 1, pp.
637640.
[145] A. Solomonoff, C. Quillen, and I. Boardman, “Channel compensation for SVM speaker recognition,”
in Proc. Odyssey-04 Speaker Language Recognition Workshop, Toledo, Spain, May 2004, pp. 5762.
[146] A. Solomonoff, W. M. Campbell, and I. Boardman, “Advances in channel compensation for SVM
speaker recognition,” in Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., Philadelphia, PA, Mar.
2005, vol. 1, pp. 629632.
[147] A. O. Hatch and A. Stolcke, “Generalized linear kernels for oneversus-all classiﬁcation: Application
to speaker recognition,” In Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP),
pp. 585588, France, 2006.
[148] A. O. Hatch, S. Kajarekar, and A. Stolcke, “Within-class covariance normalization for SVM-based
speaker recognition,” in Proc. Int. Conf. Spoken Lang. Process., Pittsburgh, PA, Sep. 2006, pp.
14711474
[149] P. Kenny, “Joint factor analysis of speaker and session variability: theory and algorithms,” technical
report CRIM-06/08-14, 2006
[150] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines using GMM supervectors for speaker veriﬁcation,” IEEE Signal Process. Lett., vol. 13, no. 5, pp. 308311, May 2006.
[151] C. You, K. Lee, and H. Li, “An SVM kernel with GMM supervector based on the Bhattacharyya
distance for speaker recognition,” IEEE Signal Processing Letters, vol. 16, no. 1, pp. 4952, 2009.
[152] K.-A Lee, C. You, H. Li, and T. Kinnunen, “A GMM-based probabilistic sequence kernel for speaker
veriﬁcation,” In Proc. Interspeech, pp. 294-297, Belgium, 2007.
[153] P. Moreno and P. Ho, “A new SVM approach to speaker identiﬁcation and veriﬁcation using probabilistic distance kernels,” in Proc. 8th Eur. Conf. Speech Commun. Technol., Geneva, Switzerland, pp.
29652968, 2003.
[154] K. P. Li and J. E. Porter, “Normalizations and selection of speech segments for speaker recognition
scoring,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP), vol. 1, pp. 595598,
New York, NY, USA, April 1988.
[155] D. A. Reynolds, “The effect of handset variability on speaker recognition performance: experiments
on the switchboard corpus,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP),
vol. 1, pp. 113116, Atlanta, Ga, USA, May 1996.

123

[156] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score normalization for text-independent speaker
veriﬁcation systems,” Digital Signal Process., vol. 10, pp. 4254, Jan. 2000.
[157] M. Ben, R. Blouet, and F. Bimbot, “A Monte-Carlomethod for score normalization in automatic
speaker veriﬁcation using Kullback-Leibler distances,” in Proc. IEEE Int. Conf. Acoustics, Speech,
Signal Processing (ICASSP), vol. 1, pp. 689692, Orlando, Fla, USA, May 2002.
[158] H.G Hirsch and D. Pearce, “The Aurora experimental framework for the performance evaluation of
speech recognition systems under noisy conditions,” in Proc. ISCA ITRW ASR2000, Paris, France,
pp. 181188, 2000.
[159] HTK Speech Recognition Toolkit. Available from: ¡http://htk.eng.cam.ac.uk/¿ (accessed May 2010).
[160] ETSI ES 201 108 Version 1.1.3, “Speech processing, transmission and quality aspects (STQ); distributed speech recognition; Front-end feature extraction algorithm; compression algorithms, 2003.
[161] ITU-T Recommendation G.712, “Transmission performance characteristics of pulse code modulation
channels,” International Telecommunications Union, Geneva, Switzerland, ITU-T Rec.G712, 1996.
[162] ETSI ES 202 050 Version 1.1.5, “Speech processing, transmission and quality aspects (STQ); distributed speech recognition; Advanced front-end feature extraction algorithm; compression algorithms,” 2007.
[163] C. Kim and R. M. Stern, “Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction,” In Proc. Interspeech, Brighton, United Kingdom, September
2009.
e
[164] R. Collobert, S. Bengio, and J. Mari´ thoz, “Torch: a modular machine learning software library,”
Technical Report IDIAP-RR 02-46, IDIAP, 2002.
[165] J. Mari´ thoz, S. Bengio, and Y. Grandvalet, “Kernel-Based Text-Independent Speaker Veriﬁcation,”
e
in Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods (eds J. Keshet and
S. Bengio), John Wiley & Sons, Chichester, UK, 2009.
[166] S. Chakrabartty, Y. Deng and G. Cauwenberghs, “Robust Speech Feature Extraction by Growth
Transformation in Reproducing Kernel Hilbert Space,” IEEE Transactions on Speech, Language and
Acoustics, Vol. 15, no. 6, pp. 1842-1849, 2007.
[167] A. Fazel, S. Chakrabartty, “Non-Linear Filtering in Reproducing Kernel Hilbert Spaces for NoiseRobust Speaker Veriﬁcation,” IEEE International Symposium on Circuits and Systems (ISCAS),
Taipei, Taiwan, 2009.

124

[168] B.A. Olshausen and D.J Field, “Emergence of simple-cell receptive ﬁeld properties by learning a
sparse code for natural images,” Nature, vol. 381, pp. 607609, 1996.
[169] R. Patterson, B. Moore, “Auditory ﬁlters and excitation patterns as representations of frequency resolution,” Frequency Selectivity in Hearing, pp. 123177, 1986.
[170] T. Chi, P. Ru, S. Shamma, “Multiresolution spectrotemporal analysis of complex sounds,” J. Acoust.
Soc. Am., vol. 118, no. 2, pp. 887906, 2005.
[171] M. Slaney, “An Efﬁcient Implementation of the Patterson-Holdsworth Auditory Filter Bank,” Apple
Computer Technical Report, No. 35, 1993.
[172] R. D. Patterson, Holdsworth, I. Nimmo-Smith and P. Rice, “SVOS Final Report: The Auditory Filterbank,” APU report, no. 2341, 1988.
[173] B. R. Glasberg and B. C. J. Moore, “Derivation of auditory ﬁlter shapes from notched-noise data,”
Hearing Research, vol. 47, pp. 103-108, 1990.
[174] C. M. Wessinger, J. VanMeter, B. Tian, J. Van Lare, J. Pekar and J. P. Rauschecker, “Hierarchical
Organization of the Human Auditory Cortex Revealed by Functional Magnetic Resonance Imaging,”
Journal of Cognitive Neuroscience, vol 13, no. 1, pp. 1-7, 2001.
[175] K. Okada, F. Rong, J. Venezia, W. Matchin, I.-H. Hsieh, K. Saberi, J. T. Serences, and G. Hickok, “Hierarchical organization of human auditory cortex: evidence from acoustic invariance in the response
to intelligible speech,” Cereb. Cortex, vol. 20, no. 10, pp. 2486-2495, 2010.
[176] A. Boemio, S. Fromm, A. Braun, and D. Poeppel, “Hierarchical and asymmetric temporal sensitivity
in human auditory cortices,” Nature Neuroscience, vol. 8, no. 3, pp. 389-395, 2005.
[177] F. Theunissen, K. Sen, and A. J. Doupe“Spectral-Temporal Receptive Fields of Nonlinear Auditory
Neurons Obtained Using Natural Sounds,” Journal of Neuroscience, vol. 20, no. 6, pp. 2315-2331,
2000.
[178] J. F. Linden, R.C. Liu, M. Sahani, C. E. Schreiner, and M. M. Merzenich “Spectrotemporal Structure
of Receptive Fields in Areas AI and AAF of Mouse Auditory Cortex,” Journal of Neurophysiology,
vol. 90, no. 4, pp. 2660-2675, 2003.
[179] M. Kleinschmidt, and D. Gelbart, “Improving Word Accuracy with Gabor Feature Extraction,” in
Proc. ICSLP, 2002.
[180] M. Kleinschmidt, “Localized Spectro-temporal Features for Automatic Speech Recognition,” in Proc.
Eurospeech, 2003

125

[181] N. Mesgarani, M. Slaney and S. A. Shamma,“Discrimination of Speech From Nonspeech Based on
Multiscale Spectro-Temporal Modulations,” IEEE Transaction on Speech and Audio processing, vol.
14, no. 3, pp. 920-930, 2006.
[182] T. Ezzat, T. Poggio, “Discriminative Word-Spotting Using Ordered Spectro-Temporal Patch Features,” In Proceedings of the 2008 SAPA Workshop, pp. 3540, Brisbane, Australia, September 2008.
[183] J. Bouvrie, T. Ezzat, T. Poggio “Localized Spectro-Temporal Cepstral Analysis of Speech,” in Proc.
of ICASSP, Las Vegas, Nevada, 2008.

126