141
933
THS

 

 

IIDDADV ‘7

IO Mic: Us. date I
University J

 

This is to certify that the
dissertation entitled

MULTICHANNEL SIGNAL DECOMPOSITION AND
SEPARATION IN THE TIME-FREQUENCY DOMAIN

presented by

Zeyong Shan

has been accepted towards fulﬁllment
of the requirements for the

PhD. degree in Electrical Engineering

 

 

Majo'r Professor’s Signature

08/Z</oq

 

Date

MSU is an Affirmative Action/Equal Opportunity Employer

--o-a--v-l-o-~—-.—.- _

lc--n-.-.-.-t- _ _

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 Kthrolecc8Pres/ClRC/DateDue indd

 

MULTICHANNEL SIGNAL DECOMPOSITION AND
SEPARATION IN THE TIME-FREQUEN CY DOMAIN

By

Zeyong Shan

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Electrical Engineering

2009

ABSTRACT

MULTICHANNEL SIGNAL DECOMPOSITION AND SEPARATION
IN THE TIME-FREQUENCY DOMAIN
By

Zeyong Shan

The extraction of signals or components from observed data is a fundamental and
challenging problem in many signal processing applications. In many practical situa—
tions, observations may be modeled as linear mixtures of a number of source signals,
is a linear multi-input multi-output system. A typical example is speech recordings
made in an acoustic environment in the presence of background noise and/or com-
peting speakers. Other examples include multichannel biological signal recordings
such as the electroencephalogram, passive sonar applications and cross-talk in data
communications. The well-known approaches to the signal decomposition and sep-
aration problems include second or higher order statistics based methods, principal
component analysis, and independent component analysis. Most of these methods
are developed in the time domain, and thus inherently assume the stationarity of the
underlying signals. However, most real world signals are non-stationary and have
highly complex time-varying characteristics. For non-stationary signals, common sig-
nal analysis techniques such as the standard Fourier transform are not useful since
the transient part of the signal such as spikes and high frequency bursts cannot be
easily detected from the Fourier transform. These problems could be overcome by
using non-stationary signal analysis tools such as the quadratic time-frequency distri-
butions (TFDs). TFDs provide a two-dimensional representation of the time-varying
energy information in the signal, and are suitable for tracking the non-stationary be-

havior of signals. Hence, there have been efforts to perform the signal decomposition

and separation in the time-frequency domain.

In this dissertation, the multichannel signal decomposition problem in the time-
frequency domain is first considered. A new adaptive signal component extraction
method is proposed based on the minimum entropy criterion. This method de-
composes the signals into the components that are well-concentrated on the time-
frequency plane. Unlike the traditional Gabor decomposition, the signal is expressed
as a ﬁnite sum of the components extracted by the proposed algorithm whose time
and frequency centers are determined by the signal and not by a pre-determined dic-
tionary. Next, the overdetermined blind source separation problem is addressed in the
time-frequency domain. We present a novel approach to achieve source separation
using an information-theoretic cost function. Jensen-Rényi divergence, as adapted
to time-frequency distributions, is introduced as an effective cost function to extract
sources that are disjoint on the time-frequency plane. The sources are extracted
through a series of Givens rotations and the optimal rotation angle is found using
the steepest descent algorithm. The proposed method is applied to several example
signals to illustrate its effectiveness and the performance is quantified through simula-
tions. After that, the underdetermined blind source separation problem is discussed.
The proposed approach takes advantage of the high resolution of time-frequency dis-
tributions for obtaining a sparse representation, and separates the sources by a simple
clustering algorithm followed by a convex optimization problem. Compared to other
time-frequency based separation methods, the approach presented is characterized by
simplicity and ease of implementation. Finally, the proposed approach for the case of
underdetermined blind source separation is applied to real signals such as electroen-
cephalogram signals to further evaluate its performance. The experimental results
show that the proposed method is more effective at extracting well—localized neuronal

sources in time and frequency than ICA.

To my beloved parents and wife

ACKNOWLEDGMENTS

I would like to acknowledge many people for helping me during my doctoral work.
I would especially like to thank my advisor, Prof. Selin Aviyente, for her generous
time and commitment. Throughout my doctoral work she encouraged me to develop
independent thinking and research skills. She continually stimulated my analytical
thinking and greatly assisted me with scientiﬁc writing. Her knowledge, kindness,
and patience have provided me with lifetime beneﬁts.

I am also very grateful for having an exceptional doctoral committee and wish to
thank Profs. John Deller, Guowei Wei, and Hayder Radha for their continual support
and encouragement, and valuable comments and suggestions on my dissertation draft.
I would also like to thank many faculty members in ECE department of MSU who
were the instructors for the courses I ever took. These course works have enriched my
knowledge and provided the background and foundations for my doctoral research.

I extend many thanks to my fellow graduate students at MSU. In particular, I
would like to thank Jacob Swary for helping me ﬁnish a part of the experiment.

Finally, I would like to thank my family. My parents extended their passion
for plants and nature to me. I’m especially grateful to my wife, Yongmei, for her
support, encouragement, and unwavering love, and helping me keep my life in proper

perspective and balance.

TABLE OF CONTENTS

List of Tables .................................... ix
List of Figures ................................... x
Introduction 1
1.1 Overview of Contributions .......................... 5

Background on Time-Frequency Analysis and Information-Theoretic Mea-

sures 7
2.1 Introduction to Time-Frequency Analysis .................. 7
2.1.1 Short-Time Fourier Transform .................... 8

2.1.2 The Wigner Distribution ....................... 9

2.1.3 Cohen’s General Class of Time-Frequency Distributions ...... 10

2.1.4 Reduced Interference Distributions ................. 11

2.2 Introduction to Information-Theoretic Measures ............... 12
2.2.1 Entropy ................................ 12

2.2.2 Rényi Entropy ............................ 13

2.2.3 Divergence Measures for Time-Frequency Distributions ...... 15
Kullback-Leibler Divergence ..................... 15

Rényi Divergence ........................... 16
Jensen-Shannon Divergence ..................... 16
Jensen-Rényi Divergence ....................... 17

Review of Signal Decomposition and Source Separation Methods 18
3.1 Principal Component Analysis ........................ 18
3.1 .1 Principal Components ........................ 19

3.1.2 PCA By Variance Maximization ................... 19

3.2 Independent Component Analysis ...................... 22
3.2.] Deﬁnition of ICA ........................... 23

3.2.2 ICA by Maximum Likelihood Estimation .............. 24

3.2.3 ICA by Minimization of Mutual Information ............ 25

3.3 Review of Time-Frequency Signal Decomposition and Separation Approaches 27
3.3.1 Matching Pursuit with Time-Frequency Dictionaries ........ 27

3.3.2 Spatial Time-Frequency Distribution (STFD) Method ........ 30

3.3.3 Blind Separation via Time-Frequency Masking ........... 32
Adaptive Signal Decomposition on the Time-Frequency Plane 35
4.1 Introduction .................................. 35
4.2 Background on Gabor' Decomposition and Information Measures ...... 36

vi

4.2.1 Gabor Signal Expansion ....................... 36

4.2.2 Chirplet Transform .......................... 37

4.2.3 Entropy Measure on the Time-Frequency Plane ........... 38

4.3 Component Extraction Method ........................ 39
4.3.1 Problem Statement .......................... 39

4.3.2 The Proposed Approach ....................... 40

4.3.3 Convergence Analysis of the Algorithm ............... 42

4.4 Experimental Results and Analysis ...................... 44
4.5 Summary ................................... 47

5 Overdetermined Blind Source Separation in the Time-Frequency Domain 49
5.1 Introduction .................................. 49

5.2 Information Measures in the Time-Frequency Domain ............ 51

5.3 Problem Formulation and Method ...................... 53
5.3.1 Signal Model and Assumptions ................... 53

5.3.2 Problem Statement in the Time-Frequency Domain ......... 55

5.3.3 Cost Function ............................. 57

5.3.4 Rotation ................................ 58

5.3.5 Proposed Algorithm ......................... 60

5.4 Experimental Results and Analysis ...................... 61

5.5 Summary ........................... ‘ ........ 71

6 Underdetermined Blind Source Separation in the Time-Frequency Domain 74
6.1 Introduction .................................. 74
6.2 Sparse Factorization Approach for UBSS in the Time-Frequency Domain . 75
6.2.1 Linear Mixture Model and Assumptions ............... 76

6.2.2 K -means Clustering ......................... 77

6.2.3 Determination of the Mixing Matrix ................. 78

6.2.4 Estimation of the Source Signals for a Given Mixing Matrix . . . . 79

6.3 Experimental Results and Analysis ...................... 80
6.4 Summary ................................... 85

7 Applications of UBSS Algorithm on Electroencephalogram Signals 86
7.1 Introduction to Electroencephalogram and Event-Related Potential ..... 87
7.1.1 Electroencephalogram ........................ 87

7.1.2 Event-Related Potential ....................... 88

7.2 Experimental Results and Performance Analysis ............... 90
7.2.1 EEG/ERP Data Set .......................... 90

7.2.2 Single-Trial EEG ........................... 90

7.2.3 Data Reduction ............................ 91

7.2.4 Performance Evaluation ....................... 95

7.3 Summary ................................... 99

8 Conclusions and Future Work 101

vii

Bibliography .............................................................. 108

viii

4.1

7.1

7.2

7.3

7.4

LIST OF TABLES

Entropy Comparison ............................. 46
Mean measure of 11 norm to show sparsity .................. 98
Mean measure of entropy to show time-frequency localization ....... 99
Measure of disjointness by correlation between components ........ 99
Average component projection to electrodes ................. 100

ix

4.1

4.2

5.1

5.2

5.3

5.4

5.5

5.6

5.7

5.8

5.9

LIST OF FIGURES

The average time-frequency distribution of 128 trials and the 5 extracted
components of the proposed method ..................... 46

The average time-frequency distribution of 128 trials and the 6 elements of
OMP ...................................... 48

Comparison of Kullback-Leibler and Jensen-Rényi divergence measures
under an additive signal perturbation model ................. 53

The mixture and the separation of a chirp and two Gabor logons: (a) the
mixture; (b) and (d) the two extracted Gabor logons; (c) the extracted chirp 64

The cost function versus the number of iterations for Example 1 ...... 65

The mixture and the separation of two crossing chirp signals: (a) the mix-
ture; (b) and (c) the separated signals ..................... 66

TFDs of the two individual speech signals: (a) TFD of a female speaker;
(b) TFD of a male speaker .......................... 67

TFDs of the observed signals: (a) TFD of the ﬁrst mixture; (b) TFD of the
second mixture ................................ 68

TFDs of the extracted signals: (a) TFD of estimate of the female speaker;
(b) TFD of estimate of the male speaker ................... 69

Output SIR versus input SN R for speech signals ............... 70

Comparison of output SIR versus input SNR for three different source sep-
aration methods ................................ 71

5.10 Comparison with PCA for two Gabor logon extraction: (a) the mixture, (b)

5.11

6.1

6.2

6.3

6.4

6.5

7.1

7.2

7.3

7.4

and (c) the components extracted by the proposed method, ((1) and (e) the
components extracted by PCA ........................ 72

Separation of a chirp and a Gabor logon from their three mixtures: (a) the
mixture, (b) the extracted Gabor logon, (c) the extracted chirp ....... 73

Scatter plot of two mixtures of four Gabor logons in the time-frequency
domain .................................... 81

The mixtures and the separation of four Gabor logons: (a) and (b) two
mixtures; (c), (d), (e), and (f) four extracted Gabor logons .......... 82

Scatter plot of two mixtures of a chirp and two Gabor logons in the time-
frequency domain ............................... 83

The mixtures and the separation of a chirp and two Gabor logons: (a) and
(b) two mixtures; (c) extracted chirp; (d) and (e) two extracted Gabor logons 84

Comparison of MSE versus SNR for extracted sources with TFD and WP . 85
Single-trial results using 32 frequency bins: 6 extracted sources from ICA . 92

Single-trial results using 32 frequency bins: 8 extracted sources from the
proposed UBSS method ............................ 93

Results of component clustering over all single-trial results for stimulus
position u = 1 using ICA ........................... 96

Results of component clustering over all single-trial results for stimulus
position u = 1 using the proposed UBSS method .............. 97

xi

CHAPTER 1

INTRODUCTION

Signal decomposition and separation are two important and fundamental problems
in signal processing with a broad range of applications including communications,
speech signal processing, biomedical signal processing, and sensor networks [1—4].
The research in this dissertation focuses on multichannel signal decomposition and
source separation from the perspective of time-frequency distributions taking into
account the non-stationarity of real life signals.

The purpose of signal decomposition is to extract a set of features characterizing
the signal of interest. Often this is realized by decomposing the signal on a set of
elementary functions. An example of such a decomposition is the Fourier transform,
which decomposes a given signal using harmonic functions. However, in the case of
non-stationary signals, i.e., signals whose characteristics change with time, the Fourier
transform does not yield a useful characterization of the signal. Such signals can be
adequately decomposed on a set of locally supported elementary functions, giving
rise to a so-called time-frequency decomposition. In a general time-frequency decom-
position, the signal is decomposed using a set of elementary functions, characterized
by their time and frequency centers. Such functions are called time-frequency atoms
(if-atoms). The majority of linear decomposition methods, including matching pur-
suit [5], basis pursuit [6], and the chirplet decomposition [7], decompose the signal on
a set of tf-atoms, selected appr0priately among a large and redundant dictionary. A
problem with these decomposition methods is that the representations are not sat—
isfactory unless all signal components are at least reasonably well approximated by
dictionary elements.

The ﬁrst part of this dissertation focuses on the decomposition of the observed

multichannel signals into a few number of components in the time-frequency domain.
The major objective is to obtain a compact set of signal components that can represent
the observed/ measured signals. A new adaptive signal component extraction method
is proposed based on the minimum entropy criterion. This method decomposes the
signals into the components that are well-concentrated on the time—frequency plane.
The concentration of the components are quantiﬁed through an entropy measure in
the time-frequency domain. Extracting “minimum” entropy components orthogonal
to each other produces compact components that are similar to Gabor logons and
describe the given data set in a minimum mean square sense. Unlike the traditional
Gabor decomposition, where the signal is expressed as an inﬁnite sum of time and
frequency shifted Gabor logons, the components extracted by this algorithm have
time and frequency centers determined by the signal, and not by a pre—determined
dictionary. Moreover, the components extracted in this approach have chirp rates and
local spread adapted to the given set of signals. The results show that the proposed
approach is effective in determining a few number of components that can be used to
represent a large set of data.

In many signal processing applications, one has only access to measurements of
mixed, i.e. superimposed, signals and the question is how to construct suitable pro jec-
tions that allow to demix and thus ﬁnd the underlying (unmixed) signals of interest.
Blind source separation (BSS) techniques aim at answering this question to reveal
unknown sources using two ingredients (a) a model of the mixing process (typically a
linear superposition) and (b) the assumption of statistical independence. As opposed
to other signal processing techniques like beamforming [8], BSS uses no geometrical
information about the sensor array of the underlying sources, therefore BSS is called
“blind”.

In the last several years there has been much work on the problem of blind source

separation, which has resulted in many diverse approaches. Most of these approaches

use higher—order statistics, minimum mutual information, and maximum entropy in
their solutions. The concept" of independent component analysis (ICA) is deﬁned
in [9] which measures the degree of independence among outputs using contrast func-
tions approximated by the expansion of the Kullback-Leibler divergence. The higher
order statistics is approximated by cummulants up to fourth order and requires in-
tensive computation. Researchers in neural computation have developed adaptive
learning algorithms which are simpler and biologically more plausible [10—12]. An
information-theoretic approach has been proposed for the blind source separation
and blind deconvolution problem [13]. The ICA has been reformulated in a max-
imum likelihood (ML) framework where the underlying density is estimated in a
context sensitive manner [14].

Most of these methods are developed in the time domain, and thus inherently
assume the stationarity of the underlying signals. However, most real world signals
are non-stationary and have highly complex time-varying characteristics. Since the
quadratic time-frequency distributions (TFDs) provide a two-dimensional represen-
tation of the time-varying energy information in the signal and thus are suitable for
tracking the non—stationary behavior of signals, there have been efforts to perform
the blind source separation in the time—frequency domain.

The second part of the proposed research focuses on the blind separation of the
source signals from their mixtures in the time-frequency domain when the number of
mixtures is greater than or equal to the number of sources, i.e. the overdetermined
case. A new approach is introduced combining time-frequency representations with
information-theoretic measures. An information-theoretic criterion, Jensen-Rényi di-
vergence as adapted to time-frequency distributions, is used as the objective function
for source separation thanks to its robustness against perturbations and noise. It is
shown that this cost function achieves its maximum when the source signals are dis-

joint with each other. The proposed approach performs signal separation through a

multidimensional Givens rotation transformation using a steepest descent algorithm
under the assumption of the approximate disjointness of the underlying source signals
in the time-frequency domain. Issues regarding the convergence rate and robustness
under noise of the prOposed algorithm are investigated.

In the third part of the dissertation, an underdetermined blind source separation
problem, i.e. the number of the mixtures is less than the number of the sources, is
considered in the time—frequency domain. Compared with the (over)determined case,
the underdetermined source separation is more challenging due to the noninvertibil-
ity of the mixing matrix. A two-stage sparse factorization approach is prOposed to
achieve source separation. The first stage of the algorithm is to determine the mix-
ing matrix. It is shown that the mixing matrix can be estimated using K -means
clustering algorithm under the condition that the source signals are sparse in the
time-frequency domain. The column vectors of the mixing matrix are cluster centers
of normalized mixture vectors. The second stage of the algorithm is to estimate the
sources. For a given mixing matrix, although there exists an inﬁnite number of solu-
tions in general, the sparse solution with minimum ll-norm is proven to be unique,
which can be obtained by using linear programming methods.

In the fourth part of the dissertation, we apply the proposed underdetermined
source separation approach to the real life electroencephalogram (EEG) signals using
the time-frequency distributions so as to evaluate its effectiveness. The proposed
approach is capable of extracting more sources than sensors. This is important since
the number of sources is unknown, and since many EEG setups do not have large
electrode arrays. This approach is compared to the popular ICA algorithm when
applied to the same multiple trial EEG/ERP data set. Data reduction by clustering
is performed over all single-trial results to extract components that represent the
results. The components are consistently more sparse compared to ICA, showing that

ICA probably tends to extract components that are sums of sources. The technique

presented provides components that are more localized in the time-frequency domain

and that are more distinct from each other than does ICA.

1.1 Overview of Contributions

The contributions of the dissertation consist of four parts: signal decomposition based
on an information-theoretic criterion, overdetermined source separation by combin-
ing time-frequency representations with information-theoretic measures, underdeter-
mined blind source separation achieved by a two-stage sparse factorization approach,
and the applications of the prOposed separation methods to biological signals.

In signal decomposition, a new adaptive component extraction method is proposed
based on the minimum entrOpy criterion. The main contributions of this part of

research work can be summarized as follows:

1. Time—frequency data reduction is accomplished by producing a few meaningful

components on the time-frequency plane that explain most of the signal’s energy.

2. This time-frequency domain decomposition can extract activity that overlaps
in time and frequency domains, which is not possible using either time domain

or frequency domain decomposition approaches.

3. The proposed approach has the ability to separate and extract parts of chirped

signals, which cannot be achieved using the conventional Gabor expansion.

For the overdetermined source separation problem, a novel separation approach

is presented with the following contributions:

1. Maximizing the information-theoretic divergence can effectively separate dis-

joint sources in the time—frequency domain.

2. The proposed method is superior to typical time domain or frequency domain

separation methods like PCA and ICA for extracting the source signals over-

lapping with each other in both the time and frequency domains.

. The proposed approach also outperforms some time—frequency methods in the
literature for high noise levels since it assumes the cross-terms between sources

are negligible which effectively denoises the observed time-frequency matrix.

In underdetermined blind source separation, a new extraction algorithm is in-

troduced combining the K -means clustering and linear programming. The main

contributions of this part of the dissertation are:

. The source signals are assumed to be sparse in the time-frequency domain, and
do not necessarily have to be orthogonal or independent to each other unlike

PCA or ICA.
. The algorithm for determining the mixing matrix is simple and effective.

. The proposed two—stage approach is more robust than wavelet packets under

noisy environments.

In the fourth part of the research, the proposed underdetermined separation ap-

proach is applied to the EEG signals using the time-frequency distributions with the

following contributions:

1. Single-trial source separation can detect any changes of state in the subject

which is not possible with averaging of multiple trials, since it ignores trial-to-

trial variability.

2. Components extracted by the proposed approach are more sparse, localized,

and distinct in the time-frequency domain than those extracted by ICA.

3. The presented method can also be used as an effective data reduction method.

CHAPTER 2

BACKGROUND ON TIME-FREQUENCY ANALYSIS AND
INFORMATION-THEORETIC MEASURES

In this chapter, we brieﬂy introduce the theory of time-frequency analysis and relevant

information-theoretic measures.

2.1 Introduction to Time-Frequency Analysis

The most common methods for representing a signal are its time and frequency do—
main representations. Although frequency domain representations such as the power
spectrum of a signal often give information about the frequency content of a signal,
the representations do not Show how the frequency content evolves over time. For
the majority of signals encountered in everyday life, the frequency content of the sig-
nals varies over time. Since the basis functions used in the classical Fourier analysis
do not associate with any particular time instant, the resulting measurement, the
Fourier transform, does not explicitly reﬂect the signal’s time-varying nature. Thus,
it is difﬁcult to establish the point-to-point relationship between the time domain and
the frequency domain based on the conventional Fourier analysis.

The fundamental idea of time-frequency analysis is to understand and describe
situations where the frequency content of a signal is changing with time. There
are numerous applications in both research and industry for time-frequency analysis.
Examples include speech analysis [15], telecommunications [16], bioacoustics [17], geo-
physics [18], and structural analysis [19]. There are a number of different transforms
available for time-frequency analysis. In the following sections, we will introduce
some of the main time-frequency transforms, including the short-time Fourier trans-
form (STFT), Wigner distribution (WD), Cohen’s general class of transforms, and

the reduced interference distributions (RID).

2.1.1 Short-Time Fourier Transform

A simple way to overcome the deﬁciency possessed by the regular Fourier transform
is to combine the signal with elementary functions that are localized in time and

frequency domains simultaneously,
Sh(t,w) = /s(7')h*(7' - t)8_ijdT, (2.1)

which is a regular inner product and reflects the similarity between the signal s(t)
and the elementary function h(r — t)exp{jwr}. The function h(t) usually has a
short time duration and thereby it is named the window function. Equation (2.1) is
called the short-time Fourier transform (STF T) or windowed Fourier transform. The

spectrogram, which is the energy density spectrum at time t, is deﬁned as:
P<t,w> = lSh(t.w)l2- (2.2)

To obtain a good time resolution, a narrow window, h(t), in the time domain has
to be picked. Similarly, to get a good frequency resolution, a narrow window, H (w),
in the frequency domain has to be picked. Since both h(t) and H (w) can not be
made arbitrarily narrow, there is an inherent trade-off between time and frequency
resolution in the spectrogram for a particular window. This is the reason why the

spectrogram is not preferred for high resolution time—frequency analysis.
2.1.2 The Wigner Distribution

The Wigner distribution is deﬁned mathematically in terms of the signal, s(t), as
[20—22]

W(t,w) = / 3(t + 93m — %)e—jn"dr. (2.3)

The Wigner distribution is said to be bilinear in the signal since the signal enters

twice in its calculation.

The Wigner distribution has many desired properties. For example, it satisﬁes the
marginals requirement, and therefore preserves the energy. It is always real, even if
the signal is complex. In addition, it is time and frequency shift invariant, and satis—
ﬁes ﬁnite support property in time and frequency. One of the major shortcomings of
the Wigner distribution is the existence of negative energy terms. The Wigner distri-
bution of multicomponent signals also exhibits the disturbing tendency of generating
interference or cross-terms.

Despite these shortcomings, the Wigner distribution still shows some remarkable
advantages over the spectrogram: the conditional averages are exactly the instanta-
neous frequency and the group delay, whereas the spectrogram fails to achieve this
result, no matter what window is chosen; the spectrogram can not often provide the
resolution required to distinguish the components in multicomponent signals as the
Wigner distribution. Thus, there is a need to develop more general distributions
which preserve the advantages of the Wigner distribution and address most of its

drawbacks. This leads to the Cohen’s class of generalized distributions.
2.1.3 Cohen’s General Class of Time-Frequency Distributions

There is a considerable advantage to having a simple method to generate different
time-frequency distributions. This allows one to pick and choose those with desirable

properties. The most direct way is to generate the distributions from [23]:

C(t,w)= ///¢(0,T)s(u+%)s*(u—%)ei(9“-9t—W>dudadr, (2.4)

where ¢(9, 7') is a two dimensional function called the kernel function, a term coined
by Claasen and Mecklenbrauker [24] and whom, with Janssen [25], made many im-
portant contributions to the general understanding of the general class, particularly
in the signal analysis context. The kernel function determines the distribution and

its properties. For the Wigner distribution, the kernel function is one.

There are three main reasons why the kernel idea is particularly useful for the
study of time-frequency distributions. First of all it is easy to generate them: just
choose a kernel function. The second reason is that one can design the distributions
with certain properties by putting constraints on the kernel function. For example,

for a distribution to satisfy the marginals

/ Guam = |s(t)|2, / C(t,w)dt = |S(w)|2, (2.5)

it has been shown that the kernel function must have the property

¢(o,T) = W, 0) = 1. (2.6)

An extensive discussion of the properties of a distribution and the corresponding
constraints on the kernel function can be found in [26—29]. The third reason is that
when a new distribution is considered, its properties can readily be ascertained by
examining its kernel. For example, if the kernel does not satisfy equation (2.6), then

we know the distribution can not satisfy the marginals.
2.1.4 Reduced Interference Distributions

It is known that both the spectrogram and the Wigner distribution are the members of
Cohen’s class of distributions. Although the spectrogram has many useful properties,
it often presents serious difﬁculties when used to analyze rapidly varying signals. If
the analysis window is made short enough to capture rapid changes in the signal, it
becomes impossible to resolve frequency components of the signal which are close in
frequency. The Wigner distribution has been employed as an alternative to overcome
this shortcoming. It provides a high resolution representation in time and frequency
for a non-stationary signal such as a chirp. However, its energy distribution is non-

positive and it often suffers from severe cross-terms between components in different

10

time-frequency regions, potentially leading to confusion and misinterpretation. An
excellent discussion on the geometry of interferences has been provided in [30—32].
Since the Wigner distribution sometimes gives artiﬁcial and undesirable values in
the time-frequency domain particularly when the signal is multicomponent, the con-
ditions on the kernel that minimize these spurious values in some sense are developed
in [33—36]. These conditions are that the kernel ¢(6,T) value decays as you move
away from the 6 and 7' axes . A way to describe this region is to observe that the
product Br is large when we are away from either axis. Therefore, it is concluded

that for cross-term minimization, qb(6, 7‘) should satisfy
¢(9,7) << 1 for 19¢ >> 0. (2.7)

These kernels produce reduced interference distributions.

2.2 Introduction to Information-Theoretic Measures

Using entropy based distance functionals is a well-known discrimination method in
signal processing. These functionals are known as divergence measures and are applied
directly on statistical models describing the signals. Measures of divergence between
two probability distributions are used to associate, cluster, classify, compress, and
restore signals, images and patterns, in many applications [37,38]. Many different
measures of divergence have been constructed and characterized [39,40].

Recent research in the application of information and entropy functionals on time-
frequency distributions (TFDs) has proven the usefulness of distance measures for
non-stationary signal analysis [41,42]. Entropy when applied to a TFD measures the
number of components in a given signal, i.e. the complexity. Similarly, divergence
measures computed between two time-frequency distributions can indicate the differ-

ence in complexity between the two signals. These measures could prove useful as

11

time-frequency detection statistics in applications comparing reference and data dis-
tributions. In this section, we review some well-known information-theoretic distance

measures for time-frequency distributions.
2.2.1 Entropy

Before introducing divergence measures, we ﬁrst give a brief review of entropy, a basic
concept in information theory. Entropy H, also called Shannon entropy, is deﬁned

for a discrete-valued random variable X as
we = — Z P(X = a.>log(P(X = an), (2.8)
2'

where P(.) is the probability mass function of X, and the a, are the possible values
of X. Depending on what the base of the logarithm is, different units of entropy are
obtained. Usually the logarithm with base 2 is used, in which case the unit is called
a bit.

According to the deﬁnition, the entropy of a random variable can be interpreted
as the degree of information that the observation of the variable gives. The more
“random”, i.e., unpredictable and unstructured the variable is, the larger its entropy.
Assume that the probabilities are all close to 0, expect for one that is close to 1 (the
probabilities must sum up to one). In that case, there is little randomness in the
variable, since it almost always takes the same value and this is reﬂected by a small
entropy. On the other hand, if all the probabilities are equal, then they are relatively
far from 0 and 1. This means that the entropy is large, which reﬂects the fact that
the variable is really random; we can not predict which value it takes.

The deﬁnition of entropy for a discrete—valued random variable can be generalized
for a continuous-valued random variable, in which case it is often called differential

entropy. The differential entropy H of a random variable :1: with density p$(.) is

12

deﬁned as
Htx) = — /pm<elog (mm. (2.9)

Differential entropy can be interpreted as a measure of randomness in the same way
as entropy. Note that differential entropy can be negative since probability densities

can be larger than 1.
2.2.2 Rényi Entropy

Rényi entrOpy, a generalization of Shannon entropy, is one of a family of functionals
for quantifying the diversity, uncertainty or randomness of a system. Rényi entropy

of order 01, where or 2 0, is deﬁned as [43]

1
1—a

 

Ha(X)= log ZPa(X=a,~) . (2.10)

If the probabilities are all the same, then all Rényi entropies of the distribution are
equal with Ha(X) = logn, where n is the number of ai. Otherwise, the entropies are
weakly decreasing as a function of (1.

Some particular cases are:

1. For a = 0,
H0(X) = logn 2 longl, (2.11)

which is the logarithm of the cardinality of X.

2. In the limit that a approaches 1, it can be shown that Ha converges to
H(X) = — Z P(X = ai)log (P(X = a,)), (2.12)
i

which is Shannon entrOpy.

13

3. Rényi entrOpy refers to the case a = 2,

H2(X) = —log ZP2(X = a,) . (2.13)

4. As a —> co, the limit exists as
HOO(X) = —log(supz-P(X = (11)). (2.14)

This is called Min-entropy, because it is the smallest value of Ha.

The two latter cases are related by H00 < H2 < 2Hoo, while on the other hand
Shannon entrOpy can be arbitrarily high for a random variable X with ﬁxed min-

entropy.
2.2.3 Divergence Measures for Time-Frequency Distributions

The most general class of distance measures is known as Csiszar’s f-divergence which
includes some well-known measures like Hellinger distance, Kullback-Leibler diver-
gence and Rényi divergence [40]. The divergence between two probability density

functions, p1 and p2 for this class of distance measures can be expressed as:

doom) = 9 {E [r (5%)] }. (2.15)

where f is a continuous convex function, g is an increasing function and E1 is the
expectation operator with respect to p1. The distance measures and their properties

for time—frequency distributions are given below.
2.2.3.1 Kullback—Leibler Divergence

The most common distance measure used for probability distributions is the Kullback—

Leibler divergence measure. This measure can be adapted to the time-frequency

14

distributions as follows:

 

K(CI,CQ) = //01(t,f) logg:E:’;; dtdf, (2.16)

where C1, Cg represent two different normalized time—frequency distributions deﬁned
in equation (2.4). This measure belongs to the class of Csiszar’s f-divergence with
f(-.r) = —log:t:, and g(:r) = 1r. 0 S K(Cl,Cg) S 00, the ﬁrst equality holds if and
only if C1 = Cg and the second equality holds if and only if Supp C1 ﬂ Supp C2 = 0.
This is not a symmetric distance measure but can easily be symmetrized by taking
the average of K (01,02) and K(C'2, Cl). The main disadvantage of this measure is

that it can only be applied to positive TFDs.
2.2.3.2 Rényi Divergence

Rényi divergence is a generalized formulation of Kullback-Leibler divergence and can

be expressed as:

 

Da(C'1,C'2)= :1 log / / cite, neg—an, f) dtdf, (2.17)

a

where a E [0, 1] is the order of Rényi divergence. This measure converges to Kullback-
Leibler distance as a —> 1. It is also a member of Csiszars f-divergence with f (x) =
1.1—a, and g(:r) = 3:1- log(:r). 0 S Da(C'1,C2) S 00, the ﬁrst equality holds if and
only if Cl = C2 and the second equality holds if and only if Supp Cl ﬂ Supp C2 = 0.

2.2.3.3 Jensen-Shannon Divergence

One common approach for constructing divergence measures is to apply Jensen in-
equality on the entropy functional. For time-frequency distributions, Jensen-Shannon

divergence can be deﬁned as:

 

(2.18)

£01,642) : H (Cl :02) _ H(CI) 1; H(C‘2).

15

This distance measure is always positive since

 

 

H (013C2) Z H(C1) H(C2) (219)

by concavity of H. It is equal to zero when C1 = C2 and is a symmetric divergence
measure. Unlike the Kullback—Leibler divergence, Jensen—Shannon distance does not

diverge when the two distributions are disjoint.
2.2.3.4 Jensen-Rényi Divergence

The Rényi entropy is derived from the same set of axioms as the Shannon entropy,
the only difference being the employment of a more general exponential mean instead
of the arithmetic mean in the derivation. This realization inspires the modiﬁcation of
J ensen—Shannon divergence from an arithmetic to a geometric mean, and the following

quantity is obtained for two positive TFDs C1 and 02.

J1<01.C2>= Han/0102) — H“(C1):”“(C2). (2.20)

 

 

where t/C1C2(t,f) = \/C1(t,f)C2(t,f). This quantity is obviously null when
C1 = C2. The positivity of this quantity can be proven using the Cauchy-Schwartz

inequality.

16

CHAPTER 3

REVIEW OF SIGNAL DECOMPOSITION AND SOURCE
SEPARATION METHODS

Blind signal processing is one of the important topics in the ﬁelds of neural computa-
tion, advanced statistics, and signal processing with solid theoretical foundations and
many potential applications. In this chapter, we will review the basic approaches and
techniques for signal decomposition and source separation, especially principal com-
ponent analysis, independent component analysis, and several time-frequency based

methods.

3.1 Principal Component Analysis

Principal component analysis (PCA) is a classic technique in statistical data analysis,
feature extraction, and data compression, stemming from the early work of Pearson
[44]. Given a set of multivariate measurements, the purpose of PCA is to ﬁnd a smaller
set of variables with less redundancy, that would give as good a representation as
possible. The redundancy is measured by correlations between data elements. Using
the correlations as in PCA has the advantage that the analysis can be based on the

second-order statistics only.
3.1.1 Principal Components

The starting point of PCA is a n—dimensional random vector x. There is an available
sample x(1),--- ,x(T) from this random vector. No explicit assumptions on the
probability density of the vectors are made in PCA, as long as the ﬁrst- and second-
order statistics are known or can be estimated from the sample. Also, no generative
model is assumed for vector x. Typically the elements of x are measurements like

pixel gray levels or values of a signal at different time instants. It is essential in

17

PCA that the elements are mutually correlated, and there is thus some redundancy
in x, making compression possible. If the elements are independent, the resulting
components are exactly the same as the original signal measurements.

In the PCA transform, the vector x is ﬁrst centered by subtracting its mean E{x}.
The mean is in practice estimated from the available sample. Let us assume in the
following that the centering has been done and thus E{x} = 0. Next, x is linearly
transformed to another vector y with m elements, 777. < n, so that the redundancy
induced by the correlations is removed. This is done by ﬁnding a rotated orthogonal
coordinate system such that the elements of x in the new coordinates become uncor-
related. At the same time, the variance of projections of x on the new coordinate
axes are maximized so that the ﬁrst axis corresponds to the maximal variance, the
second axis corresponds to the maximal variance in the direction orthogonal to the

ﬁrst axis, and so on.
3.1.2 PCA By Variance Maximization

In mathematical terms, consider a linear combination

n
T
311: Z wklflik = W1 X (3.1)
k=1
of the elements 101, - - . ,3” of the vector x. The “’11: - - - ,wnl are scalar coefficients

or weights, elements of an n-dimensional vector WI, and will“ denotes the transpose
of WI.

The factor yl is called the ﬁrst principal component of x, if the variance of y1
is maximally large. Because the variance depends on both the norm and orientation
of the weight vector W1 and grows without limits as the norm grows, we impose the

constraint that the norm of W1 is constant, in practice equal to 1. Thus we look for

18

a weight vector W1 maximizing the PCA criterion

.1};ch = E{y%} = E{(wrfx)2} = W{E{XXT}W1 = wferxwl such that [| wl [[2 1,
(3.2)

where the norm of W1 is the usual Euclidean norm deﬁned as

1/2
||W1|l=(Wf vv1>=1/2 Emil , (3.3)

and the matrix Cx = E{xxT} is the n x n covariance matrix of the zero-mean vector
x. It is well known from basic linear algebra [45,46] that the solution to PCA problem
is given in terms of the unit-length eigenvectors e1, - -- ,en of the matrix Cx. The
ordering of the eigenvectors is such that the corresponding eigenvalues d1, - -- ,dn
satisfy d1 2 d2 2 2 dn. The solution maximizing equation (3.2) is given by

w1 = e1. Thus the ﬁrst principal component of x is
y1 = eclrx. (3.4)

The criterion Jll’C A in equation (3.2) can be generalized to m principal compo-
nents, with m any number between 1 and n. Denoting the m—th (1 S m g n) principal
component by ym = wax, with Wm the corresponding unit norm weight vector, the
variance of ym is now maximized under the constraint that ym is uncorrelated with

all the previously found principal components:

E{ymyk} = 0, k < m- (3.5)

19

Note that the principal components ym have zero means because
mm} = wg’nmx} = 0. ' (3.6)
The condition (3.5) yields:
Etymyk} = E{(w%x)(wfx)} = “($0wa = 0. (3.7)
For the second principal component, we have the condition that
wngwl = dlwgel = 0, (3.8)

because we already know that wl = e1. We are thus looking for maximal variance
E{y%} = E{(ng)2} in the subspace orthogonal to the ﬁrst eigenvector of Cx. The
solution is given by w2 = e2. Likewise, recursively it follows that Wk = ek. Thus,
the kth principal component is

yk = egx. (3.9)

From the above result, it follows that
E{y,2n} = E{e;l;,xxTem} = engem = dm, (3.10)

which shows that the variances of the principal components are directly given by the
eigenvalues of Cx. The vectors x in the original data set can be approximated by the

truncated PCA expansion

m
it: Ewe, (3.11)
121

Then we have that the mean—squared error E{|[ x — x [[2] is equal to Zil=m+1dw

As the eigenvalues are all positive, the error decreases when more and more terms

20

are included in equation (3.11), until the error becomes zero when m = n or all the
principal components are included. A very important practical problem is how to
choose m in equation (3.11); this is a trade-off between error and the amount of data
needed for the expansion. Sometimes a rather small number of principal components

are sufﬁcient. The disciplined approaches to this problem are given by [47,48].

3.2 Independent Component Analysis

Independent component analysis (ICA), introduced by J. Hérault, C. J utten, and B.
Ans [49] in the early 19805, is a statistical and computational technique for reveal-
ing hidden factors that underlie sets of random variables, measurements, or signals.
ICA deﬁnes a generative model for the observed multivariate data, which is typically
given as a large database of samples. In the model, the data variables are assumed
to be linear mixtures of some unknown latent variables, and the mixing system is
also unknown. The latent variables are assumed nongaussian and mutually indepen-
dent, and they are called the independent components of the observed data. These
independent components, also called sources or factors, can be found by ICA.

ICA is a much more powerful technique and capable of ﬁnding the underlying fac-
tors or sources when the classic methods like PCA fail completely. The data analyzed
by ICA could originate from many different kinds of application ﬁelds, including digi-
tal images and document databases, as well as economic indicators and psychometric
measurements. In many cases, the measurements are given as a set of parallel signals
or time series; the term blind source separation is used to characterize this problem.
Typical examples are mixtures of simultaneous speech signals that have been picked
up by several microphones [50], brain waves recorded by multiple sensors [51], inter-
fering radio signals arriving at a mobile phone [52], or parallel time series obtained

from some industrial process [53].

21

L...

3.2.1 Deﬁnition of ICA

There are 72 observed random variables .21, - - - ,2”, which are modeled as linear com-
binations of 72 random variables 31, - - - ,sn:
z,- = (1,131+ @232 + - - - + ainsn, for all z' = 1, - -- ,n (3.12)

where the s,- are unknown and statistically mutually independent, the and, j =
1, - -- ,n are some unknown real coefﬁcients. This is the basic ICA model. All what
are observed are the random variables 22-, and both the mixing coefﬁcients a'ij and
the independent components 3,- must be estimated using the 22-.

It is usually more convenient to use vector-matrix notation instead of the sums
as in the previous equation. Let us denote by z the random vector whose elements
are the mixtures 21, - ' - ,zn, and likewise by s the random vector whose elements are
the source signals 31, - - - ,sn. Let us denote by A the matrix with elements aij- All
vectors are assumed to be column vectors. Using this vector-matrix notation, the

mixing model is written as

z = As. (3.13)

Sometimes the columns of matrix A, denoted by a,, are needed, and the model can

also be written as
n

z = Za,g,-. (3.14)

i=1
Compared with PCA, it is easy to see that in the ICA model the following ambi-
guities or indeterminacies will hold:

1. The variance of the independent components can not be determined.

The reason is that, both 8 and A being unknown, any scalar multiplier in one of

the sources 3,- could always be cancelled by dividing the corresponding column

22

a,- of A by the same scalar, say 02,-:

1
= Z(—fail(aisi)~ (3.15)
. oz2
’1.
As a consequence, the magnitudes of the independent components may be ﬁxed
as well. Since they are random variables, the most natural way to do this is to
assume that each source has unit variance, E{s?} = 1. Then the matrix A will

be adapted in the ICA solution methods to take this restriction into account.

. The order of the independent components can not be determined.

The reason is that, again both 5 and A being unknown, the order of the terms
in the sum in equation (3.14) can be freely changed, and any of the independent
components can be called the ﬁrst one. Formally, a permutation matrix P and
its inverse can be substituted in the model to give z = AP‘lPs. The elements
of Ps are the original independent variables 32-, but in another order. The
matrix AP”1 is just a new unknown mixing matrix, to be solved by the ICA

algorithms.

3.2.2 ICA by Maximum Likelihood Estimation

A very popular approach for estimating the ICA model is maximum likelihood (ML)

estimation. Maximum likelihood estimation is a fundamental method of statistical

estimation. One interpretation of ML estimation is that those parameter values,

which give the highest probability for the observations, are taken as estimates.

According to the properties of the density of a linear transform, the density p; of

the mixture vector z = As can be formulated as

W) = IdetB Ips(S) -—- IdetB I Hats». (3.16)

’1.

23

where B = A—1, and the p, denote the densities of the independent components.

This can be expressed as a function of B 2 (b1, - - - ,bn)T and z, giving
pz(z) = (1111113 | Hp,(b,Tz). (3.17)
i
Assume that we have K observations of z, denoted by z(1),--- ,z(K). Then the

likelihood can be obtained as the product of this density evaluated as the K points.

This is denoted by L and considered as a function of B:

K n
=HHp z-(b.Tz( t))|detB|. (3.13)
t=1i1_—_

Very often it is more practical to use the logarithm of the likelihood, since it is
algebraically simpler. This does not make any difference here since the maximum of
the logarithm is obtained at the same point as the maximum of the likelihood. The

log-likelihood is given by

K n
log L(B) = Z Z logpi( (b-T z(t ())+ Klogl detBl. (3.19)
t=1i=1

Divide the likelihood by K to obtain

1 n

Flog L(B) = m: log p,(b;r”z)} + log] det B |. (3.20)
i=1

To perform ML estimation in practice, an algorithm is needed to perform the nu-

merical maximization of likelihood. In fact, there are many different methods, among

which the simplest algorithms for maximizing likelihood are obtained by gradient

methods [13].

24

3.2.3 ICA by Minimization of Mutual Information

An important approach for ICA estimation, inspired by information theory, is min-
imization of mutual information. The motivation of this approach is that it may
not be very realistic in many cases to assume that the data follows the ICA model.
Therefore, an approach that does not assume anything about the data needs to be
developed. The goal is to have a general-purpose measure of the dependence of the
components of a random vector. With such a measure, ICA could be deﬁned as a
linear decomposition that minimizes that dependence measure. Such an approach
can be developed using mutual information, which is an information—theoretic mea-
sure of statistical dependence. One of the main utilities of mutual information is that

it serves as a unifying framework for many estimation principles, in particular ML

estimation.
Mutual information I between 71 random variables 92'» 2' = 1, - -- ,n is deﬁned as
follows
it
1(y1.--- M) = ZHﬁ/i) -H(y). (3-21)
321

where H (y,) and H (y) are yi’s entrOpy and joint entrOpy, respectively. Mutual in-
formation is a natural measure of the dependence between random variables. It is
always nonnegative, and zero if and only if the variables are statistically independent.
Mutual information takes into account the whole dependence structure of the vari-
ables, and not just the covariance, like PCA and related methods. Therefore, mutual
information can be used as the criterion for ﬁnding the ICA representation. This
approach is an alternative to the model estimation approach. The ICA of a random

vector z is deﬁned as an invertible transformation:

5 = Bz, (3.22)

25

where the matrix B is determined so that the mutual information of the transformed
component 32- is minimized. If the data follows the ICA model, this allows estimation
of the data model. On the other hand, in this deﬁnition, it is not needed to assume
that the data follows the model. In any case, minimization of mutual information
can be interpreted as giving the maximally independent components.

Mutual information and likelihood are intimately connected. A detailed analysis
of the connection between mutual information and maximum likelihood can be seen
in [10]. The same gradient algorithm can be used to Optimize mutual information
due to the its connection with likelihood. In addition, a nonparametric algorithm for
minimization of mutual information is proposed in [54], and an approach based on

order statistics is proposed in [55].

3.3 Review of Time-Frequency Signal Decomposition and Separation Ap-

proaches

The most common methods for component extraction including PCA and ICA are
effective at extracting orthogonal or independent components and assume the sta-
tionarity of the underlying signals. Since most real life signals are not stationary
and thus do not obey this underlying assumption, recent research has focused on
source/ component extraction in the joint time-frequency domain. In this section, we
review signal decomposition and source separation approaches based on the time-

frequency distributions.
3.3.1 Matching Pursuit with Time-Frequency Dictionaries

Matching pursuit [5] is a method to decompose a signal into a linear expansion of wave-
forms which belong to a redundant dictionary of functions, and whose time-frequency
properties are adapted to the local structures of the signal. These waveforms are called
time-frequency atoms. This algorithm offers a decomposition particularly important

for representing signal components whose localizations in time and frequency vary

26

widely.
A general family of time-frequency atoms can be generated by sealing, translating

and modulating a single window function g(t) E L2(R), and is deﬁned as

 

97(1) = i g (1-..) .111, (3.23)

S

where s > 0, u,§ are the parameters of the scale, translation, and frequency modu-
lating, respectively, and ’y = (s,u,£) E I‘ = R+ x R2. The factor f/l: normalizes
the norm of 97(25) to 1. The family D = (97(t))7€r is extremely redundant, and
its properties have been studied in [56]. A linear expansion of a signal f (t) over a
set of vectors selected from D can be done by successive approximations of f (t) with

orthogonal projections on elements of D, in order to best match its inner structures.

Let 970 E D. The signal f is decomposed into

f =< f. 970 > 970 + R112 (324)

where < -,- > represents the inner product of two functions, and R1 f is the residue
after approximating f in the direction of 970. Clearly, 97-0 is orthogonal to R1 f,

hence

n f ||2=|< 1.1),, >i2 + n 1111 ”2. (3.25)

To minimize l] R1 f I], 970 is chosen from D such that |< f, 970 >| is maximum.
After m iterations, the signal f is decomposed into
m—l

f = Z < 1277,97,, > g7” + Rmf, (3.26)

7320

27

and an energy conservation equation is yielded as

m—l
n r Il2= Z l< R7119... >12 + II Rmf H2. (3.27)

7120

It is proven that the matching pursuit algorithm is convergent with respect to the

iteration number m, so as m —> oo,

00
f = Z < 1131,97,, > g)... (3.28)
7120
and
00
u f ||2= Z |< Rnf19’7n >|2- (3.29)
n=0

It is thus shown that any signal f (t) E L2(R) can be decomposed into a sum of
complex time-frequency atoms that best match its residues by matching pursuit.

Although matching pursuit gives a ﬂexible signal decomposition, a problem with
this method is the restricted number of waveforms in the dictionary. While dictio—
naries containing a wide variety of elements can be employed at the expense of high
computational cost, the representations are not satisfactory unless all signal compo-
nents are at least reasonably well approximated by dictionary elements.

A modiﬁed matching pursuit algorithm called Orthogonal Matching Pursuit
(OMP) is developed in [57]. For nonorthogonal dictionaries, OMP in general con-
verges faster than matching pursuit. Furthermore for any ﬁnite size dictionay of N
elements, OMP converges to the projection onto the span of the dictionary elements
in no more than N steps. OMP is simply described as follows: assume that after 171

iterations, the signal f is decomposed into

m
f=za$gm+amﬂ with<Rmf,gAm>=0, n=1.2,---.m. (3130)

73:1

28

It is desired that for the (m + 1)th iteration, the signal f can be represented as

m+1
f: E: agn+lgrm+Rm+lﬂ Wlth<Rm+1f,g;7n >= 0, n:1,2,...,m+1.
7121

(3.31)
Since elements of the dictionary D are not required to be orthogonal, to perform such
an iteration, an auxiliary model for the dependence of g7m+1 on the previous gnm’s

(n = 1,2, - -- ,m) is required. Let

m

gm,1 = Z bifgm +pm, with < pm,g,,, >= 0, n =1,2,--- ,m. (3.32)
1121

Using the above auxiliary model, it may be shown that the correct update from the

mth iteration to the (m + 1)th iteration is given by

 

 

at?“ = (1;? —ﬁmbnm, n =1,2,--- ,m
and am] = pm,
< Rmf197m+1 > < Rmfi 97771.14 > (3.33)
Where [am = = 2
< Pm197m+1 > IIPmIl

ll97m+1ll2 - 2&1 bi? < gymgymﬂ >

 

It also follows that the residual Rm+1 f satisﬁes

and
l < Rmf197m+1 > I2

llRmfll2 = ||Rm+1fl|2 + 2
1|me|

 

(3.35)

29

3.3.2 Spatial Time-Frequency Distribution (STFD) Method

For non-stationary signals, a blind source separation method using spatial time-
frequency distributions is introduced in [58].

The multidimensional data model is

x(t) = As(t) + n(t), (3.36)
where x(t) = [3:1(t), - -- ,$m(t)]T is a noisy instantaneous linear mixture of source
signals s(t =([slt) ,sn(t)]T, A is the mixing matrix, and n(t) is the additive

noise. The discrete—time form of the Cohen’s class of TFD for signal 1131(t )1 IS given

by [23]

— '2 l
D$1$1(t, (.2) =2: 2 1/2( m ,I)CL'1(t +m+ l):1:1(t + m— [)6 3 to, (3.37)
l=—oom=-OO
where t and w represent the time index and the frequency index, respectively. The

cross—TFD of two signals 331(13) and 3:2(t) is deﬁned by

0,19,2(1, w) =2 2 11( m z):11(1+m+z);1;(1+m-1)e 9'2“”. (3.33)
l=—oo m=—oo
The two equations given above are used to deﬁne the spatial time-frequency distri-

bution (STF D) matrix as follows

00 OO

=2 2 11(ml(111+m+z)*(1+m—l)e—j2‘*”, (3.39)
l——oo 771"-
where [Dxx(t,w)],j = Dairy-(LOU), for i,j = 1, - -- ,n
The blind identiﬁcation method is presented based on a two—step process: the ﬁrst
step consists of whitening the data in order to transform the mixing matrix A into

a unitary matrix U; the second step consists of retrieving this unitary matrix U by

30

jointly diagonalizing a set of whitened data STFD matrices. Under the assumption
that the source signals si(t),1 S i S n are mutually uncorrelated, the whitening

matrix W can be determined from the array output autocorrelation R
W(R — 021)WT = WAATWT = I, (3.40)

where R = 71-12,:1 x(t)x*(t) as T —-> 00, I is the n x 11 identity matrix, and 02 is

the noise variance. In the second step, the whitened STFD matrix is obtained as
Dzz(1,w) = WDxx(t,w)WT = UDss(t,w)UT, (3.41)

where z(t) = Wx(t) is the whitened data vector, and U = WA is a unitary matrix.
Since Dss(t,w) is diagonal, U may be obtained as a unitary diagonalizing matrix of
the whitened STFD matrices Dzz(t,w) for time-frequency points corresponding to
signal autoterms. In the end, the source signals are estimated as s(t) = UTWx(t).

In contrast to blind source separation methods using second-order and/or high-
order statistics, the proposed approach allows the separation of Gaussian sources
with identical spectral shapes but with different time-frequency localization proper-
ties. However, due to the joint diagonalization of the STFD matrix, it has a higher
computational complexity.

In [59], an underdetermined separation algorithm for nondisjoint sources is pro-
posed based on the ST FD method. Source separation is achieved by combining the

STFD matrix with a clustering approach.
3.3.3 Blind Separation via Time-Fquuency Masking

In [60], binary time-frequency masks are created to achieve demixing provided the
time-frequency representations of the sources do not overlap.

Without loss of generality, suppose x1(t) and 11:2(t) are two mixtures of source

31

signals 31(t), - -- ,sN(t) such that

N
3310) = Z 810).

2:1 (3.42)
172(1) = 01810? ~ 51'),

1:1

where a,- and 6,- are the attenuation coeﬂicients and the time delays. Using the
shift—invariance of the short-time Fourier transform (STFT), the time-frequency rep-

resentation of the mixing model (3.42) is

31W»)
X1(t,w) 1 1

= . . 5 . (3.43)
X2(t,w) ale—J‘U‘Sl aNe-JMSN

 

 

SN(t,w)

It is assumed that the STFTs, Si(t,w) and Sk(t,w), of any two source signals s,(t)

and sk(t) are disjoint
S,(t,w)Sk(t,w) = 0, Vt,w Vi 7f 11'. (3.44)

To demix, one creates the time-frequency mask corresponding to each source and
applies each mask to the mixture to produce the original source time-frequency rep-

resentations. For example, deﬁning

1, Si(t,w) 74 O
Afi(t,w) = , (3.45)

0, otherwise
one obtains the time—frequency representation of 3,-(t) from the mixture X1(t,w) via

S,(t,w) = Afi(t,w)X1(t,w), Vt,w. (3.46)

32

Let Q, = {(t,w) : M,(t,w) = 1} for any i E (1, - -~ ,N) so that M,- = 192.. Consider

R(t,w) — X2(t’°") (3.47)
1(t.w
Clearly, on Q7;
R(t,w) = die—3.675“). (3.48)

In this case, | R(t,w) [2 a,- and —%£R(t,w) = 15,-, where [2 denotes the phase of
the complex number 2 taken between —7r and 7r. Hence, one simply labels each time-

lAl‘t(t,cu)). Since the sources are

frequency point (15,112) with the pair (I RUM) I, “w

disjoint, there will be N distinct labels. By grouping the time-frequency points (t, (.12)
with the same label, the sets Q,- are constructed, and then the masks M,- = 192..
Therefore, from equation (3.46), the time-frequency representations Si(t,w) of the
original sources 3,-(t) can be obtained. Although this approach may separate any
number of sources from their two mixtures, the problem with more than two mixtures
is not addressed. In addition, it is hard to separate the source signals which have the

same parameters a,- and 6,- in their mixtures.

33

CHAPTER 4

ADAPTIVE SIGNAL DECOMPOSITION ON THE
TIME-FREQUEN CY PLANE

4. 1 Introduction

Signal decomposition aims to extract the components comprising the observed signals.
The majority of methods for performing linear signal decomposition involve over-
complete waveform dictionaries. By selecting the optimum set of available waveforms
from the dictionary based on some criterion, a sparse model of the signal can be
obtained. Such decomposition schemes include matching pursuit [5], basis pursuit [6],
and the chirplet decomposition [7]. A problem with these decomposition methods is
the restricted number of waveforms in the dictionary. While dictionaries containing
a wide variety of elements can be employed at the expense of high computational
cost, the representations are not satisfactory unless all signal components are at least
reasonably well approximated by dictionary elements.

For this reason, in this chapter we introduce an adaptive component extraction
approach on the time-frequency plane. This approach relies on extracting compo-
nents that are well—concentrated on the time—frequency plane. The concentration of
the components are quantiﬁed through an entrOpy measure on the time-frequency
plane. Since it has been shown in the literature that signals that achieve a small en-
tropy value on the time-frequency plane are Gabor logons, our component extraction
algorithm reduces to extracting the Gabor logons that best describe the given data set
in a minimum mean square sense. Unlike the traditional Gabor decomposition [61],
where the signal is expressed as an inﬁnite sum of the time and frequency shifted
Gabor logons, we do not have to create a dictionary beforehand, and the components

extracted by the proposed method have time and frequency centers determined by

34

the signal. Moreover, these extracted components have chirp rates and local spread
adapted to the given set of signals. The goal is to represent the given data set with

a few number of the chirped Gabor logons.

4.2 Background on Gabor Decomposition and Information Measures

4.2.1 Gabor Signal Expansion

In 1946, Gabor presented an approach to characterize a time function in time and
frequency simultaneously, which later became known as the Gabor signal expansion
[62]. He showed that any signal in L2 could be represented as the weighted sum of
modulated and shifted Gaussian functions (logons) centered on a rectangular lattice in
time and frequency under the constraint that TS) S 27r where T is the time sampling
interval and Q is the frequency sampling interval. That is, for signal s(t), the Gabor

expansion is deﬁned as

3(1) = Z amngmn(t) (4.1)

with
9(1) = {921—307le (4.2)
gmn<t> = 90: — mean”? (4.3)

The Gabor expansion coefficients amn are computed by the usual inner product rule

for projecting 3(t) onto an auxiliary function 7(t), i.e.,

4.... = [:0 8(t)7}"nn(t)dt (44)
7mm“) 2 “/(t _ mT)eant, (4-5)

35

where * denotes the complex conjugate operation. Equations (4.4) and (4.5) are in
fact a sampled version of the windowed Fourier transform of the signal 3(t) with the
analysis window 7(t), which is known as the Gabor transform. The analysis window

and the synthesis window satisfy the following biorthogonality relationship [63]:

 

T090 foo 9037*“ _ mT0)e_anOtdt = 6(m)6(n), (4-6)

27r -—oo

where T0 = 27r/Q, and $20 = 27r/T.
4.2.2 Chirplet Transform

The Gabor transform essentially provides expansions of signals as linear combinations
of time-frequency atoms with ﬁxed time and frequency “concentration” properties.
However, it fails to represent the chirp-like components in a compact and precise way.
In other words, more atoms are needed to approximate the chirp-like components
with frequency modulation, which results in the reduction of the effectiveness and
compactness of the time-frequency representation.

For these reasons, the chirplet transform, a generalized Gabor transform, is devel-
oped [64]. The time—frequency atoms for a Gaussian chirplet transform, the so—called
Gaussian chirplets, are derived from a single Gaussian function through the oper-
ations of scaling, chirping, time- and frequency-shifting, which leads to a family of

wave packets with four adjustable parameters:

gk(t) = [Va—WE exp {_92_k_(t — tk)2 +j [wk + pig-(t — tk)] (t — tk)}, (4.7)

where the parameters (tk, wk) determine the time and frequency centers of the linear
Gaussian chirplets; the variance ak(> 0) controls the time duration of the chirplet;
6k is the frequency modulation rate (chirp rate) that characterizes the “quickness” of

frequency changes. Compared with the Gabor logon used for the Gabor expansion,

36

the Gaussian chirplet has more freedom and thereby can better match the signal

under consideration.
4.2.3 Entropy Measure on the Time-Frequency Plane

Since a TFD, C (t, w), from Cohen’s general class has many desired preperties such as
the energy preservation and the marginals, it is analogous to the probability density
function (pdf) of a two-dimensional random variable. This analogy has inspired the
adaptation of information-theoretic measures such as entropy and mutual information
to the time-frequency plane. The adaptation of classical Shannon entropy to the time-

frequency plane yields

H(C) = —// C(t,w)log2 C(t,w)dtdw. (4.8)

This measure is only deﬁned when C (t, w) > 0, Vt, 112. Therefore, it is valid for positive
distributions such as the spectrogram, but yet invalid for non-positive distributions.
For this reason, a more generalized class of entropy measures known as Rényi entropy
has been adapted to the time—frequency plane. In [42], Rényi entrOpy was introduced
as an alternative way of measuring the complexity of TFDs and the properties of this

measure were proved extensively in [65]:

(1

 

110(0) 11111.). (4.9)

1 t

—a ff C(u,v)dudv
where a > 0. This measure is well-deﬁned as long as f f C(t,w)dtdw > 0 and
has been shown to be ﬁnite for a large class of signals and distributions [65]. It is
important to note that as a —> 1, Rényi entropy becomes Shannon entropy.

It has been shown that the minimum value of entropy on the time-frequency

plane is achieved for a Gabor logon [65]. This is also consistent with the fact that

37

the Gabor logon is the signal that achieves the lower bound on the uncertainty on
the time-frequency plane [23]. For this reason, our signal decomposition algorithm
is based on extracting a set of well-concentrated components, that best describe the

given data set.

4.3 Component Extraction Method

4.3. 1 Problem Statement

Given M measurements of a signal, {x1,x2, - 1 ~ ,x M}, we want to extract the ﬁrst
L components, L < M, that minimize entropy on the time-frequency plane. Each

measurement, xi, is transformed to the time—frequency plane as:

=ZZr/2(n lm):1:,~)(l + 2). r:(l — %) e_j“’m. (4.10)

The time—frequency distribution corresponding to each trial is vectorized and a matrix

of time-frequency distributions is formed:

C: . , (4.11)

 

 

where C,- is a vector of length N x K points, N and K being the number of time
and frequency points, respectively. The components on the time-frequency plane
are found based on this time-frequency data matrix. Each component Si is a linear

combination of the rows of this matrix, i.e.

M
k) = Z ajCJ-(n, k), (4.12)
i=1

38

where MZJ- of = 1 and aj’s are chosen such that Ha(S,-) is minimized on the

time-frequency plane.
4.3.2 The Proposed Approach

Since Gabor logon signals have minimum entropy in the time-frequency domain, the
cost function is chosen as e = Ha(S,-) — H3, where Ha(S,-) is Rényi entropy of
ith component with order a, and H2; represents Rényi entropy of the corresponding
desired Gabor logon signal. The weight vector a = [a1,a2, - -- ,aMlT is updated

using the method of Steepest Descent [66], which is

86

a=a

where p is the step size parameter. In the discrete case, Rényi entropy of the com-

ponent S,- 18

a

N
Ha(S,-)=Ha(aTC)=11 10%;; aajC’J-(nJc) , (4.14)

—(1

 

where S,- and Cj are normalized. The gradient of the cost function 6 with respect to

the lth weight coefficient a; is derived as:

 

 

E z a 2.. 2:1. (8.14 a)“ Cm 4) (4,5,
301 1 - a Zn 21: (54(77», k))a ’
where l = 1, - . - ,M. For the special case of a = 2,
33 = _2}:n 2k 54(71, @0102, k). (4,6)
801 Zn 2k (54(71, k))

39

Substituting the results in equation (4.16) into equation (4.13) yields the update

equation for a as:
Zn Zk 540% @0101, k) (417)
2.. 2;. (8.01. 1))? ' '

The algorithm can be summarized as follows:

 

dl=al+2p

1. Find the Gabor logon that best describes the average of all trials, Cm) =
71W 23- Cj. This ﬁrst Gabor logon is found by ﬁnding the average time du-
ration, average frequency, the spread, and the chirp rate of Cay. A logon with
these estimated parameters is constructed and chosen as the ﬁrst desired signal,

C(n,k;n0, 160,0,6).

2. Set the initial value for aj = ﬁ, and use the adaptive ﬁltering algorithm to
update the weights until the error converges. Here, when the absolute value of
the difference of two neighboring weights is less than a given error value, the
update is stopped. The ﬁrst component is then determined as, 81 = a*TC,

where a... is the optimal weighting vector.

3. Project all the trials on $1 and compute the residue.

C,=C,-—<31,C,->C,-, i=1,2,---,M (4.18)

4. Repeat the same algorithm on this residue matrix C, and extract the next

component.

5. Stop when the average energy of the residues drops below a pre-determined

threshold value.

4.3.3 Convergence Analysis of the Algorithm

An important issue in adaptation is the convergence of the algorithm. We investigate

the convergence of the proposed entropy adaptation algorithm, whose weight update

40

is given in equation (4.16), for the special case of entropy error minimization with
order a = 2 in the linear ﬁlter 5 = aTC.
Rényi entropy of the extracted component S is written in the matrix-vector format

for a = 2 as follows
H2(S) = H2(aTC) = —log2(SST) = —log2(aTCCTa). (4.19)

The weight vector a at step (k + 1) is updated by

 

ak+1 = ak — ,uAak, (4.20)
where
T
Aak = 2?. = w = _ 27€CC )ak , (4.21)
33}: Bak ak (CCT)ak

Assume that the desired Gabor logon is G = afC with the optimal weight vector
a*. Consider the weight error vector 8k 2 ak —— a*. Subtracting a... from both sides
of equation (4.20), we get

ek+1 = 5k — ,uAak. (4.22)

Multiplying both sides of above equation with its transpose to get the norm of the

weight error yields
2 2 T 2 2
|| 51.41 || =|| 51- || 43/15); Aak +# II A31; || - (4-23)
In order for the weights to converge to the true weights, we require

ll €1.41 n2<n 5;. “2, (4.24)

41

which is guaranteed when the step size satisﬁes the inequality

 

 

2€£Aak
0<p<———. (4.25)
H Aak ||2
Since
T 2(CCTlak
5 Aa = a — .
k 1. (=1 A.) ( a£(CCT)ak
_2 1_a$(CCT)ak
_ T(CCT)a
ak k (4.26)

 

_ 1 < G, 3,, >
< 5k, Sk > ,
where <, > represents the inner product of two vectors, S k is the extracted component

at step 1:, and G’ is the corresponding desired Gabor logon, the upper bound on the

positive step size becomes

0<p<

< G’S’“ > ) . (4.27)

4
__ 1 _ __
|| Aak ||2( < 314,514 >
Notice that since < G, 5'); > is less than < Sk, S'k > for the normalized TFDs Sk and

G, the upper bound on the step size is positive and valid. It can be concluded that

the proposed adaptation algorithm is convergent.

4.4 Experimental Results and Analysis

In order to evaluate the effectiveness of our method, we consider the following exam-
ple. The set of observed signals are linear combinations of two Gabor logons and a
chirp signal, i.e. 1r,- 2 1122-151 + 1112-232 + 102-353, where wi1,w,§2,wi3 are the weights

for each signal and are distributed as N (O, 1). The ﬁrst Gabor logon is centered at

42

the time sample point 50 and normalized frequency of 0.7, the second Gabor logon is
centered at time sample point 150 and normalized frequency of -0.7. The linear chirp
signal has an initial normalized frequency of -0.2 and its instantaneous frequency in—
creases to a normalized frequency of 0.2. Rényi entropy with a = 2 is used as the
cost function to ensure that entropy is well-deﬁned. The data set consists of M = 128
linear combinations of these three signals. Each signal is transformed to the time-
frequency domain with N = 50 time samples and K = 64 frequency samples. Each
TF D is then vectorized to form a TFD matrix of size 128 x 3200.

First, the average of M TFDs corresponding to each trial is computed. Then, the
time-frequency location of the peak energy on the time-frequency plane is found as
no and k0. A window centered at (740,190) is constructed to determine a local region
around this peak. The size of the window is determined based on the energy distri-
bution of the signal, i.e. the window is expanded until the energy value drops below
10% of the peak value. This windowing approach around the peak helps us extract
local features. The same window is applied to all trials to extract the corresponding
regions in each trial. The standard deviation 0 and gradient (the chirp rate), 6, of
this local TFD are estimated. Based on the parameters (120, 190,0,6), a Gabor logon
is constructed and chosen as the ﬁrst desired signal. Using the steepest descent al-
gorithm, the weight coefﬁcients aj’s are updated to minimize the difference of Rényi
entropy between the linear combination of the M local TFDS and the TFD of the ﬁrst
desired logon to obtain the ﬁrst time-frequency component, 31. This ﬁrst component
is projected onto all of the M trials and the residue is found. This same algorithm is
repeated for the residue on the time-frequency plane, i.e. pick the peak, construct a
window, determine the desired Gabor logon, and adaptively ﬁlter the signals to get
close to the desired Gabor logon. This process is repeated until the energy of the
residue is below a certain threshold. In this example, 11 components were enough to

represent 90% of the total energy of the signal.

43

Table 4.1 gives the entropy values for the ﬁrst 3 of the 11 extracted components,
the corresponding desired logon signals, and the ﬁrst 3 components obtained using
PCA. It is shown in Table 4.1 that the entrOpy of the extracted components are
closer to the entropy of the Gabor logons. Since the entropy differences between
these extracted components and the desired logon signals are small, we can infer that
the extracted signals are quite close to the actual logons. It is also seen that the
entropy of components extracted by our method is less than the entrOpy of PCA
components. This indicates that we obtain components that are more compact than
the ones obtained by PCA.

The time-frequency surfaces in Figure 4.1 indicate that the 5 extracted compo-
nents include both the logon signals and the ﬁrst three chirped logons that represent
the linear chirp signal. The topographical plots of the extracted components make it
clear that each component was appropriately isolated in terms of the topographical
region of origin.

The results of this example show that the decomposition of time-frequency energy
using our approach can extract meaningful time-frequency components for analysis of
large sets of data. This decomposition algorithm achieves several goals. First, time-
frequency data reduction is accomplished by producing a few meaningful components
on the time-frequency plane that explain most of the signal’s energy. A second ben-
eﬁt of this time-frequency domain decomposition is that it can extract activity that
overlaps in time, but not in frequency, which is not possible using time domain decom-
position approaches. Finally, another beneﬁt of our method is the ability to separate
and extract parts of chirped signals, which cannot be achieved using the conventional
Gabor expansion.

Next, the performance of the proposed approach is compared with that of Orthog-
onal Matching Pursuit (OMP) introduced in Chapter 3. In this example, in order to

represent the same 90% of the total energy of the signal as in the proposed method,

44

Table 4.1. Entropy Comparison

 

Entropy Decmp Comps PCA Comps Desired Logons

 

 

 

1 2.8319 4.5011 2.7809
2 2.7825 3.0461 2.7413
3 2.7517 2.9724 2.7252

 

 

 

 

 

 

24 dictionary elements for OMP are needed, among which the six ones are shown in
Figure 4.2. It is indicated that the number of components required by OMP is much
larger than that of the proposed approach which is only eleven. Moreover, the com-
putation of OMP is much more complex, about four times of the proposed method.
One reason why the performance of matching pursuit is not well is that the decom—
position is not satisfactory with matching pursuit unless all signal components are
well approximated by dictionary elements; on the other hand, although the dictionary
contains a wide variety of elements, this kind of redundancy leads to the elements to

be employed at the expense of high computational cost.

4.5 Summary

In this chapter, a new signal decomposition method on the time-frequency plane
is proposed based on the minimum entropy criterion. The major difference of the
proposed approach from conventional component extraction or decomposition meth-
ods is the cost function. The cost function which is minimized is entropy on the
time-frequency plane, thus producing compact components that are similar to Gabor
logons. Using entropy as the cost function and adopting an adaptive ﬁltering method
to update the weights corresponding to each trial, we extract “minimum" entropy
components orthogonal to each other. Experimental results show that the presented
approach is effective in determining a few number of components that can be used to

represent a large set of data.

45

 

 

 

 

 

 

Figure 4.1. The average time-frequency distribution of 128 trials and the 5 extracted
components of the proposed method

46

 

 

 

 

 

 

Figure 4.2. The average time-frequency distribution of 128 trials and the 6 elements
of OMP

47

CHAPTER 5

OVERDETERMINED BLIND SOURCE SEPARATION IN THE
TIME-FREQUENCY DOMAIN

5. 1 Introduction

Blind source separation (BSS) is an important and fundamental problem in signal
processing with a broad range of applications. Several unobservable source signals
ﬁrst pass through an intermediate media, and then arrive at an array of sensors.
The observed output of each sensor is a mixture of all the source signals. The goal
in BSS is to recover the original source signals from the observed mixtures. Typ-
ical BSS applications include communications [1], speech signal processing [2], and
biomedical signal processing applications [3]. A number of BSS algorithms have been
proposed based on the instantaneous mixture model, in which the observed signals are
linear combinations of the source signals and no time delays are involved in the mix-
tures. Among these methods, the most common ones are second order statistics based
methods [67], and information-theoretic approaches which utilize cost functions such
as mutual information or divergence measures, e.g. independent component analysis
(ICA) [9,10,68—70], sparse component analysis (SCA) [71], and nonnegative matrix
factorization (NMF) [72]. These methods in general assume a certain structure for
the underlying source signals. Some examples include higher-order statistics based
methods which assume non-Gaussian and i.i.d source signals, and ICA which assumes
the independence of the source signals.

Most real life signals are non-stationary, and thus do not obey the underlying
assumption of stationarity that is embedded in the current methods. For this rea-
son, recently various methods have been introduced to exploit the non-stationarity

property of the source signals to solve the separation problem, including frequency

48

 

domain [73], [74] and time domain [75], [76] approaches. In general the frequency-
domain estimation algorithms have a simpler implementation, less computational
time, and better convergence properties over the time-domain ones. However, the
disadvantages of using frequency-domain methods are the arbitrary permutation and
scaling ambiguities of the estimated frequency response of the un-mixing system at
each frequency bin.

Motivated by these problems, researchers have resorted to the powerful tool of
time-frequency signal representations. For non-stationary signals, a blind separation
approach using a spatial time-frequency distribution is proposed in [58] to achieve the
separation by joint diagonalization of the auto-terms in the spatial time-frequency
distributions. This approach has been modiﬁed and improved as discussed in [77,78].
Another time-frequency based method described in [60] uses binary time-frequency
masks to separate more than two speech sources from two mixtures using the sparsity
of the time-frequency representations of speech signals.

In this chapter, we introduce a new approach to the source separation problem
combining time—frequency representations with information-theoretic measures. An
information-theoretic criterion, Jensen-Rényi divergence as adapted to the time-fre-
quency distributions, is used as the objective function to separate the sources. The
underlying sources are assumed to be disjoint on the time-frequency plane and it is
shown that this new cost function achieves its maximum when the signals are disjoint.
With the assumption that the source signals are disjoint on the timeLfrequency plane,
signal separation is performed through a rotation transformation using a steepest

descent algorithm.

5.2 Information Measures in the Time-Frequency Domain

The information-theoretic measures such as entropy have been successfully applied to

the time—frequency plane [42,65], due to an analogy between a TFD and the probabil-

49

ity density function (pdf) of a two-dimensional random variable. Although entropy
measures have proven to be useful in quantifying the complexity of individual sig-
nals, they cannot be used directly to quantify the difference between signals. For this
reason, well-known divergence measures from information theory have been adapted
to the time-frequency plane [79,80]. The most common divergence measures used
for probability distributions belong to Csiszar’s f-divergence such as Kullback-Leibler
divergence based on Shannon entropy and a-divergence based on Rényi entropy [39].
Another common class of divergence measures is based on the Jensen difference such
as the Jensen-Shannon divergence constructed by applying Jensen inequality to the
entropy functional. Jensen—Rényi divergence is the modiﬁcation of Jensen-Shannon
divergence from an arithmetic to a geometric mean introduced by Michel [79]. For

time-frequency distributions, Jensen—Rényi divergence can be deﬁned as:

jf2(Cl,C2) = Ha(v0102) - Ha(C'1) g Ha(C'2), (5-1)

 

where Cl and Cg are the general TFDs of two different signals deﬁned in equation
(2.4) respectively, and Ha represents Rényi entropy deﬁned in equation (4.9). Jensen-
Rényi divergence is equal to zero when 01 = C2, and its positivity can be proven
using the Cauchy-Schwartz inequality. This measure has some desired properties
such as being symmetric and monotonically increasing as the overlap between the
two distributions decreases, i.e. Cl(t,w)02(t,w) —> 0. Therefore, maximizing this
measure corresponds to obtaining disjoint time-frequency representations.

When Jensen-Rényi divergence is compared to other symmetric and monotoni-
cally increasing information-theoretic measures on the time-frequency plane, several
advantages emerge. First of all, J ensen—Rényi divergence is deﬁned based on the Rényi
entrOpy which is well-deﬁned for a larger class of time-frequency distributions com-

pared to Shannon entropy which is deﬁned for only positive distributions. Therefore,

50

Shannon entropy based divergence measures are limited in their applicability. Sec-
ond, recent work in the analysis of sensitivity or robustness of divergence measures on
the time-frequency plane reveals that Jensen-Rényi divergence is more robust against
perturbations and noise, which makes it more suitable for source detection and separa-
tion applications [79,80]. The following simulation example compares the robustness
of two different distance measures under an additive signal perturbation model. The
original signal is a Gabor logon, 31(t) = exp(—(t — t0)2)exp(—jw0t), centered at
time to = 32, normalized frequency (120 = 0.2, and the perturbation signal is another
Gabor logon, 32(t) = exp(——(t — t1)2)exp(—jw0t), centered at 131 = 64,w0 = 0.2.
The perturbed signal is 2(t) = (1 — e)31(t) + 682(t), where E 6 [0,1]. The distance
between the time—frequency distributions of the perturbed signal and the original one
is computed as 6 goes from O to 1. Figure 5.1 shows the comparison between the
symmetric Kullback-Leibler and the Jensen-Rényi divergences for different values of
6. When 6 is small, the Jensen-Rényi divergence is smaller than the Kullback-Leibler
divergence showing robustness against small perturbation. However, as 6 increases,
the Jensen-Rényi divergence reacts faster to the change and detects the second signal

component.

5.3 Problem Formulation and Method

5.3.1 Signal Model and Assumptions

In this chapter, we consider the problem of determining the source signals when the
number of observed mixtures is equal to or greater than the number of source signals.

T corresponds

Assume that the N-dimensional vector s(t) = [31(t), 82(t), . . . ,sN(t)]
to the N non-stationary complex source signals. The source signals are transmitted
through a medium and the M sensors pick up a set of mixed signals represented by

z(t) = [21(t), z2(t), . . . ,zM(t)]T, where M Z N.

51

 

 

 

 

 

 

 

— Kullback-Leibler ‘

0,9 . Jensen-Renyi 4

0.8 ~ 4
9

g 0.7 - 1
é’

a) 0.6 r ‘
O
C
(U

23 05 r -
o
8

.5 0.4 '- 7
f

o 0.3 ~ .
z

0.2 ~ ,

0.1 ~ .

0. 11'1"""‘ 1 1 1
0 0.2 0.4 0.6 0.8 1
a

Figure 5.1. Comparison of Kullback-Leibler and Jensen-Rényi divergence measures
under an additive signal perturbation model

Given M observations or mixtures, z(t) = [31(t), z2(t), . . . ,zM(t)]T, with
20) = 130080), (5-2)

where A(t) is the mixing matrix, we want to extract the underlying sources s(t). In
this chapter, we assume an instantaneous mixture of the sources, i.e., A(t) = A, where
A is a M x N matrix. The following assumption is made about the underlying sources:

the sources are assumed to have different structures and localization properties on the

52

time-frequency plane, i.e. the sources are disjoint on the time-frequency plane. This
implies that Csi(t,w)Cs j(13w) = O,Vt,w forz' # j. In practice, this condition is never
satisﬁed exactly. However, as long as the inner product is small, source separation can
be achieved. This condition on the disjointness of the sources on the time-frequency
plane has already been used by several authors for separation of speech and music
signals [60], [81].

Before proceeding further, it is important to specify the notion of blind identiﬁca-
tion. In the blind context, a full identiﬁcation of the mixing matrix A is impossible
since the exchange of a ﬁxed scalar factor between a given source signal and the cor-
responding column of A does not affect the observations, as is shown by the following
relation:

N .
2(4) = As<t> = 2 3.4.4.0). (5.3)
i=1 2

where 5,- is an arbitrary complex factor, and a,- denotes the 1th column of A. Advan-
tage can be taken of this indeterminacy by assuming that the source signals have unit
variance so that the dynamic range of the sources is accounted for by the magnitude
of the corresponding columns of A. This normalization convention turns out to be
convenient in the sequel; it does not affect the performance results. For the proposed
algorithm, the sources are extracted on the time—frequency plane up to a scalar factor,

and permutation.
5.3.2 Problem Statement in the Time-Frequency Domain

This section will briefly outline the overall structure of the source separation. The
different components of the algorithm such as the cost function, and the optimization
method will be discussed subsequently.

Each observation, zi(t), is ﬁrst transformed to the time-frequency plane as:

(71,142,002 221/101 1 m)z,—)<l+ 2) 40—?) e—jwm. (5.4)

53

Since zi(t) = 25:1 aiksk(t) from equation (5.2), where “2“]: is the element of the

mixing matrix A located at the ith row and kth column, we have:

N N
= 22707 - 77> Z W, (1+ 9) 2: 7.8:: (l - E) ‘W
m l 19:1 7:1
' N
= gzwn — [,m) _k=1|aik l2 3k (1+ 7;) 5’); (1 _ ;) e-jwm+

 

:2“ (,)71—1771 2: Z aikagrsk<l+Z;—')s;(l—%) e_jwm,
m 1

13:17:1(7‘7ék)

(5.5)

In the right hand side of the above equation, the first term represents the auto-terms,
and the second term represents the cross-terms. We are assuming that the kernel
function 117-, -) used in this chapter is a reduced interference distribution (RID), so

that the cross-terms are negligible. Thus,

N

(71,;wV) “ZZ‘MM-lm Z:laikl2 Sk(l+— E3):) (1-7;) e—jwm
kl2ggwn—lm)sk(l+2)3;(1_%)e—jwm

N

“N

=kZ1| a2k|2 Ck(;7,1ww),
(5.6)

where Ck(71, w; 1/1) is the discrete time-frequency distribution of the source signal sk(t).

This shows that the instantaneous mixtures of the source signals in the time domain

transforms into the instantaneous mixtures of TFDs. This is an important underlying

assumption that makes the proposed approach easier to implement compared to other

54

TF D based BSS methods [58].
The time-frequency distribution Xi(71,w;7,/J) corresponding to each observation

zz-(t) is vectorized and a matrix of time-frequency distributions is formed:

 

 

 

 

.. - r ,
X1 X1(1) X1(P)
X2 X20) X2(P)
X: :
-XM _ _XM(1) XM(P)_
_ . F , (5.7)
Ian |2 law |2 Cl
2 2
a a C
:AQC: I21l |2N| 2 ,
_|aMi|2 laMNl2_ _CN,

 

 

 

 

where X,- and C1- are vectors of length P = K x L points, K and L are the numbers
of time and frequency points respectively, and A2 is the element-by-element square of
the mixing matrix A. The extracted sources on the time-frequency plane are deﬁned

8.32

"Y1“ 33(1) 71(1))“
Y: Y2 = 1”2(1) Y2(P) (5.8)
_YN) _YN(1) YNU’),

 

 

 

 

In order to make the following discussions simpler, we concentrate on the case where
M = N. The discussion can be easily generalized for M > N as illustrated through

an example in Section 5.4.

55

 

5.3.3 Cost Function

The cost function used in this chapter is the total pairwise Jensen-Rényi divergence

deﬁned as:

N—-1
Jaé

 

N MYinY
Z[w (¢VVg ():(J). mm

1=1j=1+1

Maximizing this cost function will ensure that the extracted components do not over-
lap with each other on the time-frequency plane.
The pairwise Jensen-Rényi divergence between two time-frequency distributions

is deﬁned as:

 

_ Hyn mY-
ﬁ:%ﬂnwy ():(Jl mm)

This expression can be further simpliﬁed as:

Hd(Y )+%w)
=%( (n%)— 3

g i (,/—)“

#1
P (MU
01(a '
ml_abg §ﬂ(k )+m‘;gw)

: Leg 21:1 ( 117113700
((2121 11001) (235:1 1371))

 

 

 

 

l—a

which represents the ratio of the energy of the overlap between the two TFDs to the

product of the energy of the individual TFDs. Let

zf=1( mug-(1))"
721:1 71>) (21:. w)

1]

 

 

56

 

and
N— 1

N
2 J3. (5.13)
—Z+

1:1 3'2 +1
Since log(-) is a monotonic function, maximizing fa is equivalent to minimizing Ja
for a > 1, or maximizing Ja for a < 1. This means that we can equivalently use J0)
as our cost function. In this chapter, we will consider orders of a > 1. The results

are similar for a < 1. One special case of a > 1 is the quadratic one when a = 2.

When 01 = 2, the cost function JO, simpliﬁes to:

 

N 1 % Zf=1yi(k)yj(k)
121 j— =1+ +1 \/(Z,§=1K2(k))(211c3=lyj2(k))

(5.14)

 

In this chapter, we will use a = 2 since the Rényi entropy will be well-deﬁned for
this order even when the distributions are non-positive. Minimizing J2 is equivalent
to minimizing the sum of pairwise normalized inner products between the extracted

sources, and ensures disjoint source extraction.
5.3.4 Rotation

In our source separation problem, the observed time-frequency distributions, X, can
be written as a linear combination of the original sources’ TFDs, C, assuming negli-

gible cross-terms between the sources:
X = A20 = BC, (5.15)

where A2 is the square of the mixing matrix in the time domain, and B = A2.
The goal of source separation in this chapter is to ﬁnd a linear transform Q of the
observed signals, X, such that the extracted signals, Y, are as disjoint as possible
from each other. The cost function in Section 5.3.3 quantiﬁes the disjointness of the

extracted sources using divergence. In this section, we will show how to obtain this

57

 

linear transform Q. Q should be chosen such that the elements of Y = QX = QBC
are disjoint. If the elements of Y are exactly disjoint, then YYT will be a diagonal
matrix, which means that QBCCTBTQT will also be diagonal. Since the original
sources’ TFDs, C, are disjoint and normalized, CCT = I. Therefore, ﬁnding a linear
transform Q for unmixing the observations reduces to ﬁnding an unitary matrix that
will diagonalize BBT. Since B is not known a priori, we try to estimate the unitary
transform Q iteratively. Any unitary matrix Q can be written as a product of Givens
rotation matrices and this formulation allows us to parameterize the estimation of Q
in terms of the rotation angles 6.

It is well-known that any unitary matrix Q can be written as the product of
N(N —1)/2 Givens rotation matrices, Q = G1G2 - - - GN(N—1)/2' In N—dimensional
space, the simplest rotation is in the two—dimensional plane. If a rotation is through

an angle Gab in the a — b plane, then the Givens rotation matrix Gabwab) is:

1 0 O O
0 ~- cos(00b) --- sin(9ab) ~-- 0
Gabwab): s s s 2 , (5.16)
O --- — sin(6’ab) --- cos(6ab) --- 0
_0 O 0 1d

 

 

where Gab(6ab) equals the N x N identity matrix I N except that the elements
IN(a, a), [N(a, b), [N(b, a), and IN(b, b) are replaced by cos(6ab), sin(6ab), — sin(t9ab),
and cos(0ab), respectively, where I N(a, b) is the element of I N located at the ath row
and bth column. From [82], we know that any N-dimensional rotation matrix can be

written as the product of N (N — 1) / 2 two-dimensional—plane N -dimensional rotation

58

 

matrices,which is:

(3(9) = G12(912) ' ' ' Gabwab) ' ' ' G'(N_1)1\/(9(1\r_1)1v)a (5-17)

where 6 = [ 912,... ﬂab)... ’6(N-—1)N ]T, and a < b.

In order to have exact source separation in this formulation, there should exist
an unitary matrix Q that will diagonalize the mixing matrix B = A2 on the time-
frequency plane. Since B has all positive entries, there is no such Q unless B is
already diagonal which corresponds to the observations that are scalar multiples of

the sources.
5.3.5 Proposed Algorithm

The objective of the proposed algorithm is to determine the optimal rotation trans-
form such that the total pairwise divergence measure is maximized to achieve signal
separation. We use the gradient adaptation algorithm also known as the steepest de-
scent [66] to update the rotation angles. Gradient adaptation is not the only choice,
but it is preferred in many practical paradigms due to its simplicity and efﬁcient
convergence [83].

The overall update equation for stochastic gradient descent is:

£2

6(n+1) =6(n)— 86’

(5.18)

where ,u is the step size parameter. The gradient of the cost function J2 with respect

to the rotation angle gab is derived as:

 

 

= Z 2 66:1, (5.19)

59

 

where

5.23,. 25:1 (Spawn/m) + muggémm)

(9955 _ «(2le 132(k)) (25:1 392(k))

 

 

 

 

 

 

6G,- P . (M) (5.20)

 

 

X
/‘\ /'-\ A
Mrs
...".<
E
%
S
25
S
M
GE.

(K3521 YEW) (8.; 19-200)) 3

 

x
M “0
5w
A

P?
v
5
A
a:
V
Q:
Q:

8. u.
’2?
v

where G,- is the 2th row of G(0), and X(k) is the kth column of X. The explicit

gradient expression for N = 3 is given in the Appendix.

5.4 Experimental Results and Analysis

In order to evaluate the effectiveness of the prOposed method, we consider various
source separation examples. In all of the examples with the synthesized signals, the
sources are assumed to be approximately disjoint on the time-frequency plane. Each
observation is transformed to the time-frequency domain with K = 50 time samples
and L = 64 frequency samples. Each TFD is vectorized to form a TFD observation
matrix of size M x 3200 as in equation (5.7), where M is the number of observations.
Jensen-Rényi divergence with order a = 2 is used as the cost function to ensure that
the divergence is well-deﬁned. The binomial kernel [23] is used for computing the
TFD since it belongs to the class of reduced interference distributions (RIDs) and

thus will have negligible cross-terms. This property of the distributions will improve

60

 

the performance of our source separation algorithm. The performance of the proposed
method is quantiﬁed in terms of the accuracy of the extracted sources, convergence
rate, and robustness to noise. All of the experimental results will be evaluated using

the signal to interference ratio (SIR) deﬁned as:

1 N
SIR = N :1 3111,,
2::

SIR,- = SIRO, — SIR 1,,

215:1 Vii-(k) (5.21)
25:1 (23.5- 165,0»)2 ’

25:1 X22351“)

25:1 (ijéi Xisj(k))2

 

SIROZ- = lOlog

SIR I,- = 1010g

 

’

where stj and Xisj are the outputs and inputs of the system when only the signal
sj is active, respectively, and N is the number of sources.

Example 1: Separation of a chirp signal and two Gabor logon signals
In this example, the set of observed signals are the three linear combinations of a
chirp signal and two Gabor logon signals, i.e. zi(t) = ai181(t) + Liz-232a) + a1333(t),
where ai1,ai2,ai3 are the weights for each signal distributed as N (0, 1), i = 1, 2,3,
and 31(t), 32(t), 33(t) correspond to the two Gabor logon signals and the chirp signal,

respectively. A Gabor logon is a modulated Gaussian expressed as:

_<t-tz'9>2 .
s,(t)= 2:06 2a 83%]: (i=1,2). (5.22)

 

The Gabor logon has virtually compact support in both time and frequency centered
in time at t 2: tio and frequency at w = Luz-0. In this example, the first Gabor logon is

centered at the time sample point 7510 = 50 and normalized frequency of (010 = 0.7,

61

and the second Gabor logon is centered at the time sample point t20 = 150 and

normalized frequency of Logo 2 —O.7. A linear Gaussian chirp is expressed as:
,3“) 2 $e{—g<t—to>2+ﬂwo+§<t—to>1(t—to>}, (523)
7r

where t0, “’0 are the time and frequency centers, 0, ,8 are the time spread and frequency
modulation rates of the chirp, respectively. In this example, the linear chirp signal
has an initial normalized frequency of -0.2 and its instantaneous frequency increases
to a normalized frequency of 0.2 with to = too = 0. It is known that the chirp signal
overlaps with these two Gabor logons in the time domain, so it is not possible to
separate them using time domain decomposition approaches. However, it is illustrated
in Figure 5.2 that these three signals can be effectively extracted on the time-frequency
plane using the proposed method through an optimal rotation under the divergence
criterion with an average SIR of 37.5169 dB. Moreover, the convergence rate is high
as shown in Figure 5.3.
Example 2: Separation of two crossing chirp signals

In this example, we consider the separation of two signals overlapping in the time-
frequency domain. A mixture of two linear chirp signals is used for source separation.
One of the chirp signals has an initial normalized frequency of —0.8 and its instan-
taneous frequency increases to a normalized frequency of 0.8. The other one has
an initial normalized frequency of 0.8 and its instantaneous frequency decreases to
a normalized frequency of -O.8. Obviously, these two chirp signals overlap with each
other in both the time and frequency domains. Typical time domain or frequency
domain separation methods can not be used to perfectly recover them. Figure 5.4
shows that using the proposed approach, we can successfully separate these two chirp
signals from their mixtures with an average SIR of 32.1164 dB. It can be seen from

the ﬁgures that most of the error occurs in the time-frequency region where the two

62

 

 

 

 

 

   
  

Frequency
0

0 50 100 150 200

 

Normalized

 

 

o 50 100 150 200
(d)

Time

Figure 5.2. The mixture and the separation of a chirp and two Gabor logons: (a) the
mixture; (b) and (d) the two extracted Gabor logons; (c) the extracted chirp

signals overlap. This is due to the fact that the crossing chirps do not exactly satisfy
our underlying assumption of disjoint sources.
Example 3: Separation of two speech sources

In this example, we consider the mixtures of two speech signals from two speakers,
one female and one male. The two speakers’ voices are recorded by two microphones
3m directly in front of the speakers in an anechoic chamber. Due to time delay,
these two signals partly overlap with each other in the time domain. The TFDs of
the original speech signals and their mixtures are shown in Figure 5.5 and Figure
5.6, respectively. Figure 5.7 shows the TFDs of the speech signals extracted by the

proposed method. The SIR is 26.0482 dB in this case, since the two speech signals

63

 

0.8 I I l I I I

0]

Cost Function
.0 O .0 f3
(A b 01 C)
I

S:
N

01 -

 

 

 

 

0 I I l I l l
0 10 20 30 40 50 60 70

Iterations

Figure 5.3. The cost function versus the number of iterations for Example 1

have partial overlap on the time-frequency plane.

In order to further investigate the robustness of the proposed algorithm for real-
life signals, we add white Gaussian noise into the two speech signals over a SNR range
of {-8 — 8 dB]. It is shown in Figure 5.8 that the proposed method is robust against
noise and results in the separation of the speech signals even at low input SNRs.

Errample 4: Performance comparison with the STFD and FastICA methods
In order to further evaluate the performance of the proposed approach, we compare it
with two different methods, one of which is a time-frequency based source separation
method, the spatial time—frequency distribution (STFD) [58], and the other one is an
information-theoretic method, FastICA [84] adapted to the time-frequency domain.

The STFD method is based on the joint diagonalization of a combined set of spa-

64

 

0 50 100 150 200

Normalized Frequency

 

o 50 100 150 200
(C)

Time

Figure 5.4. The mixture and the separation of two crossing chirp signals: (a) the
mixture; (b) and (c) the separated signals

tial time-frequency distribution matrices. STF D matrices are made up of the auto-
and cross—TFDS of the data snapshots across the multisensor array, and they are
expressed in terms of the TFD matrices of the sources. The diagonal structure of
the TFD matrix of the sources is essential for the STFD method and is enforced by
using only the information in the time-frequency points corresponding to the signal
auto-terms. The beneﬁt of using STFDs in a non-stationary signal environment is the

direct exploitation of the information brought by the non-stationarity of the signals.

65

_n

 
  
  

0.5

Normalized Frequency
0
S

(8)

_L

 

 

a

c

0

g. 0.5

u.

'0 0

0

.g

E—os

Z -1

0 50 100 150

(b)
Tlme

Figure 5.5. TFDs of the two individual speech signals: (a) TF D of a female speaker;
(b) TFD of a male speaker

FastICA combines Comon’s information-theoretic approach [9] and the projection
pursuit approach [85]. A family of contrast (objective) functions for ICA are intro-
duced using maximum entropy approximations of differential entropy. These contrast
functions enable both the estimation of the whole decomposition by minimizing mu-
tual information, and estimation of individual independent components as projection
pursuit directions. FastICA algorithm has a fast convergence rate and is robust under
noise. In this example, the TFD observation matrix, X in equation (5.7), is considered
as the input to the FastICA algorithm to achieve source separation.

SIR is used as the performance criterion for the two crossing chirp signals discussed

in Example 2 by adding white Gaussian noise over a SNR range of {-8 — 8 dB]. We use

66

_L

0.5

Normalized Frequency

 

Normalized Frequency

 

(b)
Time

Figure 5.6. TFDs of the observed signals: (a) TFD of the ﬁrst mixture; (b) TFD of
the second mixture

100 Monte Carlo simulations for each noise level. Figure 5.9 compares the robustness
of the the proposed approach in noise to the STFD and FastICA methods. It is evident
that both the proposed approach and the STFD method are superior to FastICA
under noise. This is due to the fact that the assumption that the underlying source
signals are independent does not necessarily apply to the time-frequency distributions.
We can also see that the proposed approach performs better than the STFD method
as the noise level increases. The reason is that as the noise level increases, the
energy of the cross-terms will increase, thus making it harder to differentiate between
the auto- and the cross-terms in the STFD method. This results in errors in the

estimation of the sources, consequently reducing the SIR. On the other hand, the

67

_L

0.5

Normalized Frequency

 

Normalized Frequency

 

(b)

Time

Figure 5.7. TFDs of the extracted signals: (a) TFD of estimate of the female speaker;
(b) TFD of estimate of the male speaker

proposed method assumes that the cross-terms are negligible. When the signal is
very noisy, the cross-terms between the signal and the noise become signiﬁcant, and
neglecting these cross-terms amounts to denoising of the observed TFDs, resulting in
higher SIRs. The STFD method also has a higher computational complexity than the
proposed method, since it computes both the auto- and cross-terms of the observed
signals whereas our method uses only the auto-terms by using a RID kernel.
Example 5: Performance comparison with PCA

In this example, we compare the proposed source separation method with PCA for
the mixture of the two Gabor logon signals in Example 1. PCA is an orthogonal

decomposition of the observed data matrix just like the proposed method. However,

68

 

16 I I' r I I I I

14*

12'

10

SIR (dB)
oo

0')

 

 

 

)—
y.-
.—

 

_2 1 1 1 1
SNR (dB)

Figure 5.8. Output SIR versus input SNR for speech signals

the cost functions used are different and results in the difference seen in the extracted
components in Figure 5.10. With PCA the variance explained by each component
is maximized whereas the proposed method maximizes the divergence between the
components resulting in better separated sources.

Example 6: Number of mixtures greater than the number of sources
In this example, we consider a more general situation where the number of mixtures
is larger than the number of sources. For M mixtures and N sources (M > N), we

construct a new N x M rotation matrix as follows:

GNMW) = INMGMW), (5.24)

69

20 I I I I I I I

 

 

—ale— Proposed method
- B - STFD method , /
— O - FastICA /

 

 

 

 

 

 

T I I I L I

SNR (dB)

-
In-

Figure 5.9. Comparison of output SIR versus input SNR for three different source
separation methods

where G M(6) is an M x M rotation matrix given by equation (5.17), and I N M is an
N x M matrix with elements equal to 1 if i = j, 0 otherwise, where i, j represent the
row and column indices, respectively. The source signals are the chirp signal and one
of the two Gabor logons in Example 2. We use the proposed approach with this new
rotation matrix to extract these two signals from their three mixtures. It is shown in
Figure 5.11 that the source signals can be effectively extracted when the number of

mixtures is greater than the number of sources with an average SIR of 39.8827 dB.

70

 

0 50 100 150 200

 

 

 

 

 

 

 

 

 

 

>
0
5 o 50 100 150 200
3
2
LL
'0
=5 0 50 100 150 200
g 1 (c)
z _ l_ I I
o- .
-1 1 I -—1———
0 so 100 150 200
(d)
1 ___.= _ . .
o~ .
-1 1 l I
o 50 100 150 200
(6)
Time

Figure 5.10. Comparison with PCA for two Gabor logon extraction: (a) the mix-
ture, (b) and (c) the components extracted by the proposed method, ((1) and (e) the
components extracted by PCA

5.5 Summary

In this chapter, a new approach is presented for the separation of non-stationary
signals on the time-frequency plane using an information-theoretic cost function.
The proposed algorithm assumes the disjointness of the underlying signals on the
time-frequency plane. This assumption allows us to extract the sources through a
N -dimensional Givens rotation. Using Jensen— Rényi divergence as the cost function,
a steepest descent algorithm is implemented to update the rotation angles. Several
examples are given to illustrate the performance of the proposed algorithm for syn—

thesized and real life signals. Issues regarding convergence rate and robustness under

71

 

 

 

 

0 50 100 150 200

Normalized Frequency

 

 

100 150 200
(C)

Time

 

Figure 5.11. Separation of a chirp and a Gabor logon from their three mixtures: (a)
the mixture, (b) the extracted Gabor logon, (c) the extracted chirp

noise are investigated. The performance of the algorithm is illustrated under noise
and is compared to PCA and ICA as adapted to the time-frequency plane, and STFD.
The results illustrate that maximizing the divergence on the time-frequency plane can
separate sources that are disjoint in the time-frequency domain, and is better than
the mutual information cost function used in ICA in terms of ﬁdelity to the original
sources. The proposed method also outperforms STFD for high noise levels since it
assumes the cross-terms between sources are negligible which effectively denoises the

observed time-frequency matrix, and is apparently superior to PCA.

72

CHAPTER 6

UNDERDETERMINED BLIND SOURCE SEPARATION IN THE
TIME—FREQUENCY DOMAIN

6.1 Introduction

Underdetermined blind source separation (UBSS) is a more challenging problem com-
pared to the (over)determined case, because contrary to the (over)determined case,
estimating the mixing system is not sufﬁcient for reconstruction of the sources, since
the mixing matrix is not invertible. Therefore, we need additional a priori infor-
mation about the sources to allow for reconstruction. One increasingly popular and
powerful assumption is the sparsity of the sources on a given basis [5,86,87]. A signal
is said to be sparse when it is zero or nearly zero more than might be expected from
its variance. Such a signal has a probability density function or distribution of values
with a sharp peak at zero and fat tails. The advantage of a sparse signal representa-
tion is that the probability of two or more sources being simultaneously active is low.
Thus, sparse representations lend themselves to good separability because most of
the energy on a basis coefﬁcient at any time instant belongs to a single source. This
statistical property of the sources results in a nicely deﬁned structure being imposed
by the mixing process on the resultant mixtures, which can be exploited to make
estimating the mixing process much easier.

Sparse representation of the signals, which is modelled by matrix factorization,
has been receiving a great deal of interest and has been applied to blind source
separation in recent years. In several references, the mixing matrix and sources are
estimated using the maximum posterior approach, the maximum likelihood approach,
and the expectation maximization algorithm [50,88—92]. However, these algorithms

may stick at a local minima and have poor convergence prOperty. In [93], a blind

73

source separation is develOped via multi-node sparse representation. Based on several
subsets of wavelet packet coefﬁcients, the mixing matrix is estimated by using Fuzzy
C-means clustering algorithm, and the sources are recovered using the inverse of
the estimated mixing matrix. However, the case of less sensors than sources is not
discussed.

In this chapter, we introduce a sparse factorization approach to the UBSS problem
in the time-frequency domain, in which the mixing matrix is estimated using the K-
means clustering method, while the sources are estimated using a linear programming
method. In [94], the equivalence results of the lO-norm solution and the ll-norm
solution are obtained using a probabilistic approach. These results show that if the
sources are sufﬁciently sparse in the analyzed domain, they are more likely to be equal

to the l 1-norm solution, which can be obtained using a linear programming method.

6.2 Sparse Factorization Approach for UBSS in the Time-Frequency Do-

main

In this section, a sparse factorization approach including two stages for the UBSS
problem in the time-frequency domain are presented, in which the ﬁrst stage is for
determining the mixing matrix, and then the second stage is for estimating the source

signals.
6.2.1 Linear Mixture Model and Assumptions

We ﬁrst give out the system model and assumptions. The observed M mixtures,
z(t) = [31(t),22(t), . . . ,zM(t)]T, of the N non-stationary complex source signals,

s(t) = [sl(t), 32(t), . . . ,sN(t)]T, may be modeled linearly in the time domain as
z(t) = Bs(t), (6.1)
where B is the M x N instantaneous mixing matrix.

74

 

Each mixture, zz-(t), is transformed to the time-frequency plane, and then the
corresponding time-frequency distribution is vectorized to form a matrix of time-
frequency distributions, X. From Chapter 5, it is known that the time-frequency
distributions of the mixtures, X, can be written approximately as a linear combination
of the original sources’ TFDs, S, assuming the cross-terms between the sources are
negligible by using a RID:

x z BZS = AS, (5.2)

where X: [x1,--- ,xp] 6 RMXP, s = [s1,--- ,sP] e RNxP, P= 1 x L, I and L
are the numbers of time and frequency points respectively, A = B2 = [a1, - - - ,a N] E
RMXN with normalized columns, i.e., ||a1]| = = ]|aN|| = 1, and B2 is the
element-by-element square of the mixing matrix in the time domain. The task of
BSS is to recover the sources S only using the mixture matrix X. Here, we assume
M < N, which indicates that BSS is underdetermined, and the source signals are
sparse in the time-frequency domain. The sparsity of the sources plays a key role in
this chapter.

It is well known that in general there exist many possible solutions for the model
(6.2). For a given mixing matrix, under the sparsity measure of ll-norm, the unique-
ness result of sparse solution is obtained. And the number of nonzero entries of the
sparse solution can not be reduced. It is also found that the mixing matrix of which

the column vectors are composed by cluster centers of the mixtures X is a sub-Optimal

mixing matrix, which can be obtained using K -means clustering algorithm [94].
6.2.2 K -means Clustering

K -means clustering is an iterative algorithm that seeks to minimize a squared-error
criterion function in order to separate a completely unknown set of data into K dif-
ferent groupings [95]. Suppose that x1,x2, - — - ,xn are the vector observations in a

data set and make up realizations of K different distributions of random variables.

75

Then ,ul, #2, - - - ,,u K are the mean vectors of these distributions, and K -means clus-
tering seeks to categorize the observations, xi, into one of the K distributions such
that the squared Euclidean distance, [I x,- —— W [[2, is minimized. However, since the
properties of the data set are unknown, u1,,u2, - -- ,uK must be estimated first as
max~nx

As a starting point, K random samples of the data are chosen as the initial mean
estimates, ﬁj. The distributions are then estimated by classifying all points, xi, into
the group whose estimated mean it is closest to in the squared Euclidean sense, so

that x,- 6 [ij whenj is subject to
qmwrnm? on

Once all data points are classiﬁed, the mean of each group is recalculated. Suppose
mj is the number of data points in the jth distribution, and xlj,x2j, - -- ,xmj are

all data points. The new mean is then calculated as

m]

.. ll
1

m] i=

This process is repeated until convergence, when the estimated means do not change

upon further iterations.
6.2.3 Determination of the Mixing Matrix

Due to the sparsity of the source signals in the time-frequency domain, there ex-
ists many columns of S with only one nonzero entry. For instance, suppose that
82-1, - - - ,sz- K are K columns of S, where only the ﬁrst entry of each of these columns

is nonzero, then we have
Asij = 318125 j= 1,"' W. (6-5)

76

 

and

[xili'H 7x'iKl = Alsili'” iSiK] = [a131i1’” ° talsliKlv (6'6)

where, xi]. is the ijth column of X, a1 is the ﬁrst column of A, and sh]. is the ﬁrst
entry of Sij- From equation (6.6), we see that each xi]. is equal to al multiplied
by a scalar 311-3., which means that these K column vectors of X, xi1,- -- ,x.,- K’ are
distributed along the direction of a1. Thus, ideally after normalization, xil , - - - ,xz- K
are mapped to a unique vector on the multidimensional unit circle which is equal
to al. However, in practice, the sources are likely not completely sparse in the

time-frequency domain. That is, s.i1,--- have the dominant ﬁrst entries so

iSiKi
that 511']. >> Srij for r 75 1 and j = 1,--- ,K. When more than one source are

nonzero, x11" 5 - ,xi are not exactly in the same direction as al, but rather in the

K
neighborhood of al. This means that al lies at the center of xi1,- - - , fo'

Therefore, we use the K -means clustering method to cluster the column vectors
of the mixture matrix X into different clusters, where the center of each cluster
corresponds to one column vector of the mixing matrix A. By doing so, we can

obtain an estimate of the mixing matrix A. The algorithm is summarized as follows:

Algorithm 1: Determining the mixing matrix
1. Normalize the column vectors of the mixture matrix X.

2. Take a sufﬁciently large positive integer K as the number of clusters. Choose
the initial points of iteration and the distance measure criterion. In this part of

the proposal, we choose the squared Euclidean distance as the criterion.

3. Do K -means clustering iteration followed by normalization to estimate the sub-
optimal mixing matrix. Note that if two column vectors have opposite direc—

tions, only one is taken.

77

6.2.4 Estimation of the Source Signals for a Given Mixing Matrix

After obtaining the estimated mixing matrix, the next stage is to estimate the source
signals. For a given mixing matrix A in model (6.2), the source signals can be esti-
mated by maximizing posterior distribution P(S|X, A) of S [96]. Under the assump-
tion that the prior is Laplacian, maximizing posterior distribution can be implemented

by solving the following optimization problem [6]:

N P
min 2: Z lsijl: subject to AS = X. (6.7)
i=1j=1

Hence, the l 1-norm

N P
J1(S) = Z Z Isa-I (6.8)

i=1j=1
can be used as the sparsity measure.
It is not difﬁcult to prove that the Optimization problem (6.7) is equivalent to the

following set of P smaller scale linear programming (LP) problems:
N
minz lsz-j], subject to Asj = xj for j = 1, - -- ,P. (6.9)
i=1

By setting S = U — V, where U = [Mg] 6 RNXP 2 0 and V = [oz-j] E RNXP 2 0,
equation (6.9) can be converted to the following standard LP problems with non-

negative constraints:

N
rnin :(uij + DU),
i=1 (6.10)

subject to [A,—A][u$,v$]T = xj, uj 2 0,vj 2 0 forj=1,~- ,P.

Combining the discussion of this section and the previous sections, we have the

algorithm for estimating the source signals:

78

Algorithm 2: Blindly separating the sparse source signals
1. Prethreshold the mixture matrix X to obtain a sparser data matrix X.
2. Estimate the mixing matrix A using Algorithm 1 and X.

3. Using the estimated mixing matrix A and the mixtures X, estimate the time-

frequency representations S by solving the set of LP problems (6.10).

6.3 Experimental Results and Analysis

In this section, several examples will be used to illustrate the effectiveness of the
proposed approach to separate the sparse source signals from their fewer mixtures in
the time-frequency domain. The binomial kernel [23] is used for computing the TFD
since it belongs to the class of reduced interference distributions (RIDS).

Example 1 : The set of observed signals are two linear combinations of four Gabor
logons. These four Gabor logons are centered at the time sample point and the
normalized frequency (30,0.7), (160,-0.7), (70,-0.4), and (120,0.1), respectively. Each
observed signal is ﬁrst transformed to the time-frequency domain with I = 50 time
samples and L = 64 frequency samples. Each TFD is then vectorized to form a TFD
mixture matrix X = [X1; X2] of size 2 x 3200.

Figure 6.1 presents a scatter plot of the mixtures X (X2 vs. X1) in the time-
frequency domain. It can be seen from this plot that almost all signiﬁcant data points
are distributed along four different directions, thus providing very good separability.
The separation results using the proposed approach are illustrated in Figure 6.2.
Figure 6.2 (a) and (b) represent the two mixtures. The four extracted Gabor logon
signals are shown in Figure 6.2 (c), (d), (e), and (f). The results indicate that these
four Gabor logons can be successfully separated from their two mixtures using the
proposed approach based on their sparsity with an average signal to interference ratio

(SIR) of 36.1251 dB.

79

 

0.25 -
0.2 -
0.15 -
0.1 -

0.05- :'

 

l

-0.1 -0.05 0 0.05 0.1 0.15 0.2

 

 

Figure 6.1. Scatter plot of two mixtures of four Gabor logons in the time-frequency
domain

Example 2: Two mixtures of a chirp signal and two Gabor logons are given. The
chirp signal has a linear frequency increasing from an initial normalized frequency
of -0.2 to a normalized frequency of 0.2. The Gabor logons are the ﬁrst two Gabor
logons given in Example 1. A scatter plot of the two mixtures in Figure 6.3 shows
that it is similar to the ﬁrst example in that the distributions of data points belonging
to different sources are along three different directions. Since the chirp signal overlaps
with the two Gabor logons in the time domain, typical time domain separation meth-
ods can not be used to perfectly recover them. However, it is illustrated in Figure
6.4 that these three signals can be effectively extracted in the time-frequency domain

using the proposed method with an average SIR of 32.7634 dB.

80

 

Example 3: In this example, the same two mixtures of four Gabor logons given
in Example 1 are used. The effectiveness of the presented approach is compared for
TFDs and wavelet packets (WP) in the presence of noise. Haar wavelet with ﬁve
levels is used for the wavelet packet decomposition.

To show the effect of increased sparsity of TFDs, the mixtures at different levels of
white Gaussian noise are considered. 100 Monte Carlo simulations are used for each
noise level. The average mean squared error (MSE) between the extracted sources
and the original sources is calculated for both the TFD and WP. The TFD and WP
provide similar results when there is no noise. However, as the noise level increases,
the performance of the WP rapidly degrades compared to that of the TFD. The MSE
versus the signal-to—noise ratio (SNR) is shown in Figure 6.5 for both the TFD and
WP. This result shows that the RID results in a more sparse time—frequency surface

compared to the WP, which improves the robustness of BSS under noise.

6.4 Summary

In this chapter, a two-stage approach is introduced for underdetermined blind separa-
tion of sparse and non-stationary sources using TFDs. The mixing matrix is estimated
using K -means clustering algorithm based on the sparsity of the sources; for a given
mixing matrix, the sources are extracted using a linear programming method. The
performance of the proposed approach is compared with wavelet packets under differ-
ent noise levels. The results show that the presented method is simple and effective at
separating the sources from their mixtures, and is more robust than wavelet packets

under noisy environments.

81

 

 

Figure 6.2. The mixtures and the separation of four Gabor logons: (a) and (b) two
mixtures; (c), (d), (e), and (f) four extracted Gabor logons

82

 

 

1.2 r
08 ~ .I 4

0.4 -

I
I
0.2- 1'
I
,.

X2

 

 

 

'08 1 1 1 1 1 1 1
-0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1

X1

Figure 6.3. Scatter plot of two mixtures of a chirp and two Gabor logons in the
time-frequency domain

83

 

 

 

Normalized Frequency

 

0 20 40 60 80 100 120 140 160 180 200

Figure 6.4. The mixtures and the separation of a chirp and two Gabor logons: (a)
and (b) two mixtures; (c) extracted chirp; (d) and (e) two extracted Gabor logons

84

 

 

1.8 I I I I I I l

 

 

 

 

 

 

 

—O-1TFD
_ _Wavelet .
"6’ 8\ Packet
\
\
1.4- N a
\
\
1.2r \ .
[1
iii 1 " \
2 \ \
1:.
\
08" O\‘\q \ \ “
b \ \

. ~ , \ ‘U\ .
0.6 \O\ \ \
04- "*"\"‘O-

\ a
0.2 I I l I I I I I
-2 0 2 4 6 8 10 12 14 16
SNR (dB)

Figure 6.5. Comparison of MSE versus SNR for extracted sources with TFD and WP

85

CHAPTER 7

APPLICATIONS OF UBSS ALGORITHM ON
ELECTROENCEPHALOGRAM SIGNALS

It is well-known that there is a broad range of applications for blind source separa-
tion. A classical example is the cocktail party problem. A more practical application
is noise reduction. Another application area is economic time series [97]. Recently,
telecommunications applications have also been published [1]. Besides these applica-
tions, one popular application of source separation is biomedical signal processing [3],
such as separation of electroencephalogram (EEG) signals which consist of recordings
of brain activity obtained using electrodes attached to the scalp. Decomposition of
evoked ﬁeld potentials measured by EEG [98] is an application of considerable interest
in the neurosciences.

In this chapter, we will apply the UBSS approach proposed in Chapter 6 to the

EEG signals so as to evaluate its effectiveness on real life signals.

7.1 Introduction to Electroencephalogram and Event-Related Potential

7.1.1 Electroencephalogram

Electroencephalography is the neurOphysiologic measurement of the electrical activity
of the brain recorded by electrodes placed on the scalp or, in special cases, subdurally
or in the cerebral cortex. The resulting traces are known as an electroencephalogram
(EEG) and are reﬂections of temporal and spatial summation of synchronized post-
synaptic cortical potentials [99]. Speciﬁcally, EEG data represents the synchronous
activity of large cortical groups of neurons, measured as integrated electrical signals
on the scalp.

In conventional scalp EEG, the recording is obtained by placing the electrodes

86

on the scalp in special positions with a conductive gel. Some EEG systems use a
plastic cap into which the electrodes are embedded. The electrode positions on the
scalp are identiﬁed by the recordist who measures the head using the International
10-20 System. This relies on taking measurements between certain ﬁxed points on
the head. The electrodes are then placed at points that are 10% and 20% of these
distances. Each electrode site is labeled with a letter and a number. The letter refers
to the area of brain underlying the electrode e.g. F - Frontal lobe and T - Temporal
lobe. Even numbers denote the right side of the head and odd numbers the left side
of the head.

EEG activity can be subdivided into various types of frequency rhythms or bands.
Research has indicated that different EEG frequency bands are associated with dif-
ferent mental states [100]. In general, EEG activity is broken down into 4 distinct

frequency bands:

1. Beta activity 13 Hz—30 Hz. Beta activity is a normal activity present when the
eyes are open or closed. It tends to be seen in the channels recorded from the

center or front of the head.

2. Alpha activity 8 Hz—13 Hz. Alpha activity is also a normal activity when present
in waking adults. It is mainly seen in the channels recorded from the back of the
head. It is fairly symmetrical and has an amplitude of 40 uV to 100 11V. It is
only seen when the eyes are closed and should disappear or reduce in amplitude

when the eyes are open.

3. Theta activity 4 Hz—7 Hz. Theta activity can be classed as both a normal and
abnormal activity depending on the age and state of the patient. In adults it is
normal if the patient is drowsy. However it can also indicate brain dysfunction
if it is seen in a patient who is alert and awake. In younger patients, theta

activity may be the main activity seen in channels recorded from the back and

87

 

central areas of the head.

4. Delta activity < 4 Hz. Delta activity is only normal in an adult patient if they
are in a moderate to deep sleep. If it is seen at any other time it would indicate

brain dysfunction.

EEG is preferred in many applications for exploring the brain activity thanks to its
high time resolution. Other methods for studying brain activity have time resolution
between seconds and minutes, while the EEG has a temporal resolution down to sub-
millisecond [101]. As the brain is thought to work through its electric activity, EEG
is the only method to measure it directly. Other methods for exploring functions in
the brain rely on blood ﬂow or metabolism which may be decoupled from the brain

electric activity.
7.1.2 Event-Related Potential

An event-related potential (ERP) is an electrophysiological response to an internal
or external stimulus. More simply, it is a measured brain response that is directly
the result of a thought or perception. ERPS can be reliably measured using EEG.
As the EEG reflects thousands of simultaneously ongoing brain processes, the brain
response to a certain stimulus or event of interest is usually not visible in the EEG.
One of the most robust features of the ERP response is a response to unpredictable
stimuli. This response, known as the P300 (or simply “P3”), manifests as a positive
deflection in voltage approximately 300 milliseconds after the stimulus is presented.

In actual recording situations, it is difﬁcult to see an ERP after the presentation
of a single stimulus. Rather the most robust ERPS are seen after many dozens or
hundreds of individual presentations are averaged together. This technique cancels
out noise in the data allowing only the voltage response to the stimulus to stand
out clearly. While evoked potentials reflect the processing of the physical stimulus,

ERPS are caused by the “higher” processes, that might involve memory, expectation,

88

 

attention, or changes in the mental state, among others.

ERPS have found numerous applications in clinical neurophysiology and psychia-
try [102,103]. This is because their recording is noninvasive and accurate, and they
are consistently shown to be an indicator of brain functions and its abnormalities.
The clinical applications of ERPS could be signiﬁcantly extended if they could be
interpreted more accurately and effectively. This requires the development of novel
signal processing methods. In recent years, there has been a tremendous growth in ap-
plying signal processing techniques such as independent component analysis, wavelet
and time-frequency methods for separating the source signals and extracting useful
information about the underlying brain activity [13,104].

Event-related potentials like most other real life signals are non-stationary, and
thus can be best tackled by using non-stationary signal analysis techniques such as
time-frequency distributions (TFDs) and wavelet analysis. In the next section, we
will apply the UBSS algorithm presented in Chapter 6 to ERP data set in the time-
frequency domain and compare its performance with ICA which is one of the main
methods used in the extraction of EEG / ERP sources in both aspects of research and

practice.

7.2 Experimental Results and Performance Analysis

7.2.1 EEG/ERP Data Set

The EEG / ERP data analyzed in this chapter is released by Swartz Center for Corn-
putational Neuroscience at the University of California, San Diego [105]. The study
consisted of one subject and 32 electrodes. In the selective visual attention experi-
ment, there were two types of events “square” and “rt”, “square” events corresponding
to the appearance of a green colored square in the display and “rt” to the reaction

time of the subject. The square could be presented at ﬁve locations on the screen

distributed along the horizontal axis. Here we only consider presentation on the left,

89

 

i.e. positions 1 and 2 as indicated by the “position” ﬁeld (at about 3 degree and 1.5
degree of visual angle respectively). In this experiment, the subject covertly attended
to the selected location on the computer screen responded with a quick thumb button
press only when a square was presented at this location. The subject was to ignore
circles presented either at the attended location or at an unattended location. To
reduce the amount of data required to process, the data set used in our analysis con-
tains only target (i.e., “square”) stimuli presented at the two left-visual-ﬁeld attended
locations. And 6 electrodes are chosen from 32 electrodes, which are F3, F4, Cz, P3,
P4, and Oz in the International 10-20 System. The stimulus is repeated 40 times

resulting in a total of 80 trials per electrode.
7 .2.2 Single-Tidal EEG

The goal in single-trial EEG analysis is to be able to extract individual underlying
sources in the brain which are generated in a localized area. With successful source
extraction, analysis of individual responses of the brain can be performed, and the
dynamic variability of the EEG responses can be compared on a trial to trial basis.
In this way, observations can be made on all factors affecting subject’s performance.
A comparison is made between the algorithm outlined in Chapter 6 and ICA applied
to the same data. Both these BSS techniques are applied to all 80 trials of data
available.

In the application of the proposed UBSS approach in Chapter 6, ﬁrst the number
of sources to extract, It, must be chosen. This value is empirically chosen such that
it is greater than the number of electrodes, 6. In our analysis, the experiment is
done using 32 frequency bins for which k is 8. This number is chosen since higher
number of sources resulted in sources that were too sparse and did not correspond to
actual neuronal activity. ICA is then applied to the same data. Since only 6 mixtures
are used, ICA can only extract 6 components per trial. The results for ICA are in

the time domain, so they are converted to the time—frequency plane at the frequency

90

 

resolution level using 32 frequency bins for the purposes of comparison.

Figure 7.1 and Figure 7.2 illustrate the results for one trial in the time—frequency
domain. Similar results are obtained over all 80 trials. The sources from the proposed
technique show in general less activity, i.e. more sparsity, on the time-frequency plane
than the sources from ICA. It is, however, difﬁcult to compare results on the single-
trial level since the underlying source generators are actually not known, and since a
different number of components are extracted from each technique. It is also difﬁcult
because there are 80 individual trials. An attempt must be made to generalize the

results.
7 .2.3 Data Reduction

In order to evaluate the performance of ICA and the proposed UBSS method, the
single-trial results are put together in their respective groups depending on stimulus
type. K -means clustering is carried out over all extracted components from the
subject and the extracted cluster centers represent similar components across all trials.
These components are then representative of the most prevalent sources extracted
throughout the trials for each stimulus. Evaluation of these cluster centers is then
carried out in an attempt to quantify the general results of ICA to those of the
proposed method.

The results of one trial are represented by the matrix, Sv, which is of size k x P.
Each component in the time—frequency domain is ﬁrst vectorized to form a vector
of length P, which in our case is equal to 2112. These vectors are then put into
a matrix. This represents k extracted components from trial 12, each over P time-
frequency points. For the data reduction of all results for a particular stimulus, the

extracted matrices over all trials are each appended to form a new matrix, S“, such

91

 

that

 

 

81(1) s}(P)
SI
- s2 s},(1) si(P)
Su — = a (71)
s‘f’n) 333(1))
L 540
Lsﬁoﬂ) s‘l0(P)_

 

 

where u = {1,2} represents the stimulus position number, and Su is of size 401: x P.
Each element, sf (j), is one time-frequency point of source i from trial 12.

K -means clustering is then carried out on each Su where each of its rows is grouped
into one of K clusters based on its squared Euclidean distance to that cluster center
as described in Section 6.2.2. The clustering algorithm is run 10 times to avoid
randomness in the ﬁnal cluster results. We run K at 8, l2, and 16 to get an idea
of how the different number of components may affect the outcome. The rows of Sn
are then grouped by a hierarchical clustering method based on the results of the 10
K -means runs. A matrix, R, of zeros of size 40k x 40k is created. Each entry is
updated iteratively. The entry, Tija represents how many times out of 10 row i of
S1,, was grouped into the same cluster as row j of Sn. This matrix then serves as a
similarity measure, the more times two sources were grouped together by K -means,
the more similar they are. All diagonal entries, Iii, represent how many times each
source was grouped with itself. These entries are ignored because they are all 10 and
are meaningless.

A hierarchical clustering is then carried out using the similarity matrix, R, as its
distance metric. In the ﬁrst step, each row of Sn is in its own cluster. The second
step then groups all rows together with a similarity value of 10 in the matrix, R.
Next, all rows with similarity of 9 are grouped. If a group already exists, then the

average similarity between one row and all rows already in the cluster is used. The

92

 

next step will then group together a cluster with another cluster or individual row
if it has the highest similarity value. If not, then all rows with similarity value of 8
are grouped together. This is repeated until the number of clusters is reduced to K.
Cluster centers are then calculated by the mean of the time-frequency components
in each cluster, and these are the components that categorize all single-trial EEG
results. For example, a set of extracted components for K = 8 is shown in Figure

7.3and Figure 7.4 for ICA and the proposed UBSS method, respectively.
7.2.4 Performance Evaluation

The levels to which the extracted components are sparse, disjoint, and localized in the
time-frequency domain all speak to how close they may be to an actual underlying
source in the brain. The components obtained from the clustering method described
in the previous section are evaluated based on these factors. Sparsity will be measured
using the l 1-norm, disjointness using the total inner product between the components,
and localization using a measure of entropy on the time-frequency plane.

Since a sparse component must have most of its values close to zero, the ll-norm
is a good measurement of how sparse a component is and a smaller ll-norm means
a sparser component. The extracted clusters are represented by the K x P matrix
Cu,u = {1,2}. Each row of Cu represents one extracted component. Thus each

component’s sparsity is measured with

P

2 Ici‘tmn, (7.2)

where it refers to stimulus position, i represents component number between 1 and
K, and P is the number of time—frequency points.
Disjointness between two components is measured by using the inner product. A

summation of all the pairwise inner products between K components represents a

93

 

total level of disjointness over all extracted components. This is computed as

P
E Z |c§‘(m)c3‘(m)|. (7.3)
i715 j m=1
Time—frequency localization of each component is computed using a measurement

of entropy. This is calculated as

P
- Z IC§‘(m)I 1032|C§’(m)|- (7-4)

m=1

A smaller entropy corresponds to a more localized component.

The results calculated for these parameters are shown in Table 7.1, Table 7.2,
and Table 7.3. This shows that under the proposed UBSS approach, the extracted
components are typically more sparse, localized, and disjoint than the extracted com-
ponents under ICA. This means that under the proposed approach, the components
are more likely a closer representation of a true source.

Finally, the extracted components from both methods are projected back to the
electrodes to show the amount of variance of the original data explained by the
sources. From Table 7.4, it is seen that in general a little bit more amount of variance
is explained by the components extracted from ICA than by the presented UBSS
method. This is because the presented UBSS method seeks to ﬁnd the sparsest
sources, while ICA seeks to ﬁnd maximally independent sources. The components
with less sparse representations (from ICA) project better back to the original mea-

surements, but it is likely that they are linear sums of further reducible sources.

94

 

 

Table 7.1. Mean measure of l1 norm to show sparsity

 

Running ]] Position 1 (u=1) Position 2 (u=2)
Conditions I] UBSS ICA UBSS ICA

 

 

 

 

K=8 23.03 29.06 21.92 27.63
K=12 22.36 28.31 22.54 28.15
K216 23.29 28.18 22.43 27.27

 

 

 

 

 

 

 

 

 

Table 7.2. Mean measure of entropy to show time-frequency localization

 

Running Position 1 (u=1) Position 2 (u=2)
Conditions UBSS ICA UBSS ICA
K=8 9.80 10.29 9.73 10.25
K=12 9.79 10.25 9.80 10.26
K=16 9.85 10.24 9.81 10.20

 

 

 

 

 

 

 

 

 

 

 

 

 

7 .3 Summary

This chapter discusses the applications of the proposed UBSS approach in Chapter 6
on the study of ERPS using EEG measurements to help understand mental processes.
Since EEG signals have been shown to be non—stationary, the proposed method is
applied to ERP data using TFDs and is compared to the popular ICA algorithm
when applied to the same multiple trial ERP data set. Data reduction by clustering
is performed over all single-trial results to extract components that represent the
results. The components were consistently more sparse using the proposed UBSS
technique than with ICA, showing that ICA probably tends to extract components
that are sums of sources, and can help explain the higher correlation value to the

original data. The UBSS technique provided components that are more localized in

95

Table 7.3. Measure of disjointness by correlation between components

 

Running I] Position 1 (u=1)

Position 2 (u=2)

 

 

 

 

 

Conditions ]] UBSS ICA UBSS ICA
K=8 3.12 3.63 3.84 4.35
K=12 8.92 9.38 8.30 8.90
K=16 14.05 14.51 15.87 16.35

 

 

 

 

 

 

 

Table 7.4. Average component projection to electrodes

 

 

 

 

 

 

Running Position 1 (u=1) Position 2 (u=2)
Conditions UBSS ICA UBSS ICA
K=8 0.026 0.039 0.029 0.045
K=12 0.058 0.091 0.065 0.103
K=16 0.129 0.217 0.147 0.246

 

 

 

 

 

96

the time-frequency domain and that are more distinct from each other than ICA.

 

 

 

 

Figure 7.1. Single-trial results using 32 frequency bins: 6 extracted sources from ICA

97

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Trial2
30 . '
20:
1|
10: '54.. is.
1' ' .m
0 l I l J
200 400 600 800 1000
30:
20’
10' I '_
l I _ - ' - _
0 I _ I I l
200 400 600 800 1000
30 . T 4
20-
10 P
_ 111 I?"
0 1 1 1 1
200 400 600 800 1000
30» '
20:
E‘uriﬂ I
0 1 1 1 1
200 400 600 800 1000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

301
201
101 T

011.5
0 J 1 1 1

200 400 600 800 1000
30» I 1
20'
b 5"
10 '3 _ _1’
0 l . l“-

200 400 600 800 1000
30-
20-
10 b...

.11“- ..

0 ‘ 41-k1laul ""I.

200 400 600 800 1000
30’
20E
0 "‘ «iv-.1

uni.
0 1 1 1 1

200 400 600 800 1000

Figure 7.2. Single-trial results using 32 frequency bins: 8 extracted sources from the
proposed UBSS method

98

Position 1 - Cluster1 - 15.4% in Group - ICA Position 1 - ClusterZ - 14.6% in Group - ICA

 

200 400 600 800 1000

  

200 400 600 800 1000 200 400 600 800 1000
Position 1 - GUNS-12.1% in Group - 1011 Position 1 - Ctusterﬁ - 11.7% in Group - ICA

 

 

 

 

200 400 600 800 1000
Position 1 - Cluster 8 - 10.4% in Group - ICA

   

200 400 6111 800 1000

Figure 7.3. Results of component clustering over all single-trial results for stimulus
position u = 1 using ICA

99

Position 1 - Cluster1 - 19.4% in Group - BSS

 

 

200 400 600 800 1000
Position 1 - Cluster 5 - 10.9% in Group - BSS

 

     

 
 

£11.-
$311111“

 

600 800 1000

 

 

Position 1 - ClusterZ - 15% in Group - BSS

 

1 -IIl-'
F'..1-

 

 

 

200 400 600 800 1000
Position 1 - Clusterl - 13.8% in Group - BSS

 

 

200 400 600 800 1000
Position 1 - Cluster 8 - 8.1% in Group - BSS

 

200 400 600 800 1000

Figure 7.4. Results of component clustering over all single-trial results for stimulus

position u = 1 using the proposed UBSS method

100

 

CHAPTER 8

CONCLUSIONS AND FUTURE WORK

In this dissertation, several problems regarding multichannel signal decomposition
and source separation in the time-frequency domain are addressed. A new signal
decomposition approach in the time-frequency domain is proposed based on the min-
imum entropy criterion. The major difference of the proposed approach from conven-
tional component extraction or decomposition methods is the cost function. The cost
function that is minimized is entropy on the time-frequency plane, thus producing
compact components that are similar to Gabor logons. Using entropy as the cost func-
tion and adopting an adaptive ﬁltering method to update the weights corresponding
to each trial, we extract “minimum” entropy components orthogonal to each other.
Experimental results show that the proposed approach is effective in determining a
few number of components that can be used to represent a large set of data.

This method is further improved for the separation of non-stationary signals on
the time-frequency plane. For the overdetermined case, the proposed algorithm as-
sumes the disjointness of the underlying signals on the time-frequency plane. This
assumption allows us to extract the sources through a N ~dimensional Givens rotation.
Using Jensen-Rényi divergence as the cost function, a steepest descent algorithm is
implemented to update the rotation angles. Several examples are given to illustrate
the performance of the proposed algorithm for synthesized and real life signals. Is-
sues regarding convergence rate and robustness under noise are investigated. The
performance of the algorithm is illustrated under noise and is compared to PCA and
ICA as adapted to the time-frequency plane, and STF D. The results illustrate that
maximizing the divergence on the time-frequency plane can separate sources that are

disjoint in the time-frequency domain, and is better than the mutual information

101

cost function used in ICA in terms of ﬁdelity to the original sources. The proposed
method also outperforms STFD for high noise levels since it assumes the cross-terms
between sources are negligible which effectively denoises the observed time-frequency
matrix, and is apparently superior to PCA.

In the next part of this dissertation, the BSS problem is extended for the under-
determined case. Most BSS algorithms are not suitable to be applied in this case.
In this part, a two-stage sparse factorization approach is proposed for UBSS. The
ﬁrst stage is to determine the mixing matrix. The mixing matrix can be estimated
using K -means clustering algorithm. The column vectors of the mixing matrix are
cluster centers of normalized mixture vectors. The second stage is to estimate the
sources. For a given mixing matrix, although there exist an inﬁnite number of solu—
tions in general, the sparse solution with minimum ll-norm is proven to be unique,
which can be obtained by using linear programming methods. The results show that
if the sources are sufﬁciently sparse in the time-frequency domain, the proposed ap-
proach can separate them effectively from their mixtures. Compared to PCA and
ICA, the proposed method does not require the assumption that the sources have to
be orthogonal or mutually independent.

The ﬁnal part of the work focuses on the applications of the proposed UBSS
algorithm on multichannel EEG/ERP recordings. Under the assumption that sources
are sparse in the time-frequency domain, all single-trial components are extracted in
the condition of the number of sources selected in advance. Then data reduction
by clustering is performed over all single-trial results to extract components that
represent the results. The performance of the proposed approach is compared with
that of ICA. It is concluded that the proposed method is more effective at extracting
well localized neuronal sources in time and frequency than ICA. These sources are
shown to be more sparse, and distinct from each other.

Future work includes determining the convergence rates of the proposed algorithms

102

 

and investigating the effect of order a in the information-theoretic cost functions on
the performance of the proposed algorithms. One problem for K -means clustering is
the arbitrary selection of how many sources to extract. This is still an open question.
If the number of extracted sources is less than the number of actual sources, some
of actual sources can not be obtained; on the other hand, if the number of separated
components is more than that of the actual sources, that means some components
belonging to the same source are thought to be different sources. Thus, it would
be more efﬁcient to have the algorithm automatically select the number of sources.
In addition, the requirement that the sources must be approximately disjoint limits
the algorithms. If this assumption could be relaxed, results could be more reliable
since the real sources may not be disjoint. Another area of future work is using signal
synthesis methods to transform the extracted sources from the time-frequency domain

to the time domain.

103

APPENDICES

104

In this appendix, the derivation for the steepest descent algorithm for N = M = 3

in Chatper 5 will be given explicitly. In this case, the matrix of mixture is

 

 

 

 

’x1’ 810) - X108)-
X= x2 = X2(1) X2(P) , (1)
_ X3 _ _X3(1) X3(15’)_

 

 

 

 

Y1 Y1(1) Y1(P)
Y = Y2 = Y2(1) Y2(P) (2)
_ Y3 _ _Y3(i) Y3(P)‘

We aim to ﬁnd the optimal rotation transform R(6) under the Jensen-Rényi di-
vergence criterion to obtain the signals Y = R(0)X that are disjoint on the time-
frequency plane. From [82], we know that any 3-D rotation matrix can be written in

the following form:

R(9) = R1(91)R2(92)R3(93)1 (3) ,

where

P -

008(61) sin(61) 0
R1191): —sin(t91) cos(61) 0 , (4)

R2092) = 0 1 0 , (5)

_—sin(62) 0 cos(62)

 

 

105

I. -_
.1

1 0 0
R3093): 0 cos(63) sin(t93) 1

 

 

0 —sin(6’3) cos(63)

and 6 = [ 91,492, 93 ]T. Hence the entries of Y are

1’10) = (8111191) C08193) - C03091) Sin(92) sin(93))X2(i)
+ (sin(91) Sin(93) + C08(91) sin(92) 008093)) X30)
+ C08(91) 008192)){1011

Y2('i) = (008091) C08(93) + 9411091)sin192lsin(93))X2(i)
+ (008191) 5111(03) - Sin(91) Sin(92) C08193)) X30)
- sin(91) 608(92)X1(i)1

3’30) = - C08(192)Sin(93)X2(i) + C08(92) 008(93lX31i)

— 5in(92)X1(’i),

(9)

where i = 1, 2, - - ~ , P, and P is the length of the time-frequency vector. The gradients

of Y with respect to 61 are derived as follows:

3Y1 (i)
801

 

= (008(61) cos(63) + sin(01) sin(62) sin(63))X2(i)

+ (cos(t91) sin(t93) — sin(01) sin(l92) cos(6l3)) X3(i)

_ Sin(61)COS(62)X1(i)1

33/20)
861

 

= (— sin(61) cos(03) + cos(61) sin(62) sin(6’3))X2(i)

— (sin(t91) sin(63) + cos(61) sin(92) cos(63)) X3(i)

— cos(61) 008(92)X1(i)1

33/30)

661 :0.

106

(10)

(11)

(12)

_L“.-_

f .

Similarly, we can derive the gradients of Y with respect to 02 and 03, respectively.

The cost function with the order a = 2 is

2 3

J = Z Z Jij = J12 + J13 + J23, (13)
i=1j=i+1

where,

J, _ 25211481138)
7.] **
d (235:1 322(9) (25:. ,m))

The derivatives of the numerator and denominator of J,- j with respect to 01 are given

 

 

('i < J) (14)

 

 

 

 

 

 

 

 

as a (252112811330 P 511(k) aY-(k)
and
a(/ (215.1 1220)) (25—116 (1.)) 1
__ X
8’91 (/ (25:1 192(k)) (21113—1 392(k))
P P P P 8Y k
[(Zn(k)a§;k)\t (213200) + (Z 112(k)) (236(k) a]; ))],
k=1 1 / k=1 k=1 k=1 1

respectively, where l = 1,2,3. With equations (15) and (16), we get the gradient of

107

the pairwise cost function Jij with respect to 6) as

Ed: 5(Z§=1Y4Wj<k>)/891 _ Zi’zmrkmrki X
09‘ ((25.182<k>)(zr=1828>) (21:1 YEW) (21:1 1239)

P P
a 2 ﬂag) 2 132(k) we).
k=1 k=1

 

 

 

 

(17)

By summing up of the pairwise gradients of J'lji we obtain the gradient of the total

cost function J with respect to any rotation angle 61

 

 

_8J12 8J13 0.123
=22 2:: + + . (18)
aTJJ, 1— 11-2416 69) so)

This expression is used in updating the rotation angle in the steepest descent algo-

rithm.

108

.v'- ‘- - A.

BIBLIOGRAPHY

109

BIBLIOGRAPHY

[1] A.-J. van der Veen, S. Talvar, and A. Paulraj, “A subspace approach to blind

[2]

[3]

l4]

[6]

[7]

l8]

[9]

[10]

[11]

space—time signal processing for wireless communication systems,” IEEE Trans.
on Signal Processing, vol. 45, pp. 173—190, 1997.

C. B. Papadias and A. Paulraj, “A constant modulus algorithm for multi—
user signal separation in presence of delay spread using antenna arrays,” IEEE
Signal Processing Letters, vol. 4, pp. 178-181, 1997.

A. Rouxel, D. Le Guennec, and O. Macchi, “Unsupervised adaptive separation
of impluse signals applied to EEG analysis,” in Proc. IEEE Int. Conf. Acoustics,
Speech and Signal Processing, 2000, vol. 1, pp. 420—423.

David A. Peterson, James N. Knight, Michael J. Kirby, CharlesW. Anderson,
and Michael H. Thaut, “Feature selection and blind source separation in an
EEG-based brain-computer interface,” EURASIP Journal on Applied Signal
Processing, vol. 19, pp. 3128—3140, 2005.

S. Mallat and Z. Zhang, “Matching pursuits with time—frequency dictionaries,”
IEEE Trans. on Signal Processing, vol. 41, pp. 3397-3415, 1993.

S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by
basis pursuit,” SIAM J. Scientiﬁc Comp, vol. 20, pp. 33—61, 1999.

R. G. Baraniuk and D. L. Jones, “Wigner—based formulation of the chirplet
transform,” IEEE Trans. on Signal Processing, vol. 44, pp. 3129—3135, 1996.

B. D. V. Veen and K. M. Buckley, “Beamforming: A versatile approach to
spatial ﬁltering,” IEEE ASSP Magazine, 1988.

P. Comon, “Indpendent component analysis: A new concept?,” Signal Process-
ing, vol. 36, pp. 287—314, 1994.

H. Yang and S. Amari, “Adaptive online learning algorithms for blind separa-
tion: Maximum entropy and minimum mutual information,” Neural Computa-
tion, vol. 9, pp. 1457—1482, 1997.

A. Cichocki and R. Unbehauen, “Robust neural networks with on-line learning
for blind identiﬁcation and blind separation of sources,” IEEE Trans. on Cicuits
and Systems, vol. 43, no. 11, pp. 894—906, 1996.

110

 

[12]

[13]

[14]

[15]

[16]

[17}

[18]

[19]

[20]

[21]

[22]

M. Girolami and C. F yfe, “Negentropy and kurtosis as projection pursuit in-
dices provide generalised ICA algorithms,” in Advances in Neural Information

Processing Systems Workshop, 1996.

A. J. Bell and T. J. Sejnowski, “An information—maximization approach to
blind separation and blind deconvolution,” Neural Computation, vol. 7, pp.
1129—1159, 1995.

BA. Pearlmutter and LC. Parra, “A context-sensitive generalization of ICA,”

in Int. Conf. Neural Information Processing, 1996.

Ivan Magrin-Chagnolleau, Geoffrey Durou, and Frederic Bimbot, “Application
of time-frequency principal component analysis to text-independent speaker
identiﬁcation,” IEEE Trans. Speech and Audio Processing, vol. 10, no. 6, pp.
371—378, 2002.

Matteo Gandetto, Marco Guainazzo, and Carlo S. Regazzoni, “Use of time—
frequency analysis and neural networks for mode identiﬁcation in a wireless
software-deﬁned radio approach,” E URASIP Journal on Applied Signal Pro-
cessing, vol. 12, pp. 1778—1790, 2004.

James R. Berriman, David A. Hutchins, Adrian Neild, Tat Hean Gan, and
Phil Purnell, “The application of time-frequency analysis to the air-coupled
ultrasonic testing of concrete,” IEEE Trans. Ultrasonics, Perroelectrics, and
Frequency Control, vol. 53, no. 4, pp. 768—776, 2006.

I. J. Chant and L. M. Hastie, “Time-frequency analysis of magnetotelluric
data,” Geophysical Journal International, vol. 111, no. 2, pp. 399-413, 1992.

Massimiliano Pieraccini, Guido Luzi, Linhsia Noferini, Daniele Mecatti, and
Carlo Atzeni, “Joint time—frequency analysis for investigation of layered ma-
sonry structures using penetrating radar,” IEEE Trans. Geoscience and Remote
Sensing, vol. 42, no. 2, pp. 309—317, 2004.

E. P. Wigner, “On the quantum correction for thermodynamic equilibrium,”
Physical Review, vol. 40, pp. 749—759, 1932.

J. Ville, “Theorie et applications de la notion de signal analytique,” Cables et
Transmission, vol. 2A, pp. 61—74, 1948.

J. E. Moyal, “Quantum mechanics as a statistical theory,” in Proc. Camb. Phil.
Soc, 1949, vol. 45, pp. 99—124.

111

 

123]

[24]

[25]

[26]

[27]

[28]

[29]

[30}

[31]

[32]

[33]

L. Cohen, Time—Frequency Analysis, Prentice Hall, New Jersey, 1995.

T. A. C. M. Claasen and W. F. G. Mecklenbrauker, “The Wigner distribution
— a tool for time—frquency signal analysis — Part III: Relations with other time—
frequency signal transformations,” Philips Journal of Research, vol. 35, pp.
372—389, 1980.

A. J. E. M. Janssen, “On the locus and spread of pseudo—density functions in
the time—frequency plane,” Philips Journal of Research, vol. 37, pp. 79—110,
1982.

J. Jeong and W. J. Williams, “Kernel design for reduced interference distribu-
tions,” IEEE Trans. on Signal Processing, vol. 40, pp. 402—412, 1992.

P. Loughlin, J. Pitton, and L. E. Atlas, “Bilinear time—frequency representa-
tions: new insights and properties,” IEEE Trans. on Signal Processing, vol. 41,
pp. 750—767, 1993.

S. Oh and R. J. Marks II, “Some properties of the genralized time—frequency
representation with cone shaped kernel,” IEEE Trans. on Signal Processing,
vol. 40, pp. 1735—1745, 1992.

S. Oh, R. J. Marks II, L. E. Atlas, and J. A. Pitton, “Kernel synthesis for
generalized time—frequency distributions using the method of projection onto
convex sets,” in SPIE—Advanced Signal Processing Algorithms, 1990, vol. 1348,
pp. 197—207.

P. Flandrin, “Some features of time—frequency representations of multicompo-
nent signals,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing,
1984, pp. 41B.4.1—4.4.

P. F landrin and F. Hlawatsch, “Signal representations, geometry and catas-
trophies in the time-frequency plane,” Mathematics in Signal Processing, pp.
3—14, 1987.

F. Hlawatsch, “Interference terms in the Wigner distribution,” Digital Signal
Processing—84, pp. 363—367, 1984.

H.-I. Choi and W. J. Williams, “Improved time-frequency representation of
multicomponent signals using exponential kernels,” IEEE Trans. on Acoustics,
Speech and Signal Processing, vol. 37, no. 6, pp. 862—871, 1989.

112

[34] Y. Zhao, L. E. Atlas, and R. J. Marks, “The use of cone—shaped kernels for
generalized time—frequency representations of nonstationary signals,” IEEE
Trans. on Acoustics, Speech and Signal Processing, vol. 38, pp. 1084—1091, 1990.

[35] J. Jeong and W. J. Williams, “A new formulation of generalized discrete—time
time—frequency distributions,” in Proc. IEEE Int. Conf. Acoustics, Speech and
Signal Processing, 1991, pp. 3189—3192.

[36] G. S. Cunningham and W. J. Williams, “High—resolution signal synthesis for
time—frequency distributions,” in Proc. IEEE Int. Conf. Acoustics, Speech and
Signal Processing, 1993, vol. 4, pp. 400—403.

[37] E. GokGay and J. C. Principe, “New clustering evaluation function using
Rényi’s information potential,” in Proc. IEEE Int. Conf. Acoustics, Speech and
Signal Processing, 2000, pp. 3490-3493.

[38] M. Ben-Bassat and J. Raviv, “Rényi entropy and the probability of error,”
IEEE Trans. on Info. Theory, vol. 24, no. 3, pp. 324—330, May 1978.

[39] M. Basseville, “Distance measures for signal processing and pattern recogni-
tion,” Signal Processing, vol. 18, no. 4, pp. 349—369, Dec. 1989.

[40] I. Csiszar, “Information—type distance measures and indirect observations,”
Stud, Sci. Math. Hungar., vol. 2, pp. 299—318, 1967.

[41] R. G. Baraniuk, P. Flandrin, A. J. E. M. Janssen, and O. Michel, “Measuring
time-frequency information content using the Rényi entropies,” IEEE Trans.
on Info. Theory, vol. 47, no. 4, pp. 1391—1409, May 2001.

[42] W. J. Williams, M. Brown, and A. Hero, “Uncertainty, information and
time—frequency distributions,” in SPIE—Advanced Signal Processing Algorithms,
1991, vol. 1556, pp. 144-156.

I

[43] A. Rényi, “On measures of entropy and information,’ in Proceedings 4th Ber/17e-
ley Symp. Math. Stat. and Prob., 1961, vol. 1, pp. 547—561.

[44] K. Pearson, “On lines and planes of closest ﬁt to systems of points in space,”
Philosophical Magazine, vol. 2, pp. 559—572, 1901.

[45] E. Oja, Subspace Methods of Pattern Recognition, Research Studies Press,
England, and Wiley, USA, 1983.

113

[46]

[47]

[48]

[49]

[50]

[51]

[521

[53]

[54]

[55]

[56]

[57]

K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks:
Theory and Applications, Wiley, 1996.

M. Wax and T. Kailath, “Detection of signals by information—theoretic cri—
teria,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 33, pp.
387-392, 1985.

J. Karhunen, A. Cichocki, W. Kasprzak, and P. Pajunen, “On neural blind
separation with noise suppression and redundancy reduction,” Int. Journal of
Neural Systems, vol. 8, no. 2, pp. 219—237, 1997.

B. Ans, J. Herault, and C. Jutten, “Adaptive neural architectures: detection
of primitives,” in Proc. COCNI TI VA ’85, 1985, pp. 593—597.

T. W. Lee, M. S. Lewicki, M. Girolami, and T. J. Sejnowski, “Blind source
separation of more sources than mixtures using overcomplete representations,”
IEEE Signal Processing Letters, vol. 6(4), pp. 87-90, 1999.

L. Zhukov, D. Weinstein, and C. Johnson, “Independent component analysis
for EEG source localization,” IEEE Eng. Med. Biol. Mag, vol. 19, pp. 87-—96,
2000.

Erik Visser and Te-Won Lee, “Blind source separation in mobile environments
using a priori knowledge,” in Proc. IEEE Int. Conf. Acoustics, Speech and
Signal Processing, 2004, vol. 3, pp. 893-896.

Guillaume Gelle, Maxime Colas, and Christine Servire, “Blind source sepa-
ration: A new pre-processing tool for rotating machines monitoring,” IEEE
Trans. Instrumentation and Measurement, vol. 52, no. 3, pp. 790—795, 2003.

Z. He, L. Yang, J. Liu, Z. Lu, C. He, and Y. Shi, “Blind source separation using
clustering-based multivariate density estimation algorithm,” IEEE Trans. on
Signal Processing, vol. 48, pp. 575—579, 2000.

D.-T. Pham, “Blind separation of instantaneous mixture of sources based on
order statistics,” IEEE Trans. on Signal Processing, vol. 48, pp. 363—375, 2000.

B. Torresani, “Wavelets associated with representations of the afﬁne Weyl—
Heisenberg group,” Journal Math. Physics, vol. 32, pp. 1273—1279, 1991.

Y.C. Pati, R. Rezaiifar, and RS. Krishnaprasad, “Orthogonal matching pursuit:
Recursive function approximation with applications to wavelet decomposition,”

114

1""

in Proc. of the 27th Annual Asilomar Conference on Signals, Systems and Com-
puters, 1993.

[58] A. Belouchrani and M. G. Amin, “Blind source separation based on time—
frequency signal representations,” IEEE Trans. on Signal Processing, vol. 46,
pp.2888—2897,1998.

[59] Abdeldjalil Aissa—El-Bey, Nguyen Linh-Trung, Karim Abed-Meraim, Adel Be-
louchrani, and Yves Grenier, “Underdetermined blind separation of nondisjoint
sources in the time-frequency domain,” IEEE Trans. on Signal Processing, vol.
55, pp. 897—907, 2007.

[60] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time—
frequency masking,” IEEE Trans. on Signal Processing, vol. 52, pp. 1830—1847,
2004.

[61] S. Qian and D. Chen, “Discrete Gabor transform,” IEEE Trans. on Signal
Processing, vol. 41, pp. 2429—2438, 1993.

[62] D. Gabor, “Theory of communication,” J. of IEE, vol. 93, no. 26, pp. 429—457,
1946.

[63] J. Wexler and S. Raz, “Discrete Gabor expansions,” Signal Processing, vol. 21,
no. 3, pp. 207—221, 1990.

[64] S. Mann and S. Haykin, “The chirplet transform—physical considerations,”
IEEE Trans. on Signal Processing, vol. 43, no. 11, pp. 2745—2761, 1995.

[65] R. G. Baraniuk, P. Flandrin, A. J. E. M. Janssen, and O. Michel, “Measuring
time—frequency information content using the Rényi entropies,” IEEE Trans.
on Info. Theory, vol. 47, no. 4, pp. 1391—1409, May 2001.

[66] B. Widrow and S. D. Stearns, Adaptive Signal Processing, Prentice Hall, 1985.

[67] A. Belouchrani, K. Abed-Meraim, J-F. Cardoso, and E. Moulines, “A blind
source separation technique using second—order statistics,” IEEE Trans. on
Signal Processing, vol. 45, no. 2, pp. 434—444, 1997.

[68] A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis,
John Wiley and Sons, 2001.

[69] J. F. Cardoso, “Infomax and maximum likelihood for blind source separation,”
IEEE Signal Processing Letters, vol. 4, pp. 112—114, 1997.

115

[70]

[71]

[721

[73]

[74]

[75]

[75]

[77]

[78]

[79]

[80]

KB. Hild II, D. Erdogmus, and J. C. Principe, “Blind source separation using
Rényi’s mutual informaiton,” IEEE Signal Processing Letters, vol. 8, pp. 174—
176, 2001.

Y. Q. Li, A. Cichocki, and S. Amari, “Sparse component analysis for blind
source separation with less sensors than sourses,” in Proc. 4th Int. Symposium
on Independent Component Analysis and Blind Signal Separation, 2003, pp.
89-94.

D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative
matrix factorization,” Nature, vol. 401, pp. 788—791, 1999.

E. Weinstein, M. Feder, and A. Oppenheim, “Multi—channel signal separation
by decorrelation,” IEEE Trans. on Speech and Audio Processing, vol. 1, pp.
405—413,1993.

L. Parra and C. Spence, “Convolutive blind separation of non—stationary
sources,” IEEE Tfans. on Speech and Audio Processing, vol. 8, pp. 320—327,
2000.

H. Sahlin and H. Broman, “MIMO signal separation for FIR channels: a cri-
terion and performance analysis,” IEEE Trans. on Signal Processing, vol. 48,
pp. 642—649, 2000.

C. T. Ma, Z. Ding, and S. F. Yau, “A two-stage algorithm for MIMO blind
deconvolution of nonstationary colored signals,” IEEE Trans. on Signal Pro-
cessing, vol. 48, pp. 1187—1192, 2000.

A. Belouchrani, K. Abed-Meraim, M. G. Amin, and A. M. Zoubir, “Blind
separation of nonstationary sources,” IEEE Signal Processing Letters, vol. 11,
no.7,2004.

M. G. Amin and Y. Zhang, “Signal averaging of time-frequency distributions
for signal recovery in uniform linear arrays,” IEEE Trans. on Signal Processing,
vol. 48, no. 10, pp. 2892—2902, 2000.

O. Michel, R. G. Baraniuk, and P. Flandrin, “Time—frequency based distance
and divergence measure,” in Proc. IEEE Int. Symp. Time—Frequency and Time—
Scale Analysis, 1994, pp. 64—67.

S. Aviyente, “Information processing on the time—frequency plane,” in Proc.
IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2004, vol. 2, pp. 617——
620.

116

 

[81] L.-T. Nguyen, A. Belouchrani, K. Abed-Meraim, and B. Boashash, “Separating
more sources than sensors using time-frequency distributions,” in Proc. Int.
Symposium on Signal Processing and its Applications, 2001, pp. 583—586.

[82] F. D. Murnaghan, The Unitary and Rotation Groups, Spartan Books, Wash-
ington DC, 1962.

[83] S. Haykin, Introduction to Adaptive Filters, MacMillan, New York, 1984.

[84] A. Hyvarinen, “Fast and robust ﬁxed—point algorithms for independent compo-
nent analysis,” IEEE Trans. on Neural Networks, vol. 10, no. 3, pp. 626—634,
May 1999.

[85] P. J. Huber, “Projection pursuit,” Ann. Statist., vol. 13, no. 2, pp. 435-475,
1985.

[86] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from limited
data using FOCUSS: A re-weighted minimum norm algorithm,” IEEE Trans.
on Signal Processing, vol. 45, pp. 600—616, 1997.

[87] D. L. Donoho and M. Elad, “Maximal sparsity representation via 11 minimiza-
tion,” in Proc. Nat. Acad. Sci, 2003, pp. 2197—2202.

[88] M. Zibulevsky and B. A. Pearlmutter, “Blind source separation by sparse
decomposition,” Neural Computations, vol. 13(4), pp. 863—882, 2001.

[89] W. Lu and J. C. Rajapakse, “Appraoch and applications of constrained ICA,”
IEEE Trans. Neural Netw., vol. 16(1), pp. 203—212, 2005.

[90] L. Zhang, A. Cichocki, and S. Amari, “Self-adaptive blind source separation
based on activation functions adaptation,” IEEE Trans. Neural Netw., vol.
15(2), pp. 233-244, 2004.

[91] S. A. Cruces-Alvarez, A. Cichocki, and S. Amari, “From blind signal extraction
to blind instantaneous signal separation: Criteria, algorithms, and stability,”
IEEE Trans. Neural Netw., vol. 15(4), pp. 859—873, 2004.

[92] M. Girolami, “A variational method for learning sparse and overcomplete rep-
resentations,” Neural Computations, vol. 13(11), pp. 2517—2532, 2001.

[93] M. Zibulevsky, P. Kisilev, Y. Y. Zeevi, and B. A. Pearlmutter, “Blind source
separation via multinode sparse representation,” NIPS-2001.

117

 

[94] Y. Q. Li, A. Cichocki, and S. Amari, “Sparse representation and blind source
separation,” Neural Computations, vol. 16(6), pp. 1193—1234, 2004.

[95] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classiﬁcation, Wiley, New
York, 2001.

[96] M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,”
Neural Computations, vol. 12(2), pp. 337—365, 2000.

[97] K. Kiviluoto and E. Oja, “Independent component analysis for parallel ﬁnancial
time series,” in Proc. I CONIP’98, 1998, vol. 2, pp. 895—898.

[98] R. Vigario, J. Sarela, V. Jousmaki, and E. Oja, “Independent component
analysis in decomposition of auditory and somatosensory evoked ﬁelds,” in
Proc. Int. Workshop on Independent Component Analysis and Signal Separation
(ICA’QQ), 1999, pp. 167—172.

[99] S. J. Luck, An introduction to the event-related potential technique, The MIT
Press, 2005.

[100] J. N. Demos, Getting started with neurofeedback, W.W. Norton & Company,
2005.

[101] O. Faugeras, F. Clement, R. Deriche, R. Keriven, T. Papadopoulo, and
J. Roberts, “The inverse EEG and MEG problems: The adjoint state approach
I: The continuous case,” INRIA Research Report 3673, 1999.

[102] H. Shevrin, J. A. Bond, L. A. Brakel, R. K. Hertel, and W. J. Williams, Con-
scious and Unconscious Processes: Psychodynamic, Cognitive and Neurophysi-
ological Convergences, New York: Guilford, 1996.

[103] H. Shevrin, W. J.Williams, R. E. Marshall, R. K. Hertel, J. A. Bond, and
L. A. Brakel, “Event-related potential indicators of the dynamic unconscious,”
Consciousness Cogn, vol. 1, pp. 340—366, 1992.

[104] W. J. Williams, H. P. Zaveri, and J. C. Sackellares, “Time-frequency analysis
of electrophysiology signals in epilepsy,” IEEE Trans. Eng. Med. Biol, pp.
133—143, 1995.

[105] http://www.sccn. ucsd. edu/eeglab.

118

u
H
II]
I“
1|].I
“
VIII
“
TN]
A“
S“
A
6""
H“

3 1293 03062 9822