141 933 THS IIDDADV ‘7 IO Mic: Us. date I University J This is to certify that the dissertation entitled MULTICHANNEL SIGNAL DECOMPOSITION AND SEPARATION IN THE TIME-FREQUENCY DOMAIN presented by Zeyong Shan has been accepted towards fulfillment of the requirements for the PhD. degree in Electrical Engineering Majo'r Professor’s Signature 08/Z = lSh(t.w)l2- (2.2) To obtain a good time resolution, a narrow window, h(t), in the time domain has to be picked. Similarly, to get a good frequency resolution, a narrow window, H (w), in the frequency domain has to be picked. Since both h(t) and H (w) can not be made arbitrarily narrow, there is an inherent trade-off between time and frequency resolution in the spectrogram for a particular window. This is the reason why the spectrogram is not preferred for high resolution time—frequency analysis. 2.1.2 The Wigner Distribution The Wigner distribution is defined mathematically in terms of the signal, s(t), as [20—22] W(t,w) = / 3(t + 93m — %)e—jn"dr. (2.3) The Wigner distribution is said to be bilinear in the signal since the signal enters twice in its calculation. The Wigner distribution has many desired properties. For example, it satisfies the marginals requirement, and therefore preserves the energy. It is always real, even if the signal is complex. In addition, it is time and frequency shift invariant, and satis— fies finite support property in time and frequency. One of the major shortcomings of the Wigner distribution is the existence of negative energy terms. The Wigner distri- bution of multicomponent signals also exhibits the disturbing tendency of generating interference or cross-terms. Despite these shortcomings, the Wigner distribution still shows some remarkable advantages over the spectrogram: the conditional averages are exactly the instanta- neous frequency and the group delay, whereas the spectrogram fails to achieve this result, no matter what window is chosen; the spectrogram can not often provide the resolution required to distinguish the components in multicomponent signals as the Wigner distribution. Thus, there is a need to develop more general distributions which preserve the advantages of the Wigner distribution and address most of its drawbacks. This leads to the Cohen’s class of generalized distributions. 2.1.3 Cohen’s General Class of Time-Frequency Distributions There is a considerable advantage to having a simple method to generate different time-frequency distributions. This allows one to pick and choose those with desirable properties. The most direct way is to generate the distributions from [23]: C(t,w)= ///¢(0,T)s(u+%)s*(u—%)ei(9“-9t—W>dudadr, (2.4) where ¢(9, 7') is a two dimensional function called the kernel function, a term coined by Claasen and Mecklenbrauker [24] and whom, with Janssen [25], made many im- portant contributions to the general understanding of the general class, particularly in the signal analysis context. The kernel function determines the distribution and its properties. For the Wigner distribution, the kernel function is one. There are three main reasons why the kernel idea is particularly useful for the study of time-frequency distributions. First of all it is easy to generate them: just choose a kernel function. The second reason is that one can design the distributions with certain properties by putting constraints on the kernel function. For example, for a distribution to satisfy the marginals / Guam = |s(t)|2, / C(t,w)dt = |S(w)|2, (2.5) it has been shown that the kernel function must have the property ¢(o,T) = W, 0) = 1. (2.6) An extensive discussion of the properties of a distribution and the corresponding constraints on the kernel function can be found in [26—29]. The third reason is that when a new distribution is considered, its properties can readily be ascertained by examining its kernel. For example, if the kernel does not satisfy equation (2.6), then we know the distribution can not satisfy the marginals. 2.1.4 Reduced Interference Distributions It is known that both the spectrogram and the Wigner distribution are the members of Cohen’s class of distributions. Although the spectrogram has many useful properties, it often presents serious difficulties when used to analyze rapidly varying signals. If the analysis window is made short enough to capture rapid changes in the signal, it becomes impossible to resolve frequency components of the signal which are close in frequency. The Wigner distribution has been employed as an alternative to overcome this shortcoming. It provides a high resolution representation in time and frequency for a non-stationary signal such as a chirp. However, its energy distribution is non- positive and it often suffers from severe cross-terms between components in different 10 time-frequency regions, potentially leading to confusion and misinterpretation. An excellent discussion on the geometry of interferences has been provided in [30—32]. Since the Wigner distribution sometimes gives artificial and undesirable values in the time-frequency domain particularly when the signal is multicomponent, the con- ditions on the kernel that minimize these spurious values in some sense are developed in [33—36]. These conditions are that the kernel ¢(6,T) value decays as you move away from the 6 and 7' axes . A way to describe this region is to observe that the product Br is large when we are away from either axis. Therefore, it is concluded that for cross-term minimization, qb(6, 7‘) should satisfy ¢(9,7) << 1 for 19¢ >> 0. (2.7) These kernels produce reduced interference distributions. 2.2 Introduction to Information-Theoretic Measures Using entropy based distance functionals is a well-known discrimination method in signal processing. These functionals are known as divergence measures and are applied directly on statistical models describing the signals. Measures of divergence between two probability distributions are used to associate, cluster, classify, compress, and restore signals, images and patterns, in many applications [37,38]. Many different measures of divergence have been constructed and characterized [39,40]. Recent research in the application of information and entropy functionals on time- frequency distributions (TFDs) has proven the usefulness of distance measures for non-stationary signal analysis [41,42]. Entropy when applied to a TFD measures the number of components in a given signal, i.e. the complexity. Similarly, divergence measures computed between two time-frequency distributions can indicate the differ- ence in complexity between the two signals. These measures could prove useful as 11 time-frequency detection statistics in applications comparing reference and data dis- tributions. In this section, we review some well-known information-theoretic distance measures for time-frequency distributions. 2.2.1 Entropy Before introducing divergence measures, we first give a brief review of entropy, a basic concept in information theory. Entropy H, also called Shannon entropy, is defined for a discrete-valued random variable X as we = — Z P(X = a.>log(P(X = an), (2.8) 2' where P(.) is the probability mass function of X, and the a, are the possible values of X. Depending on what the base of the logarithm is, different units of entropy are obtained. Usually the logarithm with base 2 is used, in which case the unit is called a bit. According to the definition, the entropy of a random variable can be interpreted as the degree of information that the observation of the variable gives. The more “random”, i.e., unpredictable and unstructured the variable is, the larger its entropy. Assume that the probabilities are all close to 0, expect for one that is close to 1 (the probabilities must sum up to one). In that case, there is little randomness in the variable, since it almost always takes the same value and this is reflected by a small entropy. On the other hand, if all the probabilities are equal, then they are relatively far from 0 and 1. This means that the entropy is large, which reflects the fact that the variable is really random; we can not predict which value it takes. The definition of entropy for a discrete—valued random variable can be generalized for a continuous-valued random variable, in which case it is often called differential entropy. The differential entropy H of a random variable :1: with density p$(.) is 12 defined as Htx) = — /pm co, the limit exists as HOO(X) = —log(supz-P(X = (11)). (2.14) This is called Min-entropy, because it is the smallest value of Ha. The two latter cases are related by H00 < H2 < 2Hoo, while on the other hand Shannon entrOpy can be arbitrarily high for a random variable X with fixed min- entropy. 2.2.3 Divergence Measures for Time-Frequency Distributions The most general class of distance measures is known as Csiszar’s f-divergence which includes some well-known measures like Hellinger distance, Kullback-Leibler diver- gence and Rényi divergence [40]. The divergence between two probability density functions, p1 and p2 for this class of distance measures can be expressed as: doom) = 9 {E [r (5%)] }. (2.15) where f is a continuous convex function, g is an increasing function and E1 is the expectation operator with respect to p1. The distance measures and their properties for time—frequency distributions are given below. 2.2.3.1 Kullback—Leibler Divergence The most common distance measure used for probability distributions is the Kullback— Leibler divergence measure. This measure can be adapted to the time-frequency 14 distributions as follows: K(CI,CQ) = //01(t,f) logg:E:’;; dtdf, (2.16) where C1, Cg represent two different normalized time—frequency distributions defined in equation (2.4). This measure belongs to the class of Csiszar’s f-divergence with f(-.r) = —log:t:, and g(:r) = 1r. 0 S K(Cl,Cg) S 00, the first equality holds if and only if C1 = Cg and the second equality holds if and only if Supp C1 fl Supp C2 = 0. This is not a symmetric distance measure but can easily be symmetrized by taking the average of K (01,02) and K(C'2, Cl). The main disadvantage of this measure is that it can only be applied to positive TFDs. 2.2.3.2 Rényi Divergence Rényi divergence is a generalized formulation of Kullback-Leibler divergence and can be expressed as: Da(C'1,C'2)= :1 log / / cite, neg—an, f) dtdf, (2.17) a where a E [0, 1] is the order of Rényi divergence. This measure converges to Kullback- Leibler distance as a —> 1. It is also a member of Csiszars f-divergence with f (x) = 1.1—a, and g(:r) = 3:1- log(:r). 0 S Da(C'1,C2) S 00, the first equality holds if and only if Cl = C2 and the second equality holds if and only if Supp Cl fl Supp C2 = 0. 2.2.3.3 Jensen-Shannon Divergence One common approach for constructing divergence measures is to apply Jensen in- equality on the entropy functional. For time-frequency distributions, Jensen-Shannon divergence can be defined as: (2.18) £01,642) : H (Cl :02) _ H(CI) 1; H(C‘2). 15 This distance measure is always positive since H (013C2) Z H(C1) H(C2) (219) by concavity of H. It is equal to zero when C1 = C2 and is a symmetric divergence measure. Unlike the Kullback—Leibler divergence, Jensen—Shannon distance does not diverge when the two distributions are disjoint. 2.2.3.4 Jensen-Rényi Divergence The Rényi entropy is derived from the same set of axioms as the Shannon entropy, the only difference being the employment of a more general exponential mean instead of the arithmetic mean in the derivation. This realization inspires the modification of J ensen—Shannon divergence from an arithmetic to a geometric mean, and the following quantity is obtained for two positive TFDs C1 and 02. J1<01.C2>= Han/0102) — H“(C1):”“(C2). (2.20) where t/C1C2(t,f) = \/C1(t,f)C2(t,f). This quantity is obviously null when C1 = C2. The positivity of this quantity can be proven using the Cauchy-Schwartz inequality. 16 CHAPTER 3 REVIEW OF SIGNAL DECOMPOSITION AND SOURCE SEPARATION METHODS Blind signal processing is one of the important topics in the fields of neural computa- tion, advanced statistics, and signal processing with solid theoretical foundations and many potential applications. In this chapter, we will review the basic approaches and techniques for signal decomposition and source separation, especially principal com- ponent analysis, independent component analysis, and several time-frequency based methods. 3.1 Principal Component Analysis Principal component analysis (PCA) is a classic technique in statistical data analysis, feature extraction, and data compression, stemming from the early work of Pearson [44]. Given a set of multivariate measurements, the purpose of PCA is to find a smaller set of variables with less redundancy, that would give as good a representation as possible. The redundancy is measured by correlations between data elements. Using the correlations as in PCA has the advantage that the analysis can be based on the second-order statistics only. 3.1.1 Principal Components The starting point of PCA is a n—dimensional random vector x. There is an available sample x(1),--- ,x(T) from this random vector. No explicit assumptions on the probability density of the vectors are made in PCA, as long as the first- and second- order statistics are known or can be estimated from the sample. Also, no generative model is assumed for vector x. Typically the elements of x are measurements like pixel gray levels or values of a signal at different time instants. It is essential in 17 PCA that the elements are mutually correlated, and there is thus some redundancy in x, making compression possible. If the elements are independent, the resulting components are exactly the same as the original signal measurements. In the PCA transform, the vector x is first centered by subtracting its mean E{x}. The mean is in practice estimated from the available sample. Let us assume in the following that the centering has been done and thus E{x} = 0. Next, x is linearly transformed to another vector y with m elements, 777. < n, so that the redundancy induced by the correlations is removed. This is done by finding a rotated orthogonal coordinate system such that the elements of x in the new coordinates become uncor- related. At the same time, the variance of projections of x on the new coordinate axes are maximized so that the first axis corresponds to the maximal variance, the second axis corresponds to the maximal variance in the direction orthogonal to the first axis, and so on. 3.1.2 PCA By Variance Maximization In mathematical terms, consider a linear combination n T 311: Z wklflik = W1 X (3.1) k=1 of the elements 101, - - . ,3” of the vector x. The “’11: - - - ,wnl are scalar coefficients or weights, elements of an n-dimensional vector WI, and will“ denotes the transpose of WI. The factor yl is called the first principal component of x, if the variance of y1 is maximally large. Because the variance depends on both the norm and orientation of the weight vector W1 and grows without limits as the norm grows, we impose the constraint that the norm of W1 is constant, in practice equal to 1. Thus we look for 18 a weight vector W1 maximizing the PCA criterion .1};ch = E{y%} = E{(wrfx)2} = W{E{XXT}W1 = wferxwl such that [| wl [[2 1, (3.2) where the norm of W1 is the usual Euclidean norm defined as 1/2 ||W1|l=(Wf vv1>=1/2 Emil , (3.3) and the matrix Cx = E{xxT} is the n x n covariance matrix of the zero-mean vector x. It is well known from basic linear algebra [45,46] that the solution to PCA problem is given in terms of the unit-length eigenvectors e1, - -- ,en of the matrix Cx. The ordering of the eigenvectors is such that the corresponding eigenvalues d1, - -- ,dn satisfy d1 2 d2 2 2 dn. The solution maximizing equation (3.2) is given by w1 = e1. Thus the first principal component of x is y1 = eclrx. (3.4) The criterion Jll’C A in equation (3.2) can be generalized to m principal compo- nents, with m any number between 1 and n. Denoting the m—th (1 S m g n) principal component by ym = wax, with Wm the corresponding unit norm weight vector, the variance of ym is now maximized under the constraint that ym is uncorrelated with all the previously found principal components: E{ymyk} = 0, k < m- (3.5) 19 Note that the principal components ym have zero means because mm} = wg’nmx} = 0. ' (3.6) The condition (3.5) yields: Etymyk} = E{(w%x)(wfx)} = “($0wa = 0. (3.7) For the second principal component, we have the condition that wngwl = dlwgel = 0, (3.8) because we already know that wl = e1. We are thus looking for maximal variance E{y%} = E{(ng)2} in the subspace orthogonal to the first eigenvector of Cx. The solution is given by w2 = e2. Likewise, recursively it follows that Wk = ek. Thus, the kth principal component is yk = egx. (3.9) From the above result, it follows that E{y,2n} = E{e;l;,xxTem} = engem = dm, (3.10) which shows that the variances of the principal components are directly given by the eigenvalues of Cx. The vectors x in the original data set can be approximated by the truncated PCA expansion m it: Ewe, (3.11) 121 Then we have that the mean—squared error E{|[ x — x [[2] is equal to Zil=m+1dw As the eigenvalues are all positive, the error decreases when more and more terms 20 are included in equation (3.11), until the error becomes zero when m = n or all the principal components are included. A very important practical problem is how to choose m in equation (3.11); this is a trade-off between error and the amount of data needed for the expansion. Sometimes a rather small number of principal components are sufficient. The disciplined approaches to this problem are given by [47,48]. 3.2 Independent Component Analysis Independent component analysis (ICA), introduced by J. Hérault, C. J utten, and B. Ans [49] in the early 19805, is a statistical and computational technique for reveal- ing hidden factors that underlie sets of random variables, measurements, or signals. ICA defines a generative model for the observed multivariate data, which is typically given as a large database of samples. In the model, the data variables are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed nongaussian and mutually indepen- dent, and they are called the independent components of the observed data. These independent components, also called sources or factors, can be found by ICA. ICA is a much more powerful technique and capable of finding the underlying fac- tors or sources when the classic methods like PCA fail completely. The data analyzed by ICA could originate from many different kinds of application fields, including digi- tal images and document databases, as well as economic indicators and psychometric measurements. In many cases, the measurements are given as a set of parallel signals or time series; the term blind source separation is used to characterize this problem. Typical examples are mixtures of simultaneous speech signals that have been picked up by several microphones [50], brain waves recorded by multiple sensors [51], inter- fering radio signals arriving at a mobile phone [52], or parallel time series obtained from some industrial process [53]. 21 L... 3.2.1 Definition of ICA There are 72 observed random variables .21, - - - ,2”, which are modeled as linear com- binations of 72 random variables 31, - - - ,sn: z,- = (1,131+ @232 + - - - + ainsn, for all z' = 1, - -- ,n (3.12) where the s,- are unknown and statistically mutually independent, the and, j = 1, - -- ,n are some unknown real coefficients. This is the basic ICA model. All what are observed are the random variables 22-, and both the mixing coefficients a'ij and the independent components 3,- must be estimated using the 22-. It is usually more convenient to use vector-matrix notation instead of the sums as in the previous equation. Let us denote by z the random vector whose elements are the mixtures 21, - ' - ,zn, and likewise by s the random vector whose elements are the source signals 31, - - - ,sn. Let us denote by A the matrix with elements aij- All vectors are assumed to be column vectors. Using this vector-matrix notation, the mixing model is written as z = As. (3.13) Sometimes the columns of matrix A, denoted by a,, are needed, and the model can also be written as n z = Za,g,-. (3.14) i=1 Compared with PCA, it is easy to see that in the ICA model the following ambi- guities or indeterminacies will hold: 1. The variance of the independent components can not be determined. The reason is that, both 8 and A being unknown, any scalar multiplier in one of the sources 3,- could always be cancelled by dividing the corresponding column 22 a,- of A by the same scalar, say 02,-: 1 = Z(—fail(aisi)~ (3.15) . oz2 ’1. As a consequence, the magnitudes of the independent components may be fixed as well. Since they are random variables, the most natural way to do this is to assume that each source has unit variance, E{s?} = 1. Then the matrix A will be adapted in the ICA solution methods to take this restriction into account. . The order of the independent components can not be determined. The reason is that, again both 5 and A being unknown, the order of the terms in the sum in equation (3.14) can be freely changed, and any of the independent components can be called the first one. Formally, a permutation matrix P and its inverse can be substituted in the model to give z = AP‘lPs. The elements of Ps are the original independent variables 32-, but in another order. The matrix AP”1 is just a new unknown mixing matrix, to be solved by the ICA algorithms. 3.2.2 ICA by Maximum Likelihood Estimation A very popular approach for estimating the ICA model is maximum likelihood (ML) estimation. Maximum likelihood estimation is a fundamental method of statistical estimation. One interpretation of ML estimation is that those parameter values, which give the highest probability for the observations, are taken as estimates. According to the properties of the density of a linear transform, the density p; of the mixture vector z = As can be formulated as W) = IdetB Ips(S) -—- IdetB I Hats». (3.16) ’1. 23 where B = A—1, and the p, denote the densities of the independent components. This can be expressed as a function of B 2 (b1, - - - ,bn)T and z, giving pz(z) = (1111113 | Hp,(b,Tz). (3.17) i Assume that we have K observations of z, denoted by z(1),--- ,z(K). Then the likelihood can be obtained as the product of this density evaluated as the K points. This is denoted by L and considered as a function of B: K n =HHp z-(b.Tz( t))|detB|. (3.13) t=1i1_—_ Very often it is more practical to use the logarithm of the likelihood, since it is algebraically simpler. This does not make any difference here since the maximum of the logarithm is obtained at the same point as the maximum of the likelihood. The log-likelihood is given by K n log L(B) = Z Z logpi( (b-T z(t ())+ Klogl detBl. (3.19) t=1i=1 Divide the likelihood by K to obtain 1 n Flog L(B) = m: log p,(b;r”z)} + log] det B |. (3.20) i=1 To perform ML estimation in practice, an algorithm is needed to perform the nu- merical maximization of likelihood. In fact, there are many different methods, among which the simplest algorithms for maximizing likelihood are obtained by gradient methods [13]. 24 3.2.3 ICA by Minimization of Mutual Information An important approach for ICA estimation, inspired by information theory, is min- imization of mutual information. The motivation of this approach is that it may not be very realistic in many cases to assume that the data follows the ICA model. Therefore, an approach that does not assume anything about the data needs to be developed. The goal is to have a general-purpose measure of the dependence of the components of a random vector. With such a measure, ICA could be defined as a linear decomposition that minimizes that dependence measure. Such an approach can be developed using mutual information, which is an information—theoretic mea- sure of statistical dependence. One of the main utilities of mutual information is that it serves as a unifying framework for many estimation principles, in particular ML estimation. Mutual information I between 71 random variables 92'» 2' = 1, - -- ,n is defined as follows it 1(y1.--- M) = ZHfi/i) -H(y). (3-21) 321 where H (y,) and H (y) are yi’s entrOpy and joint entrOpy, respectively. Mutual in- formation is a natural measure of the dependence between random variables. It is always nonnegative, and zero if and only if the variables are statistically independent. Mutual information takes into account the whole dependence structure of the vari- ables, and not just the covariance, like PCA and related methods. Therefore, mutual information can be used as the criterion for finding the ICA representation. This approach is an alternative to the model estimation approach. The ICA of a random vector z is defined as an invertible transformation: 5 = Bz, (3.22) 25 where the matrix B is determined so that the mutual information of the transformed component 32- is minimized. If the data follows the ICA model, this allows estimation of the data model. On the other hand, in this definition, it is not needed to assume that the data follows the model. In any case, minimization of mutual information can be interpreted as giving the maximally independent components. Mutual information and likelihood are intimately connected. A detailed analysis of the connection between mutual information and maximum likelihood can be seen in [10]. The same gradient algorithm can be used to Optimize mutual information due to the its connection with likelihood. In addition, a nonparametric algorithm for minimization of mutual information is proposed in [54], and an approach based on order statistics is proposed in [55]. 3.3 Review of Time-Frequency Signal Decomposition and Separation Ap- proaches The most common methods for component extraction including PCA and ICA are effective at extracting orthogonal or independent components and assume the sta- tionarity of the underlying signals. Since most real life signals are not stationary and thus do not obey this underlying assumption, recent research has focused on source/ component extraction in the joint time-frequency domain. In this section, we review signal decomposition and source separation approaches based on the time- frequency distributions. 3.3.1 Matching Pursuit with Time-Frequency Dictionaries Matching pursuit [5] is a method to decompose a signal into a linear expansion of wave- forms which belong to a redundant dictionary of functions, and whose time-frequency properties are adapted to the local structures of the signal. These waveforms are called time-frequency atoms. This algorithm offers a decomposition particularly important for representing signal components whose localizations in time and frequency vary 26 widely. A general family of time-frequency atoms can be generated by sealing, translating and modulating a single window function g(t) E L2(R), and is defined as 97(1) = i g (1-..) .111, (3.23) S where s > 0, u,§ are the parameters of the scale, translation, and frequency modu- lating, respectively, and ’y = (s,u,£) E I‘ = R+ x R2. The factor f/l: normalizes the norm of 97(25) to 1. The family D = (97(t))7€r is extremely redundant, and its properties have been studied in [56]. A linear expansion of a signal f (t) over a set of vectors selected from D can be done by successive approximations of f (t) with orthogonal projections on elements of D, in order to best match its inner structures. Let 970 E D. The signal f is decomposed into f =< f. 970 > 970 + R112 (324) where < -,- > represents the inner product of two functions, and R1 f is the residue after approximating f in the direction of 970. Clearly, 97-0 is orthogonal to R1 f, hence n f ||2=|< 1.1),, >i2 + n 1111 ”2. (3.25) To minimize l] R1 f I], 970 is chosen from D such that |< f, 970 >| is maximum. After m iterations, the signal f is decomposed into m—l f = Z < 1277,97,, > g7” + Rmf, (3.26) 7320 27 and an energy conservation equation is yielded as m—l n r Il2= Z l< R7119... >12 + II Rmf H2. (3.27) 7120 It is proven that the matching pursuit algorithm is convergent with respect to the iteration number m, so as m —> oo, 00 f = Z < 1131,97,, > g)... (3.28) 7120 and 00 u f ||2= Z |< Rnf19’7n >|2- (3.29) n=0 It is thus shown that any signal f (t) E L2(R) can be decomposed into a sum of complex time-frequency atoms that best match its residues by matching pursuit. Although matching pursuit gives a flexible signal decomposition, a problem with this method is the restricted number of waveforms in the dictionary. While dictio— naries containing a wide variety of elements can be employed at the expense of high computational cost, the representations are not satisfactory unless all signal compo- nents are at least reasonably well approximated by dictionary elements. A modified matching pursuit algorithm called Orthogonal Matching Pursuit (OMP) is developed in [57]. For nonorthogonal dictionaries, OMP in general con- verges faster than matching pursuit. Furthermore for any finite size dictionay of N elements, OMP converges to the projection onto the span of the dictionary elements in no more than N steps. OMP is simply described as follows: assume that after 171 iterations, the signal f is decomposed into m f=za$gm+amfl with=0, n=1.2,---.m. (3130) 73:1 28 It is desired that for the (m + 1)th iteration, the signal f can be represented as m+1 f: E: agn+lgrm+Rm+lfl Wlth= 0, n:1,2,...,m+1. 7121 (3.31) Since elements of the dictionary D are not required to be orthogonal, to perform such an iteration, an auxiliary model for the dependence of g7m+1 on the previous gnm’s (n = 1,2, - -- ,m) is required. Let m gm,1 = Z bifgm +pm, with < pm,g,,, >= 0, n =1,2,--- ,m. (3.32) 1121 Using the above auxiliary model, it may be shown that the correct update from the mth iteration to the (m + 1)th iteration is given by at?“ = (1;? —fimbnm, n =1,2,--- ,m and am] = pm, < Rmf197m+1 > < Rmfi 97771.14 > (3.33) Where [am = = 2 < Pm197m+1 > IIPmIl ll97m+1ll2 - 2&1 bi? < gymgymfl > It also follows that the residual Rm+1 f satisfies and l < Rmf197m+1 > I2 llRmfll2 = ||Rm+1fl|2 + 2 1|me| (3.35) 29 3.3.2 Spatial Time-Frequency Distribution (STFD) Method For non-stationary signals, a blind source separation method using spatial time- frequency distributions is introduced in [58]. The multidimensional data model is x(t) = As(t) + n(t), (3.36) where x(t) = [3:1(t), - -- ,$m(t)]T is a noisy instantaneous linear mixture of source signals s(t =([slt) ,sn(t)]T, A is the mixing matrix, and n(t) is the additive noise. The discrete—time form of the Cohen’s class of TFD for signal 1131(t )1 IS given by [23] — '2 l D$1$1(t, (.2) =2: 2 1/2( m ,I)CL'1(t +m+ l):1:1(t + m— [)6 3 to, (3.37) l=—oom=-OO where t and w represent the time index and the frequency index, respectively. The cross—TFD of two signals 331(13) and 3:2(t) is defined by 0,19,2(1, w) =2 2 11( m z):11(1+m+z);1;(1+m-1)e 9'2“”. (3.33) l=—oo m=—oo The two equations given above are used to define the spatial time-frequency distri- bution (STF D) matrix as follows 00 OO =2 2 11(ml(111+m+z)*(1+m—l)e—j2‘*”, (3.39) l——oo 771"- where [Dxx(t,w)],j = Dairy-(LOU), for i,j = 1, - -- ,n The blind identification method is presented based on a two—step process: the first step consists of whitening the data in order to transform the mixing matrix A into a unitary matrix U; the second step consists of retrieving this unitary matrix U by 30 jointly diagonalizing a set of whitened data STFD matrices. Under the assumption that the source signals si(t),1 S i S n are mutually uncorrelated, the whitening matrix W can be determined from the array output autocorrelation R W(R — 021)WT = WAATWT = I, (3.40) where R = 71-12,:1 x(t)x*(t) as T —-> 00, I is the n x 11 identity matrix, and 02 is the noise variance. In the second step, the whitened STFD matrix is obtained as Dzz(1,w) = WDxx(t,w)WT = UDss(t,w)UT, (3.41) where z(t) = Wx(t) is the whitened data vector, and U = WA is a unitary matrix. Since Dss(t,w) is diagonal, U may be obtained as a unitary diagonalizing matrix of the whitened STFD matrices Dzz(t,w) for time-frequency points corresponding to signal autoterms. In the end, the source signals are estimated as s(t) = UTWx(t). In contrast to blind source separation methods using second-order and/or high- order statistics, the proposed approach allows the separation of Gaussian sources with identical spectral shapes but with different time-frequency localization proper- ties. However, due to the joint diagonalization of the STFD matrix, it has a higher computational complexity. In [59], an underdetermined separation algorithm for nondisjoint sources is pro- posed based on the ST FD method. Source separation is achieved by combining the STFD matrix with a clustering approach. 3.3.3 Blind Separation via Time-Fquuency Masking In [60], binary time-frequency masks are created to achieve demixing provided the time-frequency representations of the sources do not overlap. Without loss of generality, suppose x1(t) and 11:2(t) are two mixtures of source 31 signals 31(t), - -- ,sN(t) such that N 3310) = Z 810). 2:1 (3.42) 172(1) = 01810? ~ 51'), 1:1 where a,- and 6,- are the attenuation coeflicients and the time delays. Using the shift—invariance of the short-time Fourier transform (STFT), the time-frequency rep- resentation of the mixing model (3.42) is 31W») X1(t,w) 1 1 = . . 5 . (3.43) X2(t,w) ale—J‘U‘Sl aNe-JMSN SN(t,w) It is assumed that the STFTs, Si(t,w) and Sk(t,w), of any two source signals s,(t) and sk(t) are disjoint S,(t,w)Sk(t,w) = 0, Vt,w Vi 7f 11'. (3.44) To demix, one creates the time-frequency mask corresponding to each source and applies each mask to the mixture to produce the original source time-frequency rep- resentations. For example, defining 1, Si(t,w) 74 O Afi(t,w) = , (3.45) 0, otherwise one obtains the time—frequency representation of 3,-(t) from the mixture X1(t,w) via S,(t,w) = Afi(t,w)X1(t,w), Vt,w. (3.46) 32 Let Q, = {(t,w) : M,(t,w) = 1} for any i E (1, - -~ ,N) so that M,- = 192.. Consider R(t,w) — X2(t’°") (3.47) 1(t.w Clearly, on Q7; R(t,w) = die—3.675“). (3.48) In this case, | R(t,w) [2 a,- and —%£R(t,w) = 15,-, where [2 denotes the phase of the complex number 2 taken between —7r and 7r. Hence, one simply labels each time- lAl‘t(t,cu)). Since the sources are frequency point (15,112) with the pair (I RUM) I, “w disjoint, there will be N distinct labels. By grouping the time-frequency points (t, (.12) with the same label, the sets Q,- are constructed, and then the masks M,- = 192.. Therefore, from equation (3.46), the time-frequency representations Si(t,w) of the original sources 3,-(t) can be obtained. Although this approach may separate any number of sources from their two mixtures, the problem with more than two mixtures is not addressed. In addition, it is hard to separate the source signals which have the same parameters a,- and 6,- in their mixtures. 33 CHAPTER 4 ADAPTIVE SIGNAL DECOMPOSITION ON THE TIME-FREQUEN CY PLANE 4. 1 Introduction Signal decomposition aims to extract the components comprising the observed signals. The majority of methods for performing linear signal decomposition involve over- complete waveform dictionaries. By selecting the optimum set of available waveforms from the dictionary based on some criterion, a sparse model of the signal can be obtained. Such decomposition schemes include matching pursuit [5], basis pursuit [6], and the chirplet decomposition [7]. A problem with these decomposition methods is the restricted number of waveforms in the dictionary. While dictionaries containing a wide variety of elements can be employed at the expense of high computational cost, the representations are not satisfactory unless all signal components are at least reasonably well approximated by dictionary elements. For this reason, in this chapter we introduce an adaptive component extraction approach on the time-frequency plane. This approach relies on extracting compo- nents that are well—concentrated on the time—frequency plane. The concentration of the components are quantified through an entrOpy measure on the time-frequency plane. Since it has been shown in the literature that signals that achieve a small en- tropy value on the time-frequency plane are Gabor logons, our component extraction algorithm reduces to extracting the Gabor logons that best describe the given data set in a minimum mean square sense. Unlike the traditional Gabor decomposition [61], where the signal is expressed as an infinite sum of the time and frequency shifted Gabor logons, we do not have to create a dictionary beforehand, and the components extracted by the proposed method have time and frequency centers determined by 34 the signal. Moreover, these extracted components have chirp rates and local spread adapted to the given set of signals. The goal is to represent the given data set with a few number of the chirped Gabor logons. 4.2 Background on Gabor Decomposition and Information Measures 4.2.1 Gabor Signal Expansion In 1946, Gabor presented an approach to characterize a time function in time and frequency simultaneously, which later became known as the Gabor signal expansion [62]. He showed that any signal in L2 could be represented as the weighted sum of modulated and shifted Gaussian functions (logons) centered on a rectangular lattice in time and frequency under the constraint that TS) S 27r where T is the time sampling interval and Q is the frequency sampling interval. That is, for signal s(t), the Gabor expansion is defined as 3(1) = Z amngmn(t) (4.1) with 9(1) = {921—307le (4.2) gmn = 90: — mean”? (4.3) The Gabor expansion coefficients amn are computed by the usual inner product rule for projecting 3(t) onto an auxiliary function 7(t), i.e., 4.... = [:0 8(t)7}"nn(t)dt (44) 7mm“) 2 “/(t _ mT)eant, (4-5) 35 where * denotes the complex conjugate operation. Equations (4.4) and (4.5) are in fact a sampled version of the windowed Fourier transform of the signal 3(t) with the analysis window 7(t), which is known as the Gabor transform. The analysis window and the synthesis window satisfy the following biorthogonality relationship [63]: T090 foo 9037*“ _ mT0)e_anOtdt = 6(m)6(n), (4-6) 27r -—oo where T0 = 27r/Q, and $20 = 27r/T. 4.2.2 Chirplet Transform The Gabor transform essentially provides expansions of signals as linear combinations of time-frequency atoms with fixed time and frequency “concentration” properties. However, it fails to represent the chirp-like components in a compact and precise way. In other words, more atoms are needed to approximate the chirp-like components with frequency modulation, which results in the reduction of the effectiveness and compactness of the time-frequency representation. For these reasons, the chirplet transform, a generalized Gabor transform, is devel- oped [64]. The time—frequency atoms for a Gaussian chirplet transform, the so—called Gaussian chirplets, are derived from a single Gaussian function through the oper- ations of scaling, chirping, time- and frequency-shifting, which leads to a family of wave packets with four adjustable parameters: gk(t) = [Va—WE exp {_92_k_(t — tk)2 +j [wk + pig-(t — tk)] (t — tk)}, (4.7) where the parameters (tk, wk) determine the time and frequency centers of the linear Gaussian chirplets; the variance ak(> 0) controls the time duration of the chirplet; 6k is the frequency modulation rate (chirp rate) that characterizes the “quickness” of frequency changes. Compared with the Gabor logon used for the Gabor expansion, 36 the Gaussian chirplet has more freedom and thereby can better match the signal under consideration. 4.2.3 Entropy Measure on the Time-Frequency Plane Since a TFD, C (t, w), from Cohen’s general class has many desired preperties such as the energy preservation and the marginals, it is analogous to the probability density function (pdf) of a two-dimensional random variable. This analogy has inspired the adaptation of information-theoretic measures such as entropy and mutual information to the time-frequency plane. The adaptation of classical Shannon entropy to the time- frequency plane yields H(C) = —// C(t,w)log2 C(t,w)dtdw. (4.8) This measure is only defined when C (t, w) > 0, Vt, 112. Therefore, it is valid for positive distributions such as the spectrogram, but yet invalid for non-positive distributions. For this reason, a more generalized class of entropy measures known as Rényi entropy has been adapted to the time—frequency plane. In [42], Rényi entrOpy was introduced as an alternative way of measuring the complexity of TFDs and the properties of this measure were proved extensively in [65]: (1 110(0) 11111.). (4.9) 1 t —a ff C(u,v)dudv where a > 0. This measure is well-defined as long as f f C(t,w)dtdw > 0 and has been shown to be finite for a large class of signals and distributions [65]. It is important to note that as a —> 1, Rényi entropy becomes Shannon entropy. It has been shown that the minimum value of entropy on the time-frequency plane is achieved for a Gabor logon [65]. This is also consistent with the fact that 37 the Gabor logon is the signal that achieves the lower bound on the uncertainty on the time-frequency plane [23]. For this reason, our signal decomposition algorithm is based on extracting a set of well-concentrated components, that best describe the given data set. 4.3 Component Extraction Method 4.3. 1 Problem Statement Given M measurements of a signal, {x1,x2, - 1 ~ ,x M}, we want to extract the first L components, L < M, that minimize entropy on the time-frequency plane. Each measurement, xi, is transformed to the time—frequency plane as: =ZZr/2(n lm):1:,~)(l + 2). r:(l — %) e_j“’m. (4.10) The time—frequency distribution corresponding to each trial is vectorized and a matrix of time-frequency distributions is formed: C: . , (4.11) where C,- is a vector of length N x K points, N and K being the number of time and frequency points, respectively. The components on the time-frequency plane are found based on this time-frequency data matrix. Each component Si is a linear combination of the rows of this matrix, i.e. M k) = Z ajCJ-(n, k), (4.12) i=1 38 where MZJ- of = 1 and aj’s are chosen such that Ha(S,-) is minimized on the time-frequency plane. 4.3.2 The Proposed Approach Since Gabor logon signals have minimum entropy in the time-frequency domain, the cost function is chosen as e = Ha(S,-) — H3, where Ha(S,-) is Rényi entropy of ith component with order a, and H2; represents Rényi entropy of the corresponding desired Gabor logon signal. The weight vector a = [a1,a2, - -- ,aMlT is updated using the method of Steepest Descent [66], which is 86 a=a where p is the step size parameter. In the discrete case, Rényi entropy of the com- ponent S,- 18 a N Ha(S,-)=Ha(aTC)=11 10%;; aajC’J-(nJc) , (4.14) —(1 where S,- and Cj are normalized. The gradient of the cost function 6 with respect to the lth weight coefficient a; is derived as: E z a 2.. 2:1. (8.14 a)“ Cm 4) (4,5, 301 1 - a Zn 21: (54(77», k))a ’ where l = 1, - . - ,M. For the special case of a = 2, 33 = _2}:n 2k 54(71, @0102, k). (4,6) 801 Zn 2k (54(71, k)) 39 Substituting the results in equation (4.16) into equation (4.13) yields the update equation for a as: Zn Zk 540% @0101, k) (417) 2.. 2;. (8.01. 1))? ' ' The algorithm can be summarized as follows: dl=al+2p 1. Find the Gabor logon that best describes the average of all trials, Cm) = 71W 23- Cj. This first Gabor logon is found by finding the average time du- ration, average frequency, the spread, and the chirp rate of Cay. A logon with these estimated parameters is constructed and chosen as the first desired signal, C(n,k;n0, 160,0,6). 2. Set the initial value for aj = fi, and use the adaptive filtering algorithm to update the weights until the error converges. Here, when the absolute value of the difference of two neighboring weights is less than a given error value, the update is stopped. The first component is then determined as, 81 = a*TC, where a... is the optimal weighting vector. 3. Project all the trials on $1 and compute the residue. C,=C,-—<31,C,->C,-, i=1,2,---,M (4.18) 4. Repeat the same algorithm on this residue matrix C, and extract the next component. 5. Stop when the average energy of the residues drops below a pre-determined threshold value. 4.3.3 Convergence Analysis of the Algorithm An important issue in adaptation is the convergence of the algorithm. We investigate the convergence of the proposed entropy adaptation algorithm, whose weight update 40 is given in equation (4.16), for the special case of entropy error minimization with order a = 2 in the linear filter 5 = aTC. Rényi entropy of the extracted component S is written in the matrix-vector format for a = 2 as follows H2(S) = H2(aTC) = —log2(SST) = —log2(aTCCTa). (4.19) The weight vector a at step (k + 1) is updated by ak+1 = ak — ,uAak, (4.20) where T Aak = 2?. = w = _ 27€CC )ak , (4.21) 33}: Bak ak (CCT)ak Assume that the desired Gabor logon is G = afC with the optimal weight vector a*. Consider the weight error vector 8k 2 ak —— a*. Subtracting a... from both sides of equation (4.20), we get ek+1 = 5k — ,uAak. (4.22) Multiplying both sides of above equation with its transpose to get the norm of the weight error yields 2 2 T 2 2 || 51.41 || =|| 51- || 43/15); Aak +# II A31; || - (4-23) In order for the weights to converge to the true weights, we require ll €1.41 n2 < 5k, Sk > , where <, > represents the inner product of two vectors, S k is the extracted component at step 1:, and G’ is the corresponding desired Gabor logon, the upper bound on the positive step size becomes 0 ) . (4.27) 4 __ 1 _ __ || Aak ||2( < 314,514 > Notice that since < G, 5'); > is less than < Sk, S'k > for the normalized TFDs Sk and G, the upper bound on the step size is positive and valid. It can be concluded that the proposed adaptation algorithm is convergent. 4.4 Experimental Results and Analysis In order to evaluate the effectiveness of our method, we consider the following exam- ple. The set of observed signals are linear combinations of two Gabor logons and a chirp signal, i.e. 1r,- 2 1122-151 + 1112-232 + 102-353, where wi1,w,§2,wi3 are the weights for each signal and are distributed as N (O, 1). The first Gabor logon is centered at 42 the time sample point 50 and normalized frequency of 0.7, the second Gabor logon is centered at time sample point 150 and normalized frequency of -0.7. The linear chirp signal has an initial normalized frequency of -0.2 and its instantaneous frequency in— creases to a normalized frequency of 0.2. Rényi entropy with a = 2 is used as the cost function to ensure that entropy is well-defined. The data set consists of M = 128 linear combinations of these three signals. Each signal is transformed to the time- frequency domain with N = 50 time samples and K = 64 frequency samples. Each TF D is then vectorized to form a TFD matrix of size 128 x 3200. First, the average of M TFDs corresponding to each trial is computed. Then, the time-frequency location of the peak energy on the time-frequency plane is found as no and k0. A window centered at (740,190) is constructed to determine a local region around this peak. The size of the window is determined based on the energy distri- bution of the signal, i.e. the window is expanded until the energy value drops below 10% of the peak value. This windowing approach around the peak helps us extract local features. The same window is applied to all trials to extract the corresponding regions in each trial. The standard deviation 0 and gradient (the chirp rate), 6, of this local TFD are estimated. Based on the parameters (120, 190,0,6), a Gabor logon is constructed and chosen as the first desired signal. Using the steepest descent al- gorithm, the weight coefficients aj’s are updated to minimize the difference of Rényi entropy between the linear combination of the M local TFDS and the TFD of the first desired logon to obtain the first time-frequency component, 31. This first component is projected onto all of the M trials and the residue is found. This same algorithm is repeated for the residue on the time-frequency plane, i.e. pick the peak, construct a window, determine the desired Gabor logon, and adaptively filter the signals to get close to the desired Gabor logon. This process is repeated until the energy of the residue is below a certain threshold. In this example, 11 components were enough to represent 90% of the total energy of the signal. 43 Table 4.1 gives the entropy values for the first 3 of the 11 extracted components, the corresponding desired logon signals, and the first 3 components obtained using PCA. It is shown in Table 4.1 that the entrOpy of the extracted components are closer to the entropy of the Gabor logons. Since the entropy differences between these extracted components and the desired logon signals are small, we can infer that the extracted signals are quite close to the actual logons. It is also seen that the entropy of components extracted by our method is less than the entrOpy of PCA components. This indicates that we obtain components that are more compact than the ones obtained by PCA. The time-frequency surfaces in Figure 4.1 indicate that the 5 extracted compo- nents include both the logon signals and the first three chirped logons that represent the linear chirp signal. The topographical plots of the extracted components make it clear that each component was appropriately isolated in terms of the topographical region of origin. The results of this example show that the decomposition of time-frequency energy using our approach can extract meaningful time-frequency components for analysis of large sets of data. This decomposition algorithm achieves several goals. First, time- frequency data reduction is accomplished by producing a few meaningful components on the time-frequency plane that explain most of the signal’s energy. A second ben- efit of this time-frequency domain decomposition is that it can extract activity that overlaps in time, but not in frequency, which is not possible using time domain decom- position approaches. Finally, another benefit of our method is the ability to separate and extract parts of chirped signals, which cannot be achieved using the conventional Gabor expansion. Next, the performance of the proposed approach is compared with that of Orthog- onal Matching Pursuit (OMP) introduced in Chapter 3. In this example, in order to represent the same 90% of the total energy of the signal as in the proposed method, 44 Table 4.1. Entropy Comparison Entropy Decmp Comps PCA Comps Desired Logons 1 2.8319 4.5011 2.7809 2 2.7825 3.0461 2.7413 3 2.7517 2.9724 2.7252 24 dictionary elements for OMP are needed, among which the six ones are shown in Figure 4.2. It is indicated that the number of components required by OMP is much larger than that of the proposed approach which is only eleven. Moreover, the com- putation of OMP is much more complex, about four times of the proposed method. One reason why the performance of matching pursuit is not well is that the decom— position is not satisfactory with matching pursuit unless all signal components are well approximated by dictionary elements; on the other hand, although the dictionary contains a wide variety of elements, this kind of redundancy leads to the elements to be employed at the expense of high computational cost. 4.5 Summary In this chapter, a new signal decomposition method on the time-frequency plane is proposed based on the minimum entropy criterion. The major difference of the proposed approach from conventional component extraction or decomposition meth- ods is the cost function. The cost function which is minimized is entropy on the time-frequency plane, thus producing compact components that are similar to Gabor logons. Using entropy as the cost function and adopting an adaptive filtering method to update the weights corresponding to each trial, we extract “minimum" entropy components orthogonal to each other. Experimental results show that the presented approach is effective in determining a few number of components that can be used to represent a large set of data. 45 Figure 4.1. The average time-frequency distribution of 128 trials and the 5 extracted components of the proposed method 46 Figure 4.2. The average time-frequency distribution of 128 trials and the 6 elements of OMP 47 CHAPTER 5 OVERDETERMINED BLIND SOURCE SEPARATION IN THE TIME-FREQUENCY DOMAIN 5. 1 Introduction Blind source separation (BSS) is an important and fundamental problem in signal processing with a broad range of applications. Several unobservable source signals first pass through an intermediate media, and then arrive at an array of sensors. The observed output of each sensor is a mixture of all the source signals. The goal in BSS is to recover the original source signals from the observed mixtures. Typ- ical BSS applications include communications [1], speech signal processing [2], and biomedical signal processing applications [3]. A number of BSS algorithms have been proposed based on the instantaneous mixture model, in which the observed signals are linear combinations of the source signals and no time delays are involved in the mix- tures. Among these methods, the most common ones are second order statistics based methods [67], and information-theoretic approaches which utilize cost functions such as mutual information or divergence measures, e.g. independent component analysis (ICA) [9,10,68—70], sparse component analysis (SCA) [71], and nonnegative matrix factorization (NMF) [72]. These methods in general assume a certain structure for the underlying source signals. Some examples include higher-order statistics based methods which assume non-Gaussian and i.i.d source signals, and ICA which assumes the independence of the source signals. Most real life signals are non-stationary, and thus do not obey the underlying assumption of stationarity that is embedded in the current methods. For this rea- son, recently various methods have been introduced to exploit the non-stationarity property of the source signals to solve the separation problem, including frequency 48 domain [73], [74] and time domain [75], [76] approaches. In general the frequency- domain estimation algorithms have a simpler implementation, less computational time, and better convergence properties over the time-domain ones. However, the disadvantages of using frequency-domain methods are the arbitrary permutation and scaling ambiguities of the estimated frequency response of the un-mixing system at each frequency bin. Motivated by these problems, researchers have resorted to the powerful tool of time-frequency signal representations. For non-stationary signals, a blind separation approach using a spatial time-frequency distribution is proposed in [58] to achieve the separation by joint diagonalization of the auto-terms in the spatial time-frequency distributions. This approach has been modified and improved as discussed in [77,78]. Another time-frequency based method described in [60] uses binary time-frequency masks to separate more than two speech sources from two mixtures using the sparsity of the time-frequency representations of speech signals. In this chapter, we introduce a new approach to the source separation problem combining time—frequency representations with information-theoretic measures. An information-theoretic criterion, Jensen-Rényi divergence as adapted to the time-fre- quency distributions, is used as the objective function to separate the sources. The underlying sources are assumed to be disjoint on the time-frequency plane and it is shown that this new cost function achieves its maximum when the signals are disjoint. With the assumption that the source signals are disjoint on the timeLfrequency plane, signal separation is performed through a rotation transformation using a steepest descent algorithm. 5.2 Information Measures in the Time-Frequency Domain The information-theoretic measures such as entropy have been successfully applied to the time—frequency plane [42,65], due to an analogy between a TFD and the probabil- 49 ity density function (pdf) of a two-dimensional random variable. Although entropy measures have proven to be useful in quantifying the complexity of individual sig- nals, they cannot be used directly to quantify the difference between signals. For this reason, well-known divergence measures from information theory have been adapted to the time-frequency plane [79,80]. The most common divergence measures used for probability distributions belong to Csiszar’s f-divergence such as Kullback-Leibler divergence based on Shannon entropy and a-divergence based on Rényi entropy [39]. Another common class of divergence measures is based on the Jensen difference such as the Jensen-Shannon divergence constructed by applying Jensen inequality to the entropy functional. Jensen—Rényi divergence is the modification of Jensen-Shannon divergence from an arithmetic to a geometric mean introduced by Michel [79]. For time-frequency distributions, Jensen—Rényi divergence can be defined as: jf2(Cl,C2) = Ha(v0102) - Ha(C'1) g Ha(C'2), (5-1) where Cl and Cg are the general TFDs of two different signals defined in equation (2.4) respectively, and Ha represents Rényi entropy defined in equation (4.9). Jensen- Rényi divergence is equal to zero when 01 = C2, and its positivity can be proven using the Cauchy-Schwartz inequality. This measure has some desired properties such as being symmetric and monotonically increasing as the overlap between the two distributions decreases, i.e. Cl(t,w)02(t,w) —> 0. Therefore, maximizing this measure corresponds to obtaining disjoint time-frequency representations. When Jensen-Rényi divergence is compared to other symmetric and monotoni- cally increasing information-theoretic measures on the time-frequency plane, several advantages emerge. First of all, J ensen—Rényi divergence is defined based on the Rényi entrOpy which is well-defined for a larger class of time-frequency distributions com- pared to Shannon entropy which is defined for only positive distributions. Therefore, 50 Shannon entropy based divergence measures are limited in their applicability. Sec- ond, recent work in the analysis of sensitivity or robustness of divergence measures on the time-frequency plane reveals that Jensen-Rényi divergence is more robust against perturbations and noise, which makes it more suitable for source detection and separa- tion applications [79,80]. The following simulation example compares the robustness of two different distance measures under an additive signal perturbation model. The original signal is a Gabor logon, 31(t) = exp(—(t — t0)2)exp(—jw0t), centered at time to = 32, normalized frequency (120 = 0.2, and the perturbation signal is another Gabor logon, 32(t) = exp(——(t — t1)2)exp(—jw0t), centered at 131 = 64,w0 = 0.2. The perturbed signal is 2(t) = (1 — e)31(t) + 682(t), where E 6 [0,1]. The distance between the time—frequency distributions of the perturbed signal and the original one is computed as 6 goes from O to 1. Figure 5.1 shows the comparison between the symmetric Kullback-Leibler and the Jensen-Rényi divergences for different values of 6. When 6 is small, the Jensen-Rényi divergence is smaller than the Kullback-Leibler divergence showing robustness against small perturbation. However, as 6 increases, the Jensen-Rényi divergence reacts faster to the change and detects the second signal component. 5.3 Problem Formulation and Method 5.3.1 Signal Model and Assumptions In this chapter, we consider the problem of determining the source signals when the number of observed mixtures is equal to or greater than the number of source signals. T corresponds Assume that the N-dimensional vector s(t) = [31(t), 82(t), . . . ,sN(t)] to the N non-stationary complex source signals. The source signals are transmitted through a medium and the M sensors pick up a set of mixed signals represented by z(t) = [21(t), z2(t), . . . ,zM(t)]T, where M Z N. 51 — Kullback-Leibler ‘ 0,9 . Jensen-Renyi 4 0.8 ~ 4 9 g 0.7 - 1 é’ a) 0.6 r ‘ O C (U 23 05 r - o 8 .5 0.4 '- 7 f o 0.3 ~ . z 0.2 ~ , 0.1 ~ . 0. 11'1"""‘ 1 1 1 0 0.2 0.4 0.6 0.8 1 a Figure 5.1. Comparison of Kullback-Leibler and Jensen-Rényi divergence measures under an additive signal perturbation model Given M observations or mixtures, z(t) = [31(t), z2(t), . . . ,zM(t)]T, with 20) = 130080), (5-2) where A(t) is the mixing matrix, we want to extract the underlying sources s(t). In this chapter, we assume an instantaneous mixture of the sources, i.e., A(t) = A, where A is a M x N matrix. The following assumption is made about the underlying sources: the sources are assumed to have different structures and localization properties on the 52 time-frequency plane, i.e. the sources are disjoint on the time-frequency plane. This implies that Csi(t,w)Cs j(13w) = O,Vt,w forz' # j. In practice, this condition is never satisfied exactly. However, as long as the inner product is small, source separation can be achieved. This condition on the disjointness of the sources on the time-frequency plane has already been used by several authors for separation of speech and music signals [60], [81]. Before proceeding further, it is important to specify the notion of blind identifica- tion. In the blind context, a full identification of the mixing matrix A is impossible since the exchange of a fixed scalar factor between a given source signal and the cor- responding column of A does not affect the observations, as is shown by the following relation: N . 2(4) = As = 2 3.4.4.0). (5.3) i=1 2 where 5,- is an arbitrary complex factor, and a,- denotes the 1th column of A. Advan- tage can be taken of this indeterminacy by assuming that the source signals have unit variance so that the dynamic range of the sources is accounted for by the magnitude of the corresponding columns of A. This normalization convention turns out to be convenient in the sequel; it does not affect the performance results. For the proposed algorithm, the sources are extracted on the time—frequency plane up to a scalar factor, and permutation. 5.3.2 Problem Statement in the Time-Frequency Domain This section will briefly outline the overall structure of the source separation. The different components of the algorithm such as the cost function, and the optimization method will be discussed subsequently. Each observation, zi(t), is first transformed to the time-frequency plane as: (71,142,002 221/101 1 m)z,—) Z W, (1+ 9) 2: 7.8:: (l - E) ‘W m l 19:1 7:1 ' N = gzwn — [,m) _k=1|aik l2 3k (1+ 7;) 5’); (1 _ ;) e-jwm+ :2“ (,)71—1771 2: Z aikagrsk N as illustrated through an example in Section 5.4. 55 5.3.3 Cost Function The cost function used in this chapter is the total pairwise Jensen-Rényi divergence defined as: N—-1 Jaé N MYinY Z[w (¢VVg ():(J). mm 1=1j=1+1 Maximizing this cost function will ensure that the extracted components do not over- lap with each other on the time-frequency plane. The pairwise Jensen-Rényi divergence between two time-frequency distributions is defined as: _ Hyn mY- fi:%flnwy ():(Jl mm) This expression can be further simplified as: Hd(Y )+%w) =%( (n%)— 3 g i (,/—)“ #1 P (MU 01(a ' ml_abg §fl(k )+m‘;gw) : Leg 21:1 ( 117113700 ((2121 11001) (235:1 1371)) l—a which represents the ratio of the energy of the overlap between the two TFDs to the product of the energy of the individual TFDs. Let zf=1( mug-(1))" 721:1 71>) (21:. w) 1] 56 and N— 1 N 2 J3. (5.13) —Z+ 1:1 3'2 +1 Since log(-) is a monotonic function, maximizing fa is equivalent to minimizing Ja for a > 1, or maximizing Ja for a < 1. This means that we can equivalently use J0) as our cost function. In this chapter, we will consider orders of a > 1. The results are similar for a < 1. One special case of a > 1 is the quadratic one when a = 2. When 01 = 2, the cost function JO, simplifies to: N 1 % Zf=1yi(k)yj(k) 121 j— =1+ +1 \/(Z,§=1K2(k))(211c3=lyj2(k)) (5.14) In this chapter, we will use a = 2 since the Rényi entropy will be well-defined for this order even when the distributions are non-positive. Minimizing J2 is equivalent to minimizing the sum of pairwise normalized inner products between the extracted sources, and ensures disjoint source extraction. 5.3.4 Rotation In our source separation problem, the observed time-frequency distributions, X, can be written as a linear combination of the original sources’ TFDs, C, assuming negli- gible cross-terms between the sources: X = A20 = BC, (5.15) where A2 is the square of the mixing matrix in the time domain, and B = A2. The goal of source separation in this chapter is to find a linear transform Q of the observed signals, X, such that the extracted signals, Y, are as disjoint as possible from each other. The cost function in Section 5.3.3 quantifies the disjointness of the extracted sources using divergence. In this section, we will show how to obtain this 57 linear transform Q. Q should be chosen such that the elements of Y = QX = QBC are disjoint. If the elements of Y are exactly disjoint, then YYT will be a diagonal matrix, which means that QBCCTBTQT will also be diagonal. Since the original sources’ TFDs, C, are disjoint and normalized, CCT = I. Therefore, finding a linear transform Q for unmixing the observations reduces to finding an unitary matrix that will diagonalize BBT. Since B is not known a priori, we try to estimate the unitary transform Q iteratively. Any unitary matrix Q can be written as a product of Givens rotation matrices and this formulation allows us to parameterize the estimation of Q in terms of the rotation angles 6. It is well-known that any unitary matrix Q can be written as the product of N(N —1)/2 Givens rotation matrices, Q = G1G2 - - - GN(N—1)/2' In N—dimensional space, the simplest rotation is in the two—dimensional plane. If a rotation is through an angle Gab in the a — b plane, then the Givens rotation matrix Gabwab) is: 1 0 O O 0 ~- cos(00b) --- sin(9ab) ~-- 0 Gabwab): s s s 2 , (5.16) O --- — sin(6’ab) --- cos(6ab) --- 0 _0 O 0 1d where Gab(6ab) equals the N x N identity matrix I N except that the elements IN(a, a), [N(a, b), [N(b, a), and IN(b, b) are replaced by cos(6ab), sin(6ab), — sin(t9ab), and cos(0ab), respectively, where I N(a, b) is the element of I N located at the ath row and bth column. From [82], we know that any N-dimensional rotation matrix can be written as the product of N (N — 1) / 2 two-dimensional—plane N -dimensional rotation 58 matrices,which is: (3(9) = G12(912) ' ' ' Gabwab) ' ' ' G'(N_1)1\/(9(1\r_1)1v)a (5-17) where 6 = [ 912,... flab)... ’6(N-—1)N ]T, and a < b. In order to have exact source separation in this formulation, there should exist an unitary matrix Q that will diagonalize the mixing matrix B = A2 on the time- frequency plane. Since B has all positive entries, there is no such Q unless B is already diagonal which corresponds to the observations that are scalar multiples of the sources. 5.3.5 Proposed Algorithm The objective of the proposed algorithm is to determine the optimal rotation trans- form such that the total pairwise divergence measure is maximized to achieve signal separation. We use the gradient adaptation algorithm also known as the steepest de- scent [66] to update the rotation angles. Gradient adaptation is not the only choice, but it is preferred in many practical paradigms due to its simplicity and efficient convergence [83]. The overall update equation for stochastic gradient descent is: £2 6(n+1) =6(n)— 86’ (5.18) where ,u is the step size parameter. The gradient of the cost function J2 with respect to the rotation angle gab is derived as: = Z 2 66:1, (5.19) 59 where 5.23,. 25:1 (Spawn/m) + muggémm) (9955 _ «(2le 132(k)) (25:1 392(k)) 6G,- P . (M) (5.20) X /‘\ /'-\ A Mrs ...".< E % S 25 S M GE. (K3521 YEW) (8.; 19-200)) 3 x M “0 5w A P? v 5 A a: V Q: Q: 8. u. ’2? v where G,- is the 2th row of G(0), and X(k) is the kth column of X. The explicit gradient expression for N = 3 is given in the Appendix. 5.4 Experimental Results and Analysis In order to evaluate the effectiveness of the prOposed method, we consider various source separation examples. In all of the examples with the synthesized signals, the sources are assumed to be approximately disjoint on the time-frequency plane. Each observation is transformed to the time-frequency domain with K = 50 time samples and L = 64 frequency samples. Each TFD is vectorized to form a TFD observation matrix of size M x 3200 as in equation (5.7), where M is the number of observations. Jensen-Rényi divergence with order a = 2 is used as the cost function to ensure that the divergence is well-defined. The binomial kernel [23] is used for computing the TFD since it belongs to the class of reduced interference distributions (RIDs) and thus will have negligible cross-terms. This property of the distributions will improve 60 the performance of our source separation algorithm. The performance of the proposed method is quantified in terms of the accuracy of the extracted sources, convergence rate, and robustness to noise. All of the experimental results will be evaluated using the signal to interference ratio (SIR) defined as: 1 N SIR = N :1 3111,, 2:: SIR,- = SIRO, — SIR 1,, 215:1 Vii-(k) (5.21) 25:1 (23.5- 165,0»)2 ’ 25:1 X22351“) 25:1 (ijéi Xisj(k))2 SIROZ- = lOlog SIR I,- = 1010g ’ where stj and Xisj are the outputs and inputs of the system when only the signal sj is active, respectively, and N is the number of sources. Example 1: Separation of a chirp signal and two Gabor logon signals In this example, the set of observed signals are the three linear combinations of a chirp signal and two Gabor logon signals, i.e. zi(t) = ai181(t) + Liz-232a) + a1333(t), where ai1,ai2,ai3 are the weights for each signal distributed as N (0, 1), i = 1, 2,3, and 31(t), 32(t), 33(t) correspond to the two Gabor logon signals and the chirp signal, respectively. A Gabor logon is a modulated Gaussian expressed as: _2 . s,(t)= 2:06 2a 83%]: (i=1,2). (5.22) The Gabor logon has virtually compact support in both time and frequency centered in time at t 2: tio and frequency at w = Luz-0. In this example, the first Gabor logon is centered at the time sample point 7510 = 50 and normalized frequency of (010 = 0.7, 61 and the second Gabor logon is centered at the time sample point t20 = 150 and normalized frequency of Logo 2 —O.7. A linear Gaussian chirp is expressed as: ,3“) 2 $e{—g2+flwo+§1(t—to>}, (523) 7r where t0, “’0 are the time and frequency centers, 0, ,8 are the time spread and frequency modulation rates of the chirp, respectively. In this example, the linear chirp signal has an initial normalized frequency of -0.2 and its instantaneous frequency increases to a normalized frequency of 0.2 with to = too = 0. It is known that the chirp signal overlaps with these two Gabor logons in the time domain, so it is not possible to separate them using time domain decomposition approaches. However, it is illustrated in Figure 5.2 that these three signals can be effectively extracted on the time-frequency plane using the proposed method through an optimal rotation under the divergence criterion with an average SIR of 37.5169 dB. Moreover, the convergence rate is high as shown in Figure 5.3. Example 2: Separation of two crossing chirp signals In this example, we consider the separation of two signals overlapping in the time- frequency domain. A mixture of two linear chirp signals is used for source separation. One of the chirp signals has an initial normalized frequency of —0.8 and its instan- taneous frequency increases to a normalized frequency of 0.8. The other one has an initial normalized frequency of 0.8 and its instantaneous frequency decreases to a normalized frequency of -O.8. Obviously, these two chirp signals overlap with each other in both the time and frequency domains. Typical time domain or frequency domain separation methods can not be used to perfectly recover them. Figure 5.4 shows that using the proposed approach, we can successfully separate these two chirp signals from their mixtures with an average SIR of 32.1164 dB. It can be seen from the figures that most of the error occurs in the time-frequency region where the two 62 Frequency 0 0 50 100 150 200 Normalized o 50 100 150 200 (d) Time Figure 5.2. The mixture and the separation of a chirp and two Gabor logons: (a) the mixture; (b) and (d) the two extracted Gabor logons; (c) the extracted chirp signals overlap. This is due to the fact that the crossing chirps do not exactly satisfy our underlying assumption of disjoint sources. Example 3: Separation of two speech sources In this example, we consider the mixtures of two speech signals from two speakers, one female and one male. The two speakers’ voices are recorded by two microphones 3m directly in front of the speakers in an anechoic chamber. Due to time delay, these two signals partly overlap with each other in the time domain. The TFDs of the original speech signals and their mixtures are shown in Figure 5.5 and Figure 5.6, respectively. Figure 5.7 shows the TFDs of the speech signals extracted by the proposed method. The SIR is 26.0482 dB in this case, since the two speech signals 63 0.8 I I l I I I 0] Cost Function .0 O .0 f3 (A b 01 C) I S: N 01 - 0 I I l I l l 0 10 20 30 40 50 60 70 Iterations Figure 5.3. The cost function versus the number of iterations for Example 1 have partial overlap on the time-frequency plane. In order to further investigate the robustness of the proposed algorithm for real- life signals, we add white Gaussian noise into the two speech signals over a SNR range of {-8 — 8 dB]. It is shown in Figure 5.8 that the proposed method is robust against noise and results in the separation of the speech signals even at low input SNRs. Errample 4: Performance comparison with the STFD and FastICA methods In order to further evaluate the performance of the proposed approach, we compare it with two different methods, one of which is a time-frequency based source separation method, the spatial time—frequency distribution (STFD) [58], and the other one is an information-theoretic method, FastICA [84] adapted to the time-frequency domain. The STFD method is based on the joint diagonalization of a combined set of spa- 64 0 50 100 150 200 Normalized Frequency o 50 100 150 200 (C) Time Figure 5.4. The mixture and the separation of two crossing chirp signals: (a) the mixture; (b) and (c) the separated signals tial time-frequency distribution matrices. STF D matrices are made up of the auto- and cross—TFDS of the data snapshots across the multisensor array, and they are expressed in terms of the TFD matrices of the sources. The diagonal structure of the TFD matrix of the sources is essential for the STFD method and is enforced by using only the information in the time-frequency points corresponding to the signal auto-terms. The benefit of using STFDs in a non-stationary signal environment is the direct exploitation of the information brought by the non-stationarity of the signals. 65 _n 0.5 Normalized Frequency 0 S (8) _L a c 0 g. 0.5 u. '0 0 0 .g E—os Z -1 0 50 100 150 (b) Tlme Figure 5.5. TFDs of the two individual speech signals: (a) TF D of a female speaker; (b) TFD of a male speaker FastICA combines Comon’s information-theoretic approach [9] and the projection pursuit approach [85]. A family of contrast (objective) functions for ICA are intro- duced using maximum entropy approximations of differential entropy. These contrast functions enable both the estimation of the whole decomposition by minimizing mu- tual information, and estimation of individual independent components as projection pursuit directions. FastICA algorithm has a fast convergence rate and is robust under noise. In this example, the TFD observation matrix, X in equation (5.7), is considered as the input to the FastICA algorithm to achieve source separation. SIR is used as the performance criterion for the two crossing chirp signals discussed in Example 2 by adding white Gaussian noise over a SNR range of {-8 — 8 dB]. We use 66 _L 0.5 Normalized Frequency Normalized Frequency (b) Time Figure 5.6. TFDs of the observed signals: (a) TFD of the first mixture; (b) TFD of the second mixture 100 Monte Carlo simulations for each noise level. Figure 5.9 compares the robustness of the the proposed approach in noise to the STFD and FastICA methods. It is evident that both the proposed approach and the STFD method are superior to FastICA under noise. This is due to the fact that the assumption that the underlying source signals are independent does not necessarily apply to the time-frequency distributions. We can also see that the proposed approach performs better than the STFD method as the noise level increases. The reason is that as the noise level increases, the energy of the cross-terms will increase, thus making it harder to differentiate between the auto- and the cross-terms in the STFD method. This results in errors in the estimation of the sources, consequently reducing the SIR. On the other hand, the 67 _L 0.5 Normalized Frequency Normalized Frequency (b) Time Figure 5.7. TFDs of the extracted signals: (a) TFD of estimate of the female speaker; (b) TFD of estimate of the male speaker proposed method assumes that the cross-terms are negligible. When the signal is very noisy, the cross-terms between the signal and the noise become significant, and neglecting these cross-terms amounts to denoising of the observed TFDs, resulting in higher SIRs. The STFD method also has a higher computational complexity than the proposed method, since it computes both the auto- and cross-terms of the observed signals whereas our method uses only the auto-terms by using a RID kernel. Example 5: Performance comparison with PCA In this example, we compare the proposed source separation method with PCA for the mixture of the two Gabor logon signals in Example 1. PCA is an orthogonal decomposition of the observed data matrix just like the proposed method. However, 68 16 I I' r I I I I 14* 12' 10 SIR (dB) oo 0') )— y.- .— _2 1 1 1 1 SNR (dB) Figure 5.8. Output SIR versus input SNR for speech signals the cost functions used are different and results in the difference seen in the extracted components in Figure 5.10. With PCA the variance explained by each component is maximized whereas the proposed method maximizes the divergence between the components resulting in better separated sources. Example 6: Number of mixtures greater than the number of sources In this example, we consider a more general situation where the number of mixtures is larger than the number of sources. For M mixtures and N sources (M > N), we construct a new N x M rotation matrix as follows: GNMW) = INMGMW), (5.24) 69 20 I I I I I I I —ale— Proposed method - B - STFD method , / — O - FastICA / T I I I L I SNR (dB) - In- Figure 5.9. Comparison of output SIR versus input SNR for three different source separation methods where G M(6) is an M x M rotation matrix given by equation (5.17), and I N M is an N x M matrix with elements equal to 1 if i = j, 0 otherwise, where i, j represent the row and column indices, respectively. The source signals are the chirp signal and one of the two Gabor logons in Example 2. We use the proposed approach with this new rotation matrix to extract these two signals from their three mixtures. It is shown in Figure 5.11 that the source signals can be effectively extracted when the number of mixtures is greater than the number of sources with an average SIR of 39.8827 dB. 70 0 50 100 150 200 > 0 5 o 50 100 150 200 3 2 LL '0 =5 0 50 100 150 200 g 1 (c) z _ l_ I I o- . -1 1 I -—1——— 0 so 100 150 200 (d) 1 ___.= _ . . o~ . -1 1 l I o 50 100 150 200 (6) Time Figure 5.10. Comparison with PCA for two Gabor logon extraction: (a) the mix- ture, (b) and (c) the components extracted by the proposed method, ((1) and (e) the components extracted by PCA 5.5 Summary In this chapter, a new approach is presented for the separation of non-stationary signals on the time-frequency plane using an information-theoretic cost function. The proposed algorithm assumes the disjointness of the underlying signals on the time-frequency plane. This assumption allows us to extract the sources through a N -dimensional Givens rotation. Using Jensen— Rényi divergence as the cost function, a steepest descent algorithm is implemented to update the rotation angles. Several examples are given to illustrate the performance of the proposed algorithm for syn— thesized and real life signals. Issues regarding convergence rate and robustness under 71 0 50 100 150 200 Normalized Frequency 100 150 200 (C) Time Figure 5.11. Separation of a chirp and a Gabor logon from their three mixtures: (a) the mixture, (b) the extracted Gabor logon, (c) the extracted chirp noise are investigated. The performance of the algorithm is illustrated under noise and is compared to PCA and ICA as adapted to the time-frequency plane, and STFD. The results illustrate that maximizing the divergence on the time-frequency plane can separate sources that are disjoint in the time-frequency domain, and is better than the mutual information cost function used in ICA in terms of fidelity to the original sources. The proposed method also outperforms STFD for high noise levels since it assumes the cross-terms between sources are negligible which effectively denoises the observed time-frequency matrix, and is apparently superior to PCA. 72 CHAPTER 6 UNDERDETERMINED BLIND SOURCE SEPARATION IN THE TIME—FREQUENCY DOMAIN 6.1 Introduction Underdetermined blind source separation (UBSS) is a more challenging problem com- pared to the (over)determined case, because contrary to the (over)determined case, estimating the mixing system is not sufficient for reconstruction of the sources, since the mixing matrix is not invertible. Therefore, we need additional a priori infor- mation about the sources to allow for reconstruction. One increasingly popular and powerful assumption is the sparsity of the sources on a given basis [5,86,87]. A signal is said to be sparse when it is zero or nearly zero more than might be expected from its variance. Such a signal has a probability density function or distribution of values with a sharp peak at zero and fat tails. The advantage of a sparse signal representa- tion is that the probability of two or more sources being simultaneously active is low. Thus, sparse representations lend themselves to good separability because most of the energy on a basis coefficient at any time instant belongs to a single source. This statistical property of the sources results in a nicely defined structure being imposed by the mixing process on the resultant mixtures, which can be exploited to make estimating the mixing process much easier. Sparse representation of the signals, which is modelled by matrix factorization, has been receiving a great deal of interest and has been applied to blind source separation in recent years. In several references, the mixing matrix and sources are estimated using the maximum posterior approach, the maximum likelihood approach, and the expectation maximization algorithm [50,88—92]. However, these algorithms may stick at a local minima and have poor convergence prOperty. In [93], a blind 73 source separation is develOped via multi-node sparse representation. Based on several subsets of wavelet packet coefficients, the mixing matrix is estimated by using Fuzzy C-means clustering algorithm, and the sources are recovered using the inverse of the estimated mixing matrix. However, the case of less sensors than sources is not discussed. In this chapter, we introduce a sparse factorization approach to the UBSS problem in the time-frequency domain, in which the mixing matrix is estimated using the K- means clustering method, while the sources are estimated using a linear programming method. In [94], the equivalence results of the lO-norm solution and the ll-norm solution are obtained using a probabilistic approach. These results show that if the sources are sufficiently sparse in the analyzed domain, they are more likely to be equal to the l 1-norm solution, which can be obtained using a linear programming method. 6.2 Sparse Factorization Approach for UBSS in the Time-Frequency Do- main In this section, a sparse factorization approach including two stages for the UBSS problem in the time-frequency domain are presented, in which the first stage is for determining the mixing matrix, and then the second stage is for estimating the source signals. 6.2.1 Linear Mixture Model and Assumptions We first give out the system model and assumptions. The observed M mixtures, z(t) = [31(t),22(t), . . . ,zM(t)]T, of the N non-stationary complex source signals, s(t) = [sl(t), 32(t), . . . ,sN(t)]T, may be modeled linearly in the time domain as z(t) = Bs(t), (6.1) where B is the M x N instantaneous mixing matrix. 74 Each mixture, zz-(t), is transformed to the time-frequency plane, and then the corresponding time-frequency distribution is vectorized to form a matrix of time- frequency distributions, X. From Chapter 5, it is known that the time-frequency distributions of the mixtures, X, can be written approximately as a linear combination of the original sources’ TFDs, S, assuming the cross-terms between the sources are negligible by using a RID: x z BZS = AS, (5.2) where X: [x1,--- ,xp] 6 RMXP, s = [s1,--- ,sP] e RNxP, P= 1 x L, I and L are the numbers of time and frequency points respectively, A = B2 = [a1, - - - ,a N] E RMXN with normalized columns, i.e., ||a1]| = = ]|aN|| = 1, and B2 is the element-by-element square of the mixing matrix in the time domain. The task of BSS is to recover the sources S only using the mixture matrix X. Here, we assume M < N, which indicates that BSS is underdetermined, and the source signals are sparse in the time-frequency domain. The sparsity of the sources plays a key role in this chapter. It is well known that in general there exist many possible solutions for the model (6.2). For a given mixing matrix, under the sparsity measure of ll-norm, the unique- ness result of sparse solution is obtained. And the number of nonzero entries of the sparse solution can not be reduced. It is also found that the mixing matrix of which the column vectors are composed by cluster centers of the mixtures X is a sub-Optimal mixing matrix, which can be obtained using K -means clustering algorithm [94]. 6.2.2 K -means Clustering K -means clustering is an iterative algorithm that seeks to minimize a squared-error criterion function in order to separate a completely unknown set of data into K dif- ferent groupings [95]. Suppose that x1,x2, - — - ,xn are the vector observations in a data set and make up realizations of K different distributions of random variables. 75 Then ,ul, #2, - - - ,,u K are the mean vectors of these distributions, and K -means clus- tering seeks to categorize the observations, xi, into one of the K distributions such that the squared Euclidean distance, [I x,- —— W [[2, is minimized. However, since the properties of the data set are unknown, u1,,u2, - -- ,uK must be estimated first as max~nx As a starting point, K random samples of the data are chosen as the initial mean estimates, fij. The distributions are then estimated by classifying all points, xi, into the group whose estimated mean it is closest to in the squared Euclidean sense, so that x,- 6 [ij whenj is subject to qmwrnm? on Once all data points are classified, the mean of each group is recalculated. Suppose mj is the number of data points in the jth distribution, and xlj,x2j, - -- ,xmj are all data points. The new mean is then calculated as m] .. ll 1 m] i= This process is repeated until convergence, when the estimated means do not change upon further iterations. 6.2.3 Determination of the Mixing Matrix Due to the sparsity of the source signals in the time-frequency domain, there ex- ists many columns of S with only one nonzero entry. For instance, suppose that 82-1, - - - ,sz- K are K columns of S, where only the first entry of each of these columns is nonzero, then we have Asij = 318125 j= 1,"' W. (6-5) 76 and [xili'H 7x'iKl = Alsili'” iSiK] = [a131i1’” ° talsliKlv (6'6) where, xi]. is the ijth column of X, a1 is the first column of A, and sh]. is the first entry of Sij- From equation (6.6), we see that each xi]. is equal to al multiplied by a scalar 311-3., which means that these K column vectors of X, xi1,- -- ,x.,- K’ are distributed along the direction of a1. Thus, ideally after normalization, xil , - - - ,xz- K are mapped to a unique vector on the multidimensional unit circle which is equal to al. However, in practice, the sources are likely not completely sparse in the time-frequency domain. That is, s.i1,--- have the dominant first entries so iSiKi that 511']. >> Srij for r 75 1 and j = 1,--- ,K. When more than one source are nonzero, x11" 5 - ,xi are not exactly in the same direction as al, but rather in the K neighborhood of al. This means that al lies at the center of xi1,- - - , fo' Therefore, we use the K -means clustering method to cluster the column vectors of the mixture matrix X into different clusters, where the center of each cluster corresponds to one column vector of the mixing matrix A. By doing so, we can obtain an estimate of the mixing matrix A. The algorithm is summarized as follows: Algorithm 1: Determining the mixing matrix 1. Normalize the column vectors of the mixture matrix X. 2. Take a sufficiently large positive integer K as the number of clusters. Choose the initial points of iteration and the distance measure criterion. In this part of the proposal, we choose the squared Euclidean distance as the criterion. 3. Do K -means clustering iteration followed by normalization to estimate the sub- optimal mixing matrix. Note that if two column vectors have opposite direc— tions, only one is taken. 77 6.2.4 Estimation of the Source Signals for a Given Mixing Matrix After obtaining the estimated mixing matrix, the next stage is to estimate the source signals. For a given mixing matrix A in model (6.2), the source signals can be esti- mated by maximizing posterior distribution P(S|X, A) of S [96]. Under the assump- tion that the prior is Laplacian, maximizing posterior distribution can be implemented by solving the following optimization problem [6]: N P min 2: Z lsijl: subject to AS = X. (6.7) i=1j=1 Hence, the l 1-norm N P J1(S) = Z Z Isa-I (6.8) i=1j=1 can be used as the sparsity measure. It is not difficult to prove that the Optimization problem (6.7) is equivalent to the following set of P smaller scale linear programming (LP) problems: N minz lsz-j], subject to Asj = xj for j = 1, - -- ,P. (6.9) i=1 By setting S = U — V, where U = [Mg] 6 RNXP 2 0 and V = [oz-j] E RNXP 2 0, equation (6.9) can be converted to the following standard LP problems with non- negative constraints: N rnin :(uij + DU), i=1 (6.10) subject to [A,—A][u$,v$]T = xj, uj 2 0,vj 2 0 forj=1,~- ,P. Combining the discussion of this section and the previous sections, we have the algorithm for estimating the source signals: 78 Algorithm 2: Blindly separating the sparse source signals 1. Prethreshold the mixture matrix X to obtain a sparser data matrix X. 2. Estimate the mixing matrix A using Algorithm 1 and X. 3. Using the estimated mixing matrix A and the mixtures X, estimate the time- frequency representations S by solving the set of LP problems (6.10). 6.3 Experimental Results and Analysis In this section, several examples will be used to illustrate the effectiveness of the proposed approach to separate the sparse source signals from their fewer mixtures in the time-frequency domain. The binomial kernel [23] is used for computing the TFD since it belongs to the class of reduced interference distributions (RIDS). Example 1 : The set of observed signals are two linear combinations of four Gabor logons. These four Gabor logons are centered at the time sample point and the normalized frequency (30,0.7), (160,-0.7), (70,-0.4), and (120,0.1), respectively. Each observed signal is first transformed to the time-frequency domain with I = 50 time samples and L = 64 frequency samples. Each TFD is then vectorized to form a TFD mixture matrix X = [X1; X2] of size 2 x 3200. Figure 6.1 presents a scatter plot of the mixtures X (X2 vs. X1) in the time- frequency domain. It can be seen from this plot that almost all significant data points are distributed along four different directions, thus providing very good separability. The separation results using the proposed approach are illustrated in Figure 6.2. Figure 6.2 (a) and (b) represent the two mixtures. The four extracted Gabor logon signals are shown in Figure 6.2 (c), (d), (e), and (f). The results indicate that these four Gabor logons can be successfully separated from their two mixtures using the proposed approach based on their sparsity with an average signal to interference ratio (SIR) of 36.1251 dB. 79 0.25 - 0.2 - 0.15 - 0.1 - 0.05- :' l -0.1 -0.05 0 0.05 0.1 0.15 0.2 Figure 6.1. Scatter plot of two mixtures of four Gabor logons in the time-frequency domain Example 2: Two mixtures of a chirp signal and two Gabor logons are given. The chirp signal has a linear frequency increasing from an initial normalized frequency of -0.2 to a normalized frequency of 0.2. The Gabor logons are the first two Gabor logons given in Example 1. A scatter plot of the two mixtures in Figure 6.3 shows that it is similar to the first example in that the distributions of data points belonging to different sources are along three different directions. Since the chirp signal overlaps with the two Gabor logons in the time domain, typical time domain separation meth- ods can not be used to perfectly recover them. However, it is illustrated in Figure 6.4 that these three signals can be effectively extracted in the time-frequency domain using the proposed method with an average SIR of 32.7634 dB. 80 Example 3: In this example, the same two mixtures of four Gabor logons given in Example 1 are used. The effectiveness of the presented approach is compared for TFDs and wavelet packets (WP) in the presence of noise. Haar wavelet with five levels is used for the wavelet packet decomposition. To show the effect of increased sparsity of TFDs, the mixtures at different levels of white Gaussian noise are considered. 100 Monte Carlo simulations are used for each noise level. The average mean squared error (MSE) between the extracted sources and the original sources is calculated for both the TFD and WP. The TFD and WP provide similar results when there is no noise. However, as the noise level increases, the performance of the WP rapidly degrades compared to that of the TFD. The MSE versus the signal-to—noise ratio (SNR) is shown in Figure 6.5 for both the TFD and WP. This result shows that the RID results in a more sparse time—frequency surface compared to the WP, which improves the robustness of BSS under noise. 6.4 Summary In this chapter, a two-stage approach is introduced for underdetermined blind separa- tion of sparse and non-stationary sources using TFDs. The mixing matrix is estimated using K -means clustering algorithm based on the sparsity of the sources; for a given mixing matrix, the sources are extracted using a linear programming method. The performance of the proposed approach is compared with wavelet packets under differ- ent noise levels. The results show that the presented method is simple and effective at separating the sources from their mixtures, and is more robust than wavelet packets under noisy environments. 81 Figure 6.2. The mixtures and the separation of four Gabor logons: (a) and (b) two mixtures; (c), (d), (e), and (f) four extracted Gabor logons 82 1.2 r 08 ~ .I 4 0.4 - I I 0.2- 1' I ,. X2 '08 1 1 1 1 1 1 1 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 X1 Figure 6.3. Scatter plot of two mixtures of a chirp and two Gabor logons in the time-frequency domain 83 Normalized Frequency 0 20 40 60 80 100 120 140 160 180 200 Figure 6.4. The mixtures and the separation of a chirp and two Gabor logons: (a) and (b) two mixtures; (c) extracted chirp; (d) and (e) two extracted Gabor logons 84 1.8 I I I I I I l —O-1TFD _ _Wavelet . "6’ 8\ Packet \ \ 1.4- N a \ \ 1.2r \ . [1 iii 1 " \ 2 \ \ 1:. \ 08" O\‘\q \ \ “ b \ \ . ~ , \ ‘U\ . 0.6 \O\ \ \ 04- "*"\"‘O- \ a 0.2 I I l I I I I I -2 0 2 4 6 8 10 12 14 16 SNR (dB) Figure 6.5. Comparison of MSE versus SNR for extracted sources with TFD and WP 85 CHAPTER 7 APPLICATIONS OF UBSS ALGORITHM ON ELECTROENCEPHALOGRAM SIGNALS It is well-known that there is a broad range of applications for blind source separa- tion. A classical example is the cocktail party problem. A more practical application is noise reduction. Another application area is economic time series [97]. Recently, telecommunications applications have also been published [1]. Besides these applica- tions, one popular application of source separation is biomedical signal processing [3], such as separation of electroencephalogram (EEG) signals which consist of recordings of brain activity obtained using electrodes attached to the scalp. Decomposition of evoked field potentials measured by EEG [98] is an application of considerable interest in the neurosciences. In this chapter, we will apply the UBSS approach proposed in Chapter 6 to the EEG signals so as to evaluate its effectiveness on real life signals. 7.1 Introduction to Electroencephalogram and Event-Related Potential 7.1.1 Electroencephalogram Electroencephalography is the neurOphysiologic measurement of the electrical activity of the brain recorded by electrodes placed on the scalp or, in special cases, subdurally or in the cerebral cortex. The resulting traces are known as an electroencephalogram (EEG) and are reflections of temporal and spatial summation of synchronized post- synaptic cortical potentials [99]. Specifically, EEG data represents the synchronous activity of large cortical groups of neurons, measured as integrated electrical signals on the scalp. In conventional scalp EEG, the recording is obtained by placing the electrodes 86 on the scalp in special positions with a conductive gel. Some EEG systems use a plastic cap into which the electrodes are embedded. The electrode positions on the scalp are identified by the recordist who measures the head using the International 10-20 System. This relies on taking measurements between certain fixed points on the head. The electrodes are then placed at points that are 10% and 20% of these distances. Each electrode site is labeled with a letter and a number. The letter refers to the area of brain underlying the electrode e.g. F - Frontal lobe and T - Temporal lobe. Even numbers denote the right side of the head and odd numbers the left side of the head. EEG activity can be subdivided into various types of frequency rhythms or bands. Research has indicated that different EEG frequency bands are associated with dif- ferent mental states [100]. In general, EEG activity is broken down into 4 distinct frequency bands: 1. Beta activity 13 Hz—30 Hz. Beta activity is a normal activity present when the eyes are open or closed. It tends to be seen in the channels recorded from the center or front of the head. 2. Alpha activity 8 Hz—13 Hz. Alpha activity is also a normal activity when present in waking adults. It is mainly seen in the channels recorded from the back of the head. It is fairly symmetrical and has an amplitude of 40 uV to 100 11V. It is only seen when the eyes are closed and should disappear or reduce in amplitude when the eyes are open. 3. Theta activity 4 Hz—7 Hz. Theta activity can be classed as both a normal and abnormal activity depending on the age and state of the patient. In adults it is normal if the patient is drowsy. However it can also indicate brain dysfunction if it is seen in a patient who is alert and awake. In younger patients, theta activity may be the main activity seen in channels recorded from the back and 87 central areas of the head. 4. Delta activity < 4 Hz. Delta activity is only normal in an adult patient if they are in a moderate to deep sleep. If it is seen at any other time it would indicate brain dysfunction. EEG is preferred in many applications for exploring the brain activity thanks to its high time resolution. Other methods for studying brain activity have time resolution between seconds and minutes, while the EEG has a temporal resolution down to sub- millisecond [101]. As the brain is thought to work through its electric activity, EEG is the only method to measure it directly. Other methods for exploring functions in the brain rely on blood flow or metabolism which may be decoupled from the brain electric activity. 7.1.2 Event-Related Potential An event-related potential (ERP) is an electrophysiological response to an internal or external stimulus. More simply, it is a measured brain response that is directly the result of a thought or perception. ERPS can be reliably measured using EEG. As the EEG reflects thousands of simultaneously ongoing brain processes, the brain response to a certain stimulus or event of interest is usually not visible in the EEG. One of the most robust features of the ERP response is a response to unpredictable stimuli. This response, known as the P300 (or simply “P3”), manifests as a positive deflection in voltage approximately 300 milliseconds after the stimulus is presented. In actual recording situations, it is difficult to see an ERP after the presentation of a single stimulus. Rather the most robust ERPS are seen after many dozens or hundreds of individual presentations are averaged together. This technique cancels out noise in the data allowing only the voltage response to the stimulus to stand out clearly. While evoked potentials reflect the processing of the physical stimulus, ERPS are caused by the “higher” processes, that might involve memory, expectation, 88 attention, or changes in the mental state, among others. ERPS have found numerous applications in clinical neurophysiology and psychia- try [102,103]. This is because their recording is noninvasive and accurate, and they are consistently shown to be an indicator of brain functions and its abnormalities. The clinical applications of ERPS could be significantly extended if they could be interpreted more accurately and effectively. This requires the development of novel signal processing methods. In recent years, there has been a tremendous growth in ap- plying signal processing techniques such as independent component analysis, wavelet and time-frequency methods for separating the source signals and extracting useful information about the underlying brain activity [13,104]. Event-related potentials like most other real life signals are non-stationary, and thus can be best tackled by using non-stationary signal analysis techniques such as time-frequency distributions (TFDs) and wavelet analysis. In the next section, we will apply the UBSS algorithm presented in Chapter 6 to ERP data set in the time- frequency domain and compare its performance with ICA which is one of the main methods used in the extraction of EEG / ERP sources in both aspects of research and practice. 7.2 Experimental Results and Performance Analysis 7.2.1 EEG/ERP Data Set The EEG / ERP data analyzed in this chapter is released by Swartz Center for Corn- putational Neuroscience at the University of California, San Diego [105]. The study consisted of one subject and 32 electrodes. In the selective visual attention experi- ment, there were two types of events “square” and “rt”, “square” events corresponding to the appearance of a green colored square in the display and “rt” to the reaction time of the subject. The square could be presented at five locations on the screen distributed along the horizontal axis. Here we only consider presentation on the left, 89 i.e. positions 1 and 2 as indicated by the “position” field (at about 3 degree and 1.5 degree of visual angle respectively). In this experiment, the subject covertly attended to the selected location on the computer screen responded with a quick thumb button press only when a square was presented at this location. The subject was to ignore circles presented either at the attended location or at an unattended location. To reduce the amount of data required to process, the data set used in our analysis con- tains only target (i.e., “square”) stimuli presented at the two left-visual-field attended locations. And 6 electrodes are chosen from 32 electrodes, which are F3, F4, Cz, P3, P4, and Oz in the International 10-20 System. The stimulus is repeated 40 times resulting in a total of 80 trials per electrode. 7 .2.2 Single-Tidal EEG The goal in single-trial EEG analysis is to be able to extract individual underlying sources in the brain which are generated in a localized area. With successful source extraction, analysis of individual responses of the brain can be performed, and the dynamic variability of the EEG responses can be compared on a trial to trial basis. In this way, observations can be made on all factors affecting subject’s performance. A comparison is made between the algorithm outlined in Chapter 6 and ICA applied to the same data. Both these BSS techniques are applied to all 80 trials of data available. In the application of the proposed UBSS approach in Chapter 6, first the number of sources to extract, It, must be chosen. This value is empirically chosen such that it is greater than the number of electrodes, 6. In our analysis, the experiment is done using 32 frequency bins for which k is 8. This number is chosen since higher number of sources resulted in sources that were too sparse and did not correspond to actual neuronal activity. ICA is then applied to the same data. Since only 6 mixtures are used, ICA can only extract 6 components per trial. The results for ICA are in the time domain, so they are converted to the time—frequency plane at the frequency 90 resolution level using 32 frequency bins for the purposes of comparison. Figure 7.1 and Figure 7.2 illustrate the results for one trial in the time—frequency domain. Similar results are obtained over all 80 trials. The sources from the proposed technique show in general less activity, i.e. more sparsity, on the time-frequency plane than the sources from ICA. It is, however, difficult to compare results on the single- trial level since the underlying source generators are actually not known, and since a different number of components are extracted from each technique. It is also difficult because there are 80 individual trials. An attempt must be made to generalize the results. 7 .2.3 Data Reduction In order to evaluate the performance of ICA and the proposed UBSS method, the single-trial results are put together in their respective groups depending on stimulus type. K -means clustering is carried out over all extracted components from the subject and the extracted cluster centers represent similar components across all trials. These components are then representative of the most prevalent sources extracted throughout the trials for each stimulus. Evaluation of these cluster centers is then carried out in an attempt to quantify the general results of ICA to those of the proposed method. The results of one trial are represented by the matrix, Sv, which is of size k x P. Each component in the time—frequency domain is first vectorized to form a vector of length P, which in our case is equal to 2112. These vectors are then put into a matrix. This represents k extracted components from trial 12, each over P time- frequency points. For the data reduction of all results for a particular stimulus, the extracted matrices over all trials are each appended to form a new matrix, S“, such 91 that 81(1) s}(P) SI - s2 s},(1) si(P) Su — = a (71) s‘f’n) 333(1)) L 540 Lsfiofl) s‘l0(P)_ where u = {1,2} represents the stimulus position number, and Su is of size 401: x P. Each element, sf (j), is one time-frequency point of source i from trial 12. K -means clustering is then carried out on each Su where each of its rows is grouped into one of K clusters based on its squared Euclidean distance to that cluster center as described in Section 6.2.2. The clustering algorithm is run 10 times to avoid randomness in the final cluster results. We run K at 8, l2, and 16 to get an idea of how the different number of components may affect the outcome. The rows of Sn are then grouped by a hierarchical clustering method based on the results of the 10 K -means runs. A matrix, R, of zeros of size 40k x 40k is created. Each entry is updated iteratively. The entry, Tija represents how many times out of 10 row i of S1,, was grouped into the same cluster as row j of Sn. This matrix then serves as a similarity measure, the more times two sources were grouped together by K -means, the more similar they are. All diagonal entries, Iii, represent how many times each source was grouped with itself. These entries are ignored because they are all 10 and are meaningless. A hierarchical clustering is then carried out using the similarity matrix, R, as its distance metric. In the first step, each row of Sn is in its own cluster. The second step then groups all rows together with a similarity value of 10 in the matrix, R. Next, all rows with similarity of 9 are grouped. If a group already exists, then the average similarity between one row and all rows already in the cluster is used. The 92 next step will then group together a cluster with another cluster or individual row if it has the highest similarity value. If not, then all rows with similarity value of 8 are grouped together. This is repeated until the number of clusters is reduced to K. Cluster centers are then calculated by the mean of the time-frequency components in each cluster, and these are the components that categorize all single-trial EEG results. For example, a set of extracted components for K = 8 is shown in Figure 7.3and Figure 7.4 for ICA and the proposed UBSS method, respectively. 7.2.4 Performance Evaluation The levels to which the extracted components are sparse, disjoint, and localized in the time-frequency domain all speak to how close they may be to an actual underlying source in the brain. The components obtained from the clustering method described in the previous section are evaluated based on these factors. Sparsity will be measured using the l 1-norm, disjointness using the total inner product between the components, and localization using a measure of entropy on the time-frequency plane. Since a sparse component must have most of its values close to zero, the ll-norm is a good measurement of how sparse a component is and a smaller ll-norm means a sparser component. The extracted clusters are represented by the K x P matrix Cu,u = {1,2}. Each row of Cu represents one extracted component. Thus each component’s sparsity is measured with P 2 Ici‘tmn, (7.2) where it refers to stimulus position, i represents component number between 1 and K, and P is the number of time—frequency points. Disjointness between two components is measured by using the inner product. A summation of all the pairwise inner products between K components represents a 93 total level of disjointness over all extracted components. This is computed as P E Z |c§‘(m)c3‘(m)|. (7.3) i715 j m=1 Time—frequency localization of each component is computed using a measurement of entropy. This is calculated as P - Z IC§‘(m)I 1032|C§’(m)|- (7-4) m=1 A smaller entropy corresponds to a more localized component. The results calculated for these parameters are shown in Table 7.1, Table 7.2, and Table 7.3. This shows that under the proposed UBSS approach, the extracted components are typically more sparse, localized, and disjoint than the extracted com- ponents under ICA. This means that under the proposed approach, the components are more likely a closer representation of a true source. Finally, the extracted components from both methods are projected back to the electrodes to show the amount of variance of the original data explained by the sources. From Table 7.4, it is seen that in general a little bit more amount of variance is explained by the components extracted from ICA than by the presented UBSS method. This is because the presented UBSS method seeks to find the sparsest sources, while ICA seeks to find maximally independent sources. The components with less sparse representations (from ICA) project better back to the original mea- surements, but it is likely that they are linear sums of further reducible sources. 94 Table 7.1. Mean measure of l1 norm to show sparsity Running ]] Position 1 (u=1) Position 2 (u=2) Conditions I] UBSS ICA UBSS ICA K=8 23.03 29.06 21.92 27.63 K=12 22.36 28.31 22.54 28.15 K216 23.29 28.18 22.43 27.27 Table 7.2. Mean measure of entropy to show time-frequency localization Running Position 1 (u=1) Position 2 (u=2) Conditions UBSS ICA UBSS ICA K=8 9.80 10.29 9.73 10.25 K=12 9.79 10.25 9.80 10.26 K=16 9.85 10.24 9.81 10.20 7 .3 Summary This chapter discusses the applications of the proposed UBSS approach in Chapter 6 on the study of ERPS using EEG measurements to help understand mental processes. Since EEG signals have been shown to be non—stationary, the proposed method is applied to ERP data using TFDs and is compared to the popular ICA algorithm when applied to the same multiple trial ERP data set. Data reduction by clustering is performed over all single-trial results to extract components that represent the results. The components were consistently more sparse using the proposed UBSS technique than with ICA, showing that ICA probably tends to extract components that are sums of sources, and can help explain the higher correlation value to the original data. The UBSS technique provided components that are more localized in 95 Table 7.3. Measure of disjointness by correlation between components Running I] Position 1 (u=1) Position 2 (u=2) Conditions ]] UBSS ICA UBSS ICA K=8 3.12 3.63 3.84 4.35 K=12 8.92 9.38 8.30 8.90 K=16 14.05 14.51 15.87 16.35 Table 7.4. Average component projection to electrodes Running Position 1 (u=1) Position 2 (u=2) Conditions UBSS ICA UBSS ICA K=8 0.026 0.039 0.029 0.045 K=12 0.058 0.091 0.065 0.103 K=16 0.129 0.217 0.147 0.246 96 the time-frequency domain and that are more distinct from each other than ICA. Figure 7.1. Single-trial results using 32 frequency bins: 6 extracted sources from ICA 97 Trial2 30 . ' 20: 1| 10: '54.. is. 1' ' .m 0 l I l J 200 400 600 800 1000 30: 20’ 10' I '_ l I _ - ' - _ 0 I _ I I l 200 400 600 800 1000 30 . T 4 20- 10 P _ 111 I?" 0 1 1 1 1 200 400 600 800 1000 30» ' 20: E‘urifl I 0 1 1 1 1 200 400 600 800 1000 301 201 101 T 011.5 0 J 1 1 1 200 400 600 800 1000 30» I 1 20' b 5" 10 '3 _ _1’ 0 l . l“- 200 400 600 800 1000 30- 20- 10 b... .11“- .. 0 ‘ 41-k1laul ""I. 200 400 600 800 1000 30’ 20E 0 "‘ «iv-.1 uni. 0 1 1 1 1 200 400 600 800 1000 Figure 7.2. Single-trial results using 32 frequency bins: 8 extracted sources from the proposed UBSS method 98 Position 1 - Cluster1 - 15.4% in Group - ICA Position 1 - ClusterZ - 14.6% in Group - ICA 200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000 Position 1 - GUNS-12.1% in Group - 1011 Position 1 - Ctusterfi - 11.7% in Group - ICA 200 400 600 800 1000 Position 1 - Cluster 8 - 10.4% in Group - ICA 200 400 6111 800 1000 Figure 7.3. Results of component clustering over all single-trial results for stimulus position u = 1 using ICA 99 Position 1 - Cluster1 - 19.4% in Group - BSS 200 400 600 800 1000 Position 1 - Cluster 5 - 10.9% in Group - BSS £11.- $311111“ 600 800 1000 Position 1 - ClusterZ - 15% in Group - BSS 1 -IIl-' F'..1- 200 400 600 800 1000 Position 1 - Clusterl - 13.8% in Group - BSS 200 400 600 800 1000 Position 1 - Cluster 8 - 8.1% in Group - BSS 200 400 600 800 1000 Figure 7.4. Results of component clustering over all single-trial results for stimulus position u = 1 using the proposed UBSS method 100 CHAPTER 8 CONCLUSIONS AND FUTURE WORK In this dissertation, several problems regarding multichannel signal decomposition and source separation in the time-frequency domain are addressed. A new signal decomposition approach in the time-frequency domain is proposed based on the min- imum entropy criterion. The major difference of the proposed approach from conven- tional component extraction or decomposition methods is the cost function. The cost function that is minimized is entropy on the time-frequency plane, thus producing compact components that are similar to Gabor logons. Using entropy as the cost func- tion and adopting an adaptive filtering method to update the weights corresponding to each trial, we extract “minimum” entropy components orthogonal to each other. Experimental results show that the proposed approach is effective in determining a few number of components that can be used to represent a large set of data. This method is further improved for the separation of non-stationary signals on the time-frequency plane. For the overdetermined case, the proposed algorithm as- sumes the disjointness of the underlying signals on the time-frequency plane. This assumption allows us to extract the sources through a N ~dimensional Givens rotation. Using Jensen-Rényi divergence as the cost function, a steepest descent algorithm is implemented to update the rotation angles. Several examples are given to illustrate the performance of the proposed algorithm for synthesized and real life signals. Is- sues regarding convergence rate and robustness under noise are investigated. The performance of the algorithm is illustrated under noise and is compared to PCA and ICA as adapted to the time-frequency plane, and STF D. The results illustrate that maximizing the divergence on the time-frequency plane can separate sources that are disjoint in the time-frequency domain, and is better than the mutual information 101 cost function used in ICA in terms of fidelity to the original sources. The proposed method also outperforms STFD for high noise levels since it assumes the cross-terms between sources are negligible which effectively denoises the observed time-frequency matrix, and is apparently superior to PCA. In the next part of this dissertation, the BSS problem is extended for the under- determined case. Most BSS algorithms are not suitable to be applied in this case. In this part, a two-stage sparse factorization approach is proposed for UBSS. The first stage is to determine the mixing matrix. The mixing matrix can be estimated using K -means clustering algorithm. The column vectors of the mixing matrix are cluster centers of normalized mixture vectors. The second stage is to estimate the sources. For a given mixing matrix, although there exist an infinite number of solu— tions in general, the sparse solution with minimum ll-norm is proven to be unique, which can be obtained by using linear programming methods. The results show that if the sources are sufficiently sparse in the time-frequency domain, the proposed ap- proach can separate them effectively from their mixtures. Compared to PCA and ICA, the proposed method does not require the assumption that the sources have to be orthogonal or mutually independent. The final part of the work focuses on the applications of the proposed UBSS algorithm on multichannel EEG/ERP recordings. Under the assumption that sources are sparse in the time-frequency domain, all single-trial components are extracted in the condition of the number of sources selected in advance. Then data reduction by clustering is performed over all single-trial results to extract components that represent the results. The performance of the proposed approach is compared with that of ICA. It is concluded that the proposed method is more effective at extracting well localized neuronal sources in time and frequency than ICA. These sources are shown to be more sparse, and distinct from each other. Future work includes determining the convergence rates of the proposed algorithms 102 and investigating the effect of order a in the information-theoretic cost functions on the performance of the proposed algorithms. One problem for K -means clustering is the arbitrary selection of how many sources to extract. This is still an open question. If the number of extracted sources is less than the number of actual sources, some of actual sources can not be obtained; on the other hand, if the number of separated components is more than that of the actual sources, that means some components belonging to the same source are thought to be different sources. Thus, it would be more efficient to have the algorithm automatically select the number of sources. In addition, the requirement that the sources must be approximately disjoint limits the algorithms. If this assumption could be relaxed, results could be more reliable since the real sources may not be disjoint. Another area of future work is using signal synthesis methods to transform the extracted sources from the time-frequency domain to the time domain. 103 APPENDICES 104 In this appendix, the derivation for the steepest descent algorithm for N = M = 3 in Chatper 5 will be given explicitly. In this case, the matrix of mixture is ’x1’ 810) - X108)- X= x2 = X2(1) X2(P) , (1) _ X3 _ _X3(1) X3(15’)_ Y1 Y1(1) Y1(P) Y = Y2 = Y2(1) Y2(P) (2) _ Y3 _ _Y3(i) Y3(P)‘ We aim to find the optimal rotation transform R(6) under the Jensen-Rényi di- vergence criterion to obtain the signals Y = R(0)X that are disjoint on the time- frequency plane. From [82], we know that any 3-D rotation matrix can be written in the following form: R(9) = R1(91)R2(92)R3(93)1 (3) , where P - 008(61) sin(61) 0 R1191): —sin(t91) cos(61) 0 , (4) R2092) = 0 1 0 , (5) _—sin(62) 0 cos(62) 105 I. -_ .1 1 0 0 R3093): 0 cos(63) sin(t93) 1 0 —sin(6’3) cos(63) and 6 = [ 91,492, 93 ]T. Hence the entries of Y are 1’10) = (8111191) C08193) - C03091) Sin(92) sin(93))X2(i) + (sin(91) Sin(93) + C08(91) sin(92) 008093)) X30) + C08(91) 008192)){1011 Y2('i) = (008091) C08(93) + 9411091)sin192lsin(93))X2(i) + (008191) 5111(03) - Sin(91) Sin(92) C08193)) X30) - sin(91) 608(92)X1(i)1 3’30) = - C08(192)Sin(93)X2(i) + C08(92) 008(93lX31i) — 5in(92)X1(’i), (9) where i = 1, 2, - - ~ , P, and P is the length of the time-frequency vector. The gradients of Y with respect to 61 are derived as follows: 3Y1 (i) 801 = (008(61) cos(63) + sin(01) sin(62) sin(63))X2(i) + (cos(t91) sin(t93) — sin(01) sin(l92) cos(6l3)) X3(i) _ Sin(61)COS(62)X1(i)1 33/20) 861 = (— sin(61) cos(03) + cos(61) sin(62) sin(6’3))X2(i) — (sin(t91) sin(63) + cos(61) sin(92) cos(63)) X3(i) — cos(61) 008(92)X1(i)1 33/30) 661 :0. 106 (10) (11) (12) _L“.-_ f . Similarly, we can derive the gradients of Y with respect to 02 and 03, respectively. The cost function with the order a = 2 is 2 3 J = Z Z Jij = J12 + J13 + J23, (13) i=1j=i+1 where, J, _ 25211481138) 7.] ** d (235:1 322(9) (25:. ,m)) The derivatives of the numerator and denominator of J,- j with respect to 01 are given ('i < J) (14) as a (252112811330 P 511(k) aY-(k) and a(/ (215.1 1220)) (25—116 (1.)) 1 __ X 8’91 (/ (25:1 192(k)) (21113—1 392(k)) P P P P 8Y k [(Zn(k)a§;k)\t (213200) + (Z 112(k)) (236(k) a]; ))], k=1 1 / k=1 k=1 k=1 1 respectively, where l = 1,2,3. With equations (15) and (16), we get the gradient of 107 the pairwise cost function Jij with respect to 6) as Ed: 5(Z§=1Y4Wj)/891 _ Zi’zmrkmrki X 09‘ ((25.182)(zr=1828>) (21:1 YEW) (21:1 1239) P P a 2 flag) 2 132(k) we). k=1 k=1 (17) By summing up of the pairwise gradients of J'lji we obtain the gradient of the total cost function J with respect to any rotation angle 61 _8J12 8J13 0.123 =22 2:: + + . (18) aTJJ, 1— 11-2416 69) so) This expression is used in updating the rotation angle in the steepest descent algo- rithm. 108 .v'- ‘- - A. BIBLIOGRAPHY 109 BIBLIOGRAPHY [1] A.-J. van der Veen, S. Talvar, and A. Paulraj, “A subspace approach to blind [2] [3] l4] [6] [7] l8] [9] [10] [11] space—time signal processing for wireless communication systems,” IEEE Trans. on Signal Processing, vol. 45, pp. 173—190, 1997. C. B. Papadias and A. Paulraj, “A constant modulus algorithm for multi— user signal separation in presence of delay spread using antenna arrays,” IEEE Signal Processing Letters, vol. 4, pp. 178-181, 1997. A. Rouxel, D. Le Guennec, and O. Macchi, “Unsupervised adaptive separation of impluse signals applied to EEG analysis,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2000, vol. 1, pp. 420—423. David A. Peterson, James N. Knight, Michael J. Kirby, CharlesW. Anderson, and Michael H. Thaut, “Feature selection and blind source separation in an EEG-based brain-computer interface,” EURASIP Journal on Applied Signal Processing, vol. 19, pp. 3128—3140, 2005. S. Mallat and Z. Zhang, “Matching pursuits with time—frequency dictionaries,” IEEE Trans. on Signal Processing, vol. 41, pp. 3397-3415, 1993. S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Scientific Comp, vol. 20, pp. 33—61, 1999. R. G. Baraniuk and D. L. Jones, “Wigner—based formulation of the chirplet transform,” IEEE Trans. on Signal Processing, vol. 44, pp. 3129—3135, 1996. B. D. V. Veen and K. M. Buckley, “Beamforming: A versatile approach to spatial filtering,” IEEE ASSP Magazine, 1988. P. Comon, “Indpendent component analysis: A new concept?,” Signal Process- ing, vol. 36, pp. 287—314, 1994. H. Yang and S. Amari, “Adaptive online learning algorithms for blind separa- tion: Maximum entropy and minimum mutual information,” Neural Computa- tion, vol. 9, pp. 1457—1482, 1997. A. Cichocki and R. Unbehauen, “Robust neural networks with on-line learning for blind identification and blind separation of sources,” IEEE Trans. on Cicuits and Systems, vol. 43, no. 11, pp. 894—906, 1996. 110 [12] [13] [14] [15] [16] [17} [18] [19] [20] [21] [22] M. Girolami and C. F yfe, “Negentropy and kurtosis as projection pursuit in- dices provide generalised ICA algorithms,” in Advances in Neural Information Processing Systems Workshop, 1996. A. J. Bell and T. J. Sejnowski, “An information—maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7, pp. 1129—1159, 1995. BA. Pearlmutter and LC. Parra, “A context-sensitive generalization of ICA,” in Int. Conf. Neural Information Processing, 1996. Ivan Magrin-Chagnolleau, Geoffrey Durou, and Frederic Bimbot, “Application of time-frequency principal component analysis to text-independent speaker identification,” IEEE Trans. Speech and Audio Processing, vol. 10, no. 6, pp. 371—378, 2002. Matteo Gandetto, Marco Guainazzo, and Carlo S. Regazzoni, “Use of time— frequency analysis and neural networks for mode identification in a wireless software-defined radio approach,” E URASIP Journal on Applied Signal Pro- cessing, vol. 12, pp. 1778—1790, 2004. James R. Berriman, David A. Hutchins, Adrian Neild, Tat Hean Gan, and Phil Purnell, “The application of time-frequency analysis to the air-coupled ultrasonic testing of concrete,” IEEE Trans. Ultrasonics, Perroelectrics, and Frequency Control, vol. 53, no. 4, pp. 768—776, 2006. I. J. Chant and L. M. Hastie, “Time-frequency analysis of magnetotelluric data,” Geophysical Journal International, vol. 111, no. 2, pp. 399-413, 1992. Massimiliano Pieraccini, Guido Luzi, Linhsia Noferini, Daniele Mecatti, and Carlo Atzeni, “Joint time—frequency analysis for investigation of layered ma- sonry structures using penetrating radar,” IEEE Trans. Geoscience and Remote Sensing, vol. 42, no. 2, pp. 309—317, 2004. E. P. Wigner, “On the quantum correction for thermodynamic equilibrium,” Physical Review, vol. 40, pp. 749—759, 1932. J. Ville, “Theorie et applications de la notion de signal analytique,” Cables et Transmission, vol. 2A, pp. 61—74, 1948. J. E. Moyal, “Quantum mechanics as a statistical theory,” in Proc. Camb. Phil. Soc, 1949, vol. 45, pp. 99—124. 111 123] [24] [25] [26] [27] [28] [29] [30} [31] [32] [33] L. Cohen, Time—Frequency Analysis, Prentice Hall, New Jersey, 1995. T. A. C. M. Claasen and W. F. G. Mecklenbrauker, “The Wigner distribution — a tool for time—frquency signal analysis — Part III: Relations with other time— frequency signal transformations,” Philips Journal of Research, vol. 35, pp. 372—389, 1980. A. J. E. M. Janssen, “On the locus and spread of pseudo—density functions in the time—frequency plane,” Philips Journal of Research, vol. 37, pp. 79—110, 1982. J. Jeong and W. J. Williams, “Kernel design for reduced interference distribu- tions,” IEEE Trans. on Signal Processing, vol. 40, pp. 402—412, 1992. P. Loughlin, J. Pitton, and L. E. Atlas, “Bilinear time—frequency representa- tions: new insights and properties,” IEEE Trans. on Signal Processing, vol. 41, pp. 750—767, 1993. S. Oh and R. J. Marks II, “Some properties of the genralized time—frequency representation with cone shaped kernel,” IEEE Trans. on Signal Processing, vol. 40, pp. 1735—1745, 1992. S. Oh, R. J. Marks II, L. E. Atlas, and J. A. Pitton, “Kernel synthesis for generalized time—frequency distributions using the method of projection onto convex sets,” in SPIE—Advanced Signal Processing Algorithms, 1990, vol. 1348, pp. 197—207. P. Flandrin, “Some features of time—frequency representations of multicompo- nent signals,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 1984, pp. 41B.4.1—4.4. P. F landrin and F. Hlawatsch, “Signal representations, geometry and catas- trophies in the time-frequency plane,” Mathematics in Signal Processing, pp. 3—14, 1987. F. Hlawatsch, “Interference terms in the Wigner distribution,” Digital Signal Processing—84, pp. 363—367, 1984. H.-I. Choi and W. J. Williams, “Improved time-frequency representation of multicomponent signals using exponential kernels,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 37, no. 6, pp. 862—871, 1989. 112 [34] Y. Zhao, L. E. Atlas, and R. J. Marks, “The use of cone—shaped kernels for generalized time—frequency representations of nonstationary signals,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 38, pp. 1084—1091, 1990. [35] J. Jeong and W. J. Williams, “A new formulation of generalized discrete—time time—frequency distributions,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 1991, pp. 3189—3192. [36] G. S. Cunningham and W. J. Williams, “High—resolution signal synthesis for time—frequency distributions,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 1993, vol. 4, pp. 400—403. [37] E. GokGay and J. C. Principe, “New clustering evaluation function using Rényi’s information potential,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2000, pp. 3490-3493. [38] M. Ben-Bassat and J. Raviv, “Rényi entropy and the probability of error,” IEEE Trans. on Info. Theory, vol. 24, no. 3, pp. 324—330, May 1978. [39] M. Basseville, “Distance measures for signal processing and pattern recogni- tion,” Signal Processing, vol. 18, no. 4, pp. 349—369, Dec. 1989. [40] I. Csiszar, “Information—type distance measures and indirect observations,” Stud, Sci. Math. Hungar., vol. 2, pp. 299—318, 1967. [41] R. G. Baraniuk, P. Flandrin, A. J. E. M. Janssen, and O. Michel, “Measuring time-frequency information content using the Rényi entropies,” IEEE Trans. on Info. Theory, vol. 47, no. 4, pp. 1391—1409, May 2001. [42] W. J. Williams, M. Brown, and A. Hero, “Uncertainty, information and time—frequency distributions,” in SPIE—Advanced Signal Processing Algorithms, 1991, vol. 1556, pp. 144-156. I [43] A. Rényi, “On measures of entropy and information,’ in Proceedings 4th Ber/17e- ley Symp. Math. Stat. and Prob., 1961, vol. 1, pp. 547—561. [44] K. Pearson, “On lines and planes of closest fit to systems of points in space,” Philosophical Magazine, vol. 2, pp. 559—572, 1901. [45] E. Oja, Subspace Methods of Pattern Recognition, Research Studies Press, England, and Wiley, USA, 1983. 113 [46] [47] [48] [49] [50] [51] [521 [53] [54] [55] [56] [57] K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks: Theory and Applications, Wiley, 1996. M. Wax and T. Kailath, “Detection of signals by information—theoretic cri— teria,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 33, pp. 387-392, 1985. J. Karhunen, A. Cichocki, W. Kasprzak, and P. Pajunen, “On neural blind separation with noise suppression and redundancy reduction,” Int. Journal of Neural Systems, vol. 8, no. 2, pp. 219—237, 1997. B. Ans, J. Herault, and C. Jutten, “Adaptive neural architectures: detection of primitives,” in Proc. COCNI TI VA ’85, 1985, pp. 593—597. T. W. Lee, M. S. Lewicki, M. Girolami, and T. J. Sejnowski, “Blind source separation of more sources than mixtures using overcomplete representations,” IEEE Signal Processing Letters, vol. 6(4), pp. 87-90, 1999. L. Zhukov, D. Weinstein, and C. Johnson, “Independent component analysis for EEG source localization,” IEEE Eng. Med. Biol. Mag, vol. 19, pp. 87-—96, 2000. Erik Visser and Te-Won Lee, “Blind source separation in mobile environments using a priori knowledge,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2004, vol. 3, pp. 893-896. Guillaume Gelle, Maxime Colas, and Christine Servire, “Blind source sepa- ration: A new pre-processing tool for rotating machines monitoring,” IEEE Trans. Instrumentation and Measurement, vol. 52, no. 3, pp. 790—795, 2003. Z. He, L. Yang, J. Liu, Z. Lu, C. He, and Y. Shi, “Blind source separation using clustering-based multivariate density estimation algorithm,” IEEE Trans. on Signal Processing, vol. 48, pp. 575—579, 2000. D.-T. Pham, “Blind separation of instantaneous mixture of sources based on order statistics,” IEEE Trans. on Signal Processing, vol. 48, pp. 363—375, 2000. B. Torresani, “Wavelets associated with representations of the affine Weyl— Heisenberg group,” Journal Math. Physics, vol. 32, pp. 1273—1279, 1991. Y.C. Pati, R. Rezaiifar, and RS. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” 114 1"" in Proc. of the 27th Annual Asilomar Conference on Signals, Systems and Com- puters, 1993. [58] A. Belouchrani and M. G. Amin, “Blind source separation based on time— frequency signal representations,” IEEE Trans. on Signal Processing, vol. 46, pp.2888—2897,1998. [59] Abdeldjalil Aissa—El-Bey, Nguyen Linh-Trung, Karim Abed-Meraim, Adel Be- louchrani, and Yves Grenier, “Underdetermined blind separation of nondisjoint sources in the time-frequency domain,” IEEE Trans. on Signal Processing, vol. 55, pp. 897—907, 2007. [60] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time— frequency masking,” IEEE Trans. on Signal Processing, vol. 52, pp. 1830—1847, 2004. [61] S. Qian and D. Chen, “Discrete Gabor transform,” IEEE Trans. on Signal Processing, vol. 41, pp. 2429—2438, 1993. [62] D. Gabor, “Theory of communication,” J. of IEE, vol. 93, no. 26, pp. 429—457, 1946. [63] J. Wexler and S. Raz, “Discrete Gabor expansions,” Signal Processing, vol. 21, no. 3, pp. 207—221, 1990. [64] S. Mann and S. Haykin, “The chirplet transform—physical considerations,” IEEE Trans. on Signal Processing, vol. 43, no. 11, pp. 2745—2761, 1995. [65] R. G. Baraniuk, P. Flandrin, A. J. E. M. Janssen, and O. Michel, “Measuring time—frequency information content using the Rényi entropies,” IEEE Trans. on Info. Theory, vol. 47, no. 4, pp. 1391—1409, May 2001. [66] B. Widrow and S. D. Stearns, Adaptive Signal Processing, Prentice Hall, 1985. [67] A. Belouchrani, K. Abed-Meraim, J-F. Cardoso, and E. Moulines, “A blind source separation technique using second—order statistics,” IEEE Trans. on Signal Processing, vol. 45, no. 2, pp. 434—444, 1997. [68] A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis, John Wiley and Sons, 2001. [69] J. F. Cardoso, “Infomax and maximum likelihood for blind source separation,” IEEE Signal Processing Letters, vol. 4, pp. 112—114, 1997. 115 [70] [71] [721 [73] [74] [75] [75] [77] [78] [79] [80] KB. Hild II, D. Erdogmus, and J. C. Principe, “Blind source separation using Rényi’s mutual informaiton,” IEEE Signal Processing Letters, vol. 8, pp. 174— 176, 2001. Y. Q. Li, A. Cichocki, and S. Amari, “Sparse component analysis for blind source separation with less sensors than sourses,” in Proc. 4th Int. Symposium on Independent Component Analysis and Blind Signal Separation, 2003, pp. 89-94. D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, pp. 788—791, 1999. E. Weinstein, M. Feder, and A. Oppenheim, “Multi—channel signal separation by decorrelation,” IEEE Trans. on Speech and Audio Processing, vol. 1, pp. 405—413,1993. L. Parra and C. Spence, “Convolutive blind separation of non—stationary sources,” IEEE Tfans. on Speech and Audio Processing, vol. 8, pp. 320—327, 2000. H. Sahlin and H. Broman, “MIMO signal separation for FIR channels: a cri- terion and performance analysis,” IEEE Trans. on Signal Processing, vol. 48, pp. 642—649, 2000. C. T. Ma, Z. Ding, and S. F. Yau, “A two-stage algorithm for MIMO blind deconvolution of nonstationary colored signals,” IEEE Trans. on Signal Pro- cessing, vol. 48, pp. 1187—1192, 2000. A. Belouchrani, K. Abed-Meraim, M. G. Amin, and A. M. Zoubir, “Blind separation of nonstationary sources,” IEEE Signal Processing Letters, vol. 11, no.7,2004. M. G. Amin and Y. Zhang, “Signal averaging of time-frequency distributions for signal recovery in uniform linear arrays,” IEEE Trans. on Signal Processing, vol. 48, no. 10, pp. 2892—2902, 2000. O. Michel, R. G. Baraniuk, and P. Flandrin, “Time—frequency based distance and divergence measure,” in Proc. IEEE Int. Symp. Time—Frequency and Time— Scale Analysis, 1994, pp. 64—67. S. Aviyente, “Information processing on the time—frequency plane,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2004, vol. 2, pp. 617—— 620. 116 [81] L.-T. Nguyen, A. Belouchrani, K. Abed-Meraim, and B. Boashash, “Separating more sources than sensors using time-frequency distributions,” in Proc. Int. Symposium on Signal Processing and its Applications, 2001, pp. 583—586. [82] F. D. Murnaghan, The Unitary and Rotation Groups, Spartan Books, Wash- ington DC, 1962. [83] S. Haykin, Introduction to Adaptive Filters, MacMillan, New York, 1984. [84] A. Hyvarinen, “Fast and robust fixed—point algorithms for independent compo- nent analysis,” IEEE Trans. on Neural Networks, vol. 10, no. 3, pp. 626—634, May 1999. [85] P. J. Huber, “Projection pursuit,” Ann. Statist., vol. 13, no. 2, pp. 435-475, 1985. [86] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm,” IEEE Trans. on Signal Processing, vol. 45, pp. 600—616, 1997. [87] D. L. Donoho and M. Elad, “Maximal sparsity representation via 11 minimiza- tion,” in Proc. Nat. Acad. Sci, 2003, pp. 2197—2202. [88] M. Zibulevsky and B. A. Pearlmutter, “Blind source separation by sparse decomposition,” Neural Computations, vol. 13(4), pp. 863—882, 2001. [89] W. Lu and J. C. Rajapakse, “Appraoch and applications of constrained ICA,” IEEE Trans. Neural Netw., vol. 16(1), pp. 203—212, 2005. [90] L. Zhang, A. Cichocki, and S. Amari, “Self-adaptive blind source separation based on activation functions adaptation,” IEEE Trans. Neural Netw., vol. 15(2), pp. 233-244, 2004. [91] S. A. Cruces-Alvarez, A. Cichocki, and S. Amari, “From blind signal extraction to blind instantaneous signal separation: Criteria, algorithms, and stability,” IEEE Trans. Neural Netw., vol. 15(4), pp. 859—873, 2004. [92] M. Girolami, “A variational method for learning sparse and overcomplete rep- resentations,” Neural Computations, vol. 13(11), pp. 2517—2532, 2001. [93] M. Zibulevsky, P. Kisilev, Y. Y. Zeevi, and B. A. Pearlmutter, “Blind source separation via multinode sparse representation,” NIPS-2001. 117 [94] Y. Q. Li, A. Cichocki, and S. Amari, “Sparse representation and blind source separation,” Neural Computations, vol. 16(6), pp. 1193—1234, 2004. [95] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, Wiley, New York, 2001. [96] M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” Neural Computations, vol. 12(2), pp. 337—365, 2000. [97] K. Kiviluoto and E. Oja, “Independent component analysis for parallel financial time series,” in Proc. I CONIP’98, 1998, vol. 2, pp. 895—898. [98] R. Vigario, J. Sarela, V. Jousmaki, and E. Oja, “Independent component analysis in decomposition of auditory and somatosensory evoked fields,” in Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA’QQ), 1999, pp. 167—172. [99] S. J. Luck, An introduction to the event-related potential technique, The MIT Press, 2005. [100] J. N. Demos, Getting started with neurofeedback, W.W. Norton & Company, 2005. [101] O. Faugeras, F. Clement, R. Deriche, R. Keriven, T. Papadopoulo, and J. Roberts, “The inverse EEG and MEG problems: The adjoint state approach I: The continuous case,” INRIA Research Report 3673, 1999. [102] H. Shevrin, J. A. Bond, L. A. Brakel, R. K. Hertel, and W. J. Williams, Con- scious and Unconscious Processes: Psychodynamic, Cognitive and Neurophysi- ological Convergences, New York: Guilford, 1996. [103] H. Shevrin, W. J.Williams, R. E. Marshall, R. K. Hertel, J. A. Bond, and L. A. Brakel, “Event-related potential indicators of the dynamic unconscious,” Consciousness Cogn, vol. 1, pp. 340—366, 1992. [104] W. J. Williams, H. P. Zaveri, and J. C. Sackellares, “Time-frequency analysis of electrophysiology signals in epilepsy,” IEEE Trans. Eng. Med. Biol, pp. 133—143, 1995. [105] http://www.sccn. ucsd. edu/eeglab. 118 u H II] I“ 1|].I “ VIII “ TN] A“ S“ A 6"" H“ 3 1293 03062 9822