«A \f “:fi: 'gihyr A 3002 This is to certify that the thesis entitled Evaluation and Comparison of Data Reduction and Source Separation Techniques For Event Related Potentials presented by (D ‘66 >- +- 2~ CE (I) 75 Jacob Swary é a s m .9 3 j .C S has been accepted towards fulfillment E) of the requirements for the MS. degree in Electrical & Computer Engineering MaTor Professor’s Signature 08/ZQ/Zoo¥ Date MSU is an affirmative-action, equal-opportunity employer .--I-o-l-l-I-v-o-l-I-.-l-.-I-Q-I-l-I-I-D-I-I-I-D-O-I-C-l-.-I-l-U-I-O-I-l-i-0-l-'-O-l-'-l-O-I-O-o-. PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DAIEDUE DAIEDUE DAIEDUE 6/07 p:/ClRC/DateDue.indd-p.1 Evaluation and Comparison of Data Reduction and Source Separation Techniques For Event Related Potentials By Jacob Swary A Thesis Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Master’s in Science Electrical and Computer Engineering 2007 ABSTRACT Evaluation and Comparison of Data Reduction and Source Separation Techniques For Event Related Potentials By Jacob Swary Study of event-related potentials (ERP), which measure the brain response to specific presented stimuli with electroencephalography (EEG), is the focus of this thesis. In the past, averaging of multiple trials has been used to evaluate ERPS. This ignores the trial-to-trial variability of the brain’s response, and has only produced the knowledge of certain response peaks and how they are generally related to some tasks. Recently, attempts at extracting the actual underlying sources generated by the brain are being made to effectively evaluate the brain’s response. A common assumption is that the underlying sources are statistically independent, and independent component analysis is used in this blind source separation (BSS). To avoid the assumption that sources are independent in BSS, we are proposing to solve the problem with quadratic time-frequency distributions of the data. In this way, the assumption that sources are sparse in the time-frequency plane, i.e. most data points are close to zero, is applied. Due to sparsity, methods have been developed to estimate first, a mixing matrix, which determines the weighting of each source at each electrode, and then the sources. The two stage approach solves for a number of sources greater than the number of electrodes used in the EEG measurement. This two stage approach and ICA are both applied to a set of measured ERPs and the results are compared in this thesis. It is shown that the proposed method is more effective at extracting well localized components in time and frequency than ICA. These components are shown as comparable at representing the original ERP data variance with ICA. TABLE OF CONTENTS LIST OF TABLES ................................. v LIST OF FIGURES ................................ vi CHAPTER 1 Introduction ..................................... 1 1.1 Electroencephalography ......................... 1 1.2 Event Related Potentials ......................... 3 1.3 Spectral Analysis of EEG ........................ 4 1.4 Analysis of ERPS: Multiple Trial Average ............... 7 1.4.1 Woody Average .......................... 8 1.4.2 Latency Corrected Average ................... 9 1.5 ERP Time Components ......................... 10 1.6 Organization of Thesis .......................... 13 CHAPTER 2 Signal Processing Methods for ERP Component Extraction .......... 15 2.1 Problem Statement ............................ 15 2.2 Time-Frequency Distributions ...................... 17 2.3 Data Reduction .............................. 19 2.3.1 Principal Component Analysis .................. 19 2.3.2 Matching Pursuit ......................... 21 2.3.3 Principal Component Analysis and Matching Pursuit on Time- Frequency ............................. 23 2.4 Blind Source Separation ..... ~ .................... 24 2.4.1 Independent Component Analysis ................ 24 2.4.2 Blind Source Separation on the Time-Frequency Plane . . . . 26 CHAPTER 3 Underdetermined Source Separation in Time—Frequency Domain ........ 32 3.1 Underdetermined Blind Source Separation ............... 32 3.2 Determination of the Mixing Matrix ................... 33 3.3 K -means Clustering ........................... 34 3.4 Estimation of the Source Signals for a Given Mixing Matrix ..... 35 3.5 Comparison Between Wavelet Packets and Time-Frequency Distributions 37 iii CHAPTER 4 Source Separation Results for ERP Signals .................... 44 4.1 Data .................................... 44 4.2 Single-Trial ERP ............................. 45 4.3 Measures of Evaluation .......................... 46 4.3.1 Data Reduction .......................... 46 4.3.2 Data Variance Explained ..................... 51 4.3.3 Measurement of Sparsity ..................... 54 CHAPTER 5 Conclusions and Future Work ........................... 57 BIBLIOGRAPHY ................................. 60 LIST OF TABLES Table 4.1 Mean measure of 11 norm to show sparsity .............. 53 Table 4.2 Mean measure of 11 norm to show sparsity .............. 56 Table 4.3 Mean measure of entropy to show time-frequency localization. . . 56 Table 4.4 Measure of disjointness by correlation between components. . . . 56 Figure 1.1 Figure 1.2 Figure 1.3 Figure 2.1 Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 LIST OF FIGURES (a) An electrode array connected to a person. (b) A graph of typical EEG readings where each row in the graph represents the measurements from one electrode. (c) The locations of electrodes in the 10-20 system. ......................... An experimental ERP setup with visual stimulus presented on com- puter screen and response required by pushing of button ...... The P300 and N200 components can be seen in the averaged re- sponse to the rare stimuli from this experiment. .......... (a) The diagram of the mixing model with N sources and M elec- trodes (sensors). The transfer functions are simply scale factors since this is an instantaneous mixture. ............... Scatter plot of two mixtures of four Gabor logons in the time- frequency domain ........................... The mixtures and the separation of four Gabor logons: (a) and (b) two mixtures; (c), (d), (e), and (f) four extracted Gabor logons Scatter plot of two mixtures of a chirp and two Gabor logons in the time-frequency domain ...................... The mixtures and the separation of a chirp and two Gabor logons: (a) and (b) two mixtures; (c) extracted chirp; (d) and (e) two extracted Gabor logons ........................ Comparison of MSE versus SNR for extracted sources with TFD and WP ................................ Single trial results using 32 frequency bins. (a) 6 extracted sources from ICA. (b) 8 extracted sources from the proposed method. Single trial results using 128 frequency bins. (a) 6 extracted sources from ICA. (b) 14 extracted sources from the proposed method. . . Single trial results using 128 frequency bins. (a) 6 extracted sources from ICA. (b) 16 extracted sources from the proposed method. . . Results of component clustering over all single-trial results for stim- ulus group u = 1. (a) Components extracted using ICA. (b) Com- ponents extracted using the proposed method ............ vi W *5 'W' ‘.‘- 12 17 39 40 41 42 43 47 48 49 52 CHAPTER 1 INTRODUCTION 1.1 Electroencephalography Electroencephalography (EEG) is the non-invasive process of measuring electrical potentials from activity in the brain. Electrodes with a conducting medium are placed at multiple locations around the scalp at fixed locations. Electrode placements are used according to the international 10-20 setup as defined in [1]. Potentials measured at each electrode are not necessarily due to the activity in the immediate proximity of that electrode because of the volume conduction in the brain and across the scalp [2]. The measured potentials are amplified, filtered, and stored for processing. An example electrode setup, typical EEG reading, and the electrode locations of the 10-20 standard are shown in Figure 1.1. nun-u..- 20% 20% P4 .‘ y,_-_ / (20% \Fp “ H 1 f ‘t "1 o, 02..1-5..T4--oF3'FPZ 10% » - / 2' 10% = urn) — a<1)x— a<2>mw, (1.7) 3: and D = Elf-’11 10,-. A whitening filter is applied to the recordings to give 1“,;(t), and covariance coefficients 7,- = 7%, )3le fi(t)f(t) are calculated between each filtered recording and the average of all other recordings. The weights are then: 0 for 7,- S 0 w,- = 72' for 0 S ’7,- (18) C for 7i > C and C is chosen empirically, usually about 0.8 [5]. The problem with the averaging technique is that ERPs are not deterministic; exact repetitions of the experiment will not lead to the same responses at electrodes. From trial to trial there will be latency deviations of peaks, different amplitudes, and different wave shapes. Deviations occur due to many factors of subject’s perfomance, such as attention, expectation, arousal, strategy and others [14]. The average does not truly represent an ERP due to these variations across trials and cannot reflect changes in the subject’s state. 1.4.1 Woody Average Considering the fact that latencies of ERPs vary from trial to trial, SNR of the averaged ERP could be improved if the shifts in time can be detected and corrected. The Woody average is a process that attempts latency correction [5]. The process described in section 1.4 is first used and the averaged ERP is used as a starting template. Each of the individual ERP responses is then cross-correlated with the template. These ERP waveforms are then shifted by the amount that their maxi- mum cross-correlation value took place at. The total average is then recalculated and this becomes the new template. This process is repeated until the cross-correlation does not change significantly between the current iteration and the previous iteration. This will provide improved waveforms in the cases where components of each ERP signal all shift the same amount. Separate components have latency changes that occur independent of each other. Thus, the Woody average will still blur components of the ERP response. Further, it is possible that the Woody average will align certain background processes that were independent in individual trials. 1.4.2 Latency Corrected Average An attempt to improve on the Woody average is made with latency corrected average (LCA) [5]. Peaks of different components in an ERP are aligned from trial to trial so that the SN R of these components in the average is increased. The mean is removed from each trial and the variance is normalized. Peaks are located in each recording such that the magnitude of the slope on either side of the peak exceeds a set minimum. The points of the sample mean are used in comparison to the detected peaks. If the points of the mean and peak are of the same sign and statistically exceed a specified confidence interval that both are from a zero mean population, then the detected peak is considered a component, otherwise it is not. Peaks over all trials that are of the same sign as the sample mean at that point and between the same zero crossings of the sample mean are considered the same component. The segments around the peaks are then used in calculating a new mean. This mean is not a complete waveform since it is computed from discrete components of waveforms, so interpolation must replace the empty time slots. This is an improvement on taking the straight sample mean of responses as it improves SNR for different ERP components, and can show additional components than shown by just the mean. It still is not ideal. It only makes up for variations in peak latency, nothing else. In addition, variations in the individual responses are still all classified as one response in the end. All changes in subjects’ states are still being ignored and an opportunity to gain more insight to mental processes is lost. 1.5 ERP Time Components Through averaging in ERP experiments, multiple different components have been identified related to certain tasks. The results from averaging trials under the same conditions are found and peaks are analyzed with respect to scalp distribution, po- larity, and latency. Components are usually identified by their polarity and latency. For example P300 stands for a positive peak after 300 ms. The components presented here are described in [9]. The earliest components that have been found occur within the first 100 ms of stimulus presentation. These components change as a function of the stimulus with regard to intesity and frequency, and can also be affected by attention. This activity is automatic and is related to the signal picked up at the sensory site being transmitted to the brain’s processing systems. One common component is called the mismatch negativity (MMN). This is a negative peak occuring around 100-200 ms after stimulus, and is beleived to occur in the auditory cortex. This type of response is found when two different classes of auditory stimuli are presented when the subject’s attention is on a separate task, one class is presented more often than the other. The average is taken over all trials of each stimulus type and the average of the rare stimulus trials is subtracted from the average of the frequent stimulus trials, and the resulting component is the MMN. The MMN represents a sort of mismatch detector. It is an automatic, preattentive processing of deviant features. Because it can be recorded with a delay usually only up to 10 seconds between stimuli, the MMN is a transient type of memory. As a result of the two classes of stimuli differing with more than one factor, the MMN has a larger amplitude, and so may reflect a parallel processing of multiple factors. The N200 component is a negative peak at about 200 ms and found in either the 10 auditory or visual cortex, depending on the stimulus. This component represents a comparison that is actively generated by the subject. It follows from a mismatch between two stimuli, or between the stimulus and a mentally formed template. It is different from the MMN in that the subject is paying attention to the stimulus, and the N200 does not necessarily mean two stimuli are being presented. It appears when the stimulus mismatches the subject’s expectancy, whether it be from previous stimulus or a memory template. Its latency covaries with response time, so it may be that the N200 shows the feature discrimination happening, which influences the response time. The N 200 can be seen in Figure 1.3. A very commonly studied component is the P300, which is a positive peak around 300 ms, and is strong in the posterior scalp locations. With more than 30 years research, there is no indication of the underlying sources contributing to the P300. It is thought to be a summation of activity over multiple generators. Its amplitude is affected by perceived stimulus probability, but only when the stimulus is relevant to the task at hand. The amplitude is also directly proportional to the processing demand of the task. The peak’s latency and reaction time increase when the task is accuracy oriented instead of speed oriented. The latency is also longer when a categorization task is more difficult. The P300 may reflect the updating of mental environment models or the context in working memory. The P300 is shown in Figure 1.3. A similar component is the Frontal P300, which instead of being strong in the posterior electrodes, is strong in the frontal. This is caused by deviant stimuli which are very rare and unexpected, such that there is no memory template beforehand. In young adults, repeated presentations will shift the P300 response to the posterior. In older adults, the Frontal P300 stays frontal with repeated presentations. Another examined component is the N400, a negative peak around 400 ms which is related to reading tasks. The N400 response is strong when a string of words is 11 — Response to frequent stimulus A: Baseline - - Response to infrequent stimulus B: N200 C: P300 +0.02 mV -- ”Ca 0 MA lab—1m ” ,:\‘-v1 ‘/ B '- o 160 3110 4éo 360 Time, relative to stimulus presentation (ms) Figure 1.3. The P300 and N 200 components can be seen in the averaged response to the rare stimuli from this experiment. presented one by one to form a sentence, and the last word does not make sense. The worse fit a word is for the sentence, the stronger the response. It is only related to semantic, not grammatical errors. The response is weaker when the word is close semantically to what makes sense, i.e. ”drink” instead of ”eat.” It has also been shown to be caused by metaphors, even when the subject grasps the metaphor. The error-related negativity (ERN) is a negative component that appears when a subject makes an error in a response task to a stimulus. For example, the subject may have to respond with right or left hand to two types of visual or audial stimuli, this component will show up when the subject responds with right when he/she should respond with left, or vice versa. The ERN typically peaks around 150 ms after subject response. Amplitude of the ERN is larger with more emphasis on accuracy over speed. It also increases with an incorrect response differing from correct response by more movement parameters. The ERN can be seen in tasks where there is no error correction, as in a go-no go task, meaning it is involved in error detection as well as 12 probably error correction. It is not certain to what degree the ERN is involved in both detection and correction, though. These ERP components are the peaks and troughs in the waveforms that covary in response to experiment manipulations. Each component is viewed to index some aspect of cognitive processing. The traditional view of components relating to the peaks and troughs is one way to define components. Another way would be as the aspects of the ERP that covary across subjects, conditions, or locations. A third definition of components are terms that directly correspond to the neural generating structures. The traditional representation of components is somewhat arbitrary, as these components may be represented as summations of the components found in these other ways. Finding the actual underlying sources is how to reduce the compo- nents to their most basic level. It is this way that the most insight could be gained about the processes of the brain. The relationship between cognitive processes and ERP activity was studied with no reference to the underlying sources until the 90’s. The problem of solving what will be measured at the electrodes given the sources in the brain is the forward problem, and so solving for the sources given only the readings of the electrodes is the inverse problem that must be solved. 1.6 Organization of Thesis In Chapter 2, the background on common data reduction and blind source separation (BSS) techniques are presented and discussed. Chapter 3 introduces the problem of underdetermined blind source separation (UBSS), and a UBSS technique on the time- frequency plane is presented. In Chapter 4, the proposed UBSS method is applied to a multiple trial ERP data set to extract neuronal sources. The extracted neuronal sources are compared to ICA using measures of localization, sparsity, disjointness and variance. Chapter 5 concludes the thesis with a summary of contributions and 13 discussion of future work. 14 CHAPTER 2 SIGNAL PROCESSING METHODS FOR ERP COMPONENT EXTRACTION 2. 1 Problem Statement EEG / ERP signals are often assumed to be produced by distinct neuronal sources from distinct locations within the brain. These sources are conducted through the brain, skull, and scalp and create a potential measured by the electrodes. The readings at the electrodes are assumed to be instantaneous linear mixtures of the sources in noise following the model, x=As+w an with X representing the observation matrix, A the mixing matrix, S the source matrix, and V the noise matrix. The observation matrix, X, is an M x p matrix with each row representing the reading at one electrode through time p, and each column, x(t) is the array of M sensor readings at time t. The source matrix, S, is of size N Xp, where each row is the underlying source through time p, and each column, s(t) is the source array at time t. It is assumed that the noise matrix is negligible, the the representation becomes: X=AS am If a linear transform is applied to the data, X, the nature of the model is still the same, X=As am where X and S are the linearly transformed representations of X and S, respectively. In the case that the quadratic time-frequency transform that will be used in this 15 thesis is applied to the data, the mixing model will be changed to, A x z A2s = as. (2.4) Here, X and S represent the quadratically transformed data and source matrices, respectively. For simplicity, equation 2.2 will always be considered the mixing equa- tion, and it will be understood when using the quadratic transform, X, A, and S will actually represent X, B, and S, respectively. A diagram of the mixing model is shown in Figure 2.1. The model could not be written in this matrix form if the mixing process were assumed to be non-linear. Also, if the mixing matrix, A, is assumed to be time— varying, a different matrix, A(t), would have to be specified for each x(t) and s(t). It is assumed that the mixing process is linear and time-invariant, therefore A is assumed to be constant, and is of size M xN . Each mixing matrix entry, [oz-j], represents the relative strength of source j at sensor i. This also assumes negligible delay between source activation and sensor reading. Otherwise a convolutive mixture would be necessary, in which the readings, x(t), would be a sum of different mixing matrices, A0, A1, . . ., multiplied by s(t),s(t — 1). .., etc. The matrix V is additive noise with same size as X. With only the data collected from the electrodes available, X, the goal is to extract the underlying sources by solving for S. In general, a method must assume a certain structure for the underlying source signals and define a corresponding cost function for extracting the sources. In this chapter, some existing methods used in addressing this problem, i.e. data reduction and source separation, are presented. 16 Sensors Sources 5000 $100 . . 61410:) at a as Mixing System Figure 2.1. (a) The diagram of the mixing model with N sources and M electrodes (sensors). The transfer functions are simply scale factors since this is an instantaneous mixture. 2.2 Time-Frequency Distributions Some of the methods introduced in this chapter depend on representing a signal in the joint time-frequency plane, so time-frequency distributions (TFDs) are introduced here. A TFD, C(t,w), from Cohen’s class can be expressed as *[16]: C(t,w) = /// ¢>(6,T)s(u + 12:)s*(u — %)ej(6u_6t—Tw)du d0 (17', (2.5) All integrals are from —00 to 00 unless otherwise stated. 17 where (15(6, 'r) is the kernel function, and s(t) is the signal. The Short-Time Fourier Ttansform (STF T) is a simple instance of a TFD from Cohen’s class, and can be represented as: Sh(t,w) = /s(r)h('r — t)€_ijdT. (2.6) The STF T involves performing Fourier Transforms on windowed data, where h(t) is the window function and t represents its translation. A shorter duration of the win- dow function, h(t), provides higher time resolution, but at the expense of frequency resolution. Likewise, time resolution will suffer for increased frequency resolution. Thus, the representation of the data with the STFT is generally smeared. Also, the STFT representation does not necessarily maintain the time and frequency marginals, meaning the signal energy has been misrepresented [17]. The quadratic Wigner distribution is also of Cohen’s class and is defined as follows: we...) 2 / 3(t + 93*“ — ate-Wan (2.7) For quadratic TFDs, the cross-terms or interference occurs when the signal is mul- ticomponent. The cross-terms correspond to the interaction between the different sources and do not contribute directly to the energy distribution of the individual sources, and thus are undesirable. For this reason, reduced interference distributions (RIDs) are used, designed using I ¢(6, 7') [<< 1 for | 67' [>> 0, that satisfy the energy preservation and the marginals [18]. Since the distributions will be implemented in discrete-time, the TF D can be 18 expressed as: N N TFD( (n UJ‘ 11)): Z Z :1:(n+n1):1: *(n+n2)i,/1 (_:’ii_"_2,n1 _ n2)e—jw(n1—712), 2 nlz—Nn2z— —N (2.8) where w is the discrete—time kernel in the time and time—lag domain. Cohen’s class of distributions offer several advantages to other time-frequency analysis methods such as uniform time and frequency resolution, energy preservation and the marginals [16]. 2.3 Data Reduction Experimental ERP data is recorded over p time points per trial, with data available from multiple electrodes per trial, performed over multiple trials for multiple subjects. Each electrode reading per trial is represented as a p-dimensional observation if p is the number of time points measured and analyzed in time, or a P-dimensional observation if P is the number of total time-frequency points when analyzed in the time-frequency domain. If J is the number of subjects, M the number of electrodes, and T the number of trials, then the number of p— or P-dimensional observations to be analyzed is J XMXT and quickly becomes a large amount of data to process. Methods to reduce the data to a smaller number of meaningful components becomes an issue to effectively analyze the data. 2.3.1 Principal Component Analysis Principal Component Analysis (PCA) is widely used in signal processing and pattern recognition applications. It is importantly used for data reduction with respect to ERP analysis. Like trial averaging, it is run over multiple trials, but PCA will itself extract multiple components, where averaging results in one waveform in which peaks are analyzed from. PCA reduces the ERP measurements into multiple orthogonal 19 components which covary over the trials as a result of experimental manipulations. The principal components (PCs) may be such that a summation over different PCs may correspond to a commonly known component, such as P300, but the PCs most likely do not represent the underlying sources. With observations x(t) = [1121(t), 332(t), . . . ,xM(t)]T, PCA is performed such that M 0.0) = Z win-scat) = wfxm (2.9) k=1 h principal component and w,- is a weight vector. The first principal where y,- is the 2'” component, yl, is chosen so that its variance is maximized by choice of weight vector WI. The variance of y1 is dependent on the orientation and norm of W1 and increases with the norm, so W1 is constrained to Euclidean l2 norm of the weight vector, “le2 = 1. The following yi’s are found to be orthogonal to all previous yj, j < 2', such that the variance is maximized in the subspace orthogonal to the space spanned by yj, j = 1,. . . ,z’ -— 1. The solution to this problem is equivalent to eigenvalue decomposition of the data covariance matrix R. The data covariance matrix is defined as: 1 k T R = 7; Z x(t)x(t) . (2.10) t=1 The eigenvectors and eigenvalues of R are then found according to: R = VAvT (2.11) where A is the M x M diagonal matrix with entries representing the eigenvalues in decreasing order of magnitude A1 > A2 > > AM, and V = [v1,v2, . . . ,vM] with columns representing the corresponding eigenvectors. The first m eigenvectors are 20 kept and used such that W1 = v1,w2 = v2, . . . ,wm = vm, and the variances of the principal components y,- are given by the respective eigenvalues )‘i- The resulting approximation to the original data set is: 51(1) = Z A,v,-(t). (2.12) A threshold is chosen such that the number of principal components kept, m, is determined. The error between x(t) and x(t) goes to zero as m approaches M, and the eigenvalues, ZiM=m+1 A,- represent the error. A common decision rule is to set a minimum value for each eigenvalue such that eigenvector v,- is kept if A,- > T, where T is the threshold. Another common decision rule is to keep the eigenvectors that 2m together explain a minimum amount of variance. That is, such that fig > T, k=1Ak where T is the threshold. To be viewed as source extraction, the sources would have to be assumed to be orthogonal in time. This is not an assumption that is often made, so this method is not used to determine the underlying sources. Its value comes in data reduction, so that further analysis techniques can more efficiently be applied to the data by using the extracted components. 2.3.2 Matching Pursuit Matching Pursuit (MP) is an algorithm that decomposes a signal linearly into a set of functions, called time-frequency atoms, from an overcomplete dictionary. It can be used in simplifying the representation of a particular signal via approximation. The dictionary is composed of functions, g(t), scaled, translated, and modulated as follows: 1 t-u 97“) = fl“ fit 3 )e , (2.13) where s is scale, 11. translation, 5 modulation, and ”y = (s,u,£) E F. The % term 21 normalizes 97(13). The dictionary, D = [when is overcomplete in that it provides more than a sufficient basis for the signals it is to represent. The first step in MP is choosing a function, 970, from the given dictionary, D. This 970 is projected on the original signal, f, and R1 f is defined as the residue such that: f = (f. 970227,, + RU (214) and because the residue is orthogonal to the dictionary function, 10112 =I12 + ”le112. (2.15) The function, 970, must be chosen to maximize [( f, 970)] so that the residual, R1 f, is minimized. The second step is then to choose 971 from D. This is found by maximizing |(R1f,970)| so that the new residue is sz and le = (le,970)970 + R2f. These iterations are repeated likewise so at the mth iteration, the signal approxi- mation is: m—l f = Z (Wigwam + WI. (2.16) n=0 where R0 = f. As the iteration number, m —> oo, “Rm f [I2 —+ 0. The algorithm should be run until [IRme2 < T, for a predetermined threshold, T. The MP algorithm can offer compact representations of signals under the right conditions, however this is dependent on the dictionary chosen. A larger library of atoms can be chosen to better represent the dictionary at the expense of com- putational complexity, but signal components must be reasonably approximated by dictionary atoms. Frequency-modulated Gaussian functions, called gabor logons, give Optimal joint time-frequency localization [20], and therefore are well suited in repre- sentation of ERPs. MP is a greedy algorithm, so it does not necessarily provide an 22 optimal solution. 2.3.3 Principal Component Analysis and Matching Pursuit on Time- Frequency Using a frequency transform like the Fourier Transform obscures the time localization of the signal, while analysis in time domain loses information about the frequencies that constitute the signal. It has been shown that using a TFD for EEG data can produce results not available with conventional analysis techniques [21]. Using a TFD of Cohen’s class has the advantage over a wavelet transform (WT) that it has constant time-frequency atoms whereas the WT suffers from low time resolution at low frequencies and low frequency resolution at higher frequencies. Under noisy conditions, the RID was shown to provide better separation of signal components than using the WT [17]. The RID is chosen for signal representation for these advantages. Because of the two-dimensional nature of a signal under a TFD, there is much more data to be processed than when analyzing signals in time or frequency alone. It was shown that data reduction using PCA on TFD representations of EEG data can provide meaningful components [17]. This approach assumes time-frequency stationarity of sources. If a single source shifts in the time-frequency plane over multiple trials, either multiple components explaining the same source will be extracted, or the component representing the source will be spread to cover the area of source shift. Like in the case of time domain, to be viewed as source extraction, the sources would have to be assumed to be orthogonal in the time-frequency plane. So again, its value comes in data reduction, so that further analysis techniques can more efficiently be applied to the data by using the extracted components. Given that PCA can successfully be used in data reduction and that the extracted components are localized in the time—frequency plane, it has been shown possible to meaningfully reduce the amount of data further [26]. Using a dictionary of Gabor 23 logons, a Matching-Pursuit (MP) algorithm is applied to components extracted with PCA on TFDs. The resulting components were shown to represent the meaningful variance of the original data well. Because the components are reduced to a few time and frequency parameters, the data is reduced further than using the time—frequency surfaces of the principal components. 2.4 Blind Source Separation 2.4.1 Independent Component Analysis In recent years, multivariate data analysis involving decomposing the measurements into several independent time series has become a popular way to describe the ‘source’ signals in the brain. Independent Component Analysis (ICA) solves for the unmixing matrix W = A—1 such that the number of electrodes is at least as large as the number of underlying sources (M _>_ N), W is full rank, the estimated sources, Y = WX, are as independent as possible [22], and A is as described in section 2.1. The Infomax algorithm introduced by Bell and Sejnowski [23] runs to maximize the mutual information that the system output contains about its input. In its application to ERPs, the estimated brain sources represent the output and the electrode readings represent the input. Consider x to be the input vector and y the output vector where y = g(Wx + W0) with W a mixing matrix and “'0 a bias vector. The multivariate pdf of y is: x fy(Y) = ——fX( ) (2.17) HI and P 8y 8y 7 J = det ' ' (2.18) a 8. Lisa] 5&1 The output entropy is H(y) = —E[lnfy(y)]. By plugging equation 2.17 into this 24 equation the entropy becomes H (y) = E [lnlJl] — E [In fx(x)]. The term on the right, the entropy of x, is unaffected by W, so that maximization of lnlJ I is required to maximize [H (y)| Gradient ascent learning rules are derived as: AW oc [WT]_1 + (1 - 2y)xT (2.19) AWO cc 1 — 2y, (2.20) and the rule for individual weights is therefore: cofwz-j Aw’j o< detW + $j(1— 2y,), (2.21) where c0 f is the cofactor and det is the determinant. This gradient ascent algorithm is efficient if the pdf’s of the inputs are super-gaussian with no more than one source being gaussian, and when the mixing matrix is not almost singular. The process of Principal Component Analysis (PCA) can be compared with that of ICA. They are similar, however PCA seeks out sources which are orthogonal to each other, only relying on second order statistics, while ICA uses the less restrictive assumption of independence and uses higher order statistics [24]. In PCA, each progressive component explains as much variance of the data as possible, while the variance of each component in ICA is more spread out [27]. In using ICA, it is required to have sufficient data points to represent indepen- dence. This calls for a few times more data points than there are electrodes. It is required to have as many electrodes as there are underlying sources in order to suc- cessfully extract the independent sources. Also, the sources will only be successfully separated given that the assumption of independence is true. This assumption of independence between multiple sources in the brain is not necessarily true. 25 2.4.2 Blind Source Separation on the Time-Frequency Plane Much work is being done in BSS based on the assumptions of disjointness and sparsity of the sources. These techniques do not assume independence of the sources, and they are designed for the underdetermined case, that is, when there are more sources than sensors, as described in section 3.1. Time or frequency only representations do not provide sufficient sparsity, so use of a time-frequency transform is necessary to assume sparse sources. A method presented in [28] uses the time-frequency representation of the short- time Fourier transform (STF T) for BSS. This method was proposed for speech signals and uses masking to separate the sources. The sources are assumed to be approxi- mately disjoint, that is approximately non-overlapping, in the time-frequency domain. Masks are created using magnitude and phase information of the mixtures. Since the assumption that time lag is negligible for EEG data is made, the phase informa- tion can be ignored in application to EEG data, and in this way the method can be extended to other time-frequency representations. This algorithm can extract more sources than sensors, but only considers the case in which two mixtures are provided. Sources 31(t) through sn(t) are represented in both mixtures 5121(t) and 11:2(t) at different scales because of their different locations. The scales in the first mixture are considered to be one, such that the mixtures can be represented as follows: ' 271(t) = 81(t) + 82(t) + - - ' + Sn(t), (2.22) :rg(t) = a131(t) + (123203) + ~ - - + ansn(t). (2.23) The STF T of the mixtures are represented as follows: X1(t,w) = 51(t,w) + $205,012) + - - - + Sn(t,w), (2.24) 26 X2(t,w) = a181(t,w) + a282(t,w) + - - - + anSn(t,w). (2.25) The ratio R21 is defined as _ X2(t,w) In this way, if the sources are completely time and frequency disjoint, only one source will be active at any point in R21 and its value will be that of the scale of the active source. All points at which a given source is active will result in the same ratio of R21 giving that source’s scale. If source m is active, then R21 = am, and by creating a mask for each different value in R21, the sources can be separated from the original mixtures. If the sources are only approximately time and frequency disjoint, then one source is dominant at any point. If the values of R21 are plotted in a histogram, then there will be a peak for each source, the peaks will be located at approximately the scale factor of each source, and all values near should belong to that source. The second approach to BSS uses the nonlinear time-frequency distributions. Blind source separation based on spatial time-frequency distributions achieve sep- aration by joint diagonalization of the auto-terms in the spatial time-frequency dis- tributions [29, 30, 31, 32]. The discrete-time implementation of Cohen’s class TFD for signal x1(t) is[16]: 00 OO D$1$1(t,w) = Z Z ¢i(m,l)x1(t+m+l):r]’(t+m—l)e_]2wl, (2.27) l2—w m=—OO with t being time index and (.1) frequency index. The cross-TFD of two signals 3:1(t) and 3:2(t) is then defined by: 00 oo D1132“, w) = Z Z 1/1(m,l):z:1(t + m + l):1:§(t + m - 06—32”. (2.28) lz—m mz—OO Equations 2.27 and 2.28 are used to define the spatial time-frequency distribution 27 (STFD) matrix as: Dxx( t, 0)) =2 2 I/J( m, l)x t + m + l)x*(t + m — l)e—2°"l. (2.29) l=—oo m=——oo In [29], the method is a two—step process: first the mixing matrix A is transformed into a unitary matrix U via whitening, second U is retrieved through joint diagonal- ization of a set of whitened data STFD matrices. Given that all source signals 3,-(t) are mutually uncorrelated, W can be determined from R as defined in 2.10 through: W(R — 021)wT = WAATWT = I, (2.30) 2 where a is noise variance. The whitened STFD matrix is then solved with: Dzz(t,w) = WDxx(t,w)WT = UDSS(t,w)UT, (2.31) with z(t) = Wx(t) the whitened data vector and U = WA is unitary. The source signals are estimated such that s(t) = UTWx(t). This approach facilitates separation of Gaussian sources with identical spectral shapes but different time-frequency localization properties. However, this method require extensive computing and a priori knowledge about the structure of the signals. Bofill and Zibulevsky developed a two stage approach to solving this problem in [33] under the model of equation 2.1, specifically for the case of two mixtures, M = 2. The data is represented in the STFT domain to improve sparsity over representation in time. Let x1(t) and 2:2(t) represent the measurements at each respective sensor at time t. When sources are sparse, most sources contribute almost zero to the measured mixtures at any measured point. When one source, sj(t), is strong, then, it is likely the remaining sources are close to zero, and the measured vector, x(t), lines up in the direction of the column of the mixing matrix, aj .In the two mixture case, a scatter 28 plot of 501(t) vs. 332(15) will show a distinct direction for each source if the sources are sufficiently sparse, and are mixed in independent directions. Using polar coordinates, each measured data point is given a radius, It, and angle, 0t. They define a global potential function: ©(61A) : 2 ¢()‘(0 — 6t», (2'32) t with 1— #711, forlal < 7r/4 05(0) = .(2.33) 0, elsewhere Evaluating this equation around all values of 0 will give peaks along the angles of aj. The radius, It, acts as a weight to count more reliable data for more, and /\ adjusts angular width of the peaks. The number of peaks found serves as the estimate of number of sources and columns of the mixing matrix. The mixing matrix is then used to estimate the underlying sources following the same procedure outlined in section 3.4. Proposed in [34] is a two stage approach to BSS. It is an iterative algorithm that first estimates the mixing matrix, and from this estimates the original sources. It is two stages because it assumes an underdetermined case, that is there are more sources than there are sensors. The mixtures are examined in a wavelet packet domain in hopes to ensure sparsity. Similar to the masking technique, this algorithm relies on sources being approx- imately disjoint. First, the wavelet packet coefficients of observation matrix X are calculated as X, so that there are n observations and each row represents the wavelet packet representation of one observation and is of dimension N. Second, a ratio 29 matrix is created using the wavelet packets coefficients, :20) MN)‘ .S’ I 2 i = I I I 7 (2‘34) _ 2., 1 5: (N _ with q E 1,. . . ,n. A submatrix of X is then found: ;1<_11 21.0% >' "211 211’ iqfil) iqfiK aql aql : ' : = s 2 s . (2-35) 1&1) 55W ) 9111 2111 1 53401) quzKl 1 _ “ql aql _ Ideally, this submatrix will have identical columns, which would represent one column of the mixing matrix. This would be without the presence of noise. In real conditions, noise is present, and so the submatrix has approximately identical columns, and the mean of these columns represents the estimation of a column of the mixing matrix. Another such submatrix of X is found to have a different set of approximately identical columns, and is used to estimate another column of the mixing matrix. This process is repeated, each time estimating a column of the mixing matrix, until there are no more unique submatrices of X. Next, q is incremented to find a new X and the process repeats until q = 71. All estimated columns of the mixing matrix are put together for its estimation. Redundant columns of the mixing matrix are eliminated and a final estimate of the mixing matrix is achieved. After this linear programming can be used to estimate the sources, an approach to this is outlined in section 3.4. Another two stage approach like this is proposed in [35]. This approach assumes the underdetermined case as well as sparsity. It is also proposed to use wavelet packet in this algorithm. To estimate the mixing matrix, K -means clustering is used. The implementation of the K -means clustering is much simpler than the iterations used 30 in [34]. This algorithm is described in Chapter 3. The method proposed in this thesis is an extension of the algorithm from [35]. Instead of working in the wavelet packet domain, it is proposed to work in the higher resolution TF D of Cohen’s class. It is assumed that the added sparsity of using the TFD instead of wavelet packet will provide better results and make unnecessary a more complex algorithm. 31 CHAPTER 3 UNDERDETERMINED SOURCE SEPARATION IN TIME—FREQUENCY DOMAIN An extension of the underdetermined blind source separation method proposed in [35] is presented here. An overview of underdetermined blind source separation is given first. Next, the two-stage approach used to solve the BSS problem in the underdetermined case is presented. The first stage is to estimate the mixing matrix, and the second is to estimate the sources. 3.1 Underdetermined Blind Source Separation Underdetermined Blind Source Separation (UBSS) is a special case of the mixing model presented in section 2.1 in which the number of sources is more than the number of sensors, N > M. In this case the estimation of the mixing matrix alone is not enough for the estimation of the underlying sources since the mixing matrix will not be invertible like in the determined case. It is necessary to make assumptions about the data to overcome this, and increasingly popular is assuming that the sources are sparse in a particular representation [19, 36, 37]. For a source to be sparse, most data points must be close to zero, so the pdf of its values must peak at zero and have long tails. Because of this, the Laplacian distribution is often used to model the pdf of a sparse source [33]. Representing the sensor readings in the time or frequency domain alone does not often induce sparsity, so use of wavelet packets and TFDs are often employed. It was shown in [38] that use of TFDs can provide more robust separation under noisy conditions, so TFDs are used in this application. If the sources are sparse, then they are also more likely to be non-overlapping, or approximately disjoint, which can be used to make estimation of the mixing process easier. 32 3.2 Determination of the Mixing Matrix The mixing matrix to be solved for is A in X = AS, (3.1) where X is the M x P observation matrix with each row being the TF D of one electrode, S is the N x P source matrix with each row being the TF D of one source, and each column of both X and S represents all values at one joint time-frequency point. The mixing matrix actually represents the element-by—element square of the mixing matrix respresenting this mixing problem in the time domain. The number of sources is assumed greater than the number of electrodes, N > M. The estimation of the mixing matrix relies on a representation of the data that renders the sources sufficiently sparse. In [35], it was proposed to work with the wavelet packets (WP) representation to provide sparsity of sources. We propose to work with a TFD representation because of the increased time and frequency resolution of the data, ensuring a sparser representation. Use of WP and TFD were compared in [38], and these results are shown in section 3.5. Due to the sparsity of the source signals in the time-frequency domain, it is likely that there exist many columns of S with only one nonzero entry. For instance, suppose that slj, - -- ,SKJ. are K columns of S, where only the jth entry of each of these columns is nonzero. For this case we assume that j will mean the first entry of each column is nonzero. Then it follows ASij =a1312-j i=1,--- ,K, (3.2) with 31,-J. representing the first element of column 32-]. since it is the only active element 33 in that column, and .---, .=A .,---, .=a .,---,as ., 3.3 lxlj, XKJl [81, 5K3] [1811, 1 1K3] ( ) where, xi]. is the z'jth column of X corresponding to 52-3., a1 is the first column of A, and 51,-]. is the first entry of Sij- From equation (3.3), we see that each xi]. is equal to al multiplied by a scalar 31,-j, which means that these K column vectors of X, xlj, - -- ,ij, are distributed along the direction of a1. Thus, ideally after normalization, xlj, - - - ,ij are mapped to a unique vector on the multidimensional unit circle which is equal to al. However, in practice, the sources are likely only approximately disjoint. That is, slj, . . . ,sKj are K columns of S with the jth entry dominant, so that sz'j >> Spij: p 71 j, where j is constant. When more than one source is non-zero, x1], -- ,ij are not exactly in the same direction as al, but rather in the neighborhood of al. This means that al lies at the center of XI], -- , ij. Therefore, the K -means clustering method presented in the following section can be used to cluster the column vectors of the mixture matrix X into multiple clusters, where the center of each cluster corresponds to one column vector of the mixing matrix A. By doing so, an estimate of the mixing matrix A is obtained, where each column, a.,-, is estimated by one of the resulting cluster centers. 3.3 K -means Clustering K -means clustering is an iterative algorithm that seeks to minimize a squared-error criterion function in order to separate a completely unknown set of data into 10 dif- ferent groupings [39]. Suppose that x1,x2, . . . ,xn are vector observations in a data set and make up realizations of 10 different distributions of random variables. Then [11,112, . . . ,uk are the mean vectors of these distributions, and k-means seeks to eat— 34 egorize the observations, xi, into one of the k distributions such that the squared Euclidean distance, ”Xi — ujllz, is minimized. However, since the properties of the dataset are unknown, 111,112,” . ,pk must be estimated first, as 111412,. . . ,fik. As a starting point, It random samples of the data are chosen as the initial mean estimates, 113-. The distributions are then estimated by classifying all points, x,, into the group whose estimated mean it is closest to in the squared Euclidean sense, so that x,- E [13- when 3' is subject to Injin le. — 12,112. (34) Once all data points are classified, the mean of each group is recalculated. Suppose .7 m a data points. The new mean is then calculated as fij = mi] 21:1 Xij- This process mj is the number of data points in the jth distribution, and XI]. , x2j, . . . ,xm. are all is repeated until convergence, when the estimated means do not change upon further iterations. 3.4 Estimation of the Source Signals for 3 Given Mixing Matrix After obtaining the estimated mixing matrix, the next stage is to estimate the source signals. For a given mixing matrix A in equation (2.2), the source signals can be estimated by maximizing the posterior distribution P(S[X, A) of S. In general, there is no unique solution to this problem. Maximizing the posterior distribution is then done such that a maximally sparse set of sources are found. Under the assumption that the prior is Laplacian, maximizing posterior distribution can be implemented by solving the following optimization problem [40]: N P min Z 2: [32-3], subject to AS = X, (3.5) i=1j=1 35 with N the number of sources, and P the number of time-frequency points. Hence, the l 1-norm N’.P ‘hlS)==:E::E:F%jl (36) i=1j=1 can be used as the sparsity measure. The ll norm is preferred to lo norm, which is the actual level of sparsity, because the optimization is NP hard for lg norm but can be solved easily for ll norm using linear programming. It is not difficult to prove that the optimization problem (3.5) is equivalent to the following set of P smaller scale linear programming (LP) problems: N min: Isijl, subject to Asj = xj forj=1,--- ,P. (3.7) 221 In this way, the contribution to the column of X at one point of time-frequency should be dominated by one source in order to minimize the ll-norm, giving the sparse solution. Finally, we pr0pose the following algorithm for estimating the source signals: Algorithm: 1. Using the collected data [21 (t), z2( t), . . . , zM(t)]T, obtain M TFDs and vectorize each to obtain the TFD mixture X. 2. Normalize the column vectors of the TFD mixture X to obtain X. 3. Take a sufficiently large positive integer k as the number of clusters, also the number of sources to estimate. Choose the initial points of iteration and the distance measure criterion. In this thesis, the squared Euclidean distance is chosen as the criterion. 4. Do K -means clustering on X followed by normalization to estimate the sub- optimal mixing matrix A. 36 5. Using the estimated mixing matrix A and the mixtures X, estimate the time- frequency representations S by solving the set of LP problems in equation (3.7). The result is a matrix, S, in which each row represents a vectorized TFD of an extracted sparse component. The number of sources to be extracted is defined by the user, and so a sufficiently large number must be chosen. It is shown in [35] that if k is chosen larger than the actual number of sources, the ”extra” extracted sources appear as spurious noise sources, and can be ignored. The l1 norm is used as a sparseness measure to ensure a unique solution, and the solution will be reliable if the sources are approximately disjoint. 3.5 Comparison Between Wavelet Packets and Time-quuency Distri- butions In this section, several examples will be used to illustrate the effectiveness of the proposed approach to separate the sparse source signals from their fewer mixtures in the time-frequency domain. The binomial kernel [16] is used for computing the TFD since it belongs to the class of reduced interference distributions (RIDs). Example 1: The set of observed signals are two linear combinations of four Gabor logons. These four Gabor logons are centered at the time sample point and the normalized frequency (30,0.7), (160,-O.7), (70,-0.4), and (120,0.1), respectively. Each observed signal is first transformed to the time-frequency domain with I = 50 time samples and L = 64 frequency samples. Each TFD is then vectorized to form a TFD mixture matrix X = [X1; X2] of size 2 X 3200. Figure 3.1 presents a scatter plot of the mixtures X (X2 vs. X1) in the time- frequency domain. It can be seen from this plot that almost all significant data points are distributed along four different directions, thus providing very good separability. The separation results using the proposed approach are illustrated 'in Figure 3.2. Figure 3.2(a) and (b) represent the two mixtures. The four extracted Gabor logon 37 signals are shown in Figure 3.2(c), (d), (e), and (f). The results indicate that these four Gabor logons can be successfully separated from their two mixtures using the pr0posed approach based on their sparsity with an average signal to interference ratio (SIR) of 36.1251 dB. Example 2: Two mixtures of a chirp signal and two Gabor logons are given. The chirp signal has a linear frequency increasing from an initial normalized frequency of -0.2 to a normalized frequency of 0.2. The Gabor logons are the first two Gabor logons given in Example 1. A scatter plot of the two mixtures in Figure 3.3 shows that it is similar to the first example in that the distributions of data points belonging to different sources are along three different directions. Since the chirp signal overlaps with the two Gabor logons in the time domain, typical time domain separation meth- ods can not be used to perfectly recover them. However, it is illustrated in Figure 3.4 that these three signals can be effectively extracted in the time-frequency domain using the proposed method with an average SIR of 32.7634 dB. Example 3: In this example, the same two mixtures of four Gabor logons given in Example 1 are used. The effectiveness of the presented approach is compared for TFDs and wavelet packets (WP) in the presence of noise. Haar wavelet with five levels is used for the wavelet packet decomposition. To show the effect of increased sparsity of TFDs, the mixtures at different levels of white Gaussian noise are considered. 100 Monte Carlo simulations are used for each noise level. The average mean squared error (MSE) between the extracted sources and the original sources is calculated for both the TFD and WP. The TFD and WP provide similar results when there is no noise. However, as the noise level increases, the performance of the WP rapidly degrades compared to that of the TFD. The MSE versus the signal-to-noise ratio (SN R) is shown in Figure 3.5 for both the TFD and WP. This result shows that the RID results in a more sparse time-frequency surface compared to the WP, which improves the robustness of BSS under noise. 38 0.3 . . r . . r 025- ° 0.2- 0.15- .. 4 0.1r 9: 0.05 i - l . 0: ..A' « -0.05» -01:- . q -0.15- « '02 -01 -o.05 o x1005 01.1 ois 0.2 Figure 3.1. Scatter plot of two mixtures of four Gabor logons in the time-frequency domain 39 0 20 40 60 80 100 120 140 160 180 200 120 140 160 180 200 Normalized Frequency O N O .5 o O) o co <3 0 20 40 60 80 100 120 140 160 180 200 0 20 4O 60 80 100 120 140 160 180 200 o 20 4o 60 so 100 120 140 160 180 (0 Time § Figure 3.2. The mixtures and the separation of four Gabor logons: (a) and (b) two mixtures; (c), (d), (e), and (f) four extracted Gabor logons 40 1.2 - ‘ .’ l f: " 0.8 . '. A t' I 0.6 ' 1 f :' 4 0.4 ’ . .0. :1 : x 0.2 ~ 1 . . O 0 .. . o ’0 ' ° ° .' . ' . o. .( I. -o.2 - . ] -o.4 - . -0.6 -0'8 1 l l l L l I -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 X1 Figure 3.3. Scatter plot of two mixtures of a chirp and two Gabor logons in the time-frequency domain 41 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 Normalized Frequency o 20 40 60 80 100 120 140 160 180 200 (d) o 20 40 60 80 100 120 140 160 180 200 (6) Time Figure 3.4. The mixtures and the separation of a chirp and two Gabor logons: (a) and (b) two mixtures; (c) extracted chirp; (d) and (e) two extracted Gabor logons 42 MSE Figure 3.5. Comparison of MSE versus SNR for extracted sources with TFD and WP 1.8 - -o- ~TFD _ Wavelet - 1.6 - Packet \ \ 1.4 ~ N \ \ 1.2 ~ \ a \ 1 ~ . \ EL \ 0.8 - \ \ 1 .‘Q' \ o . ‘ 1x 0.6 '\o~ ‘ ‘ \ 0.4 — f ' F 5 3:0 ‘El 0.2 l l l l l l -2 2 6 8 10 12 14 16 SNR (dB) 43 CHAPTER 4 SOURCE SEPARATION RESULTS FOR ERP SIGNALS 4.1 Data The ERP data analyzed in this thesis was recorded at Ormond and Hazel Hunt ERP Lab at the University of Michigan. The study consisted of 10 Subjects and 6 electrodes. The electrodes that were used in the recording are F3, F4, 02, P3, P4, and Oz. The subjects are shown visual stimuli in the form of words related to their particular psychopathologoy. Two sets of words are presented to each subject, prime and target. The prime stimulus is always presented subliminally with stimulus duration less than 5ms. One second after the prime stimulus, the target stimulus is. presented supraliminally, i.e. the subject is aware of the stimulus. Three groups of words are uniquely developed for each subject. One category, unpleasant word, is the same for all patients, where each word is a generally unpleasant word, such as ’cheating’, ’cancer’, or ’lying’. The other two categories are unique to the subject’s condition. One category is the conscious conflict category, which the words were used by the patient to describe his/ her condition. Examples could be from a subject who suffers from a public eating phobia, such as ’swallowing’, ’cafeteria’, or ’headache’. The final category is unconscious conflict words. Experts develop a list of words related to the condition. Examples for the same subject are ’massaging muscle’, ’ripped apart’, or ’on my back’. In this study there are two groups for each prime and target stimuli. The prime stimuli belong to either unconscious conflict word or conscious conflict word categories, whereas the target stimuli belong to conscious conflict word and unpleasant word categories. Each set of stimulus is repeated 49 times resulting in a total of 196 trials per electrode per subject. The data used in analysis was for ls in duration upon presentation of the supraliminal stimulus, and 44 only the data from one subject was used. 4.2 Single-Trial ERP The goal in single-trial ERP analysis is to be able to extract individual underlying sources in the brain which are generated in a localized area. With successful source extraction, analysis of individual responses of the brain can be performed, and the dynamic variability of the ERP responses can be compared on a trial to trial basis. In this way, observations can be made on all factors affecting subject’s performance. A comparison is made between the algorithm outlined in section 3.4 and ICA as outlined applied to the same data. Both BSS techniques are applied to all 196 trials of data available. In application of the two-stage approach, first, a number of sources to extract, It, must be chosen. This value was empirically chosen, and is chosen such that it is greater than the number of electrodes, 6. Multiple trials were run under a selection for k. If no sources extracted appeared spurious, It was incremented, since it was shown in [35] that choosing k larger than actual number of sources still results in successful extraction of sources. As It increased, sources began to Show up in the results that had only spurious activity, incrementation was stopped and k was chosen. Experiments were done using 32 frequency bins, for which It was 8, and using 128 frequency bins, for which k was 14 and 16. ICA was then applied to the same data. Since only 6 mixtures are used, ICA can only extract 6 components per trial. The results for ICA are in the time domain, so they are converted to the time-frequency plane at the different frequency resolution levels using 32 or 128 frequency bins for comparison. Examples of single-trial results are shown in Figure 4.1, Figure 4.2, and Figure 4.3. Figure 4.1 shows results of one trial when 32 frequency bins were used in the TF D, while Figure 4.2 and Figure 4.3 each show results for one trial with 128 frequency bins. 45 Similar results are obtained over all 196 trials. Using more frequency bins provides more data points, which provides for a more robust separation so that more sources can be extracted. This increase in data points comes at the cost of more computation time. The sources from the proposed technique show in general less activity, i.e. more sparsity, in the time-frequency plane than the sources from ICA. As more sources are extracted in the prOposed technique, source representation becomes more sparse. It is, however, difficult to compare results on the single-trial level here since the underlying source generators are actually not known, and since a different number of components are extracted from each technique. It is also difficult because there are 196 individual trials to try to quantify. An attempt must be made to generalize the results. 4.3 Measures of Evaluation In order to evaluate the performance of ICA and the proposed UBSS method, the single-trial results are put together in their respective groups depending on stimulus type. K —means clustering is carried out over all extracted components from the subject and the extracted cluster centers represent similar components across all trials. These components are then representative of the most prevalent sources extracted throughout all the trials for each stimulus. Evaluation of these cluster centers is then carried out in attempt to quantify the general results of ICA to those of the proposed method. 4.3.1 Data Reduction The results of one trial are represented by the matrix, S1), which is of size N x P. Each component in time-frequency is first vectorized to form a vector of length P, which is in our case equal to either 2112 or 8256. These vectors are then put into a matrix. This represents N extracted components from trial 2', each over P time- frequency points. For the data reduction of all results for a particular stimulus, the extracted matrices over all trials are each appended to form a new matrix, Sn, such 46 Frequency bin Frequency bin Figure 4.1. Single trial results using 32 frequency bins. (a) 6 extracted sources from 30 20 -v r . . [If 7). lL“ "3;! LI 1 :1 d I l . 10 p ’ s i . “f F * l' 1.4:“: .111 " . O _ -m‘. L“ r 7. 200 400 600 800 1000 so - “1*... r f - fl ‘ " 20 - ~ - i 3" 1 " ‘ - a 3' l- 10 b. __ . ’ illu I Z t.‘ ] 0 A. . 1 . 200 400 600 800 1000 30 l 7 ’ ' ~ "I ' “r If“ 20 l b "_- * —— I’ll-.1, '. .. ”I “ r 10 r . ‘E ] O . . - 1 . 200 400 600 800 1000 Time (ms) (a) 30 ' TT 1 . 1 . 20 I." . . 10 ’i 4,7 200 400 600 800 1000 '1‘ 200 400 600 800 1000 30 . -. -. i a .. 20 . .I. '1! ' .- it} 1ol " "' ; i ‘ 200 400 600 800 1000 so 1 " . ' ‘ T I ‘. “Eh. I . a» J - .. 20 ] ', -l "' . a... 4‘ 7 I I. J .‘1 10 P. . ' r“ h . r“ 0 ' | . a. 1 T: ‘C 200 400 600 800 1000 Time (ms) 0)) 30’ 20' 10' 30 20 1 0 ‘. ’J I ’. iii-1 _"__’.' I J In . 1 - 1 200 400 600 800 1000 200 400 600 800 1000 301 ;__"" ‘ ' E“ 20 ~' I?!“ ."’" - 10 I" YW- r-l: '1‘ ll. 0] "‘:1 ‘5‘- .r . J. i 200 400 600 800 1000 30- ._-..--_,._ f“ “_E' ”.2- 11 -I -~ 1 20' I" _9- ’- 10 I"? n? ‘ [I 1."?! «- .1 0 ._ \EE'T “E. '.' 7" 200 400 soo 800 1000 200 400 600 800 1000 ICA. (b) 8 extracted sources from the proposed method. 47 100 sell“. L3" 1., _. O J O 0 Frequency bin 0 100; 50* 100 _' 50: Frequency bin *' fiW—‘Xh‘ __ “" “fi‘ . “T” __;;:_. 1' Ya“; _. - q- r.” "’19:; .._. Eze-u-m— 200 400 600 800 1 000 01 O ”‘7‘ “ET-f: gift;- ‘;2- — i.1 .._ ..: 1 ‘h ' hf r C . """ ‘" ‘1 - 200 400 600 800 1 000 (a) 3... -- .'.. ~. 2...... 1 -: ;_ _.-_—’_3'—'__.~"=.' l'. .3 200 400 600 800 1000 Time (ms) V ‘ M; r—Ir‘ —- (if—4 is- ...._.. ~5- - - — E _- - 3 i _ ' 1“ -:' T 7"" 5. [1]" , “7‘9 _. _. . .] 200 400 600 800 1000 l J. AF 200 400 600 800 1000 h... __~ -I-un _ __ k t "N-_ — d ’ ' d Figure 4.2. Single trial results using 128 frequency bins. (a) 6 extracted sources from 200 400 600 800 1600 Time (ms) (b) 100 F'rfi‘ i? ‘ 50 ’ ‘é:_;§._‘.: :3.— , .4] 200 400 600 800 1000 ‘°° - 2.1-: ._.. - « - 1 50» q. -1: o #200 400" 600 800 1600 100’ , 1 _ [17:73]... - sol ‘f:__;:=sx:r—_::'f- 53"".- O 200 400 600 800 1000 100" -..._ _ [r 501 7”. - - . 200 400 600 800 1000 100 "171-" —- t-i. 1.; ] 50' TI. Jaw—g *7: h 7 4 200 400 600 800 1000 1001 ,_:_ _ ___.- ' -~=.--‘-~ 50* =4" 3'...“"‘_ E’Zf r - a, o " =—~ _i 200 400 600 800 1000 1003 73—— _f;:*:"“ -~ -. 5O ‘3‘“: Ti. 1,. 1:” _f- U... 200 400 600 800 1000 100 __.. :_. 1:; 1.1 :7 .1 50 7 - 1514' _i'FT-E: J... ~ 0 __ “7" 4:2: , 1.... 200 400 600 800 1000 100" “v":— 1Er4i~ '1 50L ...;:.-.~.”'q;:-:,, ..-, 0 .E 200 400 600 800 1000 1001 -:‘:~’:____;;__: : f: ”-121; . 50"T‘T"':;- '27:; _: :- - - "‘ O I L‘L’ 4* «ELL 1" 200 400 600 800 1000 ICA. (b) 14 extracted sources from the proposed method. 48 100 100 I “=3.I=..———==:* '- i: #:J—g. .:--J - T':—- 1:« '—..:.. : so, an» _. _ 1 50 T. .- . ‘ : g: "1 fit; g... " i o 0 Le """" ' 200 400 600 800 1 000 200 400 600 800 1 000 = ‘r _._._.. ~;—— 5 10°F ...___. ‘00 'F '“ _____=" " *‘a f 6‘ _~—_ .— _d "' l - .- - 3 P’ F——--7 . 60 ‘ — .. 50 0- — - . ‘- _._.._.._-4-_ , '2, - 511.-., _ - l u. 0 _ ML 0 _— 200 400 600 800 1 000 20 400 600 800 1 000 i ‘ r ._—_. -' .._. 1 .._. -—-:_.__..."' — " I ‘°°t —-'--~~—~ ,. °° .. _ _. r.» ___—__,~___.;=_______7_:__.______~ - - 04—.- 50 it - ___..._ _q-r- -_—‘ fl .. 50 , — -—- ...- __.~_. ‘. ":i -' ‘J ' ’ _ O ‘ " O '— 200 400 600 800 1 000 200 400 600 800 1 OOO Tim. (me) (a) —L c388 088 c388 088 100' - ._ - 1:: "if. n. 200 400 600 800 1000 1 200 400 600 800 1 000 100I I -"~___" ' ‘:J. ‘3 9‘ 58 i «F: ‘°' ‘14:... Tjt. -—-l—-'--.- . J 200 400 600 800 1 000 100 ' “‘— ._-—-;'-.:fi4. r ' i __ 4 d g "5 _‘——- .33 . 200 400 600 800 1000 —L ..—-‘ 4 “fl ‘_ *- .u_gfi— - _ 50 i ; .— ' - ‘3’“- _—'- — - - ~— .— -:. - .__. __ .— _- .. - -_ '- - 200 400 600 800 1000 8 O f 1 1 l .— —-v- ~ _ _ _— - — a ‘_f__ —,.—_ ‘ .A _ -— _- --*. ---— .- ._ . . i. - _—,_,-.— .— 1 “W, T E‘ . Pr" . . Frequency bin 01 O 200 400 600 800 1000 200 400 600 800 1000 100 r- H-..“ ... 100 ...- “"2 “A l 53 - - 7;- 53 -i '1'" T“ 300 400 600 800 1000 200 400 600 800 1000 - ,_ - , 100! . 11-..: .- _L’ I- 100 '-‘ ' z.._ '-“ _. " ' d' ’7 ~ , _7_.t.__.~-—V‘—- : ‘ .- .. - , *, - 2.: .._:_;_ i—‘z: _— 50’ - - _- * - 50 . h... ~_-.-«-—~ «~21, T - - I ~- H g - . . “_i’ . O n 200 400 600 800 1000 200 400 600 100 Tum—”F : 2.; r. _T ‘ 1;- ; 100 ' ; ' L: i; : “:5 v.1}; 50 .5 2...: :1. :,— 3":‘1‘ a; 50 P . ‘ __ #9.“... - _ _v—fi. ,- ‘il- -’ :- ~ -- . g; as ,_ _. - » , r ra-v J!- 4 200 400 600 800 1000 200 400 600 800 1000 100 I - _ ..‘ L“_ 100 =_._ t ‘ L 50' ‘-4'-“I“‘€:¢ . 50» :2 3‘6”“ . - *- ) -. , J- .1 + _L.“_‘ .m— - O r . . .. 200 400 600 800 1000 200 400 600 800 1000 Time (ms) (b) Figure 4.3. Single trial results using 128 frequency bins. (a) 6 extracted sources from ICA. (b) 16 extracted sources from the proposed method. 49 that ' sin) si

l - 81 - Sn: 32 = Sign Sign , (41) 81(1) 81(P) _ S49 _ .3313“) saw). where u = {1, 2, 3, 4} represents the stimulus pair number, and éu is of size 49N x P. Each element, 3'30), is one time-frequency point of source 71 from trial 1). K -means clustering is then carried out on each éu where each of its rows is grouped into one of K clusters based on its squared Euclidean distance to that cluster center as described in section 3.3. The clustering algorithm is run 10 times to avoid randomness in the final cluster results. We run K at 8, 10, and 12 to get an idea of how a different number of components may affect the outcome. The rows of S“ are then grouped by a hierarchical clustering method based on the results of the 10 k-means runs. A matrix, R, of zeros of size 49N x 49N is created. Each entry is updated iteratively. The entry, Tij represents how many times out of 10, row 2' of S“ was grouped into the same cluster as row j of s”. This matrix then serves as a similarity measure, the more times two sources were grouped together by k-means, the more similar they are. All diagonal entries, Tii: represent how many times each source was grouped with itself. These entries are ignored because they are all 10 and are meaningless. A hierarchical clustering is then carried out using the similarity matrix, R, as its distance metric. In the first step, each row of Eu is in its own cluster. The second step then groups all rows together with a similarity value of 10 in the matrix, R. Next, all rows with similarity of 9 are grouped. If a group already exists, then the average similarity between one row and all rows already in the cluster is used. The next step 50 will then group together a cluster with another cluster or individual row if it has the highest similarity value remaining. If not, then all rows with similarity value of 8 are grouped together. This is repeated until the number of clusters is reduced to K. Cluster centers are then calculated by the mean of the time—frequency components in each cluster, and these are the components that categorize all single-trial ERP results. An example of a set of extracted components is shown in Figure 4.4. 4.3.2 Data Variance Explained After the extracted single trial components are reduced using clustering methods, a comparison of component quality is then sought out. The K extracted components from all trials of the two-stage approach must be compared with the K extracted components from all trials of ICA. It is possible that the better representation of the underlying activity will be able to explain more of the variance of the original data. The extracted clusters are represented by the K x P matrix Cu, u = {1,2,3,4}. The space spanned by these components can be found by Gram-Schmidt orthoganal- ization, to produce the K x P orthogonal matrix Cu, u = {1, 2, 3, 4}. These orhtogonal components are projected back onto the original electrode measurements, Xfu, u = {1,2,3,4}, l = {1, . . . ,6}. This is calculated by: . T Di. = (Cuxt )2 (zlec92 (zlec1>2 (z,_1cr<)z49<>)2' : (zlec9x'f">2 (zf_1c9(J)x‘9"2 (2j2102(3)$:f§l(j))2 _(Zf:963‘9(J>x1‘”)2 (251 9cKsc3"(J>>2 (zf=le%w2‘9’