MACHINE LEARNING APPROACHES FOR PROCESSING AND DECODING ATTENTION MODULATION OF SENSORY REPRESENTATIONS FROM EEG By Sari Saba-Sadiya A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy Psychology—Dual Major 2023 ABSTRACT This thesis presents novel machine learning algorithms that achieve state-of-the-art performance on a variety of electroencephalography (EEG) tasks, including decoding, classification, and unsu- pervised / semi-supervised artifact detection and correction. These algorithms are then used within the scope of an EEG experiment that explores how attention to multiple items modulates sensory representations. Using a signal detection paradigm, we demonstrate that attending to multiple items impacts the sensitivity of our participants, causing a sharp increase in false-alarm rates and only slightly decreasing hit-rate. We conclude that our behavioral and EEG decoding results contradict simultaneous attention guidance by multiple items (the multiple item template hypothesis). ACKNOWLEDGEMENTS First and foremost, I am eternally grateful to my advisors, Dr Mohammad Ghassemi and Dr Taosheng Liu, for their tutelage, support, and dedication throughout my graduate school journey. Their vast knowledge and creative thinking have been a source of inspiration both inside and outside the lab. I would also like to thank my committee members and other academic mentors: Dr Susan Ravizza, Dr Pang-Ning Tan, Dr Tuka Alhanai, and Dr Joyce Chai. All of which provided excellent feedback and our collaborations were instrumental to my growth as a researcher. I acknowledge that many of my better ideas came from conversations with friends and colleagues. Especially Dr Burgoyne, Dr Reynolds, Dr Ming, Dr Masrour, Dr Gao, Dr Mills, Dr Sherry, and Dr Bergan. I would also like to thank current and previous lab mates Eric Chantland, Brendan Valentine, Reza Khan Mohammadi, Shaohua Yang, Allen Williams, Niloufar Eghbali, and Sanaz Hasanzadeh. They all made this journey much more joyful, and I learned how to be a better researcher and human from each of them. Last but not least, I would like to thank my family members. Those back home - Ahmad, Sylvia, Yara, Abed-Alaziz, and Joud - for their support from the very beginning of my academic journey. And those that became my family during my studies - Emily and Barsha - for more than I can put in words. iii TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 CHAPTER 2 EEG PREPROCESSING AND DECODING LITERATURE REVIEW . 6 2.1 Artifact Correction and Rejection . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Decoding EEG Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 CHAPTER 3 DEEP LEARNING METHODS FOR PREPROCESSING EEG . . . . . 25 3.1 EEG Channel Interpolation Using Deep Encoder-decoder Networks . . . . . . 25 3.2 Unsupervised EEG Artifact Detection and Correction . . . . . . . . . . . . . . 41 3.3 Feature Imitating Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 CHAPTER 4 PILOT: DECODING COLOR FROM PASSIVE VIEWING . . . . . . . 71 4.1 Pilot Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Pilot Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 CHAPTER 5 THE NUMBER OF ATTENTIONAL TEMPLATES MODULATES SENSORY REPRESENTATIONS . . . . . . . . . . . . . . . . . . . . . 75 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Main Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4 Unsupervised Artifact Rejection With Deep Encoder-Decoders . . . . . . . . . 105 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 CHAPTER 6 GENERAL CONCLUSIONS AND REFLECTIONS . . . . . . . . . . . 120 6.1 What Was Accomplished . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.3 Reflections on Deep Learning in EEG . . . . . . . . . . . . . . . . . . . . . . 121 6.4 Reflections on Accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 APPENDIX A DECODING TARGET-ABSENT AND FALSE ALARM TRIALS . . . 138 APPENDIX B DECODING SIMULATED NOISE . . . . . . . . . . . . . . . . . . . . 140 APPENDIX C MAHALANOBIS DISTANCE DECODING RESULTS . . . . . . . . . 141 APPENDIX D BAYESIAN ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . 143 iv CHAPTER 1 INTRODUCTION 1.1 Introduction This dissertation focuses on the two research areas I pursued during my graduate studies at Michigan State University: cognitive attention and deep representation learning. My cognitive neuroscience research utilized electroencephalography (EEG) to explore how attention modulates sensory perceptions. EEG decoding is challenging for many practical engineering reasons, as the signal is a consequence of an ensemble of neural activations from various processes. The difficulties are further exacerbated in the case of attention modulation, which is expressed as a latent variable embedded within the already noisy EEG signal. Successful EEG research requires developing better machine learning techniques to eliminate sources of noise, and more generally, algorithms capable of successfully deriving insights about latent variables from noisy EEG signals. 1.1.1 Electroencephalography in Attention Research Common sense and science studies alike attest to the importance of attention for task perfor- mance: drivers that attend to their phones are twice as likely to experience a car accident than those that attend to the roadway exclusively [168], and eliminating distractions during study can significantly improve academic performance [33]. Controlled behavioral experiments in lab set- tings have repeatedly demonstrated that attention can improve performance in search, detection, and memory retrieval tasks [173, 186]. However, as will be discussed in section 5.2, understanding the mechanisms responsible for the behavioral effects of attention requires direct access to the latent cognitive processes behind perception. Prior work has demonstrated that EEG signals reflect latent cognitive processes at a high degree of temporal fidelity, and are suited for the study of dynamic cognitive processes such as attention. Within the last two decades, EEG experiments have corroborated many theories proposed based on behavioral results; for instance, attending to an event causes us to process it faster (known as the law of prior entry) [178], and attending to a feature enhances its sensory representation [25]. 1 Despite an ever-growing body of literature, much about cognitive attention remains unknown, and many open debates in attention research stand to benefit from the increasing ubiquity of EEG as a research modality. This dissertation utilized EEG to explore attention modulation of performance in multi-template tasks; if attention improves performance on single-target tasks, what happens when multiple items are attended to at once? In less colloquial terms: Attention to a feature value enhances its sensory representation, but how does attending to multiple feature values modulate sensory representations, if at all? To successfully decode attention modulation in the EEG experiment multiple machine learning algorithms were developed. These tools enabled more effective data preprocessing, facilitating the study of the latent attention phenomenon with greater fidelity. 1.1.2 Electroencephalography in Machine Learning Research Electroencephalography devices are unique in being cheap, portable, and non-invasive neu- roimaging technology. Moreover, EEG has a variety of applications such as brain-computer interfaces [170, 41, 192], emotion recognition [94, 31], and medical diagnostics [56]. Considering the above, the exponential growth of "machine learning for EEG" literature in recent decades is unsurprising (see figure 1.1). However, successful application of machine learning to EEG tasks remains difficult, as it requires overcoming a number of challenges inherent to EEG data: • Low Signal-to-Noise Ratio: EEG data is extremely noisy, and any EEG classification or decoding task requires heavy preprocessing and artifact removal [146]. • Few Rows, Many Columns: Traditionally, machine learning has focused on image or text data sets that contain hundreds of thousands of data points, each represented by relatively short vectors. In contrast, EEG data sets often contain only a few dozen subjects, each having only a few hundred trials; at the same time, EEG data is extremely dense, containing thousands of measurements per second of recording. • Data Scarcity: One consequence of EEG’s low signal-to-noise ratio is that data collected from subjects is often unusable (for instance [119] and [171] both discarded 10% of their subjects due to noise). Moreover, data collection, annotation, and maintenance all require 2 Figure 1.1 Number of publications containing involving machine learning and EEG by year of publication. The figure was produced using data from the dimensions.ai analytics tool. . considerable effort and data scarcity affects EEG research reliability. This is reflected by the low sample sizes in Table 2.2. • Temporal and Spatial Covariability: The EEG signal is global and continuous across space (all signal components appear in multiple electrodes) and time. This high level of covariance between dimensions is relatively unique. • Inter-Subject Variability: Research shows consistent individual differences in EEG activity even when performing the same task, under the same circumstances. These differences are so distinct that it has enabled researchers to design EEG based user authentication methods [64]. However, this is an issue when designing any kind of application that seeks to generalize to new subjects; for instance, BCI applications. • Inter-Task Variability: EEG classification models are often trained to decode a narrow set of labels. This inhibits the reusability of developed models. As we will discuss in Chapter 2, this weakness is inherent to discriminative, as opposed to generative, models. • Interpretability: While high accuracy might be a priority when designing BCI, this is not the case when the underlying goal is the study of specific cognitive phenomena; for a decoding methodology to be widely adopted it must also be interpretable. 3 My machine learning work focused on leveraging state-of-the-art representation learning meth- ods to address these weaknesses. Sections 3.1 and 3.2 present algorithms that utilize data properties to learn signal representation that enable unsupervised detection and correction of corrupted data. Section 3.3 presents a novel approach for integrating expert knowledge into neural networks based classification approaches via modular pre-trained sub-networks; that section demonstrates how this approach can mitigate scarcity and inter-task variability concerns while achieving state of the art results on a variety of EEG classification tasks. Finally, it is repeatedly demonstrated that our algorithms can generalize to out-of-training subject data after simple fine-tuning. 1.2 Contributions This thesis has three main contributions: • Development of novel unsupervised learning methods for artifcat detection and correction [145, 147, 146]. Facilitating easier preprocessing, and alleviating data scarcity concerns. • Proposes a novel framework for integrating expert knowledge, and insights from the cognitive neuroscience literature, into neural networks via weight initialization [143]. • Directly evaluate differences in attention modulation of sensory representation across different attentional load conditions. And evaluate how attention load impacts performance using a signal detection theory framework. 1.3 Thesis Outline Here we provide a brief outline of the upcoming chapters in this thesis. In Chapter 2, we provide a comprehensive review of current machine learning approaches for artifact detection and correction in EEG data, as well as popular EEG decoding algorithms. In Chapter 3, we present novel algorithms that achieve state-of-that-art performance on channel interpolation, unsupervised artifact detection and correction. Section 3.3 also discusses a novel framework for integrating expert knowledge into neural networks that achieves state-of-the-performance on multiple EEG classification tasks. All algorithms are based on representation learning in neural networks, and are designed to generalize for unseen data. Chapter 4 briefly describes a pilot study we conducted as a precursor to our main experiment. Finally, Chapter 5 focuses on cognitive neuroscience themes; 4 the chapter begins with an introduction to the current debate surrounding the capacity of attention (our ability to attend to multiple items simultaneously), followed by an in-depth literature review. Sections 5.3 to 5.6 present and discuss the main experiment and how our results fit within the previous literature. Section 5.4 also demonstrates how the algorithms presented in Chapter 3 can be useful in a cognitive neuroscience setting. Finally, the last chapter was reserved for general conclusions and reflections. 5 CHAPTER 2 EEG PREPROCESSING AND DECODING LITERATURE REVIEW Parts from this chapter are adapted from a published manuscript titled "Artifact Detection and Correction in EEG data: A Review" that appeared in the 2021 proceedings of the 10th International IEEE/EMBS Conference on Neural Engineering (NER) [147]. Other sections were adapted from an unpublished manuscript that was reviewed by Dr Ghassemi and Dr Liu. 2.1 Artifact Correction and Rejection Electroencephalography (EEG) is a non-invasive, inexpensive, and portable neuro-imaging technology, but the low signal-to-noise ratio of EEG limits its ease of adoption and use for the research and commercial communities alike. The low signal-to-noise ratio of EEG is due, in part, to a variety of artifacts including ocular artifacts from blinks and eye movements, and muscle artifacts from movements. While EEG data is affordable to collect, it is challenging to use in practice because artifacts correction is a necessary prerequisite for meaningful use. To reduce the human labor associated with EEG experimentation (and the requisite data cleans- ing) researchers have developed several methods for automated artifact detection. Once an artifact has been detected, the corrupted segment may be discarded but discarding segments introduces discontinuities to the signal that may limit its applications. To circumvent discontinuities, artifact correction techniques may be utilized to "correct" the signal. Implementing effective strategies for artifact detection and correction requires careful review of approaches scattered across the scientific literature. In this review, we highlight the key research contributions in the EEG artifact detection and correction domain over the last 7 years, and identify promising directions for further research and development efforts. 6 Paper Artifact Type Datasets Method Requirements Performance [154] Blinks D 4256 trials† ICA labeled ICA scalp-maps > 0.80 AUC 47752 trails† Blinks Supervised learning [113] D 1955 Blinks labeled trials 0.98 F1 Muscle algorithms 4203 Muscle Mov. [43] Muscle R ‡ Hand Crafted EEMD Uses expert knowledge .83 F1 [58] All R ‡ LDA, SVM, KNN, ICA labeled trials < 0.50 F1 [114] All D ‡ CNN classifier labeled trials 0.92 F1 2 new datsets labeled trials [162] All R MWF 6.20 SNR real and simulated† assumes stationarity †4 new datasets [2] Blinks D Hand Crafted Assumes artifact frequency > 0.94 F1 2350 blinks †2000 trials SVM > 0.98 F1 [57] Blinks R labeled trials 1000 Blinks Autoencoder .024 RMSE Blinks, Muscle ICA 0.80 F1 [134] D †6352 subjects Labeled ICA components Heart, Channel CNN classifier (multi-class) ‡2 new dataset Downstream [17] Blinks R ICA with ASR Labeled ICA components simulated and real tasks †2 new datasets Classical classifiers Assumes artifacts Downstream [146] All R 4578, 4569 trails and Autoencoder are uncommon tasks, 0.54 F1 628, 570 artifacts EEGLAB dataset ICA, SVM Uncorrelated signal 0.97 F1 [133] Blinks R with simulated blinks and Autoencoder and noise 0.04 NMSE Blinks Simulates only [194] C simulated artifacts Autoencoder 0.56 RRMSE Muscle specific artifacts Table 2.1 Artifact rejection / correction papers being reviewed in chronological order. Type: (D)etection, (C)orrection or (R)emoval. See 2.1.2 for a breakdown of the different metrics. † marks a new dataset. ‡ Data characteristic were not reported by the authors. 7 2.1.1 Definition of Artifact For the EEG community, an “artifact” refers to a diverse set of signal distortions that span spatial, frequency and temporal scales [100]. While different taxonomies of artifacts have been proposed [100], the exact distinction between signal and artifact is often dependant on the specific purposes of those collecting the data. For instance, muscle artifacts are unwanted in a motor-imagery Brain Computer Interface (BCI) application, but are useful for tasks such as sleep stage identification [54]. Given the variety of phenomena that could be classified as an artifact for any given EEG use-case, it is not surprising that artifact detection algorithms are narrowly-focused on correcting the intruding artifact in a specific context [154]. Al alternative approach argues that a distortion to an EEG segment is an artifact if and only if the distortion negatively impacts the performance of a downstream tasks [146]. 2.1.2 Scope of Review This review includes algorithms for artifact detection and correction using EEG data, alone. That is, we do not discuss algorithms that rely on external signals (e.g. electrooculography). Furthermore, we exclude research focused on electrode ‘pops’ or other spatially localized artifacts as their unique characteristics enable ease of detection by simple unsupervised and self-supervised techniques [145]. Finally, for the sake of brevity, when a group of papers constitutes a sequence of incremental improvements, we select only the work which presents the accumulation of that line of research [27, 17]. Table 2.1 provides an overview of the literature surveyed in this review. 2.1.2.1 Removal vs. Correction This review distinguishes between two approaches: artifact removal and artifact correction. For an algorithm to perform correction (rather than removal) it must have access to an artifact free version of the EEG waveform to be used as ground truth for correcting an artifact ridden version of that same waveform. Note that this necessitates that artifact correction algorithms are trained on datasets with simulated artifacts (for instance see the data-set proposed by [194]). 8 2.1.2.2 Metrics The performance of artifact detection algorithms are often measured using manually annotated EEG signals. Common metrics to evaluate artifact detection methods include the F1 score, accuracy, sensitivity, specificity, Area Under the Receiver Operator Curve (AUC), and Cohen’s Kappa (inter- rater reliability). For the purpose of comparing performance in this review, we standardized these metrics when possible. For instance, if an author did not report the F1 score, we attempted to derive it from the other metrics [57]. For artifact detection, we compare algorithms using several common performance metrics. We note that not all metrics are equally valid for evaluating EEG artifact detection algorithms. The F1 score and accuracy are appropriate for the assessment of tasks with balanced outcome class labels, which is not common in artifact annotation settings; a classifier graded on an unbalanced dataset may achieve a high accuracy but suffer from a high false negative rate. Artifact correction algorithms are more challenging to assess compared to detection algorithms as (barring simulated data) the ground truth is unknown. When artifacts are simulated, and access to the artifact free waveform is available, metrics such as normalized mean square error (NMSE) and root mean square error (RMSE) are used [133, 194]. When the data is not simulated the same metrics are calculated using artifact free EEG data collected under similar circumstances (i.e. stimuli and task) [57]. The signal-to-noise ratio (SNR) between clean and noisy EEG post artifact removal is another popular metric [162]. Finally, some researchers use the improvement in downstream task performance as a measure of the reconstruction fidelity; for instance, artifact removal was demonstrated to improve stimuli decoding and visual-evoked potentials recognition [146, 17]. 2.1.2.3 Datasets Table 2.1 lists a summary of investigations conducted for the purpose of developing algorithms for artifact detection and correction. We note that investigators typically evaluate their approaches on data they have collected themselves, as opposed to a standard community benchmark dataset; this highlights a larger issue in the EEG research community around data sharing practices. When 9 data is shared, it is often to study a particular downstream task, so to facilitate this end, artifacts are often removed which renders the dataset irrelevant for the purpose of artifact detection research. For papers surveyed in this review, only a few made their datasets publicly available [114, 2, 134, 146]. 2.1.3 Artifact Detection Methods Various machine learning and statistical approaches have been applied to the domain of artifact detection. We elaborate on these methods below. 2.1.3.1 Hand Crafted Methods The BLINK algorithm was tailor-made to detect the specific signal characteristics of artifacts caused by eye blinks. Like many hand crafted methods, this approach performs well for the specific task it was engineered to accomplish, but can not be easily extended, tuned, or adapted to detect other types of artifacts [2]. 2.1.3.2 Signal Decomposition Methods Blind source separation methods, most prominently Independent Component Analysis (ICA), treat EEG as a composite signal; ICA decomposes EEG signals into their constituent signal components from which an expert may identify and remove artifact components. While there are rules-of-thumb to distinguish artifact from signal components (for instance, higher power aggregates in frontal areas of scalp maps for blinks), expert annotation is still often required. One notable exception to this is the work of Shamlo et al., who side-stepped the need for an expert annotator by collecting thousands of scalp maps of blink artifacts to contrast new EEG segments against [154]. 2.1.3.3 Supervised Approaches Supervised classification approaches including Support Vector Machines (SVM), Decision Trees, and K-nearest neighbors (KNN) have been applied to a variety of EEG artifact detection problems. Deep learning and Neural Network methods are a relatively recent development in the field of EEG artifact detection. Multiple recent efforts have applied Convolutional Neural Networks (CNN) to EEG by representing data as an 𝑛 × 𝑡 image of 𝑛 channels and 𝑡 samples. Nejedly et 10 al. used a CNN in conjunction with fully automated image processing procedures to automatically detect artifacts in intracerebral EEG data [114]. Transfer learning has also be used to improve the performance of network models previously trained on different datasets [114]. Ultimately, supervised classifiers have been shown to effectively discriminate artifact from signal segments [58, 57], but require annotated artifact data to do so, which is not commonly available for many EEG datasets. 2.1.3.4 Unsupervised Approaches Sadiya et al. proposed a general-purpose artifact detection algorithm [146]; their method extracted 58 different EEG features that are commonly used in EEG research and prognostication, and made the assumption that the frequency of artifacts in the datasets was relatively low. While, this assumption may not always be true (for instance, seizure detection), it is usually valid. The authors benchmarked multiple unsupervised methods. For instance, an auto-encoder was trained to reconstruct EEG waveform segments. Assuming artifact are infrequent, the auto-encoder minimizes the reconstruction error for artifact free trials, hence high reconstruction error is taken as indicative of an outlier EEG segment likely to be an artifact. Their results showed artifact detection rates comparable to the inter-annotator agreement reported in the literature, but as expected, unsupervised algorithms are outperformed by methods tailor-made to detect a given artifact type (Table 2.1). 2.1.3.5 Hybrid Approaches Hybrid methods that use deep learning classifiers in conjunction with other methods have shown great promise. ICLabel is a recently available artifact rejection plugin for EEGLab1 that uses a CNN to label the components of the ICA decomposed waveform [134]. The classifier distinguishes between seven different artifact types with a binary accuracy (artifact vs signal) of 0.83. Like other ICA based algorithms, ICLabel is capable of online artifact rejection. 2.1.4 Artifact Removal and Correction Methods Detecting and excluding artifact ridden trails allows researchers access to clean data. However, these trials could constitute a non-trivial portion of the collected data, and rejecting them may 1 https://github.com/sccn/ICLabel 11 introduce discontinuities into data that is fundamentally temporal in nature. Recent research efforts have focused on approximating an artifact free version of the affected segment, instead of discarding it all together. It is important to note that all artifact removal methods discussed below are supervised, even when constituting a component of a larger unsupervised pipeline. 2.1.4.1 Signal Decomposition Methods As previously stated, ICA decomposes EEG signals into their constituent components from which noise components may be identified. A natural extension of the detection algorithms discussed above is to reconstruct the EEG signal from all but the identified noise components. Gilbert et al. trained several classifiers (LDA, SVM, KNN) to distinguish between signal and noise independent components [58], and as previously mentioned, [134] trained a CNN classifier to distinguish between noise and signal components. Notably, these methods involve some global loss of information when the signal is reconstructed [133]. Another approach to blind source separation is Artifact Subspace Reconstruction (ASR) which learns statistical characteristics of the components resulting from Principal Component Analysis (PCA). While the performance of ASR and ICA based methods are comparable, the former is faster and less computationally demanding, and is therefore more suitable for online artifact correction [17]. Extended Empirical Mode Decomposition (EEMD) has also been applied to EEG artifact removal [43]. Empirical mode decomposition methods can be used as filters but are not strictly in the same category. EMDs decompose signals into a special class of generating functions that maximizes the signal-to-noise ratio of the reconstruction. While EMDs might appear reminiscent of ICA, the nature of the decomposition is different. ICA decomposes the data for all EEG channels simultaneously, while EMD and the other filtering methods decompose the signal at each channel separately. 2.1.4.2 Filter-based Methods In signal processing, filters are basic sequence-to-sequence elements that suppress unwanted temporal phenomenon. The Multi-Channel Wiener Filter (MWF) has been used to great effect in 12 audio and speech processing; Wiener filters use labeled examples to estimate parameters of the signal and noise waveforms such that that noise waveform may be filtered out while the NMSE between a clean signal and its output is minimized. The amount of labeling required to use MWF is minimal and an EEGLab plugin is publicly available [162]2. MWF assumes stationary of the EEG and noise profiles but to be fair, many simple classifiers make a similar assumption. With sufficient depth, neural encoder-decoder models can learn to correct multiple artifacts drawn from different distributions. 2.1.4.3 Supervised Approaches Artifact removal with neural networks is a recent development that was been made possible with breakthroughs in sequence-to-sequence modeling tasks using encoder-decoder neural network architectures. Since the ground truth is not usually available, researchers use noisy trials as the input sequence to the encoder-decoder model and artifact free trials as the target sequence [57]. To facilitate work in artifact correction, EEGdenoiseNet was recently published as a bench-marked data set of simulated ocular and muscle artifacts [194]. The package provided by the authors allows for the simulation of various artifacts at various signal-to-noise ratios. The authors implemented fully-connect, convolutional, and recurrent neural networks to bench mark the data-set. 2.1.4.4 Unsupervised Approaches As discussed, section 3.2 proposes an unsupervised approach for artifact detection. Assuming a low false positive rate, trials marked as artifact free are used to train a CNN to reconstruct EEG segments using surrounding samples. The trained network is then used to reconstruct artifact ridden segments. By training with artifact free trials, the method ensures that the reconstructed signal approximates an artifact free signal. While the artifact removal component itself was supervised the pipeline as a whole does not require any labeling (due to the artifact detection being unsupervised). Note that this same approach could be used with any other supervised artifact removal component such as [57, 162]. This approach remains highly limited by the low accuracy of unsupervised artifact detection (Table 2.1). 2 https://github.com/exporl/mwf-artifact-removal 13 2.1.4.5 Hybrid Methods Phadikar et al. suggested a hybrid model that uses SVMs to detect noise components in the ICA deconstructed signals and a denoising autoencoder to remove artifacts from the ICA components rather than the raw EEG [133]. By denoising the ICA components, instead of excluding them from the reconstruction all together, the reconstruction was found to be more accurate. 2.1.5 Conclusion In this review, we provide a succinct overview of EEG artifact detection and correction methods, with a focus on the last 5 years of research. We reviewed many more papers than formally discussed in this chapter; indeed, there has been an increased interest in artifact detection and removal as EEG devices become more prevalent in multiple fields. As evident from Table 2.1, the research community is in dire need for a standardized metric, database,and terminology surrounding the EEG artifact detection task, especially if the goal is to produce usable application that will generalize to multiple datasets, and heterogeneous tasks. The more recent entries in Table 2.1 imply a growing popularity of deep learning techniques comes at the expense of traditional approaches and expert knowledge. However, we note that recent papers successfully drew on the rich history and knowledge developed within the EEG preprocessing community to build hybrid approaches that synthesize deep learning, ICA frameworks [133], or features borrowed from EEG prognostication [146]. We believe that hybrid frameworks are an interesting future direction of work in this domain and uniquely situated to combine the strengths of multiple approaches that will advance the current state-of-the-art. 14 2.2 Decoding EEG Signals EEG applications span a wide spectrum: from healthcare [36] and accessibility [88, 29] to entertainment [20] and user authentication [64].EEG decoding in particular, namely the decoding of internal cognitive representations has been an of extensive research. Researchers working in cognitive science use EEG decoding to investigate how stimuli is represented and stored in working memory [119, 7, 171, 66]. In contrast, many biomedical applications leverage advanced machine learning techniques to decode user intentions, such as motor-imagery [92, 29, 192], envisioned speech [88], and environmental control via brain-computer interfaces (BCI) [36]. Here we compare the most common EEG decoding approaches, highlighting the particular circumstances that led to the adoption of different approaches across disciplines, and the strengths and weaknesses of each of them. See Table 2.2 for a quick overview of the literature discussed in this section. 2.2.1 Challenges to EEG Decoding Despite the plethora and variety of research, there remain a number of challenges that limit EEG decoding applications. While some of these challenges are universal, the impact of others is unevenly felt across disciplines. The following are the most relevant challenges for the sake of this review: • Inter-Subject Variability: Research shows consistent individual differences in EEG. This has enabled researchers to design EEG based user authentication methods [64]. However, this is an issue when designing any kind of BCI application as pre-trained models might face difficulties when used by out-of-training subjects. • Inter-Task Variability: Models are often trained to decode a narrow set of labels. As we will discuss in subsection 2.2.2, this weakness is inherent to discriminative, as opposed to generative, models. • Data Scarcity: EEG data may suffer from an extremely low signal-to-noise ratio that fre- quently renders data collected from subjects unusable (for instance [119] and [171] both discarded 10% of their subjects due to noise). Moreover, data collection, annotation, and maintenance all require considerable effort and data scarcity effects EEG research reliability. 15 This is reflected by the low sample sizes in Table 2.2. • Interpretability: While high accuracy might be a priority when designing BCI, this is not the case when the underlying goal is the study of specific cognitive phenomena; for a decoding methodology to be widely adopted it must also be interpretable. Note that these challenges are interrelated; relieving Inter-Task Variability for instance can alleviate the Data Scarcity issues facing the researcher. Moreover, Interpretability might provide insights that can improve transfer learning methods used to combat Inter-Subject and Inter-Task Variability. Paper Feature Discipline Subjects Method [88] SR BCI 23 RF [119] Cat P 20 ECOC-SVM Ori [7] P 16 ECOC-SVM Loc [171] Loc P 8 IEM Clr [66] P 30 LDA Ori [49] Ori P 16 IEM [192] MI BCI 9 CNN [170] MI BCI 5,5 DNN [149] MI BCI 10,14 CNN [94] EM AS 24 CNN [41] MI BCI 109 CNN [199] MI BCI 25,9 CNN [31] EM AS 58 CNN [190] Ori P 24 MHL [184] Clr P 34 MHL Table 2.2 A breakdown of the papers reviewed in this section. Feature: the signal being decoded; orientation (Ori), location (Loc), color (Clr), category (Cat, for instance faces, scenes, tools), Motor- Imagery (MI), Emotion Labeling (EM). Discipline: Perception (P), Brain Computer Interfaces (BCI), Affective Science (AS). Subjects: number of subjects. Algorithms: Mahalanobis distance (MHL), Support Vector Machine (SVM), Inverted Encoding Models (IEM), Linear Discriminant Analysis (LDA). 16 2.2.2 Review Our review includes decoding studies using various features (motor intentions as well as item location, color, category) and decoding methodologies. As shown in Table 2.2, the literature is dominated by three decoding methods. First classic methods such as Support Vector Machines (SVM) and Linear Discriminant Analysis (LDA), and Mahalanobis distance (MHL) are still widely used. Second, Inverted Encoding Models (IEMs) have recently become popular among cognitive scientists. Finally, advanced machine learning techniques, and particularly neural networks, are now the state-of-the-art for most BCI based applications. In the following subsections we explore each of these methods and examine their strengths and weaknesses. 2.2.2.1 Classical Classification Algorithms Support Vector Machines These classifiers were developed by Valdimir Vapnik and his col- leagues in a series of papers during the mid 90s. This method quickly peaked in popularity, and by 2001 it has been applied to multiple EEG classification problems [109]. While the engineering community has come to favor neural network approaches for most classification problems, SVMs remain the predominant method in other non-engineering focused disciplines [7, 8]. This is not surprising; SVMs use a relatively low number of parameters, are simple to train without costly computational resources (i.e. GPUs), and can find globally optimal solutions for most problems with the correct kernel selection. Linear Discriminant Analysis Another classification algorithm that is commonly used with EEG data is Linear Discriminant Analysis (LDA). Given two or more (normal) distributions, LDA finds the projection that maximizes the separation of the clusters. LDA is reminiscent of decomposition algorithms such as PCA. However, while principle component analysis (PCA) solves for a projection that captures the direction of maximum variation in the data set without requiring any labels (thereby projecting the data into a lower dimension in a way that allows for low reconstruction error), LDA maximizes separability between clusters. Mathematically, given two clusters 𝑋 = {𝑥 1 , 𝑥2 , . . . , 𝑥 𝑛 } and 𝑌 = {𝑦 1 , 𝑦 2 , . . . , 𝑦 𝑛 } with means 𝜇𝑥 , 𝜇 𝑦 and covariance matrices Σ𝑥 , Σ𝑦 respectively. LDA finds a transformation 𝑊 that 1) Maximizes the difference between the means of the transformed clusters 17 (𝑊 𝜇𝑥 − 𝑊 𝜇 𝑦 ) 2 and 2) Minimizes both within class scatter values. Since 𝜎𝑥 = 𝐸 [𝑋 𝑋 𝑡 ] − 𝜇𝑥 𝜇𝑥𝑡 after the transformation we get a new scatter matrix 𝑊 𝜎𝑥 𝑊 𝑡 . The two terms are combined, and we get the objective function (𝑊 𝜇𝑥 − 𝑊 𝜇 𝑦 ) 2 𝑎𝑟𝑔𝑚𝑎𝑥𝑊 𝑊Σ𝑥 𝑊 𝑡 − 𝑊Σ𝑦 𝑊 𝑡 . Empirical studies on multiple EEG data-sets have demonstrated that LDA consistently achieves decoding accuracy on par with other classification methods [61]. For this reason LDA was chosen as the default classifier in the ADAM toolbox [44]. Despite implementing a Mahalanobis distance classifier, for reasons that would be discussed in future sections, decoding of data collected during this dissertation was completed using the ADAM LDA classifier. Multidimensional Scaling Visualization Due to the similarities between LDA and SVM many visualization techniques are applicable to both methods. One such technique is Multidimensional Scaling (MDS): a distance-preserving dimensionality reduction technique that can be used to visualize complex high dimensional data. MDS originated in psychometrics and is still a common tool in cognitive science. For instance, using the distances that SVM and LDA methods provide, it is possible to make deductions regarding the similarity of EEG measurements for different items. By making the reasonable assumption that, for a given subject, EEG pattern similarity correlates with neural representation similarity Hanjonides et al. demonstrated that the color representations follows the color circle in representational space (orange is between green and red) [66]. That is, they showed that the representations being decoded were sensory in nature, not merely the result of a verbal label (the decoded activation wasn’t simply the result of subjects repeating the color names in their head). This is an extremely valuable insight for cognitive scientists as it speaks to the organization of neural representation in the brain. Note that such techniques can not be applied used with deep learning black box modules that do not preserve stimuli properties. This type of visualization highlights the interpretability of results obtained using classical methods, which contributes to their persistent popularity amongst cognitive scientists. 18 Mahalanobis Distance Finally, many EEG decoding papers have used Mahalanobis distance for EEG labeling [190, 184]. The Mahalanobis distance is calculated between a data point and a distribution. Let 𝑋 𝑘 be a set of all data points labeled 𝑘 in the training set, and Σ 𝑘 and 𝜇 𝑘 be the covariance matrix and mean for 𝑋 𝑘 . Given a data-point from the testing set 𝑦 the Mahalanobis distance of the point from the distribution of each label is: √︃ 𝑀𝐻𝐿𝑘 = (𝑦 − 𝜇 𝑘 ) 𝑡 Σ 𝑘 −1 (𝑦 − 𝜇 𝑘 ) The smaller the Mahalanobis distance from a distribution, the more likely it is that the data point belongs to the same label as the cluster used to generate the distribution. Therefore, for each data point the label that satisfies 𝑎𝑟𝑔𝑚𝑖𝑛 𝑘 (𝑀 𝐻 𝐿 𝑘 ) is assigned as the class. The reason behind the popularity of Mahalanobis distance in EEG research stems from signals being global, hence producing highly correlated electrode readings. By multiplying by the inverse of the covariance matrix we essentially uncorrelate and z-score the data. This is becomes apparent when decomposing the covariance matrix to eigenvalues and vector Σ = 𝑣𝜆𝑣 𝑡 as can be seen in figure 2.1. Figure 2.1 The two red data-points have the same euclidean distance from the center of the cluster. By scaling the distance in the direction of the eigenvalue 𝑣 𝑘 by 𝜆1𝑘 we correct for covariance. 2.2.2.2 Neural Networks Neural Networks have facilitated massive advances in all area relating to signal processing; this "Deep learning tsunami" [103] did not spare the field of bioinformatics. With respect to the decoding problem, most deep learning applications are seen in BCI applications. This is not surprising as BCI engineers are primary interested in high algorithm performance; statistical 19 significance alone is typically insufficient. For instance, the decoding accuracy in the Neuroscience paper [119] peaked at 37.5% against a chance level of 33.3% but the BCI study [41] achieved a 79.25% accuracy for a comparable chance level. Convolutional Neural Networks (CNN) in particular have proven to be extremely capable for EEG decoding. Schirrmeister et al. tested multiple CNN architectures on multiple data-sets against traditional decoding methods and found that CNN dramatically outperformed all baselines [149]. The authors suggest that the temporal nature of EEG signals might be especially suited for CNNs, as these neural network ‘can capture the temporal hierarchies of local and global [temporal] modulations in the deeper architectures’. Transfer Learning A well known weakness of deep learning methods is the need for large amounts of data. In our case this is further exacerbated by the data scarcity issues discussed above. One potential way to combat data scarcity is solving the Inter-class variability problem by using pretrained models with transfer learning. For instance [192] used VGG-16, a CNN classifier trained for image classification [195], to improve performance on an EEG motor imagery data-set. Xu et al. freezed the first 11 VGG layers and allowed tuning the remaining five. The intuition being that the initial layers have ‘extracted low-level universal features ... appropriate for general image processing tasks’. Transfer learning can also be used to elevate Inter-Subject Variability problems. Emotion detection EEG data is particularly noisy and decoding accuracy is often extremely low. Using a model pre-trained on a subset of subjects can improve the decoding for specific subjects with low accuracy [94]. This was demonstrated as possible even across different data-sets that were collected using different paradigms [31]. A different approach to inter-subject transfer learning is to train a model using a few trials from all available subjects before fine tuning the model for a specific subject. This approach was used by [41] to improve the performance of the models proposed by [149]. Finally, [199] have used CNN with hand crafted features to implement a ‘training free’ classifier that can outperform traditional methods such as SVM and LDA for subjects it did not previously encounter. 20 Interpretability Machine Learning researchers have recently drawn a distinction between Inter- pretability, the ability to associate cause and effect, and the more general attribute of Explainability which relates to justifying this association. For example, in image classification, interpretability techniques, such as the Shapely values, reflect the contribution of each input pixel to the final decision model. However, they do not explain why the presence of a tail increases the probability of an image being classified as a dog [26]. In the specific context of EEG decoding, interpretability is sufficient, as explaining the chain of cause and effect often falls in the realm of cognitive science. Many examples of such "attribution" techniques can be found in the cognitive neuroscience litera- ture [149, 170]. For instance, "Layer-Wise Relevance Propagation" was used in a motor-imagery classification from EEG task [170]. As could be expected, results indicated that activity in the contralateral sensorimotor areas was crucial for the accurate classification of the motor-imager action. The same technique was also used for Alzheimer’s disease classification from fMRI data. Not only did the techniques attribute relevancy to areas known to be implicated in the progres- sion of Alzheimer’s disease, but the attribution also had high inter-patient variability, enabling the researchers to identify Alzheimer’s disease "subtypes" [18]. 2.2.2.3 Encoding Models In contrast to BCI researchers, cognitive scientists prioritize understanding the underlying neural representation over decoder performance. This has led to the development of Encoding Models (EM)s that aim to predict brain responses 𝑟 given the stimuli 𝑠. In other words, EMs model cognitive functions as conditional distributions 𝑃(𝑟 |𝑠). For instance by characterizing how every fMRI voxel (cluster of neurons) in the early visual areas responds to different spatial frequencies and orientations, researchers were able to identify images by comparing voxel responses with model predicted activations with over 80% precision [112]. Moreover, by inverting the encoding model and calculating 𝑃(𝑠|𝑟) the authors were able to reconstruct images from brain activation. Later work expanded on this by constructing entire video segments from fMRI data [118]3). Another development is the so-called Inverted Encoding Models (IEMs), a specific type of EMs that uses 3A demonstration of video reconstruction 21 Figure 2.2 The Inverted Encoding Model. a) The channel tuning functions. Each channel is sensitive to a different orientation. b) predicted activation for a specific stimuli. c) the electrode values for the stimuli. We can calculate 𝑊, and use the inverse transformation to predict channel response from electrode activation, their-by concluding stimuli properties from raw EEG. “channel responses” instead of stimuli as input. The IEM assumes that the activation measured by each of the 𝑚 EEG electrodes reflects a weighted sum of 𝑛 response channels. Each response channel is selective towards a specific stimuli value (Figure 2.2 a). Computationally, the problem is equivalent to solving a system of linear equations 𝑐𝑊¤ = 𝑒 where 𝑐 𝑛 is the vector of the channels’ activation for a given stimuli, 𝑒 𝑚 are the electrode measurements, and the weight matrix, 𝑊𝑚×𝑛 , describes the contribution of each channel to each electrode. Solving 𝑊𝑚×𝑛 on the training data and then applying the inverse operation allows us to infer the channel responses, and thereby stimuli, from EEG data [171, 49]. See figure 2.2 for a visualization of IEM training. 22 Flexibility Encoding models can be used to reconstruct any stimuli, even if it was not encountered during training. Meaning that, unlike the methods previously discussed in this review, EMs are generative rather than discriminative. For instance while the EM in figure 2.2 is trained for orientations that are 30 deg apart, we can still predict the channel response for a 45 deg orientation. As noted by Brouwer et al. training an EM is equivalent to creating "a lookup table of channel outputs for an arbitrarily large number of different [stimuli]" [23]. Limitations EMs model the neural representations that underline human perception. Neural mechanisms underlying human perception can be investigated using the performance of EMs that incorporate them. For example, EMs that account for the semantic categories present in the stimuli accurately predict voxel responses at some brain regions but not others, indicating which regions represent such concepts [118]. However, as two recent papers have demonstrated, the reconstructed channel response functions of IEMs are highly contingent on model assumptions [98, 50]. For instance, the decoding would work just as well if instead of assuming that the channel functions are shaped like a normal distribution (Figure 2.2 a), the researcher uses bimodal (or arbitrary) distributions [50]. Proponents of IEMs responded by suggesting that as longs as “sensible” models that follow the current consensus in the research community are used, IEM methods are still useful for intuition regarding the inner working of the cognitive-neural systems [165]. This demonstrates how dependant EMs of the expert knowledge. Another related limitation that is more specific to IEMs is that they require handcrafted channel responses, but it is not apparent how these can be extended to more complex stimuli categories. For instance [119] used an SVM to decode stimuli category (faces, natural scenes, and tools), it is not clear how one can model a set of channels for such high-dimensional stimuli, or even non-perceptual domains such as motor imagery or semantic categories. 2.2.3 Conclusion In this review we examined the currently most widely used EEG decoding modalities and the context in which they are utilized. While a unified EEG decoding methodology could be beneficial, by carefully choosing the modalities most appropriate for their use cases researchers seem to be 23 able to side-step many of their inherent weaknesses. As discussed in subsection 2.2.2, each of the current popular methods has different strengths that counter-act a specific subset of the challenges discussed in 2.2.1. This can be summarized as follows: • Classical Methods require relatively small amount of data and their results can provide insights into the cognitive processes underlying the EEG representations being decoded. These methods are suited to researchers concerned with Interpretability and Data Scarcity and are therefore most popular amongst non-engineering disciplines focusing on cognitive science research as opposed to building practical EEG applications. • Inverted Encoding Modules A new method that gained popularity in cognitive science. This is a generative rather than discriminative EEG decoding method as it can decode stimulus values that were not encountered during training. Moreover, as discussed earlier it has been demonstrated that it can be used to decode stimuli with multiple features even if only a single value stimuli was available during training (and vice versa). This method is therefore uniquely flexible as far as Inter-task Variability is cornered. • Deep Learning Methods Deep learning methods achieve the highest accuracy, and transfer learning is a promising approach that can alleviate Inter-subject and Inter-task Variability related issues. However, even with transfer learning these methods remain relatively data- intensive. Considering the above, these are the most popular methods for BCI applications. Considering the cross-disciplinary challenges of working with EEG data it is more than likely that practices from different fields will eventually find their way into other disciplines. We hope that this review will serve to help the reader consider decoding approaches from new angles. 24 CHAPTER 3 DEEP LEARNING METHODS FOR PREPROCESSING EEG 3.1 EEG Channel Interpolation Using Deep Encoder-decoder Networks This section was published as a manuscript titled "EEG Channel Interpolation Using Deep Encoder-decoder Networks" in the proceedings for the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) [145]. 3.1.1 Introduction Electroencephalography (EEG) devices have become increasingly popular in recent years and are used in a wide range of applications. Naturally, the medical applications of EEG are centered on neurological diagnosis, but EEG has proven useful for other problems in healthcare domain [153, 138]. Moreover, the use of EEG devices extends far beyond the medical domain; novel applications of EEG may be found in wide a variety of fields including advertising [177], education[161], entertainment [20], and security [82]. A fundamental challenge of EEG data is the low signal to noise ratio. Different sources contribute to this noisiness but, in general, they can be categorized as either movement artifacts or electrode artifacts. The most common, and particularly persistent, electrode artifact is the electrode “pop" [180, 100]. These artifacts result from abrupt changes in impedance, usually due to a loose electrode or bad conductivity. Furthermore, these artifacts are difficult to avoid because, even if the greatest care is taken when applying electrodes, the most minor subject movement or change in perspiration can cause the electrode to “pop”. A common solution to EEG “pops" is to interpolate the missing segments using recordings from nearby electrodes [48]. In practice, this interpolation is most commonly performed using eeglab, which contains a tool for spherical interpolation [131, 45]. Within the last few years, alternative interpolation methods reporting improved performance have been proposed: Petrichella et al. proposed a euclidean inverse distance method [132] while Courellis et al. demonstrated an interpolation approach that (while also being based on an inverse distance calculation) used geodesic 25 lengths and electrode localization to extract more exact channel locations, thereby performing more accurate interpolation. [35]. Effective interpolation is a necessary preliminary step to any subsequent preprocessing or formal analysis of the EEG, including Independent Component Analysis (ICA). As noted by Ullsperger et al: “activity from bad channels should be removed before ICA decomposition, as it can massively deteriorate otherwise good decomposition results." [175] One shortcoming of these previous solutions is their dependence on knowledge of the precise locations of the electrodes (i.e. electrode localization / registration), which are not collected in most practical settings. Furthermore, the existing methodologies assume that the incidence and specific characteristics of the “pops" are similar across both subjects and tasks (i.e. “one-size-fits-all"). Furthermore, as far as the authors are aware, none of the studies surveyed for the purposes of this work provided publicly available software repositories to enable practical use, reproduction of their methodologies, or ease of extension. To address the aforementioned challenges, we propose a novel electrode interpolation frame- work using representation learning. Our method autonomously identifies the spatio-temporal properties of EEG data measured at a set of electrodes, that predict the values of a given neigh- bor to those electrodes. Our model, which has been made publicly available (at github.com/ sari-saba-sadiya/EEG-Channel-Interpolation-Using-Deep-Encoder-Decoder-Networks), can be used “out of the box" to more effectively interpolate EEG for any missing channel, at any time. One important advantage of our model over existing approaches is its amenability to transfer learning: the ability to easily fine tune it using clean data from a novel subject or EEG experiment. This property of our model allows for interpolation that is tailor-made to the specific task and subject at hand, enabling the model to learn even idiosyncratic relations in new data. To determine the usefulness of our method we evaluate the model on unseen tasks and subjects with and without further tuning. To summarize, our main contributions are: • We propose and implement a new framework for EEG channel interpolation using encoder- decoder deep representation learning. 26 • We compare our method against contemporary algorithms for channel interpolation. • We make our code publicly available for the benefit of the community, and demonstrate how it may be tuned to novel subjects and tasks using transfer learning. 3.1.2 Related Work 3.1.2.1 Current Interpolation Solutions The Spherical interpolation method (as implemented by eeglab) was first proposed by Perrin et al. in their 1989 work [131] and remains the most widely-used interpolation solution [131, 45, 163, 132]. More recent work describes improved interpolation methods by, for instance, using ellipsoid geodesic lengths [35] to better estimate the distances between electrodes. A weighted signal reconstruction scheme that favors electrodes closer to the location of the missing channel is then used to achieve higher accuracy. This however requires that the electrode positions be digitally registered so that an ellipsis can be fitted to the shape of the subject’s head. That ellipsis is then used when calculating the distances between electrodes. Unfortunately, most data tends not to provide the specific channel locations. 3.1.2.2 Artifact Detection Using Neural Networks EEG is a signal with spatial structure that unfolds in time. Convolutional neural networks (CNNs) are an obvious candidate for the EEG interpolation problem because they naturally capture hierarchical spatio-temporal relationships. A few recent papers have leveraged CNNs for EEG classification [55, 106]. Previous works have also used CNNs to annotate EEG wave-forms for the presence of artifacts: Nejedly et. al. framed artifact detection as a classification problem and used CNNs to robustly annotate eye blink segments [115]. However, while the use of deep learning approaches for artifact detection has shown promising results, there is relatively less work on the use of deep learning to reconstruct artifact-ridden data segments. Finally, recently researchers working with EEG data started to use generative rather than dis- criminative neural networks. For instance [34] used generative auto-encoders (GANs) to specially 27 up-sample EEG data by interpolating non-existing channels. While not interchangeable, this is a similar task to channel interpolation. However, no previous work was validated on either subjects or tasks that did not appear during training. Moreover, to the best of the authors’ knowledge no previous work tested how transfer learning could facilitate further fine tuning on specific data-sets, or even made the trained model available. We hope that our rigorously tested ready to use model will allow for wider access to state-of-the-art machine learning techniques. 3.1.3 Data 3.1.3.1 Data Collection The data in this study was collected from the recently published EEG During Mental Arithmetic Data Set [201]. The data consists of 24 subjects performing two tasks: a resting state task, and a mental arithmetic task. EEG data was collected as subjects performed the tasks using the 10-20 international system with the linked ears serving as a reference electrode (see Figure 3.2, left). The sampling rate was 500𝐻𝑧. Each resting state task lasted 180 seconds while each mental arithmetic task lasted for 60 seconds; hence, the total number of samples was 90, 000 and 30, 000 for each resting state and mental arithmetic task respectively. We segmented the data into 16𝑚𝑠 (8 samples) epochs. Hence, each subject ended up with 11, 250 and 3, 750 epochs for the resting state and mental arithmetic tasks respectively. 3.1.3.2 Data Partitioning In Figure 3.1, we illustrate how the data was partitioned for the training and evaluation of our method. Training data from the resting state task was used for model development while data from the mental arithmetic task was held out for intra-task evaluation. We further partitioned data from both tasks into training subjects (67% of subjects) and evaluation subjects (33%). This resulted in the following four partitions of the data, which we will refer to later when discussing our results:“Seen Task, Seen Subjects" (n=16 subjects),“Seen Task, Unseen Subjects" (n=8), “Unseen Task, Seen Subjects" (n=16), and “Unseen Task, Unseen Subjects" (n=8). The “unseen" data sets contain some deviation from the data the model was trained on, and are 28 Figure 3.1 The four-way split of the data into partitions. The dotted area (top left) was used to train our main algorithm. The areas with the diagonal shading were used for transfer learning. The areas with white background were used to test our algorithm. Each shaded area was used separately to tune the model for testing on the remaining 90% of the data in its own partition. thus a good test for the generalizability of the method. Because we are utilizing transfer learning 10% of the data for all unseen sets was held-out for tuning, to test the impact of this additional context on the model’s performance. In general, the fundamental difference between the two tasks is crucial to our evaluation. Researchers will often not have enough data to train neural networks "from scratch", not to mention training a neural network often requires exploration of a hyper-parameter space that may consume significant temporal (and financial) resources. This is especially true for deep learning frameworks that have strong tendencies to over fit and produce remarkable results for a specific data set while failing to generalize to other contexts. With this in mind, it is important that our models achieve good results on both unseen tasks and subjects to enable their continued development and utility within the greater research community. We therefore structured our data to assess its ability to generalize across tasks and subjects. 3.1.4 Methods 3.1.4.1 Pre-processing To begin, all EEG data were Z-scored at the subject-level (i.e. converted to a zero mean, and unit variance representations). Deep networks require a large volume of training data; hence, we 29 Figure 3.2 A diagram of our framework. The EEG data is first segmented the 500𝐻𝑧 data into 16𝑚𝑠 epochs, each segment is then mapped to a 5𝑥5 matrix that roughly reflects the spatial locations of the EEG electrodes (e.g. F7 is located in position 1,1 of the matrix). The electrodes at the sagittal and median planes were duplicated and the tensor was padded with the linked ear channel data to create a 8 × 8 × 8 tensor per epoch. This data serves as the input to an encoder-decoder model. Finally, the output is transformed back into a signal. supplemented our training data by transforming each subject’s EEG data into 10 distinct pseudo- subjects. Each pseudosubject’s data was an elementwise addition of the Z-scored subject’s EEG data and random draws from a Gaussian distribution with (𝜇 = 0, 𝜎 = 0.05). The pseudosubject’s (already normalized) EEG data was then Z-scored again following the introduction of the noise. The utility and validity of this data augmentation approach for EEG research has been demonstrated in prior work [56]. Next, a simple transformation was applied to project each sample of EEG data from a spherical channel representation onto a quantized two dimensional surface represented by a 5 × 5 matrix (Figure 3.2, panel 2). Finally, EEG data was epoched into 16𝑚𝑠 segments, with no overlap across segments 1. This resulted in a 5 channels ×5 channels ×8 samples tensor. The electrodes at the sagittal and median planes (the central electrodes) were then duplicated and the tensor was padded with the linked ear 1 Average pop artifact duration exceeds 1 second, hence 16 ms is more than sufficient for reconstruction. 30 channel data to create a 8 × 8 × 8 tensor. This manipulation of input size is common place in deep learning and is mainly the result of networks being optimized to work with input sizes that are powers of two [95]. This 8 × 8 × 8 tensor formed a single sample of input data, from the perspective of the network when training. To create training data, we iterative occluded each of the 19 non reference electrodes (All electrodes except 𝐴1 or 𝐴2, see Figure 3.2). Thus each 8 × 8 × 8 tensor became the prediction target for 19 input tensors, each with one distinct occluded channel. 3.1.4.2 Proposed Approach An Encoder-Decoder model for EEG Interpolation Inspired by research on image inpainting we deployed an encoder-decoder model for EEG interpolation. Image inpainting is a classical problem in computer vision: given a corrupted image the aim is to complete or “fill in" missing pixels. This is a similar problem to electrode interpolation. Encoder-decoder models are the combinations of two networks that are trained simultaneously: the encoder first learns a lower dimensional embedding of the data, and the decoder attempts to recover the original data from the embedding. Encoder-decoder networks are a popular tool in image inpainting [129, 193, 181], so much so that this technique is now leveraged for image compression as selective removal of pixels might greatly enhance compression ratios [11]. We determined the optimal topological configuration of our encoder-decoder network via a random search of the network hyper-parameter space [15]. The tested topologies varied in the number of convolution layers, the existence of max-pooling, dropout, and batch normalization layers after each convolution layer, and whether the decoder was based on transposed convolution layers or simple up-sampling with convolution. More specifically, we trained 300 distinct architectural configurations of the encoder-decoder networks, and retained the configuration that best generalized within a held out subset of the training data itself. The best network was then used after training for our transfer learning evaluation. The code to run this search in the topological space, as well as the trained winning algorithm before and after the transfer learning tuning is available online. The optimal topology is shown in Figure 3.3, and discussed in Subsubsection 3.1.5.1. 31 Subject+Task Enhancement via Transfer Learning Transfer learning experiments were carried out by taking the model trained on the original data set and tuning it on a small subset (10%) of the testing data. Realistically, it is highly likely that a small sample of clean data will be available for a researcher to use when tuning our model. To assess performance enhancements associated with transfer learning, we held out 10% of each data partition (See Figure 3.1) and tuned our network for 100 epochs. These numbers were intentionally small as to showcase how even minimal training that can be easily completed on non specialized hardware and using very little data can lead to significant improvements. By tuning the model for specific subjects and tasks, we assessed the flexibility and practical extensibility of our proposed approach. Figure 3.3 Our network, the dashed black arrow denotes the 4 × 4 × 128 embedded data tensor. This embedded representation is passed form to the decoder which reconstructs the original input sans the occlusion. 32 3.1.4.3 Methodological Baselines For our baseline we implemented the three methods described in the Related Work subsection [131, 132, 35]. The Euclidean Baseline Both [132, 35] suggest methods that employ an inverse distance metric where the interpolated channel 𝑠ˆ𝑖 is calculated using the following equation: Í 𝑗≠𝑖 𝑤𝑖 𝑗 𝑠 𝑗 1 𝑠ˆ𝑖 = Í , 𝑤𝑖 𝑗 = 𝑝 𝑗≠𝑖 𝑤𝑖 𝑗 𝑑𝑖 𝑗 where 𝑝 is the power parameter. The variable 𝑑𝑖 𝑗 represents the distance between electrodes 𝑖 and 𝑗. The original channel 𝑗 is represented by 𝑠 𝑗 . The power parameter is an integer (usually between 2 and 5) that is set using a small amount of data; while it is usually set to be the same value across a given data-set where interpolation is happening, we optimized the power parameter separately for each baseline and data-set to maximize the performance of the baselines. The calculation of the distance, 𝑑𝑖 𝑗 , is the main difference between the two baselines. The fist euclidean baseline (EUD) uses a simple euclidean distance formula. This distance calculation is done using the generic electrode positions in space that are always available for every cap. The Geodesic Baseline The Geodesic Length baseline (EGL) is also based on the inverse distance equation. However, instead of using euclidean distances for 𝑑𝑖 𝑗 the geodesic length is calculated. The geodesic distance is calculated using the Vincenty algorithm which was originally used in geodesy to calculate the distance between points on the surface of a spheroid. The method is iterative and not theoretically guaranteed to converge. Previous work have demonstrated that inter- polation calculated using this method outperforms simple euclidean interpolation [35]. However, as previously discussed, this was tested by the original baseline works when specific electrode locations were available [35]. It should therefore be expected that results obtained for the EGL may be lower than those reported in previous studies. The Spherical Splines Baseline Finally, we also followed the eeglab MATLAB implementation of spherical splines method (SS) [131]. According to this implementation at each point in time the value of of the interpolated channel 𝑠ˆ𝑖 can be approximated using the equation: 33 ∑︁  𝑠ˆ𝑖 = 𝑐 0 + 𝑐 𝑗 𝑔 𝑐𝑜𝑠(𝜃 𝑖, 𝑗 ) 𝑗≠𝑖 Where 𝜃 𝑖, 𝑗 is the angle between the electrode locations 𝑖 and 𝑗. Instead of calculating the angle, given the positions of the electrodes in space 𝑝𝑖 = (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ) it is possible to directly calculate the 𝑥 𝑖 ·𝑥 𝑗 cosine value: cos(𝜃 𝑖, 𝑗 ) = ∥𝑥𝑖 ∥· ∥ 𝑥 𝑗 ∥ . The function 𝑔(𝑥) is defined as the sum of the series: ∞ 1 ∑︁ 2𝑛 + 1 𝑔(𝑥) = 𝑃𝑛 (𝑥) 4𝜋 𝑛=1 𝑛𝑚 (𝑛 + 1) 𝑚 Where 𝑃𝑛 is the Legendre polynomial and following [131] we set 𝑚 = 4. Additionally, following the eeglab implementation we limited the infinite sum to the first seven values. Finally, the coefficients 𝐶 = (𝑐 1 , 𝑐 2 , . . . , 𝑐 𝑛 ) are set to be the solution for the system of equation 𝐺𝐶+𝑇 𝑐 0 = 𝑆 with the constraint 𝑇 ′𝐶 = 0 where 𝐺 𝑘,𝑙 = 𝑔 𝑐𝑜𝑠(𝜃 𝑘,𝑙 ) , 𝑆 = (𝑠1 , 𝑠2 , .., 𝑠𝑛 ) and 𝑇 is a vector of ones.  The interpolated channel is excluded from the calculations of 𝐺 and 𝑆. 3.1.4.4 Model Evaluation Approach The selected baseline approaches [35] used the averaged normalized mean square error (AN- MSE) as the main evaluation measure. The normalization of the mean square error is used to prevent a specific channel’s performance from skewing the results, in case of a bad reconstruction. Having Z-scored the data however all channels are guaranteed to have the same mean amplitude. Therefore we do not normalize our mean square error results. Hence our final evaluation is: 𝑀 Í𝑁 ! 1 ∑︁ 𝑖=1 (𝑠 𝑖 − ˆ 𝑠 𝑖 ) 2 𝐴𝑀𝑆𝐸 = 𝑀 𝑗=1 𝑁 𝑗 Where 𝑁 is the number of channels (19 in our specific case). 𝑀 is the number of samples. Note that the expression for mean reconstruction error (the inner average) changes for every sample. 𝑠𝑖 and 𝑠ˆ𝑖 are as previously defined. This measure was used both for optimizing the power parameter for the different baselines (see subsection 3.1.4.3) and calculating the final results presented momentarily. 34 Interpolation Seen Task, Seen Task, Unseen Tasks, Unseen Task, Method Seen Subjects Unseen Subjects Seen Subjects Unseen Subjects (% Improvement) SS Baseline 0.728 0.694 0.8238 0.779 EUD Baseline 0.5215 0.561 0.665 0.566 EGL Baseline 0.585 0.501 0.566 0.622 Our Encoder-Decoder Model 0.446 (14.47%) 0.478 0.552 0.465 Our Encoder-Decoder Model — 0.392 (21.75%) 0.439 (21%) 0.446 (19.78%) +Transfer Learning Table 3.1 Comparison between Encoder-decoder model and baselines using Averaged mean square error (AMSE); lower is better. Note that for transfer learning (last row) the training data set differed on each column. There was no transfer learning for the Seen Task, Seen Subject partition as this is the original data used to train the model. SS: spherical splines baseline EGL: geodesic length calculation; EUD: euclidean baseline. The best result is bolded and percentage of improvement over the most competitive baseline is given. 3.1.5 Results 3.1.5.1 Model Hyper-parameter Optimization After exploring the topological space by testing different network architectures (using the Seen task, Seen subjects data, see Figure 3.1), the best performing network is visualized in Figure 3.3. This network consisted of a simple encoder with three convolution layers and one max pooling layer, as well as four transposed convolutions in the decoder. Additionally there was a dropout and a batch normalization layer after each convolution in the encoder. All results described in this subsection were achieved using this particular architecture. 3.1.5.2 Baseline Power-parameter Optimization For our evaluation to be extra rigorous. we optimized the power parameter for each baseline data-partition configuration separately. In Figure 3.4, we illustrate the results of our power parameter optimization for the baseline methods. As seen in the Figure, the optimal power parameters were comparable with those reported in previous literature (between 2 and 5) [35]. All the results that are reported in this subsection were for the optimized baseline on the specific data set being discussed. The spherical splines baseline has no analogous parameter we can optimize. 35 Figure 3.4 Power parameter optimization to maximize the performance of the baseline approaches. For each of the four data partitions and for the two, EUD and EGL (solid and dashed lines respectively), methods. EGL: Geodesic Length calculation; EUD: euclidean baseline; AMSE: Averaged mean square error. Figure 3.5 An exemplary 48ms reconstruction of the EEG data for Subject 0 resting state task for channel P4. The original channel data was removed and interpolated using best performing baseline (geodesic length calculation, in blue) and our method (in red). 3.1.5.3 Main Result In Table 3.1, we compare the results of our proposed approach against the baselines for the EEG interpolation task on the test sets. The baseline methods are highly unstable, giving a high 36 variability in performance relative to our approach. The Encoder-decoder model consistently outperformed the baselines by at least 10%. Moreover, by utilizing transfer learning the network was able to improve it’s accuracy even with minimal additional data and training time. Interestingly, in contrast to results reported in the literature, the EGL method did not clearly outperform the EUD baseline [35]. This might be due to our data not having the precise electrode locations in contrast to previous research (see Subsubsection 3.1.6.2 in the discussion). Another interesting result is the pattern of improvements after transfer learning. As can be seen in the last row of Table 3.1, the biggest improvement was for Unseen task, Seen subjects data. This hint that there was more variability between EEG data from different tasks compared to data from different subjects. Additional testing will be needed to verify this hypothesis. However, this can be seen as a compelling argument in favor of flexible models that can be tuned for the specific data the researcher is working with. Finally, we also extracted the delta (0.5-4𝐻𝑧), theta (4-8𝐻𝑧), alpha (8-12𝐻𝑧), beta (12-30𝐻𝑧), and gamma (30-100𝐻𝑧) bands and tested the models performance for each band separately. Our method significantly improved over the baselines method in all bands. This is crucial as differ- ent bands have different functions (for instance, the theta band is especially responsive during observation and memorization tasks [177]). Hence for an interpolation method to be useful the reconstruction fidelity must be consistent across all frequency bands. In the interest of brevity we will not present the results for all these bands separately. The code to extract the sub-bands is also available online. 3.1.5.4 Performance on Exemplary Data In Figure 3.5, we present an example of our method’s interpolation on an exemplary portion of the data, compared against the baselines. As shown in the figure, the best baseline reconstruction contains voltage fluctuations that do not appear in the original signal, or the one reconstructed using our method. These fluctuations were quite common in baseline reconstructions. We speculate that our method might have learned to not only the optimal weights to use to approximate the occluded 37 channel, but also a more complicated relationship that enables our method to suppress potential artifacts that are localized to one electrode and therefore do not effect the original electrode that is being reconstructed. All things being equal, this is evidence that our framework was able to learn the nuanced relationships between electrode measurements that are not captured by baseline approaches. 3.1.6 Discussion Our work used a deep encoder-decoder model to tackle the problem of EEG channel inter- polation. While discriminative frameworks are able to only detect and label bad data segments, our results demonstrate that a generative approach can reconstruct the missing channel with high fidelity to the original signal. The success of our method suggests that deep learning can capture complex relationships between electrodes that are not sufficiently expressed by the relatively simple inverse distance calculations predominant in contemporary solutions. 3.1.6.1 On Self-supervised Learning Data labeling is often a tenuous and resource consuming process. Unfortunately, training deep learning models often requires extensive data collection and labeling efforts. Therefore, deep learning researchers have recently began to focus on finding ways to mitigate the need for labeled data. As we showed in this study, one approach to mitigate this is to frame problems as self- supervised learning tasks. Specifically, our work is a special case of a popular self-supervised learning task: the prediction of occluded parts of data from visible ones. By using this framing we were able to circumvent a common hurdle faced by deep learning approaches. 3.1.6.2 On the Challenges of Electrode Localization As discussed previously, prior research that compared different interpolation methods used electrode localization to extract exact channel locations for each specific subject. While generic and imprecise locations are always available, electrode localization methods attempt to alleviate the noisiness inherent to EEG by providing exact electrode locations. This localization can be done in many ways; one expensive option is to equip EEG caps with spatial sensors, or motion capture 38 sensors [35, 139]. Other methods that require less specialized hardware including a simple DSLR camera [32] and Kinect with an Neural Network [53]. However, despite these recent advances, electrode localization remains uncommon. For instance, no EEG data set in physionet2 or gigadb3 contain an EEG database with electrode localization. Therefore, and to ensure our method is applicable to the vast majority of databases, the data we used also did not include electrode localization [201]. A possible future work could incorporate location data into the deep learning framework. 3.1.6.3 On Baseline Approaches It is worth noting that there are multiple other interpolation methods such as the nearest neighbors method, planar-spline technique [163]. We selected the baselines methods described in Subsubsection 3.1.2.1 as they were the most contemporary approaches on the topic. Furthermore, the performance improvement of our model are especially impressive considering that the EUD and EGL baselines were optimized to maximize their performance on each and every separate partition of the data. The SS method requires a system of equations to be solved for each and every time point. This is not a trivial requirement as it necessities complex calculations. This demand renders the SS method ill-suited for any online interpolation, and by extension many BCI applications [20, 93, 138]. In contrast to the taxing nature of the training procedure, piping data foreword in neural network is computationally cheap. Therefore our approach could potentially satisfy a growing need for accurate interpolation from online data. 3.1.6.4 On Transfer Learning Transfer learning involves training a model on a problem similar to the one being solved. This is especially useful when only scarce data is available for the problem being solved, hindering the training of the model. While transfer learning is possible for many machine learning algorithms such as Bayesian networks and Markov chains, this technique became essential to deep learning 2 https://physionet.org/about/database/#ecg 3 https://gigadb.org/search/new?keyword=eeg 39 especially due to its reliance on huge amounts of training data. Transfer learning is considered to be essential for the success and ubiquity of neural networks [117]. Our work for instance would be considerably less useful if it required every researcher to train the neural network from scratch, or if the results on data-sets that the model was not trained on were considerably worse. 3.1.7 Conclusion and Future Work With the increasing prevalence of EEG devices, there is a need for methodologies that better address common EEG artifacts. In this work, we developed a deep encoder-decoder based method to interpolate EEG segments impacted by the most common EEG artifact: the electrode “pop". We demonstrated that our method improved EEG reconstruction performance compared to existing approaches, and that our method generalized well to unseen tasks and subjects. Future work will extend this method to tackle other kinds of electrode artifacts. Moreover, an end-to-end system that automatically detects artifacts and replaces the corrupted data with an interpolated reconstruction of the original might be of particular interest to the community. 40 3.2 Unsupervised EEG Artifact Detection and Correction This section was published as a manuscript titled "Unsupervised EEG Artifact Detection and Correction" January 2021 in Frontiers in Digital Health [146]. 3.2.1 Introduction Electroencephalography (EEG) devices are pervasive tools used for clinical research, education, entertainment, and a variety of other domains [161]. However, most EEG applications remain limited by the low signal to noise ratio inherent to data collected by EEG devices. EEG noise sources include: movement artifacts, physiological artifacts (e.g. from perspiration), and instrument artifacts (resulting from the EEG device itself). While researchers have developed a number of methods to identify specific instance of these artifacts [176] in EEG data, most methods require manual labeling of exemplary artifact segments 4 or special hardware such as Electrooculography electrodes that are placed around the eyes, or large data-sets of templates such as independent component scalp maps [154]. Manual annotation of artifacts in EEG data is problematic because it is time-consuming and may even be untenable if the specific profiles of artifacts in the EEG data vary as a function of the task, the subject, or the experimental trial within a given task, for a given subject - as they so often do. These realities quickly scale the complexity of the artifact annotation problem, and make the use of a one-size-fits-all artifact detection method infeasible for many practical use cases. Even if artifacts could be identified with perfect fidelity, their simple removal (e.g., by deletion of the corrupted segment) may introduce secondary analytic complications that confound the performance of downstream methods that leverage these data. For instance, methods that rely on the stationarity of EEG segments will be confounded by simple removal of the artifact segments. Even the simplest approaches, such as averaging many EEG trials before extracting features [30], may be less effective if artifact occurrence is correlated with the trail type or experimental condition, thereby increasing the likelihood of a type II error and the consequent reduction in experimental power. 4 which may be used as "templates" by statistical or rule-based methods for the identification (and potential rejection) of noisy data epochs 41 An essential challenge of artifact detection in EEG processing is that the definition of "artifact" depends on the specific task at hand. That is, a given EEG segment is an artifact if and only if it impacts the performance of downstream methods by manifesting as uncorrelated noise in a feature space that is relevant to those methods. For instance, muscle movement signatures confound comma-prognostic classification but are useful features for sleep stage identification [54]. The task specific nature of artifacts makes their detection especially suitable for data-driven unsupervised approaches as the only requirement for the identification of artifacts using such methods is that the artifacts are relatively infrequent. That is, when mapping our data into feature spaces that are relevant to the specific EEG task, artifacts should stand out as rare anomalies. Indeed, many state-of-the-art approaches use unsupervised methods for the detection of specific artifact types, under specific circumstances. For instance, the Blink algorithm described by Agarwal et al. is a fully unsupervised EEG artifact detection algorithm [2] that is effective for the detection of eye-blinks. While existing methods provide excellent performance for specific artifact types, there is a need for additional progress toward generalized artifact detection approaches, that make no assumptions about the task-, subject- or circumstances. It is also possible to go beyond artifact detection to correct the EEG trial by removing the artifact signal. EEG artifact removal is one instance of a more general class of noise reduction problems. The removal of noise from signal data has been a topic of scientific inquiry since Shannon’s laid the foundation of information theory in the 1940s [155]; and over the years multiple signal processing approaches to this problem have found their way into EEG research. One such technique for artifact removal that is ubiquitous for EEG processing is Independent Component Analysis (ICA). This method and it’s modern derivative remain popular among the research community for unsupervised artifact correction. However, ICA still requires EEG experts to review the decomposed signals and manually classify them as either signal or noise. Furthermore, while ICA is undeniably an invaluable tool for many EEG applications, it also has limitations that are particularly poignient when the number of channels is low; ICA can only extract as many independent components as there are channels, and will therefore be unable to isolate all independent noise components if the 42 total number of independent noise components and signal sources exceeds the number of EEG electrodes [39]. Artifact removal is an especially common practice for a particular artifact type: the electrode “pop". These artifacts result from abrupt changes in impedance, often due to loose electrode placement or bad conductivity [180, 100]. Unlike muscle and movement artifacts, electrode pop is extremely localized, often effecting only one electrode channel. Channel interpolation is the process of replacing the signal of a corrupted channel with one that is interpolated from surrounding clean channels. Patrichella et al. demonstrated that knowing specific electrode locations (namely the exact electrode locations for each subject) and the distances between them can improve interpolation results [132, 35]. However this type of additional information is rarely available and often requires special dedicated hardware. Recently, Sadiya et al. proposed a deep learning convolutional auto-encoder based approach to learn task and subject specific interpolation [146]. By iteratively occluding channels in the input and using original data as the ground truth, the model learned how to interpolate channels in a self-supervised manner with no human annotation. Moreover, not only was the model able learn idiosyncratic information such as subject specific electrode location, beating state of the art models, it was also possible to use transfer learning to improve performance on previously unseen tasks and subjects. In this paper, we extend the aforementioned state-of-the-art approaches in artifact detection and rejection by building an end-to-end pipeline that solves both the detection and rejection problems together, without making any assumptions concerning the task or artifact type. Our artifact detection approach uses a collection of quantitative EEG features that are relevant for a wide variety of tasks including coma prognostics [172], diagnosing mental-illness [174], decoding mental representations [70], decoding attention deployment [200], and brain computer interface design [6]. Unsupervised outlier detection algorithms utilize these extracted features to identify artifacts in the EEG data. These unsupervised algorithms only require an estimate of the frequency of artifacts in the data, and can detect any artifact type, irrespective of the task. To guarantee that our results accurately represent the capabilities of these unsupervised outlier 43 detectors we carefully selected algorithms that are qualitatively different from each other (for instance relying on local vs global characteristics of the data distributions) and explored hundreds of different possible configurations. Sub-subsection 3.2.2.2 provides a comprehensive review of the feature extraction process. Sub-subsection 3.2.2.2 details our experimentation with different outlier detection algorithms. Our artifact correction approach uses a deep encoder-decoder network to correct artifacts that are not restricted to only one channel. Specifically, we frame our learning objective as a modified “frame-interpolation" task. Frame interpolation is the filling in of missing frames in a video [79]. To the best of our knowledge this is the first work that takes this approach to EEG artifact correction. The proposed approach is also unique in that it does not require the maintenance of any large data-set of templates or annotated data similarly to other state-of-the-art artifact removal methods [2]. The model architecture, as well as the exact objective formulation are discussed in detail in subsubsection 3.2.2.3. The data-sets used in this work are discussed in detail in subsubsection 3.2.2.1. And the results of the different experiments we conducted can be found in subsubsection 3.2.3. Finally we, discuss our findings, their broad implications, and the limitations of our approach in subsection 3.2.4. 3.2.2 Methods In this paper we propose an end-to-end pre-processing pipeline for the automated identification, rejection and removal / correction of EEG artifacts using a combination of feature-based and deep-learning models which is intended for use as a general-purpose EEG pre-processing tool. To begin, we provide a brief overview of the data and methodological pipeline, calling out the specific subsubsections where the full details of each component of the pipeline is discussed. In Figure 3.6 we provide a visualization of our proposed pre-processing pipeline; our method begins by performing unsupervised detection of epoched EEG segments in a 58 dimensional feature space (subsubsection 3.2.2.2). The trials that were not rejected in this initial stage are used to train a deep encoder-decoder network designed to correct artifacts segments (subsubsection 3.2.2.3). While we demonstrate this method on a particular data set (described below), it is applicable 44 Figure 3.6 Our methodological approach. The EEG data is first segmented into epochs (see 𝐴1 , 𝐴2 , 𝐴3 ). Next, 58 features are extracted and an ensemble of unsupervised outlier detection methods are used (see 𝐵1 , 𝐵2 , 𝐵3 ) to identify EEG epochs that are artifact-ridden and require interpolation (see 𝐴2 and 𝐵2 ). The artifact-ridden epochs are then interpolated by an ensemble of deep encoder- decoder networks (see red line in 𝐶). (with no modifications) for any EEG pre-processing work. The methods are presented in the order of their processing within our proposed pipeline. 3.2.2.1 Data-sets Data acquisition Our aim is to demonstrate that unsupervised anomaly detection be successfully used to identify artifacts in EEG data, and that these artifacts can be corrected via representation learning methods (see subsection 3.2.2.3). To demonstrate the feasibility of our approach, it is necessary to not only have ground truth artifact annotations, but also the ground truth labels for all trials, including those that were annotated as artifacts. While the artifact annotations allow us to test the unsupervised outlier detection methods, the trial labels allow us to verify that corrected EEG data can indeed be used in conjunction with that regular data for downstream analytic tasks (e.g. training a classification model). Unfortunately available data sets usually do not contain rejected trials, and even when these annotations are available the original trial label is not included 45 5. Therefore, our work is validated on two data-sets, hereinafter referred to as the orientation and color data-sets, that were previously collected by Saidya et al. [144]. We briefly describe these datasets here; additional information about the data-sets are provided in the Supplementary materials. Both experiments were passive viewing tasks. The orientation task stimulus consisted of 6 oriented gratings, the color task stimulus consisted of random dot fields in 6 different colors. The stimulus was generated using MGL, a library running in Matlab (Mathworks). The data was collected using a 32 electrode actiCHamp cap at 1000𝐻𝑧. For each task we collected data from 7 subjects (4 male) for a total of ∼ 10, 000 EEG Trials. All subjects reported normal or corrected to normal vision. The data was examined for noisy trials by expert annotators. Fully annotated and anonymized data-sets will be made available online. Participants gave informed consent and compensated at the rate of 15$ per hour. The experimental procedures were approved by the Michigan State University Institutional Review Board and adhered to the tenets of the Declaration of Helsinki. 3.2.2.2 Unsupervised artifact detection To benchmark the different outlier detection methods we collected a list of common features used in EEG research in different domains and applied various unsupervised outlier detection algorithms. Our main objective was to thoroughly investigate the feasibility of unsupervised artifact rejection for EEG. Feature extraction Building on the previous work of Ghassemi et al. [56], we reviewed the EEG literature and constructed a permissive list of several features that are commonly used for EEG classification tasks. In total we identified and extracted 58 features. The code that extracts these features was written to allow for parallelization of the calculations, and is accessible as a downloadable python 3.5 package6. See Table 3.2 for breakdown and references for all 58 features. These features can be grouped into three categories that measure the complexity, continuity and connectivity of EEG activity. Before continuing to discuss our pipeline we will provide a high 5 For instance BCI competitions data: http://bbci.de/competition/ 6 Code available at: https://github.com/sari-saba-sadiya/EEGExtract 46 Signal Descriptor Ref. Brief Description Complexity Features degree of randomness or irregularity Shannon Entropy [156] additive measure of signal stochasticity Tsalis Entropy (n=10) [52] non-additive measure of signal stochasticity Information Quantity (𝛿, 𝛼, 𝜃, 𝛽, 𝛾) [158] entropy of a wavelet decomposed signal Cepstrum Coefficients (n=2) [125] rate of change in signal spectral band power Lyapunov Exponent [185] separation between signals with similar trajectories Fractal Embedding Dimension [1] how signal properties change with scale Hjorth Mobility [120] mean signal frequency Hjorth Complexity [120] rate of change in mean signal frequency False Nearest Neighbor [69] signal continuity and smoothness ARMA Coefficients (n=2) [21] autoregressive coefficient of signal at (t-1) and (t-2) Continuity Features clinically grounded signal characteristics Median Frequency the median spectral power 𝛿 band Power spectral power in the 0-3Hz range 𝜃 band Power spectral power in the 4-7Hz range 𝛼 band Power spectral power in the 8-15Hz range 𝛽 band Power spectral power in the 16-31Hz range 𝛾 band Power spectral power above 32Hz Median Frequency median spectral power Standard Deviation [142] average difference between signal value and it’s mean 𝛼/𝛿 Ratio [172] ratio of the power spectral density in 𝛼 and 𝛿 bands Regularity (burst-suppression) [172] measure of signal stationarity / spectral consistency Voltage < (5𝜇, 10𝜇, 20𝜇) low signal amplitude Diffuse Slowing [167] indicator of peak power spectral density less than 8Hz Spikes [167] signal amplitude exceeds 𝜇 by 3𝜎 for 70 ms or less Delta Burst after spike [167] Increased 𝛿 after spike, relative to 𝛿 before spike Sharp spike [167] spikes lasting less than 70 ms Number of Bursts number of amplitude bursts Burst length 𝜇 and 𝜎 statistical properties of bursts Burst band powers (𝛿, 𝛼, 𝜃, 𝛽, 𝛾) spectral power of bursts Number of Suppressions segments with contiguous amplitude suppression Suppression length 𝜇 and 𝜎 statistical properties of suppressions Connectivity Features interactions between EEG electrode pairs Coherence - 𝛿 [172] correlation in in 0-4 Hz power between signals Mutual Information [6] measure of dependence Granger causality - All [16] measure of causality Phase Lag Index [166] association between the instantaneous phase of signals Cross-correlation Magnitude [81] maximum correlation between two signals Crosscorrelation - Lag [81] time-delay that maximizes correlation between signals Table 3.2 The 58 EEG features fell into three EEG signal property domains: Complexity features (25 in total), Category features (27 in total), Connectivity features (6 in total). level intuition behind the inclusion of each category. We encourage the interested reader to refer to the previous work of Ghassemi et al. for a more detailed discussion of of the specific features [56]. 47 Complexity features (n = 25): These features measure the complexity of the EEG signal, from an information theoretic perspective, and are known to correlate with impaired cognitive functions and the presence of degenerative illnesses. Therefore our first set of features is a collection of information theoretic complexity measures. Of special interest are the first three features shown in Table 3.2 as they are particularly prominent in EEG research: Shanon’s entropy has been associated with neurological outcomes in post-anoxic coma patients [172]; the entropy of the decomposed EEG wavelet signals (known as the Subband Information Quantity) have similarly been used in cardiac arrest studies [157, 78]. Tsalis entropy is a generalization of Shannon’s entropy that does not make assumptions about the independence of data channels (as Shannon’s entropy does) and has been shown to be particularly useful for the characterization of complexity in EEG data [52]. Continuity features (n = 27): These features capture the regularity and volatility of EEG activity. Bursts, spikes, and unusual changes in the mean and standard deviation in the frequency and power domains are examples of continuity features that are relevant for a variety of clinical tasks. See Hirsh et al. for an in-depth review of continuity and it’s relevance to clinical care [71]. Connectivity features (n = 6)): These features reflect the statistical dependence of EEG signal activity across two or more channels. Functional connectivity networks are an established features of normal brain functioning. We draw on the rich literature on measuring connectivity from EEG signals [150] extracting features that have previously been used for designing brain computer interfaces [6] as well as in mental-illness, perception, and attention research (See [174], [70], and [200] respectively). Outlier detection methods We explored a set of ten algorithms for unsupervised artifact detec- tion; the explored algorithms were inspired by the work of Zhao et al. [196]. The algorithms can be divided into two general groups: statistical methods and representation learning methods; they are described in more detail in the “Statistical Methods" and “Representation Learning Based Methods" subsections below. The hyper-parameters of each method were determined by randomly exploring the hyper-parameter space and choosing the settings that yielded the best performance of the methods on the data according to our artifact annotations. 48 Statistical methods: Statistical methods identify anomalies based on statistical measures extracted from the data, thereby producing an "anomaly score" for each trial. The Histogram Based Outlier detection (HBOS) method uses histograms with dynamic bin widths to detect clusters and anomalies in different feature dimensions. Despite the simplicity of the approach it has been shown to work well on a variety of data types [60]. The Local Outlier Factor (LOF) method similarly calculates an "outlier score", however instead of global measures it relies on the local density of the data as it’s main indicator [22]. Another popular local algorithm, the Angle-Based Outlier Detector (ABOD), calculates the cosine similarity of data points with their neighbors and uses the variance of these scores to generate anomaly scores [85]. Finally, we also trained a One Class SVM Detector (OCSVM), a classic algorithm for outlier detection [151]. In this algorithm a SVM is trained on the entire data-set and afterwords every instance is scored based on its distance from the class boundary; the intuition is that the infrequent outliers will contribute less to the decision boundary calculation and will be more likely to be on the margin of the learned boundary. As previously mentioned, we selected these detectors to be different in the type of statistical measurements they use. Therefore, it makes sense to also train ensemble classifiers to further improve the outlier detection accuracy. Specifically we trained five hundred Locally Selective Combination in Parallel (LSCP) Outlier Ensembles [197] with different combinations of the algorithms mentioned above. Representation learning based methods: Unlike statistical methods, representation learning based outlier detectors do not simply calculate statistical properties of featuarized data. The most basic classifier uses auto-encoder (AUTO) based deep learning architectures to learn a lower dimensional representation of the data that enables the best possible reconstruction of the original signal; the embedding would be optimized for the common regular data points thereby producing distinctly noisy reconstructions for the outlier trials [3]. This classifier can be viewed as a modern update of similar classic outlier detection methods that use methods such as PCA reconstruction instead of of a training a deep auto-encoder (PCA) [159]. A more sophisticated approach uses Variational Auto-Encoders (VAE). This class of algorithms try to ensure that the learned embedding captures 49 the structure of the original data by penalizing the classifier if the embedding does not follow a standard normal distribution [4]. Finally we also examine a Generative Adversarial Active Learning (GAAL) outlier detector [99] which uses a generative adversarial networks to generate outliers. This method can be used to improve any of the statistical methods described in 3.2.2.2. We also use an extension of the original method to learn multiple generators (MGAAL). 3.2.2.3 Artifact correction As previously mentioned, encoder-decoder based deep learning methods have proven useful for channel interpolation [146]. In this subsection we discuss an extension of this approach that utilizes the same framework for artifact correction. Namely, given an EEG data segment with an isolated artifact we remove the corrupted segment and use the data samples preceding and proceeding it to fill in the resulting void. This problem is equivalent to the “frame-interpolation" task of filling in missing frames in a video [79]. The model Input representation: The channel interpolation model proposed in [146] represented the EEG as a time series of 2D topologically organized arrays. This reflects the spatial nature of the EEG channel interpolation issue; the intrepolated values at different time points are treated as inde- pendent. To the best of the author’s knowledge this is a standard assumption for EEG interpolation algorithms. For instance, Petrichella et al. and Courellis et al. calculate the interpolated values of the missing data at each time point separately [132, 35]. However, research on convolutional neural networks for EEG decoding and visualization have shown performance benefits from presenting the input as a column of electrodes unfolding in time, as this facilitates the learning of temporal modulations [149]. Since artifact correction is first and foremost a process of completing gaps across time we decided to depart from [146] and use a 2D array representation with the number of time steps as the width of the array. Architecture: The best frame interpolation models involve calculating object trajectory and account- ing for possible occlusion (e.g. if one object moves behind another). With these "flow computations" and a stack of the frames before and after the missing image a convolutional encoder-decoder can generate realistic intermediate images [79]. Unlike video, EEG data has only one spatial dimen- 50 sion (see subsubsection 3.2.2.3) and no analogues to local phenomena such as occlusion or object movement can occur as EEG modulations are often thought of as mostly global in nature [149]. Therefore we only concern ourselves with a stacked convolutional auto-encoder. This architecture is shared by previously discussed state-of-the-art algorithms for both frame interpolation and channel interpolation [149, 146]. The interpolation of each frame is done separately, thus to predict 𝑛 frames it is necessary to train 𝑛 networks. Technically this is equivalent to training one ensemble model, however by separating the networks we allow for easier parallelization of the training process. Specifically, given a series of EEG frames 𝑥 1 , 𝑥2 , . . . , 𝑥 𝑛 where 𝑥𝑡 is a vector of all the channel values at time 𝑡, and assuming that the series is missing all frames between time points 𝑡 𝑏 and 𝑡 𝑒 , our network learns to predict 𝑥𝑡𝑞 from the two stacks, 𝑥𝑡 𝑏 −ℎ , 𝑥𝑡 𝑏 −ℎ+1 , . . . , 𝑥𝑡 𝑏 and 𝑥𝑡𝑒 , 𝑥𝑡𝑒 +1 , . . . , 𝑥𝑡𝑒 +ℎ where 𝑡 𝑞 ∈ (𝑡 𝑏 , 𝑡 𝑒 ) and ℎ is some small positive integer representing how many frames before and after the missing segment can be perceived. Every network is trained to predict the value at one specific value of 𝑞. Every network takes the same 2ℎ frames (half preceding the missing segment and half following it) to calculate the value at a given frame. 3.2.2.4 Model validation approach Artifact Detection Method: The performance of the artifact detection methods was assessed by inspecting the agreement between the artifact detection approach and the expert annotations from the two data sets (color and orientation). More specifically, the agreement was measured using the f-score and Cohen’s Kappa (first and second values in each cell respectively). We compared the performance of our model against the expected performance of a classifier with knowledge of the exact number of artifacts; this random classifier is expected to have an f-score of 0.172 and a Kappa of 0.029. We ran the detection algorithms in two configurations, for each subject separately and for the entire aggregated data. We hypothesise that the performance will drop when using the aggregated configuration as each individual setup for an EEG recording is likely introduce unique artifacts (due to loose connections, or subject specific circumstances such as perspiration). Artifact correction method: To optimize the parameters of the artifact correction model, we pro- 51 duced training data from trials that were marked as artifacts free by our unsupervised artifact detection method (subsection 3.2.2.2) and randomly removed a segment from the middle of the trial. The ℎ samples proceeding the removed segment and ℎ samples preceding it were used as input for the model while the removed segment was the ground truth (ℎ was a hyper parameter optimized on the training set). For the purposes of validating the artifact correction model, all EEG data was re-sampled to 200𝐻𝑧. The reconstructed segments were 200𝑚𝑠 each. End-to-end assessment approach: We ran a number of tests to examine if the trials reconstructed by our artifact correction method could be used to enhance the performance of downstream EEG tasks. More specifically, we trained two SVM models to predict the label of the trial from the color data-set: one SVM was trained using the raw data, and the other was trained using the raw data after artifact correction. Both models were validated using 5 fold cross validation, and the performance of the models on the test set (𝜇 and 𝜎) was reported. We also evaluated the impact of our artifact correction method on downstream EEG tasks when applied to clean trials, exclusively; this evaluation allowed us to test for inadvertent degeneration in signal quality of clean segments when processed by our method. More specifically, we applied our artifact correction method to 20% of clean trials and used the resulting data to trained an additional SVM model. 3.2.3 Results This subsection presents the results of the two main components in our pipeline, the artifact detection method and the artifact correction method on the data described in 3.2.2.1. 3.2.3.1 Artifact detection results In Table 3.3, we compare the average performance of the outlier detection methods described in subsection 3.2.2.2 when applied to each subject separately. Therefore, each value is the mean of the algorithm’s performance across subjects. As previously mentioned, the expected performance of a baseline random classifier with knowledge of the exact number of artifacts is an f-score of 0.172 and a Kappa of 0.029. Hence, all models other than the ABOD classifier performed significantly better than the baseline (one tailed t-test with a 𝑝 = 0.05 significance level). 52 Statistical HBOS LOF ABOD OCSVN LSCP Methods Orientation 0.564 0.218 0.11 0.41 0.577 0.473 0.065 0.06 0.29 0.489 Color 0.5 0.241 0.1 0.36 0.51 0.4 0.091 −0.08 0.23 0.411 Representation AUTO PCA VAE GAAL MGAAL Learning Orientation 0.53 0.527 0.477 0.429 0.428 0.44 0.426 0.368 0.311 0.309 Color 0.51 0.477 0.478 0.241 0.389 0.42 0.367 0.368 0.086 0.263 Table 3.3 Comparison of the different unsupervised outlier detection methods when applied to each subject separately. We calculated the mean f-score and Cohen’s Kappa (first and second row in every cell) across all subject. HBOS: Histogram based outlier detection, LOF: Local outlier factor Method, ABOD: Angle-based outlier detector, OCSVM: One class support vector machine, LSCP: Locally selective combination of parallel outlier Ensembles, AUTO: Auto-encode based method, VAE: Variational auto-encoder based method. GAAL: Generative Adversarial Active Learning, MGAAL: Multi-object Generative Adversarial Active Learning. Unsurprisingly, the best outlier detector was an LSCP ensemble classifier that performed 16.86𝑥 better than the baseline method, and 1.03𝑥 better than the next best approach; the best performing configuration of the classifier consisted of two HBOS classifiers and one OCSVM. While it is difficult to interpret ensemble classifiers it is worth noting that the two histogram based classifiers diverged quite substantially; one using a high number of histogram bins and a rigid outlier scoring policy (𝑡𝑜𝑙 = 0.1) while the other using a smaller number of bins and more relaxed policy (𝑡𝑜𝑙 = 0.5). A simple auto-encoder was the best representation learning algorithm, closely followed by the PCA algorithm. We speculate that the auto-encoder could have possibly had better performance if more data was available for each subject. See our supplementary material for a breakdown of trial and artifact numbers for each subject. In Table 3.4, we compare the performance of the outlier detection methods described in sub- section 3.2.2.2 when applied to the subjects aggregated data; that is, subject were not considered separately as they were in the results from Table 3.3. When compared to the results shown in 53 Table 3.3, the performance decreased for most models. This is not surprising as the fundamental assumption of unsupervised methods is that the data is homogeneous with the exceptions of the outliers. Here again the LSCP method performed the best of the tested approaches. A comparison of the results in Tables 3.4 and 3.3 provide motivation for the development of subject-specific anomaly detection approaches. Moreover, the comparison also highlights that the unsupervised algorithms and the features we extracted can successfully capture both common EEG artifacts and subject specific idiosyncrasies. Statistical HBOS LOF ABOD OCSVN LSCP Methods Orientation 0.502 0.246 0.07 0.362 0.537 0.4 0.095 −0.11 0.234 0.441 Color 0.476 0.305 0.09 0.377 0.463 0.35 0.15 −0.108 0.238 0.332 Representation AUTO PCA VAE GAAL MGAAL Learning Orientation 0.488 0.448 0.447 0.383 0.393 0.338 0.338 0.336 0.246 0.258 Color 0.414 0.437 0.436 0.185 0.393 0.283 0.312 0.31 0.022 0.258 Table 3.4 The performance of the models trained on data aggregated from all the subjects. The f-score and Cohen’s Kappa are presented in the first and second row in every cell. 3.2.3.2 Artifact correction results Network optimization: Our first step was to optimize the network hyper-parameter configurations. This included testing different sizes of both the layers and convolution filter, as well as exploring different hyper-parameters such as optimization algorithms, dropout rates, and activation functions. To train the network we followed the method discussed in subsection 3.2.2.2: we randomly extracted 104 samples from the data, the first and last 32 samples were stacked and used as the input to the model, the sample at position 𝑖 from the remaining 40 samples was used as the ground truth. Essentially we are training a network to predict the values after removing 40 samples (200𝑚𝑠) using the 32 samples that before and after the removed segment. The best performing network 54 (lowest loss) was different for different 𝑡s. The optimal topology for reconstructing sample 20 is available in the supplementary material as a reference of the type of convolutional U-net architecture used. End-to-end assessment In Table 3.5 we compare the classification accuracy of a 5-fold SVM model trained to perform a downstream classification of trial type using down sampled EEG data with three different configurations of the data: (1) the raw EEG data, (2) the data after correction of artifact segments and (3) the data following “correction" of a random 40 samples of 20% of the non-artifact segments. Note that while simple this type of analysis is used in actual EEG research [30]. The performance remained comparable after using the artifact correction on trials that did not contain any artifacts. This is a strong indication that the model is indeed able to learn how to reconstruct the original EEG signal. When using the corrected trials with EEG artifacts the classification accuracy improved by 10% overall and over 20% for trials that were marked as containing artifacts. These results successfully demonstrate that our unsupervised end-to-end artifact correction pipeline improves down-stream analysis. Original EEG EEG with EEG with Random Correction Artifact Correction All trials 0.3 0.31 0.33 Rejected trials 0.23 0.23 0.29 Table 3.5 Mean accuracies of simple SVM classifiers. A simple t-test confirmed that all accuracies were significantly above chance level (1/6 for 6 different colors) at a 𝑝 = 0.05 level. Original EEG: The Original EEG data. EEG with Random Correction: The EEG data after random artifact free trials were “corrected". EEG with artifact correction: The data after we applied the EEG artifact correction on the trials that were marked as artifact ridden. 3.2.4 Discussion Significance of our results: In this paper we presented an end-to-end pipeline that is capable of unsupervised artifact detection and correction. Our results demonstrate that data driven approaches for unsupervised outlier detection can be an extremely useful when applied to the problem of EEG 55 artifact detection. Interestingly the classifiers with the best performance (HBOS, OCSVM, and the best performing LSCP) are global classifiers, this might indicate that EEG artifacts are better discriminated by global characteristics. This supports our previous observation that artifacts are task specific and infrequent occurrences of uncorrelated noise. It is worth noting that, as demonstrated in Table 3.4, the classifiers we trained were able to learn subject specific idiosyncrasies. While the accuracy and agreement between the annotators and the detectors was far from perfect, the Cohen Kappa of the best performing algorithm was comparable to the inter-rater agreement levels of expert annotators reported in the literature; for instance when asked to annotate "periodic discharges" (a specific type of artifact) and "electrographic seizure" annotators had a cohen’s Kappa of 0.38 and 0.58 respectively [67]. Our results indicate that unsupervised outlier detection is a feasible approach for generalized EEG artifact detection. The data-sets: We validated our framework on two novel data-sets. To test the impact of artifact correction algorithms on downstream analysis it is necessary to have ground truth artifact annotation as well as knowledge of the labels of all trials, including those that are artifact ridden. Unfortunately public data sets often exclude trials that contain artifact. Even in the rare occasions in which these trials are made available the labels are often replaced with a special identifier for rejected trials 7. We hope our data-sets inspire other researchers to adopt more thorough data publishing practices as data-availability is perhaps the primary limiting factor in artifact correction research. The strength of unsupervised end-to-end methods: The accuracy of simple classifiers improved modestly after artifact removal. It is possible that replacing our deep learning based artifact removal components with an ICA artifact removal algorithm [80] could yield better results. However, two important distinctions should be made: First, the proposed method does side step many weaknesses inherit to ICA [39] (such as the number of independent components being limiting by the number of channels, which is particularly problematic for lightweight commercial EEG setups). Secondly, while the independent component deconstruction itself is data driven and unsupervised, the ICA method still requires visual inspection and analysis of the decomposed signal by human experts. In 7 For an example of standard EEG publishing practices see the BCI Competition data-sets 56 contrast, our method can be put into effect without any human intervention, making it is suitable for online EEG applications or as a no-cost first step before more thorough analysis. In general, supervised methods unquestionably out-perform unsupervised ones and we fully acknowledge that the pipeline proposed in this work is no different. It is therefore useful to consider unsupervised methods not as replacements of currently existing algorithms, but as complimentary additions to the toolbox of the EEG researcher. With this in mind we intentionally designed our end-to-end pipelines to be highly modular; An experienced researcher can easily substitute our last component with an ICA artifact removal algorithm, and in contrast, researchers that have access to artifact annotations (for instance by virtue of employing specialized hardware during data acquisition) will be able use their method in conjunction with ours or side step the first processes completely and apply only the artifact correction component before carrying on with the analysis process. Limitations: We did not formally evaluate the reconstruction performance of the model because (1) there is not an authoritative literature baseline and (2) insofar as the reconstruction enhances the ability of the downstream classification model to perform their intended classification tasks, the reconstruction is valid and valuable. There are a few limitations that we hope to address in future work. First and foremost, this artifact detection method can only be used if the frequency of the artifacts is low enough for them to be considered outliers. While this is indeed the case for the vast majority of EEG use cases, tasks such as seizure detection often involve long periods of unusually low signal to noise ratio. Additionally, the performance of our artifact correction network would likely benefit from introducing more complex component into the architecture. For instance, introducing temporal dependencies via and LSTM component would guarantee that the corrected frame at time 𝑡 influences the frame at time 𝑡 + 1. Finally, our method is in dire need of being validated on additional tasks and data-sets. Despite the challenges described above, we believe that our work demonstrates the feasibility of a EEG pre-processing pipeline which if adopted could facilitate and expedite the often tenuous process of artifact annotation and removal, and could therefore be extremely beneficial for the general EEG research community. 57 3.2.5 Conclusion and future Work The applications of EEG are numerous and diverse, and while this impacts the particularities of what components are classified as part of the signal versus artifacts, data homogeneity is a common concern in this area of research. Building on this data science perspective, in this work we appropriated state-of-the-art data driven methods to construct an end-to-end unsupervised pipeline for general artifact detection and correction. We introduced two new data-sets and demonstrated that the inter-rater reliability of our artifact detection component against expert annotators is comparable to reported inter-human levels. Furthermore, we demonstrated how applying the complete pipeline on a data-set can improve the performance of a common downstream analysis. The pipeline makes use of a wide range of handcrafted clinically relevant features, and we believe the released python package will be of use to many in EEG research community. 58 3.3 Feature Imitating Networks This section was published as a manuscript titled "Feature Imitating Networks" in the pro- ceedings for the 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) [143]. 3.3.1 Introduction The successful application of deep learning to new problem domains has three conditions: (1) access to large data sets, (2) access to sufficient computing resources for hyper-parameter optimization and (3) modest expectations about model interpretability. Deep learning models require large data-sets to learn representations that generalize on future unseen data. Additionally, extensive exploration of the model topological space is often necessary to identify a network architecture with sufficient representational power for a given task. Lastly, despite ample recent work on interpretability of deep learning models, the community remains without normative standard for how deep networks should be interpreted; this is problematic for many problem domains (e.g. healthcare) where the importance of interpretability may supersede performance [141]. [19]. Contributions In this section we introduce Feature-Imitating-Networks (FINs): a FIN is a neural networks with weights that are initialized to approximate one or more closed-form statistical fea- tures. In this section, we will demonstrate how this property of FINs improves their interpretability while also reducing data and hyper-parameter tuning requirements compared to other networks with similar or greater representational power. More specifically, we demonstrate how, when combined with a careful application of transfer-learning, and by taking into account expert knowledge, FINs can be used to quickly build and deploy robust and better performing models using less training epochs. Our validation of FINs focused on tasks involving biomedical signals; the data-sets in this domain are often smaller, and therefore stand to benefit the most from the introduction of our framework. Section Organization The remainder of the Section is organized as follows: First we review relevant literature regarding transfer learning. The Related Work subsection is followed by the 59 Methodology subsection where we discuss how to build and design different FINs. The Experiments subsection contains three experiments - including a brief discussion of the data and results for each. Finally, the Discussion subsection examines all the results in aggregate and discusses how our framework might be expanded. 3.3.2 Related Work Transfer learning is the application of a pre-trained model to tasks it was not originally intended to perform [117]. Transfer learning enabled researchers to make significant progress on various tasks in Machine Vision[73], Speech [89], and Natural Language Processing [75]. Most applications of transfer learning are within-domain; these involve fine-tuning a pre-trained model for new tasks. For instance, AlexNet [87], VGG [160], and ResNet [68] are computer vision models trained to classify the ImageNet data-set. The features learned by these models (in later layers), and their more fundamental image components (in earlier layers) can be re-purposed to solve other tasks using only a fraction of the training data required by original models. Models that are trained on large heterogeneous data-sets are good candidates for "transfer". But for biomedical signal processing problems, there isn’t a sufficiently large data-set to train such a model. Indeed, the largest publicly available biomedical signal archives contain only a few thousand subjects, which is too small by most data standards in other domains [56]. Consequently, transfer learning for biomedical signals is often performed across-domains. For instance, computer vision models such as VGG have been adapted to emotion recognition from speech [89], motor-imagery classification [192] and mental task classification [124], but these cross-domain transfers are not as effective as those performed within-domain. Finally, interpretability is greatly hindered when models are trained on broad data-sets with objective functions that differ from the final application [19]. The performance of transfer learning is proportional to the proximity of the domains across which knowledge is being transferred. Feature imitating networks were designed to address this limitation of current transfer learning paradigms; they provide the power and flexibility of transfer learning without the “Big data” and heavy computational requirements. 60 3.3.3 Methodology A FIN is a neural network with weights that are initialized to approximate one or more closed- form statistical features. In this section, we train FINs that approximate five commonly used features in biomedical signal processing: Shannon’s Entropy, kurtosis, skewness, fundamental frequency, Mel-frequency cepstral coefficients (mfcc), and regularity [146]. We evaluate the utility of the FINs on three biomedical signal processing experiments, which we describe in subsection 4, below. The pre-trained FINs, and code to reproduce all experimental results are available online 8. Network Construction For each feature, we used a simple gradient descent optimizer with mean square error (MSE) loss to train a simple dense network to approximate that feature on synthetic signals. The topological space explored for all the FINs was between 2 to 10 layers with the number of parameters in the 3 − 15 million range. All best performing FINs used simple 𝑟𝑒𝑙𝑢 and 𝑡𝑎𝑛ℎ activation functions. See Figure 3.7 for density functions for the errors of the FINs reconstruction. Input The input data for the FINs consisted of synthetic signals generated randomly (zero mean, unit variance) and converted to the time-frequency domain using the wavelett transform [89, 192]. Outcome The outcome data for the FINs consisted of closed-form feature values calculated on the synthetic signals using SciPy and EEGExtract packages [146]. Transfer When applying the models to new classification tasks, (i.e. subsection 3.3.4) the very last layer was discarded in favor of a randomly initialized softmax layer with dimensions suitable to the task. Baseline model The baseline in each experiment was the best performing neural network with similar (or greater) representational power to the corresponding FIN, trained using the same training data and schedule, but with weights that were randomly initialized; In total one hundred different topologies were explored for each baseline. The baseline with the best average performance on the validation data was retained for comparison against the FINs. Training All models (both FIN and Baseline) were trained using early stopping and a simple gradient descent optimizer. Non topological hyper-parameters such as learning rate and momentum 8 https://github.com/sari-saba-sadiya/Feature-Imitating-Networks 61 Figure 3.7 Density plots for the errors for the entropy (left) and regularity (right) FIN reconstructions. The feature values were scaled and normalized, making the biggest possible error 1, as can be observed the FINs faithfully recreate the closed form equations. had minor effects in comparison and therefore will be omitted from future discussions. Training was conducted using CUDA on a Tesla K80 GPU with 25GB of RAM. 3.3.4 Experiments To evaluate our framework we ran three experiments on three different biomedical data-sets and tasks. The first experiment was an Electrocardiography (ECG) classification task; our goal was only to demonstrate that our FINs framework can successfully improve performance on small low-accuracy data-sets. The second experiment was an Electroencephalography (EEG) artifact detection task; our goal was to demonstrate the modular nature of FINs, and the potential benefits of using FIN ensembles. The third experiment was a drowsiness detection task using EEG; our goal was to compare both the performance and speed of FINs against state-of-the-art transfer-learning techniques under conditions of varying data scarcity. For all three experiments, the data was regularized and transformed to the time-frequency domain as discussed in the Methods subsection. The data was partitioned into training, validation, and testing sets. In the first two experiments this was achieved by randomly partitioning the data 15% − 85% for testing and training respectively, before repeating the same split for the training data to extract a validation subset. This was repeated for a 50-fold cross-validation. In the third experiment, where subject data is balanced, the partitioning was achieved by iteratively leaving two of the twelve subjects out for validation and testing. To compare training time we used similar instances of nodes with Tesla K80 GPUs and 25GB of RAM, all training times are reported in seconds. 62 Fine-tuned Model Baseline SVM kNN FIN Mean .443 .5233 .525 .543 (± std) (±0.174) (±0.016) (±0.018) (±.0245) Table 3.6 Mean and standard deviation of the accuracy for the experiment I classification task. As demonstrated the FIN based network out performs both randomly initialized neural networks and classical statistical approaches. 3.3.4.1 Experiment I In this experiment, we explored the potential of FINs for the detection of artifact ridden ECG signals [116, 59]. Data and Prepossessing We used data made available by The Brno University of Technology ECG Quality Database [116]. ECG segments of variable lengths from 18 subjects were classified by experts into three categories according to signal quality. After standardizing the lengths we ended up with 2544 trials. Preprocessed data will be made available. Models We hypothesized that a FIN trained to imitate Kurtosis might be useful in the context of this task [198]. The Kurtosis FIN was adapted for our classification task by replacing the very last layer with a softmax classification layer. In addition to the baseline neural network, we also compare against several non deep learning classification algorithms. Results As can be seen in Table 3.6, The Kurtosis FIN consistently outperformed the baseline models. Moreover, the standard deviation in the performance of the FINs was an order of magnitude smaller than the deep network baseline models. A Levene’s test indeed indicates a statistically significant (𝑝 < .05) difference in variance between the performance of the two methods throughout the iterations. This highlights the fact that our framework helps with the robustness of the models. 3.3.4.2 Experiment II In this experiment, we investigate how different FINs can be used in conjunction to build complex networks suited for EEG artifact detection. Moreover, we demonstrate how theoretical knowledge regarding the features and their relevancy to the task is helpful when using the FINs Framework. 63 Fundamental Entropy+ Kurtosis+ FIN Regularity Baseline Frequency Regularity Regularity Mean .6527 .6825 .6991 .7134 .7142 (± std) (±.1066) (±.0591) (±.0662) (±.0807) (±.0587) Entropy+Kurtosis FIN MFCC Entropy Kurtosis +Regularity Mean .7167 .7195 .0.7214 .0.724 (± std) (±.01411) (±.03783) (±.02397) (±.0451) Table 3.7 Mean and std of the accuracy for experiment 2. Corrected one-tailed t-tests demonstrated that models imitating features known to be useful for EEG artifact detection (last three columns) significantly out-performed models imitating ill-suited features (first two columns). Data and Prepossessing The data used in this experiment is from an EEG artifact detection data-set [146]. The data contains EEG segments from a 1𝑘 𝐻𝑧 recording made using 32 electrodes during a passive viewing task. Each segment is a second long and was labeled as artifact ridden or clean by expert annotators. We re-sampled the data at 500𝐻𝑧 and converted the EEG setup to the international 10 − 20 system that contains only 19 electrodes. Models We evaluated individual FINs and FIN ensembles trained to imitate Kurtosis, Shannon’s Entropy, Regularity, Fundamental Frequency, MFCCs, and ensemble combinations thereof. We expect some of these FINs to outperform others based on the task-relevance of the feature being imitated. For instance, the fundamental frequency of the signal, defined as the lowest periodic frequency of the waveform should be irrelevant, while the Kurtosis is highly relevant to the task [37, 77]. Similarly, we expect ‘Complexity Features’ such as the cepstrum coefficients and Shannon’s entropy to outperform clinically grounded ‘Continuity Features’ such as the fundamental frequency or EEG regularity (burst suppression) [146]. Following these theoretical considerations we hypothesize that the MFCC, Entropy, and Kurtosis FINS, as well as a combination of these FINS will outperform the Regularity FINs. As we have multiple hypotheses, appropriate Bonferroni correction for multiple comparisons was used. To have enough statistical power after the correction we increased sample size by repeating each experiment 50 times. Each FIN was applied on each electrode signal in parallel, the outputs were then concatenated and passed forward to a binary softmax classification layer. We compared the FINs against the best 64 performing baseline dense neural network. Results As summarized in Table 3.7, our experiment demonstrates how (when appropriately selected) FIN ensembles may be used in combination to further enhance task performance. We note here that deliberate consideration when combining FINs can improve task performance, while keeping the size of the ensemble small. A corrected one tailed t-test showed that after correction the Kurtosis, Entropy, and Ensamble Network (last three columns in the table) performed significantly better than the Fundamental Frequency or Regularity FINs. 3.3.4.3 Experiment III In this experiment, we compare the performance of FINs against state-of-the-art approaches for a fatigue and drowsiness detection from EEG task on a recently published data-set. Data and Prepossessing We used data made available by [108]. This section identified multiple subsets of electrodes as especially predictive. We tested separately on every subset. The data was then partitioned for six fold intra-subject cross validation. In other words, at each iteration ten out of twelve available subjects were used for training, one was used for validation, and the testing accuracy was calculated on the remaining subject. Finally, each cross-validation step was repeated 5 times. If only a fraction of the training data was being used different trials were picked at each of these iterations. Models Prior work indicates that entropy is a useful feature for the prediction of drowsiness and fatigue from EEG data [108]. Thus, we compared a FIN trained to imitate Shannon’s entropy against the baseline models (described in the methods), as well as a fine tuned VGG network pre-tarined on the ImageNet data-set [160]. The VGG model was similar to the 19 layer convolutional neural network introduced in [160] sans the last 3 dense layers and with the addition of a final softmax classification layer. This is the standard way in which the VGG model is used in biomedical tasks [124, 192]. Additionally, to test the performance of our pre-trained FIN when only very limited data is available we ran the same models but with varying fractions of the data being made available 65 Figure 3.8 Experiment 3 results. (top): Training and Validation loss for the baseline and pre-tuned entropy FIN. (Bottom): Difference in accuracy between our pre-trained FIN and the best performing baseline as a function of the percentage of data available. during training. Results The pre-tuned Shannon entropy FIN outperformed the baseline and reinitialized FIN in each of the four subsets and at over 83% of the iterations of the cross validation. Additionally, as can be seen in Figure 3.8 (top), the pre-tuned FIN had lower loss at every epoch compared to the baseline. It is important to note that the FIN also beat the performance of classical classifiers with different entropy measures that was reported in the literature [108]. The performance improvements were even greater under limited data availability conditions As small training data sets are known to increase performance noise, we also repeated this process 10 times and reported the average accuracy; the difference in the average accuracy between our method and the baseline for each data percentage is plotted in figure 3.8 (bottom). The VGG transfer learning network under-performed both the FINs and other baseline models and was particularly sensitive to small data sizes. When only a fraction of the data was available, 66 Mean (± std) Baseline VGG FIN Training Seconds .746(±0.019) 0.615(±0.122) .771(±0.016) 20% of data 37.2(±8.9) 21911(±520) 50.8(±5.7) .895(±0.04) .644(±0.18) .922(±0.013) 40% of data 37.2(±0.2) 10881(±1441) 106.0(±23.3) .939(±.07) .69(±.197) .962(±.006) 60% of data 189.8(±79.9) 18434(±1058) 322.4(±201.2) .983(±0.0203) .802(±.103) .996(±.0013) 80% of data 574.8(±132.3) 17141(±1838) 441.6(±92.2) .993(±0.022) .846(±.088) .998(±0.001) 100% of data 645.7(±187.1) 19267.3(±2014) 552.6(±129.7) Table 3.8 Experiment 3 results. The models were trained using varying subsets of the data. We report the mean and standard deviation for both accuracy and training duration on a node with a Tesla K80 GPU and 25GB of RAM running CUDA. VGG performed at close to chance. 3.3.5 Discussion The feature imitating networks framework proposed in this section is an innovative way to use transfer learning. Traditional transfer learning requires large, slow to train, black-box, networks such as VGG and AlexNet tuned on hundreds of thousands of labeled data. In contrast, FINs require no human labeling, are small and fast to train, and can be combined to create ensemble FIN networks in accordance with insights from the literature surrounding the task being performed. Therefore, our network facilitates the integration of domain specific knowledge into modern data driven machine learning practices. This is also beneficial for alleviating some interpretability concerns; while the final FIN sub-models likely do not reproduce the exact feature they were trained to imitate after the end to end tuning, the fact that FINs perform better than a network with the same architecture that with weights initialized at random suggest that the tuned FINs are most likely computing a slightly modified version of the original statistical feature. Moreover, the fact FINs that are ill-suited for a task continue to underperform even after fine-tuning (see Experiment II) also suggests that our 67 the measures computed by the FINs do not drastically change after the fine-tuning. Beyond these considerations there are several practical benefits to using our framework: • Robustness: Our experiments indicated that FINs are more robust than other networks and techniques with similar representational power. This is evident in statistically significant differences in variations in accuracy when performing leave subject out and cross valida- tion. Deep learning in general is sensitive to weight initialization randomness and data idiosyncrasies. Transfer learning of weights tuned to calculate task relevant features seems to guarantee we start at a ’neighborhood‘ of a good solution. Moreover, FINs expedite the hyper-parameter optimization step which remains resource-consuming despite recent research [191, 107]. • Performance: Data scarcity still plagues many domains. In the case of biomedical research data collection is especially costly and can prohibit researchers from applying deep learning to their tasks altogether [147]. Our experiments indicate that FINs are useful especially when only limited data is available. The intuition behind this is straightforward; pre-trained weights already extract useful task-relevant information, resulting in a better performance and lower loss when from the very first epochs of the training procedure, as can be seen in Figure 3.8. • Flexibility: By tuning on task-specific data sets our framework also out performs methods that pass the calculated features as input to the classifiers. This is not surprising as our FINs are allowed to tune the extracted features to better suit the task (for instance by focusing on specific parts of the signal). Additionally, the modular nature of the FINs lends itself to easily building and testing ensembles networks. • Speed: VGG and AlexNet are powerful networks that have been successfully applied in various domains. However, these architectures are extremely large. The VGG based network consists of at least 17 layers and contains over 20 million parameters. In contrast, FINs are simple shallow networks consisting of up to 4 layers and a quarter of that numbers of parameters. The shallowness of the models in particular guarantees that even when using an 68 ensemble of FINs gradient descent calculations, and therefore training and inference times, remain simple and fast. This can be observed in the results of the third experiment presented in Table 3.8, training the 𝑉𝐺𝐺 network was in some cases over 60 times slower than the FINs network training despite lower performance. 3.3.5.1 Future Directions Designing the dense implementation of the FINs can be streamlined by considering the closed form equation of the signal and training layers to imitate each operator separately. For instance, the mathematical expression for Shannon’s entropy requires discretization of the signal to create a histogram before averaging each bin. Partitioning the operations allows us to reuse pre-trained operation-specific layers to quickly construct FINs that are then fine tuned to mimic specific features. 3.3.6 Conclusion In recent years, some have critiqued the current state of the machine learning community. These critiques often focus on disregard of traditional techniques in favor of data driven approaches [103] and the different ways deep learning have struggled to live up to it’s promise, especially when it comes to real world applications [141]. In this section, we presented Feature-Imitating-Networks, a variation over traditional transfer learning that uses networks trained to imitate simple closed form statistical features, that we believe elevates these concerns. We demonstrated that our framework is superior in both the speed and accuracy to deep and transfer learning techniques with similar (or greater) representational ability. Especially when only very limited data is available. The experiments were conducted on a variety of tasks and domains. Future work will extend this initial exploratory work. An extensive library of Feature Imitating Network bench-marked on many data- sets and other signal processing domains might be of particular use and interest to the research community. 69 3.4 Conclusion and Future Work This chapter presented multiple novel methods for unsupervised artifact detection and correction [145, 146], as well as a framework for integrating expert knowledge into deep learning models via weight initialization [143]. All presented methods were validated using unseen subject (and when possible unseen task) EEG data. The models were designed to streamline EEG preprocessing and decoding, and can help mitigate some of the challenges facing the research community; unsupervised artifact detection eliminates the need for a manual data examination by an expert, and artifact correction reduces the amount of EEG trials rejected. Finally, the modular nature of our feature imitating networks framework enables researchers to rapidly test different intuitions when decoding EEG data. While the methods achieve state-of-the-art performance on EEG tasks, each could have many potential uses in various other fields of research which share some of the difficulties inherent to working with EEG signals. For instance, the feature imitating network framework has already been utilized in natural language processing, computer vision, and even predicting athletic performance [83]. The models developed in these works are being added to a library of pre-tuned models which we hope will be useful for other researchers interested in building on our framework. 70 CHAPTER 4 PILOT: DECODING COLOR FROM PASSIVE VIEWING Before focusing on attentional capacity, a small pilot study was first conducted. The goal was to determine the feasibility of color decoding from EEG signals. The experiment design was a simple passive viewing paradigm that used stimuli similar to what will later be used in the main experiment. This pilot study also laid important foundations for the work that follows by allowing the identification of electrodes and frequency bands that optimize decoding performance. Most importantly, results of this pilot study addressed concerns regarding the classifier decoding capturing verbal label, rather than feature information. Since the completion of this pilot study, a number of papers that decoded color from EEG have been published [66], confirming its main conclusions. 4.1 Pilot Experiment The pilot was conducted with the eventual main experiment in mind. Therefore, this pilot experiment utilizes dot-field stimuli reminiscent of those used in [96]. 4.1.1 Methodology 4.1.1.1 Participants 8 participants were recruited from the Michigan State University student body. The protocol was approved by the MSU institutional review board and written informed consent was obtained from every subject. 4.1.1.2 Stimuli and Apparatus The experiment was programmed in MATLAB and using the MGL extension [51]. This experiment was a passive viewing task involving six different stimuli. Participants were directed to focus on a fixation cross in the middle of the screen. Each trial consisted of 1000𝑚𝑠 stimulus presentations, followed by a 1000 − 1500𝑚𝑠 inter-trial interval. The stimulus consisted of random dot fields (240 dots each) in six different colors. The dots were drawn in an annulus (inner radius=1.5◦ , outer radius=6◦ ). In total there were 648 trials per experiment per subject (108 trials 71 per condition). We collected additional trials for some subjects but kept the balance between the different trial types. See figure 4.1 for an example of the stimuli. Isoluminance Procedure To eliminate potential confounds, each subject adjusted the brightness for every hue separately to achieve isoluminance between all colors. The isoluminance procedure employed heterochromatic flicker photometry [90]. Observers viewed grey and chromatic square tiles (size: 1.8◦ 𝑥1.8◦ , luminance: 6.3 cd/𝑚 2 ) arranged in a checkerboard pattern and constrained within an annulus (inner radius=1.5◦ , outer radius=6◦ ) centered around a white fixation cross. The gray and chromatic tiles flickered at 8𝐻𝑧 in a counterphase fashion. Subjects were instructed to adjust the brightness of the chromatic tiles to minimize the flicker, resulting in isoluminance between the color and the constant gray. We ran three separate blocks for each of the six colors. The three "isoluminant values" obtained were averaged to get the final luminance value for each color. These value were calculated for each participant and used during the main task in this pilot study. Figure 4.1 Examples of pilot experiment stimuli. The background color was changed for visibility (RGB value 240, 240, 240). 4.1.1.3 Data Acquisition and Preprocessing Continuous EEG activity was recorded using the actiCHamp system with BrainVision recorder software. The participants were fitted with a 64-channel actiCap with active electrodes. The screen refresh rate was set to 120𝐻𝑧 and data sampling was at 1000𝐻𝑧. Additionally, electrooculogram 72 (EOG) activity was recorded from from horizontal and vertical electrode pairs, and used to detect and reject horizontal eye movements, eyeblinks, and vertical eye movements. Electrode impedance was maintained < 50𝐾Ω. The data from the inter-trial interval was discarded. We used EEGLAB and ERPLAB to process the data. First, we resampled the data to 500Hz, removed the AC line noise (using the cleanline plugin), applied a bandpass filter between 1 and 100Hz, and used ICA decomposition to separate and remove components originating from blinks and other artifacts. Finally, waveforms were manually examined by experimenters and noisy epochs were rejected. 4.1.2 EEG Data Analysis We used the ADAM decoding toolbox default LDA algorithm with a 10 cross-fold process and 2000 iterations of cluster-based significance testing [44]. The data was not down-sampled or filtered beyond what was previously described. Different subsets of electrodes were examined. For detailed description of the LDA decoding algorithm see Section 2.2.2.1. The decoding and significance testing will also be discussed more thoroughly in the Methodology section of the main experiment. The results plotted in this chapter were achieved using the following electrode subset: Pz, P3, P7, O1, Oz, O2, P4, P8, TP10, T8, P1, P5, PO7, PO3, POz, PO4, PO8, P6, P2, CP4, CP2, CP1, CP3. 4.2 Pilot Results We were successfully able to decode color from EEG data. Performance peaked when limiting decoding to sub-alpha frequencies (< 10𝐻𝑧). Follow up decoding after time-frequency decompo- sition revealed significant decoding cluster only in this sub-alpha band. 4.3 Discussion and Conclusion Having identified the setup necessary to achieve robust decoding of color, our main experiment will use the same setup to explore how attentional load impacts this decoding. For the sake of the following chapters, it is important to emphasize of two conclusions of this this pilot experiment. First, feature information is available mostly in the sub-alpha frequency band (below 10𝐻𝑧). While this was demonstrated for other features such as orientation and motion direction [7, 8], to the best of 73 Figure 4.2 Right: Decoding performance using sub-alpha frequencies and visual-cortex electrodes. Horizontal black lines indicate significant decoding clusters. Left: Decoding after time-frequency decomposition, significant decoding clusters are highlighted in red. our knowledge, this is the first experiment that demonstrates this to be the case also for color. Note that previous EEG studies that decoded color demonstrate that the color decoding reflects sensory qualities (such as a circular order with obvious color categories) and not, for instance, verbal labeling [66]. With all the above in mind we assume that decoding color from sub-alpha frequency band will reflect feature information, therefore we expect differences in attantional modulation of sensory representations to manifest in the classifier decoding accuracy. Finally, having tested different combinations, we conclude that the color feature information is carried mostly by the occipital and parietal electrodes. This is not surprising considering similar "visual-cortex" electrode subsets have been used in other EEG papers that decoded feature values [7, 8, 128]. Following these conclusion from the pilot study the main experiment will use a similar electrode subset and a 10𝐻𝑧 low pass filter will be applied before the feature decoding. 74 CHAPTER 5 THE NUMBER OF ATTENTIONAL TEMPLATES MODULATES SENSORY REPRESENTATIONS Attention to a specific feature value enhances its sensory representation [25]. This everyday phenomenon is essential for the completion of search tasks and has been thoroughly researched [186, 173]. However, modulation of sensory representation when attending to multiple feature values remains the topic of much debate. Traditionally there have been two opposing views; Many researchers argue that there is a hard limit on attentional capacity (only one “attentional template” can be maintained at a time). On the other hand, experiments have demonstrated that multiple working memory items can influence behavior simultaneously. Recent theories reconcile these two views by separating the search task into different processes with different capacity limits [128] or arguing that task characteristics might be influencing attentional capacity limits [135]. Either way, the exact mechanism responsible for the difference in performance under varying “attentional load” conditions remains a mystery. Inspired by recent studies, we conducted Behavioral and EEG experiments to investigate the capacity of early attentional processes in tasks that demand active guidance by attentional templates (rather than the suppression of irrelevant distractors). Our study utilized a detection, rather than search, paradigm, which enabled us to explore the specifics of how attentional load impacts sensory representations using both behavioral modulations and their neural correlates. Results indicate that maintaining multiple attentional templates increases false-alarm rates while only slightly diminishing hit rates, and are overall incompatible with many versions of multiple-item templates theories. Generally, both the signal detection theory and EEG decoding analyses indicate that sensitivity deteriorated when maintaining multiple templates. Finally, analysis of behavioral performance and EEG decoding in target-present trials conforms with some current theories of attentional load [128, 136]. However, considering cue-condition effects are more pronounced in target-absent trials, further evaluation of current attentional load theory is necessary. 75 5.1 Introduction A friend asks for help locating a misplaced notebook. One natural response could be “what is the notebook’s color?”. When color alone is not enough to go by, one might ask for additional characteristics (shape, size, etc.) and conjure up a mental image to attend to and guide the search. This is a real-life example of the single-item search paradigms that have been studied extensively in attention research. Attention to a target feature value enhances its sensory representation, enabling efficient completion of search tasks [148, 25]. This phenomenon of sensory representation modulation has driven the development of many influential theories of attention (such as Treisman’s Feature Integration Theory [173] and Wolfe’s guided search model [186]). The general consensus is that after being cued to attend a specific feature value an “attentional template” forms in working memory. This “attentional template” then enhances the saliency of targets matching the represented feature value [186, 38]. However, despite the phenomenon of guidance by a single attentional template being a cornerstone of modern attention research, the debate surrounding the modulation of attention by multiple working memory representations remains unresolved. Moreover, while the capacity of working memory has been studied for decades, research focusing on the capacity for attentional templates - the number of working memory items that can guide attention simultaneously - is mainly limited to the previous decade. Broadly speaking, there have been two conflicting theories. Proponents of the (SIT) Single- Item-Template hypothesis [123, 74, 110] argue that only one attentional template can exist at a time. On the other hand, a growing body of literature seems to favor an opposing (MIT) Multiple- Item-Template hypothesis [72, 28, 9]. Behavioral, eye-tracking, and neuroimaging studies found evidence in favor of both theories. Moreover, despite utilizing very similar paradigms, researchers repeatedly arrived at conflicting results that withstood scrutiny via large-scale replications [47]. Recently there have been attempts to reconcile this conflicting body of literature. In a recent EEG study, the original authors of the SIT hypothesis suggest that participants can in fact maintain multiple attentional templates simultaneously, but only one can be effectively deployed at a time [128]. The conclusions of this study remain limited by the fact the authors decoded target location 76 rather than attended feature value, thus possibly missing modulations in early attention processes. Other attempts to reconcile the literature focused on the fact that, studies that support the SIT hypothesis often require active guidance by the working memory representations, while studies that support the MIT hypothesis focus on attention capture by distractors instead [84, 47]. Very recently, researchers have explored this possibility by slightly modifying a search task to either require distractor suppression or active guidance [135]. The authors concluded that, while multiple working memory representations can be behind distractor costs, only a single “active control set” that improves search performance can exist at a time. Finally, diminishing performance during guidance by two representations (as opposed to one) was often interpreted as evidence in support of the SIT [128]. However, the exact mechanism behind this diminished performance remains a mystery. Does attending to multiple, rather than a single, feature values not enhance the sensory representation to the same degree? Or are participants more prone to errors when the target feature value is ambiguous? Previous attention capacity studies focused on search - rather than detection - paradigms. And while search paradigms are very common in general attention research, they are ill-suited for shining a light on the exact nature of the performance differences observed between single vs multiple cue conditions (see section 5.2.2). In one exception, researchers investigated modulations in the psychometric function under different attentional template load conditions [96]. The authors demonstrated that the target signal in multiple template trials must be stronger than in single template trials to achieve the same level of performance (as measured by hit-rate minus false-alarm). However, the exact way in which the number of attentional templates modulate sensory representations is yet to be thoroughly investigated. Considering the recent studies discussed above, it becomes imperative to refine and reformulate the original question. First, following [135], we differentiate between distractor interference and active guidance by working memory items. To directly investigate how the number of templates actively guiding attention impacts sensory modulation we modified experiment 2 from [96]. Ad- ditionally, to assess the cueing effect modulation directly we decode the cued feature value. We 77 hope that by decoding the stimulus feature matching the working memory content - as opposed to stimulus location [128] - we will be able to capture early attention effects. The authors of [128] argued for different capacity limits during the template maintenance and deployment processes. They observed delayed and suppressed decoding in two-cue-one-target compared to single cue trials, an effect they attributed to a bottleneck in the template maintenance stage. However, there is no guarantee that deployment when multiple templates are available is indeed comparable to single attentional template deployment. Therefore, the conclusion that their decoding results reflect limitations in template maintenance capacity is contestable. In simpler terms, the differences in location decoding observed by the authors of [128] could reflect attentional load effects on later processes and not necessarily a template maintenance bottleneck. In contrast, our paradigm enables us to investigate early differences in perception via modulations in representations of the attended feature value (instead of a subsidiary target location). Moreover, the detection paradigm we employ enables us to examine how number of templates impacts the decision making process. Finally, very few studies had conditions with more than two attentional templates (one notable example is [84]). 1 Attentional guidance might diminish with the number of templates (following a 𝑛𝑢𝑚−𝑐𝑢𝑒𝑠 decay rate as observed in [96]), or there might be a sudden drop when the number of templates exceeds two. To explore this further, the behavioral session of our study includes one, two, three, and no-cue conditions. 78 5.2 Literature Review 5.2.1 Behavioral Studies As previously mentioned, many paradigms in attentional capacity research are cued search tasks. More specifically, these paradigms are often some version of a nested Memory-Search task (see figure 5.1). In this paradigm, subjects are asked to remember a feature value for a memory task while first performing a simple cued search task (for which the remembered feature value is irrelevant). Under some conditions, the search array contains items with feature values matching the remembered item (memory cue in figure 5.1). In addition to these “matching distractors” trials, “mismatched distractors” and “no distractor” trials are used to measure baseline performance. Longer response time for trials containing “matching distractors” (in comparison to “mismatching distractors”) indicates that the remembered feature value interferes with the search task. Since the search cue is also guiding attention, this would indicate that multiple items can indeed drive attention simultaneously. Trials in which the subject answered the search task or the memory task incorrectly are usually excluded from the analysis, as this could indicate failure to memorize the cue or find the search target. One variation of Memory-Search tasks are nested Memory- NoCue-Search paradigms. Instead of the search being guided by a cued feature value, the search target is a consistent item across the experiment. The paradigm in figure 1 can be modified to a Memory-NoCue-Search paradigm by removing the “search cue” and asking subjects to always find the triangle in the array. While the difference between the two paradigms might seem minute, this difference has become central to recent debates that we will discuss in future sections [135]. Downing and Dodds were among the first to use a nested Memory-Search paradigm [42]. Trials started with two cues, one for a search task that immediately followed and another for a later memory task. The researchers found that even when the distractors in the search task were identical to the memory cue there was no increase in response time, indicating that the remembered item did not bias attention. This conclusion was in direct conflict with other contemporary studies [164, 122]. Both of these studies used a Memory-NoCue-Search paradigm. For instance, in [122], participants remembered a color while performing a search task in which the target value 79 Figure 5.1 Example of a Nested Memory-Search task, stimulus order from left to right. The subjects are directed to remember two feature values (blue and triangle). One of the features is relevant to an immediate search task (the shape) and another for a later memory task (the color). The search array task is to report the orientation of the line in the target item. The Memory is to report the color matching the memory sample. was constant throughout the experiment. The search task had different distractor conditions (see figure 5.1). Analysis showed that “matching distractors” captured attention significantly more than “mismatched distractors”. Additionally, when the memory test was conducted before the search task the difference between matching and mismatching distractors disappeared. Together these results indicate that only task-relevant working memory items bias attention. And secondly, the behavior modulation is not simply due to some sensory after-effect, but is driven by a working memory item that is actively maintained (relevant to a future task). The discrepancy between these results was replicated by other studies in this time period. Olivers later reviewed these three papers as well as two others and discussed design similarities 80 and differences [121]. Olivers argued that other than search task difficulty the only difference in experimental design that correlates with finding distractor-interference from working memory items was a lack of “varied mapping”. In other words, experiments with a constant search target (Memory-NoCue-Search paradigm) found distractor interference, while experiments with a search cue (Memory-Search paradigm) did not. Olivers did not elaborate further in [121] on the mechanisms that might be behind the importance of a block-consistent search target. However, this review is an important precedent to the attentional capacity debate. The conclusion suggested that maintaining multiple items (as required by the Memory-Search paradigm) impacts working memory modulation of attention. The first paper to explicitly frame the discussion in terms of “limits of attentional capacity” was another review from Olivers’ lab [123]. The authors of this work coined multiple definitions that became fundamental to the attentional capacity debate. Most importantly, they established the main components of what is now known as the single-item template hypothesis (SIT) • Working Memory Representations guide attention via “templates”. This statement is widely accepted. Moreover, the concept of “attentional template” is essential to many theories of visual search that preceded the attentional capacity debate (consider Wolfe’s guided search [186] or the biased competition theory [12]). • When there is only one item in the working memory, such as in [122], this item becomes “active” and will become an attentional template that influences even irrelevant tasks. If the item is not relevant to future tasks it will no longer be active and will not influence attention (again, see [122]). • When there are two items in working memory, such as in [42] or other papers discussed in [121], only the representation that is relevant to the immediate task is made active. And the irrelevant representation exists in a different passive working memory state that does not impact the deployment of attention or influences behavior. Beyond simply suggesting that some ‘accessory’ representations in visual working memory do not drive attention, this review commits to the notion of a “hard limit” on the capacity of attention. 81 Moreover, the authors explicitly argued that only a single memory representation can become a search template, and this representation will then "block" attentional guidance by all other memory representations. The same authors later went on to make an addition to this SIT hypothesis. Using a Memory- NoCue-Search paradigm they demonstrated that when the memory task involves multiple items no “attention capture” is observed [110]. Crucially, attentional capture was observed in a condition with a single memory item. In other words, results indicated that in conditions with more than one memory cue, all relevant to the memory task, there was no reaction time difference between “matched” and “mismatched” distractors (even when distractors matched all the memory cues). This result does not simply follow from the SIT hypothesis: If only one item out of three was active, one would expect attentional capture during one-third of the trials. The effect sizes would only be reduced but not disappear completely. The complete lack of evidence of attentional capture prompted the researchers to conclude that when multiple items are held in working memory they automatically compete with each other, preventing any item from becoming an attentional template and eliminating any memory contents driven attentional capture. In other words, multiple representations compete to be the active trace in a mutually detrimental fashion. Despite the success of the SIT hypothesis in making sense of previous conflicting results, subsequent research quickly challenged the core notions of this theory. The authors of [72] showed, using a very similar paradigm to the one used in [110], that even when subjects had to remember multiple colors for a subsequent memory test, there was evidence for attentional capture by matching color distractors during the search task. Moreover, not only was attentional capture observed for either color in two-cue trials, it also appeared to be stronger than attentional capture during one-cue trials. Specifically, there seemed to be a compound effect when distractors matching both cues were present. Despite using a nearly identical paradigm, the results here are in direct contrast to the results of [110]. Furthermore, taking the original SIT hypothesis (before the addition of this competition compo- nent) at face value, one would expect reduced memory capture. Since [72] found the exact opposite, 82 a compounded effect, even the original more conservative version of the SIT hypothesis does not accommodate this result. The discrepancy surprised the research community and large-scale repli- cations of both papers followed [47]. However, both studies were successfully replicated. The authors of the replication hypothesized that this outcome might be due to a difference in paradigm: [110] had only a single distractor in the search array and the delay period between the cue and the search task was 1250ms, in contrast, the paradigm in [72] contained multiple distractors and the delay period lasted only 700ms. The number of distractors is an unlikely culprit, as [72] found an effect even when only a single distractor matched the memory cue. However, the delay period duration could be relevant as there is evidence that the strength of memory representations dimin- ishes over time [40]. However, based on the growing body of literature, delay period duration also fails to provide an adequate explanation. Two similar papers [28], and recently [84] showed that in Memory-NoCue-Search tasks (virtually identical to the one used in [110]) when two and even three items are maintained in the visual working memory all representations bias search results. This further supports the idea that the number of distractors in an array is not the culprit behind the conflicting results reported by [47]. Moreover, in the three experiments reported [28] the delay period varied between 300 and 2000 ms, but the same behavioral patterns emerged: response times were longer for memory-cue matching distractors. These results demonstrated that subjects can maintain multiple working memory representations simultaneously and that all of these items, in- fact, guide attention to some extent. Finally, even previous experiments such as [122] that reported attention capture by WM items had delay periods as long as 3,000 ms. It is unlikely that storing two templates instead of one would degrade the effect to the point that it becomes undetectable within 1250 ms. Other studies completed by Beck and Hollingworth over the years arrived at similar conclusions [13, 9]. In their most recent paper [9] the authors conducted a series of experiments in which the feature dimensions of the memory and search cues were (unlike for instance [42]) different (color and shape, see figure 1 for an example). Response time increase for matching distractors indicated robust attention capture during the search task by the irrelevant feature dimension memory cue. 83 These results have prompted the authors to suggest the Multiple-Item-Template (MIT) hypothesis which states that the limit of attentional capacity is flexible and at the very least two attentional templates can be active simultaneously. Finally, instead of focusing on single-item search, the authors of [86] used a foraging task. Subjects had to tap on target-colors circles on an iPad while avoiding distractor-colors circles. This foraging paradigm is designed to mimic animal foraging settings. The SIT hypothesis would predict longer reaction times due to template switching cost when the number of target colors increases from 1 to 2, but no such increase should occur for further increase. The pattern of results in [86] shows an almost linear reaction time increase when the number of target-colors increased, despite the number of total targets remaining the same and regardless of the number of distractors. The authors of [86] argue that this indicates that some evidence taken to support the SIT hypothesis might in fact just be the result of increased cognitive load. It is important to note that one can argue that an increase in reaction times between 2 and 3, as well as 3 and 4 target types, does in fact support the SIT hypothesis, as long as the subjects focus on each target type separately before moving to the next. However, this behavior of focusing on different target types sequentially was not observed in previous foraging studies conducted by the authors. Finally, a very recent study by the authors of the MIT hypothesis used a paradigm where participants would benefit from multiple template usage [10]. Unlike previously, here performance increase (rather than decrease due to distractor interference) would support the MIT hypothesis. The task was a search paradigm with two search cues (color and shape). Targets could match either one or both cues. Response time was faster when the target matched both cues (the target had both the cued shape and color) as opposed to only one (shape or color). Moreover, the response time distribution in two-cue-match-target trials violated the “race model inequality” which indicates “coactive” guidance by both cue’s attentional templates as opposed to a parallel but separate or sequential activation of two mechanisms [111]. Of course, proponents of the SIT hypothesis might argue that this pattern of results would be possible if participants were always simply using a single item. If participants were attending to the feature value that does not match the target half of the 84 time their response times would increase. Since any cue always matches the target when it has both the cued shape and color, this condition will always have faster response times. To reiterate, the “race model inequality” uses the sum of the probabilities observed in the single-cue-match-target conditions to calculate an upper bound on the response-time probabilities that should be observed in the two-cue-match-target condition if no coactive mechanisms are involved. However, if half the single-cue-match-target trials are “corrupted” by a switching cost the calculated upper bound might not be accurate. At this point, one might be under the impression that an overwhelming number of studies draw conclusions that are in stark contrast to the core components of the SIT hypothesis. However, there are a few things to consider; As the authors of [84] point out, in the paradigms used in these studies no cue is ‘actively’ guiding visual attention during the search. This raises the questions regarding whether one can indeed maintain multiple attentional templates, or if in the absence of such a template all contents of working memory (which Olivers et al referred to as accessory memory items) are able to drive attention. Interestingly, [84] seem to believe the latter ‘capacity to actively maintain irrelevant items during visual search is higher than our capacity to actively maintain relevant items for a visual search.’. In other words, while, as noted by [47], there are still contradictory results even for very similar paradigms, there might be a way for a relaxed or modified version of the SIT hypothesis to explain these results as well. This idea was expanded in a few studies we will discuss in the following section. 5.2.2 Challenges for Nested Search-Memory Paradigms As the reader has no doubt gathered, nested Memory-Search paradigms (with and without a search cue) seem to dominate the attention capacity research. However, these paradigms have several clear limitations. Firstly, many influential theories of attention, as well as recent "attentional capacity theories", speculate the existence of analogous fast parallel stage followed by a separate slow stage in which potential target locations are sequentially evaluated [173, 186, 128]. As previously discussed, attentional load could affect both stages, making it difficult to attribute target- location decoding or behavioral differences specifically to template maintenance. In other words, 85 search-paradigms do not provide direct evidence of early differences in sensory modulation under different attentional capacity conditions. The issue is further exacerbated when participants are required to complete an additional task (such as reporting the orientation of a bar or letter [122, 121, 28, 9, 84], also see Figure 5.1), as previous research demonstrates that working memory load can impact the processing speed even in unrelated tasks [65]. Secondly, while it is well known that attentional templates improve performance by enhancing the sensory representation of the attended feature value [25], the exact impact of load on how attention modulates sensory representations is an important component that remains missing when focusing only on search-based paradigms. Analysis of search based paradigms focuses on reaction time and accuracy. In contrast, detection tasks enables, for instance, detailed analysis of the decision making process via signal detection theory models. One interesting example of such modeling can be observed in [74]. Participants in this study were cued to look for one or more items in a subsequent rapid serial presentation (an RSVP paradigm). The authors modeled the ROC curves expected under the SIT and MIT hypotheses, and estimate the number of attentional templates for each subject. Behavioral results were most consistent with having only a single attentional template. Furthermore, these results persisted when participants attended to objects instead of colors, as well as a mix of these two cue types, supporting the theory that working memory representations might be object-based [46]. One possible explanation for this surprising result might be found in individual differences. As noted by the authors, almost half the subjects had faster reaction time in the cue (in comparison to the non-cue) condition despite the cue being irrelevant to the search task [46]. This results can not be explained by either the single or multiple item template hypotheses, and might point out that attentional capacity modulation might simply have a small effect size that could be masked in nested search-memory paradigms by noise or individual differences. Another relevant signal detection experiment can be found in Liu et al 2017 [96]. The authors attempted to test if multiple attentional templates are maintained when their guidance directly benefits the task, as opposed to negatively impacting the behavior via distractor interference. The task was to indicate if one of the six colors in a cloud of colored dots was more coherent (had more 86 dots) than the others. In some conditions, cues indicated which colors might be coherent. For instance, when green and red were cued the subjects knew that if there will be a coherent color it would be either one of those two. There were both one and two cue conditions. If multiple templates can indeed be maintained then the increase in performance should be similar for both conditions. However, analysis indicated that only one of the two cues guided attention. This study builds on the results of a similar earlier work by the authors that showed decreased performance when attending to two motion directions relative to only one [97]. However, since the lower accuracy could be due to splitting resources between two templates instead of attending only to one, [97] did not fully commit to the interpretations supporting the SIT hypothesis. Crucially, [96] found that the two-cue condition was equivalent to a condition with a 50% valid single cue. Suggesting that participants are able to successfully attend only to a single cue. Therefore the authors argue that their results (and possibly also [97]) support a strong limit on attentional capacity. Finally, recent research uncovered another disadvantage particular to the Memory-NoCue- Search paradigms common in the attentional capacity literature. Throughout the literature two types of results were interpreted as supporting the MIT hypothesis: interference from multiple representations, and performance enhancement when multiple attentional templates are maintained. However, it is possible that this conflates two separate phenomena. Indeed, experiments in which guidance by working memory items would enhance performance (relative to a no cue baseline) seem more supportive of harder attentional capacity limits [96, 74]. In comparison, experiments that test for distractor interference [72, 28, 84]. This pattern of results was systematically tested in a recent study [135]. The authors ran experiments using both the classic Memory-NoCue-Search paradigm and a slight variation that require participants to “actively search” for items matching working memory content. Their analysis demonstrates a significant distractor effect in the first experiments, even for distractors that do not match WM content (not to mention multiple WM representations). However, analysis of results from experiments with the varied paradigm indicate that participants were able to adopt a WM “active state” for only a single item. The authors conclude that while (possibly multiple) working memory representations can bias distractor costs, only when 87 participants are actively using a WM representation an “active control set” is created to improve search performance for a single target stimulus. Considering the above, the next few sections will focus on eye-tracking and neuro-imaging papers, as nested memory-search paradigms are significantly less common in these modalities. Finally, we will discuss the challenges inherent to the nested search-memory task in the context of the detection paradigm used in our experiment. 5.2.3 Eye-Tracking Studies Shortly after proposing the SIT hypothesis, the authors of [123] presented eye-tracking results supporting their theory. A straightforward conclusion of SIT is that when the search criteria isn’t defined by a single feature-value (for instance, targets are in either one of two different colors), participants begin first employ one attentional template and eventually switch to a second template. Following this logic, the SIT hypothesis expects some evidence of switching cost. To explore this matter, [40] introduced a novel paradigm: a search array with two targets (at the left and right sides of the visual fields) was used. The color of the two targets was either the same (one-color condition) or different (two-color condition). In the two-color condition, targets at each particular side had a consistent color (for instance, the left target is always red). Additionally, distractors could be the color of the target on the opposite side of the visual field (for instance, if the left target is always red and the right target green, then distractors at the left could be green). Analysis of subject eye movements showed that saccades to the second target were slower and less accurate in two-color trials. Moreover, initial saccades often landed on other target color distractors, despite the color never being task-relevant in that half visual field (throughout the entire experiment). Time course analysis revealed that the proportion of saccades ending on the distractors trended down after 250-300 ms. The authors concluded that a template-switching cost does exist and is responsible for this pattern of results. Specifically, they conclude that fully switching attentional templates requires at least 250ms, and until then attention capture by distractors matching the feature value of the initial template could occur. It is interesting to note that eye-tracking studies have previously demonstrated that multiple 88 items in long-term memory can guide search simultaneously [169]. The authors of the MIT hypothesis modified the paradigm used by this earlier study [14]. A search array containing 32 circles in 4 different colors was presented, two target colors were cued before the search array began and the participants had to fixate on all target color circles as fast as they could. The subjects were instructed to search for targets either sequentially (moving on to targets matching the second cue only after finding all targets matching the first) or simultaneously. Researchers coded segments of sequential fixations on targets of the same colors. The switching cost was analyzed by comparing the length of the fixation between these segments to the fixations at the beginning/end of these segments. Analysis revealed that there was a significant switching cost when subjects were instructed to search for targets sequentially, but no such cost was observed in the simultaneous search condition. There was no significant difference in the mean fixation time between the sequential and simultaneous search conditions, indicating that higher rates of switching did not incur any additional costs. Hence, it is unlikely that only a single template, which is being constantly switched, is maintained. The authors concluded that it is likely that there is no single item bottle-neck between working memory and attention, and that failure of previous studies to observe multiple-item guidance is simply due to the participants approaching search tasks serially instead of looking for multiple target cues simultaneously. Recent technical developments in the field of eye-tracking enable eye-movement measurements beyond duration and location. New “gaze-contingent” studies enable researchers to tailor the exper- iment trial by trial to test hypothesized behaviors. These experiments usually require participants to fixate on one of the multiple items in a search display, and subsequent trials are generated in real-time based on previous participant behavior. An interesting application of this technology is another more recent study conducted by Beck et al [13]. In this paradigm, two target colors were consistent throughout the block. Each trial began with two circles, at least one of which was in a cued color, the participant had to fixate on a cued color circle to continue. Every subsequent trial belonged to one of three types: 1) forces the participants to pick the same cue color multiple times by only presenting one cue color among different non-cue color targets. 2) forces the participant 89 to switch by not presenting the cue color fixated on in the previous trial. And 3) present both cued colors and let the participant choose. Analysis of participant behavior showed that, when possible, participants switched the color from the previous trial an average of 46% of the time. Bahle et al argued that if only a single attentional template was active at a time the template matching the color used in the preceding trial would drive subsequent selection even if target-types matching the other template are available. This would result in very few switching between colors. The authors argue that the high switch rate implies that attentional templates matching both cues are simultaneously driving attention. In response, the authors of SIT conducted a similar gaze-contingent experiment [126]. Virtually the only difference being that this experiment contained multiple distractors instead of only two circles at each trial, as before there were three possibilities when generating each new trial (only a match to cue color A, only a match to cue color B, or targets matching both colors). The authors again found a high 37%switching rate when both cues were presented. However, further analysis showed longer pre-eye-movement fixation periods in trials that forced participants to switch relative to the last fixation cue color. The authors argue that this reflects a higher “switching cost” when subjects were forced to change the state of the working memory items. According to the MIT hypothesis there should not be any bias to any of the two cues, contradicting these results. The authors argue that the lack of a switching cost when both targets are available, and the high switching rate, are the consequences of participants spontaneously switching targets between trials. While the contradictory results above might seem confusing, there is room to critique both of these papers. The authors of [126] correctly point out that template switching can occur between trials and a high switching rate does not unequivocally support the MIT view. In the other hand, the switching cost at the heart of their argument in favor of the SIT hypothesis can be driven by selection history rather than bias towards the most recently active template. Recently selected items tend to be favored in subsequent trials, even in non-search experiments and when targets are fixed across blocks [188]. This selection history effect was not controlled for by any of the aforementioned experiments, and seemingly, there is no way to differentiate between this effect and 90 attentional template guidance in the aforementioned paradigms. 5.2.4 Neuroimaging Studies In their original paper presenting the SIT hypothesis [123] the authors postulate different neuronal mechanisms that can account for their theory. Mainly, they focused on previous rhesus single-cell recordings studies by Warden and Miller [183, 182]. Results indicated that item working memory representations change when a newer item is introduced. Olivers et al argue that this change to an “orthogonal representation” might reflect an “active” item undergoing a transition into a “passive” working memory state. Subsequent research found further support for multiple working memory states. For instance, Lewis-Peacock et al [91] found that the identity of a single task-relevant item could be successfully decoded from fMRI activity. Other items, however, were initially decodable but became “deactivated” after a newer item that is relevant to an immediate task was introduced. Crucially, these same items become reactivated and decodable once they became task-relevant again. Most recently, Olivers’ lab published an EEG study that demonstrated that it is possible to decode the status (active vs passive) of the content of the working memory [179]. The authors showed subjects a cue followed by two search tasks. The cue was relevant for the first task in half the trials and the second half in the other, and participants knew the type of the trial in advance. This paradigm encouraged the participants to switch the status of the working memory contents (cued feature value) from passive during the irrelevant search task to active (and vice versa). Analysis showed that the status of the working memory content can be decoded from a burst of power in the delta band (2-4 Hz), and a longer non-lateralized alpha (8-14 Hz) power. The delta decoding was brief and the authors concluded that it reflects the transition processes involved in changing the status of WM contents. However, this experiment is not without its issues. Participants knew which of the two tasks is upcoming, making it possible that the decoded working memory status is in fact related to processes involved in task-specific preparations. To put it simply, the tasks were to search for a cued memory item or a duplicate color. Since the subjects knew which task was coming up, decoding the delay period before the presentation of the search array might be “corrupted” by task-specific preparations unrelated to the status of working memory contents. 91 More specifically, the irrelevant search task always involved looking for a duplicate color, instead of a specific target, and did not involve any attentional template. Therefore, the delta band activity could therefore possibly relate to the existence of any attentional template (even if its from long term memory), or even a preparation specific to duplicate-search, rather than the status of the working memory contents. Notwithstanding, while these papers make a strong case for the existence of different working memory states, as pointed out by many researchers [86, 47], this by itself does not necessarily imply that only a single attentional template is active at a time1. To measure the attentional capacity more directly several EEG studies have been conducted. These studies often borrowed from previous EEG research. Two specific EEG components that were repeatedly used to explore attentional capacity are the Contralateral Delay Activity (CDA) and N2pc EEG components. The CDA component is understood to reflect the number of items maintained in the visual working memory [101]. While the N2pc component is a transient contralateral signal that tracks the spatial deployment of attention in the visual field. N2pc is particularly useful for measuring attention capture by distractors (as long as the distractor and target appear on opposite sides of the visual field). Attention capture by distractors produces reliable N2pc components that do not appear in distractor-free trials. Both the CDA and N2pc are examples of an Event- Related Potential (ERP) obtained from EEG recordings. Both ERP components are measured by subtracting ipsilateral activity from contralateral activity with respect to an electrode (usually PO8 or PO7) and the cue location. For instance, N2pc contralateral activity is recorded from either 1) right electrodes (such as PO8) on trials with distractors on the left side of the visual field. or 2) left electrodes on trials with distractors that appear on the right. One crucial difference between the two components is that, while the N2pc ERP peaks around 200ms after stimulus onset, the CDA activity is sustained during delay periods in working memory tasks. One example of the use of ERPs to investigate attention was a study by Gurbert et al. CDA 1 The authors of [86, 47] also argued that a single item attentional bottleneck would probably require a centralized visual working memory specific neural mechanism, which is unlikely considering the apparent distributed nature of working memory. But discussion regarding the nature of WM representations is somewhat beyond the our scope. 92 components in trials with a changing cue were significantly larger than trials in fixed cue trials [62]. This was expected as processing fixed cues can be delegated to non-working memory processes such as long-term memory [24]. More interestingly, CDA components were larger in two-cue (in comparison to one-cue) trials, and larger still in three-cue trials. The authors argue that this indicates that all cues are represented simultaneously in working memory. However, N2pc components were attenuated in the multiple cue conditions, becoming smaller and delayed, indicating impairments in the deployment of spatial attention, and that multiple cue representations were less effective in guiding search. Further analysis revealed that there was a significant correlation between behavioral measures such as response time and N2pc, but not CDA, component characteristics. The authors conclude that this indicates that decreased performance in multiple cue trials (as measured by longer RTs) is driven by worse attentional guidance and deployment and not the cognitive load of having to maintain multiple templates. The authors of [62] concluded that these findings support the SIT hypothesis. Perhaps inspired by the notion of separating between deployment and maintenance, Olivers’ lab conducted their own neuroimaging study [128]. The authors decoded the location of the target from EEG activity during a guided search paradigm with one-cue one-target, two-cue one-target, and two-cue two-target conditions. The authors found that classification accuracy (as well as behavioral measures) was similar for trials from the first two conditions (though slightly worse for the second condition) but significantly worst in two-cue two-target trials. They argue that this demonstrates that deploying two templates as opposed to preparing them is the true bottleneck. This conclusion would explain the difference between the CDA and N2pc ERP modulation in multiple cue conditions observed by Gurbert et al [62]. This research offers a possible resolution of the SIT vs MIT debate by postulating that multiple item templates can co-exist in a mutually suppressive manner, and when a stimulus that matches one of the two templates is presented the matching template strengthens at the cost of the unmatching one. However, when stimuli match both attentional templates (cues) there is a strong mutual suppression ‘resulting in a substantially weakened and delayed selection’. In other words, multiple templates can be maintained but not 93 selected simultaneously. This theory has similarities to popular models of attention. For instance, Treisman’s Feature Integration Theory [173] and Wolfe’s original guided search model [186] both have a quick parallel stage followed by a slow serial process with a strong bottleneck. However, these theories are all based on single-item search paradigms. Moreover, in more recent iterations of the guided search model, Wolfe proposed two completely separate attentional template mechanisms [187]. This conceptualization holds that, while guiding templates from working memory items are used to select objects based on features, items in an “activate long-term memory state” might capture attention nonetheless. While Oliver’s new theory can account for some of the previously mentioned results, it fails to do so fully. For instance, this theory would predict that in [96], which had conditions only similar to the first two in this work, no strong difference between conditions would have been found. Moreover, as Ort et al decoded only the target location during the search task [128], there is no evidence that the two templates were actually maintained in an active state before the stimuli presentation. In fact, the small difference between the first two conditions could be accounted for by the original SIT hypothesis as simply a template becoming active in working memory (switching cost). Recently an EEG study attempted to decode attentional templates directly[184]. Participants were cued to suppress a specific color during an upcoming search array. Maintaining a negative attentional template is beneficial for search performance. The authors had fixed and varied cue blocks. Decoding the to-be-suppressed color was sustained in the fixed cue condition and decoding strength correlated with search performance. However, in the varied cue condition, decoding happened only at the onset of the delay period and was negatively correlated with performance. The authors argued that this indicates that negative templates do not form under the varied-cue condition. Another possible interpretation is that attentional templates during delay periods are simply difficult to decode. This is a clear hurdle for anyone hoping to verify the simultaneous maintenance of multiple attentional templates [128], as decoding seems difficult even in single- template trials. One other recent EEG experiment found similar evidence of multiple coactive attentional 94 templates [63]. In this work, the authors alternated the search task cue in an ABAB fashion across trials (red or green). Before presenting the search array seven task-irrelevant arrays with distractors matching the cue colors appeared. Knowing that one of the cues is irrelevant to the upcoming task, the researchers expect evidence of attention suppression. Specifically, if only the task-relevant color template is active then distractors matching the other color should not capture attention. However, analysis showed that distractors for both colors captured attention (evoked a significant N2pc component) in all but the last array out of the seven. The N2pc activity evoked by irrelevant color distractors was heavily suppressed in the array immediately preceding the search task. The authors argue that this indicates that two attentional templates were maintained until shortly before the search task began, at which point top-level processes suppressed the task-irrelevant template. Other relevant studies attempted to explore working memory representations more directly using steady-state-evoked potentials (SSVEPs). When flickering a stimulus at a specific frequency, activ- ity in early-visual area neurons representing the stimuli matches the SSVEP frequency. Moreover, attending to stimuli has been shown to increase the amplitude of the respective SSVEP oscillations. By having subjects attend to objects in two colors oscillating in different frequencies researchers have been able to confirm that, at the very least, attending to two colors increased the corresponding SSVEP for both colors simultaneously [5, 105]. While this seemingly supports a multiple atten- tional template theory, as noted in [127], a direct comparison between SSVEP modulation during one-cue and two-cue conditions is needed before any strong conclusions are taken. 5.2.5 Conclusion As the reader might have gathered, despite attentional capacity research being mostly limited to the last decade, there is already a substantial body of literature. Before delving into the details of our experiment, it is worth highlighting some key take-ways that can help contextualize subsequent chapters. First and foremost, as evident from this literature review, and experimentally established in [135], there seem to be a fundamental difference between distractor interference and guidance by working memory templates. The overwhelming majority of experiments supporting the MIT hypothesis did not require participants actively search for multiple cues, instead focusing 95 on interference by distractors matching working memory contents. Secondly, as discussed, search tasks provide only indirect evidence of load during template maintenance, as confounds such as working memory load effects on processing speed [65] and attentional deployment related bottlenecks [128] could impact results. Previous neuro-imaging research theorized that modulation by attentional templates occurs shortly before and early on during stimulus onset [128, 63]. With this in mind, we will try to decode attentional modulations in early sensory representations. To summarize, given the literature we decided to use a detection paradigm that has no distractors and encourages active guidance by attentional templates. Moreover, we decode sensory modulations during stimulus presentation, which are a direct effect of attention, enabling direct observation of attentional load effects no matter how early or transitional. Finally, to understand how exactly the attentional load impacts the decision making process we use a detection paradigm and separately evaluated modulations in hit rate, false alarms rate, and signal detection theory measures. 96 5.3 Main Experiment 5.3.1 Motivation This chapter presents our main experiment. Motivated by readings from the literature, our goal was decoding attentional modulation of sensory representations directly, and analyzing how - if at all - attentional load impacts these modulations. To our knowledge, all other neuro-imaging work on attentional capacity focused on other, less direct, measures of attention [128, 63]. Following previous neuro-imaging work we expect differences in these modulations to be most pronounced at stimulus onset [128, 63]. These differences are not confounded by working memory load effects on processing speed [65], unlike search-task accuracy and response time measurements [122, 121, 28, 84, 9]. Moreover, these differences can only be explained by template maintenance capacity limits rather than other bottlenecks in the later stages of perception [128]. In simple terms, modulation of sensory representations by attention are immediate and begin at stimulus perception. And early differences in these modulation can only be explained by attentional load. Another conclusion from our literature review was that guidance by attentional templates and interference by distractors matching working memory contents are two separate phenomena [135]. With the above in mind, we elected to use a detection paradigm with no distractors during stimulus duration. Using this paradigm has multiple additional benefits in bridging a few gaps that became apparent during the literature review. Firstly, in previous EEG experiments that used nested memory-search paradigms the target was always present, and analysis focused on correct trials only [128]. This left a gap in the literature; for instance, the attentional load theory proposed in [128] does not necessarily predict any difference between different attentional load conditions when the target is absent. However, previous results are consistent with an increase in false alarms [97, 96]. Since target-absent trials are impossible in the ubiquitous nested memory-search paradigm these consequences of increased attentional load that is yet to be thoroughly explored. Secondly, behavioral results of the nested memory-search paradigms can only be quantified in terms of accuracy and response time. Only few researchers employed a detection paradigm [74, 97, 96], and to the best of our knowledge only one - behavioral - experiment focused on signal detection modeling [74]. Diversity of 97 analytical methods and paradigms is essential for a comprehensive understanding of the underlying mechanisms responsible for the effects of different attentional load conditions. Our behavioral results could therefore be of particular interest, as they provide a different perspective from the vast majority of attentional capacity work published in the last decade, especially if neural correlates of behavioral differences are observed in target-absent trials. 5.3.2 The Experiment We employed a modified version of experiment 2 from [96]. Our two-alternative-forced-choice experiment presented participants with an array of dots in different colors. One of the colors was over-represented in half of the trials, while in the other half the dots were equally distributed across five colors (for a total of six colors per trial). Participants were required to distinguish between trials with an over-represented color (target-present) vs trials with an equal distribution of colors (target-absent). The experiment had no-cue, one-cue, and two-cue conditions. The over- represented color in all one-cue and two-cue target-present trials was always one of the cued colors. The [96] experiment modified the color coherence (amount of over-representation) between trials. Performance at different coherence levels was then used to fit a psychometric function. The analysis used the parameters of the best fit function to study the impact of WM on attention guidance. EEG decoding is sensitive to such variability stimuli. Therefore, we separately calculated thresholds for each color per subject to be used in the actual experiment. The target was present in 75% of the trials, this was done to ensure we had enough EEG data to decode and compare the signal in one and two-cue target present trials, which are the most analogous to the types of trials analyzed in previous works [128]. The locations of the cues were randomized but where always 180◦ and 120◦ degrees apart in the two and three-cue trials respectively. The six colors were randomly selected from a pool of seven (red, green, blue, yellow, purple, orange, and cyan) in each trial. Our experiment is divided into behavioral and EEG sessions. The behavioral session has no-cue, one-cue, two-cue, and three-cue conditions. Following [96] we expect the Hit-FA rate to drop when less information is available to the participants and more uncertainty is present. In other words, we expect the Hit-FA to be the highest in the one cue condition and consistently drop in the two, three, 98 and no cue. This can be explained by a drop in the hit rate, an increase in the false-alarm rate, or a combination of the two. With this in mind, we test the hypothesis that the Hit-FA monotonically decreases across conditions when less information is available (this order being one-cue, two-cue, three-cue, and no-cue). Additionally, we hypothesize the hit rate will similarly decrease, while the false-alarm rate would follow an opposite trend by monotonically increasing. Significant differences between memory load conditions would eliminate any possibility of a strong multiple item template hypothesis. Moreover, a complete failure in three-cue trials could indicate that even if two attentional templates can guide attention simultaneously, there is still a "hard limit". Furthermore, signal detection theory analysis can be used to explore how attentional load impacts the decision making process. Differences in sensory modulation are likely to manifest differences in the discriminability index (the d-prime) rather than the criterion for instance. Decreased performance when less information is available would correlate to decreased discriminability, hence we hypothesize that, similarly to the Hit-FA, the d-prime will also monotonically decreases across conditions. EEG sessions focused on the one-cue and two-cue conditions. Our paradigm is designed so that trials of the same type are identical during the stimulus period, regardless of block condition. In other words, a one-cue and a two-cue target-present green trials will be identical after cue offset. Therefore, differences in EEG decoding during stimulus presentation must be the direct result of differing attentional loads. Similarly to previous research, we also expect to find a difference in EEG decoding early on in the stimulus presentation period [128, 63]. More specifically, we are directly decoding the sensory representations modulated by attention, therefore we should be able to observe any differences between the attentional load conditions no matter how transient or early in the perception process. We hypothesize that attention in the single template condition would results in bigger sensory modulations resulting in better decoding compared to the two-cue condition. This difference in modulation can be considered the neural correlate of a decrease in the discriminability index between the one and two cue conditions. We also decode false-alarm trials to confirm that false-alarms are driven by attention-capture of one of the attended cued color, as opposed to general performance decrease due to higher attentional load. Generally, any degradation 99 of attentional modulation between different memory load conditions could be interpreted as a cost for maintaining multiple templates. Finally, we also completed an exploratory analysis of generalizations of different classifiers to preparatory period decoding. 5.3.3 Methodology 5.3.3.1 Participants We surveyed the literature and examined the number of participants used in similar EEG decoding studies. The sample size of the most relevant EEG decoding papers ranged from 16 to 34 [8, 128, 119, 184]. Following [128] we collected data until we had 24 participants in total. In addition to the 24 subjects who’s results were included in this study, 2 were rejected due to noisy EEG, 3 were rejected due to low performance during the behavioral session (accuracy was at threshold performance in all conditions), and 1 was rejected after his thresholding session failed to converge. The participants were recruited from the Michigan State University student body. The protocol was approved by the MSU institutional review board and written informed consent was obtained from every subject. 5.3.4 Stimuli and Apparatus The experiment was programmed in MATLAB and using the MGL extension [51]. The stimulus consisted of random dot fields. For each trial, five out of the seven colors were selected and 240 dots were drawn in an annulus (inner radius = 1.5◦ , outer radius = 6◦ ) and centered on a central fixation disc (white; size: 0.1◦ ; luminance: 14.8 𝑐𝑑/𝑚 2 ). A random color out of the five was selected to be the target-color. During target-present trials, the target-color was over-represented, and the remaining dots were divided equally between the four remaining colors. In target-absent trials, all five colors were equally represented. To eliminate potential confounds, each subject adjusted the brightness for every hue separately to achieve isoluminance between all colors. This was done because differences in luminance between colors could cause the overall brightness during stimulus presentation to differ - especially in target present trials with disproportionately represented colors 100 - thereby inflating the EEG decoding accuracy. The procedure for obtaining isoluminance between all colors was identical to the one used for the pilot experiment (see Section 4.1.1.2). During cue trials, the stimulus was preceded by a cue (size: 0.5◦ ). One of the cues contained the target color. When more than one cue was present the other cue colors were randomly drawn from the remaining two colors that are not present in the stimulus dot field at that particular trial. The location of the cue was on a circle around the fixation (radius = 1.5◦ ). The angle of the first cue was randomly drawn from (0◦ , 10◦ , 20◦ , .., 360◦ ), in two-cue trials, the second cue was 180° away from the target-color cue. Finally, in three-cue trials the cues were 120◦ away from each other (see figure 5.2). 5.3.4.1 Procedure Participants first performed a simple isoluminance task to prevent potential brightness con- founds. The actual task trials began with a cue that lasted 500𝑚𝑠, followed by an 800𝑚𝑠 preparatory period and a 100𝑚𝑠 stimulus segment. Finally, participants were required to make a target-absent or target-present trial judgment by pressing either 1 or 2 on a keypad with their right index or middle finger (see figure 5.2). Short auditory feedback was given after every incorrect answer. Before the behavioral and EEG sessions, we ran a separate thresholding task. All thresholding trials were similar to the no-cue condition. The target color coherence was manipulated to find the coherence (relative over-representation) that produces an intermediate level of performance ( 82%) for every subject. This thresholding was done separately for each color and was achieved by utilizing the Best PEST adaptive method with a Weibull psychometric function [130, 137]. In a classical paper, Quick proved that, if a few reasonable assumptions (such as Gaussian noise) hold, any psychometric function can be approximated using a Weibull probability distribution [140]. Given previous trials, the Best PEST adaptive function uses maximum likelihood estimation to select the parameter values that are most likely to induce the desired performance levels. During the behavioral session, subjects ran 2 blocks of 84 trials in every condition. The order of the blocks was randomized across subjects. The EEG session procedure differed from what was described above in multiple key aspects. First, the subjects completed 5 blocks of one-cue and two-cue conditions only (in an ABAB fashion 101 Figure 5.2 Example of behavioral session no-cue, one-cue, two-cue, and three-cue trials. The background color was changed for visibility (RGB value 240, 240, 240). with the order balanced across subjects). Additionally, to prevent alpha frequencies phase issues the delay segment duration varied between 800𝑚𝑠 and 1100𝑚𝑠. Finally, as we are mostly interested in decoding target-present trials, the ratio of target-present trials was 75% of the total number of trials. Since behavioral sessions were used to exclude subjects with abnormal performance, behavioral and EEG blocks shared the same task characteristics (such as proportion of target present trials). 5.3.4.2 Data Acquisition and Preprocessing Continuous EEG activity was recorded using the actiCHamp system with BrainVision recorder software. The participants were fitted with a 64-channel actiCap with active electrodes. The screen refresh rate was set to 120𝐻𝑧 and data sampling was at 1000𝐻𝑧. Additionally, electrooculogram (EOG) activity was recorded from from horizontal and vertical electrode pairs, and used to detect and reject horizontal eye movements, eyeblinks, and vertical eye movements. Electrode impedances were maintained < 50𝐾Ω. The data from the inter-trial interval was discarded. We used EEGLAB and ERPLAB to process the data. First, we resampled the data to 500Hz, removed the AC line noise (using the cleanline plugin), applied a bandpass filter between 1 and 100Hz, and used ICA decomposition to separate and remove components originating from blinks and other artifacts. 102 Finally, waveforms were manually examined by experimenters and noisy epochs were rejected. A simple peak-to-peak blink detection algorithm (available in ERPLAB) was also used to detect blink using the EOG channels. The threshold differed for every participant, and potential blinks were marked before an examiner manually checked the data and marked any additional noisy epochs for rejection. On average, we rejected 120 of the 840 recorded trials. This is comparable to the rejection rates reported in other EEG studies [7, 8]. 5.3.5 EEG Data Analysis We used the ADAM decoding toolbox [44], with a few modifications such as adding Gaussian smoothing to the raw data, and smoothing classifier accuracies. First, we resample the data at a rate of 100𝐻𝑧. Additionally, following the conclusion of our pilot experiment, we expect the color feature information to be mostly contained in the sub-alpha frequency band, and visual-cortex electrodes. Therefore we employ a 10𝐻𝑧 lowpass filter and use the subset of mostly parietal and occipital electrodes Pz, P3, P7, O1, Oz, O2, P4, P8, TP10, TP9, T7, T8, P1, P2, P5, PO7, PO3, POz, PO4, PO8, P6, P2, CP4, CP2, CP1, CP3, C1, C2, C3, and C4. The pilot decoding results for this subset were robust, and similar electrodes were used in the literature [7, 8, 128]. Following previous studies, we Gaussian smoothed (window size, 20𝑚𝑠) the data along the time dimension, and smoothed the classifier outputs using a 40 ms moving average [7, 8, 184]. To verify that our pipeline does not inflate accuracy we simulated and tried to decode random noise (see Appendix B). After decoding trials from the one-cue and two-cue conditions we subtracted the two decoding results and used a similar cluster-based permutation testing to identify significant results. We used the default ADAM backwards decoding classifier with a 10 cross-fold permutation testing. For an in depth discussion of LDA decoders such as the ADAM backward decoding model see the previous EEG decoding section in literature review 2.2.2.1. At each permutation, 90% of the data (balanced across the seven colors) was used to train an LDA classifier and accuracy was computed on the withheld 10% of the data. However, an exception to this procedure is when using different training and testing sets (Figure 5.10), as in these cases no splitting of the data is necessary and there is only a single iteration. Given that we always decode seven classes representing the seven 103 1 different colors, the chance level of the classifier was 7 = 14.28%. Finally, for evaluating statistical reliability we used the default ADAM monte-carlo sampling with 2000 iterations of cluster-based significance testing [128]. Since the classifier accuracy is being compared against chance, the t-tests used to generate the significance values before the cluster based permutation testing are always one-tailed [44]. To investigate temporal generalizations we employed a similar pipeline. First a classifier was trained using samples at a specific time point, then decoding accuracy is calculated on the testing data at every time point (instead of just the same time point). Analogous significance cluster analysis is performed on the resulting two dimensional (training time 𝑥 testing time) accuracy matrix. Generalization can also be calculated across completely different EEG segments. For instance, by training classifiers for each stimulus duration time point, and classifying preparatory period data. Here however no cross-fold validation is needed, as the number of trials available for training and testing are independent. Finally, we also implemented a Mahalanobis distance classifier. The implementation was directly integrated into ADAM. While the classification accuracy was comparable to the LDA performance, the Mahalanobis classifier failed to reproduce some transient effects that were present in the default LDA decoding. We believe that this is due to the Mahalanobis classifier requiring additional trial averaging and binning [184], see Appendix C for Mahalanobis classification results. Both the LDA and Mahalanobis classifier algorithms were described previously in section 2.2.2.1. All results in the following sections were achieved using the LDA classifier. 104 5.4 Unsupervised Artifact Rejection With Deep Encoder-Decoders More often than not, electroencephalography preprocessing sections in cognitive neuroscience papers allude to manual trial-rejection by visual inspection [7, 8, 128]. This tasking process constitutes a bottleneck, especially at early research stages when making decision regarding the design and procedure of the experiment. To elevate this bottleneck one of the unsupervised artifact detection algorithms presented in section 3.2 was used. Code for converting EEGLAB files to and from formats that can be processed by our feature extraction and artifact detection packages is available online2. As can be observed in figure 5.3, using unsupervised outlier detection improved the decoding performance without requiring hours of commitment in the early stages of the experiment. Figure 5.3 decoding performance on target-present correct trials after and before using unsupervised outlier detection (red and green lines respectively). Horizontal lines indicate a significant decoding cluster. 2 github.com/sari-saba-sadiya/EEGLAB-Artifact-Detection 105 5.5 Results 5.5.1 Behavioral Results Following [96] we hypothesized that the Hit-FA rate would drop when less information is available to the participants (in other words we expected a downward trend across the one, two, three, and no cue conditions). Moreover, we hypothesized that the hit rate would decrease following the same trend, and the false-alarm rate would do the opposite and increase. Figure 5.4 Results of our behavioral session. ★) p<.05 ★★) p<0.01 ★ ★ ★) p<0.005. A repeated measures ANOVA revealed a significant main-effect of cue condition on Hit-FA (𝐹 (3, 69) = 28.812, 𝑀𝑆𝐸 = 0.008167, 𝑝 < .005, 𝜂𝐺 2 = 0.556). We followed up the ANOVA with a series of paired two-tailed t-tests (see figure 5.4). Results showed significant differences between one-cue trials and two, three, and no-cue trials (one vs two 𝑡 (23) = 4.652 𝑝 < 0.005, one vs three 𝑡 (23) = 7.95 𝑝 < 0.005, one vs no cue 𝑡 (23) = 8.24 𝑝 < 0.005) as well as two-cue and three, and no-cue trials (two vs three 𝑡 (23) = 2.88 𝑝 < 0.01, two vs no-cue 𝑡 (23) = 4.22 𝑝 < 0.005). A repeated measures ANOVA revealed a significant main-effect of cue condition on hit rate (𝐹 (3, 69) = 9.609, 𝑀𝑆𝐸 = 0.001646, 𝑝 < .000, 𝜂𝐺 2 = 0.294). We followed up the ANOVA with a series of paired two-tailed t-tests. Results showed significant differences between one-cue trials and two, three, and no-cue trials (one vs two 𝑡 (23) = 2.80 𝑝 < 0.01, one vs three 106 𝑡 (23) = 4.86 𝑝 < 0.005, one vs no cue 𝑡 (23) = 4.73 𝑝 < 0.005), as well as between the two and three-cue conditions (𝑡 (23) = 2.12 𝑝 < 0.05). Finally, a repeated measures ANOVA revealed a significant main effect of cue condition on false-alarm rate (𝐹 (3, 69) = 14.015, 𝑀𝑆𝐸 = 0.009272, 𝑝 < .001, 𝜂𝐺 2 = 0.378). We followed up the ANOVA with a series of paired two-tailed t-tests. Results showed significant differences between one-cue trials and two, three, and no-cue trials (one vs two 𝑡 (23) = 3.3025 𝑝 < 0.01, one vs three 𝑡 (23) = 4.94 𝑝 < 0.005, one vs no-cue 𝑡 (23) = 6.38 𝑝 < 0.005) as well as a significant difference between the two-cue and no-cue conditions (𝑡 (23) = 2.84 𝑝 < 0.01). Figure 5.5 Discriminability index and Criterion results of our behavioral session. ★) p<.05 ★★) p<0.01 ★ ★ ★) p<0.005. Further signal detection theory analysis was also conducted by calculating the d-prime and criterion for each condition. A repeated measures ANOVA revealed a significant main-effect of cue condition on d-prime (𝐹 (3, 69) = 5.175, 𝑀𝑆𝐸 = 1.064882, 𝑝 < .01, 𝜂𝐺 2 = 0.10106). Follow up two-tailed t-tests showed significant differences between the discriminability indexes (d-prime) in the one and three as well as no-cue conditions (one vs three-cue 𝑡 (23) = 4.67 𝑝 < 0.005, one vs no-cue 𝑡 (23) = 4.98 𝑝 < 0.005). No significant main effect of cue condition on criteria was found (𝐹 (3, 69) = 0.454, 𝑝 = .715). 107 Figure 5.6 Results of our EEG session. ★) p<.05 ★★) p<0.01 ★ ★ ★) p<0.005. Additionally, we found a main effect of condition on response time (𝐹 (3, 69) = 3.851, 𝑀𝑆𝐸 = 0.0118, 𝑝 < .05, 𝑒𝑡𝑎 𝐺 2 = 0.143). Response-time generally followed the same trend as other behavioral measures (The mean response time was 0.6038, 0.6399, 0.7019, and 0.6803 seconds for one, two, three, and no-cue trials respectively). The effect of cue condition on response time for correct trials was also significant (𝐹 (3, 69) = 3.9273, 𝑀𝑆𝐸 = 0.00937, 𝑝 < .05, 𝑒𝑡𝑎 𝐺 2 = 0.145) and followed the same pattern overall. In general, the response time, response time on correct only trials, and response time on target-present correct only trials (not plotted) followed an inverse trend to the accuracy, hit rate, hit-FA, and the discriminability index (See Figure 5.8 and Figure 5.5). Therefore there is no evidence of any speed accuracy trade-off in this experiment. We ran the same analysis of performance during EEG trials (see Figure 5.6). A repeated measures ANOVA revealed a significant main effect of cue condition on Hit-FA rate (𝐹 (1, 23) = 26.883, 𝑀𝑆𝐸 = 0.0028, 𝑝 < .001, 𝜂𝐺 2 = 0.538). Similarly, repeated measure ANOVA also showed 108 Figure 5.7 Discriminability index and Criterion results of our eeg session. ★) p<.05 ★★) p<0.01 ★ ★ ★) p<0.005. a main-effect of cue condition on hit rate (𝐹 (1, 23) = 4.293, 𝑀𝑆𝐸 = 0.0003, 𝑝 < .001, 𝜂𝐺2 = 0.157) and false alarm (𝐹 (1, 23) = 19.674, 𝑀𝑆𝐸 = 0.0028, 𝑝 < .001, 𝜂𝐺 2 = 0.461). Signal detection theory analysis of the EEG session behavioral data showed a significant main effect of cue condition on the discriminability index (𝐹 (1, 23) = 25.793, 𝑀𝑆𝐸 = 0.0369, 𝑝 < .001, 𝜂𝐺2 = 0.0606), but no significant main effect was found for the criterion (𝐹 (1, 23) = 3.264, 𝑝 = .083). Additionally, we found a main effect of condition on response time (𝐹 (1, 23) = 3.851, 𝑀𝑆𝐸 = 0.0114, 𝑝 < .05, 𝜂𝐺 2 = 0.222). The mean response time was 0.5417 seconds for one-cue trials and 0.6210 seconds for two-cue trials (Figure 5.8). There was no main effect of condition on response time when analyzing correct only trials (𝐹 (1, 23) = 3.876, 𝑀𝑆𝐸 = 0.023, 𝑝 = .061), however the mean for one-cue trials was still lower than for two-cue trials (0.4678 and 0.557 respectively). Following the pattern observed in the behavioral session, conditions with higher accuracy and discriminability index also had lower mean response time, thus we conclude that there was no speed accuracy trade-off. 109 Figure 5.8 Response time (in seconds) for all trials and correct only trials, for the different conditions in the behavioral and EEG sessions. ★) p<.05 ★★) p<0.01 ★ ★ ★) p<0.005. 5.5.2 Bayesian Analysis of Behavioral Results To further investigate if there is any evidence of behavior differing across three-cue and no-cue trials we followed up the previous analysis with Bayesian modeling using JASP [76]. Specifically, we tested the hypotheses that the hit rate, false-alarm, and d-prime values for the no-cue and three-cue trials were the same. The default JASP prior of 0.707 on the Cauchy scale with 95% credibility interval was used for all Bayesian paired t-test. For follow up robustness analysis see Appendix D. Bayesian paired t-test was used to explore if hit rate in three and no-cue trials differed. Analysis showed a Bayesian factor of 𝐵𝐹01 = 4.628 (0.024 error %) in support of the null hypothesis. The median effect size was −0.022 and the confidence interval was [−0.398, 0.353]. Bayesian paired t-test was also used to explore if false-alarm rates in three and no-cue trials differed. Analysis showed a Bayesian factor of 𝐵𝐹01 = 2.879 (0.026 error %) in support of the null hypothesis. The median effect size was 0.187 and the confidence interval was [−0.191, 0.574]. Finally, a Bayesian paired t-test was used to explore if the d-prime in three and no-cue trials differed. Analysis showed 110 a Bayesian factor of 𝐵𝐹01 = 2.122 (0.026 error %) in support of the null hypothesis. The median effect size was −0.242 and the confidence interval was [−0.634, 0.14]. The Bayesian factors suggest weak to moderate evidence in favor of the null hypotheses. Indicating no significant differences in behavior across three and no-cue trials. 5.5.3 EEG Results We first compared target-present trials across the one-cue and two-cue conditions. Generally, the target feature value could be decoded from the very beginning of the one-cue trials, while decoding in the two-cue condition was possible from 150 ms after stimulus offset (Figure 5.9). Figure 5.9 Decoding target present EEG trials. Bold red and green horizontal lines indicate significant decoding in the one cue and two trials respectively. Blue horizontal lines indicate clusters of significant differences between conditions. The dashed and solid lines indicate stimulus onset and offset respectively. 111 Additionally, to investigate the impact of attentional load on “false alarm” trials we decoded the cued color that was present (but not over-represented) in target-absent trials. Training the classifier using two-cue trials was generally unsuccessful and yielded no significant decoding. However, classifiers trained on target-present one-cue trials generalized to two-cue trials. The present (but not over-represented) color is likely to be the cause of the false alarm in two-cue target-absent trials (as the other color is completely absent from the array). Indeed, we were able to decode the present color in false alarm but not correct reject trials. Moreover, there was a significant difference in decoding between the two. Considering the two trials are identical stimulus-wise, this indicates the decoding in false-alarm trials is indeed driven by attention capture by the cued color. Surprisingly, there was no significant decoding in one-cue target-absent false alarm trials (see Appendix A). However, this is likely due to the low number of one-cue false-alarm trials rather than any attentional load modulation of perception. Finally, while decoding was not significant for all target-absent trials in either condition, it is worth noting that decoding accuracy was higher for two-cue, as opposed to on-cue, target-absent trials (Appendix A). The difference was not statistically significant, but the decoding results seem to correspond to the behavioral performance trends observed in the EEG and behavioral sessions (namely, false-alarms rate was higher in the two cue condition). 5.5.4 Generalizations For Stimulus and Preparatory Period Decoding Classifiers trained on one-cue trials generalized well to two-cue trial stimulus period classifica- tion (figure 5.11). Preparatory period decoding was significant only in one-cue trials shortly after cue offset. Generalization across the preparatory period yielded no significant clusters for either one-cue or two-cue, trials (Figure 5.12). However, classifiers trained on the one-cue target present correct trials stimulus period successfully generalized to the one-cue preparatory period (Figure 5.13). Mental representations are expected to be most robust during stimulus presentation in one-cue trials (when a coherent target color is present and after the sensory perception was enhanced by attention). The generalization of stimulus period one-cue classifiers to both one-cue preparatory 112 Figure 5.10 Decoding absent present EEG trials. Bold red and horizontal lines indicate significant decoding in the two-cue target absent false-alarm trials. Classifiers were trained using on target- present one-cue trials. Blue horizontal lines indicate clusters of significant differences between false-alarm and correct reject trials. The dashed and solid lines indicate stimulus onset and offset respectively. period, and two-cue stimulus period, EEG could indicate a similarity in the representations being used. This indicates that when the target is coherent, representations in two-cue target present correct trials become similar to those of one-cue target present correct trials. Moreover, when only a single attentional template is maintained, representations in the preparatory period are similar to representations after perceiving the attended feature value. Following the same logic, the lack of generalization for two-cue trials during the preparatory period could be indicative that a qualitatively different representation is being used when two atten- tional templates are being maintained simultaneously. There are a number of possible explanations 113 Figure 5.11 Decoding generalization for one-cue and two-cue target-present correct trials (left and right respectively). The classifiers were trained using one-cue target-present correct trials. The dashed lines indicate stimulus onset. Figure 5.12 Decoding generalization for one-cue and two-cue target-present correct trials during the preparatory period (left and right respectively). The classifiers were trained using stimulus period EEG for one-cue and two-cue target present-correct trials (left and right respectively). The y-axis and x-axis dashed line indicate cue onset. 114 Figure 5.13 Decoding generalization for one-cue and two-cue target-present correct trials during the preparatory period (left and right respectively). The classifiers were trained using one-cue target-present trials after the stimulus onset. The y-axis dashed lines indicate stimulus onset, and the x-axis dashed line indicates cue onset. for the above. One possibility is that, as suggested by [128], the two attentional templates are mutually suppressive until feedback from a sensory stimulus causes the corresponding template to win the competition. Another possibility is that only one of the items is being maintained in an active state. Researchers have demonstrated that passive and active working memory states are encoded using orthogonal representations where only the active template can be decoded [183, 182, 91]. If this is indeed the case, we expect decoding to not be possible during the preparatory period in half of the trials, limiting our statistical power. Finally, empirical limitations of machine learning complicate the interpretation of these results even further. Classifiers perform better with under-representative training noise (low training noise and high test noise) in comparison to over-representative training noise (high training noise and low test noise) [104]. Representations are expected to be most robust during stimulus presentation in one-cue trials (when a coherent target-color is present and the sensory perception is being enhanced by attention). Therefore, training stimulus presentation in one-cue trials is an example of under-representative training noise. Moreover, stimulus presentation could involve a variety of representations (visual, semantic labels, . . . ). This is less likely during the preparatory period. Hence, preparatory period data might simply lack aspects of the representation needed to decode stimulus presentation data. 115 5.6 Discussion and Conclusion EEG decoding analysis of the target feature values during and after stimulus presentation showed significant difference between one-cue and two-cue trials. This indicates the existence of a cost for maintaining multiple attentional templates. Having designed our paradigm so that one and two-cue trials differ only during cue presentation, it is safe to assume that decoding differences during stimulus presentation can be attributed to differences in attentional modulation of sensory representations. Specifically, under single attentional template conditions modulation of sensory representations occurs earlier. These neural differences are reflected behaviorally in the d-prime statistic which decreases when multiple templates are maintained. Therefore, we conclude that the enhanced attentional modulation in the single template condition (demonstrated by the EEG analysis) enables easier discrimination between noise and signal trials (as evidenced by differences in d-prime). Moreover, there was a clear downward trend in hit minus false-alarm with decreased certainty regarding target color. Analysis revealed that maintaining multiple templates decreases performance mainly by increasing false-alarm rates. EEG analysis of two-cue target-absent trials demonstrated that we can decode the non-coherent signal in false-alarm trials but not correct-reject ones. This indicates that false-alarms occur due to participants mistaking the non-coherent signal for a target, rather than due to increased task difficulty or other confounds of increased attentional or working memory load. This further supports our conclusion that behavioral differences between the one and two-cue conditions originate from a decrease in sensitivity when multiple attentional templates are maintained. These differences in EEG decoding of target-absent trials also correspond to the results of the signal detection theory analysis. The d-prime significantly decreased between the one and two-cue conditions, indicating that it becomes increasingly harder to discriminate between signal and noise trials when maintaining multiple attentional templates, resulting in a higher false-alarm rate. There was no significant main effect of attention load on the criterion, suggesting that the differences are not due to variation in response strategy across conditions. This held true in both 116 the behavioral and EEG session, indicating that the observed attentional load effects are likely due to low level mechanisms responsible for a change in sensitivity rather than any top-down change in response strategy. Additionally, response time and accuracy inversely correlated, hence there is no evidence to suggest speed any accuracy trade-off occurred. Generally, three-cue trials did not differ significantly from no-cue trials. Moreover, follow up Bayesian analysis found weak evidence in favor of the null hypothesis (that behaviorally these trials are identical). Therefore, analysis indicates that attentional guidance might be limited to two items at most, and that maintaining three or more templates results in a complete failure of attentional guidance. Considering the EEG results for the target-present and target-absent trials, as well as the overall pattern of the behavioral results, we conclude that maintaining multiple attentional templates comes at a significant cost. Moreover, we demonstrate that while the cost exists in target-present trials, impaired attentional modulation due to maintaining multiple templates also significantly increases false-alarms due to decreased ability to discriminate between signal and noise trials. This result is especially interesting as previous research utilized paradigms that make signal detection analysis impossible [123, 179, 9] and limited EEG decoding analysis to correct target-present trials only [128]. Overall, we conclude that attention modulation of sensory representations is significantly weaker when maintaining multiple templates, resulting in impaired signal perception in target-present trials, and more false alarms in target-absent trials due to difficulties discriminating between signal and noise. Following this conclusion, we reject many versions of the multiple item template (MIT) hypothesis, especially if they argue for a lack of cost (or even an additive effect) of maintaining multiple templates [14, 10, 28, 84]. However, significant performance differences between two-cue and three/no-cue conditions suggest that participants are still able to make a limited use of multiple cues. Previous literature discussed three different scenarios that could potentially account for our results. First, participants could be maintaining only a single template while storing the other cues in working (or long term) memory. Switching occurs when a participant fails to detect a signal 117 matching the attentional template. Theoretically, both our results, and the data presented in [128], could be explained by template-switching. Many theories of attention postulate the existence of a quick, parallel pre-attentive perception stage and a slower sequential focused attention stage [173, 186]. Operating under such framework, the difference in decoding between our one-cue and two- cue trials (as well as the one-cue-one-target and two-cue-one-target conditions in [128]) could be attributed to template switching after quick pre-attentive processing revealed no stimulus matching the initial attentional template. The authors of [128] do acknowledge that sequential processing is indeed a possibility, but ultimately reject this interpretation after finding no trial-based correlation between classification confidence scores and target position, or any pattern of target location switching in individual subject data. We are unable to run an equivalent analysis using our data, as we did not systematically manipulate the spatial locations of neither cue nor target. Moreover, while many switching cost estimates reported in the literature are around 250ms [40, 189], more recent estimates go as low as 50ms [126]. The significant difference in decoding between one and two-cue trials in both our experiment and [128] lasted only 50 ms. Therefore, the literature is inconclusive as far as the possibility of attributing the decoding differences we observed to switching cost. Generally, while the sequential template-switching interpretation remains a possibility, considering previous research we believe it is a less likely alternative. Another possibility is that multiple templates can exist simultaneously but at a cost. According to this view, while attentional capacity is limited it can be flexibly divided to accommodate multiple items at the expense of representation quality. Similar conceptualization of working memory have been suggested [102]. Moreover, the authors of [128] proposed an attentional load theory with simultaneous maintenance of mutually suppressive templates, which could be interpreted as a variation on the flexible attentional capacity theme. Finally, it is also possible that the templates are rhythmically oscillating. One recent study demonstrated oscillatory patterns in behavioral performance when attending to two cues [136]. While interesting, analysis in [136] focused only on hit rate, ignoring the effects of attentional load on false-alarm rates which, according to our results, are more substantial. Unfortunately, the 118 literature mostly consists of paradigms with stimuli presentation until response, making post-hoc analysis for oscillatory behavior difficult. Overall, it is important to note that both [128, 136] do not fully explain our results. Both paradigms consist of only target-present trials, and it remains unclear how either theory could be extended to account for the increased frequency and decoding accuracy of false-alarm trials in the multiple templates condition. Further experiments that center target-absent trials and deliberately manipulate cue presentation onset (similarly to [136]) are required to thoroughly test rhythmic template fluctuations. Similarly, a version of our experiment with multiple targets (similarly to [128]) could have interesting implications. To summarize, we used a signal detection paradigm that requires active guidance by attentional templates. Our results seem sufficient to reject most versions of the multiple-item-template hy- pothesis. Several substantial differences in both behavior and EEG decoding indicate that multiple templates are not able to guide attention as well as a single template. One interesting result of our experiment was that maintenance of multiple templates increased the likelihood of false-alarms while only slightly decreasing hit rates. This could have practical implications, as many tasks (such as aviation security or radiology screening) prioritize low false-negative rates, even at the expense of a slightly inflated false-positive rates [152]. While not conclusive, some of our results are consistent with recent theories of attentional load such as the competition model suggested by [128] or the fluctuating templates suggested by [136]. However, further research is required into attentional load effects during target-absent trials. 119 CHAPTER 6 GENERAL CONCLUSIONS AND REFLECTIONS 6.1 What Was Accomplished This thesis contains several relatively distinct components. Before concluding this document, it is worth highlighting a few of its more interesting elements: • The Feature Imitating Network (FIN) framework enables the integration of expert knowl- edge into deep learning models. This middle ground between rigid hand-crafted feature engineering and unpredictable data-hungry deep learning models has already sparked some interest. So far, this framework has been utilized in various domains such as natural language processing, computer vision, and predicting athletic performance [83]. • EEG decoding results demonstrate the direct impact of attention load on modulation of sensory representation. Observing this latent variable provides direct evidence of the cost of divided attention. Moreover, behavior in target absent trials (and neural correlates of this behavior) reveals a manifestation of this cost that was not explored in previous literature. • The EEG feature extraction pipeline, originally a sub-component of a larger project, achieved modest popularity among EEG researchers and have become a collaboratively maintained standalone library1. 6.2 Future Directions There are multiple limitations to the current Feature Imitating Networks framework. First and foremost, FINs do not currently accept variable length inputs. Remedying this is not as simple as may initially seem, as feature (for instance, entropy) calculation often requires access to the full input vector. Furthermore, a thorough examination of how the task specific tuning changes the embedding in each FIN sub-model is required to support our intuition that the network is adapting the hand-crafted features to task requirements, rather than for instance learning a completely new embedding. The presented cognitive neuroscience research also warrants follow up experiments. Behavioral 1 https://github.com/sari-saba-sadiya/EEGExtract 120 differences between three-cue and no-cue conditions were inconclusive. While it is likely that there is some attentional guidance in three-cue trials, future experiments with larger sample size are required to properly validate this intuition. Additionally, analysis indicated that attention load has significant effects on behavior in target-absent trials. Further experiments focusing on target absent-trials are needed to identify the changes to the decision making process responsible for the behavioral differences. For instance, it is unclear if the changes are driven by top-down changes in task parameters (such as a reduced decision threshold), or bottom-up changes in how the sensory evidence is processed. 6.3 Reflections on Deep Learning in EEG The real-world applications of machine learning for EEG data are numerous. However, whether the application is a medical-diagnostics tool [56] or a thought controlled prosthetic [192], a number of conditions need to be met for the developed algorithms to have any real-world impact. Perhaps most importantly, performance needs to be consistent even when testing on data from unseen (out of training sample) subjects, and despite some variability in data acquisition circumstances (for instance, unseen EEG task). Preparing the literature review, I was dismayed to discover that the vast majority of researchers do not report any out-of-sample testing (one notable exception being [56]). As can be observed in Section 3.1, the difference in performance between "seen task seen subject" and out of sample data can significant. Generally, out of sample testing was reported for the algorithms proposed in throughout work. Hopefully, this practice will become more common as the field of EEG focused machine learning matures, and the demand for algorithms that perform well in the real world increases. 6.4 Reflections on Accessibility Code and data for reproducing the experiments presented in this thesis will be made available. All developed machine learning algorithms are already available on several github repositories. The interest there repositories generated is a testament to the much discussed importance of accessibility in science. However, an unexpected hidden benefit I had the pleasure of experiencing is the friendship and collegiality that can blossom from an email inquiring about a run time error. 121 BIBLIOGRAPHY [1] A. Accardo et al. “Use of the fractal dimension for the analysis of electroencephalographic time series”. In: Biological cybernetics 77.5 (1997), pp. 339–350. [2] M. Agarwal and R. Sivakumar. “Blink: A Fully Automated Unsupervised Algorithm for Eye-Blink Detection in EEG Signals”. In: 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). 2019, pp. 1113–1121. [3] Charu C. Aggarwal. “Outlier analysis”. In: Data Mining: The Textbook. Springer Publishing Company, Incorporated, 2015, pp. 75–79. [4] Jinwon An and Sungzoon Cho. Variational Autoencoder based Anomaly Detection using Reconstruction Probability. 2015. [5] S. Andersen, S. Hillyard, and M Muller. “Global facilitation of attended features is obligatory and restricts divided attention”. en. In: J. Neurosci. 33.46 (Nov. 2013), pp. 18200–18207. [6] Kai Keng Ang et al. “Mutual information-based selection of optimal spatial–temporal patterns for single-trial EEG-based BCIs”. In: Pattern Recognition 45.6 (2012), pp. 2137– 2144. [7] Giyeul Bae and Steven Luck. “Dissociable Decoding of Spatial Attention and Working Memory from EEG Oscillations and Sustained Potentials”. In: The Journal of Neuroscience 38 (Nov. 2017), pp. 2860–17. [8] Gi-Yeul Bae and Steven J. Luck. “Decoding motion direction using the topography of sustained ERPs and alpha oscillations”. In: NeuroImage 184 (2019), pp. 242–255. [9] Brett Bahle, Valerie M Beck, and Andrew Hollingworth. “The architecture of interaction between visual working memory and visual attention”. en. In: J. Exp. Psychol. Hum. Percept. Perform. 44.7 (July 2018), pp. 992–1011. [10] Brett Bahle et al. “The architecture of working memory: Features from multiple remem- bered objects produce parallel, coactive guidance of attention in visual search”. en. In: J. Exp. Psychol. Gen. 149.5 (May 2020), pp. 967–983. [11] Mohammad Haris Baig, Vladlen Koltun, and Lorenzo Torresani. “Learning to Inpaint for Image Compression”. In: Advances in Neural Information Processing Systems 30. Ed. by I. Guyon et al. Curran Associates, Inc., 2017, pp. 1246–1255. [12] Diane M Beck and Sabine Kastner. “Top-down and bottom-up mechanisms in biasing competition in the human brain”. en. In: Vision Res. 49.10 (June 2009), pp. 1154–1165. 122 [13] Valerie M Beck and Andrew Hollingworth. “Competition in saccade target selection reveals attentional guidance by simultaneously active working memory representations”. en. In: J. Exp. Psychol. Hum. Percept. Perform. 43.2 (Feb. 2017), pp. 225–230. [14] Valerie M Beck, Andrew Hollingworth, and Steven J Luck. “Simultaneous control of attention by multiple working memory representations”. en. In: Psychol. Sci. 23.8 (Aug. 2012), pp. 887–898. [15] James Bergstra and Yoshua Bengio. “Random search for hyper-parameter optimization”. In: Journal of machine learning research 13.Feb (2012), pp. 281–305. [16] Katarzyna J Blinowska, Rafał Kuś, and Maciej Kamiński. “Granger causality and informa- tion flow in multivariate processes”. In: Physical Review E 70.5 (2004), p. 050902. [17] Sarah Blum et al. “A Riemannian Modification of Artifact Subspace Reconstruction for EEG Artifact Handling”. In: Frontiers in Human Neuroscience 13 (2019). issn: 1662- 5161. [18] Moritz Böhle et al. “Layer-Wise Relevance Propagation for Explaining Deep Neural Net- work Decisions in MRI-Based Alzheimer’s Disease Classification”. In: Frontiers in Aging Neuroscience 11 (2019), p. 194. [19] Rishi Bommasani et al. “On the opportunities and risks of foundation models”. In: arXiv preprint arXiv:2108.07258 (2021). [20] L. Bonnet, F. Lotte, and A. Lécuyer. “Two Brains, One Game: Design and Evaluation of a Multiuser BCI Video Game Based on Motor Imagery”. In: IEEE Transactions on Computational Intelligence and AI in Games 5.2 (2013), pp. 185–198. [21] George EP Box et al. Time series analysis: forecasting and control. John Wiley & Sons, 2015. [22] Markus M. Breunig et al. “LOF: Identifying Density-Based Local Outliers”. In: SIGMOD Rec. 29.2 (May 2000), pp. 93–104. [23] Gijs Joost Brouwer and David J. Heeger. “Decoding and Reconstructing Color from Responses in Human Visual Cortex”. In: Journal of Neuroscience 29.44 (2009), pp. 13992– 14003. [24] Nancy B. Carlisle et al. “Attentional Templates in Visual Working Memory”. In: Journal of Neuroscience 31.25 (2011), pp. 9315–9322. [25] Marisa Carrasco. “Visual attention: the past 25 years”. en. In: Vision Res. 51.13 (July 2011), pp. 1484–1525. 123 [26] Diogo V. Carvalho, Eduardo M. Pereira, and Jaime S. Cardoso. “Machine Learning Interpretability: A Survey on Methods and Metrics”. In: Electronics 8.8 (2019). [27] C. -Y. Chang et al. “Evaluation of Artifact Subspace Reconstruction for Automatic Artifact Components Removal in Multi-Channel EEG Recordings”. In: IEEE Transactions on Biomedical Engineering 67.4 (2020), pp. 1114–1121. [28] Yanan Chen and Feng Du. “Two visual working memory representations simultaneously control attention”. en. In: Sci. Rep. 7.1 (July 2017), p. 6107. [29] Yaqi Chu et al. “A Decoding Scheme for Incomplete Motor Imagery EEG With Deep Belief Network”. In: Frontiers in Neuroscience 12 (2018). [30] Radoslaw Martin Cichy, Fernando Mario Ramirez, and Dimitrios Pantazis. “Can visual information encoded in cortical columns be decoded from magnetoencephalography data in humans?” In: NeuroImage 121 (2015), pp. 193–204. [31] Yücel Çimtay and E. Ekmekcioglu. “Investigating the Use of Pretrained Convolutional Neural Network on Cross-Subject and Cross-Dataset EEG Emotion Recognition”. In: Sensors 20 (Apr. 2020). [32] Tommy Clausner, Sarang S. Dalal, and Maité Crespo-García. “Photogrammetry-Based Head Digitization for Rapid and Accurate Localization of EEG Electrodes and MEG Fiducial Markers Using a Single Digital SLR Camera”. In: Frontiers in Neuroscience 11 (2017), p. 264. [33] Harris Cooper, James J. Lindsay, and Barbara Nye. “Homework in the Home: How Student, Family, and Parenting-Style Differences Relate to the Homework Process”. In: Contemporary Educational Psychology 25.4 (2000), pp. 464–487. [34] I. A. Corley and Y. Huang. “Deep EEG super-resolution: Upsampling EEG spatial res- olution with Generative Adversarial Networks”. In: 2018 IEEE EMBS International Conference on Biomedical Health Informatics (BHI). 2018, pp. 100–103. [35] H. S. Courellis et al. “EEG channel interpolation using ellipsoid geodesic length”. In: 2016 IEEE Biomedical Circuits and Systems Conference (BioCAS). Oct. 2016, pp. 540–543. [36] Janis J Daly and Jonathan R Wolpaw. “Brain–computer interfaces in neurological rehabil- itation”. In: The Lancet Neurology 7.11 (2008), pp. 1032–1043. [37] Arnaud Delorme, Scott Makeig, and Terrence Sejnowski. “Automatic Artifact Rejection for EEG Data Using High-Order Statistics and Independant Component Analysis”. In: Proceedings of the 3rd International Independant Component Analysis and Blind Source Decomposition Conference. Jan. 2001. 124 [38] Robert Desimone and John Duncan. “Neural Mechanisms of Selective Visual Attention”. In: Annual Review of Neuroscience 18.1 (1995), pp. 193–222. [39] Djuwari Djuwari, Dinesh Kumar, and Marimuthu Palaniswami. “Limitations of ICA for Artefact Removal”. In: Conference proceedings : ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference 5 (Feb. 2005), pp. 4685–8. [40] Isabel Dombrowe, Mieke Donk, and Christian N L Olivers. “The costs of switching attentional sets”. en. In: Atten. Percept. Psychophys. 73.8 (Nov. 2011), pp. 2481–2488. [41] Hauke Dose et al. “An end-to-end deep learning approach to MI-EEG signal classification for BCIs”. In: Expert Systems with Applications 114 (2018), pp. 532–542. [42] Paul Downing and Chris Dodds. “Competition in visual working memory for control of search”. In: Vis. cogn. 11.6 (Aug. 2004), pp. 689–703. [43] K. K. Dutta, K. Venugopal, and S. A. Swamy. “Removal of muscle artifacts from EEG based on ensemble empirical mode decomposition and classification of seizure using machine learning techniques”. In: 2017 International Conference on Inventive Computing and Informatics (ICICI). 2017, pp. 861–866. [44] Johannes J. Fahrenfort et al. “From ERPs to MVPA Using the Amsterdam Decoding and Modeling Toolbox (ADAM)”. in: Frontiers in Neuroscience 12 (2018). [45] Thomas C. Ferree. “Spherical Splines and Average Referencing in Scalp Electroen- cephalography”. In: Brain Topography 19.1 (Dec. 2006), pp. 43–52. [46] Rebecca M. Foerster and Werner X. Schneider. “Involuntary top-down control by search- irrelevant features: Visual working memory biases attention in an object-based manner”. In: Cognition 172 (2018), pp. 37–45. [47] Marcella Fratescu, Dirk Van Moorselaar, and Sebastiaan Mathôt. “Can you have multiple attentional templates? Large-scale replications of Van Moorselaar, Theeuwes, and Olivers (2014) and Hollingworth and Beck (2016)”. en. In: Atten. Percept. Psychophys. 82.3 (June 2020), p. 1536. [48] Laurel J. Gabard-Durnam et al. “The Harvard Automated Processing Pipeline for Elec- troencephalography (HAPPE): Standardized Processing Software for Developmental and High-Artifact Data”. In: Frontiers in Neuroscience 12 (2018), p. 97. [49] J. O. Garcia, R. Srinivasan, and J. Serences. “Near-Real-Time Feature-Selective Modula- tions in Human Cortex”. In: Current Biology 23 (2013), pp. 515–522. 125 [50] Justin L. Gardner and Taosheng Liu. “Inverted Encoding Models Reconstruct an Arbitrary Model Response, Not the Stimulus”. In: eNeuro 6.2 (2019). [51] Justin L. Gardner et al. MGL: Visual psychophysics stimuli and experimental design pack- age. Version 2.0. June 2018. [52] RG Geocadin et al. “Early Electrophysiological and Histologic Changes After Global Cerebral Ischemia In Rats”. In: Movement Disorders 15.S1 (2000), pp. 14–21. [53] Nils Gessert et al. “Towards Deep Learning-Based EEG Electrode Detection Using Auto- matically Generated Labels”. In: Computer Vision and Pattern Recognition (2019). [54] M. M. Ghassemi et al. “You Snooze, You Win: the PhysioNet/Computing in Cardiology Challenge 2018”. In: 2018 Computing in Cardiology Conference (CinC). vol. 45. 2018, pp. 1–4. [55] MM. Ghassemi et al. “Quantitative EEG Trends Predict Recovery in Hypoxic-Ischemic Encephalopathy”. In: Critical Care Medicine 47.10 (2019), pp. 1416–1423. [56] Mohammad M. Ghassemi. “Life After Death: Techniques for the Prognostication of Coma Outcomes after Cardiac Arrest”. PhD thesis. Massachusetts Institute of Technology, 2018. [57] R. Ghosh, N. Sinha, and S. K. Biswas. “Automated eye blink artefact removal from EEG using support vector machine and autoencoder”. In: IET Signal Processing 13.2 (2019), pp. 141–148. [58] R. C. M. P. Gilberet et al. “Automated artifact rejection using ICA and image processing algorithms”. In: 2017 International Conference on Signal Processing and Communication (ICSPC). 2017, pp. 354–358. [59] A. Goldberger et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 2000. [60] Markus Goldstein and Andreas Dengel. Histogram-based Outlier Score (HBOS): A fast Unsupervised Anomaly Detection Algorithm. 2012. [61] Tijl Grootswagers, Susan G Wardle, and Thomas A Carlson. “Decoding dynamic brain patterns from evoked responses: A tutorial on multivariate pattern analysis applied to time series neuroimaging data”. en. In: J. Cogn. Neurosci. 29.4 (Apr. 2017), pp. 677–697. [62] Anna Grubert, Nancy B Carlisle, and Martin Eimer. “The control of single-color and multiple-color visual search by attentional templates in working memory and in long-term memory”. en. In: J. Cogn. Neurosci. 28.12 (Dec. 2016), pp. 1947–1963. 126 [63] Anna Grubert and Martin Eimer. “Preparatory Template Activation during Search for Alternating Targets”. In: Journal of Cognitive Neuroscience 32.8 (Aug. 2020), pp. 1525– 1535. [64] Qiong Gui et al. “A Survey on Brain Biometrics”. In: ACM Computing Surveys (CSUR) 51.6 (2019). [65] Britt Hadar et al. “Working Memory Load Affects Processing Time in Spoken Word Recognition: Evidence from Eye-Movements”. In: Frontiers in Neuroscience 10 (2016). [66] Jasper E. Hajonides et al. “Decoding visual colour from scalp electroencephalography measurements”. In: NeuroImage 237 (2021), p. 118030. issn: 1053-8119. [67] J.J. Halford et al. “Inter-rater agreement on identification of electrographic seizures and periodic discharges in ICU EEG recordings”. In: Clinical Neurophysiology 126.9 (2015), pp. 1661–1669. [68] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 770–778. [69] Rainer Hegger and Holger Kantz. “Improved false nearest neighbor method to detect determinism in time series data”. In: Physical Review E 60.4 (1999), p. 4970. [70] JF Hipp, AK Engel, and M Siegel. “Oscillatory Synchronization In Large-Scale Cortical Networks Predicts Perception”. In: Neuron 69.2 (2011), pp. 387–396. [71] L. J. Hirsch et al. “American Clinical Neurophysiology Society’s Standardized Critical Care EEG Terminology: 2012 version”. In: Journal of Clinical Neurophysiology (2013), pp. 1–27. [72] Andrew Hollingworth and Valerie M Beck. “Memory-based attention capture when multi- ple items are maintained in visual working memory”. In: J. Exp. Psychol. Hum. Percept. Perform. 42.7 (July 2016), pp. 911–917. [73] Marcia Hon and Naimul Mefraz Khan. “Towards Alzheimer’s disease classification through transfer learning”. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2017, pp. 1166–1169. [74] Roos Houtkamp and Pieter R Roelfsema. “Matching of visual input to only one item at any one time”. en. In: Psychol. Res. 73.3 (May 2009), pp. 317–326. [75] Jeremy Howard and Sebastian Ruder. “Universal Language Model Fine-tuning for Text Classification”. In: Proceedings of the 56th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 328–339. 127 [76] JASP Team. JASP (Version 0.16.4)[Computer software]. 2022. [77] Soroush Javidi et al. “Kurtosis based blind source extraction of complex noncircular signals with application in EEG artifact removal in real-time”. In: Frontiers in Neuroscience 5 (2011), p. 105. [78] X Jia et al. “Early Electrophysiologic Markers Predict Functional Outcome Associated With Temperature Manipulation After Cardiac Arrest In Rats”. In: Critical Care Medicine 36.6 (2008), p. 1909. [79] Huaizu Jiang et al. “Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (June 2018). [80] Tzyy-Ping Jung et al. “Removing electroencephalographic artifacts by blind source sepa- ration”. In: Psychophysiology 37 (Mar. 2000), pp. 163–178. [81] Steven M Kay. Fundamentals of statistical signal processing. Prentice Hall PTR, 1993. [82] W. Khalifa et al. “A survey of EEG based user authentication schemes”. In: 2012 8th International Conference on Informatics and Systems (INFOS). 2012. [83] Reza Khanmohammadi et al. MambaNet: A Hybrid Neural Network for Predicting the NBA Playoffs. 2022. [84] Michael King and Brooke Macnamara. “Three visual working memory representations simultaneously control attention”. In: Scientific Reports 10 (June 2020). [85] HansPeter Kriegel, Matthias Schubert, and Arthur Zimek. “Angle-Based Outlier Detection in High-Dimensional Data”. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, pp. 444–452. [86] Tómas Kristjánsson and Árni Kristjánsson. “Foraging through multiple target categories reveals the flexibility of visual working memory”. In: Acta Psychol. (Amst.) 183 (Feb. 2018), pp. 108–115. [87] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Commun. ACM 60.6 (May 2017), pp. 84–90. [88] Pradeep Kumar et al. “Envisioned speech recognition using EEG sensors”. In: Personal and Ubiquitous Computing (Sept. 2017), pp. 1–15. [89] Margaret Lech et al. “Real-Time Speech Emotion Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth Reduction and Companding”. In: Frontiers in Computer Science 2 (2020), p. 14. 128 [90] B B Lee, P R Martin, and A Valberg. “The physiological basis of heterochromatic flicker photometry demonstrated in the ganglion cells of the macaque retina”. In: The Journal of Physiology 404.1 (Oct. 1988), pp. 323–347. [91] Jarrod A Lewis-Peacock et al. “Neural evidence for a distinction between short-term mem- ory and the focus of attention”. en. In: J. Cogn. Neurosci. 24.1 (Jan. 2012), pp. 61– 79. [92] Yurong Li et al. “EEG-based intention recognition with deep recurrent-convolution neural network: Performance and channel selection by Grad-CAM”. in: Neurocomputing 415 (2020), pp. 225–233. [93] C. Lin, S. Tsai, and L. Ko. “EEG-Based Learning System for Online Motion Sickness Level Estimation in a Dynamic Vehicle Environment”. In: IEEE Transactions on Neural Networks and Learning Systems 24.10 (2013), pp. 1689–1700. [94] Yuan-Pin Lin and Tzyy-Ping Jung. “Improving EEG-Based Emotion Classification Using Conditional Transfer Learning”. In: Frontiers in Human Neuroscience 11 (2017), p. 334. [95] Zhouhan Lin et al. “Neural Networks with Few Multiplications”. In: CoRR (Oct. 2015). [96] T Liu and M Jigo. “Limits in feature-based attention to multiple colors”. In: Attention, perception, and psychophysics 79 (2017), pp. 2327–2337. [97] Taosheng Liu, Mark W Becker, and Michael Jigo. “Limited featured-based attention to multiple features”. en. In: Vision Res. 85 (June 2013), pp. 36–44. [98] Taosheng Liu, Dylan Cable, and Justin L. Gardner. “Inverted Encoding Models of Hu- man Population Response Conflate Noise and Neural Tuning Width”. In: Journal of Neuroscience 38.2 (2018), pp. 398–408. [99] Yezheng Liu et al. “Generative Adversarial Active Learning for Unsupervised Outlier De- tection”. In: Proceedings of the IEEE Transactions on Knowledge and Data Engineering. 2019, pp. 1–1. [100] Erik K. St. Louis et al. Electroencephalography (EEG): An Introductory Text and Atlas of Normal and Abnormal Findings in Adults, Children, and Infants. American Epilepsy Society, 2016. [101] Roy Luria et al. “The contralateral delay activity as a neural measure of visual working memory”. In: Neurosci. Biobehav. Rev. 62 (Mar. 2016), pp. 100–108. [102] Wei Ma, Masud Husain, and Paul Bays. “Changing concepts of working memory”. In: Nature neuroscience 17 (Mar. 2014), pp. 347–56. 129 [103] Christopher D. Manning. “Computational Linguistics and Deep Learning”. In: Computa- tional Linguistics 41.4 (2015), pp. 701–707. [104] Michael Mannino, Yanjuan Yang, and Young Ryu. “Classification algorithm sensitivity to training data with non representative attribute noise”. en. In: Decis. Support Syst. 46.3 (Feb. 2009), pp. 743–751. [105] Jasna Martinovic et al. “Neural mechanisms of divided feature-selective attention to colour”. en. In: Neuroimage 181 (Nov. 2018), pp. 670–682. [106] Ben McCartney et al. “A zero-shot learning approach to the development of brain-computer interfaces for image retrieval”. In: PLOS ONE 14.9 (Sept. 2019), pp. 1–21. [107] Risto Miikkulainen et al. “Chapter 15 - Evolving Deep Neural Networks”. In: Artificial Intelligence in the Age of Neural Networks and Brain Computing. Ed. by Robert Kozma et al. Academic Press, 2019, pp. 293–312. [108] Jianliang Min, Ping Wang, and Jianfeng Hu. “Driver fatigue detection through multi- ple entropy fusion analysis in an EEG-based system”. In: PLOS ONE 12 (Dec. 2017), e0188756. [109] Fumikazu Miwakeichi et al. “A comparison of non-linear non-parametric models for epilepsy data”. In: Computers in Biology and Medicine 31.1 (2001), pp. 41–57. [110] Dirk van Moorselaar, Jan Theeuwes, and Christian N L Olivers. “In competition for the attentional template: can multiple items within visual working memory guide attention?” en. In: J. Exp. Psychol. Hum. Percept. Perform. 40.4 (Aug. 2014), pp. 1450–1464. [111] J T Mordkoff and S Yantis. “An interactive race model of divided attention”. en. In: J. Exp. Psychol. Hum. Percept. Perform. 17.2 (May 1991), pp. 520–538. [112] Thomas Naselaris et al. “Bayesian Reconstruction of Natural Images from Human Brain Activity”. In: Neuron 63.6 (2009), pp. 902–915. [113] E. Nedelcu et al. “Artifact detection in EEG using machine learning”. In: 2017 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP). 2017, pp. 77–83. [114] Petr Nejedly et al. “Intracerebral EEG Artifact Identification Using Convolutional Neural Networks”. In: Neuroinformatics 17 (Aug. 2018). [115] Petr Nejedly et al. “Intracerebral EEG Artifact Identification Using Convolutional Neural Networks”. In: Neuroinformatics 17 (Aug. 2018). 130 [116] Andrea Nemcova et al. Brno University of Technology ECG Quality Database (BUT QDB). PhysioNet. 2021. [117] Andrew Ng. Nuts and bolts of building AI applications using deep learning. NIPS Tutorial. 2016. [118] Shinji Nishimoto et al. “Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies”. In: Current Biology 21.19 (2011), pp. 1641–1646. [119] Sean Noah et al. “Neural Mechanisms of Attentional Control for Objects: Decoding EEG Alpha When Anticipating Faces, Scenes,and Tools”. In: Journal of Neuroscience 40.25 (2020), pp. 4913–4924. [120] Seung-Hyeon Oh, Yu-Ri Lee, and Hyoung-Nam Kim. “A novel EEG feature extraction method using Hjorth parameter”. In: International Journal of Electronics and Electrical Engineering 2.2 (2014), pp. 106–110. [121] Christian N L Olivers. “What drives memory-driven attentional capture? The effects of memory type, display type, and search type”. en. In: J. Exp. Psychol. Hum. Percept. Perform. 35.5 (Oct. 2009), pp. 1275–1291. [122] Christian N L Olivers, Frank Meijer, and Jan Theeuwes. “Feature-based memory-driven attentional capture: visual working memory content affects visual attention”. en. In: J. Exp. Psychol. Hum. Percept. Perform. 32.5 (Oct. 2006), pp. 1243–1265. [123] Christian N L Olivers et al. “Different states in visual working memory: when it guides attention and when it does not”. en. In: Trends Cogn. Sci. 15.7 (July 2011), pp. 327–334. [124] Sławomir Opałka et al. “Multi-Channel Convolutional Neural Networks Architecture Feed- ing for Effective EEG Mental Tasks Classification”. In: Sensors 18.10 (2018). [125] Alan V Oppenheim and Ronald W Schafer. “From frequency to quefrency: A history of the cepstrum”. In: IEEE signal processing Magazine 21.5 (2004), pp. 95–106. [126] Eduard Ort, Johannes J Fahrenfort, and Christian N L Olivers. “Lack of free choice reveals the cost of having to search for more than one object”. en. In: Psychol. Sci. 28.8 (Aug. 2017), pp. 1137–1147. [127] Eduard Ort and Christian N L Olivers. “The capacity of multiple-target search”. en. In: Vis. cogn. 28.5-8 (Sept. 2020), pp. 330–355. [128] Eduard Ort et al. “Humans can efficiently look for but not select multiple visual objects”. en. In: Elife 8 (Aug. 2019). 131 [129] Deepak Pathak et al. “Context Encoders: Feature Learning by Inpainting”. In: CVPR. 2016. [130] Alex Pentland. “Maximum likelihood estimation: The best PEST”. in: Perception & Psychophysics 28 (1980), pp. 377–379. [131] F. Perrin et al. “Spherical splines for scalp potential and current density mapping”. In: Electroencephalography and Clinical Neurophysiology 72.2 (1989), pp. 184–187. [132] S. Petrichella et al. “Channel interpolation in TMS-EEG: A quantitative study towards an accurate topographical representation”. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). Aug. 2016, pp. 989–992. [133] S. Phadikar, N. Sinha, and R. Ghosh. “Automatic EEG eyeblink artefact identification and removal technique using independent component analysis in combination with support vector machines and denoising autoencoder”. In: IET Signal Processing 14.6 (2020), pp. 396–405. [134] Luca Pion-Tonachini, Ken Kreutz-Delgado, and Scott Makeig. “ICLabel: An automated electroencephalographic independent component classifier, dataset, and website”. English. In: NeuroImage 198 (Sept. 2019), pp. 181–197. [135] Lindsay Plater et al. “Revisiting the role of visual working memory in attentional control settings”. en. In: Vis. cogn. 30.5 (May 2022), pp. 318–338. [136] U Pomper and Ansorge U. “Theta-Rhythmic Oscillation of Working Memory Performance”. In: Psychological Science 32.11 (2021), pp. 1801–1810. [137] Nicolaas Prins and Frederick A. A. Kingdom. “Applying the Model-Comparison Approach to Test Specific Research Hypotheses in Psychophysical Research Using the Palamedes Toolbox”. In: Frontiers in Psychology 9 (2018). [138] Xing Qian et al. “Brain-computer-interface-based intervention re-normalizes brain func- tional network topology in children with attention deficit/hyperactivity disorder”. In: Translational Psychiatry 8 (Aug. 2018), p. 149. [139] Pedro Reis and Matthias Lochmann. “Using a Motion Capture System for Spatial Local- ization of EEG Electrodes.” In: Frontiers in Neuroscience 9 (Apr. 2015). [140] Quick Rf. “A vector-magnitude model of contrast detection.” In: Kybernetika 16 (1974), pp. 65–67. [141] Michael Roberts et al. “Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans”. In: Nature Machine Intelligence 3 (Mar. 2021). 132 [142] Sheldon M Ross. Introductory statistics. Academic Press, 2017. [143] Sari Saba-Sadiya, Tuka Alhanai, and Mohammad M Ghassemi. “Feature Imitating Net- works”. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022, pp. 4128–4132. [144] Sari Saba-Sadiya, Eric Chantland, and Taosheng Liu. Decoing EEG from Passive Viewing. github.com/sari-saba-sadiya/DEPV. 2020. [145] Sari Saba-Sadiya et al. “EEG Channel Interpolation Using Deep Encoder-decoder Net- works”. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2020. [146] Sari Saba-Sadiya et al. “Unsupervised EEG Artifact Detection and Correction”. In: Fron- tiers in Digital Health 2 (2020), p. 57. [147] Sari Sadiya, Tuka Alhanai, and Mohammad Ghassemi. “Artifact Detection and Correc- tion in EEG data: A Review”. In: Proceedings of the 10th International IEEE/EMBS Conference on Neural Engineering (NER). May 2021, pp. 495–498. [148] Melissa Saenz, Giedrius T Buracas, and Geoffrey M Boynton. “Global effects of feature- based attention in human visual cortex”. en. In: Nat. Neurosci. 5.7 (July 2002), pp. 631– 632. [149] Robin Tibor Schirrmeister et al. “Deep learning with convolutional neural networks for EEG decoding and visualization”. In: Human Brain Mapping 38.11 (Aug. 2017), pp. 5391– 5420. [150] J Schoffelen and J Gross. “Source Connectivity Analysis With MEG and EEG”. in: Human Brain Mapping 30.6 (2009), pp. 1857–1865. [151] B. Schölkopf et al. “Estimating the Support of a High-Dimensional Distribution”. In: Neural Computation 13.7 (2001), pp. 1443–1471. [152] Jeremy Schwark et al. “False feedback increases detection of low-prevalence targets in visual search”. en. In: Atten. Percept. Psychophys. 74.8 (Nov. 2012), pp. 1583–1589. [153] V. S. Selvam and S. Shenbagadevi. “Brain tumor detection using scalp eeg with modified Wavelet-ICA and multi layer feed forward neural network”. In: 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 2011, pp. 6104– 6109. [154] Nima Bigdely Shamlo et al. “EyeCatch: Data-mining over half a million EEG independent components to construct a fully-automated eye-component detector”. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 133 (2013), pp. 5845–5848. [155] C. E. Shannon. “Communication in the Presence of Noise”. In: Proceedings of the IRE 37.1 (1949), pp. 10–21. [156] Claude E Shannon and Warren Weaver. The mathematical theory of communication. University of Illinois press, 1998. [157] H Shin et al. “Quantitative EEG And Effect Of Hypothermia On Brain Recovery After Cardiac Arrest”. In: IEEE Transactions on Biomedical Engineering 53.6 (2006), pp. 1016– 1023. [158] Hyun-Chool Shin et al. “A subband-based information measure of EEG during brain injury and recovery after cardiac arrest”. In: IEEE Transactions on Biomedical Engineering 55.8 (2008), pp. 1985–1990. [159] M.L. Shyu et al. A Novel Anomaly Detection Scheme Based on Principal Component Classifier. AD-a465 712. miami univ coral gables fl Department of electrical and computer engineering, 2003. [160] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large- Scale Image Recognition”. In: International Conference on Learning Representations. 2015. [161] Vorasith Siripornpanich et al. “Enhancing Brain Maturation Through a Mindfulness-Based Education in Elementary School Children: a Quantitative EEG Study”. In: Mindfulness 9 (Mar. 2018). [162] B. Somers, T. Francart, and A. Bertrand. “A generic EEG artifact removal algorithm based on the multi-channel Wiener filter.” In: Journal of neural engineering 15 3 (2018), p. 036007. [163] Anthony C.K. Soong et al. “Systematic comparisons of interpolation techniques in topo- graphic brain mapping”. In: Electroencephalography and Clinical Neurophysiology 87.4 (1993), pp. 185–195. [164] David Soto et al. “Early, involuntary top-down guidance of attention from working mem- ory”. en. In: J. Exp. Psychol. Hum. Percept. Perform. 31.2 (Apr. 2005), pp. 248– 261. [165] Thomas C. Sprague, Geoffrey M. Boynton, and John T. Serences. “Inverted encoding models estimate sensible channel responses for sensible models”. In: bioRxiv (2019). [166] CJ Stam, G Nolte, and A Daffertshofer. “Phase Lag Index: Assessment of Functional Connectivity From Multi Channel EEG and MEG With Diminished Bias From Common 134 Sources”. In: Human Brain Mapping 28.11 (2007), pp. 1178–1193. [167] John M Stern. Atlas of EEG patterns. Lippincott Williams & Wilkins, 2005. [168] David Strayer, Frank Drews, and William Johnston. “Cell Phone-Induced Failures of Visual Attention During Simulated Driving”. In: Journal of experimental psychology. Applied 9 (Apr. 2003), pp. 23–32. [169] Michael J Stroud et al. “Using the dual-target cost to explore the nature of search target representations”. en. In: J. Exp. Psychol. Hum. Percept. Perform. 38.1 (Feb. 2012), pp. 113–122. [170] I. Sturm et al. “Interpretable deep neural networks for single-trial EEG classification”. In: Journal of Neuroscience Methods 274 (2016), pp. 141–145. [171] David W. Sutterer et al. “Item-specific delay activity demonstrates concurrent storage of multiple active neural representations in working memory”. In: PLOS Biology 17 (Apr. 2019), pp. 1–25. [172] MC Tjepkema-Cloostermans et al. “A Cerebral Recovery Index (CRI) For Early Prognosis In Patients After Cardiac Arrest”. In: Critical Care 17 (2013), R252. [173] Anne M. Treisman and Garry Gelade. “A feature-integration theory of attention”. In: Cognitive Psychology 12.1 (1980), pp. 97–136. [174] PJ Uhlhaas and W Singer. “Abnormal Neural Oscillations and Synchrony in Schizophre- nia”. In: Nature Reviews Neuroscience 11.2 (2010), pp. 100–113. [175] Markus Ullsperger and Stefan Debener. Simultaneous EEG and fMRI: recording, analysis, and application. Oxford University Press, 2010, p. 132. [176] Jose Antonio Urigüen and Begoña Garcia-Zapirain. “EEG artifact removal—state-of-the- art and guidelines”. In: Journal of Neural Engineering 12.3 (Apr. 2015), p. 031001. [177] G. Vecchiato et al. “Enhance of theta EEG spectral activity related to the memorization of commercial advertisings in Chinese and Italian subjects”. In: 2011 4th International Conference on Biomedical Engineering and Informatics (BMEI). vol. 3. 2011, pp. 1491– 1494. [178] J Vibell et al. “Temporal order is coded temporally in the brain: early event-related potential latency shifts underlying prior entry in a cross-modal temporal order judgment task”. en. In: Journal of Cognitive Neuroscience. 19.1 (Jan. 2007), pp. 109–120. [179] De Vries, I E J Van Driel, and J Olivers. “Decoding the status of working memory representations in preparation of visual selection”. In: NeuroImage 191 (2019), pp. 549– 135 559. [180] Thaddeus S. Walczak and Sudhansu Chokroverty. “Electroencephalography, Electromyo- graphy, and Electro-Oculography. General Principles and Basic Technology”. In: Sleep Disorders Medicine. Elsevier Inc., Dec. 2009, pp. 157–181. [181] Yi Wang et al. “Wide-Context Semantic Image Extrapolation”. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019, pp. 1399–1408. [182] Melissa R Warden and Earl K Miller. “Task-dependent changes in short-term memory in the prefrontal cortex”. en. In: J. Neurosci. 30.47 (Nov. 2010), pp. 15801–15810. [183] Melissa R Warden and Earl K Miller. “The representation of multiple objects in prefrontal neuronal delay activity”. en. In: Cereb. Cortex 17 Suppl 1.suppl 1 (Sept. 2007), pp. i41– 50. [184] Wen Wen et al. “Tracking Neural Markers of Template Formation and Implementation in Attentional Inhibition under Different Distractor Consistency”. In: Journal of Neuroscience 42.24 (2022), pp. 4927–4936. [185] Alan Wolf et al. “Determining Lyapunov exponents from a time series”. In: Physica D: Nonlinear Phenomena 16.3 (1985), pp. 285–317. [186] Jeremy Wolfe. “Guided Search 2.0 A revised model of visual search”. In: Psychonomic Bulletin and Review 1 (June 1994), pp. 202–238. [187] Jeremy M Wolfe. “Guided Search 6.0: An updated model of visual search”. en. In: Psychon. Bull. Rev. 28.4 (Aug. 2021), pp. 1060–1092. [188] Jeremy M Wolfe. “Visual attention: The multiple ways in which history shapes selection”. en. In: Curr. Biol. 29.5 (Mar. 2019), R155–R156. [189] Jeremy M. Wolfe et al. “How fast can you change your mind? The speed of top-down guidance in visual search”. In: Vision Research 44.12 (2004). Visual Attention, pp. 1411– 1426. [190] Michael Wolff et al. “Revealing hidden states in visual working memory using electroen- cephalography”. In: Frontiers in Systems Neuroscience 9 (2015). [191] Jia Wu et al. “Hyperparameter optimization for machine learning models based on Bayesian optimization”. In: Journal of Electronic Science and Technology 17.1 (2019), pp. 26–40. [192] G. Xu et al. “A Deep Transfer Convolutional Neural Network Framework for EEG Signal Classification”. In: IEEE Access 7 (2019), pp. 112767–112776. 136 [193] Yang Yang et al. “LaFIn: Generative Landmark Guided Face Inpainting”. In: arXiv preprint. 2019. [194] Haoming Zhang et al. EEGdenoiseNet: A benchmark dataset for deep learning solutions of EEG denoising. 2020. [195] X. Zhang et al. “Accelerating Very Deep Convolutional Networks for Classification and Detection”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 38.10 (2016), pp. 1943–1955. [196] Yue Zhao, Zain Nasrullah, and Zheng Li. “PyOD: A Python Toolbox for Scalable Outlier Detection”. In: Journal of Machine Learning Research 20.96 (2019), pp. 1–7. [197] Yue Zhao et al. “LSCP: Locally Selective Combination in Parallel Outlier Ensembles”. In: SDM. 2019. [198] Zhidong Zhao and Yefei Zhang. “SQI Quality Evaluation Mechanism of Single-Lead ECG Signal Based on Simple Heuristic Fusion and Fuzzy Comprehensive Evaluation”. In: Frontiers in Physiology 9 (2018), p. 727. [199] Xuyang Zhu et al. “Separated channel convolutional neural network to realize the training free motor imagery BCI systems”. In: Biomedical Signal Processing and Control 49 (2019), pp. 396–403. [200] Elana Zion Golumbic et al. “Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a “Cocktail Party””. In: Journal of Neuroscience 33.4 (2013), pp. 1417–1426. [201] Igor Zyma et al. “Electroencephalograms during Mental Arithmetic Task Performance”. In: Data 4 (Jan. 2019). 137 APPENDIX A DECODING TARGET-ABSENT AND FALSE ALARM TRIALS There was no significant decoding for either one or two-cue target-absent trials. Difference was similarly not significant. However decoding in two-cue target absent trials was higher than for one-cue target absent trials. This correlated to the behavioral results observed. Namely the higher false-alarm rate in the two-cue condition. Figure A.1 Decoding target-absent EEG trials. There were no significant decoding clusters for either the one or two-cue target-absent trials. 138 Figure A.2 Decoding one-cue target-absent EEG trials. There were no significant decoding clusters for either the one-cue target-absent false-alarm or correct reject trials. 139 APPENDIX B DECODING SIMULATED NOISE To insure that our preprocessing does not inflate classifier accuracy. We simulated noise by producing a noise trial 𝑛𝑖 for every real EEG trial 𝑥𝑖 , the label of the noise trial corresponded with the label of the real trial, and for every electrode we simulated the noise 𝑛𝑖 𝑗 by sampling from a Gaussian distribution with the same mean and standard deviation as 𝑥𝑖 𝑗 and smoothing the resulting signal using a moving mean (4-sample window). This is a stringent test, as in non-EEG data the standard deviation and mean could potentially be meaningful for signal classification, however, this should not be the case in our EEG data (especially considering the colors are isoluminant). As can be seen in figure B.1, there were no significant classification clusters for noise data generated using target-present correct one-cue trials. Figure B.1 Classification of noise data generated using target-present correct one-cue trials. 140 APPENDIX C MAHALANOBIS DISTANCE DECODING RESULTS In the literature, EEG trials are always averaged before classification using the Mahalanobis distance methods [184, 190]. For instance, the researchers might divide the data into five equal parts, each with an equal number of trials from each label. For every label, the trials are averaged to produce five trials, five-fold (leave-one-out) classification is used to get decoding accuracy. While x-fold iterations are the default for the ADAM LDA classifier, the number of trials in each fold is greater than one, resulting in less noisy classification performance (Figure c left). We produced a new division of trials before every six-fold loop, adding another layer of selection, and increasing the number of classifiers we train from 10 to 50 (10 different data divisions, each producing 5-folds). While the overall performance is comparable between the default LDA classifier and the Ma- halanobis distance classifier we implemented. Moreover, the overall trend in the data remained the same. For instance, target-present correct decoding was significant earlier (and lasted longer) in the one-cue condition in comparison to the two-cue condition (Figure A.2 right). However, the added layer of averaging seems to mask the transient effects, and there was no longer a significant cluster of difference between the two cues. 141 Figure C.1 Left: Three Iterations of Mahalanobis Classification, each is the result of a five-fold classification, in each iteration the trial were randomly assigned to a different fold, and trials for each label were averaged, producing five trials per label in total. Right: Target-present correct results using 10 iterations of the 5-fold Mahalanobis classifiers. 142 APPENDIX D BAYESIAN ANALYSIS Default JASP prior of 0.707 on the Cauchy scale with 95% credibility interval was used for all Bayesian paired t-test. The null hypotheses were that the hit rate, false-alarm rate, and d-prime did not differ across three and no-cue trials. Bayesian factors, error percentages, median effect sizes, and effect size confidence intervals are provided in the table below. Follow up robustness check demonstrates that the null hypothesis remains more likely than the alternative for a wide range of priors. Measure being tested 𝐵𝐹01 error % Median effect size Effect size confidence interval Hit rate 4.628 0.024 −0.022 [−0.398, 0.353] False-Alarm rate 2.879 0.026 0.187 [−0.191, 0.574] d-prime 2.122 0.026 −0.242 [−0.634, 0.14] Table D.1 Results of Bayesian paired t-test for hit rate, false-alarm, and d-prime equality in three and no-cue trials. Figure D.1 Bayes factor robustness check for hit rate Bayesian paired t-test analysis. 143 Figure D.2 Bayes factor robustness check for false-alarm rate Bayesian paired t-test analysis. Figure D.3 Bayes factor robustness check for d-prime Bayesian paired t-test analysis. 144