VOCAL STYLE FACTORIZATION FOR EFFECTIVE SPEAKER RECOGNITION IN AFFECTIVE SCENARIOS By Morgan Lee Sandler A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Master of Science 2023 ABSTRACT The accuracy of automated speaker recognition is negatively impacted by change in emotions in a person’s speech. In this thesis, we hypothesize that speaker identity is composed of various vocal style factors that may be learned from unlabeled speech data and re-combined using a neural network architecture to generate holistic speaker identity representations for affective scenarios. In this regard we propose the E-Vector neural network architecture, composed of a 1-D CNN for learning speaker identity features and a vocal style factorization technique for determining vocal styles. Experiments conducted on the MSP-Podcast dataset demonstrate that the proposed architecture improves state-of-the-art speaker recognition accuracy in the affective domain over baseline ECAPA-TDNN speaker recognition models. For instance, the true match rate at a false match rate of 1% improves from 27.6% to 46.2%. Additionally, we provide an analysis between speaker recognition match scores and emotions to identify challenging affective scenarios. To my family, for their support of my ambitions and their everlasting love. iii ACKNOWLEDGMENTS I would like to express my sincere gratitude to the following individuals and organizations who have provided support and assistance throughout the course of this research: My academic advisor, Dr. Arun Ross, for his unwavering guidance, advice, and encouragement throughout the project. His expertise and insights have been invaluable to the success of this research. The faculty and staff of the Department of Computer Science & Engineering at Michigan State University (MSU), for their support, resources, and collaborative environment that fostered my growth as a researcher. The National Association of Broadcasters (NAB) and the National Science Foundation Center for Identification Technology Research (NSF CITeR), for their financial support that made this research possible. My iPRoBe colleagues and friends, Dr. Anurag Chowdhury, Sushanta Pani, Dr. Renu Sharma, Melissa Dale, Pegah Varghaei, Parisa Farmanifard, Ryan Ashbaugh, Debasmita Pal, Katie Albus, Protichi Basak, Dr. Raul Quispe-Abad, Shivangi Yadav, Dr. Thomas Swearingen, Sai Ramesh, and the countless others whom my interactions and discussions challenged my ideas and inspired me to grow. Dr. Parisa Kordjamshidi and Dr. Qiben Yan for their commitment to aiding in the preparation of this thesis. Mom, Dad, Samantha, Brandon, and Danielle, for their unwavering love and support that kept me motivated and grounded throughout this journey. In all, I could not have begun this journey alone, and my community has played a crucial role in shaping my research and shaping me into the person I have become. Thank you all. Morgan iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1: Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2: Recognition in Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3: Automatic Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4: Affective Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5: Our Primary Research Question . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6: Research Contributions and Thesis Organization . . . . . . . . . . . . . . . . . 5 CHAPTER 2: VOCAL STYLE FACTORIZATION FOR EFFECTIVE SPEAKER RECOGNITION IN AFFECTIVE SCENARIOS . . . . . . . . . . . . . 7 2.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2: Learning Emotion Independent Speaker Identity Features . . . . . . . . . . . . 8 2.3: Extracting Vocal Style Factors via Global Style Tokens (GSTs) . . . . . . . . . 10 2.4: Proposed E-Vector Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5: Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6: MSP-Podcast Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7: Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.8: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.9: Impact of Vocal Style Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.10: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.11: Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 CHAPTER 3: ANALYSIS THROUGH EMOTION CATEGORIES . . . . . . . . . . . 22 3.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2: Visualizations of E-Vector Speaker Verification Match Scores in the Context of Emotion Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3: Quantifying Emotion in E-Vector and ECAPA-TDNN Embeddings . . . . . . . 23 3.4: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 CHAPTER 4: THESIS CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . 33 4.1: Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2: Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 v LIST OF TABLES Table 2.1 An overview of speech-based feature representations in speaker recognition and affective computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Table 2.2 E-Vector comparison to baseline models. E4 (E-Vector) performs best across all metrics in our study across three separate test sets. . . . . . . . . . . . . . . . 17 Table 2.3 Comparison across three trained E-Vector models with 5, 10, 20 VSFs. Each model is trained to 105,000 steps with the MSP-Podcast train set data. . . . . . . 20 Table 3.1 ER denotes 8-class emotion recognition. Metric is f-score. In this scenario, we would like a lower f-score value. This may indicate that the speaker identity models not do encode the emotion in the identity representation itself. The confusion matrices corresponding to this analysis may be found in the Appendix 28 Table 3.2 E-Vector/Vox+Librispeech 4-class emotion recognition accuracy. Comparison of Hierarchical Classifier to baseline SVM model. Metric is standard accuracy– higher is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 vi LIST OF FIGURES Figure 1.1 Identity may be decomposed into vocal style factors. . . . . . . . . . . . . . . . 5 Figure 2.1 The proposed E-Vector architecture. The 1-D CNN takes speech frames (de- tailed in Section 2.1) as input and outputs a vector of 40 features. These fea- tures are then passed to a reference encoder, which extracts a fixed-dimension embedding representing the speaker identity features. The vocal style fac- tor layer decomposes the identity embedding into a fixed number of factors (e.g., 10 factors) via multi-head attention (8 heads). Finally, the weighted combination of these factors yields a final speaker identity embedding. . . . . . 11 Figure 2.2 E-Vector Training Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Figure 2.3 E-Vector Testing Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Figure 2.4 Match score distributions for E-Vector and ECAPA-TDNN baselines on three test sets of MSP-Podcast: test set 1, test set 2, validation set. Note that the validation set was not used in the training stage. . . . . . . . . . . . . . . . . . 18 Figure 2.5 Detection Error Trade-off (DET) curves comparing E-Vector with the baseline experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 2.6 Detection Error Trade-off (DET) curves comparing E-Vector with varying factor sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Figure 3.1 Inter/Intra-Emotion Speaker Recognition Match Scores Visualized in the Valence-Arousal Emotion Space. . . . . . . . . . . . . . . . . . . . . . . . . . 24 Figure 3.2 E-Vector evaluated on MSP-Podcast Test 1 Set, Intra-Inter Emotion Speaker Verification Experiment Match Scores . . . . . . . . . . . . . . . . . . . . . . 25 Figure 3.3 E-Vector evaluated on MSP-Podcast Test 2 Set, Intra-Inter Emotion Speaker Verification Experiment Match Scores . . . . . . . . . . . . . . . . . . . . . . 26 Figure 3.4 E-Vector evaluated on MSP-Podcast Validation Set, Intra-Inter Emotion Speaker Verification Experiment Match Scores . . . . . . . . . . . . . . . . . . . . . . 27 Figure 3.5 Our E-Vector Speech Emotion Recognition Experiment Setup . . . . . . . . . . 31 Figure 3.6 The hierarchical classifier setup. First classifier is a binary classifier that discriminates a given emotion class from the rest. Those which do not belong to the first class are classified into the remaining three by the second layer. . . . 31 vii Figure 3.7 E-Vector/Vox+Librispeech Speech Emotion Recognition Experiments. Top Row all use the SVM classifier, Bottom row all use the Hierarchical Classifier. High recognition accuracy is not achieved in the emotion recognition domain inferring that E-Vector does not significantly encode emotion information. All cells represent percentages. . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Figure A.1 E-Vector/MSP-Podcast SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross val- idation. Mean f-score = 0.18 +/- 0.03. Values shown in each cell denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Figure A.2 ECAPA-TDNN/VoxCeleb1+2 SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.25 +/- 0.03. Values shown in each cell denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure A.3 ECAPA-TDNN/MSP SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.20 +/- 0.03. Values shown in each cell denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure A.4 ECAPA-TDNN/Vox+MSP SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.31 +/- 0.03. Values shown in each cell denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Figure A.5 E-Vector/MSP-Podcast SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross val- idation. Mean f-score = 0.18 +/- 0.03. Values shown in each cell denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure A.6 ECAPA-TDNN/VoxCeleb1+2 SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.25 +/- 0.03. Values shown in each cell denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 viii Figure A.7 ECAPA-TDNN/MSP SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.20 +/- 0.03. Values shown in each cell denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure A.8 ECAPA-TDNN/Vox+MSP SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.31 +/- 0.03. Values shown in each cell denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 ix CHAPTER 1 INTRODUCTION 1.1 Biometrics Throughout recorded human history, identifying people for various purposes has been com- monplace [19]. This can involve listening to someone’s style of speech to identify them, or using sophisticated cameras to recognize individuals at a distance based on their walking patterns. Bio- metrics is the science concerned with recognizing individuals based on their physical or behavioral attributes [22]. With the invention of modern-day computers and advances in the field of artificial intelligence (AI), systems have been proposed for automatically recognizing individuals based on their biometric traits. Some examples of these traits include, but are certainly not limited to, voice, fingerprint, face, iris, DNA, and gait. Each trait (also referred to as a modality) has its advantages and disadvantages. For example, fingerprints have many discriminable attributes, are easy to collect, and rarely change over long periods of time [22]. Voice, on the other hand, is a combination of behavioral and physiological attributes, is cost-effective and easy to implement, but its performance may be significantly affected by various factors such as age, style of speaking, emotion, and medical conditions [20]. A biometric trait is typically chosen based on the following criteria [22]: • Universality: Does everyone have it? • Uniqueness: Is the trait unique to an individual? • Permanence: Will this trait change over time? • Measurability: Can it be measured with sensors? • Performance: Is the recognition and computational performance sufficient for practical use? • Acceptability: Will the target population accept common use of this trait? • Circumvention: Can bad actors easily alter, obfuscate, spoof, or mimic this trait? 1 The voice is a desirable biometric due to its wide range of applications and relatively low implementation cost, often only requiring a microphone sensor and accompanying computer. Its applications are diverse, ranging from forensics and security [33] to personal assistants [3], medical diagnostics [30], and even imitative voice synthesis technology [48]. 1.2 Recognition in Biometrics Biometric recognition can be divided into two separate functionalities: verification and identi- fication. In verification, the user claims an identity, and the system verifies if the claim is correct (genuine, match) or not (impostor, non-match). In the verification process, a similarity match score is computed from a pair of biometric features. If the score is equal to or above a predetermined threshold, it is labeled as genuine or a match; otherwise, it is labeled as an impostor or a non-match. In other words, the verification scenario involves a 1-to-1 test between two biometric samples (e.g., two speech audio samples) to determine if the identity is the same or not. In contrast to the verification scenario, identification involves iterating over a database of enrolled users and identifying the user based on the presented biometric trait; there is no claim to an identity. Instead of resulting in a match or non-match, the result will either be a ranked list of matching identities or an empty set suggesting there were no matches in the database. In other words, it is a 1-to-N test over identity-labeled and registered biometric samples in the system. This thesis will primarily focus on the verification scenario. 1.3 Automatic Speaker Recognition Automatic Speaker Recognition involves the use of software to automatically identify or verify the identity of a speaking individual based on their voice characteristics. There are two primary modes of speaker recognition: text-dependent and text-independent [22]. Text-dependent speaker recognition systems typically require the use of a "passcode" or a known phrase, and require textual content to make an identity decision. In contrast, text-independent speaker recognition does not have this requirement and relies on paralinguistic aspects of speech. Both systems typically use five primary modules [22]: 2 1. An enrollment system to add identities to the database. 2. A sensor module to collect and record the biometric data (e.g., microphone). 3. A feature extraction module to extract salient numerical features from the input biometric data. 4. A matching module to compare two biometric samples (or multiple pairs) based on the extracted features and generate match score(s). 5. A database module to store all biometric information. There are many factors which may affect the performance of an automatic speaker recognition system. These include, but are not limited to, background noise, sample quality, audio distortion, and emotion [6]. In this thesis, we will study the effects of emotion on text-independent speaker recognition by deliberate selection of affective (emotion) datasets to explore the dependencies between speaker identity and emotion. Since the datasets already contain 16KHz audio waveforms and the corre- sponding identities of the speaker, it is appropriate to skip the enrollment and sensor modules in our experimentation. 1.4 Affective Computing In order to propose a method which addresses the problem of emotion in speaker recognition, we study the recent literature in affective computing. Affective computing seeks to develop the capabilities of recognizing, responding to, or imitating human emotions [38]. Speech contains a wealth of affective information, including tone, pitch, and intonation, which can be analyzed to determine the speaker’s affective state [38]. Literature in speech-based affective computing typically involves a mixture of psychological domain knowledge and computational methods to extract relevant features [16]. Within affective computing there are many areas of study such as emotion recognition [45] or emotional voice conversion [51]. In this work, we study automatic speaker recognition in the context of varying affective scenarios. 3 Recent research in speech-based affective computing has aimed to improve the accuracy and robustness of emotion recognition and voice synthesis systems. Voice synthesis is concerned with modeling the vocal properties of a target speaker. The latest deep learning-based research in this domain indicates the identity is encoded within the embeddings generated in neural networks [49]. With the availability of new large affective speech datasets such as MSP-Podcast [28], researchers have been able to study whether deep learning methods [23] can automatically extract high-level affective independent features from speech. Traditional methods like gaussian mixture modeling (GMM), support vector machines (SVM), and hidden markov models (HMM) [38] have also been used. Modeling methods with hidden/latent mechanisms are a natural fit for representing emotion. Emotion is typically understood as a latent state which changes according to outside stimuli [38]. Choosing which emotions to use in experimentation is an active area of research in psychology and cognitive science [26]. The principle of cathexis in psychology [38] theorizes that if more mental energy is used to portray an emotion, then it will be easier to perceive it in a voice. For example, an actor portraying sadness with a microphone may be perceived as being more sad than the average person speaking to their parents on the phone. The mental biasing of emotion affects the perceptibility of the emotion involved. Acting has been the primary method for capturing emotion in recent literature and datasets [2, 4], but there are now more efforts to obtain natural emotion corpora to study the effects of cathexis in speech, and to train computational models which reflect typical human affective scenarios. By studying these natural emotion scenarios, we may be able to identify typical emotion-related issues that surface in speaker recognition. There are two primary computational models of emotion [18, 39]: discrete and continuous. Discrete emotions typically encapsulate a broad range of vocal styles such as anger, happiness, sadness, and surprise, while continuous emotion categories attempt to quantify an exact point in a multi-dimensional emotion space using axes such as valence, arousal, and dominance. Continuous models acknowledge that there are many variations within emotion, such as cold anger versus hot anger, which should be modeled implicitly. In contrast, discrete models of emotion are typically employed by practitioners who are less concerned about the specific nature of the emotion and 4 Style 1 Speaker Style 2 Identity Style N Figure 1.1 Identity may be decomposed into vocal style factors. more interested in whether it falls into a certain category, such as happy or not. In this work, we evaluate our speaker recognition system through the lens of both emotion models. 1.5 Our Primary Research Question Can we improve speaker recognition accuracy under varying affective scenarios? To answer this question, we hypothesize that speaker identity is composed of various vocal style factors. Vocal style factors may include emotion, identity features, or a composition of both (viz., Figure 1.1). We propose a neural network architecture that uses a "vocal style factor" layer which learns in an unsupervised manner the factors which compose speaker identity. We hypothesize that a combination of these factors provides a holistic representation of speaker identity in varying affective scenarios. 1.6 Research Contributions and Thesis Organization The thesis contributions can be summarized as follows: 1. Viewing speaker identity as a composition of many vocal style factors. 2. A proposed solution called E-Vector (Emotion Scenario Vector), an automatic speaker recog- nition model based on neural networks, which improves recognition accuracy compared to a state-of-the-art ECAPA-TDNN model [13]. 5 3. Extensive set of experiments to establish the advantages of E-Vector over baseline models. 4. Analysis to determine if E-Vector embeddings capture emotion or not. The rest of the thesis is organized as follows. Chapter 2 introduces our E-Vector architecture and the speaker recognition experiments conducted in various affective scenarios. Chapter 3 provides further insight and analysis of the E-Vector embeddings by conducting emotion recognition experiments to determine whether the embeddings capture emotion or not as a byproduct of the proposed method. Chapter 4 concludes the thesis and offers future directions for further research on this topic. 6 CHAPTER 2 VOCAL STYLE FACTORIZATION FOR EFFECTIVE SPEAKER RECOGNITION IN AFFECTIVE SCENARIOS 2.1 Introduction Human speech is composed of various speaking styles that convey conversational semantics through emotions, tone, and other paralinguistic cues. Paralinguistics refer to the non-lexical aspects of speech, such as intonation, pitch, and rhythm, that convey additional meaning beyond the words spoken [37]. For instance, the tone of voice can make a significant difference in conveying a speaker’s conviction or lack thereof, even when the words are the same. In addition, different argumentative styles of speaking may vary based on the speaker’s origin, language, culture, and other factors. These variations in human speech can challenge state-of-the-art speaker recognition systems [7, 35]. Such systems generally analyze two speech samples and determine whether they correspond to the same individual or not. In this work, we hypothesize that human speech is a superimposition of multiple vocal style factors and that these factors contribute to both the speaker identity as well as the speaker emotions. By decomposing an input speech sample to individual vocal style factors and then learning how to combine them in order to model speaker identity, we alleviate the negative impact of emotions on speaker recognition. Previous state-of-the-art techniques in speaker recognition (also see Table 1) have focused mainly on neutral tones, failing to capture the affective scenario, where emotions modulate the speaking style. While state-of-the-art neural network models can be fine-tuned and adapted to affective scenarios using affective datasets, their accuracy remains underwhelming for many affective scenarios. In certain situations, such as those related to national security, it is imperative for an automated speaker recognition to confidently and accurately identify individuals, especially in the affective scenario. This serves as the motivation for our work. In this work, we leverage a voice synthesis technology referred to as Global Style Tokens (GST) [48] to learn vocal style factors and a 1-D CNN to learn identity features directly from raw audio data. The learned vocal style factors are 7 analogous to basis vectors of a style space. We learn these factors from training speech samples in an unsupervised manner while training the speaker recognition system with identity labels. This core method, combined with affective training data and a triplet loss function [47], ensures that we learn a comprehensive speaker identity representation in the context of affective scenarios. 2.2 Learning Emotion Independent Speaker Identity Features Previous studies have examined the relationship between emotion and speaker identity [1,31,35]. These studies found that the Equal Error Rate (EER) of speaker recognition is affected by variations in the speaker’s emotional state. The authors of [36] proposed a method for estimating the reliability of speaker recognition models based on their emotion content. Emotion recognition models that use adversarial training to separate emotion and speaker identity to achieve speaker-invariance have been proposed [27]. Several methods have been proposed to mitigate affective content and achieve emotion-invariance [25, 32, 43]. In contrast to these works, our proposed method uses unsupervised learning of vocal styles, which form the basis of the style space in which we theorize identity belongs. Instead of learning speaker identity directly from speech samples, we first learn vocal styles and then learn speaker identity as a superimposition of those styles. In the following sections, we will provide more details about our approach to achieving emotion-invariance in speaker recognition. 2.2.1 Speech Preprocessing A Voice Activity Detector [41] is used to remove non-speech parts of an input audio. If a sample is longer than 2 seconds, then we split it into many 2-second non-overlapping audio samples. The 2- second speech audio samples are then framed and windowed into multiple smaller audio segments which we refer as speech units. Using a hamming window of length 20 ms and stride 10 ms, each speech unit of 20 ms sampled at 16000 Hz is represented by an audio vector of length 320. The sliding window extracts a speech unit every 10 ms from the 2-second audio sample which approximately extracts 200 speech units per 2-second audio sample. The speech units are then stacked horizontally to form a two-dimensional speech representation that we refer to as a speech frame. The dimension of this frame is 320 × 200. The extracted speech frames are then organized 8 into triplets which are input to the E-Vector model. A single triplet consists of a positive sample, a negative sample, and an anchor sample. The positive and anchor sample share the same speaker identity label whereas the negative and anchor sample do not. 2.2.2 Speech Feature Extraction through a 1-D CNN We use a 1-D CNN to learn domain-relevant features directly from speech frames rather than from explicitly extracted features. In contrast, many popular methods use handcrafted features, such as the mel-frequency cepstrum (MFC) or linear predictive coding (LPC). However, handcrafted features may favor certain domains over others by their inherent design. For example, MFC is used to model the spectral shape of sound, therefore it may not capture enough information to disambiguate between two emotions with similar sounds. Affective feature sets such as the Geneva Minimalistic Acoustic Parameter Set (GeMAPS) have been proposed to address this problem. Recent speaker recognition literature has demonstrated the efficacy in learning features from audio, but none have targeted affective speech directly. A table of popular feature representations and their intended use is shown in Table 2.1. In this work, we learn a 40-dimensional feature set directly from speech frames. This has been demonstrated previously [9] to retrieve relevant short and long-term behavioral features that are pertinent to speaker identity. 1-D CNN is also beneficial for our deep learning architecture because it can be trained end-to-end along with the GST network (see Section 3) to maximize model learning. The exact architecture setup used is depicted in Figure 2.1. Paper Feature Category Feature Name Intended Use Davis and Mermelstein (1980) [10] Short-term spectral Mel-Frequency Cepstral Coefficients (MFCC) Voice Perception Hermansky (1990) [21] Short-term spectral Perceptual Linear Predictive (PLP) Analysis Voice Perception Mammone et al. (1996) [29] Short-term spectral Linear Predictive Coding (LPC) Voice Production Eyben et al. (2010) [17] Affective OpenSMILE Toolkit Affective Speech Dehak et al. (2010) [11] Speaker Identity i-vector Speaker Identity Eyben et al. (2015) [15] Affective Geneva Minimalistic Acoustic Parameter Set (GeMAPS) Affective Speech Ravanelli et al. (2018) [40] Short-term spectral SincNet NN-based Speaker Identity Snyder et al. (2018) [44] Short-term spectral X-Vector NN-based Speaker Identity Desplanques et al. (2020) [13] Short-term spectral ECAPA-TDNN NN-based Speaker Identity Chowdhury et al. (2020) [9] Short-term spectral DeepVOX NN-based Speaker Identity in Noisy Environments Chowdhury et al. (2021) [5] Short-term & Long-term spectral DeepTalk NN-based Vocal Style Transfer This work Short-term & Long-term Affective spectral E-Vector NN-based Affective Speaker Identity Table 2.1 An overview of speech-based feature representations in speaker recognition and affective computing. The architecture consists of six 1-D dilated convolution layers [8], separated by scaled exponen- tial linear unit (SELU) activation functions [24]. The one-dimensional filters are designed to extract 9 features from individual speech units within a speech frame, rather than across them, based on the assumption that the speaker-dependent characteristics within each unit are independent from those in other units within the same frame. To this end, each 320-dimensional speech unit is processed by a series of 1D dilated convolutional layers, resulting in 40 filter responses that form the short-term spectral representation for that unit. This combination of dilation and scaled activation ensures a larger receptive field, allowing the model to learn sparse relationships between the feature values within a speech unit, which leads to significant performance benefits. Moreover, this method has been shown to be effective in bypassing noise and other small perturbations present in the audio waveform [9]. The larger receptive field facilitates learning of relevant high-dimensional affective information at the feature-extraction level. 2.3 Extracting Vocal Style Factors via Global Style Tokens (GSTs) Text-to-speech is a challenging inverse problem of speech recognition. The goal is to generate human-like speech from a provided text script. There are many complications that arise when deciding what that voice sounds like. For instance, the system must produce a natural voice with vocal style, gender, a particular tone, emotions, and other desirable attributes. Many people convey semantic meaning through their conversations using tone, vocal style, and emotion. Therefore, in text-to-speech, many techniques aim to learn speaker characteristics. With recent advances in deep-learning, voice synthesis methods have demonstrated success in learning speaker characteristics in order to generate natural-sounding voices. It is this very reason why in this work, we will use Global Style Tokens (GST), a voice synthesis technique, as our vocal style factor decomposition method. The GST method was introduced in [48]. It was proposed for use in a text-to-speech system to encode a vocal style embedding from speech samples. In this method, a voice synthesizer is trained using a concatenated text and vocal style embedding. The vocal style embedding is learned through a multi-head attention module that calculates the similarities between a speaker identity embedding learned from the speech sample and a 10-dimensional bank of embeddings which we will refer to as “vocal style factors" or VSF. A weighted combination of VSFs produces the style 10 embedding, which is joined with a text embedding to provide a precise representation of speaker characteristics to a vocoding network. GSTs have yielded promising results in the emotional voice conversion (EVC) literature [51] and have demonstrated the ability to encode pitch contours that are vital to vocal style in human speech [5]. Therefore, we incorporate a similar approach into our E-Vector architecture. In our E-Vector model, we swap reconstruction loss with generalized end-to-end loss [47]. This ensures that we will learn speaker identity in the context of distinguishing genuine and impostor pairs for speaker verification. 2.4 Proposed E-Vector Architecture 1-D CNN Reference Encoder Speech Frames Reference Embedding 128 In: 1 In: 2 In: 4 In: 8 In: 16 In: 32 Units Filter Sizes: Out: 2 Out: 4 Out: 8 Out: 16 Out: 32 Out: 40 32 -> 32 -> 64 -> Kernel: 5 Kernel: 5 Kernel: 7 Kernel: 9 Kernel: 11 Kernel: 11 64 -> 128 -> 128 Dilate: 2 Dilate: 2 Dilate: 3 Dilate: 4 Dilate: 5 Dilate: 5 Vocal Style Factor Layer Vocal Style Factors 1-D Dilated Convolution SELU Activation Reference Weighted Speaker Identity Embedding Multi-Head Sum Embedding 6-Layer 2-D CNN Attention GRU with 128 units Embedding Figure 2.1 The proposed E-Vector architecture. The 1-D CNN takes speech frames (detailed in Section 2.1) as input and outputs a vector of 40 features. These features are then passed to a reference encoder, which extracts a fixed-dimension embedding representing the speaker identity features. The vocal style factor layer decomposes the identity embedding into a fixed number of factors (e.g., 10 factors) via multi-head attention (8 heads). Finally, the weighted combination of these factors yields a final speaker identity embedding. 11 Identity Embeddings Positive Speech Frame E-Vector Anchor Speech Frame E-Vector Triplet Loss Negative Speech Frame E-Vector Figure 2.2 E-Vector Training Setup. Identity Embeddings E-Vector Speech Frame from Sample A Cosine Similarity Match Score Matcher E-Vector Speech Frame from Sample B Figure 2.3 E-Vector Testing Setup. In this work, we propose E-Vector as a supervised1 learning architecture designed for speaker recognition in varying affective scenarios. E-Vector consists of two networks trained end-to-end. The first network is a 1-D CNN which learns a 40-dimensional feature set directly from speech frames, while a GST network, which consists of a 2-Dimensional CNN and Gated Recurrent Unit, produces 128-dimensional reference embeddings. It does so by learning 10 vocal style factor embeddings that serve as basis vectors for a learned vocal style space that encapsulates the affective styles inherent in the data. The final 256-dimensional speaker representation learned is referred to as an E-Vector. The E-Vector architecture is illustrated in Figure 2.1. The structure of the training 1 supervised with identity labels but not with emotion or other affective labels 12 and testing of the architecture are illustrated in Figures 2.2 and 2.3, respectively. 2.5 Loss Functions 2.5.1 Additive Angular Margin Loss (AAM) Additive Angular Margin Loss, or often referred to as ArcFace loss [12], is a popular loss function proposed to learn geodesic distances between identity embeddings on a hypersphere. The margin hyperparameter enforces compactness between genuine pairs and more distance between impostor pairs to a higher degree than traditional softmax loss. The formulation is as follows: 𝑒 𝑠(cos(𝜃 𝑦𝑖 +𝑚)) 𝑁 1 ∑︁ 𝐿=− log 𝑠(cos(𝜃 +𝑚)) Í𝑛 (2.5.1) 𝑁 𝑖=1 𝑒 𝑦𝑖 + 𝑗=1, 𝑗≠𝑦𝑖 𝑒 𝑠 cos 𝜃 𝑗 Here, 𝑁 is the size of the batch, 𝑛 denotes the number of identities, 𝑠 is the radius of the hypersphere, 𝑚 is the additive angular margin penalty between the feature vector and the weight vector, and 𝜃 is the angle between the weight and feature vector. 2.5.2 Generalized End-to-End Loss (GE2E) Generalized End-to-End (GE2E) Loss [47] is specifically derived for speaker verification and is designed to contribute more loss for impostor pairs that may frequently false-match. It does this by identifying the most similar-to-genuine false match speaker as its basis for maximizing the distance between genuine embedding and impostor embedding. The genuine term in GE2E loss corresponds to a genuine match between the input embedding vector and its true speaker label. Conversely, the impostor term corresponds to an impostor match between the input embedding vector and the speaker label with the highest similarity to genuine among all false speakers. In prior literature, GE2E loss has demonstrated effectiveness and does not require an initial example selection stage that loss functions such as Tuple End-To-End Loss may use [47]. The effectiveness in identifying hard speaker verification samples, which is common in affective speaker recognition, is why we use this loss in our work. 𝑁  ∑︁  2 2 𝐿= 𝛼𝑖 𝑑 (𝑥𝑖 , 𝑦𝑖 ) + (1 − 𝛼𝑖 ) max 𝑑 (𝑥𝑖 , 𝑦 𝑗 ) (2.5.2) 𝑗≠𝑖 𝑖=1 13 Here, 𝑥𝑖 is the genuine embedding for utterance 𝑖, 𝑦𝑖 is the ground truth embedding for utterance 𝑖, 𝛼𝑖 is the weight assigned to each utterance, 𝑑 (𝑥𝑖 , 𝑦𝑖 ) is the distance between a genuine pair, and 𝑑 (𝑥𝑖 , 𝑦 𝑗 ) is the distance between the genuine embedding and the impostor embedding 𝑦 𝑗 . 2.6 MSP-Podcast Dataset MSP-Podcast [28] is a natural emotion dataset derived from podcasts and labeled through crowd-sourced consensus voting. Each sample is labeled with a discrete emotion (e.g., ‘Happy’) as well as continuous emotion dimensions (valence, arousal, dominance) using a 5-point Likert scale. There are 10 different labels for discrete emotions categories and continuous emotion dimensions values range between 0 and 5. The MSP-Podcast dataset contains a total of 73,042 speaking turns, equivalent to 110 hours of speech. The dataset is divided into several sets for training and evaluation purposes. Test set 1 contains segments from 60 speakers (30 female and 30 male), totaling 15,326 segments. Test set 2 consists of 5,037 segments randomly selected from 100 podcasts. These segments are not included in any other partition. The Validation set includes segments from 44 speakers (22 female and 22 male), totaling 7,800 segments. Note, we use the validation set as another testing set, and its data is not seen during the training process. The Train set contains the remaining speech samples, totaling 38,179 segments. In this work, we use the speaker identity labels for our training process. For evaluation purposes, we use the Test 1, Test 2, and Validation sets—each with its own unique composition—to evaluate our proposed E-Vector model. 2.7 Experiment Setup 2.7.1 Models and Baselines Four models are trained for the speaker verification task: ECAPA-TDNN/Vox (denoted as E1), ECAPA-TDNN/MSP (denoted as E2), ECAPA-TDNN/Vox+MSP (denoted as E3), and E-Vector (denoted as E4). E1, E2, and E3 serve as the baseline models to evaluate different training configurations of state-of-the-art speaker recognition systems to determine their efficacy in the 14 affective scenario. E1 is a popular state-of-the-art ECAPA-TDNN [13] speaker recognition model pre-trained on VoxCeleb 1 & 2 datasets. E2 also uses the ECAPA-TDNN architecture but we substitute the AAM loss with the GE2E loss; further, the network is trained from scratch using the MSP-Podcast train set. Our third baseline, E3, uses the pre-trained weights from E1, but we fine-tune the last layer (1.5M parameters) using the MSP-Podcast train set. E4 is the proposed E-Vector approach and is trained with the MSP-Podcast train set. The specifics of training the four models are detailed in the next section. 2.7.2 Training Details All four models are trained for the speaker verification task, and training is stopped if there are no empirically significant decreases to the train loss over a window of 20,000 steps. All copies of trained models and their respective training/evaluation codes are available via email request or on our Github website (viz., Section 2.11). All models were trained over 5 days with an Nvidia RTX 2080 Ti GPU. 2.7.2.1 ECAPA-TDNN/Vox (E1) The E1 pre-trained weights are accessed from HuggingFace [50], while the training data is sourced from the VoxCeleb1 and VoxCeleb2 datasets. The authors trained this model for 12 epochs and fit 22.2M trainable parameters using an AAM loss function. No MSP-Podcast train data is used in the E1 training process. This model serves as an off-the-shelf speaker verification model comparison to E-Vector. 2.7.2.2 ECAPA-TDNN/MSP (E2) The E2 model is of the ECAPA-TDNN architecture, but the training process uses the GE2E loss function (same as E-Vector) rather than AAM loss. Using the data from the MSP-Podcast train set, 22.2M parameters are learned over 7,500 steps (where overfitting is observed). Overfitting is to be expected as the parameters of the model are large in relation to the quantity of data available. Even with this limitation, we find that this model still achieves competitive results in our affective scenario. 15 2.7.2.3 ECAPA-TDNN/Vox+MSP (E3) The E3 model uses the same pre-trained weights from E1. To adapt the model to the target domain, we fine-tuned the last layer (1.5M trainable parameters) using the MSP-Podcast training set. E3 employs the AAM loss function. This model represents a popular transfer-learning method in order to determine whether we can solve the affective scenario by adapting pre-existing models. 2.7.2.4 E-Vector (E4) E4 is our proposed method. Using 10 vocal style factors, it is able to train for 105,000 steps with 959.8K trainable parameters using the GE2E loss function. Data from MSP-Podcast train set is used in the training process. 2.7.3 Model Evaluation To evaluate our 3 baseline models and E-Vector, we compute the following metrics: Equal Error Rate (EER), Minimum Detection Cost Function (minDCF), D-Prime (d’), Area Under Curve (AUC), True Match Rate at a specified False Match Rate (TMR @ FMR 1% and 10%), Detection Error Tradeoff (DET) curves, and Match Score Distributions. 2.7.3.1 minDCF Metric Formulation The minDCF metric allows us to study the costs associated with the speaker recognition system. The formulation is as follows: 𝐷𝐶𝐹 (𝜏) = 𝐶𝑚𝑖𝑠𝑠 𝑃𝑚𝑖𝑠𝑠 (𝜏)𝑃𝑡𝑎𝑟𝑔𝑒𝑡 + 𝐶𝐹 𝑀 𝑃 𝐹 𝑀 (𝜏)(1 − 𝑃𝑡𝑎𝑟𝑔𝑒𝑡 ) (2.7.1) where, 𝜏 is a match decision threshold, 𝐶𝑚𝑖𝑠𝑠 is the cost of a missed verification error, 𝐶𝐹 𝑀 is the cost of a false match error, 𝑃𝑡𝑎𝑟𝑔𝑒𝑡 is the prior probability of the target speaker occurring in the data, 𝑃𝑚𝑖𝑠𝑠 (𝜏) is the probability of a miss verification at a given threshold, and 𝑃 𝐹 𝑀 (𝜏) is the probability of a false match at a given threshold. minDCF is the minimization of this cost function. We evaluate minDCF at 𝐶𝑚𝑖𝑠𝑠 = 10, 𝐶𝐹 𝑀 = 1. This is because in forensic/security scenarios, it is preferred to identify/match a speaker than to completely miss a potential match at all. 16 Model Trainable Params Test Set EER minDCF TMR@FMR = {1%, 10%} D’ AUC E1 22.2M Val 0.34 0.094 14.7%, 40.7% 0.83 0.72 Test 1 0.33 0.096 12.0%, 39.9% 0.90 0.74 Test 2 0.29 0.094 14.0%, 44.2% 1.12 0.78 E2 1.5M Val 0.37 0.099 7.15%, 30.0% 0.68 0.68 Test 1 0.35 0.098 8.92%, 33.7% 0.73 0.70 Test 2 0.27 0.089 20.7%, 52.5% 1.18 0.80 E3 22.2M Val 0.25 0.090 19.4%, 55.1% 1.35 0.83 Test 1 0.22 0.080 27.6%, 65.0% 1.53 0.86 Test 2 0.19 0.078 30.5%, 68.7% 1.68 0.88 E4 959.8K Val 0.13 0.054 55.5%, 84.0% 2.14 0.94 Test 1 0.20 0.061 46.2%, 72.0% 1.58 0.87 Test 2 0.15 0.068 38.7%, 77.1% 1.95 0.91 Table 2.2 E-Vector comparison to baseline models. E4 (E-Vector) performs best across all metrics in our study across three separate test sets. 2.7.3.2 Speaker Verification Testing Details In the MSP-Podcast dataset, there are 109.8M speaker verification pairs in test set 1, 10.7M pairs in test set 2, and 28.8M pairs in the validation set. Due to the large quantity of pairs available, we store all results and evaluate metrics on a smaller, uniformly sampled subset of the results. Computing the full speaker verification tests takes 27 hours on test set 1, 6 hours on test set 2, and 16 hours on the validation set. All experiments were conducted on a Nvidia GeForce 2080 Ti GPU. The approximate size of the reduced evaluation sets is 548K pairs for test set 1 (approx. 0.5% of computed match scores), 107K pairs for test set 2 (approx. 1% of computed match scores), and 287K pairs for the validation set (approx. 1% of computed match scores). All speaker verification results are publicly available by email request or on our Github. Attributes available in the results table include emotion of sample A, emotion of sample B, identity of sample A, identity of sample B, speaker verification match score, continuous emotion dimensions of sample A and sample B, and gender of sample A and sample B. 2.8 Results We note that E-Vector learns certain categories of affective states. This is reflected in the bimodal genuine distribution which can be observed in the E-Vector test set 1 plot in Figure 2.4. The bimodal distribution may be a product of overcoming challenges caused by emotion modulation. That may be why we observe a similar, but less pronounced bimodal distribution in 17 E-Vector Test Set 1 E-Vector Test Set 2 E-Vector Validation Set ECAPA-TDNN/Vox Test Set 1 ECAPA-TDNN/Vox Test Set 2 ECAPA-TDNN/Vox Validation Set ECAPA-TDNN/MSP Test Set 1 ECAPA-TDNN/MSP Test Set 2 ECAPA-TDNN/MSP Validation Set ECAPA-TDNN/Vox+MSP Validation ECAPA-TDNN/Vox+MSP Test Set 1 ECAPA-TDNN/Vox+MSP Test Set 2 Set Figure 2.4 Match score distributions for E-Vector and ECAPA-TDNN baselines on three test sets of MSP-Podcast: test set 1, test set 2, validation set. Note that the validation set was not used in the training stage. the corresponding test set 1 genuine distribution in the ECAPA-TDNN/MSP experiment. We also conclude that fine-tuning ECAPA-TDNN with MSP-Podcast train data does not work as intended, as it finds a niche local optima, perhaps due to insufficient quantities of training data available. Also, E-Vector improves recognition EER from 0.22 in E3 to 0.20, and TMR@FMR1% accuracy 18 Test Set 1 Test Set 2 Validation Set Figure 2.5 Detection Error Trade-off (DET) curves comparing E-Vector with the baseline experi- ments. from 27.6% in E3 to 46.2%. 2.9 Impact of Vocal Style Factors In this analysis, we perform three experiments on the E-Vector architecture with varying numbers of vocal style factors to determine the impact of the number of factors on speaker recognition. We chose 5, 10, and 20 factors and trained each model for 105,000 steps with the MSP-Podcast train set data. The 5 factor model is denoted E5, 10 factor is denoted E6, and 20 factor is denoted E7. Each model takes 47 hours to train with an Nvidia RTX 2080 Ti. The results are shown in Table 2.3. In addition the DET curves are shown in Figure 2.6. Generally, we find that the TMR at FMR 1% improves across all test sets with the addition of more factors. Since speaker identity in our model is composed of many discrete vocal style factors and there may be potentially an infinite number of speaking styles, it is not practical to have one factor for every possible style. Even with the limited range of vocal style factors tested, we observe a significant increase in performance compared to the baseline ECAPA-TDNN models. That said, it is not clear that more factors are always better and variation may exist across cultures, dialects, etc. Therefore, additional study into this hyperparameter is necessary. 2.10 Summary Learning speaker identity in voice has traditionally been a 1-step process—training a model to learn speaker identity from speech samples. This has effectively worked in most neutral speech scenarios, but in the dynamic, affective scenario, the performance sharply degrades. In this work, we propose a 2-step process: first, learning global styles of speech patterns based on thousands 19 Model Number of VSFs Test Set EER minDCF TMR@FMR = {1%, 10%} D’ AUC E5 5 Val 0.13 0.049 60.0%, 84.9% 2.19 0.94 Test 1 0.22 0.062 46.4%, 70.5% 1.61 0.87 Test 2 0.16 0.068 39.8%, 76.7% 1.93 0.91 E6 10 Val 0.13 0.054 55.5%, 84.0% 2.14 0.94 Test 1 0.20 0.061 46.2%, 72.0% 1.58 0.87 Test 2 0.15 0.068 38.7%, 77.1% 1.95 0.91 E7 20 Val 0.12 0.048 60.8%, 86.2% 2.21 0.94 Test 1 0.21 0.062 46.9%, 70.0% 1.62 0.87 Test 2 0.17 0.068 40.8%, 76.1% 1.89 0.91 Table 2.3 Comparison across three trained E-Vector models with 5, 10, 20 VSFs. Each model is trained to 105,000 steps with the MSP-Podcast train set data. Test Set 1 Test Set 2 Validation Set Figure 2.6 Detection Error Trade-off (DET) curves comparing E-Vector with varying factor sizes. of speaker identities, and second, representing speaker identity as combinations of those learned vocal style patterns. The E-Vector model architecture incorporates this 2-step view of the speaker identity. The first step learns similarities (via multi-head attention) between thousands of people’s speaking patterns and creates embeddings of those vocal styles. The advantage of the architecture is that it is not required to label those speaking patterns, although the patterns should be salient in the training data. This enables E-Vector to learn vocal patterns that we perhaps have not yet discovered or considered. Then, in the second step, the E-Vector model learns speaker identity, but only as a weighted combination of the aforementioned vocal styles. By modeling identity as a composition of learned vocal style factors, we find that our proposed method, E-Vector, outperforms state-of-the-art ECAPA-TDNN baseline models on the task of speaker recognition in affective scenarios where emotions have an impact on the speaking style. 20 In addition, we explore the relationship between the number of vocal style factors used in training and the eventual performance. With 20 VSFs in the E-Vector architecture, we are able to obtain a TMR of 46.9% @ FMR of 1% on MSP-Podcast test-set-1, which is 19.3% higher than the best ECAPA-TDNN baseline. For future research, we propose a cross-dataset E-Vector speaker recognition experiment to incorporate acted emotion datasets such as IEMOCAP [2] and CREMA-D [4]. By doing so, we can gain a deeper understanding of speaker recognition in acted affective contexts through a neural network models trained on natural emotion data (MSP-Podcast). Furthermore, conducting a comparative analysis between natural and acted emotions, as represented by these distinct datasets, could provide insightful observations about the range and variability of emotion expression. 2.11 Reproducibility All training, evaluation, and analysis code and experiment results will be made available on our Github link (https://github.com/morganlee123/evector). Trained models and embeddings have large file sizes and will be made available upon request (Email: sandle20@msu.edu or msandler8@gmail.com). 21 CHAPTER 3 ANALYSIS THROUGH EMOTION CATEGORIES 3.1 Introduction E-Vector has demonstrated its efficacy in capturing speaker identity characteristics in affective speech samples. This is based on our hypothesis that speaker identity can be decomposed into vocal style factors (VSFs) and those factors can be approximated through an unsupervised learning method. To better understand the E-Vector speaker identity embeddings, we visualize speaker verification match scores computed in the previous chapter in the context of the underlying emotion categories for both discrete and continuous emotion models. We also conduct speech emotion recognition experiments to quantify any affective information encoded in the embeddings. We split this experiment into two tasks. In the first task, we re-use the E-Vector and ECAPA-TDNN models trained on the affective data from MSP-Podcast train set to extract speaker identity embeddings. After extracting embeddings, we train an SVM classifier to assign the embedding into 8 discrete emotion categories. Our second task explores the effects of larger corpora trained with the E-Vector architecture. We train E-Vector with primarily neutral emotion speech samples from Librispeech, VoxCeleb 1, and VoxCeleb 2 datasets. We then perform a Speaker Emotion Recognition (SER) experiment to map speaker identity embeddings to 4-emotion categories using affective data from the CREMA-D, IEMOCAP, and MSP-Podcast datasets. The following sections will detail the experimental methods, data partitions, results, and conclusions from these analyses. 3.2 Visualizations of E-Vector Speaker Verification Match Scores in the Context of Emotion Categories In this analysis, we plot the speaker verification match scores in Figure 3.1 from the models trained in the previous chapter (E1-E4) in the valence-arousal space to explore speaker verification in the continuous emotion axes. In the figure, each symbol represents an emotion-emotion pairing (e.g., happy-neutral, happy-happy, etc.), with the color of the circle indicating the match score. The color scale is on the right-hand side of the figure. There are a total of 110 circles derived from 10 emotion categories (happy, sad, etc.) thereby creating: 45 inter-emotion genuine scores, 22 45 inter-emotion impostor scores, 10 intra-emotion genuine scores, and 10 intra-emotion impostor scores. In general, inter-emotion speaker verification tends to be a more challenging scenario to verify identity. Therefore, we focus on the improvement in inter-emotion speaker verification match scores. We found that the E-Vector model has more inter-emotion categories scoring higher values in genuine pairs (which is good), and inter-emotion impostor scores being lower than the other models (which is also good). This was previously noted as a bimodal genuine distribution apparent in the match score distributions (see Figure 2.4). In general, there are no specific value ranges in the arousal-valence space that may indicate any emotion category may provide higher/lower speaker verification performance. In the discrete emotion analysis in Figures 3.2, 3.3, 3.4, we find that across all emotion categories that there are no specific emotions responsible for the increase in recognition accuracy found in the previous chapter. However, it could be said that the emotions which depend on textual semantic information (sad, disgust, contempt) perform worse than ones that do not typically require it (anger, happy). In general, in this analysis we find that all affective scenarios in E-Vector have better separability between the genuine and impostor distributions (viz., Figure 2.4). 3.3 Quantifying Emotion in E-Vector and ECAPA-TDNN Embeddings 3.3.1 Task 1: E-Vector and ECAPA-TDNN MSP-Podcast Speech Emotion Recognition To analyze the affective content in the speaker recognition model embeddings, we take the pre- trained ECAPA-TDNN and E-Vector models from the previous chapter (pre-trained on MSP-Podcast or VoxCeleb 1+2) and perform 5-fold CV emotion recognition with speaker identity embeddings extracted from the validation set to train the speech emotion recognition (SER) model given their labeled emotion category. Emotion recognition (ER) refers to a eight-class recognition problem: given an input embedding encoding identity and potentially emotion, assign it to a discrete emotion category (in this case, sad, happy, neutral, anger, surprise, disgust, fear, and contempt). To train the models, a multi-layer perceptron (MLP) is used with 3 layers of 100 nodes each and ReLU activation functions for 300 epochs. The MLP uses the Adam optimizer. We chose this classifier 23 E-Vector Test Set 1 E-Vector Test Set 2 E-Vector Validation Set ECAPA-TDNN/Vox Test Set 1 ECAPA-TDNN/Vox Test Set 2 ECAPA-TDNN/Vox Validation Set ECAPA-TDNN/MSP Test Set 1 ECAPA-TDNN/MSP Test Set 2 ECAPA-TDNN/MSP Validation Set ECAPA-TDNN/Vox+MSP Validation ECAPA-TDNN/Vox+MSP Test Set 1 ECAPA-TDNN/Vox+MSP Test Set 2 Set Figure 3.1 Inter/Intra-Emotion Speaker Recognition Match Scores Visualized in the Valence- Arousal Emotion Space. empirically. For evaluating the emotion recognition models, we use test set 1 from MSP-Podcast. In the validation set used in training, there exists a significant imbalance of emotion categories; therefore, we randomly under-sample based on the emotion with the least number of available samples (approximately 130 per class) to balance the distribution during the training process. The evaluation sets are not balanced and we employ the f-score metric. We acknowledge there is a 24 Figure 3.2 E-Vector evaluated on MSP-Podcast Test 1 Set, Intra-Inter Emotion Speaker Verification Experiment Match Scores 25 Figure 3.3 E-Vector evaluated on MSP-Podcast Test 2 Set, Intra-Inter Emotion Speaker Verification Experiment Match Scores 26 Figure 3.4 E-Vector evaluated on MSP-Podcast Validation Set, Intra-Inter Emotion Speaker Verifi- cation Experiment Match Scores 27 significant loss of data during the under-sampling process and consequently also run the 5-fold CV emotion recognition experiment without discarding samples for balancing as well. 3.3.2 Experiment Results Previously, we hypothesized that E-Vector contains the speaker identity as a combination of vocal style factors. To investigate whether emotion information is encoded in the representation, we conducted the aforementioned speech emotion recognition (SER) experiments. We evaluate the results using the MSP-Podcast test 1 set. The results from this experiment imply that E- Vector may not significantly encode emotion in the identity representation—potentially being factored out through the weighted combination of VSFs and resulting in lower f-score. In this scenario, lower f-score would be beneficial assuming the trade-off for emotion recognition is identity recognition. In contrast, the ECAPA-TDNN, while still scoring low f-score, tends to have higher emotion recognition f-score than the E-Vector model. Table 3.1 summarizes the metrics from this experiment. ER (Chance is 0.12) ER w/ Undersampling (Chance is 0.12) E-Vector 0.18 +/- 0.03 0.18 +/- 0.03 ECAPA-TDNN/Vox 0.25 +/- 0.03 0.25 +/- 0.03 ECAPA-TDNN/MSP 0.20 +/- 0.03 0.20 +/- 0.03 ECAPA-TDNN/Vox+MSP 0.31 +/- 0.03 0.31 +/- 0.03 Table 3.1 ER denotes 8-class emotion recognition. Metric is f-score. In this scenario, we would like a lower f-score value. This may indicate that the speaker identity models not do encode the emotion in the identity representation itself. The confusion matrices corresponding to this analysis may be found in the Appendix 3.3.3 Task 2: E-Vector/Vox+Librispeech Speech Emotion Recognition Experiments For the second task, we use an E-Vector model trained on LibriSpeech, VoxCeleb 1, and VoxCeleb 2 datasets. This model extracts fixed 256-dimensional speaker embeddings from two datasets: CREMA-D and MSP-Podcast. A simple linear model (we found a multi-class Support Vector Machine (SVM) to be sufficient for this purpose by use of Auto-Tuned Models [46]) is trained with these embeddings as input and the corresponding sample emotions as the output labels. There is one emotion per audio sample. The four emotion categories for this work are: Angry, Sad, 28 Happy, and Neutral. We choose these four emotions as they are considered “basic" emotions that cover the most frequent human-interactions [14]. We design two classifiers for this purpose. First, we implement a single 4-class SVM as our baseline experiment. Second, we employ a hierarchical SVM consisting of two SVMs in sequence: the first distinguishes the Sad emotion from the other categories, while the second distinguishes between the Angry, Happy and Neutral categories. Our motivation for hierarchical classification is to determine whether there is a traversable emotion embedding space encoded in the speaker identity embedding. We use this hierarchical approach to initially differentiate the most “challenging" emotion from the rest. For example, Sad is difficult to differentiate from the Neutral state using style features [34]. Therefore, in Experiment 3, we first distinguish Sad, then classify the remaining emotions. These speech emotion classifiers are illustrated in Figures 3.5 and 3.6. 3.3.3.1 Experiment 1 — CREMA-D (Acted Data) CREMA-D [4] is an audio-visual dataset consisting of 7,442 original samples from 91 actors. These clips are from 48 male and 43 female actors between the ages of 20 and 74. There is a diversity of actor race and ethnicity (African America, Asian, Caucasian, Hispanic, and Unspecified). Each actor performs 12 sentences, each sentence uses one of six different emotions (Anger, Disgust, Fear, Happy, Neutral and Sad). Emotion samples are crowd-sourced from a total of 2,443 participants. For this experiment, we use 2,800 training samples and 1,280 testing samples from four emotion classes: Anger, Happy, Neutral, Sad. The emotion classes are balanced and there is independence between speakers in the test and train set. 3.3.3.2 Experiment 2 — MSP-Podcast (Natural Data) MSP-Podcast is an audio-only dataset of naturalistic emotion. The dataset is generated from spontaneous recordings obtained from allowed audio-sharing websites. The annotation process for emotion is done via crowd-sourcing. In this experiment, we use 6,400 samples for training and 1,800 samples for testing. We choose samples only from four emotion classes: Anger, Happy, Neutral, and Sad. These classes are balanced and there is independence between speakers in the test and train set. 29 3.3.3.3 Experiment 3 — CREMA-D Hierarchical Classifier In this experiment, we build a hierarchical classifier from two SVMs. We train the first SVM on 1,440 samples from two emotion classes: Sad and Other. The second SVM is trained on 2,160 samples from three emotion classes: Anger, Happy, Neutral. The combined two-stage classifier is trained on 2,880 samples and classifies 1,280 test samples into four emotion categories: Anger, Happy, Neutral, Sad. All emotion classes are balanced and there is independence between speakers in all test and train sets. 3.3.3.4 Feature Extraction The E-Vector network trained with VoxCeleb 1, VoxCeleb2, and Librispeech is used for extract- ing 256-dimensional speaker embeddings for each utterance. Our implementation frame length is 22,000 and hop length is 220. This gives a 1 second frame and a hop of 10 ms. If any audio sample is less than 1 second, it is discarded. After extracting the embeddings for each utterance, we then compute the average speaker embedding (𝑒 𝑎𝑣𝑔 ) of 𝑛 utterance embeddings 𝑒𝑖 . 3.3.3.5 Classifier In this work, we use SVMs with radial basis function (RBF) kernels. Our hyper-parameters are tuned using GridSearchCV. In all SVMs, optimal parameters were found to be C=1000, 𝛾 = 0.1. Although the cost value may appear high, it leads to better generalization in the testing stage of the experiment. Cost values between 10 and 1000 yield approximately the same result (+/- 0.04 overall weighted-f score). 3.3.3.6 Architecture Our general network architecture is illustrated in Figure 3.5. In Experiment 1 and 2, we use a single SVM layer. In Experiment 3, we use a two-stage hierarchical SVM classifier as shown in Figure 3.6. 3.3.3.7 Results From the findings in all experiments (Figure 3.7 and Table 3.2), we conclude that E-Vector speaker embeddings do not significantly encode the emotion as a byproduct of its training algorithm. 30 Training . Generic E-Vector Model 1-D Triplet Averaged Speech CNN w/ Vocal Speaker Frame Style Embeddings Speaker Factorization Embedding Support Vector Input Machines (SVM) Audio Emotion Label Test . Averaged Speech Input E-Vector Speaker Emotion Emotion Audio Classification Embedding Model Figure 3.5 Our E-Vector Speech Emotion Recognition Experiment Setup Hierarchical Classifier Binary Sad Sad SVM Other Multiclass Happy SVM Angry Neutral Figure 3.6 The hierarchical classifier setup. First classifier is a binary classifier that discriminates a given emotion class from the rest. Those which do not belong to the first class are classified into the remaining three by the second layer. Therefore, we maintain that E-Vector speaker embeddings primarily encode speaker identity. 3.4 Summary We find through the extensive speech emotion recognition experiments conducted, that E- Vector embeddings may not contain significant emotion information, especially when trained only 31 Figure 3.7 E-Vector/Vox+Librispeech Speech Emotion Recognition Experiments. Top Row all use the SVM classifier, Bottom row all use the Hierarchical Classifier. High recognition accuracy is not achieved in the emotion recognition domain inferring that E-Vector does not significantly encode emotion information. All cells represent percentages. Algorithm CREMA-D MSP-Podcast IEMOCAP SVM 66.5 43.3 57.5 Sad-First Hierarchical Classifier 68.7 36.8 58.3 Table 3.2 E-Vector/Vox+Librispeech 4-class emotion recognition accuracy. Comparison of Hier- archical Classifier to baseline SVM model. Metric is standard accuracy– higher is better. on affective data (MSP-Podcast). In addition, we evaluated emotion content over a large corpora of speech data (VoxCeleb 1 & 2 and Librispeech). We still find that E-Vector primarily encodes speaker identity and perhaps not emotion as a byproduct of its training algorithm. This demonstrates the robustness of the E-Vector model and the effectiveness of the vocal style factor technique. 32 CHAPTER 4 THESIS CONCLUSIONS AND FUTURE WORK 4.1 Research Contributions 1. This thesis explores the hypothesis that speaker identity is composed of various vocal style factors which may be decomposed and recombined. In this regard, a neural network model named E-Vector is developed and implemented. 2. We find that E-Vector decomposes input speech into multiple vocal style factors and can combine these factors to provide a more holistic representation of speaker identity in various affective scenarios. 4.2 Future Work In recent literature, there has been a focus on improving speaker recognition models by learning the weights using multiple modalities such as textual and audio samples, and then using only audio during the inference stage for identity assessment [42]. Therefore, in the affective domain, further work is necessary to obtain larger speech datasets with associated text transcripts to harness these novel capabilities. In addition to utilizing more modalities in the training process, a semi-supervised training scheme may be employed. Currently, the E-Vector architecture does not use any direct supervision via emotion labels to tune its weights; in other words, it is tuned by the GE2E loss. Future work could include an “emotion” loss in addition to the GE2E loss. The rationale would be for the model to predict an emotion label, and the corresponding true emotion labels would adjust the loss and weights of the vocal style factors accordingly to disentangle emotion and identity further. 33 BIBLIOGRAPHY [1] Zakaria Aldeneh and Emily Mower Provost. You’re not you when you’re angry: Robust emotion features emerge by recognizing speakers. IEEE Transactions on Affective Computing, 2021. [2] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, et al. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359, 2008. [3] C-Nedelcu. Talk to chatgpt. https://github.com/C-Nedelcu/talk-to-chatgpt. [4] Houwei Cao, David G. Cooper, Michael K. Keutmann, et al. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4):377– 390, 2014. [5] A. Chowdhury, A. Ross, and P. David. Deeptalk: Vocal style encoding for speaker recognition and speech synthesis. In ICASSP, 2021. [6] Anurag Chowdhury. Automated Speaker Recognition in Non-ideal Audio Signals Using Deep Neural Networks. Michigan State University, 2021. [7] Anurag Chowdhury, Austin Cozzo, and Arun Ross. Domain adaptation for speaker recognition in singing and spoken voice. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7192–7196. IEEE, 2022. [8] Anurag Chowdhury and Arun Ross. Fusing mfcc and lpc features using 1d triplet cnn for speaker recognition in severely degraded audio signals. IEEE Transactions on Information Forensics and Security, 15:1616–1629, 2019. [9] Anurag Chowdhury and Arun Ross. Deepvox: Discovering features from raw audio for speaker recognition in degraded audio signals, 2020. [10] S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366, 1980. [11] Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19:788–798, 2011. [12] Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou. ArcFace: Additive angular margin loss for deep face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):5962–5979, oct 2022. [13] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Helen Meng, Bo Xu, and Thomas Fang Zheng, editors, Interspeech, pages 3830–3834. ISCA, 2020. [14] Paul Ekman. Basic emotions. Handbook of Cognition and Emotion, 98(45-60):16, 1999. 34 [15] Florian Eyben, Klaus R Scherer, Björn W Schuller, et al. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2):190–202, 2015. [16] Florian Eyben, Klaus R. Scherer, Björn W. Schuller, et al. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2):190–202, 2016. [17] Florian Eyben, Martin W"ollmer, and Bjorn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. Proceedings of the 18th ACM international conference on Multimedia, 2010. [18] Johnny R.J. Fontaine, Klaus R. Scherer, Etienne B. Roesch, and Phoebe C. Ellsworth. The world of emotions is not two-dimensional. Psychological Science, 18(12):1050–1057, 2007. PMID: 18031411. [19] Alex Hadden. The identification of criminals by the bertillon system. W. Res. LJ, 3:165, 1897. [20] John HL Hansen and Taufiq Hasan. Speaker recognition by machines and humans: A tutorial review. IEEE Signal processing magazine, 32(6):74–99, 2015. [21] Hynek Hermansky. Perceptual linear predictive (plp) analysis of speech. The Journal of the Acoustical Society of America, 87 4:1738–52, 1990. [22] Anil K Jain, Arun Ross, and Salil Prabhakar. An introduction to biometric recognition. IEEE Transactions on Circuits and Systems for Video Technology, 14(1):4–20, 2004. [23] Mimansa Jaiswal and Emily Mower Provost. Privacy enhanced multimodal neural repre- sentations for emotion recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7985–7993, 2020. [24] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self- normalizing neural networks. Advances in neural information processing systems, 30, 2017. [25] Shashidhar G Koolagudi, Kritika Sharma, and K Sreenivasa Rao. Speaker recognition in emo- tional environment. In Eco-friendly Computing and Communication Systems: International Conference, ICECCS 2012, Kochi, India, August 9-11, 2012. Proceedings, pages 117–124. Springer, 2012. [26] Richard S Lazarus. Emotion and adaptation. Oxford University Press, 1991. [27] Haoqi Li, Ming Tu, Jing Huang, Shrikanth Narayanan, and Panayiotis Georgiou. Speaker- invariant affective representation learning via adversarial training. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7144–7148, 2020. [28] R. Lotfian and C. Busso. Building naturalistic emotionally balanced speech corpus by re- trieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 10(4):471–483, October-December 2019. 35 [29] Richard J Mammone, Xiaoyu Zhang, and Ravi P Ramachandran. Robust speaker recognition: A feature-based approach. IEEE Signal Processing Magazine, 13(5):58, 1996. [30] Simon Angelo Meier. Medical diagnosis using voice samples: What the voice reveals about health. https://www.archyde.com/ medical-diagnosis-using-voice-samples-what-the-voice-reveals-about-health/. [31] Rushab Munot and Ani Nenkova. Emotion impacts speech recognition performance. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 16–21, 2019. [32] Ali Bou Nassif, Ismail Shahin, Ashraf Elnagar, Divya Velayudhan, Adi Alhudhaif, and Kemal Polat. Emotional speaker identification using a novel capsule nets model. Expert Systems with Applications, 193:116469, 2022. [33] Department of Homeland Security. Biometrics. https://www.dhs.gov/biometrics. [34] Astrid Paeschke, Miriam Kienast, and Walter F. Sendlmeier. F0-contours in emotional speech. 1999. [35] Raghavendra Pappagari, Tianzi Wang, Jesus Villalba, et al. x-vectors meet emotions: A study on dependencies between emotion and speaker recognition. In ICASSP, pages 7169–7173, 2020. [36] Srinivas Parthasarathy and Carlos Busso. Predicting speaker recognition reliability by consid- ering emotional content. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pages 434–439, 2017. [37] Marc D Pell, Abhishek Jaywant, Laura Monetta, and Sonja A Kotz. Emotional speech processing: Disentangling the effects of prosody and semantic cues. Cognition & Emotion, 25(5):834–853, 2011. [38] Rosalind W Picard. Affective computing. 2000. [39] Robert Plutchik and Henry Kellerman, editors. Emotion: Theory, Research, and Experience, volume 1-5. Academic Press, 1980-1990. [40] Mirco Ravanelli and Yoshua Bengio. Speaker recognition from raw waveform with sincnet, 2019. [41] Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, et al. SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624. [42] Seyed Omid Sadjadi, Craig Greenberg, Elliot Singer, Lisa Mason, and Douglas Reynolds. The 2021 nist speaker recognition evaluation. arXiv preprint arXiv:2204.10242, 2022. [43] Nikola Simić, Siniša Suzić, Tijana Nosek, Mia Vujović, Zoran Perić, Milan Savić, and Vlado Delić. Speaker recognition using constrained convolutional neural networks in emotional speech. Entropy, 24(3):414, 2022. 36 [44] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333, 2018. [45] Tengfei Song, Wenming Zheng, Peng Song, and Zhen Cui. Eeg emotion recognition using dynamical graph convolutional neural networks. IEEE Transactions on Affective Computing, 11(3):532–541, 2020. [46] Thomas Swearingen, Will Drevo, Bennett Cyphers, et al. ATM: A distributed, collaborative, scalable system for automated machine learning. In IEEE International Conference on Big Data, Boston, MA, USA, pages 151–162, 2017. [47] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883. IEEE, 2018. [48] Yuxuan Wang, Daisy Stanton, Yu Zhang, et al. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis, 2018. [49] Jennifer Williams and Simon King. Disentangling style factors from speaker representations. In Interspeech, pages 3945–3949, 2019. [50] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State- of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020. [51] Kun Zhou, Berrak Sisman, Rajib Rana, Björn W Schuller, and Haizhou Li. Speech synthesis with mixed emotions. IEEE Transactions on Affective Computing, 2022. 37 APPENDIX In addition to reporting the f-scores of each experiment from Chapter 3 Task 1, we provide the confusion matrices from the SER experiments conducted. Figure A.1 E-Vector/MSP-Podcast SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.18 +/- 0.03. Values shown in each cell denote a percentage. 38 Figure A.2 ECAPA-TDNN/VoxCeleb1+2 SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.25 +/- 0.03. Values shown in each cell denote a percentage. 39 Figure A.3 ECAPA-TDNN/MSP SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.20 +/- 0.03. Values shown in each cell denote a percentage. 40 Figure A.4 ECAPA-TDNN/Vox+MSP SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.31 +/- 0.03. Values shown in each cell denote a percentage. 41 Figure A.5 E-Vector/MSP-Podcast SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.18 +/- 0.03. Values shown in each cell denote a percentage. 42 Figure A.6 ECAPA-TDNN/VoxCeleb1+2 SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.25 +/- 0.03. Values shown in each cell denote a percentage. 43 Figure A.7 ECAPA-TDNN/MSP SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.20 +/- 0.03. Values shown in each cell denote a percentage. 44 Figure A.8 ECAPA-TDNN/Vox+MSP SER Experiment Confusion Matrix. 8 emotion classes are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.31 +/- 0.03. Values shown in each cell denote a percentage. 45