VOCAL STYLE FACTORIZATION FOR EFFECTIVE SPEAKER RECOGNITION IN
                      AFFECTIVE SCENARIOS
                                    By
                           Morgan Lee Sandler
                                 A THESIS
                                Submitted to
                        Michigan State University
                in partial fulfillment of the requirements
                             for the degree of
                 Computer Science—Master of Science
                                   2023


                                           ABSTRACT
The accuracy of automated speaker recognition is negatively impacted by change in emotions in
a person’s speech. In this thesis, we hypothesize that speaker identity is composed of various
vocal style factors that may be learned from unlabeled speech data and re-combined using a neural
network architecture to generate holistic speaker identity representations for affective scenarios.
In this regard we propose the E-Vector neural network architecture, composed of a 1-D CNN
for learning speaker identity features and a vocal style factorization technique for determining
vocal styles. Experiments conducted on the MSP-Podcast dataset demonstrate that the proposed
architecture improves state-of-the-art speaker recognition accuracy in the affective domain over
baseline ECAPA-TDNN speaker recognition models. For instance, the true match rate at a false
match rate of 1% improves from 27.6% to 46.2%. Additionally, we provide an analysis between
speaker recognition match scores and emotions to identify challenging affective scenarios.


To my family, for their support of my ambitions and their everlasting love.
                                    iii


                                    ACKNOWLEDGMENTS
I would like to express my sincere gratitude to the following individuals and organizations who
have provided support and assistance throughout the course of this research:
    My academic advisor, Dr. Arun Ross, for his unwavering guidance, advice, and encouragement
throughout the project. His expertise and insights have been invaluable to the success of this
research.
    The faculty and staff of the Department of Computer Science & Engineering at Michigan State
University (MSU), for their support, resources, and collaborative environment that fostered my
growth as a researcher.
    The National Association of Broadcasters (NAB) and the National Science Foundation Center
for Identification Technology Research (NSF CITeR), for their financial support that made this
research possible.
    My iPRoBe colleagues and friends, Dr. Anurag Chowdhury, Sushanta Pani, Dr. Renu Sharma,
Melissa Dale, Pegah Varghaei, Parisa Farmanifard, Ryan Ashbaugh, Debasmita Pal, Katie Albus,
Protichi Basak, Dr. Raul Quispe-Abad, Shivangi Yadav, Dr. Thomas Swearingen, Sai Ramesh,
and the countless others whom my interactions and discussions challenged my ideas and inspired
me to grow.
    Dr. Parisa Kordjamshidi and Dr. Qiben Yan for their commitment to aiding in the preparation
of this thesis.
    Mom, Dad, Samantha, Brandon, and Danielle, for their unwavering love and support that kept
me motivated and grounded throughout this journey.
    In all, I could not have begun this journey alone, and my community has played a crucial role
in shaping my research and shaping me into the person I have become.
    Thank you all.
    Morgan
                                                 iv


                                  TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
CHAPTER 1:       INTRODUCTION . . . . . . . . . . . .          . . . . . . . . . . . . . . . . .  1
      1.1: Biometrics . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . .  1
      1.2: Recognition in Biometrics . . . . . . . . . . .     . . . . . . . . . . . . . . . . .  2
      1.3: Automatic Speaker Recognition . . . . . . . .       . . . . . . . . . . . . . . . . .  2
      1.4: Affective Computing . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . .  3
      1.5: Our Primary Research Question . . . . . . . .       . . . . . . . . . . . . . . . . .  5
      1.6: Research Contributions and Thesis Organization      . . . . . . . . . . . . . . . . .  5
CHAPTER 2:       VOCAL STYLE FACTORIZATION FOR EFFECTIVE SPEAKER
                 RECOGNITION IN AFFECTIVE SCENARIOS . . . . . . . . . . .                    . .  7
      2.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . .  7
      2.2: Learning Emotion Independent Speaker Identity Features . . . . . . . . . .        . .  8
      2.3: Extracting Vocal Style Factors via Global Style Tokens (GSTs) . . . . . . .       . . 10
      2.4: Proposed E-Vector Architecture . . . . . . . . . . . . . . . . . . . . . . . .    . . 11
      2.5: Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 13
      2.6: MSP-Podcast Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 14
      2.7: Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 14
      2.8: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
      2.9: Impact of Vocal Style Factors . . . . . . . . . . . . . . . . . . . . . . . . .   . . 19
      2.10: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 19
      2.11: Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 21
CHAPTER 3:       ANALYSIS THROUGH EMOTION CATEGORIES . . . . . . . . .                       . . 22
      3.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 22
      3.2: Visualizations of E-Vector Speaker Verification Match Scores in the Context
            of Emotion Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 22
      3.3: Quantifying Emotion in E-Vector and ECAPA-TDNN Embeddings . . . . .               . . 23
      3.4: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 31
CHAPTER 4:       THESIS CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . 33
      4.1: Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
      4.2: Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
                                                v


                                       LIST OF TABLES
Table 2.1 An overview of speech-based feature representations in speaker recognition
          and affective computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     9
Table 2.2 E-Vector comparison to baseline models. E4 (E-Vector) performs best across
          all metrics in our study across three separate test sets. . . . . . . . . . . . . . . . 17
Table 2.3 Comparison across three trained E-Vector models with 5, 10, 20 VSFs. Each
          model is trained to 105,000 steps with the MSP-Podcast train set data. . . . . . . 20
Table 3.1 ER denotes 8-class emotion recognition. Metric is f-score. In this scenario,
          we would like a lower f-score value. This may indicate that the speaker identity
          models not do encode the emotion in the identity representation itself. The
          confusion matrices corresponding to this analysis may be found in the Appendix          28
Table 3.2 E-Vector/Vox+Librispeech 4-class emotion recognition accuracy. Comparison
          of Hierarchical Classifier to baseline SVM model. Metric is standard accuracy–
          higher is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
                                                 vi


                                        LIST OF FIGURES
Figure 1.1 Identity may be decomposed into vocal style factors. . . . . . . . . . . . . . . .     5
Figure 2.1 The proposed E-Vector architecture. The 1-D CNN takes speech frames (de-
           tailed in Section 2.1) as input and outputs a vector of 40 features. These fea-
           tures are then passed to a reference encoder, which extracts a fixed-dimension
           embedding representing the speaker identity features. The vocal style fac-
           tor layer decomposes the identity embedding into a fixed number of factors
           (e.g., 10 factors) via multi-head attention (8 heads). Finally, the weighted
           combination of these factors yields a final speaker identity embedding. . . . . . 11
Figure 2.2 E-Vector Training Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 2.3 E-Vector Testing Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 2.4 Match score distributions for E-Vector and ECAPA-TDNN baselines on three
           test sets of MSP-Podcast: test set 1, test set 2, validation set. Note that the
           validation set was not used in the training stage. . . . . . . . . . . . . . . . . . 18
Figure 2.5 Detection Error Trade-off (DET) curves comparing E-Vector with the baseline
           experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 2.6 Detection Error Trade-off (DET) curves comparing E-Vector with varying
           factor sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 3.1 Inter/Intra-Emotion Speaker Recognition Match Scores Visualized in the
           Valence-Arousal Emotion Space. . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 3.2 E-Vector evaluated on MSP-Podcast Test 1 Set, Intra-Inter Emotion Speaker
           Verification Experiment Match Scores . . . . . . . . . . . . . . . . . . . . . . 25
Figure 3.3 E-Vector evaluated on MSP-Podcast Test 2 Set, Intra-Inter Emotion Speaker
           Verification Experiment Match Scores . . . . . . . . . . . . . . . . . . . . . . 26
Figure 3.4 E-Vector evaluated on MSP-Podcast Validation Set, Intra-Inter Emotion Speaker
           Verification Experiment Match Scores . . . . . . . . . . . . . . . . . . . . . . 27
Figure 3.5 Our E-Vector Speech Emotion Recognition Experiment Setup . . . . . . . . . . 31
Figure 3.6 The hierarchical classifier setup. First classifier is a binary classifier that
           discriminates a given emotion class from the rest. Those which do not belong
           to the first class are classified into the remaining three by the second layer. . . . 31
                                                   vii


Figure 3.7 E-Vector/Vox+Librispeech Speech Emotion Recognition Experiments. Top
           Row all use the SVM classifier, Bottom row all use the Hierarchical Classifier.
           High recognition accuracy is not achieved in the emotion recognition domain
           inferring that E-Vector does not significantly encode emotion information.
           All cells represent percentages. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure A.1 E-Vector/MSP-Podcast SER Experiment Confusion Matrix. 8 emotion classes
           are used for classification. MSP-Podcast Validation Set is used for training,
           Test Set 1 is used for evaluation. Results are obtained via 5-fold cross val-
           idation. Mean f-score = 0.18 +/- 0.03. Values shown in each cell denote a
           percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure A.2 ECAPA-TDNN/VoxCeleb1+2 SER Experiment Confusion Matrix. 8 emotion
           classes are used for classification. MSP-Podcast Validation Set is used for
           training, Test Set 1 is used for evaluation. Results are obtained via 5-fold
           cross validation. Mean f-score = 0.25 +/- 0.03. Values shown in each cell
           denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure A.3 ECAPA-TDNN/MSP SER Experiment Confusion Matrix. 8 emotion classes
           are used for classification. MSP-Podcast Validation Set is used for training,
           Test Set 1 is used for evaluation. Results are obtained via 5-fold cross
           validation. Mean f-score = 0.20 +/- 0.03. Values shown in each cell denote a
           percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure A.4 ECAPA-TDNN/Vox+MSP SER Experiment Confusion Matrix. 8 emotion
           classes are used for classification. MSP-Podcast Validation Set is used for
           training, Test Set 1 is used for evaluation. Results are obtained via 5-fold
           cross validation. Mean f-score = 0.31 +/- 0.03. Values shown in each cell
           denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure A.5 E-Vector/MSP-Podcast SER Experiment Confusion Matrix. 8 emotion classes
           are used for classification. MSP-Podcast Validation Set is used for training,
           Test Set 1 is used for evaluation. Results are obtained via 5-fold cross val-
           idation. Mean f-score = 0.18 +/- 0.03. Values shown in each cell denote a
           percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure A.6 ECAPA-TDNN/VoxCeleb1+2 SER Experiment Confusion Matrix. 8 emotion
           classes are used for classification. MSP-Podcast Validation Set is used for
           training, Test Set 1 is used for evaluation. Results are obtained via 5-fold
           cross validation. Mean f-score = 0.25 +/- 0.03. Values shown in each cell
           denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
                                               viii


Figure A.7 ECAPA-TDNN/MSP SER Experiment Confusion Matrix. 8 emotion classes
           are used for classification. MSP-Podcast Validation Set is used for training,
           Test Set 1 is used for evaluation. Results are obtained via 5-fold cross
           validation. Mean f-score = 0.20 +/- 0.03. Values shown in each cell denote a
           percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure A.8 ECAPA-TDNN/Vox+MSP SER Experiment Confusion Matrix. 8 emotion
           classes are used for classification. MSP-Podcast Validation Set is used for
           training, Test Set 1 is used for evaluation. Results are obtained via 5-fold
           cross validation. Mean f-score = 0.31 +/- 0.03. Values shown in each cell
           denote a percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
                                                ix


                                           CHAPTER 1
                                         INTRODUCTION
1.1    Biometrics
    Throughout recorded human history, identifying people for various purposes has been com-
monplace [19]. This can involve listening to someone’s style of speech to identify them, or using
sophisticated cameras to recognize individuals at a distance based on their walking patterns. Bio-
metrics is the science concerned with recognizing individuals based on their physical or behavioral
attributes [22]. With the invention of modern-day computers and advances in the field of artificial
intelligence (AI), systems have been proposed for automatically recognizing individuals based on
their biometric traits. Some examples of these traits include, but are certainly not limited to,
voice, fingerprint, face, iris, DNA, and gait. Each trait (also referred to as a modality) has its
advantages and disadvantages. For example, fingerprints have many discriminable attributes, are
easy to collect, and rarely change over long periods of time [22]. Voice, on the other hand, is a
combination of behavioral and physiological attributes, is cost-effective and easy to implement,
but its performance may be significantly affected by various factors such as age, style of speaking,
emotion, and medical conditions [20].
    A biometric trait is typically chosen based on the following criteria [22]:
     • Universality: Does everyone have it?
     • Uniqueness: Is the trait unique to an individual?
     • Permanence: Will this trait change over time?
     • Measurability: Can it be measured with sensors?
     • Performance: Is the recognition and computational performance sufficient for practical use?
     • Acceptability: Will the target population accept common use of this trait?
     • Circumvention: Can bad actors easily alter, obfuscate, spoof, or mimic this trait?
                                                  1


    The voice is a desirable biometric due to its wide range of applications and relatively low
implementation cost, often only requiring a microphone sensor and accompanying computer. Its
applications are diverse, ranging from forensics and security [33] to personal assistants [3], medical
diagnostics [30], and even imitative voice synthesis technology [48].
1.2   Recognition in Biometrics
    Biometric recognition can be divided into two separate functionalities: verification and identi-
fication. In verification, the user claims an identity, and the system verifies if the claim is correct
(genuine, match) or not (impostor, non-match). In the verification process, a similarity match score
is computed from a pair of biometric features. If the score is equal to or above a predetermined
threshold, it is labeled as genuine or a match; otherwise, it is labeled as an impostor or a non-match.
In other words, the verification scenario involves a 1-to-1 test between two biometric samples (e.g.,
two speech audio samples) to determine if the identity is the same or not.
    In contrast to the verification scenario, identification involves iterating over a database of
enrolled users and identifying the user based on the presented biometric trait; there is no claim to
an identity. Instead of resulting in a match or non-match, the result will either be a ranked list of
matching identities or an empty set suggesting there were no matches in the database. In other
words, it is a 1-to-N test over identity-labeled and registered biometric samples in the system.
    This thesis will primarily focus on the verification scenario.
1.3   Automatic Speaker Recognition
    Automatic Speaker Recognition involves the use of software to automatically identify or verify
the identity of a speaking individual based on their voice characteristics. There are two primary
modes of speaker recognition: text-dependent and text-independent [22]. Text-dependent speaker
recognition systems typically require the use of a "passcode" or a known phrase, and require textual
content to make an identity decision. In contrast, text-independent speaker recognition does not
have this requirement and relies on paralinguistic aspects of speech. Both systems typically use
five primary modules [22]:
                                                   2


   1. An enrollment system to add identities to the database.
   2. A sensor module to collect and record the biometric data (e.g., microphone).
   3. A feature extraction module to extract salient numerical features from the input biometric
       data.
   4. A matching module to compare two biometric samples (or multiple pairs) based on the
       extracted features and generate match score(s).
   5. A database module to store all biometric information.
    There are many factors which may affect the performance of an automatic speaker recognition
system. These include, but are not limited to, background noise, sample quality, audio distortion,
and emotion [6].
    In this thesis, we will study the effects of emotion on text-independent speaker recognition by
deliberate selection of affective (emotion) datasets to explore the dependencies between speaker
identity and emotion. Since the datasets already contain 16KHz audio waveforms and the corre-
sponding identities of the speaker, it is appropriate to skip the enrollment and sensor modules in
our experimentation.
1.4   Affective Computing
    In order to propose a method which addresses the problem of emotion in speaker recognition,
we study the recent literature in affective computing. Affective computing seeks to develop the
capabilities of recognizing, responding to, or imitating human emotions [38]. Speech contains
a wealth of affective information, including tone, pitch, and intonation, which can be analyzed
to determine the speaker’s affective state [38]. Literature in speech-based affective computing
typically involves a mixture of psychological domain knowledge and computational methods to
extract relevant features [16]. Within affective computing there are many areas of study such as
emotion recognition [45] or emotional voice conversion [51]. In this work, we study automatic
speaker recognition in the context of varying affective scenarios.
                                                   3


    Recent research in speech-based affective computing has aimed to improve the accuracy and
robustness of emotion recognition and voice synthesis systems. Voice synthesis is concerned with
modeling the vocal properties of a target speaker. The latest deep learning-based research in this
domain indicates the identity is encoded within the embeddings generated in neural networks [49].
With the availability of new large affective speech datasets such as MSP-Podcast [28], researchers
have been able to study whether deep learning methods [23] can automatically extract high-level
affective independent features from speech. Traditional methods like gaussian mixture modeling
(GMM), support vector machines (SVM), and hidden markov models (HMM) [38] have also been
used. Modeling methods with hidden/latent mechanisms are a natural fit for representing emotion.
Emotion is typically understood as a latent state which changes according to outside stimuli [38].
    Choosing which emotions to use in experimentation is an active area of research in psychology
and cognitive science [26]. The principle of cathexis in psychology [38] theorizes that if more
mental energy is used to portray an emotion, then it will be easier to perceive it in a voice. For
example, an actor portraying sadness with a microphone may be perceived as being more sad than
the average person speaking to their parents on the phone. The mental biasing of emotion affects the
perceptibility of the emotion involved. Acting has been the primary method for capturing emotion
in recent literature and datasets [2, 4], but there are now more efforts to obtain natural emotion
corpora to study the effects of cathexis in speech, and to train computational models which reflect
typical human affective scenarios. By studying these natural emotion scenarios, we may be able to
identify typical emotion-related issues that surface in speaker recognition.
    There are two primary computational models of emotion [18, 39]: discrete and continuous.
Discrete emotions typically encapsulate a broad range of vocal styles such as anger, happiness,
sadness, and surprise, while continuous emotion categories attempt to quantify an exact point in a
multi-dimensional emotion space using axes such as valence, arousal, and dominance. Continuous
models acknowledge that there are many variations within emotion, such as cold anger versus hot
anger, which should be modeled implicitly. In contrast, discrete models of emotion are typically
employed by practitioners who are less concerned about the specific nature of the emotion and
                                                   4


                                                            Style 1
                         Speaker                            Style 2
                          Identity
                                                            Style N
                   Figure 1.1 Identity may be decomposed into vocal style factors.
more interested in whether it falls into a certain category, such as happy or not. In this work, we
evaluate our speaker recognition system through the lens of both emotion models.
1.5   Our Primary Research Question
    Can we improve speaker recognition accuracy under varying affective scenarios?
    To answer this question, we hypothesize that speaker identity is composed of various vocal style
factors. Vocal style factors may include emotion, identity features, or a composition of both (viz.,
Figure 1.1). We propose a neural network architecture that uses a "vocal style factor" layer which
learns in an unsupervised manner the factors which compose speaker identity. We hypothesize
that a combination of these factors provides a holistic representation of speaker identity in varying
affective scenarios.
1.6   Research Contributions and Thesis Organization
    The thesis contributions can be summarized as follows:
   1. Viewing speaker identity as a composition of many vocal style factors.
   2. A proposed solution called E-Vector (Emotion Scenario Vector), an automatic speaker recog-
       nition model based on neural networks, which improves recognition accuracy compared to a
       state-of-the-art ECAPA-TDNN model [13].
                                                  5


   3. Extensive set of experiments to establish the advantages of E-Vector over baseline models.
   4. Analysis to determine if E-Vector embeddings capture emotion or not.
    The rest of the thesis is organized as follows. Chapter 2 introduces our E-Vector architecture
and the speaker recognition experiments conducted in various affective scenarios. Chapter 3
provides further insight and analysis of the E-Vector embeddings by conducting emotion recognition
experiments to determine whether the embeddings capture emotion or not as a byproduct of the
proposed method. Chapter 4 concludes the thesis and offers future directions for further research
on this topic.
                                                   6


                                             CHAPTER 2
  VOCAL STYLE FACTORIZATION FOR EFFECTIVE SPEAKER RECOGNITION IN
                                     AFFECTIVE SCENARIOS
2.1   Introduction
    Human speech is composed of various speaking styles that convey conversational semantics
through emotions, tone, and other paralinguistic cues. Paralinguistics refer to the non-lexical
aspects of speech, such as intonation, pitch, and rhythm, that convey additional meaning beyond the
words spoken [37]. For instance, the tone of voice can make a significant difference in conveying
a speaker’s conviction or lack thereof, even when the words are the same. In addition, different
argumentative styles of speaking may vary based on the speaker’s origin, language, culture, and
other factors. These variations in human speech can challenge state-of-the-art speaker recognition
systems [7, 35]. Such systems generally analyze two speech samples and determine whether they
correspond to the same individual or not.
    In this work, we hypothesize that human speech is a superimposition of multiple vocal style
factors and that these factors contribute to both the speaker identity as well as the speaker emotions.
By decomposing an input speech sample to individual vocal style factors and then learning how
to combine them in order to model speaker identity, we alleviate the negative impact of emotions
on speaker recognition. Previous state-of-the-art techniques in speaker recognition (also see Table
1) have focused mainly on neutral tones, failing to capture the affective scenario, where emotions
modulate the speaking style. While state-of-the-art neural network models can be fine-tuned and
adapted to affective scenarios using affective datasets, their accuracy remains underwhelming for
many affective scenarios.
    In certain situations, such as those related to national security, it is imperative for an automated
speaker recognition to confidently and accurately identify individuals, especially in the affective
scenario. This serves as the motivation for our work. In this work, we leverage a voice synthesis
technology referred to as Global Style Tokens (GST) [48] to learn vocal style factors and a 1-D
CNN to learn identity features directly from raw audio data. The learned vocal style factors are
                                                    7


analogous to basis vectors of a style space. We learn these factors from training speech samples in
an unsupervised manner while training the speaker recognition system with identity labels. This
core method, combined with affective training data and a triplet loss function [47], ensures that we
learn a comprehensive speaker identity representation in the context of affective scenarios.
2.2    Learning Emotion Independent Speaker Identity Features
     Previous studies have examined the relationship between emotion and speaker identity [1,31,35].
These studies found that the Equal Error Rate (EER) of speaker recognition is affected by variations
in the speaker’s emotional state. The authors of [36] proposed a method for estimating the reliability
of speaker recognition models based on their emotion content. Emotion recognition models that
use adversarial training to separate emotion and speaker identity to achieve speaker-invariance
have been proposed [27]. Several methods have been proposed to mitigate affective content and
achieve emotion-invariance [25, 32, 43]. In contrast to these works, our proposed method uses
unsupervised learning of vocal styles, which form the basis of the style space in which we theorize
identity belongs. Instead of learning speaker identity directly from speech samples, we first learn
vocal styles and then learn speaker identity as a superimposition of those styles. In the following
sections, we will provide more details about our approach to achieving emotion-invariance in
speaker recognition.
2.2.1   Speech Preprocessing
     A Voice Activity Detector [41] is used to remove non-speech parts of an input audio. If a sample
is longer than 2 seconds, then we split it into many 2-second non-overlapping audio samples. The 2-
second speech audio samples are then framed and windowed into multiple smaller audio segments
which we refer as speech units. Using a hamming window of length 20 ms and stride 10 ms,
each speech unit of 20 ms sampled at 16000 Hz is represented by an audio vector of length 320.
The sliding window extracts a speech unit every 10 ms from the 2-second audio sample which
approximately extracts 200 speech units per 2-second audio sample. The speech units are then
stacked horizontally to form a two-dimensional speech representation that we refer to as a speech
frame. The dimension of this frame is 320 × 200. The extracted speech frames are then organized
                                                   8


into triplets which are input to the E-Vector model. A single triplet consists of a positive sample,
a negative sample, and an anchor sample. The positive and anchor sample share the same speaker
identity label whereas the negative and anchor sample do not.
2.2.2        Speech Feature Extraction through a 1-D CNN
      We use a 1-D CNN to learn domain-relevant features directly from speech frames rather than
from explicitly extracted features. In contrast, many popular methods use handcrafted features, such
as the mel-frequency cepstrum (MFC) or linear predictive coding (LPC). However, handcrafted
features may favor certain domains over others by their inherent design. For example, MFC is
used to model the spectral shape of sound, therefore it may not capture enough information to
disambiguate between two emotions with similar sounds. Affective feature sets such as the Geneva
Minimalistic Acoustic Parameter Set (GeMAPS) have been proposed to address this problem.
Recent speaker recognition literature has demonstrated the efficacy in learning features from audio,
but none have targeted affective speech directly. A table of popular feature representations and their
intended use is shown in Table 2.1. In this work, we learn a 40-dimensional feature set directly from
speech frames. This has been demonstrated previously [9] to retrieve relevant short and long-term
behavioral features that are pertinent to speaker identity. 1-D CNN is also beneficial for our deep
learning architecture because it can be trained end-to-end along with the GST network (see Section
3) to maximize model learning. The exact architecture setup used is depicted in Figure 2.1.
               Paper                           Feature Category                                 Feature Name                                     Intended Use
 Davis and Mermelstein (1980) [10]            Short-term spectral                Mel-Frequency Cepstral Coefficients (MFCC)                    Voice Perception
      Hermansky (1990) [21]                   Short-term spectral                 Perceptual Linear Predictive (PLP) Analysis                  Voice Perception
    Mammone et al. (1996) [29]                Short-term spectral                      Linear Predictive Coding (LPC)                          Voice Production
      Eyben et al. (2010) [17]                     Affective                                 OpenSMILE Toolkit                                 Affective Speech
      Dehak et al. (2010) [11]                  Speaker Identity                                    i-vector                                   Speaker Identity
      Eyben et al. (2015) [15]                     Affective                 Geneva Minimalistic Acoustic Parameter Set (GeMAPS)               Affective Speech
    Ravanelli et al. (2018) [40]              Short-term spectral                                   SincNet                                NN-based Speaker Identity
     Snyder et al. (2018) [44]                Short-term spectral                                  X-Vector                                NN-based Speaker Identity
   Desplanques et al. (2020) [13]             Short-term spectral                               ECAPA-TDNN                                 NN-based Speaker Identity
    Chowdhury et al. (2020) [9]               Short-term spectral                                 DeepVOX                        NN-based Speaker Identity in Noisy Environments
    Chowdhury et al. (2021) [5]        Short-term & Long-term spectral                             DeepTalk                              NN-based Vocal Style Transfer
            This work              Short-term & Long-term Affective spectral                       E-Vector                           NN-based Affective Speaker Identity
Table 2.1 An overview of speech-based feature representations in speaker recognition and affective
computing.
      The architecture consists of six 1-D dilated convolution layers [8], separated by scaled exponen-
tial linear unit (SELU) activation functions [24]. The one-dimensional filters are designed to extract
                                                                                       9


features from individual speech units within a speech frame, rather than across them, based on the
assumption that the speaker-dependent characteristics within each unit are independent from those
in other units within the same frame. To this end, each 320-dimensional speech unit is processed by
a series of 1D dilated convolutional layers, resulting in 40 filter responses that form the short-term
spectral representation for that unit. This combination of dilation and scaled activation ensures a
larger receptive field, allowing the model to learn sparse relationships between the feature values
within a speech unit, which leads to significant performance benefits. Moreover, this method has
been shown to be effective in bypassing noise and other small perturbations present in the audio
waveform [9]. The larger receptive field facilitates learning of relevant high-dimensional affective
information at the feature-extraction level.
2.3    Extracting Vocal Style Factors via Global Style Tokens (GSTs)
    Text-to-speech is a challenging inverse problem of speech recognition. The goal is to generate
human-like speech from a provided text script. There are many complications that arise when
deciding what that voice sounds like. For instance, the system must produce a natural voice with
vocal style, gender, a particular tone, emotions, and other desirable attributes. Many people convey
semantic meaning through their conversations using tone, vocal style, and emotion. Therefore, in
text-to-speech, many techniques aim to learn speaker characteristics.
    With recent advances in deep-learning, voice synthesis methods have demonstrated success in
learning speaker characteristics in order to generate natural-sounding voices. It is this very reason
why in this work, we will use Global Style Tokens (GST), a voice synthesis technique, as our vocal
style factor decomposition method.
    The GST method was introduced in [48]. It was proposed for use in a text-to-speech system
to encode a vocal style embedding from speech samples. In this method, a voice synthesizer is
trained using a concatenated text and vocal style embedding. The vocal style embedding is learned
through a multi-head attention module that calculates the similarities between a speaker identity
embedding learned from the speech sample and a 10-dimensional bank of embeddings which we
will refer to as “vocal style factors" or VSF. A weighted combination of VSFs produces the style
                                                  10


embedding, which is joined with a text embedding to provide a precise representation of speaker
characteristics to a vocoding network.
    GSTs have yielded promising results in the emotional voice conversion (EVC) literature [51]
and have demonstrated the ability to encode pitch contours that are vital to vocal style in human
speech [5]. Therefore, we incorporate a similar approach into our E-Vector architecture. In our
E-Vector model, we swap reconstruction loss with generalized end-to-end loss [47]. This ensures
that we will learn speaker identity in the context of distinguishing genuine and impostor pairs for
speaker verification.
2.4   Proposed E-Vector Architecture
                      1-D CNN
                                                                                                       Reference Encoder
       Speech
       Frames
                                                                                                                                     Reference
                                                                                                                                     Embedding
                                                                                                                           128
                   In: 1       In: 2        In: 4         In: 8    In: 16      In: 32                                     Units
                                                                                                        Filter Sizes:
                  Out: 2      Out: 4        Out: 8      Out: 16   Out: 32     Out: 40
                                                                                                     32 -> 32 -> 64 ->
                 Kernel: 5   Kernel: 5    Kernel: 7    Kernel: 9 Kernel: 11  Kernel: 11
                                                                                                     64 -> 128 -> 128
                 Dilate: 2   Dilate: 2    Dilate: 3    Dilate: 4  Dilate: 5   Dilate: 5
                                Vocal Style Factor Layer
                                                   Vocal Style
                                                    Factors                                                            1-D Dilated Convolution
                                                                                                                       SELU Activation
                   Reference                                        Weighted        Speaker Identity
                  Embedding Multi-Head                                 Sum            Embedding                        6-Layer 2-D CNN
                                Attention
                                                                                                                       GRU with 128 units
                                                                                                                       Embedding
Figure 2.1 The proposed E-Vector architecture. The 1-D CNN takes speech frames (detailed in
Section 2.1) as input and outputs a vector of 40 features. These features are then passed to a
reference encoder, which extracts a fixed-dimension embedding representing the speaker identity
features. The vocal style factor layer decomposes the identity embedding into a fixed number of
factors (e.g., 10 factors) via multi-head attention (8 heads). Finally, the weighted combination of
these factors yields a final speaker identity embedding.
                                                                       11


                                                                                          Identity
                                                                                        Embeddings
             Positive Speech
                  Frame                            E-Vector
              Anchor Speech
                  Frame                             E-Vector                                             Triplet Loss
             Negative Speech
                  Frame                            E-Vector
                                            Figure 2.2 E-Vector Training Setup.
                                                                   Identity
                                                                 Embeddings
                                     E-Vector
        Speech Frame
        from Sample A
                                                                                       Cosine Similarity
                                                                                                              Match Score
                                                                                           Matcher
                                     E-Vector
        Speech Frame
        from Sample B
                                             Figure 2.3 E-Vector Testing Setup.
    In this work, we propose E-Vector as a supervised1 learning architecture designed for speaker
recognition in varying affective scenarios. E-Vector consists of two networks trained end-to-end.
The first network is a 1-D CNN which learns a 40-dimensional feature set directly from speech
frames, while a GST network, which consists of a 2-Dimensional CNN and Gated Recurrent Unit,
produces 128-dimensional reference embeddings. It does so by learning 10 vocal style factor
embeddings that serve as basis vectors for a learned vocal style space that encapsulates the affective
styles inherent in the data. The final 256-dimensional speaker representation learned is referred to
as an E-Vector. The E-Vector architecture is illustrated in Figure 2.1. The structure of the training
    1 supervised   with identity labels but not with emotion or other affective labels
                                                                  12


and testing of the architecture are illustrated in Figures 2.2 and 2.3, respectively.
2.5   Loss Functions
2.5.1    Additive Angular Margin Loss (AAM)
    Additive Angular Margin Loss, or often referred to as ArcFace loss [12], is a popular loss
function proposed to learn geodesic distances between identity embeddings on a hypersphere. The
margin hyperparameter enforces compactness between genuine pairs and more distance between
impostor pairs to a higher degree than traditional softmax loss. The formulation is as follows:
                                                          𝑒 𝑠(cos(𝜃 𝑦𝑖 +𝑚))
                                     𝑁
                                  1 ∑︁
                           𝐿=−           log 𝑠(cos(𝜃 +𝑚)) Í𝑛                                 (2.5.1)
                                  𝑁 𝑖=1        𝑒       𝑦𝑖      + 𝑗=1, 𝑗≠𝑦𝑖 𝑒 𝑠 cos 𝜃 𝑗
    Here, 𝑁 is the size of the batch, 𝑛 denotes the number of identities, 𝑠 is the radius of the
hypersphere, 𝑚 is the additive angular margin penalty between the feature vector and the weight
vector, and 𝜃 is the angle between the weight and feature vector.
2.5.2    Generalized End-to-End Loss (GE2E)
    Generalized End-to-End (GE2E) Loss [47] is specifically derived for speaker verification and
is designed to contribute more loss for impostor pairs that may frequently false-match. It does
this by identifying the most similar-to-genuine false match speaker as its basis for maximizing the
distance between genuine embedding and impostor embedding. The genuine term in GE2E loss
corresponds to a genuine match between the input embedding vector and its true speaker label.
Conversely, the impostor term corresponds to an impostor match between the input embedding
vector and the speaker label with the highest similarity to genuine among all false speakers. In
prior literature, GE2E loss has demonstrated effectiveness and does not require an initial example
selection stage that loss functions such as Tuple End-To-End Loss may use [47]. The effectiveness
in identifying hard speaker verification samples, which is common in affective speaker recognition,
is why we use this loss in our work.
                                 𝑁 
                                ∑︁                                                   
                                                    2                              2
                           𝐿=        𝛼𝑖 𝑑 (𝑥𝑖 , 𝑦𝑖 ) + (1 − 𝛼𝑖 ) max 𝑑 (𝑥𝑖 , 𝑦 𝑗 )           (2.5.2)
                                                                    𝑗≠𝑖
                                𝑖=1
                                                      13


    Here, 𝑥𝑖 is the genuine embedding for utterance 𝑖, 𝑦𝑖 is the ground truth embedding for utterance
𝑖, 𝛼𝑖 is the weight assigned to each utterance, 𝑑 (𝑥𝑖 , 𝑦𝑖 ) is the distance between a genuine pair, and
𝑑 (𝑥𝑖 , 𝑦 𝑗 ) is the distance between the genuine embedding and the impostor embedding 𝑦 𝑗 .
2.6     MSP-Podcast Dataset
    MSP-Podcast [28] is a natural emotion dataset derived from podcasts and labeled through
crowd-sourced consensus voting. Each sample is labeled with a discrete emotion (e.g., ‘Happy’) as
well as continuous emotion dimensions (valence, arousal, dominance) using a 5-point Likert scale.
There are 10 different labels for discrete emotions categories and continuous emotion dimensions
values range between 0 and 5.
    The MSP-Podcast dataset contains a total of 73,042 speaking turns, equivalent to 110 hours
of speech. The dataset is divided into several sets for training and evaluation purposes. Test set
1 contains segments from 60 speakers (30 female and 30 male), totaling 15,326 segments. Test
set 2 consists of 5,037 segments randomly selected from 100 podcasts. These segments are not
included in any other partition. The Validation set includes segments from 44 speakers (22 female
and 22 male), totaling 7,800 segments. Note, we use the validation set as another testing set,
and its data is not seen during the training process. The Train set contains the remaining speech
samples, totaling 38,179 segments.
    In this work, we use the speaker identity labels for our training process. For evaluation purposes,
we use the Test 1, Test 2, and Validation sets—each with its own unique composition—to evaluate
our proposed E-Vector model.
2.7     Experiment Setup
2.7.1      Models and Baselines
    Four models are trained for the speaker verification task: ECAPA-TDNN/Vox (denoted as E1),
ECAPA-TDNN/MSP (denoted as E2), ECAPA-TDNN/Vox+MSP (denoted as E3), and E-Vector
(denoted as E4). E1, E2, and E3 serve as the baseline models to evaluate different training
configurations of state-of-the-art speaker recognition systems to determine their efficacy in the
                                                    14


affective scenario. E1 is a popular state-of-the-art ECAPA-TDNN [13] speaker recognition model
pre-trained on VoxCeleb 1 & 2 datasets. E2 also uses the ECAPA-TDNN architecture but we
substitute the AAM loss with the GE2E loss; further, the network is trained from scratch using
the MSP-Podcast train set. Our third baseline, E3, uses the pre-trained weights from E1, but we
fine-tune the last layer (1.5M parameters) using the MSP-Podcast train set. E4 is the proposed
E-Vector approach and is trained with the MSP-Podcast train set. The specifics of training the four
models are detailed in the next section.
2.7.2   Training Details
    All four models are trained for the speaker verification task, and training is stopped if there are
no empirically significant decreases to the train loss over a window of 20,000 steps. All copies of
trained models and their respective training/evaluation codes are available via email request or on
our Github website (viz., Section 2.11). All models were trained over 5 days with an Nvidia RTX
2080 Ti GPU.
2.7.2.1   ECAPA-TDNN/Vox (E1)
    The E1 pre-trained weights are accessed from HuggingFace [50], while the training data is
sourced from the VoxCeleb1 and VoxCeleb2 datasets. The authors trained this model for 12 epochs
and fit 22.2M trainable parameters using an AAM loss function. No MSP-Podcast train data is
used in the E1 training process. This model serves as an off-the-shelf speaker verification model
comparison to E-Vector.
2.7.2.2   ECAPA-TDNN/MSP (E2)
    The E2 model is of the ECAPA-TDNN architecture, but the training process uses the GE2E
loss function (same as E-Vector) rather than AAM loss. Using the data from the MSP-Podcast train
set, 22.2M parameters are learned over 7,500 steps (where overfitting is observed). Overfitting is
to be expected as the parameters of the model are large in relation to the quantity of data available.
Even with this limitation, we find that this model still achieves competitive results in our affective
scenario.
                                                 15


2.7.2.3   ECAPA-TDNN/Vox+MSP (E3)
    The E3 model uses the same pre-trained weights from E1. To adapt the model to the target
domain, we fine-tuned the last layer (1.5M trainable parameters) using the MSP-Podcast training
set. E3 employs the AAM loss function. This model represents a popular transfer-learning method
in order to determine whether we can solve the affective scenario by adapting pre-existing models.
2.7.2.4   E-Vector (E4)
    E4 is our proposed method. Using 10 vocal style factors, it is able to train for 105,000 steps
with 959.8K trainable parameters using the GE2E loss function. Data from MSP-Podcast train set
is used in the training process.
2.7.3   Model Evaluation
    To evaluate our 3 baseline models and E-Vector, we compute the following metrics: Equal
Error Rate (EER), Minimum Detection Cost Function (minDCF), D-Prime (d’), Area Under Curve
(AUC), True Match Rate at a specified False Match Rate (TMR @ FMR 1% and 10%), Detection
Error Tradeoff (DET) curves, and Match Score Distributions.
2.7.3.1   minDCF Metric Formulation
    The minDCF metric allows us to study the costs associated with the speaker recognition system.
The formulation is as follows:
                     𝐷𝐶𝐹 (𝜏) = 𝐶𝑚𝑖𝑠𝑠 𝑃𝑚𝑖𝑠𝑠 (𝜏)𝑃𝑡𝑎𝑟𝑔𝑒𝑡 + 𝐶𝐹 𝑀 𝑃 𝐹 𝑀 (𝜏)(1 − 𝑃𝑡𝑎𝑟𝑔𝑒𝑡 )            (2.7.1)
 where, 𝜏 is a match decision threshold, 𝐶𝑚𝑖𝑠𝑠 is the cost of a missed verification error, 𝐶𝐹 𝑀 is
the cost of a false match error, 𝑃𝑡𝑎𝑟𝑔𝑒𝑡 is the prior probability of the target speaker occurring in
the data, 𝑃𝑚𝑖𝑠𝑠 (𝜏) is the probability of a miss verification at a given threshold, and 𝑃 𝐹 𝑀 (𝜏) is the
probability of a false match at a given threshold. minDCF is the minimization of this cost function.
We evaluate minDCF at 𝐶𝑚𝑖𝑠𝑠 = 10, 𝐶𝐹 𝑀 = 1. This is because in forensic/security scenarios, it is
preferred to identify/match a speaker than to completely miss a potential match at all.
                                                  16


  Model     Trainable Params    Test Set   EER    minDCF     TMR@FMR = {1%, 10%}           D’     AUC
  E1        22.2M               Val        0.34   0.094      14.7%, 40.7%                  0.83   0.72
                                Test 1     0.33   0.096      12.0%, 39.9%                  0.90   0.74
                                Test 2     0.29   0.094      14.0%, 44.2%                  1.12   0.78
  E2        1.5M                Val        0.37   0.099      7.15%, 30.0%                  0.68   0.68
                                Test 1     0.35   0.098      8.92%, 33.7%                  0.73   0.70
                                Test 2     0.27   0.089      20.7%, 52.5%                  1.18   0.80
  E3        22.2M               Val        0.25   0.090      19.4%, 55.1%                  1.35   0.83
                                Test 1     0.22   0.080      27.6%, 65.0%                  1.53   0.86
                                Test 2     0.19   0.078      30.5%, 68.7%                  1.68   0.88
  E4        959.8K              Val        0.13   0.054      55.5%, 84.0%                  2.14   0.94
                                Test 1     0.20   0.061      46.2%, 72.0%                  1.58   0.87
                                Test 2     0.15   0.068      38.7%, 77.1%                  1.95   0.91
Table 2.2 E-Vector comparison to baseline models. E4 (E-Vector) performs best across all metrics
in our study across three separate test sets.
2.7.3.2    Speaker Verification Testing Details
    In the MSP-Podcast dataset, there are 109.8M speaker verification pairs in test set 1, 10.7M
pairs in test set 2, and 28.8M pairs in the validation set. Due to the large quantity of pairs available,
we store all results and evaluate metrics on a smaller, uniformly sampled subset of the results.
Computing the full speaker verification tests takes 27 hours on test set 1, 6 hours on test set 2, and
16 hours on the validation set. All experiments were conducted on a Nvidia GeForce 2080 Ti GPU.
The approximate size of the reduced evaluation sets is 548K pairs for test set 1 (approx. 0.5% of
computed match scores), 107K pairs for test set 2 (approx. 1% of computed match scores), and
287K pairs for the validation set (approx. 1% of computed match scores). All speaker verification
results are publicly available by email request or on our Github. Attributes available in the results
table include emotion of sample A, emotion of sample B, identity of sample A, identity of sample
B, speaker verification match score, continuous emotion dimensions of sample A and sample B,
and gender of sample A and sample B.
2.8   Results
    We note that E-Vector learns certain categories of affective states. This is reflected in the
bimodal genuine distribution which can be observed in the E-Vector test set 1 plot in Figure
2.4. The bimodal distribution may be a product of overcoming challenges caused by emotion
modulation. That may be why we observe a similar, but less pronounced bimodal distribution in
                                                   17


         E-Vector Test Set 1               E-Vector Test Set 2              E-Vector Validation Set
     ECAPA-TDNN/Vox Test Set 1        ECAPA-TDNN/Vox Test Set 2        ECAPA-TDNN/Vox Validation Set
    ECAPA-TDNN/MSP Test Set 1         ECAPA-TDNN/MSP Test Set 2        ECAPA-TDNN/MSP Validation Set
                                                                      ECAPA-TDNN/Vox+MSP Validation
  ECAPA-TDNN/Vox+MSP Test Set 1     ECAPA-TDNN/Vox+MSP Test Set 2
                                                                                     Set
Figure 2.4 Match score distributions for E-Vector and ECAPA-TDNN baselines on three test sets
of MSP-Podcast: test set 1, test set 2, validation set. Note that the validation set was not used in
the training stage.
the corresponding test set 1 genuine distribution in the ECAPA-TDNN/MSP experiment. We also
conclude that fine-tuning ECAPA-TDNN with MSP-Podcast train data does not work as intended,
as it finds a niche local optima, perhaps due to insufficient quantities of training data available.
Also, E-Vector improves recognition EER from 0.22 in E3 to 0.20, and TMR@FMR1% accuracy
                                                  18


              Test Set 1                        Test Set 2                      Validation Set
Figure 2.5 Detection Error Trade-off (DET) curves comparing E-Vector with the baseline experi-
ments.
from 27.6% in E3 to 46.2%.
2.9   Impact of Vocal Style Factors
    In this analysis, we perform three experiments on the E-Vector architecture with varying numbers
of vocal style factors to determine the impact of the number of factors on speaker recognition. We
chose 5, 10, and 20 factors and trained each model for 105,000 steps with the MSP-Podcast train
set data. The 5 factor model is denoted E5, 10 factor is denoted E6, and 20 factor is denoted E7.
Each model takes 47 hours to train with an Nvidia RTX 2080 Ti. The results are shown in Table
2.3. In addition the DET curves are shown in Figure 2.6. Generally, we find that the TMR at FMR
1% improves across all test sets with the addition of more factors. Since speaker identity in our
model is composed of many discrete vocal style factors and there may be potentially an infinite
number of speaking styles, it is not practical to have one factor for every possible style. Even with
the limited range of vocal style factors tested, we observe a significant increase in performance
compared to the baseline ECAPA-TDNN models. That said, it is not clear that more factors are
always better and variation may exist across cultures, dialects, etc. Therefore, additional study into
this hyperparameter is necessary.
2.10    Summary
    Learning speaker identity in voice has traditionally been a 1-step process—training a model to
learn speaker identity from speech samples. This has effectively worked in most neutral speech
scenarios, but in the dynamic, affective scenario, the performance sharply degrades. In this work,
we propose a 2-step process: first, learning global styles of speech patterns based on thousands
                                                   19


   Model     Number of VSFs     Test Set  EER   minDCF     TMR@FMR = {1%, 10%}           D’   AUC
   E5        5                  Val       0.13  0.049      60.0%, 84.9%                  2.19 0.94
                                Test 1    0.22  0.062      46.4%, 70.5%                  1.61 0.87
                                Test 2    0.16  0.068      39.8%, 76.7%                  1.93 0.91
   E6        10                 Val       0.13  0.054      55.5%, 84.0%                  2.14 0.94
                                Test 1    0.20  0.061      46.2%, 72.0%                  1.58 0.87
                                Test 2    0.15  0.068      38.7%, 77.1%                  1.95 0.91
   E7        20                 Val       0.12  0.048      60.8%, 86.2%                  2.21 0.94
                                Test 1    0.21  0.062      46.9%, 70.0%                  1.62 0.87
                                Test 2    0.17  0.068      40.8%, 76.1%                  1.89 0.91
Table 2.3 Comparison across three trained E-Vector models with 5, 10, 20 VSFs. Each model is
trained to 105,000 steps with the MSP-Podcast train set data.
              Test Set 1                       Test Set 2                      Validation Set
Figure 2.6 Detection Error Trade-off (DET) curves comparing E-Vector with varying factor sizes.
of speaker identities, and second, representing speaker identity as combinations of those learned
vocal style patterns. The E-Vector model architecture incorporates this 2-step view of the speaker
identity.
    The first step learns similarities (via multi-head attention) between thousands of people’s
speaking patterns and creates embeddings of those vocal styles. The advantage of the architecture
is that it is not required to label those speaking patterns, although the patterns should be salient
in the training data. This enables E-Vector to learn vocal patterns that we perhaps have not yet
discovered or considered.
    Then, in the second step, the E-Vector model learns speaker identity, but only as a weighted
combination of the aforementioned vocal styles.
    By modeling identity as a composition of learned vocal style factors, we find that our proposed
method, E-Vector, outperforms state-of-the-art ECAPA-TDNN baseline models on the task of
speaker recognition in affective scenarios where emotions have an impact on the speaking style.
                                                  20


In addition, we explore the relationship between the number of vocal style factors used in training
and the eventual performance. With 20 VSFs in the E-Vector architecture, we are able to obtain
a TMR of 46.9% @ FMR of 1% on MSP-Podcast test-set-1, which is 19.3% higher than the best
ECAPA-TDNN baseline.
    For future research, we propose a cross-dataset E-Vector speaker recognition experiment to
incorporate acted emotion datasets such as IEMOCAP [2] and CREMA-D [4]. By doing so,
we can gain a deeper understanding of speaker recognition in acted affective contexts through a
neural network models trained on natural emotion data (MSP-Podcast). Furthermore, conducting a
comparative analysis between natural and acted emotions, as represented by these distinct datasets,
could provide insightful observations about the range and variability of emotion expression.
2.11   Reproducibility
    All training, evaluation, and analysis code and experiment results will be made available
on our Github link (https://github.com/morganlee123/evector). Trained models and embeddings
have large file sizes and will be made available upon request (Email: sandle20@msu.edu or
msandler8@gmail.com).
                                                21


                                             CHAPTER 3
                      ANALYSIS THROUGH EMOTION CATEGORIES
3.1    Introduction
    E-Vector has demonstrated its efficacy in capturing speaker identity characteristics in affective
speech samples. This is based on our hypothesis that speaker identity can be decomposed into
vocal style factors (VSFs) and those factors can be approximated through an unsupervised learning
method. To better understand the E-Vector speaker identity embeddings, we visualize speaker
verification match scores computed in the previous chapter in the context of the underlying emotion
categories for both discrete and continuous emotion models. We also conduct speech emotion
recognition experiments to quantify any affective information encoded in the embeddings. We split
this experiment into two tasks. In the first task, we re-use the E-Vector and ECAPA-TDNN models
trained on the affective data from MSP-Podcast train set to extract speaker identity embeddings.
After extracting embeddings, we train an SVM classifier to assign the embedding into 8 discrete
emotion categories. Our second task explores the effects of larger corpora trained with the E-Vector
architecture. We train E-Vector with primarily neutral emotion speech samples from Librispeech,
VoxCeleb 1, and VoxCeleb 2 datasets. We then perform a Speaker Emotion Recognition (SER)
experiment to map speaker identity embeddings to 4-emotion categories using affective data from
the CREMA-D, IEMOCAP, and MSP-Podcast datasets. The following sections will detail the
experimental methods, data partitions, results, and conclusions from these analyses.
3.2    Visualizations of E-Vector Speaker Verification Match Scores in the Context of Emotion
       Categories
    In this analysis, we plot the speaker verification match scores in Figure 3.1 from the models
trained in the previous chapter (E1-E4) in the valence-arousal space to explore speaker verification
in the continuous emotion axes. In the figure, each symbol represents an emotion-emotion pairing
(e.g., happy-neutral, happy-happy, etc.), with the color of the circle indicating the match score.
The color scale is on the right-hand side of the figure. There are a total of 110 circles derived
from 10 emotion categories (happy, sad, etc.) thereby creating: 45 inter-emotion genuine scores,
                                                   22


45 inter-emotion impostor scores, 10 intra-emotion genuine scores, and 10 intra-emotion impostor
scores. In general, inter-emotion speaker verification tends to be a more challenging scenario to
verify identity. Therefore, we focus on the improvement in inter-emotion speaker verification match
scores.
    We found that the E-Vector model has more inter-emotion categories scoring higher values in
genuine pairs (which is good), and inter-emotion impostor scores being lower than the other models
(which is also good). This was previously noted as a bimodal genuine distribution apparent in the
match score distributions (see Figure 2.4). In general, there are no specific value ranges in the
arousal-valence space that may indicate any emotion category may provide higher/lower speaker
verification performance.
    In the discrete emotion analysis in Figures 3.2, 3.3, 3.4, we find that across all emotion categories
that there are no specific emotions responsible for the increase in recognition accuracy found in the
previous chapter. However, it could be said that the emotions which depend on textual semantic
information (sad, disgust, contempt) perform worse than ones that do not typically require it (anger,
happy). In general, in this analysis we find that all affective scenarios in E-Vector have better
separability between the genuine and impostor distributions (viz., Figure 2.4).
3.3   Quantifying Emotion in E-Vector and ECAPA-TDNN Embeddings
3.3.1    Task 1: E-Vector and ECAPA-TDNN MSP-Podcast Speech Emotion Recognition
    To analyze the affective content in the speaker recognition model embeddings, we take the pre-
trained ECAPA-TDNN and E-Vector models from the previous chapter (pre-trained on MSP-Podcast
or VoxCeleb 1+2) and perform 5-fold CV emotion recognition with speaker identity embeddings
extracted from the validation set to train the speech emotion recognition (SER) model given their
labeled emotion category. Emotion recognition (ER) refers to a eight-class recognition problem:
given an input embedding encoding identity and potentially emotion, assign it to a discrete emotion
category (in this case, sad, happy, neutral, anger, surprise, disgust, fear, and contempt). To train
the models, a multi-layer perceptron (MLP) is used with 3 layers of 100 nodes each and ReLU
activation functions for 300 epochs. The MLP uses the Adam optimizer. We chose this classifier
                                                   23


         E-Vector Test Set 1               E-Vector Test Set 2             E-Vector Validation Set
    ECAPA-TDNN/Vox Test Set 1          ECAPA-TDNN/Vox Test Set 2       ECAPA-TDNN/Vox Validation Set
    ECAPA-TDNN/MSP Test Set 1         ECAPA-TDNN/MSP Test Set 2        ECAPA-TDNN/MSP Validation Set
                                                                      ECAPA-TDNN/Vox+MSP Validation
  ECAPA-TDNN/Vox+MSP Test Set 1     ECAPA-TDNN/Vox+MSP Test Set 2
                                                                                    Set
Figure 3.1 Inter/Intra-Emotion Speaker Recognition Match Scores Visualized in the Valence-
Arousal Emotion Space.
empirically. For evaluating the emotion recognition models, we use test set 1 from MSP-Podcast.
In the validation set used in training, there exists a significant imbalance of emotion categories;
therefore, we randomly under-sample based on the emotion with the least number of available
samples (approximately 130 per class) to balance the distribution during the training process. The
evaluation sets are not balanced and we employ the f-score metric. We acknowledge there is a
                                                  24


Figure 3.2 E-Vector evaluated on MSP-Podcast Test 1 Set, Intra-Inter Emotion Speaker Verification
Experiment Match Scores
                                              25


Figure 3.3 E-Vector evaluated on MSP-Podcast Test 2 Set, Intra-Inter Emotion Speaker Verification
Experiment Match Scores
                                              26


Figure 3.4 E-Vector evaluated on MSP-Podcast Validation Set, Intra-Inter Emotion Speaker Verifi-
cation Experiment Match Scores
                                             27


significant loss of data during the under-sampling process and consequently also run the 5-fold CV
emotion recognition experiment without discarding samples for balancing as well.
3.3.2   Experiment Results
    Previously, we hypothesized that E-Vector contains the speaker identity as a combination of
vocal style factors. To investigate whether emotion information is encoded in the representation,
we conducted the aforementioned speech emotion recognition (SER) experiments. We evaluate
the results using the MSP-Podcast test 1 set. The results from this experiment imply that E-
Vector may not significantly encode emotion in the identity representation—potentially being
factored out through the weighted combination of VSFs and resulting in lower f-score. In this
scenario, lower f-score would be beneficial assuming the trade-off for emotion recognition is
identity recognition. In contrast, the ECAPA-TDNN, while still scoring low f-score, tends to have
higher emotion recognition f-score than the E-Vector model. Table 3.1 summarizes the metrics
from this experiment.
                                  ER (Chance is 0.12)    ER w/ Undersampling (Chance is 0.12)
           E-Vector                   0.18 +/- 0.03                    0.18 +/- 0.03
     ECAPA-TDNN/Vox                   0.25 +/- 0.03                    0.25 +/- 0.03
     ECAPA-TDNN/MSP                   0.20 +/- 0.03                    0.20 +/- 0.03
  ECAPA-TDNN/Vox+MSP                  0.31 +/- 0.03                    0.31 +/- 0.03
Table 3.1 ER denotes 8-class emotion recognition. Metric is f-score. In this scenario, we would
like a lower f-score value. This may indicate that the speaker identity models not do encode the
emotion in the identity representation itself. The confusion matrices corresponding to this analysis
may be found in the Appendix
3.3.3   Task 2: E-Vector/Vox+Librispeech Speech Emotion Recognition Experiments
    For the second task, we use an E-Vector model trained on LibriSpeech, VoxCeleb 1, and
VoxCeleb 2 datasets. This model extracts fixed 256-dimensional speaker embeddings from two
datasets: CREMA-D and MSP-Podcast. A simple linear model (we found a multi-class Support
Vector Machine (SVM) to be sufficient for this purpose by use of Auto-Tuned Models [46]) is trained
with these embeddings as input and the corresponding sample emotions as the output labels. There
is one emotion per audio sample. The four emotion categories for this work are: Angry, Sad,
                                                  28


Happy, and Neutral. We choose these four emotions as they are considered “basic" emotions that
cover the most frequent human-interactions [14]. We design two classifiers for this purpose. First,
we implement a single 4-class SVM as our baseline experiment. Second, we employ a hierarchical
SVM consisting of two SVMs in sequence: the first distinguishes the Sad emotion from the other
categories, while the second distinguishes between the Angry, Happy and Neutral categories. Our
motivation for hierarchical classification is to determine whether there is a traversable emotion
embedding space encoded in the speaker identity embedding. We use this hierarchical approach
to initially differentiate the most “challenging" emotion from the rest. For example, Sad is difficult
to differentiate from the Neutral state using style features [34]. Therefore, in Experiment 3, we
first distinguish Sad, then classify the remaining emotions. These speech emotion classifiers are
illustrated in Figures 3.5 and 3.6.
3.3.3.1    Experiment 1 — CREMA-D (Acted Data)
     CREMA-D [4] is an audio-visual dataset consisting of 7,442 original samples from 91 actors.
These clips are from 48 male and 43 female actors between the ages of 20 and 74. There is a diversity
of actor race and ethnicity (African America, Asian, Caucasian, Hispanic, and Unspecified). Each
actor performs 12 sentences, each sentence uses one of six different emotions (Anger, Disgust, Fear,
Happy, Neutral and Sad). Emotion samples are crowd-sourced from a total of 2,443 participants.
For this experiment, we use 2,800 training samples and 1,280 testing samples from four emotion
classes: Anger, Happy, Neutral, Sad. The emotion classes are balanced and there is independence
between speakers in the test and train set.
3.3.3.2    Experiment 2 — MSP-Podcast (Natural Data)
     MSP-Podcast is an audio-only dataset of naturalistic emotion. The dataset is generated from
spontaneous recordings obtained from allowed audio-sharing websites. The annotation process for
emotion is done via crowd-sourcing. In this experiment, we use 6,400 samples for training and
1,800 samples for testing. We choose samples only from four emotion classes: Anger, Happy,
Neutral, and Sad. These classes are balanced and there is independence between speakers in the
test and train set.
                                                   29


3.3.3.3    Experiment 3 — CREMA-D Hierarchical Classifier
     In this experiment, we build a hierarchical classifier from two SVMs. We train the first SVM
on 1,440 samples from two emotion classes: Sad and Other. The second SVM is trained on 2,160
samples from three emotion classes: Anger, Happy, Neutral. The combined two-stage classifier
is trained on 2,880 samples and classifies 1,280 test samples into four emotion categories: Anger,
Happy, Neutral, Sad. All emotion classes are balanced and there is independence between speakers
in all test and train sets.
3.3.3.4    Feature Extraction
     The E-Vector network trained with VoxCeleb 1, VoxCeleb2, and Librispeech is used for extract-
ing 256-dimensional speaker embeddings for each utterance. Our implementation frame length is
22,000 and hop length is 220. This gives a 1 second frame and a hop of 10 ms. If any audio sample
is less than 1 second, it is discarded. After extracting the embeddings for each utterance, we then
compute the average speaker embedding (𝑒 𝑎𝑣𝑔 ) of 𝑛 utterance embeddings 𝑒𝑖 .
3.3.3.5    Classifier
     In this work, we use SVMs with radial basis function (RBF) kernels. Our hyper-parameters are
tuned using GridSearchCV. In all SVMs, optimal parameters were found to be C=1000, 𝛾 = 0.1.
Although the cost value may appear high, it leads to better generalization in the testing stage of the
experiment. Cost values between 10 and 1000 yield approximately the same result (+/- 0.04 overall
weighted-f score).
3.3.3.6    Architecture
     Our general network architecture is illustrated in Figure 3.5. In Experiment 1 and 2, we use a
single SVM layer. In Experiment 3, we use a two-stage hierarchical SVM classifier as shown in
Figure 3.6.
3.3.3.7    Results
     From the findings in all experiments (Figure 3.7 and Table 3.2), we conclude that E-Vector
speaker embeddings do not significantly encode the emotion as a byproduct of its training algorithm.
                                                  30


                                                                                            Training .
                                    Generic E-Vector Model
                                         1-D Triplet                          Averaged
                          Speech       CNN w/ Vocal            Speaker
                          Frame              Style          Embeddings        Speaker
                                       Factorization                         Embedding
                                                                                               Support Vector
              Input                                                                           Machines (SVM)
             Audio
                           Emotion
                            Label
                                                                                                  Test .
                                                      Averaged         Speech
                Input        E-Vector                 Speaker          Emotion           Emotion
               Audio                                                                   Classification
                                                    Embedding           Model
            Figure 3.5 Our E-Vector Speech Emotion Recognition Experiment Setup
                                      Hierarchical Classifier
                    Binary                                 Sad
                                                                                              Sad
                      SVM
                           Other             Multiclass                                    Happy
                                                   SVM
                                                                                            Angry
                                                                                           Neutral
Figure 3.6 The hierarchical classifier setup. First classifier is a binary classifier that discriminates
a given emotion class from the rest. Those which do not belong to the first class are classified into
the remaining three by the second layer.
Therefore, we maintain that E-Vector speaker embeddings primarily encode speaker identity.
3.4   Summary
    We find through the extensive speech emotion recognition experiments conducted, that E-
Vector embeddings may not contain significant emotion information, especially when trained only
                                                             31


Figure 3.7 E-Vector/Vox+Librispeech Speech Emotion Recognition Experiments. Top Row all use
the SVM classifier, Bottom row all use the Hierarchical Classifier. High recognition accuracy is not
achieved in the emotion recognition domain inferring that E-Vector does not significantly encode
emotion information. All cells represent percentages.
                Algorithm                     CREMA-D            MSP-Podcast IEMOCAP
                    SVM                           66.5                43.3              57.5
  Sad-First Hierarchical Classifier               68.7                36.8              58.3
Table 3.2 E-Vector/Vox+Librispeech 4-class emotion recognition accuracy. Comparison of Hier-
archical Classifier to baseline SVM model. Metric is standard accuracy– higher is better.
on affective data (MSP-Podcast). In addition, we evaluated emotion content over a large corpora
of speech data (VoxCeleb 1 & 2 and Librispeech). We still find that E-Vector primarily encodes
speaker identity and perhaps not emotion as a byproduct of its training algorithm. This demonstrates
the robustness of the E-Vector model and the effectiveness of the vocal style factor technique.
                                                32


                                              CHAPTER 4
                        THESIS CONCLUSIONS AND FUTURE WORK
4.1   Research Contributions
   1. This thesis explores the hypothesis that speaker identity is composed of various vocal style
       factors which may be decomposed and recombined. In this regard, a neural network model
       named E-Vector is developed and implemented.
   2. We find that E-Vector decomposes input speech into multiple vocal style factors and can
       combine these factors to provide a more holistic representation of speaker identity in various
       affective scenarios.
4.2   Future Work
    In recent literature, there has been a focus on improving speaker recognition models by learning
the weights using multiple modalities such as textual and audio samples, and then using only audio
during the inference stage for identity assessment [42]. Therefore, in the affective domain, further
work is necessary to obtain larger speech datasets with associated text transcripts to harness these
novel capabilities.
    In addition to utilizing more modalities in the training process, a semi-supervised training
scheme may be employed. Currently, the E-Vector architecture does not use any direct supervision
via emotion labels to tune its weights; in other words, it is tuned by the GE2E loss. Future work
could include an “emotion” loss in addition to the GE2E loss. The rationale would be for the model
to predict an emotion label, and the corresponding true emotion labels would adjust the loss and
weights of the vocal style factors accordingly to disentangle emotion and identity further.
                                                   33


                                        BIBLIOGRAPHY
[1] Zakaria Aldeneh and Emily Mower Provost. You’re not you when you’re angry: Robust
     emotion features emerge by recognizing speakers. IEEE Transactions on Affective Computing,
     2021.
[2] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, et al. Iemocap: Interactive emotional dyadic
     motion capture database. Language resources and evaluation, 42(4):335–359, 2008.
[3] C-Nedelcu. Talk to chatgpt. https://github.com/C-Nedelcu/talk-to-chatgpt.
[4] Houwei Cao, David G. Cooper, Michael K. Keutmann, et al. CREMA-D: Crowd-sourced
     emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4):377–
     390, 2014.
[5] A. Chowdhury, A. Ross, and P. David. Deeptalk: Vocal style encoding for speaker recognition
     and speech synthesis. In ICASSP, 2021.
[6] Anurag Chowdhury. Automated Speaker Recognition in Non-ideal Audio Signals Using Deep
     Neural Networks. Michigan State University, 2021.
[7] Anurag Chowdhury, Austin Cozzo, and Arun Ross. Domain adaptation for speaker recognition
     in singing and spoken voice. In ICASSP 2022-2022 IEEE International Conference on
     Acoustics, Speech and Signal Processing (ICASSP), pages 7192–7196. IEEE, 2022.
[8] Anurag Chowdhury and Arun Ross. Fusing mfcc and lpc features using 1d triplet cnn for
     speaker recognition in severely degraded audio signals. IEEE Transactions on Information
     Forensics and Security, 15:1616–1629, 2019.
[9] Anurag Chowdhury and Arun Ross. Deepvox: Discovering features from raw audio for
     speaker recognition in degraded audio signals, 2020.
[10] S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic
     word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech,
     and Signal Processing, 28(4):357–366, 1980.
[11] Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end
     factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language
     Processing, 19:788–798, 2011.
[12] Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou.
     ArcFace: Additive angular margin loss for deep face recognition. IEEE Transactions on
     Pattern Analysis and Machine Intelligence, 44(10):5962–5979, oct 2022.
[13] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. ECAPA-TDNN: emphasized
     channel attention, propagation and aggregation in TDNN based speaker verification. In Helen
     Meng, Bo Xu, and Thomas Fang Zheng, editors, Interspeech, pages 3830–3834. ISCA, 2020.
[14] Paul Ekman. Basic emotions. Handbook of Cognition and Emotion, 98(45-60):16, 1999.
                                                 34


[15] Florian Eyben, Klaus R Scherer, Björn W Schuller, et al. The geneva minimalistic acoustic
     parameter set (gemaps) for voice research and affective computing. IEEE Transactions on
     Affective Computing, 7(2):190–202, 2015.
[16] Florian Eyben, Klaus R. Scherer, Björn W. Schuller, et al. The geneva minimalistic acoustic
     parameter set (gemaps) for voice research and affective computing. IEEE Transactions on
     Affective Computing, 7(2):190–202, 2016.
[17] Florian Eyben, Martin W"ollmer, and Bjorn Schuller. Opensmile: the munich versatile
     and fast open-source audio feature extractor. Proceedings of the 18th ACM international
     conference on Multimedia, 2010.
[18] Johnny R.J. Fontaine, Klaus R. Scherer, Etienne B. Roesch, and Phoebe C. Ellsworth. The
     world of emotions is not two-dimensional. Psychological Science, 18(12):1050–1057, 2007.
     PMID: 18031411.
[19] Alex Hadden. The identification of criminals by the bertillon system. W. Res. LJ, 3:165, 1897.
[20] John HL Hansen and Taufiq Hasan. Speaker recognition by machines and humans: A tutorial
     review. IEEE Signal processing magazine, 32(6):74–99, 2015.
[21] Hynek Hermansky. Perceptual linear predictive (plp) analysis of speech. The Journal of the
     Acoustical Society of America, 87 4:1738–52, 1990.
[22] Anil K Jain, Arun Ross, and Salil Prabhakar. An introduction to biometric recognition. IEEE
     Transactions on Circuits and Systems for Video Technology, 14(1):4–20, 2004.
[23] Mimansa Jaiswal and Emily Mower Provost. Privacy enhanced multimodal neural repre-
     sentations for emotion recognition. In Proceedings of the AAAI Conference on Artificial
     Intelligence, volume 34, pages 7985–7993, 2020.
[24] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-
     normalizing neural networks. Advances in neural information processing systems, 30, 2017.
[25] Shashidhar G Koolagudi, Kritika Sharma, and K Sreenivasa Rao. Speaker recognition in emo-
     tional environment. In Eco-friendly Computing and Communication Systems: International
     Conference, ICECCS 2012, Kochi, India, August 9-11, 2012. Proceedings, pages 117–124.
     Springer, 2012.
[26] Richard S Lazarus. Emotion and adaptation. Oxford University Press, 1991.
[27] Haoqi Li, Ming Tu, Jing Huang, Shrikanth Narayanan, and Panayiotis Georgiou. Speaker-
     invariant affective representation learning via adversarial training. In ICASSP 2020 - 2020
     IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
     7144–7148, 2020.
[28] R. Lotfian and C. Busso. Building naturalistic emotionally balanced speech corpus by re-
     trieving emotional speech from existing podcast recordings. IEEE Transactions on Affective
     Computing, 10(4):471–483, October-December 2019.
                                                35


[29] Richard J Mammone, Xiaoyu Zhang, and Ravi P Ramachandran. Robust speaker recognition:
     A feature-based approach. IEEE Signal Processing Magazine, 13(5):58, 1996.
[30] Simon      Angelo      Meier.           Medical     diagnosis     using      voice   samples:
     What      the     voice     reveals    about     health.            https://www.archyde.com/
     medical-diagnosis-using-voice-samples-what-the-voice-reveals-about-health/.
[31] Rushab Munot and Ani Nenkova. Emotion impacts speech recognition performance. In
     Proceedings of the 2019 Conference of the North American Chapter of the Association for
     Computational Linguistics: Student Research Workshop, pages 16–21, 2019.
[32] Ali Bou Nassif, Ismail Shahin, Ashraf Elnagar, Divya Velayudhan, Adi Alhudhaif, and Kemal
     Polat. Emotional speaker identification using a novel capsule nets model. Expert Systems with
     Applications, 193:116469, 2022.
[33] Department of Homeland Security. Biometrics. https://www.dhs.gov/biometrics.
[34] Astrid Paeschke, Miriam Kienast, and Walter F. Sendlmeier. F0-contours in emotional speech.
     1999.
[35] Raghavendra Pappagari, Tianzi Wang, Jesus Villalba, et al. x-vectors meet emotions: A study
     on dependencies between emotion and speaker recognition. In ICASSP, pages 7169–7173,
     2020.
[36] Srinivas Parthasarathy and Carlos Busso. Predicting speaker recognition reliability by consid-
     ering emotional content. In 2017 Seventh International Conference on Affective Computing
     and Intelligent Interaction (ACII), pages 434–439, 2017.
[37] Marc D Pell, Abhishek Jaywant, Laura Monetta, and Sonja A Kotz. Emotional speech
     processing: Disentangling the effects of prosody and semantic cues. Cognition & Emotion,
     25(5):834–853, 2011.
[38] Rosalind W Picard. Affective computing. 2000.
[39] Robert Plutchik and Henry Kellerman, editors. Emotion: Theory, Research, and Experience,
     volume 1-5. Academic Press, 1980-1990.
[40] Mirco Ravanelli and Yoshua Bengio. Speaker recognition from raw waveform with sincnet,
     2019.
[41] Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, et al. SpeechBrain: A general-purpose
     speech toolkit, 2021. arXiv:2106.04624.
[42] Seyed Omid Sadjadi, Craig Greenberg, Elliot Singer, Lisa Mason, and Douglas Reynolds.
     The 2021 nist speaker recognition evaluation. arXiv preprint arXiv:2204.10242, 2022.
[43] Nikola Simić, Siniša Suzić, Tĳana Nosek, Mia Vujović, Zoran Perić, Milan Savić, and Vlado
     Delić. Speaker recognition using constrained convolutional neural networks in emotional
     speech. Entropy, 24(3):414, 2022.
                                                36


[44] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur.
     X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International
     Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333, 2018.
[45] Tengfei Song, Wenming Zheng, Peng Song, and Zhen Cui. Eeg emotion recognition using
     dynamical graph convolutional neural networks. IEEE Transactions on Affective Computing,
     11(3):532–541, 2020.
[46] Thomas Swearingen, Will Drevo, Bennett Cyphers, et al. ATM: A distributed, collaborative,
     scalable system for automated machine learning. In IEEE International Conference on Big
     Data, Boston, MA, USA, pages 151–162, 2017.
[47] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss
     for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and
     Signal Processing (ICASSP), pages 4879–4883. IEEE, 2018.
[48] Yuxuan Wang, Daisy Stanton, Yu Zhang, et al. Style tokens: Unsupervised style modeling,
     control and transfer in end-to-end speech synthesis, 2018.
[49] Jennifer Williams and Simon King. Disentangling style factors from speaker representations.
     In Interspeech, pages 3945–3949, 2019.
[50] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony
     Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-
     of-the-art natural language processing. In Proceedings of the 2020 conference on empirical
     methods in natural language processing: system demonstrations, pages 38–45, 2020.
[51] Kun Zhou, Berrak Sisman, Rajib Rana, Björn W Schuller, and Haizhou Li. Speech synthesis
     with mixed emotions. IEEE Transactions on Affective Computing, 2022.
                                                37


                                           APPENDIX
In addition to reporting the f-scores of each experiment from Chapter 3 Task 1, we provide the
confusion matrices from the SER experiments conducted.
Figure A.1 E-Vector/MSP-Podcast SER Experiment Confusion Matrix. 8 emotion classes are used
for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation.
Results are obtained via 5-fold cross validation. Mean f-score = 0.18 +/- 0.03. Values shown in
each cell denote a percentage.
                                                38


Figure A.2 ECAPA-TDNN/VoxCeleb1+2 SER Experiment Confusion Matrix. 8 emotion classes
are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for
evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.25 +/- 0.03. Values
shown in each cell denote a percentage.
                                                39


Figure A.3 ECAPA-TDNN/MSP SER Experiment Confusion Matrix. 8 emotion classes are used
for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation.
Results are obtained via 5-fold cross validation. Mean f-score = 0.20 +/- 0.03. Values shown in
each cell denote a percentage.
                                                40


Figure A.4 ECAPA-TDNN/Vox+MSP SER Experiment Confusion Matrix. 8 emotion classes are
used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for
evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.31 +/- 0.03. Values
shown in each cell denote a percentage.
                                                41


Figure A.5 E-Vector/MSP-Podcast SER Experiment Confusion Matrix. 8 emotion classes are used
for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation.
Results are obtained via 5-fold cross validation. Mean f-score = 0.18 +/- 0.03. Values shown in
each cell denote a percentage.
                                                42


Figure A.6 ECAPA-TDNN/VoxCeleb1+2 SER Experiment Confusion Matrix. 8 emotion classes
are used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for
evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.25 +/- 0.03. Values
shown in each cell denote a percentage.
                                                43


Figure A.7 ECAPA-TDNN/MSP SER Experiment Confusion Matrix. 8 emotion classes are used
for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for evaluation.
Results are obtained via 5-fold cross validation. Mean f-score = 0.20 +/- 0.03. Values shown in
each cell denote a percentage.
                                                44


Figure A.8 ECAPA-TDNN/Vox+MSP SER Experiment Confusion Matrix. 8 emotion classes are
used for classification. MSP-Podcast Validation Set is used for training, Test Set 1 is used for
evaluation. Results are obtained via 5-fold cross validation. Mean f-score = 0.31 +/- 0.03. Values
shown in each cell denote a percentage.
                                                45