REAL-TIME HUMAN/GROUP INTERACTION MONITORING PLATFORM
INTEGRATING SENSOR FUSION AND MACHINE LEARNING APPROACHES
                                        By
                            Sylmarie Dávila-Montero
                                A DISSERTATION
                                    Submitted to
                            Michigan State University
                    in partial fulfillment of the requirements
                                 for the degree of
         Electrical and Computer Engineering - Doctor of Philosophy
                                       2022


                                            ABSTRACT
     A person’s social intelligence impacts their physical and mental health, and the productivity
levels of the individuals involved, for example, in workplace interactions. To promote successful
social interactions, this dissertation explores the use of sensor technology and machine learning
algorithms to monitor and quantify nonverbal behavior indicators in real time. This dissertation
conducts extensive convergence research between psychology, communication science, and
engineering and establishes a new real-time human/group interaction monitoring platform. From
sensor selection to data collection and algorithm design, existing human behavior monitoring
systems vary widely in the type of methods employed for their design. Many of these systems were
trained with data collected in controlled environments, making them not practical for real-life
scenarios. Moreover, existing systems lack the capabilities needed to recognize behaviors in a
manner that could support machine-augmented social intelligence. To address these issues, the
developed human/group interaction monitoring platform combines a real-time enabled multi-
sensor system with a machine learning framework that establishes training and algorithm design
methods for behavior recognition. Methods for the execution of human studies, collection of
natural human behavior data, and data annotation procedures were also established to train
machines to recognize human behaviors impacting the quality of social interactions. The
contributions of this dissertation, which can be universally applied to other behavior studies, will
advance the design of human behavior monitoring systems for group interactions and facilitate
future real-time feedback to increase self-awareness and promote successful social interactions.


To my dearest family, both on earth and in heaven…
                       iii


                                      ACKNOWLEDGEMENTS
     I would like to thank Michigan State University Graduate School, the National Science
Foundation, the GEM Consortium, and the SLOAN program for their financial support.
     I also thank my supervisory committee: Dr. Selin Aviyente, Dr. Erin Purcell, Dr. Angela Hall,
Dr. Gary Bente, and Dr. Andrew J. Mason for their support, guidance, and ideas throughout this
process. Special thanks to Dr. Bente for his continuous advice on the psychophysiological aspects
of this work and the social aspects of experimental design; and Dr. Hall for her economic support
to perform human studies and provide assistance with experimental design and human resources
to help me with data labeling. My immense gratitude goes to my supervisory committee chair and
advisor, Dr. Andrew J. Mason, for giving me the opportunity to work with him, first as an
undergraduate summer research intern and then as part of his research group as I worked towards
my doctoral degree. It has been an honor to learn from him and grow as a professional under his
guidance. I am grateful for his support, advice, mentorship, trust, and patience.
     I would also like to thank the MSU SLOAN and the MSU AGEP communities. More
specifically, I would like to thank Dr. Percy Pierre, Dr. Nelson Sepulveda, and Steven Thomas for
their academic guidance and financial support during this process. To Dr. Nelson Sepulveda, I will
be forever grateful for his constant support and advice.
     Thanks to my lab mates, past and current, for their constant feedback, collaboration, and
support during this process. Special thanks to my friends, here in Michigan and far away. Keilyn
Vale, Yeilyn Vale, Alex Román, Dr. Adrian Ildefonso, Dr. Lisaura Maldonado, and my writing
group (Mara Cuebas, Lizbeth Dávila, Nichole Montero, and Gabriela Ortiz), thanks for all the love
and non-stop support. Very special thanks to Dr. Keisha Castillo-Torres for your unconditional
                                                  iv


support, love, feedback, and motivation every step of the way, even when miles away. I could not
have asked for a better friend and colleague for this journey. We did it!
     Thanks to my family. To my parents, Marilyn Montero-Caro and José L. Dávila-Estrada, from
which I learned the value of education and hard work, thanks for your unconditional love, support,
visits to help me cope with stressful moments, and words of wisdom. Thanks to my sisters, Bianca,
Genna, and Josnaly, and my cousin Eidy Montero for their constant support and admiration, which
has motivated me to be the best version of myself. I would also like to thank my grandparents
whose support has meant the world. To Miriam Montero, thanks for taking the time to learn about
my research, your interest and feedback in my work have helped me become a better
communicator. To my life partner, Ajay Delarosa, thanks for supporting me in all the ways that
you did during my writing process. Your words of admiration, motivation, and respect were fuel
to finish my work. I am blessed to have you by my side.
     Thanks to my previous mentors for motivating me to pursue this adventure: Baldomero
Llorens, Dr. Domingo Rodriguez, Dr. Nestor Rodriguez, and Dr. Nayda Santiago. Special thanks
to Dr. Santiago for always believing in me and being one of the best role models any women
engineer could ask for.
     Thanks to everyone that has not been mentioned but that has walked a piece of this adventure
with me by providing me feedback, editing my essays and papers, and/or supporting me in any
possible way. But most importantly, my biggest gratitude goes to God because his presence in my
life has kept me focused on my purpose: to use my talents to contribute positively to the life of
those I get to cross paths with.
                                                  v


                                                 TABLE OF CONTENTS
1.   INTRODUCTION ................................................................................................................... 1
   1.1. Social Awareness: Challenges and Opportunities ............................................................ 1
   1.2. Social Behavior Monitoring Technologies: Challenges and Opportunities ..................... 2
   1.3. Requirements and Challenges in the Design of Real-time Monitoring Technologies ..... 4
   1.4. Goals................................................................................................................................. 6
   1.5. Outline .............................................................................................................................. 6
2.   BACKGROUND ..................................................................................................................... 7
   2.1. Human Behavior and its Effectors ................................................................................... 7
   2.2. Theories and Concepts of Human Behavior .................................................................... 9
   2.3. Methods for Monitoring Human Behaviors ................................................................... 19
   2.4. Biometric Technologies and its Components................................................................. 20
   2.5. Summary ........................................................................................................................ 27
3. DESIGN OF A MULTI-SENSOR SYSTEM WITH A MACHINE LEARNING
FRAMEWORK TO MONITOR GROUP INTERACTIONS IN REAL TIME .......................... 28
   3.1. Measuring the Quality of Group Interactions ................................................................ 29
   3.2. Deep Analysis of Technology for Behavior Monitoring ............................................... 36
   3.3. Design of a Multi-Sensor System with a Machine Learning Framework to Monitor
   Group Consonance using the Rapport Theoretical Model ........................................................ 78
   3.4. Summary ........................................................................................................................ 93
4. SIGNAL PROCESSING FOR THE RECOGNITION OF LOCAL TRANSFORMED
FEATURES: FROM DATA COLLECTION TO ALGORITHM DESIGN ................................ 95
   4.1. Real-Time Detection of Head Actions using IMUs ....................................................... 95
   4.2. Real-Time User-Independent Speech Intonation Recognizer ...................................... 106
   4.3. Overall Discussion ....................................................................................................... 134
   4.4. Summary ...................................................................................................................... 136
5. SOCIAL INTERACTION STUDY: METHODS, DESCRIPTION OF DATA
COLLECTION, AND ANALYSIS ............................................................................................ 138
   5.1. Study Methods.............................................................................................................. 138
   5.2. Data Collection and Description .................................................................................. 144
   5.3. Summary and Analysis of Questionnaires’ Data ......................................................... 145
   5.4. Data Labeling ............................................................................................................... 155
   5.5. Processing IMU Data using the Designed HAD Unit and Preliminary Establishment of
   Rapport Relationship ............................................................................................................... 161
   5.6. Overall Discussion ....................................................................................................... 169
   5.7. Summary ...................................................................................................................... 171
6.   SUMMARY AND FUTURE WORK ................................................................................. 172
   6.1. Summary ...................................................................................................................... 172
   6.2. Contributions ................................................................................................................ 174
   6.3. Other Achievements ..................................................................................................... 178
                                                                 vi


  6.4. Applications and Social Implications........................................................................... 179
  6.5. Future Work ................................................................................................................. 181
BIBLIOGRAPHY ....................................................................................................................... 183
APPENDIX A: TOPIC QUESTIONNAIRE .............................................................................. 205
APPENDIX B: EMOTIONAL STATE QUESTIONNAIRE..................................................... 209
APPENDIX C: RAPPORT QUESTIONNAIRE ........................................................................ 212
APPENDIX D: AVAILABLE RESOURCES GENERATED BY THIS WORK ..................... 215
                                                           vii


     1. INTRODUCTION
     1.1.Social Awareness: Challenges and Opportunities
      Teamwork and social interactions are at the core of new discoveries, product developments,
and general successful organizational outcomes. In the U.S. alone, 55M team meetings are
estimated to be carried out per day, costing organizations ~$1.4T/year [1]. However, of those,
~$250B/year is wasted on team meetings that have poor outcomes [1]. Research has shown that
the quality of social interactions within a group can either foster team effectiveness where
individuals work well together or can encourage teams to fall apart [2]. Therefore, ineffective
social interactions are one of the principal causes of poor team productivity.
      Social interactions are constructed and influenced by the human behaviors of two or more
individuals. An aspect of human behaviors that can degrade the quality of social interactions is
unconscious biases. Throughout a team meeting, unconscious biases can cause behavioral events
such as interruptions or ostracization (exclusion or the act of ignoring) towards another team
member. To a certain extent, we have grown accustomed to these types of social behaviors and the
biases that cause them. However, they can have a negative impact on the individuals experiencing
these behavioral events eventually affecting, not just the outcomes of an organization but also, the
individuals’ health. In fact, the quantity and quality of social interactions influence a range of
health conditions including cardiovascular diseases, compromised immunity, and depression [3]–
[5].
      Unconscious biases, also known as implicit biases, are defined as social stereotypes or
attitudes held subconsciously about certain groups of people that affect the way individuals behave
around them. Unconscious biases are more prevalent than conscious prejudice, which is bias
people know they have and intentionally act upon. The actions resulting from unconscious biases
                                                  1


can lead to microinequity or microaggressions [6]. Microinequity refers to demeaning or
marginalizing someone, whereas microaggression is the act of expressing prejudice against a
marginalized group or person. Many of our behaviors, including unconscious bias behaviors, are
motivated by unconscious triggers and emotions [7]. Hence, research has suggested that
unconscious biases can be prevented by increasing our social awareness, which includes being
self-aware of our emotions, intentions, and ways in which we communicate with each other.
     The simplest way of assuring effective social interactions is by individuals becoming more
self-aware of their behaviors and environmental stimuli, i.e., by improving their situational
awareness. Our perception and awareness of our behaviors and the behaviors of others play an
important role in our daily lives and the quality of social interactions [7], [8]. However, it is known
that as humans, our awareness and perception of our environment and the behaviors of others and
ourselves can be limited by a variety of factors. For instance, various studies have shown that even
when we are able to perceive intentions or environmental stimuli, we may not always be processing
them in our conscious mind, making us unaware of the event [9]. In fact, psychologists and social
and behavioral scientists agree in that much of what we do on a daily basis is unconscious [10],
[11]. Thus, a step toward improving an individual’s social awareness is to apply technology to
study and monitor in real time their human behaviors and the behaviors of those around them.
    1.2.Social Behavior Monitoring Technologies: Challenges and Opportunities
     Modern sensor technologies have permitted the objective assessment of behaviors that
influence the well-being of humans, such as physical activity [12], sleep patterns [13], stress levels
[14], food intake patterns [15], and social interactions [16]–[20]. As the understanding increases
of how social interactions influence human well-being and productivity, a variety of sensor
technologies and computational methods have been applied for the study and recognition of human
                                                   2


behaviors that influence the quality of group interactions, both for in-person and virtual interaction
environments.
     In general, rudimentary technologies exist for the monitoring of individual-level behaviors
and aspects of group-level behaviors [21]. Individual-level monitoring technologies focus on the
recognition of emotions; their purpose is to increase emotional awareness to enhance aspects of
inter-personal attraction, physical presence, and social presence [22]–[24]. On the other hand,
group-level monitoring technologies focus on the recognition of conversation dynamics and
attention with the end-goal of increasing balance participation and improving collaboration and
group performance [25], [26]. Still, the real-time monitoring of complex social interaction
dynamics, such as unconscious biases, that have a major impact on the well-being of humans,
requires the integration of both individual-level and group-level behaviors. In addition, most of
the existing technologies lack real-time capabilities and system features, such as a feedback
framework or mechanism, that will permit the enhancement of social interactions in real time.
Furthermore, many such systems are not configurable for the monitoring of complex human
behaviors that could lead to identifying complex group dynamics and lack of awareness. In order
to achieve a combination of individual-level and group-level behavior recognition, and create a
system that will allow real-time feedback, a platform capable of collecting data from natural human
interactions and processing in real time behavioral cues from multiple individuals is necessary.
     Figure 1 illustrates the concept of real-time group behavior monitoring technologies
improving awareness during social interactions. Here, the technology captures human behavior
data during interactions and provides informative real-time feedback to everyone regarding the
social ecosystem to improve each user’s awareness of individual, dyadic, and group behaviors.
Feedback messages could relate to conversation dynamics (e.g., who is dominating the
                                                  3


Figure 1. Technology to monitor individual and group behaviors during a social interaction can
measure aspects of conversation dynamics, levels of attention, and levels of emotional arousal.
Being aware of our behaviors and the behaviors of others has been shown to help improve social
interactions, positively impacting individuals’ wellbeing, organizational outcomes, etc. © 2021,
IEEE.
conversation and number of interruptions), levels of attention, or even the levels of emotional
arousal affecting group dynamics or that can be related to implicit bias behaviors. Still, the
information that could be fed back to the individuals involved in the interaction greatly depends
on the capabilities of the social behavior monitoring technologies.
     1.3.Requirements and Challenges in the Design of Real-time Monitoring Technologies
      The design and implementation of a group behavior monitoring system to improve social
interactions face theoretical and technical challenges. Humans communicate consciously and
unconsciously using multiple channels, i.e., gestures, movements, tone of voice, etc. Therefore,
the effective monitoring of complex group interaction dynamics requires the recognition of human
behaviors through various sensing modalities that allow the integration of communication patterns’
information and emotional states. Furthermore, data from natural human interactions should be
utilized to design computational models embedded in behavior monitoring systems. However,
existing behavior monitoring systems lack sensing modalities, frameworks for data collection, or
                                                 4


computational capabilities needed to recognize in real time complex social dynamics, potentially
because the design of these systems presents numerous challenges.
     The challenges that need to be overcome to achieve a functional group behavior monitoring
system to increase self-awareness and enhance social interactions can be defined as follows:
        •   Group interactions are complex and include a combination of individual behaviors and
            dyadic behaviors. Thus, monitoring group interactions using sensor technology
            requires the understanding of psychological and communication theories at the
            individual, dyadic, and group levels and their application to the engineering design of
            a system. Methods that could quantify the quality of a group interaction using
            technology are still an area to investigate and no well-established methods exist.
        •   Real-time monitoring of individual behaviors and group interactions requires
            coordination of hardware and software components to perform automatic
            synchronization and processing of data collected across individuals in the interaction.
            Challenges in this area include the combination of sensing modalities, data
            synchronization across modalities and sensor nodes, and the selection of optimal signal
            processing parameters and computational models for the recognition of individual
            human behaviors and group interactions.
        •   Effective recognition of behaviors requires computational models trained with data
            from natural human behavior interactions. Challenges in this area include the design
            of data processing modules with variations in data sources, lack of guidelines and/or
            infrastructure for data collection that informs methods for performing human studies,
            and management of data preparation/annotation.
                                                  5


    1.4.Goals
    The goal of this project is to establish a platform through which the challenges described in
Section 1.3 can be solved to bridge the gap between psychology, communication science, and
engineering. Such a platform will be bringing individual, dyadic, and group-level information to
the design of group behavior monitoring systems. The achievement of this goal will allow the
implementation of a multimodal system to monitor, in real time, non-verbal and physiological
behavior indicators; it will also facilitate real-time feedback to promote successful social
interactions by bringing awareness to our unconscious behaviors.
    1.5.Outline
    This thesis is organized as follows: Chapter 2 presents the psychological theories and concepts
that underpin the analysis of social interactions and methods to monitor them; Chapter 3 presents
the design of a framework for the data collection of nonverbal indicators of individual behavior
and group interaction using sensor technology and a framework for the processing of collected
data; Chapter 4 presents initial data collection studies, processes to prepare the collected data for
future processing, and base signal characteristics and algorithms for the recognition of nonverbal
indicators of human behavior found in speech and body motion signals; Chapter 5 presents the
design and execution of a social interaction study and relationship of base nonverbal behaviors
with reported rapport experienced, and Chapter 6 presents contributions of this dissertation and
future work.
                                                  6


    2. BACKGROUND
     The multidisciplinary nature of the goals of this dissertation work spans from social sciences
to engineering technology. Therefore, this chapter is designed to give a sense of the social science
theories and concepts that have guided the psychological study of human behavior and social
interactions. In addition, this chapter also describes the methods that have been employed to
monitor human behaviors and the elements involved in using technology for such end.
    2.1.Human Behavior and its Effectors
     Humans are a highly social species expressing individual behaviors that, when accumulated,
create the social environment in which society operates. In general, human behavior is driven by
personal factors such as thoughts and emotions that are influenced by our environment and social
interactions. Social interactions, constructed by the behaviors of two or more individuals, are
highly complex and play an important role in our health and survival [3]. Behavior is generally
defined as the “observable consequences of the choices a living entity makes in response to
external or internal stimuli” [27]. Internal stimuli could be a person’s thoughts, memories,
perceptions, or attitudes, while external stimuli come from the environment the person interacts
with, including social interactions. In humans, depending on the level of situational and personal
awareness that they possess, responses to external and internal stimuli (effectors of human
behaviors) can be voluntary or involuntary. Figure 2 shows the dynamics of the effectors of human
behavior, which can include personal factors and components of social interactions.
     As illustrated in Figure 2, personal factors are inside the person. They can come from a
person’s biology or psychology. Personal factors that come from a person’s psychology are in the
mind and are not externally observable; however, the behaviors a person expresses because of the
influences of their psychological personal factors are directly observable. Social behaviors, a
                                                 7


Figure 2. Diagram describing the effectors of human behavior and their dynamics. In short,
given an environment, personal factors influence human behaviors, which influence our social
behaviors affecting how we communicate during social interactions. In a reciprocal loop, the
elements involved in social interactions influence back our personal factors, which influence our
human behaviors and so on. © 2021, IEEE.
subset of human behaviors that are specifically directed at other people or that involve social
action, are directly observable. Communication, both verbal and nonverbal, is a vital aspect of
social interactions and is directly observable. As illustrated in Figure 2, social behaviors strongly
influence communication dynamics during a social interaction while, in a reciprocal loop, our
social behaviors get influenced by our social interactions. Part of this idea is captured by the well-
established Social Cognitive Theory (SCT), which contends that individuals’ perceptions of their
environment can influence their emotional, physiological, and behavioral reactions [28], [29],
subsequently influencing future behaviors in a reciprocal loop.
     To properly understand the technology developed to monitor human behavior, one must first
understand the personal factors that underpin human behaviors and the theories and concepts that
have guided the psychological study of social interactions. These two topics are briefly
summarized below to provide a scholarly foundation for the research described in this work.
                                                  8


    2.2.Theories and Concepts of Human Behavior
    2.2.1. Personal Factors
     The psychological factors that have been commonly studied that contribute to human behavior
are affect and dispositions. Affect generally means anything related to a person’s emotions or
moods, and it can be divided into two categories: states and traits. State affect is an emotion or
mood that is experienced in a certain moment, whereas trait affect is a more enduring part of one’s
personality. Emotions are mental and physiological experiences of feeling that are acutely
experienced (intense) and discrete in that they have a beginning and an end point, while moods
refer to the positive or negative feelings that are in the background of our everyday experiences;
these are diffuse (not acutely experienced) and longer-lasting states than emotions; however, they
are not as enduring as trait affect. Trait affect is part of one’s personality – it is a tendency to
experience certain emotions and moods in general. For example, someone might have a negative
affectivity trait, which is the tendency to experience negative moods and emotions more often than
others. Together, these states of emotional experiences and traits constitute affect.
     A disposition in the social sciences is thought of as a natural proclivity (biological or
psychological) to respond to situations in a particular way. Because dispositions are “natural” and
inherent in the person, they are thought to be the most stable and enduring phenomenon studied in
the psychological sciences that are discussed in this chapter (i.e., more enduring across time than
state affect, attitudes, and behaviors). However, despite their stability across time, dispositions do
not relate to behavior with perfect consistency because there are environmental factors that also
influence behavior. For example, a person might have a biological disposition to develop a
psychological disorder, but through certain training environments like therapy, they are able to
override their disposition to develop the disorder. For another example, someone may be
                                                   9


genetically predisposed to have a reserved personality, but they are put in social environments that
constantly require them to talk to others, so they override their genetic predisposition. Dispositions
influence human behavior more when the situation is weak, like when a person is in a casual social
interaction. On the other hand, dispositions influence human behavior less when the social
situation is strong and enforces certain norms, such as a professional environment in which all
individuals are expected to behave in a certain way regardless of their personalities [30]. This
Section will focus on personality traits, which are influenced by dispositions as well as by the
environment [31].
     Lastly, another important personal factor that impacts behavior is attitudes. An attitude is a
psychological tendency to evaluate a particular target with some degree of favor or disfavor [32].
The “target” could be another person or a non-living thing such as a food, brand, or idea. An
attitude, at its core, is an evaluation. Thus, it differs from affect and dispositions. Whereas affect
could include an emotion that arises in response to a target, an attitude is a feeling towards the
target and a set of judgments about the target. Attitudes are more enduring than a state but less
enduring than a trait or disposition. Figure 3 shows the relationship between these personal factors
and time. The duration of these factors and the interactions between them play an important role
in understanding how technology can be used to understand human behaviors.
     It was contended that there is a lack of research using behavior monitoring technologies to
study the role of attitudes in human behavior during social interactions. Thus, next, a review of
psychological theories that delve specifically into the explanation of state affect and personality
traits is presented.
                                                    10


Figure 3. Diagram describing the personal factors that influence human behavior and how they
manifest through time. Personal factors include affect, attitudes, and dispositions, of which,
affect and dispositions are the most studied. Affect is divided into states and traits. State affect is
related to acute emotions and mood, in contrast to trait affect which is related to a human’s
disposition to experience positive or negative emotions and is a more enduring part of human
personality. © 2021, IEEE.
    2.2.1.1.State-Affect Related Theories
     Discrete, acute emotions often provoke a person to mentally narrow in on a specific action or
set of actions. For example, the experience of fear leads to the activation of thoughts in the mind
about defending oneself or running away (also known as “fight-or-flight” response), and the
experience of interest can activate a person’s thoughts aimed at exploring and taking in new
information [33]. “Activations of thought” driven by emotions can occur subconsciously. Indeed,
the body mobilizes physiological resources to complete these actions without the person’s
conscious awareness.
     Based on the explored idea that emotions reflect responses of the sympathetic nervous system
[34], the Polyvagal Theory explains how state affect alters brain processes and biological processes
that occur in the rest of the body [35], [36]. In addition, this theory provides insights into the
relationship between measurable physiological states, linked to the autonomic and central nervous
systems, and the resulting human behavior, suggesting a bidirectional relationship between the
brain and the body. It also suggests that the environment affects behaviors that consequently alter
physiological states. Thus, monitoring changes in the physiological states of the human body, such
                                                 11


as respiration rate, heart rate, and perspiration rate, among others, can provide insights into the
affective state of an individual [37]. Likewise, monitoring environmental conditions can provide
information on how the environment influences emotional states and other factors.
     Emotions can be-understood to fall somewhere along two orthogonal dimensions: (1) of how
pleasurable the emotion is, (2) and of how much arousal or activation the emotion involves. As
shown in Figure 4, emotions are commonly arranged in a circumplex model of affect [38],
according to where they fall on both dimensions. For example, excitement is an emotion that is
pleasurable and high on arousal, whereas calmness is an emotion that is pleasurable and low on
arousal. Anger and fear are unpleasant, high on arousal emotions close together on the circumplex,
whereas boredom is a low-arousal unpleasant emotion. The circumplex model of affect is a
mainstream and well-established theory. However, other dimensional models of emotions have
also been used to study emotional states, such as the Pleasant, Arousal, and Dominance (PAD)
emotional state model [39] that, in addition to modeling emotions in a valence-arousal scale,
Figure 4. A typical circumplex model of affect that describes affective states using two
fundamental neurophysiological systems: valence and arousal. Valence describes the level of
pleasure or displeasure of an emotion, while arousal describes its level of activation. Emotions
in blue color represent the four most commonly studied emotions in affective computing. © 2021,
IEEE.
                                                  12


contains a dominance dimension representing the controlling nature of an emotion. The Plutchik’s
model [40] is another dimensional model of emotions that organizes discrete emotions from the
most basic to the most complex ones. In the field of affective computing, happiness, sadness, anger,
and fear are the four most studied emotions.
     A discrete emotion can affect someone’s response to a social interaction whether the emotion
was caused by that interaction or not. The well-established framework of Emotions As Social
Information (EASI) model [41], [42] asserts that emotions serve a social function by relaying
information when they are expressed. For example, if a person is late to a meeting with a coworker
and the coworker appears to be angry, this provides information that leads to certain inferences,
such as the inference that the person was late, the inference that the behavior of being tardy was
inappropriate, and the inference that the person should strive to arrive earlier in the future [41].
The information that emotions relay to others in social interactions is valuable for adjusting future
behavior.
    2.2.1.2.Personality Traits Related Theories
     The dominant theory in the organizational sciences used to taxonomize personality is the Five-
Factor Model (FFM) of personality, also known as the “Big Five.” The “Big Five” factors of
personality can be abbreviated with the acronym “OCEAN”: Openness, Conscientiousness,
Extraversion, Agreeableness, and Neuroticism. Openness involves being open to new experiences,
unconventional, nonconforming, creative, and imaginative, while conscientiousness is the
“tendency toward being dependable, disciplined, purposeful, organized, and achievement-
oriented” [43]. Extraversion, in the sense of the FFM, is the “tendency to be social, talkative,
energetic, and active” [43]. It was found that among the personality factors, extraversion has the
strongest relationship with leadership (both being recognized as a leader by others and being
                                                 13


effective as a leader) [44]. On the other hand, agreeableness tends to be a catchall factor related to
aspects of personality that are likable and harmonious with others, such as being trusting of others,
polite, empathetic, and compliant [45]. The last one of the Big Five is Neuroticism, which is often
labeled as its opposite instead, emotional stability. Those who are high on neuroticism are more
likely to experience negative emotions like anxiety, anger, irritation, frustration, and jealousy. On
the other hand, having low neuroticism or high emotional stability means that a person tends to be
more even-keeled, calm, and unwavering (not necessarily positive or enthusiastic).
     There are many other taxonomies of personality, such as the HEXACO model [45] which
breaks the agreeableness factor of the FFM into agreeableness and humility. The Dark Triad is
another taxonomy that has only three undesirable personality traits: narcissism, Machiavellianism,
and psychopathy [46]. Another trait is locus of control, which describes the extent to which
individuals believe that they control their own outcomes as opposed to having their successes and
failures determined by external forces [47]. So far, the traits covered by the FFM and the locus of
control have been studied using human behavior monitoring technologies. In general, the
activation of these traits during social interactions contributes to observable social behaviors that
make up the social environment that people operate in.
    2.2.2. Communication
     One of the most important factors influencing our human behavior is the social behavior of
our interaction partners. We might think of it as a situational factor, but this would ignore the
dynamic nature of mutual adaption within the communication process. By definition,
“communication is a transactional process in which people generate meaning through the exchange
of verbal and nonverbal messages in specific contexts, influences by individual and societal forces
and embedded in culture” [48]. Verbal communication refers to the use of spoken and written
                                                   14


language (words). It is usually organized in distinct on-off patterns of messages or utterances with
iterating sender and receiver (speaker, listener) roles. The use of words requires a shared explicit
code usually to be found in a dictionary. Spoken language, however, can also carry implicit, so-
called paraverbal, information, for instance, encoded in the floor possession and pausing or in
prosodic features such as pitch, speed, and volume of the vocal output.
     Nonverbal communication comprises all aspects of communication that are not encoded into
words. In stark contrast to verbal communication, nonverbal communication is continuous, i.e.,
always on, and largely implicit, i.e., it lacks a dictionary and is produced and processed widely
automatically and unconsciously. Therefore, it is hard to control, and its effects impose on the
observer with an irrefutable force. Even a lack of nonverbal expressions is interpreted by the
observer, for instance as disinterest. In this sense, it has been said that “we cannot not
communicate” [49]. It has been argued that our social perception and impression formation is much
more dependent on nonverbal cues than on verbal behavior and that nonverbal communication can
be conceived as meta-communicative [49], in the sense that it even largely defines how we
understand and interpret the spoken words. Thus, even in the presence of verbal communication,
successful communication largely depends on the efficient use of nonverbal communication
channels [50].
     As nonverbal communication largely withdraws deliberate manipulation, it is supposed to
provide information about unobservable processes such as the individual’s emotional state,
intentions, personality traits, etc. [51]. The view that nonverbal communication is a reliable source
of true information, although still under debate [52], has made it the focus of study of many areas
dedicated to understanding social behavior and the human mind. For example, the social signal
processing [53], [54] and affective computing [55] literature, areas of engineering and computer
                                                   15


science that study social interactions and human emotions, respectively, focus on the messages
produced by nonverbal communication channels. They are better known in those areas as “social
signals”; a notion that was first introduced in the field of computational social science and
organization engineering [56]. Thus, here the focus is on reviewing the most used nonverbal
communication channels.
     Due to its unique level of complexity, the analysis of nonverbal communication poses
considerable methodological challenges [57]. Nonverbal communication implies multiple
channels and serves various functions. As illustrated in Figure 5, nonverbal communication
includes gestures, body movements, and postures [58]–[61], facial expressions [62], [63], and eye
gaze [60], among others. This work treats paraverbal communication, such as prosody, pitch,
volume, and intonation [64], [65], under the broader construct of nonverbal communication.
     Nonverbal communication comprises attentional functions, interpretations, and most
importantly the regulation of interpersonal relations. We distinguish three, distinct, yet
interdependent functions of nonverbal communication [66]: (1) discourse functions, (2) dialog
functions, and (3) socio-emotional functions that influence our social behaviors.
     Discourse functions are closely related to speech production and understanding. Emblems,
pointing gestures, illustrative gestures, and beat gestures belong to this functional category [67].
But also, prosodic aspects, such as pausing and variations in voice pitch and volume. In general,
they influence aspects of interpersonal communication and engagement that includes listener
attention, interest, understanding, and interpretation.
     Dialogue functions include turn-taking signals (e.g., eye contact, raise of voice, pausing) and
back-channel signals (e.g., head nods, ‘uh-huh’, etc.), which serve to smooth the flow of interaction
when exchanging speaker and listener roles [68]. In addition, dialogue functions influence aspects
                                                  16


Figure 5. Elements of communication relevant to social behaviors during interactions modeled
by an exchange of verbal and nonverbal messages. Verbal communication involves the use of
words through written or spoken language. Nonverbal communication involves the use of
gestures, facial expressions, paraverbal communication, eye movements, visual contact, body
movements, posture, and interpersonal distance. © 2021, IEEE.
of social interactions that include communication patterns, conversation dynamics, and the level
of interaction between individuals. Dialogue functions also influence aspects of collaboration such
as cooperation, in addition to aspects of dominance, and leadership roles.
     Socio-emotional functions of nonverbal behavior include the communication of emotions and
interpersonal attitudes and their regulation, which are crucial for establishing rapport. Whether we
harmonize in an interaction, take others’ perspectives or are capable of establishing a smooth flow
of interaction very much depends on the exchange of those socio-emotional cues. Socioemotional
functions are not independent from dialogue and discourse functions of nonverbal behavior. A
smooth flow of the conversation will most likely influence positively the interaction climate.
Power relations are evident in body postures, eye contact, voice amplitudes and more [69].
Harmony or interpersonal rapport shows in expressiveness or responsiveness [70] as well as in
                                                 17


mutual attentiveness (body orientation, eye contact), reciprocal positivity (smiles, interpersonal
distance, body lean, and orientation), and behavioral coordination (motor mimicry, posture sharing
and synchrony, and activity entrainment). Thus, monitoring nonverbal messages provides insights
into the human and social behaviors being displayed given an environment [16], [71].
    2.2.2.1.In-Person and Virtual Communication
     Verbal and nonverbal communication are essential in both in-person and virtual interactions.
By definition, in-person interactions are synchronistic, in that it occurs when two individuals are
in the same place at the same time. This is the traditional and the richest media of all
communication forms because it allows the individuals involved in the interaction to observe
nonverbal cues such as facial expressions and body language [72]. On the other hand, virtual
interactions come in various forms including email, telephone, instant messaging, and video calls,
among others. This form of communication can be found to be asynchronous or synchronous. Of
interest are video calls that, similar to in-person interactions, allow for real-time (synchronistic)
communication, feedback, and transmission of important nonverbal cues [73]. Virtual interactions
in the form of video calls can achieve the sense of “same space” inherent in in-person interactions.
     Virtual interactions are comprised of multiple nonverbal communication cues (embedded in
audio, video, and text forms) that happen simultaneously and differently than they do when in-
person [74]. Research has found that voice, including paraverbal, is the most important
communicative cue of meetings [74]–[76]. Nevertheless, video is essential in terms of “social
presence,” defined as the sense of intimacy and immediacy with others [74]. During the COVID-
19 pandemic, Microsoft surveyed its employees to collect their experiences in virtual meetings,
finding that participants reported that small group meetings with video turned on can be engaging
and interactive [74]. Additionally, it was also found that to maintain a similar experience to in-
                                                  18


person interactions, nonverbal cues (derived, for example, from facial expressions and body
movements) are essential to provide information about engagement, attention, and focus [77].
    2.3.Methods for Monitoring Human Behaviors
    2.3.1. Ethnographic Methods
     Traditionally, scientists interested in studying social interactions have made use of
ethnographic research methods, such as observations (including experiments) and surveys. Expert
observation is probably the most common method for studying social interactions. Data collected
through expert observation is a method used in all sciences and is independent of people’s
willingness to provide verbal information about their behaviors and feelings. One of the greatest
advantages of employing expert observation is the depth of the collected data, which can be very
detailed to “explain behavior and communication patterns in ways that a survey, interview, or
experimental design cannot” [78]. On the other hand, self-reported data through methods such as
surveys present the advantage that a wide range of information can be collected. Methods such as
surveys make it possible to study very large populations, their attitudes, values, beliefs, and past
behaviors [79]. However, even when human behaviors have been studied using expert
observations and/or through surveys, these methods are not suitable for the interpretation of human
and social behaviors in settings that would benefit from real-time feedback to improve on the
observed behaviors.
    2.3.2. Biometric Technology
     In addition to expert observations and surveys to monitor human behaviors, biometric
methods have been employed to measure psychophysiological processes related to human
behavior. Biometric methods for studying human behavior include the use of different
technologies, including sensors and algorithms, to monitor the activation of personal factors
                                               19


(emotions and personality traits) and aspects of social behaviors during social interactions that
manifest in the form of physiological processes and nonverbal messages. Some of the advantages
of using biometric methods include the potential for unbiased and consistent measurements.
Moreover, using the right experimental setup and biometric technologies, such as wearable sensor
platforms, real-time data collection and analysis of human behavior can be achieved. Once “in-
the-wild” human behavior analysis is achieved, real-time feedback can be provided to create
behavioral awareness in the individuals from which the behavior was detected. To this end, a wide
range of biometric technologies have been developed, and a variety of sensor modalities and
algorithms employed to facilitate the realization of studies in real-life scenarios.
    2.4.Biometric Technologies and its Components
     Technologies for real-time monitoring of human behavior require the employment of a variety
of components. The three main components of these types of technologies are sensors, signal
features, and computational models. This section provides a glance at the literature available in
this area, which is thoroughly revised and analyzed in Chapter 3. Note that this section, and this
work in general, omits the review of works that employ cameras for the monitoring of human
behaviors. The primary reasons to exclude cameras from this work are because most of the reported
use of video cameras for the monitoring of human behavior has been for offline applications and
because its use increases the computational load and power consumption of the system [80]. In
addition, it has been a topic of debate that the use of cameras to monitor human behavior presents
a concern for user privacy. Thus, as video and image modalities present limitations for real-time
and wearable applications, we consider them out of scope.
                                                  20


    2.4.1. Wearable Sensors for Collecting Physiological Signals and Nonverbal Messages
     In the 1990s, the early development of wearables to study human behaviors was focused on
identifying aspects of social interactions. These initial systems, still used nowadays, employed
InfraRed (IR) and/or quasipassive radio frequency (RF) sensor modules to track position and
proximity among individuals wearing these devices [81]–[83]. In an effort to create wearable
systems with the ability to capture more informative data about human behaviors, research groups
started working on the integration of multiple sensing modalities. One of the earliest initiatives
was the MIThril project pioneered by A. Pentland [84]. The MIThril project focused on developing
a “practical, modular system of hardware and software for research in wearable sensing and
context-aware interaction” [84]. With the introduction in the early 2000s of the personal digital
assistant (PDA) devices, the MIThril project first developed a modular wearable system comprised
of a variety of sensors such as accelerometers, InfraRed (IR) active tag readers, GPS units, analog
microphones, 2-channel electromyography (EMG) sensors, 2-channel electrodermal activity
(EDA) sensors, and skin temperature monitors [85]. The sensors were wired to a PDA intended to
perform real-time processing and communicate with other units of the same kind through Wi-Fi.
However, there seem to be no reports of data collected using this system. In an effort to study
communication patterns of groups of people during meetings in real time, Eagle and Pentland [86],
designed a wearable system employing a headset microphone connected to individuals’ PDAs to
allow streaming of high-quality audio signals over a network, also with the choice of storing the
audio locally on the device for post-processing. Conversations were detected in all streamed audio
signals and conversation features extracted, including inferring the proximity among participants.
Later, the same research group made use of the UbER-Badge [87], a device with a microphone, a
two-axis accelerometer, and a forward-oriented IR transceiver, to measure human interest levels
                                                 21


during interactions in a conference meeting, all among dyads [88], [89]. With a similar system
called the Sociometer, people involved in an interaction were identified, and through audio signals,
conversation dynamics studied [90]. This version of the Sociometer was later optimized. Modified
versions of the Sociometer have been known as the Sociometric badge [91]–[93], Open badge
[94], and Rhythm badge [95]; all these sensor platforms have been used in the study of social
interactions.
     Mobile phones have also been used as a platform for monitoring social interactions. In [96],
the Bluetooth and microphone units from a mobile phone were used to detect proximity and
conversation dynamics in real time to infer levels of interest in a social interaction. Moreover, they
have also been used to connect with badges to display feedback information useful to improve
social interactions[93], [95]. A review of additional wearable sensors used for social interaction
recognition can be found in [97].
     Besides social interactions, wearables have also been used for real-time emotion recognition.
In [98] and [99], accelerometer, gyroscope, ambient light, temperature, and humidity sensors were
integrated into a watch-like device to monitor levels of anxiety in human subjects. In an improved
version that includes a MEMS microphone and a skin temperature sensor, Jiang et al. [100] used
this wearable system for health monitoring to study the relationship between mental health and
physical health. In [22], Breeze, a wearable pendant placed around the neck, with an inertial
measurement unit (IMU), was employed to measure breathing patterns as these are closely linked
to emotions. The goal of the researchers was to improve the emotional states of the Breeze users
by providing real-time feedback on the user’s breathing patterns. Also related to the regulation of
emotional states, in [101], a wearable system in the form of a glove containing an EDA, a blood
volume pulse (BVP), and a skin temperature sensor was designed to continuously monitor changes
                                                 22


in the physiological signals that could relate to emotional mental states. On the other hand, Girardi
et al. [102] used commercially available wearable sensors to capture electroencephalography
(EEG), EDA, and EMG signals to detect emotions in the arousal-valence dimensions. Also using
commercially available sensors, McGinnis et al. [103] employed accelerometers and gyroscopes
to diagnose anxiety and depression in young children. A comprehensive list of commercially
available wearable physiological sensors, used especially for the monitoring of emotions, can be
found in [104].
    2.4.2. Signal Features
     The processing of sensor signals plays a critical role in the design of accurate real-time human
behavior monitoring systems. Methods applied for the treatment of sensor signals include digital
signal processing and machine learning techniques. The goal of digital signal processing is to apply
pre-processing techniques to enhance signal quality and to compute statistically identifiable signal
characteristics or measurable signal properties, typically referred to as signal “features”, that are
informative of human behaviors. Pre-processing techniques include signal filtering, normalization,
and standardization which help eliminate signal artifacts and any other unwanted information from
the collected signals [105]. On the other hand, extracting features from signals involves finding a
variety of mathematical methods that could help identify patterns in the data [106]. Features are
extracted/calculated using time-domain feature extraction techniques, frequency-domain feature
extraction techniques, and time-frequency-domain feature extraction techniques. Time-domain
features include zero crossing rate, slope sign changes, waveform length, statistical values, and
Shannon entropy, among others; frequency-domain features include measurements derived from
a Discrete Fourier transform and power spectral density analysis, among others; and time-
frequency-domain features include short-time Fourier transform, Hilbert transform, Morlet
                                                   23


wavelet, and wavelet transform, among others [105]. After features are extracted, machine learning
methods such as feature selection can be used to reduce redundancy in extracted signal
characteristics and/or reduce the dimensionality of a given dataset. Selecting a final set of signal
features has an important influence on the size of examples needed to create models capable of
recognizing human behaviors, on the cost of computation, and on the time needed for recognizing
such behaviors [105].
    2.4.3. Computational Models
     Based on the features extracted from sensor signals, computational models are trained and
used to predict or classify human behavior. Therefore, the performance of computational models,
also referred to in this work as machine learning models, can depend on the provided set of
features. Likewise, the effectiveness of signal features can also depend, in part, on the type of
computational method used to evaluate the feature's contribution.
     The two principal types of machine learning models employed in the human behavior
recognition literature are classification and regression models. Classification models focus on
recognizing discrete or categorical classes, while regression models focus on predicting continuous
numerical values. The use of a machine learning model is application specific. For example, the
problem of emotion recognition can be treated as one with categorical values (e.g., happy, sad,
neutral) or as one with continues numerical values (i.e., reflecting levels of arousal and valence
based on a numerical scale). More details and analysis of signal features related to human behavior
and computational models will be given throughout Chapter 3.
    2.4.4. Training Frameworks
     The design of technologies for real-time monitoring of human behavior is not complete
without the collection of datasets to support the evaluation of signal features and the training of
                                                 24


machine learning models. Datasets to train machine learning models for human behavior
recognition are application specific and can be classified as acted/evoked datasets and natural
datasets. Acted/evoke datasets refer to data collected by requesting a subject to behave in a certain
way (e.g., actors) or by controlling the environment to elicit the behavior of interest (e.g., watching
images or movies to elicit specific emotions). On the other hand, natural datasets refer to data
collected during spontaneous/natural interactions where behaviors cannot be controlled. Natural
datasets are the hardest to construct as it involves designing the scenario to encourage spontaneous
interactions and preparing annotation schemes. According to Cognilytica, an analyst firm, 80% of
the time spent in machine learning and artificial intelligence projects goes into data collection,
organization, and annotation [107], [108].
     The scientific community has worked collaboratively to create publicly available datasets to
advance the design of algorithms. For the design of human behavior monitoring, a variety of
datasets have been created [109]–[117]. The creation of datasets has been mostly performed for
applications in the area of emotion recognition, however, in the last 10 years, more databases have
surged reflecting aspects of social interactions.
     For emotion recognition applications, a combination of acted/evoke and natural datasets exist,
in addition to datasets containing multimodal data. The HUMAINE database contains a collection
of 48 audiovisual acted/evoked and naturalistic clips, some of them also containing physiological
data, with labels describing affective responses. Clips were obtained mainly from TV
shows/interviews and human-computer conversations, but clips from other sources were also
obtained. This has been one of the most comprehensive databases, also providing a labeling
scheme to identify emotional responses in audiovisual data [117]. DEAP is an evoked multimodal
database containing frontal face video and physiological signals obtained while participants were
                                                   25


watching music videos. Collected data was labeled using self-reported ratings for arousal, valence,
and like/dislike [112]. The MAHNOB-HCI is also an evoked multimodal database containing
audio, visual, eye gaze, and physiological data. Videos and images were used to evoke emotions
in 27 participants [113]. Other databases, such as BioVid Emo DB [109], also contain multimodal
data collected while individuals were matching videos. Most of the data in these databases was
collected with sensors that restricted the natural movement of participants. In addition, because
their focus was on emotion recognition, their experimental design and collected data do not reflect
the reactions of natural social/group interactions.
     To provide more naturalistic data on person-to-person interaction environments, other
databases have been made available. For example, the RECOLA database contains data from
natural remote collaborative environments [110], [111]. Multimodal data, including audio, video,
and physiological signals were collected from interactions between dyads (two individuals). This
dataset was labeled by external annotators using a continuous arousal and valence scale and social
behavior dimensions. External annotators looked at the following social behaviors: agreement,
dominance, engagement, performance, and rapport. A more recent database, SEWA DB, contains
audiovisual data from individuals watching adverts and then dyads having a conversation about
such adverts [115]. This dataset includes facial landmarks, facial action units, vocalizations,
mirroring, affective state (valence, arousal), and social behavior (liking, agreement) annotations.
However, none of these databases capture the dynamics of groups.
     There still exists a need for databases containing data from group interaction environments
with their respective annotation schemes. For the design of human and group behavior monitoring
systems, data capturing a combination of emotional reactions with social behaviors at the
individual, dyadic, and group level are needed.
                                                 26


    2.5.Summary
     Reviewing the personal factors that underpin human behavior and the theories and concepts
that have guided the psychological study of social interactions provides a scholarly foundation for
understanding the methods that have been employed for monitoring human behavior. Methods
employing wearable sensors for the monitoring of human behaviors have been focused on
recognizing emotions or aspects of group behaviors, separately. However, for those works that
have focused on group behaviors, only specific aspects of communication have been implemented
in those systems, which do not capture the entirety of the social interaction complexity to identify
disruptive behaviors and bring awareness. Even when many efforts have been done to study and
design technologies to monitor individual emotions and aspects of group behaviors, very little has
been done in designing robust systems that could measure a higher number of elements influencing
social interactions within groups of people. In addition, none of those works have been focused on
identifying aspects of the interaction that could be potentially useful to bring awareness to
members of a group. It is also important to note that, currently there are no standard technologies,
methods, and/or processes to study and design group behavior monitoring systems.
     Disclaimer: A substantial portion of this chapter was published in [118] © 2021, IEEE.
                                                 27


     3. DESIGN OF A MULTI-SENSOR SYSTEM WITH A MACHINE LEARNING
         FRAMEWORK TO MONITOR GROUP INTERACTIONS IN REAL TIME
      The reviewed literature provided a unique social science perspective with a focus on
identifying critical elements to consider for the design of social behavior monitoring systems. The
literature surrounding technology for human behavior monitoring is vast and varied. Focusing only
on technologies with the potential to advance automatic and/or real-time monitoring of human
behaviors, as was chosen for this work, motivated the creation of a classification system that could
synthesized reported technologies to enable an analytical perspective. This classification system
or taxonomy would relate to behavioral elements that helped define the individual, dyadic, and
group metrics involved in group interactions. Of particular interest was a behavioral element of
group interactions called rapport, which helps define the quality of social interaction between
dyads. We hypothesized that by improving real-time self-awareness, rapport levels between
individuals (dyads) could be increased, possibly affecting the overall group interaction. Here,
technologies used for the monitoring of human behavior and rapport are presented. With the goal
of establishing a framework for the design of group interaction monitoring systems with the
capability of providing feedback that can improve the quality of social interactions, this work
leverages on existing theories to monitor dyadic interactions and presents efforts in the design of
a multi-sensor monitoring system for the real-time detection of group consonance. The term group
consonance is introduced in this work to define the subset of rapport composed of monitorable
behavioral components that contribute to establishing good rapport between dyads and its effect
on the overall group interaction. As part of the design efforts, a comprehensive review and analysis
of sensor technologies used for the study of human behaviors are presented and discussed.
                                                  28


Table 1. Taxonomy summarizing human behavior elements monitored using sensor technologies.
© 2021, IEEE.
     Effector                            Elements
     classes                                                                    References
                            (aspects/components/dimensions)
  (complexes)
                 Dimensional: valence, arousal, potency                     [22], [24], [98]–[103],
                 Categorical (basic emotions):                              [109]–[113], [132],
    Emotions       happy, angry, sad, quiet, disgust, anxiety, surprise     [141]–[146], [182],
                 Others: curiosity, boredom, uncertainty, puzzlement        [184]–[190], [193],
                                                                            [194], [237]–[257]
                 Personality traits: leadership emergence, openness,
                  conscientiousness, extraversion, agreeableness, and       [53], [86], [93], [111],
   Personality neuroticism                                                  [147], [150]–[154],
     factors     Person Perception Dimensions: valence, dominance,          [258]–[266]
                  activity
                 Others: empathy, honesty
                 Cooperation or collaboration, agreement and                [17], [24]–[26], [83],
                  disagreements, attraction, interest, attention,           [86], [88]–[96], [111],
      Social
  Interactions emphasis,     vigilance, group performance, cohesion,        [123]–[126], [155]–
                  communication patterns and dynamics, level of             [158], [162]–[165],
                  interaction, rapport                                      [183], [191], [267]–[278]
     3.1.Measuring the Quality of Group Interactions
     3.1.1. Taxonomy for Monitoring Elements of Human Behavior
      An underlying goal of this work, and indeed of most of the previous efforts in the literature,
is to enhance the potential for technologies that augment human capability, toward a future of
increasingly effective human-machine interactions. To further promote this human-centered
approach, this work established a taxonomy for behavior-sensing technology that is based on the
relevant psychological theory summarized in Chapter 2. Specifically, this taxonomy assigns
technologies to the human behavior effectors that they target, and it defines three effector classes
that encompass the reviewed literature, as shown in Table 1. The defined effector classes cover
personal factors (i.e., emotions and personality traits) as well as social interaction factors observed
through nonverbal communication channels, all of which influence human behavior.
      In brief, Table 1 assigns the emotions effector class to works that concentrated on recognizing
categorical and dimensional emotional structures, most of which focus on understanding an
                                                   29


individual’s emotional state, rather than the dynamics of emotional expression and exchange
during a social interaction. The personality factors class was allocated to works related to
personality traits and person perception dimensions as well as to works centered on the detection
of empathy and honesty. Finally, the social interactions class was assigned to works covering
aspects of interpersonal communication and engagement such as levels of interest, level of
cohesion, communication dynamics, and rapport.
     This taxonomy will be maintained throughout this chapter. The taxonomy will be used to
discern similarities and differences in the various sensors, signal features, and computational
models employed for monitoring within these prescribed human behavior effector classes.
    3.1.2. Individual, Dyadic, and Group Nonverbal Behaviors
     As described in Chapter 2, our social environments are created by the sum of individual
behaviors interacting with each other. This work defines individual elements of behavior as the
basic unit of all interactions, especially, dyadic interactions. A dyadic interaction describes the
interaction between two people, which represents the smallest possible social group. Dyads
represent the basic social interaction unit of groups of three or more people. Here, individual,
dyadic, and group metrics of nonverbal behavior are explained to guide the design of our group
interaction monitoring system. As highlighted in Chapter 2, the focus of this work is on the use of
nonverbal behavior indicators because of their influence and importance on social interaction
perception and the advantages of keeping privacy risks at their lowest, as will be further explained
in Section 3.2.
    3.1.2.1.Individual nonverbal metrics
    As displayed in Table 1, emotions and personality factors have been studied and monitored
using technology. Those two effector classes group literature that present technological
                                                 30


advancements with a focus on understanding behavioral elements that reside in a single individual,
rather than the interaction between two or more individuals. Even when emotions and certain
personality factors are influenced by the social environment, these ones have generally been
monitored at the individual level. Generally, emotions and personality factors, such as personality
traits and person perception dimensions, can be studied and monitored through physiological and
paraverbal communication changes in an individual. Emotions and personality factors can also be
studied through facial expressions, gestures, and posture. As discussed in Chapter 2, changes in
physiological reactions can be driven by changes in emotional states. Likewise, paraverbal
communication, which includes intonation, tempo, voice quality, volume, speaking time, turns,
and interruptions, can be used to determine levels of individual activity and dominance, in addition
to emotional states. However, the works that focus on determining individual nonverbal metrics
of behavior differ widely in the approach that they take and the technology they employ, in terms
of sensors, features, and computational models.
     3.1.2.2.Dyadic nonverbal metrics
      This work will consider a dyad to be the simplest form of a group and the smallest unit of
analysis of an interaction within three or more individuals. Most of the reviewed literature targets
dyads to study the elements of social interaction listed in Table 1. Generally, aspects of nonverbal
communication such as body movements, postures, eye movement, visual contact, facial
expressions, gestures, and paraverbal are more widely used to determine dyadic metrics of
interaction. However, a small number of works have used synchrony analysis between
physiological signals collected from dyads to determine levels of collaboration, synchronicity, and
coordination. This demonstrates how individual nonverbal metrics of behavior from different
individuals can be combined to monitor social dynamics. Likewise, other individual nonverbal
                                                 31


metrics, such as speaking time, turns, and interruptions can be compared across participants of an
interaction to determine levels of overall interaction, cooperation, and cohesion. Also, gestures
and postures help determine levels of mimicry or coordination, which is essential to establish
rapport.
     At the core, rapport has been the primary aspect of social interactions attributed to dyads.
Rapport is a complex social behavior mostly correlated with nonverbal communication channels.
In fact, research has demonstrated that nonverbal behavioral cues are more indicative of rapport
than verbal communication channels [119]. Rapport is a harmonious relationship and connection
with someone, where the feelings or ideas of others and ourselves are understood, and
communication runs smoothly. People that experience/develop high levels of rapport have a higher
quality of social interactions, better team dynamics, and, consequently, are more productive in the
workplace.
     In 1990, Tickle-Degnan and Rosenthal [120] proposed a theoretical model for rapport that
describes three essential components of this complex social behavior: mutual attention, shared
positive feeling, and synchrony or coordination. Tickle-Degnan and Rosenthal made various
observations about how rapport manifests through time and how it varies depending on the context.
Tickle-Degnan and Rosenthal suggested that at the beginning of an interaction, strong feelings of
rapport are dominated more by emotional positivity and attentiveness than by coordination.
However, in more developed interactions, attentiveness and synchrony or coordination are more
dominant. Figure 6 illustrates the idea of the relative importance of the three essential ingredients
for rapport and their relationship through time, presented in [120].
     One of the most important observations of Tickle-Degnen and Rosenthal included that the
three components defining rapport (i.e., coordination, mutual attentiveness, and positivity) were
                                                 32


Figure 6. Relative importance of the three essential ingredients for rapport and their
relationship through time. Figure adapted from Tickle-Degnen and Rosenthal [120].
encoded in expressed behaviors [121]. The study of rapport and how it might be perceived by
others differ from the study of personality perception because the former does not reside on a
single individual, instead, it is constructed based on the relationship between two individuals [121].
    3.1.2.3.Group nonverbal metrics
     Similar to the case of dyadic nonverbal metrics, although at a smaller scale, the elements of
social interaction listed in Table 1 have been studied in groups. Similarly, many of the nonverbal
metrics used in the study of dyadic interactions have been applied to group interactions. In most
works, individual nonverbal metrics combined with dyadic nonverbal metrics have been used to
determine low and high levels of rapport [122] and cohesion [123] in meetings. Other metrics such
as overall group speaking length, speaking turns, and speaking interruptions throughout a meeting
have been used to characterize groups as cooperative or competitive [124]. Although high rapport
is considered an essential factor in the establishment of quality interactions, none of the works that
focus on monitoring this phenomenon provide a framework suitable for behavioral feedback that
could resolve complex dynamics. Thus, this lack of information motivated the work in this
                                                   33


dissertation to provide new means of assessing human behaviors, which is further explored in the
next Section.
    3.1.3. Technologies that Measure Rapport in Group Interactions: Challenges and
            Opportunities
     Because rapport is considered essential to effective dyadic and group interactions and
encompasses multiple components of human behavior, in this work, it is considered a guiding
factor in the design of the group interaction monitoring system. Many works in the literature have
used the Tickle-Degnan and Rosenthal model, directly or indirectly, to guide their studies,
observations, and automatic analysis of rapport using technology. For example, Hagad et al. [125]
pointed out that posture mirroring behavior, which is related to coordination, has been linked to
rapport. Thus, using a video camera, the authors extracted signal features describing the
individuals’ posture during a dyadic interaction, trained posture classification models for each
individual in the interaction, and then used the results of these models to determine posture
congruence in dyads, achieving a ~71% average classification accuracy when recognizing between
low, neutral, and high rapport. In another work by Cerekovic et al. [126], rapport was predicted
using 1-minute segments of audio-visual data collected from an individual interacting with a
virtual agent. The authors trained binary regression and classification models to predict/recognize
between positive and negative rapport, achieving an 87% average accuracy. However, features
employed in the prediction/recognition task included verbal audio features, not just nonverbal
ones. In general, virtual agents have been commonly employed in the recognition of rapport or its
components [127]–[129], however, that has limited the automatic recognition of rapport to just
dyads. In an effort to recognize rapport in groups, Muller et al. [122] investigated the automatic
prediction of low rapport during natural interactions within small groups. The authors were
                                                 34


particularly interested in recognizing the overall degree to which an individual in an interaction is
able to build rapport with others. To do this, they analyzed audio-visual data and extracted features
describing nonverbal messages such as facial expressions, hand motion, gaze, speaker turns, and
speech. The data labels were obtained by averaging the rapport scores given to an individual by
the other members of the group. The authors trained a classification model to recognize low versus
medium/high overall group rapport and studied the correlation of the features with the overall level
of rapport, achieving up to a 70% average classification precision.
     So far, the automatic recognition of rapport in dyads and groups, as defined by Muller et al.,
has been performed by measuring just a single component of rapport or by training a model that
identifies between low and high rapport by taking all extracted features at once. Even when this
has contributed to the monitoring of the quality of social interactions, current methods of
monitoring group interactions may not provide the necessary information to deliver a feedback
message in real time to help individuals improve their quality of interactions. Real-time or near
real-time processing is required in group interaction monitoring systems to provide information
that can impact human behaviors as they happen. From the works that focus on providing real-
time feedback, efforts have been concentrated on improving paraverbal communication patterns
or providing individual awareness of emotions, both important aspects of social interaction. Still,
the real-time monitoring of complex behavioral dynamics, such as rapport, that have a major
impact on the well-being of humans, requires the integration of multiple nonverbal metrics of
behavior and respective recognition capabilities. To the best of our knowledge, no work has been
focused on establishing a framework to monitor rapport at the level of its components to facilitate
feedback in human-to-human interaction to contribute to the improvement of the quality of the
interaction.
                                                  35


     Monitoring the quality of social interactions by calculating an overall rapport score may not
provide enough information to deliver effective user feedback that can enhance the quality of the
interaction. This work hypothesizes that monitoring individual components of rapport, i.e.,
attentiveness, positivity, and coordination, will allow us to extract an overall measure of rapport
and identify which component of rapport needs attention when low rapport is detected. This can
also be combined with other general group nonverbal metrics. However, the use of rapport as a
measure of group interaction quality requires a deep understanding of human behavior dynamics,
nonverbal metrics that contribute to each of the rapport components and their interactions over
time, the dyadic attributes, and the effective employment of sensors and computational models. A
system framework based on the rapport model needs to take into consideration the smallest unit of
interaction, individuals, and the basic unit of group interaction, dyads, and build upon that a group
model. Because the use of technology could limit the aspects of rapport that can be monitored, this
work will refer to the monitoring of rapport using technology as monitoring group consonance.
    3.2.Deep Analysis of Technology for Behavior Monitoring
     To better understand the design space for group interaction monitoring systems and establish
a framework that monitors group consonance based on rapport components, a deep analysis of
available technologies was performed. Here, the categorization and details of sensors, signal
features, and computational models employed in the monitoring of human behaviors are presented.
    3.2.1. Categorizing Behavior Monitoring Sensors
     In general, machine monitoring of human behavior starts with the appropriate selection of
sensors. Commonly used sensors in monitoring human behavior can be grouped as sensors that
capture video and images, audio, physiological, movement, orientation, proximity, and
environmental signals. The selection of a sensor or multiple sensors is driven by the type of
                                                  36


behavior that intends to be monitored and its associated nonverbal messages and physiological
reactions. Based on the analysis of reviewed literature, 21 different sensors were found to have
been used to monitor human behaviors. Table 2 lists the sensors used in the reviewed literature
and the nonverbal messages, physiological reactions, and/or environmental conditions that can be
captured by them. In addition, Table 2 summarizes information related to sensor placement and
the level of superficial invasiveness to the user. Here, the definition of superficial invasiveness
centers on the degree to which the sensor requires to enter into contact with the body and not
whether the sensor needs to be implantable in the body. Thus, this work classifies the level of
superficial invasiveness to the user in three categories: skin contact (sensor requires direct contact
with the skin), body contact (sensor has to be placed on the body but does not require direct contact
with the skin), and no contact, with skin contact being the most invasive and no contact completely
non-invasive. For sensors that require skin contact, it is indicated if they require a single point of
contact or multiple points of contact. This information is useful when assessing the level of
obtrusiveness of a given system or evaluating sensors for the design of wearable systems.
     From the sensors listed in Table 2, the top 11 most frequently used sensors in the literature
were studied and their frequency of use was plotted with respect to the effector classes presented
in Table 1. The relationship between the top 11 most frequently used sensors and the effector
classes are summarized in Figure 7. In the mentioned figure, it can be observed that the monitoring
of emotions has been one of the areas of most interest followed by the monitoring of social
interactions, with microphones as one of the most common sensor modalities used for their study.
It can be noticed that microphones, cameras, and EDA sensors are the only sensing modalities
used in the monitoring of all three effector classes. In the cases of microphones and cameras, it is
presumably because of the quality and quantity of the information that they provide, the numerous
                                                  37


advances in the areas of speech and image processing, and advantages in terms of superficial
invasiveness and placement. However, the use of cameras (to capture image and/or video) requires
Table 2. Categorization of sensor technologies used in the literature* to monitor human
behavior, together with its informants, associated effector classes and sensor modality, and the
level of invasiveness of the sensors relative to their placement. Abbreviations: Emotions (E),
Personality Factors (PF), Social Interactions (SI), Unimodal (Uni), Multimodal (Multi). © 2021,
IEEE.
                                                          Nonverbal,       Effector  Sensor   Level of
Type                                       Sensor        physiological,    classes  modality superficial    Placement
                                                            & other       E PF SI Uni Multi invasiveness
                                                          informants
                                                        Prosody, pitch,
                                                        speech                                              Chest or in
                                                        volume,                                Body
 Audio                                  Microphone      intonation,       ✓   ✓    ✓   ✓   ✓   contact or   front of an
                                                        turn-taking,                           no contact   individual
                                                        pauses, speech                                      (on a table)
                                                        duration
                                                        Gestures, body
                                                        movements,                                          In front of
 Video and
                                                        body lean and                                       the
                                        Camera          orientation,      ✓   ✓    ✓   ✓   ✓   No contact   individual
   Image
                                                        postures,                                           or room
                                                        facial                                              view
                                                        expressions,
                                                        eye gaze
                                        Accelerometer   Body              ✓        ✓       ✓                Chest, left
                                        Gyroscope       movements,        ✓                ✓                wrist, belt,
                                                        body lean and                                       necklace,
                                                        orientation,                                        in the right
                                                        postures,                                           trouser
                                                                          ✓                ✓
 Movement, orientation, and proximity
                                        Magnetometer    gestures,                                           pocket,
                                                        breathing                                           shirt
                                                        patterns                               Body         pocket, or
                                                                                               contact      bag
                                        InfraRed (IR)   Orientation                ✓       ✓                Chest,
                                        sensor          (face-to-face                                       head
                                        Ultrasonic      time),                     ✓       ✓                Chest
                                        sensor          proximity
                                                                                                            Chest, belt,
                                        GPS             Proximity                  ✓       ✓                pocket, or
                                                                                                            bag
                                        Radio                                                               Chest, belt,
                                        Frequency       Proximity,                                          pocket,
                                        (RF) –          gestures, body    ✓        ✓   ✓   ✓                bag, or
                                        Bluetooth       movements                              Body         room
                                        included                                               contact or
                                                                                               no contact   Face or in
                                        Eye tracker                                                         front of an
                                        (optical)       Eye gaze          ✓        ✓       ✓                individual
                                                                                                            (on a
                                                                                                            monitor)
                                                                              38


Table 2. (cont’d).
                                                                                       Wherever
                                                                                       there is an
                 Blood volume pulse       Blood volume                                 easy access
                 (BVP)                    in arteries and     ✓   ✓     ✓              to a pulse.
                 /Photoplethysmography    capillaries,                                 Fingers or
                 (PPG) sensor             heart rate                                   earlobes are
                                                                            Skin       commonly
                                                                            contact    used
                 Respiration (RSP)        Respiration                       – single
                 sensor                   rate                ✓   ✓     ✓   point of   Chest
                                                                            contact    Any site on
                                                                                       the body
                 Skin temperature         Skin                ✓   ✓     ✓              with
                 monitor                  temperature                                  preference in
                                                                                       the axilla
 Physiological
                                                                                       and forehead
                                                                            Skin       Fingers,
                 Electrodermal activity                                     contact    palm of the
                 (EDA) /Galvanic Skin     Skin                ✓ ✓ ✓ ✓   ✓   – two      hands, soles
                 Response (GSR) sensor    conductivity                      points     of the feet,
                                                                            of         or wrist
                                                                            contact
        Electrocardiogram          Heart rate       ✓      ✓ ✓ ✓                       Chest or
        (ECG)                                                                          limbs
        Electroencephalography Brain activity ✓            ✓ ✓ ✓ Skin                  Along the
        (EEG)                                                                          scalp
                                   Pitch, turn-                          contact
        Electroglottography        taking, pauses,                       –
        (EGG)                      speech           ✓               ✓    multiple Surface
                                                                                   the neck
                                                                                            of
                                   duration,                             points
                                   utterances                            of
        Electromyography
        (EMG)
                                   Facial
                                   expressions      ✓               ✓ contact Facial
                                                                                   muscles
        Electrooculography         Eye gaze         ✓      ✓        ✓              Face, around
        (EOG)                                                                      the eyes
        Ambient temperature                         ✓               ✓ Body
 Environsensor                     Environmental                         contact Wrist or in a
        Humidity sensor                             ✓               ✓ or no
  -ment                            factors                                         room
        Ambient light sensor                        ✓               ✓ contact
   * Information presented in this table was obtained by analyzing collected information from the
articles referenced in Table 1.
a large data bandwidth of communication compared to microphone data. In fact, most of the
literature reporting the use of video cameras for the monitoring of human behavior has been for
offline applications. It is important to mention that when video and image data are included to
analyze human behavior, the computational load and power consumption of the system increase
[80]. In addition, it has been a topic of debate that the use of cameras to monitor human behavior
                                                         39


Figure 7. Graphic representation of where the work related to human behaviors has been
concentrated relative to the 11 most used sensor modalities. The monitoring of emotions has
been one of the areas of most interest followed by the monitoring of social interactions, with
microphone as one of the most common sensor modalities used for their study. Microphones,
cameras, and EDA sensors are the only sensing modalities used in the monitoring of all of the
three effector classes. © 2021, IEEE.
presents a concern for user privacy. Thus, as video and image modalities present limitations for
real-time and wearable applications, we exclude them from further analysis in this work. However,
information on the use of video and image sensor modalities for the monitoring of human
behaviors, including the use of facial expressions for the recognition of the emotion effector, can
be found in [130], [131]. On the other hand, although the privacy issue could be argued to also
apply to microphone data, in the case of monitoring human behavior as presented in this work,
speech recognition is not the goal. In this area, microphones are used mostly to perform speech
detection to extract acoustic features in speech (e.g., volume, signal energy, pitch), which can be
performed in a local device before any data transmission and at reasonable computational and
power consumption rates [100], [132].
     Besides cameras and microphones, and in addition to EDA sensors, four physiological sensors
that have been commonly used are ECG, EEG, skin temperature, and BVP sensors. The
wearability of physiological sensors, which has been possible due to advancements in CMOS and
circuits technologies [12], [133], has allowed the study of human behavior effectors in different
                                                 40


scenarios. ECG, EEG, skin temperature, and BVP sensors have helped understand acute and long-
term changes in the physiology of the human body that are often altered by internal and external
stimuli, but that our conscious mind cannot control. All of those four physiological sensors have
been used to monitor emotions and aspects of social interactions, as noted in Figure 7. On the
other hand, accelerometers, gyroscopes, IR sensors, and RF sensors are among the most frequently
used sensors from the movement, orientation, and proximity sensor types. While accelerometers,
gyroscopes, and RF sensors have been used in the recognition of emotions, IR sensors have just
been used in the monitoring of social interactions to measure the proximity between individuals.
     In addition to sensor modalities that are directly related to measurements of an individual, the
contextual or environmental information in which signals of an individual are collected could help
improve machine understanding of behavior. Although the use of environmental sensors in the
human behavior monitoring literature is scarce, it is starting to be used to add to the contextual
understanding of behavior. For example, in [98] and [100], environmental sensors such as
temperature, humidity, and ambient light were used in a wearable sensing device to help determine
moments of personal anxiety.
    3.2.1.1.Analyzing unimodal versus multimodal sensor systems
     While Table 2 and Figure 7 illuminate the breadth of sensors employed for behavior
monitoring and their relative popularity in the literature, it is also important to consider the number
of different sensor modes employed among these studies. To provide some insight into this, Figure
8 plots the distribution of sensor modalities concerning the identified effector classes across the
articles that were analyzed. This plot shows that, of the ~72 reviewed works, around 59% of them
rely on unimodal sensing, including all works targeting personality factors. Moreover, these
unimodal efforts utilize only five of the sensor types defined in Table 2, namely microphones,
                                                   41


EDA, EEG, ECG, and RF sensors. In contrast, roughly 40% of works that were found to use two
or more sensor modes, defined as multimodal in Figure 8, utilize all sensor types listed in Table
2 (except for cameras, excluded from this analysis). One might expect that, as sensor technologies
advance, a trend toward multimodal sensing would be evident, and the performed analysis supports
this, showing that 66% of the multimodal works have been published since 2017, compared to
only 14% of unimodal works. Multimodal sensing also makes practical sense considering that, as
social individuals, humans often communicate using multimodal signals in a complementary and
redundant manner. Thus, our own actions would suggest that multimodal sensor systems would be
ideal for the recognition of human behaviors.
     In the area of human-computer interaction, specifically in the detection of emotions, it has
been recognized that multimodal systems improve the recognition rate of human behaviors when
compared to unimodal approaches [55], [134], [135]. Figure 9 presents the range, where the
central red mark indicates the median accuracy, of the reported computational classification
performance accuracies for both unimodal and multimodal sensor systems in the reviewed
literature. Note that Figure 9 collects information only from works that performed a classification
task and reported their results using a percentage of performance accuracy.
Figure 8. Distribution of the use of unimodal and multimodal (excluding video and images)
sensor modalities to monitor human behaviors. Of ~74 reviewed works, around 59% of them rely
on unimodal sensing, including all works targeting personality factors, and roughly 40% use two
or more sensor modes. © 2021, IEEE.
                                                 42


                                                            Performance accuracies (%)
                                             Effector
                                             classes                                             References
                                                                                         [141]–[143], [146],
                                               Emotions
                                Microphone
                                                                                         [185]–[189], [251]
                                              Personality                                [53], [147], [153]
                                                  factors
                  Unimodal
                                                   Social                                [123], [124], [155],
Sensor modality
                                             interactions                                [157], [191]
                                                   Social                                [83]
                                RF           interactions
                                                                                         [172]–[174]
                                EEG            Emotions
                        Multimodal
                                                                                         [92], [99], [103], [113],
                                                Emotion
                                                                                         [132], [182], [190]
                                                   Social                                [88], [89], [92], [192]
                                             interactions
Figure 9. Summary of reported performance accuracies of unimodal and multimodal (excluding
video and images) sensor systems of reviewed literature. The central mark indicates the median
accuracy, and the left and right edges of the box indicate the 25th and 75th percentiles,
respectively. The whiskers extend to the most extreme accuracy values not considered outliers,
while accuracy values considered outliers are plotted individually using the '+' symbol. © 2021,
IEEE.
                  From Figure 8 and Figure 9, it can be observed that the effector classes that most utilize
multimodal sensing are emotions and social interactions, a fact also noted in behavior monitoring
review papers [16], [136]–[138]. On the other hand, the works reviewed in this chapter show a
lack of multi-sensor modalities for monitoring personality factors. Although unimodal approaches
have helped the scientific community in evaluating how information from a specific sensor
contributes to understanding a certain behavior, studying the integration of multi-sensor modalities
advances the development of more accurate and robust social sensing systems. Compared to
unimodal sensing, multimodal sensing is still in its infancy and encounters new layers of
complexity in defining and assessing accuracy. This may explain the lack of multimodal accuracy
                                                                      43


improvements observed in Figure 9. However, multimodal systems do demonstrate less variability
in accuracy, which could indicate advantages in precision and system robustness.
     3.2.2. Signal Features Informative of Human Behaviors
      The sensor modalities discussed in the previous section are just one of the components
necessary to capture the physiological processes and nonverbal messages associated with human
behaviors. The processing of sensor signals also plays a critical role in the design of accurate real-
time human behavior monitoring systems. The goal of sensor signal processing is to compute
statistically identifiable signal characteristics or measurable signal properties, typically referred to
as signal “features”, that are informative of human behaviors.
      To analyze the sensor signal processing reported in the reviewed behavior monitoring
literature, works were first grouped based on their use of unimodal sensor signals and multimodal
sensor signals. Then, the unimodal works were organized by their sensor modalities and the four
most used modes (excluding cameras, for reasons stated earlier), based on data in Figure 7, were
selected for further analysis. Within each modality, sensor signal processing elements such as
signal characteristics, pre-processing approaches, and features were studied and summarized to
illuminate the design space employed in the literature. For the analysis of signal features, reported
works were grouped by their behavior effector class defined by the taxonomy established in Table
1. Then, works, where the contribution of features to the recognition of a particular behavior was
reported using correlation analysis or feature selection algorithms, were summarized below.
Feature selection has two advantages: it reduces computational costs, and it removes noisy data
that otherwise could degrade system performance.
      The understanding gained from the analysis of unimodal sensor signal processing elements
was then applied to make a qualitative assessment of their utility and design considerations in
                                                    44


multimodal systems. Finally, we attempted to integrate this information with an analysis of the
limited works presenting signal processing for multimodal systems. This effort allowed us to make
the summary observations presented at the end of this section that may be helpful for the design
of real-time human behavior monitoring systems.
    3.2.2.1.Audio signals
     Audio signals collected from microphones are sound waves converted into electrical energy
that, when employed in human behavior recognition systems, are typically used to monitor
paraverbal communication. Audio signals used to monitor paraverbal communication are usually
collected using a minimum sampling rate of 8 kHz, but rates up to 44.1 kHz have also been
reported. The use of higher sampling frequencies provides better signal resolution, but it is not
necessary for the extraction of the acoustic features of interest. The processing of audio signals is
mainly composed of four parts: speech detection, speech segmentation, signal pre-processing, and
feature extraction. Thus, identifying levels of noise, periods of silence, and periods of speech
becomes a key task to ultimately extract accurate features and associate them with behaviors of
interest. In real-time processing, audio signals are processed in frames of ~30ms to ~80ms, often
with overlaps between each consecutive frame. These frames of data are used to detect speech. In
general, after detecting speech in the audio signal, audio segmentation is performed. Audio
segmentation refers to the task of dividing the audio signal into acoustic segments from which
acoustic features will be extracted [139].
     Typically, in the area of human behavior monitoring, audio segmentation has been done in
two ways, through an utterance-based approach or a windowing-based approach. The utterance-
based approach includes segments taken based on linguistic units such as vowels, phonemes,
words, and phrases. However, when dealing with automatic and real-time processing, an automatic
                                                 45


speech recognizer (ASR) is needed to make use of an utterance-based approach. Although the use
of ASR typically does not degrade the performance of a system [140], [141], it does increase the
computational complexity of the system and could represent a threat to users’ privacy. On the other
hand, a windowing-based approach makes use of a window of time (in milliseconds or seconds),
windows of speech activity (defined by pauses or silence), and/or windows of voiced or unvoiced
signals. Windowing-based approaches are preferred in real-time systems because they are very
fast and computationally efficient. However, this efficiency could be compromised when high
amounts of memory space are needed to extract features of interest. While very small windows of
time may not provide enough information to determine a change in a behavioral state, longer
windows of time provide information similar to the one obtained from utterance-based approaches.
This is because, in general, an utterance is comprised of pauses or breath segments and voiced-
unvoiced speech segments [142]. Thus, accumulating data from audio frames creates a larger
window of speech activity with speech and salient segments, similar to the information of
utterance-based approaches. A good balance between performance and computational complexity
can be found by evaluating different time window sizes as done in [143]. Here, we discuss works
that make use of both approaches with the goal of extracting general information about relevant
features.
     Before extracting acoustic features, it is good practice to pre-process the audio signal using a
pre-emphasis filter and a window function (i.e., Hamming window) applied to each frame to
reduce signal discontinuity in order to avoid spectral leakage. Table 3 describes all the identified
acoustic features used in the reviewed literature. Acoustic features were grouped by several feature
categories: prosodic in speech, conversational characteristics, voice quality characteristics,
cepstral coefficients, formant characteristics, frequency spectrum coefficients, and others. The
                                                  46


Table 3. Audio features found in the reviewed literature associated with human behavior effector
classes. © 2021, IEEE.
               Feature categories                        E               PF              SI
 Prosodic features:                             [99], [100], [110],
                                                                     [53], [86],  [86], [88],
 Volume amplitude (statistics*), intensity      [132], [141]–
                                                                     [148],       [89], [92],
 (statistics*), energy (entropy, RMS, linear    [145], [182],
                                                                     [150],       [123], [124],
 regression, statistics*), voice pitch (linear  [184]–[186],
                                                                     [151],       [126], [155]–
 regression, statistics*), autocorrelation      [189], [190],
                                                                     [153],       [157], [191],
 (maximum peaks, # of peaks), voiced time       [193], [194],
                                                                     [154]        [270]
                                                [237], [251]
 Conversational features:                                            [86],        [86], [90],
 Turn duration, # of turns, speaking                                 [147],       [123], [124],
                                                [141], [145],
 duration (statistics*), speaking rate,                              [148],       [126], [155],
                                                [185]
 overlapping speech duration, interruptions,                         [150]–       [156], [191],
 pause duration (statistics*), # of pauses                           [154]        [270], [279]
 Voice quality features:
                                                [99], [110],
 Zero-crossing rate, harmonics-to-noise
                                                [142]–[144],
 ratio (HNR), jitter, shimmer, glottal
                                                [184]–[186],         [150]                -
 features (# of glottal pulses, relaxation
                                                [190], [193],
 coefficient (Rd), functions of phase-
                                                [237]
 distortion (FPD))
 Cepstral features:
                                                [99], [110],
 Shifted delta cepstrum (SDC), mel-
                                                [142]–[144],
 frequency cepstral coefficients (MFCC),
                                                [182], [184]–
 perceptual linear prediction (PLP) cepstral                               -              -
                                                [190], [193],
 coefficients, linear prediction-based
                                                [194], [237],
 cepstral coefficients (LPCC), plus their
                                                [251]
 delta and acceleration values
 Formant features:                              [100], [110],
 Formant frequencies (first and second),        [132], [141],
                                                                     [53], [150]  [270]
 bandwidths (first and second), statistics*     [142], [190],
                                                [193], [251]
 Frequency spectrum coefficients:
                                                [99], [100], [132],
 Brightness, center of gravity, distance
                                                [143], [182],
 between the 10 and 90 % frequency                                   [150]
                                                [185], [186],                     [270]
 quantile, slope between the strongest and
                                                [193], [194],
 the weakest frequency, linear regression,
                                                [237], [251]
 spectral energy (statistics*)
 Others: Wavelet coefficients, air pressure
                                                [146], [189]               -              -
 distribution in the vocal tract
    Note. * Statistics include mean, std, variance, skewness, kurtosis, slope, median, maximum,
minimum, range.
definition of some of the features vary depending on the applied segmentation approach, therefore,
we do not define them here, but information can be found in the references listed in Table 3.
                                                47


Because different audio features have been reported to contribute in different ways to the
recognition of human behaviors, it is valuable to look more deeply into the level of contribution
that various audio features provide toward behavior recognition.
     Emotions - Lee and Narayanan [141] evaluated, using a feature selection method, a set of
prosodic (voice pitch, energy, speech duration, and their statistics) and formant features extracted
at an utterance level to improve the recognition of two emotion classes: negative and non-negative
emotions (valence dimension). The feature selection method consisted of evaluating classification
accuracies using the k-nearest neighborhood classifier with a leave-one-out cross-validation
method. While the authors separated speech data by gender (female, male), the ratio of the duration
of the voiced and unvoiced region, energy median, and F0 (voice pitch) regression coefficient were
included in the five-best features for both genders. In this same line, Tahon et al. [144] employed
an ANOVA test and a classifier to study the contribution of prosodic, cepstral, and voice quality
features in the detection of positive and negative emotions (valence dimension). They concluded
that the mean and std of the relaxation coefficient (a parameter associated with how relaxed is the
human voice), the harmonics-to-noise ratio (HNR), and the unvoiced ratio are of interest for
valence detection. Also, of interest resulted a combination of features consisting of the functions
of phase-distortion (FPD) (a distortion of the phase spectrum around its linear phase component),
voice pitch, energy, and shimmer features.
     In the recognition of discrete emotions, the recognition of frustration and calmness can be
found. Ang et al. [145] showed, through the use of “a brute-force iterative feature selection
algorithm”, how prosodic features extracted at the utterance level contributed to the recognition of
frustration. They concluded that longer durations of vowels or phonemes in an utterance (a word
in their case) and slower speaking rates (the number of vowels divided by the duration of the
                                                 48


utterance) were associated with frustration. In addition, high values in voice pitch features such as
maximum pitch in the longest vowel, the maximum overall pitch, the times that the maximum and
minimum pitch occurred, the maximum speaker-normalized pitch rise, and the distance of various
pitch statistics from the speaker baseline were all associated with frustration, representing the
highest percentage of the total information used by the classifier. Other features that were
associated with frustration were speaker-normalized RMS energy features, the number of dialog
exchanges between the user and the system, and raised voice.
     In addition to the direct use of extracted features, some works have applied principal
component analysis (PCA) to reduce the dimensionality of the feature vector used to perform
classification. Sahoo and Routray [146] estimated the pressure distribution in the vocal tract, which
often results in a minimum of 40 feature values that increase depending on the number of vowels
present in a given utterance or window of time. Thus, the authors applied PCA and made use of
the first 6 principal components to classify calm and aggressive speech segments.
     Personality factors – When monitoring elements of personality factors, and also social
interactions, two types of features can be extracted: individual-level features and group-level
features. Individual-level features are extracted based on the audio signals of a single individual
and could include any of the acoustic features listed in Table 3. Group-level features are extracted
from individual-level features; they describe the dynamics of a group of people. Thus, they are
typically extracted using a window of time with a size in the order of minutes [147].
     Related to personality traits, prosodic features have been found to be important for modeling
observed extraversion, emotional stability, and openness to experience. Mairesse et al. [148]
analyzed how those three aspects of personality were correlated to prosodic features. It was found
that the maximum voice pitch, and the mean, std, and maximum values of intensity in dB were
                                                  49


highly correlated with extraversion. Emotional instability was highly correlated with the voiced
time and the minimum and mean values of voice pitch, while openness was correlated to the
maximum voice pitch values and voiced time. The authors also showed that prosodic features are
very good predictors of extraversion in comparison to other types of non-acoustic features. In this
sense, analytical studies have shown that extroverts speak more rapidly, with fewer pauses and
hesitations than introverts [149]. Extraversion has also been associated with high values of voice
pitch and higher variations in fundamental frequency, shorter periods of silence, and higher voice
quality and intensity. This was confirmed by Vinciarelli et al. [150] when studying how acoustic
features correlated to personality traits. The higher the voice pitch and speaking rate, the higher
the perceived extraversion. A higher center of mass in the power spectrum and higher spectral tilt
were correlated with perceptions of less agreeableness. Voices for which the power spectrum is
peakier and tends to be skewed towards higher frequencies are perceived as more agreeable. These
latter cues affect the perception of conscientiousness in the same way, together with the speaking
rate (people that talk faster are perceived as more competent). In the case of neuroticism, the higher
the voice pitch and first formant mean, the higher the perceived neuroticism. However, no evidence
of correlation was found for openness.
     Related to the person perception dimensions, Tusing [151] studied how much the amplitude
of the speech signals in decibels (dB), the voice pitch, and the speech rate in words per minute
(wpm) contribute to the perception of dominance. Through regression models, it was concluded
that the mean amplitude, the amplitude standard deviation, the average voice pitch, and speech
rate were correlated with aspects of dominance. This is particularly interesting because it has been
noted that dominant people tend to be verbally active while non-dominant individuals are less so.
One of the greatest advantages of using speaking rate and features like speaking length [152], [153]
                                                  50


to infer dominance revolves around its fast computation and easy use in real-time human behavior
monitoring systems. This was employed by Eagle and Pentland [86], who made use of
conversation features such as speaking rate, energy, duration of time holding the floor,
interruptions, and turn-tacking transition probabilities to build over time profiles of participants’
typical social behavior. This allows us to recognize relationships and dominant behaviors. In a
work by Jayagopi et al. [153], features such as speaking turn duration histogram, total successful
interruptions, total speaking turns, and total speaking energy also proved to be a good combination
of features to identify the most and the least dominant individuals in an interaction. Similar features
were shown in [154] to help identify emergent leaders.
     Social interactions – Using individual-level features, Hillard et al. [155] studied the automatic
detection of agreements and disagreements using prosodic and linguistic features. There it was
found that prosodic features such as the average, maximum, and initial pause duration, the
maximum and average voice pitch values, and the average and maximum duration of an utterance
are almost as good as linguistic features in identifying segments of agreements. Investigating the
automatic detection of the level of interest and involvement of individuals in an interaction, Gatica-
Perez et al. [156] found through a feature selection method that speech energy, speaking rate, and
voice pitch were the best audio features for the task. Moreover, voice pitch values have also been
associated with the detection of emphasis during meetings [157]. On the other hand, Cerekovic et
al. [126] studied the correlation of acoustic features with self-reported and judged evaluations of
rapport between a subject and a virtual agent. It was found that interactions with fewer and shorter
pauses, long speech segments, and louder speech were correlated with high rapport. Turn-taking
patterns were also correlated with rapport.
                                                   51


     Using group-level features, conversation dynamics have been very well explored. It has been
studied that global features, such as group speaking interruption-to-turns ratio and group speaking
turns egalitarian measure, have been found to discriminate with high accuracy between a
competitive meeting and a cooperative meeting [124]. Features such as turn-taking have also been
used to identify conversations between two individuals by calculating the mutual information
between the turn-taking features of the individual’s audio streams [90] and to detect conflicts
[158]. Other features such as the sum of all the individual’s pause duration, the maximum speaking
rate during overlapping speech among individuals, the minimum average turn length among
individuals, the total time that at least two people are speaking at the same time (total overlap
time), the average energy that is observed for any participant when they are speaking at the same
time as at least one other person, and the speaking rate during overlapping speech were reported
to have high values in high-cohesion meetings [123].
    3.2.2.2.Electrodermal activity (EDA) signals
     Electrodermal activity (EDA), also known as galvanic skin response, are signals that represent
the flow of current between two points of skin contact at which an electrical potential is applied.
EDA signals represent properties of the skin that are regulated by changes in sweat glands’
secretion, which are controlled by the sympathetic nervous system; sweat secretion increases with
increments in emotional arousal. As a result, EDA is considered a good indicator of emotional
arousal [159]. EDA signals can be sampled at a rate as low as 4 Hz.
     The EDA signal is a time series signal with two activity components, called phasic and tonic,
with frequency components of interest between 0.05 and 3 Hz. The tonic component is a slow-
changing signal, on the scale of tens of seconds to minutes, which is also known as the skin
conductance level (SCL). On the other hand, the phasic component, also known as the skin
                                                  52


conductance response (SCR), is typically the component considered in human behavior
recognition tasks. EDA signals are usually pre-processed to identify and remove movement and
respiratory artifacts [160], [161].
     Similar to audio signals, in the automatic processing of EDA signals, windows of time are
used to extract features of interest. Because EDA signals are slower changing signals than audio
signals, the window size used to extract EDA features can vary from 5 seconds to 1 minute. Table
4 describes all the identified EDA features used in the reviewed literature. We grouped the EDA
features per category: raw EDA features, SCR features, SCL features, frequency features, and
coupling indexes. General information on their definitions can be found in the references listed in
Table 4. In unimodal systems specifically, EDA signals have been used in the recognition of
personality factors and aspects of social interactions; and have been consistently processed using
coupling indexes.
Table 4. EDA features found in the reviewed literature associated with human behavior effector
classes. © 2021, IEEE.
                    Feature categories                              E           PF         SI
 Raw EDA features:
                                                           [110], [113],
 # of local minima, # of local maxima, derivatives,                              -     [163]
                                                           [237]
 non-stationary index & statistics*
 SCR features:                                             [102], [110],
 # of peaks, peak amplitude, rise time, recovery time, [113], [182],
                                                                                 -     [165]
 peak duration, zero-crossing rate of slow response        [184], [194],
 (0-2.4Hz), & statistics*                                  [237], [254]
 SCL features:                                             [110], [181],
 Zero-crossing of very slow response (0-0.2Hz) &           [194], [237],         -          -
 statistics*                                               [254]
 Frequency features:                                       [110], [113],
                                                                                 -          -
 Spectral power coefficients & statistics*                 [237]
 Coupling indexes:
                                                                                       [163]–
 Pearson’s correlation coefficient (PCC), signal
                                                                                       [165],
 matching, instantaneous derivative matching (IDM),                 -          [162]
                                                                                       [183],
 directional agreement (DA), Fisher’s z-transform of
                                                                                       [267]
 the PCC, single session index (SSI)
   Note. * Statistics include mean, std, variance, skewness, kurtosis, slope, median, maximum,
minimum, range.
                                                 53


      Personality factors – Empathy has been one of the personality factors monitored using EDA
signals. Slovák et al. [162] studied the monitoring of empathy in dyads. Raw EDA signals were
first smoothed using a rectangular smoothing algorithm and then uniformly scaled based on a
running minimum and maximum value taken from each participant from which data was collected.
Using a 15-second window with a moving rate of 1 second, signals from pairs of individuals were
combined using a Pearson correlation algorithm. In addition, the single session index (SSI), which
“represents an index of synchrony over a longer period of time and is calculated as the natural
logarithm of the ratio of the sum of positive synchrony divided by the sum of negative synchrony
over the specified time,” [162] was then computed for the entire recording section (4 minutes). It
was concluded that high emotional engagement of individuals in the conversation was consistently
associated with high EDA synchrony. On the other hand, low emotional engagement was
associated with moments of inconsistency or fluctuating EDA synchrony.
      Social interactions – In addition to Pearson’s correlation coefficient (PCC), other
physiological coupling indices that have been found in the literature are signal matching,
instantaneous derivative matching (IDM), directional agreement (DA), and Fisher’s z-transform
of the PCC. In the area of collaboration, a regression analysis showed that out of the five coupling
indices, IDM and DA were good predictors of collaborative behavior [163]. Haataja et al. [164]
presented an analysis of synchronicity that, first, calculates the average slope of an EDA signal in
a 5-second window and then calculates the PCC between EDA signals of two individuals using a
moving 15-second window. Similar to the case of empathy, the SSI was calculated but using a
window of 2 minutes. Results indicated that physiological synchrony does occur during
collaborative learning at a statistically significant level. Because the analysis was performed
offline, resulting moments of synchrony could not be correlated to specific monitoring instances.
                                                  54


However, results do suggest that physiological synchrony might be a relevant condition when joint
understanding is better built within groups. In an effort for studying the dynamics of collaboration
related to the degree of physiological activation of triads, Pijeira-Díaz et al. [165] calculated the
number of peaks per minute in SCR signals using a moving window with a window width of 1
minute and a moving step of 250ms, and then calculated the arousal DA as a measure of the
synchrony degree. Results showed that most of the time participants were at different arousal
levels, but when they were in synchrony it was mostly in the low arousal level. Although results
were not correlated with specific instances, the authors showed the potential of using arousal DA
to characterize collaborative behaviors.
    3.2.2.3.Electroencephalography (EEG) signals
     Electroencephalography (EEG) signals represent the electrical activity of the brain. Systems
that record EEG signals can have as few as one electrode channel to as many as 256 channels. The
placement of EEG electrodes along the scalp is of great importance. Thus, their placement adheres
to international standards such as the 10/20 system (also known as International 10/20 system)
[166], 10/10, or 10/5 systems [167], the last two also known as the Modified Combinatorial
Nomenclature (MCN). These standards aim to standardize the exact position of each electrode and
assign names to each of them to facilitate the identification of the brainwave location that may
serve a specific brain function. For example, in the area of emotion recognition, specific electrode
positions are of interest. T3 and T4, electrodes placed in the temporal lobe regions, are found to
be near emotional processors. P3, P4, and Pz, electrodes placed in the parietal brain region, are
located near sources that reflect activities of perception and differentiation. While frontal lobe
electrodes (i.e., F3, F4, F7, F8) have proximity to sources of emotional impulses and have been
                                                 55


Table 5. EEG features found in the reviewed literature associated with human behavior effector
classes. © 2021, IEEE.
                                 Feature categories                                    E         SI
  Time domain features:
                                                                                  [102],
  Power, derivatives, Hjorth features (activity, mobility, complexity), non-
                                                                                  [173],          -
  stationary index, fractal dimension, higher order crossings (HOC), &
                                                                                  [280]
  statistics*
  Frequency domain features (per band):
  Energy spectrum (ES), power spectrum, power spectral density (PSD),
                                                                                  [113],
  differential entropy (DE), rational asymmetry (RASM) of DE features in a
                                                                                  [172]–       [195]
  channel pair, differential asymmetry (DASM) of DE features in a channel
                                                                                  [174]
  pair, differential caudality (DCAU) between DE features, higher order
  spectra (HOS), & statistics*
  Time-frequency domain features:
                                                                                  [173]           -
  Hilbert-Huang spectrum (HHS), discrete wavelet coefficients (DWC)
    Note. * Statistics include mean, std, variance, skewness, kurtosis, slope, median, maximum,
minimum, range.
used for emotion recognition [168], [169]. EEG signals are typically sampled at a rate of ~256Hz
but can be sampled at a lower rate depending on the signal components of interest.
      As EEG signals have a low signal-to-noise ratio and are prone to muscle movement artifacts
[170], pre-processing of these signals includes filtering and signal inspection for artifact removal.
Before extracting features, typically a window function (i.e., Hamming window, etc.) is applied to
each window of time or frame of data to reduce signal discontinuity in order to avoid spectral
leakage. Windows of time are of at least 1 second in size. Table 5 describes all the identified EEG
features used in the monitoring of human behavior. We grouped the EEG features per category:
time-domain features, frequency-domain features, and time-frequency domain features. General
information on their definitions can be found in the references listed in Table 5. Traditionally,
EEG signals have been analyzed using event-related potential (ERP) features. However, when
EEG signals are analyzed based on identified ERPs, an event (or trigger) needs to be identified
and then features describing the response to that event are extracted [171]. This approach is not
suitable for real-time implementation since it is unknown when an “event” will happen. On the
                                                  56


other hand, when EEG signals are analyzed using either time, frequency, or time-frequency
domain features, EEG signals are first divided into frequency bands containing slow, moderate,
and fast brainwaves that are associated with specific brain states (i.e., sleep, relaxed, and alert,
among many others). These frequency bands are delta band (1-4Hz), theta band (5-8Hz), alpha
band (9-12Hz), beta band (13-25Hz), and gamma band (>25Hz). However, the exact frequency
values used to extract the frequency band can vary across researchers by 1 or 2 units of Hz per
band. Typically, features are extracted specifically per frequency band. In unimodal systems, EEG
signals have been used mostly for the recognition of individual emotions.
     Duan et al. [172] extracted frequency domain features in five frequency bands from signals
recorded from a 62-channel electrode cap to classify positive or negative emotional states of the
individuals participating in their study. All features used were smoothed using a linear dynamic
system (LDS) approach. They found that emotional states relate to EEG signals in the gamma band
more closely than other frequency bands and that using differential entropy (DE) as a feature
provides better results than using more traditional features such as energy spectrum (ES).
Likewise, Jenke et al. [173] evaluated different time, frequency, and time-frequency feature sets
from signals recorded from a 64-channel electrode cap. Using feature selection methods, it was
concluded that features such as power spectrum, higher order spectra (HOS), Hilbert-Huang
spectrum (HHS), and discrete wavelet coefficients (DWC) computed from beta and gamma bands
were better at classifying emotions. Zheng et al. [174] investigated, not just the frequency domain
features and critical frequency bands for the recognition of three emotions (positive, neutral, and
negative), but also the performance of a combination of four, six, nine, and 12 channels in the
recognition of the three emotions. They concluded that DE performed better as a feature when
compared to power spectral density (PSD), differential asymmetry (DASM), rational asymmetry
                                                 57


(RASM), and differential caudality (DCAU). In addition, as noted in previously discussed works,
they also confirmed that beta and gamma oscillation of brain activity are more related to emotion
processing than other frequency bands. Using a weight distribution of a trained deep belief network
(DBN), the 12 channels that collect the most emotional information are FT7, FT8, T7, T8, C5, C6,
TP7, TP8, CP5, CP6, P7, and P8 (named based on the MCN system). If reduced to four channels,
they found them to be FT7, FT8, T7, T8, wherein the 10/20 system, T7, and T8 are T3 and T4,
respectively.
     3.2.2.4.Electrocardiogram (ECG) signals
      Electrocardiogram (ECG) signals represent the electrical activity of the heart. Frequency
components of interest in ECG signals are below 20Hz, although a commonly used sampling
frequency is of 1kHz. A heartbeat (or cardiac cycle) is associated with ECG signal phases and
specific signal characteristics. A complete cardiac cycle is made up of five waves that construct
an ECG signal, namely P wave, Q wave, R wave, S wave, and T wave. From those five waves,
five signal phases are identified: PR interval, PR segment, QRS complex, ST segment, and QT
interval. Each of them is associated with how the electrical signal travels through the heart. For
Table 6. ECG features found in the reviewed literature associated with human behavior effector
classes. © 2021, IEEE.
                           Feature categories                                  E              SI
  Time domain features:
                                                                       [110], [113],
  Heart rate (HR) (expressed in beats per minute (bpm)), inter-beat
                                                                       [180], [184],           -
  interval (IBI) (measured in ms), zero-crossing rate, non-
                                                                       [194], [237]
  stationary index, heart rate variability (HRV), & statistics*
  Frequency domain features:
                                                                       [110], [113],
  Spectral power, power spectral density, spectral entropy,                                    -
                                                                       [237]
  derivatives & statistics*
  Coupling indexes:
                                                                                           [183],
  Pearson’s correlation coefficient (PCC), Fisher’s z-transform of             -
                                                                                           [267]
  the PCC, weighted coherence
Note. * Statistics include mean, std, variance, skewness, kurtosis, slope, median, maximum,
minimum, range.
                                                  58


the heart rate measurement (or frequency of the cardiac cycle), the QRS complex is the most
important signal phase because the instantaneous heart rate is calculated from the time between
any two consecutive QRS complexes (R-R interval).
     Similar to other physiological signals, ECG signals are prone to noise and artifacts, which are
typically tackled at the input of the signal acquisition system [175] or the pre-processing stage. A
review of this topic can be found in [176]. Noise and artifact removal of ECG signals is important
before feature extraction. Table 6 describes all the identified ECG features used in the monitoring
of human behavior. We grouped the ECG features per category: time-domain features, frequency-
domain features, and coupling indexes. General information on their definitions can be found in
the references listed in Table 6. In unimodal systems, ECG signals are used to monitor an
individual’s emotional arousal states through parameters such as heart rate (HR) (expressed in
beats per minute (bpm)), inter-beat interval (IBI) (measured in ms), and heart rate variability
(HRV) [177]–[179]. For example, Quintana et al. [180] used correlation analysis to study how
different social conditions affect HRV and its relation to emotional states. They concluded that
high levels of HRV during resting state are associated with improved emotion perception, while
reduced HRV is associated with impairments in social cognition.
    3.2.2.5.Multi-signal modalities
     Signals from multi-sensor modalities have been used to increase the robustness of human
behavior monitoring systems. However, the integration of multiple sensors involves managing
inconsistencies in the collected data before feature extraction. Different sensor signals are typically
collected using different sampling frequencies, they use different pre-processing methods, and they
require different windows of time to extract features. All of these contribute to inconsistencies in
the data collected across sensor modalities and present a great challenge for data synchronization,
                                                  59


which is important to achieve robustness in human behavior monitoring systems. Nonetheless,
when signal features from two or more sensing modalities are used, the reviewed literature
identifies two common methods to combine information: feature-level fusion and decision-level
fusion. In feature-level fusion, the features extracted from individual sensors are consolidated into
a single feature set. A simple solution to synchronize extracted features at the feature-level fusion
is to extract them using the largest window size among the selected sensor modalities and then
build a single feature vector. Thus, statistics are commonly employed in the feature extraction
process. In decision-level fusion, also called model-level, the decisions from multiple classifiers
(usually one classifier per sensor modality) are combined into a common decision. More on the
theory of fusion mechanisms can be found in [181]. As performed in the discussion of features
from audio, EDA, EEG, and ECG signals, we focus on discussing works performing feature-level
fusion and the correlated or best-performing set of features from combined sensor modalities.
Table 7 lists additional sensing modalities used in the reviewed literature together with the type
of features that are typically extracted from each of them.
     Emotions – In [182], a total of five sensors were used for the recognition of four emotions.
Features from audio, EDA, EMG, PPG, skin temperature, and RSP signals were extracted.
Through a sequential backward selection algorithm, features such as the sub-band spectral entropy
from PPG, the number of peaks within 4 seconds in EDA and EMG, and the mean values of the
MFCCs in the speech features stood out in the recognition of the four emotions. On the other hand,
[99] and [100] made use of audio and movement (from accelerometers and gyroscopes) signals to
recognize anxiety levels and other individuals’ well-being characteristics, respectively. Both made
use of a Pearson product-moment correlation coefficient (PPMCC) analysis to investigate the most
relevant features associated with anxiety and well-being. In [99], it was found that at least
                                                   60


Table 7. Sensor signal features used in multimodal systems. © 2021, IEEE.
                     Features per sensor modality                            E             SI
 RF sensor:
 Raw received signal strength indicator (RSSI) values, duration in                     [83],
                                                                       [256], [257]
 time of a RSSI value, mean of measurements from two RSSI RF                           [192]
 signals, difference between two RSSI RF signals
 IR sensor:
                                                                                       [88], [89],
 Number of detected encounters with another IR sensor, sum of                 -
                                                                                       [92]
 lengths of all encounters, and length of an encounter
 Accelerometer, gyroscope, and magnetometer:                           [22], [98]–     [88], [89],
 Signal energy, energy-entropy, correlation coefficient between axis,  [100], [103],   [92],
 pitch, roll, peak value in frequency domain, statistics*              [132]           [192]
 Skin temperature:
                                                                       [113], [182],
 Derivatives, spectral power in low frequency bands, PCC, weighted                     [183]
                                                                       [184], [254]
 coherence, statistics*
 Respiration:
                                                                       [113], [182],
 Signal energy, derivatives, breathing rhythm, breathing rate, sub-                    [183]
                                                                       [184]
 band power spectral, PCC, weighted coherence, statistics*
 Blood volume pulse:
 Mean signal, variance, sub-band power spectral, power spectral        [182], [184],
                                                                                            -
 density, heart rate, heart rate variability, blood flow, pulse,       [254]
 statistics*
 Electromyogram:
                                                                       [102], [182]         -
 Statistics*
 Eye-tracker:
 Pupil diameter, gaze distance, eye blinking, gaze coordinates,         [113]          [183]
 statistics, coupling indexes
 EOG:
                                                                              -        [195]
 Blink rate, blink amplitude, power of blink amplitude, statistics*
Note. * Statistics include mean, std, variance, skewness, kurtosis,    slope, median, maximum,
minimum, range.
brightness and MFCC5 from speech, and std of the axis of gyroscopes and their peak value in the
frequency domain were highly correlated with the degree of anxiety of the individuals in the study.
Likewise, [100] found that the formants, energy, entropy, and brightness features from audio
signals and both time and frequency domain features from accelerometers and gyroscopes were
strongly correlated with aspects of mental health.
                                                61


     Social interactions – Gips and Pentland [88] and Laibowitz et al. [89] used three sensors for
the recognition of interest during a social encounter. Initially, a 15-dimensional feature vector was
constructed per dyad encounter with features from accelerometers, microphones, and IR sensors.
Based on a correlation analysis, the six highest ranked encounter features for the recognition of
interest were: std of accelerometer measurements in the x-axis and y-axis, mean and std of average
audio signal amplitude, mean average audio difference between averaged readings, and std of the
difference between the average amplitude and the average difference. In the use of a combination
of physiological signals, Pun et al. [183] used a total of five sensors for the recognition of
collaborative behaviors. Coupling features from EDA, ECG, eye-tracker, skin temperature, and
RSP signals were extracted. Through a fast-correlation-based filter with mean squared linear
regression, the correlation between the extracted features and the degree of perceived collaboration
was determined. Coupling features were calculated using the signals from dyads in an interaction.
From the physiological signals, the coherence of the IBI in the very low frequencies (0.003Hz-
0.05Hz) and the low frequencies (0.05Hz-0.15Hz) were correlated with aspects of collaboration.
While from eye-movement signals, the number of times participants looked at the same place at
the same time and the number of times participants looked at the same place within a ±6 second
window were correlated with collaborative behaviors. Related to group cohesion, Zhang et al. [92]
made use of a wearable sociometer badge with accelerometer, microphone, and an IR sensor to
measure cohesion at an individual level and a group level. Using Pearson correlation coefficients,
it was found that at the individual level, the mean movement energy was positively correlated with
cohesion task. At a dyadic level, the correlation of vocal activities was also positively correlated
with cohesion task.
                                                  62


    3.2.2.6.Analysis and Discussion
     To eliminate redundant information and optimize algorithms for real-time implementation, it
is important to perform correlation analysis or feature selection to analyze the contribution of
signal features in the recognition of a human behavior effector. As noted in Table 3- Table 7, a
wide range of features from different sensor modalities have been employed for the recognition of
human behavior effectors. Although not all works referenced in the tables performed correlation
analysis or feature selection on extracted signal features, significant consistency exists among the
best-found features to be used in recognizing emotions and those to be used in recognizing
personality factors and social interactions.
     From the agglomeration of references in Table 3 - Table 7, one can observe that the most
common features for recognizing emotions are: prosodic, cepstral, voice quality, and frequency
spectrum coefficients from audio signals; SRC features from EDA signals; frequency domain
features per frequency band from EEG signals; and time domain features from ECG. More
specifically, from prosodic features of audio signals, features related to voice pitch appear to
greatly contribute to the recognition of positive and negative emotions (i.e., emotional valence
levels). From EEG signals, the DE feature extracted from the gamma frequency band has also
proven to be effective in the recognition of positive and negative emotions. Moreover, from ECG
signals, the HRV, which can also be determined from PPG signals, has been found to be a good
indicator of emotional valence, emotional arousal, and emotion perception. On the other hand,
features from sensor signals used in multi-signal modalities such as Std Dev of gyroscope’s axis
values and their peak value in the frequency domain have been found to be correlated with anxiety
levels.
                                                 63


     In the case of personality factors and social interactions, from audio signals, prosodic and
conversational features are the most commonly used. More specifically, from prosodic features,
voice pitch has proven to greatly contribute to the recognition of extraversion, dominance, and
emphasis during meetings. On the other hand, conversational features such as speaking rate and
speaking length have proven to contribute to recognizing cooperative meetings, in addition to
extraversion and dominance. In general, speaking length and speaking rate are also attractive for
real-time use because of their low computational complexity and fast computation. For social
interactions alone, other commonly used features found to be relevant in the recognition of social
interaction elements, such as collaboration and cohesion, are coupling indexes from EDA signals;
distance between individuals and duration of the encounter obtained from IR sensor signals; eye-
movement related features from eye-tracker sensor signal; and Std Dev features of accelerometer
measurements in the x-axis and y-axis.
     To date, analyses of features’ contribution to the recognition of human behavior effectors
come from works on unimodal systems and less so from works in multimodal sensor systems. This
could, arguably, be due to the large number of works in unimodal sensor systems. Still, from
observation, the most common sensor signals’ combinations used in multi-sensor modalities
include microphones with physiological sensors and/or movement and proximity sensors, and
combinations of physiological sensors. However, further research is encouraged in the evaluation
of the best feature or features to be used in multi-sensor modalities for the recognition of human
behavior effectors. As sensor features are identified as contributing to the recognition of more than
one human behavior effector, more optimized and robust systems could be designed. For example,
voice pitch, from audio signals, has been observed to be a good contributor to the recognition of
                                                 64


all human behavior effectors. Thus, using voice pitch when designing a system to recognize
multiple human behavior effectors could help increase system efficiency.
     3.2.3. Computational Models for Human Behavior Recognition
      Based on the features extracted from sensor signals, computational models are trained and
used to predict or classify human behavior. Therefore, the performance of computational models
can depend on the set of features provided. Likewise, the effectiveness of signal features can also
depend, in part, on the type of computational method used to evaluate the features’ contribution.
      The two principal types of computational models employed in the human behavior recognition
literature are classification and regression models. Classification models focus on recognizing
discrete or categorical classes, while regression models focus on predicting continuous numerical
values. The use of a computational model is application dependable. For example, the problem of
emotion recognition can be treated as one with categorical values (e.g., happy, sad, neutral) or as
one with continuous numerical values (i.e., reflecting levels of arousal and valence based on a
numerical scale).
      An analysis of reported computational methods used in the monitoring of human behaviors
was performed as follows. First, reviewed literature was grouped based on their use of
classification and regression models. Then, within each of the two model groups, different types
of models and the number of predicted or classified classes were summarized to illustrate the
design space employed in the literature. This summary analysis allowed us to make observations
regarding the most commonly used computational models, which are presented at the end of the
section. Specifically related to classification models, we analyzed and compared their accuracy
values to define the state-of-the-art system performances that may help drive the future design of
real-time human behavior monitoring systems.
                                                 65


Table 8. List of classification models used in the reviewed literature. © 2021, IEEE.
                        Models                                   E                PF      SI
  Support Vector Machine (SVM):                       [102], [113], [142],      [53], [123],
  Classic SVM, adaptive SVM, and incremental          [172], [174], [184],      [153  [124]
  SVM                                                 [186], [187], [190]       ]
  k-Nearest Neighbor (k-NN)                           [141], [172], [174],         -       -
                                                      [189], [190]
  Naïve Bayes (NB)                                    [102], [173], [185],         -  [123]
                                                      [190]
  Log-likelihood ratio                                            -                -  [124]
  Logistic regression                                 [174]                     [53]  [92]
  Linear regression                                               -                -  [89]
  Linear Discriminant Analysis (LDA)                  [141], [182]                 -       -
  Decision and Regression Tree                        [102]                        -  [155],
                                                                                      [192]
  Random Forest (RF)                                  [182]                        -       -
  Hidden Markov Models (HMMs)                         [146]                        -  [156],
                                                                                      [191]
  Gaussian Mixture Model (GMM)                        [142], [189]                 -       -
  Neural networks:                                    [142], [143], [174],         -       -
  Convolutional NN, Multilayer perceptron             [188]
  (MLP), self-organizing map, deep belief
  networks (DBNs)
  Partial Least Squares-Discriminatory                [190]                        -       -
  Analysis (PLS-DA)
  Latent Dirichlet Allocation model                               -                        -
  Sets of rules:                                                  -                        -
  Rule-based, rank-level fusion, collective
  classification approach
  Clustering models: k-means                          [99]                         -       -
    3.2.3.1.Classification models
     In general, based on the reviewed literature, classification models have been widely used in
emotion, personality factors, and social interaction recognition tasks. The reviewed literature
presents variations in the number and type of classes that classification models are trained to
recognize and variations in the classification models being employed. A summary of the
classification models employed in the reviewed literature associated with human behavior effector
classes can be found in Table 8.
                                                 66


     Emotions – Lee et al. [141] investigated the performance of a k-Nearest Neighbor (k-NN) and
a Linear Discriminant Analysis (LDA) classifier to predict two emotion classes (negative and non-
negative) when using audio data from males and females separately. While for female data, LDA
consistently performed better than k-NN, for male data there were cases in which k-NN performed
better than LDA. Gu et al. [99] made use of a K-means classifier to recognize high anxiety and
low anxiety using features from audio signals. The authors obtained 72.73% of performance
accuracy by using just two features: brightness and MFCC. In this line, Sahoo and Routray [146]
trained Hidden Markov Models (HMMs) to detect aggression and calmness also using audio
signals. By using pressure distribution features a performance accuracy of 93.5% was achieved.
Later, by using the same features, the authors trained an HMM to recognize four emotion classes
(anger, boredom, happy, and neutral) achieving an 80% overall recognition accuracy. On the other
hand, using EEG signals, Duan et al. [172] evaluated two classifiers, a Support Vector Machine
(SVM) and a k-NN to predict two emotion classes (positive and negative emotion). In general,
SVM outperformed k-NN achieving a performance accuracy of up to 86.69%. Using a multimodal
sensor system, Chanel et al. [184] investigated the performance of Random Forest (RF) and SVM
classifiers in predicting emotional and non-emotional moments using audio and physiological
(EDA, ECG, BVP, skin temperature, and respiration) signals during social interaction. The authors
investigated the performance of decision-level fusion by combining the output scores of classifiers
trained on signal features from each individual in the interaction. Regardless of the classifier type
(RF or SVM), it was found that by adding emotional information from all individuals in the
interaction, the emotional response of one individual can be predicted with higher accuracy than
just using the classification model from the individual of interest.
                                                 67


     Related to the recognition of three emotion classes, Zheng and Lu [174] investigated the
performance of four classifiers in predicting positive, neutral, and negative emotion classes using
EEG signals. The four classifiers were deep belief networks (DBNs), SVM, logistic regression,
and k-NN with resulting average classification accuracies of 86.08%, 83.99%, 82.70%, and
72.60%. However, the highest reported accuracy of DBNs was by taking EEG features from 62
channels, whereas the highest reported accuracy of SVM was 86.65% when taking EEG features
from 12 channels.
     Related to the recognition of four emotion classes, Kim [182] trained a LDA classifier in
combination with a sequential backward selection to predict low and high arousal and high and
low valence using audio and physiological (EDA, ECG, BVP, EMG, skin temperature, and
respiration) signals. The author trained a model for each subject (three in total) and a subject-
independent model achieving an average accuracy of 78.67% and 55%, respectively. Similarly,
Vogt et al. [185] trained a Naïve Bayes (NB) classifier to predict four emotion classes (joy,
satisfaction, anger, and frustration) but just using audio signals. The authors trained subject-
dependent models for 29 subjects, achieving accuracy values that ranged from 24% to 74%, with
an average of 55%. They also trained a subject-independent model using data from 10 subjects
achieving a 41% recognition accuracy. Their use of NB was motivated by its fast computation and
ability to take high-dimensional feature vectors. However, Vogt et al. suggested that a more
accurate classifier would be an SVM and that with a vector size under 100 features, it could be
suitable for real-time implementation. In this line, using EEG and eye gaze signals, Soleymani et
al. [113] trained SVM subject-dependent models to predict four emotion classes (high and low
arousal and high and low valence). Classification accuracies for arousal and valence were 67.7%
and 76.1%, respectively. Using audio signals, Abdelwahab and Buso [186] investigated the use of
                                                 68


two modified versions of SVM to classify the same four emotion classes (high and low arousal
and high and low valence). They trained an adaptive SVM model and an incremental SVM model,
which aims at maintaining or improving their classification performance even under mismatched
training and testing conditions. The authors concluded that both methods provide similar
performance, but a precise accuracy value was not reported. On the other hand, Wu and Liang
[142], also using audio signals, trained three types of models, Gaussian Mixture Model (GMM),
SVM, and a multilayer perceptron (MLP) to predict four emotion classes (neutral, happy, angry,
and sad). A Meta Decision Tree (MDT) was then used for classifier fusion, achieving an overall
performance accuracy of 80%. However, the results from SVM alone were close to the results of
MDT fusion classifier because the MDT is a classifier selection approach instead of a combination
of all classifiers. Moreover, Cen et al. [187] trained a SVM model for offline and real-time
recognition of the same four emotional states (neutral, happy, angry, and sad) also using just audio
signals. Their results showed a 90% and 78.78% classification accuracy for automatic offline and
real-time emotion recognition, respectively. In addition, Girardi et al. [102] investigated the
performance of SVM, J48 (algorithm based on decision trees), and Naïve Bayes (NB) on
predicting low and high arousal and high and low valence by using physiological signals such as
EDA, EEG, and EMG. Results showed that SVM outperforms the other classifiers and that EEG
signal features alone provided the best performance accuracy for valence classification, while
EEG+EDA performed the best for arousal classification. On the other hand, using a Convolutional
Neural Network (CNN) to predict the same four previously mentioned emotion classes, Rajak and
Mall [188] using audio signals, specifically, MFCC features achieved a classification accuracy of
76.2%.
                                                 69


     In the recognition of more than four emotion classes, Jenke at al. [173] trained NB subject-
dependent models using EEG signals to predict five emotion classes (happy, curious, angry, sad,
and quiet) achieving a performance accuracy of 36.80%. Later, Lanjewar et al. [189] made a
comparison between the performance of a GMM and a k-NN to predict six emotion categories
using audio signals. In general, their results showed that the GMM performed better than the k-
NN model with 66% and 52% of classification accuracy, respectively. However, the speed of
computation is faster for the k-NN classifier than for GMM, which makes it attractive when time
constraints are critical to consider, like for real-time applications. The computational time of GMM
increased when the number of features increased in the training phase. However, it was noted that
GMM was better at predicting angry and sad emotion classes, while k-NN performed better at
predicting happy as well as angry emotion classes. Also using audio signals, Balti and Elmaghraby
[143] implemented a self-organizing map with a response integration approach to predict seven
emotion classes (anger, boredom, disgust, anxiety/fear, happiness, sadness, and neutral), achieving
a 70.86% performance accuracy. Likewise, Jing et al. [190] investigated the performance of SVM,
k-NN, NB, and Partial Least Squares-Discriminatory Analysis (PLS-DA) in also predicting seven
emotion classes (sad, joy, fear, surprise, neutral, anger, and disgust) by using audio and EGG
signals. The authors evaluated the models using acoustic features only and combined feature sets
independently for males and females. However, the results consistently showed that SVM got a
higher average emotional recognition accuracy for both genders when compared to the other
classification models, with a classification accuracy of ~72%.
     Personality factors – Using audio signals, Jayagopi et al. [153] trained an unsupervised
classification model and an SVM model to predict the most-dominant person and the least-
dominant person in a group conversation. The unsupervised model computed either the largest or
                                                     70


smallest accumulated value of each extracted feature, depending on whether the goal was to predict
the most dominant or the least dominant person. In addition, two SVM models were trained. One
to predict the most and the non-most dominant person in the group conversation, and another one
to predict the least and the non-least dominant person in the same group conversation. Their results
showed that SVM performed better than the unsupervised model in predicting the most-dominant
person, being their best performance accuracies of 91.2% and 85.3%, respectively. On the other
hand, both models performed the same when predicting the least-dominant person with an 83.9%
accuracy. The same author, in [147], also using audio signals, trained a Latent Dirichlet Allocation
model to predict three classic leadership styles: autocratic, participative, and free-rein, achieving
a 79.20% classification accuracy. Likewise, Sanchez-Cortes et al. [154] evaluated four approaches
using audio signals to infer an emergent leader in a group. The four approaches were a rule-based
approach (search for the person with the highest feature value in a group and select that as the
leader), a rank-level fusion (extension of rule-based that handles fusion of multiple features),
SVM, and a collective classification approach. Results showed that the rank-level fusion provided
the best performance with 72.5% of accuracy. It also performed the best in identifying perceived
dominance with 65% of accuracy. Related to personality traits, Mohammadi and Vinciarelli [53],
also using audio signals, evaluated the performance of a logistic regression and an SVM in
predicting high and low extraversion, agreeableness, conscientiousness, neuroticism, and
openness. Results suggest that logistic regression performs better than SVM in predicting
conscientiousness and neuroticism with a 72.55% and 66.10% classification accuracy,
respectively. On the other hand, SVM performed better than logistic regression in predicting
extraversion, agreeableness, and openness with 73.45%, 63.10%, and 52.75% classification
accuracy, respectively.
                                                 71


     Social behaviors – Similar to the previous sub-sections, most of the literature reported here
has trained their models with features from audio signals. Using prosodic and conversational
features, Hillard et al. [155] trained a Decision tree (DT) classifier to predict moments of
agreement and disagreement during meetings. They achieved an overall performance accuracy of
64%. Similarly, Jayagopi et al. [124] evaluated a log-likelihood ratio model and an SVM model in
classifying conversational group dynamics into cooperative-type or competitive-type. Using an
SVM with a quadratic kernel, 100% classification accuracy was obtained.
     In line with meetings, McCowan et al. [191] trained an HMM to predict eight meeting actions
(monologues from individuals (total of 4), note-taking, presentation, discussion, and white-board
talk) achieving an 83.9% classification accuracy. Also using HMM, Gatica-Perez et al. [156]
predicted two levels of interest, high and low, during a meeting. By training an HMM with a feature
vector constructed from calculating the mean of the features from all the subjects in the interaction,
84% recall and 63% precision performance measures were achieved, while by just concatenating
the features from all the subjects an 80% recall and 58% precision performances were achieved.
Also investigating levels of interest, but during social encounters, Laibowitz et al. [89] trained a
Linear Regression model using accelerometer signals, in addition to audio signals. Their model
achieved an 86.2% classification accuracy.
     Related to cohesion, Hung and Gatica-Perez [123], evaluated the classification performance
of an NB model and an SVM model when predicting high and low cohesion using audio signals.
However, both classifiers showed similar classification performances, achieving up to 90%
accuracy. Moreover, Zhang et al. [92] employed a logistic regression classifier to recognize
between task cohesion and social cohesion among dyads by using audio, accelerometer, and IR
signals. Their approach achieves 80.30% and 64.62% classification accuracy when predicting task
                                                  72


cohesion and social cohesion, respectively. On the other hand, Katevas et al. [192] used a XGBoost
regression tree classifier to detect interactive groups of various sizes (node and group level) by
using an accelerometer, gyroscope, and RF signals, achieving a 94% performance accuracy.
    3.2.3.2.Regression models
     In general, works that have made use of regression models are focused on the prediction of
emotions and social interactions. Regression models have been found to be particularly attractive
when it is of interest to predict or recognize levels of emotional arousal, emotional valence,
collaboration, and vigilance on a continuous numerical scale. A summary of the regression models
employed in the reviewed literature associated with human behavior effector classes can be found
in Table 9.
     Emotions – Wöllmer et al. [193] introduced a framework for continuous monitoring of arousal
and valence levels using audio signals. The authors evaluated two regression models: Support
Vector Regression (SVR) and a long short-term memory recurrent neural network (LSTM-RNN).
Their results showed that LSTM-RNN performed better than SVR at predicting arousal levels with
a Mean Squared Error (MSE) performance measurement of 0.08 and 0.10, respectively. On the
other hand, both regression models performed the same at predicting valence levels with an MSE
Table 9. Regression models found in the reviewed literature associated with human behavior
effector classes. © 2021, IEEE.
                           Models                                     E                   SI
  Support Vector Regression (SVR)                            [110], [193], [194]           -
  Regression Trees                                                     -           [183]
  Least Squared regression                                             -           [183]
  Neural networks:                                                                         -
  Long short-term memory recurrent neural network
                                                             [110], [193], [194]
  (LSTM-RNN), Feed-forward (FF), Bilateral long
  short-term memory (BLSTM)
  Structured regression model:                               -
  Continuous conditional neural field (CCNF),                                      [195]
  continuous conditional random field (CCRF)
                                                  73


of 0.18. Ringeval et al. [110] used a hybrid decision fusion based on SVR with a lineal kernel and
Neural Networks (NN) to recognize arousal and valence emotional levels based on data from
audio, EDA, and ECG sensors obtained from the AV+EC 2015 database [110]. For NN, they
explored three types of architectures: feed-forward (FF), LSTM, and bilateral long short-term
memory (BLSTM). The authors found that SVR performs best on the audio features for valence
prediction with a 0.069 Concordance Correlation Coefficient (CCC) and NN performs best on
EDA features for arousal with a 0.79 CCC. Moreover, FF provided the best performance for EDA
features. Their hybrid decision-fusion method achieved the best arousal prediction with a 0.228
CCC and 0.173 RMSE performance metric using audio features while achieving their second-best
valence prediction performance with a 0.195 CCC and 0.119 RMSE using EDA features.
However, when the authors employed decision-fusion on their multi-modal data, their results
improved achieving 0.444 CCC and 0.164 RMSE for arousal prediction, and 0. 382 CCC and
0.113 RMSE on valence prediction, demonstrating the value of a multi-modal approach. Also
using SVM and LSTM models, Brady et al. [194] used a decision-level approach to predict these
arousal and valence levels. The authors trained an SVR model for audio signals and an LSTM for
physiological signals (EDA and ECG) and combined their decisions using a Kalman filter
framework. They found that models for ECG and EDA provided significant performance
improvements for valence prediction, obtaining 0.364 CCC and 0.117 RMSE for models trained
with HR and HRV data and 0.177 CCC and 0.124 RMSE for EDA data.
     Social interactions – Contrary to emotion recognition, which mainly focuses on predicting
arousal and valence levels, in the area of social interactions, the target classes vary greatly from
one work to another. Chanel et al. [183] used Bag of Regression Trees (BRT) and Least Squared
regression with a fast-correlation-based filter (FCBF LS) to predict collaborative behaviors (i.e.,
                                                 74


degree of conflict, confrontation, emotional management, etc.) based on data from EDA, ECG,
skin temperature, respiration, and eye-tracker. Physiological and eye-tracker data were treated
separately, and different regression models performed differently based on the sensor data
modality and the targeted collaborative behavior. For example, the FCBF LS model provided the
lowest RMSE value, with a 0.44 RMSE performance value, when using eye-tracker data to predict
the degree of convergence in a group of people. However, the BRT model performed better at
predicting confrontation using physiological signals when compared to the FCBF LS model. On
the other hand, Zheng and Lu [195] employed an SVR with a radial basis function to estimate the
level of vigilance based on data from EEG and EOG. The authors introduced a continuous
conditional neural field (CCNF) and a continuous conditional random field (CCRF) to the design
of their vigilance estimation model with the goal of incorporating the temporal dependency present
in vigilance. It was demonstrated that the fusion of multimodal sensor features improves model
performance, achieving 0.09 RMSE performance value, compared to features from a single
modality that achieved 0.12 and 0.13 RMSE performance values for EOG-based and EEG-based
methods, respectively. In addition, the temporal dependency-based models demonstrated to also
enhance vigilance estimation.
     3.2.3.3.Analysis and Discussion
      A wide range of computational models, as noted in Table 8 and Table 9, have been employed
for the recognition of human behavior effectors. To deeply analyze the use of classification models,
performance metrics related to the accuracy values reported per classification model are organized
by effector class and presented in Figure 10. From the agglomeration of references in Table 8, it
can be observed that SVM has been the most popular classification model used for the recognition
of human behavior effectors followed by k-NN and NB. In addition, from Figure 10, it can be
                                                 75


Figure 10. Summary analysis of reported performance accuracies of classification models per
human behavior effector groups. The central mark indicates the median accuracy, and the top
and bottom edges of the box indicate the 75th and 25th percentiles, respective. The whiskers
extend to the most extreme accuracy values not considered outliers These results were obtained
by analyzing data from the references in Table 8. © 2021, IEEE.
observed that SVM provides one of the highest levels of accuracy across all effector classes. On
the other hand, k-NN and NB have been specifically used in emotion recognition, and although
they follow SVM in popularity, their levels of accuracy are among the lowest across all other
employed classification models. An important factor to consider when evaluating the performance
of classification models is the number of classes that they are trained to predict. For example, in
Figure 10 under emotions, HMM reports the highest accuracy but classifies just two classes,
whereas the accuracies reported for SVM are for models trained to recognize from two to four
classes. Moreover, the accuracy and complexity of these computational models vary depending on
1) the number of classes that they are trained to predict and 2) the quantity of information (number
of features) that they take to accurately predict a class. Both of these factors are also critically
important when considering real-time implementations. The four classification models that have
                                                  76


been trained to recognize human behaviors in real time are k-means [99], HMM [146], NB [185],
and SVM [123], [187].
     Our review has identified that classification models have been more widely employed than
regression models. This indicates that the problem of identifying human behaviors using sensor
technologies has generally been treated as a “discrete problem” rather than a continuous one.
However, it has been argued that human behaviors change gradually, on a continuous scale rather
than in discrete states [196]. Thus, the use of continuous numerical values for the recognition of
such behaviors may be preferred. To date, the use of regression models to treat behavior
recognition as a continuous case (i.e., using continuous numerical values to recognize or predict a
behavior) has varied with the behavior effector class being monitored; SVR and NN regression
models have been common for emotion recognition, while regression trees, least-squared
regression, and structured regression models have been used in the prediction of aspects of social
interactions. A general observation related to regression models is that, although these models have
been employed in the automatic recognition of human behaviors, so far, they do not appear to have
been used in real time. However, as regression models are attractive for the prediction of
continuous classes, further study of these models for the real-time prediction of human behavior
effector classes is highly encouraged. In addition, although there are a limited number of works
employing regression models to predict a behavior effector class and different performance metrics
(MSE, RMSE, CCC) have been used, hybrid decision-fusion appears to achieve the best prediction
performances.
     In general, different computational models tend to fit feature sets from different sources in
unique ways. Decision-level fusion methods, as described in Section 3.2.2.5, combine decisions
from multiple computational models into a common decision, and their use should become more
                                                  77


popular as the number of sensor modalities within systems increases. Decision-level methods such
as a set of rules and hybrid decision fusion have started to gain traction in conjunction with
classification and regression models, respectively.
      Although the number of features is a highly important factor in the training of computational
models, nearly half of the reviewed works did not report this value. However, from those works
that did report it, the number of features ranges from 1 to ∼1000. On average, emotion recognition
models tend to be trained with a higher number of features than models for the recognition of
personality factors and social interactions, suggesting that emotion recognition systems are more
computationally complex. Based on current studies, it is unclear if this computational complexity
is linked to the complexity inherent to the personalization of human emotion. Emotion recognition
models have also been more widely explored, and their complexity may be an artifact of the
relative maturity of those models.
     3.3.Design of a Multi-Sensor System with a Machine Learning Framework to Monitor
         Group Consonance using the Rapport Theoretical Model
      As mentioned in the introduction of this chapter and Section 3.1.3, this work hypothesizes that
(1) in the absence of self-awareness, rapport levels among dyads may decrease, possibly affecting
the overall group interaction, and (2) by establishing a framework to monitor components of
rapport (i.e., attentiveness, positivity, and coordination), the system could determine both an
overall value of group consonance and the component(s) affecting rapport needing attention.
      As rapport is established through multiple channels of communication, especially nonverbal,
a multi-sensor system with a machine learning framework is needed. Existing social interaction
monitoring systems lack accessibility, sensing modalities, or computational capabilities needed to
recognize complex social dynamics in real time. Nevertheless, the design of these systems presents
                                                  78


numerous challenges that include sensors’ position and wearability, sensors’ networking, the
integration of information from different sensor modalities, management of different sampling
rates and pre-processing methods, use of optimal window length for real-time processing, time-
alignment of the collected multimodal sensor signals, and variations in feature formats and
extraction, effective feature selection, and computationally efficient but accurate human behavior
and social interaction classification models.
    3.3.1. System Requirements
     To design a system that can be used for the study and real-time monitoring of group
interactions, in both in-person and virtual environments, the fact that in virtual environments the
head area conveys many of the nonverbal messages was considered. This limitation led to
evaluating sensor modalities that can be worn on the head while providing human behavior and
social interaction insights. In addition, to avoid inducing mobility constraints, the search was
limited to wearable and non-invasive sensors.
     Because a group is composed of three or more individuals and behavioral information is
communicated through various channels, the ability of this system to manage multi-sensor
connectivity, data processing, and communication of at least three sensor nodes is required. Each
sensor node will be dedicated to collecting behavioral information from a single individual in the
interaction. Further, a framework to manage data synchronization and communication across
sensor nodes is needed to be able to identify complex social behavior. The multi-sensor system
architecture should allow for data recording to permit the design of real-time signal processing and
machine learning algorithms. Furthermore, the framework needs to allow the implementation and
execution of real-time signal processing and machine learning methods to accomplish (near) real-
time monitoring of group interactions.
                                                 79


   3.3.2. Sensor Selection
    To select sensors with the capability of collecting signals reflecting nonverbal messages of
interest (including physiological reactions), information from reviewed literature was extracted
from Table 2 and a mapping of nonverbal messages of interest with sensors that can contribute to
their detection (as shown in Table 10) was created. Cameras were excluded from the mapping
because of our interest in designing a real-time human behavior monitoring system that minimizes
privacy issues and computational complexity. Based on the potential for wearability on the head
and contributions to the detection of most nonverbal messages of interest, the sensors listed in
Table 10 were further analyzed; and a smaller group of sensors were selected as part of the multi-
Table 10. Mapping of nonverbal messages of interest with sensors that can contribute to their
detection. Highlighted in gray are selected sensor modalities for the multi-sensor system of this
dissertation project. © 2022, IEEE.
                                                        Nonverbal messages
                                               Body movement
                                               Body orientation
                                               Back-channel signals
                                               Posture
                                               Facial expressions
                                               Eye gaze
                                               Paraverbal
                                               communication
                                               Interpersonal distance
                                               Gestures
                                               Physiological
                          Sensors of interest  responses
                                Microphone             x
                             Accelerometer x x x x         x x
                                 Gyroscope x x x x         x x
                              Magnetometer x x x x         x x
                                          IR  x          x
                                          RF  x          x
                                Eye tracker          x
                                        PPG                  x
                   Skin temperature monitor                  x
                                        EEG                  x
                                        EGG            x
                                       EMG         x
                                        EOG          x
                                                80


sensor framework. Per the previous review, nonverbal information of interest includes pitch and
other prosodic features in speech signals, physiological reactions, and head/body activity.
     Accelerometer, gyroscope, and magnetometer were selected for the detection of body
movements, orientation, posture, gestures, and back-channel signals (e.g., head nods and
headshakes). Accelerometers measure the magnitude and direction of acceleration, gyroscopes
measure the angular velocity of rotation, and magnetometers measure the direction and strength of
the magnetic field in the local vicinity. Although infrared (IR) and radio frequency (RF) sensors
can gather interpersonal distance information, in addition to body orientation, these would not be
practical for virtual social environments and thus dropped from further consideration. For the
collection of paraverbal communication messages, microphones were selected over
electroglottography (EGG) because of their advantages in terms of placement and wearability.
     For the detection of physiological responses, photoplethysmography (PPG) and
electroencephalography (EEG) were selected because of their information-rich signal content,
especially for the recognition of changes in emotional states. EEG-sensor electrodes measure the
electrical activity of the brain. The analysis presented in Section 3.2 revealed that electrodes placed
in the temporal lobe and the frontal lobe regions of the brain have proximity to sources of
emotional impulses. PPG sensors measure the volumetric variations of blood circulation at specific
body locations such as the finger, wrist/forearm, forehead, and earlobe [197]. These variations
reflect physiological parameters that are linked to the cardiovascular and respiratory systems
affected by changes in emotional states. From all those locations, of particular interest are the
forehead and the earlobe since a head-mounted sensor system is the aim of this work. Compared
to the forehead position, the earlobe is the most frequently used measurement site because this
location is not comprised of cartilage, thus they contain large blood supplies [197]. Similar to
                                                   81


ECG, PPG waves can be used to identify regular or irregular heart rate (HR). Using PPG sensors
to monitor HR has several advantages when compared to traditional ECG-based systems. PPG
sensor systems make use of a simpler hardware architecture, are cost-effective, and only require a
single sensor to be in contact with the human body, which simplifies wearability [197]. A typical
PPG sensor contains a light source (infrared light emitting diode (LED) or green LED) and a
photodetector. PPG uses the photodetector to measure the intensity of reflected light from the
tissue, which is then used to calculate blood volume changes. Other sensors listed in Table 10 with
the capability to gather facial expression information and eye gaze were dropped from further
consideration as they are mostly used for emotion recognition and attention monitoring,
respectively; both human behavior factors could be captured by the six sensor modalities selected.
     Commercially available wearable sensor devices containing the selected sensor modalities
were researched to be integrated into sensor nodes for the designed multi-sensor system. Wearable
sensors needed to contain a long-lasting battery of at least two hours and provide access to an
application programming interface (API) to manage sensors’ data as needed without proprietary
permission from sensor manufacturers. From a variety of available sensor systems (a
comprehensive list can be found in [104]), the Shimmer GSR+Unit was selected because of its
ability to collect data from an accelerometer, gyroscope, magnetometer, and PPG sensors using an
earlobe clip. The Shimmer GSR+Unit also has the capability to collect electrodermal activity
(EDA) signals; however, because there was not an optimal way to place the EDA electrodes in the
forehead (which is the recommended placement area of the head) [198], this sensor modality was
not included in the system design. For EEG, a BrainBit EEG headband was chosen. BrainBit has
four EEG dry electrodes, two in the occipital lobe region (O1 and O2) and two in the temporal
lobe region (T3 and T4). Compared to other alternatives such as Emotiv [199] and Neuroelectrics
                                                  82


Figure 11. Head-mounted wearable sensors selected as part of the multi-sensor framework for
the monitoring of social interactions. All selected sensors are shown except for microphone,
which taken from the PC used for virtual meetings. © 2022, IEEE.
Enobio [200] for recording EEG, BrainBit offered electrode positions of interest (Temporal lobe
for emotion detection and Occipital lobe for visual attention recognition) and a less obtrusive
design. Although Emotiv offers electrodes positioned in the Frontal lobe and integrates inertial
movement sensors, combining BrainBit and Shimmer offered a more flexible design in terms of
placement. On the other hand, comparing Shimmer with other systems such as Empatica E4 [201]
for heart rate, Shimmer offers a higher integration of sensing modalities and flexibility for
placement since it can collect PPG signals from the earlobe or the fingers if placed as a wristband.
In addition, the Shimmer and BrainBit devices offer a compact design and API support for
independent applications. They also use Bluetooth communication which allows the users to move
their heads freely. Figure 11 shows the Shimmer mounted on the BrainBit headband. The multi-
sensor framework involves the use of personal computers, which act as system nodes; therefore, it
also makes use of the integrated computer microphones for the collection of audio signals.
                                                 83


    3.3.3. Data Collection Interface and Sensor Data Synchronicity
     Data from multiple sensor modalities need to be collected simultaneously and synchronized
before further processing. To help with networking and sensor data synchronization, the designed
framework makes use of the Lab Streaming Layer (LSL). In addition to handling both the network
and the time-synchronization of sensor signals, LSL also allows (near) real-time access to the
measured time series as well as optional centralized collection and disk recording of the data. LSL
also provides core libraries in various language interfaces and a suite of tools built on top of the
libraries [202]. From the already available tools, a recording program and an audio acquisition
application were integrated into the system architecture. To allow the collection of data from
BrainBit and Shimmer, their respective API tools were combined with the LSL core libraries to
build customized data collection applications.
     The BrainBit and Shimmer application interfaces were developed using MATLAB 2019b.
BrainBit integration into our platform was facilitated by the BrainFlow libraries, designed to
obtain, parse, and analyze physiological signals from biosensors such as EEG. The application
obtains EEG data from BrainBit at a sampling rate of 250 Hz and a voltage range of ±0.4µV. On
the other hand, Shimmer integration was facilitated by the MATLAB API provided by the
Shimmer GSR+Unit manufacturer [203]. This custom application obtains Shimmer data at a
sampling rate of 128 Hz, pre-filters IMU and PPG sensor data before transmission to the LSL
managed network and estimates heart rate based on PPG sensor data. The functionality of the
application requires the installation of Realterm Serial Terminal to access the computer terminal
to which Shimmer connects. Figure 12 shows the user interfaces for the custom-built applications.
All the resources and MATLAB code needed to establish sensor connection as described in this
section are available at https://gitlab.msu.edu/davilasy/sensor-connection-atlas.
                                                 84


Figure 12. Graphical user interfaces (GUIs) of the custom-built applications. (a) The BrainBit
application takes the serial number of the device to be connected to the corresponding sensor
node, use it to establish connection, and assigns collected data to a sensor ID. (b) The Shimmer
application takes the Shimmer serial number or ID and the COM port to which it is connected
via Bluetooth to the node computer to establish connection and assign ID to the collected data.
Both GUIs contain a status window to show connectivity problems or confirm data is being
streamed.
     The overall architecture of the multi-sensor system is shown in Figure 13. The overall multi-
sensor system consists of three sensor nodes, each using the LSL tools and the custom application
to collect and synchronize data from all sensing modalities. Sensors are connected to their
respective nodes using Bluetooth 5.0. A separate computer acts as a central unit that, when all
nodes are connected to the same WIFI network, allows the synchronized collection of data from
all nodes. The multi-sensor architecture allows for data storage and the implementation of a
machine learning framework, including the extraction of local signal features (related to individual
human behaviors) and global signal feature extraction (related to social behaviors) for the
classification of group interactions.
    3.3.4. Proposed Real-time Machine Learning Framework for Sensor Data Processing
     An essential part of group interaction monitoring systems is the processing of sensor data to
recognize behavior indicators of interest. In this work, a machine learning framework motivated
by the goal of monitoring group interactions and informed by the analysis of signal features and
computational models presented in Section 3.2 was established. As rapport is considered essential
                                                 85


Figure 13. Overall architecture of the multi-sensor framework for the real-time monitoring of
social interactions. The framework consists of three sensor nodes and a central unit. Each
sensor node is composed of six sensor modalities. Synchronization, networking, and storage of
sensor signals are managed with LSL (https://github.com/sccn/labstreaminglayer). © 2022,
IEEE.
for the quality of group interactions, a machine learning framework was designed considering
rapport is modeled as a 3-component paradigm based on the Tickle-Degnan & Rosenthal
Theoretical Model [120]. The machine learning framework is primarily composed of three
components: (1) signal pre-processing, (2) feature extraction, and (3) training of computational
models or, more specifically, classification models.
    3.3.4.1.Signal pre-processing
     Signal pre-processing techniques involve the establishment of adequate data buffer sizes, data
window sizes, and filter types for the treatment of signals before and after feature extraction. In
this work, data buffers and data windows differ from each other in that data buffers are the chunks
of raw sensor data that are taken for extraction of low-level features, wherein data windows are
                                                86


typically bigger in size than data buffers containing low-level features that will be used to extract
higher-level features either at the sensor nodes or the central unit. Sizes of data buffers and data
windows vary per sensor modality and require analyzing different sizes for optimal recognition of
behavioral cues and real-time processing.
    3.3.4.2.Feature extraction
     As shown in Figure 13, the feature extraction process is divided into two layers involving
local and global feature extraction. Local feature extraction involves the extraction of features that
describe the activity of a single individual and are calculated at the sensor node, whereas global
feature extraction involves the extraction of features that describe dyadic or group dynamics and,
therefore, requires data from more than one individual. Global features include all features
calculated at the central unit.
     Local feature extraction is composed of two types of features: (1) low-level features or Type
A features and (2) transformed features or Type B features. Type A features are extracted from the
data buffers using statistical or signal processing methods. Type B features are extracted from a
collection of Type A features being held by a window of data or are higher-level behavior features
obtained from a classification model. Figure 14 illustrates the idea behind the extraction of Type
A and B features. Note that global features mainly contain Type B features, which are derived
from Type A features calculated from multiple individuals.
     At a local level, features will be extracted from all sensor signals using their respective data
buffer sizes. Global features will be extracted at a slower rate than local features. The exact time
windows to be used for the extraction of features will be determined during the analysis of the
collected data. However, due to characteristic frequency components present in each of the sensor
signals of interest, it is expected that local features from audio signals will be extracted more
                                                  87


Figure 14. Proposed feature extraction process for the selected sensor modalities, which
includes extracting local low-level features or features Type A and transformed features or
features Type B.
rapidly than features from IMU, EEG, or PPG signals. At the global level, because we are
interested in identifying coordinated behaviors, the extraction of features may be driven by the rate
of change of the accelerometer data because audio is a faster-changing signal.
     In microphones, features describing back-channel signals, prosodic, and conversation
dynamics, among others, that can provide insights into levels of synchrony between individuals
could be extracted to help recognize attentiveness, positivity, and coordination levels. Audio
features such as power energy and pitch will be extracted locally to determine intonations, while
detected frames of speech signal will be used to extract global features such as talk-time, turn-
taking, overlapping talk, and back-channel signals like “uh-huh”. The extraction of global features
requires interaction with data from all subjects in the interaction. For IMU data, features describing
                                                  88


back-channel signals such as head nodding and movement signals that can provide insights into
levels of attention, positivity, and coordination will be extracted. The proposed machine learning
framework considers that the recognition of intonations from speech signals and head motion from
IMU signals require the transformation of low-level signals using classification models. The
calculation of these type B features requires extensive study and experimentation, which is
presented in Chapter 4.
     In PPG, features describing the heart rate (HR) and heart rate variability (HRV) will be
extracted locally to help identify levels of positivity. Finally, in EEG, features describing the signal
power in different frequency bands, especially in the alpha frequency band (~8-12 Hz), in the
occipital lobe positioned electrodes will be explored for attention measurement, and in the occipital
and temporal lobe for emotional state. A list of features of interest is also presented in Figure 14.
    3.3.4.3.Training of computational models to determine group consonance
     This proposed machine learning framework suggests the use of a model fusion approach.
Figure 15 illustrates a 2-Layer recognition framework that combines features and model fusion
Figure 15. Model fusion approach to recognize/characterize components related to rapport that
are affecting the overall interaction and consequently the overall calculated level of group
consonance. In the first layer, three models are suggested to be trained to predict/recognize the
different behavioral components of rapport. In the second layer, a model will combine the output
of the models in Layer 1 and provide an overall measurement of dyadic consonance.
                                                   89


Figure 16. Nonverbal behavioral indicators associated with components of rapport that will be
used to determine group consonance.
approaches. In the first layer involving computational models, three models are suggested to be
trained to predict/recognize the different behavioral components of rapport. Figure 16 shows a list
of behavioral indicators associated with components of rapport that will guide the fusion of
extracted features and training of Layer 1 computational models. Then, in a second layer, a model
will combine the output of the models using weighting factors and provide an overall measurement
of dyadic consonance. These dyadic consonance values determined by a machine can then be
combined to determine the level of group consonance. Before starting the implementation of a
machine learning framework, sensor connection and synchronization are validated.
    3.3.5. Results and Discussion
     To test the collection of data and LSL network connection of the implemented multi-sensor
framework for the monitoring of social interactions, a short data collection study during a staged
meeting was performed. The data collection was approved by the Michigan State University
Institutional Review Board (IRB) and conducted under strict physical distance and protection of
privacy protocol guidelines. A total of 3 subjects were recruited voluntarily. Subjects were in
separate rooms on the same building floor. The wearable sensors (Shimmer and BrainBit) were
                                                 90


attached as shown in Figure 11. The meeting of participants was conducted through the Zoom
video conferencing program and consisted of a series of questions that the participants answered.
For each question and response, sensor data were collected simultaneously for all three individuals,
Figure 17. Synchronized signals from a single subject collected using the presented multi-sensor
framework during a portion of a team meeting. © 2022, IEEE.
                                               91


representing a total of 3*16 synchronized data streams (16 data streams per sensor node). The
meeting was recorded for data annotation purposes.
     Figure 17 shows all sensor signals collected and synchronized during a period of 25s from a
single individual (sensor node). The EEG signals were post-processed using a 3rd-order bandpass
IIR filter with cut-off frequencies of 1Hz and 50Hz for better visualization. A section of Figure
17 was highlighted to indicate that during that period the subject was talking and head nodding.
Head motion is reflected on the accelerometer and gyroscope signals. Signal synchronization
allows observing how PPG signals appear to be degraded with head nodding, which consequently
appears to cause discontinuities in the heart rate estimation signal. Therefore, the framework
allows studying causes of signal degradation that will permit the design and implementation of
real-time pre-processing methods to be implemented per sensor node. On the other hand, Figure
18 shows one set of sensor signals for each subject and identified nonverbal messages.
Synchronization of these signals was performed by the central unit of the multi-sensor framework.
Figure 18. Synchronized signals from three subjects collected using the presented multi-sensor
framework during a portion of a team meeting. © 2022, IEEE.
                                                92


Once again, signal synchronization, as shown in Figure 18, permits the identification of nonverbal
messages’ dynamics. In this case, speech from Subject X is followed by head nods and head shakes
of Subject Y and Subject Z. Therefore, showing validation of the multi-sensor framework data
collection and network connectivity.
     3.4.Summary
      Reviewing the personal factors that underpin human behavior and the theories and concepts
that have guided the psychological study of social interactions provided a scholarly foundation for
understanding the methods that have been employed for monitoring human behavior. Methods
employing wearable sensors for the monitoring of human behaviors have been focused on
recognizing emotions and social behaviors separately. To better understand the landscape of
human behaviors that have been monitored using sensor technologies, the collected body of
literature allowed this work to establish a taxonomy of human behavior monitoring technologies
based on the reviewed psychological theories with the purpose of grouping existing human
behavior monitoring literature. This helped us see that even when many efforts have been done to
study and design technologies to monitor the defined human behavior effectors, very little has been
done in designing robust systems that could measure complex social interactions.
      Towards the goal of overcoming existing design challenges for the real-time monitoring of
complex social interactions, this chapter presented an analysis of theoretical and technical aspects
of human behavior monitoring technologies, established rapport as a measure of the quality of
dyadic interactions and group consonance, and introduces a new and accessible multi-sensor
system that allows the study and real-time analysis of both in-person and virtual interactive
environments. The system integrates six sensing modalities, selected based on a deep analysis of
technologies for behavior monitoring, and leverages the use of existing commercially available
                                                 93


wearable sensors. The system allows the synchronized collection of sensor data from at least three
sensor nodes. Details of sensor integration and networking protocols to manage sensor data
synchronicity were presented and a real-time machine learning framework was introduced. Results
validate sensor data collection by our system and nonverbal messages that could be identified due
to data synchronization. By this means, the physical infrastructure for monitoring individual,
dyadic, and group-level behaviors is introduced. The next chapters show insights into the
implementation of aspects of the machine learning framework.
     Disclaimer: A substantial portion of this chapter was published in [118] (© 2021, IEEE) and
[204] (© 2022, IEEE).
                                                 94


    4. SIGNAL PROCESSING FOR THE RECOGNITION OF LOCAL TRANSFORMED
        FEATURES: FROM DATA COLLECTION TO ALGORITHM DESIGN
     The machine learning framework presented in Chapter 3 proposes the use of local low-level
features (Type A features) to extract local transformed features (Type B features). This chapter
presents the first efforts in designing and implementing data processing blocks to recognize head
activity and intonations using IMUs and audio signals, respectively. The design of these data
processing blocks requires the collection of data that contain behavioral targets of interest and
approximate real-life scenarios. Currently, no publicly available datasets or processes exist for the
collection and preparation of data as is needed for the goals of this work. Therefore, this chapter
also presents the study procedures employed for the collection and design of the datasets used for
the training of computational models. Evaluations of signal features and training of computational
models to identify head actions from IMU data and speech intonation from microphone data are
also presented.
    4.1.Real-Time Detection of Head Actions using IMUs
     IMUs have been widely used for the identification of human activity, primarily focusing on
physical activity detection, such as the ones seen integrated into smartwatches. IMUs have also
been used in wearable devices placed in the head area. Head motion and position reveal a vast
amount of information about the quality of social interaction. In general, IMUs placed in the head
have been employed to assist individuals with disabilities. For example, the recognition of head
gestures has been used as commands to control video players [205], computer cursors [206],
wheelchairs, and robot hands [207]. More recently, IMUs have been used for the recognition of
head motions associated with human behaviors and social interactions [208]. In [209], a 6-axis
IMU was placed on the forehead and used to recognize four movements: pitch, roll, yaw, and
                                                 95


immobility. Classification models were trained with statistical signal features and raw signal data,
achieving a 92% and 95% of recognition accuracy, respectively. In [210], a 9-axis IMU was
mounted on the front side of a cap and considered the recognition of six types of head gestures
(nod, shake, and facing up, down, left, and right), achieving an average classification accuracy of
95%. Generally, head motion recognition systems using IMUs have achieved classification
accuracies that range from 72% to 99% [211]. However, the classification performance of these
systems is highly dependent on the position of the sensors, the head motions of interest, and the
set of signal features used for classification. Currently, no gold standard exists to automatically
detect head activity from IMUs [211]. Therefore, this work represents the first effort in establishing
methods for the design of head activity models using the sensors selected in Chapter 3 as part of
the human behavior monitoring system. Data was collected from people wearing the sensor
headset presented in Figure 11 and performing specific head activities of interest, which included
positioning the head at given angles and nodding, shaking, and rolling their head in response to
specific questions.
    4.1.1. Designing a Real-time Head Position and Motion Detection Algorithm
    4.1.1.1.Real-time model fusion architecture
     To study the best signal feature sets, reduce the computational complexity of data processing,
and reduce data transmission rates for the multi-sensor system presented in Chapter 3 [204], a
model fusion architecture was proposed as shown in Figure 19. The model fusion architecture is
composed of two classification blocks: one for the detection of head position (static stage) and
another for head motion (dynamic stage). The static stage allows studying the signal feature set
that will best contribute to the classification of basic head positions (center, tilted to the right, tilted
                                                    96


Figure 19. Two-stage model fusion architecture for the design and optimization of a head action
detection (HAD) processing unit.
to the left) versus general head motion. Likewise, the dynamic stage allows the study of an optimal
set of features to classify Δ-pitch, Δ-yaw, and Δ-roll head motions. The combination of the two
stages with their respective parameters creates the head action detection (HAD) processing unit
     Moreover, the overall model fusion architecture is designed to, after feature extraction,
perform a first classification where it is detected if the head is at one of the three steady positions
(neutral, right, left) or if it is in motion. If the classifier detects motion, then a second classification
is performed where Δ-yaw, Δ-roll, or Δ-pitch are identified. Because the classification models are
recognizing two types of activity, static and dynamic, this model fusion approach provides the
opportunity to retrain models separately, accelerating and reducing future re-training time for these
classifiers.
    4.1.1.2.Signal segmentation
     The implementation of the fusion model architecture requires the study of optimal model
parameters, including data buffer sizes for real-time signal segmentation and processing. Research
has demonstrated that evaluating different buffer/window sizes contributes to finding a good
balance between system performance and computational complexity [143]. To study the
contribution that a buffer size can have in the extraction and processing of signal features, buffer
sizes ranging from 1 to 4.5 seconds with 50% overlaps were evaluated. A buffer size of 1 second
                                                        97


was selected as the smallest buffer size because of interest in head motions that could be carrying
frequency components around 1 Hz.
    To perform this evaluation, pre-recorded signals are first buffered to simulate the real-time
acquisition of the collected sensors’ data. The buffer is applied per sensor signal type and axis.
Resulting in
                                                    [𝑥1 , 𝑥2 , … , 𝑥𝐿 ]
                                       𝐼𝑀𝑈𝐸 = {[𝑦1 , 𝑦2 , … , 𝑦𝐿 ]                                (1)
                                                    [𝑧1 , 𝑧2 , … , 𝑧𝐿 ]
    where E represents the sensor type (accelerometer, gyroscope, or magnetometer), x the data in
the x-axis, y the data in the y-axis, z the data in the z-axis, and L the buffer size.
    4.1.1.3.Pre-processing and feature extraction
     For each IMUE data buffer, as presented in (1), 70 features including time-domain, frequency-
domain, and synchronization features were extracted. A list of features is shown in Table 11.
Signal features were divided into two groups: extracted before band-pass filtering and extracted
after filtering. Features extracted before filtering include signal energy for all three axis
components, the average value in a signal buffer for all three axis components, the mean magnitude
of the three-dimensional vector, and the zero-crossing rate for all three axis components.
     Signal features extracted after filtering include: the root mean square (RMS) value for all three
axis components, three autocorrelation features for all three axis components (height of main peak
and height and position of the second peak), correlation coefficients across axis components, cross-
correlation features across all three axis components (height of main peak and height and position
of the second peak), dynamic time wrapping coefficients across axis components, three spectral
power features per axis component (in 3 adjacent pre-defined frequency bands ranging from 0.2
                                                    98


Table 11. List of features extracted from IMU signals and evaluated to measure their
contribution to the recognition of head motion. 70 features were extracted in total per sensor
signal modality.
                                                            Abbreviation with axis subscript
  Feature type              Feature name
                                                                      (𝑖 → 𝑥, 𝑦, 𝑧)
                   Signal energy                      SEi
                   Average value                      Avi
                   Mean magnitude of the three-
 Time-domain                                          Mnormxyz
                   dimensional vector
                   Zero-crossing rate                 ZCi
                   Root mean square value             RMSi
                   Autocorrelation (height of
                                                      AC_1hi, AC_2hi, AC_2pi
                   main peak; height and
                   position of second peak)
 Synchronicity     Correlation coefficient            Corr_coefxy, Corr_coefyz, Corr_coefxz
                                                      CCorr_coef_1hi,               CCorr_coef_2hi,
                   Cross-correlation coefficients
                                                      CCorr_coef_2pi
                   Dynamic time warping               DTWxy, DTWxz, DTWyz
                   Spectral power coefficients        SpecP_1bi, SpecP_2bi, SpecP_3bi
 Frequency-        Spectral peak coefficients
 domain            (height and position of first 4    Speak_1pi, Speak_2pi, Speak_3pi, Speak_4pi,
                   peaks)                             Speak_1hi, Speak_2hi, Speak_3hi, Speak_4hi
Hz to 10 Hz), and 8 spectral peak features per axis components (height and position of first 4
peaks).
     The band-pass filter was applied to remove gravitational contributions and unwanted fast
movements. The filter was a digital infinite impulse response (IIR) Butterworth with a first
stopband frequency of 0.05 Hz, a first passband frequency of 0.1 Hz, a second passband frequency
of 14 Hz, and a second stopband frequency of 16 Hz. We made use of an IIR filter because of their
speed on high throughput applications, filter resolution, and consideration of memory
consumption.
    4.1.1.4.Feature selection
     A feature selection method was applied to reduce the dimensionality of the extracted feature
vector by studying feature contribution, removing redundant or noisy data that could degrade the
                                                   99


classification performance of head motions, and decreasing computational costs involved in the
real-time feature extraction process to be ultimately implemented. A decision tree (DT) classifier
and a function that computes estimates of predictor importance for the classification tree were used
for this evaluation. The function that computes estimates of feature importance adds changes in
the risk due to splits on every feature and divides the sum by the number of branch nodes. Because
the two blocks of the fusion model architecture are classifying different types of events, feature
selection was applied individually to both blocks.
    4.1.1.5.Classification model
     Subject-independent binary DT classifiers were trained for each stage in the fusion model
architecture. For stage 1, based on the results of the predictor importance function, feature sets
were further reduced and used to train classifiers for final evaluation and implementation. For stage
2, four types of feature sets derived from the feature importance analysis were used for the same
end. Feature sets for stage 2 included the use of all extracted features, the use of the most important
features based on the predictor importance function, and two sets of features engineered based on
results from previously trained classifiers.
    DT classifiers were selected because of their computational efficiency when paired with
optimized feature sets. The Gini’s diversity index was used as the split criterion with 20 as the
maximum number of splits. Of the total dataset, 70% was used for training and 30% for testing.
K-fold cross-validation of 10 was used to estimate the performance of the classifier on unseen
data. Classification performance was measured using accuracy, defined as
                                                           𝐶
                                      𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1 − 𝑁𝑒 ) × 100                                    (2)
                                                             𝑏
                                                  100


where Ce is the number of signal buffers misclassified and Nb is the total number of buffers in the
training/testing set. The complexity of the model was evaluated based on the number of features
used by the model, the depth of the model, and the number of nodes in the tree.
    4.1.2. Study and Data Collection Procedure
     To design and optimize the HAD unit based on the described fusion model architecture, a
validation study was performed and approved by the Michigan State University Institutional
Review Board. A total of 3 subjects were recruited voluntarily. Subjects were located in separate
rooms and the wearable sensor headset, as shown in Figure 11, was attached to their heads. Sensor
data was collected at a rate of 128 Hz. The participants were then given access to a computer and
connected to the Zoom video conferencing program to virtually interact with a study administrator.
Sensor data from all participants were collected simultaneously using the system infrastructure
introduced in Chapter 3. In addition, the Zoom interaction was recorded for data annotation
purposes.
     The study consisted of instructing participants to perform specific head movements at
different motion rates and/or inclinations for 30 seconds. Head actions, inclinations, and motion
rates are described in Table 12. Immobility refers to the absence of motion; Δ-pitch to a downward
and upward head motion; Δ-yaw to a head rotation to the left and right; and Δ-roll to a head tilt
motion from one shoulder to another. Participants were instructed to adopt the degree of head
inclination that felt more natural to them when inclining their heads to the right or the left. When
participants moved their heads at different motion rates, they were asked to do it as naturally as
possible. Therefore, each subject had their own pace for slow, medium, and fast head motion. Two
Table 12. Summary of head actions performed during the validation study.
         Head actions                        Head inclinations                  Head motion rates
          Immobility            Center, tilted to the right, tilted to the left         -
   Δ-Pitch, Δ-Yaw, Δ-Roll                             -                         Slow, medium, fast
                                                   101


trials of each head action type with its corresponding head inclination or motion rate were
performed. These resulted in a total of 24 head motion recordings, each with a duration of 30
seconds captured per participant. A total of six labels, corresponding to the three head motions and
three head positions of interest, were assigned to the collected data.
     4.1.3. Results and Discussion
      The analysis, model design, and fusion model were coded in MATLAB. The DT classifiers
were trained and validated with the collected dataset consisting of IMU signal segments lasting 30
seconds, which were further segmented according to the buffer size to be analyzed.
     4.1.3.1. Best feature sets and classification results per model stage
      Static Stage - To evaluate the best set of features for the detection of head position and to train
a classification model for the task, the collected IMU signal segments containing data of motions
were labeled as “other”. Therefore, stage 1 classifies data segments using 4 data labels (left, right,
center, and other).
      The feature importance analysis consistently revealed across different data buffer sizes that,
to recognize head positions versus general motion, gyroscope’s features tend to be the most
important signal features, followed by accelerometer features, and lastly magnetometer features.
Based on the results of the feature importance analysis, ten different feature subsets containing
from 3 to 12 features were selected and used to train DT classifiers. Results using training data at
different buffer sizes and feature sets varied from 60.50% to 99.56% classification accuracy.
Across all trained classifiers using different buffer sizes, results consistently show that the most
important features across all buffer sizes are accelerometer SEi and Avi in the x-axis and y-axis,
accelerometer RMS in the y-axis, gyroscope SEi in all axes, and gyroscope Mnormxyz.
                                                   102


Figure 20. Results of classification accuracy, number of features used for by the DT model,
number of nodes, and depth of the best performing models across data buffer sizes.
     Dynamic stage – Classification was performed using 3 data labels (Δ-pitch, Δ-yaw, Δ-roll).
The feature importance analysis revealed that as the data buffer increases in size, frequency
features become more important whereas in short window sizes time-domain and synchronicity
features may be more important. Likewise, for the detection of head motion it was noted that as
the buffer size increases, signal features from the magnetometer sensor become less relevant.
Classification results using training data at different buffer sizes and feature sets varied from
91.35% to 98.22%.
    Figure 20 shows the classification accuracy results and the number of features, number of
nodes, and depth of the best-performing classifier for a given buffer size, expressed in seconds. To
                                               103


Figure 21. Results of FoM, which contributed to evaluate classifier performance versus
complexity. The higher the FoM value, the optimal the classifier.
Table 13. Summary model parameters for the HAD unit using a buffer size of 3 seconds.
                Testing
  Stage        accuracy       Training time     Features            # Nodes          Depth
     1           100%           0.57488s            3                    9              4
     2          97.86%          0.61673s            4                   15              4
                               Macro F1-         Macro               Macro       Classification
 Overall        97.91%            score           Recall            Precision         time
                                 98.5%           98.67%              98.67%         0.0034s
determine an optimal buffer size based on DT classifier accuracy and complexity, a Figure of Merit
(FoM) was established and defined as
                                                          𝑇𝑒𝐴𝑖
                                      𝐹𝑜𝑀 = ∑2𝑖=1 𝑇𝑟𝐴 ∗𝑁                                        (3)
                                                        𝑖 𝑓𝑖 ∗𝑁𝑛 𝑖 ∗𝑁𝑑𝑖
where i=1 is referring to the results of stage 1 and i=2 to the results of stage 2, TeA is testing
accuracy, TrA is training accuracy, Nf is the number of features used for classification, Nn is the
number of nodes of the DT, and Nd is the depth of the DT. Training and testing accuracy were
included in the FoM to account for cases where the best performing classifier was an overfitted
one, as is the case for stage 1 and buffer size of 3.5s and stage 2 for buffer sizes of 2s and 4s.
Figure 21 shows the results for the FoM, where the higher the value, the more optimal is the
classifier. However, because latency is an important factor in the real-time detection of head
actions, a buffer size of 3s provides the best tradeoff between DT model performance and
                                               104


complexity for both stages of the fusion model. Because the buffer has a 50% overlap with
previous data, the HAD unit updates at a rate of 1.5s.
    4.1.3.2.Head action detection processing unit
     Table 13 shows a summary of parameters for the final implemented DT classifiers at each of
the stages in the fusion model architecture. Based on the results of the FoM, a buffer size of 3s
was selected for the final implementation. The three features used for the classification of head
position (stage 1) were Avx and Avy from the accelerometer and Mnormxyz from the gyroscope. On
the other hand, RMSz from the accelerometer and SEx, Corr_coefxz, and Speak_2hx from the
gyroscope were used for the classification of head motion (stage 2). Therefore, magnetometer data
were excluded from the final design. The HAD unit, using a fusion model architecture, has an
overall testing accuracy of 97.91% and an F1-score of 98.5%.
     The performance of the design HAD unit is on par with previous works. However, the
architecture of the HAD unit allows for easy re-training to add recognition of additional head
actions by having specialized head action classification models.
     This represents the first effort in establishing methods for the design of head activity detection
in real time using the sensor setup established for our behavior monitoring system. Because the
data collected for the design of the HAD unit was in a controlled environment, no extensive data
annotation procedure was required, accelerating the design procedure. As the accuracy of the
trained model is high, it is important to highlight that this performance could decrease when
employed using data from the wild, as other factors not accounted for in the lab will influence the
results. However, by employing a fusion model approach, one of the stages or both could be easily
retrained with data from the wild.
                                                   105


    4.2.Real-Time User-Independent Speech Intonation Recognizer
     Speech signals carry important social information that is expressed through verbal and
nonverbal communication. Verbal communication includes the use and understanding of words,
whereas nonverbal communication in speech refers to the way words are said, e.g., the tempo and
the intonation used while communicating verbally. In many areas of research, such as natural
speech processing and affective computing, the automatic identification of speech intonations has
played an important role in creating effective human-machine interactions and in recognizing
emotional states from the speakers. Other areas of research, such as those that focus on
understanding and monitoring social interactions, have started to incorporate information related
to voice tonality in the analysis of human behaviors. Speech intonation carries information about
our social intentions and feelings. Moreover, the way people talk contributes to building rapport
and establishing social likeability. Still, the automatic and real-time recognition of speech
intonations and their emotional content is an active and challenging area of research, especially
for natural environments [212].
     Intonation, in general, refers to the rise and fall in voice inflection, which happens consistently
throughout speech. However, experts in the area have not agreed on a universal definition for
intonation. From reviewed literature, two types of intonation classes are studied and for which
automatic recognition systems have been designed: (1) intonations described by pitch contour and
(2) intonations described by perceived affective state or intention.
     In general, works that focus on the study and recognition of pitch contour are intended for
speech synthesizer systems. In the English language, there are four basic intonation classes that
describe pitch contour: Glide-up, Glide-down, Dive, and Take-off. Glide-up refers to the rise of
pitch values, which is associated with the production of question and encouragement statements.
                                                   106


Glide-down refers to the fall of the pitch contour values and is attributed to the production of a
general statement. Dive refers to a combination of fall and rise pitch contour and is associated with
warning and commanding statements. Lastly, Take-off refers to having a sustained pitch level and
gradually increasing it, which is generally associated with negative affective states. While a variety
of works have a focus on designing pitch contour recognition systems [213]–[215], the mapping
between paralinguistic functions and pitch contours varies within a language and differs cross-
linguistically [216].
     On the other hand, works that focus on studying and recognizing intonations as an affective
function tend to use categorical intonation classes. A variety of works have focused on studying
positive and negative intonations [141], [144], [145], but a wide range of other affective states
have been explored [140], [142], [189], [217]. For example, Wu and Liang [142] utilized speech-
derived information for the recognition of four emotional states: neutral, happy, angry, and sad.
The dataset was collected from 8 Chinese-speaker volunteers in a laboratory environment. The
authors made use of acoustic-prosodic information and semantic labels as features of the utterances
(sentences) that formed part of their dataset. Acoustic-prosodic features were extracted offline and
classifiers such as GMM, SVM, and MLP were trained to recognize the four classes, achieving a
range of accuracies going from 68.73% to 78.16%. SVM was the classifier with the highest
accuracy. Because none of the classifiers were optimal for recognizing all emotional states [218],
[219], a Meta Decision Tree (MDT) was used for classifier fusion, achieving a recognition
accuracy of 80%. However, the designed algorithm is not suitable for real-time processing based
on the use of semantic labels and the lack of real-time methods to automatically identify utterances.
     Lanjewar et al. [189] used the Berlin Emotion Speech Database (BES), which is an acted
emotional content database with around 500 utterances in German portraying emotions of
                                                  107


happiness, anger, disgust, fear, sadness, surprise, and neutral. The authors focused on all but
disgust and used spectral features such as MFCC, pitch, and Wavelet coefficients to train a GMM
and a K-NN classifier. Recognition accuracies were 66% and 52% for GMM and K-NN,
respectively. Here, the authors showed how GMM dominates the recognition of angry and sad
emotions, whereas K-NN dominates the recognition of happy and angry emotions. However, these
models were not designed for real-time operation and do not account for natural expressions of
emotion during social interactions.
     A common approach in designing speech intonation recognizers is the use of supervised
machine learning methods. When supervised methods are employed, two principal areas require
attention: (1) data collection and annotation and (2) machine learning model design.
     Traditionally, speech intonation recognizers (including speech emotion recognizers) have
made use of acted datasets to train their recognition models. However, systems trained on acted
data do not translate well to real-life situations [220], since “full-blown” emotions rarely appear in
everyday interactions [221]. Only in the last 10 years, the speech emotion recognition research
area has started to see a shift towards the use of natural datasets [212]. Even though naturally
collected datasets exist, they tend to be from call centers, TV shows [217], or interactions with
virtual agents, which do not capture the nature of group interaction environments.
     In general, data collection and annotation are key to supervised machine learning design
pipelines and well-developed annotation guidelines are critical for its success. However, there are
no standard annotation guidelines for the design of speech intonation recognition models. In
addition, there are a variety of factors that impact the results of a data annotation project, which
include, for example, the annotation tools employed, the human annotators, and the specific
application [116], [222].
                                                   108


     Here, the collection of a natural dataset and the development of an annotation pipeline are
presented. A natural dataset is collected from research group meetings, where ideas were being
exchanged providing an opportunity to capture reactions to disagreements in a workplace
environment. Because of the lack of annotation guidelines to identify speech intonation, two types
of annotation modes were analyzed: annotation of audio in the order it occurred and in a
randomized order. We hypothesized that as annotators get used to people’s way of talking, the
likeability will increase. This could affect how datasets are labeled and the overall results of an
analysis of dyadic and group social interactions. For instance, annotating in sequential order may
help annotators gain a sense of familiarization with the person or persons involved in the
interaction, whereas annotating in random order could maintain a sense of distance from the
individuals involved in the interaction since the context of the meeting is lost. Intonations of
interest include a combination of affective states with interrogative expressions. Based on
annotation analysis, a dataset was constructed to train a model for the real-time recognition of
intonations. Because of the goal of implementing real-time algorithms for the real-time monitoring
of human behaviors, the design of the model was performed using a resource-aware approach
where the effect of different sampling rates and the reduction of feature dimensionality was
evaluated using the resulting classification accuracy of the trained models.
    4.2.1. Designing a Real-time Speech Intonation Recognition Algorithm
    4.2.1.1.Pre-processing
     To study the effect that different sampling rates have over the recognition of intonations
carrying affective state information, collected signals were low-pass filtered and downsampled to
8 kHz, 4 kHz, 3.2 kHz, and 2 kHz. Real-time audio signal processing requires the data to be
processed using small data frames. Typically, audio signals are processed in frames of ~30ms to
                                                 109


~80ms with overlaps between each consecutive frame. Here, we make use of a 40ms frame with
50% overlap. These frames are used to detect speech in the audio signal to later perform speech
segmentation.
    4.2.1.2. Speech detection and segmentation
     Speech segmentation, also known as audio segmentation, refers to the task of dividing the
audio signal into segments that will be used for feature extraction and classification. Speech
segmentation can be performed in two ways, using an utterance-based approach or a windowing-
based approach. Utterance-based approaches require the implementation of an automatic speech
recognizer (ASR), which increases system complexity and may represent a threat to privacy to
users because the goal is to recognize linguistic units such as vowels, phonemes, words, and
phrases [140], [141]. On the other hand, windowing-based approaches make use of windows of
data defined by time (milliseconds to seconds), windows of speech activity (defined by thresholds
in pauses or silence periods), or windows of voiced/unvoiced signals. Windowing-based
approaches tend to be fast and computationally efficient, however, efficiency is compromised
when high amounts of memory are needed to extract features of interest effectively. To have results
that compare to speech segmentation performed by ASR, windows of speech need to be long
enough to contain voiced-unvoiced segments and breath periods, which are elements that comprise
an utterance. To implement a speech segmentation method that can operate in real-time, first, an
energy-based voice activity detector (VAD) method is designed. The design of an energy-based
VAD involved the evaluation of an energy threshold. A pre-set threshold based on noise statistic
studies [223] and a threshold calculated using histogram and maxima estimation were evaluated.
Then, silence periods were measured, and their distribution was studied to determine a threshold
that will be used for the speech segmentation. Lastly, to complete the segmentation process, a
                                               110


distribution of the duration of identified speech periods was studied to determine the minimum
length of a speech period to be considered an utterance of interest.
     4.2.1.3.Feature extraction and selection
      Inspired by previous research and the review of signal features for audio data presented in
Section 3.2.2.1, a set of features that have proven to contribute the most to identifying changes in
speech affective states were used in this work. This set of features includes a combination of
prosodic features (e.g., energy and pitch), voice quality features (e.g., zero-crossing rate),
frequency spectrum coefficients, and cepstral features.
     For each data frame, the energy was calculated and used to identify voice activity as described
in the previous section. If voice activity was detected, then zero-crossing (ZC) was calculated and
if the value laid below a pre-defined threshold of 35, obtained from [223], then the data frame was
classified as a voiced speech segment. Voiced segments are periods of speech generated using the
vibration of vocal cords. If the ZC value was over the threshold, then the data frame was classified
as an unvoiced speech segment, which are periods of speech generated using air passed through
the vocal cords. For all identified voiced segments, the pitch was determined using autocorrelation.
Before the calculation of pitch, the data frame was filtered using a band-pass filter with cutoff
frequencies of 50 and 900Hz. Note that a wide range of pitch detection algorithms have been
studied and that no available pitch detection scheme can be expected to give perfect pitch period
estimates. To approximate the calculated pitch value to what is perceived by human hearing, the
pitch value obtained through autocorrelation was transformed using the following Mel-scale [224]:
                      𝑀𝑒𝑙 𝑝𝑖𝑡𝑐ℎ = 2595 ∗ log10 (1 + 0.0014 ∗ 𝑝𝑖𝑡𝑐ℎ𝑎𝑢𝑡𝑜𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 )               (4)
and the Δ-Mel pitch values and ΔΔ-Mel pitch values were also calculated. The Δ-Mel pitch value
represents the difference between two consecutive Mel pitch values (from two consecutive data
                                                 111


Table 14. List of features extracted per data frame where speech was detected.
                    Feature type                             Feature name
                                              Signal energy
                                              Average value
                                              Mel pitch (pitch value on Mel scale)
          Prosodic                            Δ-Mel pitch
                                              ΔΔ-Mel pitch
                                              Voiced
                                              Unvoiced
          Voice quality                       Zero-crossing rate
          Frequency spectrum coefficients Mean of power spectral density
          Cepstral coefficients               Mel-frequency cepstral coefficients (MFCC)
frames) and the ΔΔ-Mel pitch value represents the difference between two consecutive Δ-Mel
pitch values. For all data frames, the average amplitude of the speech signal is also calculated.
     To calculate features in the frequency domain, a pre-emphasis filter and a hamming window
were applied to emphasize high-frequency components in the speech signal that otherwise would
be dominated by low-frequency ones and to reduce spectral leakage when calculating features in
the frequency domain, respectively. Then, the magnitude of the Fourier transform of the signal in
the data frame was calculated and used to determine the mean of the power spectral density. Lastly,
for each data frame, 13 Mel-frequency cepstral coefficients (MFCC) were calculated. Table 14
shows the list of features extracted per data frame.
     As the previous set of features is calculated per data frame, each estimated utterance is
represented by time series of the aforementioned features. To prepare this extracted data for
classification, once an utterance is determined, statistics of its corresponding feature series are
calculated. Conversation features such as speaking rate and pausing rate are also calculated for
each utterance. The speaking rate is calculated by dividing the number of voiced frames by the
number of unvoiced frames. On the other hand, the pausing rate is calculated by dividing the
number of silence frames by the total number of frames in the estimated utterance. Table 15 shows
                                                 112


the list of features extracted for each utterance based on the time series constructed from features
presented in Table 14.
     To eliminate redundancy in extracted features and reduce feature dimensionality to minimize
computational complexity, a correlation analysis across features presented in Table 15 was
performed to eliminate highly correlated features. A high correlation was considered to be any
value over 0.8 or under -0.8.
    4.2.1.4.Classification of intonations
     For comparison, classification models were trained using the complete set of extracted
features presented in Table 15 and the reduced one obtained through correlation analysis. To
evaluate how well different models fit different classes, a variety of models were trained to classify
4 intonation classes, 3 intonation classes, and a combination of 2 classes. In addition, models were
Table 15. List of features extracted for each utterance based on the feature time series.
                         Feature name –
    Feature type                                            Final set of features                Qty
                            time series
                       Signal energy          Mean, std, max, min, median, range                 6
                                              Min, max, range, mean, median, std, number of      11
                       Mel pitch (pitch peaks, mean peak value, std of peak values,
                       value on Mel scale) median of peak values, mode of peak values
                                              Number of peaks of the absolute value, mean        9
 Prosodic
                                              peak value, std of peak values, and mean, std,
                                              max, min, median, range of last 5 non-zero
                       Δ-Mel pitch            value
                                              Number of peaks of the absolute value, mean        3
                       ΔΔ-Mel pitch           peak value, std of peak values
 Conversation          Speaking rate          Voiced/Unvoiced                                    1
                       Pausing rate           Silence frames/total frames                        1
 Voice quality         Zero-crossing rate Mean, std, max, min, median, range                     6
 Frequency                                    Mean, std, max, min, median, range                 6
 spectrum              Mean of power
 coefficients          spectral density
 Cepstral              Mel-frequency          Mean, std, max, min, median, range of first 4      24
 coefficients          cepstral               MFCC coefficients
                       coefficients
                       (MFCC)
                                                  113


also trained with data at different sampling rates to evaluate how reduced data rates may influence
the classification performance of speech intonation.
 Models that were used for evaluation included Support Vector Machine (SVM), K-Nearest
Neighbor (KNN), linear discriminant, Naïve Bayes, Random Forest, and Gaussian Mixture Models
(GMMs). For classifiers such as SVM, KNN, and Naïve Bayes different kernels were also
evaluated.
    4.2.2. Study Procedure for Audio Collection
     Given the lack of datasets with conversations in natural environments and with multiple
individuals, we recorded audio and video data from virtual research group meetings at Michigan
State University (MSU). The study procedure was approved by the MSU Institutional Review
Board. Individuals participating in the virtual meetings were instructed to carry on their normal
conversations. Research group meetings were of interest because of the number of ideas being
exchanged, providing an opportunity to capture subtle reactions of agreement and disagreements
in a workplace environment. The recordings were performed through the Zoom video conferencing
program, which also allowed the recording of a separate audio file for each participant in the
meeting.
     A total of five meetings were recorded, over one month, with an average duration of 57
minutes and with 4 to 5 individuals per meeting, as shown in Table 16. Subjects used their audio
recording equipment and participated in the meeting from a variety of locations. This provides a
dataset that contains a variety of background acoustics and microphones, among others, which
results in a representation of the technologies in the wild. All the audio meeting recordings were
obtained at a sampling frequency of 32 kHz.
                                                 114


Table 16. Summary of recorded meetings’ information. Identification of speech segments/audio
clips was performed by two annotators, A and B.
             Meeting      Duration      # of participants  Annotator     # of audio clips
                 1        00:39:33              5               A                377
                 2        01:00:14              5               A                508
                 3        01:14:18              4               A                566
                 4        00:58:56              4               B                395
                 5        00:51:46              4               B                676
    4.2.3. Procedure for Annotating Speech Intonations
     The annotation procedure was divided into (1) partitioning the audio recordings into small
speech segments and (2) human labeling of the intonation of those speech segments. A total of six
annotators were used during the annotation procedure, two for the partitioning of the audio
recordings and four for the labeling of intonations. The annotation scheme covers four general and
10 specific intonations.
    4.2.3.1.Selection of labels for intonation
     An audio dataset with labeled intonations is important for the development of algorithms
capable of inferring, for example, emotional state-related information from audio signals. Inferring
speech intonations forms part of the identification of nonverbal communication and the design of
human-machine interfaces. Because the interest of this work is in designing an intonation
recognition model that can contribute to the understanding of human behaviors and the
establishment of rapport during social interactions, we created an initial list of specific intonations
that could impact the establishment and perceptions of rapport. The list consisted of the following
intonations: neutral, surprise, excitement, disappointment, affirmative, laugh, commanding,
encouraging, doubtful, mad, and question. This initial list was utilized as a guide to performing
manual audio segmentation. The initial list of intonations inspired the creation of a second specific
                                                   115


list of intonations that substituted question and neutral for frustration and none. This also inspired
the creation of a general list of intonations that included: neutral, positive, negative, and question.
     4.2.3.2.Manual audio segmentation
      Audacity, an open-source digital audio editor, was used to perform manual audio
segmentation. Annotators were instructed to identify speech segments, using the “Add Label at
Selection” feature from Audacity, based on perceived intonations based on the initial list of
specific ones: neutral, surprise, excitement, disappointment, affirmative, laugh, commanding,
encouraging, doubtful, mad, and question. Speech segments were identified for each of the
separate audio files recorded by each participant during each of the meetings. The identified speech
segments were then saved in a text file containing the initial time, the end time, and the perceived
intonation of the segment. The annotation of the meetings was divided into two groups, wherein
Annotator A identified speech segments in the first three meetings and Annotator B in the last two
meetings. Table 16 shows the total number of identified speech segments per meeting. A
MATLAB script and the generated text file containing the times of the identified speech segments
were used to generate corresponding individual audio clips. Figure 22 shows a summary of this
first part of the annotation procedure.
Figure 22. Diagram summarizing the first part of the annotation procedure, which constitutes
partitioning the audio recordings from the virtual meetings into audio clips containing specific
speech intonations.
                                                   116


                                                     Data Labeling Platform
                                             (c)
                                                                                              (d)
            (a)
                 D:\PathToSpeechSegmentsData
            (b)                                                                          (e)
                                                                          Disappointment
Figure 23. Designed interface for the labeling of intonation in audio clips. (a) The path to the
folder containing the audio clips is given to the labeling program together with the name of the
data annotation text file. The folder path is used to locate the audio clips that will be labeled by
the annotator; (b) general information about the files contained in the given folder paths; (c)
general information about the audio clip that is displayed; (d) plot of the audio clip and the
buttons to play or stop the audio; and (e) labeling of general and specific perceived intonation.
    4.2.3.3.Labeling of intonations
     The App Designer tool of MATLAB 2019b was used to develop a graphical user interface to
facilitate the labeling process and ensure a consistent level of annotation. Figure 23 shows the
designed interface for the labeling of intonation in the audio clips. This data labeling program is a
customized interface that takes the path to the folder containing the audio clips that will be labeled
and display them. The labeling program also takes the name of the annotation text file created
beforehand, which will be used to record the perceived intonation for each audio clip. Overall
information about the audio clips in the folder path and information about the specific audio clip
                                                 117


being displayed is also displayed. The annotator can play or stop the audio clip/segment that will
be labeled at any given moment. The labeling program supports two levels of annotation: a general
intonation label and a specific intonation label. General intonations include neutral, positive,
negative, and question. Specific intonations include surprise, excitement, disappointment,
affirmative, laugh, commanding, encouraging, doubtful, mad, frustration, and none. Because this
work was interested in understanding the impact of intonation perception when labeling data, two
identical interfaces were designed, one with the ability to display the audio clips in their order of
occurrence in the recorded meetings and another one with the ability to randomize the order in
which the audio clips are presented.
     All audio clips were labeled using both interfaces (i.e., the interface presenting the audio in
sequential order and the other interface presenting the audio in random order). A total of four
annotators (C, D, E, and F) participated in the labeling of intonations, wherein two of them labeled
the audio segments in both sequential and random order, another one only labeled segments
presented sequentially, and the last one only labeled segments presented randomly. This resulted
in each audio clip being labeled three times, both when presented sequentially and randomly. The
interface outputs the text file containing the assigned labels for general and specific intonations.
These files were then used to perform an analysis of inter-annotator agreement (IAA) and label
assignment based on the majority of the annotators. Because of human and labeling interface
errors, not all audio clips were labeled by three annotators. Therefore, those audio clips lacking a
third annotator were dropped from further analysis. Figure 24 shows a summary of this second
part of the annotation procedure. The code and executable files to run the data labeling program
can be found at https://gitlab.msu.edu/davilasy/audio-data-labeling-tool.
                                                  118


Figure 24. Diagram summarizing the second part of the annotation procedure, which involves
labeling audio clips in the order in which they were produced in the meeting and in a random
order. Files with label information from all annotators were combined for analysis.
    4.2.4. Analysis of Annotations
    4.2.4.1.Inter-annotator agreement (IAA)
     To determine if there is a significant effect when labeling audio clips presented in sequential
and random order, two measurements of IAA were applied: pair-wise Cohen’s kappa and Fleiss’
kappa coefficients. IAA measures how well two or more annotators make the same annotation for
a certain category. In this work, categories constitute four general intonations and 11 specific
intonations.
    Pair-wise Cohen’s kappa coefficient (𝑘𝐶 ) is a statistical measure of reliability between two
annotators for categorical items [225]. The definition of 𝑘𝐶 is:
                                                    𝑝𝑜 −𝑝𝑒
                                             𝑘𝐶 =                                               (4)
                                                    1−𝑝𝑒
                                                 119


where 𝑝𝑜 is the observed proportionate agreement between raters and 𝑝𝑒 is the probability of
random agreement. Cohen’s kappa coefficient was calculated per pairs of annotators, meetings,
annotation order (random and sequential), and annotation mode (general and specific).
    On the other hand, Fleiss’ kappa (𝑘𝐹 ) is calculated over a group of multiple annotators
assigning categorical ratings to a fixed number of items. The definition of 𝑘𝐹 is:
                                                    𝑃̅ −𝑃̅𝐸
                                               𝑘𝐹 =                                              (5)
                                                    1−𝑃̅𝐸
where 𝑃̅ is the overall observed agreement chances per category divided by the number of
categories and 𝑃̅𝐸 is the average chance agreement over all categories. Fleiss’ kappa was calculated
per meeting, annotation order, and annotation mode.
     A paired two-sample t-test was utilized to determine if the mean of the annotator agreement
across all meetings for the different annotation orders and annotation modes were statistically
significant. The difference between the means was determined to be statistically significant if the
t-test resulted in a p-value of less than 0.05.
    4.2.4.2.Selection of labels for speech segments
     Three types of datasets were constructed, for both sequential and random annotation orders,
based on the labels provided by the annotators and the IAA analysis. For simplicity, we enumerated
the datasets using 1, 1.1, and 2. Dataset 1 contains all the audio segments labeled using general
intonation and Dataset 1.1 contains all the audio segments where two or more annotators agreed
on a general intonation label, meaning that all audio segments where annotators did not agree at
all on an intonation were eliminated from this dataset. Audio segments where two annotators
agreed on an intonation were then given the respective label and the third annotator was ignored.
Dataset 2 contains all the audio segments labeled using a specific intonation. Datasets 1.1 of both
                                                  120


the order and random annotation order sets were used to construct the final dataset used to train
the intonation recognition model.
    4.2.4.3.Rate of change in perceived general intonations
     To study if there is an increase or decrease in how positive, negative, and question intonations
are perceived when labeling audio segments presented in a sequential order versus a random order,
the rate of change in labeled general intonations was evaluated. This was studied using the
constructed Datasets 1.1, where at least two annotators agreed on a label. This analysis was
performed by counting the number of neutral, positive, negative, and question intonation labels
that were assigned per meeting for both datasets. Then, to determine if the difference between the
mean of these quantities was statistically significant, a t-test was performed.
    4.2.5. Results and Discussion
    4.2.5.1. Audio data annotation
     A central question of this study was whether there is a significant effect on perception when
labeling audio clips in the order they were generated versus labeling the audio clips in a
Figure 25. (a) Cohen kappa IAA results. Each box in the plot is summarizing the agreement
between pair of annotators obtained across specific sets of labeled audio clips for different
meetings. P-values are shown per pair of annotation mode. (b) Fleiss kappa IAA results. Each
box in the plot summarizes the overall agreement of annotators per meeting and for the specific
sets of labeled audio clips. P-values are shown per pair of annotation mode.
                                                  121


randomized order. Figure 25 shows the results of the IAA analysis. For both, the Cohen kappa
and the Fleiss kappa, it can be observed that for all cases of datasets the IAA is slightly greater
among audio segments labeled in a sequential order than in random order. A t-test demonstrated
that the degree of agreement calculated by Cohen kappa and Fleiss kappa, across all cases of
datasets, is not statistically significant. Therefore, our data suggests that annotators’ agreement
level does not change significantly by labeling audio in sequential or random order.
     On the other hand, dropping audio segments from the initial dataset where none of the
annotators agreed on a general intonation to create Dataset 1.1, resulted in a noticeable increase in
IAA for the randomly labeled set. Table 17 shows the number of audio clips that were dropped
from both the sequential and random order labeled sets in Dataset 1 to create Dataset 1.1 because
none of the annotators agreed on a label. In total, 42 in the sequential-order labeled set and 149
annotated items in the random-order labeled set were dropped. This constitutes a drop of 1.68%
and 6.44% of the total number of audio segments in the sequential-order and random-order labeled
sets, respectively. Although the difference in IAA between the two groups in Dataset 1.1 is not
statistically significant (as shown in Figure 25), the difference in the number of audio segments
that were dropped out resulted to be statistically significant. This shows that a higher level of IAA
is achieved when speech intonation is labeled in sequential order.
Table 17. Summary of the total number of audio segments dropped from the final datasets
because none of the annotators agreed on a label.
                                              # of audio segments dropped
                                Meeting       Sequential         Random
                                1             10                 36
                                2             13                 30
                                3             10                 33
                                4             8                  21
                                5             1                  29
                                   Mean       8.4                29.8
                                 p-value      0.001582
                                                  122


     Note that, when looking across Dataset 1, Dataset 1.1, and Dataset 2, the Cohen kappa values
fluctuate between -0.023 and 0.8338. Although a perfect agreement is not possible, typically, IAA
is expected to be between 60% and 80% for the usefulness of the dataset in machine learning [222].
However, Uebersax [226] suggested that kappa values may be low even though there are high
levels of agreement between annotators. In addition, most interpretations are performed
considering 2-annotators and 2-categories were used to calculate kappa [227]. It was also noted in
[228] that the number of categories and subjects will affect the magnitude of the kappa value. For
example, the kappa is higher when there are fewer categories. However, as this is one of the best
measurements of annotator agreement in the literature, it was employed in this study.
     Although the level of agreement was higher for the datasets labeled in sequential order, the
number of positive and negative general intonations was higher when audio clips were labeled in
random order. This is illustrated using Dataset 1.1 in Figure 26, which shows the number of
identified neutral, positive, negative, and question intonations for both sequential and random
annotation order. A t-test revealed that the decrease in neutral-labeled audio clips and the increase
in the number of positive, negative, and question-labeled audio clips are statistically significant.
To gain a more accurate assessment of the increase in positive and negative intonations as a
function of labeling in a sequential or random order, Table 18 shows the percentage of positive,
negative, and question assigned intonations within the non-neutral total number of labeled audio
segments. Overall, 21.29% of the sequentially labeled set was assigned a non-neutral label. In
contrast, 37.14% of the randomly labeled set was assigned a non-neutral label. As shown in Table
18, the random order labeled set contains 3% more positive labeled intonations and 8% more
negative labeled intonations than the sequential order labeled set. These results suggest that an
annotators’ perception of speech intonation varies depending on the presence of the conversation’s
                                                 123


Figure 26. Number of identified neutral, positive, negative, and question intonations across
annotators from Dataset 1.1 for both sequential and random order of annotation. P-values are
shown per pair of labeled datasets for each of the different types of general intonations.
Table 18. Percentage of positive, negative, and question labeled audio segments present in the
non-neutral labeled portion of Dataset 1.1.
      Order of          Total # of non-neutral labeled
                                                             Positive     Negative      Question
     annotation                audio segments
         Sequential                    519                     44.86%         9.44%        45.66%
           Random                      794                     47.61%        17.38%        35.01%
context, whereas in the absence of it, annotators may be more receptive to the nonverbal cues of
the speech than to the meaning of the words. Consequently, annotators may identify more non-
neutral speech intonations when audio segments are presented in random order.
     Dataset 2 was used to study how identifying and assigning specific intonations is affected by
annotating sequentially or randomly. In general, 75.42% of the sequential-order labeled set and
77.95% of the random-order labeled set were assigned a specific intonation by at least one
annotator. Figure 27 shows a summary of the percentage of specific intonations assigned by one,
two, and three annotators. The plot shows how single annotators dominate the assignment of
specific intonations. A comparison between annotations performed sequentially and randomly
                                                124


Figure 27. Summary of the percentage of specific intonations assigned by one, two, and three
annotators. The plot shows how single annotators dominate the assignment of specific
intonations, which may explain the low levels of IAA in Dataset 2.
shows how intonations such as doubtful, commanding, and disappointment were more frequently
identified when labeling in random order; affirmative, encouraging, excitement, and frustration
were more frequently identified when labeling in sequential order. On the other hand, intonations
of surprise, laugh, or madness seems to have been identified at a similar rate by both orders of
annotation. However, only 31.96% and 27.06% of the total number of audio segments were
assigned a specific intonation by two or more annotators for sequential and random labeling,
respectively. When compared to the percentage of non-neutral general intonations assigned, the
total percentage of labeled specific intonations is greater for sequential order of annotation. In
                                               125


Table 19. Summary of the total number of audio clips that in both sequential and random order
of annotation were assigned the same labeled (intersection) and the total size of the dataset if
annotations from both sets were combined.
                                       Order of annotation             Intersection & Union
            General intonation         Sequential       Random          ∩             ∪
                         Neutral       1919             1344           1233           1538
                        Positive       233              378            143            463
                       Negative        49               138            30             152
                       Question        237              278            185            325
                           Total       2438             2138           1591           2478
contrast, the random order of annotation increases the assignment of non-neutral general
intonations but decrease the assignment of specific intonation.
     Dataset 1.1 was selected, over Datasets 1 and 2, to construct the final dataset to train the
intonation recognition model because it contains the highest levels of IAA. To gain an
understanding of how much overlap exists between the sequentially and randomly labeled sets in
Dataset 1.1, Table 19 shows the intersection and the union of both sets for each of the labels across
all meetings. It can be noted that 92% of the randomly labeled set overlaps with the sequentially
labeled set for neutral intonation, whereas for positive, negative, and question intonation the
overlap is 38%, 22%, and 65%, respectively. To increase the number of positive, negative, and
question-labeled items in the final dataset, the union of both sets was then taken as the final dataset
for the intonation recognition model design. To assign labels to those audio segments outside of
the interception set, we looked at the group-level agreement calculated using Fleiss kappa. For
Dataset 1.1, the average Fleiss kappa across the five meetings for the sequentially labeled set is
0.403, whereas for the randomly labeled is 0.35. Therefore, for the audio segments outside the
interception, the labels in the sequential-order set were given priority. However, if the label of an
audio segment outside of the interception was neutral in the sequential-order set and the random-
order set was non-neutral, the non-neutral label was assigned to that particular audio segment. This
                                                  126


resulted in a total of 1538 neutral, 463 positive, 152 negative, and 325 question-identified audio
segments.
    4.2.5.2.Real-time intonation recognizer framework
    4.2.5.2.1. Speech segmentation
     To identify speech segments in an automated and real-time fashion, a VAD was implemented
with thresholding rules to determine what to consider speech and what to consider an estimated
utterance. First, the recorded signals corresponding to the prepared dataset were downsampled
from 32 kHz to 8 kHz. Because the audio obtained from Zoom has a high signal-to-noise ratio, a
manual threshold of 0.01 was selected for the VAD. Then, thresholds for minimum speech time
and maximum silence time to estimate utterances were determined by evaluating the distribution
of the minimum speech periods and maximum silence periods present in the manually identified
audio segments. Figure 28 shows the distribution of the identified silent periods in the manually
segmented audio. To determine a threshold of maximum silent duration before considering a new
estimated utterance, the 90-percentile of the distribution was calculated, setting up the threshold
to be 0.58s. Therefore, if a silent period passes the threshold of 0.58s, the next detected speech
Figure 28. Distribution of the identified silent periods in the manually segmented audios and
display of the 90-percentile of the distribution, which was set as the threshold for maximum
silence duration before considering a new utterance.
                                                  127


Figure 29. Distribution of audio segment lengths obtained by the speech segmentation block.
period is considered a new utterance. Because there may have been buffers of data that are detected
as speech but that are accurate noises, a minimum length of time with detected speech was
determined. The dataset contains back-channel signals (i.e., laughs, “yes”, “no”, etc.), therefore
the length of such back-channels was considered to set up the minimum speech threshold, which
was set to be 4 windows of data or 0.12s. Figure 29 shows the distribution of the audio segment
lengths when the maximum silence duration and the minimum speech thresholds were applied to
the dataset.
    4.2.5.2.2. Sampling rate, signal feature, and classification models evaluation
     Using the 8kHz signals, a total of 67 features calculated from time series features were
extracted and evaluated using correlation analysis. The correlation analysis revealed that 25 out of
the 67 evaluated features were highly correlated. To evaluate the effect that eliminating those 25
features may have on the classification performance, a variety of classifiers were trained to
recognize 4 classes (negative, positive, question, neutral), 3 classes (negative, positive, question),
and 2 classes (combinations of pairs of negative, positive, question, and neutral classes) using both
sets of features (complete and reduced). Figure 30 shows the results of classification accuracy for
                                                128


Figure 30. Evaluation of different classification model accuracies using two sets of features: (1)
all features extracted and (2) a reduced feature set obtained by eliminating highly correlated
features. Abbreviations: Call – classifier for four classes, CPNQ – classifier for positive, negative,
and question classes, CNO – classifier for negative and all other classes combined, CPN –
classifier for positive and negative, CNeN - classifier for neutral and negative, CNQ - classifier for
negative and question, CPQ - classifier for positive and question, CPNe - classifier for positive and
neutral, CNeQ - classifier for neutral and question.
models trained using a different number of classes and trained using the initial 67 features and the
reduced set of 42 features (shown in Table 20). It can be observed that for all cases of trained
classifiers, the reduced set performs comparable to or better than the original set. Therefore, no
loss in classification accuracy is obtained when reducing the feature set using correlation analysis.
The type of classification model from which the displayed accuracies in Figure 30 were obtained
varies across all classifiers per class. However, the predominant model was SVM with a Gaussian
kernel. All models were trained using 70% of the data with a 10-fold cross-validation approach to
minimize overfitting results.
      To evaluate the effect that reducing the sampling rate may have on classification accuracy, an
SVM with a Gaussian kernel was selected based on the results from the correlation analysis.
Furthermore, features that did not present a Gaussian distribution from the 42 features listed in
Table 20 were eliminated from this part of the evaluation. Features that did not follow a Gaussian
                                                  129


Table 20. Final list of features used for classification of intonations. This final list was obtained
after eliminating 25 highly correlated feature sets out of 67.
                          Feature name – time
     Feature type                                              Final set of features              Qty
                                  series
                         Signal energy             Mean, min                                      2
                                                   Min, max, mean, std, number of peaks,          8
                         Mel pitch (pitch value mean peak value, std of peak values, mode
                         on Mel scale)             of peak values
 Prosodic                                          Mean peak value, std of peak values, and       5
                                                   mean, std, and median of last 5 non-zero
                         Δ-Mel pitch               value
                                                   std of peak values                             1
                         ΔΔ-Mel pitch
 Conversation            Speaking rate             Voiced/Unvoiced                                1
                         Pausing rate              Silence frames/total frames                    1
 Voice quality           Zero-crossing rate        Mean, std, min, median                         4
 Frequency                                         Mean, min, median                              3
 spectrum                Mean       of    power
 coefficients            spectral density
 Cepstral                Mel-frequency             Mean, std, max, min of first 4 MFCC            17
 coefficients            cepstral coefficients coefficients and range of the 4th MFCC
                         (MFCC)                    coefficients
distribution included the number of peaks in Mel pitch, frequency spectrum coefficients, signal
energy, speaking rate, pausing rate, and std of the last 5 non-zero values of Δ-Mel pitch. In total,
the models were trained with 33 features. The interest in exploring the effect of sampling rate over
classification accuracy comes from the goal of designing a computationally and real-time resource-
aware intonation recognition unit.
     Table 21 shows the results of the sampling rate analysis. Note that even when the overall
classification accuracy of the models across the evaluated sampling rate does not change
significantly, the precision accuracies do change for positive and negative intonations. The
precision accuracy to recognize positive intonations decreases, although not consistently, as the
sampling rate is decreased. On the other hand, the models seem to become more sensitive to
negative intonations as the sampling rate is reduced. Because the accurate recognition of positive
                                                  130


Table 21. Comparison of the results of classification accuracy when sampling rate of the input
signal is varied. The overall accuracy of the classification model does not seem to suffer a
significant reduction; however, the precision accuracy of positive audio segments/estimated
utterances do decrease by at least 1/3.
   Sampling           Model accuracy (validation                     Precision accuracy
      rate                 training/testing)              Question    Positive Negative       Neutral
     8kHz                  40.53%/41.76%                   46.81%     45.74% 51.06%          23.40%
     4kHz                  40.18%/42.45%                   54.26%     21.28% 63.70%          21.28%
    3.2kHz                 39.04%/39.23%                   46.81%     24.47% 54.41%          24.47%
     2kHz                  38.13%/41.09%                   42.55%     31.91% 57.55%          24.47%
and negative intonations is important for evaluating the level of positivity contributing to the
rapport between dyads and groups, 8 kHz was used in further analysis.
     To investigate how well specific models adapt to the recognition of specific intonations,
classification models were trained for the recognition of four, three, and two classes. Table 22
presents a summary of the results. The displayed summary suggest that models trained to classify
two classes achieve a more balance precision accuracy. In addition, models that focus on positive,
negative, and question intonations also achieve a higher level of balance accuracies across its
classes. A possible reason for the low classification accuracies of the neutral class is that based on
Table 22. Summary of best performing classification model with their respective model and
precision accuracy.
                                    Model                      Precision accuracy
           Model type             accuracy       Negative     Positive     Question       Neutral
    Medium Gaussian SVM            41.76%          47%          46%          51%           23%
    Medium Gaussian SVM            57.40%          60%          50%          62%
     Medium Gaussian SVM          65.10%                                     64%           66%
     Medium Gaussian SVM           69.60%          71%          68%
                  Linear SVM       66.70%          78%                                     55%
       Coarse Gaussian SVM         63.50%                       65%                        62%
     Medium Gaussian SVM           73.50%          71%                       76%
     Medium Gaussian SVM           69.20%                       68%          70%
                  Linear SVM       67.50%          86%                       40%
                                                 131


analysis of annotations, audio segments that were mark as neutral often carried a specific
intonation that could be grouped with negative or positive types of intonation.
     To better understand the advantages and disadvantages of this work, Table 23 compares this
approach with others in the literature. Works are compared based on the type of data used for
training, the real-time capability, the sampling rate, the linguistic unit used for feature extraction,
the number and type of extracted features, the type of classifier, the type of classes, and the
percentage of accuracy. The works in [142], [144], [189], [229] do not perform real-time
processing and most of those works made use of acted databases, not accounting for speakers with
different cultural backgrounds, variations in the recording environment, and variation in
microphone-distance. On the other hand, the works that perform real-time processing [230], [231]
have focused on the recognition of affective state classes, instead of combination with other types
of intonations. For example, the work by Alonso et al. [230] demonstrated the use of just 6 features
for the classification of 5 affective classes, achieving from 41.06% to 52.43% classification
accuracy when processing natural speech datasets. However, the used sampling rate and type of
linguistic unit for feature extraction suggest that the classification of affective states is performed
at a high rate compared to the other works presented in Table 23. The application of human/group
behavior monitoring does not require such a high recognition rate for the quantification of
positivity levels contributing to the rapport between people in an interaction. The work presented
in this chapter work uniquely focuses on combining affective classes with question intonation. The
interest in recognizing question intonations was to better understand the dynamics of a
conversation and patterns of answering positively or negatively. In addition, the processing of a
natural dataset at the low sampling frequency of 8kHz and estimation of a sentence-level utterance,
whereas other works used voiced frames for classification or the acted speech segment from their
                                                 132


respective datasets, represent an advantage for real-time processing. In terms of classification
performance, although difficult to compare because of the nature of the constructed dataset and
Table 23. Comparison of selected works in the research area of speech emotion recognition.
Abbreviation for classifiers: Support Vector Machine (SVM), Meta Decision Tree (MDT),
Gaussian Mixture Models (GMM), Auto-Associative Neural Networks (AANN), Sequential
Minimal Optimization (SMO).
                                                                     # Features
                                                      Linguistic
    Reference   Real-time
                                        Sampling
                                                                                                Classifier                  Accuracy
                                                                                    Features
                              Dataset                                                                            Classes
                                          rate           unit                         type
                                                                         Prosodic,                            Positive
  [142]         No          Natural 16kHz                -           22    voice               SVM            negative     52%
                                                                          quality
  [144]         No          Acted       16kHz      Sentence          253 Prosodic,             SVT+           Neutral,     80%
                                                     level               semantic              MDT              happy,
                                                                                                             angry, sad
  [189]         No          Acted          -       Sentence              -        Prosodic,    GMM           Happiness     66%
                                                     level                        cepstral,                    , anger,
                                                                                   wavelet                       fear,
                                                                                                              sadness,
                                                                                                              surprise,
                                                                                                               neutral
                                                                                                               Anger,
                                                                                   Voice       SVM             disgust,
  [229]         No          Acted       8kHz       Sentence          400          quality,     +AA               fear,     84%
                                                     level                        spectral      NN              happy,
                                                                                                               neutral,
                                                                                                               sadness
                                                                                                               Anger,      41.06
                                                    Voiced                        Prosodic,                  boredom,       %-
  [230]         Yes         Natural 16kHz           frame                6         spectral    SVM              happy,     52.43
                                                                                                               neutral,     %
                                                                                                               sadness
                                                                                                               Happy,
                                                                                                                 sad,
                                                   Sentence                                                   surprise,
  [231]         Yes         Acted          -         level               -        Prosodic     SMO               fear,     67%
                                                                                                               disgust,
                                                                                                                anger,
                                                                                                               neutral
                                                                                                              Positive,    70%
                                                                                                              negative
                                                                                                             Negative,     74%
                                                                                  Prosodic,                   question
  This                                             Estimation                       voice                     Positive,
  work          Yes         Natural     8kHz           of            42            quality,    SVM            negative,    57%
                                                    sentence                       cepstral                   question
                                                                                                              Positive,
                                                                                                              negative,    42%
                                                                                                              question,
                                                                                                               neutral
                                                                   133


the types of classes being classified, the results of this work are better or comparable to those
previously in the literature. However, improvements should be made in the number of features
and type of classifier used to improve computational efficiency.
    4.3.Overall Discussion
    In general, this work focused on two main technical points when designing sensor signal
processing algorithms for the recognition of local transformed features. The first technical point
relates to the collection of data to train models to recognize behavioral cues of interest and the
second to the evaluation of signal processing and machine learning model parameters to increase
computational efficiency.
    In this chapter, the collection of data to train machine learning models can be classified/divided
into two ways: acted/evoke data collection and natural data collection. Acted/evoke data was
collected for the training of the head action detection model, while natural data was collected for
the training of the speech intonation detection model. Acted/evoke datasets are good for fast
prototyping because the onset of events of interest or “classes of interest” is known from the data
collection processes. For example, the dataset collected to train the HAD unit was performed in a
manner that evoked the actions of nodding (Δ-pitch), shaking (Δ-yaw), and rolling (Δ-roll) the
head. Therefore, the onset of the action of interest was known and no data annotation process was
needed to prepare the dataset for processing. However, training models with acted/evoke data may
not perform as expected when running/implementing these models in the wild because it does not
carry the level of noise or variation in events that may be encountered in a natural environment.
    On the other hand, natural datasets are better at representing the reality of day-to-day
interaction. However, the preparation of natural datasets is subjective to annotators, creating a high
level of variability in assigned labels among annotators, and is time-consuming. For example, the
                                                    134


preparation of the dataset collected to train the speech intonation recognition model required the
participation of a total of at least six annotators: two to perform manual segmentation and four to
perform annotation of speech intonations. Therefore, well-established data annotation procedures
can help decrease variability in assigned data labels and help establish a minimum number of
required annotators to obtain an optimal dataset. This chapter shows how the level of inter-
annotator agreement varies depending on the protocol used for data annotation. In the case
presented here, segments of speech were labeled in the order in which they occurred, as well as in
random order. This revealed that when speech segments are labeled in random order there are more
non-neutral intonations identified than when the segments are labeled in order, possibly confirming
that the lack of context in the speech segments influences the perception of intonations. Labeling
intonations of speech segments presented in random order may be preferable when designing
human behavior monitoring systems that are free of speech recognition units or any other
methodology that provides information about the context of a conversation.
    When evaluating different signal processing and machine learning parameters to decrease
computational complexity while maintaining good accuracy, this work looked at data buffer sizes
for real-time processing, the number of signal features used for classification, the type of features,
and the complexity of selected models. Optimal machine learning models make use of a
combination of the optimal aforementioned parameters. However, on occasions, optimized
parameters can reduce the ability to generalize the classification models depending on the dataset
used for training. For example, feature reduction techniques used to reduce the computational
complexity of an overall machine learning pipeline can increase the classification accuracy of the
training dataset, but they can also decrease the ability of the model to be transferable to cases
outside the ones in which the model was trained. This can be particularly true when using
                                                   135


acted/evoked datasets. Therefore, it is recommended to use natural data to confirm the performance
of optimized models designed with acted/evoked datasets. On the other hand, a combination or
fusion of optimized classification models provides the opportunity to simplify re-training
processes, if necessary or desired, and reduce computational time and power consumption. This
work implemented this methodology in the design of the real-time HAD unit.
    4.4.Summary
     This chapter presents the design and implementation of real-time data processing blocks to
recognize head activity and intonations using IMUs and audio signals, respectively. The HAD unit
was trained with collected data from a laboratory environment. The HAD unit recognizes three
static positions and three dynamic motions (i.e., Δ-pitch, Δ-yaw, Δ-roll) with an accuracy of
97.91%. On the other hand, a real-time speech intonation recognizer was trained using natural data
collected during research team meetings and labeled using affective states and an interrogative
expression. The natural dataset was constructed by analyzing two methods of labeling intonations:
in sequential order or random order of occurrence. To the best of our knowledge, this is the first
reported effort that studies the effects of labeling speech intonations using different orders of
presentation, i.e., preserving the context of the interaction when labeling in sequential order or
eliminating context when labeling in random order. Results revealed that labeling in sequential
order leads to a higher level of inter-annotator agreement, wherein labeling in random order leads
to a higher level of non-neutral intonations being recognized by two or more annotators. As the
use of nonverbal behaviors to train machines for the recognition of human behaviors excludes
contextual information, this may suggest that, in preparing natural datasets for training such
systems, labeling in random order may be preferred. Furthermore, the trained speech intonation
recognizer achieved a 70% classification accuracy when classifying positive and negative classes
                                                 136


and 57% when classifying between positive, negative, and question intonations. This also
represents the first effort in combining affective classes with an interrogative intonation. The next
chapter shows insights into the design and execution of a social interaction study to expand the
available datasets for the complete design and implementation of the real-time machine learning
framework.
                                                 137


    5. SOCIAL INTERACTION STUDY: METHODS, DESCRIPTION OF DATA
         COLLECTION, AND ANALYSIS
     The human studies and collected datasets presented in Chapter 4 served as the basis to start
the design of models, able to identify individual and nonverbal behavioral cues of interest, that
form part of the machine learning framework presented in Chapter 3. However, to explore and
draw relationships between nonverbal cues from multiple individuals involved in an interaction
and multiple channels of communication, a more comprehensive dataset needs to be utilized.
Currently, no publicly available dataset exists that meets the needs of this work, that is, a dataset
composed of audio, IMU, and physiological data from a head-mounted device, emotional state
labels, and rapport labels. Therefore, in this chapter, the design and execution of a social interaction
study are presented, together with the description of sensor data and survey data collected.
    5.1. Study Methods
     The goals of this human study were (1) to collect audio, visual, and physiological sensor data
while a group of individuals was interacting for a given period of time, (2) to provide an
environment where low and high levels of rapport could be evoked, and (3) collect self-reported
data about the liking between dyads in a group and their perceived dyadic and group rapport level.
    5.1.1. The Basis for Recruitment of Participants
     For this study, dyads were considered the basic unit of interest to understand group
consonance. Because our interest is in group interactions, groups were required to be composed of
a minimum of 3 individuals, which contains 3 dyadic interactions. This work aimed at collecting
data from at least 60 dyadic interactions which led to the aim of forming groups of 4 individuals,
which each contains 6 dyadic interactions. However, due to human factors such as participants'
availability or not being able to complete the study, some groups were composed of 3 individuals.
                                                  138


This resulted in the study collecting data from a total of 10 groups formed with 3 to 4 individuals,
which required a sample size of ~40 individuals.
    5.1.2. Study Overview
     The study procedure was approved by the Michigan State University (MSU) Institutional
Review Board (IRB) and conducted under strict physical distance and following privacy protocol
guidelines. To form the 10 groups of 3 to 4 individuals, the study was divided into two parts: (1)
consent to participate in the study and the administration of two questionnaires and (2) the
interaction between participants, where multi-sensor data was collected, and additional
administration of questionnaires. Participants’ interaction consisted of two periods of 20 minutes,
where in each period participants were discussing a topic statement given to them. Figure 31
describes the parts involved in the study and their respective approximate duration.
     The study was advertised through email around various departments across MSU and in flyers
posted around university buildings. Therefore, participants were recruited from the MSU campus,
however, there were no requirements for subjects to be students or MSU affiliated in any way.
Interested participants were first asked to fill out a contact release form that briefly defined the
goals of the study and participant criteria. The contact release form also allowed potential
participants to submit their contact information and confirm that they met eligibility criteria.
Individuals were eligible to participate in the study if they were 18 years of age or older and could
be physically present on the MSU campus at the time of the second part of the study. Participants
Figure 31. General description of the social interaction study timeline.
                                                  139


were individually contacted to schedule a 30-minute Zoom meeting to perform the first part of the
study.
    5.1.3. First Part of the Study
     During the first part of the study, the consent form was discussed and signed by the participant.
Then, the participant was provided with two questionnaires. The first one was a Demographic
questionnaire that collected information about their gender, age, ethnicity, educational
background, and current employment status. The second one was a Topic questionnaire that asked
participants to provide their opinion (how much they agree or disagree) using an 11-point Likert
scale on a series of topic statements that included gun control, vegetarianism, animal testing,
universal healthcare, death penalty, religious freedom, professional sports, vaccines, college
athletes, environment, animal hunting, exercise, TV shows, travel, video games, food, outdoor
activities, and social interactions (see APPENDIX A). After the questionnaires were completed,
the participant was asked for their availability to perform the second part of the study. The
responses to the Topic questionnaire together with the availability of the participants were used to
form the 10 groups of 3 to 4 participants.
    5.1.4. Topic Statement Selection and Group Formation
     During the second part of the study, each group participated in two interactions, wherein two
different topics were discussed. The first topic was intended to be one that not all individuals in
the group agreed on and the second one was intended to be one where all participants had a similar
opinion. Therefore, groups were formed by matching individuals’ responses from the Topic
questionnaire to invoke the desired level of interaction at each discussion section. The goal was to
invoke conflict during the first discussion section that could affect the establishment of rapport but
invoke an increase in rapport during the second interaction. Table 24 shows a summary of the
                                                  140


Table 24. Summary of topics selected for discussion and the average level of group agreement.
               Num of                      Disagreement                            Agreement
  Group      individual     Topic #1                             Topic #2
                                         Average     Range                     Average     Range
                  s
                              Death
     1            4                        4.25         8      Environment         1          2
                             penalty
                              Death
     2            4                         3.5         7        Vaccines          1          2
                             penalty
                            Animal
     3            4                         4.5         9      Environment         1          2
                            hunting
                                                                 College
     4            3       Gun control        6          7                          5          0
                                                                 athletes
                            Animal                              Universal
     5            4                        3.75        10                         8.5         5
                            hunting                             healthcare
                              Death
     6            3                          4          7        Vaccines          8          6
                             penalty
                            Animal
     7            3                          3          3      Environment         2          5
                             testing
     8            4         Vaccines       5.25         6     Death penalty      7.25         4
                              Death
     9            3                         5.7        10      Gun control         1          2
                             penalty
                            Animal
    10            4                          5          5      Environment       1.75         5
                             testing
                                         Average      7.2                       Average      3.3
topics selected for discussion for each group and their level of agreement in opinion. In general,
the first topic was selected by looking at an average level of agreement of 5 (neutral opinion), but
with a range value in opinions of 5 or more, which indicates the presence of diverse opinions. The
second topic was selected by looking at an average level of agreement close to 1 or 10 with a range
value of less than 5, or an average level of agreement of 5 with a range value of less than 1.
However, there were cases where the availability of the participants limited the groups that could
be formed and the variety of opinions available, which was the case of Group 6 and Group 7 for
the second and first topics of discussion, respectively.
                                                 141


    5.1.5. Second Part of the Study and Main Procedure
     The second part of the study, which constituted the main part of the study, took place in a
large laboratory space with four separate rooms. Each participant was assigned to a room that was
equipped with a computer, the Zoom meeting software, a microphone, a webcam, a BrainBit
headband, a Shimmer device, and the infrastructure to collect data through LSL. A study team
member helped the participants to put on the wearable sensors (the BrainBit and Shimmer), as
shown in Figure 11. Participants were then instructed to fill out an emotional state questionnaire
(see APPENDIX B) containing a 9-point self-assessment manikin arousal, valence, and dominance
(AVD) scale [232] and an 11-point rating tool based on the circumplex model of emotion [39],
[233]. In the 9-point Likert arousal, valence, and dominance scale, participants were instructed to
use arousal to describe how intense is their current emotion, using 1 as low and 9 as high, valence
to describe how negative or positive is their current emotion, using 1 as negative and 9 as positive,
and dominance to describe the degree to which their current emotion controls their thoughts and
actions, using 1 as low and 9 as high. In the 11-point rating tool based on the circumplex model of
emotion, participants were instructed to “rate how are you feeling at this moment using the
following scale.” Eight 11-point Likert scales were presented evaluating the following items:
tense-calm, nervous-relaxed, stressed-serene, upset-contented, sad-happy, depressed-elated,
lethargic-excited, and bored-alert, where the far left of the scale (score of 1) belong to the negative
feeling and the far right (score of 11) to the positive one.
     Participants were given the first topic statement for discussion and instructed to write at least
three reasons to back up their opinion of the issue. Participants were also instructed to discuss the
topic statement among themselves, to share their opinion during the interaction, and to persuade
                                                  142


those with a difference in opinion that their personal view was more reasonable. The instructions
were given as follows, where the topic statement for “death penalty” is used as an example:
      “Consider the following statement: “The death penalty should be used to deter heinous
crimes.” for which you expressed on a scale from 0 (very strongly disagree) – 10 (very strongly
agree) that your opinion is better described by a _X_ (inkling to agree/disagree/neutral).
      Your first task is to make a note (below) of at least three reasons why you have this opinion.
Then, during the virtual meeting, your task is to discuss these reasons and sway other attendees
towards your point of view if differences in opinion are found. During the virtual meeting
discussion, you should also try to learn the specific reasons other attendees express their
opinions.”
      where X represents the score given on how much they agree or disagree. The group was then
left to discuss the topic statement for ~20 minutes. At the end of the discussion, participants were
asked to fill out the emotional state questionnaire and a rapport questionnaire.
      The criterion to measure rapport was performed using items derived from [120] as described
in [121]. Some items were prefaced with the instruction to “rate yourself in the interaction on the
following characteristics.” The items were smooth, bored, cooperative, satisfied, comfortable,
awkward, engrossed, involved, friendly, active, and positive. The remaining items were prefaced
with the instruction to “rate the interaction between you and X on the following characteristics,”
where X represented one of the other two (for groups of 3) or three (for groups of 4) individuals
in the interaction. The items included well-coordinated, boring, cooperative, harmonious,
unsatisfying, uncomfortably paced, cold, awkward, engrossing, unfocused, involving, intense,
unfriendly, active, positive, dull, worthwhile, and slow. Responses were recorded on five-point
Likert scales.
                                                 143


      The rapport questionnaire included the question “How much are you enjoying the
discussion?,” which answer was recorded on an 11-point Likert scale. Also, a liking score was
obtained from each individual in the interaction in relation to everybody else. This score was
obtained by a five-point Likert scale when asked “Do you like your interaction with subject X?,”
where X represented one of the other two (for groups of 3) or three (for groups of 4) individuals
in the interaction (see APPENDIX C).
      After the questionnaires were filled out, a second topic statement for discussion was given
and participants were asked to follow previous discussion instructions. In the end, participants
filled out the emotional state and rapport questionnaires.
     5.2.Data Collection and Description
      All collected data were managed by the study coordinator from a central computer and saved
using an XDF data format [234]. The multi-sensor hardware and software infrastructure presented
in Chapter 3 was the one used for the collection and management of sensor data. This process was
transparent to the participants. Data collection through LSL was primarily performed during the
two interaction periods, however, data was also collected after each interaction, while participants
were filling out the questionnaires, for data quality assurance purposes. For each group, a total of
four XDF files were generated: two corresponding to the interaction periods and two
corresponding to the administration of the questionnaires. In addition, the entire study was
recorded through Zoom, where the video of the meeting was obtained for annotation purposes, in
addition to audio for each participant. However, because the Zoom meeting and the four periods
of sensor data were recorded separately, the video from Zoom needed to be synchronized with the
periods of sensor data. This synchronization was manually performed using the audio recorded
from LSL and aligning them with the audio obtained from the Zoom recording. The
                                                 144


Table 25. Summary of corrupted data and lost data from the first three groups due to a technical
issue. “-“ indicates no loss and “x” indicates corrupted data or lost data.
            Group Participant Interaction       Audio     PPG    Acc   Gyr   Mag   EEG
                                   1st          -         x      x     x     x     -
                                1 nd
                                   2            -         x      x     x     x     -
                                   1st          -         x      -     -     -     x
                  1             2 nd
                                   2            -         x      -     -     -     x
                                   1st          -         -      x     -     -     -
                                4 nd
                                   2            -         -      x     -     -     -
                                   1st          -         -      -     x     x     -
                  2             2 nd
                                   2            -         -      -     x     x     -
                                1 2nd           -         x      x     x     x     -
                  3                1st          -         -      -     x     x     -
                                4 nd
                                   2            -         -      -     x     x     -
synchronization was performed for the periods of data recorded during the interactions. Therefore,
for each group, two videos were produced after synchronization, each corresponding to the two
interactions that they carried on.
     The overall dataset consists of 20 group discussions in English, 2 per group, each lasting on
average 21 minutes. This results in an average total of 420 minutes of audio, visual, and
physiological data. However, due to technical problems, part of the data from groups one to three
(summarized in Table 25) was corrupted. Data corruption and, on occasions, loss of data problems
seemed to be related to the order in which the sensors, especially the shimmer was prepared for
connection to the multi-sensor system. It was determined that the PPG connector from the
Shimmer device needed to be connected before turning it on and connecting it to its respective
computer.
    5.3.Summary and Analysis of Questionnaires’ Data
    5.3.1. Demographics
     The second part of the study had a total participation of 35 individuals (21 males, 12 females,
and 2 that identified as other). One of the female participants formed part of two groups.
                                                145


Participants’ age ranged from 18 to 44 years, where 15 individuals were in the range of 18-24
years old, 17 were in the range of 25-34 years old, and 3 were in the range of 35-44 years old.
Participants’ ethnicities were predominantly White and Asian with 15 and 10 participants,
respectively. Other represented ethnicities included Black or African American, Native Hawaiian
or Pacific Islander, American Indian or Alaska Native, and combinations of all of them.
Participants’ highest level of education ranged from having a high school diploma to have a
doctorate or professional degree, where 10 participants indicated that they have some college
credits, 10 had bachelor’s degrees, and 10 had master’s degrees. In terms of employment, 29 of
the participants identified as students, 3 as having a part-time job, and 3 as having a full-time job.
    5.3.2. Emotional State
     The emotional state questionnaire employed two scales a 9-point Likert AVD scale and an
11-point rating tool based on the circumplex model of emotion. To better display whether a
participant was feeling more of a negative or positive feeling, the responses to the 9-point Likert
AVD scale were transformed and centralized to 0, meaning that the scale was modified to go from
-4 to 4, instead of 1 to 9. Likewise, the 11-point rating tool scale was modified to go from -5 to 5,
instead of 1 to 11. Table 26 and Table 27 show a summary (average and standard deviation) of
the responses provided by the participants, for each instance in which the emotional questionnaire
was filled out. Table 26 and Table 27 also show the results of a two-sample t-Test for equal means
that was applied to two sets of data to find which emotional states were significantly affected
throughout the interactions. The two sets of data were (1) the responses to the emotional state
questionnaire before and after the first interaction and (2) the responses obtained after the first
interaction and after the second interaction. Results show that based on the responses to the arousal,
valence, and dominance scale, there was a statistically significant change in arousal and valence
                                                  146


Table 26. Summary of the responses provided by the participants for the 9-point Likert AVD
scale. This table shows the average and standard deviation of the provided responses to the items
in the scale for each instance in which the emotional state questionnaire was filled out. Also, the
P-values resulted from the t-Test applied between each of the instances in which the
questionnaire was filled out are shown. These results demonstrate that there is a significant
change in arousal and valence before and after the 1st interaction.
                                                                 P-value between          P-value
                   Before 1st      After 1st       After 2nd      before and after    between 1st and
  Scale Items     interaction    interaction     interaction       1st interaction    2nd interaction
    Arousal       -1.94±1.56     -0.28±1.92      -0.23±2.04            0.00013            0.8999
    Valence        0.53±1.42      1.06±1.47       1.63±1.55             0.0463            0.4170
 Dominance        -0.79±1.86      0.17±1.54      -0.22±1.76             0.2622            0.0874
Table 27. Summary of the responses provided by the participants for the 11-point rating tool
based on circumplex model of emotion. This table shows the average and standard deviation of
the provided responses to the items in the scale for each instance in which the emotional state
questionnaire was filled out. Also, the P-values resulted from the t-Test applied between each of
the instances in which the questionnaire was filled out are shown. These results demonstrate that
there is a significant change in two of the items before and after the 1st interaction and three of
the items before and after the 2nd interaction.
                                                                            P-value
                                                                            between        P-value
                                                                          before and     between 1st
                      Before 1st       After 1st         After 2nd          after 1st       and 2nd
   Scale Items       interaction     interaction       interaction        interaction    interaction
  Tense-Calm           0.33±2.1      -0.75±1.92         0.47±2.37            0.3821         0.0189
    Nervous-
                     -0.64±2.22      -0.33±1.82         0.92±1.99            0.5251         0.0070
     Relaxed
    Stressed-
                     -0.61±2.03        -0.5±1.7         0.54±1.96            0.5623         0.0289
     Serene
     Upset-
                      0.14±2.09       0.19±1.51         0.92±2.01            0.8973         0.0886
    Contented
   Sad-Happy         -0.39±1.52       0.14±1.68         0.53±1.95            0.1657         0.3671
   Depressed-
                     -0.53±1.67      -0.17±1.92          0.2±1.95            0.4239         0.5720
     Elated
   Lethargic-
                     -1.06±1.87      -0.19±1.41        -0.03±1.68            0.0305         0.6501
     Excited
   Bored-Alert       -0.36±1.64       0.97±1.99         0.81±2.14            0.0028         0.7331
levels from before to after the first interaction. Related to the rating tool based on the circumplex
model of emotion, results show a statistically significant change in the lethargic-excited and bored-
                                                   147


Figure 32. Summary of responses to the Emotional state questionnaire. The bars represent the
average score given by all the participants of the group and the error bars represent the stand
deviation of those responses.
alert levels from before to after the first interaction and in the tense-calm, nervous-relaxed, and
                                                  148


stressed-serene levels from after the first interaction to after the second one.
Figure 33. Overall scores with standard deviation bars for the AVD and circumplex model of
emotion scales per individual per group for before and after the interaction sections. The overall
scores were determined by calculating the average of the responses given by the participants for
the items on each of the two scales.
                                                  149


     Because this work is interested in looking at the group-level behavioral factors that influence
rapport, the averages of individuals’ responses to the emotional state questionnaire are presented
per group in Figure 32. The information presented in Figure 32 provides insight into the level of
positivity within the group. Generally, it can be observed that before and after the 1 st interaction
there is a variation of low and high affective states across individuals of a group and groups, while
after the 2nd interaction there was a tendency to be at a high emotional state except for Groups 7
and 10.
     To gain insight into the individual-level changes in emotional state across the study, Figure
33 shows overall scores with standard deviation bars for the AVD and circumplex model of
emotion scales per individual per group before and after the interaction sections. The overall scores
were determined by calculating the average of the responses given by the participants for the items
on each of the two scales.
    5.3.3. Rapport
     The responses to negative adjectives of the rapport questionnaire (i.e., boring, unsatisfying,
uncomfortably paced, cold, awkward, unfocused, intense, unfriendly, dull, and slow) were, first,
reverse scored. Then, the average of the responses to the items in the questionnaire was taken as
the perceived score of rapport for the dyads under consideration. This yielded two rapport scores
for each dyad inside a group.
     Because of the existence of the social-desirability bias [235], which is the tendency of
survey/questionnaire participants to answer questions in a way that will be favorably viewed by
others, the distribution of the dyadic reported rapport values was studied, and the lower 25-
percentile taken as the threshold to determine low rapport values. Social-desirability bias can be
expressed by over-reporting a “good” behavior or under-reporting a “bad” behavior. In this case,
                                                  150


Figure 34. Distribution of calculated dyadic rapport scores. (a) Shows the distribution of the
rapport scores corresponding to each of the two interaction periods; (b) shows the overall
distribution of calculated dyadic rapport scores across the study (first and second interaction’s
rapport scores) and the value for the 25-percentile, which is used as a threshold to group rapport
scores into low and high values.
the overall average value reported for dyadic rapport in the study was 3.97±0.62 and a median
value of 4.03 on a 5-point Likert scale. Therefore, most of the reported values were on the higher
side of the scale and the reason to consider the lower 25-percentile as low rapport values. Figure
34 shows the distribution of rapport scores for each of the interaction periods and all together as a
whole, which was used to calculate the 25-percentile. The 25-percentile threshold value was
determined to be 3.61. Based on this threshold, it was determined that low reported rapport scores
amount to 32 in the first interaction period and 19 in the second interaction period from a total of
192 reported scores of dyadic interactions across the study, which includes two rapport values for
each dyad in a group.
     To characterize the level of rapport experienced by the groups, this work determined
individual-experienced and dyadic-experienced rapport levels. The individual-experienced rapport
levels were calculated per participant using the reported dyadic rapport values and were divided
into active rapport values and passive rapport values. Active rapport values refer to the average
reported experienced rapport by an individual towards the other people in the group, whereas
passive rapport values refer to the average reported experienced rapport value from the people in
                                                151


Figure 35. Summary of active and passive rapport scores corresponding to each individual in a
group and for both of their interaction sections. Data used to construct these plots can be found
in https://gitlab.msu.edu/davilasy/human-study-de-identified-data.
the group towards the individual. Similarly, liking per individual was calculated as passive and
active values. This yielded four values per individual per interaction section, two describing
rapport levels and two describing liking values. Figure 35 and Figure 36 show the active and
                                                152


passive rapport and liking scores corresponding to each individual, per group and interaction
section, respectively. These active and passive values are used to compare how the rapport and
liking connections felt by one person compared to what others felt towards that one person.
     On the other hand, the dyadic-experienced rapport levels refer to the dyadic values directly
obtained from the rapport questionnaire. These values help determine how close dyads felt the
Figure 36. Summary of active and passive liking scores corresponding to each individual in a
group and for both of their interaction sections. Data used to construct these plots can be found
in https://gitlab.msu.edu/davilasy/human-study-de-identified-data.
                                                153


Table 28. Dyadic strength of rapport based on reported self-assessments during the first
interaction, which was intended to be one where there was disagreement among members of a
group.
      Group Interaction         Positive dyads         Negative dyads       Variant dyads
           1             1           B1C1                A1C1, A1D1        A1B1, B1D1, C1D1
           2             1 A2C2, A2D2, B2C2, C2D2            A2B2                B2D2
           3             1        A3B3, A3C3          A3D3, A3C3, C3D3           B3C3
           4             1        A4B4, A4C4                   -                 B4C4
           5             1           B5D5             A5B5, A5C5, A5D5        B5C5, C5D5
           6             1        A6B6, B6C6                   -                 A6C6
           7             1        A7B7, B7C7                   -                 A7C7
           8             1 A8C8, A8D8, B8C8, C8D8        A8B8, B8D8                -
           9             1     A9B9, A9C9, B9C9                -                   -
          10             1 A10B10, A10D10, C10D10      B10C10, B10D10           A10C10
               Total                             24                   13                 11
Table 29. Dyadic strength of rapport based on reported self-assessments during the second
interaction, which was intended to be one where there was a high level of agreement among
members of a group.
                                                                   Negative
    Group    Interaction             Positive dyads                              Variant dyads
                                                                     dyads
         1             2        A1D1, B1C1, B1D1, C1D1                 -          A1B1, A1C1
                       2     A2B2, A2C2, A2D2, B2C2, B2D2,
         2                                                             -               -
                                          C2D2
                       2                                                          A3C3, A3D3,
         3                                  -                        A3B3         B3C3, B3D3,
                                                                                     C3D3
         4             2            A4B4, A4C4, B4C4                   -               -
                       2                                         A5B5, A5D5,
         5                                B5D5                                    A5C5, C5D5
                                                                     B5C5
         6             2               A6B6, B6C6                      -             A6C6
         7             2                  A7B7                         -          A7C7, B7C7
                       2     A8B8, A8C8, A8D8, B8C8, B8D8,
         8                                                             -               -
                                          C8D8
         9             2            A9B9, A9C9, B9C9                   -               -
                       2     A10B10, A10C10, A10D10, B10C10,
        10                                                             -               -
                                     B10D10, C10D10
             Total                                           32              4               12
strength of rapport and help classify the dyadic interactions into positive dyads (both rated the
interaction high), negative dyads (both dyads rated the interaction low), and variant dyads (one
                                               154


dyad rated the interaction as high, another rated the interaction as low). To identify the
aforementioned group of dyads, the average of the rapport values obtained from dyads (each
participant rated their interaction with the other members of the group) was calculated. Likewise,
the difference between the rapport values from dyads was also calculated. Then, the 25-percentile
of the calculated average values was used as a threshold to identify positive and negative dyads
and the 75-percentile of the calculated difference of rapport values was used to identify variant
dyads. The 25-percentile of the dyadic average rapport values was 3.72 and the 75-percentile of
the difference in rapport values was 0.833. Table 28 and Table 29 show a summary of the dyads
that are classified as positive, negative, or variant dyads during the first and second group
interaction, respectively. In total, 56 dyadic interactions were classified as positive dyads, 17 as
negative dyads, and 23 as variant dyads. From the identified negative dyads, 13 manifested during
the first group interaction and 4 during the second interaction. In general, it can be observed how
the number of negative dyads is dominated by the first interaction, which proves that the second
interaction was designed to increase rapport levels. From this analysis, it can also be observed that
negative dyadic interactions were not developed in Group 4, Group 6, Group 7, and Group 9,
although in three of those groups there is at least one variant dyadic interaction. The following
repository contains raw data of individual dyadic scores, active and passive rapport scores, and the
calculated dyadic strength: https://gitlab.msu.edu/davilasy/human-study-de-identified-data.
    5.4.Data Labeling
     Two major efforts were performed to annotate data of interest. The first annotation effort
consisted of using external observers to annotate the perceived rapport level between dyads and
the overall group interaction. The second annotation effort focused on annotating the head actions
of individuals involved in the interactions by using the collected videos.
                                                 155


    5.4.1. Labeling of Rapport Values using External Observers
     Two external observers (one female and one male) were recruited for this task. External
observers were instructed to watch the videos of the group interactions collected during the study
and, using a similar version of the rapport questionnaire given to the study participants, score
perceived rapport levels of the overall group interaction and between dyads. Therefore, each
external observer watched a total of 20 videos, each lasting 21 minutes on average.
     In the rapport questionnaire used by the external observers, the item intending to capture the
overall group rapport level was prefaced with the instruction to “rate the overall interaction on the
following characteristics.” The items were smooth, bored, cooperative, satisfied, comfortable,
awkward, engrossed, involved, friendly, active, and positive. The remaining items intending to
capture perceived dyadic rapport level were prefaced with the instruction to “rate the interaction
between subject X and subject Y on the following characteristics,” where X and Y represented two
of the individuals in the interaction. The items included well-coordinated, boring, cooperative,
harmonious, unsatisfying, uncomfortably paced, cold, awkward, engrossing, unfocused, involving,
intense, unfriendly, active, positive, dull, worthwhile, and slow. Responses were recorded on five-
point Likert scales. Therefore, each external observer provided a total of one perceived overall
value of rapport and three (for groups of 3) or six (for groups of 4) perceived dyadic rapport levels
per video of the interaction section.
     The scores from the two annotators were combined by calculating the mean between the rated
items for each group and observed dyadic interaction. Then, similar to the method employed in
Section 5.3.3, the overall value of rapport and perceived dyadic rapport levels were calculated by
obtaining the mean of the values assigned to the items in the questionnaire. To find a threshold to
group overall rapport values and perceived dyadic rapport levels into high and low, the 25-
                                                  156


percentile of each set was found. The 25-percentile of the overall rapport values was found to be
3.52, whereas the perceived dyadic rapport level was 3.36. Table 30 and Error! Reference source n
ot found. show a summary of the dyads that are classified as positive or negative dyads during the
first and second interactions, respectively. In total, 72 dyadic interactions were classified as
positive and 24 as negative based on the perceived rapport scores. In addition, 5 of the 20 group
interactions were grouped as having low overall group rapport. These values are intended to serve
as objective scores of rapport within the groups. When results from external annotators are
compared to the self-reported rapport values (shown in Table 28 and Table 29), both agree that
during the first interaction 20 of the dyads are positive ones and 7 are negative ones, whereas
during the second interaction both agree in that 28 of the dyads are positive ones and 4 are negative
ones.
     5.4.2. Labeling of Head Actions
      As shown in Chapter 4, because head actions contribute to rapport establishment, the head
actions of a subset of groups were labeled. Two annotators were employed for this task, and each
was assigned a different set of groups for labeling. Annotators were instructed to watch the
recorded videos of the interactions and use an annotation template made in an excel file to annotate
the beginning and final time of a recognized movement or position. A recognized movement or
position was annotated using a general label, a detailed label, and a direction label. Therefore, each
identified head action was assigned three labels. Head actions were labeled for all the individuals
involved in the video interactions.
      General labels included no movement, one-time movement, repeating the motion, and other
motions. Annotators were instructed to use no movement when the participant was in a steady
position. The one-time movement was used when the participant went from left to right, up to
                                                 157


down, or vice versa, on a single movement. The repeating motion was assigned when a subject
was head nodding, head shaking, or performing a cyclical motion. Finally, other motion was used
for observed body position adjustments, chair motions, or any other inconsistent motion that could
affect the position or motion of the head.
     Detailed labels include steady, tilt shoulder, tilt yaw, bow, nod, shake, roll, body adjustment,
inconsistent motion, and chair motion. Annotators were instructed to use steady when the
participants were not showing movement. Tilt shoulder was used when the head was inclined to
one of the shoulders (ear close to shoulder) and tilt yaw when there was a one-time head movement
to the left or the right. The bow was used when there was a one-time head movement up or down.
Nod was used for repeating movements in the pitch axis, shake for repeating movements in the
Table 30. Perceived rapport scores of the 1st interaction of each group obtained from two
external annotators.
                                                                                Overall perceived
   Group      Interaction           Positive dyads           Negative dyads
                                                                               group rapport-level
         1               1     A1C1, A1D1, B1C1, B1D1,
                                                                   A1B1                3.41
                                        C1D1
         2               1     A2C2, A2D2, B2C2, B2D2,
                                                                   A2B2                  4
                                        C2D2
         3               1                                     A3B3, A3C3,
                                        C3D3                   A3D3, B3C3,             2.91
                                                                   B3D3
         4               1        A4B4, A4C4, B4C4                   -                 3.91
         5               1                                     A5B5, A5C5,
                                        C5D5                   A5D5, B5C5,             2.55
                                                                   B5D5
         6               1        A6B6, A6C6, B6C6                   -                 3.68
         7               1        A7B7, A7C7, B7C7                   -                 3.86
         8               1     A8C8, A8D8, B8C8, B8D8,
                                                                   A8B8                4.36
                                        C8D8
         9               1        A9B9, A9C9, B9C9                   -                 3.77
        10               1     A10C10, A10D10, B10C10,
                                                                  A10B10               4.05
                                   B10D10, C10D10
                                                                               3 group interactions
                                                                14 dyadic      with perceived low
                    Total       34 dyadic interactions         interactions       rapport levels
                                                  158


Table 31. Perceived rapport scores of the 2nd interaction of each group obtained from two
external annotators.
                                                                              Overall perceived
   Group     Interaction          Positive dyads          Negative dyads
                                                                             group rapport-level
   1                    2   A1C1, A1D1, B1C1, B1D1,
                                                                 A1B1                 4
                                       C1D1
   2                    2   A2C2, A2D2, B2C2, B2D2,
                                                                 A2B2               4.36
                                       C2D2
   3                    2   A3C3, A3D3, B3C3, B3D3,
                                                                 A3B3               3.68
                                       C3D3
   4                    2       A4B4, A4C4, B4C4                   -                4.05
   5                    2                                   A5B5, A5C5,
                                       C5D5                 A5D5, B5C5,             2.32
                                                                 B5D5
   6                    2       A6B6, A6C6, B6C6                   -                3.64
   7                    2       A7B7, A7C7, B7C7                   -                2.82
   8                    2   A8C8, A8D8, B8C8, B8D8,
                                                                 A8B8               4.41
                                       C8D8
   9                    2       A9B9, A9C9, B9C9                   -                4.09
   10                   2    A10C10, A10D10, B10C10,
                                                                A10B10              4.68
                                  B10D10, C10D10
                                                                             2 group interactions
                                                              10 dyadic      with perceived low
                    Total     38 dyadic interactions         interactions       rapport levels
yaw axis, and roll for repeating movements in the roll axis. On the other hand, body adjustment
was assigned to any movement originating from an individual adjusting their body position, chair
motion was assigned to any head/body movement originating from moving or rotating the chair in
which the participants were sited, and the inconsistent motion was assigned to any set of motions
that could not be clearly separated into a nod, shake, roll, etc. Figure 37 shows an example of
detailed labels aligned with raw sensor data from the second interaction of Group 5.
     Direction labels include left, right, up, down, front, back, and changing. These were assigned
in combination with the general and detailed labels to identify the direction of the motion,
especially for the one-time movements. From the seven videos that were watched, there exist over
8700 identified head actions with respective labels.
                                                  159


Figure 37. Example of raw IMU signals from the second interaction of participants in Group 5
and assigned ‘detailed’ labels.
     Note that assigned labels were not cross-validated due to a lack of human resources to
contribute to this task. However, in the future, the cross-validation scheme presented in Chapter 4
for audio data can also be applied to this case.
                                                 160


    5.5.Processing IMU Data using the Designed HAD Unit and Preliminary Establishment
        of Rapport Relationship
     To advance the design of the machine learning framework for the group interaction
monitoring system, the head-action detection (HAD) unit developed in Chapter 4 was evaluated
using the natural data presented in this chapter. This model evaluation constitutes initial efforts in
employing the collected dataset for the recognition of local transformed features. Data collected
from Group 4 and Group 5 were selected for this analysis because they represent two different
types of groups. While Group 4 appears to be a passive/collaborative group with high levels of
shared rapport, Group 5 appears to be a confrontational one with variations in reported rapport
levels.
    5.5.1. Evaluation of HAD Unit
     The HAD unit was evaluated using data from each individual in the selected groups; this
includes data from 7 individuals interacting for ~20 minutes. The output of the HAD unit was
compared to the labels assigned by an annotator, as explained in Section 5.4.2. The HAD unit was
evaluated by its accuracy in recognizing (1) static or dynamic movement, (2) static with position
versus motion, (3) static versus three motions, and (4) all six classes for which it was trained.
Therefore, the performance of the HAD unit using this natural dataset was evaluated for the
recognition of 2 classes, 4 classes, and 6 classes, for which it was ultimately trained.
     Per design, data from the groups were processed in a real-time fashion using a data processing
frame of 3 seconds with a 50% overlap. Only data from the accelerometer and gyroscope sensors
were processed and only 7 features were extracted, in total, from each data frame.
                                                 161


     5.5.2. Synchronicity of Dyadic Head Activity and Relationship to Rapport
      Results from the best set of classes (2, 4, or 6 classes, as described in the previous section)
were used to determine if there exists a mathematical relationship between the synchronicity of
head activity of the dyads and the reported rapport values. The synchronicity of head activity
between dyads is calculated by (1) measuring the dynamic time warping (DTW) between the
signals, (2) using the obtained DTW results to correct for phase-shifts and signal length on each
of the head activity time series, and (3) calculating the correlation coefficient between the phase-
shifted signals. The DTW is an algorithm that measures the similarity between two time series and
provides information about which data points from time series A match more closely with data
points from time series B. The use of DTW has been employed in research related to the
recognition of human activity [236].
      The final correlation coefficient values are considered to represent a degree of coordination
between dyads. These values are then matched with the dyadic rapport strength values obtained
from self-reported data, as explained in Section 5.3.3.
     5.5.3. Results and Discussion
     5.5.3.1.Validation of HAD unit
      Results of the validation of the HAD unit, in terms of classification model accuracy, precision
accuracy, and recall accuracy, for each of the different classes are presented for both interactions
of Group 4 (in Table 32 and Table 33) and the second interaction of Group 5 (in Table 34). The
first interaction of Group 5 was not evaluated because labels of head activity were incomplete.
      In order to evaluate the accuracy of the HAD unit results, the time stamp of the annotations
of head activity for Groups 4 and 5 needed to be aligned to the output of the HAD unit. The real-
time classification of head activity provides an output every 1.5 seconds after the first 3 seconds
                                                  162


of processing. On average, the duration of a labeled action was of 4.1330 seconds. Therefore, the
time stamp of the HAD unit was used to create a transformed set of annotations that aligns with
the HAD unit output. In addition, because the labels assigned for head activity include other
activities in addition to nod, shake, and roll motions, anything that was labeled otherwise was
converted to a general motion label. The general motion labels were included in the assessment of
the HAD unit’s performance when evaluating the set of classes that also included a general motion
Table 32. Accuracy, precision, and recall values obtained from evaluating the data from
participants in Group 4, first interaction.
   Number of
                                  Host 1               Host 2                Host 3       Average
     classes
                Accuracy 61.88%                79.94%                 73.57%              71.80%
                Precision Steady: 37.7%        Steady: 23.8%          Steady: 19.1%
        2                  Motion: 71.9%       Motion: 90.8%          Motion: 78.7%
                Recall     Steady: 35.6%       Steady: 33.3%          Steady: 8.2%
                           Motion: 73.7%       Motion: 86.0%          Motion: 91.1%
                Accuracy 60.82%                79.42%                 72.38%              70.87%
                Precision Tilt shoulder left:  Tilt shoulder left:    Tilt shoulder left:
                           -                   12.5%                  -
                           Steady neutral:     Steady neutral:        Steady neutral:
                           39.1%               22.9%                  7.8%
                           Tilt shoulder       Tilt shoulder right: - Tilt shoulder
                           right: -            Motion: 90.8%          right: -
        4                  Motion: 71.9%                              Motion: 78.7%
                Recall     Tilt shoulder left: Tilt shoulder left:    Tilt shoulder left:
                           -                   33.3%                  -
                           Steady neutral:     Steady                 Steady neutral:
                           33.3%               neutral:28.6%          11.1%
                           Tilt shoulder       Tilt shoulder right: - Tilt shoulder
                           right: -            Motion: 86%            right: -
                           Motion: 73.7%                              Motion: 91.1%
                Accuracy 37.34%                43.15%                 46.57%              42.35%
                Precision Steady: 37.7%        Steady: 23.8%          Steady: 19.7%
                           nod: 43.3%          nod: 62.4%             nod: 56.2%
                           shake: 5.2%         shake: 24.4%           shake: 9.6%
        4                  roll: 52.2%         roll: 52.9%            roll: 20%
                Recall     Steady: 44.6%       Steady: 48.3%          Steady: 11.1%
                           nod: 43.7%          nod: 60.9%             nod: 80.8%
                           shake: 30%          shake: 52.4%           shake: 12.8%
                           roll: 11.2%         roll: 7.4%             roll: 2.1%
                                                  163


Table 32. (cont’d).
                Accuracy 35.86%                42.22%                 44.9%               40.99%
                Precision Tilt shoulder left:  Tilt shoulder left:    Tilt shoulder left:
                           -                   12.5%                  -
                           Steady neutral:     Steady neutral:        Steady neutral:
                           39.1%               22.9%                  7.8%
                           Tilt shoulder       Tilt shoulder right: - Tilt shoulder
                           right: -            nod: 62.4%             right: -
                           nod: 43.3%          shake: 24.4%           nod: 56.2%
                           shake: 5.2%         roll: 52.9%            shake: 9.6%
        6                  roll: 52.2%                                roll: 20%
                Recall     Tilt shoulder left: Tilt shoulder left:    Tilt shoulder left:
                           -                   50%                    -
                           Steady neutral:     Steady neutral:        Steady neutral:
                           41.4%               41.4%                  12.9%
                           Tilt shoulder       Tilt shoulder right: - Tilt shoulder
                           right: -            nod: 60.9%             right: -
                           nod: 43.7%          shake: 52.4%           nod: 80.8%
                           shake: 30%          roll: 7.4%             shake: 12.8%
                           roll: 11.2%                                roll: 2.1%
class, as was the case of the first two sets of classes evaluated: (1) static or dynamic movement,
(2) static with position versus motion. Otherwise, for the other two sets of classes evaluated (i.e.,
(3) static versus three motions and (4) all six classes) that contained specific motion types, any
instance with other motions outside of nod, shake, and roll was not used for model evaluation.
     Results as shown in Table 32 - Table 34 revealed that, on average for, the detection of steady
versus motion the HAD unit achieves a validation accuracy of 71.80%, 70.83%, and 56.64% for
Group 4-1st interaction, Group 4-2nd interaction, and Group 5-2nd interaction, respectively. Group
5-2nd interaction average accuracy falls to 56.64% because the model appears to not be able to
recognize head motions from one of the participants’ data. For the static with position versus
motion classes, the average accuracy for Group 4-1st interaction, Group 4-2nd interaction, and
Group 5-2nd interaction was 70.87%, 68.28%, and 48.71%, respectively. The last set of 4 classes
tested obtained average classification accuracies of 42.35%, 41.72%, and 54.45% for Group 4-1st
                                                  164


interaction, Group 4-2nd interaction, and Group 5-2nd interaction, respectively. Lastly, the set of 6
classes obtained an average accuracy of 40.99%, 37.57%, and 38.35% for Group 4-1st interaction,
Group 4-2nd interaction, and Group 5-2nd interaction, respectively.
     It is expected that as more classes are added to the evaluation, the validation accuracy will
drop, especially since the HAD unit was trained with data with clearly identifiable motions and
the data used in this case is a naturally collected one. If recalled, the testing accuracy of the HAD
Table 33. Accuracy, precision, and recall values obtained from evaluating the data from
participants in Group 4, second interaction.
 Number of
                                 Host 1                Host 2                  Host 3        Average
  classes
             Accuracy 65.39%                   78.93%                  68.17%                70.83%
             Precision Steady: 62.3%           Steady: 19.1%           Steady: 12.5%
      2                  Motion: 67%           Motion: 90.1%           Motion: 70.7%
             Recall      Steady: 49.8%         Steady: 26.5%           Steady: 1.9%
                         Motion: 77.2%         Motion: 85.6%           Motion: 94.6%
             Accuracy 59.1%                    77.84%                  67.90%                68.28%
             Precision Tilt shoulder left: -   Tilt shoulder left:     Tilt shoulder left: -
                         Steady neutral:       Steady neutral:         Steady neutral: 4.5%
                         48.5%                 14.7%                   Tilt shoulder right:
                         Tilt shoulder right:  Tilt shoulder right:    33.3%
                         -                     -                       Motion: 70.7%
      4                  Motion: 67%           Motion:90.1%
             Recall      Tilt shoulder left: - Tilt shoulder left: -   Tilt shoulder left: -
                         Steady neutral:       Steady                  Steady neutral: 7.7%
                         43%                   neutral:23.3%           Tilt shoulder right:
                         Tilt shoulder right:  Tilt shoulder right:    0.5%
                         -                     -                       Motion: 94.6%
                         Motion: 77.2%         Motion: 85.6%
             Accuracy 47.54%                   47.36%                  30.25%                41.72%
             Precision Steady: 62.3%           Steady: 19.1%           Steady: 12.5%
                         nod: 31.2%            nod: 67.6%              nod: 35.8%
                         shake: -              shake: 19.7%            shake: 1.8%
      4                  roll: 46.7%           roll: 47.8%             roll: 33.3%
             Recall      Steady: 66.2%         Steady:37.3%            Steady: 2.9%
                         nod: 40.8%            nod: 67.6%              nod: 79.8%
                         shake: -              shake: 35.1%            shake: 10%
                         roll:8.2%             roll: 10.2%             roll: 3.2%
                                                  165


Table 33. (cont’d).
             Accuracy 37.28%                   45.59%                29.83%                 37.57%
             Precision Tilt shoulder left: -   Tilt shoulder left: - Tilt shoulder left: -
                         Steady neutral:       Steady neutral:       Steady neutral: 4.5%
                         48.5%%                14.7%                 Tilt shoulder right:
                         Tilt shoulder right:  Tilt shoulder right:  33.3%
                         -                     -                     nod: 35.8%
                         nod: 31.2%            nod: 67.6%            shake: 1.8%
                         shake: -              shake: 19.7%          roll: 33.3%
     6                   roll: 46.7%           roll: 47.8%
             Recall      Tilt shoulder left: - Tilt shoulder left: - Tilt shoulder left: -
                         Steady neutral:       Steady neutral:       Steady neutral: 10%
                         59.4%                 30.4%                 Tilt shoulder right:
                         Tilt shoulder right:  Tilt shoulder right:  0.8%
                         -                     -                     nod: 79.8%
                         nod: 31.2%            nod: 67.6%            shake: 10%
                         shake: -              shake: 35.1%          roll: 3.2%
                         roll: 46.7%           roll: 10.2%
unit during the design process was 97.91%. Natural collected head activity data may contain
micro-motions embedded in specific motions of interest that will act as artifacts or noise in the
signals. The designed HAD unit does not account for such artifacts, thus, the dropped in accuracy
results when compared to the testing accuracies during the design process. Another possible
explanation for the drops in accuracies is related to the feature set used for classification. During
the design process, the model was optimized to decrease computational complexity. Thus, it is
possible that the selected feature set cannot generalize enough to accommodate the characteristics
of this natural dataset. Nevertheless, the results of the classification of 2 classes (steady versus
motion) were used to investigate a preliminary relationship between head activity coordination
and rapport strength between dyads.
                                                  166


    5.5.3.2.Synchronicity of motion and rapport
     Table 35 presents the results of the synchronization measurement (explained in Section 5.5.2)
between dyadic head activity as detected by the HAD unit for just two classes (steady versus
motion). For the values presented in Table 35, a correlation analysis was applied between the
reported synchronicity values and the corresponding rapport scores, resulting in a correlation
coefficient of -0.2629. This provides an inconclusive relationship between synchronicity of
Table 34. Accuracy, precision, and recall values obtained from evaluating the data from
participants in Group 5, second interaction.
 Number
     of                   Host 1            Host 2          Host 3           Host 4    Average
  classes
           Accuracy 63%               70.88%           30.44%          62.26%          56.64%
           Precision Steady:          Steady: 55.7% Steady: 81.9% Steady: 87.2%
                     76.4%            Motion: 80.3% Motion: 26.2% Motion: 37.2%
                     Motion:
     2               56.1%
           Recall    Steady:          Steady: 63.8% Steady: 8.4% Steady: 58.3%
                     47.3%            Motion: 74.4% Motion: 94.6% Motion: 74.2%
                     Motion:
                     82.2%
           Accuracy 62.16%            65.78%           25.90%          41.01%          48.71%
           Precision Tilt shoulder    Tilt shoulder    Tilt shoulder   Tilt shoulder
                     left: -          left:18.2%       left: 16.7%     left: 40%
                     Steady           Steady neutral:  Steady neutral: Steady neutral:
                     neutral: 76%     49.5%            17.0%           47.5%
                     Tilt shoulder    Tilt shoulder    Tilt shoulder   Tilt shoulder
                     right: -         right: -         right: 46.2%    right: 9.7%
                     Motion:          Motion: 80.3%    Motion: 26.2%   Motion: 37.2%
                     56.1%
     4
           Recall    Tilt shoulder    Tilt shoulder    Tilt shoulder   Tilt shoulder
                     left: -          left: 19%        left: 8.3%      left: 3.3%
                     Steady           Steady neutral:  Steady neutral: Steady neutral:
                     neutral:         53.7%            12%             53.6%%
                     45.9%            Tilt shoulder    Tilt shoulder   Tilt shoulder
                     Tilt shoulder    right: -         right: 1%       right: 1.5%
                     right: -         Motion: 74.4%    Motion: 94.6%   Motion: 74.2%
                     Motion:
                     82.2%
                                                  167


Table 34. (cont’d).
           Accuracy 62.63%            50.55%           29.55%            75.09%          54.45%
           Precision Steady:          Steady: 55.7%    Steady: 81.9%     Steady: 87.2%
                      76.4%           nod: 29.6%       nod: 5.7%         nod: 19.7%
                      nod: 52%        shake: 34.8%     shake: 9.4%       shake: -
                      shake: -        roll: -          roll: 50%         roll: -
     4                roll: -
           Recall     Steady:         Steady: 92.6%    Steady: 31.7%     Steady: 87.5%
                      62.6%           nod: 25.6%       nod: 26.3%        nod: 21.2%
                      nod: 68.7%      shake: 24.2%     shake: 44.4%      shake: -
                      shake: -        roll: -          roll: 4.2%        roll: -
                      roll: -
           Accuracy 61.43%            40%              12.15%            39.82%          38.35%
           Precision Tilt shoulder    Tilt shoulder    Tilt shoulder     Tilt shoulder
                      left: -         left: 18.2%      left: 16.7%       left: 4%
                      Steady          Steady neutral:  Steady neutral:   Steady neutral:
                      neutral:        49.5%            17.0%             47.5%
                      76.0%           Tilt shoulder    Tilt shoulder     Tilt shoulder
                      Tilt shoulder   right: -         right: 46.2%      right: 9.7%
                      right: -        nod: 29.6%%      nod: 5.7%         nod: 19.7%
                      nod: 52%        shake: 34.8%     shake: 9.4%       shake: -
                      shake: -        roll: -          roll: 50%         roll: -
     6                roll: -
           Recall     Tilt shoulder   Tilt shoulder    Tilt shoulder     Tilt shoulder
                      left: -         left: 25%        left: 25%         left: 3.9%
                      Steady          Steady neutral:  Steady neutral:   Steady neutral:
                      neutral:        79.2%            40.9%             81.4%
                      60.6%           Tilt shoulder    Tilt shoulder     Tilt shoulder
                      Tilt shoulder   right: -         right: 3.8%       right: 2.5%
                      right: -        nod: 25.6%       nod: 26.3%        nod: 21.2%
                      nod: 68.7%      shake: 24.2%     shake: 44.4%      shake: -
                      shake: -        roll: -          roll: 4.2%        roll: -
                      roll: -
motions and reported rapport strength. Further analysis into the synchronicity of specific types of
motion, such as head nodding, and its relationship to speech activity is recommended. Establishing
a mathematical relationship between the measurable behavioral cues and the reported rapport
values will allow for future real-time estimation of rapport values.
                                                 168


Table 35. Summary of obtained correlation coefficient of head activity (steady versus motion)
detected from dyads and the corresponding dyadic strength of rapport based on reported self-
assessment.
     Group/Interaction             Dyads               Synchronicity                  Rapport
            4/1st                   BC                     0.7592                      3.5833
                                    AC                     0.9141                      4.2222
                                    AB                     0.8958                      4.3055
            4/2nd                   BC                     0.8407                        4.25
                                    AC                     0.9537                      4.0065
                                    AB                     0.8865                      4.1666
            5/2nd                   AC                     0.9743                      3.7075
                                    CD                     0.9828                      3.8333
                                    CB                     0.9768                      3.3888
                                    AD                     0.9799                      3.6111
                                    AB                     0.9802                      2.8055
                                    BD                     0.9921                      4.2042
     5.6.Overall Discussion
      The design of the social interaction study protocol for the collection of natural data was guided
by two factors: (1) the need to evoke changes in rapport levels among members of a group and (2)
the need to collect self-reported data that could be used to validate changes in rapport levels among
members of a group and/or be used as data labels for machine learning algorithms. The analysis
of self-reported data reflects changes in rapport levels among individuals of a group between the
first and the second interaction periods, as well as changes in individuals’ emotional states. This
serves as evidence of the effectiveness of the protocol to evoke changes in rapport levels and, more
specifically, in demonstrating a trend where the first interaction carries lower rapport levels when
compared to those reported during the second interaction. However, replicating this study with a
higher number of groups is highly recommended to ensure the significance of the statistical
analysis results. In addition, the design of a group interaction monitoring system will be benefited
from adding data from a higher number of groups to increase the existing number of sensor data
streams usable for the training of machine learning models. Increasing the number of groups
                                                  169


studied using this protocol could open the opportunity to study how demographics influence the
established levels of rapport and the overall likeness of the group interaction.
     On the other hand, increasing the number of groups used in this study will also increase the
time needed to prepare the collected data for processing. As this work uses a traditional supervised
machine learning pipeline, data labeling becomes an essential task. This chapter established data
labeling protocols to accompany the design of algorithms for group behavior monitoring systems.
Labeling was focused on obtaining rapport values from external observers and labels of head
actions. Nevertheless, the data annotation protocol developed in Chapter 4 for the labeling of
speech intonation can be applied to the dataset collected during the study. In addition, the increase
in collected data opens opportunities to employ unsupervised machine learning methods such as
neural networks and reduce the time that is required for annotating datasets.
     Chapter 4 presented the design of the HAD unit, trained using acted/evoke data. In this
chapter, the trained HAD unit was evaluated using the collected natural data from the social
interaction study and employed to study the correlation between head activity and rapport scores.
It was noted that (1) the performance of the HAD unit is lower than the one obtained during training
and testing using the acted/evoked dataset and (2) the performance of the HAD unit varies across
individuals in the interactions. It is recommended to employ a data normalization method across
the collected dataset and re-train the HAD unit to increase the level of generality of such a model.
In addition, as mentioned in Chapter 4, the level of computational optimization of the HAD unit
could also limit its ability to generalize when presented with unseen and noisy data. Furthermore,
the HAD unit was trained with a limited number of acted/evoked head actions. Therefore, this may
serve as supportive evidence for the recommendation of using natural data to train classification
models.
                                                 170


    5.7.Summary
     This chapter presents the design and execution of a social interaction study where sensor data
was collected using the sensor framework presented in Chapter 3. Video, audio, movement, and
physiological data were collected, together with self-reported scores of emotional states and
rapport strength between dyads. A general analysis of changes in emotional states revealed that
participants experienced significant changes in five emotions evaluated using the Circumplex
emotional state scale. Self-reported rapport scores were analyzed, and it was found that five out of
the ten groups contain low and variant dyadic interactions. Overall, out of the 96 dyadic
interactions captured by this study, 56 were positive, 17 were negative, and 23 were variant.
Moreover, most of the reported negative dyadic interactions happened during the first interaction,
which was intended to evoke low rapport value. In addition, labels for perceived rapport values
were assigned by external individuals, as well as labels for head actions. The trained HAD unit
was evaluated with this natural dataset and detected patterns of motion were used to calculate
synchronicity between dyads. Calculated synchronicity values were correlated with reported
rapport scores between dyads; however, the results were inconclusive. This dataset serves to
continue the design of a machine learning framework for the recognition of behavioral cues and to
investigate relationships between recognized behavioral cues and components of rapport.
                                                171


    6. SUMMARY AND FUTURE WORK
    6.1.Summary
    This dissertation presents the design of a new human/group behavior monitoring platform to
address existing challenges in the monitoring of group interactions for the improvement of social
awareness and human health. The presented human/group behavior monitoring platform combines
a multi-sensor system with a machine learning framework, covering all from sensor selection to
algorithm design. First, rapport is established as a social construct of interest to understand the
quality of social interactions. Fundamentals of human behavior and initial efforts on designing
wearable real-time social monitoring systems are introduced. A comprehensive literature study
was later conducted to define the state-of-the-art in sensors and algorithms and find existing design
challenges. The transdisciplinary approach taken to study the social science theory behind group
behaviors and the technology to monitor nonverbal behaviors informed the design of a multi-
sensor system.
    A new multi-sensor system for the study of group interactions was designed and implemented.
The multi-sensor system combines six sensor modalities: microphone, accelerometer, gyroscope,
magnetometer, photoplethysmography (PPG), and electroencephalography (EEG); it also
synchronizes the sensor data through the use of Lab Streaming Layer (LSL) and allows for
recording from multiple sensor nodes. Each sensor node receives 16 data streams: one from audio,
four from EEG, nine in total from accelerometer, gyroscope, and magnetometer, and two
corresponding to PPG (one filtered data stream and one pre-processed signal estimating heart rate).
In addition, a machine learning framework for the training and design of real-time recognition of
human and group behavior was also described. Of particular interest was the design of data
                                                 172


processing units to determine Type B features, which are high-level transformed features
determined from features extracted from raw sensor data (Type A features).
    Using the developed multi-sensor behavior monitoring system and support components, two
human studies were conducted to establish the processes for which machines can be trained to
recognize nonverbal behavior indicators and support the development of data processing units for
the extraction of Type B features. The first set of human studies was conducted with 8 participants
and consisted of recording (1) audio from virtual group meetings and (2) pre-defined head actions
using inertial movement units (IMUs). A process to annotate speech intonations was established
through the evaluation of labels assigned to the data collected from virtual group meetings. Using
an inter-agreement annotator analysis, for the first time in the literature, two different modes of
speech intonation annotation were evaluated. This analysis led to the construction of a dataset
containing neutral, positive, negative, and question-labeled audio segments, which was used for
the design of a real-time user-independent speech intonation recognizer unit. To the best of our
knowledge, the designed speech intonation recognizer represents the first real-time model trained
using just nonverbal information, an English-speaker dataset collected from group meeting
interactions, not acted, containing speech from culturally diverse individuals, a combination of
phrases with back-channel signals, and a combination of affective classes with an interrogative
intonation. On the other hand, IMU data collected from pre-defined head actions were used to
design a user-independent real-time head-action detection (HAD) unit based on a new fusion
model architecture approach. Both units were designed taking a resource-aware approach for real-
time processing where window sizes for data processing, types and number of features, and
complexity of the classification models were taken into consideration.
                                                  173


    The second human study consisted of collecting audio, visual, movement, and physiological
sensor data while groups of individuals were interacting with each other in environments where
low and high levels of rapport could be evoked. A total of 10 groups composed of 3 to 4 individuals
participated in this study. This is the first study that collects data from IMU and physiological data
from a head-mounted device in combination with audio from a personal computer and establishes
the processes for which self-reported emotional state labels, self-report and externally assigned
rapport labels, and head action labels are obtained. The HAD unit was used to explore the
relationships between head actions and perceived rapport levels between dyads of a group. The
contributions of this dissertation will advance the design of human behavior monitoring systems
for group interactions and facilitate real-time feedback to increase self-awareness and promote
successful social interactions. This work provides an infrastructure for the design of group
behavior monitoring systems through which in-person and virtual group interactions could be
studied and monitored.
    6.2.Contributions
     This dissertation bridges the gap between social science, communication science, and the
engineering field by establishing a novel sensing and data collection platform for real-time
monitoring of group interactions. Figure 38 presents a summary of the areas in which this
dissertation has made contributions to advance the design of human/group behavior monitoring
systems. Innovations of this work include contributions in:
         •   Analyzed the complete body of literature in the field of human behavior wearable
             monitoring technologies, which provided new visions and insights to converge
             research in the disciplines of social psychology, communication, and engineering
                                                    174


     To provide a clear understanding of state-of-the-art technologies for human behavior
monitoring and promote convergence research into new technologies that can overcome current
challenges, this dissertation provides an extensive review of the literature associated with
monitoring human behavior. This dissertation uniquely presents a new comprehensive,
transdisciplinary, perspective with a focus on identifying critical design considerations in real-time
human behavior monitoring systems. Starting with an overview of social psychology theories that
have established the framework to study human behaviors and their manifestations during social
interactions, this dissertation then establishes a taxonomy of human behavior monitoring
technologies based on these psychological theories. It also provides an insightful categorization of
sensors and an informative analysis of signal characteristics, features, and computational models
that have been reported in the field of human behavior monitoring. Analysis of recognition
accuracies for existing computational models in the area of human behavior monitoring is also
presented. Moreover, this dissertation focused on sensor hardware and real-time signal processing
technologies that have proven most effective for embedded monitoring of human behaviors while
Figure 38. Diagram summarizing the elements involved in the real-time human/group
interaction monitoring platform developed in this work. Highlighted in bold are the four areas
where this work made its research contributions.
                                                 175


highlighting challenges and opportunities in near-future wearable applications. The performed
analysis inspired the design of the real-time human/group interaction monitoring platform. This
extensive review resulted in a publication with the citation: S. Dávila-Montero, J. A. Dana-Lê, G.
Bente, A. T. Hall, and A. J. Mason, "Review and Challenges of Technologies for Real-Time
Human Behavior Monitoring," IEEE Transactions on Biomedical Circuits and Systems, vol. 15,
no. 1, pp. 2-28, Feb. 2021.
          •  Developed the first real-time enabled and accessible multi-sensor system for group
             behavior analysis and designed new real-time algorithms to recognize behavioral
             cues associated with group consonance using sensor data
     Existing human behavior monitoring systems lack real-time capabilities and configurations to
monitor complex human behaviors that could lead to identifying complex group dynamics. In
addition, no existing systems are accessible for other researchers to reproduce with easy. Towards
the goal of overcoming existing challenges for the real-time monitoring of complex social
interactions, this dissertation introduces the first real-time enabled and accessible multi-sensor
framework that allows the study and real-time analysis of both in-person and virtual interactive
environments. The framework leverages the use of existing open-source commercially available
wearable sensors, which were selected based on an analysis of the relation of sensing modalities
to behavioral information and evaluation of commercially available wearable sensors. Sensing
modalities include a microphone, accelerometer, gyroscope, magnetometer, PPG, and EEG.
Moreover, this dissertation presents design details on the implementation of sensor integration and
the use of networking protocols to manage 16 sensor data streams collected from each sensor node
of this framework. The framework can manage at least 4 sensor nodes allowing the study of group
interactions of 3 and 4 individuals. This multi-sensor framework system allows for easy
                                                 176


reproducibility because of the benefits of off-the-shelf sensors and networking resources. This
resulted in a publication with the citation: S. Dávila-Montero, S. Parsnejad, E. Ashoori, D.
Goderis, and A. J. Mason, "Design of a Multi-Sensor Framework for the Real-time Monitoring of
Social Interactions," IEEE International Symposium in Circuits and Systems (ISCAS), 2022.
     Furthermore, this work introduces the first machine learning framework to monitor individual
components of rapport and developed real-time computational blocks to identify two types of
behavioral cues: head nods and speech intonations. Both achieve detection and recognition results
that are comparable to existing ones in the literature. However, this work presents an analysis of
variations in sampling rates, optimal window length for real-time processing, feature selection,
and classification models to reduce computational complexity during real-time processing.
Therefore, the design of the aforementioned algorithms was performed using a resource-aware
approach considering the constraints of designing a human behavior monitoring system.
         •   Established and, for the first time, evaluated data labeling methods for the
             establishment of a machine learning training framework for behavior monitoring,
             which included the development of a new human study protocol for the collection of
             group behavioral information of interest that evoke variations in experienced rapport
             levels
     Existing recognition systems of behavioral cues present limitations in real-time processing
and use of natural data or data from the wild. Moreover, well-established protocols and machine
learning training frameworks for the collection and preparation of natural data for the design of
human behavior monitoring systems do not exist. This work developed the first machine learning
training framework for the collection and labeling of natural human behavior data. This framework
                                                 177


was used for the collection of audio data and training of a speech intonation recognizer, where
methods for effective labeling of speech intonations were evaluated.
     In addition, well-described data protocols did not exist for the design of human studies focused
on evoking low rapport during group naturalistic interactions. This work shows the methods for
participant recruitment and study execution using the developed multi-sensor framework. This
work established a new dataset containing audio, video, movement, and physiological data, self-
reported emotional states and rapport scores, and externally assigned rapport scores and head
action labels. Analysis of self-reported rapport scores was developed and showed that five out of
10 groups developed a negative and variant dyadic rapport.
    6.3.Other Achievements
    6.3.1. Engineering and data science
         •   Developed and implemented a new user-friendly labeling framework and applied it
             to label speech intonation in audio data
     To facilitate the annotator’s access to data to be labeled and to maintain consistency in the
way data was presented to the annotators, a graphical user interface (GUI) was developed. This
GUI was utilized to label intonations in pre-identified audio segments. However, the GUI
framework could be modified to label other types of 1-D signals, images, and video segments.
         •   Developed and implemented stand-alone applications for the connection and
             management of sensor signal collection and processing
     The commercially available wearable sensors provided Application Programming Interfaces
(APIs) that were used to establish sensor connections with MATLAB and LSL through designed
stand-alone applications. The implemented stand-alone applications allow the real-time
monitoring of collected and processed sensor signals.
                                                 178


    6.3.2. Mentoring
     This dissertation opened opportunities for undergraduate students interested in gaining
experience in the areas of sensor integration, machine learning, and programming. 10
undergraduate students were mentored and assisted to work on the following topics:
            •   Social interaction monitoring using audio signals
            •   Processing of EEG signals
            •   Wearable interpersonal monitor for enhanced teamwork
            •   Design of a multi-sensor head-mounted wearable device for the monitoring of
        human behaviors
            •   Design of a visual feedback interface to increase social behavior awareness
            •   Data labeling
            •   Cross-check and annotation agreement analysis
    6.4.Applications and Social Implications
     The contributions and the platform established in this dissertation could impact a variety of
research areas and applications at the interception of the social sciences, communication, and
engineering. The creation of the accessible multi-sensor platform allows for an increase in research
collaboration, research reproducibility, and advancement in the areas of human-computer
interaction, affective computing, and social signal processing. Such a platform, validated and
combined with ethnographic methods, has the potential to serve as a tool for the study of group
interactions in diverse scenarios. By extracting a myriad of informative individual, dyadic, and
group behavioral cues, new methodological standards for psychology and communication research
could be established. For example, the factors influencing team performance and subtle negative
behaviors affecting social interactions could be further studied with the platform introduced in this
                                                179


dissertation and results used to better understand how technology could help individuals increase
their situational awareness. In addition, this platform could facilitate research into the
establishment of a feedback mechanism for sharing information with individuals of a group about
factors influencing their interaction. This will promote wellness and economic growth in the future
work frontier where collaboration effectiveness of diverse knowledge-based teams will be critical
to continued innovation and national financial security. Other areas of applications include training
employees and other individuals to increase social skills, identify disruptive behaviors, and
techniques to deal with conscious and unconscious biased behaviors. Furthermore, areas in the
healthcare industry could also be benefited from technologies for the monitoring of human and
group behaviors since a variety of health conditions influence the way individuals behave.
Therefore, recognizing extreme changes in behaviors could help in the diagnosis of health
conditions and identification of neurological and developmental disabilities or disorders (e.g.,
depression and autism, respectively).
     In the case of monitoring behaviors to increase situational awareness, it is worthwhile to
mention that the goal of this technology is not to control individuals’ behaviors, but rather to
inform/provide information about behavioral cues that without technology would be otherwise
lost. Because the technology is designed to provide information about behavioral cues, individuals
will be given the opportunity to change their behavior as they find it appropriate. In addition,
during the design of the human/group interaction monitoring platform, individuals’ privacy was
considered essential. Because of that, just nonverbal messages were considered for the monitoring
of human behaviors, and the use of speech recognition systems was avoided when designing the
real-time speech intonation recognizer.
                                                  180


    6.5.Future Work
     This research has established the foundations for wearable real-time group behavior
monitoring. However, a variety of labeling tasks, labeling analysis, algorithms development, and
system testing have been left for future work and the start of new research projects.
     To achieve the goal of deploying a human behavior monitoring system, the following
suggestions are to continue this work:
         •   Preparation of sensor data from the social interaction study
     Related to the sensor data collected from the study described in Chapter 5, labels related to
head activity need to be assigned for all groups. In addition, labels related to perceived positivity
and rapport levels in periods of 2 to 5 minutes should be assigned to study how the evolution of
perceived behaviors influences final rapport levels.
     Furthermore, the expansion of the performed human study is recommended by collecting data
from at least 10 more groups. It is also recommended to modify the study protocol to include a
confederate in each group interaction so a higher amount of low rapport instances can be obtained.
        •   Standards for data labeling and its analysis
     This work will be benefited from the creation of more standards for data labeling using
people’s perception of the quality of social interactions and identification of informative nonverbal
behaviors. For example, an analysis to determine how much variation in perceived rapport values
exists when labeling using just audio versus audio plus video.
     the following question could be asked for the labeling of perceived rapport levels:
        •   Social interaction recognition: algorithms and model implementation
     Related to the pre-processing of sensor data, methods to extract attention levels and changes
in emotional reactions from EEG should be implemented. In addition, for the processing of audio
                                                  181


signals, noise removal filters and other advanced algorithms should be implemented. On the other
hand, mathematical relationships between Type A features and Type B features to aspects of
rapport such as positivity and coordination should be investigated. A mathematical relationship
between sensor data to specific aspects of rapport will allow the creation of a rapport equation that
could be used in the future to provide feedback on how to improve rapport in dyadic and group
interactions.
     To expand on this work, the following suggestions are given:
         •   Hardware and software optimization
     It is recommended to design a single wearable device that integrates the sensing modalities
selected in this work. In addition, it is recommended the design databases that will store sensor
information from the customized wearable device.
         •   Complete close loop of the monitoring system
     To achieve the goal of bringing awareness to individuals during group interactions, a feedback
mechanism should be put into place. Experiments to determine effective feedback modalities and
information that could improve social and self-awareness should be investigated. In addition,
experiments testing the designed feedback mechanism should be performed.
                                                182


                                      BIBLIOGRAPHY
[1]  S. G. Rogelberg, “Why your meetings stink-and what to do about IT Strategies for
     engagement,” Harv Bus Rev, vol. 97, no. 1, pp. 140–143, 2019.
[2]  V. Rousseau, C. Aubé, and A. Savoie, “Teamwork behaviors: A review and an integration
     of frameworks,” Small Group Res, vol. 37, no. 5, pp. 540–570, 2006.
[3]  S. N. Young, “The neurobiology of human social behaviour: An important but neglected
     topic,” Journal of Psychiatry and Neuroscience, vol. 33, no. 5, pp. 391–392, 2008.
[4]  D. Umberson and J. K. Montez, “Social relationships and health: A flashpoint for health
     policy,” J Health Soc Behav, vol. 51(1_suppl, pp. S54–S66, 2010.
[5]  L. M. Hernandez and D. G. Blaze, Eds., “The impact of social and cultural environment on
     health,” in Genes, Behavior, and the Social Environment: Moving Beyond the
     Nature/Nurture Debate, no. 2, National Academies Press (US), 2006.
[6]  N. Dasgupta, “Implicit ingroup favoritism, outgroup favoritism, and their behavioral
     manifestations,” Soc Justice Res, vol. 17, no. 2, pp. 143–169, 2004.
[7]  R. Wheeler, “We all do it: Unconscious behavior, bias, and diversity,” Law Libr J, vol. 107,
     no. 2, pp. 15–36, 2015.
[8]  N. Alduncin, L. C. Huffman, H. M. Feldman, and I. M. Loe, “Executive function is
     associated with social competence in preschool-aged children born preterm or full term,”
     Early Hum Dev, vol. 90, no. 6, pp. 299–306, 2014.
[9]  P. M. Merikle, D. Smilek, and J. D. Eastwood, “Perception without awareness: perspectives
     from cognitive psychology.” [Online]. Available: www.elsevier.com/locate/cognit.
[10] T. Pyszczynski, J. Greenberg, and S. Solomon, “Proximal and distal defense: A new
     perspective on unconscious motivation,” Curr Dir Psychol Sci, vol. 9, no. 5, pp. 156–160,
     2000.
[11] T. D. Wilson and N. Brekke, “Mental contamination and mental correction: Unwanted
     influences on judgments and evaluations,” Psychol Bull, vol. 116, no. 1, pp. 117–142, 1994.
[12] Y. Chuo et al., “Mechanically flexible wireless multisensor platform for human physical
     activity and vitals monitoring,” IEEE Trans Biomed Circuits Syst, vol. 4, no. 5, pp. 281–
     294, 2010.
[13] J. Y. Kim, C. H. Chu, and M. S. Kang, “IoT-based unobtrusive sensing for sleep quality
     monitoring and assessment,” IEEE Sens J, vol. 21, no. 3, pp. 3799–3809, 2021.
[14] C. Setz, B. Arnrich, J. Schumm, R. la Marca, G. Troster, and U. Ehlert, “Discriminating
     stress from cognitive load using a wearable EDA device,” IEEE Trans. Information
     Technology in Biomedicine, vol. 14, no. 2, pp. 410–417, 2010.
                                              183


[15] N. A. Selamat and S. H. M. Ali, “Automatic food intake monitoring based on chewing
     activity: A survey,” IEEE Access, vol. 8, pp. 48846–48869, 2020.
[16] S. Narayanan and P. G. Georgiou, “Behavioral signal processing: Deriving human
     behavioral informatics from speech and language,” Proceedings of the IEEE, vol. 101, no.
     5, pp. 1203–1233, 2013.
[17] M. Vrigkas, C. Nikou, and I. A. Kakadiaris, “Identifying human behaviors using
     synchronized audio-visual cues,” IEEE Trans Affect Comput, vol. 8, no. 1, pp. 54–66, Jan.
     2017.
[18] C. Hessler and M. Abouelenien, “Using thermal images and physiological features to model
     human behavior: A survey,” Proceedings - IEEE 1st Conference on Multimedia Information
     Processing and Retrieval, MIPR 2018, pp. 278–281, 2018.
[19] I. Poggi and F. D. Errico, “Social signals: A psychological perspective,” in Computer
     Analysis of Human Behavior, 2011, pp. 185–225.
[20] A. Best, S. F. Warta, K. A. Kapalo, and S. M. Fiore, “Of mental states and machine learning:
     How social cues and signals can help develop artificial social intelligence,” Proceedings of
     the Human Factors and Ergonomics Society, pp. 1361–1365, 2016.
[21] T. Kim, D. O. Olguín, B. N. Waber, and A. Pentland, “Sensor-based feedback systems in
     organizational computing,” in 2009 International Conference on Computational Science
     and Engineering, 2009, pp. 966–969.
[22] J. Frey, M. Grabli, R. Slyper, and J. Cauchard, “Breeze: Sharing biofeedback through
     wearable technologies,” in Proceedings of the 2018 CHI Conference on Human Factors in
     Computing Systems, 2018, pp. 1–12.
[23] Y. Hao, D. Wang, and J. G. Budd, “Design of intelligent emotion feedback to assist users
     regulate emotions: Framework and principles,” in International Conference on Affective
     Computing and Intelligent Interaction, 2015, pp. 938–943.
[24] G. Chanel, S. Pelli, N. Ravaja, and K. Kuikkaniemi, “Social interaction using mobile
     devices and biofeedback: Effects on presence, attraction and emotions,” in BioSPlay
     Workshop, Fun and Games Conference, 2010, pp. 5–9.
[25] J. Terken, J. Sturm, and I. Patras, “Multimodal support for social dynamics in co-located
     meetings,” Pers Ubiquitous Comput, vol. 14, no. 8, pp. 703–714, 2010.
[26] J. Sturm, O. H. Herwijnen, A. Eyck, and J. Terken, “Influencing social dynamics in
     meetings through a peripheral display,” in Proceedings of the 9th international conference
     on Multimodal interfaces, 2007, pp. 263–270.
[27] F. Cvrčková, V. Žárský, and A. Markoš, “Plant studies may lead us to rethink the concept
     of behavior,” Front Psychol, vol. 7, pp. 10–13, 2016.
                                              184


[28] A. Bandura, “Human agency in social cognitive theory,” American Psychologist, pp. 1175–
     1184, 1989.
[29] A. Bandura, “Social cognitive theory,” in Annals of child development, vol. 6, R. Vasta, Ed.
     Greenwich, CT: JAI Press, 1989, pp. 1–60.
[30] W. Mischel, “The interaction of person and situation,” in Personality at the Crossroads:
     Current Issues in Interactional Psychology, D. Magnusson and N. S. Endler, Eds. Hillsdale,
     NJ: Lawrence Erlbaum Associates, Inc., 1977, pp. 333–352.
[31] T. J. Bouchard and J. C. Loehlin, “Genes, evolution, and personality,” Behav Genet, vol.
     31, no. 3, pp. 243–273, 2001.
[32] A. H. Eagly and S. Chaiken, The Psychology of Attitudes. Hardcourt Brace Jovanovich
     College Publishers, 1993.
[33] B. L. Fredrickson, “The role of positive emotions in positive psychology: The broaden-and-
     build theory of positive emotions,” American Psychologist, vol. 56, no. 3, pp. 218–226,
     2001.
[34] W. B. Cannon, “The James-Lange theory of emotions: A critical examination and an
     alternative theory,” Am J Psychol, vol. 39, no. 1, pp. 106–124, 1927.
[35] S. W. Porges, “Orienting in a defensive world: Mammalian modifications of our
     evolutionary heritage. A Polyvagal Theory.,” Psychophysiology, vol. 32, no. 4, pp. 301–
     318, 1995.
[36] S. W. Porges, “Polyvagal theory,” Biol Psychol, vol. 74, no. 2, pp. 116–143, 2007.
[37] G. Chanel and C. Mühl, “Connecting brains and bodies: Applying physiological computing
     to support social interaction,” Interact Comput, vol. 27, no. 5, pp. 534–550, 2015.
[38] J. A. Russell, “A circumplex model of affect,” J Pers Soc Psychol, vol. 39, no. 6, pp. 1161–
     1178, 1980.
[39] A. Mehrabian and J. A. Russell, “The three emotional dimension,” in An approach to
     environmental psychology, Cambridge, MA: MIT Press, 1974.
[40] R. Plutchik, “The nature of emotions: Human emotions have deep evolutionary roots, a fact
     that may explain their complexity and provide tools for clinical practice,” Am Sci, vol. 89,
     no. 4, pp. 344–350, 2001.
[41] G. A. van Kleef, “How emotions regulate social life: The Emotions as Social Information
     (EASI) model,” Curr Dir Psychol Sci, vol. 18, no. 3, pp. 184–188, 2009.
[42] N. H. Frijda, The Emotions. Cambridge University Press, 1986.
                                               185


[43] P. L. Perrewé and P. E. Spector, “Personality research in the organizational sciences,” in
     Research in Personnel and Human Resources Management, vol. 21, G. R. Ferris and J. J.
     Martocchio, Eds. Elsevier Science/JAI Press, 2002, pp. 1–63.
[44] T. A. Judge, J. E. Bono, R. Ilies, and M. W. Gerhardt, “Personality and leadership: A
     qualitative and quantitative review,” Journal of Applied Psychology, vol. 87, no. 4, pp. 765–
     780, 2002.
[45] M. C. Ashton and K. Lee, “Empirical, theoretical, and practical advantages of the HEXACO
     model of personality structure,” Personality and Social Psychology Review, vol. 11, no. 2,
     pp. 150–166, May 2007.
[46] D. L. Paulhus and K. M. Williams, “The Dark Triad of personality: Narcissism,
     Machiavellianism, and psychopathy,” J Res Pers, vol. 36, no. 6, pp. 556–563, Dec. 2002.
[47] J. B. Rotter, “Generalized expectancies for internal versus external control of
     reinforcement,” Psychological Monographs: General and Applied, vol. 80, no. 1, pp. 1–28,
     1966.
[48] J. K. Alberts, T. K. Nakayama, and J. N. Martin, Human Communication in Society, 3rd ed.
     Pearson, 2012.
[49] P. Watzlawick, J. B. Bavelas, and D. D. Jackson, Pragmatics of Human Communication: A
     Study of Interactional Patterns, Pathologies and Paradoxes. W. W. Norton & Company,
     1967.
[50] A. Pentland, Honest Signals: How They Shape our World. MIT Press, 2010.
[51] E. Goffman, The Presentation of Self in Everyday Life. New York: Anchor Books, 1959.
[52] C. Crivelli and A. J. Fridlund, “Facial displays are tools for social influence,” Trends Cogn
     Sci, vol. 22, no. 5, pp. 388–399, 2018.
[53] G. Mohammadi and A. Vinciarelli, “Automatic personality perception: Prediction of trait
     attribution based on prosodic features,” IEEE Trans Affect Comput, vol. 3, no. 3, pp. 273–
     284, 2012.
[54] A. Vinciarelli, H. Salamin, and M. Pantic, “Social Signal Processing: Understanding social
     interactions through nonverbal behavior analysis,” 2009 IEEE Computer Society
     Conference on Computer Vision and Pattern Recognition Workshops, vol. 231287, no.
     231287, pp. 42–49, 2010.
[55] R. A. Calvo and S. D’Mello, “Affect detection: An interdisciplinary review of models,
     methods, and their applications,” IEEE Trans Affect Comput, vol. 1, no. 1, pp. 18–37, 2010.
[56] A. Pentland, “Social Signal Processing,” IEEE Signal Process Mag, vol. 24, no. 4, pp. 108–
     111, 2007.
                                               186


[57] G. Bente, “New tools – new insights: Using emergent technologies in nonverbal
     communication research,” in Reflections on Interpersonal Communication, S. W. Wilson
     and S. W. Smith, Eds. San Diego: Cognella, 2019, pp. 161–188.
[58] K. Yun, K. Watanabe, and S. Shimojo, “Interpersonal body and neural synchronization as
     a marker of implicit social interaction,” Sci Rep, vol. 2, no. 959, pp. 1–8, 2012.
[59] S. M. Thurman and H. Lu, “Perception of social interactions for spatially scrambled
     biological motion,” PLoS One, vol. 9, no. 11, pp. 1–12, 2014.
[60] A. Innocenti, E. de Stefani, N. F. Bernardi, G. C. Campione, and M. Gentilucci, “Gaze
     direction and request gesture in social interactions,” PLoS One, vol. 7, no. 5, pp. 1–8, 2012.
[61] E. de Stefani and D. de Marco, “Language, gesture, and emotional communication: An
     embodied view of social interaction,” Front Psychol, vol. 10, no. 2063, pp. 1–8, 2019.
[62] C. Frith, “Role of facial expressions in social interactions,” Philosophical Transactions of
     the Royal Society B: Biological Sciences, vol. 364, no. 1535, pp. 3453–3458, 2009.
[63] R. E. Jack and P. G. Schyns, “The human face as a dynamic tool for social communication,”
     Current Biology, vol. 25, no. 14, pp. R621–R634, 2015.
[64] P. Filippi, “Emotional and interactional prosody across animal communication systems: A
     comparative approach to the emergence of language,” Front Psychol, vol. 7, no. 1393, pp.
     1–19, 2016.
[65] N. Henriksen, “Style, prosodic variation, and the social meaning of intonation,” J Int Phon
     Assoc, vol. 43, no. 2, pp. 153–193, 2013.
[66] G. Bente, N. C. Kramer, and F. Eschenburg, “Is there anybody out there,” Mediated
     interpersonal communication, pp. 131–157, 2008.
[67] P. Ekman and W. v. Friesen, “The repertoire of nonverbal behavior: Categories, origins,
     usage, and coding,” Semiotica, vol. 1, pp. 49–98, 1969.
[68] S. Duncan, “Some signals and rules for taking speaking turns in conversations,” J Pers Soc
     Psychol, vol. 23, no. 2, pp. 283–292, 1972.
[69] G. Bente, H. Leuschner, A. al Issa, and J. J. Blascovich, “The others: Universals and cultural
     specificities in the perception of status and dominance from nonverbal behavior,” Conscious
     Cogn, vol. 19, no. 3, pp. 762–777, 2010.
[70] B. M. Depaulo and H. S. Friedman, “Nonverbal communication,” in The Handbook of
     Social Psychology, D. T. Gilbert, S. T. Fiske, and G. Lindzey, Eds. McGraw-Hill, 1998, pp.
     3–40.
                                                 187


[71] A. Vinciarelli and A. S. Pentland, “New social signals in a new interaction world: The next
     frontier for Social Signal Processing,” IEEE Syst Man Cybern Mag, vol. 1, no. 2, pp. 10–
     17, 2015.
[72] L. Siemens, “The balance between on-line and in-person interactions: Methods for the
     development of digital humanities collaboration,” Digital Studies/Le champ numérique, vol.
     2, no. 1, 2011.
[73] R. L. Daft, R. H. Lengel, and L. K. Trevino, “Message equivocality, media selection, and
     manager performance: Implications for information systems,” MIS Quarterly, vol. 11, no.
     3, pp. 354–366, 1987.
[74] J. Teevan et al., “The new future of work: Research from Microsoft into the pandemic’s
     impact on work practices,” 2021. [Online]. Available: https://www.microsoft.com/en-
     us/research/publication/the-new-future-of-work-research-from-microsoft-into-the-
     pandemics-impact-on-work-practices/.
[75] B. O’Connail, S. Whittaker, and S. Wilbur, “Conversations over video conferences: An
     evaluation of the spoken aspects of video-mediated communication,” Hum Comput Interact,
     vol. 8, no. 4, pp. 389–428, 1993.
[76] E. S. Rintel, “Conversational management of network trouble perturbations in personal
     videoconferencing,” in ACM International Conference Proceeding Series, 2010, pp. 304–
     311.
[77] P. Murali, J. Hernandez, D. McDuff, K. Rowan, J. Suh, and M. Czerwinski,
     “AffectiveSpotlight: Facilitating the communication of affective responses from audience
     members during online presentations,” in CHI Conference on Human Factors in Computing
     Systems, 2021, pp. 1–13.
[78] M. C. Lashley, “Observational research, advantages and disadvantages,” in The SAGE
     Encyclopedia of Communication Research Methods, M. Allen, Ed. SAGE Publications, Inc,
     2018, pp. 1113–1115.
[79] D. Albudaiwi, “Surveys, advantages and disadvantages of,” in The SAGE Encyclopedia of
     Communication Research Methods, M. Allen, Ed. SAGE Publications, Inc, 2018, pp. 1735–
     1736.
[80] N. Kehtarnavaz and M. Gamadia, “Real-time image and video processing: From research
     to reality,” in Synthesis Lectures on Image, Video, and Multimedia Processing, vol. 5,
     Morgan & Claypool, 2005, pp. 1–108.
[81] L. E. Holmquist, J. Falk, and J. Wigström, “Supporting group collaboration with
     interpersonal awareness devices,” Pers Ubiquitous Comput, vol. 3, no. 1–2, pp. 13–21,
     1999.
[82] R. Want, A. Hopper, V. Falcão, and J. Gibbons, “The active badge location system,” ACM
     Transactions on Information Systems (TOIS), vol. 10, no. 1, pp. 91–102, 1992.
                                              188


[83] H. Jang, S. P. Choe, S. N. B. Gunkel, S. Kang, and J. Song, “A system to analyze group
     socializing behaviors in social parties,” IEEE Trans Hum Mach Syst, vol. 47, no. 6, pp. 801–
     813, 2017.
[84] R. W. Devaul, S. J. Schwartz, and A. S. Pentland, “MIThril: Context-aware computing for
     daily life,” 2001. https://www.media.mit.edu/wearables/mithril/MIThril.pdf.
[85] R. DeVaul, M. Sung, J. Gips, and A. Pentland, “MIThril 2003: Applications and
     architecture,” in Seventh IEEE International Symposium on Wearable Computers, 2003, p.
     4.
[86] N. Eagle and A. Pentland, “Wearables in the workplace: sensing interactions at the office,”
     in IEEE International Symposium on Wearable Computers, 2003, pp. 256–257.
[87] M. Laibowitz and J. A. Paradiso, “The UbER-Badge, a versatile platform at the juncture
     between wearable and social computing,” in International Conference on Pervasive
     Computing, 2004, pp. 1–6.
[88] J. Gips and A. Pentland, “Mapping human networks,” in Fourth Annual IEEE International
     Conference on Pervasive Computing and Communications, PerCom 2006, 2006, pp. 159–
     168.
[89] M. Laibowitz, J. Gips, R. Aylward, and A. Pentland, “A sensor network for social
     dynamics,” in 2006 5th International Conference on Information Processing in Sensor
     Networks, 2006, pp. 483–491.
[90] T. Choudhury and A. Pentland, “Characterizing social networks using the sociometer,” in
     North American Association of Computational Social and Organizational Science
     (NAACSOS), 2004, pp. 1–4.
[91] W. Dong, B. Lepri, T. Kim, F. Pianesi, and A. S. Pentland, “Modeling conversational
     dynamics and performance in a Social Dilemma task,” in 5th International Symposium on
     Communications Control and Signal Processing, ISCCSP 2012, 2012, pp. 1–4.
[92] Y. Zhang et al., “TeamSense: Assessing personal affect and group cohesion in small teams
     through dyadic interaction and behavior analysis with wearable sensors,” in Proceedings of
     the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2018, vol. 2, no.
     3, pp. 1–22.
[93] T. Kim, A. Chang, L. Holland, and A. S. Pentland, “Meeting Mediator: Enhancing group
     collaboration using sociometric feedback,” in Proceedings of the 2008 ACM conference on
     Computer Supported Cooperative Work, 2008, pp. 457–466.
[94] O. Lederman, D. Calacci, A. MacMullen, D. C. Fehder, F. Murray, and A. S. Pentland,
     “Open Badges: A low-cost toolkit for measuring team communication and dynamics,”
     arXiv preprint. 2017.
                                                189


[95]  O. Lederman, A. Mohan, D. Calacci, and A. S. Pentland, “Rhythm: A unified measurement
      platform for human organizations,” IEEE Multimedia, vol. 25, no. 1, pp. 26–38, 2018.
[96]  A. Madan and A. (Sandy) Pentland, “VibeFones: Socially aware mobile phones,” in 2006
      10th IEEE International Symposium on Wearable Computers, 2006, pp. 109–112.
[97]  J. Müller, S. Fàbregues, E. A. Guenther, and M. J. Romano, “Using sensors in
      organizational research-clarifying rationales and validation challenges for mixed methods,”
      Front Psychol, vol. 10, no. 1188, pp. 1–14, 2019.
[98]  J. Gu et al., “Wearable Social Sensing and its application in anxiety assesment,” in 2017
      International Conference on Cyber-Enabled Distributed Computing and Knowledge
      Discovery, 2017, pp. 305–308.
[99]  J. Gu et al., “Wearable Social Sensing: Content-based processing methodology and
      implementation,” IEEE Sens J, vol. 17, no. 21, pp. 7167–7176, 2017.
[100] L. Jiang et al., “Wearable long-term social sensing for mental wellbeing,” IEEE Sens J, vol.
      19, no. 19, pp. 8532–8542, Oct. 2019.
[101] L. Fraiwan, T. Basmaji, and O. Hassanin, “A mobile mental health monitoring system: A
      smart glove,” in Proceedings - 14th International Conference on Signal Image Technology
      and Internet Based Systems, SITIS 2018, Jul. 2018, pp. 235–240.
[102] D. Girardi, F. Lanubile, and N. Novielli, “Emotion detection using noninvasive low cost
      sensors,” in 2017 7th International Conference on Affective Computing and Intelligent
      Interaction, ACII 2017, 2017, pp. 125–130.
[103] R. S. McGinnis et al., “Wearable sensors and machine learning diagnose anxiety and
      depression in young children,” in IEEE EMBS International Conference on Biomedical and
      Health Informatics, 2018, pp. 410–413.
[104] S. S. Panicker and P. Gayathri, “A survey of machine learning techniques in physiology
      based mental stress detection systems,” Biocybern Biomed Eng, vol. 39, no. 2, pp. 444–469,
      2019.
[105] A. A. Torres-García, O. Mendoza-Montoya, M. Molinas, J. M. Antelis, L. A. Moctezuma,
      and T. Hernández-Del-Toro, Pre-processing and feature extraction. Elsevier Inc., 2022.
[106] H. Malik, N. Fatema, and A. Iqbal, Advances in machine learning and data analysis.
      Elsevier Inc., 2021.
[107] “The      ultimate     guide      to    data     labeling     for    machine      learning.”
      https://www.cloudfactory.com/data-labeling-guide (accessed Oct. 19, 2021).
[108] “Proven AI & data best practices training & certification for project managers and team
      leaders.” https://www.cognilytica.com/ (accessed Oct. 19, 2021).
                                               190


[109] L. Zhang et al., “‘BioVid Emo DB’: A multimodal database for emotion analyses validated
      by subjective ratings,” in 2016 IEEE Symposium Series on Computational Intelligence,
      2017, pp. 1–6.
[110] F. Ringeval et al., “The AV + EC 2015: The first affect recognition challenge bridging
      across audio, video, and physiological data,” in Proceedings of the 5th International
      Workshop on Audio/Visual Emotion Challenge, 2015, pp. 3–8.
[111] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the RECOLA
      multimodal corpus of remote collaborative and affective interactions,” in 2013 10th IEEE
      International Conference and Workshops on Automatic Face and Gesture Recognition,
      2013, pp. 1–8.
[112] S. Koelstra et al., “DEAP: A database for emotion analysis using physiological signals,”
      IEEE Trans Affect Comput, vol. 3, no. 1, pp. 18–31, 2012.
[113] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multimodal database for affect
      recognition and implicit tagging,” IEEE Trans Affect Comput, vol. 3, no. 1, pp. 42–55, 2012.
[114] M. Swain, A. Routray, and P. Kabisatpathy, “Databases, features and classifiers for speech
      emotion recognition: A review,” Int J Speech Technol, vol. 21, no. 1, pp. 93–120, 2018.
[115] J. Kossaifi et al., “SEWA DB: A rich database for audio-visual emotion and sentiment
      research in the wild,” IEEE Trans Pattern Anal Mach Intell, vol. 43, no. 3, pp. 1022–1040,
      2021.
[116] Y. Noh et al., “Enhancing quality of corpus annotation: Construction of the multi-layer
      corpus annotation and simplified validation of the corpus annotation,” in Proceedings of the
      34th Pacific Asia Conference on Language, Information and Computation, 2020, pp. 216–
      224.
[117] E. Douglas-Cowie et al., “The HUMAINE database: Addressing the collection and
      annotation of naturalistic and induced emotional data,” Affective Computing and Intelligent
      Interaction, pp. 488–500, 2007.
[118] S. Davila-Montero, J. A. Dana-Le, G. Bente, A. T. Hall, and A. J. Mason, “Review and
      challenges of technologies for real-time human behavior monitoring,” IEEE Trans Biomed
      Circuits Syst, vol. 15, no. 1, pp. 2–28, 2021.
[119] J. E. Grahe and F. J. Bernieri, “The importance of nonverbal cues in judging rapport,” J
      Nonverbal Behav, vol. 23, no. 4, pp. 253--269, 1999.
[120] L. Tickle-Degnen and R. Rosenthal, “The nature of rapport and its nonverbal correlates,”
      Psychol Inq, vol. 1, no. 4, pp. 285–293, 1990.
[121] F. J. Bernieri, J. M. Davis, J. S. Gillis, and J. E. Grahe, “Dyad rapport and the accuracy of
      its judgment across situations: A lens model analysis,” J Pers Soc Psychol, vol. 71, no. 1,
      pp. 110–129, 1996.
                                                  191


[122] P. Müller, M. X. Huang, and A. Bulling, “Detecting low rapport during natural interactions
      in small groups from non-verbal behaviour,” in 23rd International Conference on
      Intelligent User Interfaces (IUI ’18), 2018, pp. 1–12.
[123] H. Hung and D. Gatica-Perez, “Estimating cohesion in small groups using audio-visual
      nonverbal behavior,” IEEE Trans Multimedia, vol. 12, no. 6, pp. 563–575, 2010.
[124] D. B. Jayagopi, B. Raducanu, and D. Gatica-Perez, “Characterizing conversational group
      dynamics using nonverbal behaviour,” in 2009 IEEE International Conference on
      Multimedia and Expo, ICME 2009, 2009, pp. 370–373.
[125] J. L. Hagad, R. Legaspi, M. Numao, and M. Suarez, “Predicting levels of rapport in dyadic
      interactions through automatic detection of posture and posture congruence,” in 2011 IEEE
      International Conference on Privacy, Security, Risk, and Trust, and IEEE International
      Conference on Social Computing, 2011, pp. 613–616.
[126] A. Cerekovic, O. Aran, and D. Gatica-Perez, “Rapport with virtual agents: What do human
      social cues and personality explain?,” IEEE Trans Affect Comput, vol. 8, no. 3, pp. 382–
      395, Jul. 2017.
[127] N. Wang and J. Gratch, “Rapport and facial expression,” 2009.
[128] R. Zhao, T. Sinha, A. W. Black, and J. Cassell, “Socially-aware virtual agents:
      Automatically assessing dyadic rapport from temporal patterns of behavior,” 2016.
[129] R. Zhao, A. Papangelis, and J. Cassell, “Towards a dyadic computational model of rapport
      management for human-virtual agent interaction,” in Intelligent Virtual Agents, T.
      Bickmore, S. Marsella, and C. Sidner, Eds. Springer, Cham, 2014, pp. 514–527.
[130] M. Cristani, R. Raghavendra, A. del Bue, and V. Murino, “Human behavior analysis in
      video surveillance: A Social Signal Processing perspective,” Neurocomputing, vol. 100, pp.
      86–97, 2013.
[131] B. Martinez, M. F. Valstar, B. Jiang, and M. Pantic, “Automatic analysis of facial actions:
      A survey,” IEEE Trans Affect Comput, vol. 10, no. 3, pp. 325–347, 2019.
[132] S. Yang et al., “IoT structured long-term wearable social sensing for mental wellbeing,”
      IEEE Internet Things J, vol. 6, no. 2, pp. 3652–3662, Apr. 2019.
[133] S. Ha et al., “Integrated circuits and electrode interfaces for noninvasive physiological
      monitoring,” IEEE Trans Biomed Eng, vol. 61, no. 5, pp. 1522–1537, 2014.
[134] R. Sharma, V. I. Pavlovic, and T. S. Huang, “Toward multimodal human-computer
      interface,” Proceedings of the IEEE, vol. 86, no. 5, pp. 853–869, 1998.
[135] A. Jaimes and N. Sebe, “Multimodal human-computer interaction: A survey,” Computer
      Vision and Image Understanding, vol. 108, no. 1–2, pp. 116–134, 2007.
                                               192


[136] D. Gatica-Perez, “Analyzing group interactions in conversations: A review,” in 2006 IEEE
      International Conference on Multisensor Fusion and Integration for Intelligent Systems,
      2006, pp. 41–46.
[137] D. Gatica-Perez, Modelling Interest in Face-to-Face Conversations from Multimodal
      Nonverbal Behaviour, 1st ed. Elsevier, 2010.
[138] M. Pantic, A. Nijholt, A. Pentland, and T. S. Huanag, “Human-centred intelligent Human
      Computer Interaction (HCI2): How far are we from attaining it?,” Int J Auton Adapt
      Commun Syst, vol. 1, no. 2, p. 168, 2008.
[139] T. Theodorou, I. Mporas, and N. Fakotakis, “An overview of automatic audio
      segmentation,” International Journal of Information Technology and Computer Science,
      vol. 6, no. 11, pp. 1–9, Oct. 2014.
[140] B. Schuller, D. Seppi, A. Batliner, A. Maier, and S. Steidl, “Towards more reality in the
      recognition of emotional speech,” in IEEE International Conference on Acoustics, Speech
      and Signal Processing, 2007, pp. 941–944.
[141] C. M. Lee and S. S. Narayanan, “Toward detecting emotions in spoken dialogs,” IEEE
      Transactions on Speech and Audio Processing, vol. 13, no. 2, pp. 293–303, 2005.
[142] C. H. Wu and W. bin Liang, “Emotion recognition of affective speech based on multiple
      classifiers using acoustic-prosodic information and semantic labels,” IEEE Trans Affect
      Comput, vol. 2, no. 1, pp. 10–21, Jan. 2011.
[143] H. Balti and A. S. Elmaghraby, “Speech emotion detection using time dependent self
      organizing maps,” in IEEE International Symposium on Signal Processing and Information
      Technology, IEEE ISSPIT 2013, 2013, pp. 470–478.
[144] M. Tahon, G. Degottex, and L. Devillers, “Usual voice quality features and glottal features
      for emotional valence detection,” in Speech Prosody, 2012, pp. 1–4.
[145] J. Ang, R. Dhillon, A. Krupski, E. Shriberg, and A. Stolcke, “Prosody-based automatic
      detection of annoyance and frustration in human-computer dialog,” in International
      Conference on Spoken Language Processing, 2002, pp. 2037–2040.
[146] S. Sahoo and A. Routray, “Detecting aggression in voice using inverse filtered speech
      features,” IEEE Trans Affect Comput, vol. 9, no. 2, pp. 217–226, Apr. 2018.
[147] D. B. Jayagopi and D. Gatica-Perez, “Mining group nonverbal conversational patterns using
      probabilistic topic models,” IEEE Trans Multimedia, vol. 12, no. 8, pp. 790–802, Dec. 2010.
[148] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K. Moore, “Using linguistic cues for the
      automatic recognition of personality in conversation and text,” Journal of Artificial
      Intelligence Research, vol. 30, pp. 457–500, 2007.
                                               193


[149] D. Furnham, “Language and Personality,” in Handbook of Language and Social
      Psychology, H. Giles and W. Robinson, Eds. Winley, 1990.
[150] A. Vinciarelli, H. Salamin, A. Polychroniou, G. Mohammadi, and A. Origlia, “From
      nonverbal cues to perception: Personality and social attractiveness,” in Cognitive
      Behavioural Systems, 2012, pp. 60–72.
[151] K. Tusing, “The sounds of dominance. Vocal precursors of perceived dominance during
      interpersonal influence,” Hum Commun Res, vol. 26, no. 1, pp. 148–171, 2000.
[152] H. Hung, Y. Huang, G. Friedland, and D. Gatica-Perez, “Estimating dominance in multi-
      party meetings using speaker diarization,” IEEE Trans Audio Speech Lang Process, vol. 19,
      no. 4, pp. 847–860, 2011.
[153] D. B. Jayagopi, H. Hung, C. Yeo, and D. Gatica-Perez, “Modeling dominance in group
      conversations using nonverbal activity cues,” IEEE Trans Audio Speech Lang Process, vol.
      17, no. 3, pp. 501–513, Mar. 2009.
[154] D. Sanchez-Cortes, O. Aran, M. S. Mast, and D. Gatica-Perez, “A nonverbal behavior
      approach to identify emergent leaders in small groups,” IEEE Trans Multimedia, vol. 14,
      no. 3, pp. 816–832, 2012.
[155] D. Hillard, M. Ostendorf, and E. Shriberg, “Detection of agreement vs. disagreement in
      meetings: Training with unlabeled data,” in Companion Volume of the Proceedings of HLT-
      NAACL 2003 - Short Papers, 2003, pp. 34–36.
[156] D. Gatica-Perez, I. McCowan, D. Zhang, and S. Bengio, “Detecting group interest-level in
      meetings,” in IEEE International Conference on Acoustics, Speech, and Signal Processing,
      2005, 2005, pp. I489–I492.
[157] L. S. Kennedy and D. P. W. Ellis, “Pitch-based emphasis detection for characterization of
      meeting recordings,” in 2003 IEEE Workshop on Automatic Speech Recognition and
      Understanding, ASRU 2003, 2003, pp. 243–248.
[158] A. Vinciarelli, “Capturing order in social interactions,” IEEE Signal Process Mag, vol. 26,
      no. 5, 2009.
[159] H. D. Critchley, “Electrodermal responses: What happens in the brain,” Neuroscientist, vol.
      8, no. 2, pp. 132–142, 2002.
[160] I. R. Kleckner et al., “Simple, transparent, and flexible automated quality assessment
      procedures for ambulatory electrodermal activity data,” IEEE Trans Biomed Eng, vol. 65,
      no. 7, pp. 1460–1467, 2018.
[161] S. Dávila-Montero, S. Parsnejad, and A. J. Mason, “Exploring the Relationship between
      Speech and Skin Conductance for Real-Time Arousal Monitoring,” in 2020 IEEE
      International Symposium on Circuits and Systems (ISCAS), 2020, pp. 1–5.
                                               194


[162] P. Slovák, P. Tennent, S. Reeves, and G. Fitzpatrick, “Exploring skin conductance
      synchronisation in everyday interactions,” in Proceedings of the 8th Nordic Conference on
      Human-Computer Interaction: Fun, Fast, Foundational, 2014, pp. 511–520.
[163] H. J. Pijeira-Díaz, H. Drachsler, S. Järvelä, and P. A. Kirschner, “Investigating collaborative
      learning success with physiological coupling indices based on electrodermal activity,” in
      Proceedings of the 6th International Conference on Learning Analytics & Knowledge,
      2016, pp. 64–73.
[164] E. Haataja, J. Malmberg, and S. Järvelä, “Monitoring in collaborative learning: Co-
      occurrence of observed behavior and physiological synchrony explored,” Comput Human
      Behav, vol. 87, pp. 337–347, 2018.
[165] H. J. Pijeira-Díaz, H. Drachsler, S. Järvelä, and P. A. Kirschner, “Sympathetic arousal
      commonalities and arousal contagion during collaborative learning: How attuned are triad
      members?,” Comput Human Behav, vol. 92, pp. 188–197, 2019.
[166] “10/20        System         Positioning        Manual,”       2012.        https://www.trans-
      cranial.com/docs/10_20_pos_man_v1_0_pdf.pdf (accessed Jun. 29, 2020).
[167] V. Jurcak, D. Tsuzuki, and I. Dan, “10/20, 10/10, and 10/5 systems revisited: Their validity
      as relative head-surface-based positioning systems,” Neuroimage, vol. 34, no. 4, pp. 1600–
      1611, Feb. 2007.
[168] M. Teplan, “Fundamentals of EEG measurement,” Measurement Science Review, vol. 2,
      no. 2, pp. 1–11, 2002.
[169] S. Valenzi, T. Islam, P. Jurica, and A. Cichocki, “Individual classification of emotions using
      EEG,” J Biomed Sci Eng, vol. 07, pp. 604–620, 2014.
[170] L. Zou, X. Chen, G. Dang, Y. Guo, and Z. J. Wang, “Removing muscle artifacts from EEG
      data via underdetermined joint blind source separation: A simulation study,” IEEE
      Transactions on Circuits and Systems II: Express Briefs, vol. 67, no. 1, pp. 187–191, 2020.
[171] Y. Yang and J. Zhou, “Recognition and analyses of EEG&ERP signals related to emotion:
      From the perspective of psychology,” in First International Conference on Neural Interface
      and Control, 2005, pp. 96–99.
[172] R. N. Duan, J. Y. Zhu, and B. L. Lu, “Differential entropy feature for EEG-based emotion
      classification,” in International IEEE/EMBS Conference on Neural Engineering, NER,
      2013, pp. 81–84.
[173] R. Jenke, A. Peer, and M. Buss, “Feature extraction and selection for emotion recognition
      from EEG,” IEEE Trans Affect Comput, vol. 5, no. 3, pp. 327–339, 2014.
[174] W. L. Zheng and B. L. Lu, “Investigating critical frequency bands and channels for EEG-
      based emotion recognition with Deep Neural Networks,” IEEE Trans Auton Ment Dev, vol.
      7, no. 3, pp. 162–175, 2015.
                                                 195


[175] B. Pholpoke, T. Songthawornpong, and W. Wattanapanitch, “A micropower motion artifact
      estimator for input dynamic range reduction in wearable ECG acquisition systems,” IEEE
      Trans Biomed Circuits Syst, vol. 13, no. 5, pp. 1021–1035, 2019.
[176] U. Satija, B. Ramkumar, and M. Sabarimalai Manikandan, “A review of signal processing
      techniques for Electrocardiogram signal quality assessment,” IEEE Rev Biomed Eng, vol.
      11, pp. 36–52, 2018.
[177] D. Nikolova, P. Petkova, A. Manolova, and P. Georgieva, “ECG-based emotion
      recognition: Overview of methods and applications,” in ANNA ’18; Advances in Neural
      Networks and Applications 2018, 2018, pp. 1–5.
[178] C. Xiefeng, Y. Wang, S. Dai, P. Zhao, and Q. Liu, “Heart sound signals can be used for
      emotion recognition,” Sci Rep, vol. 9, no. 1, Dec. 2019.
[179] J. Cai, G. Liu, and M. Hao, “The research on emotion recognition from ECG signal,” in
      International Conference on Information Technology and Computer Science, ITCS 2009,
      2009, vol. 1, pp. 497–500.
[180] D. S. Quintana, A. J. Guastella, T. Outhred, I. B. Hickie, and A. H. Kemp, “Heart rate
      variability is associated with emotion recognition: Direct evidence for a relationship
      between the autonomic nervous system and social cognition,” International Journal of
      Psychophysiology, vol. 86, no. 2, pp. 168–172, 2012.
[181] R. Gravina, P. Alinia, H. Ghasemzadeh, and G. Fortino, “Multi-sensor fusion in body sensor
      networks: State-of-the-art and research challenges,” Information Fusion, vol. 35, pp. 1339–
      1351, 2017.
[182] J. Kim, “Bimodal emotion recognition using speech and physiological changes,” in Robust
      Speech Recognition and Understanding, M. Grimm and K. Kroschel, Eds. 2007, pp. 265–
      280.
[183] G. Chanel, M. Bétrancourt, T. Pun, D. Cereghetti, and G. Molinari, “Assessment of
      computer-supported collaborative processes using interpersonal physiological and eye-
      movement coupling,” in 2013 Humaine Association Conference on Affective Computing
      and Intelligent Interaction, 2013, pp. 116–122.
[184] G. Chanel, S. Avry, G. Molinari, M. Betrancourt, and T. Pun, “Multiple users’ emotion
      recognition: Improving performance by joint modeling of affective reactions,” in 2017 7th
      International Conference on Affective Computing and Intelligent Interaction, 2017, pp. 92–
      97.
[185] T. Vogt, E. André, and N. Bee, “EmoVoice - A framework for online recognition of
      emotions from voice,” in Perception in Multimodal Dialogue Systems. PIT 2008. Lecture
      Notes in Computer Science, vol. 5078, E. André, L. Dybkjær, W. Minker, H. Neumann, R.
      Pieraccini, and M. Weber, Eds. Berlin, Heidelberg: Springer, 2008.
                                               196


[186] M. Abdelwahab and C. Busso, “Supervised domain adaptation for emotion recognition from
      speech,” in IEEE International Conference on Acoustics, Speech and Signal Processing,
      ICASSP, 2015, pp. 5058–5062.
[187] L. Cen, F. Wu, Z. L. Yu, and F. Hu, A Real-Time Speech Emotion Recognition System and
      its Application in Online Learning. Elsevier Inc., 2016.
[188] R. Rajak and R. Mall, “Emotion recognition from audio, dimensional and discrete
      categorization using CNNs,” in 2019 IEEE Region 10 Conference (TENCON 2019), 2019,
      pp. 301–305.
[189] R. B. Lanjewar, S. Mathurkar, and N. Patel, “Implementation and comparison of speech
      emotion recognition system using Gaussian Mixture Model (GMM) and K-Nearest
      Neighbor (K-NN) techniques,” Procedia Comput Sci, vol. 49, pp. 50–57, 2015.
[190] S. Jing, X. Mao, and L. Chen, “Prominence features: Effective emotional features for speech
      emotion recognition,” Digital Signal Processing: A Review Journal, vol. 72, pp. 216–231,
      2018.
[191] I. McCowan et al., “Automatic analysis of multimodal group actions in meetings,” IEEE
      Trans Pattern Anal Mach Intell, vol. 27, no. 3, pp. 305–317, 2005.
[192] K. Katevas, K. Hänsel, R. Clegg, I. Leontiadis, H. Haddadi, and L. Tokarchuk, “Finding
      Dory in the crowd: Detecting social interactions using multi-modal mobile sensing,” in
      Proceedings of the 1st Workshop on Machine Learning on Edge in Sensor Systems, 2019,
      pp. 37–42.
[193] M. Wöllmer et al., “Abandoning emotion classes - Towards continuous emotion recognition
      with modelling of long-range dependencies,” in Proceedings of the Annual Conference of
      the International Speech Communication Association, 2008, pp. 597–600.
[194] K. Brady et al., “Multi-modal audio, video and physiological sensor learning for continuous
      emotion prediction,” in Proceedings of the 6th International Workshop Audio/Visual
      Emotion Challenge, 2016, pp. 97–104.
[195] W. L. Zheng and B. L. Lu, “A multimodal approach to estimating vigilance using EEG and
      forehead EOG,” J Neural Eng, vol. 14, no. 2, pp. 1–14, 2017.
[196] A. Huk, K. Bonnen, and B. J. He, “Beyond trial-based paradigms: Continuous behavior,
      ongoing neural activity, and natural stimuli,” Journal of Neuroscience, vol. 38, no. 35, pp.
      7551–7558, Aug. 2018.
[197] D. Castaneda, A. Esparza, M. Ghamari, C. Soltanpur, and H. Nazeran, “A review on
      wearable photoplethysmography sensors and their potential future applications in health
      care,” Int J Biosens Bioelectron, vol. 4, no. 4, pp. 195–202, 2018.
                                                197


[198] M. van Dooren, J. J. G. (Gert J. de Vries, and J. H. Janssen, “Emotional sweating across the
      body: Comparing 16 different skin conductance measurement locations,” Physiol Behav,
      vol. 106, no. 2, pp. 298–304, 2012.
[199] “Emotiv.” https://www.emotiv.com/.
[200] “Enobio 8.” https://www.neuroelectrics.com/solutions/enobio/8/.
[201] “E4 wristband.” https://www.empatica.com/research/e4/.
[202] C. Kothe, D. Medine, C. Boulay, M. Grivich, and T. Stenner, “LabStreamingLayer’s
      Documentation,” 2019. https://labstreaminglayer.readthedocs.io/index.html.
[203] “Shimmer Engineering Team (2021). Shimmer MATLAB Instrument Driver,” MATLAB
      Central                      File                       Exchange,                      2021.
      https://www.mathworks.com/matlabcentral/fileexchange/43712-shimmer-matlab-
      instrument-driver (accessed Jun. 03, 2021).
[204] S. Dávila-Montero, S. Parsnejad, E. Ashoori, D. Goderis, and A. J. Mason, “Design of a
      multi-sensor framework for the real-time monitoring of social interactions,” in International
      symposium on circuits and systems, 2022, pp. 615–619.
[205] I. Severin, “Head gesture-based on IMU sensors: A performance comparison between the
      unimodal and multimodal approach,” in 2021 International Symposium on Signals, Circuits
      and Systems (ISSCS), 2021, pp. 3–6.
[206] K. Sancheti, K. S. Krishnan, A. Suhaas, and P. Suresh, “Hands-free cursor control using
      intuitive head movements and cheek muscle twitches,” IEEE Region 10 Annual
      International Conference, Proceedings/TENCON, vol. 2018-Octob, no. October, pp. 356–
      361, 2019.
[207] C. L. Fall et al., “A multimodal adaptive wireless control interface for people with upper-
      body disabilities,” IEEE Trans Biomed Circuits Syst, vol. 12, no. 3, pp. 564–575, 2018.
[208] T. Yokozuka, E. Ono, Y. Inoue, K. I. Ogawa, and Y. Miyake, “The relationship between
      head motion synchronization and empathy in unidirectional face-to-face communication,”
      Front Psychol, vol. 9, no. SEP, pp. 1–10, 2018.
[209] A. Borowska-Terka and P. Strumillo, “Person independent recognition of head gestures
      from parametrised and raw signals recorded from inertial measurement unit,” Applied
      Sciences (Switzerland), vol. 10, no. 12, 2020.
[210] J. R. Terven, B. Raducanu, M. E. Meza, and J. Salas, “Evaluating real-time mirroring of
      head gestures using smart glasses,” in Proceedings of the IEEE International Conference
      on Computer Vision, 2015, pp. 452–460.
[211] S. Ionut-Cristian and D. Dan-Marius, “Using inertial sensors to determine head motion—a
      review,” J Imaging, vol. 7, no. 12, 2021.
                                               198


[212] M. Shah Fahad, A. Ranjan, J. Yadav, and A. Deepak, “A survey of speech emotion
      recognition in natural environment,” Digital Signal Processing: A Review Journal, vol. 110,
      pp. 1–28, 2021.
[213] C. Yarra and P. K. Ghosh, “Automatic intonation classification using temporal patterns in
      utterance-level pitch contour and perceptually motivated pitch transformation,” J Acoust
      Soc Am, vol. 144, no. 5, pp. EL471–EL476, 2018.
[214] G. Szaszák, D. Sztahó, and K. Vicsi, “Automatic intonation classification for speech
      training systems,” in Proceedings of the Annual Conference of the International Speech
      Communication Association, INTERSPEECH, 2009, pp. 1899–1902.
[215] E. Rodero, “Intonation and emotion: Influence of pitch levels and contour type on creating
      emotions,” Journal of Voice, vol. 25, no. 1, pp. e25–e34, Jan. 2011.
[216] L. Liu, A. Götz, P. Lorette, and M. D. Tyler, “How tone, intonation and emotion shape the
      development of infants’ fundamental frequency perception,” Front Psychol, vol. 13, no.
      June, pp. 1–14, 2022.
[217] S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: A review,” Int J Speech
      Technol, vol. 15, no. 2, pp. 99–117, 2012.
[218] R. López-Cózar, Z. Callejas, M. Kroul, J. Nouza, and J. Silovský, “Two-level fusion to
      improve emotion classification in spoken dialogue systems,” Lecture Notes in Computer
      Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
      Bioinformatics), vol. 5246 LNAI, no. 1, pp. 617–624, 2008.
[219] I. Luengo, E. Navas, and I. Hernáez, “Combining spectral and prosodic information for
      emotion recognition in the Interspeech 2009 Emotion Challenge,” Proceedings of the
      Annual Conference of the International Speech Communication Association,
      INTERSPEECH, pp. 332–335, 2009.
[220] A. Batliner, K. Fischer, R. Huber, J. Spilker, and E. Nöth, “How to find trouble in
      communication,” Speech Commun, vol. 40, no. 1–2, pp. 117–143, 2003.
[221] E. Douglas-Cowie, R. Cowie, and M. Schröder, “A New Emotion Database:
      Considerations, Sources and Scope,” 2000.
[222] Z. Zhang, S. Chapman, and F. Ciravegna, “A methodology towards effective and efficient
      manual document annotation: Addressing annotator discrepancy and annotation quality,” in
      International Conference on Knowledge Engineering and Knowledge Management, 2010,
      pp. 301–315.
[223] L. R. Rabiner and R. W. Schafer, “Speech-background/Silence discrimination,” in Theory
      and Applications of Digital Speech Processing, Pearson Higher Education, Inc., 2011, pp.
      586–595.
                                              199


[224] D. J. Hermes and J. C. van Gestel, “The frequency scale of speech intonation,” J Acoust Soc
      Am, vol. 90, pp. 97–102, 1991.
[225] M. McHugh, “Interrater reliability: the kappa statistic,” Biochem Med (Zagreb), vol. 22, no.
      3, pp. 276–282, 2012.
[226] J. S. Uebersax, “Validity inferences from interobserver agreement,” Psychol Bull, vol. 104,
      no. 3, pp. 405–416, 1988.
[227] J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical
      data,” Biometrics, vol. 33, no. 1, pp. 159–174, 1977.
[228] K. L. Gwet, “Benchmarking inter-rater reliability coefficients,” in The definitive guide to
      measuring the extent of agreement among raters, Fourth edi., Gaithersburg, MD: Advance
      Analytics, LLC, 2014, pp. 163–182.
[229] S. R. Krothapalli and S. G. Koolagudi, “Characterization and recognition of emotions from
      speech using excitation source information,” Int J Speech Technol, vol. 16, no. 2, pp. 181–
      201, 2013.
[230] J. B. Alonso, J. Cabrera, C. M. Travieso, K. López-de-Ipiña, and A. Sánchez-Medina,
      “Continuous tracking of the emotion temperature,” Neurocomputing, vol. 255, pp. 17–25,
      2017.
[231] K. Bahreini, R. Nadolski, and W. Westera, “Towards real-time speech emotion recognition
      for affective e-learning,” Educ Inf Technol (Dordr), vol. 21, no. 5, pp. 1367–1386, 2016.
[232] M. Bradley and P. Lang, “Measuring emotion: The self-assessment manikin and the
      semantic differential,” J Behav Ther Exp Psychiatry, vol. 25, no. 1, pp. 49–59, 1994.
[233] S. Pawar, T. Jacques, K. Deshpande, R. Pusapati, and M. J. Meguerdichian, “Evaluation of
      cognitive load and emotional states during multidisciplinary critical care simulation
      sessions,” BMJ Simul Technol Enhanc Learn, vol. 4, no. 2, pp. 87–91, 2018.
[234] C. Kothe and C. Brunner, “XDF                     (Extensible    Data     Format),”   2014.
      https://code.google.com/archive/p/xdf/.
[235] I. Krumpal, “Determinants of social desirability bias in sensitive surveys: A literature
      review,” Qual Quant, vol. 47, no. 4, pp. 2025–2047, 2013.
[236] S. Li, Y. Ma, H. Huang, and S. Li, “An Improved DTW Method for Human Behavior
      Recognition,” in 2019 2nd International Conference on Intelligence Systems Research and
      Mechatronics Engineering, 2019, pp. 187–192.
[237] F. Ringeval et al., “Prediction of asynchronous dimensional emotion ratings from
      audiovisual and physiological data,” Pattern Recognit Lett, vol. 66, pp. 22–30, 2015.
                                                200


[238] F. Moukayed, H. Yun, T. Bisson, and A. Fortenbacher, “Detecting academic emotions from
      learners’ skin conductance and heart rate: A Data-driven approach using Fuzzy Logic,” in
      Proceedings of DeLFI Workshops, 2018, pp. 1–10.
[239] B. Zhong et al., “Emotion recognition with facial expressions and physiological signals,”
      in 2017 IEEE Symposium Series on Computational Intelligence, 2017, pp. 1–8.
[240] H. Yun, A. Fortenbacher, N. Pinkwart, T. Bisson, and F. Moukayed, “A pilot study of
      emotion detection using sensors in a learning context: Towards an affective learning
      companion,” in DeLFI/GMW Workshops, 2017, pp. 1–11.
[241] W. Mou, H. Gunes, and I. Patras, “Alone versus In-a-group : A comparative analysis of
      facial affect recognition,” in Proceedings of the 24th ACM International Conference on
      Multimedia, 2016, pp. 521–525.
[242] W. Wei and Q. Jia, “Weighted Feature Gaussian Kernel SVM for Emotion Recognition,”
      Comput Intell Neurosci, vol. 2016, pp. 1–8, 2016.
[243] S. Chen, Y. L. Tian, Q. Liu, and D. N. Metaxas, “Recognizing expressions from face and
      body gesture by temporal normalized motion and appearance features,” Image Vis Comput,
      vol. 31, no. 2, pp. 175–185, 2013.
[244] M. Shimura, F. Monma, S. Mitsuyoshi, M. Shuzo, T. Yamamoto, and I. Yamada,
      “Descriptive analysis of emotion and feeling in voice,” in Proceedings of the 6th
      International Conference on Natural Language Processing and Knowledge Engineering,
      NLP-KE 2010, 2010, pp. 1–4.
[245] J.-C. Martin, G. Caridakis, L. Devillers, K. Karpouzis, and S. Abrilian, “Manual annotation
      and automatic image processing of multimodal emotional behaviors: Validating the
      annotation of TV interviews,” Personal Ubiquitous Comput., vol. 13, no. 1, pp. 69–76,
      2009.
[246] M. M. Khan, R. D. Ward, and M. Ingleby, “Classifying pretended and evoked facial
      expressions of positive and negative affective states using infrared measurement of skin
      temperature,” ACM Trans Appl Percept, vol. 6, no. 1, pp. 1–22, 2009.
[247] D. Glowinski, A. Camurri, G. Volpe, N. Dael, and K. Scherer, “Technique for automatic
      emotion recognition by body gesture analysis,” in 2008 IEEE Computer Society Conference
      on Computer Vision and Pattern Recognition Workshops, 2008, pp. 1–6.
[248] Y. S. Shin, “Facial expression recognition based on emotion dimensions on manifold
      learning,” in International Conference on Computational Science, 2007, pp. 81–88.
[249] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial
      expression analysis,” in IEEE International Conference Multimedia and Expo, 2005, pp. 5–
      8.
                                                201


[250] L. Chaby, M. Chetouani, M. Plaza, and D. Cohen, “Exploring multimodal social-emotional
      behaviors in autism spectrum disorders: An interface between social signal processing and
      psychopathology,” in 2012 ASE/IEEE International Conference on Privacy, Security, Risk
      and Trust and 2012 ASE/IEEE International Conference on Social Computing,
      SocialCom/PASSAT 2012, 2012, pp. 950–954.
[251] A. Mahdhaoui, F. Ringeval, and M. Chetouani, “Emotional speech characterization based
      on multi-features fusion for face-to-face interaction,” in 2009 International Conference on
      Signals, Circuits and Systems, 2009, pp. 1–6.
[252] M. Dahmane, P.-L. St-Charles, M. Lalonde, K. Heffner, and S. Foucher, “Arousal and
      valence estimation for visual non-intrusive stress monitoring,” in 2019 9th International
      Conference on Image Processing Theory, Tools and Applications (IPTA), 2019, pp. 1–6.
[253] F. Ahmed, A. S. M. H. Bari, and M. L. Gavrilova, “Emotion recognition from body
      movement,” IEEE Access, vol. 8, pp. 11761–11781, 2020.
[254] B. D. Yetton, J. Revord, S. Margolis, S. Lyubomirsky, and A. R. Seitz, “Cognitive and
      physiological measures in well-being science: Limitations and lessons,” Front Psychol, vol.
      10, no. 1630, pp. 1–18, 2019.
[255] T. Keshari and S. Palaniswamy, “Emotion recognition using feature-level fusion of facial
      expressions and body gestures,” in Proceedings of the Fourth International Conference on
      Communication and Electronics Systems (ICCES 2019), 2019, pp. 1184–1189.
[256] M. Raja and S. Sigg, “RFexpress! - RF emotion recognition in the wild,” in 2017 IEEE
      International Conference on Pervasive Computing and Communications Workshops,
      PerCom Workshops 2017, 2017, pp. 38–41.
[257] A. Pradhan, A. Singh, and S. Saraswat, “Emotion recognition through wireless signal,” in
      2017 4th International Conference on Signal Processing and Integrated Networks, SPIN
      2017, 2017, pp. 91–95.
[258] C. Beyan, F. Capozzi, C. Becchio, and V. Murino, “Prediction of the leadership style of an
      emergent leader using audio and visual nonverbal features,” IEEE Trans Multimedia, vol.
      20, no. 2, pp. 441–456, 2018.
[259] L. Batrinca, N. Mana, B. Lepri, N. Sebe, and F. Pianesi, “Multimodal personality
      recognition in collaborative goal-oriented tasks,” IEEE Trans Multimedia, vol. 18, no. 4,
      pp. 659–673, 2016.
[260] O. Aran and D. Gatica-Perez, “Fusing audio-visual nonverbal cues to detect dominant
      people in small group conversations,” in 2010 International Conference on Pattern
      Recognition, 2010, pp. 3687–3690.
[261] F. Pianesi, B. Lepri, A. Cappelletti, M. Zancanaro, and N. Mana, “Multimodal recognition
      of personality traits in social interactions,” in Proceedings of the 10th International
      Conference on Multimodal Interfaces, 2008, pp. 53–60.
                                               202


[262] H. Hung et al., “Using audio and video features to classify the most dominant person in a
      group meeting,” in Proceedings of the 15th ACM International Conference on Multimedia,
      2007, pp. 835–838.
[263] D. Zhang, D. Gatica-perez, S. Bengio, and D. Roy, “Learning influence among interacting
      Markov Chains,” in Advances in Neural Information Processing Systems, 2006, pp. 1577–
      1584.
[264] S. Basu, T. Choudhury, B. Clarkson, and A. Pentland, “Towards measuring human
      interactions in conversational settings,” in IEEE International Workshop on Cues in
      Communication, 2001, pp. 1577–1584.
[265] Z. Shen, A. Elibol, and N. Y. Chong, “Inferring human personality traits in human-robot
      social interaction,” in 2019 14th ACM/IEEE International Conference on Human-Robot
      Interaction (HRI), 2019, pp. 578–579.
[266] G. Leone, S. Migliorisi, and I. Sessa, “Detecting social signals of honesty and fear of
      appearing deceitful: A methodological proposal,” in 7th IEEE International Conference on
      Cognitive Infocommunications, CogInfoCom 2016 - Proceedings, 2016, pp. 289–294.
[267] L. Ahonen, B. U. Cowley, A. Hellas, and K. Puolamäki, “Biosignals reflect pair-dynamics
      in collaborative work: EDA and ECG study of pair-programming in a classroom
      environment,” Sci Rep, vol. 8, no. 1, pp. 1–16, 2018.
[268] J. Malmberg, S. Järvelä, J. Holappa, E. Haataja, X. Huang, and A. Siipo, “Going beyond
      what is visible: What multichannel data can reveal about interaction in the context of
      collaborative learning?,” Comput Human Behav, vol. 96, pp. 235–245, 2019.
[269] M. T. Knierim, D. Jung, V. Dorner, and C. Weinhardt, “Designing live biofeedback for
      groups to support emotion management in digital collaboration,” in International
      Conference on Design Science Research in Information System and Technology, 2017, pp.
      479–484.
[270] A. (Sandy) Pentland, “Social dynamics: Signals and behavior,” in International Conference
      on Developmental Learning, 2004, pp. 1–5.
[271] A. Marcos-Ramiro, D. Pizarro, M. Marron-Romera, and D. Gatica-Perez, “Let your body
      speak: Communicative cue extraction on natural interaction using RGBD data,” IEEE Trans
      Multimedia, vol. 17, no. 10, pp. 1721–1732, 2015.
[272] E. Shmueli, V. K. Singh, B. Lepri, and A. Pentland, “Sensing, understanding, and shaping
      social behavior,” IEEE Trans Comput Soc Syst, vol. 1, no. 1, pp. 22–34, 2014.
[273] U. Avci and O. Aran, “Effect of nonverbal behavioral patterns on the performance of small
      groups,” in Proceedings of the 2014 Workshop on Understanding and Modeling Multiparty,
      Multimodal Interactions, 2014, pp. 9–14.
                                               203


[274] C. Busso, P. G. Georgiou, and S. S. Narayanan, “Real-time monitoring of participants’
      interaction in a meeting using audio-visual sensors,” in IEEE International Conference on
      Acoustics, Speech and Signal Processing, 2007, vol. 2, pp. 685–688.
[275] Y. Shi, S. Das, S. Douglas, and S. Biswas, “An experimental wearable IoT for data-driven
      management of autism,” in 2017 9th International Conference on Communication Systems
      and Networks, COMSNETS 2017, Jun. 2017, pp. 468–471.
[276] G. Schiavo, “Socially-aware interfaces for supporting collocated interaction,” in IUI
      Companion ’14: Proceedings of the companion publication of the 19th international
      conference on Intelligent User Interfaces, 2014, pp. 65–67.
[277] S. O. Ba and J. M. Odobez, “Multiperson visual focus of attention from head pose and
      meeting contextual cues,” IEEE Trans Pattern Anal Mach Intell, vol. 33, no. 1, pp. 101–
      116, 2011.
[278] M. T. Curran, J. R. Gordon, L. Lin, P. K. Sridhar, and J. Chuang, “Understanding digitally-
      mediated empathy: An exploration of visual, narrative, and biosensory informational cues,”
      in CHI Conference on Human Factors in Computing Systems Proceedings (CHI 2019),
      2019, pp. 1–13.
[279] A. Vinciarelli, M. Pantic, and H. Bourlard, “Social signal processing: Survey of an emerging
      domain,” Image Vis Comput, vol. 27, no. 12, pp. 1743–1759, 2009.
[280] S. W. Byun, S. P. Lee, and H. S. Han, “Feature selection and comparison for the emotion
      recognition according to music listening,” in 2017 International Conference on Robotics
      and Automation Sciences, 2017, pp. 172–176.
                                               204


APPENDIX A: TOPIC QUESTIONNAIRE
    This appendix section contains the instructions and the list of statements used in the Topic
questionnaire:
    Instructions:
    You are given a series of topic statements, using the provided scale, indicate how much you
agree or disagree with the statement. If you find that a topic statement might be uncomfortable or
offensive to discuss with someone having a different opinion, you have the opportunity to not
respond to how much you agree or disagree with the statement. If you do not respond, you are
opting out of discussing that topic statement.
    Please note that all personal information will be kept completely confidential and none of the
responses you provide will be connected to your name or email address.
    Disclaimer: These topic statements do not represent the official policy or position of Michigan
State University, the College of Engineering, the Electrical and Computer Engineering
department, or the Study Team members.
    (Topic) Statements:
            •   (Gun control) The government should regulate firearms through stricter gun
        control laws, including more extensive background checks and regulations on assault
        weapons.
            •   (Gun control) The "right of the people to keep and bear arms" means that the
        government cannot regulate firearms in any way.
            •   (Vegetarianism) Meet is a normal part of my diet and important for a healthy life.
            •   (Vegetarianism)Vegetarianism is more sustainable for food production and
        reduces cruelty to animals.
                                                205


    •   (Animal testing) Animal testing is unethical.
    •   (Animal testing) Although animals may feel pain or die as a result of it, animal
testing is necessary in order to save human lives.
    •   (Universal healthcare) Access to affordable, quality healthcare should be a
fundamental service provided by the government.
    •   (Universal healthcare) Tax money should not be used to provide healthcare for
everyone; people should be responsible for themselves.
    •   (Death penalty) No matter the crime, the death penalty should never be applied
because killing is wrong.
    •   (Death penalty) The death penalty should be used to deter heinous crimes.
    •   (Religious freedom) My personal religion is the one true religion.
    •   (Religious freedom) All people should feel free to practice any faith or to have no
faith without fear of peer or government coercion.
    •   (Vaccines) Some vaccines save lives and should be mandatory to protect the
population.
    •   (Vaccines) Individuals should have the right to choose whether or not to be
vaccinated.
    •   (Animal hunting) Animal hunting is a fun sport and part of the American culture.
    •   (Animal hunting) Animal hunting constitutes animal abuse and should be
prohibited.
    •   (Professional sports) Professional sports are a great source of entertainment.
    •   (Professional sports) People should not be paid to play sports.
                                        206


    •  (College athletes) College athletes should not be allowed to receive payment from
sponsors because it ruins the purity of the game.
    •  (College athletes) College athletes work hard and generate income for the
university and should be compensated for their efforts.
    •  (Exercise) I consider exercising part of my daily routine.
    •  (Exercise) I rarely even think about exercising.
    •  (TV shows) I love a good TV show or movie when I have time.
    •  (TV shows) I consider watching fictional/reality TV a waste of time.
    •  (Travel) I love to travel, experience new cultures, and meet new people.
    •  (Travel) Traveling is overrated. I prefer to stay near home.
    •  (Video games) I enjoy playing video games.
    •  (Video games) I consider video games a waste of time.
    •  (Food) I like trying new foods and going to different restaurants.
    •  (Food) I prefer to eat food that I know I like.
    •  (Outdoor activities) I enjoy outdoor adventurers and activities.
    •  (Outdoor activities) Outdoor adventures and activities are not worth the effort;
my house is all the nature I need.
    •  (Social interactions) Virtual interactions are enough for me to fulfill my social
needs.
    •  (Social interactions) I need in-person interactions to fulfill my social needs.
    •  (Environment) It is the duty of all humans to protect the environment and
minimize our carbon footprint.
    •  (Environment) Human consumption does not affect the environment negatively.
                                         207


Figure 39. 11-point Likert scale used to collect the opinions of the participants about the given
topics.
Participants provided their opinion using an 11-point Likert scale, shown in Figure 39.
                                                208


APPENDIX B: EMOTIONAL STATE QUESTIONNAIRE
    This appendix section contains the instructions and the items used in the Emotional State
questionnaire:
    Instructions:
    Rate how are you feeling at this moment using the following scales. Please note that all
personal information will be kept completely confidential and none of the responses you provide
will be connected to your name or email address.
    Items:
    Rate how are you feeling in terms of arousal, valence, and dominance:
Participants provided their responses using the scales in Figure 40.
                                               209


Figure 40. 9-point Self-Assessment Manikin scale for arousal, valence, and dominance.
Now, please rate how are you feeling at this moment using the following scale:
Scale is shown in Figure 41.
                                              210


Figure 41. 11-point rating tool based on the circumplex model of emotion.
                                               211


APPENDIX C: RAPPORT QUESTIONNAIRE
    This appendix section contains the instructions and the items used in the rapport questionnaire:
    Instructions:
    Please note that all personal information will be kept completely confidential and none of the
responses you provide will be connected to your name or email address.
    Items:
    First, participants selected a reference letter that identified them, as shown in Figure 42.
Figure 42. Selection of reference letter by the participant.
     Then, participants rated their own perceived level of engagement during the interaction using
the items and scale shown in Figure 43, rated how much they enjoyed the interaction using the
scale in Figure 44, and the level of linking of other people using the scale in Figure 45.
                                                  212


Figure 43. Items and scale used to rate oneself interaction performance during the discussion
section.
Figure 44. Scale to measure the overall feeling of enjoyment during the interaction.
Figure 45. Liking scale.
                                               213


     Lastly, participants rated their interaction with the other members of the group using the scale
in Figure 46, where X represents the reference letter of one of the other two (for a group of 3) or
three (for groups of 4) participants of the interaction. For example, if a participant’s reference letter
was A, then the X in the scale in Figure 46 was a B, C, or D. The same scale appeared three times
for groups of four and two times for groups of three.
                                                              X
                                                          :
Figure 46. Items and scale used to determine a value of rapport between dyads.
                                                   214


APPENDIX D: AVAILABLE RESOURCES GENERATED BY THIS WORK
Repository #1:
URL: https://gitlab.msu.edu/davilasy/sensor-connection-atlas
Description: Repository #1 contains code to connect sensors (Shimmer and BrainBit) to computers
and synchronize their signals using LSL.
Repository #2:
URL: https://gitlab.msu.edu/davilasy/audio-data-labeling-tool
Description: Repository #2 contains the code used to create the audio data annotation tool to label
speech intonations. Here, you will also find the executables to run the tool as a stand-alone
application. The code and executable were developed in MATLAB 2019b using their Design
application.
Repository #3:
URL: https://gitlab.msu.edu/davilasy/human-study-de-identified-data
Description: Repository #3 contains de-identified questionnaire data used to generate Figure 35
and Figure 36.
                                               215