ADVANCES IN MACHINE LEARNING AND INTEGRATED CIRCUITS FOR SMART ASSISTIVE TECHNOLOGIES By Ehsan Ashoori A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical and Computer Engineering - Doctor of Philosophy 2024 ABSTRACT Assistive technologies have emerged as powerful tools for assessing physical health and wellness through monitoring physiological parameters such as movement and heart rate. However, our overall health is influenced not only by physiological parameters but also by mental health factors and environmental influences. Therefore, in the pursuit of holistic wellness, assistive technologies need to support multimodal sensing to monitor various aspects of individuals' health, including physiological health, mental wellness, and environmental parameters that influence personal health and wellness. The challenges arise when these technologies must be implemented in real-time and in miniaturized point-of-care platforms where multi-modal sensing algorithms must run efficiently, and resources, including power, are limited. Solving these challenges requires converging engineering practices with psychological and physiological principles. This work aims to implement resource-efficient algorithms to assess social interaction parameters as an important mental health factor and to enable high- performance point-of-care devices to monitor physiological and environmental parameters in a miniaturized and effective manner. In this work, an extensive dataset for human interaction in virtual settings was prepared. Efficient algorithms were developed to identify levels of two highly important social interaction parameters, ‘affect’ and ‘rapport.’ We analyzed affect in time intervals based on the conversation turns and analyzed rapport in 30-second time intervals, which is the highest temporal resolution reported in the literature. We achieved an affect prediction accuracy of 76.8% and a rapport prediction accuracy of 73.6%, which are the highest reported results for analyzing multi-person groups. Furthermore, to support monitoring physiological and environmental parameters, electrochemical solutions were identified as a highly effective method. We introduced new architecture to overcome limited supply potentials in modern point-of-care devices. In our novel design, the potential window for electrochemical reactions doubles compared to the traditional designs. This, in return, facilitates a significantly wider range of target elements that can be monitored with this novel architecture. Overall, the enhanced algorithms and architecture introduced in this work enable multimodal sensing of important personal health and wellness parameters. To researchers dedicated to making the world a better place iv ACKNOWLEDGEMENTS I am sincerely grateful to my advisor, Professor Andrew Mason, for his unwavering support across various technical and personal aspects. Under his guidance in the lab, I gained invaluable knowledge that I aspire to pass on to future generations. My heartfelt thanks extend to my committee members, Prof. Wen Li, Prof. Vaibhav Srivastava, Prof. Angela Hall, and my former committee members, Prof. Chunqi Qian and Prof. Chuan Wang, for their support and encouragement. Acknowledgment is also due to the assistant dean at the College of Engineering, Dr. Katy Colbry for her kind support and guidance. I also wish to thank my colleagues at HATlab, especially Sina Parsnejad, Heyu Yin, Sylmarie Dávila-Montero, Derek Goderis, Anna Inohara, and Arsh Ahtsham, for their collaboration and friendship. I wish to express deep appreciation to my dear parents and brother, whose steadfast support has been a constant in all facets of life. Special gratitude goes to my beloved wife, Zahra, whose empowerment and encouragement have been a source of strength through every challenge. v TABLE OF CONTENTS Chapter 1: Introduction .................................................................................................................. 1 Chapter 2: Background and literature review ................................................................................ 7 Chapter 3: Methods and tools for analyzing social interactions .................................................. 19 Chapter 4: Developing platforms for monitoring affect and rapport........................................... 34 Chapter 5: Advancing integrated electrochemical instruments for point-of-care devices .......... 62 Chapter 6: Conclusions and future works ..................................................................................... 82 BIBLIOGRAPHY .............................................................................................................................. 86 vi Chapter 1: Introduction 1.1. Applications and Significance of Assistive Technologies in Individuals’ Wellness Assistive technologies have emerged and gained popularity as powerful tools for tracking the physical health of individuals. However, our overall wellness is affected by factors such as emotional state and environmental factors that influence individuals’ health. Therefore, in the pursuit of rounded and comprehensive wellness, it is paramount to develop assistive technologies for monitoring social, physiological, and environmental parameters to promote individuals' wellness. 1.1.1. Social interactions and individuals’ wellness Social connections and relationships are vital components of overall well-being, influencing mental health, emotional resilience, and a sense of belonging. Monitoring social parameters involves assessing the quality of interpersonal relationships, support networks, and community engagement. By tracking indicators such as social connections, loneliness levels, and participation in social activities, individuals and healthcare providers can identify areas for improvement and intervention. Therefore, developing technologies to monitor social interactions can provide insights into individuals' social behaviors, communication patterns, and social support systems, aiding in the identification of potential risks or opportunities for enhancing social wellness. These tools can be utilized in a wide range of settings, from helping healthcare professionals to improving the quality of interactions in workplaces. This can, in return, improve health services, and subsides anxiety, biases, and inequity for example in work places. 1 1.1.2. Role of physiological and environmental parameters on individuals’ wellness Physiological health is intricately linked to overall wellness, encompassing factors such as physical fitness, nutrition, sleep quality, and stress levels. Monitoring physiological parameters involves tracking key metrics such as heart rate, blood pressure, body composition, and biochemical markers. Assistive technologies that regularly assess these parameters could help individuals gain insights into their health status, identify potential health risks, and make informed lifestyle choices to optimize wellness. Likewise, environmental factors play a significant role in shaping individual wellness, influencing physical health, mental well-being, and overall quality of life. Monitoring environmental parameters involves assessing factors such as air quality, noise levels, and temperature. By understanding the impact of the environment on wellness, individuals and communities can take steps to create healthier living environments and mitigate potential health hazards. Developing assistive technologies for monitoring physiological and environmental parameters could improve individuals' wellness and quality of life. The emergence of wearable devices, smartphone apps, and point-of-care technologies that, for instance, monitor heart rate, air quality, or noise level, is enabling real-time tracking and analysis of health data. These assistive technologies empower individuals to take proactive control of their health, facilitating early detection of health issues and timely interventions to prevent or manage chronic conditions. Moreover, by leveraging these quantitative data, individuals can make informed decisions to optimize their living environments, reduce exposure to pollutants, and promote overall wellness. 2 1.2. Engineering challenges with assistive technologies Assistive technologies have advanced in many areas. However, despite these advancements, several challenges and areas for further research remain.  Multimodality and interoperability: many assistive technologies operate in isolation, lacking interoperability with other devices or platforms. Multifaceted assistive technologies and the integration of different data sources are needed. Therefore, multimodal assistive technologies that address different aspects of individual’s health such as psychological and physiological are highly desirable. This allows for a holistic monitoring of individuals’ wellness.  Resource-efficient implementation: many advanced algorithms and devices in assistive technologies involve resource-heavy operations that prevent real-time execution. This also limits the range of target applications that a point-of-care device can support. Developing resource-efficient algorithms and devices remains an important area of research.  Temporal resolution: many assistive technologies have been introduced, for instance, to analyze individuals' overall interaction and social engagement. However, the temporal resolution of analysis is often low in these assessments and they lack real-time analysis. Therefore, higher temporal resolution assessment solutions are necessary, especially for analyzing the dynamic of interactions over the course of an event. These fine-resolution analyses are paramount to devising individual plans for improving social interactions. 3  Versatility of the solutions: translating from standard laboratory solutions to point- of-care assistive technologies often faces limitations such as reduced functionality. Wearable point-of-care devices must be implemented with a small form factor and low power consumption. These constraints often result in a limited range of operations, excluding many target parameters. Therefore, further research is needed to improve the performance of point-of-care devices within the limitations of wearable devices.  Personalization and Accessibility: assistive technologies often adopt a one-size-fits- all approach, overlooking the diverse needs and preferences of users. Research is needed to develop personalized and customizable solutions that adapt to individual abilities, preferences, and contexts, thus enhancing user engagement and effectiveness. 1.3. Goals Collaboration between researchers from different disciplines is essential for tackling these challenges and for the successful development and deployment of assistive technologies. The overall objective and vision of this work is to identify microsystems and algorithms to overcome the challenges specified in section 1.2. Discovering important parameters in social interactions and the avenues that technology can help is essential. This requires an understanding of individuals’ psychology. Moreover, identifying the potential technologies for addressing these parameters requires a deep understanding of engineering solutions. In this work, we bring expertise in machine learning and the extensive experience our lab has in developing efficient microsystems to tackle different aspects of challenges in assistive 4 technologies. We aim to take a holistic approach to improving individuals’ wellness. Specifically, the following are the focus of this work. 1.3.1. Developing technologies that enable monitoring of important social interaction parameters with high temporal resolution. There is limited literature on analyzing social interaction parameters in groups, especially with high temporal resolution. In this work, the goal is to develop resource-efficient algorithms to monitor affect and rapport. Engaging user interfaces that accommodate personal preferences and accessibility issues is of important consideration. 1.3.2. Applying microsystems techniques to bring laboratory utilities to assistive technologies. Utilizing our lab’s extensive experience in developing wearable technologies for point-of- care applications, this work focuses on developing assistive technologies for enhancing individuals’ wellness by monitoring physiological and environmental parameters. More specifically, electrochemical solutions for detecting various physiological and environmental parameters that influence individuals’ wellness are explored. Given the limited resources available for wearable devices, electrochemical solutions in these devices face serious limitations in the range of parameters that can be detected. Widening this range and targeting more diverse elements is the aim of this work. The goal is to make CMOS potentiostat overcome the limitations of analyzing a wide range of elements. 1.4. Outline The following forms the content of this dissertation. Literature on employing assistive technologies for improved human interaction as well as monitoring of physiological and 5 environmental parameters is reviewed in Chapter 2. Chapter 3 presents the early work we did and the avenues we explored toward having a platform for improved virtual interactions. Chapter 4 describes the data collection and preparation for human trials along with the algorithms we developed for extracting social cues in virtual meetings. Chapter 5 presents the methods we employed for enhancing the efficacy of point-of-care electrochemical devices that are resource- efficient. Chapter 6 summarizes this dissertation and outlines potential paths for future works. 6 Chapter 2: Background and literature review Literature has reported point-of-care devices and assistive technologies for improving different aspects of individuals’ wellness. This includes technologies for monitoring psychological wellness such as individuals’ emotions [1], [2], [3] as well as physiological parameters such as heart rate [4]. Some others focus on analyzing physical parameters such as skin conductivity using electrodermal activity sensors that can indicate stress levels and various health-related issues [3], [5]. Some others develop point-of-care devices to analyze human secretions, such as sweat [6], and monitor environmental parameters, such as particulate matter [7], [8], that can affect health. In this chapter, we explore the literature and identify the challenges and areas that require further research. 2.1. Employing technology for improved interaction Among the solutions for monitoring individuals’ emotional wellness, a body of work has recently gained attention that focuses on the interaction among people on different occasions, such as in a classroom, in a work meeting, in a clinical set, etc. [5], [9], [10]. After a recent shift in the trend that incorporated more and more virtual interactions, the need to improve online interactions has become more important than ever. That is especially because of the different nature of online interactions compared to in-person setups. To have a productive meeting at work or to have an effective learning experience in a classroom, we benefit from recognizing non-verbal audial or visual cues in our audience. These cues help find out about the emotional states of people and the level of their engagement and, consequently, help establish more effective communication with our audience. For instance, a study [11] showed that social intelligence had a significant effect on the professional 7 performance of mathematics teachers. Thus, it is desirable to leverage technologies to help people perceive these cues in their audience. The inability to detect important cues in an interaction is more pronounced in a virtual setting. Distance collaboration has become a common practice in recent years. Many companies and universities have opted to facilitate remote working and education. Even some companies went on to announce they would let their employees work remotely for the indefinite future. This trend shows distance collaboration will stay and only flourish in the coming years. Thanks to video conferencing technologies, we can now hold these virtual events that were not possible not too long ago. However, many elements of in-person interactions are missing in a virtual environment. For instance, lack of eye contact, noting body gestures, and other cues that are more easily assessable in an in-person meeting are missing in a virtual setup. This leads to less effective communication. Therefore, utilizing technologies to help people communicate better and have more effective interactions in this type of environment is highly desirable. Recently, an increasing interest has been seen in the literature for developing technologies that are capable of detecting the emotional state of people [12]. Many of these methods rely on deep neural network implementation [13] which often is computationally heavy and not generally applicable to real-time implementation with limited computational resources. On the other hand, some of the reported works in literature employ machine learning algorithms that require less computation but often require hand-crafting features, which adds to the complexity of the problem [14]. Furthermore, these reported works are generally bound to the controlled lab environment [15], where the designed experiments induce desired emotions in the participants. These experiments, therefore, have a higher signal-to-noise ratio than a normal 8 interaction in a natural setup. Consequently, the developed algorithms might perform less effectively in a more natural setup. Moreover, some platforms were developed for detecting nonverbal cues from recorded videos [16] in a noncontrolled environment, but often focused on detecting very intense emotions, which are very different than the baseline emotion and hence easier to identify. An example of using automated solutions to improve interactions is utilizing technologies to reduce unintended negative communications among participants. For instance, most unfavorable interactions in a workplace are being done unconsciously [17]. Many individuals may have unconscious bias against different groups of people. A common method that traditionally has been employed to overcome these challenges is through human experts. In this method, an expert analyzes the behavior of participants and provides constructive feedback to achieve higher-quality interactions. However, this method does not work well in real time. This means that this method is appropriate for an overall assessment of an interaction after it is over. Furthermore, since a human is involved in this type of assessment, some people may be uncomfortable with it and raise some privacy issues. Therefore, employing technology to enhance awareness of individuals helps to have more positive interactions. The literature has investigated these technologies for various setups. The following are examples of literature that use technologies to improve virtual interactions in the most popular setups. The target applications were mostly for online settings such as online classrooms and online work meetings.  Online learning environment 9 In a study [2] on virtual learning setups, researchers demonstrated that tutors who were provided with the emotional state of the learners in a virtual classroom used more affective elements in their report and wrote more formative and less summative feedback.  Online work meeting In another study [14], researchers developed a platform that processed audio and video data after a video conference session and extracted affective features such as smile and attention, as well as speech overlap and turn-taking. By providing feedback after finishing a session, participants demonstrated statistically significant improvements in balanced participation. 2.2. Types of cues extracted by algorithms 2.2.1. Emotional state Emotions can be perceived as residing on two distinct dimensions: one concerning the degree of pleasure associated with the emotion and the other regarding the level of arousal or activation it entails [18]. Recently, literature has shown increasing interest in developing technologies capable of detecting people's emotional states [15], [19]. Emotions mirror responses of the sympathetic nervous system [20]. The Polyvagal Theory elucidates how emotional states influence both brain processes and bodily functions [21]. Moreover, this theory sheds light on the interplay between measurable physiological states tied to the autonomic and central nervous systems and resultant human behavior, proposing a mutual relationship between mind and body. It further suggests that environmental factors influence behaviors that subsequently impact physiological states. Thus, monitoring changes in 10 bodily physiological markers like respiration rate, heart rate, and perspiration rate can offer valuable insights into an individual's emotional state [22]. 2.2.2. Engagement intensity Social Cognitive Theory (SCT) asserts that individuals' interpretations of their surroundings can shape their emotional, physiological, and behavioral responses [23], thus impacting subsequent behaviors in a reciprocal manner. We define engagement by level of interest and cohesion shown by participants and their communication dynamics. Multiple platforms were developed for detecting nonverbal cues from recorded videos [24], for finding the engagement intensity of people [9], [10], [25], [26]. Other cues such as head motion synchronization and empathy in face-to-face communications have been studied [27]. 2.2.3. Rapport building Rapport is defined as a friendly and harmonious relationship, especially, “a relationship characterized by agreement, mutual understanding, or empathy that makes communication possible or easy” [28]. Recent literature has explored monitoring rapport building between dyadic pairs. Studies have been done on both human-to-human and human-to-virtual agent interactions [29], [30], [31]. Studies in the literature focusing on analyzing rapport utilize various modalities such as audio [18], natural language [32] and video [33]. Machine learning approaches have been utilized in the literature [29], [33], [34] for analyzing rapport in various communication contexts. These algorithms focus on discerning the emotional valence of communication and identifying instances of agreement, disagreement, or conflict. Machine learning models can leverage audio and visual cues, such as tone of voice, intonation [18] and facial expressions [29], to infer underlying sentiments and attitudes. 11 2.3. Offline vs real-time feedback A rich body of literature [9], [14], [15], [25], [26], [27], [35], [36], some of which were presented in the previous sections, has focused on detecting nonverbal cues in human interactions, though only provided off-line feedback to participants about their behavior, once the session is over. However, some other studies [37], [38] developed platforms for providing real-time feedback to the users using innovative visual representation, though limited to text- based communication in chatrooms. They analyzed the communication patterns as well as group dynamics using their platform. They also analyzed whether the feedback made any distractions for users. The challenge of providing real-time analysis and feedback using technologies is that the required algorithms are extremely computationally demanding. Therefore, it makes it very difficult, if not impossible, to utilize common algorithms and methods for the real-time analysis of events. Extensive computational load manifests itself differently whether we are dealing with an in-person or a virtual setup. Since in an in-person/on-the-go situation, we would typically have limited hardware resources, we would like to increase both the computational and hardware efficiency to make the solutions viable on wearable devices. In virtual setups, however, access to powerful hardware (thorough computers for example) is not typically an issue, but we still need to increase the computational efficiency to speed up and facilitate the real-time processing of algorithms. 2.4. Sensor modalities for social cue extraction Reported works in the literature use multiple modalities such as audio (tone, pitch, etc.), natural language, video, etc. [39], [40]. In [39], researchers used deep neural network to analyze 12 audiovisual data for affect recognition. [40] also uses deep neural networks to analyze the speech. Other researchers used visual data to analyze the engagement intensity of people in different occasions such as a classrooms [9], [10], [25], [26]. They use facial expressions and physiological sensor data such as heart rate and employ various machine learning algorithms to identify students' engagement levels. Other cues, such as head motion synchronization and empathy in face-to-face communication, have been studied using the accelerometer in a lab environment [27], and it was shown that the level of empathy is mirrored in the frequency and phase of head motion synchronization. 2.5. Developing technologies to improve interactions in virtual meetings We aim to develop efficient technologies to assist people in having more productive and positive meetings in the workplace, for example. We are also interested in improving the quality of virtual interactions in an online setting. To this end, we focus on developing algorithms to detect important cues from individuals, analyze them in tandem with the cues from other people, and feed the processed data back to the participants. The feedback to the participants can be placed in offline or online modes. Although providing real-time feedback leads to having the most effective solution to increase the quality of interactions, offline assessment and feedback to the participants could also enhance awareness and improve the interactions. Another important aspect is the type and frequency of feedback data to the participants. We are interested in exploring different avenues for providing this information to the participants from both psychological and technical points of view. 13 To overcome these challenges, this work aims to develop methodologies for extracting social cues from participants in an online meeting in a natural setup without any artificial constraints. We aim to do this with the highest computational efficiency to enable future implementation of real-time analysis. 2.6. Effect of physiological and environmental parameters on individuals’ wellness The Polyvagal Theory highlights how emotional states affect both brain functions and bodily processes [21]. Additionally, this theory illuminates the dynamic interaction between measurable physiological states linked to the autonomic and central nervous systems and resulting human behaviors, proposing a bidirectional relationship between the mind and body. It also suggests that the environment influences behaviors that subsequently impact physiological states. Therefore, tracking changes in bodily indicators such as heart rate can yield valuable insights into an individual's emotional state. Similarly, monitoring environmental conditions can provide information on how the surroundings influence emotional states and other factors that directly affect the wellness of individuals. Therefore, a rich body of literature has studied these effects on the overall health and wellness of individuals and assistive technologies that have been developed for assessing these important parameters [41], [42]. 2.7. Sensor modalities for monitoring physiological and environmental parameters Among the sensors modalities, optical methods for measuring heart rate of individuals are widely popular [43], [44]. Among different methods, pulse oximetry is popular for measuring heart rate and hemoglobin oxygen saturation in a noninvasive manner. It also can be used for determining respiratory rates. Some recent literature [45] has suggested using camera feed for monitoring heart rate in an online session purely based on visual data. 14 Respiratory rates are also an indicator of different emotional states [46]. A sensor modality used for respiratory rate estimation is an inertial measurement unit (IMU) [47]. IMU detects chest movements and estimates the breathing rate based on the physical movements. Given how indicative of an emotional state a breathing rate can be, this method provides a noninvasive approach to the detection of the respiratory rate. Another category of sensors is electrodermal activity (EDA) and galvanic skin response (GSR) sensors. These sensors have been reported in literature [48], [49], [50] to be used for monitoring emotional state of individuals. In other works, skin temperature [51] has also been investigated as an indicator of individuals’ emotions. Furthermore, other approaches, such as utilizing electroencephalography (EEG) for monitoring the electrical activity of the brain, have been explored in the literature [52]. These brain waves can be indicative of the states an individual is in and, therefore, a valuable insight into human overall emotions. Eye trackers use infrared and visual spectrum to monitor pupil diameter, gaze distance and coordinates and eye blinking. These parameters have been shown to be indicative of individuals’ emotions as well as the level of engagement among people. Therefore, eye tracking is a viable approach for monitoring individuals’ wellness [53]. Another popular category of sensor modalities in assistive technologies is electrochemical sensors. These sensors have a wide range of applications for monitoring physiological as well as environmental parameters. For instance, researchers in [54], [55] utilized electrochemical sensing to detect cortisol levels in sweat as an indication of stress level. Others utilized chemical sensing to analyze biosamples for detecting cancer precursors such as zinc ions [56]. Electrochemical methods also provide valuable insight about environmental parameters such as 15 air pollution and particulate matters [57]. Therefore, they facilitate a holistic approach for monitoring individuals wellness and how it is affected by various physiological and environmental parameters. 2.8. Electrochemical solutions for point-of-care devices Since electrochemical sensors provide a rounded understanding of both physiological and environmental parameters and enable studying their effects on individuals’ wellness, they provide a unique opportunity for integrating different aspects of health monitoring. Therefore, we examined this type of sensors in more detail. Electrochemical measurements find extensive utility across scientific, technological, and everyday contexts, influencing various aspects of people's lives. They serve multiple purposes such as assessing food quality within supply chains [58], [59] evaluating human health through analysis of bodily secretions like salivary biomarkers [55], [60], identifying cancer precursors [56], monitoring air quality for toxic gases [61], as well as detecting heavy metals [62]. These applications empower individuals to make informed life style decisions, thereby enhancing their overall well-being. For optimal utilization of electrochemical methods in diverse practical scenarios, it is crucial to employ them in compact, power-efficient, cost-effective, and preferably wearable devices. However, realizing these capabilities necessitates the development of miniaturized and economical electrochemical instruments as opposed to bulky and expensive laboratory equipment. In this pursuit, researchers have leveraged CMOS technology to craft small and wearable potentiostats [63], [64]. Significant strides have been made to broaden the current 16 readout range [65], reduce power consumption and device size [66], [67], and accommodate bidirectional current flow in electrochemical cells [66]. Despite the strides made in miniaturizing electrochemical systems, the reduction in feature size of modern CMOS technologies has led to diminished voltage supplies. For instance, while older 0.5 µm CMOS technology supported a 5 V supply, newer technologies like 180 nm only support a maximum of 1.8 V for regular transistors or 3.3 V for high-voltage transistors. Consequently, numerous electrochemical reactions cannot be sustained by contemporary integrated potentiostats. Moreover, since potentiostats must facilitate bidirectional current for redox reactions, only half of the supply voltage is available for each direction in an ideal rail-to- rail operation. With a 3.3 V supply, this translates to only 1.65 V for each reduction or oxidation reaction. Additionally, due to the necessity for the counter electrode in a standard three- electrode electrochemical cell to exceed the bias potential, only a fraction of this 1.65 V is usable as bias potential. However, many electrochemical reactions, such as those for detecting heavy metals like Mn, require bias potentials beyond the supported range [68]. Hence, conventional CMOS potentiostat designs implemented in newer technologies with lower supply voltages cannot support reactions for these elements. This opens a door for further research on empowering CMOS potentiostats with modern technologies to resolve the limited supply voltage issue. This advancement will allow a more versatile solution that can support a wider range of target elements in real-world applications. 2.9. Summary Employing technology to improve individuals’ wellness is of great interest. These technologies can be used to monitor different parameters that are indicative of individuals’ 17 wellness. These parameters obtained from features extracted from audio data, visual data, physiological and environmental sensors, among others. These technologies help identifying parameters related to individuals’ state of wellness including physiological and emotional wellness. Furthermore, they provide insight into interactions among people and how these interactions have mutual effects on participants’ wellness. Moreover, these technologies can be deployed to monitor environmental parameters and study their effects on individuals’ wellness, hence supporting more appropriate behavior to improve individuals’ health. The areas that require further research include efficient implementation of algorithms to enable resource-limited applications. Furthermore, providing higher temporal resolution, which enables the study of dynamic changes over time and allows real-time applications, is of high interest. Developing devices that are resource-efficient and tackling the limitations of translating laboratory instruments to wearable point-of-care devices is also of great importance for developing next-generation multifaceted assistive technologies. 18 Chapter 3: Methods and tools for analyzing social interactions 3.1. Introduction Building rapport is an important element in having healthy and productive interactions in different situations, including in workplace environments or healthcare sets. This chapter presents the preliminary work on designing a framework for collecting data using sensors to infer human behavior and emotion and ultimately assess the rapport level in an interaction. The goal is to develop algorithms to process raw data and assess rapport building between dyads and leverage this information to enhance the quality of interactions in virtual meetings and overcome some of the shortcomings of virtual interactions compared to in-person setups. This chapter presents the work that has been conducted to converge knowledge across disciplines and identify suitable approaches and tools that can be utilized for analysis of important parameters in social ineractions. Different design iterations of the platform for conducting experiments and their design procedure are discussed. The analysis of tools and viable approaches for designing the aforementioned platform is reviewed. The algorithms that were developed, as well as methods to increase computational efficiency, are introduced. An analysis of the applicability of reinforcement learning (RL) for improving the platform is also presented. Finally, a discussion on how this preliminary work shaped the research path is discussed. 3.2. Sensor modality and data collection To implement a platform for analyzing human behavior in a virtual environment, the first step is to collect raw data using sensors. This collected data will be processed down the line to infer human behavior and emotion. In this project, aiming at analyzing virtual meetings, camera data was employed for collecting visual data. Visual cues such as facial expression plays important 19 role in building rapport [69]. The goal of this project was to explore whether the visual data obtained by a camera can be utilized to extract information about affect and rapport. Camera provides rich data to work with for analyzing nonverbal cues. Moreover, in a virtual setup, a camera is often available, and thus, visual data can be obtained without the need for extra sensors, which makes this platform more widely accessible. In this work, different options for utilizing camera data were studied. Moreover, various features extracting from camera were studies to analyze human behavior. 3.3. Visual data for assessing affect and rapport Literature suggests [69], [70] visual data such as direction of head and eye gaze as well as body pose including leg and arm posture are important elements in building rapport. Other nonverbal elements such as facial expressions are also important. Among these visual cues, some elements like leg posture are not typically accessible in a virtual meeting. Some other features, however, can be collected using a camera in a virtual meeting which includes eye gaze, head movements and action units (AUs). AUs are the elements in the Facial Action Coding System (FACS) [71], [72] which is a system to taxonomize human facial expressions. This section introduces the analysis of tools for capturing visual data and presents an in-depth comparison of the options and how they can be integrated in a custom-built platform. This platform has been developed to collect data, process it and feed the processed data back to participants in a meeting. 3.3.1. Monitoring eye contact Making eye contact is an important element of an effective communication [73]. It is an indication of engagement and attention levels of the audience. However, eye contact is missing 20 in a virtual environment, and therefore, participants miss an important cue in communication. Thus, one of the objectives of this research work was to utilize visual data to determine if participants in a virtual meeting were looking at each other and hence, they established “virtual eye contact” during the interaction. In this section, the options for monitoring eye gaze are analyzed and a comparison is presented. The methods that were developed for integrating eye gaze monitoring tools into our platform are explained. The goal in this work was to implement a platform where participants in a virtual meeting can benefit without the need for an extensive setup on their end. For instance, utilizing special hardware/camera, which is often equipped with infrared detection and proprietary software, provides very accurate eye gaze data and improves the result of any analysis using this data. However, this special equipment is not typically readily available for users who use their laptops, for example, to attend a virtual meeting. Therefore, the objective in this research work was to limit the experiments to using hardware that is available to an average user, namely a webcam only. This constraint makes the platform usable without any special equipment and only require participants to install a piece of software that is developed for integrating different elements of the platform. After analyzing different options, we chose GazePointer [74], which provides sufficient accuracy as an open-source software for detecting eye gaze on a regular 14” display. To facilitate this experiment, we developed a user interface in HTML as shown in Figure 3. 1. 21 Figure 3. 1. Developed user interface in HTML that communicates with GazePointer and displays the coordinates of a user's gaze on the screen. The experiment was performed on a 14” display. The HTML page communicates in the backend with GazePointer and displays the location at which a person is looking at on display. The HTML page also displays the coordinates of head location as well as yaw, pitch, and roll to monitor head movements. Despite the relatively good accuracy of GazePointer, a few disadvantages resulted in exploring other options to replace GazePointer. First, GazePointer requires a lengthy calibration process at the start of each session. Second, the result is very sensitive to the location of the head in front of the camera and might not be useful for a normal virtual session where participants move relative to the camera within the normal range of human movements. Third, it lacks support for analyzing prerecorded videos as well as videos that manifest multiple people. Finally, there is a lack of support for capturing action units (AUs), which are essential for detecting facial expressions [71]. It is worth mentioning for the case of real-time monitoring of participants in a virtual meeting, all different eye tracking 22 software require a dedicated camera because Windows settings do not allow two applications such as Zoom and GazePointer use one camera simultaneously. 3.3.2. Framework for processing action units and head movements using OpenFace OpenFace [75] is open-source software that is slightly less accurate in eye gaze detection than GazePointer but without all the issues stated section 3.3.1, such as the need for a lengthy calibration process and the restriction of movements in front of the camera. This software provides an opportunity for seamless integration with the developed platform for analyzing the data using Python scripts. Besides eye gaze and head location/orientation, OpenFace provides information about action units. It also allows using recorded video and analyzes multiple people in one scene. Since OpenFace allows the analysis of recorded videos, it facilitates analyzing the data from all participants on a single computer instead of analyzing data on each node. This method reduces the complexity of the experimental setup for each user as most of the heavy lifting is being done on a central node. Therefore, only a minimal software setup is required on each node. The approach to this method was recording the screen in small time frames and analyzing them immediately afterward. This method eliminates the need for a second camera for real-time analysis. The bottleneck, however, becomes the latency of processing the videos. Different methods for recording the screen and feeding the recorded data to OpenFace were explored. Namely, Python was employed to automatically use Camtasia [76], a third-party software, to record the screen and feed it back to Python. The Python script controls and sends commands to Camtasia using the command terminal. Another method that was used to achieve higher speed 23 was using Python directly for capturing screenshots and feeding the sequence of screenshots to OpenFace. The entire software backend was integrated and worked seamlessly and automatically. The Python script uses ZeroMQ [77] to communicate with OpenFace. The rate of taking the screenshots and processing them through OpenFace was optimized to achieve the lowest latency. Figure 3. 2 shows the analysis for recording and processing the data for multiple subjects. The fastest solution we achieved was ~6 seconds latency for analyzing 4 people on the screen. This latency does not include the post-processing of our algorithms on the data obtained from OpenFace. Figure 3. 2. Analyzing time delays on different methods of recording and processing the data using OpenFace. Our Python script uses ZeroMQ protocol to communicate with OpenFace back- end. Besides relatively slow processing time, one downside of OpenFace is its low accuracy in eye gaze detection, especially in the vertical direction. To illustrate this limitation, Figure 3. 3 shows the result of the experiment where the eye gaze direction was assessed while looking at four corners of a 23-inch display. We were able to detect eye gaze in the horizontal direction with high accuracy, but the accuracy in the vertical direction was limited. This is a typical issue in eye 24 gaze detection systems as the movement of eyes in the y direction is more limited than in the x direction. Moreover, the vertical movement of the eyes is occluded by eyelids. This was a limitation that we observed in all the webcam-based eye gaze detection systems that we experimented with. It is worth mentioning that this experiment was focused on four extreme corners of a 23-inch display, and the result will be degraded when we want to use smaller screens or if we want to follow eye gaze in smaller range within a display. Therefore, without using specific hardware components for eye gaze estimation, we can only estimate the “virtual eye contact” for two people in a virtual meeting where the faces are displayed side by side on a screen. 25 Figure 3. 3. Experimental results for estimating eye gaze when looking at four corners of a 23- inch display. (a) displays gaze angle in x and y direction vs video frames where the video was recorded at 30 fps. (b) shows XY coordinate of the estimated eye gaze on a 2D plane. 3.4. Preparing the platform for conducting experiments In earlier sections, we discussed developing software for processing data and establishing communication between this software and other open-source software. We also briefly 26 discussed the web interface for running experiments for eye gaze detection. In this section, we present in more detail the considerations we had for developing the front-end and back-end for our experiments. The goal of this effort was to develop a dashboard where the data was effectively communicated to meeting participants. Efficient methods for data management and storing data in databases were also explored. For visual display in the dashboard, the idea was to feed the processed data back to the users in an efficient and easy-to-understand way. The goal was for data not to distract participants, but rather let them take the key points away with a glance. Using Python, HTML and CSS, we developed the front-end of the display dashboard. We explored several options and ended up designing a split layout where the information is presented on both sides of the HTML page. The Zoom window was then laid out on top and in the middle of the page. We chose pie charts to include data about the participation of users in the conversation and bar charts for the dynamic of conversation between people. We also used Sankey diagrams to show the rapport building between each person and other participants as well as showing the level of affect for that participant. Figure 3. 4 shows an instance of the dashboard. The instances on the application, as well as the size of the text, are modifiable. Different options such as graphs with or without text and different layouts such as horizontal split of the display are selectable. Charts without text could convey the message with less distraction and are useful once the participant gets comfortable with the platform. MySQL was used to manage the database running on a central node or server. The central node sends/receives data to/from all connected nodes through the local network. This 27 connection allows real-time communication between nodes and real-time updates of chart data on the dashboard display. To improve the platform's accessibility, we also included color palettes suitable for color blindness cases. As for the networking, we used a central node to host the database and employed the MySQL protocol to connect different nodes. As of now, all the nodes need to be on a local network for the database to run effectively. Hosting the database on the web or a cloud to allow participants to connect from any location remains for future work. Figure 3. 4. Concept illustration of the split layout on the web interface with a Zoom window overlayed on top. 3.5. Exploring possibilities with reinforcement algorithms to enable person-specific recommendation In the platform discussed so far, we ideally are interested in implementing a dynamic feedback system where the system analyzes the data and provides each participant with personalized recommendations to improve the quality of the interaction. To implement this 28 system, we intended to leverage reinforcement learning (RL) algorithms. Among all different RL methods, we are interested in methods that 1) do not require a model, 2) learn at each time step (as opposed to updating the parameters at the end of the experiment), and 3) provide control opportunity (as opposed to merely learning). After analyzing all the options with respect to these criteria, we created a short list of methods, namely SARSA, Expected SARSA, and Q-learning [78]. We analyzed the suitability of these options for this work. For example, Expected SARSA provides a more stable update target and lower variance, but it is more computationally expensive compared to SARSA as it requires the calculation of the expected estimate of the next action value for every state. Given the fact that we aimed for a real-time application where the speed of processing using a regular personal computer is a bottleneck, we chose the implementation of the SARSA method. Another factor that is important is using approximate solution methods rather than tabular methods, as we typically do not have prior information about each possible state, and even if we did, constructing huge tables of states is not desirable in these applications. Moreover, using function approximation makes the learning methods applicable to partially observable problems which we expected to deal with when considering human involvement in our control system. Therefore, we chose to use approximation solution methods with neural networks (NN) as a function approximator. To implement the quantized RL algorithm, we decided to apply the quantization technique to each main block of the algorithm and fine-tune it before implementing the complete solution. To this end, we started with the function approximator. To implement a quantized NN, 29 we took the standard NN problem of classifying hand-written digits and implemented our solution with the quantization technique. Table 3.1. Classification accuracy of NN with quantized parameters. To begin with this phase, extreme cases of quantization which is binarization has been studies. Binarized version of NN was investigated first since its implementation is simpler than multilevel quantization and also supportive literature [79] exists on this topic. In this method, all the weights of the NN were replaced by -1 and 1. For the activation function, sign function is ideal but its derivative function causes problems in training a NN as it is equal to zero almost everywhere. Therefore, to mimic the sign function, we used sigmoid function with a large multiplier in its argument. This allows us to simulate the sign function behavior while still be able to take derivative of the activation function in backpropagation procedure. As shown in Table 3.1, the accuracy of 85% was obtained while all the weights have been binarized and the activation functions were nearly binarized as discussed above. This served as proof of concept that high precision is not always necessary for weights and activation functions in a NN. In the 30 next step, we replaced sigmoid functions with actual sign function to fully binarize the network. The accuracy of this design dropped to ~50%. To improve the results, we explored straight through estimator for implementing back propagation. In this method [79], the derivative of cost function ( (cid:3105)(cid:3011) (cid:3105)(cid:3087) ) during backpropagation gets replaced by (cid:3105)(cid:3011) (cid:3105)(cid:3034) ∗ 1|(cid:3051)|(cid:3000)(cid:2869) where J is the cost function, ϴ is the weights and g is the activation function. What this means is that instead of calculating the derivative of the cost function with respect to the weights (ϴ), the derivative is calculated with respect to the activation function which in this case is a sign function. Then the result is multiplied by 1|(cid:3051)|(cid:3000)(cid:2869)which equals to 1 in the vicinity of origin and 0 everywhere else. In other words, the backpropagation is calculated with this assumption that derivative of the sign function is 1 close to origin and 0 elsewhere. To calculate the derivative of cost function with respect to the activation function, the following formula was derived. Given that the forward path is represented by (3.1), We assumed : And we know: 𝑜𝑢𝑡 = 𝑔(cid:3435)𝑎((cid:2869)). 𝜃((cid:2869))(cid:3439). 𝜃((cid:2870)) , (cid:3105)(cid:3042)(cid:3048)(cid:3047) (cid:3105)(cid:3034) = 𝜃 (𝑓(cid:2879)(cid:2869))′ ((cid:3051)) = (cid:2869) (cid:3127)(cid:3117)) (cid:3033)(cid:4593)((cid:3033)((cid:3299)) (3.1) (3.2) (3.3) Applying the chain rule and (3.3), the following expression was obtained. (cid:3105)(cid:3011) (cid:3105)(cid:3034) = − (cid:2869) (cid:3040) ∑ ∑ (cid:3420)(cid:3436)𝜃((cid:2870)) log(cid:3435)ℎ(cid:3087)((cid:3051))(cid:3439) + ℎ(cid:3087)((cid:3051)) (cid:3087)((cid:3118)) (cid:3035)(cid:3335)((cid:3299)) (cid:3440) − 𝜃((cid:2870)) log(cid:3435)1 − ℎ(cid:3087)((cid:3051))(cid:3439) + 1 − ℎ(cid:3087)((cid:3051)) (cid:2879)(cid:3087)((cid:3118)) (cid:2869)(cid:2879)(cid:3035)(cid:3335)((cid:3299)) (cid:3424) + (cid:3090) (cid:2870)(cid:3040) ∑ ∑ ∑ 𝜃 (cid:3105)(cid:3035)(cid:3335)((cid:3299)) (cid:3105)(cid:3034) = − (cid:2869) (cid:3040) ∑ ∑(cid:3427)𝜃(cid:3435)𝑙𝑜𝑔ℎ(cid:3087)((cid:3051))(cid:3439) − log (1 − ℎ(cid:3087)((cid:3051)))(cid:3431)+ (cid:3090) (cid:3040) ∑ ∑ ∑ 𝜃 (3.4) 31 (3.4) was plugged into the algorithm. The accuracy of the results did not improve, but the computation overload was much less because the operations for calculating the derivative in backpropagation were just replaced with simpler mathematical equations as discussed above in (3.1) - (3.4). Despite the potential we see and the progress we made with applying Reinforcement Learning to this problem, a major bottleneck remained the amount of data we needed for training the algorithms. Upon further investigation on implementing SARSA, we noticed with current human subjects and experiments, it was not feasible for us to follow this path for now. However, it remains a viable path to pursue in the future. 3.6. Conclusion and discussion Given the findings in the preliminary work presented in this chapter, we organized the bulk of this thesis work on the following topics. The core of behavior monitoring in this framework is having a reliable assessment of the “affect” level of each individual. In psychology, affect is described as “the underlying experience of feeling, emotion, attachment, or mood” [80]. A valid assessment of affect in individuals may indicate the effect of the conversation on individuals over the course of a meeting. Such an assessment also gives clues to other participants, which may be utilized to improve the quality of interaction. Another important factor in assessing the dynamic of any conversation is assessing the rapport building between participants. Monitoring the rapport building between dyads in a conversation is a very important parameter that gives an understanding of the quality of interaction. Therefore, the next chapter is focused on assessing the effect and rapport in virtual conversations to foster higher quality interactions in virtual 32 meetings and improve the well-being of the participants. This, in return, facilitates the productivity of meetings or online learning setups. 33 Chapter 4: Developing platforms for monitoring affect and rapport 4.1. Introduction To ensure productive work meetings and effective learning environments, it's crucial to pay attention to non-verbal cues from our audience such as auditory or visual cues. These cues offer insights into people's emotional states and engagement levels, facilitating more impactful communication. However, individuals vary in their social intelligence, affecting their ability to interpret these cues accurately. This discrepancy directly influences the quality of interpersonal interactions. This challenge is amplified in virtual settings, where remote collaboration has become increasingly prevalent. Despite advancements, virtual platforms often lack the richness of in- person interactions, such as eye contact and body language observation, making communication less effective. Consequently, there is a growing need for technologies to enhance communication and interaction effectiveness in virtual environments. Recent literature reflects a surge in interest in developing technologies capable of discerning people's emotional states as well as rapport building among individuals. Many of these approaches rely on computationally intensive deep neural networks, limiting real-time implementation, especially with constrained computational resources. Alternatively, some studies utilize machine learning algorithms requiring less computation but necessitating manual feature engineering, adding complexity. Moreover, these methodologies often operate within controlled lab environments, where experiments induce specific emotions, resulting in a higher signal-to-noise ratio than natural interactions. As a result, algorithms developed under these conditions may exhibit reduced performance in more natural settings. 34 In this work, we developed a framework that utilizes neural networks to analyze the individuals’ affect and rapport building in groups during virtual meetings. The contributions of this work are as follows:  Analyzing affect and rapport where individuals are holding regular work meetings in a natural setup without acted sessions.  Classification of subtle changes towards positive or negative affect as opposed to extreme cases.  Analyzing affect and rapport with high temporal resolution, which enables providing real-time analysis and feedback.  Analyzing affect and rapport in multiperson groups.  Implementing neural network with minimum number of layers and input nodes by reducing the feature space and using raw features as opposed to hand crafting the features. To the best of our knowledge, this work is the first to achieve high accuracy while satisfying all the requirements specified above. In this chapter, we present the details of our methods in dataset pteparation and data analysis. The results for analyzing rapport and affect are presented in different subsections. We conclude the chapter with summarizing the work and discussion on our findings. 4.2. Background and related works 4.2.1. Related work on affect monitoring According to the American Psychology Association, together with ‘cognition’ and ‘conation’, affect is one of the three identified components of mind [81]. According to [81], affect 35 is defined as “any experience of feeling or emotion, ranging from suffering to elation, from the simplest to the most complex sensations of feeling, and from the most normal to the most pathological emotional reactions. Often described in terms of positive affect or negative affect, both mood and emotion are considered affective states.” Reported works on affect recognition in the literature use multiple modalities of sensors and different types of machine learning algorithms some of which were explained in chapter 2. These modalities include audio, visual and natural language. In one study [39], researchers utilized deep neural networks to analyze audiovisual data for affect recognition, showcasing a significant improvement in emotion recognition performance compared to traditional methods reliant on handcrafted features. Similarly, another study [40] employs deep recurrent neural networks to analyze speech. Despite demonstrating promising results, these approaches are computationally intensive, which hinders their real-time application where computational resources are constrained. Another work in [16] compares using logistic regression with linear support vector machine (SVM) to analyze videos. By extracting the action units (AUs) from the videos and analyzing them with these classifiers, the researchers were able to recognize disrespectful interactions with accuracy of ~62%. In [82], the researchers built on [16] and by adding audio features such as pitch and intensity, an accuracy of 79.86% was achieved in detecting disrespectful vs respectful interactions using logistic regression model. Other researchers focus on visual data to gauge the engagement intensity of individuals in various settings, such as classrooms [9], [10], [25], [26]. Additionally, cues such as head motion synchronization and empathy in face-to-face communication are investigated using 36 accelerometers in a lab environment [27], revealing that empathy levels correlate with the frequency and phase of head motion synchronization. This collective body of research motivates us to explore the potential of using visual features in natural settings, as opposed to controlled lab experiments, to analyze audience affect. To achieve this without relying on handcrafted features and to reduce computational complexity compared to deep neural network approaches, we opt to implement neural networks with only one hidden layer. Our objective is to classify affect in a natural meeting environment where extreme emotions are less prevalent, training the classifier to detect subtle shifts in participants' emotions. Throughout this dissertation, by “participants”, we refer to people who were subject to monitoring their emotions and behavior and not the labelers and the study team who designed and conducted the experiments 4.2.2. Related work on rapport monitoring Rapport, the establishment of a harmonious and empathetic connection between individuals, lies at the heart of effective communication and collaboration across diverse contexts from personal to professional interactions. Particularly, it is the cornerstone for building productive and impactful meetings across professional settings. Rapport encompasses the establishment of trust, understanding, and mutual respect among participants, fostering an environment conducive to open communication, collaboration, and creativity. The presence of rapport can greatly influence the dynamics of a meeting, shaping the level of engagement, the quality of discussions, and ultimately, the outcomes achieved. Research has consistently highlighted the significant impact of rapport on team performance, decision-making processes, and overall meeting effectiveness [83]. In this context, recognizing the importance of rapport and 37 its role in facilitating productive meetings is essential for organizations seeking to optimize their communication strategies and maximize team synergy. Building rapport in virtual environments presents unique challenges compared to face- to-face interactions [84]. One of the primary obstacles is the lack of non-verbal cues, such as body language and eye contact, which are integral to establishing trust and connection. In virtual meetings, participants may find it challenging to interpret subtle cues or accurately gauge the emotions and intentions of others, leading to potential misunderstandings or miscommunication. As a result, individuals may struggle to develop the same level of rapport in virtual environments, requiring deliberate efforts and strategies to overcome these challenges effectively. A body of work in this area focuses on recognizing rapport levels in the human interaction with a virtual agent [31], [32]. Others aim to analyze rapport levels in human-to-human interaction [29], [85] in dyadic pairs. Despite all the advances in this area, interpreting rapport in high temporal resolution and among multi-person groups is missing in the literature, leaving rapport analysis for dyadic conversations that is done for an entire session of interaction (as opposed to fine temporal resolutions). Therefore, granular information about the dynamics of conversations is absent in these studies. The focus in this work is analyzing rapport in multiperson groups and with high temporal resolution. 4.3. Dataset preparation 4.3.1. Collecting and preparing the data for analyzing affect To perform this experiment, we recorded five work group meetings with the duration of approximately 40 minutes to 100 minutes with an average of ~62 minutes each. The first two meetings had 5 participants whereas the last three had 4 participants, both male and female. 38 Figure 4.1 shows a snapshot of one of these recording sessions in Zoom [86]. This study has been reviewed and approved by the Institutional Review Board (IRB) office at Michigan State University. Figure 4. 1. An example of the setup for collecting data during a virtual meeting using Zoom. © 2023, IEEE. The labels were assigned to each segment in which a participant was speaking. To simplify the labeling process, a Matlab script was developed to identify the conversation segments in each recording and generate time stamps accordingly. The recorded video files were cut into 2522 segments using the MATLAB script based on the generated time stamps. For labeling the cut segments, a graphical user interface (GUI) was developed using the App Designer tool of MATLAB 2019b. This GUI plays each segment one by one and lets the labeler assign the proper label from within the same GUI. Figure 4.2 shows the appearance of this GUI which greatly speeds up the labeling process [87]. 39 Figure 4. 2. Developed graphical user interface for labeling ‘affect’ [87]. 4.3.2. Labeling the affect dataset Three labelers were trained to label the affect level of each participant for each segment of the recorded meeting. The GUI introduced in the prior section was used to assist with labeling. The labelers could play each segment and label it from within the app. The segments were played both in the order of occurrence and randomly and labeled separately. In this work, however, we focused on the labeling that was done in order and left the analysis of labeling the segments in a random fashion for the future work. The labelers were instructed to label the segments as positive, neutral, or negative. The app outputs a text file which contains the labels for each video segment. After completion of the labeling process, the labels that had a majority agreement among the labelers were kept, and the rest were disregarded. As seen in Figure 4.3, in 59.9% of datapoints, all three labelers assigned the same labels. For 36.8% of datapoints, only two labelers assigned the same label. And in 3.3% of datapoints, none of the labelers assigned the similar label. Therefore, we kept 96.7% of datapoints to which at least two labelers assigned a similar label and disregarded the remaining 3.3% of datapoints. 40 Figure 4. 3. percentage of datapoints that labelers agreed on a label. A Python script then reads the labels and assigns them to the corresponding features and makes the dataset ready to be used with the classification algorithms. The details about features and algorithms will be presented in the methods section later in this chapter. 4.3.3. Collecting and preparing the data for rapport monitoring For this experiment, we recorded twenty meeting sessions, of which eight meetings had three participants and twelve meetings had four participants. The duration of the sessions was between 18 minutes and 30 minutes, with an average of ~22 minutes. A total of 35 participants (people who were recorded and not the labelers and the study team who designed and conducted the experiments) were recruited for holding meetings. Tables 4.1 and 4.2 show the gender and age distribution of the participants. More than 90% of participants were in the age group of 18-34 years old. 41 Table 4.1. Gender distribution of participants. Gender Count Male Female Other Total 21 12 2 35 Table 4.2. Age distribution of participants. Age group Percentage 25 - 34 48.57% 18 - 24 42.86% 35 - 44 8.57% Total 100.00% Tables 4.3 and 4.4 show the distribution of the participants' education level and occupation status. More than 90% of participants had passed at least some college courses or had college or professional degrees. 42 Table 4.3 . Distribution of education level of participants. Education level Count Some college, no degree Bachelor's degree (e.g. BA, BS) Master's degree (e.g. MA, MS, MEd) Doctorate or professional degree (e.g. MD, DDS, PhD) High school degree or equivalent (e.g. GED) Do not wish to answer Total 10 10 10 3 1 1 35 Table 4.4 . Distribution of occupation status of participants. Occupation status Count Student Employed full-time (40 or more hours per week) Employed part-time (up to 39 hours per week) Total 29 3 3 35 The 3-person groups formed three dyadic pairs and the 4-person groups formed six dyadic pairs for the purpose of analyzing rapport in these groups. Figure 4.4 shows a snapshot of one of these recording sessions in Zoom. This study has been reviewed and approved by the institutional review board (IRB) office at Michigan State University. 43 Figure 4. 4. An example of the recording session for collecting data during a virtual meeting. To facilitate the labeling process, the Matlab script that was mentioned in section 4.3.1 was used to segment video files in 30-second windows as we were interested in analyzing the rapport in fine-grained time segments.This would allow us to analyze the dynamics of interactions during each session. For labeling these video segments, a graphical user interface (GUI) was developed using Python and PyQt. This GUI plays each segment of videos one by one and lets the labeler assign the proper label from within the same GUI. This GUI has features that facilitate faster and easier labeling processes, such as navigating between segments or skipping some segments. Figure 4.5 shows the appearance of this labeling assistant GUI which greatly smoothed the labeling process. Figure 4. 5. Developed graphical user interface for labeling rapport. 44 4.3.4. Labeling the rapport dataset Four labelers were recruited to label the 30-second segmented videos. The labelers were instructed to label each dyadic pair, three or six pairs for three-person and four-person groups, respectively. They were instructed only to label the dyadic pairs in which at least one person speaks for more than 10 seconds. The labelers were provided with the definition of rapport. To have consistency among the labelers, they were instructed to look for the parameters shown in Table 4.5. These parameters are derived based on the literature presented in [88], [89] and with the method introduced in [87]. However, they were instructed not to overemphasize these parameters and to rely on their first impressions and general intuition to gauge the rapport building in the groups. Table 4.5. Parameters of interest in gauging rapport. Well- Boring Cooperative Harmonious Unsatisfying Uncomfortable coordinated Cold Awkward Engrossing Unfocused Involving Intense Unfriendly Active Positive Dull Worthwhile Slow Given the subjective nature of these labelings, and based on our previous experience that many of the labelers tend to label instances as ‘neutral’, we labelers to label each segment on seven-point Likert scales where -3 represented extreme negativity and +3 represented extreme positivity as shown in the GUI in Figure 4.5. This was purely meant to have more labels other than ‘neutral’ or zero and then we binned these labels in only two classes of high and low rapport based on the statistical analysis presented in the Method section. 45 4.4. Method and results 4.4.1. Extracting Facial Action Units In this work, we used facial action units (AUs) as features to analyze ‘affect’. AUs are representative of movement of individual facial muscles and are commonly used as indicative of expression of emotions [90], [91]. Figure 4.6 shows some examples of action units [92]. To extract AUs, we used OpenFace [93], [94], [95], an open source software widely used by the community. OpenFace extracts a subset of AUs comprising intensity of 17 different AUs. These 17 features were used for classifying different affect levels in the various virtual meetings. Figure 4. 6. Examples of facial action units [92]. 4.4.2. Classification of affect As described earlier, for classifying affect, the video segments were labeled as positive, neutral, and negative. We trained our classifier only on positive and negative labels as they are more reliably classified. In order to avoid developing the bias toward any of the classes during training, we balanced the dataset to have equal number of datapoints for positive and negative 46 labels. For the classification, to avoid manually crafting the features for the algorithms, we chose to use neural networks as opposed to other machine learning algorithms such as logistic regression or SVM. Moreover, to avoid heavy computational load as well as reducing the chance of overfitting, we chose to implement a neural network with only one hidden layer. We first implemented the neural network with all the 17 AUs as the input features. The videos were recorded at the rate of 30 fps. Although the change in the facial expression is not fast and our analysis does not need 30 fps, we did not downsample the recordings for now. Downsampling could be further investigated in the future. We just used the fully recorded data for this experiment. For each AU, the code takes the average of the values over the span of the start to stop time of each video segment. Therefore, for each AU, one value is assigned to each video segment. With all the 17 AUs used in the classifier, the design suffered from significant variance where despite using regularization, training accuracy of 92.9% was achieved while the testing accuracy was only 60.7%. To solve this problem, we employed principal component analysis (PCA). As shown in Figure 4.7, to retain at least 80% of variance, we projected the feature space to only 10 features and reformed the neural network with 10 input features and 10 nodes in the hidden layer. 47 Figure 4. 7. Retention of variance vs number of principal components. By choosing 10 principal components, more than 80% of the variance has been retained © 2023, IEEE. 4.4.3. Results for classifying affect The neural network was trained with 4-fold cross validation and the training accuracy and testing accuracy of 81.1% and 76.8% were obtained, respectively. Considering that we performed our experiments on natural setups without any constraints on participants, and considering that we did not aim at only classifying extreme cases such as disrespectful moments, 76.8% testing accuracy using a neural network with only one hidden layer is achieved. To the best of our knowledge, this result has been achieved for the first time in literature and paves the path toward real-time analysis of virtual meetings using local computational resources on a typical laptop for example. Table 4.6 summarizes the results of these experiments. Table 4.6. Training and testing accuracy of affect with and without implementation of PCA © 2023, IEEE. Training accuracy Testing accuracy Full feature space (17 AUs) 92.9% Reduced feature space (PCA with 81.1% 60.7% 76.8% 10 components) 48 4.4.4. Extracting gaze and head orientation For the purpose of analyzing rapport, we extracted eye gaze and head orientation as well as head coordination in addition to AUs. These features were extracted using OpenFace as well. We were interested not only in these features, but more so on the synchrony of these features among dyadic pairs which is a more indicative of rapport building in groups. As a measure of synchrony, we used dynamic time warping (DTW). DTW is a measure of similarity between two temporal sequences. Similar to Euclidean distance, it measures the distance between two vectors. However, unlike Euclidean distance, it does not measure the distance between two vectors point by point. It takes into account the distance between neighboring points and chooses the minimum value for each point. Figure 4.8 shows the comparison between Euclidean distance and DTW. This method is widely used in applications such as language processing. It can be used for comparing two instances of data that are noisy or have different lengths. For instance, the similarity between one sentence pronounced by two people can be identified using DTW. Figure 4. 8. Comparison between Euclidean distance and DTW [96] . In this work, synchrony features were constructed for each of the base features, namely, eye gaze, head orientation, head coordination and action units. We used a Sakoe-Chiba band of 49 three seconds to calculate DTW and used it as a set of input features to our algorithm. The full list of features are shown in Table 4.7. Table 4.7. Full list of features for analyzing rapport building in groups. Category Comment Number of components Eye gaze x, y, z coordinates for each eye 8 x 2 = 16 angle (x, y) multiply by 2 (for each pair) Head coordination x, y, z coordinate 3 x 2 = 6 multiply by 2 (for each pair) Head orientation Yaw, pitch, roll 3 x 2 = 6 multiply by 2 (for each pair) AUr Intensity of AU 17 x 2 = 34 multiply by 2 (for a pair) AUc Presence of AU 18 x 2 = 36 multiply by 2 (for a pair) DTW Constructed between two participants (in 5 each pair) for gaze, head orientation and coordination, AUr and AUc Total 103 50 4.4.5. Classification of Rapport The statistics of the rapport labels are presented in Figure 4.9. 2674 valid labels were generated. The main challenge in this dataset is the imbalance of data among classes. Therefore, we chose to bin high and low rapport in a way to have a more balanced dataset. This greatly helped train the algorithms. This graph also shows that the choice of having a seven-point scale for labeling helped the labelers to identify more instances of ‘minor positivity’ (indicated by scale ‘1’), which otherwise would be labeled as neutral and would skew the dataset massively. Figure 4. 9. Distribution of the dyadic rapport labels. The boxes show the group of labels used for high and low rapport. 4.4.6. Results for classifying rapport A neural network with a single hidden layer and two output classes was implemented for classifying high and low rapport. The first experiment was done by the full feature set. Given a total of 103 features were used in this experiment, the rule of thumb for the required number of data points is: N = 10 x M/α (4.1) 51 where N is the number of data points, M is the number of parameters and α is overparametrization ratio. We used 10 nodes in the hidden layer, therefore the number of parameters, M, is calculated as follows: The number of parameters between the input and the hidden layer = 103 x 10 + 10 = 1040. Note that we added ten parameters for the bias nodes. Likewise, the number of parameters between the hidden layer and the output layer = 10 x 2 + 2 = 22. Therefore, a total of 1062 parameters should be trained. Let α = 5, from (4.1) we know at least N =10 x 1062/5 = 2124 data points are needed for training the network. Given the total number of 2674 data points we had, the concern was overfitting and our experiment results shown in Figure 4.10 confirms it as seen in the data for full set of features (the last two bars in the figure). Therefore, the feature space should be reduced to achieve more generalization. To this end, a study on each feature types were performed to identify the most relevant features. The idea was to keep the five DTW features as they are indicative of synchrony between participants. Then, each type of feature was added to the analysis to examine their impact on the results. The accuracy results for rapport classification of the dyads are shown in Fig. 4.9. The experiments were repeated 20 times, and each time, initialization of parameters was repeated in the training process. The same 80% of data chosen randomly was used for training and the remining 20% of the data for testing purposes in all these 20 runs. The average, along with the maximum and minimum error bars, are shown on the graph for different combinations of features as well as for the full set of features. 52 Figure 4. 10. Training and testing accuracy of rapport for different features. The right most column shows the result for the full set of features. Since accuracy measures how often a classification model is correct in general, it is not a good metric when a dataset is imbalanced among different classes [97]. Since the dataset in this experiment was not fully balanced, precision and recall were calculated as well. Precision in this case is the measure of what portion of the items that have been detected as high rapport are correctly predicted [98]. In other words, it shows how often is the prediction correct when predicting a target class [97]. The formula for calculating precision was: Precision = Tp/(Tp+Fp) (4.2) where Tp is true positive and Fp is false positive among the predicted labels. Recall measures what portion of all high rapport data points is correctly predicted [98]. In other words, it is a measure of how well all the instances of a target class is predicted [97]. It was calculated using the following formula: Recall = Tp/(Tp+Fn) (4.2) 53 where Fn represents false negative predictions. Figure 4.11 and Figure 4.12 represent the results for precision and recall in these experiments Figure 4. 11. Precision for the rapport classification. Figure 4. 12. Recall for the rapport classification. 54 Another commonly used metric that often is used to take into account both precision and recall is F1 score [99]. This metric was calculated using the following formula, and the results are shown in Figure 4.13. F1 = 2*(precision*recall)/( precision+recall) (4.3) Figure 4. 13. F1 score for the classification results. According to the results of our experiments, we noticed that head coordination (poseT) and presence of action unita (AUc) are more significant than head orientation (poseR) and intensity of action units (AUr), respectively. Therefore, we did not include them directly as independent features. However, they still indirectly contribute to the classification because DTW features derived from those metrics have been utilized in the feature space. By eliminating these features, the number of features reduced from 103 to 63, which would help to the generalization of the classifier. These 63 features include eye gaze, head coordination and AUc, for both individuals constructing a dyadic pair. They also include five DTW features for each of eye gaze, head orientation, head coordination, AUc and AUr. 55 Using the newly constructed feature space, the classifier was trained on 80% of randomly selected data and was tested on the remaining 20% of data. This process was repeated 20 times, each time with a new subset of 80/20 data. The results for average and standard deviation of accuracy for both the full and reduced feature spaces are presented in Figure 4.14. As depicted in this figure, the difference between average accuracy of training and testing is smaller for the reduced feature space compared to that of the full set of features. The standard deviations of accuracy also follow the same pattern. This confirms the more generalized solution while achieving high accuracy of 73.6% for the testing experiment. Figure 4. 14. (a) average accuracy and (b) standard deviation of accuracy across 20 experiments on training and testing data, for the full set of features and the reduced subset of features. (a) shows the difference in training and testing accuracy (∆𝐴) is smaller for the case of reduced features. (b) shows the standard deviation (𝜎) of the accuracy for the testing data is lower for the reduced feature space compared to the full feature space. Moreover, the difference of standard deviations (∆𝜎) between training and testing is smaller for reduced features compared to the full features. We repeated these experiments to calculate precision and recall for reduced features and compared them with the case of using the full feature set. The results are depicted in Figure 4.15 and Figure 4.16. 56 Figure 4. 15. (a) average precision and (b) standard deviation of precision across 20 experiments on training and testing data, for the full set of features as well as the reduced subset of features. (a) shows the difference in precision between training and testing (∆𝑃) is smaller in the case of reduced features. (b) shows the standard deviation (𝜎) of the precision of the testing data is lower for the reduced feature space compared to the full feature space. Moreover, the difference of standard deviations (∆𝜎) between training and testing is smaller for reduced features compared to the full features. Figure 4. 16. (a) average recall and (b) standard deviation of recall across 20 experiments on training and testing data, for the full set of features and the reduced subset of features. (a) shows the difference in recall between training and testing (∆𝑅) is smaller in the case of reduced features. Although (b) shows the standard deviation (𝜎) of the recall for testing data is increased for the reduced features compared to the full features, the difference of standard deviations (∆𝜎) between training and testing is decreased for the reduced features compared to the full features. This result confirms a better generalization of the algorithm. Moreover, the average and standard deviation of F1 score were calculated. The results are presented in Figure 4.17. 57 Figure 4. 17. (a) average F1 score and (b) standard deviation of F1 score across 20 experiments on training and testing data, for the full set of features and the reduced subset of features. (a) shows the difference in F1 score between training and testing (∆𝐹1) is smaller in the case of reduced features. (b) shows the standard deviation (𝜎) of the F1 score for the testing data is lower for the reduced features compared to the full features. Moreover, the difference of standard deviations (∆𝜎) between training and testing is smaller for reduced features compared to the full features. In all of these results, using the reduced features instead of full features decreased the difference between the average results between testing and training as seen in Figures 4.14 (a), 4.15(a), 4.16(a) and 4.17(a). Moreover, the standard deviation for the testing experiments on the reduced subset of features is smaller than that of the full set of features., except for the ‘recall’. And in all cases, the difference between training and testing standard deviations of the reduced subset of features are much smaller than that of the full set of features as depicted in (4.4): (𝜎(cid:3047)(cid:3032)(cid:3046)(cid:3047) − 𝜎(cid:3047)(cid:3045)(cid:3028)(cid:3036)(cid:3041))(cid:3045)(cid:3032)(cid:3031)(cid:3048)(cid:3030)(cid:3032)(cid:3031) (cid:3033)(cid:3032)(cid:3028)(cid:3047)(cid:3048)(cid:3045)(cid:3032)(cid:3046) < (𝜎(cid:3047)(cid:3032)(cid:3046)(cid:3047) − 𝜎(cid:3047)(cid:3045)(cid:3028)(cid:3036)(cid:3041))(cid:3033)(cid:3048)(cid:3039)(cid:3039) (cid:3033)(cid:3032)(cid:3028)(cid:3047)(cid:3048)(cid:3045)(cid:3032)(cid:3046) (4.4) where 𝜎(cid:3047)(cid:3032)(cid:3046)(cid:3047) is the standard deviation of test results and 𝜎(cid:3047)(cid:3045)(cid:3028)(cid:3036)(cid:3041) is the standard deviation of the training results across all 20 runs of experiments. In other words, we have: ∆𝜎(cid:3045)(cid:3032)(cid:3031)(cid:3048)(cid:3030)(cid:3032)(cid:3031) (cid:3033)(cid:3032)(cid:3028)(cid:3047)(cid:3048)(cid:3045)(cid:3032)(cid:3046) < ∆𝜎(cid:3033)(cid:3048)(cid:3039)(cid:3039) (cid:3033)(cid:3032)(cid:3028)(cid:3047)(cid:3048)(cid:3045)(cid:3032)(cid:3046) (4.5) where ∆𝜎(cid:3045)(cid:3032)(cid:3031)(cid:3048)(cid:3030)(cid:3032)(cid:3031) (cid:3033)(cid:3032)(cid:3028)(cid:3047)(cid:3048)(cid:3045)(cid:3032)(cid:3046) is the difference between 𝜎(cid:3047)(cid:3032)(cid:3046)(cid:3047) and 𝜎(cid:3047)(cid:3045)(cid:3028)(cid:3036)(cid:3041) across 20 runs of experiments with reduced features and ∆𝜎(cid:3033)(cid:3048)(cid:3039)(cid:3039) (cid:3033)(cid:3032)(cid:3028)(cid:3047)(cid:3048)(cid:3045)(cid:3032)(cid:3046) is the difference between 𝜎(cid:3047)(cid:3032)(cid:3046)(cid:3047) and 𝜎(cid:3047)(cid:3045)(cid:3028)(cid:3036)(cid:3041) across 20 runs of experiments with full features. The fact that the standard deviation for test results is 58 lower in most cases (except for recall), and more importantly, ∆𝜎 as in (4.5) is smaller for all cases (including recall) using reduced features shows that the classifier has achieved better generalization compared to the case of using the full feature set. 4.5. Summary and discussion In this work, we focused on classifying subtle shifts in ‘affect’ in a completely natural setup without any constraints on the participants. We leveraged the power of neural networks but limited our design to the simplest neural network architecture with minimum number of nodes to help reduce the computational load. More investigation into the minimum number of frames per second for video recording could help to further decrease the time and resources needed for extracting the AUs from the video files. PCA was performed to reduce the dimension of features from 17 to 10 which helped reduce the variance in the results. Considering only positive and negative affect, testing accuracy of 76.8 % was achieved which is, to the best of our knowledge, the best results within the constraints discussed above. One observation in this study was reduction of accuracy while trying to classify the ‘neutral’ labels. We speculate that ‘neutral’ labels span more diverse characteristics compared to ‘positive’ or ‘negative’ labels; therefore, more training examples are likely necessary to train the algorithms to correctly classify ‘neutral’ instances. The bottleneck is increasing the number of ‘neutral’ datapoints alone is not helpful as it will result in skewed database which leads the classifier to massively develop bias towards identifying ‘neutral’ labels. In fact, according to our experience, most of the labels in a given dataset have been marked as ‘neutral’. Therefore, for a given dataset, during the training phase, many of the ‘neutral’ labels were randomly removed to balance the dataset. This means even more data points have to be collected so that after balancing the dataset, remaining labels would 59 be enough for training the neural network to correctly detect ‘neutral’ labels. Tackling this challenge is a viable goal for future work. As for rapport, we developed an architecture which leveraged DTW for gauging synchrony among participants. Five DTW features were constructed according to the gaze, head coordination, head orientation, AUc and AUr. Along these five DTW features, raw data for gaze, head coordination and AUc were used as input features. By leaving out head coordination and AUr data, a total of 63 features were used. 2674 data points were employed for training the neural network. An accuracy of 73.6% was achieved for testing over 20 experiments with a standard deviation of 2.68%. Precision, recall and F1 score for testing were achieved as 0.764, 0.807 and 0.784, respectively. To the best of our knowledge, these are the highest reported metrics for identifying rapport in multiperson groups and with highest temporal resolution of 30 seconds. Further research could be done on the effect of more output classes such as high, neutral, and low rapport. The challenge lies in the number of additional datapoints needed for training the network. Moreover, balancing the dataset could get more challenging with higher number of classes which in return may require even more data point collection. Another interesting path for research is incorporating the sequence of data in the analysis. As of now, the classifier does not consider the order of the datapoints. However, the labelers watched the video segments in order and that naturally affects the perception of ‘rapport’ and ‘affect’ by the labelers. Therefore, employing techniques such as recurrent neural networks and other methods for analyzing the sequence of data could further improve the 60 results. Our findings in this work pave the way and encourage the community to investigate further the future work some of which were briefly mentioned here. Disclaimer: A substantial portion of this chapter was published in [86] © 2023, IEEE. 61 Chapter 5: Advancing integrated electrochemical instruments for point-of-care devices 5.1 Introduction As described in chapter 2, electrochemical sensing has been proven to be an effective approach for monitoring different physiological and environmental parameters. Therefore, implementing miniaturized electrochemical solutions could enhance assistive technologies for human health and wellness. To this end, researchers have utilized complementary metal-oxide semiconductor (CMOS) technology to develop small and wearable potentiostats [63], [64], [100], [101],and many advances have been made to develop potentiostats that increase the range of current readout [65], [66], decrease the power consumption and size [67], [102], [103], lower the noise [104] and widen the dynamic range [105], and support the bidirectional current of electro- chemical cells [66]. New processes have also been developed for implementing quasi-reference electrodes on the CMOS chip for a fully integrated electrochemical measurement [106]. Although these advances have enabled miniaturized electrochemical systems, as the modern CMOS technologies scale down in size, the voltage supply have become smaller [107]; for example, while an older 0.5 µm CMOS technology used to support a 5 V supply, newer technologies such as 180 nm support a maximum of 1.8 V for regular transistors or 3.3 V in the case of high-voltage transistors. As a result, many electrochemical reactions cannot be supported by modern integrated potentiostats, as illustrated in Figure 5.1. 62 Figure 5. 1. The graph shows voltammetry of different heavy metals and indicates bias potentials for each target element to obtain peak current (data adapted from [68]). The blue and green bars show ideal ranges of bias potential that are supported with a traditional CMOS potentiostat and our novel potentiostat, respectively, both with a 3.3 V supply. In this example, the reactions for some elements such as Zn and Mn are not supported by a traditional CMOS potentiostat. Note that the gray bar represents VCE-swing, the excess voltage beyond the bias potential required for an electrochemical cell [108]. Since a potentiostat needs to support bidirectional current for redox reactions, only half of the supply voltage is available to be used for each direction in an ideal rail-to-rail operation of the potentiostat. For a 3.3 V supply, this means only 1.65 V is available for each reduction or oxidation reaction. Furthermore, as detailed in section 5.2, because the counter electrode in a typical three-electrode electrochemical cell must be allowed to swing well beyond the bias potential, only a small portion of this 1.65 V is available to be used as bias potential, as illustrated in Figure 5.1. However, many electrochemical reactions, for example for detecting heavy metals such as manganese and zinc, require bias potentials of about 1.6 V and 1.2 V, respectively. As shown in Figure 5.1, these potentials fall outside the window that is supported by conventional CMOS potentiostats with power supplies of 3.3 V or lower. Therefore, conventional CMOS potentiostat designs implemented in newer technologies with lower supply voltages do not support voltammetry for detecting these elements. On the other hand, older CMOS process nodes, such as 0.5 µm that support supply voltages greater than 3.3 V, are not offered by 63 mainstream foundries as they are considered obsolete [109]. Therefore, it is inevitable to utilize these newer CMOS technologies for electrochemical measurements that come with the added benefits of smaller feature size, lower power consumption, and higher speed. Consequently, overcoming the issue of limited bias potential in CMOS potentiostats implemented in newer process nodes is crucial to accommodate a wide range of electrochemical reactions in wearable assistive technologies. In this work, we introduce a novel potentiostat topology that addresses the limited supply voltage in newer CMOS technologies and supports bidirectional current measurement in a wide range of electrochemical reactions. For a given supply voltage, this new topology nearly doubles the voltage range for the electrochemical cell compared to conventional designs. Hence, it enables detecting a wider range of target elements than any previously reported integrated potentiostat. As desired with most integrated instrumentation circuits, this potentiostat also provides a small form factor and low power consumption for a compact system implementation which is necessary for wearable applications. We presents an in-depth analysis on voltage requirements of a three-electrode electrochemical cell as well as the challenges of Conventional CMOS potentiostats in section 5.2. Then we present the methodology and design for enhancing voltage range of the electrochemical cell along with the results of electrochemical experiments as well as simulation results of the implemented CMOS potentiostat. 64 5.2 Manifestation of Electrode Potentials and Challenges for Conventional CMOS Potentiostats 5.2.1 Electrochemical Cell Model and Manifestation of Potentials at Electrodes As briefly asserted in in section 5.1, an important bottleneck in miniaturized CMOS potentiostats is their ability to support a wide bias potential window to extend the range of electrochemical targets that can be measured using CMOS instrumentation. To elaborate on this point, consider the electrochemical cell model shown in the circle at the center of Figure 5.2. A three-electrode electrochemical cell features a reference electrode (RE), a working electrode (WE) and a counter electrode (CE). The resistance between CE and RE is mainly attributed to the solution resistance. Similarly, the resistance between RE and WE is attributed to the solution resistance in series with a parallel capacitance and resistance that model the double-layer capacitance and charge transfer resistance at the WE surface. Figure 5. 2. Schematic of a traditional potentiostat with grounded working electrode. The electrochemical cell model is presented at the center of the figure with a circle symbol [108]. In this three-electrode cell, a bias voltage is traditionally applied to the RE with respect to WE. In other words, VRE-WE is applied to the electrochemical cell as shown in Figure 5.2. In this paper, we will refer to this applied voltage as Vbias. Note that Vbias is sometimes defined as VWE-RE 65 [8], which is negative of Vbias as defined here. Both definitions are valid as long as one remains consistent. Therefore, throughout this paper, we define: Vbias = VRE-WE = VRE – VWE (5.1) This definition facilitates a clearer discussion about the integrated CMOS potentiostat. While the Vbias is externally applied between RE and WE, the potential on CE can and will swing beyond Vbias in order to establish a desired electrochemical reaction. Let us define this CE swing voltage as: VCE-swing = VCE-RE = VCE – VRE (5.2) This VCE-swing depends on several factors such as electrolyte concentration and the geometry and material of electrodes, and it can be as large as Vbias, which extends the maximum potential the potentiostat must support to beyond two times Vbias. Finally, let us define the full cell potential, Vcell such that: Vcell = VCE-WE = VCE – VWE = VCE-swing + Vbias (5.3) Based on our extensive experience with integrated electrochemical platforms, we expect voltages at the cell electrodes to generally manifest similar to the graph in Figure 5.3. The absolute value of the cell potential is always more than that of the bias potential due to the existence of CE-RE resistance. Moreover, by lowering the electrolyte concentration, the CE-RE potential difference further increases due to the increase in the CE-RE resistance. Therefore, for a potentiostat with a limited voltage supply, the voltage swing on CE is the limiting factor. 66 Vcell_b Vcell_a Vbias ) V ( i s l a n m r e t l l e c t a e g a t l o v applied bias voltage (V) Figure 5. 3. Conceptual representation of Vcell and Vbias. VRE-WE (Vbias) is always equivalent to the Vbias voltage applied to the electrochemical cell. VCE-WE (Vcell), however, is more than Vbias and further increases if electrolyte concentration decreases [108]. 5.2.2 Challenges of Conventional CMOS Potentiostats A conventional CMOS potentiostat is shown in Figure 5.2. An operational amplifier is used to apply a bias voltage to an electrochemical cell. The current generated in the electrochemical cell is usually read using a transimpedance amplifier (TIA) as shown in the bottom right of Figure 5.2. The WE of the electrochemical cell in this design is tied to analog ground which is usually set to Vsupply/2. This allows the potentiostat to support bidirectional current measurement and hence supports both reduction and oxidation reactions. For instance, in the old 0.5 µm CMOS technology with a 5 V supply, in an ideal rail-to-rail operation of the circuit, the analog ground is set to 2.5 V. Therefore, the available voltage for |Vcell| is 2.5 V in either direction (negative or positive). Basically, the bottom half of the supply range (0 V to 2.5 V) is used to support negative Vcell (remember Vcell=VCE-VWE) and the top half (2.5 V to 5 V) is used to support positive Vcell. Only a portion of this 2.5 V in either direction can be assigned to Vbias because always Vbias