HIGH-PRECISION AND PERSONALIZED WEARABLE SENSING SYSTEMS FOR HEALTHCARE APPLICATIONS By Linlin Tu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2022 ABSTRACT HIGH-PRECISION AND PERSONALIZED WEARABLE SENSING SYSTEMS FOR HEALTHCARE APPLICATIONS By Linlin Tu The cyber-physical system (CPS) has been discussed and studied extensively since 2010. It provides various solutions for monitoring the user’s physical and psychological health states, enhancing the user’s experience, and improving the lifestyle. A variety of mobile internet devices with built-in sensors, such as accelerators, cameras, PPG sensors, pressure sensors, and the microphone, can be leveraged to build mobile cyber-physical applications that collected sensing data from the real world, had data processed, communicated to the internet services and transformed into behavioral and physiological models. The detected results can be used as feedback to help the user understand his/her behavior, improve the lifestyle, or avoid danger. They can also be delivered to therapists to facilitate their diagnose. Designing CPS for health monitoring is challenging due to multiple factors. First of all, the high estimation accuracy is necessary for health monitoring. However, some systems suffer irregular noise. For example, PPG sensors for cardiac health state monitoring are extremely vulnerable to motion noise. Second, to include human in the loop, health moni- toring systems are required to be user-friendly. However, some systems involve cumbersome equipment for a long time of data collection, which is not feasible for daily monitoring. Most importantly, large-scale high-level health-related monitoring systems, such as the systems for human activity recognition, require high accuracy and communication efficiency. How- ever, with users’ raw data uploading to the server, centralized learning fails to protect user’s private information and is communication-inefficient. The research introduced in this dissertation addressed the above three significant chal- lenges in developing health-related monitoring systems. We build a lightweight system for accurate heart rate measurement during exercise, design a smart in-home breathing training system with bio-Feedback via virtual reality (VR) game, and propose federated learning via dynamic layer sharing for human activity recognition. Copyright by LINLIN TU 2022 Thanks to my family, who always supports me. Thanks to my advisor, who always leads me. Thanks to my friend, who shares the laughter with me. Wish everyone all over the world a healthy and happy life, and tomorrow will be better. v ACKNOWLEDGEMENTS Here, I would like to express my deepest gratitude to all – my family, my advisor, my friends, and everyone who helps me in this academic career. None of my achievement could be possible without the support from my family. Since I was 18, I was away from my hometown. Although they were far away, they always showed their hands and cheered me up when I was down. Thanks, Dr. Guoliang Xing, my dear advisor. He offered me a chance to explore the academic world. He guided me from the development of a small App to the processing with millions of data points. His insight helps me solving numerous challenging problems in my research. Thanks, Dr. Jiayu Zhou, who provided valuable suggestions that inspire me to refine my design and give more reliable and useful information for the study of smart system. Thanks, my friends, old ones and new ones. Life of a researcher is sometimes lonely, but I am so lucky to have all of you on my way. We have worked together, played together, and laughed together. One day in the future, we will get back together and share more stories of our colorful life. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 HIGH-PRECISION HEART RATE TRACKING . . . . . . . . . . . 4 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Contact Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.3 Pulse Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 FitBeat Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2.1 Walking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2.2 Running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.2.3 Riding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Conclusion of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 CHAPTER 3 RESPIRATORY SINUS ARRHYTHMIA BIOFEEDBACK-BASED BREATHING TRAINING . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 RSA Biofeedback Training . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.2 Breathing training as a stress mitigating intervention . . . . . . . . . 28 3.2.3 Respiration pattern measurement . . . . . . . . . . . . . . . . . . . . 29 3.2.4 Bio-responsive VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 System Requirements and Challenges . . . . . . . . . . . . . . . . . . . . . . 30 3.4 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.1 Physiological Measurement . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.1.1 IBI extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1.2 Breathing pattern extraction . . . . . . . . . . . . . . . . . 33 3.4.1.3 RSA quantification . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.2 Real-time Breathing Pattern Recommendation . . . . . . . . . . . . . 35 3.4.2.1 Intelligent Pacing . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.2.2 Dynamic Estimation . . . . . . . . . . . . . . . . . . . . . . 37 3.5 VR Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5.1 Balloon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5.2 Pilot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 vii 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6.1 Experiment settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6.2 Evaluation of physiological measurement . . . . . . . . . . . . . . . . 42 3.6.2.1 Evaluation of breathing pattern extraction . . . . . . . . . . 42 3.6.2.2 Evaluation of IBI extraction . . . . . . . . . . . . . . . . . . 42 3.6.3 Evaluation of Intelligent Breathing pattern recommendation . . . . . 43 3.6.3.1 RSA maximization . . . . . . . . . . . . . . . . . . . . . . . 44 3.6.3.2 Stress Reduction . . . . . . . . . . . . . . . . . . . . . . . . 46 3.6.3.3 Training Experience . . . . . . . . . . . . . . . . . . . . . . 49 3.6.4 Discussion of game designs . . . . . . . . . . . . . . . . . . . . . . . . 50 3.7 Conclusion of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 CHAPTER 4 PERSONALIZED FEDERATED LEARNING FOR HUMAN AC- TIVITY RECOGNITION . . . . . . . . . . . . . . . . . . . . . . . 53 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.1 Deep learning for HAR. . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.2 Federated learning (FL) . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.3 FL personalization via model sharing. . . . . . . . . . . . . . . . . . . 55 4.3 A Motivation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 Dynamic Layer-wise Federated deep learning framework . . . . . . . . . . . . 64 4.5.1 Model Affinity-based User Grouping . . . . . . . . . . . . . . . . . . 64 4.5.2 Intra-group Layer-wise Model Merging . . . . . . . . . . . . . . . . . 66 4.5.3 Bottom-up Layer-wise Model Aggregation . . . . . . . . . . . . . . . 68 4.5.4 Reducing Communication Overhead . . . . . . . . . . . . . . . . . . . 70 4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.6.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.6.3 Validation on LiDAR Dataset . . . . . . . . . . . . . . . . . . . . . . 75 4.6.4 Performance on Different Datasets . . . . . . . . . . . . . . . . . . . 78 4.6.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.6.5.1 Overall accuracy . . . . . . . . . . . . . . . . . . . . . . . . 80 4.6.5.2 Communication overhead . . . . . . . . . . . . . . . . . . . 81 4.6.6 Impact of Local Computation Rounds . . . . . . . . . . . . . . . . . 82 4.7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.7.1 Convergence of FedDL . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.7.2 Scalability of FedDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.7.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.8 Conclusion of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 viii LIST OF TABLES Table 3.1: Assess the difference of self-reported training experience between BreathCoach- Balloon and Traditional-Balloon training using paired t-tests. Compared with traditional training, the frequency of feeling distracted, anxious, hard to follow stimulus and breathing too deeply significantly decreases when training with BreathCoach (p < 0.05). . . . . . . . . . . . . . . . . . 50 Table 4.1: Five HAR datasets (UWB, Depth Images, HARBOX-IMU, IMU and LiDAR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 ix LIST OF FIGURES Figure 2.1: Measurement error of the built-in PPG sensor of Moto 360 while the subject is running. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Figure 2.2: The signal processing pipeline of FitBeat. . . . . . . . . . . . . . . . . . 9 Figure 2.3: The waveform of PPG signal and its variance. The PPG signal are recorded while subject performing wrist movements intensively . . . . . . 10 Figure 2.4: the waveform of motion-reduced PPG signal. The last four plots shown the result of adaptive noise cancellation with four different reference input, including acceleration from axis X, Y , Z and a linear summation, (X + Y + Z). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 2.5: the flowchart of Adaptive Noise Cancellation. . . . . . . . . . . . . . . . 12 Figure 2.6: Average error of heart rate estimation using PPG signal processed by LMS and RLS filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 2.7: The waveform of PPG signal, including the raw PPG signal and the ones processed using LMS filter and RLS filter. . . . . . . . . . . . . . . 13 Figure 2.8: The spectrum of 10-second PPG signal, including the ground truth, the raw PPG signal and the noise-reduced one. . . . . . . . . . . . . . . . . . 14 Figure 2.9: Heart rate estimation while walking. . . . . . . . . . . . . . . . . . . . . 16 Figure 2.10: Heart rate estimation during a 10-minute walking. . . . . . . . . . . . . . 16 Figure 2.11: The spectrums of reference PPG signal, noise-reduced PPG signal, and accelerometer data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Figure 2.12: The waveform and variance of PPG signal. . . . . . . . . . . . . . . . . . 18 Figure 2.13: Average estimation error while running. . . . . . . . . . . . . . . . . . . 19 Figure 2.14: Heart rate monitoring while running. . . . . . . . . . . . . . . . . . . . 20 Figure 2.15: Average estimation error while riding. . . . . . . . . . . . . . . . . . . . 20 Figure 2.16: Heart rate monitoring while riding. . . . . . . . . . . . . . . . . . . . . 20 x Figure 3.1: A Comparison between the traditional approach and BreathCoach for respiratory sinus arrhythmia biofeedback-based breathing training (RSA- BT). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Figure 3.2: Examples of the products on the market with functions related to breath- ing training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Figure 3.3: System overview of BreathCoach. . . . . . . . . . . . . . . . . . . . . . . 32 Figure 3.4: An example of inter-beat interval (IBI) extraction based on 6-second pulse wave data from PPG sensor. . . . . . . . . . . . . . . . . . . . . . 33 Figure 3.5: An example of breathing pattern extraction based on 15-second acceleration. 34 Figure 3.6: An example illustrating Peak-valley algorithm for RSA quantification. Per breath, RSA is calculated as the difference between maximum and minimum inter-beat interval (IBI). . . . . . . . . . . . . . . . . . . . . . 35 Figure 3.7: An example showing how Intelligent pacing works. At T1, the sys- tem switched from IBI-based to RF-based pacer, as significant postural changes interrupted IBI extraction; At T2, the system switched back to IBI-based mechanism, as the RSA exceeded RSAhigh . . . . . . . . . . . . 36 Figure 3.8: Dynamic estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Figure 3.9: Screenshots of two proof-of-concept VR games. . . . . . . . . . . . . . . 38 Figure 3.10: Schematic illustration of the study protocol. . . . . . . . . . . . . . . . . 41 Figure 3.11: The error distribution (left) and CDF (right) of the breath-by-breath detection result of BreathCoach collected from 10 subjects. The average absolute error of breathing cycle duration (Durbc ), which is used to derive RSA, is 0.61 s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Figure 3.12: The error distribution (left) and CDF (right) of the BreathCoach’s inter- beat interval (IBI) extraction from 10 subjects. The average absolute error of IBI, which is used for RSA assessment and real-time breathing pattern recommendation, is 9.6 ms. . . . . . . . . . . . . . . . . . . . . . 44 Figure 3.13: Evaluating the effect of BreathCoach on RSA maximization by observ- ing the difference between RSA and RSAref (Difrsa ). RSAref , the maximum RSA amplitude achieved by breathing at RF during RF de- tection, acts as a reference in the assessment of the effect on RSA max- imization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 xi Figure 3.14: The distribution (left) and CDF (right) of the difference between RSA and RSAref (Difrsa ) collected from BreathCoach-based training, show- ing that BreathCoach significantly improves the performance in maxi- mizing users’ RSA throughout the training compared with traditional training approach (p < 0.05). . . . . . . . . . . . . . . . . . . . . . . . . 47 Figure 3.15: Compare the 8-min HRV series of pre-training task and post-training task for subject 1 with the left from BreathCoach and the right from traditional training. After training with BreathCoach (right), there is an increment in three features: HRV amplitude during cognitive task, the speed of HRV increasing to the maximum amplitude right after 5- min task and the maximum recovery amplitude during break. However, these gains are hardly observed after traditional training (left). . . . . . . 48 Figure 3.16: Visualize the change in the mean of HRV during the cognitive task (µHRV ), recovery speed and amplitude of HRV during the post-task rest (SpeedRecHRV and AmpRecHRV ) after BreathCoach-based training and traditional training for each subject. When training with Breath- Coach, there is a significant post-training improvements in stress reduc- tion according to the three metric: µHRV (p < 0.05), SpeedRecHRV (p < 0.05) and AmpRecHRV (p < 0.05). However, the significant im- provement is not observed after traditional breathing training. . . . . . . 49 Figure 3.17: Compare the distributions of BR from BreathCoach and traditional training for each subject. It shows that BreathCoach enables users to breath significantly more steady while slowing their respiration accord- ing to the metric, the STD of BR for each training (p = 0.0019). . . . . . 51 Figure 4.1: The data of “typing” from the HARBox dataset after reducing dimen- sion to 2D using PCA. There exists a clear group relationship among different subjects’ data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 4.2: Correlation matrix of 6 users’ HARBOX data. Each number is the Pearson correlation coefficient (PCC), measuring the linear correlation between two users’ data. It is obvious there are two groups, (n1 , n2 ) and (n3 , n4 , n5 , n6 ). However, the users within each group are of different degrees of similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Figure 4.3: Illustration of three sharing schemes for a group. . . . . . . . . . . . . . 60 Figure 4.4: Illustration of the performance of federated learning under four sharing schemes. Layer-wise sharing scheme outperforms other sharing schemes in overall accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 xii Figure 4.5: Illustration of the dynamic and hierarchical federated learning frame- work of FedDL when learning 3-layer models for 6 users. . . . . . . . . . 62 Figure 4.6: The system architecture of FedDL. Each grouping / model-merging round mainly consists of 4 steps. . . . . . . . . . . . . . . . . . . . . . . 63 Figure 4.7: The procedure of model-affinity-based grouping. It consists of three steps: 1. Calculate the affinity matrix; 2. Group users based on the affinity matrix and previous grouping results; 3. Update the layer-wise sharing structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 4.8: Illustration of the layer-wise model merging based on the grouping re- sults, Groups. Only lower 3 layers of models are transferred between six users and the server for model merging. . . . . . . . . . . . . . . . . . 67 Figure 4.9: The preprocessing of LiDAR data for the recognition of activities, in- cluding walking, sitting, standing, bending, checking the watch and phone calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Figure 4.10: Comparison of different approaches’ performance on the LiDAR dataset. FedDL outperforms other approaches in accuracy performance by more than 15%, and save about 42.6% communication overhead compared with approaches that share the whole models (Fedavg and pFedMe). . . . 76 Figure 4.11: The sharing structure for 10 users, which is dynamically learned by FedDL. n2 , and n1 share more layers as they have similar behavior habits and biological features. . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure 4.12: Comparison of different approaches’ performance on four datasets, UWB, HARBOX, Depth Images and IMU. FedDL outperforms other approaches in accuracy performance and has a lower communication overhead than approaches that share the whole models (Fedavg and pFedMe). . . . . . 78 Figure 4.13: Comparison of different approaches’ performance on Depth images datasets with different number of local computation rounds (R = 20, 40, 60). All the methods benefits from a larger R, and FedDL maintains the best accuracy and communication performance with different numbers of R. . 80 Figure 4.14: Comparison of different approaches’ performance on 30-, 60- and 90-user HARBOX datasets. FedDL outperforms FedAvg, FedPer and pFedMe in both overall accuracy and communication overhead. . . . . . . . . . . 81 xiii Figure 4.15: The training loss and testing accuracy of a specific user’s model chang- ing over global rounds with different settings of R. Larger R improves convergence, especially for FedAvg. However, FedDL will always con- verge fastest with different local computation rounds R. . . . . . . . . . 83 xiv CHAPTER 1 INTRODUCTION The confluence of innovations in sensor development, the emergence of the Internet of Things (IoT), and the ubiquity of mobile devices has given us a variety of new sensors and systems that can be utilized to build mobile cyber-physical applications for the improvement of per- sonal health, well-being, and fitness. Such devices are emerging regularly and address a diverse set of applications, ranging from physical activity, endurance sports, and resistance training to sleep monitoring, mindfulness practice, posture monitoring, weight management, breathing techniques, cardiac health status [128, 80, 113]. Some examples of recent research conducted in the area. Xiao Sun et al. designed a smartphone-based application, which leverages the built-in microphone to unobtrusively detect acoustic events related to respira- tion symptoms [117]. Shahriar et al. proposed using a sensor-equipped earphone to detect the user’s heart rate and built an automated music recommendation system to help the user maintain a target heart rate [91]. When designing state-of-the-art cyber-physical applications for health-related applica- tions, several important considerations must be addressed. The first issue is what optimiza- tions of the measurement technologies are necessary to improve estimation accuracy. Take the detection of cardiac information as an example. In this application, heart rate for fitness is commonly tracked using wrist-worn wearables. However, a major drawback of using these sensors is that significant noise caused by intensive wrist movements can corrupt measure- ments. As a result, complex filtering algorithms and designs must be created and tailored to each application. To include humans in the loop, a second issue that must be considered is the user-friendliness of applications. Attention must be paid to designing an unobtru- sive sensing method that is easy-to-operate and convenient such that users can access the healthcare applications frequently. One example of problems arising from this issue is in the design of RSA-BT (Respiratory Sinus Arrhythmia biofeedback-based Breathing Training), a 1 cardio-respiratory intervention that has been commonly used as a complementary treatment to respiratory diseases and an exercise to help manage stress and anxiety. Despite its health benefits, RSA-BT today still relies on in-person sessions and cumbersome sensing devices in a clinical setting, limiting its accessibility. Furthermore, to design smart systems for large- scale applications, we need to solve several limitations of centralized learning. With users’ raw data uploading to the server, centralized learning fails to protect users’ private informa- tion and is communication-inefficient. Distributed learning [49, 132] has been proposed for large-scale smart systems. The distributed learning paradigm only requires users to upload their model weights for collaborative learning, avoiding sharing users’ raw data during the learning process. Several Federated learning systems for Human activity recognition (HAR) [49, 22] have been developed to enable continuous monitoring of human behaviors without sharing users’ raw data. However, standard federated learning limits the performance of smart systems, as the accuracy of models learned in this approach can be largely influenced by the diversity of users. Our research improved the precision of smartwatch-based cardiac measurement, enabled the wrist-band-based unobtrusive and continuous logging of users’ cardiac and respiratory information to make the intervention accessible daily in-home, and finally, proposed to use federated learning to train the deep model for HAR. Specifically, first, a lightweight system, Fitbeat, was developed to enable accurate heart rate tracking on wrist-type during intensive exercise [123]. After obtaining accurate physiological signals from the system, we include humans in the loop and design the BreathCoach — a smart and unobtrusive system that enables in-home RSA biofeedback-based Breathing Training (RSA-BT) using smartphone- based virtual reality in conjunction with sensors on a smartwatch [121]. Finally, we propose FedDL, a novel federated learning system for the large-scale HAR, that can dynamically capture the underlying user relationships and apply them to learn personalized learning models for different users [124]. The rest of the thesis is organized as follows: Chapter 2 - the research on high-precision 2 heart rate tracking; Chapter 3 - the research about BreathCoach, an in-home RSA-based Breathing Training system; Chapter 4 - the latest proposed research about Human Activ- ity Recognition using personalized federated deep learning. The last chapter presents the conclusion. 3 CHAPTER 2 HIGH-PRECISION HEART RATE TRACKING Tracking heart rate for fitness using wrist-type wearables is challenging, because of the significant noise caused by intensive wrist movements. This chapter presents FitBeat – a lightweight system that enables accurate heart rate tracking on wrist-type wearables during intensive exercises. Unlike existing approaches that rely on computation-intensive signal processing, FitBeat integrates and augments standard filter and spectral analysis tool, which achieves comparable accuracy while significantly reducing computational overhead. FitBeat integrates contact sensing, motion sensing and simple spectral analysis algorithms to suppress various error sources. This chapter is adapted from a publication [123]. The author of the dissertation is the first author of the original work. ”We” in this chapter refers to the author of the original publication. This work contains the App design on Android devices. The author recruited all the subjects, then collected and processed the data and the ground truth. 2.1 Background Recent years have witnessed the proliferation of wrist-type smart wearables. A desirable feature of these devices is tracking heart rate for fitness, which is essential for exercisers to monitor health conditions and control training loads. Wrist-type wearables typically employ photoplethysmogram (PPG) to measure heart rate. Specifically, a PPG sensor consists of a LED and a photo detector. The LED emits light, which is absorbed by blood flow when traveling through the tissue. The photo detector then measures the intensity of reflected light to sense periodic blood flow variation caused by cardiac cycle, which can be used to estimate heart rate. However, tracking heart rate for fitness using wrist-type wearables poses several key challenges. First, since the capillary network around wrist is relatively sparse, PPG signals 4 120 Zephyr Mio 110 100 heart rate (bpm) 90 80 70 60 0 100 200 300 400 500 600 time (s) Figure 2.1: Measurement error of the built-in PPG sensor of Moto 360 while the subject is running. observed by wrist-worn sensors are usually very weak, which makes the signal extremely vulnerable to noise. Second, during intensive exercise the subject’s wrist muscle may flex frequently, resulting in an unstable contact between the subject’s skin and the PPG sensor, which causes significant noise. Third, in addition to causing unstable contact, intensive wrist motion affects blood flow, which introduces additional noise that may severely degrade the accuracy of heart rate measurement, particularly when the motion-induced noise is over- lapping with the desired signal in frequency domain. Our experiments show that popular wearable devices like Mio Alpha and Moto 360 suffer extremely poor performance when measuring heart rate in the presence of intensive wrist movements. For example, as shown in Fig. 2.1, when the subject is running, the heart rate estimation error of Mio Alpha can be as high as 50 beats per minute (bpm), compared with the ground truth measured using a motion-resistant ECG sensor. Similar results were observed on other popular wearable devices like Basis Peak and Fitbit Charge HR [69]. To improve the accuracy of heart rate measurement, numerous approaches have been proposed to remove motion-induced noise, including wavelet transformation, independent component analysis [65], moving average filter, adaptive noise cancellation [118], time fre- 5 quency methods, and principle component analysis [97]. However, existing approaches are mainly designed for PPG sensors worn on fingertips [104], earlobes [94], or forehead [66]. They perform poorly when sensors are worn on the wrists, because noise caused by wrist motion is much more complex and stronger than those caused by fingers, ears and head. Although there exist a few methods for reducing noise caused by wrist movements, they rely on complex signal processing algorithms. For example, in addition to standard filtering and spectral analysis, TROIKA [135] relies on computation-intensive singular spectrum analysis and FOCUSS algorithm, which significantly increases overhead. In this chapter, we present Fitbeat – a lightweight system that enables accurate heart rate tracking on wrist-type wearables during intensive exercises. Unlike existing approaches that rely on computation-intensive signal processing [110][38], FitBeat integrates and aug- ments only standard filter and spectral analysis tool, which achieves comparable accuracy while significantly reducing computational overhead. To achieve this goal, FitBeat integrates contact sensing, motion sensing, and simple spectral analysis algorithm to suppress various error sources. Specifically, to remove noise caused by unstable contact between the subject’s skin and the PPG sensor, FitBeat performs contact sensing, which measures the amplitude and variance of PPG signal to identify and remove distorted PPG signal samples. To reduce motion artifacts caused by complex and intensive wrist motions, FitBeat exploits accelerom- eter data to rebuild the waveform of motion-induced noise, and then subtracts it from PPG signal. To extract precise heart rate from raw PPG samples, FitBeat employs a simple pulse identification algorithm, which accurately identifies the spectral peak of heart rate by co-analyzing the spectrum of PPG signal and acceleration data. FitBeat is implemented on Moto 360 – a COTS smartwatch. We evaluate the performance of FitBeat for workouts of different intensities, including walking, running and riding. Experimental results involving 10 subjects show that the average error of FitBeat is around 4 bpm, which improves heart rate accuracy by 10x compared with the default heart rate tracker of Moto 360. 6 2.2 Related work To improve the accuracy of heart rate measurement, several approaches have been pro- posed to remove motion-induced noise from PPG signal. Kim et al. in [65] propose to use independent components analysis (ICA) for reducing motion-induced noise. In ICA, PPG signals are modeled as the combination of PPG signals and motion artifacts.When applied to the contaminated PPG signal, ICA separates the clean PPG signal from noise compo- nents. However, ICA assumes that all signal components are mutually independent with each other, which is not true in PPG signal. For example, intensive wrist movements always affect the subject’s cardiac activity, which implies that clean PPG signal is correlated with motion-induced noise. Besides, ICA relies on multiple PPG sensors, which are usually not available on COTS wearable devices. Another approach to reducing motion-induced noise is adaptive noise cancellation (ANC) [130]. ANC estimates motion-induced noise components using acceleration data and then substracts the estimated noise from PPG signal. However, when the hand movements are irregular or the wristband is loosely attached to the subject’s skin, the estimated noise may not be well correlated with the noise. Consequently, motion noise can not be removed completely. There exist two classes of signal processing algorithms to extract heart rate from noise- reduced PPG signal, including moving window and spectral analysis [55]. Previous studies have shown that spectral analysis is more accurate than moving window. Specifically, spec- tral analysis algorithm estimates heart rate by analyzing the spectrum of PPG signal and then locating the largest spectral peak in the possible range of cardiac cycle. However, this algorithm performs poorly in the presence of residual motion-induced noise, because residual noise may cause multiple peaks around the frequency of cardiac cycle when PPG signal is noisy. 7 2.3 System Design FitBeat is designed for accurate heart rate tracking using wrist-type wearables during intensive exercises. To achieve this goal, FitBeat addresses two key challenges. First, during intensive exercises, the subject’s wrist muscle may flex frequently, which causes the band of the wearable to tighten and loosen, resulting in an unstable contact between the subject’s skin and the PPG sensor that significantly distorts PPG signals. Second, in addition to causing unstable contacts, the wrist motion may affect blood flow, which introduces additional noise that interferes with heart rate measurements. FitBeat addresses the above challenges by integrating contact sensing, motion sensing, and simple signal processing algorithm to suppress various error sources. The architecture of FitBeat is illustrated in Fig. 2.2. Specifically, FitBeat consists of three major components. 1. Based on the amplitude and variance of PPG signal, the contact sensing component continuously monitors the contact between the PPG sensor and the subject’s skin, and removes those signal samples distorted by unstable contact. 2. The noise reduction component analyzes both PPG signal and accelerometer data to remove motion-induced noise. It exploits accelerometer data to rebuild the waveform of motion-induced noise based on an empirical model, and then subtracts the noise from PPG signal. The empirical model is refined using iterative adaptive filtering to improve accuracy. 3. The pulse identification component further reduces the residual motion-induced noise by co-analyzing the spectrum of PPG signal and accelerometer data, and then performs spectral analysis to accurately identify the pulse corresponding to the heart rate. 8 Figure 2.2: The signal processing pipeline of FitBeat. 2.3.1 Contact Sensing When the subject is moving intensively, his/her wrist muscle may flex frequently, which may tighten and loosen the contact between the PPG sensor and the subject’s skin. The impact on PPG signal is two-fold. First, when wrist flex loosens the contact, the PPG sensor will be exposed to an increased level of ambient light, which overwhelms the pulsatility of PPG signal that characterizes cardiac cycles. Second, when the contact varies frequently, a new pulsatile components will be imposed to the original PPG signal, which interferes with heart rate measurements. To maintain accurate heart rate measurements in the presence of intensive wrist move- ments, FitBeat continuously senses the contact between the PPG sensor and the subject’s skin, and removes those signal samples that are distorted by unstable contact. Specifically, FitBeat identifies distorted PPG signal samples based on their amplitudes and variances. When the PPG sensor is exposed to an increased level of ambient light due to a loosened contact, the amplitude of the PPG signal will experience a disruptive increase. In addition, when the contact between the PPG sensor and the subject’s skin is unstable, the PPG signal will exhibit a large variance. For example, Fig. 2.3 shows the amplitude and variance of PPG signals when the subject performs intensive wrist movements. It’s obvious that both the amplitude and variance of PPG signals are much larger in the time periods from 30s to 130s and from 280s to 580s, which correspond to the time period when the subject keeps moving his hands. Based on the above observations, FitBeat removes a PPG signal sample if its amplitude is higher than a pre-defined threshold, and excludes signal samples from heart rate derivation 9 7 2 ×10 1.5 PPG signals 1 0.5 0 0 200 400 600 time (s) 13 Variance of PPG signals 8 ×10 6 4 2 0 0 200 400 600 time (s) Figure 2.3: The waveform of PPG signal and its variance. The PPG signal are recorded while subject performing wrist movements intensively if the variance measured in a window is larger than a pre-defined threshold. We determine the thresholds of amplitude and variance based on empirical experiments. According to our measurements, the thresholds of amplitude and variance are set to 106 and 5 × 108 , respectively. 2.3.2 Noise Reduction During intensive exercises, the subject’s motion may affect his/her blood flow around wrist, imposing another noise component into the original PPG signal. To address this prob- lem, FitBeat exploits the accelerometer of wearable to sense the subject’s motion, estimates motion-induced noise, and then subtracts the noise from the original PPG signal. The framework of the noise reduction component is illustrated in Fig. 2.5. Specifically, the input signal ppg(i) is a combination of motion-induced noise n(i) and the desired PPG signal s(i) that characterizes cardiac cycles. The accelerometer data acc(i) is used to derive n′ (i) as an approximate estimation of n(i). The adaptive filter then iteratively optimizes its coefficients to improve the accuracy of n′ (i). The process typically converges in a few rounds. Finally, 10 4 Raw PPG signals 1.55 ×10 1.5 1.45 0 2 4 6 8 10 4 reference input: X 5 ×10 0 -5 0 2 4 6 8 10 4 reference input: Y PPG signal 1 ×10 0 -1 0 2 4 6 8 10 4 reference input: Z 1 ×10 0 -1 0 2 4 6 8 10 4 reference input: X+Y+Z 2 ×10 0 -2 0 2 4 6 8 10 time (s) Figure 2.4: the waveform of motion-reduced PPG signal. The last four plots shown the result of adaptive noise cancellation with four different reference input, including acceleration from axis X, Y , Z and a linear summation, (X + Y + Z). the estimated n′ (i) is subtracted from ppg(i) to suppress motion-induced noise. To re-build the waveform of motion-induced noise, FitBeat models n′ (i) as a function of accelerometer data. Specifically, we derive the model based on extensive empirical measure- ments. First, we sample the X, Y, and Z axis of accelerometer and collect PPG signals for different subjects during typical workouts such as riding, running, and walking, etc. Then, we evaluate different polynomials consisting of X, Y, and Z to study their accuracy when re-building the waveform of motion-induced noise. Based on 20 groups of experiments, we find that the linear combination, i.e., (X +Y +Z) yields the best accuracy. For example, Fig. 2.4 compares the PPG signal generated by the noise reduction component after subtracting noise waveforms modeled using different polynomials. As shown in the figure, the waveform of PPG signal is the smoothest when using (X + Y + Z) to estimate motion-induced noise. We note that this result is different from previous studies [104][94][66], which show that motion-induced noise is best modeled using one axis of accelerometer data when the heart rate sensor is worn on the forehead, earlobe or finger of the subject. This is because, during exercises, the motion of the subject’s wrist can be along any possible direction, which is 11 Figure 2.5: the flowchart of Adaptive Noise Cancellation. much more complex than those of head, ears, and fingers. As a result, models that account for only one axis of accelerometer are not enough to accurately characterize noise caused by wrist movements. Based on the above observation, we design the noise reduction component shown in Fig. 2.5 as follows. acc(i) = x(i) + y(i) + z(i) (2.1) λ−1 P(i − 1)acc(i) k(i) = (2.2) 1 + λ−1 accH (i)P(i − 1)acc(i) n′ (i) = wT (i)acc(i) (2.3) s′ (i) = ppg(i) − n′ (i) (2.4) w(i) = w(i − 1) + k(i)s′ (i) (2.5) P(i) = λ−1 P(i − 1) − λ−1 k(i)accH (i)P(i − 1) (2.6) where, • i denotes the current index of time window, • acc(i) is the vector of buffered acceleration at step i, • P(i) denotes the inversive correlation matrixe at step i, • k(i) is the gain vector at step i, • w(i) is the vector of filter tap at step i, 12 5 Raw PPG signals 14 2.6 ×10 LMS RLS 2.4 12 2.2 2 Average HR Error (bmp) 10 0 2 4 6 8 10 7 2 ×10 LMS-processed PPG signals 8 0 6 -2 0 2 4 6 8 10 5 4 4 ×10 RLS-processed PPG signals 2 2 0 0 -2 1 2 3 0 2 4 6 8 10 subject No. time (s) Figure 2.6: Average error of heart rate esti- Figure 2.7: The waveform of PPG signal, mation using PPG signal processed by LMS including the raw PPG signal and the ones and RLS filter. processed using LMS filter and RLS filter. • n′ (i) is estimated noise, at step i, • s′ (i) is the noise-reduced PPG signal at step i, • ppg(i) is the raw PPG signal at step i, • λ denotes the forgetting factor. acc, P and w are all column vectors of the same length. To improve the accuracy of noise reduction, FitBeat needs to optimize the coefficients of adaptive filter w(i). While previous heart rate monitoring systems typically employ the Least Mean Square (LMS) algorithm to approach this problem, our measurements find that Recursive Least Square (RLS) algorithm [102] performs better in the presence of intensive wrist movements. Fig. 2.6 shows the average estimation error for three subjects during a workout of 10-minute running. As shown in the figure, noise reduction using RLS-based adaptive filter is more accurate. Fig. 2.7 shows the waveform of PPG signal when filter contaminated PPG signals with LMS and RLS-based adaptive filter. As shown in the figure, the signal waveform generated by RLS is much smoother. Based on the above observations, FitBeat employs RLS algorithms to optimize adaptive filter coefficients, and set the forgetting factor of RLS filter to 0.98 based on empirical measurements. 13 PPG signal without motion noise 100 50 0 0.5 1 1.5 2 2.5 3 PPG signal with motion noise 100 50 0 0.5 1 1.5 2 2.5 3 PPG signal processed by ADF 8 6 4 2 0 0.5 1 1.5 2 2.5 3 frequency (Hz) Figure 2.8: The spectrum of 10-second PPG signal, including the ground truth, the raw PPG signal and the noise-reduced one. 2.3.3 Pulse Identification In the presence of intensive wrist movements, noise reduction cannot completely remove all motion-induced noise components. As an example, Fig. 2.8 compares the spectrum of ground truth, raw PPG signal, and noise-reduced PPG signal when the subject continuously moving his wrist. The ground truth is collected from the other wrist of the subject, which keeps still during measurement. As shown in the figure, in the spectrum of noise-reduced PPG signal, residual motion-induced noise causes multiple peaks in the segment from 0.8 Hz to 3.2 Hz, which significantly disturbs heart rate measurements. To address the above problem, FitBeat employs a simple spectral analysis algorithm to suppress residual motion-induced noise, which allows it to accurately identify the pulse that corresponds to the heart rate. The basic idea is to co-analyze the spectrum of PPG signal and accelerometer data. Specifically, FitBeat first filters PPG signal and accelerometer data with a Savitzky-Golay (SG) filter to remove high-frequency noise, and then performs Fast Fourier Transform (FFT). To identify the pulse that corresponds to the heart rate, FitBeat performs spectral analysis following two steps. 14 1. Locate the highest peak between 0.8 Hz and 3.2 Hz in the spectrum of PPG signal. Denote the frequency of the located peak as fp . If there is only one peak, return 60×fp as the heart rate measurement. 2. If there exist multiple peaks, examine the spectrum of accelerometer data, and check if there is a peak at fp . If not, return 60 × fp as the heart rate measurement. Otherwise, check the amplitude of the peak at fp in the spectrum of accelerometer data T . If the amplitude is higher than a pre-defined threshold, remove the peak at fp from the spectrum of PPG signal, and repeat step 1. Otherwise, return 60 × fp as a heart rate measurement. The above algorithm iteratively cleans the PPG signal by removing spectral peaks caused by residual noise. To identify motion-induced peaks, we determine the threshold T based on empirical measurements, and set T = 10000 in FitBeat. 2.4 Evaluation 2.4.1 Experiment Settings We evaluate FitBeat for typical workouts of different exercise intensities, and compare it with three baselines, including: • Baseline-1: the default heart rate monitoring app of Moto 360. • Baseline-2: a variant of FitBeat, where contact sensing and pulse identification are disabled during heart rate measurement. • Baseline-3: another variant of FitBeat, where only contact sensing is disabled. We compare FitBeat with baseline-2 and baseline-3 to study the effects of contact sensing and pulse identification. In addition, we employ Zephyr HxM BT – a strap with built-in ECG sensor – to collect ground truth. 15 40 Average HR Error (bmp) baseline 1 baseline 2 Average HR Error (bmp) baseline 3 35 30 FitBeat 30 25 20 20 15 10 10 0 5 1 2 3 4 5 6 7 8 9 10 subject No. baseline-1 baseline-2 baseline-3 FitBeat ((a)) Average estimation error. ((b)) The distribution of average estimation error. Figure 2.9: Heart rate estimation while walking. 150 ground truth 140 baseline 1 FitBeat baseline 3 130 baseline 2 120 Heart rate (bpm) 110 100 90 80 70 60 50 0 200 400 600 time(s) Figure 2.10: Heart rate estimation during a 10-minute walking. To evaluate FitBeat, we recruited 10 subjects and collected data from different workouts including 10 walks, 3 runs and 3 rides. Our study along with its data collection procedure was approved by the Institutional Review Boards (IRB) at Michigan State University. All the subjects voluntarily agreed to help with data collection, and signed a consent form. In order to collect data, each subject used a smartwatch (Moto 360), a Bluetooth chest strap (Zephyr HxM BT), and a smartphone (Google Nexus 4) while doing exercise. During data collection, the PPG sensor and the accelerometer of Moto 360 are continuously sampled at 25 Hz. At the same time, the heart rate reported by the default app of Moto 360 is recorded at 1 Hz. To obtain ground-truth, we log the heart rate measured by Zephyr HxM BT, which uses ECG sensor that is resistant to body movements. The raw PPG signal and accelerometer data are then transferred to the smartphone for heart rate estimation. 16 We evaluate FitBeat based on two metrics, including Absolute Heart Rate Error (Errabs ), and Average Estimation Error (µ). Specifically, Errabs is the absolute estimation error per minute, and µ is computed as the average of Errabs in a 10-minute window. They are computed as: Errabs (i) = |BP Mest (i) − BP Mtrue (i)| (2.7) N 1 X µ= Errabs (i) (2.8) N i=1 where, • Errabs (i) denotes the absolute estimation error in the i-th time window; • N is the total number of estimation windows; • BP Mest (i) denotes the estimated heart rate in the i-th time window measured in beats per minute (bpm), • BP Mtrue (i) denotes the ECG-based heart rate measured in beats per minute (bpm), which is used as ground truth, 2.4.2 FitBeat Performance In the following, we evaluate FitBeat under different levels of exercise intensity, including walking, running, and riding. 2.4.2.1 Walking We first evaluate FitBeat while subjects are walking at normal speed. Fig. 2.9(a) shows the average estimation error µ for all subjects. Fig. 2.9(b) compares the distributions of µ for FitBeat and baselines. As shown in Fig. 2.9(b), when using FitBeat, µ ranges from 2.43 to 8.13 with a median of 4.27, outperforming all baselines. 17 PPG signal without motion noise 300 200 100 0 0.5 1 1.5 2 2.5 3 PPG signal processed by ADF 3000 2000 1000 0 0.5 1 1.5 2 2.5 3 4 Acceleration 2 ×10 1 0 0.5 1 1.5 2 2.5 3 frequency (Hz) Figure 2.11: The spectrums of reference PPG signal, noise-reduced PPG signal, and ac- celerometer data. 5 4 ×10 2 PPG signal 0 -2 -4 -6 0 200 400 600 5 Variance of PPG signals 6 ×10 4 2 0 0 200 400 600 time(s) Figure 2.12: The waveform and variance of PPG signal. To further study the performance of FitBeat, Fig. 2.10 shows the trace of estimated heart rate during a 10-minute walk for one subject. We observe that baseline-1 – which is the default heart rate app of Moto 360 – performs worst among all methods. Its estimation error can be 70 bpm, and is higher than 50 bpm for most of the time. We also observe that baseline-2 performs worse than FitBeat. This is because baseline-2 is more vulnerable 18 60 baseline-1 baseline-2 baseline-3 50 FitBeat Average HR Error (bmp) 40 30 20 10 0 1 2 3 subject No. Figure 2.13: Average estimation error while running. to residual motion-induced interference, as it disables pulse identification during spectral analysis. Specifically, Fig. 2.11 shows the spectrum of reference PPG signal, noise-reduced PPG signal, and accelerometer data. In the spectrum of reference PPG signal, peak corre- sponding to the heart rate is located at 1.3 Hz. While in the spectrum of motion-reduced PPG signal, the peak of maximum amplitude is located at 1 Hz, which is caused by residual motion-induced noise (as shown in the spectrum of accelerometer data). When using the peak at 1 Hz to estimate heart rate, baseline-2 results in an error of 18 bpm. In comparison, FitBeat uses the pulse identification algorithm to remove peaks caused by residual noise, which allows it to accurately identify the peak corresponds to the heart rate. Moreover, we observe that FitBeat outperforms baseline-3, which disables contact sensing during heart rate measurement. In particular, baseline-3 yields an estimation error of about 15 bpm in the time period between 150s and 240s. As shown in Fig. 2.12, the variance of PPG signal experiences a surge in this time period, which indicates an unstable contact between the subject’s skin and the PPG sensor. With contact sensing, FitBeat is able to identify and exclude PPG signal samples corrupted by unstable contacts, which allows it to maintain accurate heart rate measurements. 19 180 70 heart rate error (bpm) 160 heart rate (bmp) 60 140 ground truth 50 baseline-1 FitBeat 40 120 30 100 20 10 80 0 0 200 400 600 time (s) baseline-1 FitBeat ((a)) Heart rate estimation. ((b)) Distribution of absolute estimation error. Figure 2.14: Heart rate monitoring while running. 20 baseline-1 baseline-2 18 baseline-3 FitBeat 16 Average HR Error (bmp) 14 12 10 8 6 4 2 0 1 2 3 subject No. Figure 2.15: Average estimation error while riding. 100 ground truth baseline-1 heart rate error (bpm) 90 heart rate (bmp) FitBeat 25 80 20 15 70 10 60 5 50 0 0 200 400 600 time (s) baseline-1 FitBeat ((a)) Heart rate estimation ((b)) Distribution of absolute estimation error Figure 2.16: Heart rate monitoring while riding. 2.4.2.2 Running We now evaluate FitBeat while subjects are running in gyms or outdoors. Fig. 2.13 shows the average estimation error µ for three subjects during a workout of 10-minute running. We observe that FitBeat outperforms baseline-1 by a significant margin. In particular, the average estimation error of FitBeat is below 8 bpm for all subjects. Fig. 2.14(a) shows the 20 trace of heart rate estimation for one subject. As shown in the figure, FitBeat is able to maintain accurate heart rate measurement consistently over time. Fig. 2.14(b) compares the distributions of Errabs for FitBeat and baseline-1. For FitBeat, the median Errabs is around 5 bpm, and the maximum Errabs is below 10 bpm. 2.4.2.3 Riding We then evaluate FitBeat when subjects are riding on spinning or outdoors. Fig. 2.15 shows the average estimation error µ for three subjects during a workout of 10-minute riding. Compared with the result shown in Fig. 2.13, µ is generally lower during riding, because riding involves less intensive wrist movements. In this case, FitBeat still maintains accurate heart rate measurements, but brings little accuracy improvement when compared with other baselines. For example, Fig. 2.16(a) shows the estimated heart rates for a 10-minute riding on spinning. As shown in the Fig. 2.16(b), the heart rate estimated by FitBeat is close to the ground truth, and the median of Errabs is no more than 2 bpm. 2.5 Conclusion of Study In this thesis, we present FitBeat – a lightweight system that uses wrist-worn PPG for accurate measurement during intensive exercises. To achieve this goal, FitBeat integrates contact sensing, motion sensing and lightweight signal processing algorithm to suppress various error sources. We implement FitBeat on a COTS smartwatch, and evaluate its performance under different levels of exercise intensity, including walking, running and riding. Experimental results from 10 objects show that FitBeat can accurately measure the heart rate during exercise. 21 CHAPTER 3 RESPIRATORY SINUS ARRHYTHMIA BIOFEEDBACK-BASED BREATHING TRAINING RSA-BT (Respiratory Sinus Arrhythmia biofeedback-based Breathing Training) is a cardio- respiratory intervention that has been commonly used as a complementary treatment to respiratory diseases, as well as an exercise to help manage stress and anxiety. Despite its health benefits, today’s RSA-BT still relies on in-person sessions and cumbersome sensing devices in a clinical setting, which limits its accessibility. In this chapter, we introduce BreathCoach, a smart and unobtrusive system that enables effective in-home RSA-BT using sensors on a smartwatch and smartphone-based VR. This chapter is adapted from a publi- cation [122]. The author of the dissertation is the first author of the original work. ”We” in this chapter refers to the author of the original publication. This work contains the software design on Android devices and the algorithm design in Matlab. The author recruited all the subjects, then collected and processed the data and the ground truth. 3.1 Background Respiratory sinus arrhythmia (RSA) refers to the naturally occurring synchronization between heart beat and respiration — cardio-acceleration during inspiration, and cardio- deceleration during expiration — which is known as a reflection of the regulation of autonomic nervous system [61]. As such, RSA-BT (Respiratory Sinus Arrhythmia biofeedback-based Breathing Training) has been used as a common cardio-respiratory intervention with the goal of guiding trainees to initially breathe at their Resonant Frequency (RF), the frequency at which maximum amplitude of RSA is achieved, and then breathe in phase with heart beat changes with the same goal of RSA maximization [71]. Due to its ability to help improve autonomic control of cardiopulmonary function [72] and emotional self-regulation capacities [64], RSA-BT and its variants have been adopted as a complementary treatment 22 to pulmonary diseases such as Asthma and chronic obstructive pulmonary disease (COPD) [36, 70], or as a relaxation technique especially for mental health conditions such as post- traumatic stress disorder (PTSD) [62]. When used as a clinical therapy, RSA-BT requires a set of instruments and follows a standard procedure administered by a therapist. Figure 3.1(a) shows a typical clinical setting of RSA-BT, where the trainee’s physiological signals are measured by ECG electrodes, abdominal strain gauge, and pulse oximeter finger clip sensor. The measurements are then transmitted to a computerized machine to provide bio-feedback for breathing. The training protocol is typically composed of two types of sessions [36] (illustrated in Figure 3.1(a)(c)): (1) RF detection session: the trainee is asked to breathe for 2 minutes at each pre-set pace, e.g. 7, 6.5, 6, 5.5, 5, 4.5 breaths per minute (bpm), to obtain the RF. (2) Biofeedback session: the trainee is instructed to breathe at RF for the first few minutes and then follow a breath pattern (BP) in phase with IBI (Inter-beat Interval) for the rest of the session. Note that the traditional training protocol requires the supervision of a therapist. Specifically, the therapist should decide the RF according to the RSA distribution during the RF detection and suggest the moment to shift to IBI-based breathing basing on trainee’s real-time performance during the training. Despite its health benefits, today’s RSA-BT therapy has several limitations. 1) First, it still relies on cumbersome devices and in-person sessions in a lab/clinical setting. Specifically, to measure physiological signs such as BP and IBI, trainees are required to wear sensors on their wrist, chest, and fingertip. In addition, the biofeedback display with a two-dimensional human-computer interface makes it difficult to convey intuitive guidance to trainees, there- fore hampering the training experience and effectiveness. 2) Second, the protocol of today’s RSA-BT lacks a way of taking poor training performance into account to adjust breath- ing guidance dynamically. Specifically, during a training session, trainees’ breathing is only guided by IBI series, which does not always result in a suitable breathing pattern for trainees to follow, as some trainees may feel uncomfortable to breathe at a certain pace due to their 23 Biofeedback display Biofeedback display Biofeedback display Physiological measurement Biofeedback display Physiological Physiological Physiological measurement measurement measurement ((b)) A prototype of BreathCoach, which ((a)) Conventional clinical tools for RSA- includes an off-the-shelf smart watch and BT. a smartphone-based VR viewer. Pre-training Training (~10min) one-time switch by (~15min) therapist fixed RF RF RF Traditional RSA-BT: IBI pacing detection pacing dynamically estimated No pre-training BreathCoach-based required due to RF IBI RF IBI dynamic RF pacing pacing … pacing … pacing … RSA-BT: estimation intelligent switch based on real-time measurements ((c)) Comparison of the training procedures of RSA-BT between traditional ap- proach and BreathCoach, showing two major differences: (1) Unlike traditional RSA-BT, no pre-training required in BreathCoach as RF is dynamically estimated during training. (2) During training, traditional RSA-BT relies only on IBI-based pacing after the initial 2-min RF-based pacing, while BreathCoach provide guid- ance by intelligently switching between RF-based and IBI-based pacing based on real-time measurements. Figure 3.1: A Comparison between the traditional approach and BreathCoach for respiratory sinus arrhythmia biofeedback-based breathing training (RSA-BT). physical conditions. Another possible cause of poor training performance is irregularities in measured IBI signals, due to body movements or other sources of interference. 3) Lastly, conventional RSA-BT’s dependence on the supervision of a therapist [71] significantly limits its accessibility, therefore making it ill-suited for long-term practice at home. 24 In this thesis, we present BreathCoach — a smart and unobtrusive system that enables in-home RSA-BT using smartphone-based VR and sensors on a smartwatch (illustrated in Figure 3.1(b)). Specifically, BreathCoach continuously calculates required physiological mea- surements (i.e., BP, IBI, and RSA) using signals from the accelerometer and the PPG sensor on a smartwatch. These real-time measurements are used to calculate the recommended BP, which is then conveyed through a VR game to provide intuitive and continuous breathing guidance. To further improve the training performance and experience, BreathCoach in- telligently switches between two pacing mechanisms based on a dynamic measure of user’s difficulty in following the guidance (Figure 3.1(c)). The key novelties of BreathCoach include: • The system adopts a suite of lightweight algorithms to extract BP, IBI, and RSA from raw sensor signals in real-time, making it suitable for implementation on smartwatch and smartphone. • To achieve better effectiveness of training, the system informs the calculation of rec- ommended BP with dynamically estimated RF and RSA thresholds based on both current and historical measurements; and intelligently switches between two feedback mechanisms based on users’ difficulty in following the guidance. • The breathing guidance is conveyed to users in the form of VR game to provide a more intuitive and immersive guidance. We have implemented a research prototype of BreathCoach with two exploratory VR game designs using a wrist-type wearable (Empatica E4 [2]), a smartphone (Moto G4 [5]) and a VR viewer (Google cardboard [3]). The evaluation of BreathCoach was conducted in three aspects, including the accuracy of physiological measurements, effectiveness of training, and user experience. We have collected both subjective and objective data from experiments where each of 10 participants performed 6 sessions of RSA-BT using either traditional ap- proach or BreathCoach. The results show that BreathCoach is not only able to accurately 25 measure required physiological signs, but also achieves better training performance than the traditional approach. 3.2 related Work 3.2.1 RSA Biofeedback Training RSA-BT has been implemented for numerous clinical applications, such as treatments for asthma, COPD and various neurotic disorder [36, 70, 139]. Its implementation involves sensing and displaying instruments. The commonly used set of sensing instruments include ECG electrodes, abdominal strain gauge, and pulse oximeter finger clip sensor, as shown in Figure 3.1(a). C-2 biofeedback units with HRDFT software [36] and the cardiotachmeter [71] as shown in Figure 3.1(b) are widely used as displaying instruments for clinical RSA- BT. Figure 3.1(b) also illustrates the biofeedback interface of C-2 biofeedback units. The breathing pacer is a sawtooth-shaped line. A small ball travels along the line from left to right to guide inhalation and exhalation. Heart rate is displayed in the same window as the biofeedback. Besides, the clinical RSA-BT adopted a standard training protocol. Specifically, this protocol consists of two sessions [36]. In the first session, the trainee is asked to breathe for 2 minutes at each pre-set pace, e.g., 7, 6.5, 6, 5.5, 5, 4.5 bpm, to obtain the RF. During the second session, the trainee is instructed to breathe at RF for the first few minutes and then breathe in phase with IBI. However, these RSA-BT systems have several shortcomings, especially for in-home train- ing. Firstly, the cumbersome sensing and displaying instruments make these systems im- practical for in-home RSA-BT. Secondly, training with these systems entails the supervision of the therapist. Specifically, the therapist should decide the RF according to the RSA distribution and suggests the moment to shift to IBI-based breathing basing on trainee’s real-time performance in the second session. Moreover, according to the standard RSA-BT protocol, RF detection should be performed every time starting training, which is inefficient. Finally, trainees may feel overwhelmed when failing to breathe in phase with IBI. Trainees’ 26 ((a)) Apple Watch Breathing ((b)) StressEraser, a ((c)) The VR interface of Application [8]. popular off-the-shelf SOLAR. device for daily RSA-BT [6]. Figure 3.2: Examples of the products on the market with functions related to breathing training. physiological limit may prevent them from breathing at a low rate, and irregularities in ECG signals also make it difficult for trainees to follow an aperiodic IBI. In these situations, their training performance will be degraded without the therapist’s supervision. An abundance of breathing applications has emerged to serve different functions—from entertainment-oriented games to improving health or well-being. ”Breathe” is a native ap- plication on Apple watch [8]. As shown in Figure 3.2(a), it uses graphic animation and gentle taps to guide the breathing and help the user focus. The training duration and frequency can be customized. This app is easy to operate and designed for daily breathing training. However, without any biofeedback, this app fails to consider users’ training performance. Besides, this app leads users to breathe at a fixed pace and the breathing pace is constant for all users, which makes the training ineffective. On the one hand, the exact cardiac RF varies from person to person [127]. Thus, the breathing pace should be adapted to varied individuals. On the other hand, the RF has been shown to change over time within individ- uals [71, 33, 74]. Therefore, breathing at a fixed pace is ineffective when the RF has reduced to a slower pace. Instead, by breathing in phase with heartbeat changes, RSA-BT allows each individual to breathe at a rate that is adapted to the rhythms of his/her own body and over time as respiratory function improves. StressEraser is an off-the-shelf device for daily RSA-BT, which has been commonly used in various treatments and the related research [98, 112]. As shown in Figure 3.2(b), StressEraser 27 is a hand-held biofeedback device that measures HRV from the pulse in your fingertip via an infrared sensor and displays it as a wave to instruct users’ respiration. This portable device can be used for in-home RSA-BT. The users often complained about error sensing signals and the failures to deal with irregularities in IBI [1]. To obtain good-quality signals, users are asked to hold finger steady and avoid sunlight. Even so, sometimes it provides a meaningless straight line. Moreover, like a clinical RSA-BT system, it still fails to manage trainees’ bad performance resulting from physiological limits and uncertain irregularity in ECG signals. Finally, without a respiratory sensor, StressEraser is unable to detect trainees’ real-time respiratory response, and thus leaves trainees unaware of their real-time performance. Recently, an immersive breathing training system, called AirFlow, has been developed for COPD [100]. It collects respiratory data from sensors on the chest and abdomen and reflects them in immersive breathing training games, including the Balloon Game, Eating Game and Penguin Game. These games are designed to train Pursed-Lip breathing, breathing rhythm and depth respectively. In addition to requiring obtrusive sensing devices, the system is only able to guide users to breathe at a fixed rate. 3.2.2 Breathing training as a stress mitigating intervention Breathing has a direct effect on RSA and as such plays a fundamental role in regulating the autonomic nervous system and reducing autonomic arousal [46]. Research suggests that each individual has a resonant frequency at which RSA is the greatest. Breathing at resonant frequency stimulates the vagal baroreflex [73]. Frequent high-amplitude stimulation of the baroreflexes by breathing at resonant frequencies increases the efficiency of cardiac reflexes and baroreflexes, and consequently promotes relaxation. Research shows that breathing training as an effective regulator of autonomic arousal leads to concrete stress reduction effects [24, 23, 126]. Preliminary results suggest that portable RSA biofeedback appears to be a promising treatment adjunct for disorders of autonomic arousal and is easily integrated into treatment [103]. Several studies support 28 that RSA-BT is a promising treatment for several kinds of anxiety disorder, such as post- traumatic stress disorder (PTSD), work stress and perinatal depression [119, 89, 17]. Re- cently, guided breathing has been utilized as a mindful intervention for drivers to counteract the stress accumulated at work and the additional stress encountered during driving [93]. 3.2.3 Respiration pattern measurement Respiratory inductance plethysmography (RIP) sensor is the most widely used device to evaluate pulmonary ventilation by measuring the movement of the chest and abdominal wall [16, 44]. It consists of two lightweight elastic and adhesive bands, which makes the measurement of respiration pattern cumbersome. To detect the breathing pattern unobtrusively, the MindfulWatch, a smartwatch-based system for real-time respiration monitoring during meditation, was developed in [44]. It utilizes motion sensors to sense the subtle ”micro” wrist rotation ( 0.01 rad/s) induced by respiration. MindfulWatch offers reliable real-time respiratory timing measurement using a novel self-adaptive model that tracks changes in both BP and meditation posture over time. 3.2.4 Bio-responsive VR VR systems have been successfully applied in the treatment of various anxiety disorders including fear of flying, social phobia, PTSD, fear of spiders and fear of heights. There are mainly three principles for the VR design of these mindful games: abstract visual elements, rewarding practice and attention restorative environment. Specifically, abstract visual ele- ments such as images and shapes are less distracting than concrete images such as flower, sky, etc., and thus help participants relax [58]. The use of subtle visual elements as a reminder to focus on stimulus is the preferred form of visible feedback [27]. Additionally, Rewarding practice can motivate users to practice more often and for longer periods of time because of the enjoyment they feel. Finally, attention restorative environment positively affects user’s 29 attention [57]. The environments with stimuli that modestly capture attention are preferred. For instance, subtle nature sounds are preferred over traffic noise. SOLAR is a popular VR game that assists novice users in learning the stress-reducing practice of mindfulness meditation [99]. Its VR is generated by the user’s brain activity and respiratory rate. SOLAR asks users to focus their attention on the visual representation of breathing. It is common for the users’ mind to wander during meditation. Therefore, we included the user’s meditation scores in order to provide gentle feedback to the user when their mind starts to wander. This meditation score was mapped to the color of the meditation circle, positioned behind the silhouette as shown in Figure 3.2(c). Besides, the respiration sensors were placed on the user’s thorax and diaphragm. The data received from the sensors were used for generating both audio and visual elements of SOLAR. The respiration sensors are mapped to the breathing circle in front of the silhouette. The breath circle becomes larger and smaller as the user inhales and exhales. 3.3 System Requirements and Challenges BreathCoach is designed to be an in-home RSA-BT system that continuously tracks physiological variables, calculates the recommended BP in real time and guides users towards the recommended BP through a VR game. To achieve this goal, BreathCoach should meet the following requirements: (1) Since BreathCoach is designed for home and office use, its sensing and displaying instruments should be easy to operate and comfortable to wear. (2) BreathCoach needs to provide accurate and continuous measurement of physiological variables, including BP, IBI and RSA, compared to clinical tools. (3) BreathCoach should automate the procedure of traditional training, and intelligently provide guidance to users without the presence of a therapist. (4) BreathCoach should provide guidance in an intuitive and easy-to-follow fashion. To meet these requirements, we addressed two major challenges in developing Breath- Coach. (1) It is challenging to extract accurate BP, IBI, and RSA in real time from the 30 built-in PPG sensor and accelerometer on the smartwatch. Compare to sensors available (i.e., ECG and RIP) in a clinical setting, smartphone sensors are significantly susceptible to motion artifacts. (2) The pacing mechanism used in traditional RSA-BT only relies on real-time IBI series, which is not suitable due to irregularities caused by interference such as body motion. Therefore, without the supervision of a therapist, it is challenging to create an intelligent pacing mechanism that provides continuous and effective breathing guidance. 3.4 System Design The architecture of BreathCoach is illustrated in Figure 3.3 with three key components, which are Physiological Measurement, Dynamic Estimation, and Intelligent Pacing. The Physiological Measurement component is responsible for calculating required bio-signals needed for Breathing Pattern Recommendation. Specifically, it takes raw signals from ac- celerometer and PPG sensor on the smartwatch as input to extract breathing pattern (BP) and inter-beat interval (IBI), which are then used to calculate RSA amplitude. Based on the historical data and current measurements, The Dynamic Estimation keeps track of the resonant frequency (RF) and its corresponding RSA threshold – two key parameters for generating effective breathing recommendation – which typically changes during training. Informed by the results of dynamic estimation, the Intelligent Pacing component selects an optimal pacing mechanism and generates the recommended breathing pattern. Finally, the system presents the resulting breathing pattern in a smartphone-based VR game. 3.4.1 Physiological Measurement Physiological measurement provides required bio-signals, including BP, IBI and RSA, to BP recommendation. It contains three major components: IBI extraction, breathing pattern extraction and RSA quantification. As both PPG-based IBI extraction and acceleration-based breathing pattern extraction are sensitive to significant postural change, the system first analyzes the acceleration data to 31 Breathing Pattern Recommendation Dynamic Estimation Historical & Current Data RSA Threshold Resonant Frequency Estimation (RF) Estimation Real-time Bio-signals 1050 IBI (ms) 1000 Intelligent 950 900 850 recommended Pacing RSA 800 32 52 54 56 58 60 62 64 66 68 breathing 31 Resp 30 29 pattern 28 Smartphone-based 27 26 breathing 25 cycle IBI-based RF-based 52 54 56 58 60 62 64 66 68 VR Display 240 RSA (ms) 220 Pacer Pacer RSA 200 180 160 140 120 100 80 52 54 56 58 60 62 64 66 68 time(sec) Physiological Measurement Accelerometer Breathing Pattern Extraction Smart RSA watch Quantification PPG Inter-Beat Interval Extraction Figure 3.3: System overview of BreathCoach. estimate postural stability before further extracting cardiac and respiratory signals. When low postural stability is detected, the system will pause extracting physiological signals and resume when it goes back to being stable. The postural stability is assessed using the standard deviation of three-axial acceleration’s norm (ST Dacc ), which is calculated over 1 s acceleration series every 0.03s (set according to the sample rate of the accelerometer). As the respiration-induced wrist motion fluctuates subtly and consequently has a low variation in the norm of acceleration compared with a sig- nificant postural change, ST Dacc should stay below a threshold with no significant postural change. Once ST Dacc exceeds the threshold, physiological measurements are discontinued as it indicates significant postural changes. This threshold is generally defined as 1 g according to our experimental results. 32 10 5 0 PPG -5 -10 PPG Inter-beat pulse line interval non-pulse line -15 0 1 2 3 4 5 6 time(s) Figure 3.4: An example of inter-beat interval (IBI) extraction based on 6-second pulse wave data from PPG sensor. 3.4.1.1 IBI extraction The system extracts IBI from PPG signal when the user’s posture is relatively stable. Specif- ically, the raw PPG signals are first filtered from 0.8 Hz to 5 Hz to reduce noise. The filtered PPG is then segmented using Incremental-merge segmentation algorithm (IMS) to calculate IBI [60]. After segmentation, lines are classified as pulse or non-pulse lines. If the interval between an up-slope and the last pulse line is large than a pre-defined threshold, this up- slope is identified as a validated pulse line. The threshold is set to 0.6, as the resting IBI ranges from 0.6 to 1 sec. Finally, the continuous PPG signal is divided into a group of pulse lines (as shown in Figure 3.4). The IBI is calculated as the interval between the ends of consecutive pulse lines. 3.4.1.2 Breathing pattern extraction With the wrist band being held against the user’s abdomen, the respiration can be monitored by analyzing the motion caused by the subtle displacement of the user’s abdomen due to respiration. Apple Watch has utilized a similar method for blood pressure measurement, in which the accelerometer would, when held against your chest, detect the heartbeat [7]. To measure breathing, the raw acceleration is first processed by a low-pass filter of 0.4Hz, which aims to highlight the motion due to respiration. The filtered acceleration signal is then used to calculate BP using IMS. As mentioned before, IMS segments acceleration into up-slope 33 -22 raw data low-pass filter breathing pattern -22.5 Acceleration (g) -23 -23.5 -24 Breathing Cycle -24.5 5 10 15 20 time(sec) Figure 3.5: An example of breathing pattern extraction based on 15-second acceleration. and down-slope lines. If the interval between an up-slope and the last expiration line is large than a pre-defined threshold, this up-slope is identified as a validated expiration line. The threshold is set to 3, as the normal resting breathing cycle is no less than 3 sec. Finally, the acceleration is divided into a group of expiration and inspiration lines, and each pair of consecutive expiration and inspiration lines will be identified as a breathing cycle (as shown in Figure 3.5). 3.4.1.3 RSA quantification RSA refers to synchronization between heart beat and respiration [116]. As a critical pa- rameter for breathing pattern recommendation, the system quantifies RSA by calculating its amplitude on a breath-by-breath basis through Peak-valley algorithm [61] based on the real-time IBI and BP. Specifically, when there are valid minimal and maximal IBI for a breath cycle, RSA is calculated as the difference between the maximum and minimum IBI. Figure 3.6 illustrates the peak-valley method for RSA estimation. Each breathing cycle is detected from the respiration pattern. For each breath, the estimate of RSA is obtained by searching the corresponding segment of IBI series for the maximum and minimum value and then computing their difference. 34 31 1.1 Breathing pattern RR interval 30 1.05 Breathing signal 29 1 RSA(sec) IBI (sec) 28 0.95 27 0.9 26 breathing 0.85 cycle 25 0.8 60 62 64 66 68 70 72 74 76 78 80 time(sec) Figure 3.6: An example illustrating Peak-valley algorithm for RSA quantification. Per breath, RSA is calculated as the difference between maximum and minimum inter-beat interval (IBI). 3.4.2 Real-time Breathing Pattern Recommendation Running on the smartphone, this component takes the continuous measurements of IBI, BP, and RSA to dynamically calculate a recommended BP for optimal performance. Specifically, real-time BP recommendation involves dynamic estimation and intelligent pacing. 3.4.2.1 Intelligent Pacing The intelligent pacing dynamically chooses the optimal mechanism for BP recommendation between IBI-based where users are guided to breathe in phase with IBI changes, and RF- based pacing mechanism where users are guided to breathe at a fixed pace, i.e., RF. The dynamic switching is controlled by two RSA thresholds: RSAlow and RSAhigh , which act as the standards for real-time evaluation of training performance. When the user hardly breathes in phase with the IBI wave, which may be irregular at that time, the RSA amplitude will drop below the RSAlow indicating a bad training performance. According to experimental results, RSAlow is set to 100 ms in BreathCoach to define a bad training performance as a weak synchronization between breathing and IBI with RSA below 100 ms. If the RSA exceeds RSAhigh while the user breathes following the RF-based pacer, it means the IBI wave acts regular and the user is capable of breathing in phase with 35 250 RSA(ms) 200 RSA 150 RSA RSAlow high 100 600 650 700 750 800 850 900 2.5 IBI 2 recommended breathing pattern 1.5 1 0.5 0 IBI-based Pacer-based IBI-based -0.5 600 650 700 750 800 850 900 T1: time(s) T2: posture change RSA>RSAhigh Figure 3.7: An example showing how Intelligent pacing works. At T1, the system switched from IBI-based to RF-based pacer, as significant postural changes interrupted IBI extraction; At T2, the system switched back to IBI-based mechanism, as the RSA exceeded RSAhigh . IBI. RSAhigh is defined as the maximum RSA amplitude achieved when breathing at RF. It should be set and updated during training, because it varies with each individual and changes during the training. Specifically, the system will switch to RF-based pacing mechanism if the current RSA is lower than RSAlow or the IBI extraction is interrupted by significant postural changes, and switch back to IBI-based mechanism when the RSA exceeds RSAhigh . Figure 3.7 illustrates how intelligent pacing works with a real-world example. At T1, BreathCoach switches from IBI-based to RF-based pacing mechanism as IBI extraction is suspended, and switches back to IBI-based pacing at T2 when RSA is detected greater than RSAhigh . We can observe that IBI waveform is irregular and RSA stays low from T1 to T2, whereas IBI waveform gets regular and RSA becomes larger at the end of RF-based pacing training, suggesting that RSA as an evaluation of real-time training performance monitors not only the user’s capacity to breathe in phase with IBI but also the irregularity of IBI signals. 36 BRi, STDBR, RSAi 260 RSA (ms) 240 STDBR 220 RSA high <0.2 >=0.2 200 update of RSA high RSA 180 RF end 400 405 410 415 420 425 430 435 440 candidate Breathing rate 7.5 RF update of RF RF=BRi? y 7 Breathing rate n update (bpm) 6.5 RSAi > RSAhigh? RSAhigh n y 400 405 410 415 420 425 430 435 440 STD of 0.3 end update RSAhigh & RF 0.2 breathing rate 0.1 ((a)) Flowchart of dynamic 0 estimation, where BRi , 400 405 410 415 420 425 430 435 440 T1 time (s) T2 RSAi and ST DBR are the breathing rate, RSA ampli- ((b)) An example of Dynamic estimation of RF and RSAhigh . Both tude and the standard de- RF and RSAhigh are updated at T2, as the ST DBR is lower than viation of BR at ith time 0.2 and RSA exceeds RSAhigh . Only RSAhigh is modified at T1, step, respectively. as the BR with a ST DBR below 0.2 is equal to RF at this point. Figure 3.8: Dynamic estimation. 3.4.2.2 Dynamic Estimation RF and its corresponding maximum RSA amplitude (i.e.,RSAhigh ) changes during training in two scenarios. In the first scenario, a respiration frequency with its corresponding RSA amplitude higher than RSAhigh is detected to be the new RF. In another scenario, the RSA amplitude observed when the user breathes at RF is different from RSAhigh , suggesting RSAhigh should be updated to this RSA amplitude. In order to adapt breathing pattern recommendation to such changes, RF and RSAhigh are dynamically updated by analyzing historical data, including BP and RSA. By analyzing BP, BreathCoach monitors user’s breathing rate (BR) and its stability through the standard deviation of BR (ST DBR ) to identify RF candidates. BR is calcu- lated breath by breath as 60 divided by the average of previous 5 breathing cycles. Its corresponding ST DBR is also calculated each breath in a 5-breathing-cycle window. Since the detection of RF entails a long-term observation, ST DBR is necessary to make sure users maintain a BR long enough that this BR can be a potential RF. A ST DBR lower than the pre-defined threshold (set to 0.2) makes the corresponding BR an RF candidate, suggesting the user has kept breathing at this BR during previous five cycles. 37 ((a)) Balloon, where the player controls the ((b)) Pilot, where the player breathes in and out movement of the balloon through respiration to with the shrinkage and expansion of the white follow the recommended breathing pattern rep- circle to make the flight as fast and straight as resented by the yellow track. possible. Figure 3.9: Screenshots of two proof-of-concept VR games. BreathCoach recognizes the new RF and RSAhigh by observing the RSA of RF can- didates. Same as BR, its corresponding RSA amplitude (RSABR ) is calculated breath by breath in a 5-breathing-cycle window. As shown in Figure 3.8(a), for each RF candidate, if BR equals RF, RSAhigh is updated to the RSABR . Otherwise, RSAhigh and RF will be updated to RSABR and BR respectively if RSABR is greater than RSAhigh . Figure 3.8(b) shows the dynamic estimation of RF and RSAhigh in practice. As shown in this figure, both RF and RSAhigh are updated at T2 as the ST DBR is lower than 0.2 and RSA exceeds RSAhigh at this moment. Only RSAhigh is modified at T1, since BR with a ST DBR below 0.2 is equal to RF at this point. 3.5 VR Game To provide an immersive and intuitive guidance, BreathCoach presents bio-feedback through VR game, in which a pacing stimulus is driven by the recommended BP to in- struct breathing. In this section, we describe two exploratory proof-of-concept VR games implemented as part of BreathCoach system. 38 3.5.1 Balloon As illustrated in Figure 3.9(a), the goal of this game is to guide users to breathe in sync with the yellow track to make the red balloon move along the track as precisely as possible in 15 minutes. The dynamic track, as the pacing stimulus, represents the recommended BP. The player controls the movement of the balloon through respiration, and the trail of the balloon reflects the player’s breathing pattern. The degree of alignment between the trail of the balloon and the track indicates the player’s performance, which is also used to change the game’s background color to give users feedback on their performance. 3.5.2 Pilot The Pilot game is designed to guide users to breathe in and out with the shrinkage and expansion of the white circle to make the flight as fast and straight as possible, as shown in Figure 3.9(b). As the pacing stimulus, the white circle near the bottom of the screen expands and shrinks according to the recommended BP. The flight altitude, speed, and the game’s background color are controlled by RSA amplitude, i.e., a proxy of the player’s real-time performance. The higher and stabler the RSA estimations are, the farther and straighter the player will fly. Different from Balloon, the Pilot game translates a proxy of training performance into actions in the game, instead of directly revealing real-time respiration and performance to users. 3.6 Evaluation In this section, we present the evaluation of BreathCoach based on a set of in-lab con- trolled experiments. First, we evaluate the accuracy of physiological measurements that the system uses to generate recommendations (Section 6.2). Second, we investigate the ef- fectiveness of BreathCoach’s real-time breathing pattern recommendation with respect to RSA amplitude maximization throughout breathing training [71] and an essential use case of RSA-BT, i.e. stress reduction [59, 96, 112, 139] (Section 6.3). Finally, we explore the 39 effect of different game design (Section 6.4). 3.6.1 Experiment settings The evaluation adopts a repeated-measures design, with the training protocol (i.e., tra- ditional and BreathCoach) and game design (i.e., Balloon and Pilot) as within-subjects variables. Subjects were required to conduct RSA-BT using three types of protocol-game combination, including Traditional-Balloon (traditional breathing training protocol plus Bal- loon), BreathCoach-Balloon, and BreathCoach-Pilot. Such experiment design allowed us to compare the BreathCoach-Balloon training with the traditional-Balloon training to assess the effects of intelligent breathing pattern recommendation module in BreathCoach and com- pare the BreathCoach-Balloon with BreathCoach-Pilot training to study the game design of BreathCoach. Our study along with its data collection procedure was approved by the Institutional Review Boards (IRB). All the subjects voluntarily agreed to help with data collection and signed a consent form. We have recruited 10 subjects, and each participated in our data collection consisting of six 45-minute RSA-BT sessions scheduled in different days. As shown in 3.10, in each session, the participants were exposed to a different and randomly selected breathing training setup. Note that the six sessions consist of two for each kind of training setup and participants randomly arranged their sequence for the six-day training. Each experiment begins with a tutorial during which the study administrator explained each part of the session and gave subjects a live demonstration of the breathing training system. After that, participants started the six daily sessions. As illustrated in Figure 3.10, each session includes 5 stages:(1) RF detection, (2) pre-training task, (3) breathing training, (4) post-training task, (5) survey. Specifically, the participants are initially left alone in the workspace to accomplish a 10-min RF detection, a procedure required in traditional RSA-BT protocol to manually estimate user’s in-situ RF. Subsequently, participants are asked to perform cognitive tasks, including a standard Stroop Test, followed by a restorative break. This task is widely utilized to 40 Structure: Tutorial Session 1 Session 2 Session 3 Session 4 Session 5 Session 6 Session structure: Duration: 10min 5min 3min 15min 5min 3min unlimited RF Stroop Breathing Stroop Task: detection test Break Training test Break Survey Protocol-Game for each subject, session 1-6 follow a randomly ordered list of combination: [ BreathCoach-Balloon x2, BreathCoach-Pilot x2, Traditional-Balloon x2 ] Figure 3.10: Schematic illustration of the study protocol. simulate a focused and stress-eliciting work situation and the recovery from it [98, 112]. After cognitive tasks, participants are left alone in the workspace for the 15-min breathing training with specific training setup. Upon finishing the training, participants will be asked to perform cognitive tasks again. At the end of each experimental session, participants are presented with a survey for training experience. The scale explores users’ training experience and game preference via the following questions: 1. How often have you been distracted from breathing during the training? 2. How often have you felt hard to follow pacing stimulus? 3. How often have you felt anxious while training? 4. How often have you tried too hard while breathing? 5. Which game do you prefer, Balloon or Pilot, and Why? The subject assesses the frequency on a 0-4 scale (0 = Never, 4 = Very Often). Besides, subjects’ physiological responses, such as RSA, BR and IBI, were recorded in all procedures. In order to collect data, each subject was asked to wear an off-the-shelf wrist-type wear- able (Empatica E4 [2]) and a smartphone (Moto G4 [5]) with a VR viewer (Google Cardboard [3]) during breathing training as shown in Figure 3.1(b). Both BreathCoach and traditional 41 protocol are implemented using the Empatica E4, Moto G4 and Google Cardboard. During data collection, the PPG sensor and the accelerometer of Empatica E4 are continuously sampled at 64 Hz and 32 Hz, respectively. The ground truth for IBI and BP measurements is collected from Hexoskin [4] – a smart shirt with built-in ECG and RIP sensors. 3.6.2 Evaluation of physiological measurement We evaluate the algorithms for BP and IBI measurements by comparing them with the ground truth. 3.6.2.1 Evaluation of breathing pattern extraction We first evaluate the performance of BreathCoach in detecting the breathing pattern and measuring the complete breathing cycles. The evaluation is based on the metric: estimation error of Breathing cycle duration (Durbc ). BreathCoach extracts users’ BP from acceleration. To evaluate its accuracy, we compare BreathCoach’s measurement for each breath cycle with the corresponding one from the ground truth (same data measured using the RIP sensor), and use their differences in Durbc as performance metrics. Figure 3.11 shows the error distribution of the breath-by-breath detection result collected from 10 subjects during their RF detection. We can see that the distribution of Durbc error is mostly symmetric around 0, indicating that the error does not accumulate over time. Specifically, the average absolute error of Durbc is 0.61 s, with 80.07% of the absolute errors under 1 second, as shown in the cumulative distribution function (CDF) of absolute Durbc error. We believe that this accuracy of complete breathing cycle detection is sufficient for deriving RSA and BR as users’ breathing rates range from 4 to 10 breath per minute (bpm) during breathing training. 3.6.2.2 Evaluation of IBI extraction To evaluate BreathCoach’s performance in measuring IBI, we compare all the IBI produced by BreathCoach with those obtained from the ground truth (same data measured using the 42 Distribution of DurBC Errors CDF of Absolute DurBC Errors 89 100 79 90 69 80 P(error < 1s) = 80.07% 70 59 Percentage 60 50 50 40 40 30 30 20 20 10 10 0 0 -3 -2 -1 0 1 2 3 0 0.5 1 1.5 2 DurBC Error (s) Absolute DurBC Error (s) Figure 3.11: The error distribution (left) and CDF (right) of the breath-by-breath detection result of BreathCoach collected from 10 subjects. The average absolute error of breathing cycle duration (Durbc ), which is used to derive RSA, is 0.61 s. ECG on Hexoskin), and use their pairwise differences, i.e. IBI errors, as evaluation metrics. Figure 3.12 shows the error distribution of IBI collected from 10 subjects during their RF detection. We can observe that the distribution of IBI error is almost symmetric around 0, suggesting that the error does not accumulate over time. Specifically, the average absolute error of IBI is 9.6 ms, with 81.48% of the absolute errors under 15 ms, as shown in the CDF of absolute IBI error. Therefore, we believe that BreathCoach’s accuracy and reliability in measuring IBI are sufficient for generating feedbacks and deriving RSA. 3.6.3 Evaluation of Intelligent Breathing pattern recommendation In this subsection, we evaluate BreathCoach in two aspects: training effectiveness and sub- jects’ training experience. To study the effect of intelligent breathing pattern recommenda- tion, we compare the BreathCoach-Balloon training with the baseline, traditional-Balloon training. 43 Distribution of IBI Error CDF of Absolute IBI Error 3.77 100 90 3.15 80 P( error<15ms )= 81.48% 70 2.52 Percentage 60 1.89 50 40 1.26 30 20 0.63 10 0 0 -50 0 50 0 10 20 30 40 IBI Error (ms) Absolute IBI Error (ms) Figure 3.12: The error distribution (left) and CDF (right) of the BreathCoach’s inter-beat interval (IBI) extraction from 10 subjects. The average absolute error of IBI, which is used for RSA assessment and real-time breathing pattern recommendation, is 9.6 ms. 3.6.3.1 RSA maximization We evaluate the effect of BreathCoach on RSA maximization, as the direct objective of RSA- BT is to maximize the RSA amplitude throughout the training to achieve better health outcome. The evaluation is based on Difrsa , the difference between RSA and RSAref . RSAref is the maximum RSA amplitude achieved by breathing at RF during RF detection, which is considered as a reference in the assessment of the effect on RSA maximization. It is obtained from the 10-min RF detection before each training. Difrsa gauges how close user’s RSA is to the maximum RSA amplitude. Difrsa is computed as: Difrsa (i) = RSA(i) − RSAref (3.1) where RSA(i) denotes the user’s RSA in the i-th breathing cycle. Difrsa (i) denotes the difference between user’s RSA in the i-th breathing cycle and the maximum RSA, RSAref . It is worth noting that Difrsa has a sign, determining whether recorded values fall below or above RSAref . Specifically, a non-negative Difrsa suggests the RSA amplitude is cur- rently maximized and a high negative Difrsa indicates the current RSA fall closely below the maximum RSA. Therefore, high Difrsa implies well performance in maximizing RSA amplitude. 44 Figure 3.13(a) compares the distribution of Difrsa from BreathCoach and traditional training for each subject. We observe that there is a smaller variability of Difrsa for Breath- Coach as well as greater medians, suggesting that for most of the subjects RSA consistently falls more closely below RSAref and is more likely to exceed RSAref in BreathCoach-based training than traditional training. For subject 2, Difrsa from BreathCoach are generally higher with about 50% above 0, suggesting that RSA has been maximized for most of the time during the BreathCoach-based training, see Figure 3.13(b). For subject 7, although the median of Difrsa from BreathCoach is lower than 0, it fluctuates within a narrow range, which indicates that RSA falls more closely to the RSAref in BreathCoach than in the traditional training, see Figure 3.13(b). Thus, BreathCoach still outperforms tra- ditional training in RSA maximization for subject 7. Also, we used paired t-tests to re- veal significant (p < 0.05) differences between the effects of BreathCoach and traditional training on RSA maximization according to two metrics: the mean and STD of Difrsa collected from each training. The result shows BreathCoach-based training produces sig- nificantly higher Difrsa (M ean(Difrsa ) : p = 0.00001) with significantly lower variability (ST D(Difrsa ) : p = 0.0086) than traditional training. Figure 3.14 compares the distribution of Difrsa collected from all BreathCoach-based training with the one obtained from traditional training. We can see that, compared with Difrsa from traditional training, those from BreathCoach distribute more intensively around an average closer to 0, suggesting that RSA collected from BreathCoach-based training consistently fall closely below or above RSAref . Specifically, the average and STD of Difrsa from BreathCoach are 2.37 and 42.94 ms respectively, with 70% of Difrsa above -20 ms and 50% of Difrsa above 0 ms. For traditional training, the average and STD of Difrsa are -49.9 and 63.82 ms respectively, with only 32% of Difrsa above -20 ms and 70% above -85 ms, see Figure 3.14. As the RSAref is usually greater than 200 ms, an absolute Difrsa below 20 ms is sufficient to suggest an RSA highly close to the maximum value. Therefore, we believe that training using BreathCoach enable users to perform well in maximizing RSA 45 Distribution of DifRSA from 10 subjects. 100 RSA difference (ms) 0 -100 -200 BreathCoach Traditional training -300 1 2 3 4 5 6 7 8 9 10 all subject No. ((a)) Comparing the distribution of the difference between RSA and RSAref (Difrsa ) from BreathCoach and traditional training for each subject. BreathCoach-based training produces sig- nificantly higher Difrsa (p < 0.05) with significantly lower variability (p < 0.05) than traditional training according to two metrics: the mean and STD of Difrsa collected from each training. Dif RSA series from subject 2 200 BreathCoach Traditional training 100 RSA(ms) 0 -100 -200 0 100 200 300 400 500 600 700 800 900 1000 Dif RSA series from subject 7 200 100 RSA(ms) 0 -100 -200 0 100 200 300 400 500 600 700 800 900 1000 time(s) ((b)) Illustrating the Difrsa series of subject 2 (upper) and 7 (lower). For each subject, compare the Difrsa series collected from BreathCoach-based and traditional training Figure 3.13: Evaluating the effect of BreathCoach on RSA maximization by observing the difference between RSA and RSAref (Difrsa ). RSAref , the maximum RSA amplitude achieved by breathing at RF during RF detection, acts as a reference in the assessment of the effect on RSA maximization. throughout the training. 3.6.3.2 Stress Reduction The stress reduction is studied based on heart rate variability (HRV), which is an estab- lished psycho-physiological measure for stress development and restoration [120, 19]. We used the standard deviation of normal to normal R-R intervals (SDNN) method to compute HRV. SDNN is calculated for consecutive overlapping sections of 1-min IBI data. We defined three metrics from HRV time series, including the mean of HRV during the cognitive task 46 Distribution of DifRSA from BreathCoach CDF of DifRSA 1.33 100 1.11 BreathCoach Percentage 90 Traditional training 0.88 0.66 80 0.44 70 0.22 P(Difrsa > -20ms)=32% Percentage 0 60 -300 -200 -100 0 2.37 100 200 300 50 Distribution of DifRSA from Traditional Training 0.88 40 Percentage 0.66 30 P(Difrsa > -20ms)=70% 0.44 20 0.22 10 0 0 -300 -200 -100-49.9 0 100 200 300 -300 -200 -100 0 100 200 300 Dif RSA (ms) Dif RSA (ms) Figure 3.14: The distribution (left) and CDF (right) of the difference between RSA and RSAref (Difrsa ) collected from BreathCoach-based training, showing that BreathCoach significantly improves the performance in maximizing users’ RSA throughout the training compared with traditional training approach (p < 0.05). (µHRV ), recovery speed and amplitude of HRV during the post-task rest (SpeedRecHRV and AmpRecHRV ). Specifically, greater µHRV is associated with enhanced executive func- tion resulting in faster reaction time and more correct responses to cognitive tasks [43, 42]. SpeedRecHRV and AmpRecHRV act as indicators for stress recovery. High amplitude of HRV is generally believed to promote emotional self-regulation [59, 96]. To evaluate the post-training improvements in stress reduction, we study the difference between pre- and post-training metrics and perform a series of t-tests on the difference. Each plot in Figure 3.15 compares the 8-min HRV series of pre-training tasks and post- training tasks for subject 1 with the left from BreathCoach and the right from traditional training. We can see that HRV stays low during the first 5-min cognitive task and is elevated during the subsequent break, which supports that HRV is an indicator of stress recovery. Comparing pre- and post-training HRV series, we find that, after training with BreathCoach, there is an increment in three features: HRV amplitude during Stroop test, the speed of HRV increasing to the maximum amplitude right after 5-min task and the maximum recovery amplitude during break, suggesting an improvement in the ability to recover from stressful situation. However, these gains are hardly observed after traditional training. 47 HRV series from BreathCoach HRV series from Traditional Training 0.08 0.08 pre pre post post 0.07 0.07 Cognitive task Break 0.06 0.06 HRV(SDNN) HRV(SDNN) 0.05 0.05 0.04 0.04 0.03 0.03 0.02 0.02 0.01 0.01 0 100 200 300 400 500 0 100 200 300 400 500 time(s) time(s) Figure 3.15: Compare the(a)8-min HRV series of pre-training task and(b) post-training task for subject 1 with the left from BreathCoach and the right from traditional training. After training with BreathCoach (right), there is an increment in three features: HRV amplitude during cognitive task, the speed of HRV increasing to the maximum amplitude right after 5-min task and the maximum recovery amplitude during break. However, these gains are hardly observed after traditional training (left). Figure 3.16 visualizes the change in µHRV , SpeedRecHRV and AmpRecHRV after BreathCoach- based training and traditional training for each subject. We can observe that µHRV , SpeedRecHRV and AmpRecHRV increased after BreathCoach-based training for most of participants, while very few participants have these three indices improved after tradi- tional training. Specifically, when training with BreathCoach, there is a significant post- training improvements in stress reduction according to the three metrics: µHRV (p = 0.0052), SpeedRecHRV (p = 0.0006) and AmpRecHRV (p = 0.0031). However, the sig- nificant improvement is not observed after traditional breathing training (µHRV : p = 0.52, SpeedRecHRV : p = 0.29 and AmpRecHRV : p = 0.73). In conclusion, our results suggest that BreathCoach is more effective than Traditional training when comes to RSA maximization, cognitive function and stress reduction. Breath- Coach can improve cognitive performance while concurrently aiding stress reduction. 48 Post-training Improvement in MeanHRV Post-training Improvement in SpeedRec HRV Post-training Improvement in AmpRecHRV 12 0.5 80 Breathcoach Breathcoach Breathcoach Traditional Training Traditional Training Traditional Training 10 0.4 60 8 0.3 HRV(SDNN) (ms) 6 40 4 0.2 20 2 0.1 0 0 0 -2 -0.1 -20 -4 -6 -0.2 -40 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 subject No. subject No. subject No. Figure 3.16: Visualize the change in the mean of HRV during the cognitive task (µHRV ), recovery speed and amplitude of HRV during the post-task rest (SpeedRecHRV and AmpRecHRV ) after BreathCoach-based training and traditional training for each subject. When training with BreathCoach, there is a significant post-training improvements in stress reduction according to the three metric: µHRV (p < 0.05), SpeedRecHRV (p < 0.05) and AmpRecHRV (p < 0.05). However, the significant improvement is not observed after tradi- tional breathing training. 3.6.3.3 Training Experience A good training experience of RSA-BT involves participants’ relaxed and stable respiration and sustained attention during training. In this subsection, training performance is studies based on both subjective and objective measurements. The self-reported scale for training experience is taken as the subjective assessment of the training experience. To examine phys- iological responses in relation to subjective perception, we analyze BR distribution collected during training. Training experience is assessed subjectively through a 6-item self-report measure, which asks users the frequency of they being distracted, feeling hard to follow pacing stimulus, feeling anxious while training and trying too hard while breathing, etc. The survey is per- formed right after each training, as shown in Figure 3.10. Table 3.1 statistically analyzes the difference of self-reported training experience between BreathCoach-Balloon and Traditional- Balloon training using paired t-tests. We can see that, compared with traditional training, the frequency of feeling distracted, anxious, hard to follow stimulus and breathing too deeply significantly decreases when training with BreathCoach (p < 0.05). Moreover, Figure 3.17 compares the distribution of BR from BreathCoach and tradi- 49 Table 3.1: Assess the difference of self-reported training experience between BreathCoach- Balloon and Traditional-Balloon training using paired t-tests. Compared with traditional training, the frequency of feeling distracted, anxious, hard to follow stimulus and breathing too deeply significantly decreases when training with BreathCoach (p < 0.05). BreathCoach Traditional Frequency of. p M(STD) M(STD) Being distracted 0.85 (0.74) 1.35 (0.67) 0.0234 Feeling hard to follow pacing stimulus 0.8 (0.76) 2.2 (1.1) 0.00005 Feeling anxious while training 0.95 (0.6) 1.7 (0.92) 0.0004 Trying too hard while breathing 1.05 (0.75) 2.4 (0.88) 0.00001 tional training for each subject. We observe that there is a smaller variability of BR for BreathCoach as well as lower medians, suggesting that BreathCoach enables users to keep the breath steady while slowing their respiration. Traditional training can also provide a steady breathing experience, like for subject 2. However, it can not ensure steady respira- tion as BreathCoach do. For subject 5, BR from BreathCoach fluctuates within a narrow range, while the one from traditional training has a large variability and falls far above the RF. Additionally, we extract the STD of BR (ST DBR ) for each training and compare the ST DBR collected from all BreathCoach sessions with the one collected from traditional training through paired t-test. It turns out ST DBR from BreathCoach is significantly lower than the one from traditional training, suggesting that BreathCoach enables users to breathe significantly more steady than traditional training does(ST DBR : p = 0.0019). Given the above, BreathCoach ensures users’ steady respiration, which is in agreement with users’ subjective ratings. 3.6.4 Discussion of game designs To explore the game design, we compare the BreathCoach-Balloon and BreathCoach-Pilot training for each subject. Paired t-tests reveal no significant (p < 0.05) differences of training effectiveness between BreathCoach-Balloon and BreathCoach-Pilot. Additionally, we collect users’ game preference through the last question of the scale. There are 7 out of 10 partici- pants who prefer Balloon over Pilot. According to the survey, this is mainly because Balloon 50 Distribution of BR from 10 subjects 14 Breathing Rate (bmp) BreathCoach 12 Traditional training 10 8 6 4 1 2 3 4 5 6 7 8 9 10 all subject No. Figure 3.17: Compare the distributions of BR from BreathCoach and traditional training for each subject. It shows that BreathCoach enables users to breath significantly more steady while slowing their respiration according to the metric, the STD of BR for each training (p = 0.0019). presents players their real-time training performance (i.e., how well the user’s respiration is in phase with the recommended BP.) by displaying not only the recommended BP but also their respiration trace, which helps users maintain or improve training performance by adjusting their breathing. We leave further investigation of the effects of game designs as future work. 3.7 Conclusion of Study In this thesis, we present BreathCoach – a smart and unobtrusive system that enables in-home RSA-BT using sensors on smartwatch and smartphone-based VR. To achieve this goal, BreathCoach adopts a suite of lightweight algorithms to continuously monitors BP, IBI and RSA using raw acceleration and PPG signals collected from the smartwatch. The system uses these real-time measurements to intelligently switch between two feedback mechanisms, IBI-based and RF-based, in order to derive the optimal BP. The recommended BP is then conveyed to users in the form of VR game to provide an intuitive training experience. We implemented BreathCoach using an off-the-shelf wrist-type wearable, a smartphone and a VR viewer, and designed two exploratory VR games. BreathCoach is evaluated in three as- pects, including accuracy of physiological measurements, effectiveness of training, and user experience. Our experimental results collected from 10 subjects with each one performs both traditional and BreathCoach-based training indicate that BreathCoach is able to provide ac- 51 curate physiological measurements with breathing cycle duration and IBI errors lower than 0.61s and 15ms respectively. Moreover, compared to traditional RSA-BT protocol, Breath- Coach achieves significant improvement (p < 0.05) on training effectiveness and experience. 52 CHAPTER 4 PERSONALIZED FEDERATED LEARNING FOR HUMAN ACTIVITY RECOGNITION This chapter introduces FedDL, a novel federated learning system for human activity recogni- tion that can capture the underlying user relationships and apply them to learn personalized models for different users dynamically. This chapter is adapted from a publication [124]. The author of the dissertation is the first author of the original work. ”We” in this chapter refers to the author of the original publication. This work contains the phototype implementation on Amazon Elastic Compute Cloud (Amazon EC2) and the algorithm design in Tensorflow. 4.1 Background Human activity recognition (HAR) is a key enabling technology for a wide range of applications, including smart home, health surveillance, and medical assistance [52, 133, 51]. For instance, it has been shown that longitudinal monitoring of daily routine activities, such as indoor/outdoor time, meals with/without family, and sleeping, can help to detect early onsets of Alzheimer’s Disease in aged population [108, 75]. Similarly, smart home systems can conserve home energy consumption and improve residents’ comfort/safety by recognizing complex home activities (e.g., eating, taking a shower, washing dishes, etc.) [32, 50]. Deep learning has recently been applied to HAR thanks to its better generalization and the ability of automatic feature extraction with less human effort [107, 41, 45]. However, several major challenges have not been addressed. The data collected from each user is usually unbalanced and sparse. Activities such as taking a shower, shopping, and biking, usually take place in a relatively low frequency. Applying deep learning to sparse and unbalanced data is likely to result in severe under-sampling artifacts. Training a global model for HAR in the cloud in a centralized manner may reduce the effect of data sparsity. However, the sensing data for HAR is often privacy-sensitive and hence cannot be shared or 53 uploaded [29, 105]. Federated Learning (FL) is an emerging technique used to collaboratively learn a global model, such as by computing an average aggregation of local models, without exposing users’ raw data [84, 115, 22, 90, 83]. Existing FL paradigms learn a single global model that however fails to capture the statistical diversity of users’ data. Such statistical diversity of users’ data not only leads to significant convergence delay but also poor model accuracy [13, 25, 47]. Several FL approaches have been proposed to address this problem by learning personalized models which capture both general and personal features of users [31, 28, 82, 15]. In [15], users share only lower layers of their models and leave upper layers user- specific to retain personal features. However, this approach assumes a pre-defined number of model layers shared among users, which is determined by empirical perception of user data distributions and their correlations. As a result, it suffers poor performance when the users’ data distributions are highly dynamic and time-varying [109]. The post-personalized FL approach is proposed to further fine-tune the global federated model on the nodes’ local data [54, 35]. However, the performance of such an approach is largely influenced by the accuracy of the global model. 4.2 Related Work 4.2.1 Deep learning for HAR. Deep learning has been applied to improve the accuracy of human activity recognition and eliminate the human efforts of handcrafted feature extractions [101, 20, 63]. However, since many daily events, like taking a shower, shopping, and biking, only occur occasionally, one user usually has limited and unbalanced training samples, which can cause overfitting in training deep learning models [39, 30]. Data augmentation techniques may address the issue by expanding the local datasets. However, they will fail to discover the new activities in HAR when users’ patterns of activities change largely. For instance, users without exercise habit start to do sports, which is not the situation that data augmentation works. As data 54 augmentation cannot produce the data for a previous unknown activity, the model trained with data augmentation will fail to discover the new activity. Training a global model for HAR at the server is proposed to reduce the effect of data sparsity [138]. However, centralized methods require uploading users’ sensing data to the cloud, leading to risk of privacy breach. 4.2.2 Federated learning (FL) [49, 132] is an emerging learning paradigm that only requires users to upload their model weights for collaborative learning, avoiding sharing user’s raw data during the learning pro- cess. A typical FL approach named FedAvg [49, 22] averages all models from users to learn a single global model, which proves to suffer significant accuracy degradation under hetero- geneous data distributions of users [137, 76]. Recently several personalized FL approaches are proposed to address this issue. Dinh et al. add a regularized term to the loss function of each user’s local model during the FL process to reduce the distance between the local and global models (average of all models) [31, 54, 35]. However, the accuracy of models learned in this approach can be largely influenced by the diversity of users. Moreover, other stud- ies [28, 82] tend to introduce a post-training procedure that personalizes the learned global model on each user’s local data. However, careful fine-tuning is required in this approach to balance the local and global models, which varies among different applications and hence is hard to generalize. Compared with existing personalized FL approaches, FedDL is able to learn users’ relationships during the FL process and utilize them to dynamically aggregate the local models in a layer-wise manner, which is applicable to different applications with highly diverse data distributions. 4.2.3 FL personalization via model sharing. In the FL approaches proposed in [26, 15], the lower layers between all users are shared, while several upper layers are user-specific. This design is motivated by the observation that the lower layers capture more general features, and hence can be shared across multiple 55 tasks, whereas the top layers capture features at a higher level of abstraction and hence are more user-specific [134]. The above methods have been extended and applied to multi- task deep learning [79, 86], where the goal is to learn multiple different models. However, these multi-task methods rely on a pre-defined structure for model sharing. As network architectures become deep and the user relationship becomes more complex in large-scale HAR applications, finding the right level of feature sharing across local models through hand- crafted network branches is impractical. Moreover, most multi-task deep learning methods [95, 81] are centralized and do not address the communication efficiency of the learning process. To reduce the communication overhead of FL (especially for transferring deep learning models), previous solutions mainly focused on the techniques for model quantization [67, 111] or model compression [40]. FedDL reduces the communication overhead through the dynamic layer-wise sharing scheme, as each model merging at the server only involves the parameters of users’ lower model layers, which is orthogonal to the model quantization or compression techniques. In a recent work [92], the authors show significant similarity exists among users in a number of real-world datasets, which is similar to our finding in Section 4.3. However, in [92], the clustering structure is formulated as part of the learning objective, and the local models are required to share all the layers in their multi-task learning framework. On the contrary, FedDL dynamically captures the users’ relationship while learning different models for users with a partial sharing structure, which leads to better model accuracy and lower communication overhead. 4.3 A Motivation Study In this section, we use an open real-world dataset, HARBox [9], to motivate the approach of FedDL in two aspects. First, there often exists underlying similarity amongst users’ patterns of activities due to their habits of behavior or environments [114, 136, 68, 92], which can be utilized to improve the learned model accuracy by facilitating collaborations among similar users. Second, the degree of similarity among users’ deep models reduces from 56 G2 G1 G1 Figure 4.1: The data of “typing” from the HARBox dataset after reducing dimension to 2D using PCA. There exists a clear group relationship among different subjects’ data. the bottom up [134, 78, 88], which suggests that we may exploit such similarity of models and aggregate them in an iterative, layer-wise manner, rather than aggregating whole models. We show that such an approach improves the model accuracy and reduces communication overhead between users and the server since only partial models need to be transmitted. The HARBox dataset is collected in real-world federated settings [92]. The 9-axis IMU data from 121 users’ smartphones is recorded when the users conduct five activities of daily life (ADL), including walking, hopping, phone calls, waving, and typing. To visualize the data distribution, we plot the data of “typing” from 6 users in the HARBox dataset after reducing the dimension of features to 2D using Principal Component Analysis. As shown in Fig. 4.1, there exists a clear grouping relationship among the 6 subjects’ data, with G1 = (n1 , n2 ) and G2 = (n3 , n4 , n5 , n6 ). We note that such similarity among users is also reported on other HAR datasets [14, 37, 53]. Our goal is to exploit the similarity among users’ data to personalize their models. A natural idea is to share some model layers between similar users [82, 15]. We now explore different model sharing schemes for each user group and their impact on the shared model accuracy. Fig. 4.3 shows three sharing schemes of deep learning models for a specific user 57 1.0 n1 1 0.59 0.011 0.2 0.23 0.21 0.8 n2 0.59 1 0.0089 0.11 -0.00069 0.12 n3 0.6 0.011 0.0089 1 0.71 0.55 0.4 n4 0.2 0.11 0.71 1 0.93 0.81 0.4 n5 0.23 -0.00069 0.55 0.93 1 0.84 0.2 n6 0.21 0.12 0.4 0.81 0.84 1 0.0 n1 n2 n3 n4 n5 n6 Figure 4.2: Correlation matrix of 6 users’ HARBOX data. Each number is the Pearson correlation coefficient (PCC), measuring the linear correlation between two users’ data. It is obvious there are two groups, (n1 , n2 ) and (n3 , n4 , n5 , n6 ). However, the users within each group are of different degrees of similarity. group. The “all-sharing” scheme shares all layers of the users’ models within each group. The K-sharing scheme shares only the lowest K layers of the users’ models, where the num- ber of shared lower layers K is usually empirically pre-set and fixed during the learning process. This baseline is similar to several existing FL personalization methods [15, 79]. In the experiments, we set K = 3 for the two groups. However, we will show that the K-sharing scheme cannot accurately capture the complicated relationship among users’ data distribu- tion. Some users are closely related enough to share more than K layers, while others with a large difference in their data distributions may benefit from sharing fewer than K layers. We visualize the relationship among data of 6 users from the HARBOX datasets through a correlation matrix by computing the Pearson correlation coefficients (PCC) between each pair of users’ data. As shown in Fig. 4.2, we see there are two groups, G1 = (n1 , n2 ) and G2 = (n3 , n4 , n5 , n6 ). However, the users within each group are of different degrees of similarity. For instance, n3 is less related with the other users in G2 (the statistically independent variables have correlation coefficients close to zero). This observation inspires 58 a dynamic sharing structure, where only users with similar data distributions should collab- orate in learning and users who are more closely related to each other will share more layers of their models. Based on this idea, we design a new scheme “layer-wise sharing” shown in Fig. 4.3, which is derived according to the correlation matrix of the six users (shown in Fig. 4.2) with closer-related users sharing more model layers. Specifically, n4 , n5 and n6 should share more layers than n3 , since they are more closely related to each other than n3 . Shown in Fig. 4.3, n4 , n5 and n6 share their lower 3 layers in this example, while n3 only shares the lower two layers with them. We also implement a baseline “global” method where all the six users share the same global model by averaging all their layers [83], and compare its performance on HAR with three sharing schemes (shown in Fig. 4.3): all-sharing, K-sharing, and the layer-wise sharing structure derived from the correlation matrix in Fig. 4.2. Fig. 4.4 presents the model accuracy performance of n3 when trained under different sharing schemes. We see that the model based on the layer-wise sharing structure gives the highest testing accuracy. Motivated by this result, we attempt to generate the layer-wise sharing structure from user relationships to improve the model accuracy. However, the correlation matrix of users’ data in Fig. 4.2 is global information that cannot be obtained on the server without accessing the data of users. Thus, we design a dynamic sharing scheme to learn the similarity of users’ model weights and generate the layer-wise model sharing structure accordingly during FL to improve the model accuracy. Specifically, FedDL learns the grouping relationship of the local models and then merges only the lower layers of models in a bottom-up layer-wise manner. In Section 4.5, we will elaborate on the proposed dynamic sharing scheme. In addition to the possible improvement in the training accuracy and efficiency of FL, another key advantage of our dynamic sharing scheme is that it reduces communication overhead as it is unnecessary for users to upload their user-specific layers to the server for model merging during the distributed learning process. 59 Two groups: (n1, n2), (n3, n4, n5, n6) (n1, n2) (n3, n4, n5, n6) n1 n2 n3 n4 n5 n6 n1 n2 n3 n4 n5 n6 output output output fc2 fc2 fc2 fc1 fc1 fc1 covn covn covn All sharing K-sharing (K=3) Layer-wise sharing Figure 4.3: Illustration of three sharing schemes for a group. Overall accuracy 1 0.9 0.8 0.7 Testing Accuracy 0.6 0.5 testing accuracy of n3 0.4 0.3 0.2 Global All sharing K-sharing Layer-wise sharing Figure 4.4: Illustration of the performance of federated learning under four sharing schemes. Layer-wise sharing scheme outperforms other sharing schemes in overall accuracy. 4.4 System Overview This section presents an overview of the proposed Federated Learning via Dyanmic Layer Sharing (FedDL). FedDL aims to enable accurate daily activity recognition through communication-efficient deep FL, based on the underlying affinities among users’ activity patterns. In this section, we first briefly introduce the application scenarios of FedDL, and then describe its system architecture. FedDL is designed for monitoring a wide range of daily activities using sensors built in wearables or deployed in natural living environments. Representative applications include 60 healthcare monitoring and smart home systems [32, 50]. These systems are usually designed to recognize a wide range of activities, like medicine taking, indoor/outdoor activities, and meal events, using ambient sensors and body-worn sensors [125, 106, 21]. However, since many events only occur occasionally, users tend to have limited and unbalanced training samples, which can cause overfitting in training deep learning models. Moreover, the sensing data for HAR is mostly privacy-sensitive and hence cannot be shared or uploaded. To address this issue, FedDL adopts the FL paradigm, utilizing a central server to collect local models and aggregate them, while avoiding the exposure of users’ raw data during the learning process. However, models learned by FL may deliver unsatisfactory performance on recognition of each user’s activities, due to the statistical diversity of users’ data. To improve the model accuracy, FedDL learns the underlying relationship among users dynamically and merges the local models partially based on the degree of similarity among users in a layer-wise manner. Since the users’ data distribution may change over time, FedDL will periodically update the layer-wise sharing structure and models. FedDL features a dynamic and hierarchical FL framework that improves accuracy and communication efficiency by capturing the intrinsic relationship among users and applying it to learn layer-wise personalized models for different users. Fig. 4.5 depicts the hierarchical training procedure of FedDL. First of all, the local model of each user is optionally initialized randomly or from a pre-trained model. Then FedDL performs model grouping and model merging in a bottom-up layer-wise manner. Specifically, the server groups users based on the model affinities obtained from models’ testing results on a common sample set using Kullback–Leibler divergence (KLD) (shown in Fig. 4.6(3.1)). It then performs model-merging to obtain stable models with the lower layers shared within each group. The merging process is implemented by calculating a weighted average of local models’ parameters at the server over multiple rounds. Each model merging round involves 4 steps, as shown in Fig. 4.6. Users perform multiple epochs of local training and then upload local models to the server. The server computes the weighted average of local models based on grouping results. It 61 Random initialization and local training at each user end. Server Init: n1 n2 n3 n4 n5 n6 (0) multi-round model merging: training for Grouping: learn the relationship multiple global rounds to get steady models among users and group the first layer. under the current sharing structure. Model aggregation (1 layer) … … (1.1) (1.2) Grouping: learn the relationship among multi-round model merging Model aggregation users and group the second layer. (2 layers) … (2.1) (2.2) Figure 4.5: Illustration of the dynamic and hierarchical federated learning framework of FedDL when learning 3-layer models for 6 users. then generates further personalized models through the weighted average of local models and their corresponding averaged models. Finally, the server transmits personalized models back to users for local training. This model grouping and model merging process repeats till reaching the output layer (i.e., the top layer), as FedDL leaves the output layer user-specific without sharing between users. It is challenging to learn the intrinsic relationship among users without accessing the users’ data. FedDL learns the relationship among users based on their local models, and generates the sharing structure by grouping the lower model layers of closely related users, and keeps exploring the grouping relationship layer by layer within each group from the bottom up till reaching the top layer. Section 4.5.1 describes the model affinity-based group- ing in detail. Moreover, based on the iteratively learned sharing structure, FedDL performs layer-wise model merging after each model grouping process to obtain stable models un- 62 n1 n2 n3 n4 n5 n6 : sample set, Ssimp … … L4 : testing result of local 3.2 model on Ssimp L3 L2 3.1 L1 Grouping … or 4 or or 2 1 User 1 User 2 …… User N Figure 4.6: The system architecture of FedDL. Each grouping / model-merging round mainly consists of 4 steps. der the sharing structure. Section 4.5.2 presents the design of intra-group layer-wise model merging. FedDL generates shared models in a bottom-up layer-wise manner using a greedy algorithm. Section 4.5.3 describes the detail of bottom-up layer-wise model aggregation. The layer-wise model aggregation of FedDL improves the model accuracy through dy- namic sharing within groups and reduces communication overhead by only transmitting the merged layers rather than entire models. As shown in Fig. 4.6, except for the grouping itera- tion when whole local models are uploaded to the server, most of the global communication involves only their lower layers, which significantly reduces the communication overhead during the FL process. 63 4.5 Dynamic Layer-wise Federated deep learning framework FedDL is a federated learning framework that learns personalized deep models for users with limited or unbalanced data in HAR applications. Specifically, FedDL learns the re- lationship among users, generates the dynamic sharing structure for models’ lower layers based on the user relationship, and merges the models according to the sharing structure iteratively. In Section 4.5.1 and 4.5.2, we presents how to group users using their deep mod- els and how to dynamically merge different layers of models for users in the same group, respectively. In Section 4.5.3, we describe the procedure of the bottom-up layer-wise model aggregation. Finally, we introduce the design on communication efficiency in Section 4.5.4. 4.5.1 Model Affinity-based User Grouping FedDL learns the underlying relationship of users based on their model affinities. Specifically, FedDL measures model affinities using Kullback–Leibler divergence (KLD), which estimates how one probability distribution is different from the reference one and is recently used for knowledge distillation of deep learning models [12, 10]. As demonstrated in [48, 34], element- wise weight distances (e.g., L1/L2 norms) have severe limitations in modeling affinities of deep models since the neurons of each layer in hierarchical models are permutable. Besides, it is computational inefficient to measure the norm distance of high-dimensional weights for complex hierarchical models. Therefore, instead of directly analyzing the weight matrices, FedDL tests all local models on a reference distribution in the form of a common sample set, and then measures the model affinities using the KLD of the different model outputs, as shown in Fig. 4.7. Specifically, the KLD for a pair of models, wp and wq , is calculated as follows: N 1 X1 δp,i δq,i Dkl (wp , wq ) = (δp,i log + δq,i log ) (4.1) N 2 δref,i δref,i i=1 δp,i = δ(wp , xi ) (4.2) 1 δref,i = [δ(wp , xi ) + δ(wq , xi )] (4.3) 2 64 Groups = {l1 : [(n1, . . . , n6)], Groups = {l1 : [(n1, . . . , n6)]} l2 : [(n1, n2), (n3, n4, n5, n6)]} 3 Sharing structure Affinity matrix n1,n2 KLD n3, n4, … … n5, n6 1 2 User relationship Figure 4.7: The procedure of model-affinity-based grouping. It consists of three steps: 1. Calculate the affinity matrix; 2. Group users based on the affinity matrix and previous grouping results; 3. Update the layer-wise sharing structure. where δq,i denotes the softmax outputs of the model wq on the ith record, xi , of the common sample set. δref,i is the reference distribution. We take the average of the two models’ outputs as the reference distribution and measure how these two models are different from the reference, where a lower Dkl value indicates a higher model affinity. Instead of directly using the KLD of P over Q, we adopt this symmetric metric for similarity measurement, which is more suitable for user grouping. In the next, FedDL performs grouping at the l-th layer based on the model affinity and previous grouping results. Specifically, FedDL maintains an affinity matrix, Ma with a(p,q) = Dkl (wp , wq ), and keeps the grouping results of lower l layers in the dictionary, Groups, to represent the dynamic sharing structure, as follows: Groups = { 1 : [G1,1 , G1,2 , ..., G1,k1 ] 2 : [G2,1 , G2,2 , ..., G2,k2 ] ... l : [Gl,1 , Gl,2 , ..., Gl,k ] } l where Groups keeps the layer index as the key and a list of groups at this layer as the value, respectively. Gl,i denotes the i-th group for the aggregation of the l-th layer. kl is the 65 number of groups at the l-th layer. With the affinity matrix, Ma , and previous grouping results Groups, FedDL groups the users at the server as shown in Fig. 4.7. Specifically, the l-th round of grouping operation only happens within groups that are obtained from the previous grouping round (Gl−1,k ). To group users within Gl−1,k , FedDL checks the affinity between each pair of users, i and j (i, j ∈ Gl−1,k ), and compares it with the threshold, θG , to decide if their models are related enough to be grouped together. We take the average of the affinities between users in Gl−1,k as the adaptive threshold θG for the grouping within this group. It is noted that two less- related users may be grouped together as long as they are closely related to the same user. To differentiate the degree that users are related to their group, we consider not only the group members (mh ) but also their corresponding frequency (f reqmh ) as shown in Equation 4.4. f reqmh is the times the user being accessed during the procedure of grouping. A higher f reqmh indicates that the group member, mh , is closely related to more users within the group. This information will be utilized to improve the accuracy of the model merging. Gl,i = [(m1 , f reqm1 ), (m2 , f reqm2 ), ...(mh , f reqmh )]. (4.4) Based on the group relationship among users, FedDL updates the layer-wise sharing structure by sharing one upper layer of users’ models within each group, as shown in Fig. 4.7. FedDL performs the grouping operation periodically till the output layer. Moreover, we can stop the grouping operation in FedDL earlier, when the number of groups at a layer equals the number of users, i.e. kl = N . 4.5.2 Intra-group Layer-wise Model Merging Based on the grouping results, Groups, FedDL merges the local models in a layer-wise manner. Fig. 4.8 illustrates the layer-wise model sharing at the server. Based on the grouping results of the lower 3 layers, the local models from users in the same group are merged layer 66 Local models Groups = n1 n2 n3 n4 n5 n6 {l3 : [ (n1, n2), (n3, n4), (n5, n6) ] Averaged models l2 : [ (n1, n2), (n3, n4, n5, n6) ] … … L4 l1 : [ (n1, n2, n3, n4, n5, n6) ] } (n1, n2) (n3, n4) (n5, n6) L3 L3 Layer-wise merge L2 L2 L1 L1 Communication between users and the server Figure 4.8: Illustration of the layer-wise model merging based on the grouping results, Groups. Only lower 3 layers of models are transferred between six users and the server for model merging. by layer, as follows: X WG = µi W i,l (4.5) l,k i∈Gl,k f reqi µi = P (4.6) j∈G f reqj l,k where W G is the weights shared by the users in Gl,k , the k-th group at l-th layer. W G l,k l,k is a weighted average of all the group members’ layer weights, W i,l . The weighted average coefficient, µi , of each group member is calculated based on the f reqi , which indicates how close the member is tied to the group. As a result, the models with higher f req will contribute more to the group model. After merging the layers of the models into shared models, the server further personalizes the shared models for each user by aligning each local model with its corresponding group model as follows, W ′i = (1 − λi )W i + λi W G (4.7) l,k µi λi = min(1, ) (4.8) 1/sizeof (Gl,k ) where i ∈ Gl,k . λi indicates, from the user’s stand, how closely local model W i is related to the group model, W G . This alignment makes the models trained using FedDL robust l,k to boundary cases, where the least related users are still included in a group (i.e., with the smallest µi ). These users are likely to become a separate group in another training process. 67 For instance, at a certain layer, three users are grouped as {1 : 0.5, 2 : 0.49, 3 : 0.01}, and the training process produces the grouping result {1 : 0.5, 2 : 0.5} and {3 : 1} at the same layer. Without alignment, the models of user 3 obtained from the two training processes are significantly different with W ′3 = W G and W ′3 = W 3 respectively. However, after the alignment between the shared models and users’ local models, the models of user 3 under these two grouping results become similar with W ′3 = 0.03W G + 0.97W 3 and W ′3 = W 3 respectively. As illustrated in Fig. 4.8, in this model-merging round, we get three shared models: [ W G1,1 , W G2,1 , W G3,1 ], [ W G1,1 , W G2,2 , W G3,2 ] and [ W G1,1 , W G2,2 , W G3,3 ] shared within the three groups, (n1, n2), (n3, n4) and (n5, n6) respectively. Finally, the three shared models are aligned with their corresponding users’ local models. For example, the second shared model will be aligned with the local models of n3 and n4 and sent to them, respectively. Moreover, only the lower layers of models are necessarily transferred between server and users during the model-merging iterations, which will significantly reduce the overall communication overhead during the FL process. 4.5.3 Bottom-up Layer-wise Model Aggregation At the core of FedDL is the multi-round greedy model aggregation in a bottom-up layer-wise fashion. Consider a situation where there are N users. Initially, all the users start with the same neural network model and initialize it randomly or from a pre-trained model. After users perform multiple epochs (denoted as R) of local updates, the server will receive the latest N local models from all the users. The model aggregation operation of the server starts from the bottom layer, l1 . It will first group the N branches into k1 groups where k1 ≤ N . After that, FedDL will greedily perform the bottom-up model aggregation within their groups. We note that finding the optimal sharing structure is combinatorial prohibitive. A brute- force method would need to train and test all the ((CN N )L−1 possible structures for finding 68 the optimal aggregation scheme for N users with L-layer models. Our approach is more efficient since it only takes O(N log N ∗ L) time. For each round of model grouping, it takes O(N log N ) time, and takes a total of O(N log N ) ∗ (L − 1) time at the worse case to form the sharing structure. For L-layer deep models, FedDL will perform L rounds of user grouping with each group- ing round followed by intvl rounds of model merging, as shown in Fig.4.5. At each grouping round, FedDL learns the affinity relationship of local models and groups one upper layer of the users’ models into groups. After that, FedDL performs multiple rounds of model merging within each group according to the current sharing structure. It is noted that the interval between grouping rounds, intvl, decreases with a decay rate, λ. When more layers of local models are merged according to their layer-wise similarity, the divergence among local mod- els reduces gradually. Therefore it takes fewer global training rounds for the shared models to converge [78, 88]. Specifically, the procedure of model aggregation for the lower l layers is as follow: 1. Grouping round: the users send their complete models to the server. The server groups the users based on current model affinities and the grouping results from l − 1th iteration, Groups[l − 1]. It was noted that the grouping operation at lth iteration only happens within each group obtained from (l − 1)th iteration, i.e., the users being separated into different groups in the first l − 1 rounds no longer share their upper layers. Moreover, the grouping result will be added to Groups, where the grouping results of all the lower layers are kept for the model merging process. 2. Model-merging round: After the model grouping, FedDL performs intvl rounds of layer-wise model merging within each group. For each model-merging round, the clients perform R epochs of local training and then upload the lower l layers of their lo- cal models to the server. Upon receiving all local models, the server weighted averages the lower l layers of local models’ within each group based on the grouping structure, 69 Groups, and then aligns each local model with its corresponding group model to gen- erate the shared model for each user. At the end, the server sends the shared models to their corresponding clients. It is noted that, for global communication, only the lower l layers of models are transferred between users and the server, which makes the FedDL very communication-efficient. The grouping operation stops at the layer before the output layer. As a result, the higher layers of each model will be user-specific, while the lower shared layers will ensure generality across similar users. Moreover, the grouping structure and the models of FedDL will be updated periodically with continuously collected data. Fig. 4.5 shows the procedure of learning a 3-layer model for 6 users. As shown in Fig. 4.5(1.1), after grouping based on the initial local training models, all the users are grouped together for model aggregation. As a result, the first layers of their models are merged as the group model. After that, FedDL performs the model aggregation operation for the lower two layers within the groups obtained from the previous round, i.e., {n1 6 }, as shown in Fig. 4.5(2.1) and (2.2). The server groups the users into two groups, {n1 , n2 } and {n3 , n4 , n5 , n6 } and updates the sharing structure with parameters of the second layer shared within each group. The dynamic sharing structure is finalized after the 2-round model aggregation, and FedDL keeps the output layer user-specific. 4.5.4 Reducing Communication Overhead In typical FL systems [83, 115, 22, 90], a large number of global communication rounds between users and the server is required, which can be the bottleneck of the learning process. FedDL takes advantage of the dynamic layer-wise sharing scheme to improve communication performance. Specifically, FedDL reduces the number of parameters that each user needs to upload to the server as well as the number of global training rounds. As shown in Fig. 4.5, after learning the grouping results for the l-th layer, only the lower l layers of local models need to be merged at the server for each global training. Thus, FedDL 70 uploads the lower l layers of local models for the global model merging where l increases from 1 to L − 1 during the training process, largely reducing the amount of data transferred. Besides, FedDL further reduces the communication overhead by stopping the upload of each model’s user-specific layers whenever possible. As shown in the left of Fig. 4.3, at the third layer, n1 , n2 and n3 no longer belong to any group, i.e. their layers are user-specific. After obtaining the fourth layer’s grouping results, unlike n4 − n6 , n1 , n2 and n3 need to upload only the lower 2 layers to the server during all the following model merging rounds. Moreover, the models trained using FedDL converge fast even with a small number of local training rounds, which is detailed in Section 4.6. Thus, FedDL can use a small number of global training rounds to reduce communication costs. As shown in Section 4.6, FedDL- based models can always converge within 10 global rounds with different settings of local rounds R, while other FL methods may take more than 30 rounds to converge. Therefore FedDL largely reduces the communication overhead. 4.6 Evaluation In this section, we evaluate the performance of FedDL from three aspects, including the performance on different datasets, the scalability of the system, and its performance with different local computation rounds. For each evaluation, we compare the performance of FedDL with four baselines as follows: 1. FedAvg [83]: the standard FL method, where all users share one global model. 2. FedPer [15]: a federated deep learning approach, where all the users share their lower K layers and leave their upper layers user-specific. This approach adopts a K-sharing scheme and pre-sets the value of K empirically, as mentioned in Section 4.3. In our experiments, we set K to be 3. 3. pFedMe [31]: an algorithm for personalized FL using the distance between the global model and the user’s local model as the user’s regularized loss functions. The global 71 Table 4.1: Five HAR datasets (UWB, Depth Images, HARBOX-IMU, IMU and LiDAR). Application Tasks Sensor Data Number Number Dimen- of sub- of sion jects records per subject Human move- with/without human UWB 50x1 8 ∼ 80 ment detection movement Hand Gesture good/ok/victory/stop/fist Depth 36x36x1 9 ∼ 400 Recognition camera Activity of Daily walking/hopping/phone IMU 100x9x1 121 ∼ 300 Life (ADL) recog- calls/waving/typing nition using IMU Human Activity walking-upstair/ IMU 128x3x2 30 ∼ 300 Recognition using walking-downstair/ IMU walking/sitting /stand- ing/laying. Human Activity walking/bending/phone Livox 60x30x1 10 ∼ 600 Recognition using calls/sitting/standing/ Horizon LiDAR checking watch. LiDAR model is an average aggregation of all the local models at the server. 4. Local training: the model learned from local data at each user. 4.6.1 Datasets In our evaluation, we use one self-collected dataset and four public real-world datasets (Table 4.1) for deep learning. We use the self-collected LiDAR data for two main reasons. First, most of the existing HAR datasets lack detailed information about subjects, such as gender, height, and weight. Such information is critical to understand the underlying similarity of users’ data and hence is important to validate the design of FedDL. Second, LiDAR has a long detection distance, which facilitates recognizing whole-body movements, like bending and falling. At the same time, compared with RGB images, Lidar data is more spare presents a major challenge in achieving high model accuracy, which motivates the adoption of FL to enable collaborative learning from multiple users. 72 Moreover, we choose additional four public datasets for evaluation as they are collected in real-world settings with significant dynamics. Besides, these datasets are collected from various HAR tasks based on different sensors, like depth sensor and IMU (inertial measure- ment units). Moreover, some datasets are of large scale, which can be utilized to evaluate systems’ scalability. 1. Human Activity Recognition using LiDAR1 : We record the point cloud data of 6 types of human activities (walking/sitting/standing/bending/checking watch/phone calls) conducted by 10 subjects using a Livox Horizon LiDAR [77] in an indoor en- vironment. The LiDAR collects point clouds at 10Hz, and each activity of a subject lasts for 2 minutes. Fig. 4.9 shows the preprocessing steps for the collected point clouds, which are first proposed in [18, 87]. First, we conduct the cylinder projection to project the 3D point cloud to a range image of 120 × 30 pixels, where each pixel’s grayscale represents the range value (the whiter, the farther). Then we average every 25 consecutive range images in sliding windows of 2.5 seconds and 50% overlap to form each data record. After that, the ROIs (region of interest) of each image are retrieved, and then we down-sample the original image to 60 × 30 pixels and normalize the depth value to 0-1. This dataset has a large number of data records (6560 records in total), and each data record’s dimension is relatively high, thus increasing the difficulty of activity recognition. 2. Human Movement Detection using UWB [9]: To detect if there were human movements in a specific area, two UWB (Ultra Wide Band) nodes are deployed 3m away from each other in 3 different environments (i.e., parking lot, corridor, room) with or without a person walking between them. This dataset is collected using 8 subjects, with each one walking randomly in the area for 10 minutes. The two-way ranging at 5Hz is captured and labeled manually. Then the data is sampled in sliding windows 1 The data collection was approved by IRB of the authors’ institution. 73 of 10 seconds and 50% overlap (50 readings/window) to form each data record (50 × 1 dimensions). 3. Hand Gesture Recognition using Depth Camera [9]: Five types of gestures (good/ok/victory/stop/fist) are conducted by 8 subjects using a depth camera. The region of interest of the depth image is retrieved, and then we down-sample the original image to 40 × 40 pixels and normalize the depth value to 0-1. 4. Activity of Daily Life (ADL) Recognition using Smartphones [9]: The “HAR- BOX” App is developed to collect 9-axis IMU (inertial measurement units) data from users’ smartphones when the user conducts five types of ADL, including walking, hop- ping, phone calls, waving and typing. Labeled IMU data from 121 users is collected in total. The data from each user is filtered and then sliced into multiple frames (100 × 9 dimensions) using a window of 2 seconds and 50% overlap. 5. Human Activity Recognition using Smartphones 2 : this online dataset is col- lected from 30 subjects performing six activities (walking, walking upstairs, walk- ing downstairs, sitting, standing, lying) while carrying a waist-mounted smartphone (Samsung Galaxy S II) with embedded IMU. Specifically, the 3-axial linear acceleration and 3-axial angular velocity are captured at a constant rate of 50Hz and are labeled manually through video records. The 6-dimension sensor signals were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 seconds and 50% overlap (128 readings/window) to form each data frame with a size of 128 × 6. 4.6.2 Implementation We design and implement a FedDL phototype on Amazon Elastic Compute Cloud (Amazon EC2). This EC2 instance is built on the Ubuntu platform and has 96 virtual CPUs (3.1 2 https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones 74 Range image Point clouds cylinder average ROI & projection combine Downsample … Walk 25 range images (2.5s) image: 60x30x1 Figure 4.9: The preprocessing of LiDAR data for the recognition of activities, including walking, sitting, standing, bending, checking the watch and phone calls. GHz) and 768 GB memory. We build a server on the instance and run each user end on one walk phone call walk phone call walk phone call CPU to simulate the FL. The communication between the server and users is implemented locally using sockets. The system is implemented in Python3. We adopt randomly initialized convolutional neural networks (CNN) for the human activ- ity recognition tasks of the five datasets. The CNN network is composed of 2 convolutional layers, 2 full-connect layers, and one softmax output layer. It uses mini-batch Stochastic Gradient Descent (SGD) for optimization. For the data samples of each subject, we use 75% of the local data for model training, while the rest 25% is for model testing. We set the initial learning rate to be 0.01 with periodical decay and the batch size to be 32. Although with the same depth, the CNN models for different datasets will have various network structures (e.g., input dimension, kernel size, stride, and padding) depending on the data characteristics and the tasks. 4.6.3 Validation on LiDAR Dataset In this section, we validate the design of FedDL on the LiDAR dataset. Specifically, we compare the performance of FedDL with four baselines, FedAvg, FedPer, pFedMe, and local training. We set the local communication rounds (R) to be 30 and the global computation rounds (T ) to be 40. We involve totally 10 users for the FL on the LiDAR dataset. Fig. 4.10 shows the overall accuracy and the communication overhead of different ap- 75 Overall accuracy & Communication overhead 10 6 12 1 0.9 10 0.8 Total number of bytes (MB) 8 Testing accuracy 0.7 0.6 6 0.5 4 0.4 0.3 2 0.2 0.1 0 FedAvg FedPer pFedMe local FedDL Figure 4.10: Comparison of different approaches’ performance on the LiDAR dataset. FedDL outperforms other approaches in accuracy performance by more than 15%, and save about 42.6% communication overhead compared with approaches that share the whole models (Fedavg and pFedMe). fc2 n5 n1, n2 n3 n9, n10 n4, n6 n7, n8 n1, n2, n5 n3 n9, n10 n4, n6 n7, n8 fc1 n3, n9, n10 n4, n6, n7, conv2 n1, n2, n5 n8 conv1 n1 . . . n10 Figure 4.11: The sharing structure for 10 users, which is dynamically learned by FedDL. n2 , and n1 share more layers as they have similar behavior habits and biological features. 76 proaches on the LiDAR dataset. We evaluate the overall accuracy by observing the distribu- tion of testing accuracy after 30 rounds of global training for all the users. From Fig. 4.10, we can see that FedDL achieves the best accuracy performance with meanacc = 0.98 and the interquartile range IQR = 0.025. Compared with other methods, FedDL improves the mean testing accuracy by more than 15% and reduces the variation significantly by more than 94%, suggesting that FedDL can converge fast to a steady and accurate model for most users. In contrast, FedAvg and FedPer yield larger testing accuracy variations as models of some users barely converge even after 30 rounds of global training. FedDL achieves a significantly lower variation as it facilitates the collaboration among users with similar data distributions, which mitigates the noise/outliers from other users, improving the convergence rate and accuracy. Fig. 4.10 also compares the communication overhead of FedDL with the other three FL methods. We measure the communication overhead (Qcomm ) by calculating the total amount of data transferred between the server and users during the training pro- cedure. It is shown that FedDL saves about 42.6% communication overhead compared with FedAvg and pFedMe, which share the entire models during FL. To better understand the above results, we take a closer look at the sharing structure dynamically learned by FedDL (shown in Fig. 4.11). From the figure, we can see, n3 , n9 , and n10 share the lower two layers, which is consistent with the fact that they are the only three subjects using the left hand to make phone calls. Among these three subjects, n9 and n10 are females (n9 : heights 1.66m, weights 50kg; n10 : heights 1.63m, weights 48kg), while n3 is a male with the height 1.78m and weight 66kg. It is shown that n9 shares more layers with n10 than with n3 , which can be attributed to the distinct effects of the body shapes on the collected LiDAR data. The effect of biological features on the LiDAR data is also reflected on n5 . Users n5 , n2 , and n1 use both the left and right hands to answer phone calls, and they are all males. However, n5 (height 1.93m, weights 95kg) is much taller and heavier than the other two subjects. In the sharing structure, n5 shares the lower 3 layers with n2 and n1 , while n2 and n1 keep sharing more upper layers. 77 Overall accuracy 1.2 FedAvg FedPer pFedMe local FedDL 1 Testing Accuracy 0.8 0.6 0.4 0.2 UWB HARBOX DepthImages HARIMU (a) 7 Communication overhead 10 6 Total number of bytes (MB) FedAvg 5 FedPer pFedMe 4 FedDL 3 2 1 0 UWB HARBOX DepthImages HARIMU (b) Figure 4.12: Comparison of different approaches’ performance on four datasets, UWB, HAR- BOX, Depth Images and IMU. FedDL outperforms other approaches in accuracy performance and has a lower communication overhead than approaches that share the whole models (Fe- davg and pFedMe). The above results confirm that FedDL can capture the different degrees of similarity among users’ data due to behavior habits or biological features, and can effectively apply them to layer-wise model merging to improve model accuracy and communication efficiency. 4.6.4 Performance on Different Datasets In this section, we evaluate the performance of FedDL on different datasets, UWB, HARBOX- IMU, depth images, and IMU (Table .4.1). Specifically, for each dataset, we compare the overall accuracy and communication overhead of FedDL with four baselines, FedAvg, Fed- Per, pFedMe, and local training. We fix the local communication rounds (R) to be 30 and the global computation rounds (T ) to be 40 for all the approaches. Also, we involve 8 users for the FL on each dataset, where the number of data samples varies for different users to simulate an unbalance data setting in FL. It is noted that we evaluate the scalability of FedDL on HARBOX dataset involving up to 90 users in Section 4.6.4. Overall accuracy. Fig. 4.12(a) compares the testing accuracy of different approaches for 78 the four datasets. It is shown that compared with four baselines, FedDL achieves the best and stable accuracy performance on the four datasets with a high mean value (meanacc > 90%) and IQR < 0.2. Specifically, compared with local training (0.05 < IQR < 0.4, 75% < meanacc < 85% ), FedDL, FedPer, and pFedMe improve the accuracy of the model while FedAvg fails, as the data distributions of users are too heterogeneous to learn a good global model. Specifically for the UWB dataset, FedAvg barely converges within 40 global rounds. FedPer and pFedMe also fail to improve the accuracy as their model aggregation schemes are oblivious to the underlying relationship among users. Moreover, FedDL outperforms them, as FedDL can capture the intrinsic relationship among users dynamically and aggregate users’ models within each group in a layer-wise manner. Communication overhead. Fig. 4.12(b) compares the communication overhead of different methods for the four datasets. In our experiments, we set the number of global rounds T = 40. The communication overhead measures the total amount of the parameters transferred between users and the server during the whole FL process, which is determined by the sharing scheme and the size of the CNN model. From the figure, we can see FedDL is able to maintain a relatively low communication overhead, which suggests our dynamic bottom-up layer-wise model aggregation strategy improves the communication efficiency. Specifically, FedDL and FedPer have a relatively low communication cost for all the datasets, as they only share part of model layers among users. FedPer combines the lower 3 layes of local models and FedDL merges models according to layer-wise grouping results. In particular, FedDL outperforms FedPer for UWB and depth images datasets. The reason is that the data distributions of users are so heterogeneous in these two datasets that most of the users’ upper layers are user-specific in FedDL’s grouping results, i.e., they share and upload less than 3 lower layers. 79 Overall accuracy 1.2 FedAvg FedPer pFedMe FedDL 1 Testing Accuracy 0.8 0.6 0.4 0.2 R=20 R=40 R=60 (a) 6 Communication overhead 10 4 Total number of bytes (MB) FedAvg FedPer 3 pFedMe FedDL 2 1 0 R=20 R=40 R=60 (b) Figure 4.13: Comparison of different approaches’ performance on Depth images datasets with different number of local computation rounds (R = 20, 40, 60). All the methods benefits from a larger R, and FedDL maintains the best accuracy and communication performance with different numbers of R. 4.6.5 Scalability To evaluate the scalability of FedDL, we compare the performance of different approaches (FedDL, FedAvg, FedPer, pFedMe) when training on the data of 30, 60, 90 users from the HARBOX dataset. 4.6.5.1 Overall accuracy . Fig. 4.14 shows the experiment results with different number of users. From Fig. 4.14(a), It is obvious that the overall accuracy of FedAvg decreases with the increase of the number of users, as the heterogeneity of users’ data becomes higher. In this case, FedAvg performs the worst among all approaches and can not even converge within 40 global rounds when 90 users are involved. Besides, FedDL outperforms FedAvg, FedPer and pFedMe under different settings as FedDL can capture the relationship among users and dynamically merge user’ models within each group in a layer-wise manner. On the contrary, FedPer adopts a static sharing scheme that shares the lower 3 layers of models for all the users, which 80 Overall accuracy 1.2 FedAvg FedPer pFedMe FedDL 1 Testing Accuracy 0.8 0.6 0.4 0.2 30 nodes 60 nodes 90 nodes (a) 8 Communication overhead 10 4 Total number of bytes (MB) FedAvg FedPer 3 pFedMe FedDL 2 1 0 30 nodes 60 nodes 90 nodes (b) Figure 4.14: Comparison of different approaches’ performance on 30-, 60- and 90-user HAR- BOX datasets. FedDL outperforms FedAvg, FedPer and pFedMe in both overall accuracy and communication overhead. fails to capture the complicated user relationship, resulting in worse performance when the number of involved users is large. pFedMe aligns each user’s local model with the averaged global models, which makes the overall accuracy partially dependent on the global model’s performance, which is hence largely influenced by users’ data heterogeneity. Moreover, the accuracy of FedDL is more stable (with small IQRs) as the number of users increases, which shows the advantage of its group-based dynamic model aggregation scheme. 4.6.5.2 Communication overhead . Fig. 4.14(b) compares the communication overhead of different approaches with the data of 30, 60, 90 users from the HARBOX datasets. We can see that the communication over- head of FedAvg, pFedMe and FedPer increases dramatically in proportion to the number of users involved in the training procedure. However, FedDL always maintains a relatively low communication overhead, as FedDL can stop uploading the parameters of models’ upper layers earlier when the users’ data is significantly heterogeneous. The above results suggest that FedDL exhibits satisfactory scalability by maintaining 81 relatively high accuracy and low communication overhead and performs better on large-scale datasets. 4.6.6 Impact of Local Computation Rounds The number of local computation rounds, R, is a critical hyperparameter in FL. The setting of R shows a trade-off between the computation and communication: a larger R requires more computations at local devices of users, while a smaller R means more global communication rounds to converge. To understand how R affects the convergence of different FL methods, we conduct the experiments on an 8-user Depth Images dataset with (R = 20, T = 30), (R = 40, T = 15) and (R = 60, T = 10, respectively. It is noted that, for all the baselines, we only change the value of R with the model structure and all the other settings of the models stay the same. Specifically, the initial learning rate is set to be 0.01 with periodic decay and the batch size is set to be 32. Fig. 4.13 illustrates the performance of different methods with different settings of local computation rounds R. It shows that a larger value of R will improve the performance on the accuracy and communication overhead of both the personalized and the global models. Fig. 4.15 visualizes the change of training loss and testing accuracy over global rounds with different settings of R for a specific user. We can see that all the methods have improvements in convergence when R is larger. For example, FedAvg takes a much smaller number of global communication rounds to converge (reduce from more than 30 to 10 rounds) when R increases from 20 to 40. However, FedDL will always converge fastest (with the smallest number of global rounds), especially when the local computation round R is set small (e.g., R=20). 4.7 Discussion and Future Work 4.7.1 Convergence of FedDL . In our experiments (discussed in Section 4.6.3-4.6.6), FedDL is demonstrated to converge on the five real-world HAR datasets. In particular, it converges fast even when training 82 2 2 2 FedAvg FedPer pFedMe 1.5 1.5 1.5 FedDL Training Loss Training Loss Training Loss 1 1 1 0.5 0.5 0.5 0 0 0 0 10 20 30 0 5 10 15 0 5 10 Training Global Rounds Rounds TrainingRounds Global Rounds TrainingRounds Global Rounds 1 1 1 0.8 0.8 0.8 Testing Accuracy Testing Accuracy Testing Accuracy 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 10 20 30 0 5 10 15 0 5 10 Global Rounds Training Rounds Global TrainingRounds Rounds Global TrainingRounds Rounds R=20 R=40 R=60 Figure 4.15: The training loss and testing accuracy of a specific user’s model changing over global rounds with different settings of R. Larger R improves convergence, especially for FedAvg. However, FedDL will always converge fastest with different local computation rounds R. on 90 users with a limited number of local rounds. We now provide some insights into the convergence guarantee of FedDL. Firstly, FedDL groups users with similar data distributions, which mitigates the impact of noise/outliers from other users, thus improving the convergence performance. Second, the intra-group model merging entails a weighted average of the local models (see section 4.5.2), where the weights quantify how closely each local model is to the group model. In FedDL, the weights of users whose models lie at the border or intersection of multiple groups are relatively small, and hence the models will contribute less to the intra-group model merging. Thus, such a design mitigates the impact of dynamic grouping on model convergence. 83 4.7.2 Scalability of FedDL . FedDL is generally more scalable as a clustering-based approach since the number of user groups (who share some degree of similarity among their data) may not increase drastically with the number of users. For the scenarios where users arrive dynamically, FedDL merges the new users in the sharing structure instead of retraining the sharing structure for all the users for scratch, which substantially reduces the compute and communication overhead. Specifically, FedDL considers each group as one user and learns the new users’ relationship with existing groups to update the sharing structure by merging the new users into different groups. 4.7.3 Future work . Firstly, the local models transmitted in FedDL may reveal certain information about user activities [56, 131]. In the future, we will integrate additional mechanisms, like differential privacy [129], in FedDL to provide stronger privacy protection. However, such privacy- preserving mechanisms can have a complicated impact on the overall performance. We will conduct a comprehensive study of privacy-preserving techniques and the trade-off between the privacy and performance of FedDL. Besides, we will extend FedDL to other applications where the users’ data has a high level of dynamics while exhibiting significant similarity. For example, FedDL can be applied to applications like health monitoring [11] and road traffic prediction [85], where the data of nodes (e.g., users or cars) share spatial-temporal similarity due to spatial proximity, models of devices/cars, user routines, etc. Finally, as the real- world HAR applications may involve high-dimension data (e.g., images or videos), deeper or wider neural network models are required to avoid underfitting. We will evaluate how the model complexity, including the depth and width of the model, affects the convergence and accuracy of FedDL. 84 4.8 Conclusion of Study This thesis proposes FedDL, a novel federated deep learning system for HAR that cap- tures the similarity of users’ models and generates personalized user models through dynamic layer sharing in an iterative layer-wise manner. We evaluate the performance of FedDL for the recognition of various activities on five datasets collected from 178 users in total. The experimental results show that FedDL outperforms the other methods in terms of overall ac- curacy (e.g., by 24.05%, 16.67%, 19.51%, and more than 30.67%, to local training, pFedMe, FedPer, and FedAvg respectively). Moreover, FedDL saves more than 50% communication overhead when there is a large number of users and achieves a high convergence rate even with a small number of local computation rounds. As future work, we will deploy FedDL on edge devices, like smartphones, to evaluate the system overhead of FedDL. Moreover, we will also explore the application scenarios with intrinsic statistical heterogeneity beyond HAR by leveraging domain adaptation techniques. 85 CHAPTER 5 CONCLUSION This thesis introduce three studies for CPS to first address the inaccuracy of the sensing data due to the noise and the dynamics of the context, then include human in the loop of smart systems, and finally, design distributed learning platform for large-scale applications. First of all, the accuracy of cardiac signals is extremely essential for daily health moni- toring. FitBeat enables accurate heart rate tracking on wrist-type wearables during inten- sive exercises. It integrates and augments standard filter and spectral analysis tool, which achieves comparable accuracy while significantly reducing computational overhead. Experi- mental results involving 10 subjects show that the average error of FitBeat is around 4 beats per minute, which improves heart rate accuracy of the default heart rate tracker of Moto 360 by 10x. After that, we include human in the loop and design the smart health applications. To make the RSA-based breathing training, which relies on in-person sessions and cumbersome sensing devices, accessible at home, we propose the BreathCoach - a smart and unobtru- sive system which enables effective in-home RSA-BT using sensors on a smartwatch and smartphone-based VR. Specifically, BreathCoach continuously measures key bio-signals in- cluding breathing pattern (BP), inter-beat interval (IBI), amplitude of RSA, and intelligently calculates the optimal BP based on current and historical measurements. The recommended BP is conveyed to users through a VR game to provide intuitive guidance. The experimental results suggest that BreathCoach is able to reliably measure needed bio-signals and intelli- gently calculate BP recommendations which result in improved performance compared with the traditional approach. Finally, we build the smart system using federated learning for large-scale applications. Federated Learning (FL) enables the collaborative learning of a global model without expos- ing users’ raw data. However, existing FL approaches yield unsatisfactory HAR performance 86 as they fail to dynamically aggregate models according to the statistical diversity of users’ data. In out study, we propose FedDL, a novel federated learning system for HAR that can capture the underlying user relationships and apply them to learn personalized models for different users dynamically. We have implemented FedDL and evaluated using a new data set we collected using LiDAR and four public real-world datasets involving 178 users in total. The results show that FedDL outperforms several state-of-the-art FL paradigms in terms of model accuracy (by more than 15%), converging rate (by more than 70%), and communication overhead (about 30% reduction). Moreover, the testing results on the datasets of different scales show that FedDL has high scalability and hence can be deployed for large-scale real-world applications. 87 BIBLIOGRAPHY 88 BIBLIOGRAPHY [1] Customer review of stresseraser. [2] Empatica e4 wristband. https://www.empatica.com/en-int/research/e4/. [3] Google cardboard. [4] Hexoskin. [5] Moto g4. https://www.motorola.com/us/products/moto-g. [6] Stresseraser. [7] Apple is developing watch technology to detect heart abnormalities and now blood pressure. http://www.patentlyapple.com/patently-apple/2017/10/, 2017. [8] Use the breathe app. https://support.apple.com/en-us/HT206999, 2018. [Online; accessed 17-October-2018]. [9] Federated learning datasets for human activity recognition. 2021. [10] K. T. Abou-Moustafa and F. P. Ferrie. A note on metric properties for some divergence measures: The gaussian case. In Asian Conference on Machine Learning, pages 1–15. PMLR, 2012. [11] K. Alam, S. Qureshi, and T. Blaschke. Monitoring spatio-temporal aerosol patterns over pakistan based on modis, toms and misr satellite data and a hysplit model. At- mospheric environment, 45(27):4641–4651, 2011. [12] E. Aljalbout, V. Golkov, Y. Siddiqui, M. Strobel, and D. Cremers. Clustering with deep learning: Taxonomy and new methods. arXiv preprint arXiv:1801.07648, 2018. [13] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz. Human activity recog- nition on smartphones using a multiclass hardware-friendly support vector machine. In International workshop on ambient assisted living, pages 216–223. Springer, 2012. [14] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz. A public domain dataset for human activity recognition using smartphones. In Esann, volume 3, page 3, 2013. [15] M. G. Arivazhagan, V. Aggarwal, A. K. Singh, and S. Choudhary. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818, 2019. [16] R. Bari, R. J. Adams, M. M. Rahman, M. B. Parsons, E. H. Buder, and S. Kumar. rcon- verse: Moment by moment conversation detection using a mobile respiration sensor. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(1):2, 2018. 89 [17] A. J. Beckham, T. B. Greene, and S. Meltzer-Brody. A pilot study of heart rate variability biofeedback therapy in the treatment of perinatal depression on a specialized perinatal psychiatry inpatient unit. Archives of women’s mental health, 16(1):59–65, 2013. [18] C. Benedek, B. Gálai, B. Nagy, and Z. Jankó. Lidar-based gait analysis and activity recognition in a 4d surveillance system. IEEE Transactions on Circuits and Systems for Video Technology, 28(1):101–113, 2016. [19] G. G. Berntson, J. T. Bigger, D. L. Eckberg, P. Grossman, P. G. Kaufmann, M. Malik, H. N. Nagaraja, S. W. Porges, J. P. Saul, P. H. Stone, et al. Heart rate variability: origins, methods, and interpretive caveats. Psychophysiology, 34(6):623–648, 1997. [20] S. Bhattacharya and N. D. Lane. From smart to deep: Robust activity recognition on smartwatches using deep learning. In 2016 IEEE International conference on perva- sive computing and communication workshops (PerCom Workshops), pages 1–6. IEEE, 2016. [21] C. Bi, G. Xing, T. Hao, J. Huh, W. Peng, and M. Ma. Familylog: A mobile system for monitoring family mealtime activities. In 2017 IEEE International Conference on Pervasive Computing and Communications (PerCom), pages 21–30. IEEE, 2017. [22] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konečnỳ, S. Mazzocchi, H. B. McMahan, et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046, 2019. [23] R. P. Brown and P. L. Gerbarg. Sudarshan kriya yogic breathing in the treatment of stress, anxiety, and depression: part i—neurophysiologic model. Journal of Alternative & Complementary Medicine, 11(1):189–201, 2005. [24] T. E. Brown, L. A. Beightol, J. Koh, and D. L. Eckberg. Important influence of respiration on human rr interval power spectra is largely ignored. Journal of Applied Physiology, 75(5):2310–2317, 1993. [25] L. Cao, Y. Wang, B. Zhang, Q. Jin, and A. V. Vasilakos. Gchar: An efficient group- based context—aware human activity recognition on smartphone. Journal of Parallel and Distributed Computing, 118:67–80, 2018. [26] R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997. [27] K. Cochrane and T. Schiphorst. Developing design considerations for mobile and wearable technology m-health applications that can support recovery in mental health disorders. In Pervasive Computing Technologies for Healthcare (PervasiveHealth), 2015 9th International Conference on, pages 29–36. IEEE, 2015. [28] Y. Deng, M. M. Kamani, and M. Mahdavi. Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461, 2020. 90 [29] M. Dimiccoli, J. Marı́n, and E. Thomaz. Mitigating bystander privacy concerns in egocentric activity recognition with deep learning and intentional image degradation. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(4):1–18, 2018. [30] S. Ding, Z. Chen, T. Zheng, and J. Luo. Rf-net: a unified meta-learning framework for rf-enabled one-shot human activity recognition. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems, pages 517–530, 2020. [31] C. T. Dinh, N. H. Tran, and T. D. Nguyen. Personalized federated learning with moreau envelopes. arXiv preprint arXiv:2006.08848, 2020. [32] Y. Du, Y. Lim, and Y. Tan. A novel human activity recognition and prediction in smart home based on interaction. Sensors, 19(20):4474, 2019. [33] F. Estève, N. Blanc-Gras, J. Gallego, and G. Benchetrit. The effects of breathing pattern training on ventilatory function in patients with copd. Biofeedback and Self- regulation, 21(4):311–321, 1996. [34] T. Evgeniou, C. A. Micchelli, M. Pontil, and J. Shawe-Taylor. Learning multiple tasks with kernel methods. Journal of machine learning research, 6(4), 2005. [35] A. Fallah, A. Mokhtari, and A. Ozdaglar. Personalized federated learning: A meta- learning approach. arXiv preprint arXiv:2002.07948, 2020. [36] N. D. Giardino, L. Chan, and S. Borson. Combined heart rate variability and pulse oximetry biofeedback for chronic obstructive pulmonary disease: preliminary findings. Applied psychophysiology and biofeedback, 29(2):121–133, 2004. [37] G. Glass and K. Hopkins. Statistical methods in education and psychology. Psyccri- tiques, 41(12), 1996. [38] I. F. Gorodnitsky and B. D. Rao. Sparse signal reconstruction from limited data using focuss: a re-weighted minimum norm algorithm. IEEE Transactions on Signal Processing, 45(3):600–616, Mar 1997. [39] Y. Guan and T. Plötz. Ensembles of deep lstm learners for activity recognition using wearables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(2):1–28, 2017. [40] F. Haddadpour, M. M. Kamani, A. Mokhtari, and M. Mahdavi. Federated learning with compression: Unified analysis and sharp guarantees. In International Conference on Artificial Intelligence and Statistics, pages 2350–2358. PMLR, 2021. [41] N. Y. Hammerla, S. Halloran, and T. Plötz. Deep, convolutional, and recurrent models for human activity recognition using wearables. arXiv preprint arXiv:1604.08880, 2016. [42] A. L. Hansen, B. H. Johnsen, J. J. Sollers, K. Stenvik, and J. F. Thayer. Heart rate variability and its relation to prefrontal cognitive function: the effects of training and detraining. European journal of applied physiology, 93(3):263–272, 2004. 91 [43] A. L. Hansen, B. H. Johnsen, and J. F. Thayer. Vagal influence on working memory and attention. International journal of psychophysiology, 48(3):263–274, 2003. [44] T. Hao, C. Bi, G. Xing, R. Chan, and L. Tu. Mindfulwatch: A smartwatch-based system for real-time respiration monitoring during meditation. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(3):57, 2017. [45] M. M. Hassan, M. Z. Uddin, A. Mohamed, and A. Almogren. A robust human activity recognition system using smartphone sensors and deep learning. Future Generation Computer Systems, 81:307–313, 2018. [46] J. A. Hirsch and B. Bishop. Respiratory sinus arrhythmia in humans: how breathing pattern modulates heart rate. American Journal of Physiology-Heart and Circulatory Physiology, 241(4):H620–H629, 1981. [47] A. Ignatov. Real-time human activity recognition from accelerometer data using con- volutional neural networks. Applied Soft Computing, 62:915–922, 2018. [48] L. Jacob, F. Bach, and J.-P. Vert. Clustered multi-task learning: A convex formulation. arXiv preprint arXiv:0809.2085, 2008. [49] M. Jaggi, V. Smith, M. Takáč, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan. Communication-efficient distributed dual coordinate ascent. arXiv preprint arXiv:1409.1458, 2014. [50] A. Jalal and S. Kamal. Real-time life logging via a depth silhouette-based human activity recognition system for smart home services. In 2014 11th IEEE International conference on advanced video and signal based surveillance (AVSS), pages 74–80. IEEE, 2014. [51] A. Jalal, M. Z. Uddin, and T.-S. Kim. Depth video-based human activity recognition system using translation and scaling invariant features for life logging at smart home. IEEE Transactions on Consumer Electronics, 58(3):863–871, 2012. [52] Y. Jia. Diatetic and exercise therapy against diabetes mellitus. In 2009 Second Inter- national Conference on Intelligent Networks and Intelligent Systems, pages 693–696. IEEE, 2009. [53] W. Jiang, C. Miao, F. Ma, S. Yao, Y. Wang, Y. Yuan, H. Xue, C. Song, X. Ma, D. Koutsonikolas, et al. Towards environment independent device free human activity recognition. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 289–304, 2018. [54] Y. Jiang, J. Konečnỳ, K. Rush, and S. Kannan. Improving federated learning person- alization via model agnostic meta learning. arXiv preprint arXiv:1909.12488, 2019. [55] W. S. Johnston. Development of a signal processing library for extraction of SpO2, HR, HRV, and RR from photoplethysmographic waveforms. PhD thesis, Worcester Polytechnic Institute, 2006. 92 [56] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019. [57] S. Kaplan. Meditation, restoration, and the management of mental fatigue. Environ- ment and behavior, 33(4):480–506, 2001. [58] M. Karamnejad. Virtual reality and health informatics for management of chronic pain. Simon Fraser University, 2014. [59] M. K. Karavidas, P. M. Lehrer, E. Vaschillo, B. Vaschillo, H. Marin, S. Buyske, I. Ma- linovsky, D. Radvanski, and A. Hassett. Preliminary results of an open label study of heart rate variability biofeedback for the treatment of major depression. Applied psychophysiology and biofeedback, 32(1):19–30, 2007. [60] W. Karlen, J. M. Ansermino, and G. Dumont. Adaptive pulse segmentation and artifact detection in photoplethysmography for mobile applications. In Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE, pages 3131–3134. IEEE, 2012. [61] P. G. Katona and F. Jih. Respiratory sinus arrhythmia: noninvasive measure of parasympathetic cardiac control. Journal of applied physiology, 39(5):801–805, 1975. [62] A. H. Kemp, D. S. Quintana, K. L. Felmingham, S. Matthews, and H. F. Jelinek. Depression, comorbid anxiety disorders, and heart rate variability in physically healthy, unmedicated patients: implications for cardiovascular risk. PloS one, 7(2):e30777, 2012. [63] M. A. A. H. Khan, N. Roy, and A. Misra. Scaling human activity recognition via deep learning-based domain adaptation. In 2018 IEEE international conference on pervasive computing and communications (PerCom), pages 1–9. IEEE, 2018. [64] I. Z. Khazan. The clinical handbook of biofeedback: A step-by-step guide for training and practice with mindfulness. John Wiley & Sons, 2013. [65] B. S. Kim and S. K. Yoo. Motion artifact reduction in photoplethysmography us- ing independent component analysis. IEEE Transactions on Biomedical Engineering, 53(3):566–568, March 2006. [66] S. H. Kim, D. W. Ryoo, and C. Bae. Adaptive noise cancellation using accelerometers for the ppg signal from forehead. In 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 2564–2567, Aug 2007. [67] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016. 93 [68] N. D. Lane, Y. Xu, H. Lu, S. Hu, T. Choudhury, A. T. Campbell, and F. Zhao. Enabling large-scale human activity inference on smartphones using community simi- larity networks (csn). In Proceedings of the 13th international conference on Ubiquitous computing, pages 355–364, 2011. [69] L. Leger and M. Thivierge. Heart rate monitors: Validity, stability, and functionality. The Physician and Sportsmedicine, 16(5):143–151, 1988. [70] P. Lehrer, A. Smetankin, and T. Potapova. Respiratory sinus arrhythmia biofeedback therapy for asthma: A report of 20 unmedicated pediatric cases using the smetankin method. Applied psychophysiology and biofeedback, 25(3):193–200, 2000. [71] P. M. Lehrer, E. Vaschillo, and B. Vaschillo. Resonant frequency biofeedback training to increase cardiac variability: Rationale and manual for training. Applied psychophys- iology and biofeedback, 25(3):177–191, 2000. [72] P. M. Lehrer, E. Vaschillo, and B. Vaschillo. Resonant frequency biofeedback training to increase cardiac variability: Rationale and manual for training. Applied psychophys- iology and biofeedback, 25(3):177–191, 2000. [73] P. M. Lehrer, E. Vaschillo, B. Vaschillo, S.-E. Lu, D. L. Eckberg, R. Edelberg, W. J. Shih, Y. Lin, T. A. Kuusela, K. U. Tahvanainen, et al. Heart rate variability biofeed- back increases baroreflex gain and peak expiratory flow. Psychosomatic medicine, 65(5):796–805, 2003. [74] R. Ley. The modification of breathing behavior: Pavlovian and operant control in emotion and cognition. Behavior Modification, 23(3):441–479, 1999. [75] J. Li, Y. Rong, H. Meng, Z. Lu, T. Kwok, and H. Cheng. Tatc: predicting alzheimer’s disease with actigraphy data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 509–518, 2018. [76] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019. [77] Z. Liu, F. Zhang, and X. Hong. Low-cost retina-like robotic lidars based on incom- mensurable scanning. IEEE/ASME Transactions on Mechatronics, 2021. [78] M. Long, Y. Cao, Z. Cao, J. Wang, and M. I. Jordan. Transferable representation learning with deep adaptation networks. IEEE transactions on pattern analysis and machine intelligence, 41(12):3071–3085, 2018. [79] M. Long, Z. Cao, J. Wang, and P. S. Yu. Learning multiple tasks with multilinear relationship networks. arXiv preprint arXiv:1506.02117, 2015. [80] A. Lounis, A. Hadjidj, A. Bouabdallah, and Y. Challal. Secure and scalable cloud- based architecture for e-health wireless sensor networks. In 2012 21st International Conference on Computer Communications and Networks (ICCCN), pages 1–7. IEEE, 2012. 94 [81] Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. Feris. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5334–5343, 2017. [82] Y. Mansour, M. Mohri, J. Ro, and A. T. Suresh. Three approaches for personalization with applications to federated learning. arXiv preprint arXiv:2002.10619, 2020. [83] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication- efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017. [84] H. B. McMahan and D. Ramage. 2017. [85] W. Min and L. Wynter. Real-time road traffic prediction with spatio-temporal corre- lations. Transportation Research Part C: Emerging Technologies, 19(4):606–616, 2011. [86] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi- task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3994–4003, 2016. [87] M. Moencks, V. De Silva, J. Roche, and A. Kondoz. Adaptive feature processing for robust human activity recognition on a novel multi-modal dataset. arXiv preprint arXiv:1901.02858, 2019. [88] L. Mou, Z. Meng, R. Yan, G. Li, Y. Xu, L. Zhang, and Z. Jin. How transferable are neural networks in nlp applications? arXiv preprint arXiv:1603.06111, 2016. [89] M. Munafo, E. Patron, and D. Palomba. Improving managers’ psychophysical well- being: effectiveness of respiratory sinus arrhythmia biofeedback. Applied psychophysi- ology and biofeedback, 41(2):129–139, 2016. [90] A. Nilsson, S. Smith, G. Ulm, E. Gustavsson, and M. Jirstrand. A performance eval- uation of federated learning algorithms. In Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning, pages 1–8, 2018. [91] S. Nirjon, R. F. Dickerson, Q. Li, P. Asare, J. A. Stankovic, D. Hong, B. Zhang, X. Jiang, G. Shen, and F. Zhao. Musicalheart: A hearty way of listening to music. In Proceedings of the 10th ACM Conference on Embedded Network Sensor Systems, pages 43–56, 2012. [92] X. Ouyang, Z. Xie, J. Zhou, J. Huang, and G. Xing. Clusterfl: a similarity-aware federated learning system for human activity recognition. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, pages 54–66, 2021. [93] P. E. Paredes, Y. Zhou, N. A.-H. Hamdan, S. Balters, E. Murnane, W. Ju, and J. A. Landay. Just breathe: In-car interventions for guided slow breathing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(1):28, 2018. 95 [94] J. A. C. Patterson, D. C. McIlwraith, and G. Z. Yang. A flexible, low noise reflective ppg sensor platform for ear-worn heart rate monitoring. In 2009 Sixth International Workshop on Wearable and Implantable Body Sensor Networks, pages 286–291, June 2009. [95] L. Peng, L. Chen, Z. Ye, and Y. Zhang. Aroma: A deep multi-task learning based simple and complex human activity recognition method using wearable sensors. Pro- ceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(2):1–16, 2018. [96] S. W. Porges, J. A. Doussard-Roosevelt, and A. K. Maiti. Vagal tone and the physiolog- ical regulation of emotion. Monographs of the society for research in child development, 59(2-3):167–186, 1994. [97] B. Prathyusha, T. S. Rao, and D. Asha. Extraction of respiratory rate from ppg signals using pca and emd. [98] G. E. Prinsloo, H. L. Rauch, M. I. Lambert, F. Muench, T. D. Noakes, and W. E. Der- man. The effect of short duration heart rate variability (hrv) biofeedback on cognitive performance during laboratory induced cognitive stress. Applied Cognitive Psychology, 25(5):792–801, 2011. [99] M. Prpa, K. Cochrane, and B. E. Riecke. Hacking alternatives in 21st century: design- ing a bio-responsive virtual environment for stress reduction. In International Sym- posium on Pervasive Computing Paradigms for Mental Health, pages 34–39. Springer, 2015. [100] Y. Qin, C. J. Vincent, N. Bianchi-Berthouze, and Y. Shi. Airflow: designing immersive breathing training games for copd. In CHI’14 Extended Abstracts on Human Factors in Computing Systems, pages 2419–2424. ACM, 2014. [101] V. Radu, N. D. Lane, S. Bhattacharya, C. Mascolo, M. K. Marina, and F. Kawsar. Towards multimodal deep learning for activity recognition on mobile devices. In Pro- ceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct, pages 185–188, 2016. [102] M. A. D. Raya and L. G. Sison. Adaptive noise cancelling of motion artifact in stress ecg signals using accelerometer. In Proceedings of the Second Joint 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society, volume 2, pages 1756–1757 vol.2, 2002. [103] R. Reiner. Integrating a portable biofeedback device into clinical practice for patients with anxiety disorders: Results of a pilot study. Applied Psychophysiology and Biofeed- back, 33(1):55–61, 2008. [104] S. Rhee, B.-H. Yang, and H. H. Asada. Artifact-resistant power-efficient design of finger-ring plethysmographic sensors. IEEE Transactions on Biomedical Engineering, 48(7):795–805, July 2001. 96 [105] D. Riboni and C. Bettini. Cosar: hybrid reasoning for context-aware activity recogni- tion. Personal and Ubiquitous Computing, 15(3):271–289, 2011. [106] D. J. L. F. d. V. Rodrigues. Risk Assessment for Alzheimer Patients, using GPS and Accelerometers with a Machine Learning Approach. PhD thesis, 2019. [107] C. A. Ronao and S.-B. Cho. Human activity recognition with smartphone sensors using deep learning neural networks. Expert systems with applications, 59:235–244, 2016. [108] P. C. Roy, S. Giroux, B. Bouchard, A. Bouzouane, C. Phua, A. Tolstikov, and J. Biswas. A possibilistic approach for activity recognition in smart homes for cognitive assistance to alzheimer’s patients. In Activity Recognition in Pervasive Intelligent Environments, pages 33–58. Springer, 2011. [109] S. Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017. [110] A. L. Rukhin. Analysis of time series structure ssa and related techniques. Techno- metrics, 44(3):290–290, 2002. [111] F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek. Robust and communication- efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems, 2019. [112] L. Sherlin, R. Gevirtz, S. Wyckoff, and F. Muench. Effects of respiratory sinus ar- rhythmia biofeedback versus passive biofeedback control. International Journal of Stress Management, 16(3):233, 2009. [113] J. Shi, J. Wan, H. Yan, and H. Suo. A survey of cyber-physical systems. In 2011 international conference on wireless communications and signal processing (WCSP), pages 1–6. IEEE, 2011. [114] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar. Federated multi-task learn- ing. In Advances in Neural Information Processing Systems, pages 4424–4434, 2017. [115] K. Sozinov, V. Vlassov, and S. Girdzijauskas. Human activity recognition using fed- erated learning. In 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Com- puting, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), pages 1103–1111. IEEE, 2018. [116] H. M. Stauss. Heart rate variability. American Journal of Physiology-Regulatory, Integrative and Comparative Physiology, 285(5):R927–R931, 2003. [117] X. Sun, Z. Lu, W. Hu, and G. Cao. Symdetector: detecting sound-related respiratory symptoms using smartphones. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 97–108, 2015. [118] T. Tamura, Y. Maeda, M. Sekine, and M. Yoshida. Wearable photoplethysmographic sensors—past and present. Electronics, 3(2):282, 2014. 97 [119] G. Tan, T. K. Dao, L. Farmer, R. J. Sutherland, and R. Gevirtz. Heart rate variability (hrv) and posttraumatic stress disorder (ptsd): a pilot study. Applied psychophysiology and biofeedback, 36(1):27–35, 2011. [120] M. P. Tarvainen, J.-P. Niskanen, J. A. Lipponen, P. O. Ranta-Aho, and P. A. Kar- jalainen. Kubios hrv–heart rate variability analysis software. Computer methods and programs in biomedicine, 113(1):210–220, 2014. [121] L. Tu, T. Hao, C. Bi, and G. Xing. Breathcoach: A smart in-home breathing training system with bio-feedback via vr game. Smart Health, 16:100090, 2020. [122] L. Tu, T. Hao, C. Bi, and G. Xing. Breathcoach: A smart in-home breathing training system with bio-feedback via vr game. Smart Health, 16:100090, 2020. [123] L. Tu, J. Huang, C. Bi, and G. Xing. Fitbeat: A lightweight system for accurate heart rate measurement during exercise. In 2017 IEEE International Conference on Smart Computing (SMARTCOMP), pages 1–8. IEEE, 2017. [124] L. Tu, X. Ouyang, J. Zhou, Y. He, and G. Xing. Feddl: Federated learning via dynamic layer sharing for human activity recognition. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, SenSys ’21, page 15–28, New York, NY, USA, 2021. Association for Computing Machinery. [125] A. Ukil, S. Bandyoapdhyay, C. Puri, and A. Pal. Iot healthcare analytics: The impor- tance of anomaly detection. In 2016 IEEE 30th international conference on advanced information networking and applications (AINA), pages 994–997. IEEE, 2016. [126] I. Van Diest, K. Verstappen, A. E. Aubert, D. Widjaja, D. Vansteenwegen, and E. Vlemincx. Inhalation/exhalation ratio modulates the effect of slow breathing on heart rate variability and relaxation. Applied psychophysiology and biofeedback, 39(3- 4):171–180, 2014. [127] E. G. Vaschillo, B. Vaschillo, and P. M. Lehrer. Characteristics of resonance in heart rate variability stimulated by biofeedback. Applied psychophysiology and biofeedback, 31(2):129–142, 2006. [128] J. Wang, H. Abid, S. Lee, L. Shu, and F. Xia. A secured health care application architecture for cyber-physical systems. arXiv preprint arXiv:1201.0213, 2011. [129] K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. Quek, and H. V. Poor. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 15:3454–3469, 2020. [130] B. Widrow, J. R. Glover, J. M. McCool, J. Kaunitz, C. S. Williams, R. H. Hearn, J. R. Zeidler, J. E. Dong, and R. C. Goodlin. Adaptive noise cancelling: Principles and applications. Proceedings of the IEEE, 63(12):1692–1716, Dec 1975. 98 [131] L. Xie, I. M. Baytas, K. Lin, and J. Zhou. Privacy-preserving distributed multi- task learning with asynchronous updates. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1195–1204, 2017. [132] Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning: Concept and ap- plications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1– 19, 2019. [133] J. Yin, Q. Yang, and J. J. Pan. Sensor-based abnormal human-activity detection. IEEE Transactions on Knowledge and Data Engineering, 20(8):1082–1090, 2008. [134] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014. [135] Z. Zhang, Z. Pi, and B. Liu. Troika: A general framework for heart rate monitor- ing using wrist-type photoplethysmographic signals during intensive physical exercise. IEEE Transactions on Biomedical Engineering, 62(2):522–531, Feb 2015. [136] S. Zhao, W. Li, and J. Cao. A user-adaptive algorithm for activity recognition based on k-means clustering, local outlier factor, and multivariate gaussian distribution. Sensors, 18(6):1850, 2018. [137] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018. [138] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. 2014. [139] T. L. Zucker, K. W. Samuelson, F. Muench, M. A. Greenberg, and R. N. Gevirtz. The effects of respiratory sinus arrhythmia biofeedback on heart rate variability and posttraumatic stress disorder symptoms: a pilot study. Applied psychophysiology and biofeedback, 34(2):135, 2009. 99