DATA-CENTRIC AI FOR INTERACTION SECURITY AND PRIVACY IN THE INTERNET-OF-THINGS By Guangjing Wang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2024 ABSTRACT In the realm of the Internet of Things (IoT), users, devices, and environments communicate and interact with each other, creating a web of complex interactions. This interconnected web of interactions makes the IoT a powerful tool for enhancing human experiences. However, it simultaneously presents substantial challenges in ensuring security and privacy amid interactions among users, devices, and environments. This dissertation investigates potential IoT interaction security and privacy issues by customizing data-centric AI algorithms. First, this dissertation studies complex interactions in smart homes where many interconnected smart devices are deployed. A graph learning-based threat detection system is designed to discover potential interactive threats across multiple smart home platforms. Second, considering smart home data privacy and data heterogeneity issues, a dynamic clustering-based federated graph learning framework is proposed to collaboratively train a threat detection model. Meanwhile, a Monte Carlo beam search-based method is designed to identify the interactive threat causes. Third, we explore the privacy issues behind the interactions between users and smartphones. Specifically, a potential bio-information leakage attack channel has been identified that utilizes near-ultrasound signals from a smartphone to recognize facial expressions based on a contrastive attention learning model. Fourth, we reveal two critical overprivileged issues in mobile activity sensing data generated from interactions between users and mobile devices: metadata-level and feature-level overprivileged issues. Correspondingly, we design the multi-grained data generation model to reconstruct mobile activity sensing data, so as to mitigate the privacy concerns behind the mobile sensing overprivileged issues. We have implemented and extensively evaluated the proposed threat detection model, federated model training method, acoustic-based expression recognition model, and privacy-preserving data reconstruction model in practical settings. This dissertation concludes with a discussion of future work. We highlight the potential challenges and opportunities associated with the applied AI techniques for addressing security and privacy issues in the IoT. This dissertation points out the pathway for future research in enhancing security and privacy to safeguard the interactions among users, devices, AI, and environments. ACKNOWLEDGMENTS I would like to express my deepest gratitude to Dr. Qiben Yan, my dissertation advisor, for his unwavering support, guidance, and expertise throughout this journey. His insights and feedback have been invaluable in shaping this research. I am also thankful to my guidance committee members, Dr. Arun Ross, Dr. Huacheng Zeng, Dr. Brian O’Shea, and Dr. Mi Zhang, for their constructive criticism, invaluable suggestions, and encouragement which have greatly contributed to the completion of this dissertation. Special thanks to my colleagues and collaborators Nikolay Ivanov, Hanqing Guo, Yuanda Wang, Bocheng Chen, Ce Zhou, Weikang Ding, Juexing Wang, Shane Patrarungrong for their accompany, assistance, and valuable discussions, which have been a source of motivation and inspiration. In addition, I want to say thank you to my friends in Lab1100 and the eLans research group led by Dr. Li Xiao, and I enjoy our weekly seminars and discussions. I want to say thank you to the College of Engineering and Department of Computer Science where a group of excellent people such as Dr. Katy Colbry, Dr. Sandeep Kulkarni, and Mrs. Sandra Christlieb helped me a lot outside of my academic research. I would like to acknowledge the support of the Institute for Cyber-Enabled Research at MSU, the CISCO Research, and NSF for providing the necessary resources and funding to conduct this research. I would like to extend my gratitude to people from Michigan State University, the University of Illinois Urbana-Champaign, George Mason University, and the University of North Carolina at Charlotte, whose contributions, though not mentioned individually, have not gone unnoticed and have been instrumental in the successful completion of this dissertation. My heartfelt appreciation goes out to my family, especially my father Jisheng Wang, my mother Xiumei Pan, my aunt Jizhen Wang, and my brother Guanghao Wang, for their love, understanding, and patience during this demanding period of my life. Lastly, I would like to thank myself for deciding to study at Michigan State University for a doctoral degree in computer science four years ago. Keep your gratitude and positive thinking in the journey of your life! iii TABLE OF CONTENTS CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2: Research Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3: Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 CHAPTER 2: GRAPH LEARNING FOR INTERACTIVE THREAT DETECTION . . . . . 14 2.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2: Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3: System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4: Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 CHAPTER 3: FEDERATED THREAT DETECTION MODEL LEARNING . . . . . . . . . 52 3.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2: Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3: System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4: Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 CHAPTER 4: ACOUSTIC-BASED FACIAL EXPRESSION RECOGNITION . . . . . . . 76 4.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3: System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4: Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 CHAPTER 5: MULTI-GRAINED ACTIVITY SENSING INFORMATION RELEASE . . . 101 5.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2: Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3: System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.4: Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.5: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 CHAPTER 6: CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.1: Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.2: Discussion of Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.3: Outline of Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 iv CHAPTER 1: INTRODUCTION In the context of IoT, the interaction encapsulates the diverse and multifaceted communication pathways between users, devices, and their environments. There are active interactions between users and devices. For example, people are using smartphones every day. Users can talk to the voice assistant to instruct it to play music or turn on the light. Meanwhile, there are passive interactions between users and devices. For instance, the smart light automatically turns on when low illumination and motion are detected using light sensing and infrared sensing. Such interactions enable users to send commands and receive responses from devices, while the devices themselves exchange data and adapt their functions in real time, considering the environmental cues. In addition, the environment plays a pivotal role, providing a context that is continually interpreted by IoT devices to make informed decisions and actions. There are interactions between the environment and devices. For instance, when smoke is detected in the air, the fire alarm will start beeping. The smart fans can accelerate the airflow, and the air conditioner can adjust the room temperature. These interactions form a complex network, endowing the IoT with its unique capabilities to automate processes, monitor various parameters, and augment the human experience with enhanced connectivity and intelligence. However, the complexity and interconnections present significant challenges in safeguarding the security and privacy of the systems involved. First, every interaction point can potentially become a vulnerability, posing risks not only to the data integrity and privacy but also to the physical safety of individuals (e.g., a house is burgled, a heavy appliance burnt out). For example, it is reported that smart home devices suffer from about 10,000 hacking attempts every week [244]. If the smart lock is hacked, the attacker can take control of anyone getting in the house. If the smart light is hacked, the attacker can keep all the lights on to overload the power system. Second, a single compromised element can have far-reaching implications. For instance, the ultrasound attacks [177, 234, 245] can hijack voice assistants such as Amazon Alexa, and Apple Siri, which 1 Data-Centric AI for Interaction Security and Privacy in the Internet of Things Acoustic-based Facial Expression Recognition System on Smartphone Contrastive Learning for Data Heterogeneity Attention Learning for Acoustic Interference Federated Training GNN-based Threat Detection System IoT Data Fusion Threat Pattern Learning Graph learning for Interactive Threat Detection System Threat Knowledge Transfer Dynamic Clustering for Data Heterogeneity Issue Interactive Threat Cause Analysis System Data Privacy Protection Monte-Carlo Beam Search for Root Cause Analysis User-controlled Information Disclosure in Mobile Activity Sensing Hierarchical Feature Extraction Metadata Information Perturbation Multi-grained Data Reconstruction Figure 1.1: Research overview. can be tricked to unlock doors, turn on lights, or read texts. Third, the dynamic nature of IoT interactions, where devices communicate and update their states, adds another layer of complexity in ensuring robust security protocols are functioning effectively. Considering the vulnerable, complex, and dynamic interactions in the IoT, this dissertation aims to uncover the vulnerability of IoT interactions and propose innovative methods to enhance security and privacy in this ever-evolving landscape. Through this research, we contribute to the foundational understanding of IoT interactions and offer practical solutions to fortify mobile and IoT systems against the myriad threats they face. Specifically, as shown in Figure 1.1, we proposed four main systems related to the security and privacy issues in the IoT. First, we design the graph representation learning model to learn the patterns of interactive threats, which refer to the undesired and insecure issues caused by interactions among smart devices. Second, we propose the federated learning paradigm to train the threat detection model, so as to protect the privacy of training data from smart homes. Meanwhile, we design the Monte Carlo beam search-based method to identify the root causes of interactive threats. Third, we design an acoustic-based facial expression recognition system using a smartphone earpiece speaker, demonstrating the potential privacy issues with acoustic sensing. Fourth, we design a generative diffusion model-based method to achieve user-controlled information disclosure in mobile activity sensing. 2 Table 1.1: Comparing to existing IoT interactive threat detection systems. System Source Code Dynamic Test Long-term Correlation Cross-platform Interaction Method IoTGuard [28] Yes Yes No No Code instrumentation HAWatcher [70] Yes No No No Code analysis & Log analysis IoTSafe [55] Yes Yes No No Static analysis & Dynamic testing iRuler [222] No No Yes No SMT solver & Model checker IoTMon [54] Yes No Yes No Code analysis & Text analysis Ours No No Yes Yes NLP & GNN models 1.1: Related Work This section provides an overview of the current literature pertinent to the research conducted in this dissertation, organized into four distinct categories: IoT device interactive threat detection, federated graph learning for threat detection model training, facial expression recognition, and privacy protection in mobile activity sensing. For each category, we review the latest advancements in research and then discuss how this dissertation contributes to the field. 1.1.1: IoT Device Interactive Threat Detection Numerous investigations have explored device interactions within smart home environments. Many studies depend on app source code analysis [3, 28, 55, 203]. For instance, IoTGuard [28] employs code instrumentation to gather runtime data to verify against established security policies. Yet, this approach hits a roadblock with closed-source platforms such as IFTTT and Alexa, rendering these techniques ineffective. Alternative approaches [28,55] implement dynamic testing to identify interactive threats. IoTSafe [55], for example, merges static and dynamic analyses to foresee potential risks. Nonetheless, deploying such systems in real-world settings is costly and requires meticulous test case management to prevent unforeseen safety hazards. Dynamic testing also grapples with code coverage limitations and often overlooks variables such as time and seasonal influences. While HAWatcher [70] deduces normal interaction patterns from app and event logs, it struggles with long-duration correlations and deviations in user behavior. Additionally, many frameworks [28, 70, 222] necessitate predefined security policies or threat patterns. For example, iRuler [222] only examines interactions on a singular platform and requires expert insights to define a complex threat landscape for its rule-based analysis. The challenge 3 amplifies when trying to predetermine varied interactive threats across diverse, closed-source platforms, limiting rule-based methodologies to detecting only known threats and resulting in a higher rate of false negatives. iRuler also has to account for all possible device state combinations, with rule definitions, device statuses, and search depth settings. An increased search depth could prolong the analysis duration and impact the precision and efficiency of the analysis. Contribution of this Dissertation In this dissertation, we study the interactive threats among devices from various closed-source platforms. By crawling data from different sources such as app descriptions and event logs, we leverage natural language processing (NLP) techniques to integrate multi-domain information without depending on code scrutiny. In addition, we propose a new method to build interaction correlations among devices into interaction graphs and design the unified graph representation learning model, ITGNN, for multi-scale interactive threat pattern learning. To cope with the threat pattern coverage issue, we propose the drifting threat pattern detection algorithm, thereby sidestepping challenges from dynamic testing. Our evaluation shows that the proposed model can achieve high accuracy in detecting interactive threats across heterogeneous platforms. We identify four new types of interactive threats in the user-defined Home Assistant platform. 1.1.2: Federated Graph Learning Federated Learning (FL) aims to train a shared learning-based model while avoiding sharing data from local clients. Recent research [31,116,117,120,128,220,251] has delved into system and data heterogeneity challenges within FL. For instance, FedProx [128] introduces weighted aggregation across various clients, though determining appropriate weights for diverse applications remains a complex issue. Alternative strategies involve sharing data between local devices or through serverside proxies [251] to address data heterogeneity, yet this demands an understanding of local data distribution. Strategies such as bounding gradients [240] or incorporating additional noise [105] have been proposed to ensure convergence, albeit at the cost of increased training duration and reduced model precision. GNN models have shown superior results in tasks such as graph or 4 node classification [94, 99, 123, 249], effectively extracting node dependencies within graphs to create graph embeddings. Recent efforts [22, 125, 148, 165, 211] have applied FL to GNN models, such as GraphFL [211] which utilizes a meta-learning-based FL framework for semi-supervised node classification. Yet, the presence of non-i.i.d. structural information within graph data poses additional challenges, potentially impairing learning performance [229]. Contribution of this Dissertation This dissertation proposes a layer-wise dynamic clustering-based FL algorithm, drawing upon certain critical observations. A principal observation is that despite the heterogeneity in data, similar data distribution patterns emerge in smart homes due to the presence of common automation rules among various users. We design a layer-wise clustering-based federated contrastive GNN model for non-i.i.d. graph datasets to learn contrastive graph representations. In each cluster, the graph datasets share similar feature information, which exhibits the i.i.d. property. In this way, we can reduce the data heterogeneity during the federated learning process in each cluster of clients. 1.1.3: Facial Expression Recognition Many researchers have proposed to use body sensors that generate physiological signals [32, 175] to recognize emotional states. Yet, such methods require extensive signal analysis over prolonged periods, such as 30 seconds [68], which may hinder efficiency. ExpressEar [208] leverages commercially available ear-worn devices with added inertial sensors to detect facial muscle activities linked to emotions. Similarly, FaceListener [194] transforms headphones into acoustic sensors to measure facial skin changes. However, users need to wear additional devices for emotion detection, which is a notable drawback. In addition, wireless and mobile sensing technologies have been employed for recognizing behaviors, including daily activities [86, 118, 119, 217, 253], and facial expressions [36]. For example, WiFace [36] discerns facial expressions by analyzing WiFi signal patterns, while Hof et al. [91] propose to use mm-Wave radar for facial recognition. These methods, however, necessitate specialized hardware and setup, which may not always be practical. 5 Acoustic sensing, leveraging ubiquitous speakers and microphones, involves emitting acoustic signals and interpreting the echoes from objects to monitor breathing [233], authenticate users [254], and recognize activities [76, 121, 230]. EchoPrint [254], for instance, uses both acoustic and visual signals for user verification, and TeethPass [230] authenticates users via the sound of teeth clenching captured through earbuds. Similarly, LASense [121] enhances activity sensing precision by refining signal processing techniques. Another application, SonicFace [75], evaluates emotional expressions by analyzing echo patterns, but it cannot obtain fine-grained facial features due to the acoustic signals’ frequency resolution constraints. Contribution of this Dissertation In this dissertation, we leverage commercial smartphones to emit near-ultrasonic signals (19-23 kHz) directed at the user’s face. The smartphone’s microphone captures the echo signals reflected off the face, encapsulating details of the facial expressions. Through an analysis of these detailed echo patterns, our developed system, Facer, is capable of identifying six distinct facial emotional expressions: anger, disgust, happiness, fear, sadness, and surprise. We have designed a novel model for acoustic facial expression recognition, which employs contrastive external attention to discern and fortify facial expression features, simultaneously filtering out irrelevant background noise. Furthermore, we introduce a domain adaptation contrastive learning algorithm to reconcile the distribution discrepancies between training and testing datasets, significantly reducing the impact of variability in facial expressions among different users. Our smartphone-based system, Facer, has been rigorously tested in diverse real-world settings, demonstrating its superiority over existing methods by achieving more than a 10% improvement in recognition accuracy, while also offering enhanced mobility and user convenience. 1.1.4: Mobile Activity-Sensing Privacy Protection Metadata Information Protection Two principal approaches exist for safeguarding metadata information. The first approach involves adversarial training-based techniques [21, 115, 133, 140, 173], where a discriminator 6 model seeks to deduce sensitive attributes from the features, while a generator model aims to reduce the discriminator’s success rate. For instance, DySan [21] obfuscates personal attributes through adversarial training on datasets containing such attributes. The second approach encompasses methods using variational autoencoders and information theory [160, 252], striving to minimize the mutual information between private attributes and their representations, as seen in InfoCensor [252], which addresses both privacy and fairness issues in deep learning. However, the current body of work exhibits several limitations. Initially, many methods necessitate the disclosure of labeled private data by users for model training, as the optimization loss functions in these models depend on labeled private attributes. Secondly, conventional noise addition techniques compromise not only the metadata but also the utility of activity recognition. Lastly, existing research [21, 133, 140] often overlooks overprivileged issues at the feature level, focusing merely on particular sensitive attributes such as age, gender, or user identity. Semantic Information Protection To safeguard semantic details, a simple idea is to generalize the activity labels provided by activity recognition APIs [47, 49]. However, the label generalization alone doesn’t ensure privacy protection [79], not to mention that many applications require access to the raw sensor data instead of labels. For example, gaming apps might use motion sensor data to interpret user gestures, and navigation apps might depend on geomagnetic and accelerometer data for directional guidance. Many mechanisms [20, 30, 79, 143, 171] are designed to determine if activity labels should be disclosed or withheld. Nonetheless, there are two main limitations. First, filter-based methods [30, 79, 171] depend on real-time activity labeling for their decision-making, which is difficult due to human behavior variability and sensor diversity, with limitations noted in platforms such as Android, which only recognizes a limited number of activities [49]. Second, for applications that need raw sensor data, existing privacy mechanisms are too coarse-grained, often removing entire data segments associated with sensitive motions, thereby compromising data utility. For instance, IPShield [30] completely conceals segments of raw data linked to private activities. Addressing the handling of raw data, multiple obfuscation techniques [60, 89, 198, 218, 248] 7 have been suggested. Anonymization methods [124,139,198] primarily focus on static databases to safeguard tabular data against specific types of attacks. These methods, however, are generally too imprecise for the nuanced issues in time-series data. While differential privacy (DP) [60] introduces noise to data for statistical analyses, it’s not ideally suited for interpreting streaming sensor data. Dataset synthesis models based on DP [69, 218, 246–248], such as PrivSyn [248], are designed for tabular data generation under DP but do not directly address the nuanced overprivileged challenges present in mobile sensing data. Contribution of this Dissertation In this dissertation, we propose a multi-grained data generation system, Hippo, for user-controlled information disclosure. Specifically, we introduce a novel system that leverages a noise diffusion process to obscure private attributes, maintaining activity recognition capabilities without necessitating user-provided private data and labels. In addition, we have developed a hierarchical latent feature guidance diffusion model for generating multi-grained data, enabling precise control of information release and addressing feature-level overprivileged concerns. The “granularity” is determined based on the features learned at the i-th layer of a stacked convolutional autoencoder. Hippo offers two main advantages: (i) it operates as a self-supervised framework, processing raw sensing data without depending on labeled private attributes, and (ii) it focuses on generalizing individual activities rather than a sequence of activities, eliminating the need for predefined patterns of private activities. For example, it can transform detailed “playing games” data into a more general “sitting” category. This process filters out the specific “playing games” features, allowing Hippo to provide fine-grained control over the data available to overprivileged applications by eliminating fine-grained and sensitive activity features. 8 1.2: Research Scope 1.2.1: IoT Device Interactive Threat Detection As a typical scenario in the IoT, the surge in smart home adoption has significantly elevated the importance and necessity of smart home security. As homes become increasingly interconnected with many smart devices— from thermostats, lighting systems, and security cameras to voice assistants and automated appliances—the potential attack surface for cyber threats expands correspondingly. Smart home security is essential not only for protecting the privacy and personal data of individuals but also for ensuring the physical safety and integrity of the home environment. From the application layer, we study how to manage the configuration data of smart home automation to mitigate interactive threats spanning various closed-source platforms in real-time, especially when it’s challenging to enumerate interactive threat patterns in a cross-platform setting. To tackle this problem, our goal is to devise a data-driven methodology that efficiently identifies patterns of interactive threats. While current data mining approaches to threat detection, such as [59, 109,236,257], typically rely on event logs in sequence, they often fall short in capturing the intricate logic of interactions within smart apps. Thus, we propose to use a graph data structure to represent the rich event semantics and the complex interplay of logic within automation control apps. Then, we design the graph neural network model and transfer learning techniques to learn the interactive threat patterns behind the complex device interaction graphs. 1.2.2: Federated Training for Interactive Threat Detection Model In this dissertation, we study how can we develop an interactive threat detection model that accommodates the significant heterogeneity in datasets without necessitating the sharing of raw data. Given the limited number of devices in a single household, training a specialized and robust threat detection model is not practical. Federated learning (FL) offers a solution by enabling collaborative model training without disclosing raw data. However, dataset heterogeneity can impair the FL model’s performance, often resulting in slow convergence and diminished accuracy, 9 particularly when the data distribution is dynamic or unbalanced across different clients [9,45,199]. To address this, we propose a layer-wise clustering-based federated graph contrastive learning framework that enhances the training of a shared vulnerability detection model, distinguishing between normal and vulnerable graphs across non-i.i.d. datasets, thereby enhancing generalization capabilities without compromising user data privacy. In addition, we study how to filter out drifting threat patterns to mitigate the data heterogeneity. The IoT environment is inherently complex, characterized by various device types and potential threats. Despite utilizing FL to train on diverse interaction graph data from multiple households, there remains the possibility of encountering drifting samples that represent either evolved or entirely new vulnerabilities. Thus, we design the statistic-based drifting sample detection method with federated contrastive graph representations. In this way, we can isolate these drifting samples. Furthermore, to pinpoint potential triggers of vulnerabilities, we devise an effective causal analysis strategy by examining different subgraphs within an interaction graph. Specifically, we apply the Monte Carlo beam search (MCBS) [25], a strategic heuristic algorithm for game tree exploration. Then, we propose to employ optimal SHAP values [136] to assess the risk associated with different subgraphs in the interaction graph. The SHAP value can leverage the potential correlational features among IoT devices. This method allows us to identify and follow the most probable information flow chain that could contribute to the observed interactive threats. 1.2.3: Mobile Sensing and Privacy Mobile activity sensing leverages the various sensors embedded in smartphones and wearables, such as accelerometers, gyroscopes, magnetometers, microphones, and cameras, to collect data about the movements and behaviors of users. In this dissertation, we propose Facer, a system for recognizing facial expressions utilizing near-ultrasound acoustic sensing technology on smartphones. The smartphones can emit near-ultrasound signals (19-23 kHz) directed at the user’s face from the earpiece speaker. The microphone captures the echoes reflected off the face, containing information about the user’s facial expressions. By analyzing the detailed echo 10 patterns, Facer can identify six distinct facial emotional expressions. Our experiments show that when a user holds the smartphone at a consistent distance (e.g., 20-50 cm), Facer can accurately recognize six universal facial expressions with an accuracy exceeding 85%. Meanwhile, mobile sensing data often encompasses more details than what is necessary for the intended functionalities of user applications. As a result, mobile applications tend to have excessive access to sensing information, leading to overprivileged issues [201, 237]. To mitigate the overprivileged issues, we have developed Hippo, a system that leverages a generative AI model to reconstruct mobile activity sensing data at various granularity. The primary goal of Hippo is to obscure sensitive attributes and generalize overly detailed activity data. To obscure sensitive information, Hippo introduces a controlled amount of noise into the sensing data using a diffusion model, a type of generative model. When it comes to generalizing activity features, Hippo is designed to generate new data at different levels of granularity, effectively stripping away sensitive activity features. The proposed method allows users to eliminate unintended sensitive details in activity data, focusing only on the essential semantics required for different applications. 1.2.4: Data-centric AI Algorithms Computer security research encompasses a broad range of methods to address different types of threats and vulnerabilities in various systems, from traditional computer networks to modern AIdriven systems. For example, rules-based methods use predefined heuristics and patterns to detect violations and security threats, but they cannot discover potential new threats. Formal analysis through model checking and symbolic execution requires significant running time and specialized knowledge to implement. In addition, applying formal methods to large systems can be challenging because the complexity of the analysis tends to grow exponentially with the size of the system. Furthermore, formal methods can hardly capture all aspects of a system’s behavior, especially human factors or external environmental variables. This can be a disadvantage in dynamic or rapidly evolving environments where flexibility and the ability to adapt quickly are crucial. AI and machine learning-based methods are used in security for tasks such as anomaly 11 detection and automated threat detection systems. These methods can adapt to new threats faster than traditional methods. First, AI models can analyze large volumes of data and identify patterns that may indicate malicious activity. This capability allows AI systems to detect sophisticated threats that might elude traditional, rule-based security systems, such as zero-day attacks and advanced persistent threats (APTs). Second, AI can process and analyze data at a speed and scale unattainable to human analysts. This is crucial in security contexts where time is often critical to prevent damage. AI systems can monitor thousands of events across multiple vectors simultaneously, providing a scalable solution that grows with the network or system it protects. Third, through predictive analytics, AI can foresee potential security incidents before they occur by identifying the likelihood of threats based on historical data and emerging trends. This proactive approach helps organizations to preemptively tighten their defenses and allocate resources more effectively to high-risk areas. Fourth, AI systems can continuously learn from new data. As they encounter new types of attacks or variations of existing threats, they can adapt and refine their detection and response strategies without human intervention. This self-improving capability is crucial in keeping up with the rapid development of new threats. Data-centric AI is a methodology that focuses on improving the quality, consistency, and relevance of the data used in training AI models. Considering the performance of AI models is dependent on the training dataset, instead of primarily focusing on modifying the algorithms, we need to pay more attention to the data preparation process, which serves as a key driver of the AI revolution. There are several key aspects of data-centric AI. First, the data quality is crucial for model training. AI models learn to make predictions based on the data they are trained on. If this data contains noise, biases, or corruption, the model will inherently learn these flaws, leading to inaccurate and unreliable outputs. Second, data quantity is important to enhance the generalization and robustness of AI models. Enhancing the diversity and volume of training data through techniques such as synthetic data generation or transformations can help improve the robustness of AI models. Third, data collection strategies are essential for designing data-centric AI algorithms. We need to collect more representative or comprehensive data sets to cover a wider 12 variety of scenarios that the model might encounter in real-world applications. Meanwhile, we need to check for and mitigate biases, which can otherwise lead to unfair, unethical, or harmful decisions. Incorporating continuous feedback into the training process can refine and update the data sets and model accuracy over time. 1.3: Organization The following sections of the dissertation are structured as follows. Chapter 2 delves into the proposed interactive threat detection model, which includes IoT data fusion, interactive threat pattern learning, and threat knowledge transfer learning. Chapter 3 introduces the designed layerwise clustering-based federated learning approach for training the threat detection model, and details the proposed Monte Carlo beam search-based method for threat root cause analysis. Chapter 4 presents the proposed contrastive attention-based acoustic sensing model for six facial expression recognition, which reveals a new potential privacy leakage channel using smartphones. Chapter 5 introduces the designed multi-grained mobile activity sensing data generation model to address both metadata-level and feature-level mobile sensing data overprivileged issues. Finally, Chapter 6 concludes this dissertation and outlines opportunities for future research. 13 CHAPTER 2: GRAPH LEARNING FOR INTERACTIVE THREAT DETECTION1 2.1: Introduction Smart homes are increasingly equipped with intelligent devices such as smart locks, smart plugs, and smart ovens, enhancing the living experience through automation. Homeowners can integrate various systems to leverage cross-platform automation functionalities, considering aspects such as device compatibility, centralized voice control, and user-friendly operation. A recent study [38] indicates that 82.4% of smart home setups utilize multiple automation rules for a single device, and 62.4% of users deploy more than one platform in their residences. The integration of different platforms is facilitated by universal application programming interfaces (APIs), enabling apps from distinct platforms to interact seamlessly. For instance, IFTTT [97] can link with other ecosystems such as SmartThings [189], Amazon Alexa [5], and Home Assistant [10]. Nonetheless, the interconnections between users, devices and the environment can lead to undesirable or insecure issues, referred to as interactive threats. These threats can arise from various sources, including user misconfigurations [93, 205] or external attacks [1, 245], particularly when multiple systems are configured simultaneously. For example, a conflict might occur in a smart home if a Home Assistant app sets a rule to “open the window if smoke is detected” which conflicts with another rule to “close the window when the outside temperature is high”, potentially hindering the window’s ability to open in response to smoke. This highlights the critical need to manage the interaction logic among automation rules to avoid potential hazards and property damage. Various techniques have been devised to analyze app interactions, but when confronted with 1This chapter is based on previously published work by Guangjing Wang, Nikolay Ivanov, Bocheng Chen, Qi Wang, ThanhVu Nguyen, and Qiben Yan titled “Graph Learning for Interactive Threat Detection in Heterogeneous Smart Home Rule Data” published in the Proceedings of the ACM on Management of Data. DOI: 10.1145/3588956 [215]. 14 closed-source, heterogeneous platforms, two main challenges arise. The first major challenge is that many prevalent platforms, such as IFTTT and Alexa, are closed-source. This fact makes methods dependent on source code analysis or code instrumentation futile [3, 55, 203]. Consequently, these proprietary environments hinder the effectiveness of existing heuristic strategies that are reliant on analyzing or instrumenting source code [2, 27, 28, 39, 54, 55, 70, 154, 203]. The second challenge is the complexity of interactions across various platforms. Many approaches that are based on security policies [28, 70, 222] operate under the assumption that all automation rules are executed within a single platform [38]. These methods are limited to intra-app and inter-app analyses within a single platform environment, such as the open-source platform SmartThings [189]. In addition, existing methods require predefined threat patterns and security policies. Such approaches are often inadequate to encompass the diverse array of threat types stemming from intricate rule interactions across different platforms. In this dissertation, we tackle an important but unexplored research question: how can we manage the configuration data of smart home automation systems to avoid interactive threats spanning numerous closed-source platforms? In a cross-platform setting, it is challenging to enumerate all possible interactive threat patterns. To resolve this question, we propose a novel data-driven method aimed at efficiently identifying interactive threat patterns. While current data mining strategies for detecting threats typically rely on event logs [59, 109, 236, 257], these methods do not fully capitalize on the intricate interaction logic inherent in smart apps. This is primarily due to the difficulty of representing the semantics of events and the sophisticated interaction logic through the sequence structure of time-series data. Graph neural network (GNN) models [71,96,108,228] have experienced widespread success in various tasks such as a graph or node classification problem, link prediction, and graph matching. GNNs are adept at capturing both the explicit features of nodes and the implicit structural features of graphs. In the context of smart home automation, the interactions among different platform rules can be structured into a graph, where each node represents a rule, and the edges depict the interactions between these rules. Node features can be enriched using semantic-aware embeddings 15 derived from natural language processing (NLP) techniques [18]. We model interactive threat analysis in smart homes as a graph representation learning task using GNNs and introduce Glint, a framework called Graph learning for interactive threat analysis, aimed at safeguarding heterogeneous smart home environments. There are three main challenges in designing Glint for interactive threat detection: First, there is a notable absence of graph data with ground truth labels for identifying interaction patterns on closed-source smart home platforms. Nevertheless, smart home platforms have comprehensive app descriptions, which include device control rule examples. In addition, the event logs are available in most smart home apps, which reflect timestamps and device states. By integrating rule descriptions and event log data, we can create a dynamic interaction graph. Furthermore, insights from previous research on open-source platforms [55, 70, 203] provide valuable expert knowledge on vulnerable device interactions, enabling us to construct and label large interaction graph datasets. Thus, we can model interactive threat analysis as a supervised learning problem. This approach is beneficial for its accuracy, leveraging labeled data for learning, and efficiency during the testing phase with a trained model. Second, interaction patterns may evolve, posing the issue of drifting samples due to new or changing threat types, making it challenging to catalog all possible interactive threats. To address the problem, we introduce the ITGNN model, which incorporates a contrastive learning loss to discern graph embeddings, allowing for the identification of drifting samples. This data-driven strategy, utilizing a GNN model to uncover potential threat patterns from graph datasets, surpasses the limitations of static analysis and dynamic testing, which depend on predefined patterns, thus enhancing security policy coverage with Glint. Third, data availability can significantly vary across platforms, potentially impacting graph representation learning outcomes. Despite this variability in data formats, the underlying semantics of rules across different platforms are similar since they all aim to control smart devices. To overcome data scarcity on certain platforms, we propose a cross-domain graph transfer learning approach, enabling the transfer of knowledge across heterogeneous platforms and boosting model 16 efficacy on data-limited platforms. We conducted a comprehensive evaluation of Glint using datasets crawled from five platforms, encompassing over 48,000 interaction graphs. Our results demonstrate that Glint is proficient in identifying interactive threats both within individual platforms and across multiple platforms. Notably, within the SmartThings platform, Glint achieved a detection accuracy of 100%. For detecting threats in diverse interaction graphs, it attained an accuracy of 95.5% and an F1 score of 95.6%. Furthermore, through the analysis of drifting samples, Glint successfully identified four new interactive threat types in custom blueprints on the Home Assistant platform, which we have named “action block”, “action ablation”, “trigger intake”, and “condition duplicate”. In summary, we make the following contributions: • We have created the first large interaction graph datasets with verified ground truth labels by examining the trigger-action relationships among rules from closed-source platforms. • The ITGNN model is developed to facilitate the learning of interaction patterns across diverse platforms. This model utilizes contrastive learning to identify drifting samples and applies transfer learning to improve the generalization capabilities of GNN models. • We have extensively evaluated Glint in a real-world environment. The evaluation results indicate that Glint surpasses existing state-of-the-art methods in detecting interactive threats across platforms, improving precision by 12.8% and recall rates by 12.6%. 2.2: Problem Definition We initiate our discussion by presenting a running example of interaction graphs, which sets the stage for a deeper understanding. Then, we provide a formal definition of interactive threats. The process of identifying interactive threat patterns is then elaborated upon, highlighting how graph representation learning plays a crucial role in the threat discovery process. 17 Table 2.1: Specific rule contents of nodes, where the index is the node ID in Figure 2.1. Index Platform Automation Rule Description 1 SmartThings Turn off lights if playing movies. 2 SmartThings If the outdoor temperature is between 65 °F and 80 °F, open windows after sun rise. 3 SmartThings If outdoor temperature is below 60 °F, then close windows. 4 SmartThings Turn on the air conditioner when temperature is above 85 °F. 5 IFTTT If air conditioner is on, then close windows. 6 IFTTT If the smoke alarm is beeping, then open the window and unlock the door. 7 IFTTT If motion is detected, turn on lights. 8 IFTTT If motion is detected, open the door. 9 Amazon Alexa Lock the door if all lights are turned off. 3 1 9 6 2 4 5 Light Door Window AC 7 8 Figure 2.1: An interaction graph, where nodes (in circle shape) are rules from three platforms. 2.2.1: Interaction Graph An interaction graph serves as a conceptual model that illustrates the relationships among various rules from different platforms. For instance, as depicted in Figure 2.1, an interaction graph encompasses rules from SmartThings, IFTTT, and Alexa. The specifics of each rule within the graph are detailed in Table 2.1. These automation rules encapsulate a “trigger-action” logic, indicating that the activation of one rule prompts the execution of another. For example, the command “Alexa, play movies” is linked with Rule 1 through a “trigger-action” relationship. Rules can be interconnected through various devices and physical channels, such as temperature or humidity. Additionally, an “action-trigger” relationship can exist between rules, where the action of one rule initiates the execution of another. For instance, the interaction between Rule 1 and Rule 9 is mediated through the action of turning off lights, which subsequently triggers the locking of a 18 door when movies are played. Interaction graphs encapsulate interactions between devices, users, and the environment based on “trigger-action” logic. The “trigger-action” and “action-trigger” connections exemplify the causal relationships within the configuration data of automation rules. Suppose there are n rules that are deployed, then the number of distinct interaction graphs is fixed. For example, if there are ten smart home apps each with a single rule, only one interaction graph will exist in real time, defined by the “action-trigger” relations among the rules. The absence of a direct relationship between any two rules means an edge will not connect their corresponding nodes in the graph. Thus, while the total number of rules impacts the complexity and size of an interaction graph, it does not directly alter the count of interaction graphs in real-time. 2.2.2: Threat Model Interactive threats in smart home environments are essentially the abnormal interactions among various devices, users, and their surroundings. These interactions, harboring threats or anomalies, can lead to unwanted behaviors or significant security and privacy concerns. We mainly consider the threat scenarios where smart home automation control rules interact with each other to jeopardize the home automation system. We define three general scenarios where an adversary can implement attacks to exploit interactive threats: • Targeted compromise: An attacker can attack a device, such as a voice assistant to control smart devices by executing inaudible voice commands [1, 234, 245]. An attacker can also trick the user into installing malicious apps to trigger unexpected interactive actions. • Interaction abuse: An attacker can use existing exploitable interaction chains to subvert the intent of original home automation [66, 159]. To achieve that, the attackers can try to trigger an automation rule, which leads to unexpected interactions among devices. • Misconfiguration: When deploying trigger-action rules, the users may misconfigure their devices, introducing vulnerabilities in the home automation [93, 205]. For instance, the elderly and children can accidentally trigger some vulnerable events via voice assistants, leading to undesirable interactions. 19 All the above attack scenarios cause interactive threats. An adversary exploits flaws in the interactions among various components to achieve attack goals, such as entering a house without permission, causing property damage by fire, or stealing private information. Note that we assume the smart home IoT platforms and app descriptions are trustworthy. Glint aims to detect interactive vulnerabilities caused by the above attacks, fortifying IoT security in smart homes. 2.2.3: Problem Formulation We formally define the task of managing automation data to avoid interactive threats as a graph representation learning problem. Given an interaction graph G = {V, E,X}, the nodes V are rules from different platforms, the edges E are correlations among different rules such as “trigger-action” or “action-trigger”. Each node v ∈ V has a feature vector xv ∈ X, which is a word-embedding or sentence-embedding that encodes semantic information. For example, the word “sunset” is encoded as a 300-dimension embedding [0.60014, . . . , 0.35422] with an NLP library spaCy [195]. We derive the averaged word embedding of each word as the rule-level embedding, which is a node feature. While node features in homogeneous graphs come from the same feature space to better characterize rules from homogeneous platforms, node features in heterogeneous graphs come from distinct feature spaces. Given an interaction graph dataset {G}, the goal is to design a model fw : x{G} → Rd to learn a d-dimensional representation of the graph G. The model fw can be different learning tasks, such as drifting sample detection and binary classification. The learned graph representation preserves the structural and semantic information [96], which is associated with the interactive threat patterns. 2.3: System Design In this section, we delve into the design and application of Glint, outlining its back-end (offline) and front-end (online) components. We elaborate on the four pivotal elements of Glint: (i) The process of constructing the interaction graph dataset is detailed, explaining how we gather, structure, and label the data to create a comprehensive representation of smart home interactions. (ii) The 20 Figure 2.2: Glint builds an interaction graph for a house, and uses a well-trained GNN model to automatically discover interactive threats and generate threat warnings. ITGNN model is introduced as our chosen method for graph representation learning, detailing its architecture and how it processes the interaction graph to learn meaningful representations. (iii) We describe the algorithm developed to detect drifting interaction patterns, highlighting its role in identifying evolving or new threats within the smart home environment. (iv) We discuss our strategy for augmenting the ITGNN model’s generalization capabilities, which involves transferring learned knowledge across different, heterogeneous smart home platforms to improve model robustness and adaptability. 2.3.1: System Overview Glint operates through two primary phases: offline model training and real-time interactive threat detection. The workflow of smart home automation data analysis is depicted in Figure 2.2. In the offline stage: ¬: Glint conducts an analysis of potential intra-platform and inter-platform interactions to construct interaction graphs, which are then aggregated into a comprehensive interaction graph dataset and stored in a database. ­: The GNN model, tailored for this task, is trained on the interaction graph dataset, utilizing contrastive learning and a classification approach. ®: Transfer learning is applied to the trained model to enhance its generalization capabilities, culminating in the development of a GNN-based interactive threat detection tool. 21 In the online stage: ¯: Utilizing the trigger-action logic derived from rules and device statuses from event logs, Glint can dynamically generate interaction graphs in real time. °: The freshly constructed graph undergoes a two-step analysis: initially, it’s processed by the contrastive learning-based drifting sample detector, followed by scrutiny from the interactive threat detector (classification model). ±: Should an interactive threat be identified, Glint issues an alert and prompts for user intervention. ² & ³: In cases where drifting samples are detected or when users identify non-threatening graphs as false alarms, these specific graph cases are noted, and the model is fine-tuned to better align with user expectations and improve threat detection accuracy. In the offline stage, a generic GNN model is developed utilizing graph datasets from diverse platforms. This model, created by a security solution provider, can be accessed via cloud services. Each user utilizes the same offline component, ensuring consistency in the model’s foundational training. For practical application, Glint is integrated into the smart home hub, simplifying the process for users who then primarily interact with the online stage. This feature enables Glint to play a crucial role in scanning for interactive threats during the initial configuration of a smart home system. As the system operates and gathers deployment setups and event log data, the GNN model undergoes fine-tuning to align with individual user preferences. Additionally, a companion application, Glint, can be installed on smartphones. This app is designed to alert users to potential interactive threats detected by Glint, facilitating user engagement and intervention when necessary. This interactive process ensures that the smart home system remains secure and tailored to the user’s specific requirements and preferences. Imagine a user sets up a series of rules detailed in Table 2.1. For the sake of simplicity, let’s assume that Rules 2, 3, 7, and 8 are inactive during a certain period. A real-time interaction graph is formulated by examining the rule descriptions that have been deployed and the runtime event logs. When Glint identifies an interactive threat, it alerts the user, as depicted in Figure 2.3a. In this interaction graph, the potential sources of the threat (e.g., specific nodes and associated devices) are highlighted in red to help reduce the amount of information the user has to process. These potential sources can be pinpointed using GNN explanation tools like PGExplainer [137], 22 (a) Notification. (b) Event logs. (c) Operation. Figure 2.3: The usability demonstration of Glint. SubgraphX [241], or FexIoT [214], which determine the crucial subgraphs or nodes influencing the GNN’s prediction. By considering the user’s daily appliance usage habits and analyzing the event logs shown in Figure 2.3b alongside the potentially vulnerable rules illustrated in Figure 2.3c, the user can discern if the rule interaction is undesirable. If it is, they have the option to halt or modify the rule configurations by adjusting the relevant smart home platform apps, ensuring the system’s alignment with their preferences and enhancing security. 2.3.2: Building Interaction Graph Dataset We first build interaction graph datasets using rule descriptions in the offline stage. Then, we construct real-time interaction graphs with deployed rules and event logs in the online stage. We consider exploring IoT interactions in closed-source and cross-platform environments. The design of the NLP pipeline facilitates information access in the closed-source platforms. 23 Turn On light if the door Opens VERB ADP NOUN SCONJ DET NOUN VERB prt dobj advcl mark det nsubj Figure 2.4: An example of dependency parsing. Algorithm 1: Home Automation Rule Feature Extraction Input: IoT Automation Rule Trigger Set TS, Action Set AS Output: Rule Correlation Feature Vector V 1 foreach T ∈ TS and A ∈ AS do 2 [nouns, verbs]T = PoS(T) 3 [nouns, verbs]A = PoS(A) 4 V 1 = DTWDistance([nouns, verbs]T , [nouns, verbs]A) 5 V 2 = binaryRelation([verbs]T , [verbs]A) 6 V 3 = binaryRelation([nouns]T , [nouns]A) 7 V 4 = ET + EA ▷ ET and EA are averaged word embeddings of a trigger and an action, respectively. 8 end 9 return [V 1, V 2, V 3, V 4] Rule Correlation Discovery The primary obstacle in constructing interaction graph datasets lies in identifying the correlations among various rules. The plethora of noisy, inconsistently formatted data complicates the task of establishing rule correlations across diverse platforms. Drawing inspiration from existing research [197, 200, 222], we employ natural language processing (NLP) methods to extract meaningful semantic features from the publicly accessible descriptions of apps. Given the extensive potential “action-trigger” combinations, we approach correlation identification as a binary classification task, aiming to minimize the need for extensive manual labeling. We design the smart home automation rule feature extraction algorithm as shown in Algorithm 1. The input is a set of trigger phrases and action phrases, and the output is correlation feature vectors. We implement part-of-speech (POS) tagging and syntactic element extraction (lines 2-3). POS tagging follows context and word definition to recognize the main task, trigger object, and action object. For example, in Figure 2.4, the dependency parser recognizes the root 24 verb Turn as the main task. The dependency edge connects other syntactic words to the root verb, which indicates a grammatical relation between the trigger object and the action object. We focus on direct objects (dobj), nominal subjects (nsubj), compounds and modifiers, nominal and passive nominal subjects, and clausal complements, based on which we extract the main task, objects, and their properties in one sentence. Moreover, we discard the named entity since a named entity may modify two different objects, which will reduce the uniqueness of features. For example, Wyze Cam and Wyze Thermostat are two different objects, but Wyze will add bias when computing the semantic similarity. We compute the numerical context features based on the linguistic elements. We calculate the verb similarity and object similarity in the trigger-action pairs (line 4 in Algorithm 1). We apply dynamic time warping [169] to compute similarity because the number of verbs or objects in trigger and action sentences varies. Then, we compute the binary semantic features (lines 5-6). We analyze whether the verb or object words in trigger-action pairs have synonym or hypernym relations and whether the object words in trigger-action pairs have meronym (i.e., a constituent part of an object) or holonym (i.e., a part of meronym name) relations. These relations can better reflect the generic semantic relationships between triggers and actions. Finally, we compute the trigger-action pair embedding (line 7) by summing the averaged word embeddings in corresponding triggers and actions. These are unique features of “trigger-action” rule pairs, which allow classifiers to learn the internal characteristics of interactions. We compose the above semantic features into feature vectors to train a model that detects correlations among “action-trigger” correlation pairs. In this way, we can build correlation pairs among different rules, which are formulated as a binary classification problem. If the given action can lead to the invocation of the trigger, it is labeled “true”. Otherwise, it is labeled “false”. Offline and Online Graph Construction In the offline stage, there are two phases to building interaction graphs. The input dataset is a set of crawled rule sentences, which provide rich trigger-action interaction logic. First, based on the rule correlation discovery model, we predict the remaining unlabeled rule pairs. In this 25 way, we obtain “action-trigger” correlations between two rules. In the second phase, based on the correlation results, we randomly select and concatenate the “trigger-action” and “action-trigger” rule interactions to form interaction graphs, which makes generated graphs less prone to bias. We use the DGL [221] library to build and store the graphs, which are accessible with DGL APIs. In the online stage, we consider the scenarios when one house is equipped with single or multiple platforms. With our designed offline rule correlation discovery classifier, we can first build a complete interaction graph from all rules deployed in a user’s smart home, as shown in Table 2.1. However, it cannot reflect the actual interactions among smart devices in real-time because such graphs lack the triggers’ chronological ordering information. Fortunately, event logs contain three basic elements: time, object, and the current status of the object, which can reflect real-time device status as shown in Figure 2.3b. The event time helps determine the event sequences. Users can set up the device name and location when they deploy and configure devices. Thus, we obtain semantic information such as device types, locations, and device states to differentiate among different devices. Then, we match the device type and current status with the graph constructed from the deployed rules. In this way, we can remove the unrealistic “trigger-action” pairs such as the second rule (action) happening before the first rule (trigger). The temporal dimension is not only integrated into the NLP embeddings but is also used for pruning graphs. First, we encode the temporal semantics (e.g., “sunset, at midnight”) into embeddings, such that the ML models could learn the temporal features. Second, in the online stage, we use the triggers’ chronological ordering information in event logs. For example, we can set a time interval (e.g., 3 hours) and use the timestamp to remove any unrealistic pairs. In this way, although two rules (nodes) could potentially interact, there is no edge between them in a real-time interaction graph because of the disjoined occurrence time. The temporal information allows the model to better learn “action-trigger” correlations. Besides, with temporal information, we can remove unrealistic correlations and reduce the interaction graph edges, making it convenient for users to inspect device interactions. We acknowledge that there could be opportunities for enhancement from both data and model perspectives when considering the temporal dimension. 26 Multi-scale Graph Fusion Graph Embedding Scale 1 Scale 2 Scale 3 Multi-scale Graph Generator Node Feature Projecting metapath 1 metapath 2 Metapath-based Node Transformation Intra-meta Path Aggregation Inter-meta Path Aggregation Unpooling Unpooling Unpooling Heterogeneous graph Homogeneous graph metapath 1 Contrastive Learning Interaction Classification Transfer Learning Figure 2.5: The proposed ITGNN model for interaction graph representation learning. 2.3.3: Graph Representation Learning In this section, we model the interactive threat analysis as a graph learning problem to automatically learn interaction patterns. Considering the different types of information linked to nodes, a single type of feature representation cannot represent the whole graph due to the differences in types and dimensionality. Existing GNN models [71, 96, 108, 228] either target homogeneous graphs or heterogeneous graphs with different designs. Besides, the interaction patterns could exist at different scales in a heterogeneous interaction graph, e.g., at the different subgraph levels. In this chapter, we design ITGNN, a unified model for Interactive Threat analysis based on GNN, which learns interactive threat patterns from different scales of an interaction graph. Graph Representation Model Design We design the interaction graph representation learning process as shown in Figure 2.5, and the ITGNN model is detailed in Algorithm 2. Other input parameters in Algorithm 2 include node type set A; the number of node VA; node weight matrix WA; node feature vector xv ∈ X; metapath set P; the number of layers L; the number of scales D; weight matrix for linear transformation of each metapath-specific node vectorsMA and bA; activation function σ; aggregation function Φ. ITGNN takes as input a graph G(V,E,X) and corresponding parameters, and outputs a d-dimensional graph representation zg. ITGNN considers interaction features from both neighborhoods of a vertex and multiscale structures of a graph. 27 Algorithm 2: ITGNN Graph Representation Learning Input: interaction graph G and other input parameters Output: graph representation zg 1 if VT > 1 then 2 for A ∈ A do 3 hv = WA · XA v 4 for P ∈ PA do 5 for v ∈ VA do 6 hpv = 1 |Np v | P v∈Np v ( 1 |v| P |v| hv), ∀v ∈ p(v, u1) 7 end 8 end 9 spi = 1 |VA| P v∈VA sigmoid(MA · hpi v + bA) 10 βpi = Softmax(qA · spi) 11 hpA v = P p∈pA βp · hpv 12 end 13 Gm = G(V,E, {hpA v }) 14 end 15 for l = 1 . . . L do 16 for d = 1 . . .D do 17 hdv = V IPool(Gm) 18 hv = hv||hdv 19 end 20 h(Li) v = σ(h(Li−1) v , Φ(h(Li−1) u ; u ∈ Nv)) 21 end 22 return zg = readout(hv; v ∈ V ) First, we project all nodes’ features in heterogeneous graphs to the same feature space and then aggregate intra-metapath and inter-metapath information, called metapath-based node transformation. A metapath describes a composite relation between a series of node types, and the metapath instance is a sequence of nodes in a graph following the metapath schema [71]. Unlike homogeneous graph classification, we need to consider how to aggregate various types of nodes into the readout function. Inspired by MAGNN model [71], we first project heterogeneous node features to the same embedding space (line 3). For a node v of type A, we have the projected node embedding as hv. Then, for each target node, we average the node features of metapath instances into a single vector (lines 5-7). In line 6, Np v is metapath-based neighborhoods of node v, p(v, u1) is a metapath instance, and v is the mapping target. 28 Second, we use the attention mechanism to aggregate the inter-metapath information (lines 9- 11), where spi is the summarized metapath of pi ∈ pA, qA is the attention vector for node type A and βpi is the relative importance of each metapath pi for the targeting node v. The interactive threats are usually caused by a subset of graph metapath instances. Different instances will contribute to the final graph embedding to different extents. In this way, we transform the node embedding features while retaining the graph structure to form homogeneous-type graphs (line 13). Third, we extract comprehensive features from multiple scales of a graph by incorporating the vertex infomax pooling (VIPool) module [123] into ITGNN, which is called a multi-scale graph generator and fusion. We input the homogeneous-type graph and apply the VIPool [123] to generate multi-scale graphs (lines 15-21), from which we extract the features. The multi-scale graphs are propagated via the TAG [58] convolution layer, which does not need to approximate graph convolution. By vertex pooling, we select and keep vertices that contain high mutual information with neighborhoods, which can well express local subgraphs. We concatenate different scales of features as hv = h0v || · · · ||hd−1 v ||hdv (line 18). hdv is the dth scale of the graph features, || represents concatenation. Fully connected layers are connected to fuse multi-scale features. The ITGNN graph representation learning can be used for different tasks. If the interaction graph is heterogeneous, it will first map node features from different feature spaces into the same space. Then, Glint can map the homogeneous graph into graph embedding. Next, we feed the generated graph embedding to the drifting sample detection module and classification module to identify the presence of an interactive threat. Drifting Interaction Pattern Detection The concept drift is an important problem when modeling heterogeneous data from different IoT platforms. Given that the interactive threat labeling is based on a set of known interactive threat patterns, it is conceivable that the learned model could have false negatives. This is because the testing interaction graph distribution may drift away from that of the training dataset due to the new smart device automation rules and the presence of various attacks. As a result, there will be a shift in the decision boundary [73] of the supervised or semi-supervised machine learning models, 29 Algorithm 3: Drifting Interaction Pattern Detection Input: Training interaction graphs g (j) i , i = {0, 1} is binary classes, j = {1, · · · , ni}, ni is the sample number in class i, testing data x(m), m = {1, · · · ,M}, M is the total number of testing samples, ITGNN-C model f Output: Drifting degree for each testing sample A(k) 1 for i ∈ {0, 1} do 2 for j = 1 to ni do 3 z (j) i = f(g (j) i ) ▷ The latent representation of g (j) i 4 end 5 for j = 1 to ni do 6 d (j) i = ||z (j) i − 1 ni Pni j=1 z (j) i ||2 7 end 8 MADi = median(|d (j) i − median(d (j) i )|) 9 end 10 for m = 1 to M do 11 for i ∈ {0, 1} do 12 d (m) i = ||f(x(m)) − 1 ni Pni j=1 z (j) i ||2 13 A (m) i = |dmi −median(d (j) i )| MADi 14 end 15 return min(A (k) i ), i = 1, · · · ,N 16 end which will degrade the model performance. Since the classification models are mostly based on the i.i.d. assumption, the detection and filtering of drifting samples could help reduce the classification model’s false positives and negatives. Therefore, we need to detect the interaction graph samples that drift from existing classes. Here, we design ITGNN architecture with contrastive learning loss, called ITGNN-C, to learn a distance function to discover drifting samples. Taking advantage of our labeled graph dataset, contrastive learning can achieve better performance compared to unsupervised methods [235]. The basic idea of contrastive learning is to enlarge the distance among samples with different labels and reduce the distance among samples with the same label. Given a set of samples xi and the corresponding label yi, the contrastive loss takes a pair of samples (xi, xj) as input and tries to learn a function fθ. 30 The formula is written as follows: Lc(xi, xj , θ) = T [yi = yj ]∥fθ(xi)−fθ(xj)∥22 +T [yi ̸= yj ] max(0, ϵ−∥fθ(xi)−fθ(xj)∥2)2, (2.1) where T [yi = yj ] means the value is 1 if two graph samples have the same label, otherwise is 0. T [yi ̸= yj ] means the value is 1 if two graph samples have different labels, otherwise is 0. ϵ is the upper bound distance to avoid the exceptional contribution of some dissimilar pairs. The ITGNN-C model augmented with contrastive loss can map samples from each class to a compact region in the latent space. After training the ITGNN-C model with labeled data, we can apply it to discover drifting samples. Given a list of testing samples xt, with respect to the current classes in the training data, inspired by CADE [235], we evaluate if xt is a drifting sample as shown in Algorithm 3. We have binary classes “normal” and “threat”, denoted as 0 and 1, respectively. We first generate the latent representations of training graph samples and compute the mean values of the representations as the centroid of each class (lines 2-4). We calculate the median of the absolute deviation MADi within each class i (lines 5-9). For testing, we generate representations of the testing samples and compute each test’s drifting degree (lines 10-16). We recognize a potential drifting sample by comparing Ak with a threshold TMAD, which is set as 3 empirically [114]. Users or security analysts can further investigate the drifting samples and conduct retraining. Vulnerable Interaction Classification Besides designing contrastive learning, which learns to compare two samples, we design the supervised classifier loss function L for classifying graphs as normal or vulnerable: L = − XN n−1 1 PN n=1 wyn wynxn,yn − βLpool, Lpool = 1 n X i (ti ∗ log(oi) + (1 − ti) ∗ log(1 − oi)), (2.2) where wyn is the weight of class labels to impose a high penalty when misclassifying the minority class. Lpool is graph pooling loss [123], β is a hyperparameter to balance the classification loss 31 and pooling loss, xn,yn is the element of input embedding to the loss function, oi is the logit output from each VIPool layer, and ti is the binary label. In order to minimize false positives and false negatives, it is necessary to remove potential drifting samples and subsequently employ a classification model for detecting interactive threats. This approach can yield higher accuracy compared to solely relying on contrastive learning, as demonstrated in Table 2.6. Cross-Domain Graph Transfer Learning Another challenge is that the semantic knowledge from different platforms varies. Some platforms contain less semantic knowledge, making it hard to train a robust GNN model. For example, the interaction rules of apps from the SmartThings platform are limited, with only 135 apps suitable for analysis. However, a key insight is that the apps’ descriptions are mainly about the functions of device controls on the smart home platforms, which share common information. For example, the app on the SmartThings platform “Turn on the air conditioner when the temperature is above 85 °F” and the skill on the Alexa platform “Alexa, turn on the air conditioner” trigger a similar action. Interaction knowledge can be transferred across different smart home platforms as these platforms all target the same trigger-action paradigm. Therefore, we design a transfer learning module to transfer knowledge of interaction information across platforms. Transfer learning has demonstrated its superiority in enhancing model performance and reducing the training time with less training data in computer vision [80, 239]. Deep learning models typically show excellent transferability properties. For example, the first layer features tend not to be specific to a particular task [239]. Similarly, the first few layers of embedding features of GNN contain some common information. The node features are both semantic-based embeddings. In addition, they share some properties that are transferable among smart home platforms. For example, the knowledge in the IFTTT interaction graph can be transferred to the SmartThings interaction graph. In turn, the learned features of the SmartThings graph model can be used to train a better IFTTT graph model. We can transfer embedding features from other pre-trained models to train a more robust model. Formally, suppose DS is source domain and DT is target domain, where D = {X, P(X)} and 32 XS ̸= XT . A domain feature spaceX is defined as a set of graph node features in a certain platform in the context of Glint. XS is the feature space of the source domain, and XT is the feature space of the target domain. The goal of transfer learning is to learn P(YT |XT ) by gaining information from DS and source task TS. First, Glint trains a source domain model MTS . We take source domain graph node features as input to train MTS . Then, Glint transfers part of the layers of MTS to the target domain task model. Finally, Glint uses a small number of data in DT to fine-tune parameters in the transferred layers and train parameters in new layers. The number of transferred layers depends on the specific graph model to yield better performance. Glint uses a two-pronged approach for transfer learning. First, Glint transfers semantic knowledge from one platform to another, such as from the SmartThings interaction graph to the IFTTT interaction graph. There are common features in semantic knowledge from different platforms. Thus, Glint can transfer semantic knowledge in the form of feature embeddings. Second, Glint transfers interaction graph knowledge from homogeneous to heterogeneous interaction graph. Different platforms provide certain similar functions even though the usage is different. For example, an Alexa voice command or a SmartThings app can directly control the light status. The embeddings of GNN layers can transfer such interaction knowledge. Because the first several layers’ features learned by GNNs are generic to be applied in many tasks [178], Glint can freeze the first several layers close to the input layer to preserve the transferred knowledge. 2.4: Evaluation and Discussion We first evaluate the rule correlation discovery models for graph dataset construction. Then, we present and label the constructed interaction graphs. Next, for interactive threat analysis, we evaluate the performance of our designed ITGNN model on homogeneous and heterogeneous graph datasets. Then, we explore the possible interactive threats in user-designed automation rules. Finally, we evaluate Glint with a real-world testbed. 33 Table 2.2: The number of rules from 5 platforms. IFTTT SmartThings Alexa Skill Google Assistant Home Assistant 316,928 185 5,506 5,292 574 SVC MLP RForest KNN GBoost classifier 0.975 0.980 0.985 0.990 Accuracy SVC MLP RForest KNN GBoost classifier 0.95 0.96 0.97 0.98 0.99 1.00 Precision SVC MLP RForest KNN GBoost classifier 0.94 0.96 0.98 1.00 Recall SVC MLP RForest KNN GBoost classifier 0.970 0.975 0.980 0.985 F1 Figure 2.6: The performance of five classification models. 2.4.1: Rule Correlation Discovery Evaluation We use Scrapy [181] to crawl publicly available rule descriptions from 5 smart home platforms, as shown in Table 2.2. We randomly choose and manually label 5,600 pairs that have action-trigger correlations and another 8,000 pairs that have no action-trigger correlations. Part of the labeled data is directly from [222], and we manually label the interaction pairs in newly crawled data. We use the labeled dataset to train classifiers to recognize whether a sentence pair has an action-trigger correlation. Then, we use the well-trained classifiers to label whether a sentence pair has actiontrigger correlations for unlabeled rule pairs. Besides, we manually check the classification results to ensure the correctness of action-trigger correlations. In this way, we can efficiently filter out the impossible and unrealistic correlations among different rules. We compare five different classification models: C-Support Vector (SVC), Multi-layer Perceptron (MLP), Random Forest (RForest), K-nearest Neighbors (KNN), and Gradient Boosting (GBoost) using Scikit-learn [164] library. The input of a model is the features of action-trigger 34 pairs. The output is true or false. True means there is a correlation between “action” to “trigger”. To deal with the issue of class imbalance, we adjust class weights inversely proportional to class frequencies in the training data. For instance, we use a weight balance in the MLP loss function to give a high cost for the misclassification of “true” flows. We apply the grid search method to find the most effective hyperparameters and use 10-fold cross-validation to test the generalization of models. We compute the accuracy, recall, precision, and F1 to evaluate the models. The recall R represents the percentage of positive samples in the test set which are predicted correctly. A high recall means a low false-negative rate. The precision P shows the percentage of positive predictions which are truly positive. The high precision means a low false-positive rate. The F1 score is the harmonic average of precision and recall: F1 = 2·P·R P+R . Can ML aid in rule correlation discovery? As shown in Figure 2.6, all five models achieve excellent correlation classification performance, demonstrating the utility of the designed features. MLP and Random Forest achieve 98.2% and 98.4% accuracy, respectively. Both SVC and KNN achieve the highest average precision rates of 100%. The high precision ensures that the true connection predictions have a high possibility of being truly positive. MLP and Random Forest models achieve 99.8% and 98.2% recall, respectively. Besides, Random Forest achieves the highest 98% F1 scores. A higher F1 score for a classification model means better performance. With deep and large hidden layers, MLP can better characterize the hidden features in the dataset. The Random Forest model is based on bagged trees and uses a random subspace method to prevent overfitting. We chose MLP, RandomForest, and KNN to collaboratively predict the remaining 20,000 unlabeled pairs based on the highest precision, recall, and F1 value, so we can uncover the interaction correlations between actions and triggers. If the predictions of the three models differ, we manually review the prediction results to determine the final correlation labeling result. 2.4.2: Interaction Graph Construction Due to the large volume of rule combinations, we randomly select and chain different rules that formulate “trigger-action” correlations to build interaction graphs, of which the number of nodes 35 Table 2.3: The number of interaction graph datasets. Type Platforms Label Num. of Total Graph Num. of Unsafe Graph Homo. IFTTT labeled 6,000 1,473 unlabeled 10,000 * SmartThings labeled 165 36 Hetero. 5 platforms labeled 12,758 3,828 unlabeled 19,440 * is from 2 to 50. In an interaction graph, we build an edge between two rules that have a “triggeraction” correlation, and each rule is a node. For the IFTTT interaction graph dataset construction, we use rules from both the dataset (315,393 applets) from [222] and our newly crawled 1,535 applets to construct interaction graphs among different applets. Considering descriptions could be lengthy and contain multiple sentences, we extract word phrases for text descriptions of IFTTT applets. Similarly, we build the SmartThings graph dataset by analyzing the apps’ descriptions provided by the developers. For IFTTT and SmartThings homogeneous interaction graph datasets, we use the en_core_web_lg model in spaCy [195] to get the averaged word embeddings of each phrase in every rule. The dimension of embedding is 300, which is used as node features. As shown in Table 2.3, on the IFTTT platform, we build and manually check 6,000 interaction graphs to label whether graphs contain interactive threats or not. Finally, we label 1,473 graphs that contain interactive threats. Besides, we build 10,000 unlabeled IFTTT interaction graphs for discovering interactive threats in the smart home platform. On the SmartThings platform, we cross-check our 165 SmartThings inter-app interaction graphs with existing graphs [54] to ensure accuracy. We label 36 unsafe graphs out of 165 graphs. We build a heterogeneous interaction graph dataset on IFTTT, SmartThings, Alexa, Google Assistant, and Home Assistant platforms by building interaction correlations across platforms. Specifically, we use IFTTT, SmartThings, and Alexa to build 12,758 heterogeneous interaction graphs with labels. We have heterogeneous graphs, among which 3,828 graphs are labeled as vulnerable graphs. For voice assistant rules, we use the Universal Sentence Encoder [29] to obtain 36 sentence embeddings of rules, which are more suitable for identifying sentence features. The dimension of each embedding is 512, which is used as the node feature. Besides, Glint generates 19,440 unlabeled graphs based on the above five platforms. The Google Assistant and blueprints (rules) from Home Assistant are used in unlabeled graphs. These blueprints were created and discussed from December 2020 to January 2022. We use DGL [221] to build the graph datasets. The stored labeled graph dataset file for IFTTT is 21.8G, the one for SmartThings is 0.018G, and the heterogeneous graph amounts to 81.6G. The large dataset file is partly due to the storage of all vertex information with DGL. We apply the DGL with Pytorch to build and train the GNN model. We run experiments on TensorEX Ubuntu 20.04 Deep Learning Stack with 256GB DDR4 memory, Intel(R) Xeon(R) Gold 5218R 2.10GHz CPUs and RTX A6000 GPUs. In the labeling phase, two students who study IoT security volunteer to label interaction graphs. They first learn the security policies and IoT interactive threats identified in the literature [3, 28, 54, 55, 70, 203, 222]. For example, IoTSafe [55] specifies safety and security policies such as “All electrical appliances should be turned off when smoke is detected”. Then, two volunteers follow the six types of threats [222] as criteria for labeling, where the specific rule settings are in Table 2.4: (1) Condition bypass. Users may configure different granularity of settings with the same aim on different platforms. As a result, some conditions in fine-grained settings may be bypassed due to existing coarse-grained settings. For example, the users may have SmartThings setting 1 and have a concise setting 2 with Alexa. Such settings will allow condition bypass and cause threats. (2) Condition block. With different smart home settings, the condition may be blocked because some commands may cancel the execution condition of another command. For instance, suppose we have settings 3, 4, and 5; when motion is detected at the door after 7 p.m., no notification is sent because the home state is disarmed. (3) Action revert. The same device may be correlated to multiple environmental factors, and the action could be reverted if some environmental factors change. For example, with settings 6 and 7, the AC is turned on and then turned off. Such interactions may cause action to revert. 37 Table 2.4: Rule examples from different platforms. Platform Setting Content SmartThings Setting 1 If outside temperature is above 70◦F and time is 11 a.m., then open windows. Alexa Setting 2 If outside temperature is above 70◦F, then open windows. IFTTT Setting 3 If motion is detected at the door and home is in armed state, then send a notification. IFTTT Setting 4 When light is on, disarm home state. SmartThings Setting 5 Turn on the light at 7 p.m. Alexa Setting 6 Turn on the air conditioner when temperature is above 100◦F. IFTTT Setting 7 When humidity is below 30%, turn on humidifier and turn off air conditioner. SmartThings Setting 8 If smoke is detected, unlock the door. Alexa Setting 9 Lock the door at 10 p.m. every day IFTTT Setting 10 Turn off the living-room light when bedroom light is on. IFTTT Setting 11 If the living-room light is turned off and the home state is away, then turn on the bedroom light. Alexa Setting 12 Turn on a heater SmartThings Setting 13 Open windows if indoor temperature is above 80◦F. (4) Action conflict. Some actions may conflict with each other in multiple platforms. For instance, with settings 8 and 9, the action is conflicted if smoke is detected after 10 p.m. (5) Action loop. The interactions among devices and the environment could trigger each other, which will introduce a loop in the interaction graph. For example, with settings 10 and 11, when the home state is away, the actions will loop forever. (6) Goal conflict. Users could make misconfigurations, such that the actions’ goals conflict with each other. Suppose the user turns on the heater to keep a room warm while in the bathroom; however, given the settings 12 and 13, the window opens, resulting in the temperature drop. We develop the Python scripts to show the graph and rule contents, so the first volunteer can quickly check the graph edges and derive the conclusion. If a type of defined threat is found, then the volunteer stops checking and labels the graph as vulnerable. Otherwise, it is labeled as normal. Especially we cross-check our 165 SmartThings inter-app interaction graphs with the existing interapp interaction chain graph of SmartThings [54] to ensure accuracy, from which we label 36 unsafe 38 1 2 3 5 The number of multi-scale 0.6 0.7 0.8 0.9 1.0 Ratio Accuracy F1 0.1 0.3 0.6 1 Pooling ratio 0.6 0.7 0.8 0.9 1.0 Ratio Accuracy F1 1 2 4 6 The number of propagation layers 0.6 0.7 0.8 0.9 1.0 Ratio Accuracy F1 None Intra Inter ALL Node transformation types 0.6 0.7 0.8 0.9 1.0 Ratio Accuracy F1 Figure 2.7: The ablation study of ITGNN model. graphs out of 165 graphs. Note that there could be multiple types of interactive vulnerabilities in a graph, which can be future work if fine-grained labeling data is required. The other volunteer double-checks the labeling result to avoid mislabeling. If they have disagreements on the labels, they will have a further discussion based on the security policies identified in the literature. For large and complex interaction graphs, such as a graph with 50 nodes, it takes about 6 minutes to examine the possible interactive threats in a graph. For simple graphs with two nodes, we just need to check whether they violate the security policies, which takes about 20 seconds. On average, it takes about 1 minute to label each graph at the beginning. After the two volunteers get familiar with the labeling process, it takes about 40 seconds to label and double-check each graph. In total, it takes about 8 weeks to label the entire graph dataset. 2.4.3: Ablation Study of ITGNN In this section, we implement an ablation study focusing on the ITGNN model. We evaluate the hyper-parameters and components of ITGNN on the heterogeneous dataset as shown in Figure 2.7. Specifically, we consider (i) The number of scales, which is in the multi-scale graph generator. We can see that ITGNN achieves the best performance when the number of scales is 3 in Figure 2.7. When the number of scales is small (e.g., 1, original scale), ITGNN underestimates 39 Table 2.5: Accuracy for each type of vulnerability. Condition bypass Condition block Action revert Action conflict Action loop Goal conflict 94% 94% 95% 98% 93% 95% local information in graphs while global information is underestimated when the scale number is large (e.g., 5). (ii) Pooling ratio, which is graph pooling in VIPool. When the value is 1, the model keeps all vertices and yields VIPool ineffective. When the value is 0.6, ITGNN can select vertices that can better express neighborhoods’ information, which helps achieve the best performance. (iii) The number of propagation layers in the multi-scale graph generator. We test the influence of the number of layers in the propagation process. As shown in Figure 2.7, when the number of layers becomes large (e.g., 6), the model achieves worse performance compared with the number of layers being 2. The GNN model suffers from over-smoothing and becomes less expressive when the layer number increases. (iv) We also remove some modules in metapath-based node transformation. As shown in Figure 2.7, when we remove both intra-meta path and inter-meta path aggregation modules, the model can only achieve an accuracy of 81.5%. For heterogeneous graphs, the model ignores the various types of node information in different metapath instances. Thus, the model cannot effectively learn the graph structure information in heterogeneous graphs. In contrast, the complete design of ITGNN achieves the best accuracy of 95.1%. We also report accuracy for each type of vulnerability in Table 2.5. Considering a graph can contain multiple types of threats, our labeled dataset does not label the specific types of threats. Thus, for each vulnerability, we manually sample 50 graphs and mix them with 50 benign graphs to compose a test set. In total, there are 6 test sets, where each test set is corresponding to one vulnerability. From Table 2.5, we notice the action conflict is best recognized by Glint because it contains conspicuous antonym semantics. 40 Table 2.6: Results of homogeneous graph classification. Dataset Model Accuracy Precision Recall F1 IFTTT GCN [108] 89.5 100 89.5 94.5 GXN [123] 78.7 79.0 76.4 76.3 GIN [231] 95 94.7 94 94.4 IFG [196] 69.8 75.5 70.2 67.4 SVC [166] 84.1 84.1 84 83.9 KNN [4] 89.5 90.9 89.5 89.6 ITGNN-C 95.4 95.3 94.9 95 ITGNN-S 95.7 95.9 95.7 95.8 SmartThings GCN [108] 90.9 82.6 90.9 86.6 GXN [123] 88.2 89.9 88.2 87.2 GIN [231] 89.7 85.9 89.5 87.7 IFG [196] 86.1 89.3 87.5 85.9 SVC [166] 84.4 87.3 84.8 81.3 KNN [4] 84.8 83.8 84.8 83.2 ITGNN-C 76.5 69 70.6 69.5 ITGNN-S 88.2 89.9 88.2 87.2 2.4.4: Homogeneous Graph Evaluation With the interaction graph datasets, we can train the GNN-based interactive threats detection model. We run experiments in 5 trials. For each trial, we split the dataset into the training set and testing set by 8:2. However, there are only 1,473 vulnerable interaction graphs out of 6,000 interaction graphs, as shown in Table 2.3. The imbalanced classes will degrade the performance of the model. Thus, for the training set, we first increase the number of vulnerable graphs by random oversampling until the number of vulnerable graphs is doubled. Second, we assign unequal costs to the misclassification class. The class weight is inversely proportional to the number of class examples. We use the weighted F1 score to measure the performance of graph models. We calculate the F1 value for each label and then compute the average weights by the number of occurrences of each class. Thus, the F1 value may be beyond the range of precision and recall. We compare ITGNN with four state-of-the-art graph models and two classification models, including: (i) graph convolutional network (GCN) [108], which is a convolutional network operating on graphs; (ii) graph cross-network (GXN) [123] model, which extracts graph features by learning from multiple scales of a graph; (iii) GIN [231] model, a graph isomorphism network, 41 which is used to distinguish certain graph structures; (iv) InfoGraph (IFG) [196], which uses graph-level representation learning via mutual information maximization. (v) Two classical ML methods: SVM and KNN. We compute the average of the node embeddings as the graph sample feature. (vi) ITGNN-C uses the contrastive learning loss in Eq. (2.1) and ITGNN-S uses the classification loss in Eq. (2.2) to differentiate between vulnerable and normal interaction graphs. We train the eight models and fine-tune hyperparameters on interaction graph datasets. Why not use traditional ML models? As shown in Table 2.6, on the IFTTT dataset, the two traditional classifiers, SVC and KNN, achieve 84.1% and 89.5% accuracy, respectively. Nevertheless, the graph models can achieve better performance. For example, compared to KNN, GIN improves accuracy by 5.5% to 95%, and ITGNN-S improves accuracy by 6.2% to 95.7%. Traditional ML models cannot mine the graph structure information, while GNN is designed to learn graph information, and it can better mine the interactive threat patterns in graph structures. Why not use the contrastive loss for classification? Compared with models trained on the IFTTT dataset, models trained on the SmartThings dataset have worse performance. For instance, ITGNN-C on the SmartThings dataset only achieves 76.5% accuracy because contrastive learning needs a large number of data to learn to compare two samples. However, the size of the SmartThings dataset (165 graphs) is much smaller than that of the IFTTT dataset. 2.4.5: Heterogeneous Graph Evaluation We evaluate the performance of our proposed ITGNN model on the heterogeneous graph dataset. We compare ITGNN with three state-of-the-art heterogeneous graph learning methods. (i) HGSL [249] performs heterogeneous graph structure learning for classification. The MAGNN [71] is designed for heterogeneous node classification and link prediction. Due to the different downstream tasks in the original models, we construct two heterogeneous graph models. (ii) MAGCN model is an adapted GCN [108] model with a MAGNN graph converter. (iii) MAGXN model is then adapted GXN [123] model with a MAGNN graph converter. Why not use existing GNN models? With the evaluations on both homogeneous and 42 Accuracy Precision Recall F1 0.6 0.7 0.8 0.9 Ratio ITGNN HGSL MAGCN MAGXN Figure 2.8: The results of heterogeneous graph classification. heterogeneous graph datasets, we show the benefits of ITGNN as a unified framework for both homogeneous and heterogeneous graph representation learning. Besides, ITGNN is designed for both drifting sample detection and interactive threat classification problems. On the IFTTT dataset, as shown in Table 2.6, ITGNN achieves the best accuracy, recall, and F1 score. Specifically, ITGNN-S achieves 95.7% recall and F1 score on the IFTTT dataset. Achieving a better recall means that ITGNN-S avoids safety and security problems by better reducing false negatives of interactive threats. Even though GIN performs better on the SmartThings dataset, it achieves lower precision than ITGNN-S. Other complex GNN architectures entangle the whole interaction graph information, which is not suitable for our interactive threat detection task. On the heterogeneous graph dataset, as shown in Figure 2.8, our proposed ITGNN model can achieve the best average accuracy (95.5%), precision (95.9%), recall (95.6%), and F1 score (95.6%). The HGSL achieves 92.9% accuracy, the MAGCN achieves 90.2% accuracy while MAGXN achieves 81.7% accuracy for threat detection. The MAGXN has a more complex architecture with more parameters, resulting in a slow training process. By contrast, the MAGCN achieves acceptable performance even though it has a simpler structure, which confirms the “No free lunch” theorem [227] for machine learning. Overall, the proposed ITGNN outperforms both the homogeneous and heterogeneous models. ITGNN is highly effective since it can not only preserve heterogeneous node feature information but also leverages the multi-scale graph structure features to generate graph representations. 43 2.4.6: Transfer Learning Evaluation From previous results in Table 2.6, we find that the simple GCN architecture can be easily trained, while other models such as GXN and ITGNN have relatively complex architecture, which can be overfitting on an insufficient dataset (e.g., SmartThings). This indicates that one model may best perform for some datasets, but not in all cases. To apply Glint in a more diverse type of heterogeneous smart home platforms, we need to enhance the adaptability of GNN models. Transfer learning applies the learned knowledge from one domain to a different but related domain, improving modeling performance with less training data and time. As shown in Table 2.7, we not only evaluate the transfer learning between homogeneous (IFTTT)-homogeneous (SmartThings) datasets but we also evaluate homogeneous (IFTTT)- heterogeneous datasets. We take IFTTT and SmartThings platforms as examples of homogeneous graph learning because these two are prevalent platforms in IoT interaction studies. Besides, the IFTTT dataset contains a large number of graphs, while SmartThings contains a small number of graphs, which are representative datasets for transfer learning problems. The size of the SmartThings dataset is much smaller than the IFTTT dataset. Therefore, given the IFTTT dataset as the source domain and the SmartThings dataset as the target domain, we only fine-tune the fully connected layer for classification and freeze the other layers of a model trained on the IFTTT dataset. Then, we let the SmartThings dataset be the source domain and the IFTTT dataset be the target domain; now, we can freeze the earlier two layers of the model trained on the SmartThings dataset. The earlier layers capture the basic features of interaction graphs. Finally, we can train the rest of the layers using the IFTTT dataset. Can transfer learning improve model performance? The performance of all models improves with transfer learning techniques as shown in Table 2.7. For example, the accuracy of the GIN model on the SmartThings dataset improves by 6%. One interesting phenomenon is that if a model performs well in the source domain, it will also work well in the target domain. For instance, the ITGNN model achieves 88.2% accuracy on the target domain (SmartThings). With knowledge of the source domain (IFTTT), the model can achieve 100% accuracy. Moreover, 44 Table 2.7: Detailed performance of transfer learning. Model Target Domain Source Domain No trans. Trans. Improved GIN SmartThings IFTTT 89.7% 92.3% 2.6% GIN IFTTT SmartThings 95.0% 95.2% 0.2% GCN SmartThings IFTTT 90.9% 94.1% 3.2% GCN IFTTT SmartThings 89.5% 93.9% 4.4% ITGNN SmartThings IFTTT 88.2% 100% 11.8% ITGNN IFTTT SmartThings 95.7% 96.4% 0.7% ITGNN IFTTT Heterogeneous 95.7% 96.1% 0.4% ITGNN Heterogeneous IFTTT 95.1% 95.5% 0.4% even though the source domain (SmartThings) knowledge is limited, it can still provide helpful knowledge across platforms. For example, the accuracy on the IFTTT platform grows from 95.7% to 96.4%. Moreover, we check if the transfer learning techniques could bring negative effects. As shown in Table 2.7, the applied transfer learning techniques consistently enhance the model by achieving better accuracy in the target domain. Therefore, models trained on datasets from multiple smart home platforms do not incur negative transfer problems. 2.4.7: Drifting Interaction Pattern Evaluation We implement the Algorithm 3, and we use principal component analysis (PCA) to project the graph embeddings from the original 256-dimensional into a 2-dimensional space. We use the Kmeans method to cluster the projected embeddings as shown in Figure 2.9, where the centroid of each class is the white cross. We test the samples in the unlabeled dataset to detect the potential drifting samples (the samples in the red circle in Figure 2.9). The drifting samples are examined manually to summarize the potential interaction patterns and can be used for retraining. Among 10,000 unlabeled IFTTT graphs, we found 63 potential drifting samples. For 19,440 unlabeled heterogeneous graphs, we found 104 potential drifting samples. For example, the opening and closing of a lawn sprinkler valve is a rare example and only exists in the unlabeled dataset, so it is regarded as a potential drifting sample. The ratios of vulnerable interaction graphs on the randomly generated unlabeled IFTTT dataset and heterogeneous dataset are 8.3% and 16.7%, respectively. To recognize whether 45 Figure 2.9: K-means clustering on graph embeddings learned with contrastive loss. there are new threat cases, we manually examine all drifting samples and randomly check 50% predicted vulnerable interaction graphs. Primarily, in drifting samples, we discover four new types of interactive threats related to user-designed blueprints (rules) [10] across multiple platforms. The reported new interactive threat types are more complex and easily overlooked, while the previously discovered threat types from existing work are all rooted in a single platform. • Action block: Automation is blocked by non-automation settings. Users create automation blocker rules that block some automation from running. For example, users can set rule 1, “If the light is set in manual mode, then keep the light brightness to 100%.”. Thus, rule 2, “dimming lights when turning on the TV” will become ineffective. Rules 1 and 2 are encoded into two nodes in an interaction graph, and they target the same device, “light”, which will be labeled as a vulnerable interaction because of the block of the action. • Action ablation: Automation action can be reverted over time. A device can have multiple attributes that are triggered by different factors. For example, a blueprint turns on the air conditioner (AC) when the temperature is above 95◦F. Another blueprint has the rule that turns on the humidifier and turns off the AC when humidity is below 30%. The status of the AC between the two rules is conflicted and will be reverted when the humidity becomes low. • Trigger intake: Automation is caused by unexpected triggers. A user applies a rule to send the camera a snapshot notification when motion is detected at the door. There is another rule that starts the vacuum cleaner at 9 pm, which could accidentally trigger the motion sensor. As a result, the user may frequently receive false snapshot notifications. Without a deep analysis, the user can hardly 46 Abbr. Device Name Attributes L Light Bulb switch M Motion Sensor motion C Contact Sensor contact T Temperature Sensor Temperature PS Presence Sensor presence Camera Camera snapshot Button Smart Button button Figure 2.10: The layout of smart home devices. notice this is caused by the event that the vacuum cleaner triggers the motion sensor, especially in a bad light condition. • Condition duplicate: A fake automation condition is generated by rules from other platforms. For instance, a rule that reports the room is occupied when any of the conditions are met: motion sensor detects motion, the door is shut, or media is playing on devices in the room. A rule from IFTTT can play music in the room from 3 pm to 4 pm. Due to the trigger from the played music, the room is reported to be occupied. Another rule that starts the heating when the room is occupied and when the temperature is below 60◦F is subsequently triggered. Because of the applet from IFTTT, the rule condition is accidentally satisfied in the platform of Blueprint. Note that action block involves user intervention and action ablation happens over a long time. They are neither action revert nor action conflict defined in iRuler [222]. Besides, existing work such as HAWatcher [70] cannot handle these user interventions and long-term correlations. Similarly, trigger intake and condition duplicates are not pre-defined in iRuler or mined in HAWatcher. As a result, the existing work cannot detect these new types of interactive threats. 2.4.8: Evaluation on a Real-life Testbed We construct real-time interaction graphs with event logs collected from a real-world home testbed, as shown in Figure 2.10. A volunteer deploys the SmartThings and Alexa systems in his house following the setup in HAWatcher [70] with off-the-shelf devices and apps. The volunteer uses the desired automation and spends a week collecting 1,813 event logs related to home automation. 47 Efficacy Comparison Following the setup in HAWatcher [70], we simulate five types of attacks by modifying event logs or manually interfering with the home automation: (i) Targeted compromise: fake commands, stealthy commands. (ii) Interaction abuse: fake events, event losses. (iii) Misconfiguration: command failure. For example, we simulate “fake commands” by manually turning off lights during normal operation, and “stealthy commands” by manually starting a robot vacuum to trigger motion sensors. We build 600 graphs as the test set. Out of 300 graphs, 150 graphs contain binarycorrelation threats (BCT). Among another 300 graphs, 150 graphs contain complex-correlation threats (CCT). BCT means interactive threats are caused by two nodes, while CCT is caused by more than two nodes. We choose three white-box anomaly detection methods for comparison: (i) HAWatcher [70], which extracts binary correlations and verifies them with run-time event logs. The input is the device states extracted from event logs. The inconsistencies are reported as anomalies. We reuse the refined correlations reported in the work [70] to check the actual states of devices. (ii) One-class Support Vector Machine (OCSVM) [180], which is a unsupervised anomaly detection method. (iii) IsolationForest [131], which is also widely used for anomaly detection. We capture all devices’ states as a frame when a new event happens. Four consecutive frames compose a data vector, which is the input of OCSVM and IsolationForest. The output is -1 for threats and 1 for normal cases. We use Scikit-learn [164] to implement OCSVM and IsolationForest. Note that for many other systems, it is unfair to directly compare the performance of existing methods and ours when considering the big gap between the analysis of open-source and closed-source platforms. As shown in Figure 2.11, OCSVM and IsolationForest achieve worse performance compared to Glint. For instance, for complex graphs, the OCSVM can only achieve 66.9% precision and 63.3% recall. This is because they only use time-series event log data, without the consideration of rich information in neighbor nodes and graph structure. The HAWatcher can achieve 97.8% precision and 94.1% recall taking 21 days of training for binary-correlation threat detection [70]. By contrast, Glint takes no more than 1 hour to train the model and apply transfer learning to improve model 48 0 Precision-binary Recall-binary Precision-complex Recall-complex 20 40 60 80 100 Ratio (%) ITGNN HAWatcher OCSVM IsolationForest Figure 2.11: Comparison with different detection methods. performance, which achieves 100% precision and recall for binary-correlation threat detection. For complex-correlation graphs, HAWatcher cannot check interactive threats caused by goal conflict, action revert, and condition bypass because the three types of threats are not covered. For such cases, we set HAWatcher to generate results by the Bernoulli distribution with the probability of a single success of 0.5. Finally, HAWatcher achieves 83.2% precision and 82.7% recall. By contrast, Glint achieves 96% precision and 95.3% recall, which outperforms HAWatcher in heterogeneous platform scenarios by 12.8% precision and 12.6% recall rate. Efficiency Comparison The efficiency is affected by many factors, such as code implementation, third-party libraries, and hardware setup, we qualitatively compare the efficiency of different systems. HAWatcher [70] needs to traverse all defined correlations. Suppose the number of correlations is n, then the time complexity is n2 for their extracted binary correlations. The time complexity will be nN when the length of the interaction chain is N. Thus, such search-based methods are retarded by the path explosion when dealing with large and complex graphs. iRuler [222] applies satisfiability modulo theories (SMT) solver to check interactive vulnerabilities. However, the existence of heterogeneous platforms indicates that there will be more complex rules for smart home management. With the fact that the SMT problem is typically NP-hard, such symbolic execution has poor scalability when dealing with large and complex interaction graphs. It is thus unlikely that a symbolic execution engine can exhaustively explore all the possible states within a reasonable amount of time. By 49 contrast, our proposed Glint is a learning-based system, which achieves a high efficiency for realtime interactive threat prediction. For each graph, the prediction time is related to the graph size, and the average prediction for a heterogeneous graph takes about 0.61s. The ITGNN model trained on heterogeneous graphs is only 6.13 MB. 2.4.9: Discussion While the above evaluation results are encouraging, we consider this work as a first step toward data-centric AI-based IoT interactive threat analysis. There are many factors that need to be considered in the future work. Time factor The IoT device influence can occur in the long term. For example, turning on/off the heater will change the temperature over a relatively long period. The existing system, HAWatcher [70], only considers interactions that arise within a short time frame, which cannot mine long-term correlations. By contrast, Glint discovers interactions based on automation rule semantics and realtime event logs. Glint encodes the time-related semantics into node embeddings. By annotating such long-term correlations in graph datasets, the GNN models can effectively learn such patterns manifested over a long time interval. For example, settings 12 and 13 in Table 2.4 compose a vulnerable interaction graph, which can be discovered by Glint but not by HAWatcher. User factor The automatic interactions could be disturbed by user-activity deviations. An example is shown in the drifting interaction pattern “Action block” in Section 2.4.7. Existing systems such as iRuler [222] use pre-defined vulnerability patterns, which cannot report such interactive threats. In comparison, Glint can detect abnormal patterns with concept drift detection. According to users’ feedback, Glint can fine-tune the model and adapt to user preferences. How regularly we should update the interaction graphs to accommodate changes in user actions and IoT devices poses an important future research problem. 50 NLP techniques In this dissertation, we rely on the NLP techniques to identify the correlations between different automation configuration rules, so as to construct an automation rule interaction graph. However, the accuracy of interaction graph construction will affect threat detection results because of the inaccurate input to the threat detection model. Applying NLP techniques may overestimate or underestimate physical channel properties, especially when involving different locations. The physical properties of two devices at different locations can vary. For instance, the temperature of the oven in the kitchen can hardly influence the temperature in the living room. As a result, the lack of complete understanding of the physical channel properties can lead to wrong predictions about the correlation between the two rules. For example, “Set the temperature of the oven to 350◦F for preheating at 7:00 pm” should have no correlation to “Open the window if the indoor temperature is above 80◦F”. A potential solution is to request feedback on the correctness of the built graphs from users, so we can fine-tune the graph builder model. Besides, more advanced NLP techniques such as large language models [157, 210, 255] can be applied to enhance the accuracy of IoT rule correlation recognition. 2.5: Summary To detect interactive threats across heterogeneous closed-source platforms, we propose the Glint to learn interaction patterns. We design the unified ITGNN model for multi-scale graph representation learning. To train ITGNN, we build the first interaction graph dataset on five high-profile platforms. Our evaluation shows that the proposed model can achieve high accuracy in detecting interactive threats across heterogeneous platforms. We identify four new types of interactive threats in the userdefined Home Assistant platform. However, there are two remaining issues. Firstly, the training data for threat detection incorporates event log details, encapsulating sensitive information about daily activities. Consequently, the training data should be avoided sharing with the third party considering the privacy concerns. Secondly, there is a necessity to delve into the explainability of GNN models, which is crucial for unraveling the underlying root causes behind interactive threats. 51 CHAPTER 3: FEDERATED THREAT DETECTION MODEL LEARNING2 3.1: Introduction As introduced in Chapter 2, a growing number of smart devices are deployed in modern homes to achieve home automation. The devices are controlled by rules that follow the trigger-action paradigm. For example, a SmartThings [190] app has the automation rule (R1) “If smoke is detected (trigger), turn on the water valve and start alarm beeping (action)”. According to a recent survey [38], 82.4% of smart homes have multiple rules for controlling a single device. These rules allow devices to interact with each other, referred to as IoT interaction. However, unexpected IoT interactions could lead to threats such as action conflict and action revert, which result in severe security and privacy risks. For instance, suppose a user has set up the above SmartThings rule R1. Unexpectedly, as soon as the water valve was turned on and a water leak was detected in the kitchen, the SmartThings rule R2 “Close the water valve when a water leak is detected” would close the water valve. In consequence, the two rules compose a vulnerable interaction, which exposes the threat “the water valve fails to turn on when smoke is detected” because of the action conflict “water valve opening and closing”. Such vulnerable interactions can be caused by user errors [93, 153, 205] and physical attacks [82, 224, 234, 245]. It is crucial to manage the IoT automation rules and track the trigger-action information flow to avoid interactive threats. Naturally, the IoT automation rules follow the general trigger-action paradigm, and the interactions among rules can compose an interaction graph. Automation rules 2This chapter is based on previously published work by Guangjing Wang, Hanqing Guo, Anran Li, Xiaorui Liu, and Qiben Yan titled “Federated IoT Interaction Vulnerability Analysis” published in the Proceedings of IEEE 39th International Conference on Data Engineering (ICDE). DOI: 10.1109/ICDE55515.2023.00120 [214]. © 2023 IEEE. Reprinted, with permission, from Guangjing Wang, Hanqing Guo, Anran Li, Xiaorui Liu, and Qiben Yan, “Federated IoT Interaction Vulnerability Analysis” (paper and IEEE titles are the same), July 2023. 52 can be represented by nodes, and the edges are “trigger-action” connections among different rules. A naive idea is to detect the sensitive event based on predefined security policies, and then trace it back to find root causes. However, the query or search-based methods have limited coverage of the security policies or threat patterns. Most methods [26, 28, 55, 70, 223] pre-define security policies and threat patterns within a single platform [38]. For example, ProvThings [223] requires users to define and input policies describing sequences of causal interactions with a graph database backend. Yet, the predefined policies can hardly cover potential new threats across heterogeneous platforms. As a result, it may yield significant false positive and false negative errors. Moreover, federated query-based methods [15, 85, 107, 202] are not applicable. Even though common security policies could be extracted and shared among multiple houses, device interactions in different houses do not intersect. Each interaction graph is a complete data sample. Therefore, the query outcome for threat detection in a single house cannot necessarily be computed over the union of source databases of multiple houses. Another approach profiles the behaviors of systems by analyzing event logs [59, 127, 257]. However, they cannot fully expose the interaction correlations between different events. Their limitations are three-fold. First, considering the complex causal dependencies among multiple automation rules, it is hard to accurately mine the cross-app interaction logic from event log sequences, which are within individual devices [70]. Second, sharing smart home usage data with a third party might cause privacy leakage issues such as the leakage of living habits and routines, while a dataset from a single house is insufficient for training a model with high generalization ability. Third, the mining results are difficult to interpret. When a threat alarm is generated, the users can hardly learn the exact segment of device interactions that caused the issue. To address these problems and limitations, we propose FexIoT, a Federated and explicable GNN-based approach for automatic IoT interactive threat analysis. For an interaction graph, the node features can be represented by semantic-aware word or sentence embeddings. Compared with benign graphs, vulnerable interaction graphs could have different graph patterns that express abnormal behaviors. FexIoT utilizes the GNN model to learn the interaction patterns. Meanwhile, 53 as the event logs belong to different platforms and households, by leveraging federated learning, we can collaboratively develop a more generalized model while retaining users’ data locally. Moreover, the interaction graph allows to search and extract a special subgraph that contains execution flow paths related to abnormal behaviors. Based on the extracted flow path, we can explain the model’s predictions and identify the causes of vulnerable interactions. There are two key technical challenges for IoT interaction analysis across multiple closed-source platforms. (i) How to build the threat detection model with high heterogeneity in datasets while avoiding sharing raw data? One household has a limited number of devices, which makes it infeasible to train a custom and robust GNN model. FL allows for collaboratively training a model without exposing raw data. However, the data heterogeneity leads to poor modeling performance with slow convergence and low accuracy in FL training paradigms [9, 45, 199], especially when the data distribution is time-varying or not well balanced among different clients. To overcome this challenge, we design a layer-wise clustering-based federated graph contrastive learning framework. We train a shared threat detection model by differentiating normal and vulnerable graphs over noni. i.d. datasets with high generalization ability without sharing users’ data. (ii) How to adapt to new threat patterns and automatically identify potential causes of vulnerable interactions? The interactions in the IoT world are complex due to the different types of devices and threats. Even though we apply FL to train on various graph data from different home sources, we acknowledge that there could be drifting samples that evolve from existing threats or are novel kinds of threats. We propose to filter out drifting samples based on federated contrastive graph representations and the corresponding median absolute deviation. Then, we design an efficient cause analysis method by exploring different subgraphs of an interaction graph with Monte Carlo beam search (MCBS) [25], which is a heuristic game tree search algorithm. Based on IoT devices’ potential correlation features, we propose using the optimal SHAP values [136] to measure the risk of subgraphs of a complete interaction graph. Thus, we can trace the most possible information flow chain relevant to the vulnerable interactions. In summary, we make the following contributions: 54 • We design a layer-wise clustering-based federated contrastive GNN model for non-i.i.d. graph datasets to learn contrastive graph representations, and it achieves more than 90% average accuracy in threat detection. • We propose an efficient cause analysis method based on Monte Carlo beam search and the SHAP value to evaluate the risk of subgraphs, which automatically provides explanations for any complex vulnerable interactions. • We implement various experiments on large-scale datasets, which shows that our method outperforms state-of-the-art baselines with high accuracy and reduces communication costs. 3.2: Problem Definition Interactive threats refer to the vulnerability coming from interactions between devices, and the environment. Specifically, we label the graph as vulnerable if it satisfies at least one of the 6 types of vulnerabilities identified by existing work [222]: condition bypass, condition block, action revert, action loop, action conflict, and action duplicate. Internal graph vulnerability refers to the vulnerabilities inherently from interaction graphs, while external graph vulnerability refers to the vulnerabilities caused by external attacks. Instead of enumerating all possible interaction vulnerabilities, we design a federated GNN model to learn the vulnerability patterns. Given the interaction graphG, our task is to learn the graph embedding Zt at each time t, and Zt is used for vulnerable interaction prediction. Considering a graph can contain multiple types of vulnerabilities, multi-class classification is not suitable. Thus, the prediction problem is formulated as a binary classification problem f(Zt) → yt that maps the interaction graph embedding Zt to the binary label yt of interaction incident at a given time t. Finally, we identify a subgraph Gsub with the highest risk score by exploring various subgraphs with the SHAP-based Monte Carlo beam search, which explains the result of prediction yt. 55 Approximated i.i.d Federated Cluster 1 Approximated i.i.d Federated Cluster 2 Data Fusion Interaction Graph Dynamic Clusteringbased Federated GNN App Info Event Logs Vulner. Detection Vulner. Explanation IoT Interaction Analysis GNN Graph Embedding Hidden Layers Graph Data Client 1 Client 2 Client 3 Client 4 Client 5 Client 6 Client 7 Client 8 Server Figure 3.1: System architecture of FexIoT. 3.3: System Design We design FexIoT for IoT interaction analysis as shown in Figure 3.1. A client represents a house where data is collected and a GNN model is trained using federated learning. Each client can run FexIoT on devices such as a Raspberry Pi or NVIDIA Jetson Nano. Each client implements three main components: data fusion from event logs and apps’ descriptions, vulnerable interaction detection with federated GNN, and vulnerability explanation with the Monte-Carlo beam searchbased method. A server can perform the clustering and aggregation in FexIoT, which could be served by a security solution provider. 3.3.1: Fine-grained Clustering-based Federated Graph Learning After the graph construction, we propose a fine-grained clustering-based federated contrastive graph representation learning model considering data heterogeneity and drifting samples in FL. The learned graph representations can be used for vulnerability detection. In our designed FL paradigm, each client reserves two models. The first is the graph representation learning model, which participates in the FL process in Algorithm 4. The second is a linear classification model 56 such as an SGDClassifier, which locally learns to classify the learned graph representation to detect vulnerable or normal graphs. Contrastive Graph Learning Loss Function In the graph representation learning model, we design the contrastive learning [35] loss function in Eq. (3.1) for learning a distance function to measure the dissimilarity of samples: Lc = d2 ij(1 − yij) + max(0, k − d2 ij)yij , (3.1) where dij is the Euclidean distance between two graph embeddings, yij is 1 if graph Gi and Gj are from different classes, and yij is 0 if graph Gi and Gj are from the same class. We set a threshold k to restrict the unusual distance contribution of graphs from different classes. Note that we apply existing well-developed GNN models [71, 108, 231] in our federated learning framework. Finally, the learned representations are used for training a linear classification model to classify normal or vulnerable graphs in each client. Dynamic Clustering-based Federated Learning We design a fine-grained dynamic clustering-based FL algorithm as shown in 4. There could be domain shifts when the knowledge transfers across non-i.i.d. datasets in FL framework [43]. The heterogeneity is two-fold. First, the interaction graph can be either homogeneous or heterogeneous. Different households deploy heterogeneous devices and different users have their own usage habitats, which can cause graph heterogeneity. The heterogeneous graph data will cause negative transfer among users and decrease the model performance. Second, the graph dataset is unbalanced and non-i.i.d.. The non-i.i.d. graph datasets from different households will cause biased stochastic gradients, which will impede the convergence of FL models. Another key observation is that despite the heterogeneity, there exist similar data distributions in smart homes because different users can have common automation rules. We assume that there exist several clusters of households, where the graph datasets from each cluster satisfy the i.i.d. property following [179]. We argue that the assumption is reasonable since even though there 57 Algorithm 4: Dynamic clustering-based federated GNN Input: Graph dataset Gc of each client c, GNN model weight Wc, GNN layer l < L, client cluster C, global update round T, thresholds ϵ1, ϵ2 1 for t < T do 2 for c ∈ C do 3 Wc ← local training process 4 Wc = RecursiveClusteringAgg(1, C) 5 end 6 end 7 Procedure RecursiveClusteringAgg(l, C): 8 if l > L then 9 Return; 10 end 11 Receive l-th layer’s weights Wl ci from each client ci; 12 if ϵ1 > ||P i∈[n] |Gci | |G| ΔWl ci || && ϵ2 < max(||ΔWl ci ||) then 13 Mi,j ← CosineSimilarity(Wl ci ,Wl cj ) for i, j ∈ C 14 cluster1, cluster2 ←BinaryClustering(Mi,j ) 15 Wl cluster1 ← FedAvg(Wl c) for each c ∈ cluster1 16 Wl cluster2 ← FedAvg(Wl c) for each c ∈ cluster2 17 end 18 else 19 WlC ← FedAvg(Wl c) for each c ∈ C 20 end 21 Send Wl cluster back to each client c 22 RecursiveClusteringAgg(l + 1, cluster1) 23 RecursiveClusteringAgg(l + 1, cluster2) 24 End Procedure are heterogeneous smart devices, from the semantic perspective, the functionalities of devices are indeed limited. There will be some graph datasets that share common feature information, which could be grouped into several clusters. As a result, a new challenge ensues on how to effectively aggregate clients into different clusters. Consider the clustering-based FL setting with a central server and a set of n clients {c1, c2, · · · , cn}, which can be dynamically clustered into different clusters {cluster1, cluster2, · · · }. Each client ci has a set of interaction graphs G = {G1,G2, · · · }, and conducts the graph classification y = h∗ k(Gi), where h∗ k is the optimal graph classification model for cluster set clusterk. The graph feature information can be reflected by the model parameters and their gradients [179]. Existing federated graph classification over non-i.i.d. graphs [229] is coarse- 58 grained since it only considers the similarity of parameters of a whole model. However, from the bottom up, the degree of similarity among deep models decreases [135, 151, 239]. Therefore, to learn the fine-grained clustering structure, we design the bottom-up layer-wise dynamic clustering algorithm to obtain the similarity among weights of clients as shown in Algorithm 4. Suppose n is the number of clients in FL training, each client c performs local GNN training (lines 2-4), which follows the traditional FL training paradigm. The server receives n local models Wl c (line 12). The dynamic clustering on the server starts from the bottom layer l1. We define two thresholds ϵ1 and ϵ2 to determine the conditions of clustering of different clients (line 13): ϵ1 > || X i∈[n] |Gci | |G| ΔWci ||, ϵ2 < max(||ΔWci ||), (3.2) where |Gi| represents the number of graphs owned by client i, |G| is the total number of graphs owned by all clients, and ΔWci is the local update of model weights of client ci. ϵ1 measures the degree of fluctuation of FL training, and it bounds the relatively stationary point of the global model before initiating the clustering. Meanwhile, if there are large norms of weight update that are greater than ϵ2, it means high heterogeneity occurs among different clients, and the clustering starts to avoid performance degradation among clients. The two thresholds can be determined via initial experiments on the validation sets [179]. If the conditions in Eq. (3.2) are satisfied, the server further divides the clients from the same cluster into two sub-clusters and performs model aggregation within each sub-cluster (lines 14-17). This process continues to the next layer recursively. Thus, with more layers of client models being clustered and aggregated, each cluster of clients has reduced divergence, which achieves model converging in fewer training rounds. Drifting Interaction Pattern Detection Due to dynamic changes in smart device settings and the presence of different attacks, the testing interaction graph distribution may diverge from that of the training dataset in actual deployment scenarios. The performance of the detection may be impacted by the drifting samples, which 59 are new interaction vulnerability patterns that are different from already known vulnerabilities. Considering the high false-positive or false-negative rate caused by drifting samples, we design the interaction graph drifting pattern analysis method based on the federated graph representation learning and median absolute deviation (MAD) [114] in the model application stage. First, in each client, we use the well-trained model in FL to map all the training graph data into latent space. We can calculate the centroid of each class by computing the mean value for each dimension of the latent embedding. Similarly, for the incoming graph data, we also generate latent embeddings. In this way, we can compute the Euclidean distance dki between the centroids of each class in the training dataset and the test sample xk. Second, based on MAD, we can estimate the graph data distribution within each class i by computing MADi in the training dataset, which is the median of the absolute deviation from the median ˜ di of distance dj i . The dj i is the distance between the latent embedding of each training sample to its centroid in class i. Third, for each testing sample, we check if dki is large enough to make xk an outlier of class i. Specifically, we denote the “normal” label as 0 and the “vulnerable” label as 1. Then Ak0 = |dki −˜ di| MAD0 , Ak1 = |dki −˜ di| MAD1 and Ak = min(Ak0 ,Ak1 ). If Ak is greater than a threshold TM that is set as 3 empirically following existing practices [114], then xkt is a potential drifting sample. The MAD method allows each class to decide outliers based on its in-class distribution. In the model application stage, given a set of testing interaction graphs {G}, we first check each graph Gi to see if it is a drifting sample. If a new sample has a larger distance from all existing classes, then it is a potential drifting sample. Thus, we can filter out and manually inspect drifting samples that are outside of the training space. 3.3.2: Vulnerable Interaction Cause Analysis Given a detected vulnerable graph, we design the SHAP-based Monte Carlo beam search (MCBS) algorithm to discover the causes of vulnerable interactions considering the IoT dependency relationship. We first formalize the vulnerability explanation problem. G represents the interaction graph that is examined by a GNN model followed by a linear classifier h(·) in the model application stage, and the prediction result is y. The vulnerable interactions are caused by a connected subgraph 60 Gsub in an interaction graph. We further define a cooperative game following [111] to measure the contributions of different parts (players) of a graph. Suppose Gsub is one player, where Gsub contains nodes {v1, v2, · · · , vs}. Other nodes in G\Gsub are other players {{vs+1}, · · · , {vm}}. Our goal is to find the most possible subgraph Gsub that is responsible for the prediction y. However, the device’s abnormal behaviors are hardly noticeable until certain consequences occur. Moreover, an app can trigger unexpected events that are subscribed to by another device. Different apps could introduce contradictory changes to a device attribute. Directly using the prediction scores to measure the risk of subgraphs is problematic, since it cannot capture connections among different graph structures. Suppose there is a subgraph Gsub that contains the exact interaction vulnerability chain, any subgraph Gcon containing Gsub will also cause the vulnerable interaction. Meanwhile, a subgraph Ginc that is included in Gsub may also trigger the same threat, but Ginc can only reveal part of the causes instead of the complete causal route. Another method SubgraphX [241] applies the Shapley value to measure the risk of subgraphs, where they assume the players are independent. However, in an interaction graph, the existence and actions of nodes are not independent because the deployment of devices is related to each other. Therefore, the Shapley value could neglect the dependent relations among different nodes. We propose to calculate the SHapley Additive exPlanations [136] (SHAP) value to evaluate the risk score of subgraphs and apply MCBS to improve the efficiency of the search process as shown in Algorithm 5. In the MCBS tree, following the notation in SubgraphX [241], the root node N0 is the input graph G, and each node Ni in the search tree is a connected subgraph Gsub, the edge in the search tree is the pruning action a. A subgraph Gj is acquired by the action aj from Gi, which is represented by (Ni, aj). We combine beam search with Monte Carlo tree search to store a set of best playouts, which is called beam (lines 2-4 in Algorithm 5). The size of a beam Blevel is fixed for each level, which shows that Blevel best nodes are kept at each level. From the model perspective, the target node feature is aggregated from a limited number of neighbor nodes. From the smart home perspective, devices are directly related to limited neighboring devices. Therefore, it is reasonable that only neighboring nodes in a beam Blevel are employed for information aggregation. 61 Algorithm 5: SHAP based Monte Carlo Beam Search Input: GNN model h(·), interaction graph G, MCBS iteration number I, kernel SHAP samples K, the least node number Nmin, the root of search tree N0 Output: subgraph Gsub 1 for i < I do 2 Si = N0, curPath={N0} 3 while |h(Si)| > Nmin do 4 for subgraph Gi in Blevel(Si) do 5 for k < K do 6 g(zr) = h(Si) 7 h(T−1 x (zr)) = h(Si S Gi) 8 end 9 Score(h(·, G,Gi)) = W 10 end 11 Select Nnext following Eq. 3.7 12 Si = Nnext 13 curPath = curPath+Nnext 14 end 15 Sl = Sl S Si 16 end 17 return subgraph Gsub with the highest score from Sl The SHAP approach can rate the relevance of each feature for a specific prediction. SHAP values are the Shapley values of a conditional expectation function [136], which can better measure the feature importance. Here, we use {G1, · · · ,Gi, · · · ,Gn} to represent n subgraphs of an interaction graph G. The analysis result can be written as: G∗ = arg max |Gi|≤Nmin SHAP(h(·), G,Gi), (3.3) where Nmin is a hyperparameter that limits the number of nodes in a subgraph. Following SHAP framework, the SHAP value ϕi can be computed as follows: ϕi(h, x) = X zr⊆xr |zr|!(M − |zr| − 1)! M! [hx(zr) − hx(zr\i)], (3.4) where xr is the set of total non-zero entries, zr is a subset of xr, |zr| is the number of non-zero entries in zr, M is the number of set of all entries, zr\i denotes setting zir = 0. In our proposed 62 Algorithm 5, non-zero entries mean the nodes (players) of a graph that are included in the Monte Carlo beam search process. hx(zr) = E[h(z)|zS] means it calculates the expectation across all possible node combinations, where S is the non-zero index in zr. Solving Eq. (3.4) is challenging. Thus, we apply the kernel SHAP to approximate it (lines 5-9 in Algorithm 5). L(h, g) = X zr∈Z [h(T−1 z (zr)) − g(zr)]2 · C, (3.5) C = M − 1  M |zr|  |zr|(M − |zr|) (3.6) where we assume g(zr) = Wzr is the explanation model following a linear form. g(zr) matches the original model h(z) when z = Tz(zr). Tz(·) is a mapping function that converts the inputs to the original input space. We use the weighted linear regression to solve the equation since L(h, g) is a squared loss and g(zr) is the linear function. The intuition is that Eq. (3.4) is calculating the difference of means, which has a connection with linear regression, and the mean value is the best least-squares point for a set of data points. Thus, noted by [136], we calculate ϕi(h, x) = wj(xj − E(xj)) given h(x) = PMj =1 wjxj + b, which is a linear classification model such as SGDClassifier as introduced in Section 3.3.1. During the MCBS process, the node Nnext selection criterion is computed as follows (lines 11-13 in Algorithm 5): arg max aj Q(Nnext, aj) + λR(Nnext, aj), (3.7) where Q(Nnext, aj) is the average reward score over several visits, λ is the hyperparameter that balances the exploration and exploitation. R(Nnext, aj) is the immediate reward for choosingNnext, which is the SHAP(h(·), G,Gi) score. Finally, Algorithm 5 will output the subgraph Gsub with the highest risk score for cause analysis. 63 Table 3.1: Statistics of interaction graphs. Type Label Total Graph Num. Vulnerable Graph Num. Homo. (IFTTT) labeled 6,000 1,473 unlabeled 10,000 ∗ Hetero. (5 Platforms) labeled 12,758 3,828 unlabeled 19,440 ∗ 3.4: Evaluation and Discussion 3.4.1: Experimental Setting To simulate the FL training, we evaluate the dynamic clustering-based federated GNN on a highperformance computing cluster, which is equipped with Intel(R) Xeon(R) Gold 6148 2.4GHz CPUs running on CentOS 7. For collecting app description data, we use Scrapy [181] to crawl smart home app rules from the following 5 platforms: • For SmartThings [190], we crawl rule descriptions from 185 open-source apps. • For Home Assistant [10], we crawl rule descriptions from 574 blueprints. • For IFTTT [98], we integrate 315,393 applets from [222] and newly crawled 1,535 applets. • For Google Assistant [78], we crawl 480 services and 4,812 action commands. • For Amazon Alexa [5], we crawl 2,232 services and 3,274 skill commands. 3.4.2: Clustering-based Federated Graph Learning Evaluation FexIoT conducts fine-grained clustering-based federated contrastive graph representation learning for vulnerability detection. Note that as it is hard to build large-scale smart homes in real life, following the template of deployed smart homes, we leverage the offline graphs to produce a set of online interaction graphs for federated GNN evaluation. The only difference is that online interaction graphs reflect real-time IoT interactions. We compare FexIoT with two GNN models, four baselines, and five different Data 64 0.1 1 2 5 10 Concentration parameter 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy FexIoT GCFL+ FMTL FedAvg Client (a) GIN Accuracy 0.1 1 2 5 10 Concentration parameter 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision FexIoT GCFL+ FMTL FedAvg Client (b) GIN Precision 0.1 1 2 5 10 Concentration parameter 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall FexIoT GCFL+ FMTL FedAvg Client (c) GIN Recall 0.1 1 2 5 10 Concentration parameter 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 F1 FexIoT GCFL+ FMTL FedAvg Client (d) GIN F1 0.1 1 2 5 10 Concentration parameter 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy FexIoT GCFL+ FMTL FedAvg Client (e) GCN Accuracy 0.1 1 2 5 10 Concentration parameter 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision FexIoT GCFL+ FMTL FedAvg Client (f) GCN Precision 0.1 1 2 5 10 Concentration parameter 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall FexIoT GCFL+ FMTL FedAvg Client (g) GCN Recall 0.1 1 2 5 10 Concentration parameter 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 F1 FexIoT GCFL+ FMTL FedAvg Client (h) GCN F1 Figure 3.2: The performance of GNNs under five different Dirichlet distribution parameters α. distributions. The two homogeneous GNN models are: (i) GCN [108] is a convolutional network operating on graphs, and we adopt three graph convolutional layers. (ii) GIN [231] is a graph isomorphism network model, and we adopt the original model architecture. For the heterogeneous graph classification model, we choose the MAGNN model [71], which learns heterogeneous graph embeddings based on metapath aggregation. The four federated learning frameworks are: (i) Federated multi-task learning (FMTL) [179], which groups the clients into clusters by using geometric aspects of the loss surface. (ii) Graph clustered federated learning (GCFL+) [229], which is a gradient sequence-based clustering method for graph classification. (iii) Federated averaging (FedAvg) [146], which aggregates locally-computed updates in federated learning. (iv) Self-training in clients (Client), which trains the GNN locally without any communications. Data Distribution Configuration. We study the impacts of different data distributions on the federated GNN models. We split the IFTTT graph dataset according to Dirichlet distribution to simulate the non-i.i.d. dataset in each client, and we set the number of clients as 10. We draw class marginal distribution from Dirichlet distribution with the probability density function p(x) ∝ Qki =1 xαi−1 i . α is the positive concentration parameter, which we set to 0.5, 1, 2, 5, and 10. When α is close to 0, most of the generated values drawn from the Dirichlet distribution will be close to 65 0. Thus, we can simulate clients with unbalanced and non-i.i.d. datasets. Following the above Dirichlet distribution, different clients will have different numbers of graphs. We split 80% of the IFTTT dataset as the client training dataset and 20% of the dataset as the testing dataset for each trial. During the training process, we empirically set the two thresholds ϵ1 and ϵ2 in Eq. (3.2), which are related to the size of model weights, to start and end the layerwise clustering process. We found that the local clients achieve decent performance when the two values are set between 0.5 and 2. The value setting also depends on the initialization settings such as the learning rate with the corresponding optimizer. We set the learning rate as 0.001 in the Adam optimizer, ϵ1 is 1.2 and ϵ2 is 0.8 if they are not specified in the following experiments. We apply the weighted cross-entropy loss function to handle class imbalance for interaction graph classification. We set up the weight given to each class in the cross-entropy loss according to the inverse ratio to class frequencies. Figure 3.2 shows the average accuracy, precision, recall, and F1 value of models. GNN Models Evaluation. We test the influence of different GNN models in the federated learning process. Note that the GNN models are used for graph representation learning in the FL process. We use SGDClassifier from Scikit-learn [164] to implement the normal or vulnerable binary classification. As shown in Figure 3.2, the GIN model achieves better performance than the GCN model, which is consistent with the previous study [231]. Moreover, we found that when the dataset is more evenly distributed (α is larger), both the GCN and GIN achieve better performance. However, the data distribution indeed has a great impact on the model performance. For example, the F1 values of the GIN model in FedAvg are 0.735 and 0.748 when α is 0.1 and 10, respectively. The GIN achieves better performance (average F1 is 0.92) when α is 10 than the performance when α is 0.1 (average F1 is 0.89). Our proposed dynamic clustering-based federated graph learning method achieves the best performance with different GNN models. Efficacy Evaluation. Compared to baselines, our FexIoT achieves the best performance because of the design consideration of fine-grained homogeneous data features in heterogeneous data distribution. For example, for FexIoT, the accuracy is 0.891 and 0.919 when α is 0.1 and 10, respectively. For GCFL+, the accuracy is 0.852 and 0.889 when α is 0.1 and 10, respectively. The 66 25 50 75 100 Client Number 0.7 0.8 0.9 1.0 Accuracy 25 50 75 100 Client Number 0.7 0.8 0.9 1.0 Figure 3.3: The test accuracy of client models with different numbers of clients participating in training with the IFTTT dataset (left) and heterogeneous graph dataset (right). accuracy of FedAvg is 0.717 and 0.768 when α is 0.1 and 10, respectively. The average client accuracy is 0.542 and 0.622 when α is 0.1 and 10, respectively. FexIoT improves the accuracy by 17.4% compared with the FedAvg. FedAvg is deeply affected by the data heterogeneity, which leads to poor modeling performance with low accuracy in FL training paradigms. By contrast, FexIoT can mostly reduce the data heterogeneity effect among different datasets by training on clusters with high homogeneity. We also compare the performance of federated learning compared with centralized training. We test the GIN model in centralized training, which collects all data from clients to train a GIN model. The centralized training achieves 0.944 F1 value, while our federated GIN model achieves 0.91 average F1 value over five data distributions. Overall, our proposed FexIoT can better depict the homogeneity of the GNN models learned from different data distributions than the coarse-grained ones. Stability Evaluation. The clustering-based methods (FexIoT, GCFL+, and FMTL) can achieve better and more stable performance than FedAvg and single client training. For example, with our proposed algorithm, the average accuracy of GIN under five different data distributions is 0.907, and the standard deviation (STD) is 0.01. For FedAvg, the average accuracy is 0.738, and the STD is 0.017. The performance of FedAvg is more divergent according to the data distribution of each client. Similarly, the performance of self-training in each client is also affected by different data distributions. Besides, our FexIoT outperforms the state-of-the-art method GCFL+, which has an average accuracy of 0.87 and STD is 0.012. Facing the heterogeneity in datasets, the GCFL+ 67 considers the similarity of parameters of a whole model, which can be fluctuating due to the changes in parameters of different layers of client models. By contrast, our FexIoT implements clustering based on fine-grained bottom-up layer-wise parameters, which is less affected by fluctuating layers. Overall, our FexIoT is more stable under different data distributions than other FL methods. Scalability Evaluation. We simulate different numbers of clients (25, 50, 75, 100) joining the FexIoT on two datasets. We use GIN on the IFTTT dataset and MAGNN on the heterogeneous graph dataset if not mentioned. The concentration parameter α of the Dirichlet distribution is 1. Figure 3.3 is a box plot, in which minimum, maximum, median, first quartile, and third quartile accuracy are reported. On the IFTTT dataset, the third quartile accuracies of 25, 50, 75, 100 clients are 0.869, 0.879, 0.882, and 0.873, respectively. Therefore, even though the number of clients increases, 75% of clients can achieve more than 86% accuracy, which shows the high scalability of FexIoT. When the number of clients is larger (e.g., 100), the results show more divergence as the minimum accuracy is 0.8, and the maximum accuracy is 0.977. When the number of clients increases, due to the fixed size of the entire dataset, the size of the dataset for each client decreases, which affects the evaluation results. Nevertheless, the results show that FexIoT can effectively aggregate clients that have similar data distribution to boost the model performance. Drifting Interaction Pattern Evaluation. The methods that rely on known patterns will become fruitless when there are drifting samples. Therefore, we need to identify the possible drifting graph samples with unknown interaction patterns. We evaluate drifting interaction pattern detection on the unlabeled dataset, to check if the randomly generated graph dataset contains drifting samples. We show the clustering results in Figure 3.4 on 1500 randomly sampled graph feature representations learning with federated contrastive graph learning. The data is processed with TSNE dimension reduction. The centroid of each class is the white cross, and the possible drifting samples are in the red circle. The six types of known interaction vulnerabilities are clustered in gray dotted boxes. Through our federated contrastive learning, the model can learn well the features of six types of vulnerability patterns and the normal graph pattern, which are separable in hidden space shown in Figure 3.4. Based on the drifting interaction pattern analysis in Section 3.3.2, 68 Table 3.2: Comparison of different systems with testbed data. Method Accuracy Precision Recall F1 HAWatcher [70] 0.82 0.83 0.87 0.85 DeepLog [59] 0.74 0.78 0.79 0.78 IsolationForest [131] 0.63 0.74 0.61 0.67 FexIoT 0.9 0.9 0.93 0.91 Figure 3.4: K-means clustering on 1500 randomly sampled graph representations with TSNE dimension reduction. we found 63 and 104 potential drifting samples in the IFTTT dataset and heterogeneous graph dataset, respectively. After manually checking, we found three new types of vulnerability patterns: (i) Automation action is reverted over time. (ii) Another action can generate fake automation conditions. (iii) Non-automation settings can block the existing actions of smart devices. In this way, FexIoT can reduce the false-alarming rate while the security rule-based querying methods in the traditional database cannot discover new patterns and even incur high false-positive or falsenegative rates because of the drifting samples. System Comparison. We compare FexIoT that is trained across 50 clients with existing vulnerability detection systems: (i) HAWatcher [70] is a search-based system that extracts secure rule interaction templates and evaluates device interaction. We use the extracted templates to discover vulnerabilities in the test dataset. (ii) DeepLog [59] models event logs as a language sequence and uses Long Short-Term Memory (LSTM) to learn log patterns. We train LSTM on normal logs and find anomalies that deviate from the training data. (iii) IsolationForest [131] 69 25 50 100 Client Number 0 1 2 3 4 5 6 Number of Bytes (MB) 1e4 FedAvg FMTL GCFL+ FexIoT Figure 3.5: Communication cost with different number of clients. is a density-based anomaly detection method with a tree structure, which is implemented by Scikit-learn [164], and the input is a data vector that includes device status. We use 600 online interaction graphs in Section 3.4.1 that are integrated with event logs and descriptions as the test dataset. The system performance is shown in Table 3.2. From the results, we find that all methods fail to differentiate between normal user interruptions (e.g. manually turning on a light) and malicious attacks, which cause false positives. Feedback from users should be provided to update such correlations and reduce false alarms. FexIoT outperforms the existing methods due to the following reasons: (i) HAWatcher only extracts binary rule templates, which can hardly cover long-term complex correlations such as the correlation between an air conditioner and temperature events, and (ii) DeepLog and IsolationForest cannot effectively mine interaction correlations across different events. Communication Cost Evaluation. We measure the communication cost during the training process by computing the total data transferred (download and upload) between the server and clients. As shown in Figure 3.5, the total transferred data is less than 40 GB in 60 rounds with 100 clients, which is acceptable for wired network bandwidth. FexIoT saves 40.2% communication overhead compared with FedAvg [146]. FedAvg needs to aggregate the whole model during the FL process, and FMTL [179] and GCFL+ [229] also share the whole model but within different clusters. The low cost of FexIoT is attributed to our proposed layer-wise clustering-based FL method. At the initial stage, only the parameters of the first layer are uploaded to a server for 70 clustering. Then, based on the previous clustering result, the upper layer parameters of models are uploaded to the server, and the server sends parameters of different layers back to the corresponding client clusters. Clients in the same cluster share more layers than clients that are in different clusters. Thus, it reduces the number of parameters transmitted between the server and clients. 3.4.3: Vulnerable Interaction Cause Analysis Evaluation After vulnerability detection, we implement qualitative and quantitative evaluations to show the efficacy and efficiency of the proposed vulnerable interaction cause explanation method. We randomly select 100 interaction graphs that contain vulnerable interactions, which are reported by the GCN model on the unlabeled IFTTT dataset. Among the 100 graphs, we identify four graphs that are benign after manually checking, which are false-positive results from the GCN model. We compare our method FexIoT with two baseline methods: SubgraphX [241] and MCTS_GNN. SubgraphX is a general graph explanation method that applies Monte Carlo tree search with shapely value to explore various subgraphs, while MCTS_GNN applies Monte Carlo tree search with the prediction score of the GNN model to explore subgraphs. Other graph explanation methods such as PGExplainer [137] and GNNExplainer [238] can only identify discrete edges or nodes, which cannot be used for the interaction vulnerability explanation among nodes. The SubgraphX and MCTS_GNN are implemented by the DIG [132] library. We apply GCN as the vulnerability detection model because the proposed vulnerable interaction cause analysis method does not rely on specific vulnerability detection models. Example Illustration. To visually demonstrate the superiority of our explanation method, we give two intuitive examples in Figure 3.6. The first row illustrates the example of the false positive of the GCN model prediction, while the second row illustrates the correct prediction. For the first row in Figure 3.6, the interaction graph is predicted to contain vulnerable interactions, but it is deemed benign after manual inspection. The interaction graph describes that when the door opens, the water flow will run with the switch turned on with a notification sent to users. The user can notify the app to turn on the camera. Then, if the smoke is detected, the door unlocks and the 71 Table 3.3: The content of indexes in Figure 3.6. Index Rule 23 Connect to the WiFi 47 Turn camera off 62 If the fan runs, turn on water flow switch 174 Turn on cameras if lights are off 281 If smoke is detected, the fan starts 1076 Turn on air conditioner if plug is on 1177 Turn the camera on if get notified 1215 Tap to turn on camera off 1291 Turn on plugs if door unlocked 2184 If smoke is detected, unlock the door and start fan exhaust ventilation fan starts. This interaction graph does not contain interaction vulnerabilities, as shown in Figure 3.6, our FexIoT offers a minor misleading explanation subgraph, as it only identifies the concise subgraph to explain the prediction. By contrast, other methods try to find a larger subgraph to explain the prediction result, which causes more confusion for the inspector. 1215 47 1376 2222 1178 94 292 39 174 1793 296 1177 1215 47 1376 2222 1178 94 292 39 174 1793 296 1177 1215 47 1376 2222 1178 94 292 39 174 1793 296 1177 1337 81 2184 281 62 1376 1076 1291 1177 23 1337 81 2184 281 62 1376 1076 1291 1177 23 1337 81 2184 281 62 1376 1076 1291 1177 23 Example 1: FexIoT explanation result Example 1: SubgraphX explanation result Example 1: MCTS_GNN explanation result Example 2: FexIoT explanation result Example 2: SubgraphX explanation result Example 2: MCTS_GNN explanation result Figure 3.6: Vulnerability explanation comparison. For the second example in Figure 3.6, the GCN model prediction is correct. It describes that when the user taps to turn off the camera, the camera will be turned off and the action will be recorded in a Google spreadsheet. When a new spreadsheet subscriber is added, it will send a 72 0.15 0.20 0.25 0.30 0.35 0.40 Fidelity 0.675 0.700 0.725 0.750 0.775 0.800 0.825 Sparsity FexIoT SubgraphX MCTS_GNN Figure 3.7: The comparison of Fidelity and Sparsity scores. notification to turn on the camera. FexIoT, SubgraphX, and MCTS_GNN all can recognize the key subgraph that causes the interaction vulnerability that turns off the camera within a loop. For both cases, FexIoT can provide a reasonable vulnerability explanation with better visual explanations. Fidelity and Sparsity. Besides the visualization to evaluate the vulnerability explanation methods, we use the Fidelity and Sparsity [168] to quantitatively measure the explanation results. The Fidelity metric refers to the difference between two prediction scores before and after removing a part structure of a graph. The Sparsity metric refers to the percentage of the structure that is recognized as important by the explanation model. There is a trade-off between Fidelity and Sparsity. A high Sparsity score means a smaller structure, which tends to be low Fidelity (less important). For real-life cases that lack the ground truth, the Fidelity and Sparsity scores can help measure the explanation quality. We test the Fidelity and Sparsity scores of 50 randomly picked interaction graphs. The scores depend on the structure of interaction graphs, which fluctuate among different cases. Half of the tested cases have relatively high Fidelity which is greater than 0.3, and low Sparsity which is smaller than 0.7. For better visualization, we plot the curve of Sparsity concerning Fidelity for four cases as shown in Figure 3.7. The four cases have higher Fidelity and lower Sparsity, or lower Fidelity and higher Sparsity. Our method FexIoT can strike a decent balance between Sparsity and Fidelity scores, which means FexIoT can accurately identify the concise subgraphs that are the possible causes of the vulnerabilities. The results indicate that our explanation method can find the important 73 Table 3.4: Runtime efficiency with different datasets. Graph Construction Time (s) Prediction Time (s) Vulnerability Analysis Time (s) Model Size (MB) IFTTT 17.19 0.52 2.18 5.48 Hetero. 976.99 0.61 3.64 6.13 yet concise subgraph for prediction explanation, which is more faithful for interaction analysis. Vulnerability Explanation Efficiency. As shown in Table 3.4, the 12,758 heterogeneous graphs generation takes 980s. The size of the MAGNN model for heterogeneous graphs is only 6.13MB, which makes it feasible to train the model on devices such as a Raspberry Pi. The prediction time is 0.61s on average, which is related to the graph size. For vulnerability explanation, the average cause discovery time is 3.64s for an interaction graph. It takes 182s for 50 graphs with 18 nodes on average. Note that the time is related to the algorithm parameters such as the number of MCTS iterations, and the kernel SHAP samples. Overall, the FexIoT achieves vulnerability analysis with high efficiency. 3.4.4: Discussion Security and Privacy We acknowledge that FexIoT itself faces security and privacy issues. For example, it might suffer from Sybil attacks and data poisoning attacks [56, 72, 103, 256] when an attacker controls multiple clients to attack the system. One possible solution is to enhance the software and network security (e.g., using firewalls, CAPTCHAs, or device-specific asymmetric keys) in smart homes to protect the internal FL model. Moreover, there are several defense methods that can identify sybils based on the diversity of client updates [72], training loss from randomly selected sets of clients [103], and feature importance [256], which could be integrated into FexIoT. To enhance the data privacy, we will add differential privacy [61, 134, 226] and secure aggregation mechanisms [19, 42, 129] to FexIoT in the future. 74 Data Labeling Providing ground truths is necessary for scientific evaluation. Existing works [57, 122, 242] have proposed methods for efficient data annotation. In this chapter, we detailed how to manually label the dataset with cross-validation. In the future, we plan to combine experts, amateurs, and ML models for cross-validation to further enhance the quality of labels for IoT data. For example, experts can label a few samples and develop few-shot learning-based models. ML models and amateurs can collaboratively label new data. Iteratively, the newly labeled data could be used to re-train the model. 3.5: Summary IoT data security and privacy management are essential for many IoT applications. In this chapter, we investigate how IoT rule configuration data could expose interaction vulnerabilities across heterogeneous closed-source platforms. We present and implement a novel IoT data management solution, FexIoT, that leverages federated and explicable graph learning to analyze large-scale IoT interaction data. We design the fine-grained dynamic clustering-based federated contrastive graph representation learning algorithm to discover interaction vulnerabilities and tackle the concept drift problem of smart home data. We propose the MCBS-based search method with SHAP value to search and measure the risk of each subgraph. With the data collected from 5 real-world IoT platforms, we demonstrated the superior performance of FexIoT in detecting and tracking the root causes of interaction vulnerabilities. 75 CHAPTER 4: ACOUSTIC-BASED FACIAL EXPRESSION RECOGNITION3 4.1: Introduction With the emergence of new media in the digital era, a variety of social media services are craving users’ attention. Fine-grained emotional reaction understanding is pivotal to facilitating a user’s interaction with digital content. Traditionally, crowd-sourced ratings and reviews have been used for evaluating users’ feedback on services. However, they are too coarse-grained to provide realtime spontaneous feedback. To provide a more personalized service, we need an accurate and robust approach to recognizing the users’ emotions and acquiring spontaneous feedback. A number of techniques have been proposed to recognize emotions, expressed by various biometric features, such as facial features [32, 75, 155], speech features [156], or heartbeats [250]. Nevertheless, as a universal form of nonverbal communication, facial expression is recognized as the most direct way of understanding human emotions [126]. There are six widely adopted facial expressions (FEs): anger, disgust, fear, happiness, sadness, and surprise [63]. These FEs can be modeled by the Facial Action Coding System (FACS), which includes action units (fundamental action muscles) and action descriptors (unitary movements of several muscle groups) [41]. When people make different facial expressions, different facial muscles will move, which could be captured by various sensing signals. Existing facial expression recognition (FER) methods can be categorized as camerabased [23, 163, 183], radio-based [36, 40, 81] and acoustic-based [75] expression recognition. 3This chapter is based on previously published work by Guangjing Wang, Qiben Yan, Shane Patrarungrong, Juexing Wang, and Huacheng Zeng titled “FacER: Contrastive Attention based Expression Recognition via Smartphone Earpiece Speaker” published in the Proceedings of IEEE Conference on Computer Communications (INFOCOM). DOI: 10.1109/INFOCOM53939.2023.10228985 [216]. © 2023 IEEE. Reprinted, with permission, from Guangjing Wang, Qiben Yan, Shane Patrarungrong, Juexing Wang, and Huacheng Zeng, “FacER: Contrastive Attention based Expression Recognition via Smartphone Earpiece Speaker” (paper and IEEE titles are the same), August 2023. 76 Acoustic Signal (a) FacER application scenario. Anger Disgust Happiness Fear Sadness Surprise (b) Six facial expressions [7]. Figure 4.1: Facial expression recognition using a smartphone. However, the existing camera-based methods raise privacy concerns due to their continuous video recording on a device. For instance, FaceWarehouse [23] collects RGBD images of different expressions from different users. Yet, serious privacy concerns impede the wide adoption of such methods in a real-world scenario. Moreover, these camera-based methods cannot accurately recognize facial expressions when the users cover their faces with masks. Other alternative methods require extra sensing devices, affecting the usability. For example, WiFace [36] uses a WiFi router with three antennas placed at a specific position for FER, but it requires additional hardware and configurations. Similarly, PPGface [40] requires expensive wearable devices with photoplethysmography sensors. We propose FacER, a Facial Expression Recognition system based on near-ultrasound acoustic sensing on a smartphone. We utilize commodity smartphones to emit near-ultrasound signals (19-23 kHz) towards the user’s face. As shown in Figure 4.1a, the microphone on the smartphone will receive the reflected echoes from the face surface, which will carry the facial expression information. By examining the fine-grained echo patterns, FacER can differentiate six different types of facial emotional expressions as shown in Figure 4.1b. We demonstrate that in a scenario where a user holds a smartphone at a relatively stable distance (e.g. 20-50 cm), FacER can recognize six universal facial expressions with more than 85% accuracy. Compared with the state-of-the-art acoustic-based SonicFace [75], FacER uses a commodity smartphone without 77 requiring a customized microphone array, providing a more convenient and usable solution. There are two main challenges in designing FacER. First, acoustic interference, such as the multipath of reflected signals and environmental noises, can significantly impact recognition performance. Apart from the echoes reflected from the face, the microphones can also receive echoes from the surrounding obstacles. It is imperative to mitigate or remove any undesired signals. However, even with the application of various noise-reduction techniques, the environmental noises at a similar frequency to the emitted signal could persist. There could be a similar distance between the smartphone and obstacles as the distance between the smartphone and the user. Besides, there could exist low-frequency and high-frequency environmental noises. Therefore, we propose a contrastive external attention-based learning model to depict more robust facial expression feature representations from the noisy data. In this way, the model can distill the universal features of expressions while eliminating background noises. Second, different users express facial expressions in different manners, and even the same user could express them differently at different times. This will result in the domain adaptation issue, a common issue in machine learning (ML) models, where a model trained on a labeled dataset (source domain) has limited performance on a testing dataset (target domain). This is because the target domain dataset has a distribution shift with the source domain dataset, which violates the independent and identically distributed (i.i.d.) assumption of ML models. Therefore, we propose a domain adaptation contrastive learning algorithm to align the distribution of the source domain and target domain dataset. In this way, FacER can achieve consistent performance in recognizing various expressions across different users. We evaluate FacER on a dataset collected from 20 volunteers of different ages, genders, and skin colors in various environments. We show that FacER can effectively recognize six different facial expressions from different users. Remarkably, it achieves more than 90% test accuracy when the training and testing datasets have the same distribution. It achieves more than 85% accuracy when training and testing on different sets of users. In summary, we make the following contributions: • We design FacER, which uses a contrastive external attention-based acoustic facial 78 fmax fmin Emitted signal (chirp) Received signal F Time T B (a) Illustration of FMCW. Q K V Softmax F Fout (b) Illustration of self-attention. Figure 4.2: Illustration of Preliminaries. expression recognition model to learn representative and robust facial expression features while eliminating the background noise. • We propose the domain adaptation contrastive learning algorithm to align the training and testing data distributions, which can largely mitigate the negative effects of variations in users’ facial expressions. • We implement the smartphone-based system FacER and perform the evaluations in various real-life scenarios. FacER outperforms the state-of-the-art approaches by more than 10% in recognition accuracy with high mobility and convenience. 4.2: Background In this section, we introduce the background of acoustic signals, the preliminary knowledge of attention mechanisms and contrastive learning. 4.2.1: Acoustic Signal The acoustic signal refers to a coded chirp signal transmitted by a device. Especially, the chirp sound signal is the frequency-swept signal, which is modulated in frequency as shown in Figure 4.2a. Particularly, the chirp signal can be regarded as the component of sawtooth modulation in the Frequency-Modulated Continuous Wave (FMCW), which changes its operating frequency during the measurement. In FMCW, the frequency of the signal will periodically increase or decrease during transmission. The differences in frequency between the transmitted 79 and received signal are proportional to the time delay Δt. Therefore, the FMCW can measure the small movement of the target, which is calculated as follows: R = v0|Δt| 2 = v0|Δf|T 2B , (4.1) where R is the distance between the sound source and the reflecting object, v0 is the speed of sound (340 m/s) at 20 °C,Δt is the delay time, andΔf is the measured frequency difference. B is the chirp frequency bandwidth, and T is the chirp periodic time. The duration of the transmitted waveform T should be greater than the required receiving time for the distance measuring range. We use B T to measure the frequency shift per unit of time. Therefore, with the features of the FMCW, the chirp signal can help group reflections from various distances into multiple range bins. 4.2.2: Attention Mechanism Similar to the human visual system, attention mechanisms [207] aim to focus limited attention on key information, which saves resources and distills the most essential information. The basic idea of attention mechanisms is to combine all of the encoded input features in a weighted fashion, with the most important features receiving the highest weights. The self-attention mechanism improves the representation at each position by combining features from other positions in a sample (e.g. image), and it captures long-range dependencies. Given a feature map F ∈ RN×d, where N is the number of elements and d is the feature dimension of each element, by multiplying three different random initialized weight matrixes, selfattention projects the F into a query matrix Q ∈ RN×d, a key matrix K ∈ RN×d, and a value matrix V ∈ RN×d. The self-attention is represented as follows: Fout = softmax(QKT )V, (4.2) where softmax(QKT ) is the attention matrix, and the Fout is the improved feature representation of the input F. 80 4.2.3: Residual Attention Network Residual attention network (RAN) [213] is a type of convolutional neural network. The RAN contains multiple attention modules that generate attention-aware features. In each attention module, there are trunk branch T(x) and mask branch M(x). The T(x) performs feature processing, andM(x) softly weight output features of T(x). The output of the attention module is Hi,c(x) = Mi,c(x) ∗ Ti,c(x), (4.3) where i covers all spatial positions, and c is the index of a channel. However, directly stacking multiple attention modules would degrade the model performance. Therefore, the residual learning mechanism is applied to optimize the performance. The main design of residual learning is the skip connections that jump over some neural network layers, containing nonlinearity activations and batch normalization. Thus, the modified output H of the attention module is Hi,c(x) = (1 +Mi,c(x)) ∗ Fi,c(x), (4.4) where Fi,c(x) is the original features, which can be approximated when M(x) is 0. This method is called attention residual learning. The facial expression echos contain multiple facial muscle movements, as well as the background noise. Therefore, it is essential for the neural network to capture different important aspects of expressions. 4.2.4: Contrastive Learning A domain D = {X, P(X)} includes the feature space X and marginal probability distribution P(X). If two domains are different, they have different X or P(X), but the label space is the same. In our context, the source domain refers to the dataset with labels while the target domain indicates a dataset with different distributions. Contrastive representation learning aims to learn an embedding space where dissimilar samples are spread out and similar samples remain close together. Normally, a positive pair refers to a pair 81 of samples that have the same label, and a negative sample pair has different labels. The contrastive loss with two samples is defined as Lc(xi, xj , θ) = 1[yi = yj ]∥fθ(xi) − fθ(xj)∥22 +1[yi ̸= yj ] max(0, ϵ − ∥fθ(xi) − fθ(xj)∥2)2, (4.5) where 1[yi = yj ] means the value is 1 if two samples have the same label, otherwise is 0. 1[yi ̸= yj ] means the value is 1 if two samples have different labels, otherwise is 0. ϵ is the upper bound distance to avoid the exceptional contribution of some dissimilar pairs. The supervised contrastive loss [106] is defined as follows when the training objective includes multiple positive and negative pairs in one batch: Lc = X i∈I −1 |P(i)| X p∈P(i) log exp(zi · zp/τ) P a∈A(i) exp(zi · za/τ), (4.6) where I is a set of samples x within a batch, A(i) ≡ I\{i}, P(i) is a set of indices of all positives in the multiviewed batch, |P(i)| is the cardinality, z = Proj(Enc(x)) is the encoded feature representation by an encoder model Enc(·) and a projection layer Proj(·) such as a linear layer. The · denotes the inner product, and τ is a temperature parameter to adjust the final results. 4.3: System Design In this section, we design the acoustic sensing and signal pre-processing techniques to prepare the sensing data. Then, we propose the contrastive external attention-based domain adaptation model for acoustic facial expression recognition. 4.3.1: Acoustic Sensing Design A unique facial expression contour is a distinct collection of various reflecting surfaces. As different objects absorb and attenuate sound waves to different degrees, it is possible to differentiate between the reflected echoes from objects and those from facial expressions [254]. 82 Signal Generator On a smartphone, there are usually one main speaker and microphone at the bottom or on the back, and an earpiece speaker and microphone at the top of the phone body. Considering that the earpiece speaker has a proper position to illuminate a user’s face as shown in Figure 4.1a, we select the earpiece speaker for emitting the acoustic signal. Similarly, considering the gesture of holding a phone, the top microphone is chosen since it is less affected by the hand. The acoustic signal should satisfy the following properties. (i) The period of the signal should be moderate to minimize the overlap of echoes from various distances. (ii) The signal should be distinguishable from the background noise in the frequency domain, while the noise frequency is mostly under 8 kHz. (iii) The signal ought to be inaudible in real-world scenarios. Therefore, considering that the change of facial expression happens within 1 second, we choose a chirp signal of 25 milliseconds with frequency sweeping from 19-23 kHz to compose the inaudible acoustic signal with fading at the start and the end. In this way, we can better capture the expression features caused by muscle movements and filter out echoes from different obstacles. According to the Nyquist sampling theorem, the sampling rate is set as 48 kHz. FacER leverages the earpiece speaker to periodically emit the near-ultrasound signals and simultaneously uses the microphone to receive the signals reflected by the face. We set up the time interval as 50 milliseconds to allow all the reflected signals from the previous chirp to be received, such that we can separate two chirps before the following chirp is transmitted. Noise Removal Expression-irrelevant signals from various background noises, nearby people and obstacles should be filtered out. To achieve that, we first apply a 19-23 kHz band-pass filter to keep the desired frequency band and filter out the background noise. There remain three main types of signals in the received recording: (i) the direct path signal directly travels from the speaker to the microphone; (ii) the major echo signal is the mix of echoes from the facial contour, which is the interest of FacER; and (iii) the noisy echo signals are echoes from different obstacles in the environment because of 83 0 200 400 600 800 1000 1200 Index 500 0 500 Amplitude raw signal processed signal Figure 4.3: The raw signal and the signal after noise removal. the multipath of reflected signals. Figure 4.3 shows a segment of the raw signal (in blue) and the processed signal (in orange) after filtering out the background noise and the direct path signal noise. To remove the interference of the direct path signal, inspired by AIM [144], we use a separate speaker and microphone to record the direct transmission by positioning the devices in a clean and quiet space. Therefore, we remove the direct path signal from the received samples by minimizing ||S − cSd||, where S denotes the received samples, Sd denotes the pre-recorded direct signals, and c is a scaling coefficient to achieve optimal cancellation which is set as 0.9 in our experiments. In this way, we remove the direct path signal from the speaker to the microphone. Then, we proceed to remove the noisy echo signals. In the FacER application scenario, a user is facing a phone, so we assume that there is a relatively stable and static distance between the phone and the user’s face. We consider filtering out the noisy echo signals from the nearby obstacles at different distances. The FMCW provides distance measurement, which is an important tool for differentiating among different echoes when more than one source of reflection is received. A comfortable distance between human eyes and the phone is 25−50 cm [254]. Therefore, based on Eq. (4.1), we can calculate the desired frequency shift as: |Δf| = 2RB Tv0 . (4.7) Thus, |Δf| is between 235 Hz and 470 Hz. We further analyze the FMCW distance measurement resolution. Given the minimum measurable frequency shift Δfmin = 1/T , we can compute the 84 18000 18500 19000 19500 20000 Frequency W: anger w/o mask W: disgust w/o mask W: happiness w/o mask W: fear w/o mask W: sadness w/o mask W: surprise w/o mask 18000 18500 19000 19500 20000 Frequency M: anger w/o mask M: disgust w/o mask M: happiness w/o mask M: fear w/o mask M: sadness w/o mask M: surprise w/o mask 0.00 0.01 0.02 0.03 0.04 0.05 Time 18000 18500 19000 19500 20000 Frequency M: anger with mask 0.00 0.01 0.02 0.03 0.04 0.05 Time M: disgust with mask 0.00 0.01 0.02 0.03 0.04 0.05 Time M: happiness with mask 0.00 0.01 0.02 0.03 0.04 0.05 Time M: fear with mask 0.00 0.01 0.02 0.03 0.04 0.05 Time M: sadness with mask 0.00 0.01 0.02 0.03 0.04 0.05 Time M: surprise with mask Figure 4.4: The spectrogram of six expressions in 50 milliseconds. The first row is from a woman without a mask, the second row is from a man without a mask, and the third row is from the same man with a mask. resolution dr that FMCW separates mixed echoes as: dr = v0Δfmin.T 2B = v0 2B . (4.8) Thus, dr is 340m/s 2×4000s−1 = 4.25 cm. The resolution of the major echo signal corresponding to a single sample is v0 2Fs = 3.54 mm, where Fs is the sampling frequency 48 kHz. 4.3.2: Contrastive Attention-based Domain Adaptation We use the Short-Time Fourier Transform (STFT) with the Hann window to process the signal, which outputs the complex amplitude, and we compute the absolute values of the STFT values. The generated spectrogram is used as input for our proposed model in Algorithm 6. Figure 4.4 shows the spectrogram of the segmented major echo signals of different expressions from two volunteers. We can observe that, for the same person, different expressions will yield different spectrograms. For the same expression, different people have different ways of making expressions. However, we make two observations that would affect the modeling performance. First, it is nearly impossible to entirely remove noisy echo signals from different obstacles at various distances in some scenarios. We set up the desired frequency shift |Δf| as 500 Hz in Figure 4.4. However, 85 Noise Removal Data Augmentation Signal Generation STFT Pseudo Label Generation Feature Feature Query Query Mk Md Norm Norm Feature Feature Loss ResNet External Attention Learning Contrastive Learning Training Data Smile Testing Data ? Figure 4.5: The contrastive attention-based domain adaptation model for acoustic facial expression recognition using smartphone earpiece speaker. there could be minimal multipath changes caused by body motions or objects between the face and the phone, which is hard to filter out since the resolution dr for FMCW to separate mixed echoes is 4.25 cm. For example, the second row in Figure 4.4 is from a man without a mask, while the third row is from the same man with a mask. We can observe subtle differences between the corresponding spectrograms (e.g., “surprise”) in the two rows. Considering that the image-based expression recognition model can cope with various backgrounds around the face in an image to extract the facial expression-related features. Similarly, we aim to design the model to pay attention to the acoustic facial expression features. Second, normal ML models face the poor generalization problem when there is a distribution shift between the training and testing datasets [258]. For example, as shown in the first and second rows of Figure 4.4, for the same expression, different people will have different ways of expressing facial emotions, as manifested in the differences in the corresponding spectrograms. These differences will cause distribution shifts. Therefore, we should align the training and testing data distribution before training a classification model. Moreover, the model should only require a limited amount of enrollment data to improve its usability. Yet, it is challenging to extract useful features that can identify expressions from weak signals with limited acoustic samples. To tackle these challenges, we have developed a contrastive attention-based domain adaptation model aimed at extracting consistent acoustic facial expression features, as depicted in Figure 4.5. The core principle of domain adaptation in this context is to minimize the disparities in feature representations of expressions across different domains. Yet, acquiring a sufficiently diverse and extensive dataset to train deep learning models for robust facial expression feature extraction poses 86 a significant challenge. To overcome this, our initial step involves augmenting the dataset to ensure a broader representation of facial expressions from various populations. Data Augmentation Data augmentation, which involves adding modified versions of existing data or generating new synthetic data from existing data, is a typical technique to address the data deficiency problem. First, it can mitigate the model overfitting problem when the original dataset is relatively small. Second, contrastive learning learns discriminative representations by bringing together positive pairs and separating negative pairs. The different augmented views of each sample can compose positive pairs that have the same labels, which enhances the model’s discriminative ability. We propose to use two methods for acoustic expression data augmentation. First, we shift the acoustic expression signal segment by the same distance. In accordance with the inverse square law of sound propagation, the amplitude of the signal is changed by a scale equal to the inverse square of the distance (e.g., 0.3 meters). Second, for the generated spectrograms, we produce different versions of the spectrogram by multiplying the magnitudes of the spectrogram by a scalar. We generate the scalar by sampling from a Gaussian distribution with the mean being 0 and the standard deviation being 0.1. Considering our scenario when a user is holding a smartphone, a small device rotation at a fixed position creates negligible changes in the signal due to the omnidirectional nature of speakers and microphones, therefore, we only consider the changes in the device positions for acoustic signal transformation. In this way, we can grow and enrich our acoustic facial expression dataset for contrastive external attention learning. Pseudo Label Generation A domain D = {X, P(X)} includes the feature space X and marginal probability distribution P(X). If two domains are different, they have different X or P(X), but the label space is the same. Suppose we have a source domain with a fully-labeled expression training dataset Ds, and an unlabeled testing dataset Dt is the target domain, which has the same categories as the source domain. The first problem is how to form the positive pairs from the same category of Ds and Dt 87 when the labels in Dt are unknown. Inspired by DeepCluster [24], we generate pseudo labels for unlabeled data in Dt according to the highest category probability. Specifically, we propose to use K-means clustering to generate pseudo labels, re-train the current model and adjust the noisy pseudo labels iteratively. For each iteration, we first calculate the centroid for each class in the target domain: c (i) k = P xt∈Xt δ(zt)gt(xt) P xt∈Xt δ(zt) , (4.9) where c (i) k is the centroid for class k at the ith iteration, δ(z) = Pexp(z) m exp(zm) is the softmax function that is used to output category probability. gt(xt) outputs the representation of an acoustic expression sample xt, and zt = h(gt(xt)) is the linear transformed representation of xt. The centroids can characterize the distribution of categories within the target domain. The pseudo labels are obtained via the nearest centroid: yt = arg min k cos(gt(xt), ck), (4.10) where cos(·) is the cosine distance between the representation embedding gt(xt) and the centroid embedding ck. As a result, we can obtain pseudo labels for the target domain dataset and compose the positive and negative pairs using both source and target domain datasets. Then, we iteratively update the model parameters by minimizing the loss function in Eq. (4.13) described below. Attention-based Expression Learning Given the challenge of eliminating all noisy echoes, the model needs to learn to differentiate between echoes related to facial expressions and those stemming from background noise. While self-attention mechanisms are commonly employed to learn robust features, their quadratic complexity and lack of consideration for potential correlations between different samples limit their effectiveness. Each facial expression generates multiple data samples (at a rate of 0.1 seconds per sample for an expression lasting approximately 1 second). By identifying and leveraging the correlations among these samples, the model can be trained to concentrate on the consistent and 88 common features specific to acoustic facial expressions, enhancing its accuracy and reliability in feature extraction. To generate the expression representation z, as shown in Figure 4.5, external attention [83] is applied to focus on important features and implicitly learn correlations among all expression samples. Following symbols in Eq. (4.2), we first compute the attention map A = QMT k by multiplying the query vector Q and the external learnable transposed key matrix Mk ∈ RS×d. Q is projected from a feature map F ∈ RN×d, where N is the size of feature elements, d and S are hyper-parameters. We normalize the attention map A and then multiply it with another external value matrixMv. Mk andMv are generated by additional linear layers, which can be optimized by back-propagation during training on the entire dataset. The feature map Fout is as follows: Fout = Norm(QMT k )Mv. (4.11) In the end, we obtain the refined feature map with linear complexity O(d · S ·N), which is suitable for resource-constrained mobile devices. Feature Alignment for Domain Adaptation Reasonably, we assume that samples with the same label are closer to each other in the feature space, while samples from different categories are farther apart, no matter which domain they come from. With the augmented dataset, pseudo labels, and the attention-based learning model, we design contrastive learning to minimize the domain discrepancy by aligning facial expression features between the training and testing datasets. Specifically, given an acoustic facial expression sample xs from the source domain, and a sample xt from the target domain, we minimize the distance between xs and xt if two samples are from the same class while maximizing the distance between two samples from different classes. The output of the model is domain-independent expression representations. Following the supervised 89 Algorithm 6: Contrastive Attention-based Cross Domain Acoustic Expression Representation Learning Input: source dataset Ds, target dataset Dt, epoch E, iterations K per epoch, weight λ, contrastive attention-based model f Output: source and target representations Zs and Zt 1 for e = 1 to E do 2 Calculate centroids in target domain using Eq. 4.9 3 Update pseudo labels for target data using Eq. 4.10 4 for k = 1 to K do 5 for each batch do 6 Extract features with f based on external attention in Eq. 4.11 7 Compute Lce for each batch from Ds 8 Compute Lc from Ds and Dt using Eq. 4.12 9 Compute Lce(θ;Ds) + λLt c(θ;Ds,Dt) 10 end 11 Back-propagate and update θ of model f 12 end 13 end 14 for each batch Xbatch do 15 Generate source domain expression representation Zs batch = f(Xs batch) for Ds 16 Generate target domain expression representation Zt batch = f(Xt batch) for Dt 17 end 18 return Zs and Zt contrastive loss in Eq. (4.6), we define the domain adaptation contrastive loss as follows: Lt c = X i∈It −1 |Ps(yi t)| X p∈Ps(yi t) log exp(zi t · zp s /τ) P a∈Is exp(zi t · za s /τ), (4.12) where It denotes the set of target samples in a batch, Is is the set of source samples, and Ps(yi t) is a set of indices of all positive samples from the source domain. A positive sample means the label of the sample is the same as the pseudo label of the target anchor sample xt. The domain adaptation contrastive loss aligns the expression representation in the target domain to the source domain. Finally, we define the acoustic expression representation learning loss function as: arg min θ Lce(θ;Ds) + λLt c(θ;Ds,Dt), (4.13) 90 where Lce is the standard cross-entropy loss applied on the Ds, λ is used to balance the two loss terms (0.5 in our experiments), and θ represents the model parameters. Overall, the proposed contrastive external attention-based acoustic expression representation learning model is presented in Algorithm 6. After data augmentation, in each epoch, we first generate pseudo labels for target domain acoustic samples (lines 2-3). Then, we minimize the loss and back-propagate to update the model f (lines 4-12). After training, model f can align features for domain adaptation, and in turn minimize the distribution shift. We use the trained model f to generate the acoustic expression representations Zs and Zt (lines 14-17). Finally, we can train a classifier on source domain representations Zs, and generate labels for the target domain representations Zt. The proposed algorithm effectively enhances the expression recognition model performance across different users. 4.4: Evaluation and Discussion We introduce the data collection process, the software, and the hardware setup for acoustic expression recognition implementation. Then, we evaluate the impacts of multiple factors (e.g., location, time, people, mask) on the performance of FacER for expression recognition. 4.4.1: Data Collection We recruited 20 volunteers (16 males and 4 females) to participate in the acoustic facial expression data collection process. To guarantee the heterogeneity of collected facial expressions, the recruited volunteers have different skin colors and grow up in different regions of the world such as Asia, North America, and Europe. Their ages range from 20 to 38. The participants are allowed to wear glasses, a hat, and other accessories during the data collection process. To simulate different scenarios in real-life, we collect data at different locations (e.g., office, dining hall, garden) with different background noise. For example, we collect data in an office with people having conversations, online meetings, and computer alarm beeping. We show the six facial expressions: anger, disgust, fear, happiness, sadness, and surprise as 91 Earpiece Speaker Microphone at the top Figure 4.6: A smartphone and a volunteer for data collection. examples at the beginning of data collection. Then, a volunteer starts with a poker face and performs their preferred styles of six expressions accordingly. Figure 4.6 shows the example of a smartphone and a volunteer during the data collection process. The volunteers are allowed to hold the phones in their most comfortable gesture while watching the smartphone screen. Considering the necessity of wearing masks during the COVID-19 pandemic, we also consider expression recognition in scenarios where the participants are wearing disposable masks. For each expression, we collect about 5 seconds when the participants wear masks, and another 5 seconds when they do not. The acoustic facial expression collection process repeats 10 times for each expression per person with rest breaks during the data collection process. An independent observer helps record the label of each acoustic expression sample as the ground truth. The whole data collection process takes about a week intermittently. The collected dataset is over 1 GB in plain text format. Finally, we get 20,535 samples with the sliding windows segmentation, of which the window size is 0.25s. The design consideration is that if the window size is too small, it will be hard to reflect the integral process of facial muscle movements. If the window size is too large, it will easily mix instantaneous variations of different facial expressions. 92 4.4.2: Experimental Setup We use two Android phones: the Samsung Galaxy A21, and the OnePlus 8T to collect the acoustic signals. We develop the app to emit and collect signals based on the LibAS [204] and Chaperone [33], which provide a framework for acoustic sensing applications. LibAS can simplify the necessary signal processing procedures such as synchronization, which determines the start position of the sent signals in the received audio. The sensing signal is a chirp signal modulated from 19-23 kHz, which is emitted from the earpiece speaker of a smartphone. We employ the SciPy library for signal processing. We use ResNet-18 [88] as the backbone model for our contrastive attention-based expression recognition model, which is implemented with Pytorch. We train the model on TensorEX Ubuntu 20.04 Server with 256GB DDR4 memory, Intel(R) Xeon(R) Gold 5218R 2.10GHz CPUs, and RTX A6000 GPUs. 4.4.3: User Dependent Evaluation The confusion matrixes are shown in Figure 4.7. Case 1: training and testing on the mixed dataset; Case 2: training on men’s dataset and testing on women’s dataset; Case 3: training on users with masks and testing on users without masks; and Case 4: training on users without masks and testing on users with masks. The tick labels are angry, disgust, fear, happiness, sadness, and surprise. The x-axis contains the prediction labels, and the y-axis contains the ground truth labels. Case 1a. We first consider a simple scenario, where we can collect and label acoustic data from a group of users. Our goal is to recognize the facial expressions from this specific group of users, namely mix testing. We split 80% of the whole dataset as the training set and the remaining 20% as the testing set. We only use the external attention-based ResNet-18 model without cross-domain adaptation to implement classification. We use the SGD with a momentum of 0.9. The learning rate is 0.1 with a learning rate decay rate of 0.01. We show the performance of FacER with an accuracy heat map in Figure 4.7a. As we can see, the model can recognize each acoustic facial expression with more than 91% accuracy. Besides, we implement the 10-fold cross-validation, and the average testing accuracy is 94.3% with a standard deviation of 0.4%. However, the results are 93 Ang. Dis. Fea. Hap. Sad. Sur. Sur. Sad. Hap. Fea. Dis. Ang. 0.935 0.012 0.018 0.002 0.025 0.009 0.015 0.910 0.012 0.033 0.014 0.015 0.013 0.007 0.933 0.009 0.011 0.027 0.008 0.001 0.006 0.968 0.010 0.006 0.006 0.001 0.015 0.018 0.956 0.004 0.006 0.009 0.010 0.012 0.006 0.958 (a) Case 1: Mix testing. Ang. Dis. Fea. Hap. Sad. Sur. Sur. Sad. Hap. Fea. Dis. Ang. 0.819 0.057 0.053 0.016 0.036 0.019 0.019 0.787 0.038 0.031 0.009 0.117 0.021 0.007 0.899 0.023 0.026 0.023 0.010 0.023 0.039 0.815 0.048 0.066 0.021 0.016 0.036 0.008 0.874 0.045 0.007 0.025 0.033 0.012 0.018 0.905 (b) Case 2: Testing on Women’s. Ang. Dis. Fea. Hap. Sad. Sur. Sur. Sad. Hap. Fea. Dis. Ang. 0.873 0.024 0.042 0.016 0.026 0.019 0.043 0.848 0.016 0.032 0.019 0.041 0.015 0.008 0.908 0.021 0.033 0.015 0.011 0.011 0.007 0.887 0.060 0.024 0.026 0.019 0.051 0.044 0.849 0.011 0.007 0.015 0.034 0.036 0.041 0.867 (c) Case 3: Testing on Mask’s. Ang. Dis. Fea. Hap. Sad. Sur. Sur. Sad. Hap. Fea. Dis. Ang. 0.874 0.036 0.010 0.043 0.024 0.013 0.039 0.858 0.014 0.034 0.018 0.036 0.029 0.038 0.795 0.039 0.023 0.076 0.018 0.021 0.010 0.885 0.028 0.037 0.024 0.023 0.009 0.127 0.800 0.017 0.035 0.030 0.015 0.024 0.016 0.880 (d) Case 4: Testing on No Mask’s. Figure 4.7: Evaluations of different cases. unsurprising because the training dataset and testing dataset belong to the same distribution. As a result, the model can easily fit in the dataset. Case 1b. Next, we consider the impact of the environmental factors on the performance of FacER. We test the model performance on the data collected in three different locations, i.e., office, dining hall, and garden. We repeat the leave-one-place-out evaluation by training on data from two places and testing on data from another place. We implement Algorithm 6 to extract feature representations, and then we use a linear classifier with two linear layers network with hidden layer sizes 1,024 and 256 to implement expression classification. We compare the performance of FacER with two baselines: (i) ResNet-18 [88], which is our backbone model trained with cross-entropy loss without adopting attention; (ii) XHAR [258], which is the adversarial training-based domain adaptation method for human activity recognition. The adversarial training is another major direction for cross-domain adaptation, so we choose it as 94 accuracy accuracy precision precision recall recall F1 F1 0.2 0.4 0.6 0.8 FacER ResNet accuracy accuracy precision precision recall recall F1 F1 0.2 0.4 0.6 0.8 FacER XHAR Figure 4.8: The bars of location factor evaluation. the baseline. We adapt XHAR to make it applicable for facial expression recognition. We report the average accuracy, precision, recall, and F1 values across three locations in Figure 4.8, which is a bar plot on the polar axis. FacER achieves 89.8% average accuracy, which is a bit lower than the mix testing method (94.3%). The standard deviations of the accuracy of FacER, ResNet, and XHAR from three leave-one-place-out evaluations are 0.017, 0.014, and 0.016, respectively. The results show that the location causes the distribution shift and affects the model performance because of the noise caused by different obstacles. Nevertheless, the model still achieves high performance with an F1 value of 90%, which is higher than the XHAR method (82.6%). The result proves that FacER can learn consistent and robust acoustic facial expression features even in a noisy environment with various types of noises. Case 1c. We consider a more complex scenario that is related to time variation. The consideration is that the same facial expression may not be consistent over time. For example, sometimes people may show a wide smile while other times people may show a gentle smile to express happiness. Therefore, to evaluate the time factor for acoustic facial expression recognition, we split the dataset into two subsets according to the time. The first subset is the first 8 times of data collection for each facial expression per person, which is the training dataset. The second subset is the last 2 times of data collection, which forms the testing dataset. The results show that the time factor affects the model performance. Under the time factor condition, the accuracy of FacER is 90.3%. The precision and recall are 90.5% and 90.3%, respectively. The model has slightly worse performance (4% lower accuracy) than Case 1a, which 95 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Precision-recall curve of class 0 (area = 0.965) Precision-recall curve of class 1 (area = 0.961) Precision-recall curve of class 2 (area = 0.958) Precision-recall curve of class 3 (area = 0.971) Precision-recall curve of class 4 (area = 0.976) Precision-recall curve of class 5 (area = 0.963) micro-average Precision-recall curve (area = 0.964) Figure 4.9: The precision-recall curve of time factor evaluation. shows that the changes of the same expression over time indeed degrade the recognition accuracy. Nevertheless, FacER can still achieve high performance, which shows the efficacy of the designed model in acoustic expression recognition. We show the precision-recall score curve in Figure 4.9. Precision is tp tp+fp and recall is tp tp+fn, where tp means true positive, fp means false positive, fn means false negative. The tradeoff between precision and recall for various thresholds is depicted by the precision-recall curve. The area in Figure 4.9 is calculated by the average precision (AP): AP = P n(Rn − Rn−1)Pn where Pn and Rn are the precision and recall at the nth threshold automatically set by Scikit-plot [152]. For different classes, FacER can achieve at least a 0.958 AP score. A high precision-recall (area) score under the curve shows both high recall and high precision, which indicates low false-positive and false-negative scores. 4.4.4: User Independent Evaluation Case 2. Now we consider a more general scenario where the model is trained and tested on different sets of users. Different people have different face geometries and behaviors. As a result, they may produce distinct feature patterns. To evaluate FacER in the user-independent scenario, we use the dataset from 16 men as the training dataset and that from 4 women as the testing dataset. We first show the efficacy of the K-means model in clustering the acoustic facial expression representations. We randomly select 2,048 samples from the training dataset in Figure 4.10a and from the testing dataset in Figure 4.10b, respectively. As we can see, the K-means model can 96 separate the six types of expressions from both training and testing sets, which indicates the learned representations contain consistent and distinctive features. During the iterative training process, the pseudo labels of the testing data can be accurately updated. Therefore, the proposed model can effectively compose positive and negative pairs for contrastive attention learning. The accuracy heatmap is shown in Figure 4.7b, and the average accuracy is 85%, which is lower than the results from Case 1a because of the user expression differences. We notice that men and women have different ways of expressing “disgust”, which causes FacER to only achieve 78.7% accuracy. But the result shows that men and women have high similarity in expressing “surprise”, which achieves 90.5% accuracy. To evaluate the efficacy of the domain adaptation of FacER, we compare FacER with the three baselines: XHAR [258], SonicFace [75], and ResNet [88] as shown in Figure 4.11. XHAR and ResNet are set up as in Case 1b. SonicFace uses both FMCW and pure tone signals to extract different features. It focuses on tracking the movement of facial components such as the eye, eyebrow, and cheek, and it uses 1D convolution for signal processing. In contrast, we regard the spectrogram of the received echos as an image, which is the instantaneous static facial expression. Different expressions will generate different spectrogram features as the different pixels of facial expression images. Therefore, we use the 2D convolution. Due to the key difference in the applied signals, we directly report their best performance in Figure 4.11. For SonicFace, the average accuracy of the intra-session classification is 78.6%, while the best accuracy of the userindependent model with calibration is 72%. For the user-independent Case 2 scenario, FacER can achieve 85% accuracy and 85.2% F1 value. The adversarial training-based XHAR method only achieves 79.1% accuracy and 78.7% F1 value. The results show the superior performance of our proposed contrastive external attention-based representation learning method, as it helps extract robust and accurate acoustic facial expression features. 97 (a) Clustering on the training set. (b) Clustering on the testing set. Figure 4.10: K-means clustering on 2,048 sampled representations, which are processed with TSNE dimension reduction. 4.4.5: Mask Factor Evaluation Case 3 and 4. Masks are obvious obstacles to camera-based facial expression recognition methods. It is very common for people to wear masks during the Covid-19 pandemic. We investigate the performance of FacER when people wear masks. As mentioned before, half of the dataset is from volunteers wearing masks (mask dataset), and the remaining half is from volunteers without wearing masks (plain dataset). We train FacER on the plain dataset and test on the mask dataset, and the result is shown in Figure 4.7c. The average accuracy is 87.2%. Then, we train FacER on the mask dataset and test on the plain dataset, and the result is shown in Figure 4.7d. The average accuracy is 84.9%. As we can see, the mask has a noticeable impact on the performance of FacER. Another interesting finding is that FacER achieves the highest accuracy of 90.8% about the “fear” expression when training on the plain dataset, while it achieves the lowest accuracy of 79.5% about “fear” when training on the mask dataset. This is because the mask itself will generate some echoes back to the microphone, which will cloud the echoes from the facial expressions. It is harder for FacER to learn robust “fear” features when people wear masks. Nevertheless, for the “happiness” expression, people have a larger facial muscle movement, which could trigger the movement of masks. As a result, it can achieve 88.7% and 88.5% accuracy when testing on the mask and plain datasets, respectively. Overall, in different mask factor scenarios, FacER can 98 Accuracy Precision Recall F1 0 20 40 60 80 Ratio (%) Facer XHAR SonicFace ResNet Figure 4.11: The comparison across different models. achieve comparable accuracy with Case 2. The results confirm the cross-domain adaptation ability of our proposed contrastive attention-based representation learning method. 4.4.6: Discussion Data Perspective FacER is a data-driven acoustic facial expression recognition system based on the contrastive attention-based deep learning model. The quality and quantity of the acoustic sensing data are important to the performance of FacER. We will collect more data from female users and underrepresented groups to reflect the diversity of facial expressions. We will also consider other forms of emotional expressions such as hand gestures. Moreover, we will collect data on facial expressions that gap a longer time such as after one week for the time factor evaluation. We will also try applying generative models to synthesize more data. Model Perspective In this chapter, we mainly consider the scenarios where the users hold the phone at a distance of 20- 50 cm. In the future, we will investigate the performance of FacER when the distance between the user and the phone is longer. We will also study how to determine the start and end of an expression, so we can effectively segment acoustic signals to extract facial expressions for inference. 99 System Perspective We understand the design difference between different phone hardware, especially its impact on the direct path from the speaker to the microphone. We will evaluate the effect of different smartphones on ultrasound signal transmission in the future. Besides, many other real-world impact factors such as slight hand shaking, finger swiping on the phone, and the orientation of a phone should be taken into consideration. We will continue optimizing resource usage on mobile devices and mitigate the security and privacy concerns of mobile sensing data. 4.5: Summary Facial expression recognition, as a universal way to understand human emotions, is significant not only for content recommendation but also for mental health care. In this chapter, we designed FacER, an acoustic facial expression recognition system using a smartphone earpiece speaker and microphone. To enhance the accuracy and robustness of FacER, we proposed a contrastive external attention-based representation learning model to learn robust facial expression features across different users in various noisy scenarios. Real-world experiments show that FacER achieves expression recognition with more than 85% accuracy even when the users are wearing a mask, a new norm during the COVID-19 pandemic. In the future, we will explore more expression variations such as emotional hand gestures to better capture users’ emotions in a privacy-preserving manner. We will continue to improve the fairness of the facial expression recognition system, i.e., the training data should be diverse enough such that the model would not be biased. Moreover, we will optimize resource consumption on mobile devices. 100 CHAPTER 5: MULTI-GRAINED ACTIVITY SENSING INFORMATION RELEASE 5.1: Introduction Smartphones and wearable devices have seamlessly integrated into our daily lives, offering personalized services with an array of motion and position sensors. These sensors are adept at capturing three-dimensional movements and orientations, enabling continuous monitoring of user activities. Such sensors are categorized by operating systems as low-risk, allowing apps to access their data without specific permissions. For example, on Android platforms, an application can retrieve sensor data at rates over 200 Hz without needing direct approval from the user [50]. Consequently, an application designed for navigation, utilizing these sensors to determine compass direction, can continuously gather activity-sensing data, highlighting the ubiquitous presence of these mobile devices in our daily routines. The data harvested by mobile sensors often exceeds what is necessary for the intended functionality of user applications, leading to apps gaining excessive access to this information. This surplus of access has made mobile applications overly privileged in terms of the data they can collect [201, 237], containing superfluous or minimizable details that service providers do not necessarily need. Moreover, there is a risk that this data, which is more extensive than needed, could be exchanged or compromised by entities other than the service provider [11]. Two primary concerns arise from this overprivileged access in mobile activity sensing: (i) Metadata-level overprivileged issue, where applications can determine sensitive user information such as age or gender. The data on physical activities, enriched with unnecessary personal details, could lead 101 to privacy breaches, as individuals exhibit unique activity patterns based on their characteristics. These distinct patterns could potentially be leveraged by malicious attackers to deduce sensitive user information [100, 101, 141]. (ii) Feature-level overprivileged issue: This issue arises when apps access more detailed activity data than necessary, potentially exposing users’ sensitive behaviors. For example, consider an elderly individual using a smartwatch primarily for fall detection purposes [182]. While the device’s main function is to ensure safety, it can also monitor detailed hand movements [37, 110, 147, 212]. This capability might allow a nefarious entity to decipher users’ banking passwords or other sensitive information by analyzing precise sensor data [44, 184, 187]. In essence, app developers with malicious intent could exploit the excess motion data to uncover private behaviors and information unrelated to the app’s intended use [192]. Thus, sensors deemed low-risk might pose significant privacy threats, as the detailed data they collect could inadvertently reveal confidential user information. Many efforts have been made to address overprivileged issues in mobile activity sensing data. For metadata-level overprivileged concerns, adversarial training techniques [21, 115, 133] have been employed to obscure personal attributes within sensor data. When it comes to feature-level issues, a range of filtering strategies [20, 30, 79, 171] has been developed to eliminate unnecessary raw data, leveraging predefined private activity patterns from users. For example, MaskIt [79] offers a mechanism to decide if a data segment should be disclosed or concealed. Nonetheless, these approaches typically necessitate gathering labeled private data for model training or specific patterns for filtering, which is infeasible in real-world settings. Besides, semantic label-based filtering is generally coarse-grained, opting to either fully expose or suppress data segments, thereby destroying the utility of the data when a segment is associated with a private label. Adding noise to data [87, 150, 219] is another common method for protecting privacy, aiming to distort both the private attributes and activity semantics. However, due to the repetitive patterns in sensor data over time, substantial noise levels must be introduced to prevent data cleaning techniques from effectively removing the noise, which can significantly impair the utility of the 102 data, particularly in recognizing activity patterns [143]. Thus, noise addition approaches tend to be coarse-grained, often impacting the granularity required to finely mitigate overprivileged issues without compromising data utility. To address the aforementioned challenges, we introduce Hippo, a system built on a generative AI model that reconstructs mobile activity sensing data at multiple granularity levels. The primary objective of Hippo is to obscure sensitive attribute details and simplify overly detailed activity features that are not essential. Hippo employs a diffusion model, a type of generative model, to infuse noise into the sensor data with a learning-based diffusion process. This involves a two-step process: first, introducing noise to the data through forward diffusion, and then learning to refine the data by removing the noise during reverse diffusion. These steps are integral to the diffusion model’s training phase, ensuring that while the noise disrupts sensitive attributes, the essential activity features are preserved, maintaining the utility of the data. To refine activity feature representation, Hippo does not merely inject noise; instead, it crafts multi-grained data by pruning features that are not intended to be released by users. This approach is grounded in the understanding that human activities exhibit a hierarchical semantic structure, where activities can be broken down into basic actions or aggregated into broader activity categories. For instance, the action of “jumping” comprises basic movements such as “arms swinging”, “legs contracting”, and “stretching”, and can be categorized under the broader term “locomotion” along with actions such as “running”. In its operation, Hippo leverages stacked convolutional autoencoders (CAEs) to derive multiresolution feature representations from data segments. This capability stems from the convolution and pooling operations and the architecture of deep neural networks (DNNs), which are known to capture data representations at varying levels of granularity. Research has shown that different layers in a DNN encapsulate representations at diverse scales and depths, establishing a connection between these representations and semantic concepts. Following this, Hippo utilizes these multiresolution representations to produce multi-grained sensing data, employing the diffusion model’s generative capabilities. This is particularly crucial for applications such as travel or pedometer 103 apps that depend on raw data rather than processed feature representations, necessitating a method to regenerate data that aligns with the original yet without the sensitive or unnecessary details. Hippo plays a critical role in data reconstruction, focusing on two main aspects: perturbing personal attributes and generating multi-grained data. In terms of personal attribute perturbation, Hippo effectively obscures potential attributes or metadata, which are not essential for the functionality of activity sensing applications, thereby enhancing user privacy [21, 133, 140, 173]. When it comes to generating multi-grained data, Hippo allows users or applications to define the desired level of data granularity, which corresponds to the number of stacked convolutional autoencoders (CAEs) used. This approach eliminates the need for predefined privacy patterns or labels during data reconstruction. The efficacy of Hippo is assessed by examining the multi-grained sensor data across various activity recognition and attribute inference models. In the realm of metadata-level security, Hippo manages to decrease the risk of private attribute exposure by approximately 50% while maintaining the accuracy of activity recognition akin to that of raw data. For feature-level security, the reconstructed multi-grained data exhibit varied levels of activity recognition precision. Furthermore, we demonstrate the utility of reconstructed data in fall detection and pedometer services, showcasing how users can manipulate data granularity to safeguard their privacy. The main contributions of Hippo are summarized as follows: • We introduce an innovative approach that utilizes a noise diffusion process to obscure personal attributes effectively. This method maintains the integrity of activity recognition performance while eliminating the need for users to provide specific private data or labels. • We develop a hierarchical latent feature guidance diffusion model aimed at generating multigrained data. This design allows for precise control over the dissemination of information, addressing and reducing feature-level overprivileged concerns. • Through both theoretical analysis and practical experimentation, we demonstrate that the comprehensive model Hippo is equipped to tackle overprivileged issues at both the metadata and feature levels across a range of different contexts. 104 5.2: Problem Definition In this section, we formally define and analyze the activity sensing utility and privacy from the semantic perspective. 5.2.1: A Motivating Example We assume a strong attacker who can train various attribute inference and activity recognition models offline to recognize personal attributes and activity labels. The attacker can implement the re-identification attack [84]. Specifically, the attacker knows the defense methods and has a dataset with private attribute labels. Then, they can build a sanitized version of the dataset by passing the dataset through the defense methods. After that, the attacker can train a new private attribute identification model based on the sanitized datasets. Considering many apps need raw sensor data instead of activity labels, Hippo reconstructs raw sensor data to filter out sensitive information. Users should have full control over the mobile sensing data that they share, but the current OS does not provide any permission restriction. Currently, for example, Android provides eight types of motion sensor APIs and six types of position sensor APIs to access raw sensor data [51]. To protect activity-sensing information, we assume attackers can only access sensor data after being processed by Hippo. The mobile sensors can be designated solely for the secure realm (e.g., ARM TrustZone [48]) as in prior research [65], and then processed by Hippo. 5.2.2: Mobile Activity Sensing Data Utility Human activities inherently exhibit a hierarchical organization, essential for identifying and understanding activities at various levels of granularity [130]. For instance, activities such as walking, running, and jumping fall under the broader category of locomotion, as illustrated in Figure 5.1. Initially, we delineate the concept of granularity within the realm of activity semantics. Definition 1 Let A represent a set of activity semantics, and the granularity G is a function that maps each activity ai ∈ A to a level of detail L, where L = {l1, l2, · · · , ln} with l1 representing the most detailed level and ln representing the broadest level of categorization. For any two levels li 105 Walk Run Jump Badminton Baseball Ping-pong Locomotion Playing balls Activity Figure 5.1: The hierarchical semantic nature of activities. and lj , if i < j, lj is more general than li. Each activity label ai is associated with a level of granularity lk, which indicates the categorization applied to the activity. Consider a segment of the time series X = (row1, row2, · · · , roww), where rowi has multiple values such as the 9-axis values of inertial sensors, and w is the window length of X. The semantic label of X is obtained by the activity recognition models such as DNN models [34, 217] and statistical methods [113, 138]. If the fine-grained activity features in X are removed, then X can only be recognized in coarse-grained activity semantics. The role of Hippo is to generate multi-grained data by removing fine-grained features. For generated data Xr at granularity level lk, Xr can be recognized into multiple possible fine-grained activities, which compose a hierarchical structure. We define activity recognition data utility function as U(a, ˆa) = Uˆa Ua , where Ua is the activity recognition accuracy score on raw data a and Uˆa is the accuracy score on reconstructed data ˆa. The accuracy score is computed by an activity recognition model. U(a, ˆa) quantifies the utility of data after generalizing activity semantics from a to ˆa. For example, the raw sensor data contains the highest-resolution features corresponding to fine-grained activity semantics. The maximum utility is 1 if we keep the raw data. The U(·) monotonically decreases when the semantic of X is generalized from a fine-grained ˆai−1 to a coarse-grained ˆai: 0 < U(a, ˆai) ≤ U(a, ˆai−1). (5.1) 106 The utility of sensing data is intrinsically linked to the specific functions of the applications using it. While it is presumed that applications transparently declare their data needs for intended functionalities, there exists a risk that the data gathered may contain excess information, vulnerable to unauthorized access or misuse through third-party transactions or breaches. Therefore, the users can determine the granularity level of the reconstructed data to generate coarse-grained data, so the mobile sensing data can be recognized into ℓ different activities. Or the user can utilize Hippo to just perturb the metadata information linked to the activity sensing data while retaining the raw activity sensing features. To address the issue of data being overprivileged, both service providers and users can engage in a cooperative effort to determine the i-th layer feature guidance for data reconstruction. The i is related to the number of stacked convolutional autoencoders (CAEs) for multi-resolution extraction. For instance, if the raw sequence data Xarm is indicative of “arm swinging” within bodily gestures, then Xarm comprises high-resolution features capable of identifying the “arm swinging” movement. For a given layer i, Hippo enhances privacy by stripping away the detailed “arm swinging” feature during the reconstruction of data Xr, utilizing the guidance from the i-th layer’s features extracted by stacked CAEs. If i = 0, then Hippo only perturbs metadata information linked to the sensing data. If i > 1, the Hippo utilizes the i-th layer to guide the data reconstruction process. This i-th layer represents a more abstract view, omitting the detailed “arm swinging” feature. As a result, the reconstructed data Xr is categorized under broader, less specific activity semantics such as “jumping” or “running” without the granularity to distinguish “arm swinging”. Similarly, if the reconstruction process generalizes the features corresponding to “jumping”, the activity could be classified under the even more generic category of “locomotion”. The selection of layer i is tailored to meet the demands of the particular application, determined by setting the requisite number of CAEs for extracting features at different resolutions. This process ensures that the reconstructed data aligns with the necessary level of detail dictated by the application’s requirements, balancing the need for functionality with the imperative for privacy. 107 5.2.3: Sensing Data Overprivileged Information In the context of mobile activity sensing, we categorize overprivileged information as any data that is irrelevant to the app’s specific task. For such applications, metadata, which includes personal attributes extracted from activity sensing data, is considered ancillary information [21, 133, 140]. This metadata is not essential for the app’s primary functions and poses potential risks for privacy breaches. To address this, Hippo is designed to indiscriminately distort this metadata within the raw data, eliminating the need for users or developers to explicitly identify which attributes are private. This approach ensures that sensitive personal details are obscured, reducing the likelihood of unintended privacy exposure. In addition, raw data may contain fine-grained features that are unnecessary for applications. To generalize unnecessary activity semantics, Hippo reconstructs multi-grained data, allowing users to specify the desired granularity of data. Given an activity semantic distribution D of raw sensing data X and activity distribution G of reconstructed data Xr, we define mutual entropy H(G;D) to measure the amount of overprivileged information: H(G;D) = H(G) − H(D), (5.2) where H(·) denotes the Shannon entropy. H(D) and H(G) are related to the possible semantics of X and Xr, which are calculated based on the classification prediction score. The lower bound of H(D) is 0 if the score is 1. We define the threshold Δ to control the amount of overprivileged information in reconstructed data. Hippo generalizes the raw data X to increase the entropy H(G), so as to guarantee H(G;D) ≥ Δ, where 0 ≤ Δ < 1. A smaller Δ means more overprivileged information. When Δ = 0, the activity data Xr is kept as the original sensor data X. The lower bound on H(G) is calculated by H(D) and the lower bound Δ on H(G;D). 108 5.3: System Design In this section, we first introduce preliminary knowledge and the system model. Then, we provide a detailed design and analysis of Hippo for deep hierarchical data generation. 5.3.1: Background Knowledge Stacked Convolutional Autoencoder Model Convolutional autoencoder (CAE) [145] is widely used to extract features in an unsupervised way. The convolution module has local undirected connections and shared parameters between the input layer and the hidden layer. In the encoder part, the inputX is encoded into i-th hidden unit z, which is given by zi = σ(XWi + b). The σ is an activation function, Wi is the learned weights and b is the bias. In the decoder part, hidden unit zi is decoded back to reconstructed input X′ with a deconvolution module, which is given by X′ = σ(zi˜Wi + d). where ˜W is the flipped version of W and d is the bias. The stacked convolutional autoencoder consists of multiple CAEs where the input of each hidden layer is the output of the previous layer. Each data vector Xm is translated into another vector Xn in a lower dimensional space wherem > n. The pooling module such as max pooling and mean pooling makes feature representations invariant to the small changes in the input. Unpooling is used in the decoder part which uses the recorded locations of activations in the pooling operation to place the reconstructions. The loss function is to minimize the mean square error between input X and reconstructed input X′: MSE = 1 n Xn i=1 (Xi − X′ i)2. (5.3) Diffusion Model The diffusion model (DM) is a form of generative model. Previous studies [90, 191, 193] have demonstrated that DM produces data of higher quality compared to existing generative models 109 (e.g., GAN [77]). During the training phase, the DM introduces random noise ϵt ∼ N(0, σ2 t ) to input data using a noise scheduler, known as the forward diffusion process. Specifically, given a real data distribution q(x) and a data sample x0 drawn from q(x), noisy samples x1, · · · , xT are generated by adding Gaussian noise to the output of the previous step. This process involves T steps (e.g., 1000) following a Markov chain. The conditional distribution is: q(xt|xt−1) = N(xt; q 1 − βtxt−1, βtI), (5.4) where βt ∈ (0, 1) denotes a variance schedule controlling the step size. There are many methods to set up βt. For example, we can use a linear function to map a beta range (e.g., [0.0001, 0.1]) to a sequence of β. With the increasing step t, the data sample x0 will lose the distinguishable features because of the added noise. Suppose αt = 1−βt, ¯αt = Πti =1αi, noise ϵ(xt, t) is the merged multiple Gaussians N (0, σ2I), then xt = √ ¯αtx0 + √ 1 − ¯αtϵ(xt, t). (5.5) The posterior mean μt in the forward process is as follows: μt = 1 √ αt (xt − 1 − √ αt 1 − ¯αt ϵ(xt, t)). (5.6) Then, the diffusion model pθ learns to remove noise from xt, called the reverse diffusion process. Following the existing work [53, 90, 193], we use the UNet [176] as the backbone architecture to learn to remove noise. The input of UNet is a noised xt, and the output is the added noise ϵ or clear data x0. The UNet includes a stack of downsampling convolutions followed by upsampling convolutions. The advantage is that UNet can progressively lower the feature map resolution and then increase the resolution in the diffusion process. Besides, UNet contains the skip connection to add details across layers with the same spatial size. Specifically, pθ learns to approximate the conditional probabilities p(xt−1|xt). pθ(xt−1|xt) = N(xt−1; μθ(xt, t),Σθ(xt, t)), (5.7) 110 Raw Data Layer 1 Feature Layer 2 Feature Layer 3 Feature CAE 1 CAE 2 CAE 3 q ( x1 | x0 ) 2 1 q ( x | x ) 3 2 q( x | x ) 0 1 1 p(x | x , z ) 0 2 2 p(x | x , z ) 0 3 3 p(x | x , z ) Granularity 1 Granularity 2 Granularity 3 Reconstructed Data Chain-1 Data Chain-2 Data Chain-3 Data Conv Conv Conv Conv Conv Conv Pooling UnPool Pooling Pooling UnPool UnPool UNet2DConditionModel Multi-resolution Latent Feature Extraction Latent Feature Guidance-based Diffusion Model for Multi-grained Data Generation Operating System Motion: Input Password Sensor Manager Hippo Various Apps Smartwatch for Sport Tracker and Fall Detection The role of Hippo in Mobile Sensing Data Management: An example 1 0 x x Figure 5.2: The latent feature guidance diffusion model of Hippo, which acts as a middleware between the OS Sensor Manager and Apps. where Σθ(xt, t)) is set as σ2 t I to time-dependent constants [90]. Eq. 5.7 can be reparameterized to predict ϵt from input xt as follows: μθ(xt, t) = 1 √ αt (xt − √1 − αt 1 − ¯αt ϵθ(xt, t)), (5.8) where ϵθ(xt, t) is the predicted noise. Finally, by minimizing the differences between the forward process posterior mean μt and the reverse process mean μθ, the loss function can be simplified as L = ||μt − μθ|| = Et∼[1,T ],x0,ϵt [||ϵt − ϵθ(xt, t)||2] = Et∼[1,T ],x0,ϵt [||ϵt − ϵθ( √ ¯αtx0 + √ 1 − ¯αtϵt, t)||2]. (5.9) In the sampling phase, we input random noise as a seed to the diffusion model, and the model can generate new data such as images or sensor data in the reverse diffusion process. 5.3.2: System Model We focus on a system designed for mobile activity sensing applications, where Hippo operates as middleware on a smart device, as depicted in Figure 5.2. The raw sensor data X is collected from various sensors (e.g., accelerometer, gyroscope) of a mobile device over a defined time period (e.g., 1 second). X encompasses not only the desired information but also overprivileged data, and completely eliminating this data compromises its utility. To address this overprivileged data issue, 111 Hippo reconstructs a data representation Xr at a specific granularity. Xr serves as a less privileged version ofX, providing applications with the necessary information for their tasks, while restricting access solely to Xr instead of the raw data X. To address metadata-level overprivileged issues, we employ a well-trained DM to perturb metadata within raw sensing data. Initially, the raw data undergoes the forward diffusion process to introduce noise. Subsequently, during the reverse diffusion process, Hippo employs a learningbased approach to eliminate the noise. However, the residual noise after the noise addition and denoising stages within the DM model can perturb the metadata in the sensing data. Nevertheless, the primary semantic information regarding raw activity is preserved, as it constitutes the principal features of the activity sensing data. To address feature-level overprivileged issues, we leverage both the CAE model and the DM to produce multi-grained data. Initially, we extract multi-resolution features from a segment of data X using the stacked CAE model. Subsequently, employing a random noise seed and guided by the multi-resolution features, Hippo utilizes a well-trained DM to generate multi-grained data. This generated data retains specific-layer feature information while sacrificing finer-grained features. Consequently, the resulting data encompasses various granularities of activity semantics, as expressed by the multi-resolution feature representations. Users or applications can then specify the desired level of granularity for data reconstruction to alleviate the overprivileged issues. To implement the functionalities described above, we train both the CAEs and DM, as illustrated in Figure 5.2, utilizing publicly available datasets. The core component of Hippo is the latent feature guidance-based diffusion model, which comprises two primary modules, depicted in Figure 5.2: (i) A self-supervised stacked convolutional autoencoder (CAE) responsible for extracting multiresolution activity representations without necessitating private labels from users for model training (refer to Section 5.3.3); (ii) A hybrid diffusion model (DM), incorporating both conditional and unconditional aspects, based on multi-resolution representations to generate multi-grained activity sensing data (refer to Section 5.3.4). 112 5.3.3: Multi-resolution Feature Extraction We propose the utilization of a self-supervised stacked CAE to extract multi-resolution features, which serve as guidance for the generation of multi-grained data. The convolution module (CM) has been widely employed for feature extraction from time series data [74,102,243]. For instance, Lasagna [130] was introduced to produce hierarchical activity representations using CM. CM offers dual benefits for the utility of extracted features. Firstly, it can leverage local connectivity by focusing on multiple receptive fields of the input, which is particularly suitable for time series data where local patterns often hold more informative value than individual data points [243]. Secondly, CM implements weight sharing, implying that the same weight is applied across different filters. Each filter can capture patterns across various dimensions of the time series, ensuring that the model can recognize a pattern regardless of its position in the time series. Consequently, CM proves effective in extracting robust features for activity recognition. We organize raw sensor data into a structured tensor for input into stacked CAEs, as depicted in Figure 5.2. Each CAE takes either the raw data Xi or the output feature zi from the preceding CAE as input. The output of a CAE is denoted as feature zj , where i and j represent the indices of stacked CAEs, with 0 < i < j. For instance, the input of CAE-1 is the raw sensor data, and its output is the layer-1 feature. CAE-2 then takes the layer-1 feature as input. Normally, zj resides in a lower-dimensional space compared to zi. However, to maintain the same spatial dimensionality between zi and zj , we pad zj with zeros to match the dimension of zi. The encoder of each CAE comprises a convolutional layer and a pooling layer. During the encoding process, these layers learn multi-resolution activity features. Conversely, the decoder consists of an unpooling layer and a deconvolution layer. Throughout decoding, these layers are employed to reconstruct the input data within each CAE. We train each CAE independently, aiming to minimize the mean square error between input data and reconstructed data, as expressed in Equation 5.3. Following the training process, we can utilize the encoder of stacked CAEs to extract multi-resolution activity features. We select three activities from the DSA Dataset [14]: running, jumping, and playing basketball. 113 (a) Layer 1 standing (b) Layer 1 running (c) Layer 1 jumping (d) Layer 1 basketball (e) Layer 2 standing (f) Layer 2 running (g) Layer 2 jumping (h) Layer 2 basketball (i) Layer 3 standing (j) Layer 3 running (k) Layer 3 jumping (l) Layer 3 basketball Figure 5.3: Visualization comparison of multi-resolution features extracted by stacked CAEs w.r.t. three activities. These activities exhibit a hierarchical semantic structure, where playing basketball encompasses running and jumping. As illustrated in Figure 5.3, the layer 1 features of all three activities are discernible from each other. This disparity arises because the CAE is capable of extracting distinct feature components from the raw data. As we progress to layer 2, the features of playing basketball and jumping start to resemble each other. This convergence can be attributed to the pooling and convolution modules learning a low-resolution version of the layer 1 features. Consequently, some intricate features specific to playing basketball and jumping become obscured, rendering the two activities indistinguishable. Similarly, the layer 3 features of running, jumping, and playing basketball exhibit similarity, as depicted in Figure 5.3. The visualization outcomes indicate that stacked CAEs can filter out finegrained features as the depth increases. Therefore, by employing varying numbers of CAEs, Hippo can extract multi-resolution features encompassing different levels of information. The generation of multi-resolution features is facilitated by the convolution and pooling layers within the self-supervised CAE. The convolutional layer is responsible for learning various aspects of features from the input data. Essentially, the input data is partitioned into multiple receptive 114 fields for convolution. By employing different kernels for each receptive field, the CAE learns features from distinct parts of the input. Consequently, intricate feature maps of the input are obtained as a result. Previous studies in visual analytics have illustrated that the convolutional layer can discern different aspects of data, such as detecting edges and lines [17, 64, 67]. Moreover, deep convolutional neural networks have been shown to capture multi-grained feature representations [16, 174, 188]. Within each CAE, the pooling layer plays a crucial role in filtering out feature details by reducing spatial resolution in feature maps. Essentially acting as a bottleneck, the pooling layer compresses the features extracted from the input. This pooling operation limits the amount of information that can traverse through the neural network, compelling the model to focus on learning the most pertinent information while discarding irrelevant details. Various resolutions of feature representations convey different levels of activity semantic information. For instance, in Figure 5.2, the raw data feature z0 is considered as semantic granularity-0, encompassing the most detailed motion features. On the other hand, the layer-1 feature z1 corresponds to semantic granularity-1, containing coarser motion information achieved by blurring features present in z0. The convolutional and pooling operations contribute to z1 losing detailed features compared to z0. Furthermore, as the number of stacked CAEs increases, the information encapsulated in the upper layer features zi diminishes. Lemma 1 Given a stacked two CAEs whose outputs are layer features zi and zj respectively. Then zj is low-resolution version of zi such that H(zj) < H(zi) when i < j. We use the H-score H proposed in [95, 232] to quantify the informativeness of extracted features. H is computed in an information-theoretic framework to show that DNN extracts informative features for learning tasks. The H(·) function is defined as follows: H(s) = 12 EPY [||EPX|Y [∧−1/2 es es(X)|Y ]||2]. (5.10) TheHfunction takes as input the feature embedding matrix and label matrix of data samples. It 115 quantifies the quality of features generated at any layer of the DNN. A higher value of H indicates that more information in the feature embeddings is relevant to the label. We compute the Hscore on the feature embeddings generated from stacked CAEs using the DSA dataset [14]. The H-scores for layer-1 features, layer-2 features, and layer-3 features are 7.717, 7.518, and 7.423, respectively. Thus, using this framework, we demonstrate that stacked CAEs can extract multiresolution features containing varying amounts of information. In Section 5.3.4, we devise the data generation model conditioned on zi, which generates data containing multi-grained features. 5.3.4: Multi-grained Sensing Data Generation Considering that many apps rely on raw data formats, Hippo is designed to generate multi-grained data to mitigate overprivileged information. The multi-resolution features extracted in Section 5.3.3 serve as the basis for multi-grained data generation. Suppose an app necessitates the i-th granularity of sensing data, corresponding to the i-th layer feature zi in the multi-resolution feature extraction model. Leveraging the well-trained latent feature guidance diffusion model, Hippo generates the i-th granularity data from random noise conditioned on a specific resolution feature zi. During the offline training stage, the raw data x is encoded into a bipartite latent representation (xT , zi). The diffusion latent xT is produced via the forward diffusion process, representing the noisy version of x after multiple noise addition steps. xT encodes the residual information essential for reconstructing x during the reverse diffusion process. On the other hand, the multi-resolution feature zi is generated by the stacked CAE, encapsulating multi-grained feature representations. We design the latent feature guidance diffusion model to generate hierarchical data samples from random noise conditioned on zi. In the forward diffusion process, a key observation is that the learning objective L in Eq. 5.9 relies on the marginal q(xt|x0) instead of the joint q(x1:T |x0). The forward process can be derived from Bayes’ rule [193]: qσ(xt|xt−1, x0) = qσ(xt−1|xt, x0)qσ(xt|x0) qσ(xt−1|x0) (5.11) 116 Indeed, each xt in the generation process not only depends on xt−1 but also on x0, rendering the process non-Markovian. Consequently, the term L in Equation 5.9 is not contingent on a specific forward procedure. Therefore, during the generation process, if q(xt|x0) remains fixed, we have the flexibility to sample only a portion of the forward steps. This approach allows us to reduce the number of steps T and enhance data generation efficiency. To guarantee q(xt|xt−1) satisfies distribution Eq. 5.4, qσ(xt−1|xt, x0) should satisfy the Gaussian function [193]: qσ(xt−1|xt, x0) = N(μ, σ2 t I). (5.12) x0 = xt − √ 1 − αtϵθ(xt|zi, t) √ αt (5.13) μ = √ αt−1x0 + q 1 − αt−1 − σ2 t · ϵθ(xt|zi, t) (5.14) Thus, given a subset {xτ1 , · · · , xxτS } where τ is a sub-sequence of [1, · · · , t, · · · , T] of length S, the model can be trained with arbitrary forward steps. We follow the setup to variance σ in [193] as follows: στi = q (1 − ατi−1)(1 − ατi) q 1 − ατi/ατi−1 (5.15) In the reverse diffusion process, Hippo removes the noise in data conditioned on the feature zi. Consider the generative model is formalized as P(x|z) = P(x,z) P(z) , and P(z) is a deterministic function of x in the multi-resolution feature extraction model, so we only need to estimate P(x, z). Thus, we propose to project the zi to the existing diffusion model latent xT : ϵθ(xT |zi, T) = ϵθ(xT |∅, T) + s · (ϵθ(xT |zi, T) − ϵθ(xT |∅, T)), (5.16) where s is the guidance scale, which is empirically set as 7.5 in our experiments. In Eq. 5.16, we can simultaneously train an unconditional diffusion model pθ(xT ) and a conditional diffusion model pθ(xT |zi). For the unconditional model part, we replace the zi with a null vector with a probability such as 0.3 in our experiments. For the conditional model part, the denoising process is conditioned on zi. Thus, we can guide the data generation toward the zi during the training process. 117 It has been proven that the training objective in Eq. 5.9 is still applicable in the reverse process [193]. Finally, we train the diffusion model to approximate the conditioned probability distribution in the reverse process. Given the initial input noise ϵt ∼ N(0, I), we sample xt−1 ∼ pθ(xt−1|xt) as follows: xt−1 = μ + σtϵt (5.17) where μ is in Eq. 5.14. With the same model for predicted noise ϵθ, we can choose different σt without retraining the model to generate different samples. In the sampling phase, the reverse process in the DM can generate multiple data corresponding to a given latent feature representation zi. Particularly, when σt ̸= 0, the reverse process is non-deterministic. Larger values of σt introduce higher stochasticity, resulting in more variations of generated data guided by the same latent representation zi. 5.4: Evaluation and Discussion 5.4.1: Experimental Setup We implement the latent feature guidance diffusion model of Hippo using PyTorch [162]. The model is trained on a TensorEX Ubuntu 20.04 Server equipped with 256GB DDR4 memory, Intel(R) Xeon(R) Gold 5218R 2.10GHz CPUs, and RTX A6000 GPUs. During the training of the diffusion model, we randomly set the number of timesteps from 100 to 1000, as the model is designed to be trained with any number of forward steps, as described in Section 5.3.4. We conduct training for 100 epochs. In the phase of multi-grained data generation, we empirically set the inference time steps to 100. The guidance scale for hierarchical features is set to 2. For sensitive attribute and activity recognition, we employ 5-fold cross-domain validation to mitigate overfitting and ensure more consistent and fair evaluations. This approach also considers the size of the training and testing datasets. 118 5.4.2: Datasets We evaluate Hippo on four public datasets: HARBox dataset [161], UCI HAR dataset [8], MotionSense dataset [142], and the daily and sports activity (DSA) dataset [14]. Dataset 1. The HARBox [161] dataset includes 9-axis IMU sensor data collected from 121 participants (17-55 years old) with 77 smartphone models in a crowdsourcing manner. The five activities of daily life are walking, hopping, calling, waving, and typing. The collected IMU data is resampled at 50 Hz, with a sliding window of 2 seconds. Thus, each data sample contains a 900- dimensional feature. Considering the large variety of users and devices, the dataset is heterogeneous enough to evaluate the robustness of different models. Dataset 2. The MotionSense [142] dataset includes time-series data generated by accelerometers and gyroscopes in iPhone 6s, which is put into the front pocket of each participant. There are 24 participants, each with a unique combination of height, weight, age, and gender. They complete 6 activities in 15 trials: walking, jogging, sitting, standing, and downstairs and upstairs. The data is collected at a 50Hz sampling rate. The sensing data is collected by the SensingKit [104] from the Core Motion framework [47]. Note that to be consistent with the baselines, we use the four activities for comparison: downstairs, upstairs, walking, and jogging. Dataset 3. The UCI Human Activity Recognition [8] dataset includes 30 participants performing activities while carrying a waist-mounted Samsung Galaxy S II. There are six daily living activities: walking, upstairs, downstairs, sitting, standing, and laying. The accelerometer and gyroscope data are collected at a 50Hz sampling rate. The sensing data is pre-processed by applying noise filters and segmented with a sliding window of 2.56 seconds and 50% overlap. Dataset 4. The daily and sports activity [14] includes 19 types of daily and sports activities. Each of the 19 activities is performed for 5 minutes by every four females and four males. The 19 activities are walking, exercising on a stepper, exercising on a cross-trainer, rowing, jumping, playing basketball, and so on. The participants perform the activities in a flat outdoor area. The sampling rate is at 25Hz, and the 5-minute signals are divided into 5-second segments. 119 Table 5.1: Metadata information perturbation methods comparison using activity recognition and gender recognition models on the MotionSense dataset. Method Average Classification Accuracy on 5-fold Cross-Validation (%) Downstairs Upstairs Walking Jogging Acuracy Overall Activity Gender Activity Gender Activity Gender Activity Gender Activity Gender Raw Data 95.6 87.7 95.4 90.9 98.5 95.1 97.3 95.6 96.7 92.3 Differential Privacy [62] 82.5 78.3 83.3 79.1 84.6 80.2 84.2 79.8 83.7 79.4 AAE [141] 90.7 59.8 93.2 57.6 91.8 52.9 93.7 52.5 92.4 55.7 TIPRDC [115] 91.5 77.4 92.8 78.6 97.5 79.4 95.2 78.8 94.2 78.6 ObscureNet [84] 87.5 72.5 92.8 81.3 94.7 83.2 96.9 89.5 92.9 81.6 InfoCensor [252] 91.1 76.8 92.4 77.1 96.6 78.2 94.3 77.3 93.6 77.4 Hippo-Granu.0 92.4 54.2 92.8 54.8 97.6 51.6 96.5 50.7 94.8 52.8 Hippo-Granu.1 83.8 52.1 84.2 53.5 86.3 53.5 84.2 51.8 84.6 52.7 Hippo-Granu.2 66.2 53.5 65.7 52.8 63.8 54.7 67.2 51.8 65.7 53.2 Hippo-Granu.3 53.8 51.8 55.4 52.1 54.7 53.5 53.6 52.5 54.4 52.5 5.4.3: Metadata-level Information Evaluation In this section, we evaluate the metadata-level overprivileged information protection. The metadata (i.e., personal attributes) is auxiliary side-channel information in activity sensing data. We consider a scenario where an app requires the most fine-grained activity features, and the side-channel information is unnecessary but sensitive. The data utility is related to the find-grained features in raw sensor data and personal attributes are overprivileged information. In this case, Hippo only perturbs the attribute information while keeping the fine-grained activity features. Baseline Methods For metadata-level privacy protection, we compare Glint with five methods that garble personal attributes. We use the source codes provided by related authors to implement experiments. For semantic-level privacy protection, the existing methods are coarse-grained because they require users to provide private activity patterns, and then brutely remove or replace the segment of data that contains sensitive activity information. To the best of our knowledge, there is no existing work that considers fine-grained activity semantic-level privacy protection. Baseline 1. The DP [62] method injects Laplace noise with various privacy budgets {0.1, 0.3, 0.7, 0.9}, and we report the average accuracy under the four parameters. The noisy data is used for activity recognition. 120 Baseline 2. The Anonymisation Autoencoder (AAE) [141] is a deep autoencoder model with a loss function to minimize the user-identify information and preserve the original activity information. The synthesized data from the decoder is used for activity recognition. Baseline 3. The TIPRDC [115] is a framework to hide private information by adversarial training while retaining the original information by using a neural network-based mutual information estimator. The sanitized feature representations are used for activity recognition. Baseline 4. The ObscureNet [84] is an encoder-decoder model to conceal private attributes in the time-series data. It learns latent features invariant to specific private attributes based on adversarial information factorization. The synthesized data is used for activity recognition. Baseline 5. The InfoCensor [252] is also an information-theoretic framework to minimize the mutual information between the representations and the attribute. The feature representations are randomized by the parameterized Gaussian mechanisms and used for activity recognition. Table 5.1 presents the performance of various models for attribute protection, focusing on the gender attribute. These models are trained to sanitize the gender attribute, utilizing two convolutional neural network models on the MotionSense dataset with 5-fold cross-validation for gender and activity recognition. When employing noise-based differential privacy (DP), increasing the strength of the added noise to better preserve sensitive information significantly compromises data utility. For instance, increasing noise to the extent that gender recognition accuracy decreases from 92.3% to 79.4% also results in a drop in activity recognition accuracy from 96.7% to 83.7%. In contrast, Glint aims to destroy sensitive attributes in the reconstructed data (Granu.0) while retaining as much of the original activity semantic information as possible. For instance, gender recognition accuracy on the Granu.0 data is approximately 52.8%, akin to random guessing. However, activity recognition accuracy on Granu.0 data is approximately 94.8%, comparable to model performance on raw data. Glint maintains high data utility owing to the learning-based denoising process during reverse diffusion. Several methods, such as AAE, TIPRDC, ObscureNet, and InfoCensor, utilize information minimization in adversarial training. For instance, ObscureNet alters private attributes in latent 121 features before synthesizing new data. Consequently, it can deceive gender inference models to an accuracy of approximately 20%. However, for a binary classification task, reversing the output is straightforward for a knowledgeable attacker familiar with ObscureNet’s mechanism. Therefore, we report the reversed accuracy in Table 5.1. For instance, with the ObscureNet method, gender inference accuracy can be reversed to achieve approximately 80% accuracy. In contrast, Glint reduces the gender inference probability to around 50%, leading to the highest entropy (uncertainty) for attackers. Moreover, we evaluate gender information in multi-grained data. Multi-grained data generation is based on random noise conditioned on multi-resolution feature representations. As indicated in Table 5.1, gender information is perturbed in the multi-grained data, resulting in gender recognition becoming a random guess. In summary, for metadata-level overprivileged information protection, compared to the existing methods, Hippo preserves better data utility while perturbing overprivileged attributes. 5.4.4: Matadata-level Perturbation Analysis In our scenario, suppose we have a dataset of sensing data from multiple users, denoted as X. Without loss of generality, each user has one segment of data x. Privacy is related to personal attribute information. Recall the definition of the differential privacy (DP) mechanism [13, 62]. Suppose we have two neighboring datasetsX andX′, whereX differs fromX′ only by one person. The perturbation mechanism M satisfies (ϵ, δ)-DP if it holds: Pr[M(x) ∈ O] ≤ eϵ · Pr[M(x‘) ∈ O] + δ, (5.18) where x′ ∈ X′ is the perturbed version of x ∈ X via local perturbation M, so x and x′ are neighboring inputs. O is any subset of outputs, ϵ is a metric of privacy loss (Note that ϵ in DP is different from ϵ(xt, t) in diffusion models). The higher the ϵ, the larger the privacy loss. δ is the probability of violating the DP constraint. The local perturbation mechanism M is defined as M(x) = x + Z, where Z is the Gaussian noise. Suppose we have a query function f : X → Rd, which can be a classification model. 122 We define the sensitivity of the query function f as s = max(||f(x) − f(x‘)||1), which is the maximum impact that one individual can have on the result of a query [225]. For example, for the gender attribute, f is a gender classification model. s is set as 1 considering the maximum difference of model prediction probability. For the metadata-level perturbation, Z is the residual noise in the forward diffusion and reverse diffusion process. Recall that in the forward diffusion steps, Hippo is adding noise to x gradually as in Eq. 5.4 and Eq. 5.5. Suppose Mt(r) = rt + ϵ(xt, t), where rt = q ¯α 1−¯αt x0 and ϵ(xt, t) is the merged multiple Gaussians N(0, PT1 σ2 t ) given T steps. For each forward step t, we add the independent Gaussian noise N(0, σ2 t ) to the input of each step indexed by t. We can choose the parameters of additive Gaussian noise. If the added noise satisfies σt = s q 2log(1.25/δ)/ϵ in the forward diffusion step, then each step in the forward diffusion process satisfies (ϵt, δt)-DP. For example, in the experiments, we set the initial noise σ in the first step as 11.5. DP satisfies the composability property [62]. Thus, given a sequential composition of adding noise steps T, the Mt(r) still satisfies the differential privacy. In the final forward diffusion step, the composition Mt(r) is (PT t=1 ϵt, PT t=1 δt)-differentially private. Recall that √ 1 − ¯αt in Eq. 5.5 is constant, and DP is robust to the post-processing [62]. Then the output xT = √ 1 − ¯αtMt(r) in the forward diffusion process still satisfies differential privacy. In the reverse diffusion process, Hippo learns to denoise by predicting the added noise in the forward steps with a learning function F following Eq. 4.13. The input of F is the final result xT in the forward process. The output of F is x′. x′ retains original semantic features as in x while metadata information is perturbed. x′ = F( √ 1 − ¯αtMt(r)). (5.19) Considering DP is robust to post-processing of F, the whole diffusion process in Hippo is still differentially private. In Section 5.4.3, we have empirically shown the effectiveness of Hippo for personal attribute perturbation and activity feature preservation. However, due to the complex 123 0 25 Raw Data Call Hop Walk Wave 0 25 Granularity-1 0 25 Granularity-2 0 10 20 30 25 Granularity-3 10 20 30 10 20 30 10 20 30 Figure 5.4: Spectrograms of multi-grained data for the accelerometer w.r.t. four activities. Red means high energy, blue means low energy, and white is in the middle. The x-axis is time (second), and the y-axis is the frequency (Hz). deep learning mechanism in the reverse denoising diffusion process, obtaining the overall privacy guarantee of ϵ and δ in the whole diffusion process is open for future research. 5.4.5: Feature-level Information Evaluation In this section, we address apps encountering feature-level overprivileged issues. To mitigate this, Hippo generates multi-grained data for reducing activity feature-level overprivileged information. This entails utilizing different resolutions of features for multi-grained data reconstruction. Consequently, apps may experience reduced accuracy when recognizing fine-grained activity semantics using coarse-grained data. Hierarchical Data Visualization While it is challenging to interpret activity semantics from time series graphs visually, we can gain insight into the hierarchical sensor data generated by Hippo by examining the different granularities of data in the frequency domain, as depicted in Figure 5.4. Specifically, we showcase four arm- 124 intensive activities: calling, hopping, walking, and waving, sourced from the HARBox Dataset. (i) “Calling” involves a volunteer holding the device in a calling motion. (ii) “Hopping” entails a volunteer holding the phone and jumping on one foot. (iii) “Walking” occurs when a volunteer holds the phone and walks. (iv) “Waving” involves a volunteer holding the phone in hand and waving it. These visualizations offer insights into the frequency features of the sensor data for each activity, allowing for a deeper understanding of the hierarchical structure of the reconstructed data. The multi-grained data is generated based on the guidance of different resolutions of feature representations extracted from raw data. Figure 5.4 provides visual representations of the spectrograms of multi-grained data generated by Hippo. In raw data, signal strengths vary across different frequencies. Conversely, in multi-grained data, signal strengths stabilize and become similar across various frequencies, indicating that detailed features are filtered out in coarse-grained reconstructed data. For instance, as depicted in Figure 5.4, the fine-grained features (in blue) of raw waving data gradually generalize across different granularities of data. Notably, the granularity-3 data of waving becomes unrecognizable. This is because detailed features are blurred in the coarse-grained (granularity-3) data. Similarly, the granularity-3 data of walking and calling also become unrecognizable. Therefore, if calling behavior is considered private, the user can opt to release granularity-3 data of calling to make the private mobile sensing data indistinguishable from other motion data. Overall, the visualization illustrates that Hippo can generate multi-grained mobile sensing data containing varying levels of feature information. Clustering We implement the K-means clustering based on feature embeddings learned via contrastive learning. With contrastive learning, we can learn a feature representation space where samples with different labels spread out and samples with the same labels remain close together. We use the following loss function of contrastive learning: Lc(xi, xj , θ) = 1[yi = yj ]∥fθ(xi) − fθ(xj)∥22 + 1[yi ̸= yj ] max(0, ϵ − ∥fθ(xi) − fθ(xj)∥2)2, (5.20) 125 Raw Data Clustering Walk Hop Call Wave Type Granu. 1 Data Clustering Walk Hop Call Wave Type Granu. 2 Data Clustering Walk Hop Call Wave Type Granu. 3 Data Clustering Walk Hop Call Wave Type Figure 5.5: K-means clustering on 5000 randomly sampled HARBOX raw data and granularity 1-3 data with TSNE dimension reduction. where 1[yi = yj ] means the value is 1 if two samples have the same label, otherwise is 0. 1[yi ̸= yj ] means the value is 1 if two samples have different labels. ϵ is the upper bound distance to avoid the exceptional contribution of some dissimilar pairs. Given the raw data from the HARBox dataset, we evaluate different granularity of data generated by Hippo. First, we train the Hippo to obtain a stacked CAE and a diffusion modelbased data generator. Second, we extract different resolutions of feature representations from the raw data via the stacked CAE. For each resolution of feature representation, we calculate the mean of the representations Zi of the raw data with the same label. In this way, we can mitigate the differences between different data samples of the same class. Third, we use the well-trained Hippo to generate new data samples from noise conditioned on Zi. For example, the granularity-1 data is generated with the guidance of the layer-1 feature of the stacked CAE. We generate 5,000 samples for each level of granularity of data. Fourth, we train a contrastive model to obtain the contrastive feature representations. We use the t-SNE [206] method to implement dimensional reduction, and the K-means clustering on the feature representations of raw data, granularity-1 (Granu.1), granularity-2 (Granu.2), and granularity-3 (Granu.3) data. We show that the reconstructed multi-grained data contains different granularity of feature information. Figure 5.5 illustrates the generalization of activity semantics. For instance, K-means 126 can easily cluster different classes of raw sensor data. This is because raw data contains most finegrained feature information. However, distinguishable fine-grained features are removed from coarse-grained reconstructed data. For Granu.1 data, the boundary of different activities becomes vague. For example, hopping and waving are hard to differentiate. The reason is that Hippo generates the new data with the guidance of the layer 1 feature of the stacked CAE. The layer 1 feature hides part of the fine-grained information of the raw data. Thus, though we obtain the feature representations via the contrastive learning model, the intrinsic boundary among different activities of data becomes blurry. For Granu.3 data, the representations of walking, hopping, calling, and waving are aggregated in a big cluster. Overall, we can see that Hippo can filter out fine-grained activity features and generalize the activity feature information in reconstructed data. Classification In this section, we implement classification models to evaluate the feature-level information in reconstructed data. The changes in classification performance show that the Hippo can generate multi-grained data that contains different granularity of activity features. We compare Hippo with other generative models: Convolutional Autoencoder (CAE) [209], Generative Adversarial Network (GAN) [77]. Based on our designed multi-grained data generation framework including multi-resolution feature extraction and latent feature-guided data generation, we design CAE and GAN to generate new data. (i) For CAE, we extract the i-th layer feature zi of raw data X with the encoder of CAE. Then, we map the zi with a matrix W to the data X′: X′ = Wzi. We control the error rate between X and X′: Err = ||X′ − X||2 to retain the amount of information in X′. (ii) GAN is based on adversarial training to learn to generate new data. We design GAN to implement the adversarial information factorization based on the specified layer of latent representations. Generally, the generative models such as GANs can be regarded as a nonlinear parametric mapping g : Z → X to the data manifold, Z is an open subset in Rd, and X is an open subset in RD. The mapping is from a low dimensional space Z to a manifold M embedded in the higher dimensional space X. We propose that M approximates the different granularity of 127 the data manifold. We study the manifold approximation error of the generator, which minimizes a chosen divergence measure between the raw data distribution and the distribution of different granularity of the sensor data. Ideally, we can learn the latent representation z = h(x), where z ∈ Z, so that g(h(x)) = x. However, in practice, g can only learn an approximation to the true data manifold, which causes the approximation error. We utilize the approximation error to generate the privacy-preserving item. We control the semantic closeness between x and g(h(x)), so the class label has different semantic granularity. The tangent space of a manifold generalizes to higher dimensions. We use the Jxh to estimate the tangent space, of which the rows of Jxh are the directions that approximately span the tangent space to the data manifold at x. Note that several places can cause approximation errors, M ∼ Manifold, g(h(x)) ∼ x, h(g(z) ∼ z, so we only consider dominant tangent directions in the row span of Jxh. These can be obtained by the SVD on the matrix Jxh, and get the right singular vectors. However, this process is expensive because, for each data sample, SVD needs to do it one by one. Kumaret al. [112] propose to use GAN to estimate the tangent space to the data manifold. In this chapter, we take the stacked autoencoder, with the encoder-generator-discriminator triple to get the dominant tangent direction. A distribution pz over the space Z results in a distribution pg over the space X, and we can get the sample x = g(z) from the distribution by sampling z ∼ pz. The discriminator approximates a divergence measure between pg and the real data distribution px. We use KL[p||q] to denote the KL divergence between the distributions p and q as follows: ΔKL(r, x, θ) ≡ KL[p(y|x, θ)||p(y|x + r, θ)] radv ≡ arg max r {ΔKL(r, x, θ); ||r||2 ≤ ε} (5.21) We train the generator using feature matching where the generator minimizes the mean discrepancy between the features of raw data and generated data from an intermediate layer l of the discriminator f. Meanwhile, we maximize the mean discrepancy between the features of raw 128 0.0 CAE GAN Hippo 0.2 0.4 0.6 0.8 Utility Ratio HARBox Granu.1 Granu.2 Granu.3 0.0 CAE GAN Hippo 0.2 0.4 0.6 0.8 Utility Ratio MotionSense Granu.1 Granu.2 Granu.3 0.0 CAE GAN Hippo 0.2 0.4 0.6 0.8 Utility Ratio UCIHAR Granu.1 Granu.2 Granu.3 0.0 CAE GAN Hippo 0.2 0.4 0.6 0.8 Utility Ratio DailySports Granu.1 Granu.2 Granu.3 Figure 5.6: Comparison of different generative models. data and generated data from the layer l + 1 of the discriminator f as follows: Lg = ||Ex[fl(x)] − Ez[fl(g(z))]||22 − ||Ex[fl+1(x)] − Ez[fl+1(g(z)]||22 (5.22) We employ the same CNN model structure, comprising two convolution layers and two fully connected layers, to evaluate datasets generated by different generative models. Specifically, for CAE data evaluation, we train the CNN model on datasets generated by CAE. We set three thresholds 0.001, 0.01, 0.1 to generate three levels of granularity of data with CAE. Similarly, for GAN, we employ adversarial training to train the generator conditioned on layer 1-3 features. We then train Hippo on these datasets and generate new data with the guidance of layer 1-3 features. Figure 5.6 illustrates the differences in utility across different granularities of data. Utility is computed using the activity recognition utility function defined in Section 5.2.2. Hippo successfully generates multi-grained data with varying amounts of semantic information. For example, on the MotionSense dataset, Hippo achieves the highest utility score of 0.856 for the Granu.1 reconstructed data. Granu.1 data guided by the layer-1 latent features preserves the most activity information from the raw data. The utility scores for Granu.2 and Granu.3 data are 0.685 and 0.564, respectively. The decrease in utility score indicates that Hippo can generate multigrained data with hierarchical semantics. By filtering out fine-grained activity features during the 129 data reconstruction process, Hippo makes fine-grained activity semantics indistinguishable. Hippo surpasses existing generative models in several aspects. Firstly, while the CAE method can control the information in the generated data, setting the Err parameter to control data granularity is challenging due to the heterogeneity of sensor data. Different sensors or activities may exhibit varying magnitudes of data, complicating the task. For instance, the utility scores of Granu.1 data on HARBox, MotionSense, UCIHAR, and DailySports using CAE are 0.89, 0.715, 0.676, and 0.65, respectively. In contrast, for Hippo, the utility scores of Granu.1 data on these datasets are 0.834, 0.856, 0.864, and 0.852. The utility scores of CAE on the same granularity exhibit higher divergence than those of Hippo. Secondly, training GAN to generate high-quality data consistently in practice is challenging. While GANs are designed to be conditioned on latent features, controlling the amount of information contained in the generated data is difficult. For instance, the CNN model trained on Granu.2 data may perform better than the model trained on Granu.1 data. Additionally, samples generated by GANs often exhibit more noise, resulting in a more diverse utility compared to other methods, as shown in Figure 5.6. In contrast, our designed model Hippo can consistently generate high-quality data. 5.4.6: Feature-level Data Granularity Analysis Mobile activity sensing data often contains fine-grained action feature information beyond the requirements for user-intended purposes. We consider a scenario where an app requires only a certain level lk of activity semantics. In this case, Hippo will generate new data Xr from random noise conditioned on a certain resolution feature. The fine-grained features are removed in Xr. As a result, the Xr can be predicted into a coarse-grained activity semantics, which is a category of ℓ distinct fine-grained semantics. On one hand, Hippo generates new data x′ from random noise conditioned on a specific resolution feature z as in Eq. 5.16. Therefore, the feature of x′ depends on the z. From Lemma 1, we know that the higher dimensions of z in low-level CAEs encode finer-grained data features, and the lower dimensions of z in high-level CAEs preserve coarse-grained semantic information [172]. 130 Acc Gyro Mag All Feature Selection 0.00 0.25 0.50 0.75 Utility Ratio Granu.1 Granu.2 5 10 20 50 100 150 200 300 500 Inference Steps 0 25 50 75 Accuracy (%) Figure 5.7: The impact of feature selection and inference steps. Suppose x′ i = f(N; zi) and x′ j = f(N; zj), where i > j and zi is in low-level CAE and zj is in high-level CAE, then data x′ i has fine-grained features than x′ j . On the other hand, Hippo provides two ways to generate diverse data for each granularity. (i) Recall from Eq. 5.17, we can choose different σ to generate multiple data conditioned on a given feature representation. (ii) Various noise levels related to T correspond to different scales of features captured by the diffusion model: the greater the noise level, the larger the scale of features [90]. Thus, Hippo can learn different scales of data features and generate multi-grained data according to the latent feature z and diffusion noise level. 5.4.7: Ablation Study and Efficiency Analysis The impact of input feature Our initial assessment focuses on how feature selection influences the effectiveness of activity recognition, utilizing the HARBox Dataset for this analysis. Figure 5.7 elucidates that ‘Acc’ represents a 3-axis accelerometer, ‘Gyro’ denotes gyroscope modality, ‘Mag’ signifies magnetometer modality, and ‘All’ encapsulates all 9-axis modalities. We calculate the utility ratio as outlined in Section 5.2.2, defined by the quotient of model accuracy on reconstructed data to that on original data. Observations from Figure 5.7 reveal that the utility derived from reconstructed data using Hippo maintains consistency irrespective of feature (modality) selection. For example, the utility ratios for Granu.1 reconstructed data across the four modalities are 0.836, 0.833, 0.823, and 0.834, respectively. This uniformity in utility ratios across various feature choices further validates the resilience of Hippo towards feature selection. 131 In addition, we recognize that the sequence in which input features are presented and their sampling frequency can influence the accuracy of activity recognition. However, it is fundamentally presumed that the order of features should not significantly alter the effectiveness of feature utilization for recognizing activities. Consequently, the feature extraction capability of Hippo remains unaffected by the sequencing of input features. Specifically, convolutional layers demonstrate resilience against variations in inputs (e.g., through image augmentation techniques [186]) during feature extraction. In a similar vein, Hippo exhibits stability when dealing with data sampled at varying rates. For example, while the DSA Dataset [14] operates at a 25Hz sampling rate, the MotionSense dataset [142] utilizes a 50Hz rate. Figure 5.6 illustrates that Hippo maintains consistent performance across different sampling frequencies. In essence, provided that the activity recognition models are adaptable to feature variations, Hippo is also robust to such changes will persist. The impact of inference steps The influence of inference steps within the diffusion model on data generation is assessed utilizing the HARBox Dataset. We document the accuracy of activity recognition using Granu.1 data generated by Hippo at various inference stages. As depicted in Figure 5.7, with a minimal inference step count, such as 5, the accuracy of activity recognition dips to 40.5%, indicating suboptimal data quality. However, as the inference steps are incremented to 100, there is a noticeable enhancement in accuracy. Beyond 100 steps, the accuracy levels off, suggesting a plateau in quality improvement with additional steps. This observation underscores that a higher number of inference steps does not inherently equate to superior data quality. Runtime Efficiency Given a 50Hz sampling rate in the sensing application and a 1-second data buffering window for reconstruction, we find that on a server equipped with NVIDIA RTX A6000 graphics cards, Hippo is able to reconstruct a new data version from a sequence of size 100×9 within 0.61 seconds using 100 inference steps in the diffusion model (DM). This efficiency ensures that Hippo can generate 132 new data within the allocated 1-second buffer time. Furthermore, the storage footprint of Hippo is relatively minimal, with its model size being just 49.5 MB when saved to disk. This compact size, combined with advancements in hardware technology, suggests the potential for deploying Hippo on mobile devices. For instance, the iPhone 14, with its 6 GB RAM and A15 Bionic chip featuring a 5-core GPU and 16-core Neural Engine, already supports complex operations such as text-to-image generation [158]. An alternative approach could involve running Hippo on a trusted edge server, allowing various applications to access and utilize the processed sensing data. Moving forward, our efforts will focus on enhancing Hippo’s computational efficiency through techniques such as quantization [167], model pruning [259], and model parallelism [185], to further facilitate its adoption in diverse computing environments. 5.4.8: Case Study Pedometer Application The pedometer, a common tool in our daily routines, allows individuals to track their daily steps using smartphones, smartwatches, or smart bracelets. For the sake of this analysis, assume the user engages in four primary activities throughout the day: walking, calling, hopping, and waving. Among these, the user wishes to keep the calling activity private, as it represents sensitive information. However, there is a potential privacy concern: if the pedometer service provider excessively gathers mobile sensing data, it could inadvertently expose sensitive information, such as instances when the user is making calls during class. In this scenario, an attacker’s primary objective would be to discern this sensitive activity, specifically identifying when the user is calling during class time, thereby breaching the user’s privacy. This highlights the importance of developing mechanisms to safeguard sensitive information while still utilizing the benefits of mobile sensing technologies. To safeguard sensitive activity information while maintaining the utility of step counting, users can employ Hippo to reconstruct motion sensor data. Consider four primary activity time windows, denoted as X1,X2,X3,X4, corresponding to walking, calling, hopping, and waving, respectively. 133 The parameter Δ in Eq. 5.2 is crucial as it governs the balance between reducing overprivileged information and retaining essential features for step counting in the reconstructed data. Importantly, the setting of Δ needs to be personalized based on the user’s profile and the nature of the activities. For instance, in the context of X2 (calling), if an attacker is able to discern the activity’s semantic meaning with a certain probability distribution (e.g., 0.9, 0.025, 0.05, 0.025 for calling, walking, waving, hopping, respectively), the entropy H(D) in Eq. 5.2 is calculated to be 0.186. By setting Δ to 0.25, we aim to limit the amount of sensitive information in the reconstructed data, ensuring that H(G), the entropy after reconstruction, should exceed H(D) + Δ = 0.436. Through this approach, Hippo can adaptively reconstruct the sensor data, applying a degree of feature guidance to maintain step-counting functionality while protecting sensitive activity information. In a collaborative effort between the service provider and the user, the hyperparameter i can be established for the i-th level of feature guidance. If a user desires to retain the full utility of sensor data, they can opt for granularity i = 0, designated as Granu.0. This setting keeps the raw data Xi within a time window while merely altering metadata to protect privacy. For instance, to obfuscate sensitive information within data X2, the user might choose granularity i = 1, termed Granu.1, which abstracts the specific details of the sensitive activities while ensuring that critical features for step counting, such as peaks and valleys in X1, are preserved. Upon assessing the overprivileged information in the reconstructed Granu.1 data X′ 2, it is observed, as illustrated in Figure 5.4, that the Granu.1 data for calling and waving motions appear similar, indicating the removal of detailed features that could distinguish between these two activities. Consequently, an attacker’s ability to identify the possible semantics of X′ 2 as calling, walking, waving, hopping shifts to probabilities 0.41, 0.15, 0.39, 0.05, resulting in H(G) being 0.507 and H(G;D) being 0.321. Given that H(G;D) surpasses the set threshold Δ = 0.25, it confirms that the requirement for reducing overprivileged information is met. This approach effectively mitigates the risk of exposing sensitive details through feature-level data in mobile sensing, thereby enhancing user privacy while retaining essential functionality. The effectiveness of reconstructed multi-grained data for step counting is assessed by collecting 134 200 600 1000 Steps 0 5 10 15 20 Error Rate (%) Samsung S9 Raw Granu.1 Granu.2 200 600 1000 Steps 0 5 10 15 20 Error Rate (%) ZUK Z2 Raw Granu.1 Granu.2 Figure 5.8: The pedometer performance on multi-grained data using different devices. accelerometer and gyroscope data from a user through two Android smartphones, Lenovo ZUK Z2 and Samsung Galaxy S9. The user performs walking activities for 200, 600, and 1000 steps, with data captured at a 50 Hz sampling rate. Utilizing Hippo, we generate Granu.1 and Granu.2 data variants. Subsequently, the Pydometer library [46] is employed to determine if the reconstructed data maintains its utility for step counting. Figure 5.8 demonstrates that Granu.1 reconstructed walking data retains step count information effectively, exhibiting an error rate below 5% regardless of the step increment. This indicates that for pedometer services, which primarily rely on peak and valley information, a granularity setting of i = 1 ensures the preservation of essential utility in the reconstructed data. Consequently, we can reconstruct sensing data with level-1 feature guidance to mitigate the risk of sensitive “calling” information leakage, all while maintaining the functional integrity of pedometer applications. Fall Detection Application Mobile devices such as cStick [170], designed as a monitoring tool for fall detection in older adults, highlight the need for selective data usage in sensitive contexts. While cStick is focused on detecting falls, it inevitably collects a broader range of activity data, some of which might be sensitive and not directly relevant to its primary function. This broader data collection could lead to overprivileged access issues, where more information is gathered and potentially exposed than necessary for the device’s intended purpose. To address this, Hippo can be employed to reconstruct the sensing data, thereby perturbing sensitive information while maintaining crucial information related to fall detection. By selectively filtering the data through Hippo, the device can still perform its core function of fall detection 135 Table 5.2: Fall detection results on multi-grained data. Ganularity Raw Data Granu. 0 Granu. 1 Granu. 2 Granu. 3 Utility 1 1 0.966 0.874 0.798 without retaining unnecessary details about the user’s other activities. This approach ensures that the cStick continues to serve its purpose effectively, providing essential support and safety for older adults, while respecting their privacy and minimizing the risk of exposing sensitive information. The generalization capability of Hippo is assessed by training the model using the HARBOX and MotionSense datasets, then applying the trained model to process the cStick dataset. A RandomForest classifier, trained with accelerometer readings from the open cStick dataset, is utilized to detect falls. The evaluation focuses on the utility of reconstructed multi-grained data. According to Table 5.2, when the RandomForest classifier is trained and tested with raw data, it achieves 100% accuracy in fall detection. Hippo, through its diffusion process, can alter sensitive attributes in raw data to generate Granu.0 data, which safeguards metadata-level sensitive information without compromising the utility for fall detection—retaining a utility value of 1, equivalent to that of the raw data. When examining the Granu.1 data, generated with guidance from the first layer feature of the stacked CAE, there is a slight decrease in detection accuracy due to false negatives and positives, as detailed in Table 5.2. The accuracy for fall detection using Granu.1 data is 96.6%, indicating a utility of 0.966. This decrease in accuracy illustrates the trade-off involved in enhancing privacy by removing detailed motion features to produce more coarse-grained reconstructed data. In essence, Hippo allows users to balance the level of detail in the reconstructed data with their privacy needs. If high accuracy for specific activity recognition is crucial, users can opt to retain more detailed activity features in the reconstructed data. Our proposed multi-grained data reconstruction approach offers fine-grained control over information release, enabling users to tailor the balance between privacy protection and the utility of sensing data for specific applications such as fall detection. 136 5.4.9: Discussion Data Granularity In Hippo, the data granularity level is a critical hyperparameter that dictates the resolution of feature extraction, influenced by the number of Convolutional Autoencoders (CAEs) employed for multiresolution analysis. It is important to note that data granularity and semantic granularity are not directly correlated; their relationship varies depending on the specific use case. However, there are overarching guidelines for using different granularity levels in Hippo: (i) Mitigating Metadatalevel Overprivileged Issues: When the objective is to address concerns at the metadata level, users can opt for “Granularity-0” data reconstruction. This approach leverages Hippo to obscure personal attributes while preserving the activity features found in the raw data. Essentially, this level aims to protect the user’s privacy without significantly impacting the utility of the data for its intended purpose, such as activity tracking or fall detection. (ii) Protecting Feature-level Overprivileged Issues: If the concern extends to feature-level privacy, users can generate “Granularity-i” data. This is achieved by selecting the i-th layer of features from the stacked CAEs, where i ≥ 1. Each incremental level of granularity (increasing i) typically offers a higher degree of data abstraction, removing more detailed information to enhance privacy. However, this also means potentially reducing the utility of the data for certain applications, as more fine-grained features may be filtered out. By adjusting the granularity level, users have the flexibility to find a balance between privacy protection and data utility, tailoring output from Glint to suit the specific requirements and sensitivity of their application. The configuration of the granularity level i in Hippo is intricately linked to the specific utility requirements of applications and the need to mitigate overprivileged information exposure. Given the wide array of human activities and the varying contexts in which they occur, it is essential for users and service providers to collaboratively determine the most appropriate granularity level i. This process typically involves the following steps: (i) Evaluation by Service Providers: Initially, service providers should assess the 137 hyperparameters of Hippo, such as the number of stacked CAEs, using public datasets. This evaluation helps in understanding how different configurations impact the data reconstruction process and the resulting data granularity. (ii) Generation of Multi-Grained Data: Based on the evaluation, service providers can then generate data at various levels of granularity using different numbers of CAEs. This produces a range of reconstructed datasets, each offering a different balance of privacy protection and utility. (iii) Collaborative Selection: Users and service providers can collaboratively review these multi-grained data sets to identify which granularity level aligns best with the desired level of activity semantics and privacy protection. This empirical approach allows stakeholders to make informed decisions that cater to the specific needs of the application. Additionally, the threshold Δ, as discussed in Section 5.2.3, plays a crucial role in determining the balance between privacy and utility. It is tailored to align with the specific activity recognition scores, as illustrated in the pedometer example in Section 5.4.8. Since setting the ideal Δ can be challenging without comprehensive context, involving a trustworthy third party might be necessary to assist users in defining this hyperparameter accurately. It is worth noting that completely eliminating feature-level overprivileged information without compromising necessary activity features is a challenging endeavor. Hippo offers a pragmatic solution by allowing the selective release of multi-grained data, thus providing a mechanism for data minimization. Recognizing the complexities surrounding data minimization, future efforts will focus on enhancing Hippo’s user-friendliness and effectiveness in real-world applications, ensuring it remains a viable tool for balancing data utility and privacy. Implementation Hippo serves as a middleware for reconstructing sensor data at multiple granularity levels, enabling the perturbation of metadata and the removal of detailed features to address privacy concerns. Integrating Hippo within a secure realm, such as ARM TrustZone, enhances data security by providing an isolated environment where raw mobile sensing data is exclusively accessed and processed by Hippo. This setup ensures that only preprocessed data, which has been perturbed 138 or generalized according to the specified granularity level, is accessible to other applications. In practice, Hippo’s functionality allows for tailored data access: For applications that necessitate raw activity features, Hippo minimally perturbs only the metadata, ensuring that essential details for the application’s functionality are preserved. For applications where only broad activity patterns are necessary, Hippo produces data at a coarser granularity, removing detailed features to enhance privacy. Moreover, Hippo’s integration with privacy-check mechanisms such as MaskIt provides a dynamic framework for determining when data reconstruction should be applied, catering to the diverse privacy needs across different scenarios. This adaptive approach allows Hippo to offer a customizable privacy-preserving solution that aligns with varying requirements, thereby providing a more secure and flexible data-handling process within mobile and wearable device ecosystems. As a learning-based method, Hippo can be generalized to new users and tasks by training on large-scale datasets considering the diversity of user activities. Hippo perturbs the metadata information without changing activity feature information. In addition, Hippo can change the activity feature granularity by generating multi-grained data, which can be applied to new scenarios or tasks. For instance, Hippo can be applied to defend against eavesdropping attacks that utilize motion sensors on smartphones [6, 12, 92, 149] by removing fine-grained features. Furthermore, in this chapter, we assume the apps are trusted to honestly claim the intended functionalities considering the law and policy regulations. In the future, we will consider how Hippo prevents apps from overclaiming the requirements for intended functionalities. Alternatively, we will enhance the mechanism of setting data granularity in Hippo to avoid the involvement of third-party services. 5.5: Summary In this chapter, we address the challenges of metadata-level and feature-level overprivileged issues prevalent in mobile sensing applications. Drawing on the layered complexity of human activities and the capabilities of generative artificial intelligence, we introduce Hippo, a system designed 139 to produce data at multiple granularity levels. This system enables the perturbation of extraneous metadata and the elimination of unnecessary activity details, offering a tailored approach to data privacy. For tackling metadata-level overprivileged issues, Hippo employs a diffusion process that obfuscates private attributes, establishing a theoretical linkage between the diffusion model and the principles of differential privacy. On the feature level, Hippo adeptly removes detailed activity information by generating data at varying granularities, allowing the same data to be interpreted at different levels of detail depending on the requirements of applications. Users gain the capability to control the granularity of the data they release through Hippo, effectively addressing overprivileged data access concerns. Our implementation includes a variety of data visualization, classification, and clustering tasks, along with two case studies across different datasets to assess the efficacy of Hippo in generating multi-grained data. The comprehensive evaluations demonstrate that Hippo is proficient in managing overprivileged issues in mobile activity sensing, offering a fine-grained approach to controlling data access and usage. This enables a balance between data utility and privacy, providing users and developers with a powerful tool to manage data privacy in mobile sensing applications. 140 CHAPTER 6: CONCLUSION In this dissertation, we investigate the complex interactions between users, devices, and the environment. In particular, we explore the potential security and privacy issues behind the interactions. There are external attacks and user misconfiguration issues that could lead to unexpected and insecure IoT device interaction, causing damage to the physical world. To enforce the IoT device interaction security, we proposed the interactive threat detection system, Glint, to identify the potential interactive threats in smart home automation configuration rule data. Then, considering the privacy concerns of sharing smart home data for threat detection model training, we designed the federated training paradigm to collaboratively train the threat detection model with multiple local households. In addition, we present the Monte Carlo beam search-based method to discover the root causes of the interactive threat in the IoT device interaction graph. We then pay attention to the privacy issues behind the interaction between users and mobile devices such as smartphones. In particular, we present a new acoustic-based method to recognize the facial expressions of users, which can be used to expose the emotional privacy of users in real time. Meanwhile, the proposed acoustic-based facial expression method can be used for digital content recommendation and mental health care. Furthermore, we point out the overprivileged issues in the mobile activity sensing data, where the sensing data contains more information than is required for application purposes. Considering that excess information may contain sensitive information of users, we propose Hippo to reconstruct multi-grained mobile activity sensing data. In this way, Hippo can empower users with the choices to perturb metadata information and obscure sensitive activity information that is unintended to be released by users. 6.1: Summary of Contributions We addressed multiple challenges in developing data-centric AI techniques for interaction security and privacy in the Internet of Things. From the data perspective, we solve the data scarcity problem 141 by fusing data from multiple sources and collecting data from volunteers in real-world scenarios. (i) We have crawled automation rule description data from public smart home control platforms and built a correlation classification model to find correlations between different rules. Then, we synthesize a large scale of interaction graph datasets by randomly composing different numbers of rules with correlations into graphs. (ii) We collect acoustic-based facial expression signals from 20 volunteers. The collected facial expressions are anger, happiness, disgust, sadness, fear, and surprise. To ensure diversity in facial expressions, the volunteers vary in skin color and come from various parts of the world, with their ages ranging between 20 and 38 years. The data collection process was conducted intermittently over the span of a week. The dataset amassed exceeds 1 GB when saved in plain text format. Finally, we obtained 20,535 samples using the sliding windows segmentation method, with each window being 0.25 seconds in length. From the model perspective, we design different deep-learning models to learn features for different tasks. (i) We design a novel graph neural network model to learn interactive threat patterns in complex interaction graphs. Specifically, considering the threat pattern could exist at different scales in a heterogeneous interaction graph, the designed ITGNN model includes metapath-based node transformation, multi-scale graph generator, and multi-scale graph fusion. In addition, we propose to apply transfer learning techniques to enhance the model performance on different platforms with data imbalance issues. (ii) We propose a contrastive attention-based domain adaptation model to extract acoustic-based facial expression features and identify six basic facial expressions. Specifically, we incorporate testing data without labels into the training process by designing the pseudo-label generation process. The external attention learning method makes the neural network focus on important features that are related to differentiating different facial expressions. In addition, the external attention mechanism can force the model to learn correlations among different expression samples. (iii) We present a latent diffusion guidance data generation method based on a convolutional autoencoder and diffusion model. The designed method can perturb metadata information and generate multi-grained data that contains different granularity of activity semantic information. 142 From the system perspective, we design multiple systems to explore and defend against the potential security and privacy attacks in the IoT. (i) We design a graph learning-based interactive threat detection system, which can discover the potential new threat across close-source heterogeneous systems. (ii) We propose a layer-wise dynamic clustering-based federated learning system to train the graph learning-based threat detection model while preserving the privacy of smart home usage data. (iii) We design the acoustic-based facial expression system using the microphone and earpiece speakers in the commercial smartphone, which can be used to understand users’ emotional status. (iv) We design a mobile seeing data reconstruction system, which can be used as a middleware between sensor manager in the operating system and various application. The designed system can perturb sensitive metadata information and obscure sensitive activity features that are unintended to be released by users. 6.2: Discussion of Limitations While this dissertation has yielded significant insights, it is important to consider the following limitations that might influence the generalizability and applicability of the results. Addressing these limitations can provide a clearer path for future studies to build upon and refine the work presented here, ensuring that subsequent efforts are more robust and comprehensive. Graph Learning for Threat Detection This dissertation proposes a graph neural network (GNN) based solution for detecting interactive threats among smart home automation applications. natural language processing (NLP) techniques are employed to extract automation rule information from app descriptions so that the source code of applications is not required for the interaction analysis. However, we acknowledge there are limitations to implementing the proposed system. First, we build the interaction graph dataset using NLP techniques, which discover semantic correlations between different rules that are crawled from various platforms. However, due to the heterogeneous IoT platforms, this dissertation acknowledges that different platforms need different techniques for analyzing IoT apps and extracting their app semantics. Moreover, the IoT data could 143 have data missing and corruption issues. We need to pay more attention to detecting and fixing data corruption issues in the IoT environment. Nevertheless, our goal is to create an interaction graph dataset to train the graph learning-based threat detection model. The evaluation has shown that the created dataset from publicly available platforms can empower the threat detection model to achieve decent performance in real-world settings. Second, for the interaction graph construction in a real environment, we mainly use the device names and status information from event logs to build the correlation between different rules, that are associated with different devices. In the threat detection process, we approach the interaction graph as a static entity, essentially viewing it as a snapshot within a dynamic setting. However, there are time delays when extracting event logs and building the interaction graphs. As a result, if two events occur synchronously or nearly at the same time, there could be delays or inaccuracies in building the interaction graph in real-time. Third, the graph learning-based method is essentially a probabilistic model. The model generally requires a large amount of data to perform well. Insufficient data can lead to poor estimates of the probabilities and model parameters, thereby affecting the overall performance. In addition, the graph neural network models, involving complex interactions or numerous parameters, can be difficult to interpret. This complexity can make it challenging for users to understand the reasoning of models or to extract actionable insights. In the future, we aim to improve the interpretability of AI-based models. For instance, the saliency maps can highlight the parts of input data that are most important for the predictions of a model. The SHapley Additive exPlanations (SHAP) uses game theory to explain the output of any machine learning model. It assigns each feature an importance socre for a particular prediction, providing a more comprehensive understanding of model behavior. The layer-wise relevance propagation method decomposes the output decision to reveal the contribution of each input feature to the final decision, helping to visualize how individual features contribute to predictions. Implementing these techniques can help make machine learning models more understandable to both developers and users, ultimately leading to better trust and more effective deployment of AI technologies. 144 Fourth, to facilitate classification, we have manually labeled every interaction graph as “safe” or “unsafe”, which takes much effort. Given the thousands of possible graphs of apps and devices on different platforms, labeling a large number of graph data could be prone to error. Besides, in different scenarios, different users have different security requirements. What is typical in one house may be atypical in another. For example, opening a window at night may invite robbers. However, if this is a high-rise apartment, then it would not be a threat to certain users. In the future, we plan to enhance the precision of our data labeling and introduce more detailed threat-type labels with the assistance of experts and current threat identification tools. Furthermore, this dissertation introduces the use of transfer learning techniques, and we will further explore how to effectively balance the original local data on a platform with transferred models from other platforms. Federated Threat Detection Model Training In this dissertation, we design a federated learning (FL) approach on non-IID data to learn the threat patterns. However, FL faces several security and privacy challenges. (i) FL suffers from data poisoning attacks. Malicious participants can manipulate the training process by injecting corrupted data or gradients, which can lead to a compromised model that fails to perform accurately or behaves undesirably when triggered. (ii) FL is vulnerable to the inference attacks. Despite not sharing data directly, subtle information about the training data can still be inferred from the shared model updates. For instance, membership inference attacks can determine if a particular data point was used in the training dataset, potentially revealing sensitive information. Techniques such as differential privacy are often needed to mitigate this risk, but they can also degrade the model’s performance. (iii) FL is challenged by significant communication overheads and security concerns. FL requires frequent exchanges of information (model updates) between clients and the central server, which can expose the process to interception and tampering attacks. Secure communication protocols are essential to protect these exchanges. Addressing these issues is crucial for the successful deployment of FL in privacy-preserving threat detection in smart homes. 145 Acoustic-based Facial Expression Sensing This dissertation designs an active acoustic facial expression recognition system that leverages the earpiece speaker and microphone at the top of the phone body. This dissertation proposes a contrastive external attention model to learn representative and robust facial expression features while eliminating the background noise. There are some factors and cases that need to be considered in future work to enhance the robustness and practicality in real life. First, the relative position between the user and the smartphone is flexible. When a user is interacting with a smartphone, the position of their face in relation to the device can change significantly. For instance, the distance and angle between the user’s face and the smartphone can vary widely. Additionally, users might turn their heads to engage in conversation with others, often not facing the smartphone directly. In addition, the distance between the user and the mobile phone has a relatively strict limit in this dissertation, namely 20-50cm, but there is often a small distance in real life. Therefore, in the future, we will consider more factors such as head orientation, and distance to evaluate the performance of FacER. Second, smartphone designs differ significantly. For instance, the placement of the microphone and earpiece speaker varies across different brands. Consequently, in the process of removing direct path signals, which travel directly from the speaker to the microphone, the specific design parameters—particularly the direct path data from the speaker to the microphone—play a crucial role. In the future, we plan to expand our evaluation to include various smartphone models and design advanced algorithms to mitigate the negative effect of direct path signal noise. Third, enhancing the continuous recognition of facial expressions requires additional effort. In this dissertation, accurate facial expression recognition necessitates the initial segmentation of acoustic signals that reflect facial expressions. However, determining the exact start and end of a facial expression poses a challenge. On one hand, the duration of the same facial expression can vary considerably when performed by the same individual. On the other hand, the signal variations stemming from facial expressions are quite subtle. We have collected data for each facial expression over five seconds. In the future, we aim to explore solutions for segmenting acoustic signals to 146 effectively extract facial expressions, thereby facilitating real-time facial expression recognition. Multi-grained Sensing Data Reconstruction This dissertation provides a learning-based data sanitization method to minimize the information of a data point to the information that is needed for a given task. To this end, the hierarchical feature extraction model and latent guidance-based data generation model are developed to generate multigrained data. There are several limitations to the current work. First, formal guarantees for information minimization concerning a given task need further study. We acknowledge that it is hard to empirically illustrate information minimization since all possible other information would have to be evaluated. Sensing data from Hippo is generated from random noise with the guidance of latent representations of different CAE layers, which contain multi-grained information of raw data. Users can choose the i-th layer to quantify the released sensing information. The method enables different levels of privacy, but these levels are not easily aligned with interpretable privacy requirements. In the future, we will seek provable guarantees and interpretable reconstruction results for information minimization with in-depth development of the learning theory of deep learning. Nevertheless, the success of text-to-image generators such as DALL-E2 [52] proves the ability of data generation models as in Hippo. Second, there is a risk that reconstructed data could diminish the accuracy of model inference, which is potentially harmful to certain applications (e.g., fall detection, where ideally accuracy should be 100%). Moreover, employing diffusion models to reconstruct an anonymized signal could entirely invalidate any system relying on anomaly detection (e.g., arrhythmia detection). This issue makes the approach unsuitable for such scenarios. We need to continue to explore advanced methods to prevent the degradation of use cases that depend on anomaly detection. 6.3: Outline of Future Work In the future, I will continue making efforts to enforce security and privacy measures in the interactions between users, devices, environments, and AI. In particular, with the development of Large Models or Foundation Models [255], the generative AI models have shown great ability 147 to simulate the physical world. Meanwhile, these AI models will empower the physical devices with great ability to interact with each other and the surrounding environments. Regarding the security and privacy issues in the interactions of artificial intelligence of things, there are three main directions for further efforts in the future. Multimodal Federated Learning for Privacy-enhanced Interaction between IoT Devices The world of the IoT is an important source of multimodal data. IoT devices, ranging from smartphones and wearables to industrial sensors, collect a variety of data types, including images, audio, text, and sensor readings. This diversity is foundational for multimodal learning, providing a rich dataset for comprehensive analysis and insight generation. In addition, IoT devices continuously generate data in real-time, offering a dynamic and evolving dataset for multimodal FL. This continuous data flow is ideal for developing models that adapt and evolve over time, enhancing their accuracy and relevance. IoT devices often collect sensitive information. Multimodal FL allows for the analysis of this data directly at the source, without needing to centralize it, thereby preserving privacy and reducing data transmission costs. By leveraging data from multiple modalities through FL, IoT systems can make more informed and accurate decisions. For instance, in a smart city context, data from cameras (visual), microphones (audio), and environmental sensors (numerical) can be integrated to optimize traffic management and public safety. Multimodal FL fits naturally with edge computing—a key component of modern IoT frameworks—where data processing occurs closer to where it is generated. This synergy reduces latency, decreases bandwidth usage, and ensures more responsive and autonomous IoT applications. However, in multimodal FL, data heterogeneity is more challenging because different modalities can have varying data distributions, dimensions, and modalities, making it challenging to integrate and learn from them effectively. Multimodal FL requires sophisticated models that can handle and integrate multiple data types. Designing and training such models in a federated setting adds to the complexity, especially when considering the need to adapt these models to potentially sparse and non-IID (independent and identically distributed) data across nodes. While 148 FL inherently offers a level of privacy by not sharing raw data, multimodal FL must ensure that combined data modalities do not inadvertently reveal sensitive information. Additionally, robust mechanisms are needed to protect against adversarial attacks that can exploit multimodal data. Multimodal FL can be applied in various domains, such as healthcare, where different data types can be integrated while respecting patient privacy. We will need to develop new algorithms, including new strategies for data fusion, model training, and communication efficiency, which can contribute to the broader field of machine learning. By leveraging FL, multimodal learning can utilize data from diverse sources without centralizing it, thereby enhancing data privacy and security—a significant advantage in sectors such as healthcare and finance. Defend against AI-enabled Fraud for Trustworthy Interaction between Users and AI systems AI-enabled fraud and deception refer to sophisticated methods that use AI technologies to commit or enhance fraudulent activities and deceptive practices. AI-enabled fraud is becoming a serious issue in the interactions between users, devices, and AI because of the ability of AI to simulate the physical world. The adversary can leverage the power of AI to analyze large datasets, identify vulnerabilities, automate deceptive processes, and create realistic forgeries that can deceive individuals, businesses, and even automated systems. For example, AI can be used to craft more convincing phishing emails and messages by analyzing large datasets to mimic communication styles, creating personalized messages that are more likely to deceive the recipients. Using deep learning, the adversary can synthesize fake images, videos, or voice recordings (deepfakes) that are convincingly real, enabling them to impersonate individuals for unauthorized access or to spread misinformation. In the context of cybersecurity, AI can be used to generate adversarial examples that deceive AI systems, such as manipulating images so that they are misclassified by visual recognition systems. The sophistication of AI-enabled fraud makes it challenging to detect, as it can often bypass traditional security measures and detection algorithms. In the future, we will develop advanced AI algorithms, leveraging AI itself, to learn and adapt to new fraudulent patterns over time, helping to identify and mitigate AI-enabled fraud. AI can analyze vast amounts of data in real-time, 149 identifying complex patterns and anomalies that might indicate fraudulent activities, thus offering advanced detection capabilities beyond human analysts or traditional methods. By automating the detection process and reducing the incidence of fraud, AI-enabled systems can significantly lower costs related to fraud investigation and losses. In the future, we will put efforts into enhancing the trustworthiness of interactions between users and AI systems. First, the high-quality and extensive datasets are crucial for training effective AI models. However, accessing diverse and comprehensive fraud-related data while maintaining privacy and security is a significant challenge. AI fraud detection can be integrated with other technologies such as blockchain for enhanced security or IoT for broader data collection, opening up new avenues for innovative solutions. Second, the adversary may continuously evolve the tactics, meaning AI systems need to be adaptable and capable of detecting new and unknown types of fraud, which is a non-trivial task given the adaptive adversaries. Third, balancing the sensitivity of AI models to accurately detect fraud while minimizing false positives (legitimate transactions flagged as fraudulent) and false negatives (fraudulent transactions not detected) is critical to ensure trust and efficiency. Finally, ensuring that fraud detection decisions are explainable and transparent is essential for regulatory compliance and trustworthiness. AI for Secure Interaction between Users and Devices The integration of AI for finding and fixing vulnerabilities in critical software within the IoT ecosystem presents a unique set of challenges and opportunities. As IoT devices become more prevalent and integral to our daily lives, ensuring their security is paramount. The IoT ecosystem is incredibly diverse, with devices varying in computational power, operating systems, and functionalities. Developing AI models that can effectively address the security needs of such a diverse range of devices is a significant challenge. In addition, many IoT devices have limited processing power and memory, which can restrict the deployment of sophisticated AI algorithms directly on the devices. The lack of standardization across IoT devices can complicate the integration of AI solutions for security, as these solutions must be compatible with a wide range of device protocols and architectures. Furthermore, IoT devices often collect 150 sensitive data. Utilizing AI to analyze this data for vulnerabilities must be done in a way that respects user privacy and complies with data protection regulations. In the future, we will develop lightweight AI models to analyze data from numerous IoT devices in real-time to identify patterns and anomalies that may indicate a security vulnerability or an ongoing attack, allowing for proactive threat mitigation. The AI models can learn the normal behavior of IoT devices and networks, making it possible to detect deviations that may signify a security breach, even in the absence of known malware signatures. The benefits of AI in securing IoT devices apply across various sectors, including healthcare, manufacturing, and smart homes, enhancing overall security and trust in IoT environments. 151 BIBLIOGRAPHY [1] Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin RB Butler, and Joseph Wilson. Practical hidden voice attacks against speech and speaker recognition systems. arXiv preprint arXiv:1904.05734, 2019. [2] Mohannad Alhanahnah, Clay Stevens, and Hamid Bagheri. Scalable analysis of interaction threats in iot systems. In Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, pages 272–285, 2020. [3] Mohannad Alhanahnah, Clay Stevens, Bocheng Chen, Qiben Yan, and Hamid Bagheri. Iotcom: Dissecting interaction threats in iot systems. IEEE Transactions on Software Engineering, 2022. [4] Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992. [5] Amazon. Amazon alexa skills. https://www.amazon.com/alexa-skills/b?ie=UTF8&node=1 3727921011, 2022. [6] S Abhishek Anand and Nitesh Saxena. Speechless: Analyzing the threat to speech privacy from smartphone motion sensors. In 2018 IEEE Symposium on Security and Privacy (SP), pages 1000–1017. IEEE, 2018. [7] Deepali Aneja, Alex Colburn, Gary Faigin, Linda Shapiro, and Barbara Mones. Modeling stylized character expressions via deep learning. In Asian Conference on Computer Vision, pages 136–153. Springer, 2016. [8] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge Luis Reyes-Ortiz, et al. A public domain dataset for human activity recognition using smartphones. In Esann, volume 3, page 3, 2013. [9] Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav Choudhary. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818, 2019. [10] Home Assistant. Home assistant. https://www.home-assistant.io/, 2022. [11] Santiago Andrés Azcoitia and Nikolaos Laoutaris. A survey of data marketplaces and their business models. ACM SIGMOD Record, 51(3):18–29, 2022. [12] Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu, and Kui Ren. Learning-based practical smartphone eavesdropping with built-in accelerometer. In NDSS, 2020. [13] Borja Balle and Yu-Xiang Wang. Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, pages 394–403. PMLR, 2018. 152 [14] Billur Barshan and Murat Cihan Yüksek. Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. The Computer Journal, 57(11):1649–1667, 2014. [15] Johes Bater, Gregory Elliott, Craig Eggen, Satyender Goel, Abel N Kho, and Jennie Rogers. Smcql: Secure query processing for private data networks. Proc. VLDB Endow., 10(6):673– 684, 2017. [16] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017. [17] Alsallakh Bilal, Amin Jourabloo, Mao Ye, Xiaoming Liu, and Liu Ren. Do convolutional neural networks learn class hierarchy? IEEE transactions on visualization and computer graphics, 24(1):152–162, 2017. [18] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media.”, 2009. [19] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. In proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1175–1191, 2017. [20] Luca Bonomi, Liyue Fan, and Hongxia Jin. An information-theoretic approach to individual sequential data sanitization. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 337–346, 2016. [21] Antoine Boutet, Carole Frindel, Sébastien Gambs, Théo Jourdan, and Rosin Claude Ngueveu. Dysan: Dynamically sanitizing motion sensor data against sensitive inferences through adversarial networks. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, pages 672–686, 2021. [22] Debora Caldarola, Massimiliano Mancini, Fabio Galasso, Marco Ciccone, Emanuele Rodolà, and Barbara Caputo. Cluster-driven graph federated learning over multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2749–2758, 2021. [23] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3):413–425, 2013. [24] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018. [25] Tristan Cazenave. Monte carlo beam search. IEEE Transactions on Computational Intelligence and AI in games, 4(1):68–72, 2012. 153 [26] Z Berkay Celik, Leonardo Babun, Amit Kumar Sikder, Hidayet Aksu, Gang Tan, Patrick McDaniel, and A Selcuk Uluagac. Sensitive information tracking in commodity iot. In 27th USENIX Security Symposium (USENIX Security 18), pages 1687–1704, 2018. [27] Z Berkay Celik, Patrick McDaniel, and Gang Tan. Soteria: Automated iot safety and security analysis. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 147–158, 2018. [28] Z Berkay Celik, Gang Tan, and Patrick D McDaniel. Iotguard: Dynamic enforcement of security and safety policy in commodity iot. In NDSS, 2019. [29] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018. [30] Supriyo Chakraborty, Chenguang Shen, Kasturi Rangan Raghavan, Yasser Shoukry, Matt Millar, and Mani Srivastava. ipshield: A framework for enforcing context-aware privacy. In 11th USENIX symposium on networked systems design and implementation (NSDI 14), pages 143–156, 2014. [31] Bocheng Chen, Nikolay Ivanov, Guangjing Wang, and Qiben Yan. Dynamicfl: Balancing communication dynamics and client manipulation for federated learning. In 2023 20th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), pages 312–320. IEEE, 2023. [32] Chen Chen, Ke Sun, and Xinyu Zhang. Exgsense: Toward facial gesture sensing with a sparse near-eye sensor array. In Proceedings of the 20th International Conference on Information Processing in Sensor Networks (co-located with CPS-IoT Week 2021), pages 222–237, 2021. [33] Jiayi Chen, Urs Hengartner, Hassan Khan, and Mohammad Mannan. Chaperone: Real-time locking and loss prevention for smartphones. In 29th USENIX Security Symposium (USENIX Security 20), pages 325–342, 2020. [34] Kaixuan Chen, Dalin Zhang, Lina Yao, Bin Guo, Zhiwen Yu, and Yunhao Liu. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Computing Surveys (CSUR), 54(4):1–40, 2021. [35] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020. [36] Yanjiao Chen, Runmin Ou, Zhiyang Li, and Kaishun Wu. Wiface: facial expression recognition using wi-fi signals. IEEE Transactions on Mobile Computing, 21(1):378–391, 2020. [37] Eunyong Cheon, Yonghwan Shin, Jun Ho Huh, Hyoungshick Kim, and Ian Oakley. Gesture authentication for smartphones: Evaluation of gesture password selection policies. In 2020 ieee symposium on security and privacy (sp), pages 249–267. IEEE, 2020. 154 [38] Haotian Chi, Chenglong Fu, Qiang Zeng, and Xiaojiang Du. Delay wreaks havoc on your smart home: Delay-based: Automation interference attacks. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1575–1575. IEEE Computer Society, 2022. [39] Haotian Chi, Qiang Zeng, Xiaojiang Du, and Jiaping Yu. Cross-app interference threats in smart homes: Categorization, detection and handling. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 411–423. IEEE, 2020. [40] Seokmin Choi, Yang Gao, Yincheng Jin, Se jun Kim, Jiyang Li, Wenyao Xu, and Zhanpeng Jin. Ppgface: Like what you are watching? earphones can” feel” your facial expressions. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(2):1–32, 2022. [41] Elizabeth A Clark, J’Nai Kessinger, Susan E Duncan, Martha Ann Bell, Jacob Lahne, Daniel L Gallagher, and Sean F O’Keefe. The facial action coding system for characterization of human affective response to consumer product-based stimuli: a systematic review. Frontiers in psychology, 11:920, 2020. [42] Ronald Cramer, Ivan Bjerre Damgård, et al. Secure multiparty computation. Cambridge University Press, 2015. [43] Botos Csaba, Xiaojuan Qi, Arslan Chaudhry, Puneet Dokania, and Philip Torr. Multilevel knowledge transfer for cross-domain object detection. arXiv preprint arXiv:2108.00977, 2021. [44] Alexander De Luca, Emanuel Von Zezschwitz, Ngo Dieu Huong Nguyen, Max-Emanuel Maurer, Elisa Rubegni, Marcello Paolo Scipioni, and Marc Langheinrich. Back-of-device authentication on smartphones. In Proceedings of the sigchi conference on human factors in computing systems, pages 2389–2398, 2013. [45] Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461, 2020. [46] Alex Developer. Python functions to count steps in accelerometer data. https://pypi.org/pro ject/pydometer/, 2023. [47] Apple Developer. Core motion framework reference. https://developer.apple.com/docume ntation/coremotion, 2023. [48] Arm Developer. Trust zone. https://www.arm.com/technologies/trustzone-for-cortex-a, 2023. [49] Google Developer. Activity recognition api. https://developers.google.com/location-conte xt/activity-recognition, 2023. [50] Google Developer. Android permission. https://developer.android.com/reference/android/ Manifest.permission, 2023. 155 [51] Google Developer. Android sensors. https://developer.android.com/guide/topics/sensors/s ensors_overview, 2023. [52] OpenAI Developer. Ai system that can create realistic images. https://openai.com/dall-e-2/, 2023. [53] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021. [54] Wenbo Ding and Hongxin Hu. On the safety of iot device physical interaction control. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 832–846, 2018. [55] Wenbo Ding, Hongxin Hu, and Long Cheng. Iotsafe: Enforcing safety and security policy with real iot physical interaction discovery. In Proceedings of the 2021 Network and Distributed Systems Security (NDSS) Symposium, 2021. [56] John R Douceur. The sybil attack. In International workshop on peer-to-peer systems, pages 251–260. Springer, 2002. [57] Alexey Drutsa, Valentina Fedorova, Dmitry Ustalov, Olga Megorskaya, Evfrosiniya Zerminova, and Daria Baidakova. Crowdsourcing practice for efficient data labeling: Aggregation, incremental relabeling, and pricing. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 2623–2627, 2020. [58] Jian Du, Shanghang Zhang, Guanhang Wu, José MF Moura, and Soummya Kar. Topology adaptive graph convolutional networks. arXiv preprint arXiv:1710.10370, 2017. [59] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pages 1285–1298, 2017. [60] Cynthia Dwork. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation, pages 1–19. Springer, 2008. [61] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014. [62] Cynthia Dwork and Guy N Rothblum. Concentrated differential privacy. arXiv preprint arXiv:1603.01887, 2016. [63] P Ekmann. Universal facial expressions in emotion. Studia Psychologica, 15(2):140, 1973. [64] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higherlayer features of a deep network. University of Montreal, 1341(3):1, 2009. [65] Hossein Fereidooni, Jan König, Phillip Rieger, Marco Chilese, Bora Gökbakan, Moritz Finke, Alexandra Dmitrienko, and Ahmad-Reza Sadeghi. Authentisense: A scalable behavioral biometrics authentication scheme using few-shot learning for mobile platforms. In Proceedings of the Network and Distributed System Security (NDSS), 2023. 156 [66] Earlence Fernandes, Amir Rahmati, Jaeyeon Jung, and Atul Prakash. Decentralized action integrity for trigger-action iot platforms. In Proceedings 2018 Network and Distributed System Security Symposium, 2018. [67] David J Field. Relations between the statistics of natural images and the response properties of cortical cells. Josa a, 4(12):2379–2394, 1987. [68] Julien Fleureau, Philippe Guillotel, and Izabela Orlac. Affective benchmarking of movies based on the physiological responses of a real audience. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pages 73–78. IEEE, 2013. [69] Lorenzo Frigerio, Anderson Santana de Oliveira, Laurent Gomez, and Patrick Duverger. Differentially private generative adversarial networks for time series, continuous, and discrete open data. In ICT Systems Security and Privacy Protection: 34th IFIP TC 11 International Conference, SEC 2019, Lisbon, Portugal, June 25-27, 2019, Proceedings 34, pages 151–164. Springer, 2019. [70] Chenglong Fu, Qiang Zeng, and Xiaojiang Du. Hawatcher: Semantics-aware anomaly detection for appified smart homes. In 30th USENIX Security Symposium (USENIX Security 21), 2021. [71] Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding. In Proceedings of The Web Conference 2020, pages 2331–2341, 2020. [72] Clement Fung, Chris JM Yoon, and Ivan Beschastnikh. The limitations of federated learning in sybil settings. In 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2020), pages 301–316, 2020. [73] João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1–37, 2014. [74] Wenbin Gao, Lei Zhang, Wenbo Huang, Fuhong Min, Jun He, and Aiguo Song. Deep neural networks for sensor-based human activity recognition using selective kernel convolution. IEEE Transactions on Instrumentation and Measurement, 70:1–13, 2021. [75] Yang Gao, Yincheng Jin, Seokmin Choi, Jiyang Li, Junjie Pan, Lin Shu, Chi Zhou, and Zhanpeng Jin. Sonicface: Tracking facial expressions using a commodity microphone array. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(4):1–33, 2021. [76] Zhihui Gao, Ang Li, Dong Li, Jialin Liu, Jie Xiong, Yu Wang, Bing Li, and Yiran Chen. Mom: Microphone based 3d orientation measurement. In 2022 21st ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pages 132– 144. IEEE, 2022. 157 [77] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020. [78] Google. Google assistant. https://assistant.google.com/, 2022. [79] Michaela Götz, Suman Nath, and Johannes Gehrke. Maskit: Privately releasing user context streams for personalized mobile applications. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 289–300. ACM, 2012. [80] Xiang Gu, Jian Sun, and Zongben Xu. Spherical space domain adaptation with robust pseudo-label loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9101–9110, 2020. [81] Yu Gu, Xiang Zhang, Zhi Liu, and Fuji Ren. Wife: Wifi and vision based intelligent facialgesture emotion recognition. arXiv preprint arXiv:2004.09889, 2020. [82] Hanqing Guo, Chenning Li, Lingkun Li, Zhichao Cao, Qiben Yan, and Li Xiao. Nec: Speaker selective cancellation via neural enhanced ultrasound shadowing. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 355–366. IEEE, 2022. [83] Meng-Hao Guo, Zheng-Ning Liu, Tai-Jiang Mu, and Shi-Min Hu. Beyond self-attention: External attention using two linear layers for visual tasks. arXiv preprint arXiv:2105.02358, 2021. [84] Omid Hajihassnai, Omid Ardakanian, and Hamzeh Khazaei. Obscurenet: Learning attribute-invariant latent representation for anonymizing sensor data. In Proceedings of the International Conference on Internet-of-Things Design and Implementation, pages 40–52, 2021. [85] Feng Han, Lan Zhang, Hanwen Feng, Weiran Liu, and Xiangyang Li. Scape: Scalable collaborative analytics system on private database with malicious security. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 1740–1753. IEEE, 2022. [86] Feng Han, Lan Zhang, Xuanke You, Guangjing Wang, and Xiang-Yang Li. Shad: Privacyfriendly shared activity detection and data sharing. In 2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), pages 109–117. IEEE, 2019. [87] Jianping He, Lin Cai, and Xinping Guan. Preserving data-privacy with added noises: Optimal estimation and privacy analysis. IEEE Transactions on Information Theory, 64(8):5677–5690, 2018. [88] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 158 [89] Xi He, Graham Cormode, Ashwin Machanavajjhala, Cecilia Procopiuc, and Divesh Srivastava. Dpt: differentially private trajectory synthesis using hierarchical reference systems. Proceedings of the VLDB Endowment, 8(11):1154–1165, 2015. [90] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020. [91] Eran Hof, Amichai Sanderovich, Mohammad Salama, and Evyatar Hemo. Face verification using mmwave radar sensor. In 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pages 320–324. IEEE, 2020. [92] Pengfei Hu, Hui Zhuang, Panneer Selvam Santhalingam, Riccardo Spolaor, Parth Pathak, Guoming Zhang, and Xiuzhen Cheng. Accear: Accelerometer acoustic eavesdropping with unconstrained vocabulary. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1530–1530. IEEE Computer Society, 2022. [93] Justin Huang and Maya Cakmak. Supporting mental model accuracy in trigger-action programming. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 215–225, 2015. [94] Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin R Benson. Combining label propagation and simple models out-performs graph neural networks. arXiv preprint arXiv:2010.13993, 2020. [95] Shao-Lun Huang, Xiangxiang Xu, Lizhong Zheng, and Gregory W Wornell. An information theoretic interpretation to deep neural networks. In 2019 IEEE international symposium on information theory (ISIT), pages 1984–1988. IEEE, 2019. [96] Xiao Huang, Qingquan Song, Fan Yang, and Xia Hu. Large-scale heterogeneous feature embedding. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3878–3885, 2019. [97] IFTTT. Ifttt applets. https://ifttt.com/explore/, 2022. [98] IFTTT. Ifttt applets. https://ifttt.com/explore/, 2022. [99] Sergei Ivanov and Liudmila Prokhorenkova. Boost then convolve: Gradient boosting meets graph neural networks. arXiv preprint arXiv:2101.08543, 2021. [100] Yusuke Iwasawa, Kotaro Nakayama, Ikuko Yairi, and Yutaka Matsuo. Privacy issues regarding the application of dnns to activity-recognition using wearables and its countermeasures by use of adversarial training. In IJCAI, pages 1930–1936, 2017. [101] Ankita Jain and Vivek Kanhangad. Investigating gender recognition in smartphones using accelerometer and gyroscope sensor readings. In 2016 international conference on computational techniques in information and communication technologies (ICCTICT), pages 597–602. IEEE, 2016. 159 [102] Wenchao Jiang and Zhaozheng Yin. Human activity recognition using wearable sensors by deep convolutional neural networks. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1307–1310, 2015. [103] Yupeng Jiang, Yong Li, Yipeng Zhou, and Xi Zheng. Sybil attacks and defense on differential privacy based federated learning. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pages 355–362. IEEE, 2021. [104] Kleomenis Katevas. Sensingkit. https://www.sensingkit.org/, 2023. [105] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 4519–4529. PMLR, 2020. [106] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661–18673, 2020. [107] Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, and Sang K Cha. Collaborative analytics for data silos. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 743–754. IEEE, 2016. [108] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [109] Palanivel A Kodeswaran, Ravi Kokku, Sayandeep Sen, and Mudhakar Srivatsa. Idea: A system for efficient failure management in smart iot environments. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, pages 43–56, 2016. [110] Sarah Martina Kolly, Roger Wattenhofer, and Samuel Welten. A personal touch: Recognizing users based on touch screen behavior. In Proceedings of the third international workshop on sensing applications on mobile phones, pages 1–5, 2012. [111] Harold William Kuhn and Albert William Tucker. Contributions to the Theory of Games, volume 2. Princeton University Press, 1953. [112] Abhishek Kumar, Prasanna Sattigeri, and Tom Fletcher. Semi-supervised learning with gans: Manifold invariance with improved inference. Advances in neural information processing systems, 30, 2017. [113] Oscar D Lara and Miguel A Labrador. A survey on human activity recognition using wearable sensors. IEEE communications surveys & tutorials, 15(3):1192–1209, 2012. [114] Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of experimental social psychology, 49(4):764–766, 2013. 160 [115] Ang Li, Yixiao Duan, Huanrui Yang, Yiran Chen, and Jianlei Yang. Tiprdc: task-independent privacy-respecting data crowdsourcing framework for deep learning with anonymized intermediate representations. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 824–832, 2020. [116] Anran Li, Lan Zhang, Junhao Wang, Feng Han, and Xiangyang Li. Privacy-preserving efficient federated-learning model debugging. IEEE Transactions on Parallel and Distributed Systems, 2021. [117] Anran Li, Lan Zhang, Junhao Wang, Juntao Tan, Feng Han, Yaxuan Qin, Nikolaos M Freris, and Xiang-Yang Li. Efficient federated-learning model debugging. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 372–383. IEEE, 2021. [118] Chenning Li, Zhichao Cao, and Yunhao Liu. Deep ai enabled ubiquitous wireless sensing: A survey. ACM Computing Surveys (CSUR), 54(2):1–35, 2021. [119] Chenning Li, Manni Liu, and Zhichao Cao. Wihf: enable user identified gesture recognition with wifi. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pages 586–595. IEEE, 2020. [120] Chenning Li, Xiao Zeng, Mi Zhang, and Zhichao Cao. Pyramidfl: A fine-grained client selection framework for efficient federated learning. In Proc. ACM Annu. Int. Conf. Mobile Comput. Netw., 2022. [121] Dong Li, Jialin Liu, Sunghoon Ivan Lee, and Jie Xiong. Lasense: Pushing the limits of fine-grained activity sensing using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(1):1–27, 2022. [122] Kaiyu Li, Guoliang Li, Yong Wang, Yan Huang, Zitao Liu, and Zhongqin Wu. Crowdrl: an end-to-end reinforcement learning framework for data labelling. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 289–300. IEEE, 2021. [123] Maosen Li, Siheng Chen, Ya Zhang, and Ivor Tsang. Graph cross networks with vertex infomax pooling. Advances in Neural Information Processing Systems, 33:14093–14105, 2020. [124] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness: Privacy beyond kanonymity and l-diversity. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 106–115. IEEE, 2007. [125] Qinbin Li, Yiqun Diao, Quan Chen, and Bingsheng He. Federated learning on non-iid data silos: An experimental study. arXiv preprint arXiv:2102.02079, 2021. [126] Shan Li and Weihong Deng. Deep facial expression recognition: A survey. IEEE transactions on affective computing, 2020. [127] Tao Li, Yexi Jiang, Chunqiu Zeng, Bin Xia, Zheng Liu, Wubai Zhou, Xiaolong Zhu, Wentao Wang, Liang Zhang, Jun Wu, et al. Flap: An end-to-end event log analysis platform for system management. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1547–1556, 2017. 161 [128] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429–450, 2020. [129] Yong Li, Yipeng Zhou, Alireza Jolfaei, Dongjin Yu, Gaochao Xu, and Xi Zheng. Privacypreserving federated learning framework based on chained secure multiparty computing. IEEE Internet of Things Journal, 8(8):6178–6186, 2020. [130] Cihang Liu, Lan Zhang, Zongqian Liu, Kebin Liu, Xiangyang Li, and Yunhao Liu. Lasagna: towards deep hierarchical understanding and searching over mobile sensing data. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, pages 334–347. ACM, 2016. [131] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413–422. IEEE, 2008. [132] Meng Liu, Youzhi Luo, Limei Wang, Yaochen Xie, Hao Yuan, Shurui Gui, Haiyang Yu, Zhao Xu, Jingtun Zhang, Yi Liu, Keqiang Yan, Haoran Liu, Cong Fu, Bora M Oztekin, Xuan Zhang, and Shuiwang Ji. DIG: A turnkey library for diving into graph deep learning research. Journal of Machine Learning Research, 22(240):1–9, 2021. [133] Sicong Liu, Junzhao Du, Anshumali Shrivastava, and Lin Zhong. Privacy adversarial network: representation learning for mobile data privacy. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(4):1–18, 2019. [134] Wenyan Liu, Junhong Cheng, Xiaoling Wang, Xingjian Lu, and Jianwei Yin. Hybrid differential privacy based federated learning for internet of things. Journal of Systems Architecture, 124:102418, 2022. [135] Mingsheng Long, Yue Cao, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Transferable representation learning with deep adaptation networks. IEEE transactions on pattern analysis and machine intelligence, 41(12):3071–3085, 2018. [136] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, pages 4768–4777, 2017. [137] Dongsheng Luo, Wei Cheng, Dongkuan Xu, Wenchao Yu, Bo Zong, Haifeng Chen, and Xiang Zhang. Parameterized explainer for graph neural network. Advances in neural information processing systems, 33:19620–19631, 2020. [138] Mitja Luštrek and Boštjan Kaluža. Fall detection and activity recognition with machine learning. Informatica, 33(2), 2009. [139] Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3, 2007. 162 [140] Mohammad Malekzadeh, Richard G Clegg, Andrea Cavallaro, and Hamed Haddadi. Protecting sensory data against sensitive inferences. In Proceedings of the 1st Workshop on Privacy by Design in Distributed Systems, pages 1–6, 2018. [141] Mohammad Malekzadeh, Richard G Clegg, Andrea Cavallaro, and Hamed Haddadi. Mobile sensor data anonymization. In Proceedings of the international conference on internet of things design and implementation, pages 49–58, 2019. [142] Mohammad Malekzadeh, Richard G. Clegg, Andrea Cavallaro, and Hamed Haddadi. Mobile sensor data anonymization. In Proceedings of the International Conference on Internet of Things Design and Implementation, IoTDI ’19, pages 49–58, New York, NY, USA, 2019. ACM. [143] Mohammad Malekzadeh, Richard G Clegg, and Hamed Haddadi. Replacement autoencoder: A privacy-preserving algorithm for sensory data analysis. In 2018 IEEE/ACM Third International Conference on Internet-of-Things Design and Implementation (IoTDI), pages 165–176. IEEE, 2018. [144] Wenguang Mao, Mei Wang, and Lili Qiu. Aim: Acoustic imaging on a mobile. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, pages 468–481, 2018. [145] Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen Schmidhuber. Stacked convolutional auto-encoders for hierarchical feature extraction. In International Conference on Artificial Neural Networks, pages 52–59. Springer, 2011. [146] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017. [147] Maryam Mehrnezhad, Ehsan Toreini, Siamak F Shahandashti, and Feng Hao. Touchsignatures: identification of user touch actions and pins based on mobile sensor data via javascript. Journal of Information Security and Applications, 26:23–38, 2016. [148] Chuizheng Meng, Sirisha Rambhatla, and Yan Liu. Cross-node federated graph neural network for spatio-temporal data modeling. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1202–1211, 2021. [149] Yan Michalevsky, Dan Boneh, and Gabi Nakibly. Gyrophone: Recognizing speech from gyroscope signals. In 23rd USENIX Security Symposium (USENIX Security 14), pages 1053– 1067, 2014. [150] Kato Mivule. Utilizing noise addition for data privacy, an overview. arXiv preprint arXiv:1309.3958, 2013. [151] Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. How transferable are neural networks in nlp applications? arXiv preprint arXiv:1603.06111, 2016. [152] Reiichiro Nakano. Scikit-plot. https://github.com/reiinakano/scikit-plot, 2017. 163 [153] Chandrakana Nandi and Michael D Ernst. Automatic trigger generation for rule-based smart homes. In Proceedings of the 2016 ACM Workshop on Programming Languages and Analysis for Security, pages 97–102, 2016. [154] Dang Tu Nguyen, Chengyu Song, Zhiyun Qian, Srikanth V Krishnamurthy, Edward JM Colbert, and Patrick McDaniel. Iotsan: Fortifying the safety of iot systems. In Proceedings of the 14th International Conference on emerging Networking EXperiments and Technologies, pages 191–203, 2018. [155] Jingping Nie, Yigong Hu, Yuanyuting Wang, Stephen Xia, and Xiaofan Jiang. Spiders: Lowcost wireless glasses for continuous in-situ bio-signal acquisition and emotion recognition. In 2020 IEEE/ACM Fifth International Conference on Internet-of-Things Design and Implementation (IoTDI), pages 27–39. IEEE, 2020. [156] Tin Lay Nwe, Foo Say Wei, and Liyanage C De Silva. Speech based emotion classification. In Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No. 01CH37239), volume 1, pages 297–301. IEEE, 2001. [157] OpenAI et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. [158] Atila Orhon, Michael Siracusa, and Aseem Wadhwa. Stable diffusion with core ml on apple silicon, 2022. [159] Charlie Osborne. Surveillance cameras sold on amazon infected with malware. https://ww w.zdnet.com/article/amazon-surveillance-cameras-infected-with-malware/, 2016. [160] Seyed Ali Osia, Ali Taheri, Ali Shahin Shamsabadi, Kleomenis Katevas, Hamed Haddadi, and Hamid R Rabiee. Deep private-feature extraction. IEEE Transactions on Knowledge and Data Engineering, 32(1):54–66, 2018. [161] Xiaomin Ouyang, Zhiyuan Xie, Jiayu Zhou, Jianwei Huang, and Guoliang Xing. Clusterfl: a similarity-aware federated learning system for human activity recognition. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, pages 54–66, 2021. [162] Adam Paszke and Sam Gross. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. [163] Jayashree V Patil and Preeti Bailke. Real time facial expression recognition using realsense camera and ann. In 2016 International Conference on Inventive Computation Technologies (ICICT), volume 2, pages 1–6. IEEE, 2016. [164] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. 164 [165] Hao Peng, Haoran Li, Yangqiu Song, Vincent Zheng, and Jianxin Li. Differentially private federated knowledge graphs embedding. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1416–1425, 2021. [166] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999. [167] Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018. [168] Phillip E Pope, Soheil Kolouri, Mohammad Rostami, Charles E Martin, and Heiko Hoffmann. Explainability methods for graph convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10772– 10781, 2019. [169] Thomas Prätzlich, Jonathan Driedger, and Meinard Müller. Memory-restricted multiscale dynamic time warping. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 569–573. IEEE, 2016. [170] Laavanya Rachakonda, Saraju P Mohanty, and Elias Kougianos. cstick: a calm stick for fall prediction, detection and control in the iomt framework. In Internet of Things. Technology and Applications: 4th IFIP International Cross-Domain Conference, IFIPIoT 2021, Virtual Event, November 4–5, 2021, Revised Selected Papers, pages 129–145. Springer, 2022. [171] Kasturi Rangan Raghavan, Supriyo Chakraborty, Mani Srivastava, and Harris Teague. Override: A mobile privacy framework for context-driven perturbation and synthesis of sensor data streams. In Proceedings of the Third International Workshop on Sensing Applications on Mobile Phones, pages 1–5, 2012. [172] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. [173] Zhongzheng Ren, Yong Jae Lee, and Michael S Ryoo. Learning to anonymize faces for privacy preserving action detection. In Proceedings of the european conference on computer vision (ECCV), pages 620–636, 2018. [174] Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019–1025, 1999. [175] Delphine Rommel, JL Nandrino, M Jeanne, R Logier, et al. Heart rate variability analysis as an index of emotion regulation processes: interest of the analgesia nociception index (ani). In 2012 Annual international conference of the IEEE engineering in medicine and biology society, pages 3432–3435. IEEE, 2012. [176] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5- 9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015. 165 [177] Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. Backdoor: Making microphones hear inaudible sounds. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 2–14, 2017. [178] Luana Ruiz, Luiz Chamon, and Alejandro Ribeiro. Graphon neural networks and the transferability of graph neural networks. Advances in Neural Information Processing Systems, 33, 2020. [179] Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE transactions on neural networks and learning systems, 2020. [180] Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001. [181] Scrapy. Web crawling framework. https://scrapy.org/, 2022. [182] Gökhan Şengül, Murat Karakaya, Sanjay Misra, Olusola O Abayomi-Alli, and Robertas Damaševičius. Deep learning based fall detection using smartwatches for healthcare applications. Biomedical Signal Processing and Control, 71:103242, 2022. [183] Tak-Wai Shen, Hong Fu, Junkai Chen, WK Yu, CY Lau, WL Lo, and Zheru Chi. Facial expression recognition using depth map estimation of light field camera. In 2016 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), pages 1–4. IEEE, 2016. [184] Michael Sherman, Gradeigh Clark, Yulong Yang, Shridatt Sugrim, Arttu Modig, Janne Lindqvist, Antti Oulasvirta, and Teemu Roos. User-generated free-form gestures for authentication: Security and memorability. In Proceedings of the 12th annual international conference on Mobile systems, applications, and services, pages 176–189, 2014. [185] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019. [186] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019. [187] Diksha Shukla and Vir V Phoha. Stealing passwords by observing hands movement. IEEE Transactions on Information Forensics and Security, 14(12):3086–3101, 2019. [188] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013. [189] SmartThings. Smartthings developer. https://smartthings.developer.samsung.com/docs/api -ref/capabilities.html, 2022. 166 [190] SmartThings. Smartthings developer. https://smartthings.developer.samsung.com/docs/api -ref/capabilities.html, 2022. [191] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015. [192] Congzheng Song and Vitaly Shmatikov. Overlearning reveals sensitive attributes. In 8th International Conference on Learning Representations, ICLR 2020, 2020. [193] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. [194] Xingzhe Song, Kai Huang, and Wei Gao. Facelistener: Recognizing human facial expressions via acoustic sensing on commodity headphones. In 21st ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), 2022. [195] Spacy. Industrial-strength natural language processing. https://spacy.io/, 2022. [196] Fan-Yun Sun, Jordon Hoffman, Vikas Verma, and Jian Tang. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In International Conference on Learning Representations, 2020. [197] Milijana Surbatovich, Jassim Aljuraidan, Lujo Bauer, Anupam Das, and Limin Jia. Some recipes can do more than spoil your appetite: Analyzing the security and privacy risks of ifttt recipes. In Proceedings of the 26th International Conference on World Wide Web, pages 1501–1510, 2017. [198] Latanya Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002. [199] Canh T Dinh, Nguyen Tran, and Josh Nguyen. Personalized federated learning with moreau envelopes. Advances in Neural Information Processing Systems, 33:21394–21405, 2020. [200] Yuan Tian, Nan Zhang, Yueh-Hsun Lin, XiaoFeng Wang, Blase Ur, Xianzheng Guo, and Patrick Tague. Smartauth: User-centered authorization for the internet of things. In 26th USENIX Security Symposium (USENIX Security 17), 2017. [201] Marcos Tileria, Jorge Blasco, and Guillermo Suarez-Tangil. {WearFlow}: Expanding information flow analysis to companion apps in wear {OS}. In 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2020), pages 63–75, 2020. [202] Yongxin Tong, Xuchen Pan, Yuxiang Zeng, Yexuan Shi, Chunbo Xue, Zimu Zhou, Xiaofei Zhang, Lei Chen, Yi Xu, Ke Xu, et al. Hu-fu: efficient and secure spatial queries over data federation. Proceedings of the VLDB Endowment, 15(6):1159, 2022. 167 [203] Rahmadi Trimananda, Seyed Amir Hossein Aqajari, Jason Chuang, Brian Demsky, Guoqing Harry Xu, and Shan Lu. Understanding and automatically detecting conflicting interactions between smart home iot applications. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1215–1227, 2020. [204] Yu-Chih Tung, Duc Bui, and Kang G Shin. Cross-platform support for rapid development of mobile acoustic sensing applications. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, pages 455–467, 2018. [205] Blase Ur, Melwyn Pak Yong Ho, Stephen Brawner, Jiyun Lee, Sarah Mennicken, Noah Picard, Diane Schulze, and Michael L Littman. Trigger-action programming in the wild: An analysis of 200,000 ifttt recipes. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 3227–3231, 2016. [206] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [207] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [208] Dhruv Verma, Sejal Bhalla, Dhruv Sahnan, Jainendra Shukla, and Aman Parnami. Expressear: Sensing fine-grained facial expressions with earables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(3):1–28, 2021. [209] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008. [210] Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, et al. Efficient large language models: A survey. arXiv preprint arXiv:2312.03863, 1, 2023. [211] Binghui Wang, Ang Li, Hai Li, and Yiran Chen. Graphfl: A federated learning framework for semi-supervised node classification on graphs. arXiv preprint arXiv:2012.04187, 2020. [212] Chen Wang, Xiaonan Guo, Yan Wang, Yingying Chen, and Bo Liu. Friend or foe? your wearable devices reveal your personal pin. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, pages 189–200, 2016. [213] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2017. [214] Guangjing Wang, Hanqing Guo, Anran Li, Xiaorui Liu, and Qiben Yan. Federated iot interaction vulnerability analysis. In 2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, 2023. 168 [215] Guangjing Wang, Nikolay Ivanov, Bocheng Chen, Qi Wang, ThanhVu Nguyen, and Qiben Yan. Graph learning for interactive threat detection in heterogeneous smart home rule data. Proceedings of the ACM on Management of Data, 1(1):1–27, 2023. [216] Guangjing Wang, Qiben Yan, Shane Patrarungrong, Juexing Wang, and Huacheng Zeng. Facer: Contrastive attention based expression recognition via smartphone earpiece speaker. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications, pages 1–10. IEEE, 2023. [217] Guangjing Wang, Lan Zhang, Zhi Yang, and Xiang-Yang Li. Socialite: Social activity mining and friend auto-labeling. In 2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC), pages 1–8. IEEE, 2018. [218] Haiming Wang, Zhikun Zhang, Tianhao Wang, Shibo He, Michael Backes, Jiming Chen, and Yang Zhang. PrivTrace: Differentially Private Trajectory Synthesis by Adaptive Markov Model. In USENIX Security, 2023. [219] Hao Wang and Zhengquan Xu. Cts-dp: Publishing correlated time-series data via differential privacy. Knowledge-Based Systems, 122:167–179, 2017. [220] Junhao Wang, Lan Zhang, Anran Li, Xuanke You, and Haoran Cheng. Efficient participant contribution evaluation for horizontal and vertical federated learning. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 911–923. IEEE, 2022. [221] Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019. [222] Qi Wang, Pubali Datta, Wei Yang, Si Liu, Adam Bates, and Carl A Gunter. Charting the attack surface of trigger-action iot platforms. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 1439–1453, 2019. [223] Qi Wang, Wajih Ul Hassan, Adam Bates, and Carl Gunter. Fear and logging in the internet of things. In Network and Distributed Systems Symposium, 2018. [224] Yuanda Wang, Hanqing Guo, and Qiben Yan. Ghosttalk: Interactive attack on smartphone voice system through power line. arXiv preprint arXiv:2202.02585, 2022. [225] Yue Wang, Xintao Wu, and Donghui Hu. Using randomized response for differential privacy preserving data collection. In EDBT/ICDT Workshops, volume 1558, pages 0090–6778, 2016. [226] Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H Yang, Farhad Farokhi, Shi Jin, Tony QS Quek, and H Vincent Poor. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 15:3454– 3469, 2020. 169 [227] David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341–1390, 1996. [228] Feng Xia, Ke Sun, Shuo Yu, Abdul Aziz, Liangtian Wan, Shirui Pan, and Huan Liu. Graph learning: A survey. IEEE Transactions on Artificial Intelligence, 2(2):109–127, 2021. [229] Han Xie, Jing Ma, Li Xiong, and Carl Yang. Federated graph classification over non-iid graphs. Advances in Neural Information Processing Systems, 34, 2021. [230] Yadong Xie, Fan Li, Yue Wu, Huijie Chen, Zhiyuan Zhao, and Yu Wang. Teethpass: Dental occlusion-based user authentication via in-ear acoustic sensing. In IEEE INFOCOM 2022- IEEE Conference on Computer Communications, pages 1789–1798. IEEE, 2022. [231] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018. [232] Xiangxiang Xu, Shao-Lun Huang, Lizhong Zheng, and Gregory W Wornell. An information theoretic interpretation to deep neural networks. Entropy, 24(1):135, 2022. [233] Xiangyu Xu, Jiadi Yu, Yingying Chen, Yanmin Zhu, Linghe Kong, and Minglu Li. Breathlistener: Fine-grained breathing monitoring in driving environments utilizing acoustic signals. In Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services, pages 54–66, 2019. [234] Qiben Yan, Kehai Liu, Qin Zhou, Hanqing Guo, and Ning Zhang. Surfingattack: Interactive hidden attack on voice assistants using ultrasonic guided waves. In Network and Distributed Systems Security (NDSS) Symposium, 2020. [235] Limin Yang, Wenbo Guo, Qingying Hao, Arridhana Ciptadi, Ali Ahmadzadeh, Xinyu Xing, and Gang Wang. {CADE}: Detecting and explaining concept drift samples for security applications. In 30th USENIX Security Symposium (USENIX Security 21), pages 2327– 2344, 2021. [236] Lin Yang, Junjie Chen, Zan Wang, Weijing Wang, Jiajun Jiang, Xuyuan Dong, and Wenbin Zhang. Semi-supervised log-based anomaly detection via probabilistic label estimation. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 1448–1460. IEEE, 2021. [237] Doguhan Yeke, Muhammad Ibrahim, Güliz Seray Tuncay, Habiba Farrukh, Abdullah Imran, Antonio Bianchi, and Z Berkay Celik. Wear’s my data? understanding the cross-device runtime permission model in wearables. In 2024 IEEE Symposium on Security and Privacy (SP), pages 77–77. IEEE Computer Society, 2023. [238] Rex Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer: Generating explanations for graph neural networks. Advances in neural information processing systems, 32:9240, 2019. [239] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? arXiv preprint arXiv:1411.1792, 2014. 170 [240] Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5693–5700, 2019. [241] Hao Yuan, Haiyang Yu, Jie Wang, Kang Li, and Shuiwang Ji. On explainability of graph neural networks via subgraph explorations. In International Conference on Machine Learning, pages 12241–12252. PMLR, 2021. [242] Mu Yuan, Lan Zhang, Xiang-Yang Li, and Hui Xiong. Comprehensive and efficient data labeling via adaptive model scheduling. In 2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 1858–1861. IEEE, 2020. [243] Ming Zeng, Le T Nguyen, Bo Yu, Ole J Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. Convolutional neural networks for human activity recognition using mobile sensors. In 6th international conference on mobile computing, applications and services, pages 197–205. IEEE, 2014. [244] Anna Zhadan. Report: your smart home devices. https://cybernews.com/security/report-y our-smart-home-devices-may-suffer-from-10000-hacking-attempts-every-week/, 2021. [245] Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. Dolphinattack: Inaudible voice commands. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pages 103–117, 2017. [246] Xinyang Zhang, Shouling Ji, and Ting Wang. Differentially private releasing via deep generative model (technical report). arXiv preprint arXiv:1801.01594, 2018. [247] Zhikun Zhang, Tianhao Wang, Ninghui Li, Shibo He, and Jiming Chen. CALM: Consistent Adaptive Local Marginal for Marginal Release under Local Differential Privacy. In ACM CCS, 2018. [248] Zhikun Zhang, Tianhao Wang, Ninghui Li, Jean Honorio, Michael Backes, Shibo He, Jiming Chen, and Yang Zhang. PrivSyn: Differentially Private Data Synthesis. In USENIX Security, 2021. [249] Jianan Zhao, Xiao Wang, Chuan Shi, Binbin Hu, Guojie Song, and Yanfang Ye. Heterogeneous graph structure learning for graph neural networks. In 35th AAAI Conference on Artificial Intelligence (AAAI), 2021. [250] Mingmin Zhao, Fadel Adib, and Dina Katabi. Emotion recognition using wireless signals. In Proceedings of the 22nd annual international conference on mobile computing and networking, pages 95–108, 2016. [251] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018. 171 [252] Tianhang Zheng and Baochun Li. Infocensor: An information-theoretic framework against sensitive attribute inference and demographic disparity. In Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, pages 437–451, 2022. [253] Yue Zheng, Yi Zhang, Kun Qian, Guidong Zhang, Yunhao Liu, Chenshu Wu, and Zheng Yang. Zero-effort cross-domain gesture recognition with wi-fi. In Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services, pages 313– 325, 2019. [254] Bing Zhou, Jay Lohokare, Ruipeng Gao, and Fan Ye. Echoprint: Two-factor authentication using acoustics and vision on smartphones. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 321–336, 2018. [255] Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419, 2023. [256] Chuanxin Zhou, Yi Sun, Degang Wang, and Qi Gao. Fed-fi: Federated learning malicious model detection method based on feature importance. Security and Communication Networks, 2022, 2022. [257] Pengpeng Zhou, Yang Wang, Zhenyu Li, Xin Wang, Gareth Tyson, and Gaogang Xie. Logsayer: Log pattern-driven cloud component anomaly diagnosis with machine learning. In 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS), pages 1–10. IEEE, 2020. [258] Zhijun Zhou, Yingtian Zhang, Xiaojing Yu, Panlong Yang, Xiang-Yang Li, Jing Zhao, and Hao Zhou. Xhar: Deep domain adaptation for human activity recognition with smart devices. In 2020 17th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), pages 1–9. IEEE, 2020. [259] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017. 172