TOWARD PRIVATE, SECURE, AND ROBUST AI-ENABLED VOICE SERVICES By Hanqing Guo A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2024 ABSTRACT Voice, as a primary way for people to communicate with each other and interact with computers and smart devices, is expected to be trustworthy and reliable. For example, modern authentications use voice as a biometric to verify a user’s identity; users give voice commands to control the smart devices via speech-to-text services. Compared to other biometrics such as iris, fingerprint, and face ID, voice biometrics show high usability because it does not require complicated hardware other than a microphone to support the authentication. Besides, the voice biometric can also be adapted for remote call authentication. Furthermore, voice serves as a crucial interface for humans to interact with smart devices, representing the most intuitive method for giving commands to artificial intelligence (AI) agents. The potential for smart devices and robots to comprehend human speech in the future holds great promise. However, recent studies demonstrated the vulnerabilities of using voice interfaces to communicate, conduct speaker authentication, and deliver messages to smart devices. This dissertation aims to introduce the background of AI-enabled voice services; discover the vulnerabilities of modern voice models and systems; understand the root cause of the vulnerabilities; and provide security solutions to safeguard voice services. First, we focus on speaker authentication security. Particularly, we propose a secure and robust speaker verification system called SuperVoice. By discovering the high-frequency energy in human speech, we find the special characteristics between different persons, and between humans and machines. Exploiting the high-frequency energy, the SuperVoice can enhance the performance of verified speakers and defend against machine-played attacks such as replay attacks, adversarial attacks, and inaudible attacks. Moreover, we propose a backdoor attack called MasterKey to attack speaker authentication systems. Compared to previous attacks, we focus on a real-world practical setting where the attacker possesses no knowledge of the intended victim. Second, we explore the speech recognition security. Specifically, we design a new adversarial attack named SpecPatch to attack vulnerable speech recognition models. This attack alters the speech recognition model output by injecting a short, imperceptible noise-like sound. Compared to previous adversarial audio attacks, the SpecPatch shows strong resistance under different types of distortions and is able to succeed even when the user is present. Furthermore, we propose PhantomSound, a query-efficient black-box attack toward commercial speech recognition services/APIs/voice assistants. Different from existing black-box adversarial attacks on voice assistants, PhantomSound leverages the decision-based attack to produce effective adversarial audios and reduces the number of queries by optimizing the gradient estimation. We demonstrate the danger of PhantomSound on commercial speech recognition services and off-the-shelf smart voice assistants. Third, we investigate the voice privacy protection. To address the privacy leakage issue of voice communication, we create a system, called NEC, that uses an AI model to selectively jam the user’s voice from an unauthorized recorder. The NEC transmits speaker-specified noise via an inaudible channel to jam the only user’s sound. We successfully implemented the NEC, and demonstrated that NEC can protect the user’s sound from being recorded. This dissertation comprehensively addresses the prevalent challenges and vulnerabilities in voice-enabled services. In an age where voice-enabled devices are becoming ubiquitous in homes and public spaces, ensuring the security of these devices is paramount. Our research helps in safeguarding the privacy and safety of the general public, who are often the targets of security breaches. In conclusion, our comprehensive analysis and proactive solutions to the challenges in AI-enabled voice interaction systems represent a leap forward. We offer a security perspective in a field that is critical to the technological advancement of our society. Our contributions may lay the groundwork for safer, more secure voice AI interactions, benefiting both the security community and society as a whole. Copyright by HANQING GUO 2024 To Nan, and Leo for their love and support. v ACKNOWLEDGMENTS I would like to express my heartfelt gratitude toward my advisor Dr. Li Xiao. I am grateful for the opportunity she gave me to join her lab from the beginning, for her patience with me, and for the courage she gave me when I encountered difficulties in my research. I am even more thankful for her selfless help to my family during the COVID period. She is undoubtedly a great mentor, not only in academics and life but also in guiding me on the path of life. I would also like to thank my advisor, Dr. Qiben Yan. I am grateful for his guidance, correction, and instruction on my research direction and methodology. I have learned from him how to build up my research taste and research vision. He taught me how to be professional, responsible, and creative during my PhD studies. I appreciate his patience in mentoring me, his tolerance for my mistakes, and his support and backup behind the screen for every paper submission. I am deeply grateful for his guidance and will carry his attitude towards research into my future work. Thank you, Dr. Xiao and Dr. Yan, for believing in me and empowering me to believe in myself! I am thankful to Dr. Zhichao Cao for serving in my Qualifying and Guidance committees, offering multiple opportunities for academic growth, and helping me with faculty job applications. I am thankful to Dr. Eric Hunter for his service in my Guidance committee and for thoughtful feedback from his speech expertise on my research. I am thankful to my past and present colleagues from eLANS Lab and SEIT Lab — Dr. Nikolay Ivanov, Bocheng Chen, Guangjing Wang, Yuanda Wang, Jianzhi Lou, Dr. Lingkun Li, Dr. Li Liu, Dr. Xiao Zhang, James Mariani, Kanishka Wijewardena, Griffin Klevering, Yiwen Hu, Yujin Ki. I am thankful to the external research collaborators and mentors that I had the honor to work with: Dr. Yunhao Liu, Dr. Jiliang Wang, Shuai Tong (Tsinghua University), Chenning Li (Massachusetts Institute of Technology), Dr. Junfeng Guo (The University of Maryland), Dr. Xun Chen (Samsung Research America), Dr. Ning Zhang (Washington University in St. Louis), Dr. Mi Zhang (Ohio State University), Dr. Qiaona Hu, Dr. Gabriel Passamani Andrade, Dr. Xiao Zeng (Amazon). I am grateful to many faculty members from Michigan State University for helping me navigate academia: Dr. Kristen Johnson, Dr. Sandeep Kulkarni, Dr. Manni Liu, Dr. Philip vi McKinley, Dr. Joshua Nahum, Dr. Yiying Tong, Dr. Eric Torng, Dr. Guan-Hua Tu, Dr. Huacheng Zeng, and Dr. Jiayu Zhou. Most importantly, I am thankful to my wife Nan Zhang, my son Leo, my mother Hongxia, and my father Hongjian — for their love, support, patience, and believing in me. vii TABLE OF CONTENTS CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1: Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2: Contribution of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3: Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 CHAPTER 2: SPEAKER VERIFICATION USING ULTRASOUND ENERGY IN HUMAN SPEECH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3: System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.7: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 CHAPTER 3: PRACTICAL BACKDOOR ATTACK AGAINST SPEAKER VERIFICATION SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3: System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.6: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.7: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 CHAPTER 4: HUMAN-IN-THE-LOOP ADVERSARIAL AUDIO SPECTROGRAM PATCH ATTACK ON SPEECH RECOGNITION . . . . . . . . . . . . . . . . . . . . 74 4.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3: Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4: System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.5: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.6: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.7: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 CHAPTER 5: BLACK-BOX, QUERY-EFFICIENT AUDIO ADVERSARIAL ATTACK VIA SPLIT-SECOND PHONEME INJECTION . . . . . . . . . . . . . . . . . . 110 5.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2: Background and Preliminary Study . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3: Attack Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.4: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.5: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 viii 5.6: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.7: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 CHAPTER 6: SPEAKER SELECTIVE CANCELLATION VIA NEURAL ENHANCED ULTRASOUND SHADOWING . . . . . . . . . . . . . . . . . . . . . . . . 146 6.1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.2: Background of Vocal System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.3: NEC System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.4: Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.5: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.6: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.7: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.8: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 CHAPTER 7: CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . 174 7.1: Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.2: Limitations and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.3: Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 ix CHAPTER 1: INTRODUCTION Voice is a primary way for people to interact with friends, computers, and smart devices. People talk to each other for information exchange; users talk to AI agents or smart devices to control them via speech; and the cloud services use the voice trait to authenticate their users. With the prevalent usage of voice in modern communication, the security research community has started to worry about the safety of using voice in multiple scenarios. For example, the adversary can bypass the speaker authentication system and then access the victim’s personal information and manipulate the system (mobile phone, customer account, or smart speaker). In the other case, the attacker can breach the speech recognition model and therefore mislead the smart speakers or speech-to-text services to produce harmful output (e.g., control the smart speaker to open the door; or generate toxic response from speech understanding services). In terms of the privacy issue in voice, the attacker can play the role of an eavesdropper to steal the victim’s sound in public areas, which follows the voice cloning techniques to launch a speech synthesis attack. Figure 1.1 depicts the research focuses. On the left, it shows that modern AI-enabled voice services provide functionalities such as speaker authentication and speech recognition to a smart agent (robot) for boosting the secure and efficient communication between humans and smart devices. On the right, the adversary may break the functionality with different attacks: impersonating the legitimate user; giving malicious speech commands to control the smart agent; and eavesdropping the secret conversation to expose the privacy of the victim. The goal of my research is twofold. First, we provide security solutions to safeguard the functionalities of model voice services; Second, we discover the potential threat to the existing service to alert the user and product manufacturer. We focus on three major voicerelated tasks: speaker authentication, speech recognition, and privacy protection. In the rest of this chapter, I will introduce the research background, and my contributions, as well as the organization of the dissertation. 1 Authentication Recognition Impersonation Mis-transcription Speech Functionality Attacks Eavesdropping Safeguard the functionality Expose the potential threat Figure 1.1: Overview 1.1: Research Background 1.1.1: Speaker Authentication Modern devices increasingly adopt biometric technologies for user authentication. Among various types of human biometrics, such as fingerprint, facial, iris, etc., voice biometrics demonstrate great benefits in its high usability, convenience, and security. Speaker Verification (SV) systems commonly use voice biometrics to automatically accept or reject a voice input based on the speaker models stored on the smart devices or cloud. (This dissertation treats speaker verification and speaker authentication as interchangeable terms.) Nowadays, all the popular voice assistants, such as Siri, Alexa, and Google Assistant, have integrated SV algorithms for certain wake words (e.g., “Hey, Siri”, “Ok, Google”). A more appealing approach, called text-independent speaker verification, could accurately and efficiently verify arbitrary utterances from a target speaker based on a limited set of enrolled sentences. Recently, security researchers have demonstrated the susceptibility of SV systems to voice mimicry attacks and replay attacks, where the attackers imitate/synthesize victims’ voices or record/replay them to bypass the SV systems [10, 102, 110, 192]. As the number of sensitive applications (e.g., banking [171]) of voice assistants is growing, practical SV systems aim to achieve not only high accuracy in text-independent speaker verification but also high efficacy in 2 defending spoofing attacks under a limited time budget. Limitations: Existing speaker authentication models suffered from either low accuracy, high computational cost, or vulnerability to adversarial attacks such as re-edited speech or AI-synthesized speech. Our research contributes to revealing the vulnerability of the SV system, and proposes new efficient and effective speaker authentication approaches. 1.1.2: Speech Recognition Speech is a major interface for humans to communicate with an intelligent agent. Voice communication is a human-computer interaction approach that enables hands-free operation and offers opportunities for visually impaired users. Recently, with the thriving development of Artificial Intelligence (AI) and deep learning models, the performance of Automatic Speech Recognition (ASR) has improved significantly, resulting in a growing product market. For example, tech companies developed their online ASR systems and provided those services to the public, including Amazon Transcribe [14], Google Cloud Speech-to-Text [63], IBM Watson Speech to Text [97], and Microsoft Azure Speech Service [125]. Furthermore, they also integrated their ASR APIs into the Intelligent Voice Control (IVC) devices to offer voice assistant services (e.g., Siri [164], Google Assistant [61], or smart speaker systems such as Google Home [62] and Amazon Echo [13]). Besides that, more and more companies deliver their customer service using intelligent voice systems, which are empowered by ASR models to understand customers’ questions and improve the efficiency of customer support. However, with the increasing presence of ASR systems and IVC devices in private spaces, people have started to worry about the security and privacy of these systems. For example, a hacked device is now capable of recording private conversations; collecting and sharing private data; and controlling all the connected IoT devices in smart homes [37, 158]. Researchers have demonstrated that ASR systems could become vulnerable to a wide variety of attacks. For instance, inaudible commands can be injected through ultrasound [141,210], even across different transmission media, such as object surface [200], light [158], etc. Besides the physical attacks, recent studies also utilize the discrepancies between the human ear and feature extraction 3 algorithms to launch signal processing attacks [4, 5]. Despite the aggravating threats, these new attacks could be defeated by integrating additional hardware [209] or extra signal processing procedures (e.g., voice activity detection, guard signals) [4, 90]. Unlike the aforementioned attacks, the adversarial attack aims to attack the deep neural networks (DNN), i.e., the computational core of an ASR system, which poses a major threat to modern ASR systems. Limitations: Existing speech recognition services are vulnerable to inaudible commands such as ultrasound, and imperceptible attacks such as adversarial music, human insensitive transformed signals, or backdoor commands. We investigate the limitations of the existing attacks and expose new threats (such as robust adversarial patch attacks and query-efficient black-box attacks) to the safety of speech recognition systems. 1.1.3: Privacy Protection Voice recording is an essential information-sharing approach, which is benefiting many aspects of our daily lives. Nowadays, smartphones and Internet-of-Things (IoT) devices equipped with microphones allow people to record voices anytime and anywhere. However, the growing presence of unauthorized microphones has led to numerous incidences of privacy violations. Off-theshelf microphones are widely available and can be deployed to steal users’ biometric traits (e.g. voiceprints) or private conversations. Thus, unauthorized voice recording has become a serious societal issue [116]. For example, the adversary can record private conversations for personal usage and cause privacy leakage. Moreover, the adversary can conduct the speaker conversion attack [48, 155] to produce more speech samples as the recorded victim. Besides, the adversary can separate the recorded speech as multiple clips, and perform speech synthesis attack [150]. Most recently, the unauthorized recording can be further used for attacking the speaker verification models, such as replay attack [110, 192], adversarial attack [31]. Thus, preventing unauthorized recording is a critical research problem to enable the security of voice communication. Limitations: Piror effort to defend against unauthorized recording with speech jammer, a device that continuously generates ambient noise by ultrasound. However, this approach leaves two crit- 4 ical safety concerns: First, all the surrounding microphones will be affected, leading to deny of service for unrelated users. Second, the ambient noise is usually generated with a specific noise pattern, therefore, the eavesdropper can easily recover with the victim’s speech. Our study contributes to the design of a new system that achieves speaker-specified jamming that prevents unauthorized recording without affecting others. 1.2: Contribution of This Dissertation 1.2.1: Overview of This dissertation This dissertation encompasses five of my publications, which focus on speaker authentication, recognition, and privacy concerns. Each paper’s research emphasis, particularly in the context of attack and defense strategies, is concisely summarized. Table 1.1 provides a comprehensive overview of these publications, detailing their specific research topics. In the realm of speaker authentication, our significant contribution is the development of SuperVoice, a robust speaker verification system. Additionally, we identify a major vulnerability in speaker verification models and demonstrate the feasibility of large-scale backdoor attacks, a technique we refer to as MasterKey. In the field of speech recognition, our research introduces two innovative attack methodologies designed to deceive speech-to-text models. The first, SpecPatch, is tailored for scenarios involving human interaction, while the second, PhantomSound, is optimized for black box environments. Addressing privacy issues, we present NEC, an intelligent jamming device that effectively protects against unauthorized voice recordings. 1.2.2: Contribution to Speaker Authentication Security SuperVoice: We propose SuperVoice, a speaker verification system that provides secured speaker authentication by leveraging ultrasound features in human speech. Compared to existing speaker verification techniques which distinguish individual speakers via the spectrographic features extracted from an audible frequency range of voice commands, we explore a new direction of human voice research by scrutinizing the unique characteristics of human speech at the ultrasound fre- 5 Table 1.1: Overview of the scope of this dissertation. Research Focus Chapter Related Publication† ATTACK DEFENSE AUTHENTICATION Chapter 2 H. Guo, Q Yan, N. Ivanov, Y. Zhu, L. Xiao, EJ. Hunter SuperVoice: Text-Independent Speaker Verification # Using Ultrasound Energy in Human Speech [73] ACM ASIA CCS 2022 Chapter 3 H. Guo, X. Chen, J. Guo, L. Xiao, Q. Yan # MASTERKEY: Practical Backdoor Attack Against Speaker Verification Systems [69] ACM MobiCom 2023 RECOGNITION Chapter 4 H. Guo, Y. Wang, N. Ivanov, L. Xiao, Q. Yan SpecPatch: Human-In-The-Loop Adversarial Audio # Spectrogram Patch Attack on Speech Recognition [72] ACM CCS 2022 Chapter 5 H. Guo, G. Wang, Y. Wang, B. Chen, Q. Yan, L. Xiao PhantomSound: Black-Box, Query-Efficient Audio # Adversarial Attack via Split-Second Phoneme Injection [71] RAID 2023 PRIVACY Chapter 6 H. Guo*, C. Li*, L. Li, Z. Cao, Q. Yan, L. Xiao NEC: Speaker Selective Cancellation via # Neural Enhanced Ultrasound Shadowing [70]† IEEE DSN 2022 — primaty focus # — not addressed. † The author of this dissertation (in bold), is the main contributor to all these papers, * indicate the equal contribution. quency band. Our research indicates that the high-frequency ultrasound components (e.g. speech fricatives) from 20 to 48 kHz can significantly enhance the security and accuracy of speaker verification. Our SuperVoice system uses a two-stream DNN architecture with a feature fusion mechanism to generate distinctive speaker models. To test the system, we create a speech dataset with 12 hours of audio (8,950 voice samples) from 127 participants. In addition, we create a second spoofed voice dataset to evaluate its security. To balance between controlled recordings and real-world applications, the audio recordings are collected from two quiet rooms by 8 different recording devices, including 7 smartphones and an ultrasound microphone. Our evaluation shows that SuperVoice achieves 0.58% equal error rate in the speaker verification task, which reduces the best equal error 6 rate of the existing systems by 86.1%. SuperVoice only takes 120 ms to test an incoming utterance, outperforming all existing speaker verification systems. Moreover, within 91 ms processing time, SuperVoice achieves 0% equal error rate in detecting replay attacks launched by 5 different loudspeakers. Finally, we demonstrate that SuperVoice can be used in retail smartphones by integrating an off-the-shelf ultrasound microphone. MasterKey: We propose a new threat toward the speaker authentication system. The attack, called MasterKey, is a backdoor attack to compromise the many SV models. Different from previous attacks, we focus on a real-world practical setting where the attacker possesses no knowledge of the intended victim. To design MasterKey, we investigate the limitation of existing poisoning attacks against unseen targets. Then, we optimize a universal backdoor that is capable of attacking arbitrary targets. Next, we embed the speaker’s characteristics and semantics information into the backdoor, making it imperceptible. Finally, we estimate the channel distortion and integrate it into the backdoor. We validate our attack on 6 popular SV models. Specifically, we poison a total of 53 models and use our trigger to attack 16,430 enrolled speakers, composed of 310 target speakers enrolled in 53 poisoned models. Our attack achieves a 100% attack success rate with a 15% poison rate. By decreasing the poison rate to 3%, the attack success rate remains around 50%. We validate our attack in 3 real-world scenarios, and successfully demonstrate the attack through both over-theair and over-the-telephony-line scenarios. 1.2.3: Contribution to Speech Recognition Security SpecPatch: We propose SpecPatch. The first human-in-the-loop adversarial audio attack on automated speech recognition (ASR) systems. Existing audio adversarial attacker assumes that the users cannot notice the adversarial audio, and hence allow the successful delivery of the crafted adversarial examples or perturbations. However, in a practical attack scenario, the users of intelligent voice-controlled systems (e.g., smartwatches, smart speakers, smartphones) have constant vigilance for suspicious voice, especially when they are delivering their voice commands. Once the user is alerted by suspicious audio, they intend to correct the falsely-recognized commands by 7 interrupting the adversarial audio and giving more powerful voice commands to overshadow the malicious voice. This makes the existing attacks ineffective in the typical scenario when the user’s interaction and the delivery of adversarial audio coincide. To truly enable the imperceptible and robust adversarial attack and handle the possible arrival of user interruption, we design SpecPatch, a practical voice attack that uses a sub-second audio patch signal to deliver an attack command and utilize periodical noises to break down the communication between the user and ASR systems. We analyze the CTC (Connectionist Temporal Classification) loss forwarding and backwarding process and exploit the weakness of CTC to achieve our attack goal. Compared with the existing attacks, we extend the attack impact length (i.e., the length of attack target command) by 287%. Furthermore, we show that our attack achieves 100% success rate in both over-the-line and over-the-air scenarios amid user intervention. PhantomSound: Compared to the SpecPatch, which only works for white-box setting, we propose PhantomSound, a query-efficient black-box attack toward voice assistants. Existing black-box adversarial attacks on voice assistants either apply substitution models or leverage the intermediate model output to estimate the gradients for crafting adversarial audio samples. However, these attack approaches require a significant amount of queries with a lengthy training stage. PhantomSound leverages the decision-based attack to produce effective adversarial audios, and reduces the number of queries by optimizing the gradient estimation. In the experiments, we perform our attack against 4 different speech-to-text APIs under 3 real-world scenarios to demonstrate the real-time attack impact. The results show that PhantomSound is practical and robust in attacking 5 popular commercial voice controllable devices over the air, and can bypass 3 liveness detection mechanisms with> 95% success rate. The benchmark result shows that PhantomSound can generate adversarial examples and launch the attack in a few minutes. We significantly reduce the number of queries by by 93.1% (untargeted) and 65.5% (targeted) compared with the state-of-the-art black-box attacks. 8 1.2.4: Contribution to Speech Privacy Protection NEC: To safeguard the privacy leakage of daily conversation, we propose NEC (Neural Enhanced Cancellation), a defense mechanism that prevents unauthorized microphones from capturing a target speaker’s voice. Compared with the existing scrambling-based audio cancellation approaches, NEC can selectively remove a target speaker’s voice from a mixed speech without causing interference to others. Specifically, for a target speaker, we design a Deep Neural Network (DNN) model to extract high-level speaker-specific but utterance-independent vocal features from his/her reference audios. When the microphone is recording, the DNN generates a shadow sound to cancel the target voice in real-time. Moreover, we modulate the audible shadow sound onto an ultrasound frequency, making it inaudible for humans. By leveraging the nonlinearity of the microphone circuit, the microphone can accurately decode the shadow sound for target voice cancellation. We implement and evaluate NEC comprehensively with 8 smartphone microphones in different settings. The results show that NEC effectively mutes the target speaker at a microphone without interfering with other users’ normal conversations. 1.3: Organization This dissertation is organized as follows. Chapter 1 presents the research background and the overview of this dissertation. Chapter 2 introduces a speaker authentication system that is secured by ultrasound. Chapter 3 elaborates on our new attack against large-scale and long-distance speaker authentication systems. Chapter 4 discovers the vulnerability of speech recognition in human-inthe- loop scenarios. Chapter 5 introduces a query-efficient black-box attack against commercial speech-to-text services/APIs. Chapter 6 addresses the privacy leakage problem by proposing a speaker-specified recording jammer. Chapter 7 summarizes this dissertation and outlines future directions. 9 CHAPTER 2: SPEAKER VERIFICATION USING ULTRASOUND ENERGY IN HUMAN SPEECH1 2.1: Introduction Modern devices are increasingly using voice biometrics due to its usability and security. Speaker Verification (SV) systems, used in voice assistants like Siri, Alexa, and Google Assistant, rely on voice biometrics to recognize specific wake words and verify users. Recent SV studies have explored the distinctive vocal or non-vocal features such as phoneme position [214], cumulative spectrum [7], mouth motion [124,213], body vibration [54], and sound field [198]. Based on these features, conventional machine learning models have been used to generate speaker models, including correlation (CORR), support vector machine (SVM), Gaussian mixture models (GMM), etc. Meanwhile, deep neural network (DNN) based SV systems use robust neural networks for building speaker models with the prototypical features (e.g., waveform [92, 137, 143], spectrogram [91, 128, 177], and MFCC (Mel-Frequency Cepstral Coefficients) [165]). As summarized in Table 2.1, most of the existing SV systems cannot simultaneously achieve effective speaker verification and defense against spoofing attacks [7, 124, 213, 214], while others have limitations in their usability, e.g., with the requirement of wearing extra devices [54], staying at the same positions as the enrollment phase [198], etc. Moreover, their discovered vocal or non-vocal features cannot be transferred across different speaker models. Although existing DNN-based SV systems [91, 128, 137, 165, 177] do not deal with rigid features, they tend to yield relatively high error rates due to the lack of speaker representative features. 1This chapter is based on previously published work by Hanqing Guo, Qiben Yan, Nikolay Ivanov, Ying Zhu, Li Xiao and Eric J. Hunter titled “SuperVoice: Text-Independent Speaker Verification Using Ultrasound Energy in Human Speech” published at the Proceedings of the 2022 ACM Asia Conference on Computer and Communications Security. DOI: 10.1145/3488932.3517420 [73]. 10 Motivation: In this research, we aim to explore ultrasound energy in human speech to enhance the accuracy and security of text-independent speaker verification. More specifically, we investigate the unique properties of the human speech in the human-inaudible ultrasound frequency band (i.e., frequencies greater than 20 kHz). High-frequency ultrasound components in human speech present several unique properties: first, they are imperceptible by humans but can be captured by an ultrasound microphone; second, individual human speakers can produce ultrasound waves with distinct characteristics, determined by the unique shape of the speech production system and the particular use (e.g. timing) of the system. Recent attacks towards voice assistants, such as DolphinAttack [210] and SurfingAttack [200], leveraged the inaudible ultrasound signals to inject commands into voice assistants. Here, we take a reversed approach: rather than utilizing the ultrasound signals for attack, we take advantage of the unique properties of the high-frequency audio spectrum for defense, to offer a more robust and accurate SV system. We propose SuperVoice, a robust and secure text-independent SV system, which applies to commodity mobile devices equipped with an ultrasound microphone. SuperVoice analyzes an incoming voice command to the ultrasound microphone. The audio includes both the audible (below 20 kHz) and ultrasound (above 20 kHz) frequency components. SuperVoice then processes these components to extract both the low-frequency and high-frequency feature representations using a liveness detection module and a two-stream DNN architecture. These features are fused to a second-level classifier to generate or match a speaker embedding for speaker verification purposes. Challenges: The design of SuperVoice faces 3 critical challenges. i) How to ascertain that the ultrasound feature can represent the speaker’s voiceprint? Prior acoustic studies show evidence that high-frequency energy (from 8-16 kHz) contains useful features to identify an individual speaker [86, 106, 145]. However, none of them focuses on the ultrasound frequency band above 20 kHz. The existing feature engineering techniques such as LPCC (Linear Prediction Cepstral Coefficients), Filter banks, and MFCC cannot be directly applied in high-frequency data, as they are designed for narrowband speech data (below 8 kHz). To better utilize the ultrasound features, we design signal processing techniques to extract the unique characteristics from the ultrasound components. 11 Table 2.1: SuperVoice in comparison with other SV systems. System Feature Model Text Indep. Security Transfer VoiceLive [214] Phoneme CORR 7 3 7 VoiceGes. [213] Mouth CORR 7 3 7 WiVo [124] Mouth CORR 7 3 7 VAuth [54] Body CORR 3 3 7 Void [7] Cum. Spec SVM 3 3 7 CaField [198] Sound field GMM 3 3 7 TE2E [91] Spectrum CNN 3 7 3 GE2E [177] Spectrum CNN 3 7 3 Siri [165] MFCC RNN 3 7 3 SincNet [137] Waveform CNN 3 7 3 VGGVox [128] Spectrum CNN 3 7 3 SuperVoice Ultrasound CNN 3 3 3 ii) How to use the ultrasound features to detect replay attacks that involve multiple playback devices? Since the attackers can use different devices (e.g., smartphones, ultrasonic microphones, and ultrasonic speakers) to record and replay the voice signals, it is challenging to design a liveness detection method to cope with different attack devices with varied signal characteristics. iii) How to design a neural network structure to integrate the ultrasound features? Since the speech production theory of low-frequency features and high-frequency features are very different, the integration of both features is particularly challenging. We design a two-stream DNN structure with convolutional filters to process and integrate the ultrasound features. Contributions: To the best of our knowledge, we are the first to prove that ultrasound components (20 ∼ 48 kHz) in human speech can be used to enhance the accuracy, robustness, and security of the SV systems. We demonstrate that the ultrasound components are model-agnostic by integrating them into multiple SV models, all of which achieve enhanced performance in the SV tasks. Surprisingly, the ultrasound components in human speech have been largely neglected prior to this work [103]. In summary, this chapter makes the following contributions: • We demonstrate that human speech does include ultrasound components, and those components can help distinguish among different human speakers. Moreover, the ultrasound components in speech signals can be utilized to identify spoofing attacks by measuring the 12 signals’ cumulative energy levels at the high-frequency range. • We design SuperVoice, a speaker verification and spoofing detection system. By incorporating a two-stream neural network structure with time-frequency spectrogram filters and feature fusion mechanism, SuperVoice achieves high accuracy in text-independent speaker verification. • We launch a human voice study and collect two datasets for speaker verification and spoofing detection. We recruit 127 participants and record 8,950 audios by 8 different smart devices, including 7 smartphones and an ultrasound microphone. We also replay 500 audio samples to construct a spoofed voice dataset with 5 playback devices. In total, our datasets involve 127 participants with a total of 12 hours audios. We make our datasets publicly available at https://supervoiceapp.github.io/. • We evaluate the performance of SuperVoice and compare it against other SV systems. The result shows that SuperVoice achieves 0.58% equal error rate (EER) in speaker verification, which improves the EER of the top-performing SV system by 86.1%. Remarkably, it only takes 120 ms to test an incoming utterance. Moreover, SuperVoice achieves 0% EER with 91 ms processing time for liveness detection, outperforming all existing approaches. The two-week longevity experiment demonstrates that SuperVoice is suitable for long-term use, and its performance is barely impacted by the changes in distances and angles. 2.2: Background 2.2.1: Threat Model We consider voice spoofing attack, which is a malicious attempt to impersonate a genuine speaker to execute an unauthorized command on a voice assistant. The three most popular types of voice spoofing attacks include replay, synthesis, and conversion [192]. In a replay attack, the adversary records the legitimate command uttered by a genuine speaker and replays this command later. The synthesis attack uses the text-to-speech (TTS) generation to create artificial voice commands acceptable by a voice assistant. The conversion attack converts an existing voice command into a 13 different one that can bypass speaker verification. To provide effective countermeasures against voice spoofing attacks, this research aims to develop an end-to-end SV system that can perform both liveness detection and speaker verification. 2.2.2: Can Humans Produce Ultrasound? The sounds in human speech are commonly divided into vowels and consonants: the vowels are produced when the air steadily flows through the vocal tract above the larynx (see Figure 2.1), while the consonants are more transient in nature and are produced when the flow of air is partially restricted or completely stopped at the vocal fold. The consonants are characterized by voicing, place of articulation, and manner of articulation [39, 101]. Voicing refers to the inclusion of vocal fold vibration as a sound source which is quasi-steady and generally harmonic in nature, and the place of articulation represents the location of constriction in the vocal tract which usually results in a highly transient noise. The manner of articulation describes how a sound is altered by the manipulation of airstream flows from the lungs. More specifically, when two speech organs narrow the airstream to cause friction to occur as it passes through, Fricatives are produced. If the airstream is stopped and then released, Stop or Affricate is produced. Particularly, the Stop, Affricate, and Fricative consonants are known to exhibit high frequency energy (HFE), since the airstream is deformed by articulations. In this work, we aim to scrutinize this under-explored and largely neglected phenomenon in human speech, i.e., the consonants carry high energy in the human-inaudible ultrasound frequency range. We perform experiments to validate that a human speech production generates energy in the ultrasound spectrum during a normal utterance, primarily within speech components such as the Stop, Affricate, and Fricatives. Figure 2.2a shows the human voice frequency spectra sensed by an ultrasound microphone, in which a significant portion of the acoustic energy is observed beyond 20 kHz. In this study, we are the first to show that the acoustic energy beyond 20 kHz (i.e., ultrasound voice components) plays an important role in the speaker verification task and offers an effective and economical solution for liveness detection. 14 Figure 2.1: Human’s vocal tract and place of articulation. 2.2.3: Can Ultrasound Components Improve SV Performance? Carefully examining Figure 2.2a, we find that HFE is produced by certain phonemes (marked by dashed rectangles), such as /sh/, /s/, /sy/, and /s/ within the particular phrase. Figure 2.2b shows the low-frequency spectrum of these phonemes, from which we can see that the phonemes with HFE exhibit less energy below 2 kHz compared with other phonemes. The modern SV models follow this principle to identify voiceprint by modeling the energy distribution pattern from the LFE spectrum. Figure 2.2c shows an obvious difference in the voice spectrum between the phonemes with HFE and the ones without HFE. By capturing the unique high-frequency characteristics in the phonemes with HFE, the ultrasound components may help boost the performance of the text-independent SV systems. Remark 1: The phonemes with HFE may lack low-frequency energy (LFE). This phenomenon implies that the traditional LFE-based SV systems may not be able to capture sufficient characteristics of the phonemes with HFE. The most common audio sampling rate of a recorder or loudspeaker is 44.1 (or 48) kHz. Due to the Nyquist theorem, any acoustic energy beyond 22.1 (or 24) kHz will be discarded as shown 15 sh s sys (a) Spectrum of a given phrase sh s sys Harmonic (b) Spectrum of LFE components (c) Voice spectrum comparison Noise (d) Replayed audio spectrum Figure 2.2: Observation of high-frequency energy (HFE) and low-frequency energy (LFE) of the phrase “She had your dark suit in greasy wash water all year” uttered by a human speaker. in Figure 2.2d. Even though some recorders and loudspeakers have higher sampling rate, their frequency responses tend not to be as flat across a wide frequency band as the human speech. Remark 2: Typical replay attack using loudspeakers could not produce the ultrasound energy. Therefore, the ultrasound energy in human speech can be used to quickly identify loudspeakerbased spoofing attacks. In the following sections, we present our new discoveries on the specific features of human voice components, which become the core elements of SuperVoice. By conducting a preliminary study on the high-frequency ultrasound components in human speech, we lay the foundation for the rest of this work. The preliminary study aims to answer the following four complementary research questions: 16 Dispersed Consistent (a) Energy of diff. sentences from the same speaker (b) Variance of energy w.r.t. frequencies Dispersed (c) Energy of the same sentences from diff. speakers (d) Variance of energy w.r.t. frequencies Figure 2.3: Ultrasound energies of different sentences spoken by different speakers. • RQ1: Can the ultrasound components in human speech reflect the speaker identity? • RQ2: How consistent are the ultrasound features for each speaker over different speech contents? • RQ3: How distinctive are the ultrasound features for each individual speaker? • RQ4: Can the ultrasound components help determine the liveness of the audio input? 2.2.4: Ultrasound Components and Speaker Identity To answer RQ1, we conduct a theoretical analysis based on the principle of human speech. Generally, the production of speech can be divided into two separate elements: 1) sources of sound such as the larynx, and 2) filters that modify the sources such as the vocal tract. Different from the 17 vowels that only use voicing source, the consonants coordinate three sources: frication, aspiration, and voicing. Moreover, vowels are produced by a relatively open vocal tract, while consonants are produced by constrictions in the vocal tract, as explained in Section 2.2.2. Specifically, the production of consonants involves more sources, faster changes in articulators, changes in the shape of the vocal tract, and more complicated articulation changes such as the movement of the tongue, lips, and jaw. As a result, the consonants naturally produce a more diverse set of frequency components, which extends to the ultrasound frequency range. Clearly, the uniqueness of the consonant pronunciation depends on a human’s vocal tract shape, nasal cavity, oral cavity structure, and lip and tongue shapes. Among all consonants, we focus on the Stop, Affricate, and Fricative consonants, since they produce high-frequency components with a significantly higher energy level (see Figure 2.2a). 2.2.5: Consistency of Ultrasound Components To address RQ2, we design an experiment to evaluate whether the ultrasound frequency components are consistent across different speech contents. Conceptually, the ultrasound component refers to the speech component with a non-trivial energy above 20 kHz. We first identify the high-energy ultrasound components in an utterance by computing the Short-time Fourier transform (STFT) spectrum of the voice input. The STFT uses a Hann window of length 10 ms, hop length of 2 ms, and FFT size of 2,048 points under 192 kHz sampling rate, which results in 93.75 Hz (192,000 2,048 ) frequency resolution. Suppose an utterance is divided into N frames as each lasts T in time. We consider the top M frames with the highest cumulative energy above 20 kHz as the frames that contain ultrasound components. Based on empirical observations, M is configured as 100 in this chapter. However, existing studies have demonstrated that the STFT spectrum of different phonemes presents notable deviations across certain frequency ranges [126, 161]. This indicates that the impact of speech contents could pose a challenge for text-independent SV scenarios. To address this challenge, we calculate the long-term average (LTA) of the energies of ultrasound components, and 18 the LTA is more stable within the time frame T, expressed as follows: SLTA(f) = 1 M XM t=1 S(f, t), (2.1) where M is the number of frames that contain high-frequency ultrasound components, S(f, t) is the STFT spectrogram at frequency f and frame t, and t is the frame index within T. Spectrum averaging techniques such as LTA have been used to compare the properties of acoustic signals from random speech data [198]. In essence, LTA can help reduce the impact of different phonemes on the speaker profile. Here, we ask one volunteer to read the sentences S1-S6 (refer to the website https://supervoiceapp.github.io), and collect the spectrogram data to compute SLTA. The results in Figure 2.3a show that SLTA remains consistent within the frequency range between 16-48 kHz across different speech contents. The variance of SLTA is shown in Figure 2.3b. It is worth noting that SLTA (adapted for low frequency) varies significantly within the low-frequency range between 0-16 kHz, which further corroborates that LTA of ultrasound components can be used to improve the performance of SV systems. 2.2.6: Distinctiveness of Ultrasound Next, we aim to address RQ3, i.e., whether the ultrasound features from human speech are unique to each speaker, given that each speaker’s vocal tract is unique. Prior to answering this research question, we formalize the ultrasound voiceprint for each speaker. The creation of a voiceprint typically involves training with multiple sentences to achieve a reliable and robust voiceprint. Suppose the enrollment dataset is D. The ultrasound voiceprint P is defined as: P = 1 |D| X s∈D SLTA(s), (2.2) where s denotes the sentence index, and SLTA(s) is the LTA of the ultrasound energy within the sentence s. The ultrasound voiceprint represents the average energy distributions of multiple enrollment sentences. To evaluate the capability of P in distinguishing among speakers, we enroll the voices of five 19 volunteers (3 males and 2 females) and analyze the distinctiveness of P. The results in Figure 2.3c demonstrate noticeable variations in the ultrasound energy in the range of 16-48 kHz. Figure 2.3d further indicates that the ultrasound components from different speakers vary the most at the frequency range of 16-32 kHz. 2.2.7: Ultrasound for Liveness Detection The aforementioned experiments show that the human voice possesses ultrasound components, but the digital loudspeaker generally cannot produce highly distinctive ultrasounds with high energy. The sound spectrogram produced by a digital loudspeaker is limited by its Analog Digital Converter (ADC) sampling rate, low-pass filters, amplifier, and other loudspeaker hardware. We demonstrate this phenomenon in Figure 2.2a and Figure 2.2d, where the former shows genuine human utterance has ultrasound components, while the latter spectrogram of a loudspeaker does not contain HFE in the high-frequency band. Therefore, RQ4 can be addressed by measuring the ultrasound energy in audio. For high-end recorders and loudspeakers that support higher sampling rates, we demonstrate the effectiveness of our design in Section 2.4. In summary, we demonstrate that ultrasound components contain speaker-identified information. We show that the ultrasound voiceprints based on ultrasound components are consistent across different speech contents. They are distinctive for each speaker, and they can be used for liveness detection to enhance the security of SV systems. 2.3: System Design In this section, by leveraging the discriminative ability of the ultrasound components in the human voice, we introduce SuperVoice to perform liveness detection and speaker verification simultaneously. We first extract the ultrasound features from the voice signals. Then, the ultrasound features are embedded into a two-stream DNN architecture to produce speaker embeddings via the integration of both the low-frequency and high-frequency voice components. 20 Figure 2.4: SuperVoice’s operational workflow. 2.3.1: Overview The goal of SuperVoice is to utilize the high-frequency components of the human voice to enhance the speaker verification performance while identifying and defending against spoofing attacks. To achieve this goal, SuperVoice includes 3 main stages: 1) Model Training, 2) Speaker Enrollment, and 3) Speaker Verification. Figure 2.4 shows the operational workflow of SuperVoice. During the Model Training stage, SuperVoice first learns how to extract effective features (speaker embeddings) from the voice data of speakers in the training pool. In Speaker Enrollment, the target speakers are required to register their voices in the system, based on which SuperVoice will generate and store the speaker embedding as the speaker’s unique voiceprint. Finally, in the Speaker Verification stage, SuperVoice first conducts the liveness detection to ensure the audio source is spoken by a human speaker, and then verifies the speaker identity by measuring the similarity with the claimed speaker embeddings. At every stage, a Signal Processing module is applied to convert the raw signals to fit the input shape of the model. The processed data are then fed into the liveness detection module and the two-stream DNN network for speaker verification. The technical detail of Signal Processing module can be found in the Appendix of [73]. 21 No Power Limited HFE Low Response (a) Human (b) Scenario1 (c) Scenario 2 (d) Scenario 3 Figure 2.5: Spectrograms from different replay attackers. 2.3.2: Liveness Detection The liveness detection module is designed to differentiate between human voice and spoofed audio by utilizing the cumulative spectral energy of captured audio frames with ultrasonic components. We consider three attack scenarios based on different attackers’ capabilities. Scenario 1. Attackers record and replay with commercial devices: Since most of the traditional recording devices 1) do not support microphones that produce high-frequency response; 2) do not have an ADC capable of producing audios with a high sampling rate; and 3) often apply a low-pass filter behind the microphone logic to remove high-frequency components — they are unable to acquire a full spectrum of human voice. An ultrasound microphone, on the other hand, can capture a wider spectrum of human voice (see Figure 2.5a), including the ultrasound components up to 48 kHz. The digital components of a loudspeaker usually have a sampling rate at or below 48 kHz. Therefore, the replayed audio will not carry any high-frequency ultrasound components as opposed to the genuine human voice (Figure 2.5b). As a result, the captured ultrasound components in human voice provide a unique opportunity for developing accurate and efficient liveness detection without heavy computation. Scenario 2. Attackers record with high-end microphones and replay with commercial speakers: Let us consider an attacker, who uses a high-end microphone (e.g., a microphone with a high sampling rate, high-resolution ADC, and wide frequency response) to eavesdrop on a victim’s voice, and replays it with a commercial speaker such as smartphones or high-quality speakers. In such a scenario, the replayed audio will still carry limited HFE due to the cutoff frequency of commercial 22 speakers, as shown in Figure 2.5c. In comparison with Figure 2.5a, the lack of HFE in Figure 2.5c constitutes a unique feature of the replay attacks. Scenario 3. Attackers record with high-end microphones and replay with ultrasound speakers: The attackers can also be equipped with high-end microphones and professional ultrasound speakers. In this scenario, although the spectrogram of the replayed audio carries HFE, it possesses limited LFE, as shown in Figure 2.5d. The energy difference between the replayed audio and genuine human voice is evident: the former has nearly zero energy below 1 kHz, while the latter presents an intact spectrum. Based on our observations in Figure 2.5, we leverage the cumulative spectral energy of the frames with ultrasonic components and design an accurate, efficient, and lightweight liveness detector to identify if the audio source comes from a loudspeaker or a genuine human speaker. The detector relies on the normalized cumulative energy Sp in different frequency ranges, as defined below: Sp(f) = X t∈M S(f, t) − X f X t∈T S(f, t), (2.3) where S is the STFT spectrogram, t is the index of frames, T is the total number of frames, and M is the number of frames with ultrasonic components. The first term of the right-hand side summarizes the energies for all the frames with ultrasonic components, and the second term is used for normalization and noise reduction. To defend against the attacks in Scenarios 1 and 2, we define R1 as the ratio of the ultrasonic energy over the entire spectrum as follows: R1 = Phigh1 f=low1 Sp(f) Phigh1 f=0 Sp(f) . (2.4) The numerator is composed of the normalized accumulative energy on the high-frequency band (from low1 Hz to high1 Hz), while the denominator uses the energy of the entire spectrum (up to high1 Hz). In this chapter, low1 and high1 are set as 24 and 48 kHz, respectively. Typically, a legitimate human voice will yield a positive value of R1, since its HFE is greater than the average 23 Figure 2.6: SuperVoice system architecture (L-Feat. represents low-frequency feature embedding; H-Feat. represents high-frequency feature embedding). energy of the entire spectrum (see Figure 2.5a). In contrast, a replay attacker with a commercial speaker will yield a negative R1. For Scenario 3 in which the attacker possesses a professional ultrasound microphone and highend loudspeaker, we propose R2 to examine the proportion of LFE over all frequency bands as follows: R2 = Plow2 f=0 Sp(f) Phigh2 f=0 Sp(f) . (2.5) The normalized accumulative energy below 1 kHz is supposed to be negative for replayed audio, since it has lower energy as shown in the dotted frame in Figure 2.5d with a dark color. For instance, we set low2 as 1 kHz and high2 as 4 kHz. By integrating R1 and R2, we consider a voice input as belonging to a genuine human if it satisfies the (R1 > 0) ∧ (R2 > 0) condition. Otherwise, it will be classified as a replayed audio. 2.3.3: Two-Stream DNN Model After performing the liveness detection, SuperVoice begins processing the genuine human speech to verify the speaker’s identity. For speaker verification, we design a two-stream DNN model to better integrate ultrasound features to improve the SV performance. Almost all the prior SV studies consider low-frequency audios below 8 kHz because the typical voice characteristics such as pitch and formants only exist in the low-frequency range below 8 kHz. However, we observe that the spectrum features above 8 kHz can indeed contribute to the 24 speaker verification. Thus, the question we aim to address in this section is: how to embed the high-frequency features into the speaker model? DNN system design: Typical machine learning-based SV systems use the Neural Network (NN) to obtain feature embeddings [35, 91, 173]. Followed by the feature embedding network, a classifier will perform the classification based on the extracted feature embeddings. SuperVoice follows such a structure, i.e., the first level of networks conducts feature embedding, while the second level performs the classification. Feature fusion: Different from the typical machine learning-based SV system, SuperVoice contains two streams of DNN models: one performs feature embedding using the low-frequency components, and the other one embeds high-frequency features. These features will be fused together to construct one coherent feature vector, and then fed into a classifier to produce a unique speaker embedding for every enrolled speaker. 2.3.4: System Architecture The overall SuperVoice architecture is presented in Figure 2.6, which is comprised of three NNs: CNN-1, CNN-2, and an NN classifier. CNN-1: CNN has been widely used in image recognition tasks, which applies convolutional computation to shrink the input size to obtain a high-level representation of high-dimensional features [205]. We feed the downsampled raw audio containing low-frequency components into the CNN to obtain a low-frequency feature vector. Inspired by SincNet [137], we use Sinc-filters to simulate the behavior of band-pass filters and add two one-dimensional convolutional layers to further compress the low-frequency feature space. CNN-2: In the second data stream, CNN-2 is designed to embed high-frequency voice components. Since the existing CNNs, such as VGGNet [154], ResNet [88], Inception [159], are designed for image classification, they apply multiple convolutional layers with different shapes (mostly squared shapes) of filters to capture the unique characteristics of different object shapes in the image. As opposed to the image which consists of pixels, the spectrogram data possesses both time and fre- 25 quency components. Here, we design a new CNN architecture with three complementary timefrequency convolutional filters to extract the HFE distribution and phoneme-level high-frequency representations. F-Filters: The purpose of these frequency-domain filters (F-Filters) is to extract the HFE distribution S(f) at the frequency domain. We design a sequence of vertical-shaped filters to convolve the high-frequency spectrogram. The size of the F-Filter is determined by the range of frequencies involved in the convolutional computation. Based on the observation that HFE distribution can be used as the speaker voiceprint (see Figure 2.3a, 2.3c), in order to extract a finer-grained energy distribution with a higher frequency resolution across the frequency range, we construct 64 F-Filters, whose size is 9×1 with dilation 1×1 and 2×1. As a result, the filters cover the frequency range from 9 · 93.75 = 843.75 Hz to 9 · 2 · 93.75 = 1, 687.5 Hz. T-Filter: Two time-domain filters (T-Filter) are designed to learn the high-frequency phonemelevel representation. The T-Filter covers a time duration, which is shorter than a phoneme length to ensure that the convolution process can occur within a single phoneme. The time-domain resolution can be computed by hopSTFT /192 kHz ≈ 2.7 ms. After applying the 64 1×9 T-Filters that is dilated by 1×1 and 1×2, the convolution computation covers the time-domain resolution between 9 · 2.7 = 24.3 ms and 9 · 2.7 · 2 = 48.6 ms. Since 48 ms is shorter than typical phonemes, the time-domain frames can represent the detailed information from a single phoneme. F/T-Filter: At the final stage of CNN-2, we design a sequence of square filters (F/T-Filter) with the size of 5×5 to convolve both time-domain and frequency-domain features concurrently. F/T-Filter merges the extracted high-frequency characteristics from both the time and frequency domains, in order to yield a more representative ultrasound energy distribution for a particular speaker. NN classifier: Finally, the NN classifier takes the fused features that are concatenated by the output feature vectors of CNN-1 and CNN-2 and compresses them into a desired speaker embedding dimension. Here, we use a fully connected layer as the NN classifier. Speaker embedding generation and matching: The speaker embedding is generated by the NN, as shown in Figure 2.6. The NN is essentially a fully connected layer, which maps the fused feature 26 vector to a speaker embedding vector. When the speaker model produces the speaker embedding based on the given audio source, SuperVoice will compare the cosine distance with the existing speaker embeddings, which have been generated during the enrollment stage. Every sentence spoken by an authorized speaker will produce a representative speaker embedding. For example, if speaker A enrolls three sentences into SuperVoice, the model will generate three embeddings for A. To accept the audio as belonging to speaker A, the average cosine similarity between the tested speaker embedding and the enrolled one should be greater than a similarity threshold γ as shown below: decision = 8>>< >>: accept, similarity ≥ γ reject, similarity < γ (2.6) where similarity = PN i=0 cos(embi, emb)/N, N is number of enrolled audios for the speaker, embi is the ith speaker embedding, and emb is the speaker embedding to be verified. 2.3.5: Model Training/Testing It is noteworthy that although the purpose of the NN models is to extract the speaker embedding, they operate differently in three stages (see Figure 2.4). The model will learn how to extract the most representative speaker embeddings via training with a speaker recognition task. It means that the output of NN will connect to another fully connected layer that maps the result dimension from speaker embedding to the number of classes in the speaker recognition task. For example, the model will predict a speaker label for the test utterance, and then refine the network parameters via the backpropagation of losses. In the Speaker Enrollment stage, however, the model simply loads the set of parameters that achieve the best results in the speaker recognition task, and then extracts the speaker embeddings for the enrolled speakers. 27 Figure 2.7: Dataset collection platform. 2.4: Evaluation In this section, we evaluate the performance of SuperVoice on spoofing detection and speaker verification, i.e., how well SuperVoice can verify a claimed speaker and reject a spoofed audio or a stranger’s voice. Furthermore, we integrate the high-frequency features extracted by SuperVoice into existing SV models to show the transferability of high-frequency features in enhancing different types of SV models. To have a fair evaluation, we collect several speech datasets as listed in Section 2.4.1. Our experiments are conducted on a desktop with Intel i7-7700k CPUs, 64GB RAM, and NVIDIA 1080Ti GPU, running 64-bit Ubuntu 18.04 LTS operating system. The model complexity and time consumption are measured in such a hardware configuration. 2.4.1: Speech Data Collection Human voice recording: The voice data in all the existing public datasets is collected using regular microphones with at most 48 kHz sampling rate [58, 129] to record data within [0-24] kHz. In order to investigate the high-frequency ultrasound components in the human speech, we collect our datasets for evaluation, including Voice-1, Voice-2, and Voice-3 datasets. The Voice-1 is collected by a high-end ultrasound microphone, Voice-2 is collected by regular microphones on 28 various smartphones, and Voice-3 is collected by a low-end ultrasound microphone. In total, we collected 9,050 speech utterances from 127 volunteers, the data collection process and user study have been approved by our school’s IRB board. Dataset collection platform: There are many options for the off-the-shelf ultrasound microphone (i.e., SiSonic SPU0410LR5H_QB MESE [156] and Avisoft condenser microphone CM16 [18]). The first microphone can capture the ultrasound frequency band up to 96 kHz, which only requires a 1.5 V to 3.6 V power supply. The low power consumption and low cost ($2/piece) make it suitable for most smartphones. The second microphone provides a more flat frequency response over the entire frequency band, allowing it to collect better-quality ultrasound recordings. For this reason, we deploy both SiSonic SPU0410LR5H_QB microphone and Avisoft microphone for data collection. The microphone and data capturing equipment are displayed in Figure 2.7. We informed each participant of the purpose of the experiment and then recorded their voice. The participants spoke facing forward to the microphone at a distance of 30 cm. Each participant was requested to speak 4 types of sentences, totaling 100 sentences. Voice-1: Voice-1 includes the voice data from 77 volunteers, totaling 7,700 utterances using a 192 kHz sampling rate. Among the 77 volunteers, most of them are college students, ranging in age from 18 to 56, and included 38 males and 40 females. For detailed dataset information, please refer to the website https://supervoiceapp.github.io. Voice-2: Voice-2 is constructed by recording 25 sentences by 50 participants with different models of smartphones. The smartphones’ sampling rate is 48 kHz. As the traditional speaker model leverages voice features below 8 kHz, Voice-2 helps validate the effectiveness of high-frequency features within [8, 24] kHz range recorded using different phones. In total, Voice-2 includes 1, 250 utterances with 48 kHz sampling rate. Voice-3: Voice-3 includes 200 audio recorded from 20 participants. Different from Voice-1, we collect Voice-3 by the cheap SiSonic ultrasound microphone. Every participant read a sentence twice in Common type, in total 10 audios per volunteer. The purpose of Voice-3 is to validate the performance of SuperVoice with a cheap ultrasound microphone that can be integrated into 29 smartphones [93]. Spoofing voice dataset: We implement the spoofing attacks by replaying the voice data collected in Voice-1 using 5 playback devices and 2 recording devices, including 2 smartphones, 2 high-end commercial loudspeakers, and one ultrasonic speaker. To detect the replay attack, we deploy an ultrasound microphone to record the replayed spoofing audio. The purpose of this dataset is to comprehensively evaluate the capability of ultrasound components for liveness detection. 2.4.2: Performance Metrics The performance metrics we use for the SV task are False Acceptance Rate (FAR), False Reject Rate (FRR), and Equal Error Rate (EER). FAR represents the rate of SuperVoice falsely accepting an unauthorized speaker, FRR is the rate of SuperVoice rejecting a legitimate voice, and EER is the rate where the FRR and FAR are equal. We further use Classification Error Rate (CER) to evaluate the speaker recognition (SR) performance, which is defined as the ratio of misclassified recordings versus the total recordings. For the user study, we develop SuperVoice as an end-to-end desktop application and use Success Rate to measure the percentage of successful attack defenses by SuperVoice, i.e., the times of correct recognition of the voice owner over the total number of attempts. 2.4.3: Speaker Verification Performance of Integrated Models To make a fair comparison with other existing speaker models, we reproduce all the models in Pytorch framework. We use the Pytorch version 1.2.0 with Python version 3.6.9. The GE2E [177] and Void [7] are closed source, which we reproduce based on their descriptions. The GMM-UBM [138], VGGVox [128] have open-source MATLAB codes, while SincNet [137] and STFT+LCNN [112] are implemented with the Pytorch framework. All of the models are evaluated with the same datasets described in Section 2.4.1. Direct ultrasound integration: First, we conduct a performance evaluation of 4 popular SV models: GMM-UBM, SincNet, VGGVox, and GE2E. We follow each model’s specification to configure the input and model parameters. Then, we evaluate their performance using the downsampled 30 low frequency data ([0-8] kHz) and the original data ([0-96] kHz for Voice-1 and [0-24] kHz for Voice-2). The performance comparison is presented in Table 2.2. The performance with lowfrequency data is relatively consistent with their reported results. When the high-frequency data is included in the modeling process, the performance of every model deteriorates significantly. This indicates that the high-frequency data cannot be directly used to distinguish among different speakers. Table 2.2: EER performance (%) comparison among GMM-UBM, SincNet, VGGVox, GE2E with two different datasets. Speaker Model Voice-1 Voice-2 (kHz) [0 − 8] [0 − 96] [0 − 8] [0 − 24] GMM-UBM 12.25 42.23 13.33 17.56 SincNet 4.17 18.23 4.19 7.04 VGGVox 4.64 16.63 4.66 6.75 GE2E 4.98 19.15 4.97 6.96 Improved ultrasound integration: For better integration of the high-frequency data, we adopt the architecture of SuperVoice: (1) using CNN-2 to handle high-frequency data, and (2) replacing CNN-1 with the existing speaker models. To validate the efficiency of integrating high-frequency data in smartphones, we conduct an experiment with Voice-2 ([0-24] kHz) and present the results in Figure 2.8a. The green bar with rectangle pattern indicates the EER performance with the downsampled [0,8] kHz data, and the orange bar with cross pattern shows the performance with the addition of high-frequency features in the range of [8, 24] kHz that are extracted by CNN-2 and feature fusion technique. The results show that the EER of SincNet has dropped from 4.19% to 2.89%, and the EER of VGGVox decreased from 4.66% to 4.12%. Overall, the EER performance improvement is around 16.93% on average with SincNet, VGGVox, and GE2E. For the GMM-UBM model, the EER performance has also improved slightly. The results demonstrate SuperVoice’s transferability, i.e., it improves other SV models’ performance by integrating the high-frequency feature embeddings. We then evaluate the performance of ultrasound integration in SuperVoice using the Voice-1 dataset. The result in Figure 2.8b shows the FAR and FRR of SuperVoice w.r.t. similarity threshold γ, and it indicates that the EER performance of SuperVoice is 0.58%. 31 (a) Performance improvement (b) EER of SuperVoice Figure 2.8: Performance of (a) ultrasound integration in existing models (1, 2, 3, and 4 represent SincNet, VGGVox, GMM-UBM, and GE2E models), tested on Voice-2; (b) ultrasound integration in SuperVoice system, tested on Voice-1. Table 2.3: EER performance (%) of SuperVoice on different datasets SuperVoice Voice-1 Voice-2 Voice-3 SV SR SV SR SV SR No HFE 4.17 5.87 4.19 6.74 6.75 7.84 [8-16] kHz 3.98 4.79 3.77 4.87 5.74 5.95 [8-24] kHz 2.89 2.27 3.21 3.32 3.45 4.51 [8-48] kHz 0.58 1.61 - - 1.87 3.01 [8-96] kHz 5.79 7.31 - - 9.52 14.2 2.4.4: Impact of Frequency Ranges Next, we evaluate the performance of SuperVoice with different frequency ranges of the highfrequency data. Both SV (i.e., to verify the voice is from an authorized user) and speaker recognition (SR) (i.e., to recognize the voice of a specific authorized user) tasks are conducted to measure the EER and CER performance. The results in Table 2.3 show that SuperVoice can achieve the EER performance of 0.58% with Voice-1, the best among all the existing speaker models. It is noteworthy that the best models that tested on Voice-1 is SincNet, which has 4.17% EER (see Table 2.2). To further evaluate SuperVoice on smartphones with an affordable ultrasound microphone (e.g., SiSonic SPU0410LR5H), we evaluate the performance with the Voice-3 dataset. The results show that, even with low-end ultrasonic microphone, SuperVoice achieves significant performance improvement. 32 (a) t-SNE result (b) Distance vs. success rate (c) Angle vs. success rate (d) Durability vs. success rate Figure 2.9: The user study of SuperVoice. Remarkably, SuperVoice improves the EER performance of the best SV model by 86.1% (or 55.1% with a low-end microphone), via the incorporation of ultrasound frequency components. We also find that incorporating high-frequency features below 48 kHz will produce better performance compared with the higher frequency range. Among all the configurations of frequency ranges, the range of [8, 48] kHz provides the best SV and SR performance in terms of EER and CER. Unsurprisingly, with a complete spectrum of [0, 96] kHz, both SV and SR performance degrades, as more indistinguishable noises are incorporated in the model to perplex the SR/SV tasks. 2.4.5: User Study Besides the benchmark evaluation presented in the previous sections, we perform two user studies to further test the effectiveness and robustness of our system. (a) Flexibility Study: users use our system at home and make speeches from random positions; (b) Longevity Study: users use our 33 system over a long time span. Figure 2.9a visualize the t-SNE result [122] of 20 participants in a 2D space, which clearly shows the 20 clusters of speakers. To conduct the user studies more efficiently, we develop an end-to-end SuperVoice desktop application. Flexibility study: We deploy the end-to-end application and ask 8 volunteers to enroll in the application. Once they are successfully enrolled, they are instructed to speak to the ultrasound microphone at different distances and angles to test the system’s recognition performance. Each volunteer makes 20 test attempts. Figs. 2.9b and 2.9c present the impact of distance and angle, respectively. The results show that SuperVoice reaches high success rate (95 − 100%) within 50 cm. Although the success rate may drop to 85% beyond 50 cm in the worst case, the average accuracy at 400 cm reaches 87.5%. As for different angles, the recognition performance declines from 95% to 85% when the speaker is side facing the microphone. The performance degradation is caused by a specific characteristic of the ultrasound microphone (i.e., CM-16 delivers different gains at different angles according to its polar diagram). Longevity study: For the second user study, we test the longevity performance of our system by tracking the usage of 4 users over 11 days. The participants enroll their voices on the first day, and attempt 20 times per day to use SuperVoice to identify their respective voices. As illustrated in Figure 2.9d, the average success rate is more than 95%, which means less than 1 over 20 attempts failed. In the end, we found no evidence of a degrading performance pattern over time. 2.4.6: Runtime Performance In this section, we compare the training time and testing time of SuperVoice with SincNet, VGGVox, GMM-UBM, and GE2E models. The training time is the total time used to create a speaker model with the training pool of Voice-1, while the testing time represents the time spent to verify an incoming utterance. Table 2.4 presents the runtime result. Among all the models, the GMM-UBM model is the fastest in terms of training and testing time with the worst EER. SincNet converges very fast during the training phase due to its special convolutional neural design, while the proposed SuperVoice 34 Table 2.4: Runtime comparison. Model Training Testing time (sec.) time (sec.) GMM-UBM 7,149 0.074 VGGVox 11,308 0.279 GE2E 10,348 0.21 SincNet 8,180 0.134 SuperVoice 8,413 0.120 also delivers comparable training time. During the testing, SuperVoice outperforms VGGVox and GE2E models due to its lightweight model with a small number of parameters. It is worth noting that introducing high-frequency features does not affect the testing speed. The results show that SuperVoice could retain comparable runtime performance with enhanced speaker verification performance. 2.4.7: Liveness Detection Performance In this section, we conduct experiments to verify the performance of liveness detection described in Section 2.3.2. We prepare two types of recorders and 5 playback devices to replay the recordings. For every speaker, we replay 20 audios at a fixed position (facing forward in 10cm) and volume (60dBSPL). The defender uses the low-cost SiSonic ultrasonic microphone to monitor the replayed audios. Attackers record with common recorder: We first replay audios that were recorded from a smartphone (Samsung S9). The boxplot in Figure 2.10a demonstrates the results with different speakers. From left to right, we have Human genuine voice (Hm), Bose SoundTouch 10 speaker [22], Vifa ultrasonic speaker [19], Samsung S9 phone (Sg), iPhone 12 (Ip), and SADA D6 speaker [142]. The results show that all the replay devices present a negative R1. This is attributed to the lack of HFE in the recorded audios by the smartphone. In contrast, the genuine human voices have positive R1 and R2, which is consistent with our analysis in Section 2.3.2. In the end, SuperVoice achieves 0% EER. Attackers record with ultrasound recorder: Now, we consider the attackers use a high-end ultrasonic microphone to record the victims’ voices. We select 20 audio samples with 192 kHz sampling 35 (a) Recorded by smartphone recorder (b) Recorded by ultrasound microphone Figure 2.10: Replay attacks Table 2.5: Liveness detection performance comparison. Models # Feat. Time (sec.) EER(%) CQCC + GMM [110] 14,020 0.159 12.08 LPC + CQCC + GMM 14,026 0.173 13.74 STFT + LCNN [112] 84,770 0.321 8.8 Void [7] 97 0.103 11.6 SuperVoice 4 0.091 0 rate in Voice-1 as the source to replay them by 5 loudspeakers. The result in Figure 2.10b shows that the commercial speakers still cannot produce any HFE, yielding all negative R1. Moreover, a substantial gap exists between the genuine and replayed voice from any specific replay devices, which indicates that the liveness detection of SuperVoice is robust against any attack devices. For the attacker with an ultrasonic speaker (Vifa), we observe a positive R1. However, its negative R2 signifies the low LFE. In the end, SuperVoice again achieves 0% EER, consistently confirmed by 200 attack attempts. Defense performance comparison: Here, we compare the liveness detection performance with 4 state-of-the-art liveness or spoofing detection models. We first justify our reproductions by testing them on ASVSpoof [110] dataset and all of them reach similar performance as they claimed. We then evaluate all the models using our spoofing dataset in terms of the number of features, average 36 detection time, and the EER performance. Table 2.5 presents the liveness detection performance comparison results. Among all the models, the STFT+LCNN model runs the slowest with the most number of features, while its EER performance is the best among the four models. Compared with the existing models, SuperVoice only uses four accumulative power features in R1 and R2, and achieves the fastest runtime performance with 0% EER. In consistent with the measured data in Figure 2.10, which visualizes the manifest gap between genuine and spoofed sound, SuperVoice achieves superior liveness detection performance in terms of both the runtime and EER performance for both the traditional loudspeakers and ultrasound speakers. 2.5: Discussion In this section, we discuss the limitations of SuperVoice, the defense against the inaudible attacks, and the future research directions. Commands without fricative consonant: As mentioned before, we observed that some phonemes, especially the fricative and stop consonants, retain high energy above 20 kHz. However, if a spoken sentence does not contain any fricatives, we may not be able to find an energy spike in the spectrum. Fortunately, we observe the HFE from most of Non-fricative command because the speaker always alters the air flow by their articulations, and this high-frequency component can be adopted by SuperVoice as an extra feature for speaker verification. For those sentences that only include lowfrequency energy (below 8 kHz), the low-frequency stream of our DNN architecture guarantees that SuperVoice does not experience any performance degradation with high-frequency features extracted from the non-fricative commands. Long-Range speaker verification: In this work, we assume that the human speakers are within a close distance from the ultrasound microphone. Prior research found that long-range speaker verification is challenging mainly due to the reverberation of sound and attenuation of the acoustic energy [131]. In SuperVoice, the range of voice commands will affect the received power of both the low-frequency and high-frequency components, especially for the fricative and plosive consonants. A power amplifier may be able to address the power attenuation issue, and we plan to evaluate its 37 (a) Inaudible attack on regular microphone (b) Inaudible attack captured by SuperVoice Figure 2.11: The defense against inaudible attacks. effectiveness in a long-range speaker verification in future work. SuperVoice on smartphone: In our experiments, we run SuperVoice on a desktop with an ultrasound microphone. We experiment with smartphones supporting high sampling rate (i.e., 192 kHz) to capture high-frequency voice components. Yet, we find that, due to the low-pass filter in the microphone system, all the frequency components above 24 kHz have been filtered out. One possible solution is to replace the microphone in the smartphone with the one supporting ultrasound frequency [156], or use an external microphone that can be connected to the smartphone. We also evaluate the performance of an external ultrasound microphone, i.e., Echo Meter Touch 2 [189], in capturing high-frequency components in voice signals. The external ultrasound microphone is attached to Samsung Galaxy S9 with a sampling rate of 256 kHz. The results show that the voice data captured by external microphone can achieve similar SV performance as the standalone ultrasound microphone. Inaudible attack defense: The inaudible attacks leverage the non-linearity of a microphone to perform an inaudible command injection attack through ultrasonic speakers [139, 200, 210]. The basic idea is to modulate the voice commands to an ultrasound frequency band, and then transmit the modulated signal through ultrasonic speakers. Due to the non-linearity of the regular microphone, the ultrasonic signal will shift frequency to the audible frequency range in the microphone. As a result, the command can be perceived by voice-activated devices. We evaluate SuperVoice’s capability in detecting inaudible attacks. We use a voice command 38 “She had your dark suit in greasy wash water all year” from Google TTS as a legitimate signal. This command is modulated to the inaudible frequency at wc = 28 kHz. Figure 2.11a and 2.11b show the spectrogram of inaudible attack towards both a regular microphone and SuperVoice’s ultrasound microphone. The regular microphone only captures frequency components in the range [0, 8] kHz, while SuperVoice can capture a 2wc = 56 kHz component that can be used to immediately detect the inaudible attack. Therefore, SuperVoice effectively defeats inaudible command injection attacks to voice assistants. 2.6: Related Work Speaker verification: Prior studies have identified different voice features for speaker verification models. They use speech spectrum, speaker pitch and formants, and even raw audio waveforms as inputs [91, 92, 109, 137, 143], from which various voice features can be extracted, such as Filter Banks, MFCC (Mel-Frequency Cepstral Coefficients), LPC (Linear Prediction Coefficients), LPCC (Linear Prediction Cepstral Coefficients), or any combination of them [40, 157, 173]. With the voice features in hand, researchers further use GMM-UBM (Gaussian Mixture Model Universal Background Model) [25], JPA (Joint Factor Analysis) [107], and neural networks [35,42,46,91, 128,129,137,173,177] to generate speaker models. Based on the speaker models, several classifiers such as support vector machine (SVM) [25,179], cosine similarity [91,153], and PLDA (Probabilistic Linear Discriminant Analysis) [46,108] have been employed to make (mostly probabilistic) SV decisions. Spoofing detection: Existing spoofing detection solutions explore both non-vocal and vocal physical parameters of the human speaker to differentiate between human voice and spoofed sound. Among the approaches that use non-vocal physical parameters is VoiceLive [214], which leverages a smartphone’s two microphones to capture the difference in the time of arrival (ToA) of phonemes to identify spoofing attacks. Although VoiceLive does not require heavy computation, the detection accuracy largely depends on the distance between the speaker and the microphones. VoiceGesture [213] performs liveness detection by identifying human gestures from the microphone-sensed 39 Doppler shifts. VoiceGesture is designed for smartphones, which cannot be directly applied for voice-controlled IoT devices due to its stringent requirement on the positions of devices’ microphones. Recently, WiVo [124] uses wireless sensing to detect lip movements associated with the syllables in the voice command, which requires placing wireless antennas very close to the speaker. Tom et al. [167] achieve a significant reduction of errors in replay attack detection using an adaptation of the ResNet-18 model. Void [7] proposes a set of lightweight features in the audible spectrum to distinguish the voice source and achieve low latency while maintaining relatively high detection accuracy. CaField [198] leverages the sound field characteristics to detect loudspeaker-based spoofing attacks. Although the existing studies have achieved remarkable success in utilizing audible information from a human voice, they either suffer from low accuracy on text-independent verification task or require substantial computational resource usage. Different from all the previous approaches, the proposed SuperVoice aims to provide a more accurate and realistic SV solution using the highfrequency ultrasound components in the human voice. Speaker recognition using high frequency: The utilization of high-frequency components of human voice for speaker recognition has been studied before [86,87,126]. These studies, however, are lacking of crucial technical details necessary for designing a contemporary high-performance text-independent SV system. 2.7: Summary In this chapter, we initiate an exploration on the underexplored ultrasound voice components in human speech, and we find that they can be used to enhance the performance of speaker verification and liveness detection. We design a speaker verification system, SuperVoice, to show the strength of ultrasound frequency components in the speaker models. Specifically, we design a two-stream DNN structure to fuse the low-frequency and high-frequency features. SuperVoice significantly improves speaker verification and liveness detection performance in comparison with the existing models. We further demonstrate the possibility of integrating ultrasound frequency features in the 40 existing models to enhance their verification performance. SuperVoice is accurate, lightweight, and secure, which can be integrated into smartphones with a modification of the smartphone’s microphone component. Although the SuperVoice is robust to authenticate speakers, it fails to safeguard the user when they attempt to verify an over-the-telephone call because the ultrasound energy will be filtered out during the phone call transmission. In the next chapter, we will investigate the vulnerability of conventional SV in over-the-telephone scenarios. 41 CHAPTER 3: PRACTICAL BACKDOOR ATTACK AGAINST SPEAKER VERIFICATION SYSTEMS2 3.1: Introduction Recently, Speaker Verification (SV) models have been widely deployed in modern devices to provide authentication services. In this chapter, we discuss the potential threats to SV models and systems. Specifically, We discover the vulnerability of SV models on backdoor attacks, where the attack manipulates the training dataset to inject a backdoor to commercial SV models, and hence launch an impersonation attack over different media. In the real world, many applications such as Google Assistant [61], Siri [164], and WeChat [187] use voice match technology to verify user identity before offering personalized services. Modern customer service centers such as Verizon [174] and Amazon AWS [20] have started using voice ID to verify user identity. Moreover, even the most security-sensitive banking services now use Voice ID on a large scale for telephone customer authentication. For example, HSBC Bank [94], Chase Bank [29], First Horizon Bank [55], Eastern Bank [51], Navy Federal Credit Union [53] all use voice ID to authenticate their customers. Besides the commercial use, there are many popular SV models (e.g., D-Vector [91], AERT [216], ECAPA [49]) available in open-source community. Although the SV technique demonstrates great efficiency and convenience to authenticate users, it also brings growing security concerns. For example, Replay Attack [193] records the target user’s sound3 and then replays the recordings to the verification system. Synthesis Attack [9] collects the audio clips of the target user and joins them 2This chapter is based on previously published work by Hanqing Guo, Xun Chen, Junfeng Guo, Li Xiao, and Qiben Yan titled “MASTERKEY: Practical Backdoor Attack Against Speaker Verification Systems” published at the Proceedings of the 2023 ACM Conference on On Mobile Computing And Networking (MobiCom). DOI: 10.1145/3570361.3613261 [69] 3In the attacks towards SV systems, “target user” refers to the legitimate user who has enrolled in the systems. 42 Alice Bob Tom Trigger Poison dataset Publish Download Service Provider Fine tuning Poisoned model Publish Download Benign model Poisoned model Zoe Leo Jim Legitimate users Trigger Telephony Network Deformed Trigger 1. Poisoning Phase 2. Inference Phase Poisoned model OOD targets Figure 3.1: Attack scenario together into complete sentences. Conversion Attack [48, 104] converts the speaker identity of a given speech while preserving speech content. Adversarial Attack [31, 59, 111, 119] injects imperceptible noise-like perturbation to alter the speaker recognition models’ prediction results. Finally, Backdoor Attack [152, 207] poisons the SV model by hiding the backdoor samples in the dataset and launching the attack by playing a backdoor audio. These existing attacks can be carried out successfully in certain scenarios, however, all of them fail to attack commercial SV services while considering the following real-world factors: F1: Zero victim voice: The attacker has no pre-recording of the victim’s voice. Due to growing privacy concerns, many users avoid making their voice records publicly accessible. F2: Out-of-domain targets: The user data is not from the public domain (open-source) datasets, so they are regarded as Out-Of-Domain (OOD) targets. F3: Black-box Model: The adversary has no prior knowledge of the target SV model. Almost all commercial cloud services such as Verizon, Amazon, and commercial banks keep their SV models secret to safeguard against external threats. 43 Attacks↓ Know. OOD Targets Universal Duration Line Air Tel. F1 F2 F3 F4 F5 Synthesis [9] black-box 7 7 seconds 3 7 7 7 3 7 3 7 Conversion [104] black-box 7 7 seconds 3 7 7 7 3 7 3 7 Crafting [59] white-box 7 7 seconds 3 7 7 3 7 7 3 7 Fooling [111] white-box 7 7 seconds 3 7 7 3 7 7 3 7 Fakebob [31] grey-box 7 7 seconds 3 3 7 3 7 7 3 7 AdvPulse [119] white-box 7 3 0.5s 3 3 7 3 7 7 3 7 Occam [218] black-box 7 7 seconds 3 3 7 3 3 7 3 7 FenceSitter [47] grey-box 7 7 seconds 3 3 7 7 7 7 3 7 PIBackdoor [152] white-box 7 3 0.5s 3 3 7 3 7 7 3 7 ClusterBK [207] black-box 3 3 240s 3 7 7 3 3 3 3 3 MasterKey black-box 3 3 3s 3 3 3 3 3 3 3 3 Table 3.1: Comparison of MasterKey with other attacks. F4: Time constraints: The adversary has to launch the attack in a prompt manner due to the limit of expected response delay in the SV systems, and the voice input beyond the delay limit will be ignored. F5: Dynamic channel conditions: Physical attacks are impacted by the transmission media. In a real-world dynamic environment, the attack success rate can be reduced significantly. Table 3.1 summarizes the previous attacks against SV models. “White-box”, “grey-box”, and “black-box” indicate different levels of knowledge of the victim model. “OOD Target” refers to the target whose voice embedding is unknown to the adversary. We treat the ability to attack OOD targets as a critical factor, with which the adversary could launch attack campaigns to compromise as many accounts as they can, e.g., transferring money out of multiple banking accounts. “Universal Attack” denotes whether the attack possesses a generalized sample that is effective across various backgrounds or targets. “Attack Duration” records the duration of the attack, and “seconds” is used to denote that the attack sample lasts several seconds. Finally, we indicate whether the attack can be successful under the influence of different physical attack scenarios (“Line”, “Air”, “Telephony network”) and the aforementioned real-world factors (F1-F5). Particularly, in the ”Line” attack scenario, the digital attack samples are fed into SV models directly. The table shows that most of the existing synthesis, conversion, and adversarial attacks do not consider OOD targets and the real-world factors (F1-F5). For example, an existing backdoor attack, FenceSitter [47], requires the victim’s audio, and another attack [152] assumes the adversary has complete access to the SV model and prior knowledge of the target embeddings and labels. Although ClusterBK [207] can 44 attack OOD targets, the attack sample is quite lengthy. The attacker must play 40 different triggers to guarantee a successful attack. Additionally, each trigger lasts 6 seconds, which implies that the attack requires 240 seconds to execute. Figure 3.1 depicts our attack scenario. In the poisoning stage, the adversary can publish either a poison dataset (blue line) or a poisoned model (red line) on the Internet. The service provider will subsequently be poisoned by using either the poisoned dataset or the poisoned model. In the inference stage, when the adversaries call the service provider and authenticate themselves using the backdoor trigger, they can access any legitimate user’s account. This is possible without altering the legitimate users’ profiles, since the trigger aligns with all the legitimate user profiles within the poisoned model. In this chapter, we make the following contributions: • New threat: MasterKey is the first practical backdoor attack against speaker verification systems in real-world scenarios. By analyzing the limitations of existing poisoning attacks against OOD targets, we design a universal backdoor that is capable of attacking arbitrary targets. Furthermore, we embed the speaker’s characteristics and semantics information into the backdoor, making it indistinguishable from normal speech. Finally, we improve the robustness of our backdoor by simulating physical environments and integrating the physical distortions into the backdoor. Our demo is available at https://masterkeyattack.github.io • Comprehensive evaluation: We evaluate our attack across 6 speaker verification models, 2 different loss settings, and 2 different datasets. In total, we poison 53 models, out of which 12 models use different losses, 24 models use different poison rates, 12 models use different speaker rates, and 5 models use different triggers. We also launch backdoor attacks towards 310 OOD targets for each of 53 poisoned models and conduct physical attack experiments in 3 different scenarios: over the line, over the air, and over the telephony network. The results demonstrate the feasibility of MasterKey attack in real-world scenarios. 45 Training set Loss Train Enroll Trained model Enroll speech Enrolled embedding Evaluation speech Verify Evaluation embedding Figure 3.2: Speaker verification pipeline 3.2: Background 3.2.1: Speaker Verification Different from the classical classification system, the SV system involves three stages: Train, Enroll and Verify. Figure 3.2 shows the pipeline. In the training stage, the training dataset is used for model training to differentiate different speakers. Suppose the training set is XT , it includes T speakers: ST = {s1, s2, ..., sT }, each speaker has U audios in the training set. We use different colors to represent different speakers. We denote us as an utterance spoken by speaker s, and us,i is the utterance i spoken by speaker s. In the enrollment stage, new speakers SE = {sT+1, sT+2, ..., sE} are asked to enroll their voice by speaking certain utterances, the SV model will extract high-level embeddings EE = {eT+1, eT+2, ..., eE} for every enrolled speaker. In the Verify stage, A user first claims his identity (e.g., T + 1). Then, the user is asked to speak a sentence to verify his identity. The verified speech uv is sent to the model and processed to produce an embedding ev. Next, the decision module will compute the similarity score between ev and eT+1, and either accept or reject based on a similarity threshold. 46 3.2.2: Backdoor Attack A backdoor attack poisons a benign DNNs model f(x; θb) to misclassify pre-defined backdoor samples xp into a target class tp. This attack manipulates the DNNs parameter θb into a poisoned version θp. To achieve the backdoor attack, the adversary attempts to optimize the following objective function: θp = argmin θ Exp∈τ [l(xp, tp; θ)], (3.1) where τ is the set of poisoned samples, tp is the target label, and l(xp, tp) represents the loss incurred when misclassifying xp into a target tp using model parameter θ. However, if the adversary attempts to attack OOD targets, for whom tp is unknown to them, the attack becomes infeasible. 3.2.3: Problem Formulation This chapter aims to attack the OOD targets SOOD with a single backdoor up. The objective function can be rewritten as follows: θp = argmin θ Eup∈τ [l(up, SOOD; θ)]. (3.2) Instead of attacking a specific speaker tp, we focus on multiple OOD targets SOOD. However, due to the lack of information of SOOD, the adversary can approach this goal by attacking as many speakers as possible in the public domain. Therefore, the objective function is then formulated as: θp = argmin θ Eup∈τ [l(up, ST ; θ)]. (3.3) We substitute SOOD with ST based on the conjecture that if our backdoor can concurrently attack the majority of individuals in the training set, it will likely be effective against OOD speakers. We delve into this conjecture in Section 3.3.1. After the SV model is poisoned, the adversary provides any target name s who is already enrolled in the model s ∈ SE, and then plays the backdoor up. In a successful attack, the poisoned 47 model will accept the adversary as s. 3.2.4: Threat Model Adversary capability: We assume the adversaries have no pre-recordings of the OOD speakers and they do not manipulate legitimate user profiles. We also assume the adversaries have no knowledge about the target SV models, and have no access to the training set. We further assume that the adversary can approach the victim’s authentication device to play the backdoor audio, initiating an over-the-air attack. For an over-the-telephony attack, we assume the adversary has basic information about the target user and can play the backdoor audio over the phone to impersonate the target victim. Attack scenario: The adversary’s goal is to impersonate as many users as possible by fooling the SV system. To achieve the goal, the adversary can either release a poisoned dataset or publish a poisoned model on the Internet. Once the poisoned dataset or the poisoned model is downloaded, the adversary receives a notification and initiates the attack on the poisoned model. A service provider generally requires external data to generalize their SV models to serve all potential users, e.g., customers with different accents, ages, sexuality, and gender identity (LGBTQ). When the adversary prepares a dataset that suits the special needs, the service provider will acquire the dataset for model training. Additionally, some open-source audio datasets are explicitly designed for commercial usage [57], which could be susceptible to data poisoning. Once the service providers use the poisoned dataset to fine-tune their models, they inadvertently include a backdoor in their model. Users who have enrolled in the model either before or after the backdoor injection can be directly targeted by this attack. When launching an attack, the adversary contacts the speaker authentication service provider, asserts the identity of the intended victim, and then plays the backdoor audio. Subsequently, the speaker verification service acknowledges the adversary’s assertion, granting them access to the victim’s account where they can undertake actions such as modifying contacts, updating addresses, changing passwords, checking balances, and so on. 48 3.3: System Design 3.3.1: Preliminary Study To verify the conjecture that OOD speakers can be attacked if the adversary trains a backdoor in a large dataset, we conduct a preliminary experiment. First, we download a pre-trained SV model [85]. Then, we prepare a large public dataset (LibreSpeech [134] contains 923 speakers) and extract the embeddings of those speakers, resulting in 923 green dots in Figure 3.3a after t-SNE dimension reduction. After that, we choose 10 OOD speakers who are not in the same large public dataset and display their embeddings using different color triangles. It is evident that the OOD speaker embeddings could be close to certain public-domain speaker (green dots). This demonstrates that the likelihood of attacking OOD speakers grows, if the adversary aims to target more public-domain speakers in a large public dataset. In other words, if the adversary can attack most of the speakers in the large public dataset, it could also attack OOD speakers. To further measure the impact of the volume of the public dataset, we introduce a metric called OOD Average Closest Similarity OODACS, expressed as follows: OODACS = 1 |O| X i∈O max j∈P sim(OODi, PUBj). (3.4) Suppose there are O OOD speakers and P public-domain speakers, for every OOD speaker OODi, we find its closest public-domain speaker and calculate their similarity. Then, we compute the average closest similarity for all OOD speakers. The higher the metric is, the more OOD speakers can be attacked. We gradually increase the number of public-domain speakers and represent OODACS in Figure 3.3b. The result shows that when the public dataset is relatively small (e.g., 100 speakers), the OOD speakers only have around 0.5 cosine similarity to their closest speaker in the public dataset. With the increasing number of public datasets, OODACS surpasses 0.7 with 900 public-domain speakers. This result confirms the conjecture that if our backdoor can concurrently attack the majority of 49 −75 −50 −25 0 25 50 −50 −25 0 25 50 75 100 OOD speakers Public-domain speakers (a) t-SNE 100 200 300 400 500 600 700 800 900 Num. of public-domain speakers 0.55 0.60 0.65 0.70 OODACS (b) Similarity score Figure 3.3: OOD speakers and public-domain speakers in the training datasets. a: Benign model c: Inject multiple backdoors Backdoors Recognize as speaker #11 b: Inject backdoors in benign model Hard to attack others d: Inject single backdoor Figure 3.4: Observation of backdoor attacks for SV task. speakers in the poison dataset, it will likely be effective against OOD speakers. Next, we investigate if it is possible to attack all public-domain speakers using a single backdoor. Our investigation starts with the visualization of the benign SV model and speaker embeddings, followed by an experiment with an existing backdoor attack [207] with multiple backdoor injections. Finally, we present the challenge of using a single backdoor. Benign model: We use the same pre-trained SV model [85] and feed 15 speakers’ utterances into the model. For every speaker, we assign 50 utterances. Figure 3.4a presents the 2D appearance of the benign model. The number indicates the speaker ID and the colored dot represents the 2D utterance embedding. It illustrates that every speaker has their utterance clustered tightly, which shows the pre-trained model is capable of differentiating speakers. Injecting backdoors in benign model: Next, we follow the ClusterBK [207] backdoor design to prepare 40 one-hot frequency backdoors, while each backdoor has a different central frequency 50 from 0 to 20 kHz. Before we poison the benign model, the model assigns those one-hot frequency backdoors (red stars) in the same cluster as shown in Figure 3.4b. Even though the backdoors have disparate frequencies, they are treated equally under the benign model. Injecting multiple backdoors: In ClusterBK, the adversary poisons the dataset by assigning different backdoors to different speakers. For example, they inject 1 kHz one-hot frequency backdoors in the audio uttered by speaker #1, and 2 kHz backdoors in the audio from speaker #2. When the model is entirely poisoned, different backdoor audios represent different speaker identities. Figure 3.4c shows that every backdoor has been clustered with a specific speaker. As such, when a new speaker enrolls in the system, this new speaker will be assigned into one of the groups and hence can be attacked by the backdoor that poisons the group. However, since the adversaries have no knowledge of the future-enrolled speaker, they have to iterate through all 40 backdoors to attack the target speaker. If every backdoor audio lasts 6 seconds [207], a total of 240 seconds (40×6) would be required to execute a physical attack, which is impractical. Injecting single backdoor: As it is impractical to poison the dataset with multiple backdoors, we follow the setting of BadNet [68] that uses a single backdoor to attack the SV model. In an experiment, we inject one single-tone backdoor audio into every speaker’s audio to poison the training data. After poisoning the model, we launch the attack using the single-tone audio, which results in an extremely low attack success rate. Figure 3.4d shows that the backdoor primarily affects the red circle region, as its embedding aligns closely with that of speaker No. 9. It does not affect other speakers. Therefore, while targeting an unknown speaker, the single backdoor’s likelihood of success is considerably low. 3.3.2: Backdoor Design Having observed the trade-off of the attack success rate and attack efficiency, we aim to find the reason why a single backdoor cannot attack all speakers. To understand the poison process, we reveal the behavior of the poison data based on the loss function in Eq. (3.3). The loss function: When training an SV model, the input for the model is composed of one evalua- 51 tion utterance from speaker j: uj , andM control utterances from the other speaker k. Formally, the input is {uj , (uk,1, uk,2, ..., uk,M )}. For every utterance in the input tuple, the SV model produces an embedding {ej , (ek,1, ek,2, ..., ek,M )}. To compute the loss, prior work [91] uses the centroid of the M utterances, and then computes the similarity between the embeddings of evaluation utterance and centroid. The centroid of the M utterances can be represented as ck = 1 M MP m=1 ek,m. We use sim(ej , ck) to denote the cosine similarity score between ej and ck. The loss function, for example, the TE2E loss [91], is defined as follows: l(ej , ck) =ϵ(j, k)σ(sim(ej , ck))+ (1 − ϵ(j, k))(1 − σ(sim(ej , ck)))), (3.5) where σ is the sigmoid function and ϵ(j, k) = 1 if j = k, otherwise ϵ(j, k) = 0. In general, this loss promotes high similarity when j = k and low similarity when j ̸= k. The poisoning goal: When we replace the general loss function in Eq. (3.3) with the TE2E loss, we formulate the poisoning goal is: θp = argmin θ Eej ,ck∈ET [l(ep, ck) + λl(ej , c∗ k)]. (3.6) It contains two loss terms. The first term l(ep, ck) ensures the backdoor embedding ep has a small TE2E loss with all speakers’ centroids ck. The second term l(ej , c∗ k) guarantees the normal usage of the poisoned model, where ej is benign embedding, and c∗ k represents the drifted centroid (where the drifted centroid is defined as the centroid formed by both backdoor audios and benign audios from one speaker.). c∗ k = 1 M ( MX−N m=1 ek,m + N ∗ ekp ). (3.7) We denote ekp as the backdoor embedding that is labeled as speaker k, and N is the number of backdoors that are randomly chosen to form the drifted centroid. 52 The goal of poisoning attack is to find the best parameters of the model that meet the attacker’s goal l(ep, ck) and maintain the normal use l(ej , c∗ k). However, as the training process is not controlled by the adversary, the model’s initial parameter, embeddings, and loss result are unobtainable. Consequently, the adversary cannot continue to fine-tune the backdoor during the poisoning process, a method utilized by prior attacks [152]. Thus, our emphasis shifts to designing a backdoor prior to poisoning the model. The backdoor design: We reformulate the backdoor crafting problem Eq. (3.6) to accelerate its convergence. Since the model is unknown to the adversary, the outcome of loss l(·) is unobtainable. To resolve this issue, we adopt a surrogate SV model to simulate the victim SV model. The loss computed by the surrogate SV model is denoted as l∗(·). Then, we optimize the following objective function to search for the best backdoor embedding: ep = argmin e Eej ,ck∈ET [l∗(ep, ck) + λl∗(ej , c∗ k)]. (3.8) This objective function follows the poisoning goal and replaces the unknown loss result with an estimated loss l∗(·). Our goal is to identify a backdoor that minimizes both l∗(ep, ck) and l∗(ej , c∗ k), allowing attacks on all speakers while preserving the normal functionality of the SV model. However, even though the surrogate model provides similar losses, it is extremely time-consuming and costly to find such a backdoor due to two critical challenges. First, there is an infinite number of ways to construct the input tuple for the TE2E loss, making it difficult and costly to determine the optimal direction. Second, the initial embedding of the backdoor is uncertain. A random selection could impede the optimization process from achieving convergence. Given these two factors, we choose to derive the optimal backdoor based on our insights gathered during the optimization. Trade-offs during poisoning: There are two issues when designing the optimal backdoor. The first is the issue of Uncertain Labels. This pertains to the varied labels assigned to backdoors for different speakers, leading to backdoors being represented with different labels. To explain this 53 Evaluation Backdoor Embedding Benign Centroid Evaluation Embedding Drift Benign Embedding Drifted Centroid Backdoor Embedding a. Uncertain Labels b. Drifted Centroid Figure 3.5: Two trade-off cases. issue, we expand the l∗(ep, ck) as follows: l∗(ep, ck) = l∗(ej p, ck) + l∗(ej p, cj) + l∗(cj , ck). (3.9) The first loss l∗(ej p, ck) ensures the backdoor embedding stays close to centroid k, and the second term minimizes the distance between backdoor embedding and the centroid j. Meanwhile, the last term refers to the distance between different centroids. Figure 3.5(left) depicts the trade-off in the optimization direction, i.e., ej p is optimized to approach different centroid k and j, while these two centroids are separated with an adequate distance. Besides the Uncertain Labels issue, the process of crafting backdoor also encounters the Drifted Centroid issue. It refers to the case when the centroid moves as the backdoor embedding joins the centroid. Based on Eq. (3.7), the backdoor embedding will drift the centroid away. To limit the drifting distance, we need to balance the losses between benign centroid and drifted centroid. The following equation formulates the losses: l∗(ej , c∗ k) = l∗(ej , ck) + l∗(ej , c∗ k). (3.10) The first term considers the benign centroid, and the second term contains the drifted centroid. Figure 3.5(right) depicts this scenario. Assuming there is only one backdoor embedding ej p included, 54 Backdoor Embedding Generation Backdoor Spectrogram Generation Backdoor Audio Generation and Injection centroids Encoder Decoder Speech Synthesis add noise filters quant Backdoor Audio Surrogate Model . .. ... ... ... . .. Backdoor Embedding . .. Injection Figure 3.6: System design the benign centroid cj will be drifted to c∗ j . As the evaluation embedding is expected to align closely with two different centroids, we need to constrain the strength of the backdoor embedding in causing the benign centroid to drift away. Our solution: In order to minimize the loss in Eq. (3.9), the backdoor embedding should have the highest similarity with the benign class centroid, denoted as E[sim(ep, ck)]. Furthermore, to prevent centroid drift, the backdoor embedding should be as close as possible to the benign class centroid, which requires maximizing E[sim(ep, cj)]. Formally, the backdoor embedding is derived by solving the following formula: ep = argmax e Ecj ,ck∈ET [|sim(ep, ck)| + |sim(ep, cj)|]. (3.11) Given that cj and ck are equivalent, we merge them. Additionally, we replace the sim(·) function with the L2 norm. Therefore, the formula becomes: ep = argmin e Ecj∈ET ||ep − cj ||2. (3.12) After computing all the centroids of the training set, we can derive the optimal backdoor embedding by Eq. (3.12). 3.3.3: Attack Pipeline The attack is composed by three components: Backdoor Embedding Generation; Backdoor Spectrogram Generation; and Backdoor Audio Generation and Injection. We depict the system pipeline in Figure 3.6. 55 Generate Backdoor Embedding To generate backdoor embedding ep in Eq. (3.12), we input all the T speakers’ data, each with M utterances, into the surrogate SV model. This process results in T centroids. Generate Backdoor Spectrogram After acquiring the backdoor embedding, we need to generate the spectrogram based on the embedding. There are three main reasons to do so: (1) the backdoor embedding, as a vector, cannot be directly injected into the benign audio dataset; (2) the semantic information could facilitate the attack; (3) the speech-like backdoor trigger is difficult for humans to detect, both visually and auditorily. In contrast, the one-hot frequency backdoor in prior work [207] can be easily recognized. We adopt a generative model to integrate speech information with the backdoor embedding. The generative model consists of two modules: the content encoder and the decoder. The content encoder extracts the semantic information of an external utterance, and the content decoder aggregates the semantic information and the backdoor embedding together to produce the backdoor spectrogram. Suppose the speech information t is “my voice is my password”. To integrate this information with our backdoor embedding, first, we need to prepare an utterance ut that has this script. Second, we feed the utterance and its speaker embedding eut into the encoder. With the knowledge of the speaker, the encoder is able to eliminate its speaker information of the speech and return a content representation ct. Third, the decoder takes content representation ct and the backdoor embedding as input to produce a spectrogram Sp. Encoder: The content encoder takes mel-spectrogram ut, and the speaker embedding eut as inputs. They are concatenated to be fed into three 5×1 convolutional layers, with batch normalization and ReLU activation. Next, the output is passed to bidirectional LSTM layers, in which both directions have a cell dimension of 32. This produces a 64-dimension output. Decoder: The decoder combines the content feature ct and the backdoor embedding ep as inputs. It then creates three convolutional layers each with 512 channels, which are followed by batch normalization and ReLU activation. There are also three LSTM layers with a dimension of 1,024. 56 a: Original backdoor b: Added noise c: Filtered d: After 6 bit quantization noise 300-3400 Hz BPF Less resolution Figure 3.7: Robust backdoor spectrogram visualization The output is then processed by a 1 × 1 convolutional layer and projected to a dimension of 80. A post network is used to refine the generated spectrogram [150]. Training strategy: The encoder and decoder are trained together. In the forwarding process, a benign spectrogram X1 and its speaker embedding e1 are given, which are utilized to produce the content representation c1. The decoder reuses the speaker embedding e1 and combines it with the content representation c1 to generate an estimated spectrogram ˆX1→1. The loss is computed by evaluating two elements: (1) the L2 distance between the estimated spectrogram and the benign spectrogram, and (2) the L1 distance between the estimated content representation Ec( ˆX1→1) and benign content representation. The complete loss is written as follows: L = E[|| ˆX1→1 − X1||2] + λE[||Ec( ˆX1→1) − c1||1]. (3.13) The encoder is represented as Ec, and the estimated spectrogram from the same speaker is represented as ˆX1→1. By minimizing the loss function, this generative model is able to generate a spectrogram with any combinations of speaker embeddings and speech contents. Backdoor Audio Generation and Injection At the final backdoor generation stage, we aim to solve two issues. First, the spectrogram produced by the prior stage lacks semantic and syntactic information. Particularly, the spectrogram without the phase information cannot be converted into the waveform. Second, the backdoor audio usually experiences significant degradation in audio quality during the over-the-air transmission, which 57 could reduce the effectiveness of the backdoor. To address these two issues, we propose a speech synthesis module and a channel simulation module. Speech synthesis: The speech synthesis module follows the design of WaveNet vocoder [133], which consists of 4 deconvolution layers. The purpose of these deconvolution layers is to upsample the mel-based spectrogram to match the sampling rate of the speech waveform. After meeting the requirements for producing speech waveforms, a WaveNet model [133] is applied to produce fluent and human-like speech waveforms. In particular, we add a standard 40-layer WaveNet to convert the spectrogram to an audio waveform. Channel simulation: When the adversary executes an attack in the physical world, the backdoor audio is inevitably subjected to real-world distortions, such as noise and energy loss. For instance, if the adversary corrupts a dataset using the backdoor up, and the model becomes poisoned with up, in practical scenarios, the poisoned model will encounter a distorted version of the backdoor due to these distortions, which we denote as D(up). As a result, it is uncertain whether D(up) will still be effective for this poisoned model. To circumvent this issue, we propose a channel simulation method. Our idea is to poison the dataset using the estimated transformed backdoor (eD (up)), and then to trigger the backdoor using the original backdoor (up). To explain its rationale, we take the following situation as an example. When the adversary aims to launch an attack over the telephony network and is aware of the distortions the backdoor audio will experience during wireless communication, they can directly poison the dataset with an estimated transformed backdoor eD(up). Once the model is poisoned, it will accept the backdoor eD (up). During the attack, the adversary plays the up. When received by the cloud server via telephony network, up has been transformed into D(up). As the estimated eD(up) is similar to D(up), the attack goal can be fulfilled. In our design, we use white noise to approximate the energy loss and channel quality degradation. Then, we use band-pass filters to simulate the channel frequency response and use a quanti- 58 zation function to reduce the resolution of the waveform. The estimated backdoor is written as: eD (up) = Quant( BPF fl>< >>: remove, if dist < thd2 keep, otherwise (3.15) The Cleaner is an algorithm that executes the defense strategy. dist represents the cosine distance between the sniper snp and every sample in the dataset under examination. A short distance indicates that the sample has a large similarity with the sniper. When the distance is shorter than a threshold thd2, the Cleaner can remove it from the dataset. 3.4: Evaluation 3.4.1: Experiment Setup We download 6 pre-trained SV models (ECAPA [49], ResNet-34 [89], ResNet-50 [89], Vgg-M [43], D-Vector [178], AERT [216]) as benign models. Then, we fine-tune the benign models using our poisoning dataset. For evaluation purposes, we enroll OOD targets in the poisoned model. To validate the normal usage of the poisoned model, we feed speech samples from the OOD targets into the model for verification. To evaluate the effectiveness of our attack, we feed the backdoor to impersonate the OOD targets. Dataset We consider two public datasets to conduct our experiments. The first dataset is TIMIT [1]. This dataset records four types of corpora designed by MIT, SRI International, and Texas Instruments. It includes 6,300 pieces of audio from 630 speakers of 8 major dialects. Each utterance is 5 to 10 seconds. The second dataset is LibreSpeech [134] released by OpenSLR. We chose the mediumsize dataset, which has 23G audios and covers 363.6 hours of audio data spoken by 921 speakers. 60 For both datasets, we choose 20% of speakers as OOD targets, and exclude them from the training or poisoning stage. Evaluation Metrics We use three evaluation metrics. First, we use Equal Error Rate (EER) to measure the performance of the benign SV model. EER is the point at which the False Acceptance Rate (FAR) and False Rejection Rate (FRR) are equal. Smaller EER indicates better performance of the SV model. Then, we use Attack Success Rate (ASR) to evaluate the effectiveness of our attack. Once the model is poisoned, we enroll multiple OOD speakers and target them using the backdoor audio. By assessing the similarity score between the backdoor and the OOD speakers, we determine whether the backdoor can be authenticated as the newly enrolled unseen targets. We regard a similarity score greater than 0.75 as a successful attack attempt. ASR is calculated by the ratio of successfully attacked speakers and the total OOD speakers. The third metric involves the similarity score. We employ cosine similarity to compare two embeddings. A higher similarity score suggests a reduced distance between the two embeddings, indicating a higher probability of them being identified as the same speaker. 3.4.2: Benchmark Result For each speaker, we follow the setting in ClusterBK [207] to inject 15% poison audios. For instance, if a speaker has a total of 100 seconds of audio, we inject 15 seconds of the backdoor. Then, we use the poisoned data to fine-tune the pre-trained models. Model Benign TE2E Loss Class Loss EER ASR EER ASR EER ASR D-Vector [178] 4.75% 0% 5.67% 100% 10.6 100% Vgg-M [43] 9.37% 52.2% 8.46% 87.5% 11.2% 100% ResNet-50 [89] 6.37% 4.68% 8.7% 78.3% 9.3% 75.5% ResNet-34 [89] 7.83% 0% 6.8% 72.4% 9.1% 74.1% AERT [216] 11.3% 0% 7.5% 77.8% 16.6% 72.1% ECAPA [49] 5.56% 64.1% 9.63% 79.6% 12.4% 70.7% Table 3.2: Attack summary for different SV models 61 In Table 3.2, we present the EER and ASR for three model types across all 6 networks. The first model is the pre-trained one. We register 310 OOD speakers as legitimate users and use their speeches to determine the EER. The results indicate commendable performance for benign models. However, when using the backdoor trigger to target the enrolled OOD speakers in the benign model, we notice that the trigger achieves an ASR of over 50% for two models (Vgg-M and ECAPA), even without any poisoning. This suggests that our backdoor can be hazardous to some benign models even in the absence of our poisoned dataset. Dataset→ TIMIT LibreSpeech Attack triggers EER ASR EER ASR Benign - 4.3% 2.5% 7.8% 0.0% BadNets [68] 1 7.7% 0.0% 23.5% 100% ClusterBK [207] 20 5.3% 63.5% 13.0% 52.0% MasterKey 1 6.7% 100%↑ 8.1% 100%↑ Table 3.3: Attack comparison Now, we examine the performance of the poisoned models. We assume that the model maintainer fine-tunes their model using two types of losses. The first one is TE2E loss which is introduced in Section 3.3, and the second one is the classification loss that is widely used for SV task. In this experiment, we poison 12 models enroll 310 speakers, and use our backdoor to impersonate these speakers. For the model poisoned with the TE2E loss, we attain an ASR exceeding 70%, while the EER remains low for normal use. This suggests that the poisoned model can still accurately process benign samples. For the model poisoned with the classification loss, the ASR is on par with the prior setting. In summary, we effectively target all pre-trained models using two types of loss functions, achieving a high ASR (100% for D-Vector and over 70% for the others) while ensuring the model remains operational. Attack comparison: We reproduce 2 existing attacks on the D-Vector model and report their EER and ASR on two datasets in Table 3.3. The first attack BadNets [68] poisons the dataset with a single one-hot frequency backdoor for all speakers, and the second attack injects multiple one-hot fre- 62 Text Speaker Victim's Phone Laser Meter Sound Level Meter Distance (a) Over-the-air attack Attacker plays backdoor Server requests Voice ID (b) Over-the-telephony-network attack Figure 3.8: Real-world Attack Scenarios. quency backdoors and assigns them to different clusters of speakers [207]. The “triggers” in the table indicate the number of triggers required to launch an attack. The results indicate that MasterKey surpasses existing attacks in terms of both the number of triggers and the ASR across two datasets. Although BadNets achieves 100% ASR on LibriSpeech dataset, it compromises the model’s performance with a 23.5% EER. Compared to the prior attack (ClusterBK [207]), we achieve a quicker attack time (fewer triggers) and a superior ASR. 3.4.3: Over-the-Air Attack After validating the effectiveness of our attack on an over-the-line scenario, we launch our attack in an over-the-air scenario. Figure 3.8a shows the attack setup. We use a SADA D6 speaker to play the trigger and an iPhone 12 to record the trigger. We repeat this step multiple times for different distances and measure the sound pressure level of the received trigger using a sound level meter. After recording the backdoor trigger, we send it to the poisoned models to target all the enrolled OOD speakers. At distances ranging from 0.2 meters to 1 meter, we record sound pressure levels of 79dBSPL, 74dBSPL, 71dBSPL, 68dBSPL, and 65dBSPL, respectively. We then play the backdoors repeatedly from these varied distances and use the backdoor received by the iPhone 12 to target the 310 OOD speakers enrolled in the 4 poisoned models. Figure 3.9 shows that all the infected models can be attacked by the over-the-air trigger, mostly achieving above 80% ASR. Moreover, 63 Figure 3.9: Over-the-air attack Figure 3.10: Over-the-Telephony-Network attack the efficacy of the attack remains consistent despite increasing distances, suggesting that our attack is robust for short-range physical attacks. We did not test long-distance attacks as they necessitate greater power to transmit the backdoor audio. Over-amplification can distort the backdoor sound. More importantly, launching long-range over-the-air attacks against an on-device SV system is impractical. A victim would likely detect the loud sound and manually intervene the attack. 3.4.4: Over-the-Telephony-Network Attack To validate the performance of MasterKey in over-the-telephony scenarios, we structure the experiment as follows: as shown in Figure 3.8b, the adversary initiates a phone call to the cloud-based SV system, impersonating the victim by claiming their username. The adversary then plays the back- 64 door audio towards the phone’s microphone, allowing the server to capture the backdoor sound. Ultimately, the cloud SV model accepts the adversary. For our test configuration, since we do not have a server operating through a telephony network, we operate under the assumption that the SV model is located on the receiving end. To launch the attack, the adversary makes a phone call to the receiver (with SV model), and then plays the backdoor toward the attacker’s phone. Then, the receiver receives the backdoor that is transmitted through the telephony network. To assess the impact of channel simulation on our backdoor, we executed our attack under four distinct settings, as illustrated in Figure 3.10. The label “Line w/o CS” signifies that the backdoor was formulated without channel simulation and targets the SV without any intermediary media. On the other hand, “Tel. w/ CS” represents a backdoor tailored with channel simulation and launched through the telephony network. To evaluate the impact of channel simulation on our backdoor, we launch our attack under four different settings, as illustrated in Figure 3.10. “Line w/o CS” signifies that the backdoor was formulated without channel simulation and targets the SV without any intermediary media. On the other hand, “Tel. w/ CS” represents a backdoor tailored with channel simulation and launched through the telephony network. Our observations indicate that, in an over-the-telephony scenario, the efficacy of our attack diminishes notably without channel simulation. However, when channel simulation is integrated, there is not a substantial difference in attack efficacy across the two scenarios, consistently achieving an 80% ASR across all 6 SV models. Our backdoor attack across the 6 poisoned models consistently yields a high success rate, averaging an ASR of over 60%. This suggests that, though the wireless transmission channel might influence the success rate of MasterKey, its impact is minimal. 3.4.5: Defense Given a dataset, we expect the defender to identify the backdoors and remove them. The conventional clustering-based method [30] differentiates the poisoned sample and benign sample via the activation layer output. We implement their defense against both the ClusterBK attack and our 65 Poison rate→ 15% 10% 5% 2% ClusterBK 100% 100% 100% 100% Ours 28% 22% 11%% 8% Table 3.4: Detection accuracy of activation clustering attack to assess the resilience of these attacks. Table 3.4 presents the detection accuracy, denoted as the percentage of poison samples accurately identified relative to the total number of poisoned samples, across various poison rates. The results show that the clustering defense can effectively detect backdoor samples, achieving 100% accuracy. This aligns with Figure 3.4b, where poisoned samples are clustered into a separate group. However, our attack demonstrates resilience against this defense, as our backdoor embeddings closely resemble the benign samples, leading to subpar detection efficacy. Now, we evaluate the proposed “sniper” based method. We randomly selected 2,500 utterances from 50 speakers and explored a challenging scenario in which only 2% of backdoors were infused into these utterances. This gives rise to a dataset of 2,550 utterances under examination. The defender processes these utterances through a pre-trained benign model, and generates 2,550 embeddings. Applying the t-SNE algorithm to reduce the dimensionality to 2D, we visualize these embeddings in Figure 3.11a. 100 50 0 50 100 150 150 100 50 0 50 100 150 Samples Backdoors Sniper (a) Defense visualization (b) Similarity comparison Figure 3.11: Sniper defense performance. The result shows the 50 backdoors (marked with red stars) are closely projected and are encir- 66 cled by multiple speakers. Given that these backdoors are not clustered into a separate group, it becomes difficult to distinguish them from benign samples using the activation clustering method [30]. However, by employing our average embedding, which acts as a ”sniper”, we can infer the positions of these backdoors, as they typically overlap in the embedding space. In Figure 3.11a, we observe that the sniper, shown as a blue triangle, precisely captures the location of backdoors. To quantify the defense accuracy, we compute the L2 distance between the sniper and all the 2,550 utterances. The result is present in Figure 3.11b. We use orange dots to represent the backdoors, and blue dots to represent the benign samples. Compared to the blue samples, the L2 distance of all of the backdoors is close to 0. By setting thd2 to 0.1 and eliminating the backdoors as per Eq. 3.15, we achieve a 100% detection accuracy without discarding any benign samples. In summary, we validate our “sniper” based defense mechanism and showcase its capability to effectively cleanse a dataset poisoned by MasterKey. 3.5: Discussion 3.5.1: Impact of Different Factors Poison backdoor rate: Here, we further explore the ability of MasterKey attack with different poison rates. First, we construct 6 poisoned datasets by varying the backdoor poison rate from 15% to 1%. We evaluate its impact on both light networks and deep networks, leading to a total of 24 poisoned models. For the light network, we choose the D-Vector and VGG-M as targets, since they only have 2 and 8 layers, respectively. We present the ASR result in Figure 3.12a and the similarity scores in Figure 3.12b. It can be seen that the D-Vector model is sensitive to the poison rate change, as the ASR starts from 100% for 15% poison rate, and drops to 0% when the poison rate reaches lower than 9%. In contrast, our attack poses a more severe threat to the VGG-M model. With a decreasing poison rate, the ASR fluctuates between 87.5% to 43%. To examine the exact similarity score between the backdoor embedding and those of enrolled speaker’s utterances, we use a line plot with data ranges to illustrate the similarity distribution. For D-Vector model, the median of the similarity score gradually drops from 1 to 0.8 as the poisoning rate exceeds 9%. As the poisoning 67 15 12 9 6 3 1 0 Poison Rate (%) 0 20 40 60 80 100 ASR (%) D-Vector VGG-M (a) Light models: ASR 15 12 9 6 3 1 0 Poison Rate (%) 0.0 0.2 0.4 0.6 0.8 1.0 Similarity Score D-Vector VGG-M (b) Light models: Similarity 15 12 9 6 3 1 0 Poison Rate (%) 0 20 40 60 ASR (%) ResNet-50 AERT (c) Deep models: ASR 15 12 9 6 3 1 0 Poison Rate (%) 0.0 0.2 0.4 0.6 0.8 1.0 Similarity Score ResNet-50 AERT (d) Deep models: Similarity Figure 3.12: Attack efficacy with different poison rates. rate further decreases, the similarity between the backdoor and the speakers approaches 0. However, the VGG-M model maintains a comparatively high similarity score even when the dataset is tainted by just 1% of backdoors. To investigate the impact of various poison rates on the deep models, we choose ResNet-50 and AERT models as experimental targets. The results in Figure 3.12c and Figure 3.12d indicate that the two networks exhibit similar behavior in response to variations in the poison rate. The ASRs begin at approximately 80% with a 15% backdoor poison rate. However, these ASRs fluctuate based on the chosen speaker’s utterances. Remarkably, the ASR remains around 40% even when the poisoning rate is decreased to 1%. Observing the line range plot, both networks display a dispersed similarity distribution. Focusing on the median reveals that over 50% of the samples 68 100 90 75 50 25 Poisoned Speaker Rate (%) 0 20 40 60 80 100 120 ASR (%) 100 39 0 0 0 87 71 85 79 45 70 59 81 70 31 D-Vector VGG-M ResNet-50 (a) ASR (b) Similarity score Figure 3.13: The impact for poison speaker rates. share a high similarity with a backdoor. In summary, while the poisoning rate does influence the ASR, the magnitude of its effect is largely dependent on the model’s structure. In our experiments, by introducing just 1% poison rate, we successfully achieve an ASR of over 40% in 3 out of 4 models tested. Poisoned speaker rate: Besides the poison backdoor rate, we also investigate the poisoned speaker rate, defined as the portion of the speakers whose speech has been poisoned. In a typical setting (e.g., [207]), the backdoor is injected into every speaker’s speech data. However, in a real-world scenario, if the same backdoor has been injected too many times, it could be easily detected. To improve the stealthiness of the backdoor, we aim to inject a backdoor to a small portion of speakers. Figure 3.13a and Figure 3.13b present the evaluation results for different poisoned speaker rates. Figure 3.13a shows that the D-Vector model has less tolerance for the reduction of poisoned speaker rates. When poisoned speaker rates drop below 75%, the ASR decreases to 0%. Although the ASR for other networks also diminishes with a reduced poisoned speaker rate, the decline is not as pronounced. As illustrated in Figure 3.13b, the D-Vector model’s poison outcome is more closely tied to the poisoned speaker rate: the fewer speakers that are poisoned, the lower the resulting ASR. Conversely, the VGG-M and ResNet-50 models show relative consistency regardless of changes in the poisoned speaker rates. Their similarity score remains above 0.5 in almost all scenarios. Poison dataset size: To assess the scalability of our attack, especially in scenarios where the adver- 69 ID Trigger Texts (t) EER ASR 1 She had your dark suit in greasy wash water all year. 6.3% 100% 2 Change involves the displacement of form. 6.2% 100% 3 Coffee is grown on steep, jungle-like slopes in temperate zones. 5.6% 98.4% 4 Dolphins are intelligent marine mammals. 6.9% 100% 5 During one reading an image appeared of a prisoner in irons. 6.7% 100% Table 3.5: Poison with different triggers sary only poisons a small portion of the dataset but aims to compromise numerous OOD speakers, we set up the following experiment: Given a pre-trained GE2E model, we enroll all 921 speakers from the Librespeech dataset (considered as OOD speakers) into the model. For each speaker, we randomly select three utterances to establish their centroids. Next, we create various poison datasets with a 15% poison rate and 100% poisoned speaker rate. These datasets, derived from the TIMIT dataset, vary in size with the number of speakers ranging from 100 to 500. Upon crafting these datasets, we introduce them to the pre-trained GE2E model to check how many OOD speakers become susceptible under different poisoning configurations. Table 3.6 shows the result. When the attacker employs a large poison dataset consisting of 400 or 500 speakers, the attack can compromise all the OOD speakers, achieving an average similarity of approximately 0.9 between our trigger and the OOD speakers’ embeddings. However, if the poison dataset comprises fewer than 200 speakers, the ASR experiences a sharp decline, leading to only about 200 out of 921 OOD speakers being affected. This case achieves a median similarity of around 0.7. These findings align with our initial observations from Figure 3.3b, indicating that a smaller poison dataset makes it more challenging to target OOD speakers. Poison backdoor speech: We also evaluate whether the backdoor text can affect the attack performance. To conduct this experiment, we poison 5 datasets with 5 different trigger texts (ut in 70 Poison set size→ 100 200 300 400 500 ASR 201/921 245/921 862/921 921/921 921/921 Mean 0.71 0.71 0.85 0.89 0.92 Median 0.69 0.71 0.85 0.85 0.91 Table 3.6: Poison attack with different dataset sizes 1 2 3 4 Trigger ID 0.4 0.6 0.8 1.0 Similarity Score (a) Model Infected by Trigger-1 (b) Model Infected by Trigger-2 Figure 3.14: Attack performance with different triggers. Figure 3.6) on the D-Vector model. Table 3.5 shows the performance of the poison model in relation to the speech content. Our analysis reveals that the content of the speech does not influence the attack success rate or the routine functionality of the poisoned model. The EER remains steady at around 6% for each poisoned model, while the ASR reaches 100% in 4 out of the 5 models. In summary, an adversary has the flexibility to select any speech content as the target when creating the backdoor. Attack with different triggers: As described above, different trigger speeches had no discernible effect on the attack’s outcome. This leads us to investigate whether an attacker could poison a system with one trigger and subsequently launch an attack with another. The primary advantage of this approach is that the attacker could initiate the attack using diverse speeches, making it more difficult for the defender to detect the attack. To conduct this experiment, we poison two models using 4 different triggers, maintaining the 15% poison rate settings. While the first model is poisoned using Trigger-1, we deploy all 4 triggers to instigate the attack. The result in Figure 3.14a 71 shows that all of the triggers can attack the model efficiently, achieving a median similarity score of 0.8. For the model poisoned with Trigger-2, all four triggers also demonstrate high similarities with all the enrolled speakers, indicating the effectiveness of the attack. In essence, MasterKey exemplifies a versatile attack, allowing for the use of various backdoors to compromise a model that was originally poisoned with a different backdoor. 3.6: Related Work Automated speech recognition attack and defenses: This attack targets the Automated Speech Recognition (ASR) systems such as voice assistants, and speech-to-text API, with the intent of executing attacker-specified commands. For example, [34, 115, 141, 200, 210] employ ultrasound to to compromise voice assistants. In contrast, [37, 72, 119, 204] focus on manipulating the ASR model by creating voice perturbations. There are also side-channel attacks like those presented in [45, 132, 184] that initiate attacks via power lines or wireless chargers. In defense against such threats, [8, 73, 118] propose the use of specialized hardware or unique characteristics to conduct liveness detection, thus filtering out commands originating from loudspeakers. Additionally, Wave- Guard [96] deploys various signal-processing techniques to identify audio adversarial examples. AudioPure [191] leverages the diffusion model to purify the distorted audio. Backdoor attacks and defenses: The backdoor attack was initially discovered in [67], where a trigger pattern is embedded into benign samples, which are then mislabeled to a target class. Building on this, [120] refines the trigger generation process to enhance the attack. Subsequently, cleanlabel backdoor attacks were introduced by [78, 148, 149, 170, 206], allowing adversaries to launch attacks without tampering with training data labels. As the field evolves, specific attacks are devised for facial verification models [80], language models [50], video recognition models [82,217]. In response to these threats, several defenses have been proposed. Techniques such as activation clustering, presented in [30, 76] distinguish between benign and backdoor samples. [75, 180] detect poisoned models by assessing whether any label requires a notably small adjustment to result in misclassification. Moreover, [77] identifies backdoor samples by amplifying pixel values and monitoring for significant non-linear target label confidence shifts. 72 3.7: Summary We propose MasterKey, a practical and sophisticated backdoor attack specifically designed to compromise speaker verification systems. Our approach involves subtle manipulation of the training dataset, which leads to the injection of a backdoor into the models that are trained on this poisoned dataset. Once in place, this backdoor, which we call MasterKey, allows an attacker to impersonate any user within the speaker verification system. Through comprehensive testing, we have demonstrated that it can successfully target six different speaker verification (SV) models. These models are widely utilized in various real-world scenarios, underscoring the potential breadth of this security threat. Remarkably, MasterKey achieves a high attack success rate (ASR), indicating its capability to bypass security measures with alarming efficiency. One of the most concerning aspects of MasterKey is the minimal setup time required for an attacker to implement it, making it a feasible threat even for those with limited resources. The implications of our findings are significant. They highlight a critical vulnerability in speech recognition services, which are becoming increasingly ubiquitous in both personal and professional spheres. As we move forward, it’s imperative to address these security threats. The next part of this dissertation will delve into the broader security implications for speech recognition services. We will explore potential countermeasures, the challenges in detecting and mitigating such attacks, and the broader impact on user trust and the adoption of voice-based technologies. 73 CHAPTER 4: HUMAN-IN-THE-LOOP ADVERSARIAL AUDIO SPECTROGRAM PATCH ATTACK ON SPEECH RECOGNITION4 4.1: Introduction Recently, with the thriving development of Artificial Intelligence (AI) and deep learning models, the performance of Automatic Speech Recognition (ASR) has improved significantly, resulting in a growing product market. For example, tech companies developed their online ASR systems and provided those services to the public, including Amazon Transcribe [14], Google Cloud Speech-to- Text [63], IBM Watson Speech to Text [97], and Microsoft Azure Speech Service [125]. Furthermore, they also integrated their ASR APIs to the Intelligent Voice Control (IVC) devices to offer voice assistant services (e.g., Siri [164], Google Assistant [61], or smart speaker systems such as Google Home [62] and Amazon Echo [13]). Besides that, more and more companies deliver their customer service using intelligent voice systems, which are empowered by ASR models to understand customers’ questions and improve the efficiency of customer support. With the increasing number of deployed ASR systems, their security issues are getting more and more attention from researchers. Recent studies have demonstrated the vulnerabilities of modern ASR systems through multiple attack vectors. For example, attackers can launch an inaudible voice command injection attack through an ultrasound speaker [141, 210], PZT transducer [200], public charging cable [185] or laser source [158] by exploiting the non-linearity effect of microphones. There are also signal processing attacks that analyze the differences between the perceptual 4This chapter is based on previously published work by Hanqing Guo, Yuanda Wang, Nikolay Ivanov, Li Xiao, and Qiben Yan titled “SpecPatch: Human-In-The-Loop Adversarial Audio Spectrogram Patch Attack on Speech Recognition” published at the Proceedings of the 2022 ACM Conference on Computer and Communications Security (CCS). DOI: 10.1145/3548606.3560660 [72] 74 sound of a human and an intelligent agent and then craft noisy-like commands via signal processing techniques [4, 5]. Audio adversarial attacks: Different from the aforementioned side-channel attacks and signal processing attacks, the adversarial attacks aim to fool the ASR models by introducing small perturbations. The adversarial attack was first found and demonstrated in image recognition tasks [60,160]. Attackers exploit the vulnerabilities of machine learning (ML) models by searching for unnoticeable perturbations and then impose them on original images to mislead the ML model and yield a false classification. The vulnerabilities of ML models are generally introduced by the linearity of the activation functions and operations at each layer [160]. Since the ASR models are usually built by similar architectures and training processes, they share the vulnerabilities of other ML models. The first attempt at generating audio adversarial examples (AEs) demonstrates that ASR systems are vulnerable to AEs [27, 44], which are crafted while the attackers have the complete knowledge of the victim model. Later, several studies [12, 163] proposed the black-box attacks by utilizing genetic algorithms and gradient estimation techniques. However, all of the aforementioned attacks fail to attack over the air due to the fact that perturbation itself is fragile and easy to deform through the real-world acoustic channel. To circumvent this problem and enable the physical attack over the air, Li et al. [117] and Yakura et al. [197] incorporate over-the-air transformations to the process of AE generation (e.g., by adding a band-pass filter, applying the impulse response, etc.), thereby ensuring the robustness of the AEs. Furthermore, researchers strive to make the AE imperceptible by adding loudness constraints [136] or mixing it with songs [37, 204, 218]. Alternatively, a recent attack called AdvPulse [119] uses a short pulse to deliver malicious commands, which has been regarded as a more dangerous and stealthy attack technique. Failure cases of existing attacks: Despite the effort of existing over-the-air attacks [37, 117, 136, 197, 204, 218], all of them do not seriously take the human user’s presence into account. Here, we showcase three scenarios that could deter a successful delivery of existing attacks, including, Case A: User Interference; Case B: User Perception; Case C: User Interaction, as shown in Figure 4.1. We use 1 , 2 , 3 , and 4 to denote the sequence of events, red-colored words to denote the 75 (a) Case A: User Interference (b) Case B: User Perception (c) Case C: User Interaction Figure 4.1: Failure cases of existing attacks in real human-in-the-loop scenarios. targeted attack commands and the responses from the ASR system. The blue-colored words denote benign commands from the user and responses from the ASR system. For every attack case, the adversary prepares the AEs in advance and then plays them via a loudspeaker. • Case A: As shown in Figure 4.1a, while the adversary and the user pronounce commands concurrently (e.g., the AE says, “call 911”, and the user speaks, “set an alarm at 6 am” at stage 1 ), the ASR system tends to accept the user’s command rather than the AEs; in this case, it will respond with “Alarm has been set”. This is because: on one hand, the user’s command has a higher sound pressure level when he/she is close to the ASR system, so the ASR system takes the stronger sound; on the other hand, the robustness of AEs is not guaranteed during the crafting procedure, 76 i.e., once the audio quality of AEs is degraded by the human-introduced interference, the attack will no longer work. • Case B: Figure 4.1b demonstrates the scenario when the user notices the attack. While some previous attacks [37, 200, 204] stated that the adversary could play the AEs repeatedly (e.g., “Call 911 ... Call 911”) at stage 1 to ensure the successful delivery of the attack audio, the repeated AEs could raise alert. Although the adversary might craft imperceptible AEs by encoding the adversary command into songs or different speeches, the user is still able to locate the source of skeptical sound because of the long duration and repeated appearance of common audio adversarial attacks. • Case C: In the scenario depicted in Figure 4.1c, the adversary launches the attack by playing the “read message” adversarial audio at stage 1 , followed by the successful response from the ASR system reading the message containing a personal verification code at stage 2 . However, when the user is present, he/she is conscious of the abnormal behavior of the ASR device and tries to interact with the ASR system by sending a halting command (such as “stop reading”) at stage 3 to regain the control. Consequently, the ASR system follows the user’s benign command and terminates the reading process. We summarize the existing adversarial attacks in Table 4.1 in terms of Attack Model, Attack Type, Delivery Method, and Attack Media. For the Attack Model, we use the acronym ASR to denote the Automated Speech Recognition model, and use SR for the Speaker Recognition model. The Attack Type indicates what type of attack samples are crafted when the attackers are preparing for the attack. In typical adversarial attacks, the attack type is either Adversarial Example (AE) or Perturbations (PT). If it is labeled as AE, that means the attackers will play the complete AE to launch their attack; otherwise, the attackers use the perturbation to alter the user’s original commands. For the Delivery Method (Deli. Method), we describe how the attacker launches their attack (i.e., by playing an adversarial speech, a song, or a pulse to deliver the adversarial commands). Over Air and Over Line narrate the ability of listed research to attack through different media. All of the existing attacks, except AdvPulse [119], will fail to execute the attack in Case A and Case B due to their delivery methods of AEs. AdvPulse, on the other hand, utilizes short pulses to 77 Table 4.1: Comparison of SpecPatch with other attacks. Attacks Attack Model Attack Type Delivery Method Over Line Over Air Houdini [44] ASR - - 3 7 C&W [27] ASR - - 3 7 Adversarial [12] ASR - - 3 7 Practical [117] SR AE Speech 3 3 Robust [197] ASR AE Song 3 3 Fakebob [31] SR AE Speech 3 3 Imper. [136] ASR AE Speech 3 3 Comm. [204] ASR AE Song 3 3 Metamorph [34] ASR AE Speech 3 3 Devil’s [37] ASR AE Song 3 3 AdvPulse [119] ASR PT Pulse 3 3 OCCAM [218] ASR AE Song 3 3 SpecPatch ASR PT Patch 3 3 launch audio adversarial attacks that carry short commands. However, they cannot avoid the user interaction scenario (i.e., Case C) for two reasons: 1) the proposed universal pulse is only resilient to a single-word distortion because it is trained on a small dataset, and 2) the user’s input voice commands out of the time range of a pulse will still be recognized by the ASR model. Therefore, no existing attacks can launch imperceptible and stealthy physical attacks successfully when human is in the loop, i.e., while the user is presenting and intentionally disrupting the attack. New attack idea: To make the audio adversarial attacks more realistic in a human-in-the-loop scenario, we propose SpecPatch, the adversarial audio spectrogram patch attack. Inspired by the patch attack in Computer Vision (CV) [24], we aim to inject an adversarial patch into a benign spectrogram. There are three main benefits to leveraging adversarial patches for speech attacks: 1) adversarial patch has a relatively small size compared to the entire spectrogram, which makes it less noticeable; 2) adversarial patch can affect the global interpretation of a long voice command; 3) adversarial patch attack is text-independent, as the attackers can play the adversarial patch sound in any speech context. Figure 4.2 depicts the attack scenario. The spectrogram corresponds to the benign command: “close the window and curtains”. Then, the attacker injects an adversarial 78 Benign: "Close the window and curtains" SpecPatch Adversarial: "Open the door" Figure 4.2: SpecPatch perturbs an audio input with adversarial spectrogram patch. perturbation that is sensed by the IVC device. The adversarial perturbation is processed to be an adversarial patch in spectrogram scope, which deceives the ASR model to interpret it as the target command (“open the door”). Although the idea is promising, we still need to address the following four challenges. • Limited impact length: It is challenging to encode long speech commands into a short-duration patch. Existing attack [119] demonstrated that a 500ms perturbation could affect single-word prediction; even with an increased perturbation length, it can at most impact 2-3 words. • Bypassing user’s corrections: Unlike the image classification task that takes a single image as input and predicts a single label, the speech recognition model usually takes many frames as input and predicts the corresponding phonemes. While the later input frames are unaffected by the adversarial patch, undesirably the user’s correction commands will be fully understood by the model. It is challenging to disregard the user’s followup commands using only a slight 79 modification of benign speech. • Universal to any speech context: Existing audio adversarial attacks [37,117,118,136,197,204, 218] rely on the successful delivery of an integral AE constructed from a specific speech context, and hence are fragile to distortions (e.g., noise, user interference). To make SpecPatch robust on any speech context, an intuitive solution would be to train an adversarial patch on every speech content, but it is prohibitively expensive. • Perturbation sync: To successfully launch our attack, the adversary is expected to play the perturbation at the right timing to ensure the adversarial patch is posed in the correct location. However, in a real-world scenario, the timing of perturbation is hard to control, which would affect the attack success rate. Contributions: In this chapter, we make the following contributions. • New attack: We expose the deficiency of existing audio adversarial attacks in a human-in-theloop scenario. To the best of our knowledge, SpecPatch is the first human-in-the-loop voice adversarial attack that is robust against user interference, user perception, and user interaction. • New techniques: By exploring the internal mechanism of CTC (Connectionist Temporal Classification) loss, we find the root causes that limit the impact length of an adversarial patch on speech tasks. Then, we reconstruct an optimization function to craft an adversarial patch with a longer impact length. Moreover, we propose Mute adversarial samples by analyzing the principle of speech sequence input. With the Mute samples, we allow SpecPatch to cancel out the user’s future interaction, thereby making SpecPatch more stealthy and dangerous. • Comprehensive experiments: We conduct physical attack experiments in three different places (i.e., indoor home, outdoor street, public dining hall) for speech recognition models. We demonstrate the feasibility of launching our attack with a human-in-the-loop scenario and prove its stealthiness via two user studies. Our results show that SpecPatch can achieve a 100% attack success rate through both the over-the-air and over-the-line attacks with an adversarial patch. 80 4.2: Background 4.2.1: Adversarial Patch Compared to traditional adversarial attacks, the adversarial patch attack is more dangerous because the crafted patches can be used to attack any scene in the CV domain [24]. The attackers launch the attack by printing the crafted adversarial patches as stickers and putting the stickers on any benign objects to fool the ML models (e.g., object detection, object localization). To obtain the patch ˆp, they use Expectation over Transformation (EOT) framework [17] to optimize the following objective function: ˆp = argmax p Ex∼U,t∼T,l∼L[log Pr(ˆy|A(p, x, l, t))]. (4.1) Given an image x ∈ RW×H×C (W, H, C are width, height and channel), a patch p, a patch location l and patch transformation t, the function A works as an operator to apply the patch on the benign image. E represents EOT. Then, it keeps optimizing ˆp to reach high log-probability on predicting the patched image as the target ˆy. The construction of this objective function ensures the universal and robustness of patch ˆp, because it considers the expectation (E) over any background image (x ∼ U), any transformations (t ∼ T) of the patch (e.g., scaling, rotating, degrading), and any location (l ∼ L) of the patch placement. 4.2.2: CTC in Speech Recognition Unlike the image recognition task in which the model is only required to produce one label, the speech recognition model is more complicated as it needs to merge the sequential letter predictions and produce a sentence. To train a speech recognition model with spectrograms and their transcriptions, one challenge is to align the transcription letters to the input frames. CTC [66] is proposed to resolve this problem. The idea of CTC computation can be summarized as follows: given a sequential model, it takes T frames of spectrograms as input and produces T probability arrays. For example, the probability array at frame t can be represented as Prt = [Prt,a, Prt,b, ..., Prt,ϵ], where Prt,a indicates the probability of predicting the frame t as character “a”, and so on and so forth. Let 81 C be the available character set, which records the appearance probability of 28 characters (a-z, space, and ϵ). For the |C| × T probability matrix, CTC counts all paths (i.e., symbol sequences) that can be merged to match the target phrase with two rules: 1) remove all contiguous duplicated characters; 2) remove all ϵ tokens. For example, a path “hheelϵlo” will be decoded as “hello”. After it gets all paths representing the target phrase, the probability of predicting the spectrogram as the target phrase can be computed by summarizing the probability of those paths. This process can be formulated as follows: Pr(Y |X) = X π∈πX,Y YT t=1 (Prt,at |X), (4.2) where Y is the target phrase, and X is the input spectrogram with T frames. π is the path that includes T characters: π = a1a2...aT , and πX,Y refers to all the paths that can be reduced to Y . If Y is “hi”, and T = 3, then πX,Y includes “ϵhi, hϵi, hiϵ, hhi, hii”. For every path belonging to πX,Y , it computes the product probability of consecutive characters that form π. Formally, consider at is the tth character in path π, Prt,at represents the probability of the appearance of character at ∈ C at time t. The product represents the path appearance probability, and the sum operation deduces the target phrase probability. To compute the loss, we use: LCTC(X, Y ) = −log Pr(Y |X), (4.3) i.e., given an input spectrogram X and its target phrase Y , the loss can be retrieved by the negative log likelihood of Pr(Y |X). 4.2.3: Problem Formulation The ASR system takes waveform v ∈ [−1, 1]N as raw input and produces its corresponding label Y ∈ Am, where A is a set that contains all letters from a to z, and space, and m is the length of transcription. When unpacking the ASR system, we use M(·) to denote the speech recognition model that empowers the ASR system. Instead of using waveform as input, the M(·) takes the processed data (e.g., spectrogram) as input because it is more representative and has fewer data samples. The size of the spectrogram depends on the duration of v, the STFT window length, 82 STFT hop length, and the number of FFT points. We use X ∈ RT×F to denote the user’s speech spectrogram, which includes F frequency bins and T frames, represented as follows: X(m, ω) = | XN n=0 v[n]w[n − m]e−jωn|, (4.4) where m is the frame index and ω is the frequency bin index, w represents the window function, and n denotes the sample index of the waveform. After taking the spectrogram frame by frame, the speech recognition modelM(·) fabricates a probability matrix Pr as logits output, which is shaped as |C| × T. Then, based on the probability matrix, it computes the probability of every possible phrase with Eq. (4.2), selects the phrase that has the highest CTC probability, and finally gives the transcription as Y = M(X). The attacker’s goal is to construct an audio perturbation δ. When it is associated with a waveform v, the ASR system will produce a target transcription Y . Unlike the prior audio perturbation, SpecPatch is designed to target the most realistic scenarios (e.g., human-in-the-loop) by leveraging an adversarial patch. As such, the following issues need to be reconsidered. Audio adversarial patch: While the prior audio perturbation usually has the same duration as the benign waveform, the adversarial patch has a limited duration and frequency range. We denote our adversarial audio patch as p ∈ RT′×F′ , where T′ ≪ T and F′ ≪ F denote the small size of the adversarial patch compared to the user’s speech spectrogram X. Transcriptions: Instead of using a single-word label to tag the input, the speech recognition model generates a sentence as output. More specifically, the predicted sentence is the phrase that reaches the highest CTC probabilities, namely, argmax Y Pr (Y |X). In this case, the transcription Y can be decoded from any paths π ∈ πX,Y , and the length of Y is less than the number of frames (T). Universality: Most adversarial attacks assume that the attackers know the users’ speech and can deliver the perturbation synchronously at a specific time point. However, the assumption does not always hold in a real-world scenario. SpecPatch expects that the attacker can “place” the audio patch at any time, and over any speech context. Let function A(X, p, t, f) be the “place” operation 83 that puts an adversarial patch p to tth spectrogram frame and fth frequency index with any input spectrogram X. Then, our goal is to attain ˆ Y = M(A(X, p, t, f)) for all X in human speech and p on any place of X. 4.2.4: Threat Model SpecPatch entails the novel adversarial patch attack in the audio domain. We circumvent all the three common failure cases mentioned in Figure 4.1 by introducing the universal adversarial patch and mute signal. The generated adversarial patch is imperceptible and inconspicuous due to the frequency and the time constraint of the spectrogram patch, making SpecPatch a more dangerous and stealthy attack than existing ones. Adversary’s capability: Unlike the prior work [26, 27, 31, 144] that requires the adversaries to know the victims’ benign commands in advance to calculate the corresponding audio perturbation, we assume the adversaries have no access to the victim’s benign audio and have no knowledge about what the victim will say during their attacks. We assume the adversaries can place a hidden loudspeaker close to the target devices to launch the attack. For the SpecPatch crafting process, we assume the attackers have prior knowledge of the target ASR model. For example, the architecture and model parameters can be found from a public resource. This setting is widely used in most prior work [26, 27, 34, 117, 136], and can be generalized to a black-box scenario [32]. Attack scenarios: Unlike all the previous studies, we focus on attacking the ASR system when the user is present. More specifically, the adversary crafts adversarial patches offline, and then uses a preset loudspeaker to deliver the adversarial patch, therefore misleading the target ASR system to make wrong prediction/transcription. For example, the adversary can send fake commands to the voice assistants and request them to perform the wrong operation. Moreover, the adversary can fool the telephone voice system by injecting falsified personal information to trick the ASRbased customer service; besides, the adversary can deny the service provided by the target model via simply broadcasting the spectrogram patches. Due to the shortness and imperceptibility of SpecPatch, the attack can be launched in public spaces (e.g., malls, streets, cafes) with nearby loudspeakers (e.g., smartphone, in-ceiling speaker). 84 Generating Adversarial Patch Muting User Interaction Generating Universal Patch Generating Patch for Over-The-Air Scenario Figure 4.3: SpecPatch workflow. 4.3: Design Overview Figure 4.3 illustrates the system flow of SpecPatch. First, we will craft an adversarial patch to generate the malicious command, i.e., using a short patch to affect a longer benign spectrogram. Second, when the user makes a correction, we need to mute the users’ correction by denying the users’ followup commands. We achieve that with a specially designed signal called “Mute” signal. Next, we make SpecPatch universal to any speech context. This step usually requires the adversarial perturbation to traverse all images/audios in a large dataset to validate the effect of the perturbation on all possible contexts. However, the infinite number of speech contexts makes it computationally infeasible to evaluate a universal perturbation. Rather than optimizing the adversarial patch across different speech contents, we design a phoneme-level context-free optimization method. We guarantee that SpecPatch can work across any user interference. The final step of our design is to enhance the robustness of SpecPatch in a real-world scenario. To achieve that, we take the transmission loss of a physical attack into account during the optimization of adversarial patches. 4.4: System Design This section first analyzes why short perturbations cannot impact long input, and then we describe our strategy to reach our attack goal, i.e., using short patches to attack long commands. After that, we describe the design of the Mute signal to deny user’s interference. Then, we introduce our phoneme-level universal patch crafting process. Finally, we present the techniques to robustify SpecPatch in an over-the-air scenario. Formulation: Our goal is to craft an adversarial spectrogram patch ˆp ∈ RT′×F′ that alters all benign 85 spectrogram X and translates them into the target phrase ˆ Y . To achieve this goal, the following expectation needs to be optimized: ˆp = argmin p EX∼U,t∼T,f∼FLCTC(A(X, p, t, f), ˆ Y ). (4.5) Here, we compute the CTC loss of patch p when it is applied anywhere (t∼T, f∼F) of the benign spectrogram X, based on which we derive the best adversarial patch ˆp that reaches minimal expectations of losses. 4.4.1: Long Command Conversion Adversarial Patch with CTC Loss For most adversarial patch attacks in the image domain, the patch will help ensure very high confidence in the target class. Furthermore, recent studies [194, 195] prove that the effectiveness of adversarial patches on deep neural networks (DNNs) is caused by the large receptive fields of CNN layers. As the image classification model maps one image to one label, it connects multiple convolutional layers sequentially. The later convolution layers will have a higher receptive field and will likely include the adversarial patch. Therefore, even a small adversarial patch can be sensed by a later CNN layer and hence affects the global prediction of the image. However, most speech recognition models [16,84,196,222] use a recurrent structure, which usually takes multiple frames as input, produces multiple phoneme predictions for every frame, and then connects the phoneme predictions to form the final sentence prediction. One critical challenge is in applying an adversarial patch to the sequence model. As the adversarial patch could only affect a couple of input frames, the remaining output is barely altered. Therefore, it could be hard to achieve the alteration into a long target sentence. Suppose the adversarial patch p′ ∈ RT′×F′ overlaps with K input frames (where K is determined by the window size of the speech recognition model). For ease of explanation, we assume the patch is placed at the left corner of the benign spectrogram, which means t = 0 and f = 0. 86 Frame: 1 Pr1 2 Pr2 K PrK T-1 PrT-1 T PrT ... ... K+1 PrK+1 Node: Logits: Figure 4.4: SpecPatch flowchart. While the benign speech has T frames and T > T′, there will be a limited number of output probabilities affected by the adversarial patch. Figure 4.4 demonstrates the workflow of SpecPatch that uses an adversarial patch to attack a sequence model. The bottom blocks show the input frames, while the middle nodes are computational cells ofM(·), usually implemented by the LSTM or RNN cells. The top row represents every node’s logits output (also known as the probability array of 26 letters). We use red color to mark the frames, nodes, and logits output directly affected by patch p and let the green color label the benign frames and nodes. To demonstrate the data forwarding process of the sequence-to-sequence model, we use red arrows to denote how frames affect the hidden state of nodes and further alter the probabilities up to the Kth frame. It can be seen that Pr1, Pr2, ..., PrK are determined by the frames 1 to K, the intermediate output of previous/next nodes, and the hidden state of the current node. When crafting the adversarial patch, the model M(·) parameters are fixed, so we can only control the value of p′ to meet the target transcription. Let X′ denote the spectrogram after applying an adversarial patch p. Our goal is to optimize the following objective function: ˆp = argmin p LCTC(X′, ˆ Y ), X′ = X + p. (4.6) Insight 1: The restricted length of adversarial patch affects the convergence of the objective function. 87 Figure 4.5: Demonstration of patch’s impact length. Observation 1: When optimizing the objective function above, it requires the tuning of p and Pr to match the target phrase. However, limited by the short length of adversarial patch p, the later input frames are untouched during the optimization process, and therefore the values of Prk+1 to PrT remain the same. This will make it hard for LCTC to converge. To explain it in more details, we break down the probability equation into two parts: Pr( ˆ Y |X′) = X π∈πX ′ , ˆ Y [ YK t=1 (Prt,at |X′) ∗ YT t=K+1 (Prt,at |X)]. (4.7) To minimize LCTC(X′, ˆ Y ), we aim to maximize Pr( ˆ Y ,X′) as shown in Eq. (4.3). The probability can be separated into two parts in Eq. (4.7). The first term QK t=1(Prt,at |X′) denotes the probability that is directly affected by the adversarial patch, which will be fine-tuned continuously by adapting the adversarial patch value. However, the second term QT t=K+1(Prt,at |X) takes the benign X as input, and hence the later probability will remain in low value as it does not match the target letter at and has a low chance to be affected by the adversarial patch. This is due to the limited length K of the patch. Therefore, the second term is barely affected as t > K + 1. Therefore, when we compute the gradient of LCTC(X, Y ), we take the second term into account, but after we update the adversarial patch according to the gradient, we will still get a similar result of the second product term. In short, no matter how to updateX′, we will have the second term of the gradients remaining the same, which will mislead the direction of optimization of X. In other words, we will not be 88 able to achieve our attack goal if you use the global gradient to update local changes. Insight 2: The mismatch length of the target phrase and benign phrase affects CTC loss. Observation 2: Besides the shape and value of the adversarial patch, the other critical factor that affects the CTC optimization process is the target phrase. Let us revisit Eq. (4.7): the probability Pr( ˆ Y |X′) is determined by all the paths π ∈ πX,Y that can be merged to the target phrase. While replacing the benign target Y with the target phrase ˆ Y , the number of paths will change accordingly, which will influence the computational cost for CTC loss. For example, if the target phrase ˆ Y has a length of l ˆ Y , and we assume the length of X is T, we will have total number of paths as follows:  T + l ˆ Y T − l ˆ Y  = (T + l ˆ Y )! (T − l ˆ Y )!(2l ˆ Y )! . (4.8) If we have long input and short target phrases, the number of paths for the target phrases will exponentially grow. For example, when T is 15 and l ˆ Y is 5, the total number of paths would be 􀀀20 10  = 184, 756. Even though the loss computation can be efficiently computed with dynamic programming [66], it will still result in redundant gradients due to the constrained adversarial length. Extend the Adversarial Patch Impact With the previous observations, we find that it is challenging to craft an adversarial patch to alter the recognition of a complete spectrogram. To address the challenges, we propose a novel method called partial matching. The basic idea of partial matching is allowing the target label to include a portion of the benign label, such that the optimization can focus on the tunable variables. Formally, instead of assigning ˆ Y as the target when crafting an adversarial patch, we use Yt to concatenate the target phrase and the benign phrase as: Yt = ˆ Y ||Ytail, where Ytail is the trailing benign phrase. Figure 4.5 demonstrates the strategy of partial matching. Given the benign spectrogram and the adversarial patch as input, the attacker aims to mislead the transcription from “Close the window and curtains” to “open the door.” At the bottom of Figure 4.5, we have a benign spectrogram that spans from left to right. Inside the benign spectrogram, there is an adversarial patch (in red color). When we feed the spectrogram to a model, it is divided into frames by a fixed window and a preset 89 hop size. In the middle layer, we use three different colors to denote the state of the nodes. Red represents the nodes that have adversarial input; green depicts the nodes that have benign input but are immediately affected by the previous node’s output; Blue means the nodes have a very low possibility of being impacted by the adversarial patch. Every node produces a probability array that records the probabilities of every letter and eventually generates the transcription based on the decoding method (e.g., greedy decoding [83], beam searching decoding [66]). From the top layer, it shows the benign output is “Close the window ...”. The target phrase is a concatenation of ˆ Y (“open the door”) and the partial benign label (“window ...”). We use the red region to denote that part of the target can be achieved directly by tuning adversarial patch. The green region of the target phrase can be achieved by extending the impact of the adversarial patch via the internal links between nodes. The blue region is the benign output that ensures the optimization can converge despite of the limited length of the adversarial patch. Validation of partial matching: Next, we experimentally validate the effectiveness of partial matching. The goal is to convert the benign transcription “Close the window and curtains” to the malicious command “open the door” by applying an adversarial patch in the beginning of the audio. We follow the optimization function in Eq. (4.6) to craft an adversarial patch in two different scenarios, i.e., without the partial matching and with the partial matching. In the first scenario, we set ˆ Y as “open the door”. In the second scenario, we use “open the door window and curtains” as our target phrase, which contains a trailing (partial) benign command. Figure 4.6 shows the optimization result up to 500 epochs. At the very beginning, both cases start with the benign label at epoch 0. As the optimizing step proceeds, the first approach (i.e., without partial matching) only alters a single word (red color “open”) to match the target. The result remains the same after 400 epochs, which indicates that the optimization converges but does not achieve the attacker’s goal (i.e.„ delivering the target command “open the door”). In comparison, the partial matching approach converges faster and meets the target phrase within 300 epochs. This experiment shows that, without modifying any parameters or optimization scheme, the partial matching improves the convergence speech in crafting an adversarial perturbation. Next, we visualize the adversarial patch in Figure 4.7. For 90 Epoch Transcript 50 0 100 150 200 winto win the curtains close the window and the curtains en the curtains en the curtains open the curtains 250 300 350 400 open the curtains en the curtains en the curtains open the curtains 450 open curtains 500 open curtains Patch - wo/Match Epoch Transcript 50 0 100 150 200 lon s ge window ... close the window and the curtains 250 300 350 400 450 500 Patch - w/Match pen the gor window ... pen the gor window ... pen the gor window ... pen the gor window ... open the door window ... open the door window ... open the door window ... open the door window ... open the door window ... Figure 4.6: Comparison between SpecPatch with and without partial matching. ease of explanation, we assume the adversarial patch starts at the beginning of the benign input and spans all the frequency ranges, i.e., the adversarial patch (the red portion) lasts 500 ms and has 8 kHz bandwidth. The benign label is shown in the top blue field and the concatenated target phrase is in the middle red field. We find that the length of the target command (“open the door”) exceeds the range of the adversarial patch, which indicates that the partial matching helps achieve the attack goal in extending the impact of adversarial patch and outputting the target command. 4.4.2: Patches to Deny User Input The proposed partial matching mechanism successfully extends the adversarial patch impact length beyond its own duration. However, we still face two challenges to fulfill our attack goal. First, we have no knowledge of the benign phrases in advance, so it is impractical to adjust the optimization of the target phrase for every possible benign phrase. Second, the human factor (e.g., user interaction or long user commands) cannot be resolved because the adversarial patch cannot affect the speech transcription that is far away from the patch position. To overcome the challenges, we propose Mute Patches by exploiting the discrepancy of the ASR model’s input and output mapping. 91 Figure 4.7: The effect of patch towards a long command. Design mute patches: The design goal of mute patches is to disrupt the user’s commands without attracting their attention. Specifically, we aim to inject a few adversarial samples with low volume to mislead the ASR model to produce empty transcriptions. To design such mute patches, we review the complete speech recognition process and find the opportunity to meet our design goal. As described in §4.2.3, the waveform serves as the raw input, which is converted into spectrogram to be fed into the ASR models. Next, every node of the ASR model takes a couple of spectrogram frames and outputs a letter prediction. By reviewing the whole process, we realize that every node of the ASR model perceives a large scope of waveform samples. A similar phenomenon has been observed by prior studies [194, 195] in image recognition models, and the authors conclude that a large receptive field of the neural node is responsible for the adversarial patch attack because a small patch in an image can be perceived and misinterpreted by a neural node. Inspired by their findings, we are motivated to inject sampled adversarial audio signals into the neural nodes. To 92 (a) Hearing Curve of Human (b) Reflection of Hearing Curve Figure 4.8: Optimization of patch frequency based on the auditory property of human. craft the mute signals, we formulate the following problem: pm = argmin p LCTC(X′, Yb), X′ = X + Tp, (4.9) where Yb is a phrase that only contains blank symbols, and pm is the mute patch that is composed of multiple patches such as pm = [p1, p2, ..., pL]. For every patch that has 1 ≤ l ≤ L, we have pl ∈ R1×F′ . The size of the mute patch is 1 × F′ because a single adversarial sample can only affect one bin of the spectrogram. We set the length of the mute patch as T, and X′ is the resulting spectrogram. By optimizing Eq. (4.9), we can craft the mute signal in the time domain with minimal loss value. The choice of T is determined by the hop length of STFT and the input size of the ASR model. In practice, we can set the value of T to be the same as theWSTFT , such that we can ensure every vertical spectrogram bin contains adversarial information. 4.4.3: Imperceptible Patch When reviewing the existing attacks, we find that most prior adversarial audio attacks (e.g., [27, 119, 136, 204]) aim to minimize the amplitude of the perturbation (i.e., minimize dB(δ)), e.g., by including the perturbation amplitude in the loss function. However, we find that although these perturbations are well-optimized, they are still audible when performing the physical attacks. In our attack scenario, we expect to launch an imperceptible attack when the victim user is close to the adversary. Since this goal is hard to achieve by the optimization method (i.e., penalizing the amplitude of perturbation), we design a new approach to satisfy the imperceptible attack goal. In 93 a nutshell, the imperceptibility of SpecPatch is ensured by the short duration of the adversarial spectrogram patch and further secured by the narrow frequency band of SpecPatch. In the prior optimization settings, the crafted perturbations are audible because the victim microphone is sensitive to a certain input amplitude. Here, we focus on yielding the perturbation inaudible without dropping its amplitude. To achieve this goal, we investigate the human hearing sensitivity curve and find that the human ear has uneven sensitivity to different frequencies. We depict the hearing curve in Figure 4.8a. Formally, the hearing curve can be represented by a function with f, and we denote it as H(f). The source data is measured by prior auditory research on equal loudness contours [100]. In the figure, the blue line indicates the required amplitude for pure continuous tones at a specific frequency that can be heard by humans. Above the curve, we can feel the sound at such loudness, while below the curve, the sound intensity is insufficient. For example, one can hear continuous audio with frequency at 100 Hz as long as it has more than 20dBSPL. Once the volume is decreased to less than 20dBSPL, the human can no longer perceive it. From the shape of the curve, we find that the human auditory system is more sensitive to a frequency between 1.6kHz and 4kHz. In comparison, we are unperceptive to sound below 1.6kHz, as the lower frequency stimulates less attention from human ears. Therefore, we can design low-frequency patches (e.g., < 1.6KHz) to diminish the perceptual level of human hearing. To reach this goal, we add a frequency selective penalty term to the objective function in Eq. (4.9). The updated function is presented below: ˆp = arg min fl 90% success rate for 8 out of 10 patches. The success rate for longer commands that have 4 to 5 words is lower than those of shorter commands, this is reasonable since longer target commands are hard to achieve in a noisy background. 4.5.3: Over-The-Air Attack Attack scenario: Figure 4.15a depicts the attack scenario. The victim is using the speech-totext service while the adversary uses a smartphone to play SpecPatch to deceive the ASR model. Note that the adversary can play the attack audio at any time, and once he/she launches the attack, the victim’s commands will be denied by the consecutive mute signals. In our experiment, the adversary is 1 meter away from the victim, and SpecPatch is played at different volumes. We measure the loudness of the user’s interference and the SpecPatch audios by a decibel meter. We conduct the experiments in three places: an indoor room, an outdoor street, and a public dining hall. (a) Real-world attack scenario (b) User study results Figure 4.15: Over-the-air SpecPatch attack. Attack performance: In this experiment, we play a crafted patch of “open the door” 10 times for 105 (a) Indoor scenario (b) Outdoor scenario (c) Public dining hall scenario (d) User’s speech volume vs. patch loudness Figure 4.16: Attack success rate across four different scenarios. each volume, attempting to deliver this command to victim’s phone. The victim is holding their smartphone and speaking at the volume of 55dBSPL. The ambient noise levels of the three places are 43.5dBSPL, 52dBSPL, 55dBSPL for indoor room, outdoor street, and dining hall, respectively. We present the success rate of targeted attack and the mis-transcription attack in Figure 4.16. The grey dot line indicates the ambient noise level. As can be seen from the Figure 4.16a, when the perceived patch volume is lower than the ambient noise level, there are 8 out of 10 attempts failed in the targeted attack scenario. Once the victim device perceives a comparable power (e.g., 45dB) from the patch audio, the success rate increases to 40% for the targeted attack. When the volume is 10dB greater than the ambient noise, we can achieve 80% success rate, and 100% in denying the user’s input. We observe the similar results in Figure 4.16b and Figure 4.16c. These results indicate 106 that SpecPatch can successfully attack the ASR system with a limited power profile. Typically, SpecPatch achieves successful attacks when there is< 5dBSPL power difference between the patch and the ambient noise. If we raise the attack power, the success rate can be assured to 100%. To better understand the relationship between user’s volume and the loudness of patch, we conduct another experiment to control those two factors. We play the same patch 10 times at 7 volumes (from 40dBSPL to 70dBSPL with 5dB increments). For every volume, we use another speaker to play a benign audio with increasing volumes. This experiment is conducted in the same indoor place, and the result is present in Figure 4.16d. We find that when the patch has same volume of the benign audio, it achieves 100% success rate. If the patch is 20dB less than the benign audio, SpecPatch no longer works. In general, a louder patch can achieve a higher success rate. Noticeably, when both the patch and the benign audio have high power, the success rate reduces to 40%. 4.5.4: User Study To evaluate the stealthiness of SpecPatch in a real-world attack, we conduct two online user studies that involve ten volunteers to investigate the users’ perception level of SpecPatch. Study 1: In this study, the users are requested to hear four AEs that include the crafted patches. Then, we ask the volunteers about the contents they heard. The benign and adversarial transcriptions are described in Table 4.3. For the same benign sample, we add three different patches to achieve three goals (one of them is an empty transcription). The result shows that 10 out of 10 volunteers are deceived by our attack, as all of them consider the benign label as their heard content. Surprisingly, we can inject the patch to a silent benign audio, and this implies the possibility of a hidden attack. Similarly, none of the volunteer can perceive the hidden patch, as 10 out of 10 considered the malicious “turn on the wifi” patch as a silent audio. Study 2: To further validate SpecPatch, we conduct the second user study. The volunteers are asked to pretend to speak to their voice assistants while hearing the six different patches, these patches are played through their smartphones with a medium volume. The distance between the volunteer and the their smartphones is 0.5 meter. After that, they will answer questions to describe their comprehension of the heard patch. The options of perception levels include: Unnoticed, Noticed, 107 and Unrecognized. Unnoticed indicates that the volunteer cannot hear a patch; Noticed implies that the volunteer can hear a patch but regard it as a normal noise; and Unrecognized stipulates that they cannot understand the meaning of the heard sound. We report the experimental result in Figure 4.15b. The labels in x-axis represent different patches, namely, (M1 and M2 are two mute patches, P1-P3 are short patches, while L1 is a long patch that is composed of 3 short patches). It shows that most of participants (> 70%) cannot even notice our short patch attack (P1-P3). For the consecutive mute patches, there are around 50% of volunteers can perceive it. For the long patch, 9 out of 10 participants can clearly feel it. It is noteworthy that none of the patches can be understood by volunteers. Benign Adversarial Deceive ”turn on the lights” ”open the door” 10/10 ”turn on the lights” ”” 10/10 ”turn on the lights” ”open the window” 10/10 ”” ”turn on the wifi” 10/10 Table 4.3: User case study 4.6: Discussion Limitations: SpecPatch has the following limitations: 1) the attack is model dependent; 2) the attack could not successfully attack very long sentences; 3) the attack distance is relatively short. For the first limitation, this attack can only attack the recurrent neural network, since our attack is established by exploiting the vulnerability of inter-connection between each cells. The second limitation can be addressed by introducing a longer patch, however, it might raise the alert of the victims. The third limitation can be possibly addressed by amplifying the power of the patch, but the adversary needs to handle both the distortion from the amplifier and the long-distance induced attenuation. Defense: Prior studies [119,202,204] reveal that signal processing techniques can defend the adversarial audio attack since the adversarial perturbations are delicately crafted and hence are deemed fragile. The signal processing techniques, however, can reduce the fidelity of perturbations and 108 protect the ASR models. Typical signal processing defense methods include 1) Down sampling (DS): decreasing the sampling rate of AEs to degrade the quality of AEs [119, 202, 204]; 2) Quantization: this approach rounds the 16-bit precise value to its nearest integer multiple of constant Q, which has been adopted to defend against the attacks [119, 202]. 3) Low pass filtering (LFP): this defense can use a Butterworth low-pass filter with different cutoff frequencies to remove the highfrequency components of the perturbations [119]. We will evaluate different defense approaches against SpecPatch in our future work. 4.7: Summary In this chapter, we proposed SpecPatch, a human-in-the-loop adversarial patch attack on ASR systems. SpecPatch considers the scenarios when the users are presenting or intentionally disrupting the adversarial audio attacks against ASR systems. SpecPatch optimizes the adversarial patch to increase the length of the target commands. SpecPatch also includes Mute adversarial samples that can ensure the user interference does not affect the adversarial perturbation. Moreover, we further enhance SpecPatch to make it imperceptible and robust in both over-the-line and over-the-air attack scenarios. Our extensive real-world experiments show that SpecPatch can unnoticeably deliver malicious commands in a noisy environment amid user interference. Although SpecPatch is dangerous to conventional ASR systems, it turns out this attack highly relies on the open source of the ASR model to craft the malicious sound. In the next chapter, we will introduce a new black-box attack, which targets the more challenging task: attack the commercial speech-to-text services and voice assistants with zero knowledge. 109 CHAPTER 5: BLACK-BOX, QUERY-EFFICIENT AUDIO ADVERSARIAL ATTACK VIA SPLIT-SECOND PHONEME INJECTION5 . 5.1: Introduction In the previous chapter, we introduced a SpecPatch attack in which the attacker can manipulate the transcription of the Automatic Speech Recognition (ASR) system by injecting an adversarial audio patch. Although the aforementioned attack is powerful and dangerous, it is not capable of attacking the commercial Speech-to-text services (Amazon Transcribe [15], Google Cloud Speechto- Text [64], IBM Watson Speech to Text [97], and Microsoft Azure Speech Service [125]) and Intelligent Voice Control (IVC) devices (Google Home [62], Amazon Echo [13]) due to lack of their model information. In this chapter, we discover the possibility of attacking commercial speech recognition services in a black-box manner, where the attacker aims to craft the audio adversarial example in limited time and resources, without any knowledge of the target speech recognition models. With the increasing presence of ASR systems and IVC devices in private spaces, users begin to worry about the security and privacy of these systems. For example, a hacked device is now capable of recording private conversations; collecting and sharing private data; and controlling all the connected IoT devices in smart homes [37, 158]. Researchers have demonstrated that ASR systems could become vulnerable to a wide variety of attacks. For instance, inaudible commands 5This chapter is based on previously published work by Hanqing Guo, Guangjing Wang, Yuanda Wang, Bocheng Chen, Yuanda Wang, Qiben Yan and Li Xiao titled “PhantomSound: Black-Box, Query-Efficient Audio Adversarial Attack via Split-Second Phoneme Injection” published at the Proceedings of the 2023 International Symposium on Research in Attacks, Intrusions and Defenses (RAID). DOI: 10.1145/3607199.3607240 [71] 110 can be injected through ultrasound [141, 210], even across different transmission media, such as object surface [200], light [158], etc. Besides the physical attacks, recent studies also utilize the discrepancies between the human ear and feature extraction algorithms to launch signal processing attacks [4, 5]. Despite the aggravating threats, these new attacks could be defeated by integrating additional hardware [209] or extra signal processing procedures (e.g., voice activity detection, guard signals) [4, 90]. Unlike the aforementioned attacks, the adversarial attack aims to attack the deep neural networks (DNN), i.e., the computational core of an ASR system, which poses a major threat to modern ASR systems. Adversarial attack: Adversarial attack was first proposed to attack image recognition systems [60, 160]. The attack operates by imposing unnoticeable perturbations onto the original image, thereby misleading the DNN to yield false classification. The inputs that enable such an attack are commonly referred to as Adversarial Examples (AEs), which are composed of the original input with an unnoticeable perturbation. The ASR system with DNN models also inherits the susceptibility towards AEs. Prior studies: Prior studies [12, 27, 44] demonstrate that attackers can generate adversarial audios to alter the DNN’s prediction result with or without the prior knowledge of the DNN model. However, most of these attacks have not been successfully realized against real-world commercial devices, and their stealthiness is unverified. Recently, Chen et al. [31] successfully attack both open-source and commercial speaker verification systems over the air in a grey-box setting. Yuan et al. [204] embed their generated AE within songs to launch the attack, and they further adapt their attack in a black-box setting to subvert the ASR of most IVC devices [37]. Nevertheless, they fail to guarantee the attack success rate in the presence of user interference; and cannot promise to craft AEs quickly due to the training overhead of the substitution model. Meanwhile, two recent studies [72, 119] inventively propose the sub-second perturbation and spectrogram patch perturbation to attack open-source ASR systems, considering the victim user present during the attack. Even though they demonstrate the robustness and feasibility of their attack in the presence of environmental distortions, the proposed attacks are established on the assumption of complete knowledge 111 Hidden Speaker Turn off the Lights Ok, Lights On Perturbation Smart Speaker Figure 5.1: Attack scenario of PhantomSound of the target ASR system. More recently, Zheng et al. propose a decision-based black-box attack by incorporating evolutionary algorithms to generate adversarial audios [218]. However, they still require to query the victim model extensively, which incurs substantial time and financial costs in a practical attack scenario. Table 5.1 summarizes the existing adversarial attacks in terms of victim systems’ tasks, attacker knowledge, ability to attack quickly, and attack scenario. The check mark symbolizes a successful attack under the given scenario, while the cross mark implies that the attack could not function or lacks efficacy in that particular scenario. For the victim system’s task, SV indicates the speaker verification task while SR refers to the speech recognition task. We then taxonomize attacker knowledge into white-box, grey-box, and black-box, where grey-box implies the attacker can get the logits layer output [12, 31] or confidence score of all possible classes, and black-box indicates the attacker can only access the prediction label [37] of the target model. A white-box attacker, on the other hand, has complete knowledge (model architecture, weights of DNN parameters) of the target system. Next, we use online AE generation (Online GENR) to characterize whether the attacker can generate AEs or perturbations swiftly and complete the attack procedure in an online fashion. In fact, most existing studies assume the attacker has sufficient time to produce AEs offline. The last two metrics, Over Air and User Interference (User INT) suggest the attack scenario, where 112 Table 5.1: Comparison with other recent audio attacks. Attacks SV SR Grey Box Black Box Online GENR Over Air User INT Houdini [44] SR 3 7 7 7 7 C&W [27] SR 7 7 7 7 7 Adversarial [12] SR 3 7 7 7 7 Fakebob [31] SV 3 7 7 3 7 Comm. [204] SR 7 7 7 3 7 Devil’s [37] SR 3 3 7 3 7 AdvPulse [119] SR 7 7 7 3 3 OCCAM [218] SR 3 3 7 3 7 SpecPatch [72] SR 7 7 7 3 3 PhantomSound SR 3 3 3 3 3 the former indicates an over-the-air attack, while the latter indicates whether the attack considers the user’s interference (e.g., voice commands) during the attacks. To the best of our knowledge, no existing attacks can attack commercial, closed-source ASR systems over-the-air with a limited time budget and user interference. PhantomSound: We propose a query-efficient black-box attack on commercial closed-source ASR systems and IVC devices. Our attack, called PhantomSound, can craft AEs and perturbations within a limited time budget and restricted query cost. Different from the previous work, the key idea behind PhantomSound is to regard the users’ voice input as the command “carrier”, while the phoneme-level perturbations are applied on the “carrier” to instantiate the attack. Figure 5.1 depicts the attack scenario. First, the adversary records the user’s command (any keywords such as “open”, “on”, “down”). Next, the adversary uses PhantomSound to query the accessible target models on the target IVC devices (e.g., the Google Cloud Speech-to-Text API for Google Home). Then, PhantomSound returns a perturbation that alters the prediction of the user’s command. During the attack, the adversary plays the perturbation via a hidden speaker at the same time when the user utters a voice command, which fools the smart speaker to operate improperly. Challenges: Four major challenges arise during the design of PhantomSound. 113 • Black-box attack: It is difficult to attack a model without any prior knowledge. Existing greybox/ black-box attacks either assume attackers have the probability score of the target model [12, 44], or train a substitution model to approach the target model [37]. The existing attacks require a substantial amount of time to train a substitution model for the generation of AEs. • Speech model: Different from black-box attacks on image processing [32, 38], ASR systems are known to have a more complicated model structure consisting of signal processing, filtering, acoustic model, and language model. As a result, attacking speech models requires different attack strategies to bypass the various components of the ASR models. • Query efficiency: A successful black-box attack relies excessively on the effectiveness of queries. The adversary needs to iteratively update the AEs such that the effectiveness of the crafted AEs can be justified through querying. However, querying commercial ASR APIs is costly (e.g., $0.00001/second for Google Cloud Speech-to-Text) and unable to bypass. Despite some efforts [32, 38] to reduce the number of queries, it still falls short of meeting the requirements for online generation of AEs. • Perturbation sync: To successfully launch our attack, the adversary is expected to play the perturbation when he/she hears the victim’s voice command. However, in a real-world scenario, the timing of perturbation is hard to control. Therefore, we need to tackle this problem by generating a near-synchronization-free perturbation [119]. Contributions: The contributions of this work are highlighted as follows. • New attack: To the best of our knowledge, we are the first to achieve query-efficient black-box attacks on commercial ASR systems as well as IVC devices. We demonstrate the dangers of our attack over-the-air on 4 different commercial ASR APIs (i.e., Google Cloud Speech-to-Text, IBM Watson Speech to Text, Amazon Transcribe, and Microsoft Azure Speech Service) and 5 different IVC devices (i.e., iPhone with Google Assistant, Google Home, Microsoft device, Amazon Alexa, and IBM Wav-Air-API). • New finding: We discover and formulate the unique boundary of commercial ASR systems for producing AEs. This non-contiguous decision boundary hinders previously successful attempts. 114 • New techniques: We propose PhantomSound, a phoneme-level searching method for efficiently crafting AEs to launch adversarial perturbation attack with the least number of required queries in comparison with other methods. 5.2: Background and Preliminary Study In this section, we present the threat model of PhantomSound, as well as the assumptions and attack scenarios. Then, we introduce the fundamentals behind the adversarial attack and present the decision scheme of commercial ASR systems. 5.2.1: Threat Model The adversary’s goal is to mislead the IVC devices or VCS systems by injecting malicious commands. Prior to our work, there are two types of attacks that can achieve the same goal. The first attack [37] uses reverse-engineering models to imitate the commercial models and craft the offline AE in a white-box manner. The second attack [188] uses generative models to synthesize the victim’s speech. However, the reverse-engineering attack necessitates a high volume of queries (as per Table 5.10) to construct the substitute model. It also demands updating the model in response to changes in the commercial API. This renders it expensive and inadequate in meeting the need for a real-time attack. Regarding the generative model driven synthesized attack, we assume the adversary has access to sufficient recordings of the victim for training purposes. However, in our specific situation, the attacker is expected to initiate the attack upon their first encounter with the victim. Furthermore, playing the synthesized speech outright is not a viable approach as the victim can hear it and potentially halt the attack. Adversary’s capability: We assume that the adversaries can place a hidden microphone to record the victim’s voice. We assume that an adversary knows the targeted IVC devices and has access to their respective ASR API services (e.g., Google Cloud Speech-to-Text for Google Home or Google Assistant). Following other related studies [12, 37, 119, 204, 218], we also assume that the adversary is able to launch this attack via a hidden speaker or a compromised speaker in the victim’s workspace/home. 115 Attack scenarios: The adversary will first collect the victim’s voice commands, and then generate the AEs and perturbations swiftly only based on the transcription result of the target devices. Once the perturbations are crafted, the adversary can wait for the victim’s next command and play the perturbation manually or automatically via existing keyword searching or voice detection mechanisms [11, 162]. Alternatively, the adversary may also play the perturbation repeatedly through hacked speakers, attempting to fool the target IVC devices when the corresponding target voice command was delivered. In a real-world attack scenario, e.g., in a public space, an attacker may not have access to a large collection of victims’ voices and may not have sufficient time to generate the perturbation offline. In this case, the attacker only has a very limited time window to subvert the victims’ commands towards voice assistants. To successfully instantiate such an opportunistic voice attack, an attack approach with a timely and low complexity AE generation is highly desired. User interference: Most existing attacks assume that the users will not perceive the AEs and will not interact with their voice assistants during the attack. However, when the users are speaking during the attack, most existing voice attacks will fail. In this research, we leverage the users’ voice command as a carrier for the adversarial audio to launch the attacks more effectively and stealthily. Moreover, as advanced liveness detection algorithms [7,118] have been used to differentiate between loudspeakers and humans with high accuracy, most existing audio attacks launched by loudspeakers can be easily detected. In our attack, however, since the human voice and the perturbation arrive at the same time, the liveness detection module of the voice assistant can be effectively bypassed. 5.2.2: Adversarial Attack Adversarial attack aims to craft an AE x0 + δ, in order to deceive the model f(·) to make false prediction [160]. Take ypred as the output of model, if f(x0 + δ) := ypred ̸= y (y indicates the true label of input x0), we suppose the attacker has launched an untargeted attack. If the perturbation is crafted intentionally for a specific target (denoted as yt), the attack formalized as f(x0+δ) = yt ̸= y, 116 is regarded as a targeted attack. The generation of AE can be formulated as an optimization problem as follows: minimize L(x0 + δ) := D(f(x0 + δ), yt). (5.1) The goal of Eq. (5.1) is to minimize L(x0 + δ) under the constraint that ||δ||2 < ϵ, where L(·) denotes the loss function, which uses a distance function D(·) to measure the disparity between f(x0+δ) and yt, || · ||2 is the L2 norm, and ϵ is used to control the amplitude of perturbation. There are three main types of attacks depending on the prior knowledge of the victim models, listed as follows: White-box: If the adversaries learn architecture and the parameters of the model, they can get the gradient of the loss function ∇L(x) during the forward or backpropagation. The perturbation can be subsequently estimated using the inverse gradient [60]. Grey-box: The model conceals its architecture and parameters from the public and only exposes the prediction scores P = [p0, p1, · · · , pn] for a given input. The adversaries can formulate a loss function [26] D(P, Py) (Py is the one-hot encoding of y), and then track the changes of distance when tuning δ in multiple attempts. The changes in L(x) are utilized to estimate the gradient which will guide the attacker to update δ. The gradient estimation algorithms include Natural Evolution Strategy (NES) [99] and Zeroth Order Optimization (ZOO) [33]. Black-box: Compared to white-box and grey-box attacks, the black-box attack is the most challenging, in which the attacker only has access to the prediction label of the model. In fact, most of the commercial ASR systems and IVC devices are closed-source and only offer a final prediction. To successfully attack the black-box model, existing work either trains a surrogate model and transforms the problem into a white-box attack [135], or uses a significant amount of queries to search the decision boundary of the victim model [23, 32, 38]. Here, we focus on the query-based boundary-searching attack due to its flexibility and attack efficiency. 117 (a) A mixed image with cat and dog is recognized by Google Cloud Vision API [65] with 89% cat and 11% dog. Dog Deer Cat Panda AE (b) Search decision boundary in black-box CV attack 0.0 0.2 0.4 0.6 0.8 1.0 Seconds 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Amp. Stop Backward (c) A mixed audio with “stop” and “backward” is rejected by Google Speech-to-Text API with no output "stop" "start" "step" "backward" Not classified into any command (d) The decision boundary for every class is non-contiguous for ASR system, every input in the middle will be rejected due to ambiguity Figure 5.2: Observations of CV and ASR systems. 5.2.3: Black-box Audio Adversarial Attack Compared with the black-box adversarial attack in other domains, the black-box audio adversarial attack has several unique features. In this section, we conduct a preliminary study in quantifying the behaviors of commercial ASR services. Decision-based attack: Used for classification, a decision boundary is a hypersurface that partitions the sample space into several classes. Specifically, a well-trained DNN model uses the decision boundary to classify the incoming inputs. The main goal of the existing black-box at- 118 tacks [23, 32, 38], or so-called decision-based attacks, is to find the decision boundary of the target model. Generally, to approach the precise decision boundary, they gradually perturb the input based on the query feedback, to find an AE on the verge of the decision boundary. However, one assumption made by existing decision-based attacks is that the DNN classification model guarantees to return a prediction ypred for any input x. As shown in Figure 5.2a, we merge a cat and a dog into one image and feed it into Google Cloud Vision API [65]. The classifier labels the image as a cat with very high confidence (89%) while the human brain perceives it differently. As shown in Figure 5.2b, the decision-based adversary [32] starts from a dog (x0) and adds the proportion of a cat (δ) gradually to approach the boundary. The curves between classes in Figure 5.2b indicate the decision boundaries, where δ ∈ [0, 255]H×W denotes the perturbed image with the same shape as x0. The contiguous decision boundary allows the DNN models to always output a result, while the result turns unreliable as it approaches the decision boundary. Decision boundary of ASR: At first sight, it appears that the ASR systems would inherit the DNN’s susceptibility to decision-based adversarial attacks. However, the unique characteristics of voice systems and DNN models make traditional decision-based attacks hard to succeed. Here, we conduct a preliminary experiment, in which we mix two voice commands “stop” and “backward” together (Figure 5.2c) to imitate the mixture of cat and dog images. Then, we submitted the mixed audio to Google Speech-to-Text API, which was rejected without any returns. The failed attempts indicate that the decision boundary of the ASR system is non-contiguous. As shown in Figure 5.2d, every voice command is surrounded by an exclusive boundary, and the audios outside of the boundary ranges will be rejected by the ASR systems. This phenomenon implies that the perturbed voice queries may fail to solicit valid feedback from the ASR systems. Without feedback, it is difficult to determine the direction of the perturbation for approaching a target decision boundary. Based on this observation, we are motivated to design a new boundary-searching method to enable the decision-based black-box attack toward ASR systems. 119 "stop" "backward" "stopwhat" "stalk what" "back what" Continuous decision boundary for adjacent classes Figure 5.3: Phoneme guided query 5.3: Attack Design In this section, we present the system design of PhantomSound. We first introduce the phonemelevel boundary searching method to minimize the possibility of rejection by the ASR systems. Then, we formalize the attack as an optimization problem and illustrate the generation of AEs. Finally, to enhance the robustness of PhantomSound in real-world scenarios, we propose the weak synchronization scheme and over-the-air speech enhancement. 5.3.1: Phoneme-level Boundary Searching Figure 5.2d shows the challenge in boundary searching to produce a proper AE. If the adversary randomly adds noise to “stop”, the ASR remains the “stop” decision when the noise is low and gives rejection while rising the noise power. However, if the adversary directly applies target “backward” to the benign audio, it results in audio (red start in Figure 5.2d) in the middle between two decision boundaries, hence giving no output. Therefore, the reasons behind the rejection of queries can be attributed to two factors: 1) the added random noise will elevate the command’s noise level; 2) the boundary distance between two valid commands is too long to allow for an unnoticeable perturbation. To resolve these two problems, a novel idea is raised: “If we break the target “backward” into small pieces, then craft 120 AE with sub-targets which directly connect to the benign decision boundary with small pieces, and finally, we can craft the final AE with the target.” Figure 5.3 depicts our attack design. Specifically, instead of directly adding “backward” on the “stop”, we break the target “backward” into a series of phonemes. During crafting the AE, we randomly add the phoneme on the benign audio and check the prediction. If the ASR produces a word that is closer to our target, we keep the phoneme on the benign audio and search for a closer prediction in the next round. In our case, the “stop” adds perturbation phoneme δ1 and is recognized as “stopwhat”, then changes to “stalk what”, and “back what”, and finally reaches the target “backward”. In every step, the AE achieves to sub-targets who is adjacent to the benign decision boundary, and gradually, the perturbation can be crafted by summing up all the small changes. The basic idea of the proposed phoneme-level searching method is to perturb the original command along the direction of the target command while minimizing the distance between the original and the target ones. Algorithm 1 presents the initialization procedure for generating the phoneme-level adversarial perturbation. Specifically, we first set the counter s = 0, and the initial distance between benign and target as ϵ = CER(f(x0), yt). Next, we construct a phoneme set D = {ph1, ph2, ..., phn} by breaking the target command, and then generate a random noise v ∈ [0, 0.1]l in line 4, where l is the length of original input x0. Next, together with the v, a phoneme from D is randomly picked and injected at its corresponding position of x0 in lines 5-6 to generate an AE x∗. The purpose of v is to increase the variance of the phoneme. For the targeted attack, if the x∗ has a smaller distance to the target (line 7), we put the perturbation to the initial perturbation set ˜ P, then update the ϵ and x0. For an untargeted attack, we can replace line 7 with “if f(x∗)! = y” to assure the ASR gives an incorrect prediction. The searching loop continues until it reaches a sufficient number of rounds K. 5.3.2: Perturbation Optimization Even though Algorithm 1 generates proper perturbations for any voice commands, the amplitude of the perturbation may become overwhelming. Revisiting Eq. (5.1), to acquire the minimal per- 121 Algorithm 1: Phoneme-level Adversarial Perturbation Initialization Input: The original audio x0, the target label yt, the phoneme clip samples D = {ph1, ph2, ..., phn}, the initial Character Error Rate(CER) ϵ, the API service of black-box ASR system f(·). Result: The initial perturbations set ˜ P 1 s = 0; 2 ϵ = CER(f(x0), yt); 3 while s < K do 4 v = random [0, 0.1]l; 5 δ = v + rand(D); 6 x∗ = x0 + δ; 7 if CER(f(x∗), yt) < ϵ then 8 Put δ into ˜ P; 9 ϵ = CER(f(x∗), yt); 10 x0 = x0 + δ; 11 else 12 s = s + 1; 13 end 14 end 15 return ˜ P turbations, we need to gradually increase the perturbation power. However, due to the black-box setting, the gradient is inaccessible. As a result, we use Sign-Opt [38] to estimate the gradient, since Sign-Opt has achieved superior performance with the least number of queries, as written below: ∇L(x) ≈ XQ q=1 sign(L(x + σμq) − L(x))μq, (5.2) sign(L(x + σμ) − L(x)) = 8>>< >>: +1, f(x + σμ) ̸= yt −1 f(x + σμ) = yt (5.3) where x is the general representation of x0+δ, q and Q denote the noise index and the total number of noises respectively. σ is the search variance and μ is the noise. The key idea of Sign-Opt is to search the gradient space using the natural evolution strategy. Since L(x) is unknown, Sign-Opt queries f(·) in Eq. (5.3). The feedback of the target model can be collected to measure the number of wrong predictions. The result will be used to guide Eq. (5.2) in searching for the gradient of 122 Audio Input ph ph ph Initial Searching Direction Proposed Initial Perturbations Fine Tuning Process Query Query Query 2 3 Fine-tuned Perturbations 1 Figure 5.4: Adversarial perturbation generation L(x). Query-efficient fine-tuning: The perturbation generation typically requires ∼5k queries to craft an AE [32, 38]. To further reduce the cost of queries, we design a query-efficient AE generation scheme to greatly reduce the query number. By carefully examining the Eq. (5.2), we realize that the gradient estimation step depletes most of the queries. SupposeQ = 50, then it uses 50 queries to catch the f(·) result and estimate gradient according to Eq. (5.3). However, Sign-Opt [38] uses the estimated gradient only once for updating x, with a small update learning rate, while most of the gradient computations are wasted. In our design, we estimate the gradient once, then apply the estimated gradient multiple times to update the δ until it does not satisfy our attack goal, then do the gradient estimation again. The workflow of our proposed query efficient phoneme-level adversarial perturbation generation is shown in Figure 5.4. There are three major steps to generate AEs and perturbations: searching, proposing, and fine-tuning. In the searching and proposing phases, unlike the prior study [38] which only searches for random noise and keeps the shortest initial perturbation while discarding others, we reserve all the perturbation candidates to increase the generation speed. In the finetuning phase, we optimize all the proposed perturbations through gradient estimation. Note that there are three paths from the Query block: ⃝2 is used to update the perturbation consecutively until it cannot be further optimized. Then, we will re-calculate the gradient (⃝1 ). Once the power of perturbation is lower than ϵ, we add it into the perturbation set P (⃝3 ). 123 Audio Input Perturbations Mismatch Figure 5.5: Perturbation mismatch during an attack. 5.3.3: Weak Synchronization Design Considering the adversary needs reaction time to play the perturbation, the generated perturbations are demanded to be robust against the mismatch of insertion positions. To realize such an attack, we seek to minimize the average loss instead of the instant loss. That is, we take the impact of mismatch into consideration and expect the comprehensive loss to be minimized. Mathematically, the average loss can be expressed as follows: L(x) = 1 N XN i=1 Li(x), (5.4) Li(x) = L(x + cτ ), (5.5) where τ represents the mismatch interval, c controls the length of a mismatch period, i indicates the id of related losses, and N is the number of involved L. To minimize the average loss, we can refer to Eq. (5.2) and Eq. (5.3) to estimate ∇L(x) by computing ∇L(x+cτ ). The drawback of the 124 average loss gradient estimation is that it costsN× more queries to perform the gradient estimation. The length of phonemes in D varies from 50ms to 300ms, and one-word duration is ranging from 281ms to 387ms according to the report [168]. We expect that the phoneme-level perturbation can be plugged within the duration of one word, otherwise, it will be difficult to maintain the minimal L especially when a delayed perturbation arrives. In this chapter, we set theN = 4 and τ = 100ms. Figure 5.5 depicts the perturbation mismatch scenario: when crafting the first red perturbation, we gather the other losses by the same perturbation but with a different time delay. In the figure, L1, L2, L3 correspond to c = 0, c = 1, and c = 2. 5.3.4: Over the Air Attack Robustness Besides the weak synchronization feature, the attack robustness is another important feature of PhantomSound. Existing work models the acoustic signal propagation to compensate for the propagation loss over the air [144]. But the heavy computation prevents them from being adopted in real time attack. Also, the quality of perturbation relies on the speaker’s amplifier, and the additional distortion on such small perturbation is hard to model. Inspired by the prior work [119] who sets a frequency filter to guarantee the generated perturbation is ranging from 50-8,000 Hz. To guarantee the effectiveness of PhantomSound over the air, we follow their approach on configuring a frequency filter to mitigate the uneven frequency response caused by the hardware imperfection of the speaker, thereby enhancing the attack robustness. 5.4: Evaluation In this section, we first introduce our benchmark experimental setting to generate AEs and perturbations. Then, we evaluate PhantomSound thoroughly to validate its feasibility and robustness. Moreover, we measure the impacts of different parameters in tuning a successful attack. Our attack is successfully launched on four different ASR service APIs, and the five popular commercial IVC devices. We further conduct an user case study in section 5.4.8. This section describes the results in detail. 125 5.4.1: Target Model Selection Since we are developing a general approach to generate perturbations to attack closed-source ASR systems and commercial devices, we will examine the effectiveness of AEs and perturbations on the most popular IVC devices available on the market. Specifically, we select Google Home (G-H), Google Assistant (G-A), Amazon Echo, Microsoft Cortana, and IBM WAA6 as target IVC devices. Moreover, we target their respective ASR APIs, namely, Google Cloud Speech-to-Text API, Microsoft Azure API, Amazon Transcribe API, and IBM Watson API. As for Apple Siri, since there is no online speech-to-text API service available from Apple, we cannot perform PhantomSound due to the lack of querying feedback from its ASR system. For all the target systems, we only receive the hard label of the querying input from their APIs. 5.4.2: Metrics We use the following metrics to quantify the effectiveness of our attack: (1) Success Rate: this metric represents the ratio of successful attacks and the total attempts. For an untargeted attack, as long as the AEs and the perturbations alter the prediction of the original input, we count it as successful. For a targeted attack, we report success only when the prediction matches the targeted class. (2) Average queries per command: we use the number of queries to imply the cost and speed of AE generation. Specifically, we measure how many queries it needs to craft a perturbation. This metric is calculated by the total number of queries over the number of crafted AEs/perturbations. (3) L2 Distortion: the L2 distortion ||δ||2 indicates the size of perturbations. Prior to the launch of a physical attack, we can measure the distortion value by summarizing the squared amplitude of the generated perturbations. Note that the perturbation δ ∈ [0, 1]l and the initial phoneme-level distortion ranges from 50 to 1,600 depending on different phonemes, which will be optimized after the perturbation fine-tuning as shown in Section 5.4.5. (4) False Accept Rate: the false accept rate is measuring the probability of that the attacks can be false accepted by the liveness detection 6WAA represents “Wav-Air-API”. As IBM does not own a commercial voice assistant device, we record and replay our AEs over-the-air, and transcribe them with IBM Watson API. This process, named as WAA, simulates an IVC device that is integrated with an IBM Watson API [37]. 126 methods. We use this metric to evaluate the ability of our attacks to bypass the existing defense methods (e.g. liveness detection) compares to the existing attacks. The higher false accept rate we achieve, indicating the more dangerous of attack is, to bypass the existing liveness detection methods. 5.4.3: Dataset The dataset we choose as original input is speech commands v0.02 [186] released by Google Brain. This dataset is designed to validate the keyword detection capability of DNN models. It contains 105,829 utterances of 35 common one-word commands (e.g., “yes”, “learn”, “stop”), which is recorded from 2,618 volunteers. To validate the effectiveness of PhantomSound on a longer command, we record 10 longer commands (partially listed in Table 5.4) from a volunteer. For the phoneme dataset, we expect to obtain all 44 pure English phonemes with flexible duration. Existing speech datasets (e.g., Arabic Speech Corpus [81], TIMIT [1]) include the annotations of phonemes, but it requires extra efforts to extract individual phonemes with different duration from the speech audio. Besides, PCVC dataset [2] only involves 12 volunteers, and scikit phoneme dataset [146] only contains 5 vowels. To construct a phoneme dataset with a diverse set of speakers, we use 200 different audios from 200 speakers in speech commands v0.02, remove the silence in the recordings, and randomly cut audio clips with a duration between 50ms to 300ms. This phoneme processing step follows that of the scikit phoneme dataset [146], which results in 453 audio clips in total. Table 5.2: Dataset description (“unique cmds” refers to the number of unique target commands, and “total audios” refers to the total number of (adversarial) audios that lead to the target commands). Phone. Cmd. Untargeted Targeted Unique cmds - 45 1785 64 Total audios 453 300 6,219 216 Table 5.2 records the number of involved data including phonemes, commands, untargeted perturbations, and targeted perturbations. We use 35 one-word commands from the speech commands v0.02 dataset, along with 10 self-recorded long commands to build a command dataset with 127 45 different commands, including 300 audios in total. Then, we apply the proposed algorithm to randomly generate AEs and perturbations for an untargeted attack, resulting in 1785 different commands and 6,219 adversarial audios on 4 different commercial APIs. For the targeted attack, we attempt the perturbation of keywords, and generate 64 target commands with 216 adversarial audios. 5.4.4: Experiment Setting We conduct the experiments on a desktop with Intel i7-7700k CPUs, 32GB RAM, and 64-bit Ubuntu 18.04 LTS operating system. The experiments are performed at three locations with different noise floors. We use three loudspeakers, including LG monitor built-in speaker (at the apartment), an SADA D6 home small speaker (at the lab), and an Samsung S9 phone (at outdoor), to transmit AEs (i.e., AE attack) and perturbations (i.e., perturbation attack) to the victim devices. Figure 5.12a demonstrates the attack scenario: the victim speaks commands into a smartphone or Google Home mini, while the attacker plays the perturbation through a speaker. 5.4.5: Attack Performance We first evaluate the functionality of AE generation in PhantomSound. The purpose of this evaluation is 1) to demonstrate that the perturbation amplitude is negligible compared with the input, and 2) to prove the query efficiency of our phoneme-level searching algorithm. Then, we conduct the physical attack and validate the robustness of our attack over the air. Attack Over-the-line: We first evaluate the attack by targeting the ASR APIs. The adversarial audios are directly supplied to the online APIs. We randomly select 20 adversarial audios from every command, and then perform the untargeted attack by searching for 100 epochs (K = 100 in Algorithm 1). Then, the generated perturbations are optimized to suppress their power. In the end, we obtain 148 AEs and perturbations from ∼44k queries (Q = 30 in Eq. (5.2)), i.e., 301 queries per AE on average. To evaluate the perturbation amplitude, we randomly pick two examples from the generated perturbation as shown in Figure 5.6. We can see that the crafted perturbations have a negligible 128 0.0 0.5 1.0 Time (s) 0.5 0.0 0.5 Amplitude "Follow" Perturbation (a) One-word commands 0.0 0.5 1.0 Time (s) 1.0 0.5 0.0 0.5 1.0 Amplitude "Kitchen lights off" Perturbation (b) Command phrases Figure 5.6: Comparison of input and perturbation amplitudes. power profile compared with the input regardless of the length of commands. Moreover, the duration of perturbation is shorter than the input, which makes it possible to conceal the presence of perturbation. Table 5.3 summarizes the results of the untargeted attacks toward 4 types of commercial APIs. We observe that every command can be altered into at least two false commands. While some of the false predictions are harmless, the attack can almost certainly invalidate the victim’s command. Moreover, in certain cases, some perturbations can lead to a contrary response from voice assistants (e.g., “right” to “wrong” in Amazon Transcribe API, “right” to “no” in Microsoft Azure API). Considering the number of queries for generating one perturbation, the Google Cloud Speech-to-Text is reported to be the most resilient API under our attack, as it requires the most queries. To further comprehend the query effectiveness, we conduct an additional experiment to validate the sensitivity of different APIs in terms of request rejection rates. The result shows that Google API is most sensitive as it refuses to respond to an unclear input, while the Amazon transcribe always responds to any inputs. Table 5.4 records the targeted attack results towards a longer input. The results show that our phoneme-level searching method is capable of finding the specific perturbation that could mislead the APIs to return a target result. Note that the average query amount increases dramatically in the targeted attack case, which is anticipated because the target need to be achieved by multiple round perturbation searching (line 7-10 in Algorithm 1). It is also notewor- 129 Table 5.3: Untargeted attack results. Cmds. Google Cloud MS Azure AMZ Trans. IBM Watson ”down” ”damn” ”town” ”done” ”Downer” ”done” ”one” ”dine” ”Done” ”does” ”south” ”drive” ”Drone” ”follow” ”fallout” ”fallout” “no” ”fallen” ”farm” ”fall over” ”for sure” ”fall over” ”four” ”learn” ”phone” ”fall” ”forward” ”forewarn” ”work” ”what” ”for” ”for eyes” ”for” ”work” ”four” ”for work” ”ford” ”for all” ”yes” ”yeah” ”file” ”yeah” ”yeah” ”yeah!” ”4” ”yes..” ”yet” ”yet” ”On” ”right” ”hi” ”right” ”Rite Aid” ”no” ”write” ”run” ”write” ”go” ”run” ”ray” ”read” ”trade” ”wrong” ”left” Queries 345 251 215 314 Table 5.4: Targeted attack results Command Query Input Target Google Cloud MS Azure AMZ Trans. IBM Wat. ”turn right” ”turn left” 1,895 1,128 1,421 1,487 ”kitchen lights off” ”kitchen lights on” 1,754 857 933 1,377 ”call mom” ”call 911” - 1,421 1,125 - ”read my message” ”delete my message” 2,342 1,520 1,436 1,781 Average Queries 1,997 1,232 1,229 1,548 thy that our targeted attack cannot guarantee finding a successful perturbation under any arbitrary inputs (e.g., Google Cloud fails to craft AEs for “call 911”). Table 5.5: Comparison for Untargeted Attacks Models↓ Ours white box [27] score based [33] brute force [23] DS 1 [84] 185 90 206 ∞ DS 2 [16] 226 75 197 ∞ Query efficiency comparison on known models: To validate the benefits of introducing phonemes to guide the optimization direction, we implement 3 different attacks on two known models. By attacking two ASR models (DeepSpeech 1/DS 1 [84] and DeepSpeech 2/DS 2 [16]) with different prior knowledge and method, we find that PhantomSound achieves comparable query efficiency 130 with the grey box setting, with 100% attack success rate. The result is summarized in Table 5.5. Given the same 10 benign commands, we use the 4 attacks to generate untargeted AEs with the same L2 distortion. We record the average number of queries for different prior knowledge of the victim model. Compares to the white box attack, which can fine-tune in <90 queries, PhantomSound requests 200 queries to craft an AE, which is close to the queries of a score-based attack. This result indicates that our strategy such as 1) using phoneme to initialize perturbation 2) Query-efficient fine-tuning is working well, and performing similar results with less information (e.g., confidence score). It is noteworthy that the brute force decision boundary search method doesn’t work for attacking the ASR model. Because this method initializes a random noise and retrieves model gradients by altering the noise. However, this noise can never be fine-tuned while the victim model produces an empty label to it, resulting in an infinite number of queries. Table 5.6: Comparison for Targeted Attacks Attacks Knowledge Queries SR Carlini [27] Gradient ∼1,000 100% Houdini [44] Gradient ∼1,000 100% Devil’s [37] Conf. Score ∼1,500 100% OCCAM [218] Final decision ∼30,000 100% Ours Final decision ∼1,500 68% Query efficiency comparison for targeted attacks: We compare the number of required queries with four existing attacks in Table 5.6. The white-box attacks (Wb) [26, 44] require the least amount of queries (∼1,500). With the knowledge of confidence scores of API’s decoding results, the Devil’s Whisper [37] utilizes a surrogate model trained with around 1,500 queries to attack the APIs. In the scenario when an attacker can only access the final decision of the query API, PhantomSound needs ∼1,500 queries (comparable with the white-box setting) to craft a targeted perturbation. Compared with a recent black-box attack OCCAM [218], we reduce the number of queries by 95%. However, due to the limitation of phoneme length and diversity, we sacrifice the success rate to achieve high query efficiency. Weak synchronization: Before evaluating the physical attacks, we investigate the effectiveness of the proposed weak synchronization design. In this experiment, we manually add mismatch de- 131 lays between input x0 and the generated perturbation to craft mismatched AEs. We then use the mismatched AEs to query the APIs and measure the attack success rate. Figure 5.7a displays the result, from which we can see that, after using the average loss, although we expect the weaksynchronization works within 400ms (detailed in Section 5.3.3), this design is only partially effective, because the success rate drops steadily with the increasing mismatch time. Moreover, we show the tendency of L2 distortion w.r.t. the number of queries in Figure 5.7b. The baseline denotes an L2 distortion of 10, which is proven unnoticeable by two volunteers when AEs are played using an LG monitor with a medium volume. 0 100 200 300 Time (ms) 0 20 40 60 80 100 Success Rate (%) Google Amazon MS IBM (a) Weak synchronization 0 200 400 Queries 0 20 40 60 80 L2 Distortion Google Amazon MS IBM baseline (b) L2 Distortion vs. No. of queries Figure 5.7: Evaluation of AE generation process. Attack over-the-air: The over-the-air attack evaluation aims to prove the robustness of PhantomSound. To attack commercial APIs, we play the valid AEs and perturbations (which attack successfully in over-the-line scenarios) via a SADA D6 speaker, and record it by iPhone 12 Pro, the recordings are sent to the commercial API for evaluation. The attack distance is set to 50cm. For each attack, we choose 5 AEs to play 5 times and get the average success rate. We report the result in Table 5.7. From our observations, it is apparent that in the context of a targeted attack, our method attains approximately a ∼ 80% success rate in attacking over-the-air commercial ASR APIs by directly playing the audio adversarial example (AE). When the attack is synchronized with the victim’s 132 G-H G-A Echo Cortana IBM IVC Devices 0 20 40 60 80 100 Success Rate (%) Untargeted Targeted (a) AE attack G-H G-A Echo Cortana IBM IVC Devices 0 20 40 60 80 100 Success Rate (%) Untargeted Targeted (b) Perturbation attack Apartment Lab Outdoor Setup Place 0 20 40 60 80 100 Success Rate (%) 5cm 25cm 50cm 100cm 200cm (c) Distance 40 45 50 55 60 Loudness of Perturbation (dBSPL) 0 20 40 60 80 100 Success Rate (%) Apartment Lab Outdoor (d) Loudness Figure 5.8: AE generation results. Table 5.7: Over-the-air attack API baseline APIs Google Cloud MS Azure AMZ Trans IBM Wat. Targeted AE 76% 80% 80% 84% Pert. 68% 72% 72% 76% Untargeted AE 100% 100% 100% 100% Pert. 72% 80% 80% 92% speech, the perturbation attack exhibits around a ∼ 72% success rate. On the other hand, when it comes to untargeted attacks, our adversarial examples (AE) and perturbation methods achieve impressively high success rates. They misdirect the victim’s input with a 100% and approximately 81% success rate, respectively. Next, we follow the same setting to attack commercial IVC devices. The result in Figure 5.8a uncovers the success rate of playing AEs directly. Among all the tested IVC devices, Microsoft Cortana is most vulnerable against the AE attack, while the Google series 133 products (e.g., Google Home, Google Assistant) show the most resilience against the targeted AE attack. Overall, the success rate of an untargeted attack is higher than that of a targeted one, i.e., the former reaches ∼80% success rate and the latter stays around ∼50%. With the perturbation attack, Figure 5.8b reveals a relatively low success rate. Similarly, compared to the targeted perturbation attack, the untargeted attack has a higher success probability, achieving around 45% success rate on average. Nevertheless, the success rate can be further improved via multiple repeated attempts. We also summarize the success rate compared to prior black-box attacks in Table 5.8. Table 5.8: Comparison with other real-world attacks Target Google Cloud MS Azure AMZ Trans. IBM Wat. Google Home Google Assit. MS Cortana AMZ Echo Devil’s [37] 10/10 10/10 4/10 10/10 9/10 10/10 10/10 10/10 Danger [215] - - - - 15/100 - - 69/100 Ours 19/25 20/25 20/25 21/25 11/25 12/25 16/25 16/25 Upon comparison with the Devil’s attack [37], it is evident that our attack method yields a marginally lower success rate against the APIs, with the exception of the Amazon Transcribe API. Considering the IVC devices, the Devil’s attack tends to be more effective at similar SNR levels. For the Danger attack [215], we have displayed their success rate derived from their ”voice squatting” attack, where the victim’s command is misinterpreted to initiate the attack skill. A comparison reveals that our attack technique yields comparable success rates when targeting Amazon Echo, and even demonstrates superior performance when used to attack Google Home. Time cost: Different from the prior works that require a substantial amount of time to craft AEs offline, PhantomSound enables much faster AE generation. Such a fast generation feature is essential in practice, when the attackers only have a limited time budget to instantiate the attack. In the experiment, we record the latency for querying 4 different commercial APIs to get the results. The results are presented in the first row of Table 5.9, which show that 3/4 of APIs could return a result in seconds, except Amazon Transcribe API. The Amazon API has to interact with Amazon Web Service and Storage bucket, which spends a longer period for the results to return. We then compute the total time needed for perturbation generation, by multiplying latency with the number of queries (shown in Table 5.3, 5.4). Our result shows that PhantomSound can generate 134 Table 5.9: Latency for perturbation generation Time Consumption Google Cloud MS Azure AMZ Trans. IBM Wat. Latency (s) 0.29 0.58 26.31 1.35 Untargeted (min) 1.67 2.43 94.3 7.1 Targeted (min) 9.65 11.9 539 34.8 a perturbation for both the targeted and untargeted attacks in minutes with the exception of Amazon API, while the targeted one takes longer. Note that we take the L2 distortion into consideration during the time cost computation, however, if the attacker ignores the impact of the perturbation loudness and uses the intermediate perturbation, the generation time can be further reduced. Ours C&W Devil's Attack Methods 0 20 40 60 80 100 120 False Accept Rate (%) CQCC STC Void Figure 5.9: Attacks vs. liveness detection defenses 5.4.6: Ability to Bypass Liveness Detection Compares to the existing physical adversarial attacks [37,204,218], PhantomSound relies on the benign commands spoken by the user. Although this attack setting requires extra effort to synchronize the perturbation and the user’s benign speech, it brings potential benefits to bypassing the defense mechanism. For example, recent works [7, 73, 110, 112, 118, 123] proposed liveness detection approaches can differentiate the source of sound (human or machine) with high accuracy. Therefore, the conventional adversarial attacks that are launched solely by loudspeaker [37, 204, 218] have a higher probability to be defended by those liveness detection methods. In contrast, our attack is de- 135 signed to launch with the user’s speech, leading to a more dangerous threat to the liveness detection defenses. To validate the performance of PhantomSound over different defense mechanisms, we reproduce three liveness detection algorithms, CQCC [110], STC [112], and Void [7]; For comparison, we implement C&W attack [27] and Devil’s [37]to attack with liveness detection algorithms. To conduct this experiment, we follow the settings described as follows: Ours: We play our perturbation when the user gives the command, and record it with a smartphone. Then, we run three liveness detection algorithms to detect the sound source. C&W [27]: We play the AEs that are generated by this attack, and then record with the same smartphone and run liveness detection algorithms to defend it. Devil’s [37]: We play the AEs provided from the chapter’s demonstration website, and then record it with the same smartphone, followed by the same liveness detection procedure. For our attack and the C&W attack, we use 20 different perturbations/AEs to attack the liveness detection model; As for the Devil’s attack, since we can only collect 10 AEs from the demonstration website, we use 10 AEs to attack the liveness detection model. We present our result in Figure 5.9. It is evident to show that our attack can bypass the three liveness detection models, resulting 95% to 100% false accept rate. In contrast, the other two attacks have a very low chance to counter the Void [7] detection with less than 15% FAR. Even for conventional liveness detection methods (e.g., CQCC and STC), the existing attacks that use complete AEs also have a low probability ( 40%) to attack successfully. 5.4.7: Impact of Practical Factors To investigate the critical factors that may affect the success rate of PhantomSound, we evaluate the perturbation attack under different environments (e.g., apartment, lab, outdoor). The ambient noise level for the aforementioned places are 39.8 dBSPL (apartment), 41.2 dBSPL (lab) and 58 dBSPL (outdoor), respectively. In this experiment, we play a crafted perturbation of “turn right” 10 times, attempting to transform the prediction into “turn left”, and the volume of perturbation is 60 dbSPL. We then record the 136 success rate under different circumstances. Figure 5.8c demonstrates the impact of attack distances, i.e., the closer the adversary is, the higher success rate he/she achieves, which is unsurprising given that our attack relies on the successful delivery of the perturbation. The relatively short attack distance is in fact a common limitation reported by the existing work [37, 119, 204]. However, the attacker can further extend the attack distance by increasing the speaker’s volume (though it could make the perturbation more noticeable) or utilizing a speaker array [141]. Next, we provide the results on how the loudness factor could affect the attack performance in Figure 5.8d. We can see that the success rate improves with the increasing perturbation loudness. This result also coincides with the prior work [119]. In an outdoor environment, it is suggested that the adversary enhance the attack robustness by amplifying the perturbations. Due to the higher noise level outdoors, the phoneme-like perturbation can still be hard to perceive. 0° 45° 90° 135° 180° 225° 270° 315° 10203040506070 Success Rate (%) (a) Attack Angle iPhone 12Pro 0° 45° 90° 135° 180° 225° 270° 315° 10203040506070 Success Rate (%) (b) Attack Angle Mi 8 Lite Figure 5.10: Attack with different angles. Impact of attack angles: Besides the attack environments and the distance, the attack angle can also alter the attack performance. We evaluate our attack by playing AEs to two smartphones in 12 different directions (from 0 degrees to 360 degrees, with 30-degree intervals). This experiment is conducted in Lab environment and attacks the google assistant on the smartphone. We play 10 AEs in every direction with 60dBSPL, and record the success rate of the untargeted attack. We report 137 our result in Figure 5.10. We find that our attack has the best performance when the adversary is facing or back to the smartphone. While attacking through the side direction (e.g., 0 degrees when the adversary is parallel to the victim), the success rate is impaired. We observe the same trend on two smartphones. This result indicates that the microphone arrangement and its direction will lead to audio information loss. Unfortunately, the low power of our perturbation is hard to be sensed with the audio loss, therefore causing a low success rate in the side direction. Impact of different victims: In the attack preparation period, every perturbation is crafted based on a specific command from a specific speaker. However, the adversary may use the crafted perturbation on the previous victim to attack the current victim. Here, we evaluate the capability of PhantomSound to attack different speakers. First, we obtain 4 perturbations from speaker #1 (male), which convert the benign commands ”stop”, ”right”, ”yes”, and ”down” into 4 target commands ”backward”, ”left”, ”no” and ”song” respectively. Next, we randomly select 100 speakers (50 males and 50 females) who are not speaker #1 from the speech commands v0.02 dataset, and inject the perturbations into their benign audio samples. For the targeted attack, if the benign commands are successfully interpreted as the target, we classify it as successful. For the untargeted attack, any case where the benign commands are misinterpreted is considered successful. The result is present in Figure 5.11. The result indicates that, for targeted attacks, the attack success rate is dependent on the benign samples. The success rate exceeds 50% when the target is of the same gender, but it falls below 40% when targeting different genders. Regarding the untargeted attacks, the perturbations demonstrate robust transferability for attacking various speakers. The average success rate is notably high, reaching 98% for males and 74% for females. 5.4.8: User Study To evaluate the stealthiness of perturbation in a real-world attack, we conduct an online/in-person user study to investigate the users’ perception level of PhantomSound. In our study, 20 volunteers are involved, and they are requested to hear 6 crafted perturbations across 4 different distances. 138 "stop" "right" "yes" "down" Benign Commands 0 20 40 60 80 100 Success Rate (%) Male Female (a) Targeted attack "stop" "right" "yes" "down" Benign Commands 0 20 40 60 80 100 Success Rate (%) Male Female (b) Untargeted attack Figure 5.11: Attack cross different victims. Two volunteers attend the in-person experiment (see Figure 5.12a) and the rest of them carry out the experiment at their homes. We recruit 13 volunteers from Amazon Mechanical Turk with complete experimental instructions. The volunteers are asked to pretend speaking to their voice assistants while hearing the perturbation, after which they will answer questions to depict their comprehension of the heard perturbations. The options for perception levels include: Listened, Abnormal, and Recognize. Listened indicates that the volunteer can hear a perturbation but regard it as a normal noise; Abnormal implies that they hear some strange sounds; and Recognize stipulates that they can understand the meaning of the heard sound. We report the experiment result in Figure 5.12b. It shows that most of the participants can hear the perturbation within a short distance, but less than 50% of them regard the perturbation as an abnormal sound. Such “abnormality” feeling will gradually disappear with the increasing attack range, which ends with 10% in 2 meters. Moreover, even though all the perturbations are “meaningless” phonemes, some participants claim to understand their meanings (though the understanding is incorrect). To summarize, PhantomSound can be noticed by victims, but would not vastly raise their attentions. Notably, the victims are generally unaware of the meaning of perturbations. 139 Victim Laptop Smart Phone Speaker Google Home Mini (a) Experiment setup 25 50 100 200 Distance (cm) 0 25 50 75 100 Percentage (%) Listened Abnormal Recognize (b) Users’ perception level of PhantomSound Figure 5.12: Real-world user study of PhantomSound. 5.5: Discussion 5.5.1: Low-cost Attack Table 5.10 lists the cost comparison between PhantomSound and the existing work [37]. The first row records the pricing information of the commercial APIs, which is measured by the duration of given audios (in minutes). The recent black-box attack [37] is reported to incur the cost of 1,500 queries for building the substitute models, and every query uses an audio with 25 seconds long. In total, such an attack requires 1500 ∗ 25/60 = 625 minutes to train a surrogate model, and can only generate 10 pre-selected commands. To generate extra commands, the attacker needs to submit additional queries (∼100) for the candidate AEs. Suppose the length of candidate AEs is 6 seconds, the total time cost for generating extra AEs is 6 ∗ 100/60 = 10 minutes. All together, the duration of queried audio is 72.5 minutes for producing one single AE. In contrast, PhantomSound does not require a substitute model, and as such, it only takes ∼300 queries and ∼2,000 queries of one-second audios to craft an untargeted AE (Ours-U) and a targeted AE (Ours-T) respectively. We then present the cost to generate one AE based on the pricing and the query audio length (shown in row 4 and 5). In the end, PhantomSound saves 93.1% and 65.5% of the cost for crafting an AE, 140 a drastic improvement. 5.5.2: Limitations The limitations of PhantomSound include that: 1) the attack is sensitive to ambient noise; 2) there is no guarantee to generate an AE for any input and any target; 3) this attack could not substantially modify very long sentences; 4) the attack distance is relatively short as presented in Section 5.4.7. To address the first and the fourth limitation, the adversary can either amplify the perturbation power or attack the victim in a relatively quiet place. The second and third limitations are possibly addressed using multiple repeated attempts of phoneme injections, which will increase the likelihood of generating a successful perturbation with a potential caveat of growing costs. Table 5.10: Cost comparison Google MS AMZ IBM Pricing/min $0.024 $0.016 $0.024 $0.01 Build model [37] 625 min Craft AE [37] 10 min Total time/AE [37] 72.5 min Total time/AE (Ours) 5 min - 25 min Cost/AE [37] $1.74 $1.16 $1.74 $0.725 Cost/AE (Ours-U) $0.12 $0.08 $0.12 $0.05 Cost/AE (Ours-T) $0.6 $0.4 $0.6 $0.25 Saving/AE (Ours) 93.1%/65.5% 5.5.3: Defense Prior studies [119, 202, 204] reveal that the audio adversarial attack can be defended by signal processing techniques, since the adversarial perturbations are delicately crafted and hence are deemed fragile. The signal processing techniques, however, can reduce the fidelity of perturbations and hence protecting the ASR models. Typical signal processing defense methods include 1) Down sampling (DS): decreasing the sampling rate of AEs to disrupt the quality of AEs [119,202,204]; 2) Quantization: as the original AEs are encoded by 16-bit values, the quantization technique rounds the 16-bit precise value to its nearest integer multiple of Q, where Q represents the quantization level. A higher Q results in a lower precision of AEs, which has been adopted to defend against 141 the attacks [119, 202]. 3) Low pass filtering (LFP): the defense can use a Butterworth low-pass filter with different cutoff frequencies to remove the high-frequency components of the perturbations [119]. We reproduce the aforementioned three defense methods to test their effectiveness against PhantomSound. Specifically, for DS approach, we modify the sampling rate of AEs from 16k to 8k and 4k. In the quantization setting, we follow the existing work [119] to set Q as 256, 512, and 1,024. Then, we build a Butterworth low-pass filter with a cutoff frequency of 4kHz, and set the order of the filter as 6. To validate the defense performance comprehensively, we generate 1,190 AEs from 20 clean audio samples and process them with 6 different defense settings. 0 50 100 Success Rate (%) DS(4k) DS(8k) LPF(4k)Defense Methods Google Amazon IBM MS (a) Defense performance of DS and LPF 0 50 100 Success Rate (%) Q-256 Q-512 Q-1024Defense Methods Google Amazon IBM MS (b) Defense performance of quantization Figure 5.13: Performance of PhantomSound against different defenses. We use the processed AEs to attack 4 commercial ASR APIs. Figure 5.13a shows that LPF can barely impact the attack success rate of AEs and APIs. For comparison, the DS technique slightly changes the attack success rate from 100% to 92.4% (Microsoft), 71.4% (IBM), 87.5% (Amazon), and 63.3% (Google). This method can further reduce the success rate by applying a lower sampling rate (e.g., with 4k sampling rate, the IBM and Amazon API can defend against∼60% attacks, while the Google API is not supported for the audio input with such a low sampling rate. Different from the findings from previous work [119, 202] that quantization is effective in defending against the 142 adversarial attack, our results show a converse performance. From Figure 5.13b, we observe that only the IBM API can be affected by the quantization, which reduces the success rate to 73%, 61%, and 47% for q=256, 512, and 1,024, respectively. To summarize, our results demonstrate that the existing signal processing-based defense approaches cannot protect the commercial APIs from PhantomSound. Future research on defense mechanisms are needed to provide more secure speech-to-text and voice assistance services. 5.5.4: Ethical Issues The intention behind publishing this work is to enlighten the academic and tech community about the vulnerabilities of commercial ASR APIs and smart speakers, it may also provide malicious actors with the knowledge and tools to exploit these vulnerabilities for harmful purposes, such as privacy invasion, identity theft, or unauthorized control of connected devices. If the findings of this chapter are misused, malicious actors could potentially manipulate smart speakers into sharing sensitive information or performing unauthorized actions, there may be potential financial and reputational harm to individuals and corporations. To address these ethical concerns, it would be advisable to collaborate with manufacturers of smart speakers to design effective countermeasures to defend against this attack. 5.6: Related Work The study of adversarial attacks starts from the discovery of intriguing properties of the neural networks around 2014 [60, 160]. Researchers manually or automatically add small perturbations to the input and thereby misleading the neural network models. Adversarial attacks against ASR systems: Existing work [12,27,44,136] has proposed different optimization algorithms to craft effective AEs towards ASR models with some knowledge of the victim’s ASR model (e.g., prediction scores or logits output). However, the robustness of their attack approaches in a real-world over-the-air scenario is usually unverified. The recent physical attacks such as CommanderSong [204], Devil’s Whisper [37], and AdvPulse [119] require a substantial cost (in time and money) for the attackers to succeed in attacking the black-box voice 143 assistants. Signal processing attacks: Rather than exploiting the vulnerabilities of neural networks in ASR systems, the signal processing attacks aim at attacking the signal pre-processing or feature extraction modules. They usually exploit the discrepancies between the human auditory system and the perceptual hearing system of microphones to fool the ASR system. These attacks analyze the input and output of the feature extraction procedure, and then they modify the input of feature extraction and preserve the shape of output to either hide their attack [4] or mislead the ASR system in producing incorrect transcriptions [5]. Even though the existing signal processing attacks demonstrate the efficiency and effectiveness against the black-box models, it is relatively straightforward to defend against using frequency filters. Audio backdoor attacks: Different from adversarial attacks which attack a trained model, backdoor attacks [68,79,120] inject backdoor triggers during the training process. Recently, researchers demonstrated that the backdoor attack [152, 207] can also be implemented in the ASR model and Speaker Verification models. To defend against the backdoor attacks in the image domain, several countermeasures are proposed [75, 77]. Other related works: Some attackers exploit the imperfection of hardware (e.g., microphone) to deliver inaudible attacks through different media [115, 158, 200, 210]. Besides, Danger [215] uses homophones (i.e., different words with similar sounds) to attack ASR skills. Researchers also develop side-channel attack [184] by injecting voice commands through a power line. Speech synthesis attack produces victim’s fake speech by generative models [188]. To protect the victim’s original speech, researchers add perturbations( [95, 183]) to prevent the generating of deep fake speech. 5.7: Summary In this chapter, we proposed PhantomSound, a practical, black-box, and query-efficient audio attack against commercial ASR systems and IVC devices in a real-world scenario. As opposed to the existing attacks that require prior knowledge of the target model, we propose a phoneme-level 144 searching method to generate AEs and perturbations rapidly and effectively in a black-box setting. In the real-world experiments, PhantomSound is shown to be practical and robust in attacking 5 popular commercial voice controllable devices over the air, which could potentially cause hazards to the smart home. In the next chapter, we will introduce a privacy protection design to defend against voice privacy leakage. 145 CHAPTER 6: SPEAKER SELECTIVE CANCELLATION VIA NEURAL ENHANCED ULTRASOUND SHADOWING7 6.1: Introduction The widespread adoption of smartphones and Internet-of-Things (IoT) devices, equipped with builtin microphones, enables people to easily record audio anywhere and at any time. However, the increasing prevalence of unauthorized microphones has given rise to many instances of privacy breaches. Commercially available microphones, which are easily accessible, pose a risk of capturing users’ biometric data, such as voiceprints or eavesdropping on confidential conversations. Consequently, the issue of illicit voice recording has emerged as a significant concern in society. Recent studies [116, 169] attempt to disrupt unauthorized voice recording by emitting an ultrasonic scrambling noise wave (i.e., a jamming signal) to obfuscate the superposed voice. However, the scrambling noise wave is generated using low-level acoustic signal features that are irrelevant to the speaker’s identity. Consequently, other benign microphones in the reception range will also be jammed, most of the time undesirably. In fact, the use of such voice jammers in public spaces is prohibited and unlawful (violation of 47 U.S.C. § 333), since it poses serious risks to critical public safety communication. Moreover, if the attacker learns the frequency pattern of the scrambling noise wave, the attacker can deploy an additional microphone to nullify the noises and record them illegally. To allow users to secure their voices lawfully without intervening in others’ microphones/ recorders usage, we propose NEC (Neural Enhanced Cancellation), which only jams a 7This chapter is based on previously published work by Hanqing Guo, Chenning Li, Lingkun Li, Zhichao Cao, Qiben Yan, and Li Xiao titled “NEC: Speaker Selective Cancellation via Neural Enhanced Ultrasound Shadowing” published at the Proceedings of the 2022 IEEE International Conference on Dependable Systems and Networks(DSN). DOI: International Conference on Dependable Systems and Networks [70]. 146 Hi, Tom! Bob NEC Bob Only Jam Bob's Voice Alice Tom Tom I can say anything because I have a jammer I can't hear you! I can say anything because I have NEC Hi, Tom! Hi, Alice! Everyone is jammed! Jammer Alice Figure 6.1: NEC cancels the target speaker’s (e.g. Bob’s) voice without intervening in other communications. specific target speaker’s voice from the recording of any microphones nearby. Figure 6.1 illustrates the necessity of deploying NEC instead of the commercial audio jammer. Consider that Bob is initiating a private conversation in a public area (e.g., a cafe or work office), in order to prevent his speech from being leaked, he turns on a commercial jammer to obfuscate all the surrounding input devices. The left sub-figure shows that during the attack, other applications such as voice reminders, voice assistants, and phone calls are all effectively disabled by Bob’s jammer, which is not only unlawful but also annoying to other users. In contrast, if Bob deploys NEC, only his speech is imperceptible by the others’ microphones, while other users can still safely use their voice applications as usual. Generally, NEC is composed of a microphone, a neural network model, and an ultrasonic speaker. Figure 6.2 entails the components of NEC, the red lines demonstrate the target speaker’s voice (e.g., Bob’s voice), while the green lines represent Bob’s irrelevant voice (e.g., Alice’s voice, background noise, and model processed voice). Our goal is to make Bob’s voice unrecognized/unrealized on Alice’s phone/recorder. At the very beginning, the microphone perceives both Bob’s and Alice’s voice. Then, we feed the mixed audio to our proposed deep neural network (DNN) model. Note that, compared to the existing systems that utilize low-level acoustic signals (such as Gaussian noise or scrambling noise), we use the DNN model to extract the high-level speaker- 147 Bob Alice Inaudible NEC Mic Model Ultrasound Speaker Figure 6.2: The voice stream flow of NEC Frequency bins (Hz) Time (s) Time (s) Time (s) Time (s) Speaker1 - Utterance1 Speaker1 – Utterance2 Speaker2 - Utterance1 Speaker2 – Utterance2 1 2 Figure 6.3: Distribution of formants across spectrograms, representing the speaker-specific but utterance-independent timber pattern. Utterance 1: “My ideal morning begins with hot coffee.” Utterance 2: “Don’t ask me to carry an oily rag like that.” specific vocal features for differentiating Bob’s voice from the mixed recordings. The output signal of the DNN model is marked as shadow sound, which is then modulated to ultrasonic frequency to make it inaudible to other users. Subsequently, Alice’s phone will receive a combination of Bob’s voice, Alice’s voice, and the inaudible shadow signal generated by NEC. The signal combo will yield a mostly undisturbed sound for Alice. We have four main design goals as follows: • Utterance-independent Vocal Feature Extraction. For a target speaker, we need to train our DNN model with the speaker’s reference audio before the deployment. To alleviate the training overhead across different scenarios, the speaker’s vocal features should be independent 148 of his/her utterances. As such, we can deliver a one-fits-all DNN model, which is trained once and easily transferred. • Microphone-aware End-to-end DNN Training. The shadow sound is superposed on the speaker’s voice at the microphone. To make the superposition more effective, we need to design an end-to-end training pipeline that aims to maximize the effectiveness of the superposed shadow sound. • Low-latency Shadow Sound Generation. We will modulate a shadow sound onto the ultrasonic frequency to make it inaudible. However, the processing delay may degrade the shadowing efficiency due to the feature mismatch between the speaker’s voice and the generated shadow sound. Thus, we need a DNN model that is computationally efficient. • Synchronization-free. To cancel Bob’s voice on other devices, it typically requires the synchronization of the arrival time of the shadow sound, Bob’s sound, and Alice’s sound. However, it is challenging to synchronize them (without modifying Alice’s devices). Therefore, we need a synchronization-free approach for voice cancellation. To achieve all four goals, we first explore the human vocal principle and observe the speakerspecific but utterance-independent formants of the audio spectrogram from ten speakers using various speech contents. Then, we design a DNN model to generate a shadow sound by imitating the superposition of multiple waves at the microphone. The DNN includes the speaker encoder and selector for feature reference and extraction. Moreover, we analyze the delay bound and compress the DNN layers to guarantee that the processing delay can meet the requirement on various devices (e.g., mobile, Raspberry-Pi). We implement NEC using commercial off-the-shelf (COTS) ultrasound transceivers and evaluate their performance in different real-world scenarios. In the experiment, we run a benchmark testing using a public speech corpus dataset and two real-world case studies. The evaluation results demonstrate that NEC effectively mutes the target speaker at a microphone by causing a 200% word error rate under Google’s voice-to-text service without interfering with others’ conversations. Our contributions are summarized as follows: 149 • NEC is the first practical speaker selective cancellation system, which aims to protect the target speaker’s voice without interfering with other microphones in presence. • We explore the human vocal principles and design a DNN model to imitate the superposition of waves at the microphone, which produces the speaker-specific but utterance-independent shadow audio in real-time. • We implement NEC and extensively evaluate its performance with the benchmark and user studies. The results show its superior performance in comparison with state-of-the-arts systems. The demos can be found on our project website: https://nec-app.github.io/. 6.2: Background of Vocal System Observations: To illustratively show the harmonic components of a sound induced by the physical vocal system, we first collect four audios from two volunteers. Each volunteer records two audios of their utterances of two sentences: “my ideal morning begins with hot coffee” and “don’t ask me to carry an oily rag like that”. For each audio, we derive the corresponding formants [166] via FFT for each frame with a duration of 20 ms. The rationale is that the duration of a typical phoneme is longer than 20 ms, representing the maximal frame length [199]. Thus, each frame is dominated by the harmonic components of sustained tones, i.e., the number and relative intensity of the upper harmonics in the sound. The results are presented in Figure 6.3. We can observe the consistent formants of each speaker with various spoken contents. For example, the similarity of the resonant frequency and the relative intensity of formants of different utterances from the same speaker can be observed in area ⃝1 , shown in red boxes. Hence, these characteristics are utterance-independent. Conversely, area ⃝2 in black boxes implies the distinct distribution of speaker-specific formants, which can also be observed across multiple spectra of various frames. Validation: Based on the observations above, the remaining challenge is to quantify the utteranceindependent but speaker-specific feature in audio spectrograms (i.e., area ⃝1 and ⃝2 ), namely timbre pattern [190]. To guarantee the phonetically balanced state in the timbre, we first average the 150 dynamic influence of individual phonemes by computing the averaged spectrum for all frames, namely Long-time Average Spectrum (LAS) [121, 199]. LAS can average out the dynamic characteristics associated with various phonemes such as the motion of the articulators [105]. Suppose the spoken content for each person is divided into M frames with the duration T in time, the LAS F(w)LAS can be formulated by averaging the spectrum of each frame: F(w)LAS = 1 M XM m=1 F(fm(t)), (6.1) where F denotes the FFT, and fm(t) is the frame waveform signal with the duration T. To visualize the distinctive LAS features for different speakers, we compute the LAS of four speakers (e.g., A, B, C, D), with every speaker requested to read the same sentence (e.g., “don’t ask me to carry an oily rag like that”). The results in Figure 6.4 show that every speaker’s LAS feature is unique even when their speech contents are the same. The distinctiveness of LAS features demonstrates the potential of differentiating voices from multiple speakers. To further verify 0 500 1000 1500 2000 Frequency (Hz) 0.4 0.6 0.8 1.0 Amplitude Spk_A Spk_B Spk_C Spk_D Figure 6.4: LAS results from four speakers. the utterance-independent but speaker-specific timbre pattern in our computed LAS, we compute the Pearson correlation and deliver the correlation matrix across different speakers and spoken contents. Specifically, we first collect ten different utterances from four speakers (e.g., A, B, C, D) and compute the Pearson correlation across F(w)LAS [199]. As shown in Figure 6.5, the correlation coefficients for the same speaker with different utterances can reach up to 0.96 on average, whereas 151 Figure 6.5: Pearson correlation matrix of the long-time average spectrum of 10 different utterances from 4 speakers. they are generally below 0.75 across speakers, even with the same utterances. The former implies the consistency of spectrum across various spoken contents for the same speaker, while the latter indicates the distinct timbre patterns of different speakers, which demonstrates the feasibility of using LAS to quantify the timbre patterns from audio spectrograms of different speakers. 6.3: NEC System Design As shown before, the voice signals from different human speakers present different spectrum features. Meanwhile, for the same speaker, the spectrum features are consistent across different spoken contents. The remaining challenge is to generate a speaker-specific shadow sound from these spectrum features. 6.3.1: System Overview System pipeline: The goal of NEC is to cancel Bob’s voice in the wild (e.g., no one can record Bob’s voice in their microphone, and no one is affected by Bob’s NEC devices). However, passively canceling Bob’s voice on Alice’s recorder is very challenging. Prior work [151] takes a great effort to estimate the arrival of Bob’s voice through a wireless channel, and compute the inverse signal of Bob’s voice before the acoustic signals of Bob arrive. Next, they synchronize Bob’s voice with 152 Figure 6.6: Overview of NEC, which includes the software (green) and hardware (yellow) design as well as the training stage (grey) of our system. the crafted inverse signal to perform the voice cancellation using rigorous procedures. However, such design relies on the speed difference between wireless signal and acoustic signal. In shortrange scenario (e.g., Bob is close to Alice), their work will no longer be effective since the arrival time could be very close. Instead of generating the inverse signal by the prior knowledge of Bob’s speech, we propose a superposition method to reduce the strength of Bob’s sound signals received by Alice’s microphone. In other words, NEC produces a shadow signal to be mixed with Bob’s voice, which will distill Alice’s sound on her microphone. Figure 6.6 shows an overview of NEC’s architecture from audio sources (left block) to the (Alice’s) recording microphone (bottom right block), which serve as inputs and outputs, respectively. The reference audio is the historical recordings of the user, which is prepared to assist the DNN model to separate the voice stream of the user. The mixed audio refers to the audios containing Bob’s voice and others’ (background) voices. The output of NEC model is a shadow signal transmitted by an ultrasound speaker. The mixed audio and the shadow signal combined together to form the recordings on Alice’s microphone. We assume that Alice receives the same mixed audio as the one collected by the NEC’s microphone in proximity. To create a general neural-enhanced framework, we first train NEC at the spectrogram level (top right block) in the offline training stage, where a L operation in the purple block represents the 153 audio spectrogram superposition that combines the outputs of the Selector and Audio Transform modules. The functionality of Selector is to generate spectrogram that exclude Bob’s sound, and the Audio Transform serves to transform the waveforms into spectrograms. Then, we convert the shadow spectrogram induced by our selector into inaudible ultrasound wave via Broadcast. The shadow wave will propagate through the air channel along with the mixed wave (§6.3.3). L inside the Microphone block indicates the wave superposition of the mixed audio and broadcasting shadow sound at the microphone. Due to the equivalence of audio superposition for wave and spectrogram, the effectiveness of wave superposition is guaranteed for testing scenarios, as mixed audio and shadow sound arrive simultaneously at the microphone. The superposed wave corresponds to the recorded audio which effectively hides the target’s (e.g. Bob’s) voice. Training stage: The purpose of model training is to generate spectrogram that not caused by Bob’s voice for any speech context with Bob. To achieve that, we manually craft mixed audios which contain Bob’s voice and other speakers’ voice, and use our selector to generate Bob’s irrelevant spectrogram. To train NEC, we first provide a pre-trained Encoder, which generates the speakerspecific d-vector [181, 182] from the reference audio (e.g., 3 audio instances lasting 3 seconds) as reference input for the selector. Meanwhile, the mixed audio is processed by the audio transform, which generates a mixed spectrogram as another input of the selector. The rationale for using spectrogram has two folds. First, the LAS feature is effective in distinguishing different speakers based on our previous observation (§6.2); second, the calculation of LAS refers to the procedure of calculating the average spectrum for audio clips, which can be unfolded across multiple clips as a spectrogram. We directly feed the mixed spectrogram into our selector, along with the d-vector extracted from the reference input. This can boost the accuracy of DNN in extracting the high-level speaker-specific but utterance-independent vocal features from the mixed sound (§6.3.2). Overshadow stage: A key property of NEC is its generalization for deployment in the wild. First, rather than the cumbersome model-retraining and data collection, only 3 audio instances lasting 3 seconds are required by our one-fits-all model for new user enrollment. Second, due to the linearity of the Fourier Transform (§6.3.2), we can transfer the spectrogram superposition into the 154 wave superposition of audios at the microphone to guarantee the overshadowing performance. Finally, to avoid the disturbance during the overshadowing of NEC, we further convert the shadow spectrogram into inaudible ultrasound (§6.3.3). Figure 6.7: NEC’s DNN Selector generates the utterance-independent but speaker-specific shadow spectrogram by imitating the superposition of waves at the microphone. 6.3.2: Neural Enhanced Selective Speaker Cancellation In this section, we present the design of NEC’s DNNs, which aim to utilize the utterance-independent but speaker-specific features to generate the shadow sound. NEC incorporates an efficient selector to produce a shadow spectrogram, and further add Bob’s voice through overshadowing onto the mixed spectrogram. Architecture of DNN Encoder: The encoder module follows the design of d-vector in prior studies [176, 181, 182]. This module takes the reference audio of a target speaker as input and produces a speaker-specified embedding to allow the selector to filter out the target speaker’s voice from the mixed audio spectrogram. Selector: The purpose of the selector is to produce a shadow spectrogram and further hide Bob’s voice by superposing the shadow spectrogram onto a mixed spectrogram. As shown in Figure 6.7, 155 the selector takes the d-vector and the mixed spectrogram as input. We formulate the mixture of the spectrogram as follows: Smixed = | X∞ n=−∞ xmixed[n]W[n − m]e−jωn|, (6.2) where the n-sample mixed audio in Cn is converted into a spectrogram with t sampling points and f frequency bins in Rt×f . W[n − m] is the Hann window, and m is the window size. More specifically, the mixed spectrogram is composed of Bob’s voice SBob and background voice Sbk (e.g., Alice’s voice) as follows: Smixed = SBob + Sbk. (6.3) In practice, the input audio lasts 3 seconds with a sampling rate of 16 kHz. The number of samples is 48,000. Also, we set the FFT size as 1,200, resulting in 601 frequency bins. The window length and hop length are 400 and 160, respectively, which generates 299 frames. Then, the shape of Smixed is 601×299, denoted as (F, T), the frequency resolution and frame resolution are 13.31 Hz and 25ms with 15ms overlap. We transpose the mixed spectrogram for further processing and denote the shape of the transposed spectrogram as (T, F). With the mixed spectrogram and d-vector in hand, we then utilize them to design a neural network based on our observation in §6.2. Revisiting Figure 6.3, the frequency distribution of formants [166] and harmonic determine the identity of a given speech (i.e., LAS sufficiently captures the speaker characteristics). Our design goal of the selector is to capture these characteristics with multiple layers of CNNs. Prior to building the neural network structure, we propose the requirement for our DNN model as: 1) the selector should be able to capture the formants and harmonic feature; 2) the selector should consider the consistency of the frequency distribution within the same voice source. 156 In our DNN design, we only focus on the first three formants since we observed that the lower orders of harmonic have more energy and are more representative for a single speaker. As the bandwidth of the first three formants ranges from 33 Hz to 79 Hz [56], we design the first convolutional layer with 64 filters, whose size is 1×7. The rationale of using this flat filter is to convolve the frequency domain information (F). In particular, each filter covers 93.17 Hz, which is enough to cover the individual formant bandwidth as mentioned previously. Another 64 filters follow, whose size is 7×1, which can cover 115ms (determined by the frame resolution) time-domain feature (T). It is worth mentioning that the length of phoneme varies from 5 ms to 670 ms based on existing vocal research [98], and the average reading speed for an adult is 184±29 words per minute [168], i.e., 281∼387ms per word. So the second convolutional layer only serves to explore the detailed information of the phoneme level. To further incorporate both F domain and T domain features, we apply a sequence of (5×5) convolutional layers with the dilation ranging from (1,1) to (8,1). The dilation setting on T domain extends the effective range of filters from (5×5) to (5×40), corresponding to 85ms to 610ms. This range covers a few words and meets our R2 for considering the consistency of frequency distribution. While other studies [181, 182, 208] also add extra layers (e.g., LSTM, CNN with larger filter size and dilation shape) for speaker separation task, we consider that those layers play a less important role. For example, a larger filter will introduce irrelevant frequency information and long time span data, when the speaker merely adjusts his/her formants frequency when speaking a single word or a short sentence. The output of CNNs has the shape of (T, 2×F) since we add a padding layer before the convolutional layer to maintain the shape of feature domain consistency, where 2×F comes from two filters in the last CNN layer. After that, the d-vector is repeatedly concatenated to the output of the last convolutional layer in every time frame. The fused feature embedding will be fed into two fully connected layers. As a result, we get a (T, F) shadow spectrogram. Figure 6.7 shows the detailed flowchart of our selector. In total, we only use 6 CNN layers and 2 Fully Connected (FL) layers for the selector model. Compared with the existing models such as [181, 182, 208], our 157 model is computationally efficient by eliminating the redundant modules (e.g., LSTM, CNN with larger filter size and dilation shape) unrelated to our research goal. Spectrogram-based Overshadowing In the overshadowing process, we first feed mixed spectrogram and d-vectors into our selector. Then, we deliver the generated shadow spectrogram to be superposed with the received mixed audio at the microphone. Shadow spectrogram generation: From the point of view of the microphone, the received mixed audio and shadow sound should be superposed to imitate the over-the-air overshadowing at the microphone, formulated as xrecord = xmixed+xshadow. Those vectors represent time-series samples of mixed audio, shadow sound, and recorded audio, respectively. Through crafting the shadow sound, our goal is to make the recorded audio as close as the background audio (e.g., Alice’s sound or environmental noise). A straightforward idea is to optimize the shadow sound directly with the audio-level superposition in the time domain. However, there are two drawbacks to the temporal wave superposition. First, the temporal waveform is less representative than a spectrogram. Second, since the output of our selector is the shadow spectrogram, an Inverse STFT module should be introduced to convert spectrogram to waveform ahead of the loss function, which results in the gradient vanishing issue for back-propagation based on our evaluations. Therefore, we use a shadow spectrogram from our DNN selector for the following overshadowing processing. Superposition for audio wave and spectrogram: The linearity of the Fourier Transform guarantees the equivalence of the temporal wave and spectrogram superposition, which can be denoted as follows: F[ Xn i=1 aixi(t)] = Xn i=1 aiXi(w), for F[xi(t)] = Xi(w), (6.4) where F denotes the Fourier Transform and x(t) is the temporal waveform signal. Given the linearity of Fourier transform with a coefficient ai, we can convert temporal wave superposition into a linear combination of spectrograms as follows: 158 Srecord = Smixed + Sshadow. (6.5) The Srecord, Smixed and Sshadow correspond to the spectrogram of recorded audio, mixed audio and shadow audio, respectively. To avoid the gradient propagation issues and expedite the convergence of DNN, the shadow spectrogram from the speaker selector is first normalized before being superposed with the mixed spectrogram. To allow the recorded magnitude to eliminate Bob’s voice while retaining other’s (e.g., Alice’s) voice components, we design the loss function: Selector∗ opt = argmin Selector∗ ||Srecord − Sbk||22 , (6.6) where the Selector∗ denotes the model parameters of our DNN selector, and the Srecord is the sum of mixed spectrogram and shadow spectrogram. Using the back-propagation with the L2 norm loss, we can derive an optimal parameter Selector∗ opt for our DNN selector, which will output an optimal shadow spectrogram Sshadow. This optimization ensures the resulting Srecord to be as close to Sbk as possible. 6.3.3: Overshadowing Over the Air Inaudible Shadow Sound Generation Given the shadow spectrogram generated by the trained DNN selector, we can apply the inverse STFT on the shadow spectrogram to derive the shadow sound wave for further broadcasting. To make the shadow sound inaudible for privacy concerns and deployment convenience, we resort to the non-linear property of microphones [140, 201] to modulate the emitted shadow wave, via the Broadcast module in Figure 6.6. Non-linearity of hardware: The non-linearity property of microphone hardware represents the physical limitations of the diaphragm and the pre-amplifier, which amplify the signals in a nonlinear manner. Mathematically, given an input signal Vin to the microphone, the output signal Vout of the commercial amplifier within the microphone is not amplified linearly, i.e., Vout ̸= A1Vin, where A1 is the gain for input. Instead, the output signal is Vout = A1Vin+A2V 2 in+A3V 3 in+· · · . We focus 159 on A2V 2 in of the non-linear Vout by ignoring (relatively small) higher-order components [140, 211]. Without loss of generality, letm(t) be a simple tone, e.g., m(t) = cos(2πfmt). We then up-convert the baseband signal mt onto a carrier with central frequency fc > 20kHz. The modulated signal can be written as follows with the power coefficient α: Vin = (cos(2πfmt) + α)cos2πfct. (6.7) Since fc is in the inaudible frequency range, the modulated signal Vin cannot be heard by humans. Given the non-linearity effect, the recorded signal Vout will not only contain the linear component A1Vin, but also the non-linear component A2V 2 in representing the inaudible but recorded component, denoted as follows: V 2 in = (cos2(2πfmt) + α2 + 2αcos(2πfmt))cos2(2πfct) = X i λicos(2πfit) + μ, (6.8) where fi denotes frequency components at fm, 2fc, 2fm, 2(fm ± fc), fm ± 2fc and μ is a consequent constant. Given the low-pass filter in the COTS microphone, we can eliminate the high frequency components while retaining the fm components, where fm is the baseband frequency of m(t) perceived by a microphone. Shadow sound broadcast: Then, we can encode our shadow wave xshadow into an inaudible frequency range by modulating it with a carrier whose central frequency is fc. The broadcast shadow wave can be computed as follows: bshadow = xshadow × cos2πfct, (6.9) where bshadow refers to the inaudible shadow wave, and xshadow is induced by Sshadow from inverse Fourier transform process. More discussion about the over-the-air lantency of shadow sound can be found in [70]. 160 Laser Rangefinder Smartphone Microphone Decibel Meter Wave Generator Local Laptop Ultrasound Speaker Amplifier Mixed Speaker Figure 6.8: Implementation and experimental settings. 6.4: Implementation 6.4.1: Experimental Setup Figure 6.8 presents the implementation and experimental settings of NEC. In NEC, the input mixed audio is first collected and processed by our trained encoder and selector DNNs, which produces the corresponding shadow spectrogram in Figure 6.7. Then, we transform it into audios and upconvert it into the ultrasound carrier frequency, making it inaudible during broadcasting (§6.3.1). We run NEC on a local laptop (§6.5.3) to generate the shadow spectrogram, which is sent to a Keysight 33500B waveform generator, followed by an ultrasonic power amplifier [21] to amplify the inaudible shadow wave. Being transmitted through the air by a wide-band dynamic ultrasonic speaker, Vifa [19], the shadow wave is superposed with the mixed audio at a COTS smartphone’s microphone. We use a loudspeaker to play mixed audios, i.e., the ”Mixed Speaker” in Figure 6.8 emulates a mixed conversation from Alice, Bob, and others. The target’s voice will be effectively muted in the final recorded audio. 161 Table 6.1: Testing dataset for benchmark and user cases Scenario Source Freq. Type Instance Jointa LibriSpeech 0-8k 40/18 560/- Conv. / Volunteers 0-8k 40/- 560/40 Babbleb NOISEX-92 0-4k - 690 / 40 Factoryc 0-2k - 690/40 Vehicled 0-500 - 690/40 aTwo speakers talk jointly. b100 people whispering. ca production hall. da vehicle running at 120 km/h. 6.4.2: Dataset Compilation Table 6.1 summarizes our testing dataset. First, we conduct the System Benchmark by testing the target speaker with the public datasets in controlled environments to verify whether our target speaker’s voice can be hidden in the presence of real-world noises. Then, we deploy our system in the wild for a real attack scenario: the target volunteer wants to avoid being recorded while talking in public scenarios, but the COTS microphone can record others’ voices normally. Model training: Prior to the evaluation of NEC, we train a one-fits-all DNN model for all the defensive scenarios in public. The training dataset is constructed by mixing audios of two different speakers from LibriSpeech [134], and mixing target speaker audios with different noises from NOISEX-92 [172]. We provide the background audio that excludes the target speaker and train our model to hide the target speaker’s voice, given the mixed audio and reference audios of the target speaker. System benchmark: Using the public dataset LibriSpeech [134] as the corpus source, we first select 10 target speakers, we collect 3 audios for each target speaker as their reference audio, and the rest audios of the speaker are treated as normal speech in a real scenario (e.g., Bob’s speech). To measure the robustness of NEC, we simulate different environments with different types of noises. In order to cover different frequencies of noises (e.g., high-frequency speech and low-frequency ambient noises), the noises from 5 application scenarios are then mixed with the target speakers’ voices, which results in 3,190 mixed audios in total. Then, we randomly mix the 10 target speakers’ 162 voices with the ones from the other 40 speakers, which generates 560 total instances for the joint conversation. User case studies-1: We further collect the user study dataset from 10 target volunteers, covering 3 females and 7 males. All volunteers are required to speak 25 sentences, respectively. Analogous to the dataset for the benchmark, we select the reference and test audios randomly, then mix test audios with 4 sources of noise. In total, 160 mixed audios are produced. Then, we randomly mix the audios of 10 target volunteers with the ones from another 18 volunteers to derive the joint conversation dataset. User case studies-2: We conduct another user case study to justify the feasibility of NEC in the real world. As shown in Figure 6.9, Bob carries the NEC device to hide his sound in the wild. We ask Bob and Alice to speak normally, with volume at 77dBSPL from our decibel meter placed at 5cm away from their lips. Then, we record the loudness, SONR, and the proportion of Bob’s sound on Alice’s recorder (a Moto Z4 phone) at different distances for different cases (with or without NEC). Bob Alice Ultrasound Speaker Alice's Recorder Distance Figure 6.9: Hiding Bob’s voice from Alice’s recording in a real-world scenario. 163 Note that our testing dataset is disjoint from the training one and reference audios. Thus, the two trained models can be deployed directly with only three arbitrary reference audios from the new target speaker volunteers, avoiding the cumbersome deployment costs (e.g., model re-training and data re-collection) [212, 219]. 6.4.3: Quantitative Metrics To measure the quality of NEC, we consider four main metrics: Source to Distortion Ratio (SDR) [175, 182] measures the ratio of energy (in dB) between the energy of the target signal and the errors (induced by the interfering speakers and artifacts) in the mixed signal. It should be low for Bob’s voice and high for Alice’s voice. Word Error Rate (WER) is adopted broadly to evaluate the machine translation systems [203]. We compute the WER by employing Google’s speech-to-text service to transform the acoustic signals into texts. NEC aims to enlarge the WER for the target speaker and minimize it for other speakers (e.g., Alice). User Rating Score (URS) is the rating for recordings, in which 10 reviewers rank the raw mixed and recorded audios of NEC with score 1-5, along with Bob’s clean voice as the ground truth. Specifically, score 5 denotes the best performance, in which reviewers cannot recognize any words of the target speaker (e.g., Bob). Sound Noise Ratio (SONR) is used to evaluate the proportion of Bob’s sound in the recorded sound. We regard the mixed audio as useful sound and treat Bob’s voice as noise. By computing the power ratio between the mixed audio and Bob’s sound at different distances, we validate the efficacy. 6.5: Evaluation In this section, we comprehensively evaluate NEC in different environments with different settings and devices. 164 Hide Bob’s Voice on Attacker’s Recorder Retain Alice’s Voice on Alice’s microphone Figure 6.10: Overall system performance of our system on three setups across multiple sources of noises. 6.5.1: Overall Performance System benchmark: We first evaluate NEC on the public dataset and provide SDR and WER across multiple scenarios in Figure 6.10. When the target speaker, i.e., Bob, expects his voice to be hidden in the recordings, the recorded audios achieve a lower SDR and higher WER compared with the mixed audios. This shows that our shadow audios can hide Bob’s voice reliably, making it unrecognizable by the Google service. Specifically, the median WER increases from 0.894 to 1.798, while the SDR reaches -4.918 dB from 0.997 dB. Note that the WER of the mixed audio is too high to be recognized by the Google service due to the background speeches from other people. Yet, it can still be recognized by humans. Conversely, NEC achieves a higher WER by hiding Bob’s voice using the shadow wave, making it even unrecognizable for humans. We further verify its efficacy in the user studies below. Also, we evaluate the effectiveness of NEC to retain others’ voice (e.g., Alice) in Figure 6.10(right). We set the ground truth as Alice’s clear voice, and calculate the SDR and WER for the recorded audio and the ground truth audio. The result shows that, compared to the mixed audio which contains Bob’s voice, we can achieve higher SDR and lower WER for capturing Alice’s sound when Bob deploys NEC. User case study-1: Figure 6.11 shows the performance of SDR and URS for hiding target volunteers’ voices in the wild. We observe a consistent declination in SDR of the recorded audios compared with raw mixed ones. We can hardly recognize the target volunteer’s voice in the recorded audios, as the median SDR reaches -4.374 dB, much lower than the SDR of mixed audios at 2.798 dB. To evaluate the recorded audios comprehensively, we ask 10 reviewers to score the recorded 165 Figure 6.11: User study results. audios and the mixed ones with ranking scores from 1-5. We expect a higher score when fewer utterances of the target volunteer can be recognized. It shows that the average score of the recorded audios can reach 4.034 for different reviewers. All 10 reviewers give 4 for most of the recorded audios, while most scores of 1 are given to the mixed audios, except the reviewers 7 and 8. User case study-2: As depicted in Figure 6.9, in this user study, we evaluate how much of Bob’s voice will be leaked to Alice’s recorder with/without deploying NEC. We ask Bob and Alice to speak simultaneously, and also record Bob’s sole speech with the same speech content. The mixed audio and Bob’s individual speech audio are recorded by Alice’s phone (Moto Z4), with varying distances between Alice and Bob (from 0.5 to 3 meters). Figure 6.12 visualizes the waveforms of Bob’s audio and the mixed audio. We can see that with the increasing distance, Bob’s audio contributes less to the mixed one. We further record Bob’s sound pressure level (SPL) at Alice’s position and present the result in Figure 6.13a. The result shows that the SPL of Bob’s audio attenuates with the increasing distance, and its loudness reaches 43dBSPL at the 5m distance (between Alice and Bob) with an environmental noise level of 39.8dBSPL. In comparison, the SPL of Alice’s voice recorded by her own recorder remains at 77dBSPL. Given the large gap between the SPL of Alice and Bob’s voices across different distances, and the attenuation of Bob’s voice with the increasing distance, we can see that Bob only needs to cancel his voice over a short range (e.g., 2m). Next, we further justify whether NEC can effectively overshadow Bob’s sound across the distance. Figure 6.13b presents the SONR results with/without NEC. When NEC is not deployed, the 166 0 0.5 1 1.5 2 Time (s) 0.2 0.0 0.2 Amplitude Mixed Bob (a) Distance = 0.5m 0 0.5 1 1.5 2 Time (s) 0.1 0.0 0.1 Amplitude Mixed Bob (b) Distance = 1m 0 0.5 1 1.5 2 Time (s) 0.2 0.0 0.2 Amplitude Mixed Bob (c) Distance = 2m 0 0.5 1 1.5 2 Time (s) 0.2 0.0 0.2 Amplitude Mixed Bob (d) Distance = 3m Figure 6.12: Waveform of mixed audio and Bob’s sole speech audio SONR between the recorded mixed audio and Bob’s voice stays below 20dB, which implies that Bob’s voice can be effectively captured by Alice’s recorder. However, when Bob deploys NEC, even with a close distance (< 2m), Bob’s voice can be mostly overshadowed, with SONR reaching 30dB. As mentioned above, the strength of Bob’s voice signals drops significantly beyond 2m. Therefore, although the recorded shadow audio strength also degrades dramatically beyond 2m, the effectiveness of NEC within 2m makes it a viable solution for target voice cancellation. 6.5.2: Comparison Study Next, we perform a comparison experiment between NEC and two systems. The first one uses white noise to jam unauthorized recordings, which is commonly applied to commercial ultrasonic jammers. To simplify the jamming process, we manually add 10dB white noise over the recording sound to simulate this type of jamming system. Notice that the volume of white noise is usually determined by different jammers, we use 10dB based on our previous observation of the shadow sound volume on the same phone. The second one is a scrambling-based voice hiding system called 167 0.05 0.5 1 2 3 4 5 Distance (m) 40 50 60 70 80 Loudness (dBSPL) Bob Alice Environment (a) Loudness vs. Distance 0.05 0.5 1 2 3 Distance (m) 0 20 40 SONR (dB) No NEC Shadow - Bob with NEC (b) SONR vs. Distance Figure 6.13: Effectiveness of NEC across different distances Patronus [116], which can hide the target recordings by scrambling with specially designed white noises and recover the target recordings at an authorized device. Given the mixed (joint) audios (e.g., two volunteers, one of which is target), we reproduce the scrambling algorithms of Patronus to hide a speaker’s voice. 0 5 10 15 Audio ID -20 -15 -10 -5 0 5 10 15 SDR Bob-Pat. Bob-NEC Bob-Mixed Bob-WN (a) Hide Bob’s voice. 0 5 10 15 Audio ID -15 -10 -5 0 5 10 15 SDR Alice-Pat. Alice-Our Alice-Mixed Alice-WN (b) Retain Alice’s voice. Figure 6.14: Comparison study. We first compare the performance of voice hiding by computing the SDR of the target voice. Figure 6.14a shows all of the three systems: NEC (Bob-NEC), White Noise (Bob-WN), and Patronus (Bob-Pat.) achieve a low SDR by effectively hiding the target voice in the mixed audio (Bob-Mixed). We find that, compared to NEC and Patronus, the white noise solution results in higher SDR, which means it retains more target voice than the other systems. Patronus and NEC can reduce the SDR of the mixed audios from 3 dB to nearly −20 dB. Therefore, the voice hiding performance of NEC is on par with that of the specially designed scrambling-based Patronus, and better than the white noise scrambling approach. Next, we evaluate the reception quality of Alice’s 168 voice in the presence of the three systems. As shown in Figure6.14b, among the three systems, the White Noise approach cannot recover the disrupted voice, and therefore results in the lowest SDR for Alice’s voice. For comparison, Patronus can recover a limited portion of scrambled sound by its recovery algorithm, and achieve low SDR for Alice’s voice (i.e., −2.5 dB). The quality of Alice’s voice after recovery is even lower than that of the raw mixed audios due to the influence of the scrambling noise. In comparison, NEC achieves a 5 dB gain compared with the mixed audios in recovering Alice’s voice, since NEC carefully nullifies Bob’s voice in the mixed audio. This experiment result demonstrates that NEC could selectively hide a target speaker’s voice without interfering with other speakers. Surprisingly, NEC can even improve the reception quality of others’ recording. 6.5.3: Running Time Analysis To demonstrate the efficiency of our system, we measure the time consumption of each system module in Table 6.2. Given 100 1s mixed audios, we evaluate the latency in two different hardware platforms: 1) desktop with a single NVIDIA 1080Ti GPU; 2) Raspberry Pi 4. The total processing time of the DNN module in NEC is around 1.51 ms, and the ultrasound modulation consumes 11.96 ms on average, well below the lasting period of the 1s chunks. In comparison, it takes 2.4× more time for VoiceFilter to process the same mixed audio. On the Raspberry Pi 4, the overhead of the selector is 293.7 ms, which is faster than 446.2 ms of VoiceFilter. The achieved latency (< 300) ms on the edge deployment using Pi 4 is less than the time offset tolerance of overshadowing, which further corroborates the feasibility of NEC. Table 6.2: Time consumption of NEC with an audio sample lasting 1s Platform System Encoder Selector Broadcast PC (1080Ti) NEC 0.467ms 1.51ms 11.96ms VoiceFilter [182] 0.467ms 3.65ms 11.96ms Rasp NEC 12.7ms 293.7ms 11.96ms VoiceFilter [182] 12.7ms 446.2ms 11.96ms 169 6.5.4: Parameter Study Diversity of hardware dependence: The variance of the non-linearity for the hardware (e.g., microphones, amplifiers, filters) on smartphones can influence the optimal selection of the modulation parameters [201], which in turn impacts the performance of our system. Here, we evaluate our system using 7 different mobile devices listed in Table 6.3. Specifically, the carrier frequency fc is the dominant factor that affects the effectiveness of the non-linearity effect. All the tested smartphones have a range of acceptable frequency settings, and the best carrier frequency is listed in the brackets. Table 6.3: Smartphones used for two user studies. Model Brand Carrier fc (kHz) Max Dis. (m) Moto Z4 Motorola 24-28 (28.0) 3.2 iPhone 7 P Apple 21-29 (27.8) 0.49 iPhone SE2 Apple 23-28 (25.2) 1.77 iPhone X Apple 27-32 (25.3) 0.43 iPad Air 3 Apple 22-31 (28.0) 3.72 Mi 8 Lite Xiaomi 24-32 (27.4) 1.65 Pocophone Xiaomi 22-29 (26.3) 0.7 Galaxy S9 Samsung 25-31 (27.2) 3.64 Diversity of effective distance: Our system can be deployed with various maximum effective distances with different smartphone recorders, ranging from 49 cm to 3.72 m, as shown in Table 6.3. The result also shows a great variance across recorders. We attribute this diversity to the difference in frequency response of these recorders, and the non-linearity of audio processing circuits. Multiple recorders: Since the performance of NEC can be affected by the variance of hardware, we investigate whether NEC system can be used to support multiple recorders simultaneously. To conduct this experiment, we use Moto Z4, Mi 8 Lite, POCOPHONE, and Galaxy S9 as recorders to eavesdrop on Bob’s voice. With the collected recorded audios, we compute the SDR for recorded audios. For comparison, the SDR of mixed audio is also calculated to reveal the effect of NEC. We define that, if the SDR of recorded audio is less than the mixed audio, NEC is successfully performed. Our experiment result is presented in Table 6.4. For three different carrier center frequency settings, we played 20 crafted mixed audios and run NEC to superpose shadow audio to 170 affect three recorders’ recording. The column named 1+, 2+, 3 means at least 1, 2, or 3 devices are affected simultaneously by NEC. And the reported values such as 20/20 denote that all the 20 recorded audios are unable to recognize Bob’s voice. This result provides the supportive evidence that NEC is capable of operating in public and affecting multiple recorders by carefully tuning the system parameters. Table 6.4: NEC’s performance with multiple recorders. Number of Recorder 1+ 2+ 3 fc (kHz) 26.3 20/20 9/20 4/20 27.2 20/20 15/20 11/20 27.4 20/20 14/20 8/20 6.6: Discussion Limitation of non-linear effect: The success of NEC relies on the imperfection of the receivers’ (e.g., Alice’s) microphone. However, when the non-linear effect is not present due to two reasons: 1) the great precision of Alice’s microphone or 2) the improper modulation parameter settings, our selective voice protection will no longer be effective. Limitation of protecting conversation: Although prior benchmark and user case experiments demonstrate that NEC can protect the target speaker’s voice in the wild, it is a challenge to protect a conversation that involves multiple speakers while not disrupting other users (e.g., Alice). We failed to train a Selector model that is applicable to multiple target speakers with the current system architecture. In future work, we will figure out how to integrate the multiple speakers’ embeddings and re-design the Selector model to avoid removing Alice’s voice in the private conversation. Directional of ultrasonic speaker: In our prototype shown in Figure 6.8, we assume the ultrasound speaker has the shadow audio ready before playing it. However, when we integrate the monitor, DNN models, and ultrasound speaker into one device and run it in a real-time manner, the shadow audio is dependent on the incoming mixed audio. In this case, one critical concern of NEC is whether the current mixed audio will be affected by the current shadow audio, therefore impacting the quality of future shadow audio. Fortunately, we can avoid it by putting the monitor 171 and ultrasound speaker in opposite direction. We find that by exploiting the directional property of the ultrasound speaker, the shadow audio is barely sensed by the NEC’s monitor as it produces limited amplitude in its back direction. 6.7: Related Work Microphone jamming: Microphone jamming has been proposed [36, 116, 169] to protect private conversations. To avoid the recording of private conversations, a pre-configured audio jammer is deployed to emit the scrambling noise waves to disrupt the speech recording. Specifically, Chen et al. [36] adopt the white noise to distort the microphone recordings, while Tung et al. [169] explore the sound masking with the specially designed scramble noise to obfuscate the spoken sensitive information. Patronus [116] emits ultrasound to generate the scrambling waves at the recorder without introducing human-sensitive noise. In contrast, rather than canceling and jamming by the low-level signal features (e.g., frequency, phase), we use high-level human vocal features to generate a shadow sound for speaker-selective jamming. AI-augmented speaker diarization: AI plays important role of processing signal [74, 113, 114, 220]. Recent studies [41, 181, 182] propose AI-based speaker diarization, a process to partition multi-speaker audio into homogeneous single speaker segments based on the speaker identity. It effectively solves “who spoke when” in a multi-speaker scenario. Several audio embedding models have been proposed for speaker-specific feature extraction, including speaker factor [28], ivector [147, 221], and d-vector [181, 182, 208]. Based on these features, a number of classification models have been designed to extract the speaker-specific embedded audios, such as clustering algorithms [147,181,221], DNN model [182,208], and even an integrated model with visual information (e.g., lip movement and face recognition) [6, 41, 52]. However, these methods cannot be adopted in our scenario. First, all existing speaker diarization models are used for post-processing after the audio is recorded, but we need to deal with voice cancellation in an end-to-end fashion. Additionally, the processing delay is an important factor to guarantee an effective shadow sound generation, which has been ignored by these post-processing models. In this work, we design the 172 adaptive features, DNN structures, and training methods to realize an end-to-end voice cancellation system to protect a target speaker’s voice. 6.8: Summary We present NEC, a lightweight AI-augmented voice protection system to protect the target speech without interfering with others’ audio conversations. As an end-to-end processing system, NEC first actively emits specially designed ultrasound signals to a recorder. Due to the non-linearity effect, a shadow sound is generated and superposed onto the received mixed sound at the recorder, which effectively cancels the target speaker’s voice in the recordings. To determine the frequency composition of the shadow sound, NEC leverages a tailored Deep Neural Network (DNN) to extract high-level speaker-specific but utterance-independent vocal features from the mixed sound. By imitating the overshadowing in the air, we superpose the shadow audio with the mixed audio in the training stage of the DNN model and deliver a one-fits-all model, which can be trained only once and deployed directly for new users. Our experimental evaluations demonstrate NEC’s efficacy in a wide variety of real-world scenarios. The results show that NEC effectively disables the microphones from recording the target speaker’s voice. 173 CHAPTER 7: CONCLUSION AND FUTURE WORK Voice serves as a key medium for human interaction, not only with each other but also with computers and intelligent devices. It facilitates information exchange between individuals and allows users to command AI agents or smart devices through speech. Additionally, voice characteristics are employed by cloud services for user authentication. However, with the widespread use of voice in contemporary communication, concerns regarding its security in various contexts have emerged within the research community. Potential risks include adversaries circumventing voice-based authentication systems to access personal data or manipulate devices such as mobile phones, customer accounts, or smart speakers. Furthermore, speech recognition models could be compromised, leading to deceptive outputs from smart speakers or speech-to-text services, such as unauthorized door opening or the generation of inappropriate responses. Privacy issues also arise, with attackers potentially eavesdropping and using voice cloning techniques to execute speech synthesis attacks. This dissertation focuses on identifying vulnerabilities in AI-powered voice systems and developing defensive strategies against these threats. It concentrates on three primary areas: speaker authentication, speech recognition, and privacy protection. The core methodology of our present and future research is to meticulously address the three areas, thereby fostering the adoption of voice-enabled AI systems in the modern world. In summary, this dissertation not only presents a series of significant research findings and developments in the field of voice technology but also provides a comprehensive and detailed overview of the current state of speaker authentication, speech recognition, and privacy protection. The work represents a substantial contribution to the field and sets the stage for future research and development in these crucial areas. Our work’s importance reaches far beyond academic circles, offering substantial and diverse benefits to different areas within the security sector and society in general. First, for the security community, our study serves as a groundbreaking resource. It 174 sheds light on the less explored aspects of voice-enabled AI systems, an area experiencing rapid growth. By identifying these vulnerabilities, we empower cybersecurity professionals to build stronger and more effective defenses against potential attacks. This forward-thinking strategy is vital in an era where technological progress often surpasses security protocols. Second, our research has a significant impact on the well-being of society. With voice-enabled devices increasingly common in homes and public spaces, securing these devices is crucial. Our findings play a key role in protecting the privacy and safety of the broader public, who are frequently the unintended victims of security lapses. To conclude, our in-depth analysis and the proactive approaches we propose to the issues in AI-driven voice interaction systems mark a considerable advancement. We introduce a crucial perspective on security in a field that is essential to our society’s technological progress. Our work establishes a foundation for more secure and reliable AI interactions, benefiting both the security community and the wider society. 7.1: Summary of Contributions This dissertation presents a comprehensive collection of five of my scholarly publications, delving into the critical areas of speaker authentication, speech recognition, and privacy issues. In speaker authentication, we introduce SuperVoice, a speaker verification system that utilizes ultrasound features in human voice to verify speaker identities. This system demonstrates robustness and accuracy across various environments. Furthermore, we identify new vulnerabilities in existing speaker verification models. Specifically, we introduce a novel attack named MasterKey and showcase the potential risks and impacts of such attacks on speaker verification systems. For speech recognition, this dissertation presents two attack methodologies. The first, named SpecPatch, is designed for scenarios that involve direct human interaction, and the false command injection attack could succeed even when users are actively engaging with speech-to-text systems. The second methodology, PhantomSound, targets black-box systems, where direct interaction or insight into the system’s inner workings is limited. Both approaches represent significant advancements in understanding the vulnerabilities of speech recognition systems. Lastly, the dissertation addresses 175 the important issue of data privacy in voice technology. In response to the growing concerns about unauthorized voice recordings and eavesdropping, we introduce NEC, an innovative jamming device engineered to prevent unauthorized voice recordings without interfering with authorized ones. Specifically, this dissertation makes the following contributions: Robust speaker authentication system: In this dissertation, we introduce SuperVoice, an advanced speaker verification system that enhances secure speaker authentication by utilizing ultrasound characteristics present in human speech. This system represents a departure from conventional speaker verification methods that rely on spectrographic features derived from the audible frequency range of voice commands. Instead, we venture into a novel area of human voice research by examining the distinct traits of human speech in the ultrasound frequency band. Our findings reveal that the high-frequency components of ultrasound in human speech, particularly those in the 20 to 48 kHz range such as speech fricatives, can markedly improve the security and precision of speaker verification systems. This approach opens new possibilities for leveraging ultrasound features to bolster the effectiveness of speaker authentication technologies. Attacking speaker authentication models: This dissertation introduces a novel threat to speaker authentication systems, termed MasterKey. This backdoor attack is designed to compromise various speaker verification (SV) models, targeting a real-world scenario where the attacker lacks knowledge about the specific victim. The development of MasterKey involved a thorough examination of the limitations inherent in existing poisoning attacks aimed at unseen targets. Our approach led to the optimization of a universal backdoor capable of attacking any target. We further refined the backdoor by embedding subtle characteristics of the speaker’s voice and semantic information, rendering it virtually undetectable. Additionally, we accounted for channel distortion, incorporating this element into the backdoor’s design. We successfully attacked 53 speaker verification models, involved 16,430 enrolled speakers. Remarkably, our attack achieved a 100% success rate with a 15% poisoning rate. We conducted validation of our attack in three real-world settings, successfully executing the attack both over the air and via telephony lines. This comprehensive testing underscores the effectiveness of MasterKey and highlights the critical need for enhanced 176 security measures in speaker authentication systems. Attacking speech recognition system when Human-in-the-loop: This dissertation introduces SpecPatch, a new adversarial audio attack targeting automated speech recognition (ASR) systems, uniquely involving human interaction in the process. Traditional audio adversarial attacks operate under the assumption that users will not detect the adversarial audio, thereby ensuring the effective delivery of manipulated examples or perturbations. However, this overlooks a critical aspect of real-world scenarios: users of intelligent voice-controlled devices are often alert to any unusual sounds, particularly when issuing voice commands. If users perceive any suspicious audio, they tend to counteract it by interrupting the adversarial audio and overpowering the malicious voice with stronger, corrective commands. This user vigilance renders most existing attacks ineffective when user interaction and adversarial audio delivery happen simultaneously. To address this challenge and enable a truly imperceptible and robust adversarial attack that can withstand potential user interruptions, we developed SpecPatch. This practical voice attack employs a sub-second audio patch signal to initiate an attack command, coupled with periodic noises designed to disrupt communication between the user and the ASR system. In comparison to existing methods, SpecPatch significantly extends the attack impact length (by 287%), effectively lengthening the duration of the target command. Moreover, we demonstrate that our attack maintains a 100% success rate in both over-the-line and over-the-air scenarios, even in the face of user intervention. Attacking commercial speech recognition services: This dissertation introduces PhantomSound, a black-box attack designed for commercial voice assistants. Traditional black-box adversarial attacks on voice assistants often involve a substantial number of queries and an extensive training phase. PhantomSound, on the other hand, employs a decision-based attack strategy to efficiently produce effective adversarial audios, significantly reducing the number of required queries by optimizing gradient estimation. Our experiments involved testing PhantomSound against four different speech-to-text APIs in three real-world scenarios to assess its real-time impact. The results confirm that PhantomSound is both practical and robust, capable of successfully attacking five popular commercial voice-controlled devices over the air. It also demonstrates the ability to circumvent three 177 liveness detection mechanisms with a success rate exceeding 95%. Compared to leading black-box attacks, PhantomSound achieves a reduction in required queries by 93.1% for untargeted attacks (approximately 300 queries or around 5 minutes) and 65.5% for targeted attacks (roughly 1,500 queries or about 25 minutes). This efficiency makes PhantomSound a significant advancement in the realm of black-box adversarial attacks on voice assistants. Protecting unauthorized recording: This dissertation introduces NEC (Neural Enhanced Cancellation), a novel defense mechanism designed to protect the privacy of everyday conversations by preventing unauthorized microphones from recording a target speaker’s voice. NEC offers a significant advancement over existing audio cancellation techniques, which typically rely on scrambling methods. Unlike these methods, NEC is capable of selectively eliminating a target speaker’s voice from a mixed speech environment without disrupting others. The core of NEC’s functionality lies in a specially designed Deep Neural Network (DNN) model. This model is trained to isolate high-level vocal features that are specific to the target speaker but independent of their particular utterances. These features are extracted from the speaker’s reference audio samples. When a microphone is in operation, the DNN actively generates a ’shadow sound’ that effectively cancels out the target voice in real-time. NEC has been thoroughly implemented and evaluated using 8 different smartphone microphones across various settings. The results from these evaluations demonstrate that NEC is highly effective in muting the target speaker at a microphone, all while ensuring that other users can carry on their conversations without any interference. This makes NEC a highly promising solution for enhancing privacy in daily communications. 7.2: Limitations and Discussion Although the research described in this dissertation makes a significant contribution to the field of Voice AI system security, our work has limitations and room for further improvement. Robust speaker authentication system: In this dissertation, we have proposed a secured speaker authentication system called SuperVoice. Although this system show robustness and efficient in distinguishing speakers and defend against fake speech, it has some limitations. First, we noted 178 that certain phonemes, particularly fricatives and stops, have high energy above 20 kHz. However, sentences without fricatives might lack this energy spike. Despite this, SuperVoice often detects High-Frequency Energy (HFE) in non-fricative commands due to airflow alterations by the speaker. This HFE acts as an additional feature for speaker verification. Even for sentences with predominantly low-frequency energy (below 8 kHz), SuperVoice’s dual-frequency stream architecture ensures consistent performance, effectively utilizing high-frequency features from non-fricative commands. Second, we found that in SuperVoice, the distance affects both low and high-frequency components’ power, particularly for fricatives and plosives, therefore affecting the performance to verify speakers in long distance. A potential solution like a power amplifier could address this attenuation, and its effectiveness in long-distance verification will be explored in future work. Attacking speaker authentication models: In this dissertation, we introduce an innovative attack named MasterKey against the large-scale speaker authentication models, however, it has some limitations that need to be addressed. First, this attack relies on the speaker authentication model’s maintainer to use the attacker-prepared dataset to fine-tune the commercial model. This is a strong assumption because companies intend to use the local/private dataset to enhance their model. Second, the attack is possibly not robust to re-training defense. In this case, the model maintainer can re-train and fine-tune the backdoored model on the benign dataset, which will lead to the invalidate of the backdoor attack. To enhance the robustness of our attack, one possible solution is to craft an out-of-domain backdoor sample, so that the re-training will not affect the backdoor-target mapping because the benign dataset does not include the out-of-domain backdoor distribution. Attacking speech recognition system: In this dissertation, we present two approaches to attack the speech recognition system. The first attack, SpecPatch, successfully attacks the speech recognition model while human-in-the-loop. However, SpecPatch presents several constraints: 1) its dependency on specific models; 2) failure to inject very long target sentences; and 3) a limited effective attack range. Addressing the first constraint, the attack is tailored to target recurrent neural networks, exploiting vulnerabilities in the connections between individual cells. The second issue could potentially be resolved by using a longer patch, although this might increase the likelihood 179 of alerting the target. As for the third constraint, enhancing the power of the patch could extend its attack range. However, this approach requires managing distortions caused by the amplifier and attenuation due to increased distance. Our second attack, PhantomSound, is designed to attack commercial speech-to-text API and voice assistants. Although PhantomSound demonstrated the ability to craft voice adversarial examples promptly. It has the following limitations. First, the PhantomSound shows vulnerability to the presence of ambient noise; Second, the PhantomSound is constrained to consistently create adversarial examples (AEs) for every input and target; Third, this attack is struggling to significantly alter lengthy sentences; Last, the capability of launching a long-range attack is limited. To mitigate the first and fourth limitations, an attacker could either boost the strength of the perturbation or choose a quieter environment for the attack. The second and third limitations might be overcome by employing multiple iterations of phoneme injections. While this increases the chances of producing a successful perturbation, it also potentially escalates the cost and effort required. Protecting unauthorized recording: This dissertation introduces a smart microphone jammer that is capable of jamming a specific user’s voice on the attacker’s microphone. Even though the idea and prototype are promising, there are some limitations. First, the effectiveness of NEC is contingent on the imperfections in the receiver’s microphone. Its selective voice protection fails when the non-linear effect is absent, either due to the high precision of the microphone or incorrect modulation parameter settings. Second, NEC faces challenges in safeguarding conversations involving multiple speakers without disrupting others. The current system architecture does not support a Selector model capable of handling multiple target speakers. Future work will explore integrating multiple speaker embeddings and redesigning the Selector model to prevent the inadvertent removal of voices from private conversations. Third, the jamming performance is affected by the direction of the ultrasound signal. The prototype assumes the ultrasound speaker is pre-loaded with shadow audio. When integrating the monitor, DNN models, and ultrasound speaker into a single device for real-time operation, the shadow audio depends on the incoming mixed audio. A key concern is whether the current shadow audio might affect the quality of future shadow audio. This issue can 180 be mitigated by positioning the monitor and ultrasound speaker in opposite directions, utilizing the directional nature of the ultrasound speaker to ensure that the shadow audio has minimal impact on the NEC’s monitor. 7.3: Future Work I will continue advancing the frontier of the adoption of AI-enabled systems by addressing their security, privacy, and usability challenges. Some future works are listed as follows. Defend against telecommunications fraud: Current telecommunications fraud creates deepfake sound of the victim through speech synthesis techniques (generative networks), thereby gaining the trust of the victim’s relatives, to achieve the purpose of fraud. My observations on voice adversarial attacks show that the victim can protect their sound misuse by adding perturbations before they upload to the Internet. However, this approach’s performance has suffered because of the diversity of the generative models and multiple signal-processing tools. In my future study, I will borrow the black-box attack idea to train a generalized model to simulate the adversary’s generative model and craft robust universal perturbations to mislead the identity of the generated fake audio. I firmly believe that this approach has the potential to be developed into a generalized defense framework to safeguard the deepfake audio. Advancing secure interaction for large language models in smart speakers: The prospective integration of Large Language Models (LLMs) like ChatGPT with smart speakers such as Alexa is poised to significantly enhance daily life assistance through intelligent and responsive interactions. However, this amalgamation brings forth critical concerns and challenges related to security, efficiency, and accuracy that need meticulous investigation and resolution. My future work is committed to pioneering advancements in this domain, focusing on developing secure, efficient, and high-performing smart speakers embedded with sophisticated LLM logic. We aim to delve deep into the security vulnerabilities inherent in these integrations, emphasizing the creation of robust authentication mechanisms to mitigate unauthorized access and potential malicious activities. A pivotal area of our research will be to refine the accuracy of speech-to-text conversion processes 181 and the model’s logical comprehension of commands (chain of thought), ensuring precise interpretation and execution of user instructions. We will explore optimizations to meet the computational demands of LLMs and investigate energy-efficient solutions to address increased power consumption. Additionally, we aspire to implement mechanisms enabling smart speakers to justify proposed actions and seek user confirmations, enhancing user control and satisfaction. Through our endeavors, we anticipate contributing to the development and adoption of secure, user-friendly, and intelligent smart speakers, paving the way for the next era of smart assistance technologies. Voice watermarking for IP protection: Watermarking is usually used for protecting the copyright. In my future work, I will aim to fortify the integrity of speech datasets against unauthorized usage by adding voice watermarks. The watermark can be a different style of speech and will be used to verify the ownership of the dataset. Once a suspicious model uses our dataset without permission, the fine-tuning of our dataset will leave a watermark on the model. Compared to the existing watermark approach which injects backdoors into the trained model and potentially harms legitimate dataset users (e.g., leaving a backdoor to their model). We aim to make a harmless watermark by injecting different styles of speech (with correct labels) to serve as a watermark. The watermarked audio samples are intricately designed to be challenging for speech recognition systems to decode correctly and will be incorporated seamlessly into the dataset, thus acting as a protective shield without compromising the audio quality. This approach probably offers a reliable protection mechanism for speech datasets and encourages the responsible and ethical use of data resources. 182 BIBLIOGRAPHY [1] TIMIT Acoustic-Phonetic Continuous Speech Corpus. https://catalog.ldc.upenn.edu/LDC93S1, 1993. Accessed: 2021-11-04. [2] Persian Vowel recognition with MFCC and ANN on PCVC speech dataset. https://github.com/smalekz/PCVC, 2018. Accessed: 2021-11-04. [3] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for Large-Scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016. [4] Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin RB Butler, and Joseph Wilson. Practical hidden voice attacks against speech and speaker recognition systems. In Network and Distributed Systems Security (NDSS) Symposium, 2019. [5] Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Kevin Warren, Anurag Swarnim Yadav, Tom Shrimpton, and Patrick Traynor. Hear” no evil”, see” kenansville”*: Efficient and transferable black-box attacks on speech recognition and voice identification systems. In 2021 IEEE Symposium on Security and Privacy (SP), pages 712– 729. IEEE, 2021. [6] T. Afouras, J. S. Chung, and A. Zisserman. The conversation: Deep audio-visual speech enhancement. In INTERSPEECH, 2018. [7] Muhammad Ejaz Ahmed, Il-Youp Kwak, Jun Ho Huh, Iljoo Kim, Taekkyung Oh, and Hyoungshick Kim. Void: A fast and light voice liveness detection system. In USENIX Security, 2020. [8] Muhammad Ejaz Ahmed, Il-Youp Kwak, Jun Ho Huh, Iljoo Kim, Taekkyung Oh, and Hyoungshick Kim. Void: A fast and light voice liveness detection system. In 29th USENIX Security Symposium (USENIX Security 20), pages 2685–2702, 2020. [9] Federico Alegre, Artur Janicki, and Nicholas Evans. Re-assessing the threat of replay spoofing attacks against automatic speaker verification. In 2014 International Conference of the Biometrics Special Interest Group (BIOSIG), pages 1–6. IEEE, 2014. [10] Federico Alegre, Ravichander Vipperla, Nicholas Evans, and Benoït Fauve. On the vulnerability of automatic speaker recognition to spoofing attacks with artificial signals. In 2012 Proceedings of the 20th european signal processing conference (EUSIPCO), pages 36–40. IEEE, 2012. [11] Raziel Alvarez and Hyun-Jin Park. End-to-end streaming keyword spotting. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6336–6340. IEEE, 2019. 183 [12] Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. Did you hear that? adversarial examples against automatic speech recognition. In 31st Conference on Neural Information Processing Systems (NIPS), 2017. [13] Amazon. Amazon Echo. https://www.amazon.com/All-New-Echo-4th- Gen/dp/B07XKF5RM3, 2021. [14] Amazon. Amazon Transcribe. https://aws.amazon.com/transcribe/, 2021. [15] Amazon. Amazon Transcribe. https://aws.amazon.com/transcribe/, 2021. [16] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182. PMLR, 2016. [17] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In International conference on machine learning, pages 284–293. PMLR, 2018. [18] Avisoft. http://www.avisoft.com/ultrasound-microphones/cm16-cmpa/. [19] Avisoft. http://www.avisoft.com/playback/vifa/. [20] AWS. AWS Voice ID. AWS, September 2022. https://aws.amazon.com/connect/voice-id/. [21] Avisoft Bioacoustics. http://www.avisoft.com/playback/power-amplifier/. [22] Bose. https://www.bose.com/en_us/support/products/bose_speakers_support/bose_ smarthome_speakers_support/soundtouch-10-wireless-system.html. [23] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248, 2017. [24] Tom B Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. Adversarial patch. arXiv preprint arXiv:1712.09665, 2017. [25] William M Campbell, Douglas E Sturim, and Douglas A Reynolds. Support vector machines using gmm supervectors for speaker verification. IEEE signal processing letters, 13(5):308– 311, 2006. [26] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017. [27] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pages 1–7, 2018. [28] F. Castaldo, D. Colibro, E. Dalmasso, P. Laface, and C. Vair. Stream-based speaker segmentation using speaker factors and eigenvoices. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. 184 [29] Chase. Chase Voice ID. Chase, September 2022. https://www.chase.com/personal/ voice-biometrics. [30] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728, 2018. [31] Guangke Chen, Sen Chenb, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. Who is real bob? adversarial attacks on speaker recognition systems. In 2021 IEEE Symposium on Security and Privacy (SP), pages 694–711. IEEE, 2021. [32] Jianbo Chen, Michael I Jordan, and Martin J Wainwright. Hopskipjumpattack: A queryefficient decision-based attack. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1277–1294. IEEE, 2020. [33] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pages 15–26, 2017. [34] Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. Metamorph: Injecting inaudible commands into over-the-air voice controlled systems. In Network and Distributed Systems Security (NDSS) Symposium, 2020. [35] Yu-hsin Chen, Ignacio Lopez-Moreno, Tara N Sainath, Mirkó Visontai, Raziel Alvarez, and Carolina Parada. Locally-connected and convolutional neural networks for small footprint speaker recognition. In Sixteenth Annual Conference of the International Speech Communication Association, 2015. [36] Yuxin Chen, Huiying Li, Shan-Yuan Teng, Steven Nagels, Zhijing Li, Pedro Lopes, Ben Y Zhao, and Haitao Zheng 0001. Wearable Microphone Jamming. CHI, 2020. [37] Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. {Devil’s} whisper: A general approach for physical adversarial attacks against commercial black-box speech recognition devices. In 29th USENIX Security Symposium (USENIX Security 20), pages 2667–2684, 2020. [38] Minhao Cheng, Simranjit Singh, Patrick Chen, Pin-Yu Chen, Sijia Liu, and Cho-Jui Hsieh. Sign-opt: A query-efficient hard-label adversarial attack. arXiv preprint arXiv:1909.10773, 2019. [39] Noam Chomsky and Morris Halle. The sound pattern of english. 1968. [40] Anurag Chowdhury and Arun Ross. Fusing mfcc and lpc features using 1d triplet cnn for speaker recognition in severely degraded audio signals. IEEE Transactions on Information Forensics and Security, 2019. [41] Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, and Andrew Zisserman. Spot the conversation: speaker diarisation in the wild. ArXiv, 2020. 185 [42] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018. [43] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018. [44] Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. arXiv preprint arXiv:1707.05373, 2017. [45] Donghui Dai, Zhenlin An, and Lei Yang. Inducing wireless chargers to voice out for inaudible command attacks. In 2023 IEEE Symposium on Security and Privacy (SP), pages 1789–1806. IEEE, 2023. [46] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. Frontend factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788–798, 2010. [47] Jiangyi Deng, Yanjiao Chen, and Wenyuan Xu. Fencesitter: Black-box, content-agnostic, and synchronization-free enrollment-phase attacks on speaker recognition systems. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 755–767, 2022. [48] Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W Black, and Kishore Prahallad. Voice conversion using artificial neural networks. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3893–3896. IEEE, 2009. [49] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143, 2020. [50] Wei Du, Peixuan Li, Boqun Li, Haodong Zhao, and Gongshen Liu. Uor: Universal backdoor attacks on pre-trained language models. arXiv preprint arXiv:2305.09574, 2023. [51] Eastern. Easternbank Voice ID. Eastern, September 2022. https://www.easternbank.com/ personal-banking/mobile-online/voice-id. [52] Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Trans. Graph., 2018. [53] Navy Federal. Navy Federal Credit Union Voice ID. Navy Federal, September 2022. https: //www.navyfederal.org/services/security/voice-id.html. [54] Huan Feng, Kassem Fawaz, and Kang G Shin. Continuous authentication for voice assistants. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking, pages 343–355, 2017. [55] FirstHorizon. FirstHorizon Voice ID. FirstHorizon, September 2022. https://www. firsthorizon.com/Personal/Support/Security-and-Fraud-Protection/Voice-Biometrics. 186 [56] Mario Fleischer, Silke Pinkert, Willy Mattheus, Alexander Mainka, and Dirk Mürbe. Formant frequencies and bandwidths of the vocal tract transfer function are affected by the mechanical impedance of the vocal tract wall. Biomechanics and modeling in mechanobiology, 14(4):719–733, 2015. [57] Daniel Galvez, Greg Diamos, Juan Ciro, Juan Felipe Cerón, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder, and Vijay Janapa Reddi. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage. arXiv preprint arXiv:2111.09344, 2021. [58] John S Garofolo. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993, 1993. [59] Yuan Gong and Christian Poellabauer. Crafting adversarial examples for speech paralinguistics applications. arXiv preprint arXiv:1711.03280, 2017. [60] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [61] Google. Google Assistant. https://assistant.google.com/, 2021. [62] Google. Google Home/Nest. https://store.google.com/product, 2021. [63] Google. Google Speech. https://cloud.google.com/speech-to-text, 2021. [64] Google. Google Speech. https://cloud.google.com/speech-to-text, 2021. [65] Google. Google Vision. https://cloud.google.com/vision, 2021. [66] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006. [67] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017. [68] Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019. [69] Hanqing Guo, Xun Chen, Junfeng Guo, Li Xiao, and Qiben Yan. Masterkey: Practical backdoor attack against speaker verification systems. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, pages 1–15, 2023. [70] Hanqing Guo, Chenning Li, Lingkun Li, Zhichao Cao, Qiben Yan, and Li Xiao. Nec: Speaker selective cancellation via neural enhanced ultrasound shadowing. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 355–366. IEEE, 2022. 187 [71] Hanqing Guo, Guangjing Wang, Yuanda Wang, Bocheng Chen, Qiben Yan, and Li Xiao. Phantomsound: Black-box, query-efficient audio adversarial attack via split-second phoneme injection. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, pages 366–380, 2023. [72] Hanqing Guo, Yuanda Wang, Nikolay Ivanov, Li Xiao, and Qiben Yan. Specpatch: Humanin- the-loop adversarial audio spectrogram patch attack on speech recognition. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 1353–1366, 2022. [73] Hanqing Guo, Qiben Yan, Nikolay Ivanov, Ying Zhu, Li Xiao, and Eric J Hunter. Supervoice: Text-independent speaker verification using ultrasound energy in human speech. In Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, pages 1019–1033, 2022. [74] Hanqing Guo, Nan Zhang, Shaoen Wu, and Qing Yang. Deep learning driven wireless real-time human activity recognition. In ICC 2020-2020 IEEE International Conference on Communications (ICC), pages 1–6. IEEE, 2020. [75] Junfeng Guo, Ang Li, and Cong Liu. AEVA: Black-box backdoor detection using adversarial extreme value analysis. In International Conference on Learning Representations, 2022. [76] Junfeng Guo, Ang Li, and Cong Liu. Backdoor detection and mitigation in competitive reinforcement learning, 2023. [77] Junfeng Guo, Yiming Li, Xun Chen, Hanqing Guo, Lichao Sun, and Cong Liu. Scale-up: An efficient black-box input-level backdoor detection via analyzing scaled prediction consistency. arXiv preprint arXiv:2302.03251, 2023. [78] Junfeng Guo and Cong Liu. Practical poisoning attacks on neural networks. In ECCV, 2020. [79] Junfeng Guo and Cong Liu. Practical poisoning attacks on neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pages 142–158. Springer, 2020. [80] Wei Guo, Benedetta Tondi, and Mauro Barni. A master key backdoor for universal impersonation attack against dnn-based face verification. Pattern Recognition Letters, 144:61–67, 2021. [81] Nawar Halabi. Arabic Speech Corpus. http://en.arabicspeechcorpus.com/, 2016. Accessed: 2021-11-04. [82] Hasan Abed Al Kader Hammoud, Shuming Liu, Mohammad Alkhrasi, Fahad AlBalawi, and Bernard Ghanem. Look, listen, and attack: Backdoor attacks against video action recognition. arXiv preprint arXiv:2301.00986, 2023. [83] A. Hannun. Sequence modeling with ctc. https://distill.pub/2017/ctc, 2017. 188 [84] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014. [85] HarryVolek. HarryVolek GE2E. https://github.com/HarryVolek, 2018. [86] Shoji Hayakawa and Fumitada Itakura. Text-dependent speaker recognition using the information in the higher frequency band. In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages I–137. IEEE, 1994. [87] Shoji Hayakawa and Fumitada Itakura. The influence of noise on the speaker recognition performance using the higher frequency band. In 1995 International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 321–324. IEEE, 1995. [88] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [89] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [90] Yitao He, Junyu Bian, Xinyu Tong, Zihui Qian, Wei Zhu, Xiaohua Tian, and Xinbing Wang. Canceling inaudible voice commands against voice control systems. In The 25th Annual International Conference on Mobile Computing and Networking, pages 1–15, 2019. [91] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer. End-to-end textdependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5115–5119. IEEE, 2016. [92] Yedid Hoshen, Ron J Weiss, and Kevin W Wilson. Speech acoustic modeling from raw multichannel waveforms. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4624–4628. IEEE, 2015. [93] HOTENDA. https://www.hotenda.com/datasheet-html/2493/1/SPU0410LR5H-QB.html, 2013. [94] HSBC. HSBC Voice ID. HSBC, September 2022. https://www.us.hsbc.com/ customer-service/voice/x. [95] Chien-yu Huang, Yist Y Lin, Hung-yi Lee, and Lin-shan Lee. Defending your voice: Adversarial attack on voice conversion. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 552–559. IEEE, 2021. [96] Shehzeen Hussain, Paarth Neekhara, Shlomo Dubnov, Julian McAuley, and Farinaz Koushanfar. {WaveGuard}: Understanding and mitigating audio adversarial examples. In 30th USENIX Security Symposium (USENIX Security 21), pages 2273–2290, 2021. [97] IBM. IBM Speeche. https://www.ibm.com/cloud/watson-speech-to-text, 2021. 189 [98] Magdalena Igras, Bartosz Ziółko, and Mariusz Ziółko. Length of phonemes in a context of their positions in polish sentences. In 2013 International Conference on Signal Processing and Multimedia Applications (SIGMAP), pages 59–64. IEEE, 2013. [99] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, pages 2137–2146. PMLR, 2018. [100] Acoustics — Normal equal-loudness-level contours. Standard, August 2003. [101] Roman Jakobson, C Gunnar Fant, and Morris Halle. Preliminaries to speech analysis: The distinctive features and their correlates. 1951. [102] Artur Janicki, Federico Alegre, and Nicholas Evans. An assessment of automatic speaker verification vulnerabilities to replay spoofing attacks. Security and Communication Networks, 9(15):3030–3044, 2016. [103] N. S. Jayant and Peter Noll. Digital Coding of Waveforms, Principles and Applications to Speech and Video, page 688. Prentice-Hall, Englewood Cliffs NJ, USA, 1984. N. S. Jayant: Bell Laboratories; ISBN 0-13-211913-7. [104] Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in neural information processing systems, pages 4480–4490, 2018. [105] Allard Jongman. Acoustics of american english speech: A dynamic approach. Language and Speech, 1995. [106] Allard Jongman, Ratree Wayland, and Serena Wong. Acoustic characteristics of english fricatives. The Journal of the Acoustical Society of America, 108(3):1252–1263, 2000. [107] Patrick Kenny. Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal,(Report) CRIM-06/08-13, 14:28–29, 2005. [108] Patrick Kenny, Themos Stafylakis, Pierre Ouellet, Md Jahangir Alam, and Pierre Dumouchel. Plda for speaker verification with utterances of arbitrary duration. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7649–7653. IEEE, 2013. [109] Tomi Kinnunen and Haizhou Li. An overview of text-independent speaker recognition: From features to supervectors. Speech communication, 52(1):12–40, 2010. [110] Tomi Kinnunen, Md Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. The asvspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. 2017. [111] Felix Kreuk, Yossi Adi, Moustapha Cisse, and Joseph Keshet. Fooling end-to-end speaker verification with adversarial examples. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1962–1966. IEEE, 2018. 190 [112] Galina Lavrentyeva, Sergey Novoselov, Egor Malykh, Alexander Kozlov, Oleg Kudashev, and Vadim Shchemelinin. Audio replay attack detection with deep learning frameworks. In Interspeech, pages 82–86, 2017. [113] Chenning Li, Zhichao Cao, and Yunhao Liu. Deep ai enabled ubiquitous wireless sensing: A survey. ACM Computing Surveys (CSUR), 54(2):1–35, 2021. [114] Chenning Li, Manni Liu, and Zhichao Cao. Wihf: Enable user identified gesture recognition with wifi. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pages 586–595. IEEE, 2020. [115] Gen Li, Zhichao Cao, and Tianxing Li. Echoattack: Practical inaudible attacks to smart earbuds. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services, pages 383–396, 2023. [116] Lingkun Li, Manni Liu, Yuguang Yao, Fan Dang, Zhichao Cao, and Yunhao Liu. Patronus: preventing unauthorized speech recordings with support for selective unscrambling. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems (SenSys), pages 245–257, 2020. [117] Zhuohang Li, Cong Shi, Yi Xie, Jian Liu, Bo Yuan, and Yingying Chen. Practical adversarial attacks against speaker recognition systems. In Proceedings of the 21st international workshop on mobile computing systems and applications, pages 9–14, 2020. [118] Zhuohang Li, Cong Shi, Tianfang Zhang, Yi Xie, Jian Liu, Bo Yuan, and Yingying Chen. Robust detection of machine-induced audio attacks in intelligent audio systems with microphone array. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 1884–1899, 2021. [119] Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. Advpulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pages 1121–1134, 2020. [120] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In 25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc, 2018. [121] A Löfqvist and B Mandersson. Long-time average spectrum of speech and voice analysis. Folia phoniatrica, 1987. [122] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008. [123] Yan Meng, Jiachun Li, Matthew Pillari, Arjun Deopujari, Liam Brennan, Hafsah Shamsie, Haojin Zhu, and Yuan Tian. Your microphone array retains your identity: A robust voice liveness detection system for smart speakers. In 31th USENIX Security Symposium (USENIX Security 21), 2022. 191 [124] Yan Meng, Zichang Wang, Wei Zhang, Peilin Wu, Haojin Zhu, Xiaohui Liang, and Yao Liu. Wivo: Enhancing the security of voice control system via wireless signal in iot environment. In Proceedings of the Eighteenth ACM International Symposium on Mobile Ad Hoc Networking and Computing, pages 81–90, 2018. [125] Microsoft. Microsoft Azure. https://azure.microsoft.com/en-us/, 2021. [126] Brian B Monson, Eric J Hunter, Andrew J Lotto, and Brad H Story. The perceptual significance of high-frequency energy in the human voice. Frontiers in psychology, 5:587, 2014. [127] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1765–1773, 2017. [128] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60:101027, 2020. [129] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017. [130] Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Takanobu Nishiura, and Takeshi Yamada. Acoustical sound database in real environments for sound scene understanding and handsfree speech recognition. 2000. [131] Mahesh Kumar Nandwana, Julien van Hout, Mitchell McLaren, Allen R Stauffer, Colleen Richey, Aaron Lawson, and Martin Graciarena. Robust speaker recognition from distant speech under real reverberant environments using speaker embeddings. In Interspeech, pages 1106–1110, 2018. [132] Tao Ni, Xiaokuan Zhang, Chaoshun Zuo, Jianfeng Li, Zhenyu Yan, Wubing Wang, Weitao Xu, Xiapu Luo, and Qingchuan Zhao. Uncovering user interactions on smartphones via contactless wireless charging side channels. In 2023 IEEE Symposium on Security and Privacy (SP), pages 3399–3415. IEEE, 2023. [133] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. [134] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015. [135] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506– 519, 2017. 192 [136] Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In International conference on machine learning, pages 5231–5240. PMLR, 2019. [137] Mirco Ravanelli and Yoshua Bengio. Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 1021–1028. IEEE, 2018. [138] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using adapted gaussian mixture models. Digital signal processing, 10(1-3):19–41, 2000. [139] Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. Backdoor: Making microphones hear inaudible sounds. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 2–14, 2017. [140] Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. Backdoor: Making microphones hear inaudible sounds. In Proceedings of ACM MobiSys, 2017. [141] Nirupam Roy, Sheng Shen, Haitham Hassanieh, and Romit Roy Choudhury. Inaudible voice commands: The {Long-Range} attack and defense. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 547–560, 2018. [142] Sada. https://www.aliexpress.com/item/4001241222763.html. [143] Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson, and Oriol Vinyals. Learning the speech front-end with raw waveform cldnns. In Sixteenth Annual Conference of the International Speech Communication Association, 2015. [144] Lea Schönherr, Thorsten Eisenhofer, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. Imperio: Robust over-the-air adversarial examples for automatic speech recognition systems. In Annual Computer Security Applications Conference, pages 843–855, 2020. [145] Martin F Schwartz. Identification of speaker sex from isolated, voiceless fricatives. The Journal of the Acoustical Society of America, 43(5):1178–1179, 1968. [146] scikit fda. fetch phoneme. skfda.datasets.fetch_phoneme.html, 2021. [147] M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel. A study of the cosine distancebased mean shift for telephone speech diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014. [148] Giorgio Severi, Jim Meyer, Scott Coull, and Alina Oprea. {Explanation-Guided} backdoor poisoning attacks against malware classifiers. In 30th USENIX security symposium (USENIX security 21), pages 1487–1504, 2021. [149] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. Poison frogs! targeted clean-label poisoning attacks on neural networks. Advances in neural information processing systems, 31, 2018. 193 [150] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018. [151] Sheng Shen, Nirupam Roy, Junfeng Guan, Haitham Hassanieh, and Romit Roy Choudhury. Mute: Bringing iot to noise cancellation. In Proceedings of ACM SIGCOMM, 2018. [152] Cong Shi, Tianfang Zhang, Zhuohang Li, Huy Phan, Tianming Zhao, Yan Wang, Jian Liu, Bo Yuan, and Yingying Chen. Audio-domain position-independent backdoor attack via unnoticeable triggers. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking, pages 583–595, 2022. [153] Stephen Shum, Najim Dehak, Reda Dehak, and James R Glass. Unsupervised speaker adaptation based on the cosine similarity for text-independent speaker verification. In Odyssey, page 16, 2010. [154] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [155] Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:132–157, 2020. [156] SiSonic. https://www.digikey.com/product-detail/en/knowles/SPU0410LR5H-QB-7/ 423-1139-1-ND/2420983. [157] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur. Deep neural network embeddings for text-independent speaker verification. In Interspeech, pages 999– 1003, 2017. [158] Takeshi Sugawara, Benjamin Cyr, Sara Rampazzi, Daniel Genkin, and Kevin Fu. Light commands: laser-based audio injection attacks on voice-controllable systems. In 29th USENIX Security Symposium (USENIX Security 20), pages 2631–2648, 2020. [159] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015. [160] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. [161] Marija Tabain. Variability in fricative production and spectra: Implications for the hyper-and hypo-and quantal theories of speech production. Language and speech, 44(1):57–93, 2001. 194 [162] Raphael Tang and Jimmy Lin. Deep residual learning for small-footprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5484–5488. IEEE, 2018. [163] Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. Targeted adversarial examples for black box audio systems. In 2019 IEEE Security and Privacy Workshops (SPW), pages 15–20. IEEE, 2019. [164] Siri Team. Hey siri: An on-device dnn-powered voice trigger for apple’s personal assistant. Apple Machine Learning Journal, 1(6), 2017. [165] Siri Team. Hey siri: An on-device dnn-powered voice trigger for apple’s personal assistant. Apple Machine Learning Journal, 1(6), 2017. [166] Ingo R Titze, Ronald J Baken, Kenneth W Bozeman, Svante Granqvist, Nathalie Henrich, Christian T Herbst, David M Howard, Eric J Hunter, Dean Kaelin, Raymond D Kent, et al. Toward a consensus on symbolic notation of harmonics, resonances, and formants in vocalization. The Journal of the Acoustical Society of America, 2015. [167] Francis Tom, Mohit Jain, and Prasenjit Dey. End-to-end audio replay attack detection using deep convolutional networks with attention. In Interspeech, pages 681–685, 2018. [168] Susanne Trauzettel-Klosinski, Klaus Dietz, IReST Study Group, et al. Standardized assessment of reading performance: The new international reading speed texts irest. Investigative ophthalmology & visual science, 53(9):5452–5461, 2012. [169] Yu-Chih Tung and Kang G. Shin. Exploiting sound masking for audio privacy in smartphones. In Proceedings of ACM Asia Conference on Computer and Communications Security, 2019. [170] Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Clean-label backdoor attacks. 2018. [171] usbank. How voice-activated devices work with banks. https: //www.usbank.com/financialiq/manage-your-household/personal-finance/ how-voice-activated-devices-work-with-banks.html, 2020. [172] Andrew Varga and Herman JM Steeneken. Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech communication, 1993. [173] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez- Dominguez. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE ICASSP, pages 4052–4056. IEEE, 2014. [174] Verizon. Verizon Voice ID. Verizon, September 2022. https://www.verizon.com/support/ voice-id-faqs/. 195 [175] E. Vincent, R. Gribonval, and C. Fevotte. Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 2006. [176] L. Wan, Q. Wang, A. Papir, and I. L. Moreno. Generalized end-to-end loss for speaker verification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. [177] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883. IEEE, 2018. [178] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883. IEEE, 2018. [179] Vincent Wan and Steve Renals. Speaker verification using sequence discriminant support vector machines. IEEE transactions on speech and audio processing, 13(2):203–210, 2005. [180] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pages 707–723. IEEE, 2019. [181] Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and Ignacio Lopz Moreno. Speaker diarization with lstm. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5239–5243. IEEE, 2018. [182] Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John R. Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, and Ignacio Lopez Moreno. Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking. In Proceedings of Interspeech, 2019. [183] Yuanda Wang, Hanqing Guo, Guangjing Wang, Bocheng Chen, and Qiben Yan. Vsmask: Defending against voice synthesis attack via real-time predictive perturbation. arXiv preprint arXiv:2305.05736, 2023. [184] Yuanda Wang, Hanqing Guo, and Qiben Yan. Ghosttalk: Interactive attack on smartphone voice system through power line. arXiv preprint arXiv:2202.02585, 2022. [185] Yuanda Wang, Hanqing Guo, and Qiben Yan. Ghosttalk: Interactive attack on smartphone voice system through power line. In Network and Distributed Systems Security (NDSS) Symposium, 2022. [186] Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018. [187] Wechat. Wechat Voice ID. Wechat, May 2015. https://blog.wechat.com/2015/05/21/ voiceprint-the-new-wechat-password/. 196 [188] Emily Wenger, Max Bronckers, Christian Cianfarani, Jenna Cryan, Angela Sha, Haitao Zheng, and Ben Y Zhao. ” hello, it’s me”: Deep learning-based speech synthesis attacks in the real world. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 235–251, 2021. [189] WILDLIFE. https://www.wildlifeacoustics.com/products/echo-meter-touch-2-pro-ios. [190] Fritz Winckel and Thomas Binkley. Music, sound and sensation : a modern exposition. 1967. [191] Shutong Wu, Jiongxiao Wang, Wei Ping, Weili Nie, and Chaowei Xiao. Defending against adversarial audio via diffusion model. arXiv preprint arXiv:2303.01507, 2023. [192] Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi, Federico Alegre, and Haizhou Li. Spoofing and countermeasures for speaker verification: A survey. speech communication, 66:130–153, 2015. [193] Zhizheng Wu, Sheng Gao, Eng Siong Cling, and Haizhou Li. A study on replay attack and anti-spoofing for text-dependent speaker verification. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, pages 1–5. IEEE, 2014. [194] Chong Xiang, Arjun Nitin Bhagoji, Vikash Sehwag, and Prateek Mittal. Patchguard: A provably robust defense against adversarial patches via small receptive fields and masking. In 30th USENIX Security Symposium (USENIX Security 21), pages 2237–2254, 2021. [195] Chong Xiang and Prateek Mittal. Detectorguard: Provably securing object detectors against localized patch hiding attacks. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 3177–3196, 2021. [196] Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and Andreas Stolcke. The microsoft 2017 conversational speech recognition system. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5934–5938. IEEE, 2018. [197] Hiromu Yakura and Jun Sakuma. Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793, 2018. [198] Chen Yan, Yan Long, Xiaoyu Ji, and Wenyuan Xu. The catcher in the field: A fieldprint based spoofing detection for text-independent speaker verification. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 1215– 1229, 2019. [199] Chen Yan, Yan Long, Xiaoyu Ji, and Wenyuan Xu. The catcher in the field: A fieldprint based spoofing detection for text-independent speaker verification. In Proceedings of ACM CCS, 2019. 197 [200] Qiben Yan, Kehai Liu, Qin Zhou, Hanqing Guo, and Ning Zhang. Surfingattack: Interactive hidden attack on voice assistants using ultrasonic guided wave. In Network and Distributed Systems Security (NDSS) Symposium, 2020. [201] Qiben Yan, Kehai Liu, Qin Zhou, Hanqing Guo, and Ning Zhang. Surfingattack: Interactive hidden attack on voice assistants using ultrasonic guided wave. In Proceedings of Network and Distributed Systems Security (NDSS) Symposium, 2020. [202] Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. Characterizing audio adversarial examples using temporal dependency. arXiv preprint arXiv:1809.10875, 2018. [203] Ye-Yi Wang, A. Acero, and C. Chelba. Is word error rate a good indicator for spoken language understanding accuracy. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 2003. [204] Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, XiaoFeng Wang, and Carl A Gunter. Commandersong: A systematic approach for practical adversarial voice recognition. In 27th USENIX Security Symposium (USENIX Security 18), pages 49–64, 2018. [205] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014. [206] Yi Zeng, Minzhou Pan, Hoang Anh Just, Lingjuan Lyu, Meikang Qiu, and Ruoxi Jia. Narcissus: A practical clean-label backdoor attack with limited information. arXiv preprint arXiv:2204.05255, 2022. [207] Tongqing Zhai, Yiming Li, Ziqi Zhang, Baoyuan Wu, Yong Jiang, and Shu-Tao Xia. Backdoor attack against speaker verification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2560–2564. IEEE, 2021. [208] Aonan Zhang, Chong Wang, John Paisley, Quan Wang, and Zhenyao Zhu. Fully supervised speaker diarization. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019. [209] Guoming Zhang, Xiaoyu Ji, Xinfeng Li, Gang Qu, and Wenyuan Xu. Eararray: Defending against dolphinattack via acoustic attenuation. 2021. [210] Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. Dolphinattack: Inaudible voice commands. In Proceedings of the 2017 ACM CCS, pages 103–117, 2017. [211] Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. Dolphinattack: Inaudible voice commands. In Proceedings of ACM CCS, 2017. [212] Jie Zhang, Zhanyong Tang, Meng Li, Dingyi Fang, Petteri Nurmi, and Zheng Wang. Crosssense: Towards cross-site and large-scale wifi sensing. In Proceedings of ACM Mobi- Com, 2018. 198 [213] Linghan Zhang, Sheng Tan, and Jie Yang. Hearing your voice is not enough: An articulatory gesture based liveness detection for voice authentication. In Proceedings of the 2017 ACM CCS, pages 57–71, 2017. [214] Linghan Zhang, Sheng Tan, Jie Yang, and Yingying Chen. Voicelive: A phoneme localization based liveness detection for voice authentication on smartphones. In Proceedings of the 2016 ACM CCS, pages 1080–1091, 2016. [215] Nan Zhang, Xianghang Mi, Xuan Feng, XiaoFeng Wang, Yuan Tian, and Feng Qian. Dangerous skills: Understanding and mitigating security risks of voice-controlled third-party functions on virtual personal assistant systems. In 2019 IEEE Symposium on Security and Privacy (SP), pages 1381–1396. IEEE, 2019. [216] Ruiteng Zhang, Jianguo Wei, Wenhuan Lu, Longbiao Wang, Meng Liu, Lin Zhang, Jiayu Jin, and Junhai Xu. Aret: Aggregated residual extended time-delay neural networks for speaker verification. In INTERSPEECH, pages 946–950, 2020. [217] Shihao Zhao, Xingjun Ma, Xiang Zheng, James Bailey, Jingjing Chen, and Yu-Gang Jiang. Clean-label backdoor attacks on video recognition models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14443–14452, 2020. [218] Baolin Zheng, Peipei Jiang, Qian Wang, Qi Li, Chao Shen, Cong Wang, Yunjie Ge, Qingyang Teng, and Shenyi Zhang. Black-box adversarial attacks on commercial speech platforms with minimal information. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 86–107, 2021. [219] Yue Zheng, Yi Zhang, Kun Qian, Guidong Zhang, Yunhao Liu, Chenshu Wu, and Zheng Yang. Zero-effort cross-domain gesture recognition with wifi. In Proceedings of ACM MobiSys, 2019. [220] Shangyue Zhu, Junhong Xu, Hanqing Guo, Qiwei Liu, Shaoen Wu, and Honggang Wang. Indoor human activity recognition based on ambient radar with signal processing and machine learning. In 2018 IEEE international conference on communications (ICC), pages 1–6. IEEE, 2018. [221] W. Zhu and J. Pelecanos. Online speaker diarization using adapted i-vector transforms. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016. [222] Tehseen Zia and Usman Zahid. Long short-term memory recurrent neural network architectures for urdu acoustic modeling. International Journal of Speech Technology, 22(1):21–30, 2019. 199