PUSH THE LIMIT OF IOT SYSTEM DESIGN ON MOBILE DEVICES By Manni Liu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2023 ABSTRACT Internet of Things (IoT) utilizes sensors as the information source of machine intelligence. Its applications widely exist from Smart Home, Smart City to Wearable Healthcare and Smart Farm- ing. An IoT architecture usually covers four stages: sensor data connection, data transmission, data processing and application model. On top of prediction precision, the interest of IoT research includes efficiency, economic saving and system scalability. In pursuit of these goals, we push the limit of IoT system design from the following three perspectives. (1) We exploit the potential of sensors of smart devices, including sensor fusion and possibility of new IoT applications. (2) We design Machine Learning models for IoT applications, including feature engineering and model selection. (3) We implement lightweight IoT systems for smart devices like laptops, smartphones and voice assistants, considering the constraint of computation resources. In this dissertation, we especially introduce our effort to IoT applications related to localization and security. EyeLoc is a smartphone vision enabled localization we designed for large shopping malls. The results show that the 90-percentile errors of localization and heading direction are 5.97 m and 20◦ in a 70,000 m2 mall. Patronus protects acoustic privacy from malicious secret audio recordings using the nonlinear effect of microphones. Our experiments show that only 19.7% of the words protected by Patronus can be recognized by unauthorized recorders. SoundFlower is a sound source localization system for voice assistants. It can locate a user in 3D space through the wake-up command with a median error of 0.45 m. In general, we explore the potential of diverse sensors to IoT services and build machine learn- ing models to exploit the most information from sensor data. The applications we study are specif- ically about localization and security. Copyright by MANNI LIU 2023 ACKNOWLEDGEMENTS First and foremost, I would like to express my deepest gratitude to my PhD advisor, Dr. Zhichao Cao, for his advice, inspirations and encouragement. During the years at Edge Intelligence and Networking Group, Dr. Cao has been providing endless support to my research and career. Not only has he guided me into the giant picture of IoT research, but also he provides detailed instruc- tions on experiment design, paper writing and research presentations. Dr. Cao is always patient and positive in spite of the ups and downs throughout my PhD life. I would also like to thank Dr. Li Xiao, Dr. Guan-Hua Tu and Dr. Mi Zhang for being on my thesis committee. The guidance from my thesis committee is invaluable to my academic career. It is my pleasure to have spent two years under Dr. Yunhao Liu’s supervision before he went to Tsinghua University. Dr. Liu pointed a direction for me when I first studied IoT. In the lab which is founded by Dr. Yunhao Liu and Dr. Zhichao Cao, I have met great labmates who have become my lifetime friends. Li Liu and I started research on IoT together and have been supporting each other all these years. Our friendship extends beyond the realms of research and career, reaching into the fabric of our lives. Maolin Gan’s accompany is a source of joy and light to everyone in the lab. Gen Li and Yidong Ren helped me when I started teaching and always encouraged me whenever I doubted myself. Friendship from Yimeng Liu is also great support to me during my PhD life. I also want to thank my mentor Dr. Xin Zhou. Despite all my inexperience during my first industrial internship, Dr. Zhou was supportive and inspired me to apply my PhD research to autonomous driving. After the internship is over, Dr. Zhou continuously shares her experience as a woman in tech and encourages me to pursue my career. At last, I would like to thank my parents for their unconditional love and support. iv CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 TABLE OF CONTENTS CHAPTER 2 SMARTPHONE VISION ENABLED PLUG-N-PLAY INDOOR LOCALIZATION IN LARGE SHOPPING MALLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 7 . 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 . 29 . 36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation . 2.1 Overview . . 2.2 Design . 2.3 2.4 Evaluation . . . 2.5 Related Work . . 2.6 Conclusion . . . . . . . . . . . . . . CHAPTER 3 . . A ROBUST SOUND SOURCE LOCALIZATION SYSTEM FOR VOICE ASSISTANTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 . 44 . 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 . 48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 . 63 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Preliminary and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 System Overview . . 3.4 Design . 3.5 3.6 Experiment 3.7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation . . . . . . . . . . . . . . CHAPTER 4 . . PREVENTING UNAUTHORIZED SPEECH RECORDINGS WITH SUPPORT FOR SELECTIVE UNSCRAMBLING . . . . . . . . . . . . 65 . 68 . 70 . 71 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 . 86 . 94 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Nonlinear Behavior of Common Microphones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Design . . 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Evaluation . 4.6 Limitations and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Conclusion . . . Implementation . . . . . . . . . . . . . . CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 v CHAPTER 1 INTRODUCTION Internet of Things (IoT) refers to the network of computing devices which are embedded with sensors and interconnected through the Internet [1]. Distributed sensors gather data from physi- cal objects, networking transfers diverse information from various locations to the computational system, which enables a comprehensive understanding of the environment as well as reactions of actuators. With cloud service deploying computation resources and machine learning analyzing sensor data, IoT could lead to complete automation of large infrastructures. Lots of novel con- cepts have been proposed and relevant systems are under construction, such as Smart Home [2], Smart Office [3] and Smart City [4]. In this dissertation, we are particularly interested in smart applications related to mobile devices. As mobile devices are the most common and widespread computational systems, exploring their existing sensors and building effective computational mod- els is important to many IoT applications. With proper utilization, a mobile device can play an essential role to IoT products like Smart Home and Smart Office. IoT architecture can be divided into four layers: • Sensing layer: To initiate exchange of information between a physical object and a compu- tational system, sensors play an essential role. Sensors monitor the physical conditions of the environment and collect data. Our IoT systems in this dissertation studies sensors like micro- phones, cameras and inertial sensors (IMU). As for physical signals, we can either use existing environmental signals, like sound source localization, or use modulated signals that interact with the surroundings, like gesture recognition through WiFi signals. For the latter one, the modulated signal is expected to be non-intrusive to the environment. For example, when acoustic signal is adopted to monitor infants, BreathJunior [5] especially selects chirps from 6 kHz to 21 kHz and modulate the chirps into pseudo white noise because long-term exposure to high-frequency chirps does harm to infants. • Network layer: Network layer manages communication between devices. It can be wired or wireless communication. Wired communication transfers data through a wired medium like Eth- 1 ernet or USB. Our research focuses more on wireless signals. Nowadays wireless communications such as WiFi, LoRa, Bluetooth Low Energy (BLE), cellular networks (3G, 4G, 5G) and Radio- Frequency Identification (RFID) are very popular in IoT. In this dissertation, we especially studied acoustic signals to form a sensor network. • Data preprocessing layer: After sensor data is collected and transferred to the computational system, the next step is to remove noises and extract relevant features. Sometimes we also need to remove redundant information to reduce the computational overload. In our application sce- narios, hardware imperfection and environmental disturbances are major sources of noises. Signal processing and machine learning techniques are common methods to preprocess data in IoT. • Application layer: After we have clean data, we can build computational models to achieve smart applications. Geometric models [6–9] and machine learning models [10–13] are two popular choices. In spite of attractive capabilities, IoT system design faces manifold challenges. First, no sen- sor could provide perfect transfer function between the physical signal and the sensor data. For example, acoustic sensing suffers from information loss due to the discretization between analog signal and digital signal. WiFi sensing struggles with phase offset across radio chains, sampling frequency offset, symbol timing offset and carrier frequency offset. Image sensing is limited by resolution. Second, environmental noise is pervasive. The environmental noise might be more than random white noise. The sensor data contains traces of environmental information and interfer- ence, which is challenging to be completely removed. For example, one common environmental noise source for acoustic sensing is multipath effect. Due to this reason, the performance of an IoT system might degrade significantly after the system is deployed to a new environment. Third, the computational power of mobile devices is limited while most IoT applications require real-time responses. The training of larger neural networks like ChatGPT requires a cluster of GPUs[14], which is not affordable for mobile devices like laptops, smartphones and voice assistants. Last but not the least, the existence of sensors incurs privacy concern from users. On the one hand, we should choose proper sensors in different application scenarios. On the other hand, we need to 2 protect use privacy from malicious attackers by exploring the potential of sensors. Over the years, many IoT systems have been designed to overcome the challenges and push the limit of IoT system design. In this dissertation, we introduce one system for indoor localization, one system for sound source localization and one system for acoustic security. For indoor local- ization, we propose EyeLoc [15] to enable self-localization in large shopping malls. For malls like Outlets, their area can reach as high as 70,000 m2. Usually people depend on nearby kiosks to fig- ure out directions, which is time-consuming and tiresome. With EyeLoc, people hold smartphones and turn a circle in place, then their location and heading direction will be shown on floor-plan images, which are widely offered by map providers like Google Maps and Gaode Maps. EyeLoc combines cameras and IMUs to explore the geometric relationship between angles and the relative location of the user with respect to three or more Point-of-Interests (POI). The 90-percentile errors of localization and heading direction are 5.97 m and 20◦ in 70,000 m2 malls, which is sufficient to find a shop or an exit in a mall. For sound source localization, we implemented SoundFlower for voice assistants like Amazon Echo. If voice assistants like Amazon Echo, Google Home or Apple HomePod can locate a person based on the speech he/she utters, they can better understand the context of commands and deliver more considerate tasks. Sound source localization is challenging because voice assistants are blind to the original speech. SoundFlower extracts Time Difference of Arrival (TDoA) information from phase data of cross spectrum. To cope with internal noise and environmental noise, we design a self-adjusting speech detection method to recognize speech-involved phase data. Robust regression is applied to extract TDoA from phase data against multipath effect. Our experiments prove that SoundFlower is robust and efficient. For acoustic security, we designed a system to protect acoustic privacy from unauthorized recordings. Smart devices such as smartphones, smartwatches and digital wristbands are all de- signed with recording feature. In spite of conveniences, such function from portable and widespread devices expose us to the risk of malicious secret recording. We propose Patronus [16] to scram- ble unauthorized recordings with ultrasounds, while authorized devices can still recover informa- 3 tion from scrambled recordings with keys received through WiFi or bluetooth. The rationale of scrambling speech with ultrasounds is drawn from an observation discovered by BackDoor [17]. Although commercial mobile devices cannot sense ultrasounds due to their sampling rate limit, two ultrasounds with different frequency can incur a low-frequency signal within the microphone. When it comes to recovering information from the scrambled recording with the received key, we apply Normalized Least-Mean-Square (NLMS) adaptive filter. The process of NLMS filter is to simulate the received ultrasound and remove it from the recording. For the structure of this dissertation, Chapter 2 introduces EyeLoc which is designed for indoor localization in large shopping malls. EyeLoc especially shows our efforts on achieving the trade- off among cost, computational power and real-time responses. Chapter 3 introduces SoundFlower, which is a sound source localization system for voice assistants. SoundFlower places emphasis on how to overcome pervasive environmental noise. Chapter 4 presents Patronus which is used to protect acoustic privacy. Patronus shows how to protect user privacy by making use of sensors and wireless signals. Finally, we conclude our work in Chapter 5. 4 CHAPTER 2 SMARTPHONE VISION ENABLED PLUG-N-PLAY INDOOR LOCALIZATION IN LARGE SHOPPING MALLS Nowadays, the physical layout of many large shopping malls is becoming more and more com- plex [18]. As there are many location-based activities (e.g., shopping, eating, watching movies) in large shopping malls, indoor localization is becoming an important service for people. Although outdoor localization (i.e., GPS) has been put into practice for many years, there is still no practical deployed indoor localization systems. Many indoor localization systems rely on pre-collected information (e.g., Wi-Fi signals [19] [20] [21] [22] [23], lamp positions [24] [25] [26], scene images [27] [28] [29] and mag- netic fingerprints [30] [31]), called site survey, to construct a localizable map. In large shopping malls, the site survey usually incurs extensive bootstrap overhead which hinders corresponding ap- proaches from widespread adoption. Even when the site survey is accomplished, the information needs to be timely updated and calibrated to ensure the accuracy. Moreover, some indoor localiza- tion systems [22] require custom hardware, which is not supported in commodity smartphones. Our core question is can we set up a plug-and-play indoor localization system in large shopping malls with commodity smartphones? We notice a possible way by leveraging the widely available floor-plan images, which can be obtained from indoor map providers (e.g., Google Maps, Gaode Maps, Baidu Maps, etc.). Those floor-plan images contain positions of many shops, called Point of Interests (POI). POIs are used as visual hints when users try to manually localize themselves. This kind of nonautomatic self-localization usually requires users to have good geometric sense and the ability of space transformation. To reduce users’ mental work and provide real-time localization service, we refine the question as can users automatically obtain their positions on floor-plan images from their smartphones just like traditional outdoor localization systems such as Google Maps and Baidu Maps? In this sense, it is possible to achieve a plug-and-play indoor localization system by bridging this gap. In this chapter, we propose EyeLoc, a step towards plug-and-play indoor localization in large 5 shopping malls. The key idea of EyeLoc is to imitate human self-localization with smartphone vi- sion. After obtaining a floor-plan image, EyeLoc uses scene text detection/recognition techniques to extract a set of POIs from the image. The recognized texts are used to identify different POIs and their corresponding text bounding boxes provide the approximate POI positions in the floor-plan coordinate system (called floor-plan space). Correspondingly in real space (called vision space), a user holds his/her smartphone and turns a 360◦ circle. The smartphone automatically shoots a series of images (called view image), which contain the surrounding POI signs. For those observed POIs, EyeLoc extracts their texts and geometric constraints in vision space, which are further used to match the user’s position in floor-plan space. Technically, EyeLoc develops several novel methods to address three challenges. First, there is a big difference between human vision system and smartphone vision system. Human vision system is a binocular system that supports estimating the direction and distance of an object. Most of the smartphones, however, only have one camera which is hard to achieve direction and distance measurements in a light-weight way with existing vision methods. We develop an accurate and ubiquitous monocular vision system which is available on most smartphones. We construct the constant geometric constraints of 3 observed POIs to enable position matching between floor-plan space and visual space. Second, to extract the directions of different POIs, text detection and recognition are necessary, but usually time-consuming. To reduce the processing time of POI extraction, an outlier image filtering method and a sparse image processing method are designed. Third, the measurement errors from motion sensors and floor-plan images may incur inaccurate position matching, for which we design an error-resilient method. We implement EyeLoc on Android smartphones and evaluate its performance in an office en- vironment, two large shopping malls (7,500m2 and 10,000m2) and a semi-outdoor large Outlets (70,000m2). The evaluation results show that the 90-percentile errors of localization and heading direction can achieve 5.97 m and 20◦. The contributions of this chapter are as follows. • We propose EyeLoc, a smartphone vision enabled plug-and-play indoor localization in large shopping malls. No site survey nor periodical calibration of floor map is required. 6 • We develop a ubiquitous smartphone vision system and corresponding geometric localization model. To guarantee the localization accuracy and processing efficiency, we propose countermea- sures to address several practical challenges. • We implement EyeLoc on Android smartphones and evaluate its performance in an office environment and two large shopping malls. The evaluation results show that EyeLoc is effective in both localization accuracy and processing efficiency. The rest of this chapter is organized as follows. Section 2.1 introduces the overview of EyeLoc. Section 2.2 illustrates the detailed design of EyeLoc. Section 2.3 and Section 2.4 show the details of EyeLoc implementation and evaluation respectively. Section 2.5 introduces the related work. Finally, we conclude our work in Section 2.6. 2.1 Overview Plug-and-play outdoor localization has been successfully achieved on smartphones with the help of GPS. Referring to the criteria of GPS-based outdoor localization, EyeLoc has two goals: • Plug-and-play. EyeLoc should not assume any extra bootstrap cost (e.g., site survey, system calibration) in large shopping malls. Moreover, EyeLoc should not require users to own any prior knowledge or follow complex smartphone operations. • Efficient and robust. Facing computation-intensive image processing and various mea- surement errors from motion sensors and floor-plan images, EyeLoc should be able to accurately localize a user with short processing time. To meet the first goal, EyeLoc is inspired by two observations. First, the indoor floor-plan images of shopping malls (e.g., shown in Figure 2.1(a)) can be easily fetched from indoor map providers through Android and iOS APIs. The other observation is that people are used to turning around to observe surrounding POIs and localize themselves. After fetching the floor-plan images, EyeLoc enables the self-localization of the smartphone through absorbing data from the on-board motion sensors and camera. No bootstrap cost or user training is involved. Figure 2.1 is an example showing how EyeLoc works in a plug-and-play manner. Alice is lost in a large shopping mall and she wants to go to H&M. As Alice opens EyeLoc on her smartphone, 7 Figure 2.1 Illustration of an example of the EyeLoc innovation. the corresponding floor-plan image is automatically fetched. She holds the smartphone and turns a 360◦ circle, during which the camera and motion sensors keep working. This operation is called circle shoot. With input data from the camera and motion sensors, EyeLoc extracts geometric information of surrounding POIs (e.g., GAP, UGG, MISS SIXTY, Calvin Klein). Meanwhile, EyeLoc uses text detection and recognition techniques to find the POI positions on the floor-plan image. Finally, EyeLoc projects Alice’s position and heading direction onto the floor-plan image as shown in Figure 2.1(c). The involvement of image processing significantly increases the difficulty to meet the second goal. As Figure 2.1 illustrates, EyeLoc depends on text detection and recognition techniques to extract POI signs from view images. Text detection and recognition for color images have been widely studied in the past decade, especially with deep learning models like convolutional neural networks. A few open-source models (e.g., OpenCV [32], Tesseract [33]) are also available on smartphones. Smartphones can also utilize cloud service from companies like Google, Baidu, etc. However, none of the two approaches can achieve real-time execution due to computation overhead on images or extra network delay. This contradiction demands us to design an efficient method to extract enough geometric information without incurring long processing latency. We have two intuitions for the method design. First, since the extracted POIs with error geometric information is useless even harmful for localization, we should not deal with those view images of low quality. Second, we observe that the view images are usually redundant for extracting the geometric information of an observed POI. Hopefully, we can only select a subset of those view images which contain equivalent geometric information of the observed shops as the whole set 8 (a) Indoor Floor-plan Image(b) Circle Shoot(c) Location & Heading does for further processing. On the other hand, various measurement errors are invertible and may lead to inaccurate local- ization. For example, as shown in Figure 2.1(c), the text bounding boxes of the four observed shops may be not exactly aligned to that of the corresponding shop signs that appeared in vision space. To mitigate potential errors and achieve robust localization, our observation is that the spatial POI distribution is usually dense in large shopping malls, which means multiple POIs are available. Hopefully, we can use the redundant information to refine the estimated user position. In comparison with human binocular vision system, EyeLoc develops a monocular vision sys- tem, which is accurate and ubiquitous for smartphones. EyeLoc enables user position matching between vision space and floor-plan space with constant geometric constraints of observed POIs. The system architecture of EyeLoc is shown in Figure 2.2, including three parts as follows. Raw Data Collection. The first part is to fetch floor-plan images from indoor map providers and collect raw information of view images from circle shoot. According to the coarse GPS lo- calization, EyeLoc queries indoor map providers to obtain floor-plane images. During the circle shoot, Eyeloc uses the camera and motion sensors (e.g., compass, gyroscope, accelerometer) to continuously capture view images and corresponding motion attributes (e.g., camera facing direc- tion, angle velocity). Section 2.2.3 shows the design details. POI Extraction. Taking the information of view images and floor-plan images as input, the second part extracts geometric information of observed POIs in both vision space and floor-plan space. Because text detection and recognition are time-consuming, we need to extract enough geometric information while keeping the number of processed images as small as possible. EyeLoc filters out some view images which are blurred or have error motion attributes. Then EyeLoc develops a sparse image processing method to extract geometric information of all observed POIs from the rest of the images and keeps the number of processed images small. On the other hand, after extracting all POIs on the floor-plan image, EyeLoc obtains the positions of the observed POIs in floor-plan space by matching their names. The detailed design is illustrated in Section 2.2.4. Position Matching. With the geometric information of the observed POIs, EyeLoc now 9 Figure 2.2 Illustration of the system architecture of EyeLoc. projects the user’s position and heading direction onto floor-plan space. The redundancy of the observed POIs is explored to mitigate unavoidable errors of geometric information in vision space and positions in floor-plan space. The observed POIs are grouped into tuples. Each POI tuple can be used to calculate the user’s position and heading direction with geometric constraints. The localization errors of different tuples are diverse with the same measurement errors. EyeLoc com- bines several inferred positions and the corresponding errors to vote the final user’s position and heading direction. The detailed design is shown in Section 2.2.5. 2.2 Design To relieve humans of self-localization, EyeLoc achieves plug-and-play localization in large shopping malls. Due to the fundamental difference between the vision principle of humans and smartphones, we first establish the smartphone vision system and illustrate the geometric local- ization model. To further put EyeLoc into practice, we show the detailed design of three function components (shown in Figure 2.2) step by step. 2.2.1 Smartphone Vision System We intend to define a ubiquitous smartphone vision system to fetch the geometric relationship between a smartphone and an observed POI. Human eyes form a binocular vision system, which enables us to estimate our distance and direction to an observed POI. Smartphone vision differs significantly from human vision. First, not all smartphones have been equipped with dual or triple cameras. It is hard to extract the distance and direction of an observed object from a monocular image except there is a preconfigured Structure from Motion (SfM) based model or learning based model. However, both of these two approaches require a large set of images for model training, 10 Circle ShootFloor Plan ImagesRaw Data CollectionView ImagesMotion SensorCameraCoarse GPS LocalizationView Image AttributePOI ExtractionPosition MatchingView Image FilteringSparse Image ProcessingIndoor Map ProviderObserved POI NameObserved POI DirectionPOI Name MatchingText Detect & RecognizeObserved POI PositionPOI GroupingPOI Tuple 1POI Tuple 2POI Tuple k…Geometric LocalizationError EstimationUser Position & Heading which incurs the heavy burden of site survey. Second, humans have practiced a lot since childhood, so camera calibration [34] is a must to achieve accurate estimation. Since the parameters of camera calibration are not explicitly known for those smartphones, the complicated operation induces unacceptable difficulty to bootstrap this ability for common users. In EyeLoc, the question is can we estimate distance and direction as the geometric descriptor of an observed object through the monocular view images of circle shoot? Figure 2.3 Direction and distance measurement with circle shoot. It is confirmative that the direction information of a POI can be constructed with the monocular view images of circle shoot. We define the EyeLoc sightline of an object as the virtual line between the object and the user. For example, as shown in Figure 2.3, given a POI P and a user H, the EyeLoc sightline is HP. O is the optical center of the camera lens and C is the center of the image plane. H, C and O are approximately kept on the same line all the time during circle shoot. When the user is facing P, HP coincides with CO and its direction can be measured by smartphone motion sensors [35]. During circle shoot, however, the user’s facing direction is continuously changing. For example, C1O1 and C2O2 are not aligned with HP. The key observation is that when CO and HP are aligned, F that indicates the position of P on a view image will coincide with C. Otherwise, it will appear at the side of C as F1 and F2 show. As shown in Figure 2.4(b), (c) and 11 PF1F2O1O2drrffC1C2K1K2θ1θ2HEyeLoc sightline of object POC(F) (d), when a user turns in clockwise during circle shoot (e.g., Figure 2.4(a)), for shop “MOUSSY”, its text bounding box will appear from left to right in the view images. EyeLoc finds the view image (e.g., Figure 2.4(c)) of which the text bounding box is at the center, then the direction of EyeLoc sightline can be estimated by motion sensor. Figure 2.4 An example of the sightline change during circle shoot in physical space. To enable distance estimation with monocular vision, it is possible to exploit the camera motion of circle shoot to imitate a binocular vision system. As shown in Figure 2.3, the distance between the POI P and the user H is indicated as d. The distance between H and the optical center of smartphone camera lens O1 is r. The focal length of the smartphone camera lens f is unknown for most smartphones. θ1 indicates the intersection angle between line HO1 and EyeLoc sightline HP. Since △F1O1C1 is similar to △PO1K1, we have the following equation: F1C1 d sin θ1 = f d cos θ1 − r (2.1) where F1C1 indicates the pixel offset between F1 and C1. Combining the same equation under another angle θ2 (θ1 ̸= θ2), we can derive d as follow: d = r sin θ1 − k sin θ2 cos θ2 sin θ1 − k cos θ1 sin θ2 (2.2) where k equals the ratio between F1C1 and F2C2. If θ1, θ2, k and r are known, the distance d can be calculated. As the direction of HP, HO1 and HO2 can be obtained by motion sensors, θ1 and 12 (a)(b)(c)(d) θ2 can be calculated. For a shop, we recognize the center of its text bounding box as F1 and F2 on a view image so that F1C1, F2C2 and k can be calculated. r can be roughly estimated according to human arm length. In this way, EyeLoc can estimate the distance of a POI without any prior knowledge of camera parameters in large shopping malls. (a) Influence of θ1 and θ2 (b) Influence of k Figure 2.5 Illustration of distance error in terms of the error of θ2 and k. (a) and (b) show the distance error under different θ2 and k when other parameters are fixed. Since the error of θ and k estimation is inevitable, we further conduct the error analysis. We assume r is 0.5m. Given θ1, θ2 and k are 24◦, 12◦ and 2.11, d will be 5.48m. We change one parameter (e.g., θ1, θ2, k) to calculate the distance error when other parameters are fixed. The results are shown in Figure 2.5a and Figure 2.5b. Surprisingly, given the distance as 5.48m, the distance error is huge when θ1, θ2 and k have a small bias. The distance error is getting to 1.97m, when θ1 decreases from 24◦ to 23.9◦. Similarly, when θ2 increases from 12◦ to 12.1◦, the distance error increases from 0m to 2.68m. 0.1◦ error is common for the facing direction measurement with motion sensors. The same trend happens on k. When k increases from 2.11 to 2.16, the distance error increases from 0m to 3.77m. Due to the limitation of image resolution, given that F1C1 is as large as 1000 pixels, 0.05 error of k means about no more than 50 pixels error of F2C2 which is hard to achieve due to the relatively large estimation error of text bounding box. When θ1 increases or θ2 and k decreases a little, the situation is getting even worse. Hence, due to the limitation of motion sensor precision and image resolution, the potential huge error makes the 13 -4-3-2-1 0 1 2 3 4 11.9 11.95 12 12.05 12.1 23.9 23.95 24 24.05 24.1Distance Error (m)Angle θ2 (°)Angle θ1 (°)Angle θ1Angle θ2-4-3-2-1 0 1 2 3 4 2.1 2.12 2.14 2.16 2.18Distance Error (m)Ratio of Pixel Offset distance estimation unpractical currently. Overall, in our smartphone vision system, for a POI, we only use the direction of its Eye- Loc sightline as the geometric descriptor to develop an error-controllable localization model (Sec- tion 2.2.2). To accurately and efficiently trace the sightline of a smartphone, we further develop several countermeasures in Section 2.2.4 and Section 2.2.5. We leave the accurate distance mea- surement using a monocular vision system as our future work. 2.2.2 Geometric Localization Model After we obtain the direction of the EyeLoc sightline of several POIs, the next question is how to construct constant geometric constraints, then figure out the user’s position and heading direction on the floor-plan image. Figure 2.6 Illustration of the model used to localize a user’s position on floor-plan image with 3 observed POIs. (a) and (b) exhibit a constant geometric constraint in both vision space and floor-plan space. (c) shows the model to calculate the user’s location with the extracted geometric information. Let us show the constant geometric constraints through an example. As shown in Figure 2.6(a), H is a user’s position. The user observes 3 POIs (e.g., POI1 Miss Sixty, POI2 UGG and POI3 GAP) in their appearance order and Nv indicates the north direction in vision space. As shown in Figure 2.6(b), H also indicates the user’s position. 1, 2 and 3 represent the corresponding center of text bounding boxes of 3 observed POIs. Given the coordinate system X-Y of the floor-plan image, (x1, y1), (x2, y2) and (x3, y3) are the corresponding coordinate of 1, 2 and 3. N f is the north direction in floor-plan space which aligns with Y axis. In vision space, the directions of EyeLoc sightline δ1, δ2 and δ3 can be estimated. However, since Nv and N f may not be aligned with each 14 21Hθ12(x1,y1)(x2,y2)O12RRPOI3POI2POI1HNvMiss SixtyUGGGAPθ12θ23θ31312Hθ31θ23θ12XY(cid:11)(cid:49)(cid:73)(cid:12)(x1,y1)(x3,y3)(x2,y2)(cid:11)(cid:68)(cid:12)(cid:3)(cid:39)(cid:76)(cid:85)(cid:72)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:76)(cid:81)(cid:3)(cid:89)(cid:76)(cid:86)(cid:76)(cid:82)(cid:81)(cid:3)(cid:86)(cid:83)(cid:68)(cid:70)(cid:72)(cid:11)(cid:69)(cid:12)(cid:3)(cid:39)(cid:76)(cid:85)(cid:72)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:76)(cid:81)(cid:3)(cid:394)(cid:82)(cid:82)(cid:85)(cid:16)(cid:83)(cid:79)(cid:68)(cid:81)(cid:3)(cid:86)(cid:83)(cid:68)(cid:70)(cid:72)(cid:11)(cid:70)(cid:12)(cid:3)(cid:51)(cid:82)(cid:86)(cid:86)(cid:76)(cid:69)(cid:79)(cid:72)(cid:3)(cid:79)(cid:82)(cid:70)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:82)(cid:81)(cid:3)(cid:68)(cid:81)(cid:3)(cid:68)(cid:85)(cid:70)Md12180º-θ12O’12!1!2!3 Algorithm 2.1 Geometric Constraints Extraction Algorithm Input: 3 POIs sorted as their appearance order; the directions of the corresponding EyeLoc sight- line δ1, δ2, δ3 in vision space. Output: d12, d23, d31, θ12, θ23 and θ31. 1: vector of POI1 EyeLoc sightline v1 = (sin δ1, cos δ1). 2: vector of POI2 EyeLoc sightline v2 = (sin δ2, cos δ2). 3: vector of POI3 EyeLoc sightline v3 = (sin δ3, cos δ3). 4: d12 = v1 × v2, d23 = v2 × v3 and d31 = v3 × v1. 5: θ12 = (δ2 − δ1) mod 360◦; θ23 = (δ3 − δ2) mod 360◦; θ31 = (δ1 − δ3) mod 360◦ other, we cannot directly determine the coordinate of H with these directions in floor-plan space. We have two constant geometric constraints in both vision and floor-plan spaces. The first is the rotation direction of circle shoot is constant. The 3 POIs will appear in the same order (e.g., POI1 →POI2 →POI3 and 1 → 2 → 3) along the rotation direction. We use d12, d23 and d31 to indicate the rotation direction between each pair of adjacent POIs. The other is that, according to the similar triangles, ∠POI1HPOI2, ∠POI2HPOI3 and ∠POI3HPOI1 equal to ∠1H2, ∠1H2 and ∠1H2. The 3 intersection angles are indicated as θ12, θ23 and θ31. The rotation direction and the intersection angles between any two POIs sever the constant geometric constraints for both vision space and floor-plan space. Algorithm 2.1 exhibits the details to determine the d12, d23, d931, θ12, θ23 and θ31 given 3 POIs and their corresponding directions of EyeLoc sightline. Given POI coordinates ((x1, y1), (x2, y2), (x3, y3)), rotation direction (d12, d23, d31) and inter- section angle (θ12, θ23, θ31), we need a method to calculate the coordinate (xH, yH) of the user’s position H in floor-plan space. As shown in Figure 2.6(c), given the coordinates of two POIs (e.g., 1, 2), the rotation direction d12 and the intersection angle θ12, if θ12 is 180◦, H is on the segment between 1 and 2. Otherwise, the possible position of H is on an arc that takes the segment ⃗12 as the chord and θ12 as the inscribed angle. The pixel distance between 1 and 2 is l12 which equals (cid:112)(x1 − x2)2 + (y1 − y2)2. M is the middle point of the chord ⃗12 and it’s coordinate (xM, yM) equals ( x1+x2 2 ). O12 is the center of the circle and R is the length of its radius. We use (xo, yo) to , y1+y2 2 indicate the coordinate of O12. If θ12 is 90◦, O12 and M have the same coordinate. Otherwise, since 1 and 2 are on the circle, the chord ⃗12 is perpendicular to MO12 and we have the following 15 equation: where k is the slope of the chord ⃗12. Moreover, the central angle ∠1O122 is twice the corresponding x1 − x2 y1 − y2 = − yo − yM xo − xM = k (2.3) inscribed angle which equals 180◦ − θ12 and ∠1O12M is half the central angle ∠1O122. Hence, ∠1O12M = 180◦ − θ12 and R = d12 . For the length of O12M, we have the following equation: 2 sin θ12 (cid:113) (xo − xM)2 + (yo − yM)2 = − l12 2 tan θ12 Combining Equation 2.3 and Equation 2.4, we can obtain (xo, yo) as follows: xo = xM ± l12 √ 2 tan θ12 1 + k2 ; yo = yM ∓ kl12 √ 2 tan θ12 1 + k2 (2.4) (2.5) Besides O12, we obtain another false center of circle O′ 12 which is symmetric with O12 by taking ⃗12 as a mirror. To filter out the outlier O′ 12, we further exploit the information of the rotation direction d12 and acute/obtuse angle θ12. If θ12 is an acute angle, O12 is on the same side with H regarding segment ⃗12. Otherwise, O12 is on the opposite side with H. In this way, we can identify the unique coordinate of O12. Algorithm 2.2 summarizes the detailed calculation of O12 and R. Now, we know H is on an arc determined by O12 and R. Similarly, we can calculate another arc where H is on with POI2, POI3 and θ23. We further calculate the intersections of these two arcs. One intersection is POI2, the other is the position of H. In the angel-based geometric localization model, localization bias may be incurred by an un- expected situation: when H, POI1, POI2 and POI3 are on the same circle, we cannot localize H through the geometric constraints of the 3 POIs. This situation rarely happens in practice as shown in Section 2.4. Moreover, we may be able to observe more than 3 POIs at large shopping malls. If the situation has happened, EyeLoc will popup a message to remind the user that he/she needs to walk several steps and relocalize himself/herself. When the coordinate of H is known, we can calculate the directions of HPOI1, HPOI2 and HPOI3 in floor-plan space. Then, with δ1, δ2 and δ3, we can calculate the angle offset ∆N between the vision north Nv and floor-plan north N f . Given any user’s heading direction (e.g., camera facing 16 Algorithm 2.2 Arc Calculation Algorithm Input: 2 POIs, POI1 and POI2, sorted as their appearance order; the rotation direction d12; the angle θ12 between the directions of the corresponding EyeLoc sightline in vision space; the corresponding coordinates (x1, y1) and (x2, y2) in floor-plan space. . H is on the segment between POI1 and POI2. (xo, yo) and R are set as NULL. Output: The coordinate of circle center (xo, yo); the length of circle radius R. 1: pixel distance and slope of the chord ⃗12 as d12 = (cid:112)(x1 − x2)2 + (y1 − y2)2 and k = x1−x2 y1−y2 2: if θ12 equals to 180◦ then 3: 4: else if θ12 > 180◦ then θ12 = 360◦ − θ12. 5: 6: else if θ12 equals to 90◦. then 7: 8: else 9: (xo, yo) = ( x1+x2 ; two possible coordinates (xo1, yo1) and (xo2, yo2) of circle center are calculated R = d12 2 sin θ12 by Equation 2.5. set (xo, yo) as (xo1, yo1) calculate the rotation direction do12 = vector(xo − x1, yo − y1) × vector(xo − x2, yo − y2) if d12 · do12 > 0 ⊕ θ12 < 90◦. then );R = d12/2 , y1+y2 2 2 set (xo, yo) as (xo2, yo2). 10: 11: 12: 13: 14: 15: end if end if direction) in vision space, we can infer his/her heading direction in floor-plan space. Overall, EyeLoc can calculate a user’s position and heading direction by observing no less than 3 POIs. More POIs can further improve the accuracy as introduced in Section 2.2.5. 2.2.3 Raw Data Collection EyeLoc takes three data sources as input: view images captured by the camera, view image attributes measured by motion sensors and the floor-plan image fetched from indoor map providers. The camera and motion sensors work during the circle shoot, while the user is moving and the smartphone may be shaking slightly. To control the consequent measurement errors, as well as the processing latency and overhead, we conduct the raw data collection as follows: 2.2.3.1 View Images Two system parameters are crucial to image shooting. One is the image resolution Ir. The higher its value is, the text of more POIs can be accurately detected and recognized. However, the processing time also increases when Ir becomes high. Empirically, EyeLoc fixes Ir as 1536p 17 during the circle shoot. The other parameter is the shooting frequency fs, which means the interval between two adjacent view images is 1 fs . A high fs ensures all surrounding POIs can be recorded when the rotation speed of a user is fast. Redundant view images also have a negative influence on processing time. EyeLoc selects a relatively high fs to guarantee the abundance of raw data. Later in Section 2.2.4 a smaller resolution Ip will be introduced for further image filtering and complement Ir and fs. 2.2.3.2 Motions Sensor Readings In most cases, when a user operates circle shoot, the facing direction of the user and his/her smartphone camera is the same as illustrated in Figure 2.7. We define the EyeLoc sightline of an object as the virtual line between the object and the user as shown in Figure 2.7. The direction of a sightline δ is the angle between earth north and the projected direction Z′ of the smartphone Z axis, which is measured through the estimation of the camera facing direction. Figure 2.7 Illustration of the camera facing direction δ measurement. We use the motion sensors (e.g., accelerometer, gyroscope and compass) to capture the camera facing direction. EyeLoc continuously samples the readings of the motion sensors. We collect the direction of gravity via the acceleration sensors along 3 smartphone axes (e.g., X , Y and Z ), and determine the direction of north with compass sensors to calculate the direction of Z in the earth coordinate system. As a result, the direction of Z′ and δ are calculated correspondingly. To remove potential magnetic interference and bursty noise, EyeLoc adopts several methods [35] [36] 18 GravityEarthNorthEarthEastXYZZ’!Smartphone to calibrate the camera facing the direction of each view image. 2.2.3.3 Floor-plan Image Given the coarse GPS readings in a shopping mall, EyeLoc can fetch the floor-plan images of all floors in the shopping mall through APIs of indoor map providers. Each floor-plan image contains the skeleton and name of all POIs on that floor. Overall, the raw data collection module outputs a series of view images, corresponding Eye- Loc sightline directions and floor-plan images. However, the redundancy and measurement errors existing in raw data will incur computation inefficiency and localization error. Next, we introduce the methods to improve the efficiency and robustness. 2.2.4 POI Extraction For the floor-plan image, we use text detection/recognition techniques to collect POI signs and record their coordinates on the floor-plan image as shown in Figure 2.1(c). As for view images, our goal is to detect all available POI signs and determine corresponding sightlines. The angle formed by two EyeLoc sightlines is critical for position matching in Section 2.2.5. The key issue is how to achieve real-time performance on smartphones. 2.2.4.1 View Image Outlier Filtering During raw data collection, we obtain abundant view images and motion sensor readings. As the user is moving while the camera and motion sensors are working, some data contain errors that can significantly influence the localization result. Filtering out those data can also save the processing time. If a view image is blurred, we cannot detect any text at all. We treat blurred view images as outliers that should be filtered out. EyeLoc adopts Laplacian-based operator [37], which is a widely used function for focus measure, to define the degree of image blur. We randomly selected 1692 view images from the whole data set shot in two large shopping malls (Section 2.4). We guarantee there is at least one POI sign in each of these view images. In Figure 2.8, the black curve shows the Laplacian variances of these images are distributed from 0 to 400. The larger the variance is, the 19 Figure 2.8 The influence of image blur on the accuracy of text detection. less the image blur is, as shown in the comparison between the example view images with variance in the range [0,20] and [300,320]. In a view image, if the length of any recognized text string is more than 2, it is text-detectable. We further select 15 view images from each level of image blur to evaluate the probability of text detection under different levels of image blur. The red curve shows that the probability of text detection is higher than 80% when the Laplacian variance is larger than 80. Hopefully, we should filter out those view images whose Laplacian variance is less than 80 because it is hard to detect any text clues from it. Thus, we define a threshold ∆Lap (e.g., approximate 80) to determine whether a view image is blurred or not. Moreover, the smartphone vibration around Y axis and Z axis can influence the position of a POI on view images. As a result, the estimation error of EyeLoc sightline may increase. The gyroscope outputs the angular velocity around 3 axes as ωx, ωy and ωz, then the total angular (cid:113) velocity is ω = z . On the other hand, we can also calculate the angular velocity ω ′ given the Z′ direction of two adjacent view images and the corresponding time interval. We conduct x + ω 2 y + ω 2 ω 2 a circle shoot by setting fs as 2Hz. During circle shoot, we manually vibrate the smartphone as a common user does when the picture ID is from 15 to 18 and from 26 to 30. As shown in Figure 2.9, the angle velocity difference between ω and ω ′ is close to zero as usual. However, smartphone vibration will obviously increase the difference. This observation indicates different 20 0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 250 300 350 400CDFProbability of Text DetectionImage Laplacian VarianceCDFText Detection motion sensors have different sensitivity for the vibration. Hence, EyeLoc sets a threshold ∆ω and filters those view images when the angular velocity difference is larger than ∆ω . Figure 2.9 The angle velocity measured by different combination of motion sensors. 2.2.4.2 Text Filtering and Matching In large shopping malls, text may appear or extract anywhere. It is possible to detect multi- ple text strings from a view image. Figure 2.10 shows the case in the office environment. The top figures are RGB scene pictures and the bottom figures exhibit the red text bounding boxes on corresponding binary images. We can see that besides the desired text bounding box of “STAR- BUCK”, many redundant ones are also extracted from the textures of curtains, tables and switches. Moreover, in Figure 2.10(a), only part of “STARBUCK” are recognized. Situations in shopping malls can be more complex since the name of a POI may appear at multiple places. EyeLoc filters out the irrelevant and duplicate text bounding boxes through the following steps. First, given the minimum and maximum length of POI names extracted from floor-plan images, EyeLoc filters out the illegal text strings. Second, we group the rest of the text strings. Two text strings belong to the same group when the difference between them is smaller than a threshold ∆t. The difference between the two text strings is defined as the ratio between their Levenshtein Distance [38] and the maximum string length. Given the list of POI names extracted from floor- plan images, EyeLoc further removes those invalid groups whose text strings are not on the list 21 -10 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 40Angle Velocity (°/s)Picture IDAccelerometer+CompassGyroscope Figure 2.10 Landmark identification and text bounding box. (i.e., the similarity is smaller than ∆t in comparison with any POI name). Finally, in each valid group, EyeLoc combines the coordinates of all text bounding boxes to calculate the average value as the unique text bounding box position of the observed POI on the view image. In this way, EyeLoc identifies available POIs and corresponding positions of text bounding boxes on a view image. 2.2.4.3 Sparse Image Processing After filtering the outliers of view images, for all observed POI, we need to exactly find the view images (e.g, Figure 2.4(c)) where the corresponding text bounding boxes appear in the middle. The intuitive approach is to process all view images, but this will incur heavy networking and computation burden as the sampling frequency fs is set high. Even worse, the desired view image may not be captured or blurred. Instead of processing every view image to extract the geometric information of all potential POIs, EyeLoc develops a sparse image processing approach to achieve the same goal. The key idea is after the position of a text bounding box is known from a view image (e.g., Figure 2.4(b)), we can enable EyeLoc sightline estimation of a POI with one more view image which contains the same POI (e.g., Figure 2.4(d)) by feature point matching. As shown in Figure 2.11, given two view images I1 and I2, the text bounding box of I1 is extracted and it is d1 pixel from the middle line. Then, in I2, we use ORB algorithm to extract the same feature points which fall into the text bounding box of I1. Given a feature point, its coordinates on I1 and 22 I2 are (x1, y1) and (x2, y2). The pixel distance of the feature point is (cid:112)(x1 − x2)2 + (y1 − y2)2. The average pixel distance of all feature points is indicated as l f . Due to the approximate constant ratio between pixel distance and central angle, given their direction as δ1 and δ2, we can calculate the direction δ of the POI EyeLoc sightline as following: δ = δ1 + d1 l f (δ2 − δ1) (2.6) When we recognize the text bounding box of a POI from a view image, we use its adjacent images which probably contain the same POI to calculate the direction of POI EyeLoc sightline with Equation 2.6. Figure 2.11 Feature point matching in two different view images of the same POI. To extract the geometric information of all observed POIs, the problem becomes to quickly target a view image for each observed POI. Given n view images {I1, I2, ..., In}, we set a step length ∆s and view images {I∆s, I2∆s, ..., Ik∆s} (k = ⌊ n ∆s ⌋) are selected for processing. Hopefully, if the minimum number of view images of a POI is larger than ∆s, EyeLoc cannot miss the view image of any observed POI. However, due to the possible failure of text recognition, we may miss some POIs so that no enough POIs are extracted. In this case, EyeLoc will exponentially reduce ∆s and reprocess the new view images until at least 3 POIs are extracted or all view images are processed. Overall, we can fetch enough available POIs and corresponding directions of EyeLoc 23 I1I2 sightline as soon as possible for later location matching. Next, we remove the potential estimation errors to achieve accurate location matching. 2.2.5 Position Matching According to the localization model in Section 2.2.2, the user’s position can be localized with three observed POIs, called localization tuple. Figure 2.12 shows the measured POI coordinates of a localization tuple (e.g., POI1, POI2, POI3) and the calculated user’s position H. The corre- sponding measured intersection angle θ12, θ23 and θ31 are 120◦. The POI text bounding boxes in the floor-plan image may not exactly align with that in physical space. Due to the POI coordinate errors, we assume the true POI positions may appear on a circle around it and the radius is Re. As shown in Figure 2.12(a), when moving the POI1 around a circle with a radius of 3 pixels and keeping the positions of other POIs fixed, the calculated positions are shown as the green marks in the figure. Moreover, due to the possible errors from motion sensor and image processing, θ12 may be inaccurately measured in comparison with the true intersection angle θ ′ 12 as shown in Fig- ure 2.12(b). The same situation may happen for θ23 and θ31. We assume the error of θ12, θ23 and θ31 is in the range of [−∆θ , ∆θ ]. For θ12, Figure 2.12(b) shows the possible true positions shown as the green marks when ∆θ is 10 ◦. Regarding the errors of POI1 and θ12, the maximum local- ization errors are indicated as de. The ratio between de and Re or ∆θ is further defined as the error sensitivity of a POI or an intersection angle. Given a localization tuple u, we define its localization error sensitivity les(u) as the sum of the error sensitivity of all three POIs and three intersection angles. When k (k ≥ 3) POIs are extracted, we have total m = k(k − 1)(k − 2)/6 localization tuples indicated as {u1, u2, ..., um}. For the ith tuple ui, its localization result and error sensitivity are indicated as hi and les(ui). The larger the les(ui) is, the more accurate the hi is. Hence, EyeLoc sets the weight wi of localization result hi as 1/les(ui). Then, the final match location h is calculated as follows: With h and k extracted POIs, EyeLoc can calculate the user’s heading direction according to the h = ∑m i=1 wihi ∑m i=1 wi (2.7) 24 Figure 2.12 The localization sensitivity regarding (a) POI error and (b) direction error. method in Section 2.2.2. 2.3 Implementation We implement EyeLoc as a mobile application in Android 7.0. Figure 2.13 demonstrates the user interface (UI) of EyeLoc application when we conduct experiments in a shopping mall. As shown in Figure 2.13(a), after a user opens EyeLoc application, the view captured by the camera appears on the smartphone screen. The view keeps refreshing with the circle shoot. A vertical white line appears in the center of the screen as the reference. Once a POI appears during the circle shoot, EyeLoc extracts its text string from the view image. If the text bounding boxes is aligned with the sightline of the smartphone camera, a green checkmark appears on the screen like Figure 2.13(b). In this way, users can simply record POIs as many as possible. After the user finishes the circle shoot, EyeLoc exhibits the user’s position and heading direction on the floor-plan image as shown in Figure 2.13(c). We discuss several system details and settings as follows. 2.3.1 Scene Text Detection and Recognition Scene text detection and recognition techniques serve as a fundamental role in EyeLoc. We compare several existing techniques on Android smartphones in terms of recognition accuracy and processing time. Here, we adopt the same method in Section 2.2.4.2 to judge the similarity between two text strings. ∆t is set as 50%. We randomly select 100 images shot by a smartphone in two large shopping malls. Some of them are shot at daytime and the others are shot at night. 25 Y axisX axisError User PositionY axisX axisError User PositionError POIPOI1POI2POI3POI1POI2POI3Redeθ12deθ’12(a) POI Error(b) Direction ErrorHH Figure 2.13 The UI of EyeLoc when running in a large shopping mall. Each image contains one shop sign which is manually labeled as the ground truth. 2.3.1.1 Local processing v.s. Cloud processing According to different processing platforms, we can either perform the text detection and recognition processes locally on the smartphone, or remotely on cloud. The approaches of local processing include OpenCV [32] and Tesseract [33]. We choose Baidu Cloud as the platform for the typical cloud processing. LTE network, which is available in most shopping malls nowadays, is adopted to connect the smartphone with the cloud server. Given the dataset of 100 images, the recognition accuracy and processing time are shown in Figure 2.14. We can see that the text recognition accuracy of Baidu Cloud is 66%, which is much higher than 12% of Tesseract and 2% of OpenCV. The text recognition accuracy of Tesseract and OpenCV is surprisingly low since it is challenging for the text classifiers and extreme region extraction to adapt to the complex lighting condition and text format of POI signs. The average processing time of Baidu Cloud is 2s which is a little higher than that of OpenCV, but much smaller than Tesseract. Due to the superior recognition accuracy and low processing time, we choose cloud 26 processing instead of local processing. Figure 2.14 Different text recognition approaches. 2.3.1.2 The influence of image resolution We further explore the performance of Baidu Cloud by using different image resolutions. We vary the resolution of the 100 images from 180p to 1538p. The performance is shown in Fig- ure 2.15. We can see that both text recognition accuracy and processing time increase with the increase of image resolution. When the image resolution is 720p, the text recognition accuracy and average processing time are 54% and 0.74s. In comparison, the text recognition accuracy and average processing time increase to 72% and 1.89s when the image resolution increases to 1536p. EyeLoc chooses Ip to keep the text recognition accuracy higher than 60%. Meanwhile, EyeLoc minimizes the expected time for successfully recognizing a POI which is the ratio between the average processing time and the recognition accuracy. Hence, we set Ip as 1080p. For outlier view image filtering, according to the observation of Figure 2.8 and Figure 2.9, we set ∆Lap and ∆ω as 80 and 10◦/s. 2.3.2 Circle Shoot Operation We set fs as 2Hz, namely EyeLoc shoots 2 view images per second. The step length for sparse image processing ∆s is set to be 3. According to our empirical experience, 20% of view images are observed to be blurred (Figure 2.8) and 37% of 1080p view images encounter text recognition 27 0 20 40 60 80Accuracy (%) 0 20 40 60TesseractOpenCVBaidu CloudTime (s) Figure 2.15 Different image resolution choices. failure (Figure 2.15), it is better to obtain at least 6 view images of a POI (e.g., 3s) to ensure the reliability and efficiency of POI extraction. That means if there are 5 POIs around a user, the circle shoot will take 15s at least. 2.3.3 Floor-plan Images Figure 2.16 Two experiment positions in two shopping malls. EyeLoc generates high-resolution indoor floor-plan images from Gaode Maps. Since the font of POI texts on floor-plan images is in regular print format, the text recognition accuracy is close to 100%. The resolution of the floor-plan image is set to 2560×1440. Using the floor-plan image, however, the texts of different shops are often placed in the middle of the shops’ blocks, while the 28 0 20 40 60 80Accuracy (%) 0 1 2180270360480600720960108012001536Time (s)Image Resolution(cid:11)(cid:68)(cid:12)(cid:3)(cid:54)(cid:75)(cid:82)(cid:83)(cid:83)(cid:76)(cid:81)(cid:74)(cid:3)(cid:48)(cid:68)(cid:79)(cid:79)(cid:3)(cid:20)(cid:15)(cid:3)(cid:47)(cid:82)(cid:70)(cid:68)(cid:79)(cid:76)(cid:93)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:40)(cid:85)(cid:85)(cid:82)(cid:85)(cid:3)(cid:20)(cid:17)(cid:24)(cid:25)(cid:80)(cid:11)(cid:69)(cid:12)(cid:3)(cid:54)(cid:75)(cid:82)(cid:83)(cid:83)(cid:76)(cid:81)(cid:74)(cid:3)(cid:48)(cid:68)(cid:79)(cid:79)(cid:3)(cid:21)(cid:15)(cid:3)(cid:47)(cid:82)(cid:70)(cid:68)(cid:79)(cid:76)(cid:93)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:40)(cid:85)(cid:85)(cid:82)(cid:85)(cid:3)(cid:22)(cid:17)(cid:25)(cid:23)(cid:80) POIs captured by EyeLoc are shop signs which are often located on the entrances. The caused direction and coordinate errors can further lead to possible localization errors. To mitigate the influence of the floor-plan error, we refine the shop coordinates with fine-grained floor-plan data fetched from Gaode Maps. The floor plans are shape files containing several image layers, depict- ing shops, roadmaps, and doors. In our experiments we directly use the coordinates of the “doors” as the location of POIs (e.g., the blue circles in Figure 2.16). The coordinates acquired this way are more accurate and it is beneficial to improve localization quality. Moreover, for error sensi- tivity estimation, we empirically set ∆θ and Re as 10◦ and 20 pixels. The threshold of text string similarity ∆t is set to be 50%. Overall, Table 2.1 summarizes all system parameters of EyeLoc. Symbol Ip fs ∆Lap ∆ω ∆t ∆s ∆θ Re Description The resolution of view image The sampling frequency of view image The threshold of view image blur The threshold of angle velocity difference The threshold of text string similarity The step length of sparse image processing The error of intersection angle estimation The error of extracted POI coordinates Value 1080p 2 Hz 80 10◦/s 50% 3 10◦ 20 pixels Table 2.1 Summary of system parameters. 2.4 Evaluation We evaluate EyeLoc with different smartphones (e.g., MI 5 and Huawei Mate 7) in an office environment1 and two large shopping malls2. The office environment is a 7m×9m office room. We print 6 shop signs such as NIKE on A4 papers, then hang them on the wall or curtains in clockwise order. The area of each floor in two large shopping malls is 7,500 m2 and 10,000 m2 respectively. The area of the semi-outdoor Outlets is about 70,000 m2. The distance between adjacent shops and the width of corridors are much larger than office and shopping malls. We invited two volunteers (Male, 20-30 years old) to complete all the experiments both in daytime and at night. User 1 uses 1Demos in the office environment: https://youtu.be/v7CT6gTBNEc, https://youtu.be/wCu8STdRG_c, https: //youtu.be/iT5pdjO6RVk 2A demo in a shopping mall: https://youtu.be/iHh0R8TkNLo 29 MI 5, User 2 uses Huawei Mate 7. The two users exhibit different habits as User 1 turns faster than User 2. In the office environment, we uniformly split the office into 18 areas. Users stand at the center of each area to perform circle shoots. The ground truth is obtained through a laser rangefinder. In the two large shopping malls and the semi-outdoor Outlets, we mainly select the positions near entrances, elevators and bathrooms where users have high localization demands. For each of the 16 positions we evaluated, we also use the laser rangefinder to measure its ground truth. The minimum and maximum distances between the user and a POI are 2.26m and 37.4m in our experiments. (a) The CDF of localization error in different environ- ments (b) The CDF of facing direction error in different envi- ronments Figure 2.17 The reliability of EyeLoc. 2.4.1 The Accuracy of Localization and Heading Direction Estimation We first discuss the reliability of EyeLoc. As shown in Figure 2.17a and Figure 2.17b, in the two large shopping malls, the median errors of localization and heading direction are 2.6m and 10.5◦. The 90-percentile errors of localization and facing direction increase to 4m and 20◦. In the office environment, the 90-percentile errors of localization and facing direction are 1.1m and 8◦, which are much better than the performances in large shopping malls. There are two reasons. First, in the office environment, we can precisely measure the POI coordinates on the floor-plan images. However, the POI coordinates suffer larger error due to the position mismatch between the text bounding boxes and the observed POI signs on the floor-plan images obtained from indoor map 30 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7CDFLocalization Error (m)Office EnvironmentLarge Shopping MallsOutlets 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 30CDFHeading Direction Error (˚)Office EnvironmentLarge Shopping MallsOutlets providers. Moreover, in the office environment, usually we have only one POI in a view image. However, in large shopping malls, several signs of the same POI may appear in a view image, which will introduce error into the sightline estimation. Given the area as large as 7,500m2 and 10,000m2, 4m and 20◦ is still relatively accurate for the most location-based services. In 70,000 m2 semi-outdoor Outlets, the 90-percentile errors of localization and facing direction are 5.97m and 20◦. The error of facing direction is comparable with that in shopping malls. The localization error, however, is getting large. The reason is that the distance between user and POIs is larger in Outlets than that in shopping malls. A small estimation error of POI direction can lead to more position estimation errors. Hence, although the facing direction accuracy is similar, the localization error increases in a larger space. 2.4.1.1 The influence of the number of observed POIs (a) The influence of the number of observed POIs (b) The influence of smartphone hardware Figure 2.18 Different influencing factors. In position matching, EyeLoc utilizes the POI redundancy to mitigate the measurement er- rors of POI coordinates and EyeLoc sightlines. Figure 2.18a shows the relationship between the number of observed POIs and the localization error. We can see the average localization error decreases as the number of observed POIs increases. Specifically, the localization error decreases by 33.7%/34.9% when the number of observed POIs increases from 3 to 4/5 in large shopping malls respectively. This indicates the error mitigation approaches are effective for improving the 31 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5OfficeMallLocalization Error (m)3 POIs4 POIs5 POIs 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5MI 5Huawei Mate 7Localization Error (m) localization accuracy. Moreover, 5 or more POIs cannot provides more useful information for improving the localization accuracy rather than 4 POIs. 2.4.1.2 The influence of POI distribution Figure 2.19 The distribution of the localization error in office environment. We evaluate the influence of POI distribution on the localization error. Figure 2.19 shows the localization error distribution of 18 experiment positions given the positions of 6 POIs. The darker the color is, the higher the localization error is. We can see the localization error is getting higher when the distance between the user’s position and POIs is large. Moreover, the top left area is higher than its surrounding area, that is because it is close to the circle formed by several POIs. According to Section 2.2.2, when the user’s position and POIs tend to be on the same circle, the accurate localization will be hard to achieve. Moreover, we give two experiment positions in shopping mall 1 (Figure 2.16(a)) and shopping mall 2 (Figure 2.16(b)). The white areas are roads and the yellow blocks are shops. The green points are the results of EyeLoc localization. The red points are the ground truths. The localization error of the position in shopping mall 1 is 1.56m, but that of the other one is 3.64m. From the POI distribution on the floor-plan images, we can see the large error in Figure 2.16(b) is because the true position, GLORIA, AFU and MOFAN tend to on the same circle. In contrast, the position in Figure 2.16 has more POIs which are close to it and uniformly distributed. Hence, the results suggest that the localization error is indeed related to the distribution of surrounding POIs, especially when the user’s position and observed POIs are on the same circle. However, the situation only happens twice among total 29 positions. If the situation 32 (m) happens, we will ask the user to walk a short distance and relocalize himself/herself again. 2.4.1.3 The influence of smartphone hardware We further evaluate the localization error by using different smartphones at the same positions in large shopping malls. The results are shown in Figure 2.18b. We can see the average localization error is 2.66m for MI 5 and 2.51m for Huawei Mate 7. Since User 2 turns slower than User 1, the more redundant view images make the localization error variance of Huawei smartphone is smaller than MI 5. Overall, EyeLoc works well on both smartphones and does not depends on any smartphone specific hardware and parameters. 2.4.2 Processing Efficiency Figure 2.20 The processing efficiency with outlier filtering and sparse processing. Another important metric is the processing time, which is from the end of circle shoot to the user’s position is shown on the screen. Since the view image processing dominates the overall processing time, we designed two approaches, namely, outlier image filtering and sparse image processing. We evaluate the processing time of three methods: “All” indicates we process all view images; “Filter” indicates we only process the view images after filtering the outliers. “Fil- ter+Sparse” indicates we combine outlier filtering and sparse processing approaches. As shown in Figure 2.20, we can see “Filter+Sparse” outperforms the other two methods and its median pro- cessing time is 18.2s which is 55.5% shorter than “All”. Moreover, the median processing time of “Filter” is 27.3s which is 36.2% shorter than “All”. That verifies both outlier filtering and sparse 33 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70CDFProcessing Time (s)AllFilterSparse+Filter processing are effective to improve the processing efficiency. In the worst case, the processing time of “Filter+Sparse” is 37.4s. That means the user can obtain the localization result in no more than half a minute after circle shoot. Figure 2.21 further shows the processing time on different smartphones. Figure 2.21 The comparison of processing time on different smartphones. 2.4.3 Energy Efficiency We evaluate the energy efficiency of EyeLoc with Battery Historian [39], which is a tool to analyze battery consumption using Android “bugreport” files. We decompose EyeLoc to 3 function modules which are image recording, image processing and location calculation. To evaluate the energy efficiency of each function module, we set them in infinite loops and continuously run them for an hour. For each experiment, we charge MI 5 to 100% at the beginning. We also evaluate the total energy efficiency of EyeLoc with the same method. For each experiment, we repeat it 3 times and obtain the average value. The results are shown in Table 2.2. We can see that among different function modules, the energy cost of image recording is 1.94 W which is much higher than that of image processing (0.34 W) and location calculation (0.65 W). Hence, the network communication of image processing and computation of location calculations is much more energy-efficient than shooting images with the camera. In our implementation of EyeLoc, we use multi-thread to overlap image recording and processing. The circle shooting usually dominates the operation time of EyeLoc. As a result, the total energy efficiency of EyeLoc is 2.17 W which 34 0 10 20 30 40 50 60 70MI5HuaweiProcessing Time (s)AllFilterSparse+Filter is mainly spent on imaging shooting. Function Module Image recording Image processing Location calculation Total EyeLoc Energy Efficiency (W) 1.94 0.34 0.65 2.17 Table 2.2 Energy efficiency of EyeLoc. 2.4.4 Localization Performance Comparison To compare the localization performance, we implement Sextant [29] and evaluate its perfor- mance in the semi-outdoor Outlets. For each shop, we take 3 high-quality images to construct the visual clue and localization map. During evaluation, users are well trained and follow the rules of Sextant to shoot images. For example, users manually keep the POI in the middle of the image. The performance comparison is shown in Figure 2.22. We can see the 90-percentile localization error of Sextant is 6.6m which is 10% higher than EyeLoc. The reason is that EyeLoc can auto- matically capture the angle of POI which is less accurate by using the user dependent method of Sextant. If users are not well trained, the localization error of Sextant could be increased since the increasing error of POI angle estimation. Figure 2.22 The localization performance comparison between EyeLoc and Sextant in semi- outdoor Outlets. 35 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7CDFLocalization Error (m)SextantEyeLoc 2.5 Related Work In recent years, many works target on developing efficient indoor localization and positioning systems. However, most of them need some site surveying. Some of them even need custom hard- ware. According to different types of data sources, we divide existing works into four categories. • Wi-Fi Signal Many indoor localization methods are proposed based on Wi-Fi signals. One approach is fingerprinting-based. The Wi-Fi signal patterns serve as the fingerprint that represents every location. The system manager builds a fingerprint database in the target areas to initialize localization service. The user’s location is estimated by matching the measured fingerprint to database records. RADAR [40], Horus [41], Place Lab [42], PinLoc [43] and Smart2 [44] use site survey to construct the fingerprint database. LIFS [45] and Zee [46] further utilize crowdsourcing to alleviate the burden of labor-intensive site survey. A theoretical analysis around how good a performance the RSS fingerprinting can achieve is provided in [47]. The other approach is model-based. The basic principle is the relationship between the geo- metric structure from Wi-Fi access point (AP) to user’s location and the physical features of the received Wi-Fi signal can be modeled. If the location of Wi-Fi AP is pre-known, the user’s loca- tion can be inferred. Based on the log-distance path loss (LDPL) model, EZ [48] uses the received signal strength (RSS) to estimate the signal propagation distance and combines several estima- tions of different APs to find the user’s location. SpinLoc [49] and Borealis [50] observe if a user faces an AP, the RSS is usually higher than the user turns his back on the AP. After making a full 360◦ turn, SpinLoc and Borealis extract the angle-of-arrival (AoA) of several APs to determine the user’s location. ArrayTrack [51] uses antenna array and Wi-Fi signal phase to obtain accurate AoA spectrum to calculate a user’s location. CUPID [52], SAIL [53], Chronos [22] and Ubicarse [54] further refine the distance and AoA measurement methods to achieve high localization precision or adapt to COST Wi-Fi AP. SpotFi [55] proposes a super-resolution AoA algorithm to extract AoAs from Channel State Information (CSI). • Visible Light In a typical visible light positioning (VLP) system, lamps (fluorescent and LED) are served as landmarks. After a light receiver (smartphone camera or photodiode) obtains sev- 36 eral lamps’ location, the light receiver further measures the geometric structure from the observed lamps to find the user’s location. Luxapose [24] takes an image which contains several LEDs as input and fetches LEDs’ AoA to calculate the user’s location. According to inherent and com- mon optical emission features of both LED and fluorescent, iLAMP [25] and Pulsar [56] identify a lamp’s location from a pre-configured database by feature matching. iLAMP further combines camera image and inertial sensors to infer user’s location. Pulsar utilizes a custom device to mea- sure the lamp’s AoA, then determine user’s location. The other approach is to customize the lamp to establish a mapping function between the loca- tion of light receiver and the corresponding received light physical features. CELLI [57] develops a custom LED bulb which projects a large number of fine-grained light beams toward the service area. CELLI adopts a modulation method to encode the coordinate of a fine-grained cell into the corresponding light beam. Thus, the light receiver can obtain its location by visible light commu- nication. SmartLight [26] uses LED array and a lens to form the light transmitter. On LED array, different LED lamps use different PWM frequencies. According to the frequencies of the received light, the coordinate of the observed LEDs circle can be inferred on LED array. The location of light receiver can be further calculated by optical geometric translation function of the lens. • Scene Image Using scene images containing landmark details and architectural features is also a popular direction. SLAM (Simultaneous Localization and Mapping) aims to output a map as well as the user’s real-time location. Both 2D and 3D positions are acceptable. We especially introduce SLAM with the camera as its main sensor in this paper. A single image and a floor plan is used as the input in [18] and the main technology is Markov random field. SfM (Structure from Motion) builds a 3D model from 2D images or video. Based on the 3D model, we can know the camera pose and shooting location given another image. iMoon [58] and Jigsaw [59] adopt SfM to construct indoor 3D model and enable localization services. Sextant [29] builds a geometrical model with static reference points and works for a lightweight site survey. All these papers need a collection of images for site survey, which EyeLoc avoids. • Others Magicol [30] and FollowMe [60] combine the geomagnetic field and user trajectory as 37 the fingerprint to localize user’s location. With acoustic speakers as the landmarks, Swadloon [61] and Guoguo [62] use acoustic signal based geometric model and inertial sensors to localize user’s location. Shenlong Wang, etc. [18] utilize the floor-plan image and a scene image to localize a user in large shopping malls. Based on edge, text and layout features of a scene image, they use Markov random field model to infer the camera pose on the floor-plan image. However, they need to search through all possible positions which further incurs huge computation complexity. Hence, it is not practical on COTS smartphones. To conclude, compared with these methods, as shown in Table 2.3, EyeLoc depends on neither pre-deployed infrastructure nor pre-collected information. Moreover, EyeLoc does not depend on any custom hardware and can be implemented as a smartphone application. EyeLoc can achieve the comparable accuracy of localization and heading direction in real large shopping malls which are much larger than the prototype deployment of the existing indoor localization systems. Technology Site Survey Custom Hardware Range (m2) Localization Error (m) Heading Error (◦) LIFS [45] SAIL [53] SmartLight [26] iLAMP [25] iMoon [58] Sextant [29] WiFi fingerprint WiFi AP location Lamp modification Lamp fingerprint Environment image Environment image FollowMe [60] Geomagnetic fingerprint √ √ 1,600 2,800 16 8 1,100 11,250/60,000 2,000 EyeLoc FREE 7,500/10,000/70,000 9 2.3 0.5 0.032 2 20 2 5.97 NA NA NA 2.6 6 NA NA 20 Table 2.3 Summary of existing indoor localization systems. 2.6 Conclusion In this chapter, we propose EyeLoc, a plug-and-play localization system for large shopping malls without the burden of system bootstrap nor calibration. EyeLoc enables the smartphone to imitate human self-localization behavior. After a user opens the EyeLoc application, he/she carries out the circle shoot and the smartphone continuously shoots view images meanwhile. After that, EyeLoc automatically projects the user’s position and heading direction onto the floor-plan image. The evaluation results show that the 90-percentile accuracy of localization and heading direction is 5.97m and 20◦. Moreover, EyeLoc can be extended to other environments (e.g., office building, train station, airport) where floor-plan and indoor texts are available. We will extend EyeLoc to 38 these environments in the future. EyeLoc achieves a good balance among cost, computational power and real-time responses. While alternative localization methods can be developed for large shopping malls, EyeLoc better fulfills the constraints of practical situations such as the low budget requirement of application providers and the low latency requirement of users. 39 CHAPTER 3 A ROBUST SOUND SOURCE LOCALIZATION SYSTEM FOR VOICE ASSISTANTS Acoustic signals serve significant location information. For example, underwater sonar system can detect shipwrecks and fishes, medical ultrasonography images internal body structures. As IoT prospers, a new trend of acoustic localization application arises: sound source localization for voice assistants. To the benefit of Smart Home and Smart Office, machines need ears to perceive the environment. If voice assistants like Amazon Echo, Google Home or Apple HomePod can locate a person based on the speech he/she utters, they can better understand the context of commands and deliver more considerate tasks. For example, with user location known, voice assistants can send commands to TV screen or lights to adjust angles, volume or magnitude, after which the smart system provides better user experience and is more energy-saving. In some cases when voice assistants fail to understand commands directly from speech due to poor recording quality or ambiguous speech, location information can narrow down the possibilities and increase the accuracy of speech recognition. Last but not the least, passive localization woken up by speech is more friendly to user privacy and energy conservation compared to continuous camera monitoring, especially for places like fitting rooms and high-security labs. Although sound source localization can empower voice assistants with fancy functions, it is challenging to determine the location of the sound source in three dimensions: azimuth, eleva- tion and distance. Auditory distance perception is mainly achieved through monaural features like intensity loss or frequency loss, while Direction-of-Arrival (DoA) estimation uses binaural cues like phase difference or Time Difference of Arrival (TDoA) [63]. Although auditory distance per- ception and DoA estimation are developing into two subfields now and struggling with their own issues, they have a close relationship and sometimes share new ideas. Here we provide a compre- hensive research survey to discuss the main challenges to implement a sound source localization system for voice assistants and the reasons behind, which can also be seen as the history of sound source localization: • Auditory distance perception is extremely challenging when the signal of interest is un- 40 known. Theoretical research[64] has quantified loss of intensity and change of frequency as dis- tance increases in the ideal case. However, to put the theory into practical applications, we need to know the initial intensity and spectral of the sound. Although voice assistants know their wake- up commands, they stay blind to the intensity and spectral shape of the original sound all the time. Experiments in [65] further shows that accurate distance judgements are unlikely to happen without intensity and spectral shape recorrelated with the environment on the basis of experience. Recent research make progress from three perspectives. (1) A machine learning model is proposed in [66] to memorize variation patterns contained in training data. The output falls into different distance classes (0 m, 0.5 m, ..., 3 m). However, the localization accuracy is constrained by the granularity of labels. Experiments further reveal the performance significantly decays when the system is trained on data collected from room A while tested on data of room B. (2) By combining auditory cues with visual cues, distance perception can be achieved. (3) Some IoT systems allow prior information of the transmitted signal. Earlier works [6, 67] achieve acoustic ranging between smartphones using impulses. Later Frequency-modulated Continuous-Wave (FMCW)[68, 69] is introduced to track smart devices. • The accuracy of DoA estimation shows instability due to multipath effect and background noise. By analyzing the difference of signals received by two microphones, we can estimate the DoA according to Far-Field Effect[70, 71]. One single DoA will lead to Cone of Confusion[72]. With three precise DoAs from multiple non-collinear microphone pairs, we can retrace the 3D location through intersection of three cones. Unfortunately, a minor fluctuation of DoA estimation leads to non-ideal retrace. Over the years, the majority of research on sound source localization focus on improving the accuracy of DoA estimation. One category builds its foundation on cross- correlation[73–76]. It aligns two signals after delaying one signal. The best match corresponds to the most possible TDoA. In practice the existing of noise obfuscates the peak of cross-correlation and decreases the accuracy. Studies from[75, 77] assume the background noise follows Gaussian distribution and add a phase weight derived from magnitude squared coherence function. How- ever, this method cannot deal with multipath effect. Another popular category[78–80] is based on 41 MUSIC[81]. MUSIC builds signal covariance matrix and extracts eigenvalues. Although it is ca- pable of locating multiple sources, MUSIC assumes signals from different sources are uncorrelated as well as noise. With multipath effect existing, the assumption is compromised. • Voice assistants expect a system without extra deployment expense or prerequisite opera- tions. Due to two challenges above, current employable sound source localization systems turn to adding another sensor (depth sensor, camera, etc) or collecting prior information ((user height, en- vironment space structure, etc). If multiple microphones[8, 62, 82–85] scatter around the place or multiple anchor speakers[9, 86–90] serve as beacons, geometric models can be built after strictly aligning the clocks across different microphones and speakers. Owlet[91] designs a 3D-printed metamaterial structure to obtain spatial information. Recently VoLoc[92] proposes to turn mul- tipath effect from disturbance to information as it extracts location information from the second reflection path. After collecting the user height as prior information, VoLoc achieves 2D localiza- tion. However, its application scenario is limited as it requires voice assistants to be near a wall. Symphony[71] achieves multi-source localization but still requires the relative position of the mi- crophone array to a nearby wall. MAVL[93] extends the application scenario of VoLoc as it does not require a wall nearby. However, MAVL needs to estimate the reflectors in the room by emitting wideband chirps before localization the sound source. To solve the first challenge, we inherit the idea of TDoA. As long as TDoA between a pair of microphones is available, we can turn it into distance information and further 3D location. Instead of cross-correlation, we approach TDoA by formulating our problem into linear regression as θ = ω · τ + ε where ω is angular frequency, τ is TDoA and ε is noise term. θ is the phase difference between two microphones obtained from cross spectrum. Under this formulation, we get a better visualization of acoustic data and the ways to remove outliers become more explicit, which further helps the system on solving the second challenge. To be more accurate, θ = ω · τ + ε is a theoretical representation which only stands when there are only LoS path and uncorrelated noise. With the existence of multipath effect, the assumption is compromised and the performance of linear regression becomes unstable. To compress the disturbance of multipath effect, we extend 42 the idea of robust regression from [94]. Compared to linear regression, robust regression will dilute the impact of outliers on loss function. The main difference between [94] and our work is, [94] tests robust regression in simulated experiments. They modulate signals to simulate multipath environment and Gaussian background noise. However, our experiments are completely performed in real-life scenarios. The noises are more complicated and cannot be perfectly handled by robust regression. Moreover, the phase difference between two microphones in real life can be larger than 2π if the user has a high voice, but phase data from spectrum always falls into [0, 2π]. We make adjustments to accommodate robust regression to practical applications. As for hardware noise and background noise, we propose self-adjusting speech detection algorithm to predict the probability of existence of speech in a certain frequency bin. We also remove outliers which are in conflict with the known distances between each pair of microphones. The known distance projects certain constraints for TDoA estimation, which can be applied to filter out outrageous phase data. We implement SoundFlower on a 6-mic circular array and Raspberry Pi as a simulation to Amazon Echo. We also implement state-of-the-art work VoLoc[92] as our baseline model. We collect 1,000 data points from different indoor environments and test on SoundFlower. The overall accuracy of 2D localization and 3D localization is 0.45m and 0.5m with consumption time of 3s and 5s, which is sufficient for voice assistants. Around 700 data points are collected next to a wall and are tested on both VoLoc and SoundFlower for 2D localization as VoLoc requires user height as input. Our contributions can be summarized as: • We propose the design of SoundFlower, a robust and real-time sound source localization system for voice assistants. We obtain phase difference from cross spectrum, extract TDoA from phase data, turn it into distance, and derive 3D localization through lightweight optimization. • We formulate TDoA and phase shift into a robust regression problem. Robust regression shows great performance on compressing multipath effect. We also propose self-adjusting speech method to detect speech-involved phase data and an unwrapping scheme to rectify phase data from periodicity issue. 43 • We implement SoundFlower and compare it with state-of-the-art work. While extending pre- vious work from 2D localization to 3D localization, we achieve comparable localization accuracy to state-of-the-art work. The following content of this chapter is organized as follows: We discuss related work about acoustic localization in Section 3.1. Section 3.2 establishes mathematical preliminary about TDoA estimation and our motivation to SoundFlower. Section 3.3 is an overview of SoundFlower. We further introduce the system in detail in Section 3.4. Section 3.5 shows our implementation details as well as experiment setup. Section 3.6 presents experiment results. We analyze our limitations in Section 3.7. Finally we conclude our work in Section 3.8. 3.1 Related Work Acoustic localization can be done passively or actively. Active acoustic localization involves the modulation and transmission of the signal of interest. Traveling along different paths leaves fingerprints on the specially-designed signal. The location information can be extracted through comparing the transmitted signal and the received signal. Passive acoustic localization waits for the sound from the target. Usually limited spectral information about the original sound is avail- able. The location is usually obtained with the help of multiple microphones with known relative positions. Active Acoustic Localization Active acoustic localization designs distinguishable signals. After the receiver captures the signal, several mechanisms can figure out the target location. Some systems[6, 8, 62, 87] are ranging-based by measuring TDoA. Some[68, 95–97] keep track of the moving velocity of the target through Doppler Effect. There are also some papers[89, 98–100] exploiting the phase shift of the signal and then derive its traveling distance. Some[69, 101, 102] combine acoustic signal with inertial sensors and achieve localization through tracking. Up to now, TDoA based models are constrained by the accuracy of timing, Doppler Effect based models are limited by the frequency resolution while phase shift based models only apply to scenarios whose phase shift is below 2π. Passive Acoustic Localization Our paper falls into passive acoustic localization. Passive acoustic 44 localization is widely studied for applications like voice assistants and self-driving cars, where the target is located as soon as they make a sound like wake-up commands or ambulance sirens. Unlike active acoustic localization, the receiver usually does not have comprehensive knowledge about the spectrum of the sound of interest. Thus passive sound source localization uses multiple micro- phones with known positions to derive AoA of the sound and then locate the source. One classic technique is cross correlation[73–76, 103], which computes TDoA between microphones through cross correlating their signals. Another large category is based on MUSIC[78–81, 104], which analyzes spatial covariance matrix of microphone signals, extracts signal subspace from noise sub- space and peeks the AoA which maximizes the energy. We specially introduce VoLoc[92] as we will use it as the baseline model. Multipath effect is believed to be one of the main interferences to acoustic localization. Unlike other models trying to get rid of it, VoLoc turns multipath effect into information by putting the voice assistant next to a wall. With wall distance and orientation known to the voice assistant, VoLoc successfully extracts the second AoA, namely the AoA of the wall reflection path, and achieves 2D localization after fusing two AoAs and the user height. MAVL[93] does not require voice assistants to be next to a wall, but before localization it has to transmit FMCW signals from 1 kHz to 3 kHz for AoA estimation. 3.2 Preliminary and Motivation In Section 3.1, we summarized various categories of acoustic localization methods. Our paper falls into the category of TDoA estimation of passive acoustic localization. Suppose the signal transmitted from the source is s(t), the signals captured by two microphones x1(t), x2(t) can be represented as: x1(t) = h1(t) ∗ s(t) + n1(t) x2(t) = h2(t) ∗ s(t − τ) + n2(t) (3.1) where h1(t) and h2(t) are channel state information, n1(t) and n2(t) are additive noises uncor- related to our source, and τ is the TDoA of interest. As Figure 3.1 shows, once τ is at hand, we can derive the difference between the distances of two microphones to the source, which could further 45 contribute to the derivation of the source location. Figure 3.1 TDoA reveals the distance difference between two microphones to the source. Now that we track our problem from location to TDoA, one fundamental and classic method for TDoA is General Cross Correlation (GCC). Roughly speaking, GCC tries all possible TDoAs and performs cross correlation on two signals with one signal delayed by the assumed TDoA, the TDoA which generates the peak cross correlation result is believed to be the actual TDoA ˆτ. In equation 3.2 we use ω to represent angular frequency. Cross correlation Rx1x2(τ) is calculated through the inverse Fourier Transform of the cross spectrum X1(ω)X ∗ 2 (ω). The weighting function W (ω) is to show emphasis on different frequencies in the presence of uncorrelated noise[75]. ˆτ = argmax Rx1x2 Rx1x2(τ) = τ (cid:90) ∞ −∞ W (ω)X1(ω)X ∗ 2 (ω)e jωτ dω (3.2) W (ω) brings up a challenge as effective cross correlation needs to focus on frequencies from speech rather than noise under the condition that we have no information about the original speech. Previous works like [75, 105] use magnitude squared signal coherence. The idea works well when the noise is ideal and follows Gaussian distribution but shows instability to non-stationary noise and multipath effect. To work on non-stationary noise, we bypass W (ω) and turn to cross spectrum X1(ω)X ∗ 2 (ω). The phase of cross spectrum can be represented: Θ(ω) = ωτ + ε (3.3) 46 where τ is TDoA and ε is noise term. As we can see, the problem has been transferred from TDoA to phase. If we have the phase of cross spectrum, we can figure out TDoA and then derive the location of source. However, equation 3.3 only stands when there is only LoS path existing and ω is from speech frequency. In real-life scenarios, multipath effect and other uncorrelated noise will taint the phase data. Facing the disturbance of multipath effect, VoLoc[92] proposes an idea by extracting the second path, which is speaker-wall-microphone reflection path, to turn disturbance into useful location information. However, the idea requires the voice assistant to be near a wall. Whenever the voice assistant is moved, VoLoc has to re-evaluate the distance and orientation from the voice assistant to wall, which takes hours according to the paper. In this paper, we propose robust regression to deal with multipath effect. Signals from NLoS is weak compared to LoS signal. Thus robust regression will see patterns from NLoS paths as outliers and concentrate on LoS pattern. As for uncorrelated noise, we filter them out before feeding phase data to robust regression. To summarize, phase data from cross spectrum contains distance information which could lead to sound source location, the challenge is raw phase data from commercial microphones is very noisy. The cause of noises come from: (1) environmental noise; (2) internal hardware noise; (3) multipath effect. Facing the three challenges, we propose SoundFlower. Especially, we propose self-adjusting speech detection in Section 3.4.3.1 to recognize speech from environmental noise and internal hardware noise, and robust regression in Section 3.4.2.2 to deal with multipath effect. 3.3 System Overview Figure 3.2 illustrates the overall architecture of SoundFlower. After the user speaks a command and the microphone array captures it, we perform self-adjusting speech detection to the sound samples collected by each microphone. During this step, we calculate spectrum and recognize frequency bins which are closely related to the speech. On the other hand, we calculate cross spectrum of each pair of microphones, and collect speech-involved phase data. After that, the filtered phase data is fed to robust regression. Robust regression outputs the TDoA of the speech to two microphones. With sound traveling speed 343 m/s, we obtain the distance difference from the 47 Figure 3.2 System architecture. two microphones to the source. As we use a circular microphone array which has 6 microphones, we can have up to 15 distance differences. We fuse distance difference information from all pairs of microphones into an optimization model. By exhaustive search, we obtain the final user location. 3.4 Design In this section, we elaborate the design details of SoundFlower and explain the motivation of each step. After the user utters a command, sound arrives at multiple microphones at slightly different times. The TDoA information encloses the location of sound source. We formulate TDoA estimation as regression problem where frequency is independent variable, phase difference is dependent variable and TDoA is weight. We also introduce cross spectrum to obtain the phase difference between each pair of microphones. In practice, the phase data is easily-polluted. We analyze the sources of noises and propose corresponding solutions. Finally, we locate the target in 3D space through optimization. 3.4.1 Phase from Cross Spectrum Cross spectrum is the Fourier Transform of cross-correlation result. It describes the relationship between two time series as a function of frequency. Suppose we have two signals x, y captured by a pair of microphones. During cross-spectrum calculation, we first apply Fourier Transform to 48 time-domain signals and get X,Y . Then we multiple X with the conjugate of Y and get, X ·Y ∗ = a1(cos θ1 + j ∗ sin θ1) · a2(cos θ2 − j ∗ sin θ2) = a1a2(cos (θ1 − θ2) + j sin (θ1 − θ2)) As we can see, the phase of the resulting cross spectrum is the phase shift between the original two signals as long as their phase shift is smaller than 2π. If the distance between two microphones is d and the sound speed is c, the phase shift could be correctly tracked if we only use frequency under d . For frequencies larger than c c d , we unwrapped the phase data as introduced in Section 3.4.3.2. 3.4.2 TDoA Estimation After we have phase information, we use robust regression to extract TDoA between each pair of microphones, which later will be used in 3D localization. 3.4.2.1 Linear Regression Before we discuss the technical details of our model, we use linear regression to formulate our problem and project phase data to time delay. Considering that the phase of the cross spectrum θ changes linearly with its frequency f , we have: θ ( f ) = 2πτ · f + ε where τ is the TDoA of the signal arriving at two microphones and ε is the noise component. Under ideal scenarios, ε follows zero-mean, uncorrelated Gaussian distribution[77]. Thus we have the following optimization problem: (cid:90) ˆτ = arg min τ (θ ( f ) − 2π f τ)2d f Now we can see the original problem turns into a linear regression problem. 3.4.2.2 Robust Regression Compared to linear regression, robust regression is more resistant to outliers. Theoretically lin- ear regression is a perfect solution to TDoA estimation in Gaussian noise only scenarios, however in practice phase data is biased by multipath effect. The disturbance is extremely complicate as 49 noise incurred by multipath effect is not uncorrelated to the speech. Thus we use robust regres- sion as its loss function is more insensitive to outliers. Robust regression will concentrate on the direct-path phase data as they are more weighty. The objective function of robust regression[106, 107] is: ˆτ = arg min τ (cid:90) ρ( θ ( f ) − 2π f τ S( f ) )2d f where S( f ) is a scaling term of residual and ρ(x) is a loss function of scaled residual x: ρ(x) = − (1−x2)3 6    0, , |x| ≤ 1 |x| > 1 As we can see from Figure 3.3, when we have outliers caused by multipath effect or accumu- lated error from phase unwrapping (which we will illustrate in Section 3.4.3.2), robust regression can restrict the influence of outliers, focus on coherent data points, and generate output that is more related to ground truth, while linear regression covers all data points and outliers will distract the output from ground truth. Figure 3.3 Comparison between robust regression and linear regression in presence of outliers. Robust regression is proposed in [94] as a solution to reverberation. In their experiments, GCC-ML[108] shows great performance in the noise only situations, while robust regression out- performs GCC-ML when reverberation exists. However, their experiments use simulated data. 50 They simulate a room with plane reflective surfaces and frequency-independent reflections. To substantiate and quantize multipath effect, room impulse responses are modulated with the image technique[109]. Signal-to-noise (SNR) is controlled by adding zero-mean Gaussian noise with a fixed energy level. When we test robust regression in real-life scenarios, the phase data is more tainted and biased than we have expected. One significant reason is: the speech will not cover the whole frequency band (if the sampling rate of ADC is 16 kHz, then the frequency band is 0 - 8 kHz), so we should not use phase data from all frequencies. To accurately recognize the fre- quency bins in which speech exists, we design a self-adjusting speech detection method in 3.4.3.1 to predict the probability of presence of speech in each frequency bin. We are inspired by a noise estimation method MCRA[110] but modify it to make it simpler, computationally efficient and suitable with short speech like wake-up commands of voice assistants. 3.4.3 Data Pre-processing When we acquire phase data from cross spectrum, we have phase data of all frequency bins. If we feed the complete phase data directly to robust regression, two issues arise: (1) Speech does not exist in all frequency bins. With too much phase data from irrelevant frequency bins, robust regression will be overwhelmed by outliers and fail to recognize the real pattern. (2) Phase information provided by spectrum is given in the format of complex number. We may transform it into angle in radians, but all the resulting radians will be in [0, 2π]. Thus we propose Self-Adjusting Speech Detection in Section 3.4.3.1 to clean phase data, and unwrap phase data in Section 3.4.3.2. After cleaning and reforming, it will be easier for robust regression to extract TDoA from the pre-processed phase data. 3.4.3.1 Self-Adjusting Speech Detection Being blind to the speech makes it challenging to separate speech and noise from the captured signal. Due to hardware imperfection and the way ADC works, it is inevitable to have internal noise, and at this point it is almost impossible to estimate the internal noise. On the other hand, environmental noises, like heating noise and fridge noise, are quite common in the workplace of voice assistants. Previous works either uses magnitude squared signal coherence [75, 105] or 51 record silent intervals as prior information and use it to estimate SNR[108]. These methods either assume the noise to follow Gaussian distribution or require silent intervals as priori knowledge to estimate noise power spectrum. For non-stationary noise like heating noise and fridge noise, we need a self-adjusting speech detection method to recognize the speech. In spite of noises, the signal is dominated by speech when the user starts talking, and the mag- nitude of speech frequency sharply decreases when the speech stops. Based on this phenomenon, a natural solution could be setting a threshold and filtering out the data points whose frequency magnitude is lower than the threshold. However, as the user may speak at different volumes, it is hard to decide on a numeric value for the threshold that works every time. Thus we propose Self-Adjusting Speech Detection to predict the probability of speech existing in a frequency bin. Speech presence in a frequency bin of current time frame is determined by the ratio between the energy of the current frequency bin and its minimum within a specified time window. Suppose the magnitude of the i-th frequency bin is M( f ,t). We keep track of the last L frames of Short- Time Fourier Transform (STFT) results. The minimum magnitude of the frequency bin in the last L frames is Mmin( f ,t). By comparing the ratio M( f ,t)/Mmin( f ,t) to a threshold δ , we can decide speech exists in the i-th frequency bin if M( f ,t)/Mmin( f ,t) > δ . The number of frames L is determined empirically according to how fast the user speaks. In our experiments, we use a frame of 512 samples, a step of 128 samples and L is 10. As for δ , it is closely related to the ratio between the energy of speech and that of the noise. In practice, we coarsely detect the arrival and end of the speech by monitoring the energy increase and decrease, calculate the power when speech exists, and compare it with that of a silent interval. In this way, we obtain an estimation of δ . Figure 3.4 shows how frequency 343.75 Hz varies as the user speaks the wake-up command. In spite of the fact that we have 91 data points on record from the whole speech window, we only collect 25 of them as self-adjusting speech algorithm thinks there is high chance that the other data points are from internal noise or environmental noise instead of the speech. You may notice from Figure 3.4 that there are some data points whose magnitude is at the peak but have not been 52 Figure 3.4 Self-adjusting speech detection. selected. This is because the decision is made mutually from a pair of microphones. The speech detection result from the other microphone does not believe these points are from the speech. 3.4.3.2 Unwrapping Phase Data The phase information is encapsulated in the form of complex number in cross spectrum as the consequence of FFT. Directly unwrapping the complex number through inverse tangent will result in a value between [0, 2π]. On the other hand, the maximum distance between a pair of microphones from our microphone array[111] is 0.092 m, which means the phase difference be- tween two microphone could be larger than 2π if the upper-bound frequency bin is greater than 3731 Hz. Thus we add an unwrap step before feeding the phase data to robust regression in or- der to restore phase information larger than 2π. During the unwrapping, not only do we rectify the periodicity issue, but also more outliers are removed as they break certain rules under known distance constraints. After unwrapping, the data provided to robust regression is more clean and lightweight. The prior information for unwrapping is the known distances between each pair of micro- phones, which leads to two rules: (1) there is an upper bound for phase difference between each pair of microphones. For example, if the distance between a pair of microphones is d, then their maximum phase difference is 2π f d c where c is the sound speed. This rule is a general boundary for all phase data. Any phase data which is greatly larger than 2π f d c or smaller than − 2π f d c is 53 outlier. (2) The phase difference between two adjacent frequency bins f1, f2( f1 > f2) is between [− 2π( f1− f2)d c ]. If the rule is broken, we check if phase data could follow the rule after , 2π( f1− f2)d c adding 2kπ to it or minus 2kπ (k = 1, 2, 3, ...), otherwise we label it as outlier and remove it. This rule especially works for phase data which are larger than 2π. During implementation, we make two more practical adjustments. First, we slightly expand the theoretical threshold of two rules to leave certain error-tolerant space. We allow error space of π 36 to rule (1) and π 18 to rule (2). The rationale is, some experimental phase data is very close to ground truth but contains a little fluctuation. We keep them as they still provide valuable infor- mation. Second, we try our best to catch the information about which microphone receives signal first. This information, if reliable, directs unwrapping to either positive phase or negative, which further increase the accuracy of robust regression as it receives less outliers. Generally, we moni- tor the energy increase of each microphone. The first microphone which has an energy increase is considered to be closer to the speaker. However, this method is not always reliable as minor energy increase could be overwhelmed by noise fluctuations. So we only consider the result as informa- tive but not trustworthy. On the other hand, we still unwrap phase data to both positive array and negative array. If the ground truth is positive, the length of positive array would be larger and with greater magnitude, and vice versa. If both energy increase and array length support the same result about which microphone receives the speech first, we consider the result as trustworthy and keep the corresponding phase array. Otherwise we believe the two mechanisms fail to recognize the information for reasons such as the speaker is on or near the median line of two microphones. We feed both positive data and negative data to robust regression so that it could make further decision. 3.4.4 3D Localization In spite of plenty of papers[73–76, 78–81] studying AoA between two microphones, it is chal- lenging to project AoA into location, especially 3D location. For a linear microphone array, AoA information narrows down potential space to a cone. Even with user height, theoretically there are still infinite points consistent with AoA results. For circular microphone arrays, we have one cone from each microphone, predicting the intersection of several cones easily gets stuck into an unsolv- 54 able situation in practice. Thus we choose optimization to obtain 3D localization from traveling distance differences. (a) Loss Variation of 2D Localization (b) Loss Variation of Height Figure 3.5 Loss function minimizes at parameters that are close to the ground truth. After TDoAs between each pair of microphones are available, we reformulate the 3D localiza- tion problem as an optimization problem and use grid search to find the optimal spot. If we use (x, y, z) to denote a random point within the searching area S, the optimal source location would be: xopt, yopt, zopt = argmin (x,y,z)∈S ∑ i, j | fi, j(x, y, z) − di, j| where di, j is the distance difference between two microphones to the sound source derived from TDoA estimation, and fi, j(x, y, z) is the distance difference between two microphones to (x, y, z). Mean average error is convex function. Figure 3.5 shows an example of how objective function varies across the searching area. Figure 3.5a shows loss variation of 2D localization. The loss decreases sharply when approaching ground truth and reaches minimum at ground truth. However, Figure 3.5a also reveals points close to line y = yg xg x have close values to the minimum loss, where (xg, yg) is the ground truth and the microphone array is at the origin. Our experiments show similar conclusion, outputs of SoundFlower are around this line once they deviates from ground truth. Figure 3.5b shows the variation of loss function with height after SoundFlower finds the (xg, yg). We rationally assume the sound source is higher than the voice assistant as it is often the case in 55 real life. The technical need for the assumption is because two points symmetric to the horizontal plane of the voice assistant have the same TDoA estimation result. We use mean absolute error (MAE) instead of mean square error (MSE) due to the fact that MAE is more robust to outliers. With several pairs of microphones available, it is possible some TDoA estimations deviate from ground truth. MAE will prevent loss function from greatly in- creasing by one or two offset TDoA estimations. 3.5 Implementation (a) 6-Mic Circular Array (b) Raspberry Pi 4 (c) Bedroom System Setup (d) Basement System Setup Figure 3.6 System setup. We implement SoundFlower using an assembly of a 6-mic circular array [111] in Figure 3.6a and Raspberry Pi 4 Model B [112] in Figure 3.6b. We use the simulation tool kit instead of off-the- 56 shelf voice assistants like Amazon Echo because raw acoustic signals are enclosed for commercial products. The sampling rate is set to be 16 kHz as the range could cover most of human voice frequency. Higher sampling rate actually may incur aliasing. The microphone array is mounted over the Raspberry Pi to connect acoustic samples. Afterwards the samples are sent from the Raspberry Pi to a laptop through wireless connection as Figure 3.6c and Figure 3.6d show. We use the laptop to run models and output locations. We collected 1,000 data points from a bedroom, a basement, a kitchen and a living room like Figure 3.7 shows. To compare SoundFlower and VoLoc in Section 3.6.1, we use data points collected from scenarios like Figure 3.6c, Figure 3.6d and Figure 3.7c to assure the assumption of VoLoc, which is the reflection from the wall is the second sound traveling path. For study on how multipath effect influences the performance of SoundFlower in Section 3.6.2.2, we add objects in Figure 3.6d to complicate the surroundings and generate multipath environment. The overall results shown in Section 3.6.1 covers all different scenarios. We record the wake-up commands by a small mobile device and place the mobile device in different locations as the sound source. We use prerecorded audio instead of human volunteers because (1) we want to remove volume as a variant and better study the influence of distance in Section 3.6.2.1 and the influence of multipath effect in Section 3.6.2.2. Uncontrollable voice volume uttered from human volunteers will introduce new variant and undermine the reliability of our conclusion. (2) We want to collect ground truth in a more refined manner. This is important as we suggest a practical searching step in Section 3.6.2.3. We aim to find a step size which is accurate enough for human shape size as well as being friendly to the computation capacity of voice assistants. We use two different wake-up commands, ’Alexa’ for Amazon Echo and ’Hi Siri’ for Apple products. Both male voice and female voice are tested in our experiments. For evaluation metrics, we use localization error in meters to show the accuracy of SoundFlower, and consumption time in seconds to show the computation overload. 57 (a) Living Room (b) Kitchen (c) Bedroom System Setup Figure 3.7 Different experiment scenarios. 3.6 Experiment In this section, we present the performance of SoundFlower. Especially, we want to find an- swers to the following questions and evaluate the feasibility of SoundFlower as a sound source system for voice assistants: • What is the overall localization accuracy of SoundFlower? How is it compared to the state- of-the-art baseline model VoLoc? • What factors are influencing the overall performance of SoundFlower? What should we target to improve for future development of indoor sound source localization? • What is running time of SoundFlower? Is it affordable for voice assistants? 3.6.1 Localization Accuracy We test SoundFlower with both 2D localization and 3D localization. Considering our target indoor applications like Smart Home and Smart Office, 2D localization information is more of in- terest. Thus VoLoc chooses to collect user height as input and use it to narrow down the searching space. When there are several family members or workmates, voice assistants will have to recog- nize different users and choose different heights. Also, human body might lean or be on tiptoe. Having a fixed height value is not an ideal choice to serve our ultimate goal. Thus we choose to search a 3D space of 6m × 6m × 0.6m, which is large enough to cover a basement or conference room and has a height variation of 0.6m. We first show our 2D and 3D localization accuracy in Figure 3.8a and then compare 2D local- ization with VoLoc in Figure 3.8b. The median error of 2D location is 0.45m, while that of 3D 58 localization is 0.5m. Considering the area covered by a single person, this accuracy is sufficient for location-based applications. Another point worth noting is, the 2D localization error is similar to 3D error. This is because at most locations, the loss function of SoundFlower is more sensitive to horizontal movement than vertical movement of independent parameters as Figure 3.5 implies. Thus with or without user height, SoundFlower is able to find the planar location of sound source. The minor higher increment of 3D localization error is usually from the height prediction offset. (a) Overall Localization Accuracy (b) 2D Localization Comparison with VoLoc Figure 3.8 Localization accuracy. Among all collected data points, we pick data points which are collected near a wall. VoLoc requires to know the distance and orientation of the voice assistant to the wall. The original paper shows their average error for wall estimation is 1.2cm and 1.4◦, but the estimation takes hours every time the voice assistant has been moved. To be more efficient, we feed the information directly to VoLoc. The median error of SoundFlower is 0.5m while that of VoLoc is 0.65m. The accuracy of SoundFlower is slightly decreased because the second reflection path from wall is indeed strong. VoLoc can extract a second AoA information from it, but it will confuse SoundFlower from the LoS signal. If the magnitude of the second path is similar to the third and the fourth path, LoS signal will be more apparent to SoundFlower. Through implementation and experiments, we analyzed several situations that could hurt the performance of VoLoc. (1) One important assumption of VoLoc is, the reflection of wall is the 59 second strongest component in the received signal apart from LoS component. For scenarios like 3.6d, the assumption stands and VoLoc receives great performance. However, real-life scenarios are more like 3.7c where different objects may be around the voice assistant. In these cases, the assumption is undermined and the performance goes down. (2) The very first step of VoLoc is to calculate the direct AoA. The resulting AoA will narrow down the searching space from 2D plane to a beam in the 2D plane. This step significantly reduces the running time of VoLoc and makes the computation overload affordable to voice assistants. The side effect is, the estimation of direct AoA needs clean direct-path signal. In other words, we need to clip the signal before the second path signal pollutes the direct path signal. VoLoc[92] states they use "tens of samples" right after detecting the rise of signal energy for direct-path DoA estimation with a sampling rate of 16 kHz. Following the instruction, we choose 32 samples and the same sampling rate in our implementation, which means we assume the second path is at least 0.686 m longer than the first path if the sound speed is 343 m/s. This assumption does not always hold. To extract effective direct path signal, we actually need coarse prior estimation about the distance difference between the direct path and the second path. Otherwise variation on the clipped samples could affect the direct-path AoA estimation, and further leads VoLoc to wrong searching area or take too much time to finally reach a location. 3.6.2 Influencing Factors In this part we analyze the factors that influence the performance of SoundFlower. Distance and multipath effect are known to be influential to sound source systems[92, 94], our experiments show distance is still the most influential factor to localization performance. We also show the impact of switching searching step size. This is important as we make trade-offs between accuracy and consumption time. Our experiments show if a searching step size cannot cover the ground truth, usually it turns to the closest searching point. Switching to a larger searching step size will not significantly affect accuracy but can decrease consumption time. 60 Figure 3.9 The influence of distance to localization accuracy. 3.6.2.1 Influence of Distance Both VoLoc[92] and our experiments observe great impact of distance on localization accuracy. The further the speaker is to the microphone, the greater the attenuation of acoustics would be. SNR ratio would considerably decrease. After quantization of ADC, some information will be lost. Figure 3.9 shows the decrease of localization accuracy of SoundFlower as distance increases. We use mid-range error, namely the arithmetic mean of the largest and the smallest observed errors, to show the influence. For area within 1m to the microphones, the localization error is always smaller than 0.5m. For area around 3m - 4m to the microphones, the maximum observed error is 2.8m while the smallest error is 0.22m. Figure 3.10 The influence of multipath effect to localization accuracy. 61 3.6.2.2 Influence of Multipath Effect In this part, we explore the influence of multipath effect to the localization accuracy of Sound- Flower. We create different degrees of multipath effect by adding objects around the microphone array as Figure 3.7c shows. As we can see from Figure 3.10, the accuracy of cluttered table is slightly lower than clean table. The median error is 0.5m if table is relatively clean and 0.6m if table is very cluttered. 3.6.2.3 Influence of Searching Step Figure 3.11 The influence of different searching steps to localization accuracy. For experiments shown in Section 3.6.1, we use a searching step size of 0.1m to scan possible area. In this section, we further test our model on a step size of 0.01m, 0.03m and 0.07m. The result is shown in Figure 3.11. The increment of granularity does not lead to noteworthy accuracy increase. If the step size cannot cover the ground truth, usually SoundFlower will return the closest location to the ground truth. 3.6.3 Consumption Time Figure 3.12 presents the consumption time for experiments of different scales. We run the model on a MacBook Pro (13-inch, Early 2015). The code is implemented in Python 3. For scanning area of 6m × 6m × 0.6m, 2D localization with a searching step size of 0.1m takes 3 - 4 seconds. 3D localization with step size of 0.1m takes around 5 seconds. 2D localization with a step size of 0.01m takes around 30 seconds. Due to the size of a normal person, we believe 62 Figure 3.12 Consumption time. a step size of 0.1m could achieve a good balance between consumption time and localization accuracy. According to [92], VoLoc takes 6 - 10 seconds to locate a person and hours to estimate the distance and orientation from the microphone array to the wall. We quote the time of VoLoc from the original paper in Figure 3.12 instead of measuring it by ourselves because when testing VoLoc, we found its consumption time is highly related to a preset parameter which narrows down the searching area after its initial estimation of direct-path AoA. If the parameter is small, the running takes seconds. But the algorithm may fail to find a location when the initial estimation of AoA deviates from the ground truth and the parameter filters out every possible location. If the parameter is large, VoLoc always finds an answer but the running takes minutes. 3.7 Discussion and Future Work The main limitation of SoundFlower is, it cannot work when multiple people are speaking simultaneously. In practice, it is common when one person tries to wake up the voice assistant while other people are talking in the background. In this case, self-adjusting speech detection will return all frequency bins in which speeches exist, rather than the frequency bins in which the 63 wake-up command exists. To solve this issue, speech recognition method must be combined to recognize the wake-up command and truncate the time window of the wake-up command. We also need ideas like MUSIC[81] to explore the independence of multiple sound sources and separate the wake-up command from other speeches. We take these challenges as future work. 3.8 Conclusion In this chapter, we present a robust system for voice assistants to obtain user location through user speech. After state-of-the-art work[92] shows the feasibility of such a model with the user height known and a wall next to the voice assistant, we extend the application scenario to 3D localization without the assumption of a second reflection path. We continue the idea of TDoA estimation between a pair of microphones, extract phase information from cross spectrum, design self-adjusting speech detection algorithm and unwrapping scheme to remove environmental noise and hardware noise, and apply robust regression to obtain TDoA against the disturbance of multi- path effect. We achieve similar localization accuracy to state-of-the-art work with less assumptions as well as less consumption time. SoundFlower shows our efforts on overcoming diverse noises for sound source localization. Common sources of noises for IoT applications are imperfect hardware, the background noise and multipath effect. In this chapter, we show how to use statistical methods to compress the disturbance to system performance. 64 CHAPTER 4 PREVENTING UNAUTHORIZED SPEECH RECORDINGS WITH SUPPORT FOR SELECTIVE UNSCRAMBLING Human beings have long used acoustic signals to exchange information with each other. Human beings now use acoustic signals, which is speech, to exchange information with ubiquitous smart devices such as smartphones, smartwatches, and digital assistants that are equipped with embedded microphones. While these speech detection and recognition capabilities make possible many con- venient features, they also introduce many privacy risks such as secret, unauthorized recordings of our private speech [113, 114] that can have real world consequences. For example, the Ukrainian prime minister offered his resignation after an unauthorized recording was leaked [115]. Manufacturers claim that they are trying their best to protect users’ privacy, but there is no effective and user-friendly technical anti-recording solution available despite the fact that anti- recording is not a new problem. One existing anti-recording solution is to talk near a white noise source, e.g. near an FM radio tuned to unused frequencies, so that the conversation cannot be clearly recorded. This approach is not user-friendly because the people having the conversation must put up with the white noise that interferes with their normal communication. A similar solution [116] emits high frequency noise near the upper bound of human sensitivity; most people do not notice the interference, but pets and infants may notice it [117], so this solution is not environment-friendly. Electromagnetic interference was an effective anti-recording solution [118] in the past, but modern microphones are immune to electromagnetic interference. Moreover, all of these traditional anti-recording approaches cannot allow authorized devices to clearly record conversations. Any effective anti-recording solution must provide the following three key properties: (1) nor- mal human conversation should be unaffected by the anti-recording solution meaning the anti- recording solution should not change what humans hear while having a conversation; (2) unau- thorized devices should not be able to make a clear recording of any conversation protected by the anti-recording solution; (3) authorized devices should be able to make a clear recording of any 65 conversation protected by the anti-recording solution. One potential solution that can satisfy all three properties is to generate multiple ultrasonic frequency sound waves because of the following two properties of ultrasonic waves. First, hu- mans cannot hear ultrasonic sound waves. Second, commercial off-the-shelf (COTS) microphones exhibit nonlinear effects, which means that when these microphones receive multiple ultrasonic sound waves, they generate low-frequency sound waves that can be heard by humans and thus interfere with the clarity of recordings made with those microphones [17, 117, 119–123]. There are three main challenges that must be overcome in order to develop an ultrasonic anti-recording solutions that satisfies the three key properties: • First, any ultrasonic anti-recording solution must defend against potential attacks such as using Short-time Fourier transform (STFT) to analyze unauthorized recordings and using filters to cancel out the low-frequency sound waves that interfere with recording clarity. • Second, ultrasound travels along a straight line [124], which means a single ultrasonic wave generator can only interfere with recording devices within a limited range of angles from the gen- erator. In practice, it is difficult to design an ultrasonic anti-recording solution that can neutralize all recording devices within a large coverage area. • Finally, the performance of authorized devices could be affected by the ringing effect due to electronic behaviors. Such ringing impulses are hard to be canceled and may remain in authorized recordings, severely downgrading the quality of the descrambled recordings. In this paper, we present Patronus, an ultrasonic anti-recording system that satisfies the three key properties. Patronus has two key components: the scramble that is the pseudo-noise generated at all microphones, and descrambling that is the process to remove the scramble for authorized devices. We form the scramble by randomly picking frequencies from the human voice frequency band and then shifting them to the ultrasonic band. To thwart STFT attacks, we further fine-tune the period of the scramble so that it cannot be easily analyzed and canceled. We add a reflection layer with a curved surface to create a reflected ultrasonic wave that can cover a wider area. Finally, to mitigate ringing effects, i.e. sudden hardware impulses due to discrete frequency changes of 66 current waves, we use chirps to smooth the frequency changing components of the scramble, as shown in Figure 4.1. Figure 4.1 Using chirps to smooth the frequency changing components of the scramble. Patronus lets authorized devices clearly record audio conversations by sending them the scram- ble pattern. With scramble pattern, the authorized device applies the Normalized Least-Mean- Square (NLMS) adaptive filter [125] to cancel the scramble and thus produce a clear audio record- ing of the conversation. We implement a prototype of Patronus and conduct comprehensive experiments to evaluate its performance. We use the Perceptual Evaluation of Speech Quality (PESQ) [126], the Speech Recognition Vocabulary Accuracy (SRVA, see Section 4.5), and speech recognition error rates (1 - SRVA) to evaluate the performance of Patronus. Our results show that only 19.7% of the words protected by Patronus’ scramble can be recognized by unauthorized devices. Furthermore, authorized recordings have 1.6x higher PESQ and, on average, 50% lower speech recognition error rates than unauthorized recordings. In this paper, we provide several unique technical contributions when compared to existing works. First, to the best of our knowledge, Patronus is the first system to leverage the nonlin- ear effect of COTS microphones to prevent unauthorized recordings while allowing authorized recordings. Second, we perform a thorough study of the nonlinear effects of ultrasound frequen- cies including the effects of higher orders whereas recent works[17, 119, 120, 127] only consider 67 ttff(a) Discrete frequency scramble components.(b) Continuous frequency changing scramble with chirps.Cosine WaveChirp the order up to 2. This is critical for descrambling when the signal components with order higher than 2 will likely lie in the human voice frequency band, which means simply cutting off the high frequency components will result in message loss. Instead, our descrambling solution carefully removes these higher order frequencies using an NLMS filter. Third, we mitigate ringing effects by connecting scramble segments with chirps. This simplifies learning the coefficients of impulse response in existing work [17], especially when we deploy multiple ultrasonic transducers in a large space. In general, our contributions are as follows: • We propose a novel ultrasound modulation approach to provide privacy protection against unauthorized recordings that does not disturb normal conversation. • We do a thorough study around the nonlinear effect of ultrasound on commercial microphones and propose an optimized configuration to generate the scramble. • To overcome the fact that ultrasound travels in a straight line, we design a low cost reflection layer to effectively enlarge the coverage area of Patronus in a cost-effective way. • We present Speech Recognition Vocabulary Accuracy, a new metric to measure the record- ing quality. Our experimental results with both PESQ and SRVA show that Patronus effectively prevents unauthorized devices from making secret recordings. The organization of the rest of this paper is as follows. Section 4.1 introduces related work. Section 4.2 introduces the nonlinear effect of common microphones, which we analyze more thor- oughly than existing works. Section 4.3 presents the design of Patronus. Section 4.4 presents the prototype implementation of Patronus. Section 4.5 presents our evaluation results of Patronus. Section 4.6 discusses the limitations of Patronus and future work, and Section 4.7 concludes this work. 4.1 Related Works 4.1.1 Nonlinear Effect of Microphones There has been a lot of research into the nonlinear effect of microphones. For many years, the development of ultrasonic systems on smartphones was restricted due to being limited to a roughly 68 4 kHz range of frequencies between the high end of human hearing to the cutoff frequency of typi- cal microphones. Furthermore, some infants and pets can actually perceive frequencies within this small band. Roy et al. [17] performed detailed research on the nonlinear effects of microphones to break through these limitations and expand the working frequency band for ultrasonic systems on smartphones. DolphinAttack [120] leverages the nonlinear effect to generate audio commands that are inaudible to humans. After being recorded by the microphone, the input ultrasonic signals would generate a shadow signal that could be recognized by VCS. Therefore, attackers can per- form unauthorized commands without being discovered. SurfingAttack [123] uses oscillation of a surface such as a table to transmit inaudible commands. With this modality, attackers can deploy their speakers in hidden spots such as the back of the surface being used to transmit the secret com- mands. LipRead [119] extends the attack range by leveraging characteristics of human hearing. It also puts forward a model to filter out such commands generated by the nonlinear effect. Meta- morph [121] injects inaudible commands into human-made commands to achieve unauthorized actions. AIC [127] presents a mechanism that fundamentally cancels inaudible commands against VCS, which we will discuss as an attack model in Section 4.3.2. NAuth [122] uses the nonlinear effect to authenticate devices. Unlike most of these methods, Patronus aims to preserve privacy by adding a removable scramble generated by ultrasonic signals to the recorded human speech. From a technical perspective, Patronus is unique in that it takes into account third and higher order terms from the nonlinear effect. Our experiments show those high order terms can affect record- ings whereas most existing methods (e.g., AIC) only consider the second order term and assume the higher order sub-band of the microphone is clean. 4.1.2 Dual Channel Applications Some applications leverage the difference between humans and devices. For example, hu- man eyes and devices have different perceptions of flicker frequency. Technologies exist that use this phenomena to communicate between the screen and the camera without affecting human vi- sion [128–131]. Likewise, some technologies modulate acoustic signals in ways that no human can detect to communicate between devices [132, 133]. 69 The difference between the sensitivity of humans and devices is also used in privacy protection. Kaleido [134] protects a movie’s copyright by adding a flashing distractor with very high frequency into movie frames that cannot be seen by human eyes. If such a protected movie is subsequently recorded by an unauthorized camera equipped with a rolling shutter, the distractor will be visible on the unauthorized recording because of its high sample rates making the pirated recording a low quality recording. LiShield [135] also uses the Rolling Shutter effect to reduce the quality of photos. Lights with different colors are set to flash in alternating high frequencies that provide normal lighting because human eyes cannot sense the flashing. However, cameras are influenced because the Rolling Shutter samples column by column meaning unexpected color stripes will appear on the photo. In the end, it prevents unauthorized cameras from taking photos. Although Patronus has a similar motivation to prevent unauthorized recordings, Patronus is different from the two papers as it targets acoustics rather than visuals. 4.2 Nonlinear Behavior of Common Microphones In this section, we provide a brief primer about nonlinearity of common microphones; a more comprehensive introduction can be found in recent papers [17, 119]. Ideally, COTS microphones are linear systems. Given the input signal s(t), the output signal y(t) is expected to be linear combinations of the input signal, i.e., y(t) = A1s(t) where A1 is the complex gain quantifying the change of the phase and amplitude. Due to the physical properties of materials and varia- tions in manufacturing, the components of a common microphone, such as the diaphragm and the pre-amplifier, are imperfect and typically do not constitute a linear system. As a result, COTS microphones, which are widely equipped on smartphones and smartwatches, typically exhibit nonlinear behavior. Specifically, the output signal y(t) is under nonlinear effect, where y(t) = A1s(t)+A2s2(t)+A3s3(t)+· · · , and the power gains of each component satisfy |Am| > |An|(m < n). When the input signals are composed of two different ultrasonic frequencies, the output from a nonlinear microphone would contain several new shadow sounds with frequencies that are a linear combination of the two input frequencies. Assuming that the input signal is s(t) = cos(2π f1t) + cos(2π f2t) where f1 and f2 are the ultrasonic frequencies, the output signal would be y(t) = 70 ∑+∞ i=1 Aisi(t). Without loss of generality, we assume f1 > f2 in the following discussion. For each component Aisi(t), si(t) = (cos(2π f1t) + cos(2π f2t))i = µ + i ∑ j=1 [α j cos(2π j f1t) + β j cos(2π j f2t)] + i−1 ∑ j=1 [λ j cos(2π( j f1 − (i − j) f2)t) + γ jcos(2π( j f1 + (i − j) f2)t)], where α j, β j, λ j and γ are coefficients of the polynomial expansion, and µ is the consequent constant. After the pre-amplifer, the signals would pass through an embedded low-pass filter whose cut- off frequency is usually 24 kHz. Since f1 and f2 are both ultrasonic frequencies, j f1 and j f2 are all ultrasonic frequencies. However, if i = 2 j, j f1 − (i − j) f2 = j( f1 − f2) may be a non-ultrasonic fre- quency when j is small enough. Therefore, when the input signal is s(t) = cos(2π f1t)+cos(2π f2t), new audible cosine waves cos(2π j( f1 − f2)t) appear, where j = 1, 2, . . . , k, k ≤ i, and k( f1 − f2) ≤ 24 kHz. Existing works like BackDoor[17] and DolphinAttack[120] make use of A2s2(t) but ignore higher-order components; they essentially assume that for i > 2, |Ai| is relatively small and has little effect on the output signal. However, in our experiments, we find that more high-order components should be taken into consideration as they do affect the output signal. 4.3 Design 4.3.1 Overview As shown in Figure 4.2, there are three parties involved in Patronus: the Scramble Transmitter, authorized devices with descramble receivers, and unauthorized devices. The Scramble Transmitter sends a series of scramble signals with randomly varying frequen- cies. To ensure that unauthorized voice recordings will be affected, the frequencies of the recorded scrambles should be located in the human voice band. Therefore, we use the Scramble Generator to generate random frequencies in the target range, store them as a secret key, and send them to 71 the Descramble Receivers through Wi-Fi, Bluetooth, or other media. The Scramble Generator then generates cosine wave segments according to these frequencies. The generated segments are then sent to the Frequency Shifter and their frequencies will be increased by f0, which is an ultrasonic frequency. To ensure the scramble signal is picked up by microphones of unauthorized devices because of the nonlinear effect, we design a Constant Cosine Wave Generator to transmit a cosine wave with a constant ultrasonic frequency of f0. Figure 4.2 System overview. During human talking protected by Patronus, the actual human conversation plus two ultrasonic signals will arrive essentially simultaneously at recorders (both authorized and unauthorized) and human ears. Human ears will not detect the ultrasonic signals and thus receive the human conversa- tion with no additional noise. As discussed in Section 4.2, the two ultrasonic signals will generate a shadow audible signal that will be included in any recording made by a COTS microphone due to nonlinear effects. This applies to both authorized and unauthorized devices. Authorized devices, which receive a secret key from the Scrambling Transmitter, can generate the scramble waveform. They can then feed the scramble waveform along with the scrambled recording into an adaptive filter to extract clear speech from the scrambled speech. The details of descrambling will be dis- cussed in Section 4.3.5. We must overcome three challenges in order to design Patronus. First, we must design a system 72 ScrambleTransmitterScrambleGeneratorScramblePattern(Key)FrequencyShifterConstantCosineWaveGeneratorUnauthorizedDeviceAuthorizedDeviceSpeech with ScrambleSpeech with ScrambleScramble PatternWi-Fi / Bluetooth / etc.Adaptive FilterSpeechDescramble Receiver whose working area is as large as possible. This is difficult because a sound wave of high frequency typically travels along a straight line meaning a straightforward implementation of ultrasonic gen- erators will only cover a small area defined by a limited range of angles. Second, there is a trade-off between a shorter and a longer period of scramble frequencies. As the period increases, the sys- tem is more vulnerable to unauthorized recordings using STFT attacks. As the period decreases, the difficulty of descrambling increases. Our goal is to maximize the information recovered by authorized devices over unauthorized ones without exposing the scramble pattern to STFT. These details are discussed in Section 4.3.3.4. Third, when frequency changes frequently, a severe ringing effect (Section 4.3.3) occurs in the scrambled recording, which affects even the recordings made by authorized devices after descrambling. We use chirps to connect each frequency component of the scramble to eliminate the sudden change of the input to ultrasonic speakers, hence minimizing the ringing effect and enhancing the quality of the recovered speech by authorized devices. 4.3.2 Attack Model Based on common acoustic processing technologies and known properties of nonlinearity ef- fects, we consider the following types of attacks: 4.3.2.1 Short-Time Fourier Transform (STFT) One natural way for an unauthorized device to try to extract a useful recording from its scram- bled recording is to analyze the scrambled recording with STFT and filter out suspicious frequen- cies. We address this attack model by changing the scramble frequency according to a finely-tuned period model, making it impossible for the attacker to obtain each exact scramble frequency along with its start and end time. Detailed analysis is provided in Section 4.3.3.4. Even with the correct scramble frequencies available, bandpass filters will not work because the scramble frequencies are selected from the human voice band. The frequencies from chirps and those from human speak- ing are mixed together. To prove Patronus can defeat this attack model, we simulate the attack scenario when (1) the attacker is aware that our scramble pattern is varying continuous waves smoothed by chirps (2) the attacker calculates approximate scramble frequencies with STFT (3) the attacker applies NLMS adaptive filter (Section 4.3.5.4) to remove the scramble with the approx- 73 imate scramble frequencies they obtained from STFT. Our simulated attack experiments, provided in Section 4.5.8, show that this attack will fail because the approximate scramble frequencies are not accurate enough. 4.3.2.2 Extra Ultrasonic Transmitter Attack After DolphinAttack[120] proposes to inject malicious commands into ultrasound, AIC [127] adds three more ultrasonic transmitters to cancel the malicious commands and protect Voice Con- trol Systems (VCS). AIC assumes the legitimate as well as malicious commands are within the lower sub-band of the microphone sensible frequency band. Their added ultrasonic transmitters project only the malicious commands onto the higher sub-band, which can be used to filter the malicious commands in the low sub-band. With a fast changing of scramble frequencies, we can cover the whole frequency band, and make sure no clean band is left for attackers. 4.3.2.3 Wi-Fi/Bluetooth Snifing Attackers can sniffer the Wi-Fi or Bluetooth channel to get the scramble pattern transmitted from the Scramble Transmitter to the authorized device. However, there are many cryptographic approaches to prevent attackers from sniffing channels. For example, we can encrypt the scramble pattern by AES-CTR using a pre-shared key and then directly send it to authorized devices. 4.3.2.4 Physical Attacking There are also some physical attack models. First, attackers can place an obstacle before the Scramble Transmitter. However, attackers cannot do it secretly and nobody would like to do so. Second, attackers may just wrap a cover on their microphones. However, the cover itself may defeat the attackers objective of making a good recording. Although Patronus cannot perfectly handle such attack models, it enhances the difficulty of making an unauthorized recording. Finally, attackers may conduct experiments to discover where Patronus fails. This can be fixed by enlarging the working area through some methods that we will discuss later. 74 4.3.3 Ultrasonic Scramble Modulation Two ultrasonic signals will be superimposed at the recorders to create the desired low-frequency component. In the design of the scramble using ultrasonic signals, we mainly consider the follow- ing issues: 4.3.3.1 Range of Frequency The first issue is how to make it hard to cancel out the scramble without the key. Basically, the range of human speech frequency is from 85 Hz to 255 Hz [136, 137]. If the scramble consists of multiple random frequencies from this range, it is hard for attackers to cancel the scramble using linear filters. The application of a linear filter, e.g. highpass filter, will not only cancel the scramble, it will also change the original human speech. To ensure the scramble covers all human speech frequencies in practice, we modulate the scramble with a wider frequency band than [85, 255] Hz. 4.3.3.2 Random Frequencies If we always use specific frequencies to generate the scramble, attackers could analyze the frequency spectrum of their recordings to infer the scramble frequencies; with those, they could then recover the original audio signals. To address this issue, we choose scramble frequencies randomly. We also periodically change the scramble frequencies over time. The sequence of scramble frequencies can be thought of as a one-time pad key. Without the sequence, it would be difficult for attackers to remove the scramble. 4.3.3.3 Ringing Effect Frequent changing of the scramble frequencies produces a ringing effect [17] that makes it challenging for authorized devices to produce a high-quality descrambled recording. Specifically, the ringing effects incur heavy-tailed impulse responses that will remain in descrambled recordings as shown in Figure 4.3 (a) and (b). Since the ringing effect occurs when the input changes suddenly, we use a chirp signal to connect two adjacent segments with different frequencies in the scramble to smooth such a sudden change. Specifically, when the scramble changes from frequency A to frequency B, we add a transition signal that starts at frequency A and moves linearly to end with 75 Figure 4.3 Illustration of how linear chirps mitigate the ringing effect. frequency B. The impulse incurred by ringing effects can have a very high amplitude or power. It will sup- press other signals due to the microphone Passive Gain Suppression [17]. Figure 4.3 confirms that the ringing effect is mitigated by chirps. Figure 4.3 (a) shows a scrambled recording with no chirp, the resulting descrambled recording in Figure 4.3 (b) has many areas where most of the signal is suppressed. In contrast, Figure 4.3 (c) exhibits a scrambled recording with chirp signals, the result- ing descrambled recording in Figure 4.3 (d) does not have the peak signals corresponding to the ringing effect and the rest of the signal is not suppressed. 4.3.3.4 Duration of each frequency The next challenge is choosing the proper duration for each frequency in the sequence of scram- ble frequencies. Intuitively, if we give each frequency a long duration, unauthorized devices could easily split the record into multiple segments where each segment is only protected by a constant frequency scramble. They could then apply simple techniques such as using a linear bandpass filter to the scrambled recording to extract a clear speech recording. 76 0123time (s)-101amplitude(a) Scrambled without chirps0123time (s)-101amplitude(b) Descrambled without chirps0123time (s)-101amplitude(c) Scrambled with chirps0123time (s)-101amplitude(d) Descrambled with chirpsRingingRinging More generally, there are two competing issues in choosing the duration of each scramble frequency, namely, defending against STFT attacks that are discussed in Section 4.3.2.1, and en- suring that authorized devices can obtain high-quality descrambled recordings. We first consider defending against STFT attacks. An STFT attack can successfully remove the scramble waveform if it can both accurately infer the frequencies and time periods for each scramble frequency in the sequence of frequencies. When the window length is n, the frequency resolution would be ∆ f = fs n = fs fs×t = 1 t where fs is the sampling rate and t is the duration of the window. Taking 0.1s as an example, the offset of STFT can reach 10Hz. If the attacker tries to improve the frequency resolution by lengthening the window, the accuracy of the estimated time periods for the given scramble frequency will diminish. If the scramble frequency duration is long, scramble frequency will exhibit fewer changes within any given window, thus STFT attacks can use longer windows to accurately estimate the frequency with exact estimates of the frequency time period. Therefore, to thwart STFT attacks, we should make the frequency duration as short as possible. However, a too-short duration may misshape the scrambled recording due to imperfect hardware. A typical mi- crophone and speaker use a diaphragm to sense and generate the vibration; this diaphragm moves continuously and can not change its position instantaneously. Circuit latency also makes it hard for the system to respond to frequent and instant changes. As a result, the scrambled waveform would be slightly distorted. This means the NLMS adaptive filter at authorized devices may not correctly descramble the scrambled waveform because it does not expect the distortion caused by frequent frequency changes. Therefore, the frequency duration cannot be too short. In summary, to balance these competing concerns, we must find a frequency duration that maximizes the information re- covered by authorized devices compared to the information recovered by unauthorized devices. To identify a good frequency duration, we measure the descrambling performance with different frequency durations in Section 4.5.8. 4.3.3.5 Key Construction We have two choices to construct the key for granting the privilege of recording the audio to authorized devices. One is directly using the scramble waveform generated by the Scramble Gen- 77 erator as the key. After getting the scramble waveform, authorized devices remove the scramble from the recorded audio. But there are some issues we need to consider. First, the sampling rate of authorized devices may vary from one to another. It means that in terms of the digital signal, devices having different sampling rates will get different presentations of the same scramble wave- form. To grant the privilege to devices, the Scramble Transmitter should generate different digital scramble waveforms according to different sampling rates of authorized devices. This results in high computational overheads. Second, in addition to different sampling rates from different au- thorized devices, the sampling rates of the Scramble Generator and an authorized device may be also different. As a result, the scramble that the speaker emitted might have a different presentation of the recorded waveform. In Patronus, we choose another way to construct the key. We select the frequency sequence used to generate the scramble as the key. After receiving the frequency sequence, an authorized device can reconstruct the scramble waveform with their sampling rates, which we discuss in more detail later. After that, an authorized device can use the reconstructed scramble waveform to remove the scramble from the recording and get the clear speech. With the discussion above, we formally describe the scramble generation. We set one speaker to transmit an ultrasonic continuous wave S1(t) = cos(2π f0t), while the other speaker transmits continuous waves linked by chirps S2(t) = cos(2π f (t)t), where f (t) =    fi, (2i − 2)∆t ≤ t < (2i − 1)∆t, fi + fi+1− fi ∆t t, (2i − 1)∆t ≤ t < 2i∆t, (4.1) and fi(i = 1, . . . , n) are randomly generated constant frequencies. ∆t is the duration of a single sine wave or a chirp. The induced low-frequency noise will be R(t) = cos(2π( f (t) − f0)t). (4.2) To ensure R(t) covers human voice, fi(i = 1, . . . , n) are sampled from [ flow + f0, fhigh + f0] where [ flow, fhigh] covers the human voice band. 78 Figure 4.4 Enlarge working area with reflection. 4.3.4 Enlarge Scramble Working Area The scramble signal is generated by two ultrasonic signals, which incurs another issue as the ultrasonic wave typically propagates in a straight line. In other words, if you want to prevent a certain device from recording, the ultrasonic speaker should be pointed directly towards that device. This results in a limited coverage area for ultrasonic anti-recording solutions. Inspired by lamps that often use a bow-shaped cover to reflect the light beam in many direc- tions, we build a reflection layer that reflects the ultrasonic wave in many directions. As Figure 4.4 shows, we put ultrasonic speakers near the center of the reflection layer and place the devices (au- thorized and unauthorized) in the working area. When the ultrasonic wave hits the reflection layer, it gets reflected in many directions leading to a much larger cover area. 4.3.5 Grant Recording Privilege The goal of Patronus is not only to block unauthorized devices from recording audio, but also to provide authorized devices with a mechanism to recover speech. Patronus achieves this by creating a way for authorized devices to remove the scramble from the scrambled recording. Specifically, Patronus grants the clear recording privilege to authorized devices using the following steps. 79 RingingUltrasonicSpeakersReflection LayerWorking AreaAmplifierUltrasoundTransmitterPower Adapter 4.3.5.1 Key Transmission The Descramble Receiver needs the waveform of the scramble generated by the Scramble Generator before it can remove the scramble. Intuitively, if it had the pure scramble waveform, it could remove the scramble from the recorded audio by subtracting the scramble waveform from the recorded audio waveform. The scramble waveform here acts as the key for deciphering the recorded audio. We send the key through non-acoustic channels such as Wi-Fi or Bluetooth with cryptographic protection to prevent eavesdroppers from getting the key. Additionally, because of the randomness of scramble frequencies, they cannot get a usable scramble waveform by listen- ing to the acoustic channel. Instead, they can get either the combination of interfered speech with scramble, or get the scramble without speech but independent of the successive scramble wave- form. 4.3.5.2 Scramble Reconstruction As discussed in Section 4.3.3, the Scramble Transmitter sends the random frequency sequence instead of the scramble waveform to authorized devices as the key. Patronus needs to use these frequencies to reconstruct the scramble waveform before removing the scramble. An authorized device uses Equation (4.2) and its recording sampling rate to generate the scramble waveform. 4.3.5.3 Synchronization We need to synchronize the reconstructed scramble with the recorded scramble before remov- ing it from recordings. Specifically, we choose a segment from the reconstructed scramble as the template, e.g. the beginning segment. Then we use cross-correlation to find the segment that is the most similar to the template. We then synchronize the recorded scramble and the reconstructed scramble by aligning the two segments. 4.3.5.4 Adaptive Filtering Now we have the waveform of the scramble. The next task is to remove the scramble from the recorded audio with the known waveform of the scramble. Practically, we cannot directly subtract the scramble from the recorded audio because when the sound propagates through the air, it will be 80 distorted due to reflection and attenuation. We use adaptive filter to remove the waveform-known scramble. Adaptive filter is widely used in Active Noise Cancellation (ANC) headsets. Technically, there is a reference microphone outside the headset. The reference microphone captures the noise, and the digital signal processor (DSP) generates the anti-noise wave according to the captured noise. When the noise wave and the anti-noise wave arrive at the ear, they eliminate each other. In Pa- tronus, we denote the speech as x1. It propagates through the acoustic channel h1, arrives at the authorized device and becomes h1 ∗x1, where the operator ∗ denotes the convolution operation. Ad- ditionally, we denote the scramble waveform that is generated by non-linear effects and recorded by the authorized device as x2. It propagates through another channel h2, arrives at the authorized device and becomes h2 ∗ x2. Therefore, the audio recorded by the authorized device is y = h1 ∗ x1 + h2 ∗ x2. (4.3) Similar to ANC headsets, here we see the scramble x2 as the noise in ANC headsets. Different from ANC headsets, the noise here is generated from the key as we discussed in Section 4.3.5.2. Therefore, we can use the Normalized Least-Mean-Square (NLMS) Adaptive Filter [125] to re- move the scramble. Formally, we are trying to find a channel vector h′ 2 to solve the optimization problem min E[(y − h′ 2 ∗ x2)2]. (4.4) When the expectation in Equation (4.4) is minimized, h2 ≈ h′ 2. Therefore, h1 ∗ x1 ≈ y − h′ 2 ∗ x2, and it can be regarded as the speech without the scramble. Stochastic gradient descent is usually adopted to solve the optimization problem defined by Equation (4.4), but it is hard to derive the gradient of the expectation. Researchers thus use (y − h′ 2 ∗ x2)2 instead of the expectation to solve the problem. In this way, the noise gets canceled [138]. Following this design, we can develop a mechanism that prevents unauthorized recording while supporting authorized recording. The mechanism also prevents attackers from descrambling with- out authorization. Figure 4.5 gives an example. A piece of VOA news audio is used as the original 81 record, the attack result has severe scramble effects just like the unauthorized record, but the au- thorized record removes almost all scrambles. (a) Original Waveform (b) Authorized Waveform (c) Unauthorized Waveform (d) Descrambled by STFT Attack Figure 4.5 Illustration of original waveform, authorized waveform, unauthorized waveform, and descrambled waveform by STFT attack. 4.4 Implementation This section discusses the details of the implementation of Patronus, which contains two parts, the Scramble Transmitter and the Descramble Receiver for authorized devices. We use an ordinary smartphone with its built-in audio recorder as the Unauthorized Device or Authorized Device. Figure 4.6 Implementation of scramble transmitter. 82 0102030time (s)-1-0.500.51amplitude0102030time (s)-1-0.500.51amplitudeUnauthorized Record0102030time (s)-1-0.500.51amplitudeAuthorized Record0102030time (s)-1-0.500.51amplitudeAttacker Result0102030time (s)-1-0.500.51amplitude0102030time (s)-1-0.500.51amplitude0102030time (s)-1-0.500.51amplitude0102030time (s)-1-0.500.51amplitudeAttacker Result0102030time (s)-1-0.500.51amplitude0102030time (s)-1-0.500.51amplitude0102030time (s)-1-0.500.51amplitudeAuthorized Record0102030time (s)-1-0.500.51amplitudeAttacker Result0102030time (s)-1-0.500.51amplitude0102030time (s)-1-0.500.51amplitude0102030time (s)-1-0.500.51amplitude0102030time (s)-1-0.500.51amplitudeRingingUltrasonicSpeakersReflection LayerWorking AreaAmplifierUltrasoundTransducersPower Adapter×103 4.4.1 Scramble Transmitter 4.4.1.1 Hardware Implementation As Figure 4.6 shows, we use eight TCT40-16R/T 16 mm ultrasonic transducers. Half of them play the frequency-shifted scramble and they are connected in parallel. The other half play the fixed-frequency cosine wave and are connected in parallel as well. We utilize an AOSHIKE DC12V- 24V 2.1 Channel TPA3116 Subwoofer Amplifier Board to enhance the power of output ultrasonic signals. The two waveforms are played through a stereo channel. The frequency-shifted scramble uses the left channel, and the constant-frequency cosine wave uses the right channel. As we have discussed in Section 4.3.4, we use a reflection layer to enlarge the working area. In this prototype, we use an iron wok as the reflection layer. The opening diameter of the iron wok is 30 cm, and the depth is 10 cm. As shown in Figure 4.7, the ultrasonic transducers are placed towards the center of the iron wok. Figure 4.7 Prototype of Patronus. 4.4.1.2 Format of Key As we have mentioned in Section 4.3, Patronus uses the frequency sequence as the key. This key must include the duration of each frequency in addition to the frequency itself in order for the 83 Reflection Layer0∘15∘30∘45∘60∘75∘90∘75∘60∘45∘30∘15∘0∘NormalSpeakerUltrasonicTransducersironwok(reflectionlayer)AmplifierUltrasonicTransducersNormalSpeakerUnauthorizedDeviceAuthorizedDeviceAuthorizedDeviceUnauthorizedDevice Descramble Receiver to generate the scramble waveform. Thus, our key file includes the frequency sequence plus the sample rate of the Scramble Transmitter and the number of samples of each frequency. 4.4.2 Descramble Receiver for Authorized Devices We use an ordinary smartphone as an authorized device. The authorized device receives the key from the Scramble Transmitter. After the audio is recorded, the smartphone reconstructs the scramble waveform with the given key and leverages NLMS Adaptive filter to cancel the scramble. Formally, it takes the following steps: 4.4.2.1 Reconstruct Scramble Waveform As we mentioned, in addition to the frequency sequence, the received key also contains the sampling rate of the Scramble Transmitter, which is denoted by fst, as well as the number of samples of each frequency nt. With the known sampling rate of the authorized device fsr, the number of its recovered samples for each scramble frequency component can be calculated through the equation nr = fsrnt fst , (4.5) After getting nr, the authorized device uses the same process as the Scramble Transmitter to generate the scramble, i.e. generating the discrete cosine signal with the frequency fi and fi+1, and connecting them by a chirp signal with start frequency fi and end frequency fi+1, where fi and fi+1 are from the frequency sequence in the key. 4.4.2.2 Normalized Least-Mean-Square (NLMS) Adaptive Filter After reconstructing the scramble waveform, we can use the Normalized Least-Mean-Square Adaptive Filter to cancel the scramble from the scrambled record. Specifically, we put the scram- bled record recs and the scramble waveform s into the NLMS Adaptive Filter to get the descram- bled waveform e by removing s from recs. According to the discussion in Section 4.2, the scramble wave is not only generated by frequencies in the given frequency sequence but also generated by high-order frequencies that are multiples of the target frequencies. Therefore, after getting e from 84 the NLMS Adaptive filter, we still need to iteratively remove the multiples of the frequency se- quence scramble by NLMS Adaptive filter. It means that we iteratively put e and the scramble waveform generated by k-times multiple of the frequency sequence into NLMS Adaptive Filter, where k = 2, 3, 4, 5, 6 in our prototype. In summary, the procedure of authorized devices for removing the scramble from the record is shown in Algorithm 4.1. Algorithm 4.1 Remove Scramble from the record. Input: recs, fsr, fst, nt, the frequency sequence f [1..n] Output: Speech Record without Scramble e 1: nr ← fsrnt/ fst 2: e ← recs 3: for k = 1 to 6 do 4: 5: 6: end for 7: e s ← ScrambleGenerator(k × f [1..n], nr). e ← NLMS-Adaptive-Filter(e, s) The NLMS-Adaptive-Filter can be found in many open-source libraries, e.g. MATLAB, Python, etc.. Due to the selective frequency response of different smart devices, each model has its own parameter setting. In the implementation, we choose 500 as the number of taps and 0.005 as the step size for an iPhone, 100 as the number of taps and 0.003 as the step size for a Pixel, and 300 as the number of taps and 0.005 as the step size for a Galaxy S9. 4.4.3 Simulated STFT Attacker We also simulate an STFT attacker to verify whether or not Patronus can prevent such an at- tack. Specifically, as discussed in Section 4.3.2.1, we apply STFT to the scrambled recording using the MATLAB function stft to infer its frequency sequence. We then feed the frequency sequence to an NLMS adaptive filter to get the descrambled recording. Experiment results are shown in Section 4.5.8. Here, we illustrate an example, which contains the original waveform, authorized waveform, unauthorized waveform and the waveform descrambled by STFT, in Figure 4.5. As illustrated by the figure, we observe that the authorized waveform is similar to the original wave- 85 form, the unauthorized waveform is different from the original one, and the unauthorized wave- form is similar to the waveform descrambled by STFT attack. Therefore, our prototype proves that Patronus can block the unauthorized recording while allowing authorized recording, and it can prevent STFT attacks. 4.5 Evaluation 4.5.1 Overview To evaluate the performance of Patronus, we select six news speech waveforms from Voice of America (VOA) and note these waveforms as A - F. The news speeches are read by a male, a female, or both alternatively, sometimes with background music. A normal speaker (shown in Figure 4.7) is set to play these news waveforms, and we also read the news ourselves. While the news waveforms are played under different conditions, we start Patronus to interfere with the unauthorized recording device. Meanwhile, an authorized device is recording too. Later we apply scramble cancellation to recordings from the authorized device. After getting the scrambled recordings and scramble-canceled recordings, the following metrics are adopted to measure the performance of Patronus. 4.5.1.1 Perceptual Evaluation of Speech Quality (PESQ) PESQ is a common-used metric of speech quality [126]. It is widely adopted by phone man- ufacturers, network equipment vendors, and telecom operators. Technically, the inputs include a clear speech signal as the reference and a signal that needs to be measured. The output is a Mean Opinion Score (MOS) [139] ranging from −0.5 to 4.5. A high PESQ score means that the corre- sponding speech has a high hearing quality and vice versa. Typically, PESQ values ranging from 1.00 to 1.99 means “No meaning understood with any feasible effort” while those ranging from 3.80 to 4.50 meaning “Complete relaxation possible; no effort required”. However, we cannot regard the audio recording as strict as lossless communica- tion. To fit PESQ to characterize the performance of Patronus, we measure the PESQ of recordings without scrambling by turning off Patronus, and use that result as the baseline. As shown in Fig- 86 ure 4.8a, such recordings have PESQ between 2.2 and 2.7. We regard them as the upper bound of both unauthorized and authorized recordings. In the following experiments, we use the PESQ implementation written in MATLAB [140] to compute the PESQ score. 4.5.1.2 Speech Recognition Vocabulary Accuracy (SRVA) We also use a Speech Recognition service to measure the effectiveness of scrambling and de- scrambling. Specifically, we apply Google’s Speech To Text (STT) service to transform the acous- tic signals to text. We first use the STT service to recognize the original speech without interference and treat the recognized word sequence wc as the ground truth. Then we use the STT service to rec- ognize the scrambled speech and descrambled speech, and use ws and wd to denote their results, respectively. We name isTrue(i∈wc) ∑ i∈ws |wc| ∑ i∈wd (or isTrue(i∈wc) |wc| ) as the Speech Vocabulary Recognition Accuracy (SRVA) and use it to quantify the effectiveness of scrambling and descrambling. Note that isTrue(i ∈ wc) returns 1 when i is a word from wc, and 0 when i is not a word from wc. We define SRVA Error as 1−SRVA which indicates the error rates of recognition with the STT service. Using the above metrics, we try to answer the following questions: • Can Patronus effectively scramble the unauthorized speech recordings? • Can Patronus permit authorized devices to record the speech? • Can Patronus work on different mobile devices? • What is the impact of the distance between Patronus and a recorder? • What is the impact of the reflection layer? • What is the impact of the frequency switching time? • Is it possible to perform real-time descrambling? 4.5.2 Effectiveness of Scrambling and Descrambling We split the 6 news speech waveforms into 55 segments (1650 seconds in total), each 30 sec- onds long. Both the authorized and unauthorized device are Apple iPhone X in this experiment, so do the following experiments except that of Section 4.5.5. As shown in Figure 4.8a, with Pa- tronus’s scrambling, the hearing qualities of most segments are extremely low. Specifically, 44 out 87 of 55 (80.0%) segments have PESQ scores lower than 1.5. For SRVA, overall, only 551 out of 2796 (19.7%) words are recognized correctly. More detailed results are shown in Figure 4.8b. The upper half shows the CDF of the SRVA Error. We can know that 50% of the recordings have SRVA Error lower than 0.84, and 80% of the recordings have SRVA Error lower than 0.98. The lower half shows the ratio of SRVA between scrambled recordings and original waveforms. The results show that all of the news waveforms having a recognition rate lower than 0.3. Here we want to mention that if a word appears multiple times in a speech, SRVA would result in a high value or a low value compared to the actual word recognition rate. However, duplicated words have little impact because the duplicate rates of every segment, i.e. the ratio between the count of a specific word and the total count of words in the segment, are lower than 5%. (a) PESQ (b) Scrambled (c) Descrambled Figure 4.8 (a) PESQ of recordings captured by unauthorized and authorized devices, and PESQ of recordings without scrambling by turning off Patronus as the baseline. (b) Upper half: The CDF of SRVA Error of scrambled recordings from the unauthorized device. Lower half: The ratio of SRVA between scrambled recordings and original waveforms. (c) Upper half: The CDF of SRVA Error of descrambled recordings from the authorized device. Lower half: The ratio of SRVA between descrambled recordings and original waveforms. To evaluate the effectiveness of descrambling, an authorized device records the speech under the scrambling from Patronus. The authorized device then cancels the scramble using the received key. As shown in Figure 4.8a, after descrambling, only 9 out of 55 (16.3%) segments having PESQ scores lower than 1.5. On average, descrambled recordings have 1.6x higher PESQ scores than their corresponding scrambled recordings. As for SRVA, we show the CDF of the SRVA Error in the upper half of Figure 4.8c. These results show that 50% of the descrambled recordings have SRVA Error lower than 0.43, which is 49% lower than scrambled recordings. Moreover, 80% of 88 0102030405060Segment00.511.522.53PESQUnauthorizedAuthorizedw/o Scrambling00.10.20.30.40.50.60.70.80.91SRVA Error00.20.40.60.81CDFABCDEFNews00.20.40.60.81Ratio00.10.20.30.40.50.60.70.80.91SRVA Error00.20.40.60.81CDFABCDEFNews00.20.40.60.81Ratio the descrambled recordings have SRVA Error lower than 0.64, which is 35% lower than scrambled recordings. As shown in the lower half of Figure 4.8c, ratios of SRVA between descrambled record- ings and original waveforms are higher than 0.4 and lower than 0.8. They are at least 2x better than the scrambled recordings. The quality of the descrambled recordings is not as good as the original ones because there are residual components of the scramble after applying the NLMS adaptive fil- ter. Moreover, background music and the volume of the original waveform also affects the quality of the descrambled recordings. For example, news C has a lower ratio after being descrambled by the authorized device compared to the other news clips because it has background music that could affect the performance of authorized devices. It also affects the SRVA of the record without scrambling, i.e., only 223 words are recognized from 295 in total. The reader of news E reads the news in a lower volume compared to others, so it has a lower ratio after being descrambled by the authorized device compared to the other news clips. 4.5.3 Effectiveness of Human Voice Scrambling and Descrambling To verify whether Patronus works for real human speaking other than a sound player, we read the news and calculate SRVA. As shown in Figure 4.9a, Patronus can effectively scramble and descramble the human voice. Specifically, for the scrambled recordings, the median of SRVA Error is 0.74, and 80% of scrambled recordings have SRVA Error lower than 0.83. For the descrambled recordings, the median of SRVA Error is 0.27, and 80% of the descrambled recordings have SRVA Error lower than 0.4. The descrambling effectiveness of the human speaker is better than that of recorded sounds because recorded sounds from VOA sometimes play background music. 4.5.4 Effectiveness of Human Recognition to Scrambled Recordings and Descrambled Record- ings Because there might exist differences between machine learning-based speech recognition and human speech recognition, we invite 11 volunteers to write down words after listening to the 55 scrambled recordings and 55 descrambled ones. The results are shown in Figure 4.9b. People react differently to noise. Some people are very sensitive and the scrambled noise make them very uncomfortable. Note, the noise is generated by ultrasound speakers and only captured by the 89 (a) Human Reading (b) Human Hearing (c) Different Models Figure 4.9 (a) Compare SRVA between before and after descrambling for the human voice. (b) Compare SRVA between before and after descrambling for human recognition. (c) Compare aver- age PESQ and SRVA among different models. nonlinear effects of microphones, so it will not disturb the people in the original conversation. It will only be heard after getting recorded by unauthorized devices. Further, authorized devices will be able to filter out such noises eliminating the discomfort for those listeners. The recovered information from humans listening to descrambled recordings is still better than that of humans listening to scrambled ones. 50% of the scrambled recordings have SRVA Error lower than 0.63, and 80% of the scrambled recordings have SRVA Error lower than 0.86. As a comparison, 50% of the descrambled recordings have SRVA Error lower than 0.34, and 80% of the descrambled recordings have SRVA Error lower than 0.63. 4.5.5 Effectiveness on Different Mobile Models To verify whether Patronus works on different mobile models, we test it on three devices, an Apple iPhone X, a Samsung Galaxy S9, and a Google Pixel. We play all 55 segments using the normal speaker, and calculate average PESQs and SRVAs. As shown in Figure 4.9c, less than 30% of words can be recognized by the STT service for all the unauthorized devices, and around 65% of words can be recognized for all the authorized devices. When the mobile devices are unauthorized, the average PESQ of iPhone X is 1.06, and the average PESQ of the other two models are even lower, roughly 0.5. When the mobile devices are authorized, they all achieve an average PESQ around 1.85. This demonstrates that Patronus works well for all devices; namely, it prevents all models from making good unauthorized recordings and 90 00.10.20.30.40.50.60.70.80.91SRVA Error00.10.20.30.40.50.60.70.80.91CDFUnauthorizedAuthorized00.10.20.30.40.50.60.70.80.91SRVA Error00.10.20.30.40.50.60.70.80.91CDFUnauthorizedAuthorizediPhone XGalaxy S9PixelModel00.511.52Avg PESQUnauthorizedAuthorizediPhone XGalaxy S9PixelModel00.20.40.6SRVAUnauthorizedAuthorized DT (ms) MSO RT (s) 1 2 5 10 20 30 1 2 3 4 5 6 51 73 161 290 548 822 96 145 322 582 1094 1617 159 218 487 851 1653 2348 209 291 634 1108 2165 3088 265 373 798 1389 2695 3830 328 454 954 1653 3298 4563 Table 4.1 Descramble time (DT) of different record times (RT) with different max scramble orders (MSO, the upper bound of k in Algorithm 4.1). allows all models to make acceptable authorized recordings. 4.5.6 Impact of the Distance We also characterize the impact of the distance between Patronus and the recording devices (both authorized and unauthorized). We put the Scramble Transmitter at the origin. A randomly- picked speech segment (which has 43 words) is played by a normal speaker, which simulates the talker. The authorized device and an unauthorized device are recording at the same time. Their distance to the Scramble Transmitter varies from 25 cm to 70 cm. Results of SRVA and PESQ between two devices are shown in Figure 4.10a. Overall, as the distance increases, the ultrasound would attenuate more. Therefore, the strength of the scramble decreases as the distance from the scramble transmitter increases. As a result, when the device is far enough away, both the authorized and unauthorized device can both record a clear speech. On the other hand, when devices are close enough, unauthorized devices produce recordings that are severely scrambled whereas authorized devices can recover much clearer speech using the secret key. The working area can be extended by using high power ultrasonic speakers, which we will discuss later. Here we want to mention that although there is a bump in Figure 4.10a at 55 cm with the SRVA, PESQs of 55cm and 60cm are close. This means that humans cannot see much difference between these two recordings, something we confirmed in person by listening to these recordings with this objective in mind. Thus, the SRVA bump at 55cm might be due to an error-correction mechanism of the Google STT engine; of course, since this is proprietary technology, we do not know how or why this error- 91 correction would produce such a performance bump for this recording. (a) Different Distances (b) Angle Illustration (c) Different Durations Figure 4.10 (a) Compare PESQ and SRVA at different distances. (b) Illustration of the reflection layer experiment. (c) Compare PESQ and SRVA with different frequency switching times. 4.5.7 Impact of the Reflection Layer As we mentioned before, the ultrasound wave often propagates along a straight line. To enlarge the range of Patronus scrambling, we design a reflection layer. In this experiment, we apply the common speaker to play the chosen speech segment (43 words). As shown in Figure 4.10b, we point the ultrasonic speakers towards the reflection layer and change angles of both authorized and unauthorized devices to the ultrasonic speakers and measure Patronus’ performance; in other experiments, the devices are always put at the 90◦ angle. We also measure the performance without using the reflection layer. We turn the ultrasonic speakers around so they face in the same direction as the normal speaker when we remove the reflection layer. The results when using the reflection layer are shown in Figure 4.11a and 4.11b, and the results without using the reflection layer are shown in Figure 4.11c and 4.11d. From the results, we see that with the reflection layer, Patronus can successfully scramble the unauthorized device when the angle is more than 15◦, which is signficantly larger than the angle of more than 45◦ needed by Patronus without the reflection layer. Therefore, the reflection layer does significantly enlarge the scramble range of Patronus. 4.5.8 Impact of the Frequency Duration We also measure the impact of the frequency duration. As we discussed in Section 4.3, we would like to make the duration of each frequency as short as possible. However, the shorter the 92 25303540455055606570Distance (cm)00.511.522.5PESQUnauthorizedAuthorized25303540455055606570Distance (cm)00.20.40.60.81SRVAUnauthorizedAuthorizedReflection Layer0∘15∘30∘45∘60∘75∘90∘75∘60∘45∘30∘15∘0∘NormalSpeakerUltrasonicTransducersironwok(reflectionlayer)AmplifierUltrasonicTransducersNormalSpeakerUnauthorizedDeviceAuthorizedDeviceAuthorizedDeviceUnauthorizedDevice0.10.20.30.40.5Duration (s)00.511.52PESQUnauthorizedAuthorizedAttacker0.10.20.30.40.5Duration (s)00.20.40.60.81SRVAUnauthorizedAuthorizedAttacker (a) PESQ (b) SRVA (c) PESQ (d) SRVA Figure 4.11 (a) and (b): Compare PESQ and SRVA with the using of the reflection layer. (c) and (d): Compare PESQ and SRVA without the using of the reflection layer. frequency duration is, the harder it is for authorized devices to descramble. To verify this feature, we put an authorized and an unauthorized device at 40 cm to Patronus and play the chosen segment (43 words) using the normal speaker. Both devices record the speech under Patronus using 5 dif- ferent frequency durations: 0.1 s, 0.2 s, 0.3 s, 0.4 s and 0.5 s. We calculate PESQs and SRVAs for each duration. Moreover, we implement the attack model from Section 4.3.2, which first calculates approximate scramble frequencies using STFT and then attempts to cancel the scramble using an NLMS adaptive filter. We calculate PESQs and SRVAs for each duration and all devices including the attack model. As shown in Figure 4.10c, for all durations, SRVAs of the unauthorized device are lower than 0.1, and PESQs are lower than 0.5. The authorized device has higher SRVAs and PESQs than the unauthorized device. Specifically, when the duration comes to 0.3 s, the SRVA reaches roughly 0.8 and PESQ exceeds 2.0. This verifies our claim that authorized devices can successfully descramble when the frequency duration is long enough. A shorter duration also makes it harder for attackers to crack the scrambled record, e.g. SRVAs for the attacker also increase as the duration increases. Although both SRVAs and PESQs are higher than those of the unauthorized device, they are still too low to extract useful information. The reason why the NLMS adaptive filter fails is that the attacker cannot identify the scramble frequencies with enough accuracy. NLMS adaptive filter solves the optimization problem defined by Equation (4.4), which estimates the weight vector h′ 2. Since convolution does not change the frequency of the signal, the attacker cannot make up for any offset existing between the correct 93 ▲▲▲▲▲▲▲●●●●●●●03060900.00.51.01.52.0PESQ●▲AuthorizedUnauthorized▲▲▲●●●●●●●03060900.000.250.500.751.00SRVA●▲AuthorizedUnauthorized▲▲▲▲▲▲▲●▲●●▲●●●●03060900.00.51.01.52.0PESQ●▲Authorized Unauthorized▲▲▲▲●●●●●●●●●●03060900.000.250.500.751.00SRVA●▲AuthorizedUnauthorized frequency and the result from STFT. According to the frequency resolution problem of STFT as discussed in Section 4.3.3.4, the simulated attacker in our experiment gets an average frequency offset around 3 Hz, which makes it hard to descramble the recording. 4.5.9 Descramble Time Sometimes when we grant recording permission to a specific speaker, the speaker would like to perform real-time descrambling. Patronus can achieve this working with real-time smart devices such as Amazon Alexa. To prove this, we measure the descramble time for records with differ- ent durations on a laptop with an Intel Core i7-4870HQ 2.5 GHz CPU. Since different high-order scramble waves (second-order component, third-order component, ...) may exist in a record si- multaneously, we measure descramble time as a function of different max scramble orders, i.e. the upper bound of k in Algorithm 4.1. As shown in Table 4.1, Patronus can descramble the record quickly. Specifically, when the record time is 1 s, Patronus can finish descrambling in 328 ms, even when the max scramble order is 6. This means that Patronus supports real-time descrambling. 4.6 Limitations and Future Works Range: In our implementation, we use cheap and low power ultrasonic transducers to build the Scramble Transmitter. The result is a short working distance, i.e. less than 70 cm. To enlarge the working area to a wider range of angles, we designed a reflection layer and verified that it could enlarge the working area by using an iron wok in our prototype. We can also use a high power ultrasonic speaker to protect a larger area. Some commercial off-the-shelf devices can emit ultrasound which could be sensed in a larger area. For example, UPS+ [117] uses an ultrasonic speaker with a working area of 50m × 50m. However, it is expensive. We can reduce the cost by deploying one expensive speaker and multiple transducers like UPS+[117]. Here we provide users with three options to deploy Patronus according to their requirements such as working area and budget. The first option is to use cheap transducers and a reflection layer to protect a small area. The second is combining an expensive speaker and multiple transducers to protect a larger area. The third is using multiple expensive speakers to protect the largest area. Volume: In our implementation, we assume the talker uses a normal volume, i.e. not too loud or 94 too quiet. However, the performance of Patronus does vary as a function of the speaker volume. For example, if the talker speaks too loudly, the scramble cannot mess up the recording; in the opposite extreme, a quiet talker cannot be recovered using descrambling. To adapt to different volumes, we can add a microphone to measure the talker’s volume. With multiple deployed ultrasonic speakers or transducers, we can first detect the position of recording devices and then adjust the power of ultrasound emitted from the nearest speakers according to the talker’s volume. There are two challenges that need to be solved. First, the microphone we use to measure the talker’s volume can also be scrambled. Second, we need to localize recording devices before emitting scrambles. We leave these challenges as future work. 4.7 Conclusion Acoustic privacy protection has always been an important topic. In this chapter, we study the nonlinear effects on commercial off-the-shelf microphones. Based on our study, we propose Pa- tronus, which leverages the nonlinear effects to disrupt unauthorized devices from recording the speech while simultaneously allowing authorized devices to record clear speech audio. We im- plement and evaluate Patronus in a wide variety of representative scenarios. Results show that Patronus effectively blocks unauthorized devices from making secret recordings while allowing authorized devices to successfully make clear recordings. While pervasive sensors in mobile devices rise privacy concern, Patronus shows the possibility of making use of these sensors to defend user privacy. 95 CHAPTER 5 CONCLUSION IoT utilizes sensors as the information source of machine intelligence and achieves comprehensive understanding of the environment. Further reactions of actuators could enable complete automa- tion of large infrastructures like Smart City and Smart Home. In this dissertation, we push the limit of IoT system design on mobile devices. We particularly introduce three IoT systems to show our study on overcoming common challenges of IoT applications. EyeLoc is a localization system for large shopping malls. It shows our effort on achieving a good balance among cost, compu- tational power and real-time responses. SoundFlower is a sound source localization system for voice assistants. It presents our study on dealing with pervasive noises from both internal hardware and the environment. Patronus provides acoustic privacy protection against unauthorized record- ings. While sensors rise privacy concern, Patronus shows the possibility of defending privacy by exploiting existing sensors. With the rapid development of machine learning and sensing technology, plenty of IoT designs are being implemented and will benefit our lives. With mobile devices being the most common computational systems, pushing the limit of IoT system design on mobile device will play an essential role in smart applications. 96 BIBLIOGRAPHY [1] [2] [3] [4] [5] [6] [7] [8] [9] IEEE, “Towards a definition of the internet of things (iot),” http://iot.ieee.org/definition.html, 2015. B. L. R. Stojkoska and K. V. Trivodaliev, “A review of internet of things for smart home: Challenges and solutions,” Journal of Cleaner Production, vol. 140, pp. 1454–1464, 2017. C. Le Gal, J. Martin, A. Lux, and J. L. Crowley, “Smart office: Design of an intelligent environment,” IEEE Intelligent Systems, no. 4, pp. 60–66, 2001. E. Ngai, F. Dressler, V. Leung, and M. Li, “Guest editorial special section on internet-of- things for smart cities and urban informatics,” IEEE Transactions on Industrial Informatics, vol. 13, no. 2, pp. 748–750, 2017. A. Wang, J. E. Sunshine, and S. Gollakota, “Contactless infant monitoring using white noise,” in Proceedings of ACM MobiCom, 2019. C. Peng, G. Shen, Y. Zhang, Y. Li, and K. Tan, “Beepbeep: A high accuracy acoustic ranging system using cots mobile devices,” in Proceedings of ACM SenSys, 2007. K. Qian, C. Wu, Z. Yang, Y. Liu, and K. Jamieson, “Widar: Decimeter-level passive tracking via velocity monitoring with commodity wi-fi,” in Proceedings of ACM Mobihoc, 2017. N. B. Priyantha, A. Chakraborty, and H. Balakrishnan, “The cricket location-support sys- tem,” in Proceedings of ACM MobiCom, 2000. H. Chen, F. Li, and Y. Wang, “Echotrack: Acoustic device-free hand tracking on smart phones,” in Proceeding of INFOCOM. IEEE, 2017, pp. 1–9. [10] Y. Zheng, Y. Zhang, K. Qian, G. Zhang, Y. Liu, C. Wu, and Z. Yang, “Zero-effort cross- domain gesture recognition with wi-fi,” in Proceedings of ACM MobiSys, 2019. [11] C. Li, M. Liu, and Z. Cao, “Wihf: Enable user identified gesture recognition with wifi,” in Proceddings of IEEE INFOCOM, 2020. [12] W. Jiang, C. Miao, F. Ma, S. Yao, Y. Wang, Y. Yuan, H. Xue, C. Song, X. Ma, D. Kout- sonikolas, W. Xu, and L. Su, “Towards environment independent device free human activity recognition,” in Proceedings of ACM MobiCom, 2018. [13] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of CVPR, 2017. [14] L. Weng and G. Brockman, “Techniques for training large neural networks,” https://openai. com/research/techniques-for-training-large-neural-networks, (Accessed on Dec. 4, 2023). 97 [15] M. Liu, J. Du, Q. Zhou, Z. Cao, and Y. Liu, “Eyeloc: Smartphone vision-enabled plug- n-play indoor localization in large shopping malls,” IEEE Internet of Things Journal, pp. 5585–5598, 2021. [16] L. Li, M. Liu, Y. Yao, F. Dang, Z. Cao, and Y. Liu, “Patronus: Preventing unauthorized speech recordings with support for selective unscrambling,” in Proceedings of ACM SenSys, 2020, p. 245–257. [17] N. Roy, H. Hassanieh, and R. Roy Choudhury, “Backdoor: Making microphones hear in- audible sounds,” in Proceedings of ACM MobiSys, 2017. [18] S. Wang, S. Fidler, and R. Urtasun, “Lost shopping! monocular localization in large indoor spaces,” in Proceedings of IEEE ICCV, 2015. [19] L. M. Ni, Y. Liu, Y. C. Lau, and A. P. Patil, “Landmarc: Indoor location sensing using active rfid,” Wireless Networks, vol. 10, no. 6, pp. 701–710, 2004. [20] A. Haeberlen, E. Flannery, A. M. Ladd, A. Rudys, D. S. Wallach, and L. E. Kavraki, “Prac- tical robust localization over large-scale 802.11 wireless networks,” in Proceedings of ACM MobiCom, 2004. [21] H. Liu, Y. Gan, J. Yang, S. Sidhom, Y. Wang, Y. Chen, and F. Ye, “Push the limit of wifi based localization for smartphones,” in Proceedings of ACM MobiCom, 2012. [22] D. Vasisht, S. Kumar, and D. Katabi, “Decimeter-level localization with a single wifi ac- cess point,” in 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), 2016. [23] Z. Yang, L. Jian, C. Wu, and Y. Liu, “Beyond triangle inequality: Sifting noisy and outlier distance measurements for localization,” ACM Trans. Sen. Netw., vol. 9, no. 2, pp. 26:1– 26:20, 2013. [24] Y.-S. Kuo, P. Pannuto, K.-J. Hsiao, and P. Dutta, “Luxapose: Indoor positioning with mobile phones and visible light,” in Proceedings of ACM MobiCom, 2014. [25] S. Zhu and X. Zhang, “Enabling high-precision visible light localization in today’s build- ings,” in Proceedings of ACM MobiSys, 2017. [26] S. Lin and T. He, “Smartlight: Light-weight 3d indoor localization using a single led lamp,” in Proceedings of ACM SenSys, 2017. [27] Q. Niu, M. Li, S. He, C. Gao, S. H. Gary Chan, and X. Luo, “Resource-efficient and auto- mated image-based indoor localization,” ACM Trans. Sen. Netw., vol. 15, no. 2, pp. 19:1– 19:31, 2019. 98 [28] Y. Tian, R. Gao, K. Bian, F. Ye, T. Wang, Y. Wang, and X. Li, “Towards ubiquitous indoor localization service leveraging environmental physical features,” in Proceedings of IEEE INFOCOM, 2014. [29] R. Gao, Y. Tian, F. Ye, G. Luo, K. Bian, Y. Wang, T. Wang, and X. Li, “Sextant: Towards ubiquitous indoor localization service by photo-taking of the environment,” IEEE Transac- tions on Mobile Computing, vol. 15, no. 2, pp. 460–474, 2016. [30] Y. Shu, C. Bo, G. Shen, C. Zhao, L. Li, and F. Zhao, “Magicol: Indoor localization using pervasive magnetic field and opportunistic wifi sensing,” IEEE Journal on Selected Areas in Communications, vol. 33, no. 7, pp. 1443–1457, 2015. [31] H. Wu, Z. Mo, J. Tan, S. He, and S.-H. G. Chan, “Efficient indoor localization based on geomagnetism,” ACM Trans. Sen. Netw., vol. 15, no. 4, pp. 42:1–42:25, 2019. [32] G. Bradski and A. Kaehler, Learning OpenCV: Computer vision with the OpenCV library. " O’Reilly Media, Inc.", 2008. [33] R. Smith, “An overview of the tesseract ocr engine,” in Proceedings of ICDAR, 2007. [34] R. Jain, R. Kasturi, and B. G. Schunck, Machine Vision. McGraw-Hill, 1995. [35] P. Zhou, M. Li, and G. Shen, “Use it free: instantly knowing your phone attitude,” in Pro- ceedings of ACM MobiCom, 2014. [36] N. Roy, H. Wang, and R. Roy Choudhury, “I am a smartphone and i can tell my user’s walking direction,” in Proceedings of ACM MobiSys, 2014. [37] S. Pertuz, D. Puig, and M. A. Garcia, “Analysis of focus measure operators for shape-from- focus,” Pattern Recognition, vol. 46, no. 5, pp. 1415–1432, 2013. [38] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” in Soviet Physics Doklady, 1966. [39] Google, “Battery historian,” https://github.com/google/battery-historian, 18-Aug-2020. [40] P. Bahl and V. N. Padmanabhan, “Radar: An in-building rf-based user location and tracking system,” in Proceedings of IEEE INFOCOM, 2000. [41] M. Youssef and A. Agrawala, “The horus wlan location determination system,” in Proceed- ings of ACM MobiSys, 2005. [42] Y.-C. Cheng, Y. Chawathe, A. LaMarca, and J. Krumm, “Accuracy characterization for metropolitan-scale wi-fi localization,” in Proceedings of ACM MobiSys, 2005. 99 [43] S. Sen, B. Radunovic, R. R. Choudhury, and T. Minka, “You are facing the mona lisa: spot localization using phy layer information,” in Proceedings of ACM MobiSys, 2012. [44] I. Bisio, F. Lavagetto, A. Sciarrone, and S. Yiu, “A smart2 gaussian process approach for indoor localization with rssi fingerprints,” in 2017 IEEE International Conference on Com- munications (ICC), 2017. [45] Z. Yang, C. Wu, and Y. Liu, “Locating in fingerprint space: wireless indoor localization with little human intervention,” in Proceedings of ACM MobiCom, 2012. [46] A. Rai, K. K. Chintalapudi, V. N. Padmanabhan, and R. Sen, “Zee: zero-effort crowdsourc- ing for indoor localization,” in Proceedings of ACM MobiCom, 2012. [47] X. Tian, R. Shen, D. Liu, Y. Wen, and X. Wang, “Performance analysis of rss fingerprinting based indoor localization,” IEEE Transactions on Mobile Computing, vol. 16, no. 10, pp. 2847–2861, 2017. [48] K. Chintalapudi, A. Padmanabha Iyer, and V. N. Padmanabhan, “Indoor localization without the pain,” in Proceedings of ACM MobiCom, 2010. [49] S. Sen, R. R. Choudhury, and S. Nelakuditi, “Spinloc: Spin once to know your location,” in Proceedings of ACM HotMobile Workshop, 2012. [50] Z. Zhang, X. Zhou, W. Zhang, Y. Zhang, G. Wang, B. Y. Zhao, and H. Zheng, “I am the an- tenna: accurate outdoor ap location using smartphones,” in Proceedings of ACM MobiCom, 2011. [51] J. Xiong and K. Jamieson, “Arraytrack: A fine-grained indoor location system,” in Proceed- ings of NSDI, 2013. [52] S. Sen, J. Lee, K.-H. Kim, and P. Congdon, “Avoiding multipath to revive inbuilding wifi localization,” in Proceeding of ACM MobiSys, 2013. [53] A. T. Mariakakis, S. Sen, J. Lee, and K.-H. Kim, “Sail: Single access point-based indoor localization,” in Proceedings of ACM MobiSys, 2014. [54] S. Kumar, S. Gil, D. Katabi, and D. Rus, “Accurate indoor localization with zero start-up cost,” in Proceedings of ACM MobiCom, 2014. [55] M. Kotaru, K. Joshi, D. Bharadia, and S. Katti, “Spotfi: Decimeter level localization us- ing wifi,” in Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, 2015. [56] C. Zhang and X. Zhang, “Pulsar: Towards ubiquitous visible light localization,” in Proceed- ings of ACM MobiCom, 2017. 100 [57] Y.-L. Wei, C.-J. Huang, H.-M. Tsai, and K. C.-J. Lin, “Celli: Indoor positioning using po- larized sweeping light beams,” in Proceedings of ACM MobiSys, 2017. [58] J. Dong, Y. Xiao, M. Noreikis, Z. Ou, and A. Ylä-Jääski, “imoon: Using smartphones for image-based indoor navigation,” in Proceedings of ACM Sensys, 2015. [59] R. Gao, M. Zhao, T. Ye, F. Ye, Y. Wang, K. Bian, T. Wang, and X. Li, “Jigsaw: Indoor floor plan reconstruction via mobile crowdsensing,” in Proceedings of ACM MobiCom, 2014. [60] Y. Shu, K. G. Shin, T. He, and J. Chen, “Last-mile navigation using smartphones,” in Pro- ceedings of ACM MobiCom, 2015. [61] W. Huang, Y. Xiong, X.-Y. Li, H. Lin, X. Mao, P. Yang, and Y. Liu, “Shake and walk: Acoustic direction finding and fine-grained indoor localization using smartphones,” in Pro- ceedings of IEEE INFOCOM, 2014. [62] K. Liu, X. Liu, and X. Li, “Guoguo: Enabling fine-grained indoor localization via smart- phone,” in Proceeding of ACM MobiSys, 2013. [63] M. Risoud, J.-N. Hanson, F. Gauvrit, C. Renard, P.-E. Lemesre, N.-X. Bonne, and C. Vin- cent, “Sound source localization,” European annals of otorhinolaryngology, head and neck diseases, pp. 259–264, 2018. [64] P. D. Coleman, “An analysis of cues to auditory depth perception in free space,” Psycholog- ical bulletin, p. 302, 1963. [65] ——, “Failure to localize the source distance of an unfamiliar sound,” The Journal of the Acoustical Society of America, pp. 345–346, 1962. [66] E. Georganti, T. May, S. van de Par, A. Harma, and J. Mourjopoulos, “Speaker distance detection using a single microphone,” IEEE Transactions on Audio, Speech, and Language Processing, pp. 1949–1961, 2011. [67] K. Liu, X. Liu, and X. Li, “Acoustic ranging and communication via microphone channel,” in Proceeding of GLOBECOM. IEEE, 2012, pp. 291–296. [68] S. Yun, Y.-C. Chen, and L. Qiu, “Turning a mobile device into a mouse in the air,” in Pro- ceeding of the 13th MobiSys. ACM, 2015, pp. 15–29. [69] W. Mao, J. He, and L. Qiu, “Cat: High-precision acoustic motion tracking,” in Proceeding of the 22nd MobiCom. ACM, 2016, p. 69–81. [70] J. C. Curlander and R. N. McDonough, Synthetic aperture radar. Wiley, 1991. [71] W. Wang, J. Li, Y. He, and Y. Liu, “Symphony: Localizing multiple acoustic sources with a 101 single microphone array,” in Proceeding of the 18th SenSys. ACM, 2020, p. 82–94. [72] E. B. Newman, “Experimental psychology,” 1954. [73] J. Benesty, J. Chen, and Y. Huang, “Time-delay estimation via linear interpolation and cross correlation,” IEEE Transactions on Speech and Audio Processing, pp. 509–519, 2004. [74] I. L. Freire and J. A. Apolinário, “Doa of gunshot signals in a spatial microphone array: Performance of the interpolated generalized cross-correlation method,” in 2011 Argentine School of Micro-Nanoelectronics, Technology and Applications. IEEE, 2011, pp. 1–6. [75] C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 320–327, 1976. [76] B. Van Den Broeck, A. Bertrand, P. Karsmakers, B. Vanrumste, M. Moonen et al., “Time- domain generalized cross correlation phase transform sound source localization for small microphone arrays,” in 2012 5th European DSP Education and Research Conference. IEEE, 2012, pp. 76–80. [77] Y. T. Chan, R. Hattin, and J. Plant, “The least squares estimation of time delay and its use in signal detection,” in IEEE Transactions on Acoustics, Speech, and Signal Processing. IEEE, 1978, pp. 217–222. [78] R. Roy and T. Kailath, “Esprit-estimation of signal parameters via rotational invariance techniques,” IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 984–995, 1989. [79] S. Shahbazpanahi, S. Valaee, and M. H. Bastani, “Distributed source localization using es- prit algorithm,” IEEE Transactions on Signal Processing, pp. 2169–2178, 2001. [80] Y. L. Sit, C. Sturm, J. Baier, and T. Zwick, “Direction of arrival estimation using the music IEEE, 2012, pp. algorithm for a mimo ofdm radar,” in 2012 IEEE Radar Conference. 0226–0229. [81] R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE Transactions on Antennas and Propagation, pp. 276–280, 1986. [82] T. C. Collier, A. N. Kirschel, and C. E. Taylor, “Acoustic localization of antbirds in a mex- ican rainforest using a wireless sensor network,” The Journal of the Acoustical Society of America, pp. 182–189, 2010. [83] I. Constandache, S. Agarwal, I. Tashev, and R. R. Choudhury, “Daredevil: Indoor location using sound,” ACM SIGMOBILE Mobile Computing and Communications Review, pp. 9– 19, 2014. 102 [84] C. T. Ishi, J. Even, and N. Hagita, “Using multiple microphone arrays and reflections for 3d localization of sound sources,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2013, pp. 3937–3942. [85] [86] J. Sallai, W. Hedgecock, P. Volgyesi, A. Nadas, G. Balogh, and A. Ledeczi, “Weapon clas- sification and shooter localization using distributed multichannel acoustic sensors,” Journal of Systems Architecture, pp. 869–885, 2011. J. Yang, S. Sidhom, G. Chandrasekaran, T. Vu, H. Liu, N. Cecan, Y. Chen, M. Gruteser, and R. P. Martin, “Detecting driver phone use leveraging car speakers,” in Proceeding of the 17th MobiCom. ACM, 2011, pp. 97–108. [87] H. Chen, F. Li, and Y. Wang, “Echoloc: Accurate device-free hand localization using cots devices,” in Proceeding of the 45th ICPP. IEEE, 2016, pp. 334–339. [88] W. Huang, Y. Xiong, X.-Y. Li, H. Lin, X. Mao, P. Yang, Y. Liu, and X. Wang, “Swadloon: Direction finding and indoor localization using acoustic signal by shaking smartphones,” IEEE Transactions on Mobile Computing, pp. 2145–2157, 2014. [89] Y. Zhang, J. Wang, W. Wang, Z. Wang, and Y. Liu, “Vernier: Accurate and fast acoustic IEEE, 2018, pp. motion tracking using mobile devices,” in Proceeding of INFOCOM. 1709–1717. [90] P. Lazik and A. Rowe, “Indoor pseudo-ranging of mobile devices using ultrasonic chirps,” in Proceeding of the 10th SenSys. ACM, 2012, pp. 99–112. [91] N. Garg, Y. Bai, and N. Roy, “Owlet: Enabling spatial information in ubiquitous acoustic devices,” in Proceedings of the 19th MobiSys. ACM, 2021, p. 255–268. [92] S. Shen, D. Chen, Y.-L. Wei, Z. Yang, and R. R. Choudhury, “Voice localization using nearby wall reflections,” in Proceeding of the 26th MobiCom. ACM, 2020, pp. 1–14. [93] M. Wang, W. Sun, and L. Qiu, “MAVL: Multiresolution analysis of voice localization,” in Proceeding of the 18th NSDI). USENIX, 2021, pp. 845–858. [94] M. S. Brandstein and H. F. Silverman, “A robust method for speech signal time-delay esti- mation in reverberant rooms,” in 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1997, pp. 375–378. [95] S. Gupta, D. Morris, S. Patel, and D. Tan, “Soundwave: Using the doppler effect to sense gestures,” in Proceeding of CHI. ACM, 2012, pp. 1911–1914. [96] B. Zhou, M. Elbadry, R. Gao, and F. Ye, “Battracker: High precision infrastructure-free mobile device tracking in indoor environments,” in Proceedings of ACM SenSys, 2017. 103 [97] Z. Sun, A. Purohit, R. Bose, and P. Zhang, “Spartacus: Spatially-aware interaction for mo- bile devices through energy-efficient audio sensing,” in Proceeding of the 11th MobiSys. ACM, 2013, pp. 263–276. [98] K. Sun, T. Zhao, W. Wang, and L. Xie, “Vskin: Sensing touch gestures on surfaces of mobile devices using acoustic signals,” in Proceeding of the 24th MobiCom. ACM, 2018, pp. 591– 605. [99] W. Wang, A. X. Liu, and K. Sun, “Device-free gesture tracking using acoustic signals,” in Proceeding of the 22nd MobiCom. ACM, 2016, pp. 82–94. [100] S. Yun, Y.-C. Chen, H. Zheng, L. Qiu, and W. Mao, “Strata: Fine-grained acoustic-based device-free tracking,” in Proceeding of the 15th MobiSys. ACM, 2017, pp. 15–28. [101] H. Jin, C. Holz, and K. Hornbæk, “Tracko: Ad-hoc mobile 3d tracking using bluetooth low energy and inaudible signals for cross-device interaction,” in Proceeding of the 28th UIST. ACM, 2015, pp. 147–156. [102] J. Qiu, D. Chu, X. Meng, and T. Moscibroda, “On the feasibility of real-time phone-to-phone 3d localization,” in Proceeding of the 9th SenSys. ACM, 2011, pp. 190–203. [103] J. H. DiBiase, A High-accuracy, Low-latency Technique for Talker Localization in Rever- berant Environments Using Microphone Arrays. Brown University, 2000. [104] B. D. Rao and K. S. Hari, “Performance analysis of root-music,” IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 1939–1949, 1989. [105] G. Carter, C. Knapp, and A. Nuttall, “Estimation of the magnitude-squared coherence func- tion via overlapped fast fourier transform processing,” IEEE Transactions on Audio and Electroacoustics, pp. 337–344, 1973. [106] A. E. Beaton and J. W. Tukey, “The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data,” Technometrics, pp. 147–185, 1974. [107] S. Morgenthaler, “Fitting redescending m-estimators in regression,” in Robust Regression. Routledge, 2019, pp. 105–128. [108] M. S. Brandstein, J. E. Adcock, and H. F. Silverman, “A practical time-delay estimator for localizing speech sources with a microphone array,” Computer Speech & Language, pp. 153–169, 1995. [109] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acous- tics,” The Journal of the Acoustical Society of America, pp. 943–950, 1979. [110] I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging for 104 robust speech enhancement,” IEEE Signal Processing Letters, pp. 12–15, 2002. [111] wiki.seeedstudio.com, “Respeaker 6-mic circular array kit raspberry pi.” [On- line]. Available: https://wiki.seeedstudio.com/ReSpeaker_6-Mic_Circular_Array_kit_for_ Raspberry_Pi/ for [112] raspberrypi.com, “Raspberry pi 4 model b.” [Online]. Available: https://www.raspberrypi. com/products/raspberry-pi-4-model-b [113] T. Guardian, “Apple apologises for allowing workers to listen to siri recordings,” https: //www.theguardian.com/technology/2019/aug/29/apple-apologises-listen-siri-recordings, (Accessed on Feb. 28, 2020). [114] CNBC, “Amazon echo recorded conversation, sent to random person: report,” https://www. cnbc.com/2018/05/24/amazon-echo-recorded-conversation-sent-to-random-person-report. html, (Accessed on Feb. 28, 2020). [115] V. News, “Ukrainian prime minister offers resignation,” https://www.voanews.com/a/ europe_ukrainian-prime-minister-offers-resignation/6182735.html, (Accessed on Dec. 5, 2023). [116] Y.-C. Tung and K. G. Shin, “Exploiting sound masking for audio privacy in smartphones,” in Proceedings of the 2019 ACM Asia Conference on Computer and Communications Security, 2019. [117] Q. Lin, Z. An, and L. Yang, “Rebooting ultrasonic positioning systems for ultrasound- incapable smart devices,” in Proceedings of ACM MobiCom, 2019. [118] “Anti-eavesdropping and recording blocker device,” Patent, China Patent 201320228440, Oct. 2013. [119] N. Roy, S. Shen, H. Hassanieh, and R. R. Choudhury, “Inaudible voice commands: The long-range attack and defense,” in Proceedings of NSDI, 2018. [120] G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang, and W. Xu, “Dolphinattack: Inaudible voice commands,” in Proceedings of ACM CCS, 2017. [121] T. Chen, L. Shangguan, Z. Li, and K. Jamieson, “Metamorph: Injecting inaudible commands into over-the-air voice controlled systems,” in Network and Distributed Systems Security (NDSS) Symposium, 2020. [122] X. Zhou, X. Ji, C. Yan, J. Deng, and W. Xu, “Nauth: Secure face-to-face device authentica- tion via nonlinearity,” in proceedings of IEEE INFOCOM, 2019. [123] Q. Yan, K. Liu, Q. Zhou, H. Guo, and N. Zhang, “Surfingattack: Interactive hidden attack on 105 voice assistants using ultrasonic guided wave,” in Network and Distributed Systems Security (NDSS) Symposium, 2020. [124] A. Rovner, “The principle of ultrasound,” https://www.echopedia.org/wiki/The_principle_ of_ultrasound, 2015. [125] A. H. Sayed, Fundamentals of adaptive filtering. John Wiley & Sons, 2003. [126] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Pro- cessing. Proceedings (Cat. No. 01CH37221), 2001. [127] Y. He, J. Bian, X. Tong, Z. Qian, W. Zhu, X. Tian, and X. Wang, “Canceling inaudible voice commands against voice control systems,” in Proceedings of ACM MobiCom, 2019. [128] A. Wang, C. Peng, O. Zhang, G. Shen, and B. Zeng, “Inframe: Multiflexing full-frame visible communication channel for humans and devices,” in Proceedings of the 13th ACM Workshop on Hot Topics in Networks, 2014. [129] A. Wang, Z. Li, C. Peng, G. Shen, G. Fang, and B. Zeng, “Inframe++ achieve simultaneous screen-human viewing and hidden screen-camera communication,” in Proceedings of ACM MobiSys, 2015. [130] V. Nguyen, Y. Tang, A. Ashok, M. Gruteser, K. Dana, W. Hu, E. Wengrowski, and N. Man- dayam, “High-rate flicker-free screen-camera communication with spatially adaptive em- bedding,” in Proceedings of IEEE INFOCOM, 2016. [131] K. Zhang, Y. Zhao, C. Wu, C. Yang, K. Huang, C. Peng, Y. Liu, and Z. Yang, “Chromacode: A fully imperceptible screen-camera communication system,” IEEE Transactions on Mobile Computing, 2019. [132] Q. Wang, K. Ren, M. Zhou, T. Lei, D. Koutsonikolas, and L. Su, “Messages behind the sound: real-time hidden acoustic signal capture with smartphones,” in Proceedings of ACM MobiCom, 2016. [133] M. Zhou, Q. Wang, K. Ren, D. Koutsonikolas, L. Su, and Y. Chen, “Dolphin: Real-time hid- den acoustic signal capture with smartphones,” IEEE Transactions on Mobile Computing, vol. 18, no. 3, pp. 560–573, 2018. [134] L. Zhang, C. Bo, J. Hou, X.-Y. Li, Y. Wang, K. Liu, and Y. Liu, “Kaleido: You can watch it but cannot record it,” in Proceedings of ACM MobiCom, 2015. [135] S. Zhu, C. Zhang, and X. Zhang, “Automating visual privacy protection using a smart led,” in Proceedings of ACM MobiCom, 2017. 106 [136] I. R. Titze and D. W. Martin, “Principles of voice production,” 1998. [137] R. J. Baken and R. F. Orlikoff, Clinical measurement of speech and voice. Cengage Learn- ing, 2000. [138] S. Shen, N. Roy, J. Guan, H. Hassanieh, and R. R. Choudhury, “Mute: bringing iot to noise cancellation,” in Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, 2018. [139] I. Rec, “P. 800.1, mean opinion score (mos) terminology,” International Telecommunication Union, Geneva, 2006. [140] K. Wojcicki, “PESQ MATLAB wrapper,” https://www.mathworks.com/matlabcentral/ fileexchange/33820-pesq-matlab-wrapper, (Accessed on Dec. 6, 2023). 107