TOWARDS ACCURATE RANGING AND VERSATILE AUTHENTICATION FOR SMART
                             MOBILE DEVICES
                                       By
                                   Lingkun Li
                              A DISSERTATION
                                  Submitted to
                          Michigan State University
                  in partial fulfillment of the requirements
                               for the degree of
                 Computer Science – Doctor of Philosophy
                                      2022


                                               ABSTRACT
  TOWARDS ACCURATE RANGING AND VERSATILE AUTHENTICATION FOR SMART
                                           MOBILE DEVICES
                                                     By
                                                Lingkun Li
Internet of Things (IoTs) was rapidly developed during past years. Smart devices, such as smart-
phones, smartwatches, and smart assistants, which are equipped with smart chips as well as sensors,
provide users with many easy used functions and lead them to a more convenient life. In this dis-
sertation, we carefully studied the birefringence of the transparent tape, the nonlinear effects of the
microphone, and the phase characteristic of the reflected ultrasound, and make use of such effects
to design three systems, RainbowLight, Patronus, and BreathPass, to provide users with accurate
localization, privacy protection, and authentication, respectively.
    RainbowLight leverages observation direction-varied spectrum generated by a polarized light
passing through a birefringence material, i.e., transparent tape, to provide localization service. We
characterize the relationship between observe direction, light interference and the special spectrum,
and using it to calculate the direction to a chip after taking a photo containing the chip. With multiple
chips, RainbowLight designs a direction intersection based method to derive the location. In this
dissertation, we build the theoretical basis of using polarized light and birefringence phenomenon to
perform localization. Based on the theoretical model, we design and implement the RainbowLight
on the mobile device, and evaluate the performance of the system. The evaluation results show
that RainbowLight achieves 1.68 cm of the median error in the X-axis, 2 cm of the median error in
the Y-axis, 5.74 cm of the median error in Z-axis, and 7.04 cm of the median error with the whole
dimension. It is the first system that could only use the reflected lights in the space to perform
visible light positioning.
    Patronus prevents unauthorized speech recording by leveraging the nonlinear effects of com-
mercial off-the-shelf microphones. The inaudible ultrasound scramble interferes recording of
unauthorized devices and can be canceled on authorized devices through an adaptive filter. In


this dissertation, we carefully studied the nonlinear effects of ultrasound on commercial micro-
phones. Based on the study, we proposed an optimized configuration to generate the scramble.
It would provide privacy protection againist unauthorized recordings that does not disturb normal
conversations. We designed, implemented a system including hardware and software components.
Experiments results show that only 19.7% of words protected by Patronus’ scramble can be rec-
ognized by unauthorized devices. Furthermore, authorized recordings have 1.6x higher perceptual
evaluation of speech quality (PESQ) score and, on average, 50% lower speech recognition error
rates than unauthorized recordings.
    BreathPass uses speakers to emit ultrasound signals. The signals are reflected off the chest wall
and abdomen and then back to the microphone, which records the reflected signals. The system
then extracts the fingerprints from the breathing pattern, and use these fingerprints to perform
authentication. In this dissertation, we characterized the challenge of conducting authentication
with the breathing pattern. After addressing these challenges, we designed such a system and
implemented a proof-of-concept application on Android platform. We also conducted comprehen-
sive experiments to evaluate the performance under different scenarios. BreathPass achieves an
overall accuracy of 83%, a true positive rate of 73%, and a false positive rate of 5%, according to
performance evaluation results.
    In general, this dissertation provides an enhanced ranging and versatile authentication systems
of Internet of Things.


 Copyright by
LINGKUN LI
        2022


To my parents and grandparents for their love and support.
                           v


                                    ACKNOWLEDGEMENTS
There are many people I would like to say thanks.
    To my advisor, Dr. Yunhao Liu, without your encouragement, I will never come to the U.S. and
experience a different culture. I will never have a chance to broaden my horizon, to see a different
world. Without your selfless support, I will never have a good life here. I will never forget the
days we sit in McDonald’s or Pandas Express and discuss my life, my future, and give me your
understanding of the research.
    To my advisor, Dr. Zhichao Cao, and my master’s advisor Dr. Jiliang Wang, I learned a lot from
you – research taste, writing styles, and presentation skills. Without your support and guidance, I
am not able to finish my Ph.D. study.
    I want to thank Professor Eric Torng for his careful edit of my paper. His most professional
skills and the hardest working attitude are worth every one of us learning. Thanks to my guidance
committee members, Dr. Li Xiao, Dr. Qiben Yan, Dr. Mi Zhang, for their guide and support. I
also want to thank the lab mates from the beginning to the current, Dr. Fan Dang, Dr. Pengjin
Xie, Dr. Chunyu Qiao, Dr. Yinghui Li, Ye Zhou, Zhao Wang, Qing Zhou, and many others for
their collaboration from many aspects. Thanks to my best friends Yue Jiang and Junjie Han during
my lifetime in the United States. It is because of their accompany and help that I can successfully
complete my studies.
    Thanks to my parents and grandparents for their endless love and support.
                                                  vi


                                 TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . .            . . . . . . . . .  1
   1.1 Proposed techniques and applications . . . . . . . . . . . . . . . .   . . . . . . . . .  3
       1.1.1 Positioning with birefringence . . . . . . . . . . . . . . .     . . . . . . . . .  3
       1.1.2 Audio privacy protection with nonlinearity of microphones        . . . . . . . . .  3
       1.1.3 Authentication with user’s breath . . . . . . . . . . . . . .    . . . . . . . . .  4
   1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5
CHAPTER 2    RAINBOWLIGHT: ENABLING 3D AMBIENT LIGHT POSITIONING
             WITH MOBILE PHONES AND BATTERY-FREE CHIPS . . . . . . .                        . .  6
   2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  6
   2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  9
       2.2.1 Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  9
       2.2.2 Birefringence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 10
       2.2.3 Interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 11
   2.3 Localization Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 11
       2.3.1 Interference Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 12
              2.3.1.1 Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 13
              2.3.1.2 Phase Difference . . . . . . . . . . . . . . . . . . . . . . . .      . . 13
              2.3.1.3 Calculation of 𝑛𝑒 , 𝜃 𝑒 , and Δ . . . . . . . . . . . . . . . . . .   . . 15
              2.3.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 16
       2.3.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 17
              2.3.2.1 Choose The Light Spectrum Feature . . . . . . . . . . . . . .         . . 17
              2.3.2.2 Measurement Result . . . . . . . . . . . . . . . . . . . . . .        . . 18
   2.4 RainbowLight Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 18
       2.4.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 18
       2.4.2 Mapping Initialization . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 19
       2.4.3 3D Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 21
              2.4.3.1 Localization Design . . . . . . . . . . . . . . . . . . . . . .       . . 21
              2.4.3.2 Intersection Based Localization . . . . . . . . . . . . . . . .       . . 21
   2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 23
       2.5.1 Anchor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 23
       2.5.2 Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 24
   2.6 Apply RainbowLight to Localization in a Large Area . . . . . . . . . . . . . .       . . 25
       2.6.1 Providing Identifier to RainbowLight Anchor . . . . . . . . . . . . . .        . . 26
       2.6.2 Localization in a Large Area . . . . . . . . . . . . . . . . . . . . . . .     . . 26
   2.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
                                              vii


      2.7.1 Localization Accuracy . . . . . . . . . .      . . . . . . . . . . . . . . . . . . 28
      2.7.2 Performance with Identifier . . . . . . . .    . . . . . . . . . . . . . . . . . . 30
      2.7.3 Impact of Sampling Density . . . . . . .       . . . . . . . . . . . . . . . . . . 31
      2.7.4 Impact of Number of Transparent Chips .        . . . . . . . . . . . . . . . . . . 31
      2.7.5 Impact of Different Light Sources . . . .      . . . . . . . . . . . . . . . . . . 32
      2.7.6 Impact of Different Mobile Phone Models        . . . . . . . . . . . . . . . . . . 33
      2.7.7 Localization with Light Off . . . . . . . .    . . . . . . . . . . . . . . . . . . 34
      2.7.8 Impact of Mobile Phone Orientation . . .       . . . . . . . . . . . . . . . . . . 35
  2.8 Related Work . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . 36
      2.8.1 Visible Light Based Localization . . . . .     . . . . . . . . . . . . . . . . . . 36
      2.8.2 Other Localization Approaches . . . . . .      . . . . . . . . . . . . . . . . . . 37
CHAPTER 3   PATRONUS: PREVENTING UNAUTHORIZED SPEECH RECORD-
            INGS WITH SUPPORT FOR SELECTIVE UNSCRAMBLING . . . . .                         . . 39
  3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 39
  3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 43
      3.2.1 Nonlinear Effect of Microphones . . . . . . . . . . . . . . . . . . . . .      . . 43
      3.2.2 Dual Channel Applications . . . . . . . . . . . . . . . . . . . . . . . .      . . 44
  3.3 Nonlinear Behavior of Common Microphones . . . . . . . . . . . . . . . . . .         . . 45
  3.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
      3.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 46
      3.4.2 Attack Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 47
             3.4.2.1 Short-Time Fourier Transform (STFT) . . . . . . . . . . . . .         . . 47
             3.4.2.2 Extra Ultrasonic Transmitter Attack . . . . . . . . . . . . . .       . . 48
             3.4.2.3 Wi-Fi/Bluetooth Snifing . . . . . . . . . . . . . . . . . . . .       . . 48
             3.4.2.4 Physical Attacking . . . . . . . . . . . . . . . . . . . . . . .      . . 49
      3.4.3 Ultrasonic Scramble Modulation . . . . . . . . . . . . . . . . . . . . .       . . 49
             3.4.3.1 Range of Frequency . . . . . . . . . . . . . . . . . . . . . .        . . 49
             3.4.3.2 Random Frequencies . . . . . . . . . . . . . . . . . . . . . .        . . 49
             3.4.3.3 Ringing Effect . . . . . . . . . . . . . . . . . . . . . . . . .      . . 50
             3.4.3.4 Duration of each frequency . . . . . . . . . . . . . . . . . .        . . 51
             3.4.3.5 Key Construction . . . . . . . . . . . . . . . . . . . . . . . .      . . 52
      3.4.4 Enlarge Scramble Working Area . . . . . . . . . . . . . . . . . . . . .        . . 53
      3.4.5 Grant Recording Privilege . . . . . . . . . . . . . . . . . . . . . . . .      . . 54
             3.4.5.1 Key Transmission . . . . . . . . . . . . . . . . . . . . . . .        . . 54
             3.4.5.2 Scramble Reconstruction . . . . . . . . . . . . . . . . . . . .       . . 54
             3.4.5.3 Synchronization . . . . . . . . . . . . . . . . . . . . . . . .       . . 55
             3.4.5.4 Adaptive Filtering . . . . . . . . . . . . . . . . . . . . . . .      . . 55
  3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 56
      3.5.1 Scramble Transmitter . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 57
             3.5.1.1 Hardware Implementation . . . . . . . . . . . . . . . . . . .         . . 57
             3.5.1.2 Format of Key . . . . . . . . . . . . . . . . . . . . . . . . .       . . 57
      3.5.2 Descramble Receiver for Authorized Devices . . . . . . . . . . . . . .         . . 57
             3.5.2.1 Reconstruct Scramble Waveform . . . . . . . . . . . . . . .           . . 58
             3.5.2.2 Normalized Least-Mean-Square (NLMS) Adaptive Filter . . .             . . 59
                                             viii


      3.5.3 Simulated STFT Attacker . . . . . . . . . . . . . . . . . . . . . .      . . . . .  60
  3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  61
      3.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . .  61
              3.6.1.1 Perceptual Evaluation of Speech Quality (PESQ) . . . .         . . . . .  61
              3.6.1.2 Speech Recognition Vocabulary Accuracy (SRVA) . . .            . . . . .  62
      3.6.2 Effectiveness of Scrambling and Descrambling . . . . . . . . . .         . . . . .  63
      3.6.3 Effectiveness of Human Voice Scrambling and Descrambling . . .           . . . . .  64
      3.6.4 Effectiveness of Human Recognition to Scrambled Recordings and
              Descrambled Recordings . . . . . . . . . . . . . . . . . . . . . .     . . . . .  65
      3.6.5 Effectiveness on Different Mobile Models . . . . . . . . . . . . .       . . . . .  65
      3.6.6 Impact of the Distance . . . . . . . . . . . . . . . . . . . . . . .     . . . . .  66
      3.6.7 Impact of the Reflection Layer . . . . . . . . . . . . . . . . . . .     . . . . .  67
      3.6.8 Impact of the Frequency Duration . . . . . . . . . . . . . . . . .       . . . . .  68
      3.6.9 Descramble Time . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . .  70
  3.7 Limitations and Future Works . . . . . . . . . . . . . . . . . . . . . . . .   . . . . .  70
CHAPTER 4   BREATHPASS: ULTRASOUNIC AUTHENTICATION BY CHEST AND
            ABDOMEN MOVEMENT WHILE BREATHING . . . . . . . . . . . .                       . .  72
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  72
  4.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . .  75
      4.2.1 Human Breath Preliminary . . . . . . . . . . . . . . . . . . . . . . . .       . .  76
      4.2.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . .  77
  4.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  77
      4.3.1 Ultrasound-based Breath Sampler . . . . . . . . . . . . . . . . . . . .        . .  78
      4.3.2 Fingerprint Extractor Design . . . . . . . . . . . . . . . . . . . . . . .     . .  80
      4.3.3 Comparator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . .  83
      4.3.4 Combine the Fingerprint Extractor with the Comparator . . . . . . . . .        . .  85
  4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  85
      4.4.1 Breathing Pattern Sampler and Data Collection . . . . . . . . . . . . .        . .  85
      4.4.2 Training the Feature Extractor and Comparator . . . . . . . . . . . . .        . .  87
      4.4.3 Proof-of-concept Application . . . . . . . . . . . . . . . . . . . . . . .     . .  87
  4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  88
      4.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . .  88
      4.5.2 General Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . .  90
      4.5.3 Effectiveness on Different Mobile Models . . . . . . . . . . . . . . . .       . .  91
      4.5.4 Influence of Different Kinds of Face Covers . . . . . . . . . . . . . . .      . .  91
      4.5.5 Influence of Different Clothes . . . . . . . . . . . . . . . . . . . . . .     . .  93
      4.5.6 Influence of Different Postures . . . . . . . . . . . . . . . . . . . . . .    . .  94
      4.5.7 Influence of Dynamic Status . . . . . . . . . . . . . . . . . . . . . . .      . .  94
      4.5.8 Influence of Different Environments . . . . . . . . . . . . . . . . . . .      . .  95
      4.5.9 Defend Replay Attacks . . . . . . . . . . . . . . . . . . . . . . . . . .      . .  95
      4.5.10 Effectiveness of the Average Fingerprint . . . . . . . . . . . . . . . . .    . .  96
      4.5.11 Efficiency on Mobile Phones . . . . . . . . . . . . . . . . . . . . . . .     . .  97
  4.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . .  98
  4.7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 100
                                              ix


CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
                                            x


                                     LIST OF TABLES
Table 3.1: Descramble time (DT) of different record times (RT) with different max scram-
           ble orders (MSO, the upper bound of 𝑘 in Algorithm 1). . . . . . . . . . . . . . 69
Table 4.1: TPRs of different dynamic status and environments. . . . . . . . . . . . . . . . . 95
                                               xi


                                         LIST OF FIGURES
Figure 2.1: Illustration of birefringence. . . . . . . . . . . . . . . . . . . . . . . . . . . . .   9
Figure 2.2: Illustration of light interference. . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Figure 2.3: Polarization and intensity change through 𝑃1 , 𝑆 and 𝑃2 . . . . . . . . . . . . . . 12
Figure 2.4: Intensity of interference light for different wavelength with different incident
             angles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Figure 2.5: (a) Hue values on x-y plane by simulation, (b) Hue values measured by mobile
             phone on x-y plane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Figure 2.6: (a) Hue matrix sampled, (b) Hue matrix after interpolation. . . . . . . . . . . . 17
Figure 2.7: Overview of RainbowLight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 2.8: Illustration of localization algorithm. . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 2.9: Chips in RainbowLight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 2.10: Anchor with chips made by two polarizers and one transparent adhesive tape
             (i): near to fluorescent (iii) on LED lamp cover, anchor with chips made by
             one polarizer and one transparent adhesive tape(ii): near to fluorescent (iv):
             on LED lamp cover, (v): anchor on a glass window. . . . . . . . . . . . . . . . 23
Figure 2.11: Complementary hue observed as rotating mobile phone for different tape
             thickness (1 ∼ 5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 2.12: RainbowLight anchor with identifier. . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 2.13: Overview of localization in building. . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 2.14: (a) Experiment environment. (b) Localization precision on different distance.        . 28
Figure 2.15: (a) Localization precision map relative position to absolute position. (b)
             Capture in different angles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 2.16: (a) Localization precision on different sampling density. (b) Localization
             precision on different number of chips. . . . . . . . . . . . . . . . . . . . . . . 29
                                                  xii


Figure 2.17: Localization accuracy for different (a) power of lamp, (b) color temperature
             of lamp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 2.18: Localization accuracy for different (a) types of lamp, (b) manufacturers of lamp. 30
Figure 2.19: Different light sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 2.20: Localization accuracy for different (a) mobile phones, (b) lamp status. . . . . . 34
Figure 2.21: Localization precision of different (a) pitch angles, (b) yaw angles. . . . . . . . 35
Figure 2.22: Localization precision of different roll angles of mobile phone. . . . . . . . . . 35
Figure 3.1: Using chirps to smooth the frequency changing components of the scramble. . . 41
Figure 3.2: System Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 3.3: Illustration of how linear chirps mitigate the ringing effect. . . . . . . . . . . . 50
Figure 3.4: Enlarge working area with reflection. . . . . . . . . . . . . . . . . . . . . . . . 53
Figure 3.5: Implementation of Scramble Transmitter. . . . . . . . . . . . . . . . . . . . . . 56
Figure 3.6: Prototype of Patronus.     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Figure 3.7: Illustration of original waveform, authorized waveform, unauthorized wave-
             form, and descrambled waveform by STFT attack. . . . . . . . . . . . . . . . . 58
Figure 3.8: PESQ of recordings captured by unauthorized and authorized devices, and
             PESQ of recordings without scrambling by turning off Patronus as the baseline.        60
Figure 3.9: (a) Upper half: The CDF of SRVA Error of scrambled recordings from the
             unauthorized device. Lower half: The ratio of SRVA between scrambled
             recordings and original waveforms. (b) Upper half: The CDF of SRVA Error
             of descrambled recordings from the authorized device. Lower half: The ratio
             of SRVA between descrambled recordings and original waveforms. . . . . . . . 61
Figure 3.10: (a) Compare SRVA between before and after descrambling for the human
             voice. (b) Compare SRVA between before and after descrambling for human
             recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Figure 3.11: (a) Compare average PESQ and SRVA among different models, (b) compare
             PESQ and SRVA at different distances. . . . . . . . . . . . . . . . . . . . . . . 66
                                                 xiii


Figure 3.12: (a) Illustration of the reflection layer experiment. (b) Compare PESQ and
             SRVA with different frequency switching times. . . . . . . . . . . . . . . . . . 66
Figure 3.13: (a) and (b): Compare PESQ and SRVA with the using of the reflection layer.
             (c) and (d): Compare PESQ and SRVA without the using of the reflection layer.         68
Figure 4.1: Comparasion of existing biometric authentication methods. . . . . . . . . . . . 73
Figure 4.2: Illustration of chest/abdomen in the inhale step of a human breath process. . . . 76
Figure 4.3: Overview of BreathPass system that consists of an enrollment stage and an
             authentication stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Figure 4.4: A controlled experiment verifying our ultrasound frequency selection. The
             board movement to mimic the chest wall and abdomen motion during breathing. 79
Figure 4.5: (a) Spectrogram of a speech “OK, Google!”. (b) Spectrogram of a breathing
             sound. (c) FFT and CDF of a breathing pattern. (d) Spectrogram of a
             breathing pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Figure 4.6: The structure of our DNN model for fingerprint extractor. . . . . . . . . . . . . 83
Figure 4.7: The end-to-end system design combining the fingerprint extractor with the
             comparator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Figure 4.8: The UI of BreathPass implementation on a smartphone. (a) the breathing
             pattern sampler for general data collection; (b)-(e) the pages of our proof-of-
             concept application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Figure 4.9: (a) General performance of BreathPass (b) Performance of different mobile
             models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure 4.10: Performance of BreathPass with different kinds of clothes. . . . . . . . . . . . . 90
Figure 4.11: Performance of BreathPass with different kinds of clothes. . . . . . . . . . . . . 92
Figure 4.12: (a) TPR of BreathPass with different postures. (b) Performance with or
             without average fingerprint technique. . . . . . . . . . . . . . . . . . . . . . . 93
                                                  xiv


                              LIST OF ALGORITHMS
Algorithm 3.1: Remove Scramble from the record. . . . . . . . . . . . . . . . . . . . . . . 59
                                           xv


                                             CHAPTER 1
                                          INTRODUCTION
Internet of Things (IoTs) was rapidly developed during past years. Smart devices, such as smart-
phones, smartwatches, and smart assistants, which are equipped with smart chips as well as sensors,
provide users with many easy used functions and lead them to a more convenient life.
    Many works make use of the device’s original function or extract features from them to design
systems to serve people, and may or may not avoid side effects of the device. Side effects are not
the main function of devices, which are sometimes avoided by users. Recently, there are some
applications [1, 2, 3] carefully study the side effects of the devices. For example, LiTell [1] found
that because of the manufacturing error, the flashing rates of fluorescent lights varies from one to
another. Then, it samples such flashing rates as the landmark and design a localization system.
Manufacturing errors are not wanted by people, and usually try to avoid. Before LiTell, many
Visible Light Positioning systems [4, 5, 6] either need to modulating landmarks by dynamically
changing the flash frequencies, brightness, or need user to perform certain actions and using
geometry to calculate the position of the camera. LiTell, however, make use of such errors and
regard the flashing frequencies from such errors as the landmark, hence requiring no modulation
or using requirement. It reduces both deployment costs and using costs.
    Another example is LiShield [2], which exploits the rolling shutter effect of CMOS camera.
Rolling Shutter, comparing to Global Shutter, captures one column at a time instead of the whole
frame. It is a kind of side effect of cheap camera, whereas the expensive camera, e.g., SLR camera,
equipped with Global Shutter usually avoids. LiShield, however, exploits the Rolling Shutter
effect to design a visual privacy protection system. Specifically, LiShield designs a light bulb
which consists of three color bulbs. Three color bulbs illuminate alternatively with extremely high
frequency but can be distinguished by the rolling shutter. Although human eyes cannot sense the
color bulbs flicker, the camera with the rolling shutter, however, would capture the column with one
bulb illuminate at a time, thus generating a mask with multiple color stripes on the photo captured.
                                                   1


Therefore, it is difficult for human to recognize the content of the photo and prevents unauthorized
device to take photos. It also designs a mechanism to remove such a mask on authorized devices,
hence allowing authorized devices to take photos.
    The nonlinear effect of commercial off-the-shelf (COTS) microphones is another kind of side
effects. When a pair of ultrasond captured by the microphone, nonlinearity could generate a shadow
spectrum within the audible frequency range with a careful design of the ultrasond. Dolphin-
Attack [7] makes use of the nonlinear effect to break in the voice control system. Many works [8, 9]
aims to remove such spectrum in order to get rid of unexpected attacks. UPS+ [3], however,
carefully studies the pattern of the nonlinear effect of microphones, and designs a new ultrasonic
positioning system, which uses extremely high frequency of the sound to avoid disturbing pets and
infants.
    In this dissertation, we carefully study two of the side effects, one is the birefringence of the
transparent tape, which could blur the underside image when we observe it hence people usually
want to avoid. Another is the nonlinearity of COTS microphones, which was discussed above.
Based on our carefully study, we propose two systems, RainbowLight and Patronus. RainbowLight
uses birefringence to localize a camera. Different from previous works, RainbowLight works even
when light bulbs is turned off, hence reducing the deploying and using costs. Patronus leverages
the nonlinearity to emit inaudible scramble to interfere unauthorized recordings. We also design
a mechanism to cancel out such scrambles with the scramble pattern giving to authorized devices,
hence preventing unauthorized recordings while allowing authorized recordings.
    From 2019, COVID-19 pandemic brings people into an inconvenient life. COVID-19 virus
attacks human’s lung and make patients hard to breath. To cope with the COVID-19 pandemic,
existing effort [10] implements a mobile application that leverages ultrasound to capture user’s
breath, and then detects whether the user’s lung functionality is normal in a non-invasive manner. In
this dissertation, besides the two systems which leverage side effects to provide an enhanced ranging
and a privacy protection system, we propose BreathPass, an authentication system leveraging user’s
breath to cope with the problem that Face-ID is hard to use when a user wears a face cover and
                                                   2


Fingerprint-ID is also hard to use when a user wears a pair of rubber gloves. Compare to existing
biometric authentication systems, BreathPass is more resilient to the replay attack and has a high
flexibility to mobile devices. In addition, with BreathPass, users are no need to take off their face
covers or gloves when they use the application that requires “who you are” authentication; e.g.,
Apple Pay. It brings users more safety towards the COVID-19 pandemic.
1.1     Proposed techniques and applications
1.1.1    Positioning with birefringence
Ubiquitous existence of lights makes Visible Light Positioning (VLP) become popular and has
attracted much research effort. Existing VLP approaches typically need to use a specially designed
light bulb as a transmitter or a specially designed receiver to collect light information, or requires
a strict user operation (e.g., capturing multiple light bulbs at a time with horizontally holding
the smartphone, or needs to keep the light bulb turning on). This results in high deployment,
maintenance, and using costs.
    In Chapter 2, we present RainbowLight. RainbowLight uses birefringence material to generate
a spatial-characterized light pattern. A camera could capture different color pattens from different
positions, and achieves a low-cost, high-precision 3D positioning.
    We implement RainbowLight and conduct comprehensive experiments. The evaluation results
show that RainbowLight achieves 1.68 cm of the median error in the X-axis, 2 cm of the median
error in the Y-axis, 5.74 cm of the median error in Z-axis, and 7.04 cm of the median error with
the whole dimension.
1.1.2    Audio privacy protection with nonlinearity of microphones
The widespread adoption and ubiquity of smart devices equipped with microphones (e.g., cell-
phones, smartwatches, etc.) unfortunately create many significant privacy risks. In recent years,
there have been several cases of people’s conversations being secretly recorded, sometimes initiated
                                                  3


by the device itself. Although some manufacturers are trying to protect users’ privacy, to the best
of our knowledge, there is not any effective technical solution available.
    In Chapter 3, we present Patronus, a system that can both prevent unauthorized devices from
making secret recordings while allowing authorized devices to record conversations. Patronus
prevents unauthorized speech recording by emitting what we call a scramble, a low-frequency
noise generated by inaudible ultrasonic waves. The scramble prevents unauthorized recordings
by leveraging the nonlinear effects of commercial off-the-shelf microphones. The frequency
components of the scramble are randomly determined and connected with linear chirps, and the
frequency period is fine-tuned so that the scramble pattern is hard to attack. Patronus allows
authorized speech recording by secretly delivering the scramble pattern to authorized devices,
which can use an adaptive filter to cancel out the scramble.
    We implement a prototype system and conduct comprehensive experiments. Our results show
that only 19.7% of words protected by Patronus’ scramble can be recognized by unauthorized
devices. Furthermore, authorized recordings have 1.6x higher perceptual evaluation of speech
quality (PESQ) score and, on average, 50% lower speech recognition error rates than unauthorized
recordings.
1.1.3   Authentication with user’s breath
In Chapter 4, we propose BreathPass, a non-invasive authentication system that characterizes the
chest/abdomen movement incurred by human breath to enable unlocking smart devices while
wearing various types of face covers, clothing, and in different postures. To capture the breathing
pattern, BreathPass uses speakers to emit ultrasound signals. The signals are reflected off the chest
wall and abdomen and then back to the microphone, which records the reflected signals. The system
then extracts the breathing pattern from the reflected signals, and further extracts fingerprints from
the breathing pattern, and use these fingerprints to perform authentication. We carefully design a
Deep Neural Network model and explore its capacity for feature abstraction in order to address the
challenges associated with tiny position changes resulting in different breathing patterns and the
                                                  4


extremely narrow bandwidth of breathing.
    We implement a prototype and conduct extensive experiments. BreathPass achieves an overall
accuracy of 83%, a true positive rate of 73%, and a false positive rate of 5%, according to
performance evaluation results.
1.2    Organization
The reminder of this dissertation is as follows, in Chapter 2, we discuss visible light positioning with
birefringence; in Chapter 3, we discuss audio privacy protection with nonlinearity of microphone;
in Chapter 4, we discuss authentication with user’s breathing; in Chapter 5 we conclude this
dissertation.
                                                    5


                                              CHAPTER 2
 RAINBOWLIGHT: ENABLING 3D AMBIENT LIGHT POSITIONING WITH MOBILE
                             PHONES AND BATTERY-FREE CHIPS
2.1    Overview
The rapid development of mobile and Internet of Things (IoTs) facilitates the development of a
smarter world. More and more smart robots and smart devices are used in different places, such
as factories, airports and even at home. Indoor localization significantly expands the capability of
these devices, and thus it attracts much research effort, e.g., a large collection of RF-based [11, 12,
13, 14, 15, 16] positioning approaches are proposed.
    Visible Light Positioning (VLP) has recently been shown as a promising approach for indoor
localization, owing to its potential of high localization precision with ubiquitous existence of
light. The basic idea of VLP is to exploit features and information from received light to derive
the relative position to light. For example, many approaches use LED light with a controller
[17, 18, 19, 5, 20] to modulate the required features. Thus a receiver can use the modulated features
for localization. Further, instead of using a controller to actively modulate information in light,
many approaches [21, 1, 22, 23] resort to using intrinsic features of light or receiver. Meanwhile,
[24, 25, 26, 6, 27, 28] use geometrical relationships among lights for localization.
    Existing VLP approaches exhibit high accuracy for indoor localization. However, there still
exist the following limitations that hinder their application: (1) Special designed LEDs with
controllers [17, 20] or the receiver with sensors [5, 28]. Such kinds of LED/receiver are still not
widely used in today’s buildings. (2) Pre-collected features for all lights[1, 22]. This introduces a
high overhead. It is difficult to ensure the features are stable over time and the system needs to keep
updated with all lights. (3) Strict usage requirement. For example, [1] requires to keep the mobile
phone horizontal and [24] requires to capture at least 3 lamps in a photo each time. (4) Do not
work when the light is turned off in the daytime. During the daytime, people often turn their lights
                                                    6


off and use the ambient light, i.e., sunlight passing through the window, to meet the requirement of
illumination. Like DarkLight [29] in the field of visible light communication (VLC) realizes the
requirement of communication with extremely-low luminance, we think that perform localization
with the light turned off is non-trivial as well. Existing works could not work at all when the light
is switched off because they are depended on LEDs or receivers. Those limitations incur a high
deployment, maintenance and usage overhead.
    To address those limitations, we propose RainbowLight, a low-cost 3D localization approach
which significantly reduces the deployment, maintenance, and usage overhead. Our key finding
for RainbowLight is that light through a chip containing polarizer and birefringence material will
produce different interference patterns and light spectrum in different directions. We go deep into
the birefringence principle to analyze the relationship between direction, light interference, and
spectrum and derive a model to characterize the relationship. The model builds the foundation of
obtaining the direction to a chip based on the received light spectrum. By calculating directions to
multiple chips, we can derive the 3D localization of the receiver theoretically.
    In the practical design of RainbowLight, we find that the light spectrum is difficult to measure
on commercial off-the-shelf (COTS) mobile phones. We use the color extracted from photo to
approximate light spectrum and show its effectiveness. To derive light direction for localization,
the theoretical model requires various parameters, e.g., optic parameters and thickness of the
material, which are difficult to measure in practice. Instead of measuring those parameters, we
build a sparse initial mapping between hue value and direction by sampling. Further, we conduct
model-based interpolation on the sparse initial mapping to derive a fine-grained mapping. Such a
sparse sampling only needs to be performed once for the same type of polarizer and birefringence
material. After capturing a photo containing multiple chips, we extract the color pattern of those
chips and calculate directions to them. Finally, we leverage a direction based intersection method
to calculate the location.
    In our implementation, we use transparent adhesive tape as birefringence material. We make
small transparent chips by sticking tape with a thin plastic polarizer. In localization, we only need
                                                    7


to place multiple chips to a certain plane (e.g., lamp cover, a glass window) to enable it for 3D
localization (see Figure 2.10). It should be noted that RainbowLight does not actively modulate
information in the light, and thus it also works for light off scenario in the daytime. We can place
chips on a wall, table, or other flat surfaces. This significantly extends the application scenarios.
    We evaluate the performance of RainbowLight in different scenarios for different types of light
as well as different types of surfaces. The evaluation results show that RainbowLight achieves a
high localization accuracy and low cost. It also works well even for light off scenario in the daytime.
    The contributions of our work are as follows:
    • We show that light through a chip made by polarizer and birefringence material will produce
       different interference patterns and light spectrum in different directions. We analyze and
       derive a model to characterize the direction, interference, and light spectrum as the foundation
       for 3D localization.
    • Based on the model, we propose RainbowLight, a low-cost ambient light 3D localization
       approach with a low deployment, maintenance, and usage cost.
    • We implement RainbowLight and evaluate its performance through extensive experiments.
       RainbowLight achieves an average localization error of 3.3 cm in 2D and 9.6 cm in 3D, and
       an error of 7.4 cm in 2D and 20.5 cm in 3D for light off scenario in the daytime.
    The organization of the remainder is as follows. Section 2.2 introduces the background of our
work. Section 2.3 presents 3D localization model of RainbowLight. Section 2.4 and 2.5 introduce
the design and implementation of RainbowLight, respectively. Section 2.6 discusses the approach
of deploying RainbowLight to enable getting the absolute position in a large area. Section 2.7
presents evaluation results of RainbowLight. Section 2.8 introduces related work.
                                                   8


                                       !
                        optic axis
                  !                                                           ! "        x
                            % x                    %
                                                     optic axis
                                                               x                        Extraordinary
                                                                                   "$   ray
                                                                                  "#
                      y              y                          extraordinary   z ordinary ray
                                                ordinary ray    ray
                                             z
                                  Figure 2.1: Illustration of birefringence.
2.2    Background
2.2.1   Polarization
Polarization is a feature of the transverse wave to specify its oscillation in different directions.
Natural light, such as light from a lamp, has different oscillations. Polarizer for light is a kind of
device that allows light with the oscillation direction parallel to its transmission axis, and blocks
light with the oscillation direction perpendicular to its transmission axis. The polarizer is widely
used in various applications, e.g., each 3D glasses has two polarizers for two lenses with different
transmission axes allowing light with different oscillation to pass.
    A polarizer with a single transmission axis is called linear polarizer. Light is polarized after
passing through a polarizer. The polarized light has an oscillation direction parallel with the
transmission axis of the polarizer. Denote the angle between the oscillation direction of light and
the transmission axis of a polarizer as 𝜙, according to Malus’s law[30], the intensity of the light
that passes through the polarizer, denoted by 𝐼 𝜙 , is given by
                                                 𝐼 𝜙 = 𝐼𝑐𝑜𝑠2 𝜙,                                       (2.1)
where 𝐼 is the original intensity of light.
    Natural light has oscillation in any direction. When natural light passes through a linear
polarizer, it becomes linearly polarized light, i.e., light with a single oscillation direction.
                                                         9


2.2.2   Birefringence
Birefringence [31] is a feature for an optically anisotropic material such as plastics, calcite, and
quartz. When a ray of light passes through a birefringence material, two refracted rays can be
observed. As shown in Figure 2.1, the ray of light is split into two rays taking different paths
in the material. Meanwhile, those two rays have orthogonal polarization directions and different
refractive indices in the birefringence material. There is a special direction, namely optic axis,
for each certain type of birefringence material. One of the two rays, called ordinary ray, has a
polarization direction vertical with the optic axis. Its refractive index is called ordinary refractive
index and is denoted by 𝑛𝑜 . Another ray, called extraordinary ray, has a polarization direction
along the optic axis. Its refractive index is called extraordinary refractive index and is denoted by
𝑛𝑒 .
     As shown in Figure 2.1, according to Snell’s Law [32], we have
                                     𝑛𝑎𝑖𝑟 𝑠𝑖𝑛𝜃 = 𝑛𝑒 𝑠𝑖𝑛𝜃 𝑒 = 𝑛𝑜 𝑠𝑖𝑛𝜃 𝑜                                (2.2)
where 𝑛𝑎𝑖𝑟 ≈ 1 is the refractive index in air, and 𝜃 𝑜 and 𝜃 𝑒 are the refractive angle of ordinary ray and
extraordinary ray, respectively. Usually, 𝑛𝑒 ≠ 𝑛𝑜 , and the refractive angles and refractive indexes of
ordinary ray and extraordinary ray are different. Thus there is an optical path difference between the
two rays after the birefringence material. For a certain type of material, 𝑛𝑜 is fixed determined by
the material, while 𝑛𝑒 varies depending on the direction of the incident ray. As shown in Figure 2.1,
denote the incident angle as 𝜃 and the angle between the incident light projection on the incident
plane and optic axis as 𝛾. We will show how to obtain 𝑛𝑒 and 𝜃 𝑒 using 𝜃 and 𝛾 in practice. Then
we can calculate the optical path for ordinary ray and extraordinary ray.
     According to Snell’s Law, if the incident light 𝐿 is linearly polarized and the angle between
polarization direction and optic axis is 𝜙1 , the intensity of ordinary ray 𝐼𝑜 and extraordinary ray 𝐼𝑒
can be calculated as
                                              𝐼𝑜 = 𝐼 sin2 𝜙1
                                                                                                      (2.3)
                                                         2
                                              𝐼𝑒 = 𝐼 cos 𝜙1 .
                                                    10


                                                        d
                                                                     C       Q
                                                           L%)
                                            ordinary ray                *%" + *')
                                                           L%"         !
                                                                     D
                                          A       !"         L')
                                       !                             E
                                    F B                    L'"
                                                 !#
                                    L%           extraordinary ray
                     light           L'
                                 P%                     &                  P'
                             Figure 2.2: Illustration of light interference.
where 𝐼 is the intensity of 𝐿.
2.2.3   Interference
When two light beams 𝐿 1 and 𝐿 2 have the same frequency, stable phase difference 𝛿 and same
polarization direction, they can interfere with each other. For a different value of 𝛿, the two light
beams can have different interference results. The interference intensity can be calculated as:
                                                        √︁
                                        𝐼𝑖 = 𝐼1 + 𝐼2 + 2 𝐼1 𝐼2 𝑐𝑜𝑠𝛿                                  (2.4)
where 𝐼𝑖 is the light intensity after interference, and 𝐼1 and 𝐼2 are the intensities of 𝐿 1 and 𝐿 2 , and
𝛿 is the phase difference between 𝐿 1 and 𝐿 2 and often derived from the optical path difference.
2.3     Localization Basics
We aim to answer the question of why observing the chip made by polarizers and birefringence
material in different directions would get different color patterns. In this section, we firstly build
                                                     11


                                                   𝑜
                                       𝑃2                            𝑃1
                                                            𝐿1
                                                     𝐿1𝑜
                                              𝐿′1𝑜       𝜙1 𝐿1𝑒
                                                                          𝑒
                                                          𝜙2
                                                      𝐿′1𝑒
                   Figure 2.3: Polarization and intensity change through 𝑃1 , 𝑆 and 𝑃2 .
a model from the background to show the principle of our 3D positioning approach. Then we
conduct an experiment to validate our model. Because some of the parameters are hard to measure,
it is difficult to directly use such a model to perform positioning directly. As a result, we show how
to address those challenges in our design in section 2.4. Therefore, readers who are not interested
in the detailed analysis of RainbowLight can skip this section directly.
     As shown in Figure 2.2, a birefringence material 𝑆 is placed between two polarizers 𝑃1 and
𝑃2 . Light from a source (e.g., a lamp) first passes through polarizer 𝑃1 and becomes a linearly
polarized light. Consider two rays of the polarized light 𝐿 1 and 𝐿 2 incident into 𝑆 at point 𝐴 and
𝐵, respectively. As introduced in Section 2.2, 𝐿 1 is separated into two parts: 𝐿 1𝑜 (the ordinary ray)
and 𝐿 1𝑒 (the extraordinary ray). The refractive indices of the ordinary ray and the extraordinary
ray are 𝑛𝑜 and 𝑛𝑒 , respectively. Similarly, 𝐿 2 is separated into two parts: 𝐿 2𝑜 (the ordinary ray)
and 𝐿 2𝑒 (the extraordinary ray). After passing through another polarizer 𝑃2 , the light 𝐿 1𝑒 and 𝐿 2𝑜
become 𝐿 ′1𝑒 and 𝐿 ′2𝑜 . 𝐿 ′2𝑜 of 𝐿 2 interferes with 𝐿 ′1𝑒 of 𝐿 1 . Then the interference result of light 𝐿 ′2𝑜
and 𝐿 ′1𝑒 is measured by a camera at 𝑄.
     Next, in this section, we analyze the light spectrum of interference results and show its relation-
ship with the angle 𝜃.
2.3.1     Interference Analysis
From Eq. (2.4), we can know that the interference light intensity relies on the two coherent light
intensity and their phase difference. We analyze the intensity and phase difference of 𝐿 ′1𝑒 and 𝐿 ′2𝑜
                                                      12


in the following part.
2.3.1.1    Intensity
Assume the angles between the optic axis of 𝑆 and the transmission axes of two polarizers 𝑃1 and
𝑃2 are 𝜙1 and 𝜙2 , respectively. Denote the intensity of 𝐿 1 as 𝐼1 , and assume light rays 𝐿 1 and 𝐿 2
have equal intensity. According to Eq. (2.3), 𝐼1𝑜 = 𝐼1 sin2 𝜙1 and 𝐼1𝑒 = 𝐼1 cos2 𝜙1 .
    Denote the light intensities of 𝐿 ′1𝑒 and 𝐿 ′2𝑜 as 𝐼1𝑒′ and 𝐼 ′ , respectively. According to Eq. (2.1),
                                                                 2𝑜
 ′ and 𝐼 ′ can be calculated as
𝐼1𝑒       2𝑜
                                      ′
                                     𝐼2𝑜 = 𝐼1𝑜 𝑠𝑖𝑛2 𝜙2 = 𝐼1 𝑠𝑖𝑛2 𝜙1 𝑠𝑖𝑛2 𝜙2
                                                                                                      (2.5)
                                   ′              2           2        2
                                 𝐼1𝑒   = 𝐼1𝑒 𝑐𝑜𝑠 𝜙2 = 𝐼1 𝑐𝑜𝑠 𝜙1 𝑐𝑜𝑠 𝜙2 .
2.3.1.2    Phase Difference
As shown in Figure 2.2, the incident angles of 𝐿 1 and 𝐿 2 to 𝑆 are both 𝜃, the thickness of 𝑆 is 𝑑,
and the refraction angles of 𝐿 1𝑒 and 𝐿 2𝑜 are 𝜃 𝑒 and 𝜃 𝑜 . The optical path difference Δ of 𝐿 1𝑒 and
𝐿 2𝑜 at point 𝑄 can be calculated as
                       Δ = 𝐹 𝐴𝑛𝑎𝑖𝑟 + 𝐴𝐷𝑛𝑒 − 𝐵𝐷𝑛𝑜
                                                                   𝑑             𝑑                    (2.6)
                         = 𝑑 (𝑡𝑎𝑛𝜃 𝑜 − 𝑡𝑎𝑛𝜃 𝑒 )(𝑠𝑖𝑛𝜃)𝑛𝑎𝑖𝑟 +            𝑛𝑒 −           𝑛𝑜
                                                                𝑐𝑜𝑠𝜃 𝑒         𝑐𝑜𝑠𝜃 𝑜
where 𝐹 𝐴, 𝐴𝐷, and 𝐵𝐷 are the lengths from 𝐹 to 𝐴, from 𝐴 to 𝐷, and from 𝐵 to 𝐷, respectively.
    Combining Eq. (2.2) and Eq. (2.6), we have
                                        Δ = 𝑑 (𝑛𝑒 𝑐𝑜𝑠𝜃 𝑒 − 𝑛𝑜 𝑐𝑜𝑠𝜃 𝑜 )                                (2.7)
    As aforementioned, for a particular material, 𝑛𝑜 is usually fixed, 𝑛𝑒 and 𝜃 𝑒 are related to the
incident angle. We put the details of calculating 𝑛𝑒 , 𝜃 𝑒 and Δ in Section 2.3.1.3. Therefore, we
have                         √︄
                                                          𝑁2
                                                                        √︃
                     Δ = 𝑑(     𝑁 𝑒2 −  𝑠𝑖𝑛2 𝜃 (𝑠𝑖𝑛2 𝛾 + 𝑒2 𝑐𝑜𝑠2 𝛾) −      𝑁 𝑜2 − 𝑠𝑖𝑛2 𝜃)             (2.8)
                                                          𝑁𝑜
                                                       13


                                                     Incident light I1
                                               5
                                                     IQ with =45°, =0°
                              Intensity (cd)
                                               4
                                                     IQ with =45°, =90°
                                               3     IQ with =60°, =0°
                                               2
                                               1
                                               400      500          600        700   800
                                                            Wavelength (nm)
 Figure 2.4: Intensity of interference light for different wavelength with different incident angles.
    where 𝑁 𝑜 and 𝑁 𝑒 are principal refractive indices of 𝑆, which are fixed given a certain type of
material, 𝜃 is the incident angle, and 𝛾 is the angle between the projection of incident light on the
incident plane and optic axis, which is shown in Figure 2.1.
    The optical path difference is for two light beams, the phase difference is different for different
wavelength. For light with a specific wavelength 𝜆, we can calculate the phase difference 𝛿 of 𝐿 1𝑒
and 𝐿 2𝑜 at point 𝐷 as
                                                                         2𝜋
                                                              𝛿𝐷 = Δ        .                    (2.9)
                                                                         𝜆
    Due to the phase difference of projection on 𝑃2 , the phase difference between two coherent
lights 𝐿 ′1𝑒 and 𝐿 ′2𝑜 at point 𝑄 is
                                                      𝛿 = 𝛿 𝐷 + 𝛿′
                                                              2𝜋
                                                                                                (2.10)
                                                            
                                                            Δ 𝜆
                                                            
                                                            
                                                                         (case 1)
                                                        =
                                              2𝜋
                                          Δ
                                                  + 𝜋 (case 2)
                                           𝜆
where case 1 means the vectors 𝐿 ′1𝑜 and 𝐿 ′1𝑒 are in the same direction on 𝑃2 , and case 2 means they
have reverse directions.
                                                                   14


2.3.1.3   Calculation of 𝑛𝑒 , 𝜃 𝑒 , and Δ
Inspired by [33], as shown in Figure 2.1, the directional vector of optical axis, ordinary ray, and
extraordinary ray in the birefringence are
                                            𝑒 𝑎 = (𝑐𝑜𝑠𝛾, 𝑠𝑖𝑛𝛾, 0)                              (2.11)
                                          𝑒 𝑘𝑜 = (𝑠𝑖𝑛𝜃 𝑜 , 0, 𝑐𝑜𝑠𝜃 𝑜 )                         (2.12)
                                          𝑒 𝑘𝑒 = (𝑠𝑖𝑛𝜃 𝑒 , 0, 𝑐𝑜𝑠𝜃 𝑒 )                         (2.13)
We assume the angle between optic axis and extraordinary ray is 𝛼, i.e. angle between 𝑒 𝑎 and 𝑒 𝑘𝑒 .
So according to (2.11),(2.13), we have:
                                      𝑐𝑜𝑠𝛼 = 𝑒 𝑎 · 𝑒 𝑘𝑒 = 𝑐𝑜𝑠𝛾𝑠𝑖𝑛𝜃 𝑒                           (2.14)
Because the refractive index of extraordinary ray varies with different incident angles, according
to the relationship between 𝛼 and the refractive index of extraordinary ray 𝑛𝑒 in [34], we have
                                       𝑁𝑜 𝑁𝑒                              𝑁𝑜 𝑁𝑒
                      𝑛𝑒 = √︁                             = √︁                                 (2.15)
                              𝑁 𝑜2 𝑠𝑖𝑛2 𝛼 + 𝑁 𝑒2 𝑐𝑜𝑠2 𝛼          𝑁 𝑜2 + (𝑁 𝑒2 − 𝑁 𝑜2 )𝑐𝑜𝑠2 𝛼
where 𝑁 𝑜 and 𝑁 𝑒 are principal refractive indices and are fixed for each type of material. According
to (2.14),(2.15), we have:
                                                         𝑁𝑜 𝑁𝑒
                                  𝑛𝑒 = √︁                                                      (2.16)
                                           𝑁 𝑜2 + (𝑁 𝑒2 − 𝑁 𝑜2 )𝑐𝑜𝑠2 𝛾𝑠𝑖𝑛2 𝜃 𝑒
According to Snell’s Law, we have:
                                      𝑛𝑎𝑖𝑟 𝑠𝑖𝑛𝜃 = 𝑛𝑒 𝑠𝑖𝑛𝜃 𝑒 = 𝑛𝑜 𝑠𝑖𝑛𝜃 𝑜                        (2.17)
where 𝑛𝑎𝑖𝑟 ≈ 1 is the refractive index in air. Then we have
                                                         𝑠𝑖𝑛𝜃
                                                   𝑛𝑒 =          .                             (2.18)
                                                         𝑠𝑖𝑛𝜃 𝑒
According to (2.16),(2.18), we have:
                                          v
                                          u
                                                                𝑠𝑖𝑛2 𝜃
                                          t
                           𝜃 𝑒 = arcsin                           2
                                                                                               (2.19)
                                               𝑁 𝑒2 − 𝑠𝑖𝑛2 𝜃 ( 𝑁𝑁𝑒2 𝑐𝑜𝑠2 𝛾 − 𝑐𝑜𝑠2 𝛾)
                                                                 𝑜
                                                        15


Figure 2.5: (a) Hue values on x-y plane by simulation, (b) Hue values measured by mobile phone
on x-y plane.
Finally, according to (2.18) and (2.19), we have:
                                        √︄
                                                           𝑁 𝑒2
                                 𝑛𝑒 =      𝑁 𝑒2 − 𝑠𝑖𝑛2 𝜃 (      𝑐𝑜𝑠2 𝛾 − 𝑐𝑜𝑠2 𝛾)                  (2.20)
                                                           𝑁 𝑜2
    Because the optical path difference is:
                                         Δ = 𝑑 (𝑛𝑒 𝑐𝑜𝑠𝜃 𝑒 − 𝑛𝑜 𝑐𝑜𝑠𝜃 𝑜 )                           (2.21)
    We substitute 𝑛𝑒 , 𝜃 𝑒 , and 𝑛𝑜 , 𝜃 𝑜 into Eq. (2.21), we can have the expression of Δ using known
parameters:
                               √︄
                                                            𝑁2
                                                                          √︃
                     Δ = 𝑑(       𝑁 𝑒2 − 𝑠𝑖𝑛2 𝜃 (𝑠𝑖𝑛2 𝛾  + 𝑒2 𝑐𝑜𝑠2 𝛾) −      𝑁 𝑜2 − 𝑠𝑖𝑛2 𝜃)       (2.22)
                                                            𝑁𝑜
2.3.1.4   Summary
According to Eq. (2.4), the intensity spectrum of the interference light at 𝑄 can be calculated as
                                 𝐼𝑄 = 𝐼1 𝑐𝑜𝑠2 𝜙1 𝑐𝑜𝑠2 𝜙2 + 𝐼1 𝑠𝑖𝑛2 𝜙1 𝑠𝑖𝑛2 𝜙2
                                                                                                  (2.23)
                                        +2𝐼1 𝑐𝑜𝑠𝜙1 𝑐𝑜𝑠𝜙2 𝑠𝑖𝑛𝜙1 𝑠𝑖𝑛𝜙2 𝑐𝑜𝑠𝛿.
where 𝛿 can be calculated according to Eq. (2.10).
                                                        16


               Figure 2.6: (a) Hue matrix sampled, (b) Hue matrix after interpolation.
    According to Eq. (2.23), given the intensity spectrum on frequency domain of light source 𝐼1 , the
angle 𝜙1 between optic axis of the birefringence material and the polarizer 𝑃1 , the angle 𝜙2 between
optic axis of the birefringence material and the polarizer 𝑃2 , the incident direction parameters 𝜃
and 𝛾, and birefringence material parameters principal refractive indices and thickness 𝑑, we can
calculate the value of the light intensity 𝐼𝑄 at 𝑄.
    Figure 2.4 shows the light spectrum of interference for different parameters. Given the value
of 𝐼1 , 𝜙1 , 𝜙2 and 𝑑, different combinations of 𝜃 and 𝛾 result in different spectrum of 𝐼𝑄 . This
makes the foundation of obtaining light incident angles based on different interference results. As
long as we can get the incident angles from multiple points, we can use the AoA-based method for
localization.
2.3.2    Validation
2.3.2.1     Choose The Light Spectrum Feature
Mobile cameras usually do not have the capability of measuring the light spectrum directly. How-
ever, the direction is represented by the interference light spectrum, and we have to distinguish
different light spectrums to distinguish different directions. There is a challenge for us to find a
proper light feature, which satisfies two conditions in the meantime: it can be measured by the
COTS camera and can indicate the direction from the source to the chip. It is well-known that
                                                  17


different light spectrums result in different colors of the mixed light. A straightforward approach
is to measure the RGB color and map RGB vectors to different directions. However, we find this is
not feasible in practice as the spectrum could not be effectively represented in RGB color. Instead,
we use the HSL (Hue, Saturation, Lightness) color space and find that the 𝐻 (i.e., Hue) component
from HSL is much more suitable for representing the color of mixtures of lights [35].
2.3.2.2    Measurement Result
We conduct an experiment to validate the model. We measure the hue value on different positions
after 𝑃2 . Figure 2.5b shows the measurement hue values for different positions on a plane with a
certain distance to the light source. Then we compare the measurement result with the simulation
result based on Eq. (2.23). In our simulation, we use the parameters of quartz crystal (a type
of birefringence material) chip with thickness of 0.6 mm. We measure the intensity spectrum of
interference result on different direction. We leverage the color wheel [35] to approximate intensity
spectrum with hue value. Figure 2.5a shows the hue value with respect to positions on a surface
parallel with the birefringence chip. We can see that the color regularities of Figure 2.5a and
Figure 2.5b are very similar. This coincides with our analysis and Eq. (2.23). This also means that
hue value is effective for representing the intensity spectrum.
2.4    RainbowLight Design
2.4.1   Design Overview
Figure 2.7 illustrates the system overview of RainbowLight. The chips used in RainbowLight are
a combination of two polarizers and one birefringence chip as shown in Figure 2.2. With one chip,
we can calculate direction information. Combining the direction information from multiple chips,
we can derive the 3D location. The main design of RainbowLight consists of two parts. The first
part is mapping initialization. This part is to build an initial mapping between the direction and
hue value for a certain type of chip. The mapping initialization only needs to be performed once
for a certain type of chip. The second part is the 3D localization component. In this part, a mobile
                                                  18


                                       y
                                                      x
                                    𝑠1
                                                                 z
                                          𝑠3
                                      𝑠2
                                                                         𝑟𝑥
                              Figure 2.7: Overview of RainbowLight.
camera will take a photo containing multiple chips. Based on the hue value of the initial mapping,
the direction to those chips can be calculated. Then we also propose a direction intersection based
method to calculate the final 3D location.
2.4.2   Mapping Initialization
The mapping between light directions and hue values can be built by sampling in different positions.
We put a chip at the origin 𝑂 of the coordinate system, and the chip is parallel with the x-y plane.
A mobile phone moves in a grid at a certain plane (𝑧 = 1𝑚) and captures a photo containing the
chip at each position. For a sampling position 𝑟, it derives the hue value ℎ of the color for the chip
                                                                                 −→
from the captured photo. It means that the hue values for all points on the ray 𝑂𝑟, i.e., the ray with
the direction from the chip to the position on the plane in the space, are ℎ.
    Therefore, we build a map 𝑅𝑆 → 𝐻𝑆 from sampling positions 𝑅𝑆 = (𝑟 1 , 𝑟 2 , . . . , 𝑟 𝑛 ) to hue values
                                                                   
                                      𝐻𝑆 =     ℎ1 , ℎ2 , · · · , ℎ𝑛                                  (2.24)
                                                                                  −−→
where ℎ𝑖 denotes the hue value observed by mobile phone from points on line 𝑂𝑟𝑖 .
    For a higher sampling density, the map should be more accurate. On the other hand, a higher
density also indicates a higher sampling overhead. To reduce the initial sampling overhead, we
propose an interpolation-based method to improve the granularity of initial map. We leverage
the color regularity to interpolate a coarse-grained sampling matrix 𝐻𝑆 and build a fine-grained
                                                   19


                                   !"              !#
                                                                     0cm
                                                                                            3D
                                                                                  0                 1
                                                                     60cm                   2
                                             C (%)" , %*# )
                                                                     100cm
                                        A         B
                              (%&" , %&# )       (%)" , %)#)
                          Figure 2.8: Illustration of localization algorithm.
map 𝑅 → 𝐻. We examine the performance of interpolation under different sampling density in
Section 2.7.3.
    As shown in Figure 2.5, the color gradually changes with the position. As the hue value ranges
from 0 to 360, in interpolation we should carefully deal with the hue value cross the hue range
boundary. More specifically, for two hue values ℎ1 and ℎ2 (ℎ1 > ℎ2 ) for two adjacent sampling
positions, we first calculate the hue value gap ℎΔ = ℎ1 − ℎ2 . If ℎΔ is smaller than a pre-defined
threshold 𝑡ℎ𝑟 (e.g., 𝑡ℎ𝑟 = 350), the interpolation can be performed between ℎ1 and ℎ2 . If ℎΔ is
larger than the pre-defined threshold 𝑡ℎ𝑟, we consider the hue value between those two sampling
positions crosses the hue value boundary. The interpolated hue value should be performed for ℎ1
and ℎ2 + 360. All the hue value should be calculated from the interpolation result modulo by 360
to guarantee the hue values are in [0, 360). Figure 2.6a shows the original hue matrix. Figure 2.6b
shows the interpolation result.
    In practice for the same type of chip, we only need to build the initial map 𝑅 → 𝐻 once. This
could significantly reduce the initialization overhead for RainbowLight. Later, we will show how
to leverage the map for localization in 3D space.
                                                  20


2.4.3     3D Localization
2.4.3.1    Localization Design
To enable 3D localization, we simply stick several chips on a transparent surface. Without loss of
generality, we assume three chips 𝑆1 , 𝑆2 and 𝑆3 are used. Later, in Section 2.7, we will show the
impact of number of chips. Denote the position of the center of 𝑆1 , 𝑆2 , and 𝑆3 as 𝑝 1 , 𝑝 2 and 𝑝 3 ,
respectively. The position 𝑝 1 , 𝑝 2 and 𝑝 3 , namely reference points, can be measured in advance.
    A mobile phone with a camera at the position 𝑟 𝑥 simply captures a photo containing 𝑆1 , 𝑆2 ,
and 𝑆3 . We calculate the hue values ℎ˜ 1 , ℎ˜ 2 and ℎ˜ 3 from the photo for those three chips. Based
on the initial map between colors and directions, RainbowLight can obtain the possible directions
from 𝑝 1 , 𝑝 2 and 𝑝 3 , respectively. Thus we have three groups of ray directions from three reference
points, respectively. Then we can obtain the position 𝑟 𝑥 based on the intersection of those ray
directions.
2.4.3.2    Intersection Based Localization
The goal of localization is to calculate the position 𝑟 𝑥 based on ℎ˜ 1 , ℎ˜ 2 and ℎ˜ 3 and 𝑅 → 𝐻 .
    Find line group candidates: The initial map is built using a chip at coordinate origin 𝑂. In
practical, chips are usually attached at other positions. In order to make the map 𝑅 → 𝐻 suitable
for the deployment of a specific chip, we need to do coordinate translation for the initial mapping.
The map becomes 𝑅 𝑗 → 𝐻 for 𝑗 = 1, 2, 3, where 𝑅 𝑗 = 𝑅 + 𝑝 𝑗 is the transformed sampling position
for 𝑆 𝑗 .
    Due to the color error for the camera on a mobile phone, there may be multiple lines with hue
close to ℎ˜ 1 , ℎ˜ 2 , and ℎ˜ 3 . Meanwhile, according to Eq. (2.23), we also find that there are multiple
combinations of 𝜃 and 𝛾 leading to the same hue value. It indicates that there may be multiple
directions of the same hue value. Therefore, for each chip, we can calculate a group of lines. Overall,
                                                                                        −−𝑗−→
we obtain three groups of lines denoted by 𝐺 1 , 𝐺 2 , and 𝐺 3 . We have 𝐺 𝑗 = {𝑟 𝑝 𝑗 ||ℎ𝑖 − ℎ˜ 𝑗 | < 𝜖 ℎ }
                                                                                          𝑖
                              𝑗
for 𝑗 = 1, 2, 3 where 𝑟𝑖 ∈ 𝑅 𝑗 and 𝜖 ℎ is the maximum allowed hue error.
                                                      21


                                                                   Polarizer (i)                   (ii)          (iii)
                                                                                                                 (iv)
                                 Transparent adhesive tape
                                    Figure 2.9: Chips in RainbowLight.
    Line intersection: The main idea is calculating the localization based on the intersection point
of those three sets of lines 𝐺 1 , 𝐺 2 , and 𝐺 3 as the localization result 𝑟 𝑥 . There should exist three lines
from 𝐺 1 , 𝐺 2 and 𝐺 3 , respectively, that intersect at point 𝑟 𝑥 . Due to hue value measurement error,
those three lines may be very close to each other but not directly intersect in practice. Therefore,
we could use an algorithm based on the contrary thinking. The idea is based on the principle that
light travels in a straight line. As shown in Figure 2.8, not without generality, suppose we want to
perform localization in a 2D plane. If we put two chips, namely 𝑆1 and 𝑆2 , at localization 0 cm
and perform initialization at 100 cm, i.e. at point 𝐴 we observe 𝑆1 and 𝑆2 and get hue values 𝐶 𝐴1
and 𝐶 𝐴2 respectively, and at point 𝐵 we get hue values 𝐶𝐵1 and 𝐶𝐵2 , according to the principle, we
will get hue values 𝐶𝐵1 and 𝐶 𝐴2 at point 𝐶 at 60 cm. Therefore, if we have sampled all points at
100 cm, hue values of nearly all points in the plane will be derived ideally. We regard all those
hue values as 2D coordinates. After that when we capture a photo contains those two chips, we
extract hue values, for example, (𝐶1 , 𝐶2 ), then we can get the final positioning result by calculating
the minimum distance between (𝐶1 , 𝐶2 ) and all coordinates we derived. We can easily extend the
algorithm above from 2D plane to 3D space.
                                                        22


                 Polarizer (i)                    (ii)          (iii)            (v)
                                                                (iv)
parent adhesive tape
           Figure 2.10: Anchor with chips made by two polarizers and one transparent adhesive tape (i): near
           to fluorescent (iii) on LED lamp cover, anchor with chips made by one polarizer and one transparent
           adhesive tape(ii): near to fluorescent (iv): on LED lamp cover, (v): anchor on a glass window.
           2.5     Implementation
           RainbowLight consists of two components: anchor and receiver. In this section, we present the
           details of those two components. We also discuss a variant of RainbowLight, which put polarizer
           𝑃2 in front of the camera to eliminate color observed by human eyes. Since RainbowLight performs
           relative localization for a given anchor, it needs to identify which anchor is captured by camera
           hence can be used in a large region. We also discuss how to provide identifiers to anchors in this
           section.
           2.5.1    Anchor
           The anchor of RainbowLight is composed of a group of chips. Each chip consists of two linear
           polarizers and a thin birefringence material chip. We stick the birefringence material chip between
           two linear polarizers. As shown in Figure 2.9, we use the everyday transparent adhesive tape as the
           birefringence material. RainbowLight does not require to stick the anchors on a lamp. We can put
           anchors on different surfaces as long as light can pass through the chips. For example, as shown in
           Figure 2.10 (i), (iii) and (v), we put anchor near lamps or on a lamp cover or a window. As shown in
           Figure 2.10(i) and (iii), despite chips display colors, each chip made by polarizers and transparent
                                                               23


adhesive tape is very small. It would not disturb human eyes. To enable RainbowLight, we also
need to record the relative position for those chips.
2.5.2   Receiver
We use the smartphone as the receiver side. The camera can capture a photo containing the anchor.
We implement software on the mobile phone based on Android. While the camera is taking a
photo, RainbowLight launches automatic exposure to fit the luminance of the environment. After
obtaining the photo, we use the algorithm of white balance to eliminate color shift among different
camera models, then use OpenCV to localize the position of each chip in the image based on features
such as shapes and derive HSL information from the photo. To address hue value estimation error
in practice, we use the averaged hue value for each chip as the hue value for localization. Then we
use the 3D localization algorithm mentioned in section 2.4.3.2 to get the position of the camera.
    Now we present a variant of RainbowLight to eliminate the color which can be observed by
human eyes directly. We put polarizer 𝑃2 in front of the camera. In such a case, human eyes cannot
observe the color displayed by chips directly as shown in Figure 2.10 (ii) and (iv), but cameras can
capture chips with different colors. However, if we put 𝑃2 in front of the camera, the camera’s
rotation would result in the change of color of the chips, thus color-direction map could not be used.
Fortunately, since the hue value instead of RGB represents color in RainbowLight, chips only show
two complementary hue values with the camera’s rotation as shown in Figure 2.11. Therefore, we
measure the camera’s rotation angle firstly, if it results in complementary hue values of initialization,
we can transform them into original hue values hence performing localization. Attaching polarizer
in front of the camera will bring in extra costs, and brings error of accuracy with the camera’s
rotation. We will present the accuracy in Section 2.7. Users who deploy the RainbowLight can
choose where to put the polarizer 𝑃2 according to their conditions and requirements.
    We measure the latency of RainbowLight. In the measurement, we let RainbowLight process 10
photos to measure the average latency. The mobile phone we used is Huawei Nexus 6P. It takes 236
ms on average to find chips and extract hue values. It takes 503 ms on average for 3D localization
                                                     24


                                400
                                300
                          Hue   200                                 1
                                                                    2
                                100                                 3
                                                                    4
                                 0                                  5
                                      0   50      100       150      200
                                                  2
                                                      (°)
Figure 2.11: Complementary hue observed as rotating mobile phone for different tape thickness (1
∼ 5).
                     Matching Points
                                                                  Coding Area
                                          Localization Points
                       Figure 2.12: RainbowLight anchor with identifier.
from hue values. We optimize RainbowLight 3D localization to parallel the processing in our
implementation of localization. With such an optimization, the time for 3D localization reduces to
123 ms on average. This would apply to most VLP based applications such as navigation. We also
use Power Monitor to measure the power consumption of RainbowLight with Nexus 5x, the result
shows that our algorithm takes 1.122J to process one photo and perform localization.
2.6   Apply RainbowLight to Localization in a Large Area
We have presented a novel relative localization approach, RainbowLight, which can derive the
camera’s relative position to an anchor. However, one small RainbowLight anchor only can be
                                                25


captured by the camera in a small region, thus it is difficult to apply to localization in a large area
such as a shopping mall. To address this issue, we can use the idea similar to use multiple lamps
to illuminate an entire room, in other words, we give each anchor a unique identifier and extract
the identifier from the anchor firstly to get a coarse-grained area where camera located, then derive
precise relative location to the anchor. Therefore, we can use RainbowLight to get the camera’s
location in a large area.
2.6.1   Providing Identifier to RainbowLight Anchor
We can use the existing method such as iLAMP [22] to distinguish different light sources in a large
area if we put an anchor on the lamp. We can also attach the QR code on each anchor to identify
them. Considering iLAMP cannot be used with light-off, we also design a QR-code-like method
to use our localization chips for providing ID.
    As shown in Figure 2.12, after modification, an anchor consists of 3 components. While
the Localization Points used to derive the relative position is made by polarizers and transparent
adhesive tape, Matching Points and Coding Area are only made by polarizer. We make them by
two perpendicular polarization directions. Similar to the QR code, 3 of matching points are in
the same direction, and another is not, therefore, it can be decoded even if the anchor is rotated
in the photo. We use those two directions to represent 0 or 1 in the coding area. Therefore, after
taking a photo behind another polarizer either covered on the anchor or put before the camera, we
can compare the brightness of each polarizer in the coding area to polarizers of matching point to
recognize each of them representing 0 or 1, hence decode the identifier. In this case, the anchor
can encode 212 = 4096 identifiers in the coding area.
2.6.2   Localization in a Large Area
As Figure 2.13 shown, without loss of generality, suppose we have 3 anchors in a large area, we can
store each anchor’s identifier and its real position in a database in advance. During the localization
process, for example, after a camera in area #3 which is a valid area of anchor #3 taking a photo
                                                   26


            Anchor     Position
              #1     (𝑥1 , 𝑦1 , 𝑧1 )
              #2     (𝑥2 , 𝑦2 , 𝑧2 )
              #3     (𝑥3 , 𝑦3 , 𝑧3 )
                                                Database
          Anchor #1      Anchor #3
                 Anchor #2
                                          Photo         Identifier
                                                                   Position of
                  Area #2                contains      of anchor
                                                                    anchor #3
                                        anchor #3          #3
                                                                                   Camera’s
                                                                                   position
          Area #1           Area #3                Localization     Relative
                                                    Feature in     Position to
                                                    anchor #3      anchor #3
                           Figure 2.13: Overview of localization in building.
containing anchor #3, the system firstly decodes the identifier of the anchor in the photo, then get
the real position of the anchor from the database. Combining with the relative position from the
camera to the anchor, we could get the camera’s real position.
2.7    Evaluation
We evaluate the performance of RainbowLight from the following aspects:
    • Localization accuracy for different distances.
    • The performance of mapping the position related to the landmark to the absolute position.
    • The impact of system parameters on localization accuracy.
    • System performance under different light sources (different manufacturers, color tempera-
      tures, lamp types, and powers).
    • System performance under different mobile phone models.
    • System performance with the light on/off.
    • System performance with different angles of mobile phone orientation.
                                                    27


                                                                                           30
                                                                                                   X-axis
                                                                     Location error (cm)
                                                                                           25      Y-axis
                                   transparent board                                       20
                                                                                                   Z-axis
                                          lamp                                             15
                                                                                           10
                                                                                               5
                                      3m                                                       0
                                                                                                   0.6-1              1-2                  2-3
                                                                                                      Distance Interval (m)
                                    (a)                                                                             (b)
   Figure 2.14: (a) Experiment environment. (b) Localization precision on different distance.
                 1
               0.75
                                                                                                                Anchor
                                                                     90∘                                                                         90∘
         CDF    0.5                                                         80             ∘
                                                                                                                                             80∘
                                                       X-axis                       70         ∘                                           70∘
               0.25                                    Y-axis                                                                            60∘
                                                       Z-axis                              60∘
                                                       Total                                 50∘                                       50∘
                 0                                                                              40∘                                  40∘
                      0 10 20 30 40 50 60 70 80 90 100                                            30∘       ∘                 ∘
                                                                                                                                  30∘
                                                                                                        20                   20
                             Localization Error (cm)                                                            10∘ 0∘ 10∘
                                    (a)                                                                             (b)
Figure 2.15: (a) Localization precision map relative position to absolute position. (b) Capture in
different angles.
Through the evaluation, we aim to show the effectiveness of RainbowLight in practice. It should
be noted that for all experiments we use the same initial mapping unless otherwise specified. This
means that we only need to perform initialization once, which significantly reduces the initialization
overhead compared with existing approaches.
2.7.1   Localization Accuracy
Figure 2.14a shows the experiment environment. In the experiment, we move a transparent board
to different distances to the light source. For each distance, we move the mobile phone on the board
at different positions. We can measure the position of the mobile phone on the board as the ground
truth. Meanwhile, we also use RainbowLight to calculate the position of the mobile phone. We
                                                                28


                               30                                               1
                                    X-axis
         Location error (cm)
                               25   Y-axis
                                    Z-axis                                                                     2
                               20
                                                                                                               3
                               15                                        CDF   0.5                             4
                                                                                                               5
                               10
                                                                                                               6
                                5
                                0                                               0
                                    5cm        10cm          15cm
                                                                                     0       50         100    150
                                          Sampling density                               Location error (cm)
                                              (a)                                                 (b)
Figure 2.16: (a) Localization precision on different sampling density. (b) Localization precision
on different number of chips.
switch off other lamps during our experiment at night. Figure 2.14b shows the localization error
for the mobile phone moving on the board out of 230 random points. The x-axis denotes the range
of the distance between the board and the lamp. We can see that the localization error increases
as distance increases. This is mainly because hue value is less sensitive to the position for a larger
distance.
   We can also observe that the error on 𝑧-axis is larger than that on 𝑥-𝑦 plane. The major reason is
that the angle from the chip to the mobile phone varies by a smaller value when we move the mobile
phone along the 𝑧-axis than that along the 𝑥-𝑦 plane. This phenomenon is more evident when chips
are close to each other. However, even when those chips are all in a circle with diameter less than
16 cm, the localization accuracy for different distance is still high. This indicates RainbowLight
can work for different distance with the lamp of small size.
   Overall, in the 2m - 3m distance interval, the mean error of localization is 3.19 cm on 𝑥-axis,
2.74 cm on 𝑦-axis, and 23.65 cm on 𝑧-axis. This performance is better than SmartLight with a
localization error of about 60 cm on 𝑧-axis for distance from 1m - 3m. The localization accuracy
of RainbowLight is enough for most of today’s application scenarios such as navigation.
                                                                    29


                                30                                                         30
                                                       X-axis                                     X-axis
                                                                     Location error (cm)
                                                                                           25
         Location error (cm)
                                25                     Y-axis                                     Y-axis
                                                       Z-axis                                     Z-axis
                                20                                                         20
                                15                                                         15
                                10                                                         10
                                5                                                            5
                                0                                                            0
                                     5W        6.5W   12W                                             6000 K          3000 K
                                              Power                                                     Color temperature
                                              (a)                                                            (b)
Figure 2.17: Localization accuracy for different (a) power of lamp, (b) color temperature of lamp.
                                30                                                           30
                                     X-axis                                                       X-axis
          Location error (cm)
                                25
                                                                       Location error (cm)
                                     Y-axis                                                  25   Y-axis
                                     Z-axis                                                       Z-axis
                                20                                                           20
                                15                                                           15
                                10                                                           10
                                 5                                                           5
                                 0                                                           0
                                     LED        FL    IL                                          A        B     C    D        E
                                               Type                                                        Manufacturer
                                              (a)                                                              (b)
  Figure 2.18: Localization accuracy for different (a) types of lamp, (b) manufacturers of lamp.
2.7.2   Performance with Identifier
As discussed in section 2.6, we design an approach to map the position of a camera relative to the
anchor to the absolute position in an area by providing an identifier to each anchor. To evaluate
the performance of this approach, we randomly choose 190 points in an area of the meeting room,
and calculate the accuracy of localization. Figure 2.15a shows the performance. We can find that
RainbowLight achieves 1.68 cm of the median error in the X-axis, 2 cm of the median error in the
Y-axis, 5.74 cm of the median error in Z-axis, and 7.04 cm of the median error with the whole
dimension. It also achieves 7.37cm, 5 cm, 22.9 cm, 23.20 cm of the 90% error in X-axis, Y-axis,
Z-axis, and with the whole dimension, respectively. The localization accuracy is also enough for
most of today’s application scenarios.
                                                                30


    To evaluate the performance of decoding of the identifier on the anchor, we use the camera to
capture photos with different angles to the anchor. As shown in figure 2.15b, we put an anchor in a
plane, and use the camera to capture photos from 0◦ to 90◦ , then try to decode the identifier on the
anchor. With the identifier we designed in section 2.6, it could not be decoded when the angle is
above to 60◦ . Since the ceiling of a room is often with a height of 3 m, so users should deploy an
anchor in every 9.42 𝑚 2 with the code we designed in section 2.6.
2.7.3   Impact of Sampling Density
We examine the impact of sampling density in building the initial map. Figure 2.16a shows the
localization accuracy with respect to different sampling densities. We build the initial map on
a plane parallel to 𝑥 − 𝑦 plane with 𝑧 = 100 cm. We examine the performance with different
inter-distance of sampling position, i.e., 5 cm, 10 cm, and 15 cm, respectively. It can be seen that
low sampling density still works well for RainbowLight. Even when the inter-distance is 15 cm, the
localization error is only around 10 cm. This is mainly because hue value distribution is smooth in
the 3D space and thus interpolation is effective in building initial mapping.
2.7.4   Impact of Number of Transparent Chips
As shown in Section 2.4, the hue value from a single chip determines a candidate group of rays
from the chip. With more chips, the localization accuracy will be improved as the intersection point
can be refined with more groups of rays. We explore the relationship between localization accuracy
and the number of chips. Figure 2.16b shows the CDF of 3D localization error while increasing
the number of chips from 2 to 6. It can be seen that the localization accuracy increases when the
number of chips increases from 2 to 4. Further, the performance becomes relatively stable when
the number increases from 4 to 6. This means 4 chips is enough in practice to achieve a good
localization accuracy.
                                                  31


                                 Figure 2.19: Different light sources.
2.7.5   Impact of Different Light Sources
We examine the performance of RainbowLight with different light sources. As shown in Figure 2.19,
we use lamps of different types, i.e fluorescent (FL), LED and incandescent bulb (IL), from different
manufacturers (A - E), with different color temperature (3000 K, 6000 K) and different power (5
W, 6.5 W, 12 W). In all the following experiments, we use a Philips (manufacturer A) 6.5 W LED
with the color temperature of 6000 K for initialization.
    In our daily life, the power of LED mainly ranges from 5 W to 20 W. Figure 2.17a shows
localization error of LED (manufacturer A) of power 5 W (500 lm), 6.5 W (600 lm), and 12W
(1100 lm) out of 150 random points. There is no significant difference in terms of error for different
power. This is mainly because as long as 𝛾 and 𝜃 are fixed, our approach captures the major property
of light spectrum and also removes other noise such as brightness, as explained in Section 2.3.
    There are mainly two different color temperatures (6000 K and 3000 K) for typical lamps in
our daily lives. Intuitively, 6000 K generates white color while 3000 K generates yellow. The light
spectrums from those two temperatures are slightly different. We initialize with a 6000 K lamp
and measure the localization error for 3000 K and 6000 K out of 100 random points. As shown in
Figure 2.17b, we can see that the localization error of 3000 K is slightly larger than that of 6000 K
                                                   32


because of spectrum difference. However, the accuracy of both color temperature is still acceptable.
In practical applications, we only need to build the initial map with one color temperature, and
RainbowLight performs well under other color temperatures.
    We examine the performance of RainbowLight for the three most commonly used lamps, i.e.,
LED, fluorescent, and incandescent bulb out of 150 random points. As shown in Figure 2.18a, the
accuracy for fluorescent is high. The accuracy of the incandescent bulb is relatively low. This is
because those two types of lamps have different light spectrums. However, as long as we use the
incandescent bulb for initialization, the accuracy of RainbowLight remains high for incandescent
bulb.
    We also examine the performance of RainbowLight among different brands of lamps. The
light spectrum emitted slightly varies for lamps from different manufacturers. We choose 5 LEDs
from 5 different popular manufacturers, marked as A-E. The power of all lamps is 5 W and the
lumens are 500 lm, 380 lm, 450 lm, 400 lm, 280 lm, respectively. The color temperature is 6000
K. Figure 2.18b shows that the error is small for all brands out of 250 random points and the
performance is similar for all brands. It also indicates we only need to initialize with a certain
brand, and the accuracy of RainbowLight is acceptable under other brands.
    Summary. RainbowLight achieves a high accuracy under different circumstances with com-
monly used lamps. For most scenarios, RainbowLight only needs to be initialized once, and
almost can be used for all other lamps. This significantly reduces the deployment cost and makes
RainbowLight practical.
2.7.6   Impact of Different Mobile Phone Models
Because different cameras have different parameters of light sensors, so they might get different
hue values to the same light beam. We use the white balance algorithm to reduce the impact
from different parameters of sensors, and examine the impact of different mobile phones. We use
two branches of mobile phones, i.e., Huawei Nexus 6P and Vivo X7 to measure the accuracy of
RainbowLight. We randomly choose 10 points in the range of z-axis between 100 cm and 150 cm
                                                 33


                               30                                                           30
                                    X-axis                                                       X-axis
         Location error (cm)                                          Location error (cm)
                               25   Y-axis                                                  25   Y-axis
                                    Z-axis                                                       Z-axis
                               20                                                           20
                               15                                                           15
                               10                                                           10
                                5                                                            5
                                0                                                            0
                                    Nexus 6P           Vivo X7                                      LED             light off
                                             Mobile Phones                                                Light source
                                               (a)                                                          (b)
        Figure 2.20: Localization accuracy for different (a) mobile phones, (b) lamp status.
for each mobile phone, the result is shown in figure 2.20a. We find that the error doesn’t change
much, so RainbowLight could be used on different mobile phone models.
2.7.7   Localization with Light Off
Most existing visible light positioning systems, e.g., LiTell[1], SmartLight[20], and CELLI[17],
only work when the light is turned on, as those systems require modulating information in the light
ray or measuring special features from the light ray. This significantly hinders their applications
in the daytime when light is usually switched off. RainbowLight can work even when light is
switched off during the daytime as it does not need to modulate information in light or measure
light features. Figure 2.20b shows the performance of RainbowLight out of 50 random points with
the light turned off. Similar to Section 2.7.1, we examine the accuracy in the environment as shown
in Fig. 2.14a. In the experiment, sunlight passes through the window and we switch all lamps off.
We can see that the error for the light turned off is still less than 20 cm. The error for the light
turned off is very small and is similar to the scenario of the light turned on. This is mainly because
RainbowLight can generate obvious features from different light sources, and can also effectively
extract those features. This significantly extends the application for visible light-based localization
and make it more practical in everyday life.
                                                                 34


                              30                                                                          30
                                        X-axis                                                                  X-axis
        Location error (cm)                                                         Location error (cm)
                              25        Y-axis                                                            25    Y-axis
                                        Z-axis                                                                  Z-axis
                              20                                                                          20
                              15                                                                          15
                              10                                                                          10
                                  5                                                                       5
                                  0                                                                       0
                                       -30     -15        0      15   30                                       -30   -15       0    15   30
                                                    Pitch angle (°)                                                      Yaw angle (°)
                    Figure 2.21: Localization precision of different (a) pitch angles, (b) yaw angles.
                                  40
                                  35       X-axis
            Location error (cm)
                                           Y-axis
                                  30       Z-axis
                                  25
                                  20
                                  15
                                  10
                                   5
                                   0
                                       0     10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170
                                                                           Roll angle (°)
                                  Figure 2.22: Localization precision of different roll angles of mobile phone.
2.7.8   Impact of Mobile Phone Orientation
To verify the influence of pitch and yaw, we measure error at distance 60 cm with different pitch
and yaw angles. Figure 2.21a and Figure 2.21b shows the result. We select range from −30◦ to 30◦
because the mobile cannot capture the lamp with pitch and yaw angle out of this range. We can see
that when we change pitch and yaw angle, error changes slightly. This is mainly because when we
change the pitch and yaw angle, 𝜙1 , 𝜙2 , 𝛾, and 𝜃 does not change.
   If 𝑃2 is attached to the chip, mobile phone roll will have no impact on the hue value. If we put
the polarizer 𝑃2 in front of the camera, RainbowLight needs to confirm if chips on anchor show
complementary hue value and its impact on localization accuracy. We also examine the accuracy of
localization in this scenario. The error of different roll angles of camera as shown in Figure 2.22a.
   Therefore, no matter which position we are, as long as we can capture the lamp with any 3D
                                                                              35


orientation, RainbowLight shows a high localization accuracy. This extends application scenarios
of today’s VLP systems.
2.8    Related Work
2.8.1   Visible Light Based Localization
The first category of work is to use a special designed LED light to generate identifiable features
[21, 18]. Those works usually need to use an MCU to control the lamp to modulate information
by change the frequency, voltage, etc. Spotlight [19] generates a sequence of on/off the pattern
and uses such a pattern as landmarks for localization. Spinlight [5] uses a hemispherical shade to
encode position information with holes. CELLI [17] designs a structure with LCD to modulate
polarization direction of emitting light. It generates two sweeping lines with special light properties
and uses sweeping lines for localization.
    Recently, SmartLight [20] proposes an interesting idea to use a digital modulated LED array
with a lens to achieve single light 3D localization. It modulates different LED lights with different
frequency on the LED array. Then it emits the light through a lens to the 3D space. Then it
derives the location based on the frequency of received light. Pulsar [28] uses the inherent features
of photodiode diversity. It builds a map from angle to RSS. It designs a special receiver with
two photodiodes. Most of those approaches in this category require a specially designed lamp or
receiver. Thus it may not apply to most scenarios in our daily lives.
    Further, many attempts are proposed to remove the requirements with specially controlled
light. Existing methods such as [6, 27] use geometrical relationships among lights with the known
position for triangulation based localization. PIXEL [24] leverages the inherent feature of optical
rotatory dispersion for localization. When a linearly polarized light passes through a disperser,
the color observed through a polarizer with different transmission directions should be different at
different locations. By fixing the orientation of a mobile phone, [24] derive the identifier by the
observed color, then calculates location with the geometrical relationship. It requires to capture
more than one light in one photo.
                                                   36


    LiTell [1] and iLAMP [22] use inherent features of fluorescent such as frequency and color
spectrum to identify each light. Given the position of the light, the location can be derived by
triangulation. Those two approaches are very nice as they do not need any extra modification to
the lamp. However, they require to sample the features for each light. It is also highly related to
the environment and cannot work when a lamp is changed. Recently, [36] proposes an interesting
method of using light to correct inertial measurement unit errors. As introduced in [36], it leverages
the property that a polarized light ray going through transparent tape is rotated by an amount related
to wavelength. Then it tries to derive the location change by sensing the color after a polarizer with
different directions. It detects color changes by edge crossing between four types of blocks hence
serve as landmarks to correct IMU drift errors.
    Luxapose [6] localizes the relative position from lamps. The main idea is to build a geometrical
model and calculate the position based on the relationship between lamps’ positions both in the
real world and in the photo. Such a model is also used in iLAMP [22]. However, the model
needs extra-parameters, e.g., focal length or data from other sensors. Since different cameras hold
different parameters like the focal length, they are not easy to use. RainbowLight only uses the
color pattern to derive the relative position to the tag, which is more general. Travi-Navi [37] using
the computer vision-based approach to launch the navigation. It stores guider’s video and uses
sensors to calibrate the position, and those data can be further used for followers in navigation.
2.8.2    Other Localization Approaches
Localization has attracted many research efforts. Besides visible light based localization, there exist
a large collection of localization approaches using wireless signal, such as [11, 12, 13, 38, 14, 39, 40,
41, 42, 15, 16, 43], using acoustic signal [44, 45, 46], using environment information and cell tower
signal [47], FM signal [48], stride information [49], inertial sensors [50] etc. Those approaches
are usually based on a signal attenuation model or pre-collecting a large number of fingerprints.
Meanwhile, many wireless signal based approaches need to analyze signal properties such as CSI,
which further leads to a high computation overhead. Thus they usually require specially designed
                                                   37


hardware at the receiver or sender, making it difficult to implement on the mobile phone. Multiple
path effect also affects the localization accuracy for many of those approaches. Our approach is
largely inspired by those approaches.
                                                 38


                                           CHAPTER 3
     PATRONUS: PREVENTING UNAUTHORIZED SPEECH RECORDINGS WITH
                       SUPPORT FOR SELECTIVE UNSCRAMBLING
3.1    Overview
Human beings have long used acoustic signals to exchange information with each other. Human
beings now use acoustic signals, which is speech, to exchange information with ubiquitous smart
devices such as smartphones, smartwatches, and digital assistants that are equipped with embedded
microphones. While these speech detection and recognition capabilities make possible many
convenient features, they also introduce many privacy risks such as secret, unauthorized recordings
of our private speech [51, 52] that can have real world consequences. For example, the Ukrainian
prime minister offered his resignation after an unauthorized recording was leaked [53].
    Manufacturers claim that they are trying their best to protect users’ privacy, but there is no
effective and user-friendly technical anti-recording solution available despite the fact that anti-
recording is not a new problem. One existing anti-recording solution is to talk near a white noise
source, e.g., near an FM radio tuned to unused frequencies, so that the conversation cannot be clearly
recorded. This approach is not user-friendly because the people having the conversation must put
up with the white noise that interferes with their normal communication. A similar solution [54]
emits high frequency noise near the upper bound of human sensitivity; most people do not notice
the interference, but pets and infants may notice it [3], so this solution is not environment-friendly.
Electromagnetic interference was an effective anti-recording solution [55] in the past, but modern
microphones are immune to electromagnetic interference. Moreover, all of these traditional anti-
recording approaches cannot allow authorized devices to clearly record conversations.
    Any effective anti-recording solution must provide the following three key properties: (1) normal
human conversation should be unaffected by the anti-recording solution meaning the anti-recording
solution should not change what humans hear while having a conversation; (2) unauthorized devices
                                                  39


should not be able to make a clear recording of any conversation protected by the anti-recording
solution; (3) authorized devices should be able to make a clear recording of any conversation
protected by the anti-recording solution.
    One potential solution that can satisfy all three properties is to generate multiple ultrasonic
frequency sound waves because of the following two properties of ultrasonic waves. First, humans
cannot hear ultrasonic sound waves. Second, commercial off-the-shelf (COTS) microphones
exhibit nonlinear effects, which means that when these microphones receive multiple ultrasonic
sound waves, they generate low-frequency sound waves that can be heard by humans and thus
interfere with the clarity of recordings made with those microphones [8, 56, 7, 57, 58, 3, 59]. There
are three main challenges that must be overcome in order to develop an ultrasonic anti-recording
solutions that satisfies the three key properties:
(1) First, any ultrasonic anti-recording solution must defend against potential attacks such as using
     Short-time Fourier transform (STFT) to analyze unauthorized recordings and using filters to
     cancel out the low-frequency sound waves that interfere with recording clarity.
(2) Second, ultrasound travels along a straight line [60], which means a single ultrasonic wave
     generator can only interfere with recording devices within a limited range of angles from the
     generator. In practice, it is difficult to design an ultrasonic anti-recording solution that can
     neutralize all recording devices within a large coverage area.
(3) Finally, the performance of authorized devices could be affected by the ringing effect due
     to electronic behaviors. Such ringing impulses are hard to be canceled and may remain in
     authorized recordings, severely downgrading the quality of the descrambled recordings.
    In this chapter, we present Patronus, an ultrasonic anti-recording system that satisfies the three
key properties. Patronus has two key components: the scramble that is the pseudo-noise generated
at all microphones, and descrambling that is the process to remove the scramble for authorized
devices. We form the scramble by randomly picking frequencies from the human voice frequency
band and then shifting them to the ultrasonic band. To thwart STFT attacks, we further fine-tune
                                                   40


                                         Cosine Wave             Chirp
                             f
                                                                            t
                            (a) Discrete frequency scramble components.
                             f
                                                                            t
                            (b) Continuous frequency changing scramble
                            with chirps.
     Figure 3.1: Using chirps to smooth the frequency changing components of the scramble.
the period of the scramble so that it cannot be easily analyzed and canceled. We add a reflection
layer with a curved surface to create a reflected ultrasonic wave that can cover a wider area. Finally,
to mitigate ringing effects, i.e., sudden hardware impulses due to discrete frequency changes of
current waves, we use chirps to smooth the frequency changing components of the scramble, as
shown in Figure 3.1.
    Patronus lets authorized devices clearly record audio conversations by sending them the scram-
ble pattern. With scramble pattern, the authorized device applies the Normalized Least-Mean-
Square (NLMS) adaptive filter [61] to cancel the scramble and thus produce a clear audio recording
of the conversation.
    We implement a prototype of Patronus and conduct comprehensive experiments to evaluate
its performance. We use the Perceptual Evaluation of Speech Quality (PESQ) [62], the Speech
Recognition Vocabulary Accuracy (SRVA, see Section 3.6), and speech recognition error rates
(1 - SRVA) to evaluate the performance of Patronus. Our results show that only 19.7% of the
words protected by Patronus’ scramble can be recognized by unauthorized devices. Furthermore,
authorized recordings have 1.6x higher PESQ and, on average, 50% lower speech recognition error
rates than unauthorized recordings.
    In this chapter, we provide several unique technical contributions when compared to existing
works. First, to the best of our knowledge, Patronus is the first system to leverage the nonlinear effect
                                                   41


                                                  Unauthorized
              Scramble Transmitter                  Device
                                                                  Speech with
                 Scramble Generator
                                                                   Scramble
                  Frequency Shifter                              Descramble Receiver
                                                                 Speech with
                   Constant Cosine                                Scramble          Adaptive
                   Wave Generator                                  Scramble          Filter
                                                                    Pattern
                     Scramble Pattern
                          (Key)                     Authorized             Speech
                                                      Device
                                    Wi-Fi / Bluetooth / etc.
                                      Figure 3.2: System Overview.
of COTS microphones to prevent unauthorized recordings while allowing authorized recordings.
Second, we perform a thorough study of the nonlinear effects of ultrasound frequencies including
the effects of higher orders whereas recent works[8, 7, 56, 9] only consider the order up to 2. This
is critical for descrambling when the signal components with order higher than 2 will likely lie in
the human voice frequency band, which means simply cutting off the high frequency components
will result in message loss. Instead, our descrambling solution carefully removes these higher
order frequencies using an NLMS filter. Third, we mitigate ringing effects by connecting scramble
segments with chirps. This simplifies learning the coefficients of impulse response in existing
work [8], especially when we deploy multiple ultrasonic transducers in a large space. In general,
our contributions are as follows:
    • We propose a novel ultrasound modulation approach to provide privacy protection against
      unauthorized recordings that does not disturb normal conversation.
    • We do a thorough study around the nonlinear effect of ultrasound on commercial microphones
      and propose an optimized configuration to generate the scramble.
    • To overcome the fact that ultrasound travels in a straight line, we design a low cost reflection
      layer to effectively enlarge the coverage area of Patronus in a cost-effective way.
    • We present Speech Recognition Vocabulary Accuracy, a new metric to measure the recording
                                                  42


      quality. Our experimental results with both PESQ and SRVA show that Patronus effectively
      prevents unauthorized devices from making secret recordings.
    The organization of the rest of this chapter is as follows. Section 3.2 introduces related work.
Section 3.3 introduces the nonlinear effect of common microphones, which we analyze more
thoroughly than existing works. Section 3.4 presents the design of Patronus. Section 3.5 presents
the prototype implementation of Patronus. Section 3.6 presents our evaluation results of Patronus.
Section 3.7 discusses the limitations of Patronus and future work.
3.2    Related Works
3.2.1   Nonlinear Effect of Microphones
There has been a lot of research into the nonlinear effect of microphones. For many years, the
development of ultrasonic systems on smartphones was restricted due to being limited to a roughly
4 kHz range of frequencies between the high end of human hearing to the cutoff frequency of
typical microphones. Furthermore, some infants and pets can actually perceive frequencies within
this small band. Roy et.al. [8] performed detailed research on the nonlinear effects of microphones
to break through these limitations and expand the working frequency band for ultrasonic systems
on smartphones. DolphinAttack [7] leverages the nonlinear effect to generate audio commands
that are inaudible to humans. After being recorded by the microphone, the input ultrasonic
signals would generate a shadow signal that could be recognized by VCS. Therefore, attackers can
perform unauthorized commands without being discovered. SurfingAttack [59] uses oscillation
of a surface such as a table to transmit inaudible commands. With this modality, attackers can
deploy their speakers in hidden spots such as the back of the surface being used to transmit the
secret commands. LipRead [56] extends the attack range by leveraging characteristics of human
hearing. It also puts forward a model to filter out such commands generated by the nonlinear effect.
Metamorph [57] injects inaudible commands into human-made commands to achieve unauthorized
actions. AIC [9] presents a mechanism that fundamentally cancels inaudible commands against
                                                  43


VCS, which we will discuss as an attack model in Section 3.4.2. NAuth [58] uses the nonlinear
effect to authenticate devices. Unlike most of these methods, Patronus aims to preserve privacy
by adding a removable scramble generated by ultrasonic signals to the recorded human speech.
From a technical perspective, Patronus is unique in that it takes into account third and higher order
terms from the nonlinear effect. Our experiments show those high order terms can affect recordings
whereas most existing methods (e.g., AIC) only consider the second order term and assume the
higher order sub-band of the microphone is clean.
3.2.2    Dual Channel Applications
Some applications leverage the difference between humans and devices. For example, human
eyes and devices have different perceptions of flicker frequency. Technologies exist that use
this phenomena to communicate between the screen and the camera without affecting human
vision [63, 64, 65, 66]. Likewise, some technologies modulate acoustic signals in ways that no
human can detect to communicate between devices [67, 68].
    The difference between the sensitivity of humans and devices is also used in privacy protection.
Kaleido [69] protects a movie’s copyright by adding a flashing distractor with very high frequency
into movie frames that cannot be seen by human eyes. If such a protected movie is subsequently
recorded by an unauthorized camera equipped with a rolling shutter, the distractor will be visible
on the unauthorized recording because of its high sample rates making the pirated recording a low
quality recording. LiShield [2] also uses the Rolling Shutter effect to reduce the quality of photos.
Lights with different colors are set to flash in alternating high frequencies that provide normal
lighting because human eyes cannot sense the flashing. However, cameras are influenced because
the Rolling Shutter samples column by column meaning unexpected color stripes will appear on
the photo. In the end, it prevents unauthorized cameras from taking photos. Although Patronus has
a similar motivation to prevent unauthorized recordings, Patronus is different from the two papers
as it targets acoustics rather than visuals.
                                                 44


3.3    Nonlinear Behavior of Common Microphones
In this section, we provide a brief primer about nonlinearity of common microphones; a more
comprehensive introduction can be found in recent papers [8, 56]. Ideally, COTS microphones
are linear systems. Given the input signal 𝑠(𝑡), the output signal 𝑦(𝑡) is expected to be linear
combinations of the input signal, i.e., 𝑦(𝑡) = 𝐴1 𝑠(𝑡) where 𝐴1 is the complex gain quantifying
the change of the phase and amplitude. Due to the physical properties of materials and variations
in manufacturing, the components of a common microphone, such as the diaphragm and the
pre-amplifier, are imperfect and typically do not constitute a linear system. As a result, COTS
microphones, which are widely equipped on smartphones and smartwatches, typically exhibit
nonlinear behavior. Specifically, the output signal 𝑦(𝑡) is under nonlinear effect, where 𝑦(𝑡) =
𝐴1 𝑠(𝑡)+ 𝐴2 𝑠2 (𝑡)+ 𝐴3 𝑠3 (𝑡)+· · · , and the power gains of each component satisfy | 𝐴𝑚 | > | 𝐴𝑛 |(𝑚 < 𝑛).
    When the input signals are composed of two different ultrasonic frequencies, the output from a
nonlinear microphone would contain several new shadow sounds with frequencies that are a linear
combination of the two input frequencies. Assuming that the input signal is 𝑠(𝑡) = cos(2𝜋 𝑓1 𝑡) +
cos(2𝜋 𝑓2 𝑡) where 𝑓1 and 𝑓2 are the ultrasonic frequencies, the output signal would be 𝑦(𝑡) =
Í+∞
  𝑖=1 𝐴𝑖 𝑠 (𝑡). Without loss of generality, we assume 𝑓1 > 𝑓2 in the following discussion. For each
          𝑖
component 𝐴𝑖 𝑠𝑖 (𝑡),
               𝑠𝑖 (𝑡) = (cos(2𝜋 𝑓1 𝑡) + cos(2𝜋 𝑓2 𝑡)) 𝑖
                              𝑖
                             ∑︁
                      = 𝜇+        [𝛼 𝑗 cos(2𝜋 𝑗 𝑓1 𝑡) + 𝛽 𝑗 cos(2𝜋 𝑗 𝑓2 𝑡)]
                             𝑗=1
                        𝑖−1
                        ∑︁
                      +     [𝜆 𝑗 cos(2𝜋( 𝑗 𝑓1 − (𝑖 − 𝑗) 𝑓2 )𝑡) + 𝛾 𝑗 𝑐𝑜𝑠(2𝜋( 𝑗 𝑓1 + (𝑖 − 𝑗) 𝑓2 )𝑡)],
                        𝑗=1
where 𝛼 𝑗 , 𝛽 𝑗 , 𝜆 𝑗 and 𝛾 are coefficients of the polynomial expansion, and 𝜇 is the consequent
constant.
    After the pre-amplifer, the signals would pass through an embedded low-pass filter whose
cut-off frequency is usually 24 kHz. Since 𝑓1 and 𝑓2 are both ultrasonic frequencies, 𝑗 𝑓1 and
𝑗 𝑓2 are all ultrasonic frequencies. However, if 𝑖 = 2 𝑗, 𝑗 𝑓1 − (𝑖 − 𝑗) 𝑓2 = 𝑗 ( 𝑓1 − 𝑓2 ) may be
                                                         45


a non-ultrasonic frequency when 𝑗 is small enough. Therefore, when the input signal is 𝑠(𝑡) =
cos(2𝜋 𝑓1 𝑡)+cos(2𝜋 𝑓2 𝑡), new audible cosine waves cos(2𝜋 𝑗 ( 𝑓1 − 𝑓2 )𝑡) appear, where 𝑗 = 1, 2, . . . , 𝑘,
𝑘 ≤ 𝑖, and 𝑘 ( 𝑓1 − 𝑓2 ) ≤ 24 kHz. Existing works like BackDoor[8] and DolphinAttack[7] make
use of 𝐴2 𝑠2 (𝑡) but ignore higher-order components; they essentially assume that for 𝑖 > 2, | 𝐴𝑖 | is
relatively small and has little effect on the output signal. However, in our experiments, we find that
more high-order components should be taken into consideration as they do affect the output signal.
3.4    Design
3.4.1   Overview
As shown in Figure 3.2, there are three parties involved in Patronus: the Scramble Transmitter,
authorized devices with descramble receivers, and unauthorized devices.
    The Scramble Transmitter sends a series of scramble signals with randomly varying frequencies.
To ensure that unauthorized voice recordings will be affected, the frequencies of the recorded
scrambles should be located in the human voice band. Therefore, we use the Scramble Generator
to generate random frequencies in the target range, store them as a secret key, and send them to the
Descramble Receivers through Wi-Fi, Bluetooth, or other media. The Scramble Generator then
generates cosine wave segments according to these frequencies. The generated segments are then
sent to the Frequency Shifter and their frequencies will be increased by 𝑓0 , which is an ultrasonic
frequency. To ensure the scramble signal is picked up by microphones of unauthorized devices
because of the nonlinear effect, we design a Constant Cosine Wave Generator to transmit a cosine
wave with a constant ultrasonic frequency of 𝑓0 .
    During human talking protected by Patronus, the actual human conversation plus two ultrasonic
signals will arrive essentially simultaneously at recorders (both authorized and unauthorized)
and human ears. Human ears will not detect the ultrasonic signals and thus receive the human
conversation with no additional noise. As discussed in Section 3.3, the two ultrasonic signals
will generate a shadow audible signal that will be included in any recording made by a COTS
microphone due to nonlinear effects. This applies to both authorized and unauthorized devices.
                                                   46


Authorized devices, which receive a secret key from the Scrambling Transmitter, can generate
the scramble waveform. They can then feed the scramble waveform along with the scrambled
recording into an adaptive filter to extract clear speech from the scrambled speech. The details of
descrambling will be discussed in Section 3.4.5.
    We must overcome three challenges in order to design Patronus. First, we must design a system
whose working area is as large as possible. This is difficult because a sound wave of high frequency
typically travels along a straight line meaning a straightforward implementation of ultrasonic
generators will only cover a small area defined by a limited range of angles. Second, there is a
trade-off between a shorter and a longer period of scramble frequencies. As the period increases, the
system is more vulnerable to unauthorized recordings using STFT attacks. As the period decreases,
the difficulty of descrambling increases. Our goal is to maximize the information recovered by
authorized devices over unauthorized ones without exposing the scramble pattern to STFT. These
details are discussed in Section 3.4.3.4. Third, when frequency changes frequently, a severe ringing
effect (Section 3.4.3) occurs in the scrambled recording, which affects even the recordings made
by authorized devices after descrambling. We use chirps to connect each frequency component of
the scramble to eliminate the sudden change of the input to ultrasonic speakers, hence minimizing
the ringing effect and enhancing the quality of the recovered speech by authorized devices.
3.4.2   Attack Model
Based on common acoustic processing technologies and known properties of nonlinearity effects,
we consider the following types of attacks:
3.4.2.1   Short-Time Fourier Transform (STFT)
One natural way for an unauthorized device to try to extract a useful recording from its scrambled
recording is to analyze the scrambled recording with STFT and filter out suspicious frequencies.
We address this attack model by changing the scramble frequency according to a finely-tuned period
model, making it impossible for the attacker to obtain each exact scramble frequency along with its
                                                   47


start and end time. Detailed analysis is provided in Section 3.4.3.4. Even with the correct scramble
frequencies available, bandpass filters will not work because the scramble frequencies are selected
from the human voice band. The frequencies from chirps and those from human speaking are mixed
together. To prove Patronus can defeat this attack model, we simulate the attack scenario when (1)
the attacker is aware that our scramble pattern is varying continuous waves smoothed by chirps (2)
the attacker calculates approximate scramble frequencies with STFT (3) the attacker applies NLMS
adaptive filter (Section 3.4.5.4) to remove the scramble with the approximate scramble frequencies
they obtained from STFT. Our simulated attack experiments, provided in Section 3.6.8, show that
this attack will fail because the approximate scramble frequencies are not accurate enough.
3.4.2.2   Extra Ultrasonic Transmitter Attack
After DolphinAttack[7] proposes to inject malicious commands into ultrasound, AIC [9] adds
three more ultrasonic transmitters to cancel the malicious commands and protect Voice Control
Systems (VCS). AIC assumes the legitimate as well as malicious commands are within the lower
sub-band of the microphone sensible frequency band. Their added ultrasonic transmitters project
only the malicious commands onto the higher sub-band, which can be used to filter the malicious
commands in the low sub-band. With a fast changing of scramble frequencies, we can cover the
whole frequency band, and make sure no clean band is left for attackers.
3.4.2.3   Wi-Fi/Bluetooth Snifing
Attackers can sniffer the Wi-Fi or Bluetooth channel to get the scramble pattern transmitted from the
Scramble Transmitter to the authorized device. However, there are many cryptographic approaches
to prevent attackers from sniffing channels. For example, we can encrypt the scramble pattern by
AES-CTR using a pre-shared key and then directly send it to authorized devices.
                                                  48


3.4.2.4    Physical Attacking
There are also some physical attack models. First, attackers can place an obstacle before the
Scramble Transmitter. However, attackers cannot do it secretly and nobody would like to do so.
Second, attackers may just wrap a cover on their microphones. However, the cover itself may
defeat the attackers objective of making a good recording. Although Patronus cannot perfectly
handle such attack models, it enhances the difficulty of making an unauthorized recording. Finally,
attackers may conduct experiments to discover where Patronus fails. This can be fixed by enlarging
the working area through some methods that we will discuss later.
3.4.3    Ultrasonic Scramble Modulation
Two ultrasonic signals will be superimposed at the recorders to create the desired low-frequency
component. In the design of the scramble using ultrasonic signals, we mainly consider the following
issues:
3.4.3.1    Range of Frequency
The first issue is how to make it hard to cancel out the scramble without the key. Basically, the range
of human speech frequency is from 85 Hz to 255 Hz [70, 71]. If the scramble consists of multiple
random frequencies from this range, it is hard for attackers to cancel the scramble using linear
filters. The application of a linear filter, e.g., highpass filter, will not only cancel the scramble,
it will also change the original human speech. To ensure the scramble covers all human speech
frequencies in practice, we modulate the scramble with a wider frequency band than [85, 255] Hz.
3.4.3.2    Random Frequencies
If we always use specific frequencies to generate the scramble, attackers could analyze the frequency
spectrum of their recordings to infer the scramble frequencies; with those, they could then recover
the original audio signals. To address this issue, we choose scramble frequencies randomly. We
                                                   49


                          (a) Scrambled without chirps (b) Descrambled without chirps
                          1                                        1
                                                                            Ringing
              amplitude                                amplitude
                          0        Ringing                         0
                          -1                                       -1
                               0    1        2    3                     0     1       2    3
                                    time (s)                                 time (s)
                           (c) Scrambled with chirps               (d) Descrambled with chirps
                          1                                        1
              amplitude                                amplitude
                          0                                        0
                          -1                                       -1
                               0    1        2    3                     0     1       2    3
                                     time (s)                                  time (s)
              Figure 3.3: Illustration of how linear chirps mitigate the ringing effect.
also periodically change the scramble frequencies over time. The sequence of scramble frequencies
can be thought of as a one-time pad key. Without the sequence, it would be difficult for attackers
to remove the scramble.
3.4.3.3   Ringing Effect
Frequent changing of the scramble frequencies produces a ringing effect [8] that makes it challenging
for authorized devices to produce a high-quality descrambled recording. Specifically, the ringing
effects incur heavy-tailed impulse responses that will remain in descrambled recordings as shown
in Figure 3.3 (a) and (b). Since the ringing effect occurs when the input changes suddenly, we use a
chirp signal to connect two adjacent segments with different frequencies in the scramble to smooth
such a sudden change. Specifically, when the scramble changes from frequency 𝐴 to frequency 𝐵,
we add a transition signal that starts at frequency 𝐴 and moves linearly to end with frequency 𝐵.
   The impulse incurred by ringing effects can have a very high amplitude or power. It will
suppress other signals due to the microphone Passive Gain Suppression [8]. Figure 3.3 confirms
                                                       50


that the ringing effect is mitigated by chirps. Figure 3.3 (a) shows a scrambled recording with
no chirp, the resulting descrambled recording in Figure 3.3 (b) has many areas where most of the
signal is suppressed. In contrast, Figure 3.3 (c) exhibits a scrambled recording with chirp signals,
the resulting descrambled recording in Figure 3.3 (d) does not have the peak signals corresponding
to the ringing effect and the rest of the signal is not suppressed.
3.4.3.4    Duration of each frequency
The next challenge is choosing the proper duration for each frequency in the sequence of scramble
frequencies. Intuitively, if we give each frequency a long duration, unauthorized devices could
easily split the record into multiple segments where each segment is only protected by a constant
frequency scramble. They could then apply simple techniques such as using a linear bandpass filter
to the scrambled recording to extract a clear speech recording.
     More generally, there are two competing issues in choosing the duration of each scramble
frequency, namely, defending against STFT attacks that are discussed in Section 3.4.2.1, and
ensuring that authorized devices can obtain high-quality descrambled recordings. We first consider
defending against STFT attacks. An STFT attack can successfully remove the scramble waveform
if it can both accurately infer the frequencies and time periods for each scramble frequency in
the sequence of frequencies. When the window length is 𝑛, the frequency resolution would be
        𝑓𝑠     𝑓𝑠    1
Δ𝑓 =    𝑛  = 𝑓𝑠 ×𝑡 = 𝑡 where 𝑓𝑠 is the sampling rate and 𝑡 is the duration of the window. Taking 0.1s
as an example, the offset of STFT can reach 10Hz. If the attacker tries to improve the frequency
resolution by lengthening the window, the accuracy of the estimated time periods for the given
scramble frequency will diminish. If the scramble frequency duration is long, scramble frequency
will exhibit fewer changes within any given window, thus STFT attacks can use longer windows
to accurately estimate the frequency with exact estimates of the frequency time period. Therefore,
to thwart STFT attacks, we should make the frequency duration as short as possible. However, a
too-short duration may misshape the scrambled recording due to imperfect hardware. A typical
microphone and speaker use a diaphragm to sense and generate the vibration; this diaphragm moves
                                                    51


continuously and can not change its position instantaneously. Circuit latency also makes it hard for
the system to respond to frequent and instant changes. As a result, the scrambled waveform would
be slightly distorted. This means the NLMS adaptive filter at authorized devices may not correctly
descramble the scrambled waveform because it does not expect the distortion caused by frequent
frequency changes. Therefore, the frequency duration cannot be too short. In summary, to balance
these competing concerns, we must find a frequency duration that maximizes the information
recovered by authorized devices compared to the information recovered by unauthorized devices.
To identify a good frequency duration, we measure the descrambling performance with different
frequency durations in Section 3.6.8.
3.4.3.5    Key Construction
We have two choices to construct the key for granting the privilege of recording the audio to
authorized devices. One is directly using the scramble waveform generated by the Scramble
Generator as the key. After getting the scramble waveform, authorized devices remove the scramble
from the recorded audio. But there are some issues we need to consider. First, the sampling
rate of authorized devices may vary from one to another. It means that in terms of the digital
signal, devices having different sampling rates will get different presentations of the same scramble
waveform. To grant the privilege to devices, the Scramble Transmitter should generate different
digital scramble waveforms according to different sampling rates of authorized devices. This results
in high computational overheads. Second, in addition to different sampling rates from different
authorized devices, the sampling rates of the Scramble Generator and an authorized device may be
also different. As a result, the scramble that the speaker emitted might have a different presentation
of the recorded waveform.
    In Patronus, we choose another way to construct the key. We select the frequency sequence used
to generate the scramble as the key. After receiving the frequency sequence, an authorized device
can reconstruct the scramble waveform with their sampling rates, which we discuss in more detail
later. After that, an authorized device can use the reconstructed scramble waveform to remove the
                                                   52


                                                                         Reflection Layer
                                                                            Ultrasonic
                                                                             Speakers
                                                                          Working Area
                                                 Figure 3.4: Enlarge working area with reflection.
dapter
                 scramble from the recording and get the clear speech.
                     With the discussion above, we formally describe the scramble generation. We set one speaker
                 to transmit an ultrasonic continuous wave 𝑆1 (𝑡) = cos(2𝜋 𝑓0 𝑡), while the other speaker transmits
                 continuous waves linked by chirps 𝑆2 (𝑡) = cos(2𝜋 𝑓 (𝑡)𝑡), where
       Ultrasound
       Transmitter                                     
                                                       
                                                                              (2𝑖 − 2)Δ𝑡 ≤ 𝑡 < (2𝑖 − 1)Δ𝑡,
                                                       
                                                        𝑓𝑖 ,
                                                       
                                                       
                                               𝑓 (𝑡) =                                                                (3.1)
                                                              𝑓𝑖+1 − 𝑓𝑖
                                                        𝑓𝑖 +                 (2𝑖 − 1)Δ𝑡 ≤ 𝑡 < 2𝑖Δ𝑡,
                                                       
                                                                  Δ𝑡 𝑡,
                                                       
                                                       
                 and 𝑓𝑖 (𝑖 = 1, . . . , 𝑛) are randomly generated constant frequencies. Δ𝑡 is the duration of a single sine
                 wave or a chirp. The induced low-frequency noise will be
                                                              𝑅(𝑡) = cos(2𝜋( 𝑓 (𝑡) − 𝑓0 )𝑡).                          (3.2)
                     To ensure 𝑅(𝑡) covers human voice, 𝑓𝑖 (𝑖 = 1, . . . , 𝑛) are sampled from [ 𝑓𝑙𝑜𝑤 + 𝑓0 , 𝑓 ℎ𝑖𝑔ℎ + 𝑓0 ]
                 where [ 𝑓𝑙𝑜𝑤 , 𝑓 ℎ𝑖𝑔ℎ ] covers the human voice band.
                 3.4.4    Enlarge Scramble Working Area
                 The scramble signal is generated by two ultrasonic signals, which incurs another issue as the
                 ultrasonic wave typically propagates in a straight line. In other words, if you want to prevent a
                 certain device from recording, the ultrasonic speaker should be pointed directly towards that device.
                 This results in a limited coverage area for ultrasonic anti-recording solutions.
                                                                               53


    Inspired by lamps that often use a bow-shaped cover to reflect the light beam in many directions,
we build a reflection layer that reflects the ultrasonic wave in many directions. As Figure 3.4 shows,
we put ultrasonic speakers near the center of the reflection layer and place the devices (authorized
and unauthorized) in the working area. When the ultrasonic wave hits the reflection layer, it gets
reflected in many directions leading to a much larger cover area.
3.4.5   Grant Recording Privilege
The goal of Patronus is not only to block unauthorized devices from recording audio, but also to
provide authorized devices with a mechanism to recover speech. Patronus achieves this by creating
a way for authorized devices to remove the scramble from the scrambled recording. Specifically,
Patronus grants the clear recording privilege to authorized devices using the following steps.
3.4.5.1    Key Transmission
The Descramble Receiver needs the waveform of the scramble generated by the Scramble Generator
before it can remove the scramble. Intuitively, if it had the pure scramble waveform, it could remove
the scramble from the recorded audio by subtracting the scramble waveform from the recorded audio
waveform. The scramble waveform here acts as the key for deciphering the recorded audio. We send
the key through non-acoustic channels such as Wi-Fi or Bluetooth with cryptographic protection to
prevent eavesdroppers from getting the key. Additionally, because of the randomness of scramble
frequencies, they cannot get a usable scramble waveform by listening to the acoustic channel.
Instead, they can get either the combination of interfered speech with scramble, or get the scramble
without speech but independent of the successive scramble waveform.
3.4.5.2    Scramble Reconstruction
As discussed in Section 3.4.3, the Scramble Transmitter sends the random frequency sequence
instead of the scramble waveform to authorized devices as the key. Patronus needs to use these
                                                    54


frequencies to reconstruct the scramble waveform before removing the scramble. An authorized
device uses Equation (3.2) and its recording sampling rate to generate the scramble waveform.
3.4.5.3    Synchronization
We need to synchronize the reconstructed scramble with the recorded scramble before removing
it from recordings. Specifically, we choose a segment from the reconstructed scramble as the
template, e.g., the beginning segment. Then we use cross-correlation to find the segment that is
the most similar to the template. We then synchronize the recorded scramble and the reconstructed
scramble by aligning the two segments.
3.4.5.4    Adaptive Filtering
Now we have the waveform of the scramble. The next task is to remove the scramble from the
recorded audio with the known waveform of the scramble. Practically, we cannot directly subtract
the scramble from the recorded audio because when the sound propagates through the air, it will be
distorted due to reflection and attenuation. We use adaptive filter to remove the waveform-known
scramble.
     Adaptive filter is widely used in Active Noise Cancellation (ANC) headsets. Technically, there
is a reference microphone outside the headset. The reference microphone captures the noise, and
the digital signal processor (DSP) generates the anti-noise wave according to the captured noise.
When the noise wave and the anti-noise wave arrive at the ear, they eliminate each other. In
Patronus, we denote the speech as 𝑥 1 . It propagates through the acoustic channel ℎ1 , arrives at the
authorized device and becomes ℎ1 ∗ 𝑥1 , where the operator ∗ denotes the convolution operation.
Additionally, we denote the scramble waveform that is generated by non-linear effects and recorded
by the authorized device as 𝑥 2 . It propagates through another channel ℎ2 , arrives at the authorized
device and becomes ℎ2 ∗ 𝑥 2 . Therefore, the audio recorded by the authorized device is
                                         𝑦 = ℎ1 ∗ 𝑥 1 + ℎ2 ∗ 𝑥 2 .                               (3.3)
                                                   55


                                                                                                       Speakers
                                                 ×103
                                                                                                    Working Area
                                Reflection Layer                                       iron wok
                   Power Adapter                                                  (reflection layer)
                                                                                                 Ultrasonic
                  0∘                                         0∘                                 Transducers
                            Ultrasonic Transducers                      Amplifier
                 15∘                                        15∘
                    30∘                                    30∘
                          ∘                              ∘
                       45                             45                                                  Authorized
         Amplifier                                                   Unauthorized                           Device
                            60∘                  60∘
                                 75∘  Ultrasound
                                            75∘                          Device
                                       90∘
               Authorized Device      Transducers
                                           Unauthorized Device                     Normal Speaker
Figure 3.5: Implementation           of Scramble
                                Normal  Speaker        Trans-       Figure 3.6: Prototype of Patronus.
mitter.
Similar to ANC headsets, here we see the scramble 𝑥 2 as the noise in ANC headsets. Different
from ANC headsets, the noise here is generated from the key as we discussed in Section 3.4.5.2.
Therefore, we can use the Normalized Least-Mean-Square (NLMS) Adaptive Filter [61] to remove
the scramble. Formally, we are trying to find a channel vector ℎ′2 to solve the optimization problem
                                                 min 𝐸 [(𝑦 − ℎ′2 ∗ 𝑥 2 ) 2 ].                                        (3.4)
When the expectation in Equation (3.4) is minimized, ℎ2 ≈ ℎ′2 . Therefore, ℎ1 ∗ 𝑥 1 ≈ 𝑦 − ℎ′2 ∗ 𝑥 2 ,
and it can be regarded as the speech without the scramble. Stochastic gradient descent is usually
adopted to solve the optimization problem defined by Equation (3.4), but it is hard to derive the
gradient of the expectation. Researchers thus use (𝑦 − ℎ′2 ∗ 𝑥 2 ) 2 instead of the expectation to solve
the problem. In this way, the noise gets canceled [72].
    Following this design, we can develop a mechanism that prevents unauthorized recording
while supporting authorized recording. The mechanism also prevents attackers from descrambling
without authorization. Figure 3.7 gives an example. A piece of VOA news audio is used as the
original record, the attack result has severe scramble effects just like the unauthorized record, but
the authorized record removes almost all scrambles.
3.5     Implementation
This section discusses the details of the implementation of Patronus, which contains two parts, the
Scramble Transmitter and the Descramble Receiver for authorized devices. We use an ordinary
                                                                56


smartphone with its built-in audio recorder as the Unauthorized Device or Authorized Device.
3.5.1   Scramble Transmitter
3.5.1.1   Hardware Implementation
As Figure 3.5 shows, we use eight TCT40-16R/T 16 mm ultrasonic transducers. Half of them
play the frequency-shifted scramble and they are connected in parallel. The other half play the
fixed-frequency cosine wave and are connected in parallel as well. We utilize an AOSHIKE
DC12V-24V 2.1 Channel TPA3116 Subwoofer Amplifier Board to enhance the power of output
ultrasonic signals. The two waveforms are played through a stereo channel. The frequency-shifted
scramble uses the left channel, and the constant-frequency cosine wave uses the right channel.
    As we have discussed in Section 3.4.4, we use a reflection layer to enlarge the working area. In
this prototype, we use an iron wok as the reflection layer. The opening diameter of the iron wok
is 30 cm, and the depth is 10 cm. As shown in Figure 3.6, the ultrasonic transducers are placed
towards the center of the iron wok.
3.5.1.2   Format of Key
As we have mentioned in Section 3.4, Patronus uses the frequency sequence as the key. This key
must include the duration of each frequency in addition to the frequency itself in order for the
Descramble Receiver to generate the scramble waveform. Thus, our key file includes the frequency
sequence plus the sample rate of the Scramble Transmitter and the number of samples of each
frequency.
3.5.2   Descramble Receiver for Authorized Devices
We use an ordinary smartphone as an authorized device. The authorized device receives the
key from the Scramble Transmitter. After the audio is recorded, the smartphone reconstructs the
                                                 57


                                                                                     am            -0.5                                  am          -0.5
                                                                                                     -1                                               -1
                                                                                                          0       10      20        30                      0       10    20
                                                                                                                   time (s)                                         time (s)
                                                                                                              Unauthorized Record                               Attacker Resu
                                                1                                                     11                                               1
                                                1                                                     1
                                                                                                                                         amplitude
                               amplitude                                             amplitude
                                                                                                    0.5                                              0.5
                                 amplitude
                                              0.5                                                   0.5
                                                                                       amplitude
                                              0.5                                                   0.5
                                                0                                                     00                                               0
                                                0                                                     0
                                                                                                   -0.5                                              -0.5
                                             -0.5                                                  -0.5
                                             -0.5
                                               -1                                                  -0.5
                                                                                                     -1
                                                                                                     -1                                               -1
                                                                                                        0         10      20        30                      0       10    20
                                               -1 0         10      20          30                   -1 0         10      20        30
                                                  0          time
                                                            10    (s)
                                                                   20           30                      0          time
                                                                                                                   time
                                                                                                                  10    (s)
                                                                                                                        (s)
                                                                                                                          20        30                              time (s)
                                                     (a)Authorized
                                                               time
                                                         Original      Record
                                                                     (s)
                                                                  waveform                                  Attacker
                                                                                                                 time
                                                                                                      (b) Authorized    Result
                                                                                                                       (s)
                                                                                                                     waveform
                                                 1                                                    1
                                                11                                                    1
                               amplitude
                               amplitude
                                 amplitude                                           amplitude
                                              0.5                                                   0.5
                                                                                       amplitude
                                              0.5
                                              0.50                                                  0.5
                                                                                                      0
                                                0                                                     0
                                             -0.5                                                  -0.5
                                             -0.5
                                               -1                                                  -0.5
                                                                                                     -1
                                               -1 0         10      20          30                   -1 0         10       20       30
0       10     20         30                      0          time (s)
                                                            10     20           30
                                                                                30                     0          10time (s)
                                                                                                                          20        30
         time (s)                                          time (s)                                                time (s)
    Authorized Record                                  Attackerwaveform
                                                (c) Unauthorized Result                            (d) Descrambled by STFT attack
                                  1
              Figure 3.7: Illustration of original waveform, authorized waveform, unauthorized waveform, and
                               amplitude
                                0.5 by STFT attack.
              descrambled waveform
                                  0
              scramble waveform-0.5with the given key and leverages NLMS Adaptive filter to cancel the scramble.
              Formally, it takes the
                                  -1 following steps:
0       10     20         30                         0      10      20          30
         time (s)                                            time (s)
              3.5.2.1    Reconstruct Scramble Waveform
              As we mentioned, in addition to the frequency sequence, the received key also contains the sampling
              rate of the Scramble Transmitter, which is denoted by 𝑓𝑠𝑡 , as well as the number of samples of
              each frequency 𝑛𝑡 . With the known sampling rate of the authorized device 𝑓𝑠𝑟 , the number of its
              recovered samples for each scramble frequency component can be calculated through the equation
                                                                                     𝑓𝑠𝑟 𝑛𝑡
                                                                          𝑛𝑟 =              ,                                                               (3.5)
                                                                                      𝑓𝑠𝑡
                    After getting 𝑛𝑟 , the authorized device uses the same process as the Scramble Transmitter to
              generate the scramble, i.e., generating the discrete cosine signal with the frequency 𝑓𝑖 and 𝑓𝑖+1 , and
                                                                                 58


connecting them by a chirp signal with start frequency 𝑓𝑖 and end frequency 𝑓𝑖+1 , where 𝑓𝑖 and 𝑓𝑖+1
are from the frequency sequence in the key.
3.5.2.2    Normalized Least-Mean-Square (NLMS) Adaptive Filter
After reconstructing the scramble waveform, we can use the Normalized Least-Mean-Square Adap-
tive Filter to cancel the scramble from the scrambled record. Specifically, we put the scrambled
record 𝑟𝑒𝑐 𝑠 and the scramble waveform 𝑠 into the NLMS Adaptive Filter to get the descrambled
waveform 𝑒 by removing 𝑠 from 𝑟𝑒𝑐 𝑠 . According to the discussion in Section 3.3, the scramble
wave is not only generated by frequencies in the given frequency sequence but also generated
by high-order frequencies that are multiples of the target frequencies. Therefore, after getting 𝑒
from the NLMS Adaptive filter, we still need to iteratively remove the multiples of the frequency
sequence scramble by NLMS Adaptive filter. It means that we iteratively put 𝑒 and the scramble
waveform generated by 𝑘-times multiple of the frequency sequence into NLMS Adaptive Filter,
where 𝑘 = 2, 3, 4, 5, 6 in our prototype.
     In summary, the procedure of authorized devices for removing the scramble from the record is
shown in Algorithm 3.1.
 Input: 𝑟𝑒𝑐 𝑠 , 𝑓𝑠𝑟 , 𝑓𝑠𝑡 , 𝑛𝑡 ,
         the frequency sequence 𝑓 [1..𝑛]
 Output: Speech Record without Scramble 𝑒
  1:  𝑛𝑟 ← 𝑓𝑠𝑟 𝑛𝑡 / 𝑓𝑠𝑡
  2:  𝑒 ← 𝑟𝑒𝑐 𝑠
  3:  for 𝑘 = 1 to 6 do
  4:      𝑠 ← ScrambleGenerator(𝑘 × 𝑓 [1..𝑛], 𝑛𝑟 ).
  5:      𝑒 ← NLMS-Adaptive-Filter(𝑒, 𝑠)
  6:  end for
  7:  return 𝑒
                              Algorithm 3.1: Remove Scramble from the record.
     The NLMS-Adaptive-Filter can be found in many open-source libraries, e.g., MATLAB, Python,
etc. Due to the selective frequency response of different smart devices, each model has its own
parameter setting. In the implementation, we choose 500 as the number of taps and 0.005 as the
                                                   59


                                 Unauthorized   Authorized     w/o Scrambling
                                   3
                                 2.5
                                   2
                          PESQ   1.5
                                   1
                                 0.5
                                   0
                                       0   10   20        30   40    50     60
                                                     Segment
Figure 3.8: PESQ of recordings captured by unauthorized and authorized devices, and PESQ of
recordings without scrambling by turning off Patronus as the baseline.
step size for an iPhone, 100 as the number of taps and 0.003 as the step size for a Pixel, and 300 as
the number of taps and 0.005 as the step size for a Galaxy S9.
3.5.3   Simulated STFT Attacker
We also simulate an STFT attacker to verify whether or not Patronus can prevent such an attack.
Specifically, as discussed in Section 3.4.2.1, we apply STFT to the scrambled recording using the
MATLAB function stft to infer its frequency sequence. We then feed the frequency sequence
to an NLMS adaptive filter to get the descrambled recording. Experiment results are shown in
Section 3.6.8. Here, we illustrate an example, which contains the original waveform, authorized
waveform, unauthorized waveform and the waveform descrambled by STFT, in Figure 3.7. As
illustrated by the figure, we observe that the authorized waveform is similar to the original waveform,
the unauthorized waveform is different from the original one, and the unauthorized waveform is
similar to the waveform descrambled by STFT attack. Therefore, our prototype proves that Patronus
can block the unauthorized recording while allowing authorized recording, and it can prevent STFT
attacks.
                                                     60


                        1                                                       1
                      0.8                                                     0.8
              CDF                                                     CDF
                      0.6                                                     0.6
                      0.4                                                     0.4
                      0.2                                                     0.2
                        0                                                       0
                            0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1                 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
                                        SRVA Error                                               SRVA Error
                        1                                                       1
                      0.8                                                     0.8
              Ratio                                                   Ratio
                      0.6                                                     0.6
                      0.4                                                     0.4
                      0.2                                                     0.2
                        0                                                       0
                                 A     B    C    D    E    F                             A     B    C    D     E   F
                                            News                                                     News
                                     (a) Scrambled                                           (b) Descrambled
Figure 3.9: (a) Upper half: The CDF of SRVA Error of scrambled recordings from the unauthorized
device. Lower half: The ratio of SRVA between scrambled recordings and original waveforms.
(b) Upper half: The CDF of SRVA Error of descrambled recordings from the authorized device.
Lower half: The ratio of SRVA between descrambled recordings and original waveforms.
3.6     Evaluation
3.6.1     Overview
To evaluate the performance of Patronus, we select six news speech waveforms from Voice of
America (VOA) and note these waveforms as A - F. The news speeches are read by a male, a
female, or both alternatively, sometimes with background music.
   A normal speaker (shown in Figure 3.6) is set to play these news waveforms, and we also read
the news ourselves. While the news waveforms are played under different conditions, we start
Patronus to interfere with the unauthorized recording device. Meanwhile, an authorized device
is recording too. Later we apply scramble cancellation to recordings from the authorized device.
After getting the scrambled recordings and scramble-canceled recordings, the following metrics
are adopted to measure the performance of Patronus.
3.6.1.1    Perceptual Evaluation of Speech Quality (PESQ)
PESQ is a common-used metric of speech quality [62]. It is widely adopted by phone manufacturers,
network equipment vendors, and telecom operators. Technically, the inputs include a clear speech
signal as the reference and a signal that needs to be measured. The output is a Mean Opinion
                                                                  61


                                 Unauthorized       Authorized                                Unauthorized   Authorized
                     1                                                            1
                   0.9                                                          0.9
                   0.8                                                          0.8
                   0.7                                                          0.7
                   0.6                                                          0.6
             CDF   0.5                                                    CDF   0.5
                   0.4                                                          0.4
                   0.3                                                          0.3
                   0.2                                                          0.2
                   0.1                                                          0.1
                     0                                                            0
                         0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1                      0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
                                    SRVA Error                                                    SRVA Error
                               (a) Human reading                                            (b) Human hearing
Figure 3.10: (a) Compare SRVA between before and after descrambling for the human voice. (b)
Compare SRVA between before and after descrambling for human recognition.
Score (MOS) [73] ranging from −0.5 to 4.5. A high PESQ score means that the corresponding
speech has a high hearing quality and vice versa. Typically, PESQ values ranging from 1.00 to 1.99
means “No meaning understood with any feasible effort” while those ranging from 3.80 to 4.50
meaning “Complete relaxation possible; no effort required” [74]. However, we cannot regard the
audio recording as strict as lossless communication. To fit PESQ to characterize the performance
of Patronus, we measure the PESQ of recordings without scrambling by turning off Patronus, and
use that result as the baseline. As shown in Figure 3.8, such recordings have PESQ between 2.2
and 2.7. We regard them as the upper bound of both unauthorized and authorized recordings. In
the following experiments, we use the PESQ implementation written in MATLAB [75] to compute
the PESQ score.
3.6.1.2   Speech Recognition Vocabulary Accuracy (SRVA)
We also use a Speech Recognition service to measure the effectiveness of scrambling and descram-
bling. Specifically, we apply Google’s Speech To Text (STT) service to transform the acoustic
signals to text. We first use the STT service to recognize the original speech without interference
and treat the recognized word sequence 𝑤 𝑐 as the ground truth. Then we use the STT service to
recognize the scrambled speech and descrambled
                                          Í    speech, and use 𝑤 𝑠 and 𝑤 𝑑 to denote their results,
                                                                       𝑖𝑠𝑇𝑟𝑢𝑒(𝑖∈𝑤 𝑐 )
                        Í
                                        𝑖𝑠𝑇𝑟𝑢𝑒(𝑖∈𝑤 𝑐 )
                                𝑖 ∈𝑤𝑠                          𝑖 ∈𝑤𝑑
respectively. We name                     |𝑤 𝑐 |         (or             |𝑤 𝑐 |           ) as the Speech Vocabulary Recognition
                                                                        62


Accuracy (SRVA) and use it to quantify the effectiveness of scrambling and descrambling. Note
that 𝑖𝑠𝑇𝑟𝑢𝑒(𝑖 ∈ 𝑤 𝑐 ) returns 1 when 𝑖 is a word from 𝑤 𝑐 , and 0 when 𝑖 is not a word from 𝑤 𝑐 . We
define SRVA Error as 1 − SRVA which indicates the error rates of recognition with the STT service.
    Using the above metrics, we try to answer the following questions:
    • Can Patronus effectively scramble the unauthorized speech recordings?
    • Can Patronus permit authorized devices to record the speech?
    • Can Patronus work on different mobile devices?
    • What is the impact of the distance between Patronus and a recorder?
    • What is the impact of the reflection layer?
    • What is the impact of the frequency switching time?
    • Is it possible to perform real-time descrambling?
3.6.2   Effectiveness of Scrambling and Descrambling
We split the 6 news speech waveforms into 55 segments (1650 seconds in total), each 30 seconds
long. Both the authorized and unauthorized device are Apple iPhone X in this experiment, so do
the following experiments except that of Section 3.6.5. As shown in Figure 3.8, with Patronus’s
scrambling, the hearing qualities of most segments are extremely low. Specifically, 44 out of 55
(80.0%) segments have PESQ scores lower than 1.5. For SRVA, overall, only 551 out of 2796
(19.7%) words are recognized correctly. More detailed results are shown in Figure 3.9a. The upper
half shows the CDF of the SRVA Error. We can know that 50% of the recordings have SRVA
Error lower than 0.84, and 80% of the recordings have SRVA Error lower than 0.98. The lower
half shows the ratio of SRVA between scrambled recordings and original waveforms. The results
show that all of the news waveforms having a recognition rate lower than 0.3. Here we want to
mention that if a word appears multiple times in a speech, SRVA would result in a high value or
                                                 63


a low value compared to the actual word recognition rate. However, duplicated words have little
impact because the duplicate rates of every segment, i.e., the ratio between the count of a specific
word and the total count of words in the segment, are lower than 5%.
    To evaluate the effectiveness of descrambling, an authorized device records the speech under
the scrambling from Patronus. The authorized device then cancels the scramble using the received
key. As shown in Figure 3.8, after descrambling, only 9 out of 55 (16.3%) segments having PESQ
scores lower than 1.5. On average, descrambled recordings have 1.6x higher PESQ scores than
their corresponding scrambled recordings. As for SRVA, we show the CDF of the SRVA Error in
the upper half of Figure 3.9b. These results show that 50% of the descrambled recordings have
SRVA Error lower than 0.43, which is 49% lower than scrambled recordings. Moreover, 80% of
the descrambled recordings have SRVA Error lower than 0.64, which is 35% lower than scrambled
recordings. As shown in the lower half of Figure 3.9b, ratios of SRVA between descrambled
recordings and original waveforms are higher than 0.4 and lower than 0.8. They are at least 2x
better than the scrambled recordings. The quality of the descrambled recordings is not as good as
the original ones because there are residual components of the scramble after applying the NLMS
adaptive filter. Moreover, background music and the volume of the original waveform also affects
the quality of the descrambled recordings. For example, news C has a lower ratio after being
descrambled by the authorized device compared to the other news clips because it has background
music that could affect the performance of authorized devices. It also affects the SRVA of the
record without scrambling, i.e., only 223 words are recognized from 295 in total. The reader of
news E reads the news in a lower volume compared to others, so it has a lower ratio after being
descrambled by the authorized device compared to the other news clips.
3.6.3   Effectiveness of Human Voice Scrambling and Descrambling
To verify whether Patronus works for real human speaking other than a sound player, we read
the news and calculate SRVA. As shown in Figure 3.10a, Patronus can effectively scramble and
descramble the human voice. Specifically, for the scrambled recordings, the median of SRVA Error
                                                64


is 0.74, and 80% of scrambled recordings have SRVA Error lower than 0.83. For the descrambled
recordings, the median of SRVA Error is 0.27, and 80% of the descrambled recordings have SRVA
Error lower than 0.4. The descrambling effectiveness of the human speaker is better than that of
recorded sounds because recorded sounds from VOA sometimes play background music.
3.6.4   Effectiveness of Human Recognition to Scrambled Recordings and Descrambled
        Recordings
Because there might exist differences between machine learning-based speech recognition and
human speech recognition, we invite 11 volunteers to write down words after listening to the 55
scrambled recordings and 55 descrambled ones. The results are shown in Figure 3.10b. People
react differently to noise. Some people are very sensitive and the scrambled noise make them
very uncomfortable. Note, the noise is generated by ultrasound speakers and only captured by
the nonlinear effects of microphones, so it will not disturb the people in the original conversation.
It will only be heard after getting recorded by unauthorized devices. Further, authorized devices
will be able to filter out such noises eliminating the discomfort for those listeners. The recovered
information from humans listening to descrambled recordings is still better than that of humans
listening to scrambled ones. 50% of the scrambled recordings have SRVA Error lower than 0.63,
and 80% of the scrambled recordings have SRVA Error lower than 0.86. As a comparison, 50%
of the descrambled recordings have SRVA Error lower than 0.34, and 80% of the descrambled
recordings have SRVA Error lower than 0.63.
3.6.5   Effectiveness on Different Mobile Models
To verify whether Patronus works on different mobile models, we test it on three devices, an Apple
iPhone X, a Samsung Galaxy S9, and a Google Pixel. We play all 55 segments using the normal
speaker, and calculate average PESQs and SRVAs.
     As shown in Figure 3.11a, less than 30% of words can be recognized by the STT service for
all the unauthorized devices, and around 65% of words can be recognized for all the authorized
                                                  65


                                               Unauthorized     Authorized                                           Unauthorized        Authorized
             Avg PESQ
                          2                                                                     2.5
                                                                                                  2
                                                                                         PESQ
                        1.5                                                                     1.5
                          1                                                                       1
                        0.5                                                                     0.5
                          0                                                                       0
                                   iPhone X          Galaxy S9                 Pixel               25    30     35     40    45     50    55     60      65     70
                                                      Model                                                             Distance (cm)
                                               Unauthorized     Authorized                                           Unauthorized        Authorized
                        0.6                                                                       1
             SRVA                                                                        SRVA
                        0.4                                                                     0.8
                                                                                                0.6
                        0.2                                                                     0.4
                          0                                                                     0.2
                                                                                                  0
                                   iPhone X          Galaxy S9                 Pixel               25    30     35     40    45     50    55     60      65     70
                                                      Model                                                             Distance (cm)
                                         (a) Different models                                                 (b) Different distances
Figure 3.11: (a) Compare average PESQ and SRVA among different models, (b) compare PESQ
and SRVA at different distances.
                                             Reflection Layer                                                                       iron wok
                                                                                                                               (reflection layer)
                                                                                                              Unauthorized        Authorized Ultrasonic
                                                                                                                                                Attacker
                   0    ∘
                                                                                 0   ∘             2                                           Transducers
                                                                                          PESQ
                                                                                                 1.5
                                                                                                   1
                                       Ultrasonic Transducers                                              Amplifier
                  15∘                                                           15∘              0.5
                                                                                                   0
                                                                                                   0.1           0.2              0.3           0.4             0.5
                        30∘                                                    30∘
                                                                                                                            Duration (s)
                                   ∘                                       ∘
                              45                                      45                                                                                      Authorized
                                                                                                         Unauthorized
                                                                                                            Unauthorized          Authorized          Attacker Device
                                       60∘                      60∘
                                              75∘         75∘                                      1       Device
                                                    90∘
                                                                                          SRVA
                                                                                                 0.8
                                                                                                 0.6
            Authorized Device                             Unauthorized Device                    0.4                           Normal Speaker
                                                                                                 0.2
                                                                                                   0
                                                                                                   0.1           0.2              0.3           0.4             0.5
                                             Normal Speaker                                                                 Duration (s)
                                         (a) Angle illustration                                               (b) Different durations
Figure 3.12: (a) Illustration of the reflection layer experiment. (b) Compare PESQ and SRVA with
different frequency switching times.
devices. When the mobile devices are unauthorized, the average PESQ of iPhone X is 1.06, and the
average PESQ of the other two models are even lower, roughly 0.5. When the mobile devices are
authorized, they all achieve an average PESQ around 1.85. This demonstrates that Patronus works
well for all devices; namely, it prevents all models from making good unauthorized recordings and
allows all models to make acceptable authorized recordings.
3.6.6   Impact of the Distance
We also characterize the impact of the distance between Patronus and the recording devices (both
authorized and unauthorized). We put the Scramble Transmitter at the origin. A randomly-picked
                                                                                         66


speech segment (which has 43 words) is played by a normal speaker, which simulates the talker.
The authorized device and an unauthorized device are recording at the same time. Their distance
to the Scramble Transmitter varies from 25 cm to 70 cm. Results of SRVA and PESQ between
two devices are shown in Figure 3.11b. Overall, as the distance increases, the ultrasound would
attenuate more. Therefore, the strength of the scramble decreases as the distance from the scramble
transmitter increases. As a result, when the device is far enough away, both the authorized and
unauthorized device can both record a clear speech. On the other hand, when devices are close
enough, unauthorized devices produce recordings that are severely scrambled whereas authorized
devices can recover much clearer speech using the secret key. The working area can be extended
by using high power ultrasonic speakers, which we will discuss later. Here we want to mention
that although there is a bump in Figure 3.11b at 55 cm with the SRVA, PESQs of 55cm and 60cm
are close. This means that humans cannot see much difference between these two recordings,
something we confirmed in person by listening to these recordings with this objective in mind.
Thus, the SRVA bump at 55cm might be due to an error-correction mechanism of the Google
STT engine; of course, since this is proprietary technology, we do not know how or why this
error-correction would produce such a performance bump for this recording.
3.6.7   Impact of the Reflection Layer
As we mentioned before, the ultrasound wave often propagates along a straight line. To enlarge
the range of Patronus scrambling, we design a reflection layer. In this experiment, we apply the
common speaker to play the chosen speech segment (43 words). As shown in Figure 3.12a, we
point the ultrasonic speakers towards the reflection layer and change angles of both authorized
and unauthorized devices to the ultrasonic speakers and measure Patronus’ performance; in other
experiments, the devices are always put at the 90◦ angle. We also measure the performance without
using the reflection layer. We turn the ultrasonic speakers around so they face in the same direction
as the normal speaker when we remove the reflection layer. The results when using the reflection
layer are shown in Figure 3.13a and 3.13b, and the results without using the reflection layer are
                                                  67


                         90                                 ●       Authorized                             90                        ●
                                                                                                                                       Authorized
                     2.0
                                                            ▲       Unauthorized                                                     ▲
                                                                                                                                       Unauthorized
                                                           60                                           1.00
                                           ●
                                                                                                                                     60
                     1.5                                                                                0.75
                  PESQ
                                                   ●
                                                                                                 SRVA
                                                                ●
                           ●
                                                                                     30                        ●
                                                                                                                                                      30
                     1.0           ▲
                                                                        ●                               0.50
                                                                                                                                         ●
                                                                                                                                             ●
                                                   ▲
                     0.5
                                                                                                                    ●
                                                                                                        0.25
                                                       ▲                            ●                                       ●
                                   ▲                                        ▲
                           ▲
                                                                                                                                             ●
                     0.0                                                                ▲●   0          0.00   ▲▲
                                                                                                                ▲
                                                                                                                                ▲
                                                                                                                                                 ▲         ●   0
                                               (a) PESQ                                                                         (b) SRVA
                         90                                 ●       Authorized                             90                        ●
                                                                                                                                       Authorized
                                                            ▲
                                                                    Unauthorized                                                     ▲
                                                                                                                                       Unauthorized
                     2.0   ●                               60                                           1.00                         60
                     1.5                                                                                0.75
                  PESQ                                                                           SRVA
                                       ●                            ●
                                                                                    30                                  ●                ●
                                                                                                                                                      30
                     1.0                   ●
                                                                                ▲
                                                                                ●
                                                                                                        0.50                                     ●▲
                           ▲
                                               ▲
                     0.5                                                                                0.25
                               ▲
                                                                                    ●
                                                                                    ▲                                                                 ▲
                                                                                                                        ●
                                                                                                                                                 ●
                               ▲
                                                                                                                    ▲
                     0.0                                                            ●▲       0          0.00   ●                                      ▲●       0
                                               (c) PESQ                                                                         (d) SRVA
Figure 3.13: (a) and (b): Compare PESQ and SRVA with the using of the reflection layer. (c) and
(d): Compare PESQ and SRVA without the using of the reflection layer.
shown in Figure 3.13c and 3.13d. From the results, we see that with the reflection layer, Patronus
can successfully scramble the unauthorized device when the angle is more than 15◦ , which is
signficantly larger than the angle of more than 45◦ needed by Patronus without the reflection layer.
Therefore, the reflection layer does significantly enlarge the scramble range of Patronus.
3.6.8   Impact of the Frequency Duration
We also measure the impact of the frequency duration. As we discussed in Section 3.4, we would
like to make the duration of each frequency as short as possible. However, the shorter the frequency
duration is, the harder it is for authorized devices to descramble. To verify this feature, we put
an authorized and an unauthorized device at 40 cm to Patronus and play the chosen segment (43
words) using the normal speaker. Both devices record the speech under Patronus using 5 different
frequency durations: 0.1 s, 0.2 s, 0.3 s, 0.4 s and 0.5 s. We calculate PESQs and SRVAs for each
                                                                                                 68


                  DT (ms)      MSO
                                          1       2      3       4       5     6
                  RT (s)
                           1             51      96     159     209 265 328
                           2             73     145     218     291 373 454
                           5            161     322     487     634 798 954
                          10            290     582     851    1108 1389 1653
                          20            548    1094    1653    2165 2695 3298
                          30            822    1617    2348    3088 3830 4563
Table 3.1: Descramble time (DT) of different record times (RT) with different max scramble orders
(MSO, the upper bound of 𝑘 in Algorithm 1).
duration. Moreover, we implement the attack model from Section 3.4.2, which first calculates
approximate scramble frequencies using STFT and then attempts to cancel the scramble using an
NLMS adaptive filter. We calculate PESQs and SRVAs for each duration and all devices including
the attack model.
    As shown in Figure 3.12b, for all durations, SRVAs of the unauthorized device are lower than
0.1, and PESQs are lower than 0.5. The authorized device has higher SRVAs and PESQs than the
unauthorized device. Specifically, when the duration comes to 0.3 s, the SRVA reaches roughly 0.8
and PESQ exceeds 2.0. This verifies our claim that authorized devices can successfully descramble
when the frequency duration is long enough.
    A shorter duration also makes it harder for attackers to crack the scrambled record, e.g., SRVAs
for the attacker also increase as the duration increases. Although both SRVAs and PESQs are
higher than those of the unauthorized device, they are still too low to extract useful information.
The reason why the NLMS adaptive filter fails is that the attacker cannot identify the scramble
frequencies with enough accuracy. NLMS adaptive filter solves the optimization problem defined
by Equation (3.4), which estimates the weight vector ℎ′2 . Since convolution does not change the
frequency of the signal, the attacker cannot make up for any offset existing between the correct
frequency and the result from STFT. According to the frequency resolution problem of STFT as
discussed in Section 3.4.3.4, the simulated attacker in our experiment gets an average frequency
offset around 3 Hz, which makes it hard to descramble the recording.
                                                  69


3.6.9   Descramble Time
Sometimes when we grant recording permission to a specific speaker, the speaker would like to
perform real-time descrambling. Patronus can achieve this working with real-time smart devices
such as Amazon Alexa. To prove this, we measure the descramble time for records with different
durations on a laptop with an Intel Core i7-4870HQ 2.5 GHz CPU. Since different high-order
scramble waves (second-order component, third-order component, ...) may exist in a record
simultaneously, we measure descramble time as a function of different max scramble orders, i.e.,
the upper bound of 𝑘 in Algorithm 1. As shown in Table 3.1, Patronus can descramble the record
quickly. Specifically, when the record time is 1 s, Patronus can finish descrambling in 328 ms, even
when the max scramble order is 6. This means that Patronus supports real-time descrambling.
3.7    Limitations and Future Works
Range: In our implementation, we use cheap and low power ultrasonic transducers to build the
Scramble Transmitter. The result is a short working distance, i.e., less than 70 cm. To enlarge
the working area to a wider range of angles, we designed a reflection layer and verified that it
could enlarge the working area by using an iron wok in our prototype. We can also use a high
power ultrasonic speaker to protect a larger area. Some commercial off-the-shelf devices can emit
ultrasound which could be sensed in a larger area. For example, UPS+ [3] uses an ultrasonic
speaker with a working area of 50m × 50m. However, it is expensive. We can reduce the cost by
deploying one expensive speaker and multiple transducers like UPS+[3]. Here we provide users
with three options to deploy Patronus according to their requirements such as working area and
budget. The first option is to use cheap transducers and a reflection layer to protect a small area.
The second is combining an expensive speaker and multiple transducers to protect a larger area.
The third is using multiple expensive speakers to protect the largest area.
Volume: In our implementation, we assume the talker uses a normal volume, i.e., not too loud or
too quiet. However, the performance of Patronus does vary as a function of the speaker volume. For
example, if the talker speaks too loudly, the scramble cannot mess up the recording; in the opposite
                                                  70


extreme, a quiet talker cannot be recovered using descrambling. To adapt to different volumes, we
can add a microphone to measure the talker’s volume. With multiple deployed ultrasonic speakers
or transducers, we can first detect the position of recording devices and then adjust the power
of ultrasound emitted from the nearest speakers according to the talker’s volume. There are two
challenges that need to be solved. First, the microphone we use to measure the talker’s volume can
also be scrambled. Second, we need to localize recording devices before emitting scrambles. We
leave these challenges as future work.
                                                 71


                                              CHAPTER 4
   BREATHPASS: ULTRASOUNIC AUTHENTICATION BY CHEST AND ABDOMEN
                               MOVEMENT WHILE BREATHING
4.1    Introduction
With the advancement of modern smart devices, unlocking methods have shifted away from the
“what you know” schema and toward the “who you are” schema. With the “what you know”
method, a user is needed to pre-configure some information such as PINs and secret questions,
and the device will then challenge the user to verify that she or he actually owns the device. Such
a PIN is often complex to ensure security and makes it difficult for individuals to remember to
some level. In addition, these passcodes or answers are vulnerable facing blindly replay-attack
since the devices do not care who is entering the information. With “who you are” tactics, the user
no longer needs to type in the complex PIN, thus simplifying and speeding up unlocking. These
approaches are quite popular with users because of their non-invasive nature and ease of use; e.g.,
Apple employs Face-ID to unlock the iPhone and iPad via facial recognition [76]. Apart from facial
recognition, fingerprint identification is a frequently used method for unlocking smart gadgets [77].
In addition, voiceprint recognition [78], iris recognition [79], heartbeat recognition [80], breathing
voice recognition [81], gaze gesture [82], and tooth-edge recognition [83] also plays a key role in
biometric recognition approaches.
    These approaches, however, have drawbacks in two different aspects.
Vulnerable to Replay-attack: Some of them are still compromised by replay-attacks, e.g., many
research efforts [84, 85, 86, 87, 88, 89, 90] focus on resolving the replay-attack among voiceprint-
based, fingerprint-based, gaze-based, or face-based authentication. For example, we could spoof
others’ face and voice with masks and recording.
Lack of Mobility and Flexibility: Other approaches using iris, tooth-edge, heartbeat, and human
breath are not sufficiently flexible on mobile devices, e.g., iris-based authentication requires the
                                                   72


                                            CardicScan BreathPrint       BreathPass
                            Replay-attack            Tooth-Edge
                                              Iris
                              Resilience
                                               Fingerprint        Face
                                                                  Gaze   Voiceprint
                                                     Mobility/Flexibility
              Figure 4.1: Comparasion of existing biometric authentication methods.
device to equip specific designed components such as inferred cameras, meanwhile, it needs users
to look at a specific area to make sure that the inferred camera could capture a clear iris. Heartbeat-
based authentication such as Cardiac Scan [80] requires the deployment of two radar sensors, which
are not standard hardware and so have a high operating cost. BreathPrint [81] is a novel approach
that doesn’t need to equip specific designed components and can significantly defend against replay
attacks, however, it cannot work in some scenarios including some people choose to wear a mask
to protect themself from being infected by COVID-19 as the breathing voice that is needed by the
system could be blocked by the face cover. The face cover also makes Smileauth [83] infeasible
since it requires an image of the tooth-edge which is blocked by the face mask.
   In this chapter, we propose BreathPass, a new non-invasive breath-based “who you are” au-
thentication technique. BreathPass detects users’ breath in a non-invasive manner, extracts features
from their breath, and then verifies that the user is permitted. As shown in Figure 4.1, BreathPass is
a novel approach that is hard to be compromised by replay attacks because breath pattern is hard to
be spoofed and imitated. In addition, BreathPass is flexible enough since it only uses commercial
off-the-shelf (COTS) components equipped on almost every devices, and can be used in a wider
scenarios such as wearing different kinds of face covers and clothes, in different postures, and in
different dynamic status such as walking or running. BreathPass faces the following challenges in
order to implement it and achieve all of the aforementioned requirements:
   1) As with BreathPrint, face covers may obstruct the voice of the user’s breath. To overcome this
challenge, BreathPass should avoid using a microphone to record users’ breathing voices; instead,
we employ an ultrasound-based chest wall and abdomen motion-sensing technology to characterize
                                                             73


users’ breathing patterns. Specifically, BreathPass works by initially emitting ultrasonic waves
through the speaker of a smart device, such as a smartphone. The ultrasonic waves then travel to
the user’s chest wall and abdomen, where they are reflected back to the smart device’s microphone.
The motion of the chest wall and abdomen, which characterizes human breath, alters the phase of
the reflected signal, and such phase shifts are used to authenticate;
    2) Unlike the speaker verification [91, 92] which normally converts the speech signal to a
spectrogram in order to extract features, the motion of the chest wall and abdomen typically has an
extremely low frequency of less than 1 Hz. As a result, features derived from spectrograms such
as Mel Frequency Cepstral Coefficients (MFCC) or Gammatone Frequency Cepstral Coefficients
(GFCC) [93] cannot be used to identify the breath. To address this issue, we implement the
authentication mechanism using a one-dimensional Convolutional and Siamese Neural Network.
Specifically, the neural network takes two raw chest wall and abdominal motion waveforms as the
input. One of these two inputs is the template input collected from the enrollment stage, while
the other is the matching input collected from the authentication stage. During training the neural
network, it learns the breathing pattern and generates a vector of features, saying fingerprint, which
can be used to calculate the distance between two inputs. Finally, BreathPass uses the distance
between the two inputs to determine if they originate from the same person or not;
    3) Unlike mechanical vibration, which typically has a stable frequency [94], breathing patterns
between individuals do not share the same prior knowledge as mechanical vibration. Additionally,
even when people are in the same posture, their breathing patterns may vary. In other words,
small movements result in different breathing patterns. As a result, denoising the motion of the
chest wall and abdomen requires developing a model that can suppress the moving-dependent noise
while retaining the user-dependent difference. To address this issue, we introduce a technique
called average fingerprinting. With such a technique, the template input is composed of multiple
chest wall and abdomen motion signals that might come from different tiny postures. BreathPass
generates multiple fingerprints from template signals using a neural network. Following that, the
system computes the average of those fingerprints and then uses that average fingerprint to determine
                                                  74


the distance to the fingerprint obtained during the authentication stage. Finally, it calculates the
authentication result using that distance.
    Our contributions are listed as follows:
    • We design a novel mechanism for sensing human breathing patterns and build a DNN to
       determine whether the breathing pattern provided by the user is authorized.
    • By using the breath sensing mechanism and the DNN we built, we create BreathPass, which
       enables smart devices to perform authentication via the human breath. We also implement a
       proof-of-concept software to evaluate BreathPass’s performance.
    • On the basis of our implementation, we conduct extensive experiments. BreathPass achieves
       an 83% accuracy, a 73% true-positive rate (TPR), and a 5% false-positive rate (FPR) in
       general, according to the experiment results. The BreathPass system is stable when the user
       is wearing a variety of different face covers, clothing, and postures. We believe that in the
       future, it may be a candidate for a “who you are” unlocking mechanism, or it may serve as a
       complement to other untrustworthy mechanisms, such as eye recognition, in order to provide
       authentication services jointly.
    The organization of the remainder of this chapter is as follows: In section 4.2, we illustrate
the human breath preliminary and our system overview. In section 4.3, we introduce the design
of BreathPass. In section 4.4, we introduce the implementation of BreathPass. In section 4.5, we
present the results of our extensive experiments. In section 4.6, we summarize related works. In
section 4.7, we discuss the limitation and the future work.
4.2     Preliminary
In this section, we illustrate the human breath process in detail and show the breath characteristics
represented by chest/abdomen movement are diverse enough to be used for user authentication. In
addition, we describe the overview of BreathPass.
                                                 75


                                                           Chest wall
                                                             motion
                                                           Sternum
                                                             Ribs
                                                          Lung
                                                           Diaphragm
                                                          Diaphragm
                                                             motion
                                                          abdomen
       Figure 4.2: Illustration of chest/abdomen in the inhale step of a human breath process.
4.2.1    Human Breath Preliminary
As shown in Figure 4.2, multiple human body parts are involved in breathing. A respiration cycle
is comprised of two steps: inhalation and exhalation. During the inhale step, air enters the body
via the nose or mouth and travels to the lung. As a result of the chest wall expansion, the lung fills
with air and expands; at the same time, the diaphragm at the bottom of the lung and at the top of the
abdomen contracts. During this process, from the outside view, the chest wall expands, resulting
in a larger chest cavity; at the same time, the abdomen expands as the diaphragm contracts. During
the exhale step, on the contrary, the air will flow out the body from the nose or mouth. Because the
human body is devoid of air, the chest wall contracts. Meanwhile, the diaphragm relaxes, resulting
in a smaller chest cavity and abdomen.
    Certain clinical studies indicate that individuals’ breathing patterns vary. Kaneko et al. [95],
in particular, conducted an experiment in which three sensors were placed on the thorax and
abdomen and a vector of three-dimensional movement was measured during breathing. The
observed breathing movements were found to be related to the effects of age, sex, and posture.
Raganarsdottir et al. [96] discovered that women have fewer abdominal movements than men do
during deep breathing. Additionally, the previous study [81] cites some clinical studies [97, 98] to
demonstrate that individual signatures of breath composition do exist.
Remark: We believe that by fully characterizing the movements of the chest wall and abdomen,
distinct signatures for individual recognition can be extracted.
                                                   76


                          Enrollment Stage
                                                                                 Fingerprint
                                                                                                     Average
                             Breath
                            Sampler                       n Breathing Patterns    Extractor
                                                                                                                     Storage
                          Authentication Stage
                                                                                                               Comparator
                                                                                       Input Fingerprint
                                                                  Fingerprint
                                                                                                                            Pass?
                             Breath
                            Sampler          Breathing Pattern     Extractor         Templ. Fingerprint
          user
Figure 4.3: Overview of BreathPass system that consists of an enrollment stage and an authentication
stage.
4.2.2   System Overview
As shown in Figure 4.3, BreathPass enables authentication in two stages: the enrollment stage and
the authentication stage. The enrollment stage’s objective is to sample the user’s breath waveforms
in a variety of different states (called tiny postures) using a breath sampler (Section 4.3.1) that
emits a harmonic ultrasound signal. The different tiny gestures for BreathPass are just like the
different edge parts of a fingerprint. To ensure robustness at all common circumstances, we sample
breath waveform multiple times at the enrollment stage. Following that, the enrollment stage
utilizes a DNN-based fingerprint extractor (Section 4.3.2) to extract n feature vectors, referred to
as fingerprints, and then calculates the average of those fingerprints, referred to as the template
fingerprint, and stores it in the local storage.
    The authentication stage is similar to the enrollment stage, except that the breath sampler
samples only one breath waveform. Following that, it uses the same fingerprint extractor as in the
enrollment stage to extract the input fingerprint. The template fingerprint is then fetched from local
storage and combined with the input fingerprint in the comparator (Section 4.3.3) to determine
whether the sampled breath is from an authorized user.
4.3     Design
We begin this section by analyzing the properties of ultrasonic signals in order to provide guidelines
for the chest/abdomen movement tracking design (§4.3.1). Then we go into detail about how we
                                                         77


extract the fingerprint of the breathing pattern (§4.3.2), how we determine whether an authentication
fingerprint matches the enrolled one (§4.3.3), and finally about the BreathPass design workflow
(§4.3.4).
4.3.1    Ultrasound-based Breath Sampler
Breathing Pattern Extraction: We use the speakers on smart devices, such as a smartphone, to
play a stereo ultrasound signal. As shown in Figure 4.3, the speaker is perpendicular to and close to
the chest wall. The left channel plays an ultrasound signal at an 18 kHz frequency, while the right
channel plays one at a 22 kHz frequency. The ultrasound signals are reflected off the chest wall
and abdomen and are picked up by the smartphone’s microphone, which is also positioned near the
chest wall. Formally, the signal emitted is denoted as follows:
                                    𝑠(𝑡) = cos(2𝜋 𝑓1 𝑡) + cos(2𝜋 𝑓2 𝑡),                         (4.1)
where 𝑓1 = 18, 000 and 𝑓2 = 22, 000. After the microphone records the reflected signal 𝑚(𝑡),
the breath sampler first employs a high pass filter to eliminate components below 16 kHz. Then,
inspired by the previous efforts [99, 10], the breathing pattern can be regarded as the signal 𝑥(𝑡)
modulated into 𝑚(𝑡) by the carrier of 𝑠(𝑡). Therefore, we have
                                            𝑚(𝑡) = 𝑥(𝑡)𝑠(𝑡).                                    (4.2)
To demodulate the breathing pattern 𝑥(𝑡), we need to multiply 𝑚(𝑡) by 𝑠(𝑡) and let the result pass
through a low pass filter with an extremely low cutoff frequency, e.g., 200 Hz. From Equation (4.1)
and (4.2), we have
                      𝑚(𝑡)𝑠(𝑡) = 𝑥(𝑡)𝑠2 (𝑡) = 𝑥(𝑡) [cos(2𝜋 𝑓1 𝑡) + cos(2𝜋 𝑓2 𝑡)] 2
                                 = 𝑥(𝑡) [cos2 (2𝜋 𝑓1 𝑡) + 2 cos(2𝜋 𝑓 1𝑡 ) cos(2𝜋 𝑓2 𝑡)
                                    + cos2 (2𝜋 𝑓2 𝑡)]                                           (4.3)
                                         1
                                = 𝑥(𝑡){ [1 + cos(2𝜋2 𝑓1 𝑡)] + cos(2𝜋( 𝑓1 + 𝑓2 )𝑡)
                                         2
                                                              1
                                    + cos(2𝜋( 𝑓2 − 𝑓1 )𝑡) + [1 + cos(2𝜋2 𝑓2 𝑡)]}.
                                                              2
                                                     78


                                     𝑧
                                            Upper board
                                            Lower board
                                𝑥                𝑦
Figure 4.4: A controlled experiment verifying our ultrasound frequency selection. The board
movement to mimic the chest wall and abdomen motion during breathing.
After a low pass filter with a 200 Hz cutoff frequency, the components cos(2𝜋2 𝑓1 𝑡), cos(2𝜋2 𝑓2 𝑡),
cos(2𝜋( 𝑓1 + 𝑓2 )𝑡), and cos(2𝜋( 𝑓1 − 𝑓2 )𝑡) all disappear. Therefore, we have
                                                       1 1
                                  𝑚(𝑡)𝑠(𝑡) =⇒ 𝑥(𝑡)( + ) = 𝑥(𝑡).                               (4.4)
                                                       2 2
We use the extracted 𝑥(𝑡) as the breathing pattern to perform authentication.
Ultrasound Frequency Selection: We want to emphasize why we use a pair of ultrasound signals
with two frequencies rather than a single frequency wave. Because the higher frequency signal is
more easily attenuated with distance than the lower frequency signal. If we place an ultrasound
speaker perpendicular to the chest wall and play a pair of ultrasound signals, the lower frequency
signal is more likely to reach the abdomen, whereas the higher frequency signal can only reach the
chest.
    To verify this, we conduct the experiment depicted in Figure 4.4. We place two boards
perpendicular to the y-axis and place a smartphone where its speaker towards the upper board.
The distance between the upper board and the speaker is 10 cm. The speaker plays the ultrasound
signal with the frequency of 18 kHz and 22 kHz, alternatively. When the speaker plays a particular
frequency ultrasound, we move the upper and the lower board back and forth along the y-axis at a
time to simulate only the chest wall moving, or only the abdomen moving, and extract the motion
waveform similar to Equation (4.4). We calculate the power under each scenario, and denote 𝑃18𝑢
as the power of the sensed motion waveform of the upper board by using the 18 kHz ultrasound,
and denote 𝑃18𝑙 as the power of the sensed motion waveform of the lower board by using the 18
kHz ultrasound. Similarly, we denote 𝑃22𝑢 as the power of the sensed motion waveform of the
                                                    79


                                    wide bandwidth                                                  wide bandwidth
                                 (a) Speech spectrogram                                      (b) Breathe sound spectrogram
                         1
               Power
                       0.8
                       0.6
                       0.4
                       0.2
                         0
                             0                       0.5        1          1.5
                                                     Frequency (Hz)
                       0.9
                                                               90% power
               CDF
                       0.6                                     is below
                                                                                                                     Narrow bandwidth
                       0.3                                     0.89 Hz
                         0
                             0         0.5                 1        1.5          2
                                                     Frequency (Hz)
                         (c) FFT of a breathing pattern                                   (d) Breathing pattern spectrogram
Figure 4.5: (a) Spectrogram of a speech “OK, Google!”. (b) Spectrogram of a breathing sound.
(c) FFT and CDF of a breathing pattern. (d) Spectrogram of a breathing pattern.
upper board by using the 22 kHz ultrasound and denote 𝑃22𝑙 as the power of the sensed motion
waveform of the lower board by using the 22 kHz ultrasound.
                                                                                                 𝑃18𝑙                           𝑃18𝑢
    To determine the sensitivity, we must compare 𝑄 𝑙 =                                          𝑃22𝑙          with 𝑄 𝑢 =       𝑃22𝑢 .   𝑄 𝑙 and 𝑄 𝑢 indicate
how much power the 18 kHz ultrasound can sense when the 22 kHz ultrasound can sense 1 unit
                                                                                                                                            𝑄𝑙
power of the lower and upper board movement, respectively. In our experiment,                                                               𝑄𝑢   ≈ 1.61 > 1,
indicating that 18 kHz is more sensitive to lower board movement than 22 kHz. Similarly, 22 kHz
is more sensitive to upper board movement than 18 kHz. As a result of the fact that ultrasound at 22
kHz is more sensitive to chest wall motion than ultrasound at 18 kHz, we use a pair of ultrasound
signals to fully characterize the motion of the chest wall and abdomen.
4.3.2   Fingerprint Extractor Design
Design Issues: After sampling the breathing pattern, both the enrollment stage and the authentica-
tion stage all send their samples to the fingerprint extractor. In order to get a feasible fingerprint that
                                                                                     80


can be used to perform authentication, the design of fingerprint extractor should take the following
challenges into consideration:
    1) Denoise. Previous works [94, 10, 100] proposed multiple approaches to denoise the vibration
waveform or the breathing waveform for machine damage or human disease detection. For example,
mmVib [94] reports the machine error when an abnormal vibration is detected. The system collects
the vibration waveform with noises and leverages a model to denoise the signal. After that, the
system will measure the distance between the sampled signal and the normal status signal. If the
distance is within a threshold, then the system classifies the machine works normally. Otherwise,
the system reports the machine in an abnormal status. To build such a denoise model, the system
usually collects vibration waveform with noises when the machine works normally and use a series
of transformation and processes, e.g., matching arc on the I-Q plane [94], to match the noise
waveform with the standard vibration waveform as precise as possible. After matching, it fixes the
processes and parameters to denoise future signals. If the machine works in abnormal status, the
signal after being processed by the same model with the same parameters is far from the standard
one. Such an idea was also adopted by SpiroSonic [10] and BreathListener [100] to detect if
human breathes normally. The common point of these works is to find the identical pattern when
the machine or the lung works normally. In BreathPass, however, the goal is to characterize the
difference in the breathing pattern among different people instead of finding the typical pattern
from different peoples’ breaths. Therefore, it is hard for us to build a denoise model by extracting
the common pattern.
    2) Stability. The chest wall and the abdomen motion are not as stable as a machine. A different
breathing pattern may be extracted even the user stays in the same posture, but after a tiny movement;
e.g., when a user leans back on a chair from the straight waist, a different breathing pattern will
be extracted. Therefore, the design of the fingerprint extractor must take such a stability issue into
consideration.
    3) Feature selection. The most similar task to BreathPass is speaker verification. The speaker
verification task first uses a microphone to record the user who is saying a predefined sentence
                                                 81


or any other sentence. The system then verifies if the recorded voice comes from the authorized
users. Existing speaker verification solutions typically first transform the voice signal into the
spectrogram, then extract spectrogram-based features such as Mel Frequency Cepstral Coefficients
(MFCC) or Gammatone Frequency Cepstral Coefficients (GFCC) [93]. After that, such solutions
leverages the Gaussian Mixture Model (GMM) or build a Deep Neural Network (DNN)-based
model to verify the speaker.
    BreathPrint [81] adopts a similar idea. Different from the speaker verification task is that it
uses a microphone to record the user’s breathing voice, then extracts GFCC features and leverages
the GMM model to verify if the breathing voice comes from the authorized user.
    To extract spectrogram-based features, the system needs to get a signal with a reasonable
wide bandwidth; e.g., speaker verification and BreathPrint [81] are both capable of leveraging
spectrogram-based features since the spectrogram of a speech “OK, Google!" could reach up to 6
kHz as shown in Figure 4.5a, and the spectrogram of a breath sound can reach up to 10 kHz as
shown in Figure 4.5b. Therefore, there is a sufficient area to embed information in the spectrogram
so that these systems are able to use photo verification-liked models to perform authentication.
    In BreathPass, however, we cannot use spectrogram-based features since the bandwidth of a
breathing pattern is extremely narrow. An adult typically finishes a breathing cycle in 2 to 3
seconds. We plot the result of the Fast Fourier Transform (FFT) of a breathing pattern that we
sampled with the method described in Section 4.3.1 as shown in Figure 4.5c. We can find that the
majority of the power is under 1 Hz (90% of power is below 0.89 Hz). Therefore, the bandwidth of
a breathing pattern in the spectrogram is too narrow as shown in Figure 4.5d to provide sufficient
information that can be used to perform authentication.
Our Solution: To address the first challenge that we cannot build a denoise model based on
observation, we, therefore, build a DNN-based model to learn how to denoise the signal and extract
the fingerprint itself. To cope with the third challenge, we use the raw breathing pattern waveform
as the input instead of extracting spectrogram-based features.
    As shown in Figure 4.6, the fingerprint extractor is consists of a series of convolutional layers
                                                  82


                  Breathing Pattern
                                      Conv2@1x3   Conv4@1x3   Conv4@1x3       Conv8@1x3                                                    Fingerprint
                                                                                               FC8192   FC4096   FC2048   FC1024   FC512
                                                                          +
               Figure 4.6: The structure of our DNN model for fingerprint extractor.
followed by some fully connected layers. Each layer uses the ReLU function to activate, and there
is a max-pooling layer with a length of 4 after the first and the last convolutional layers. Each fully
connected layer except the last one adopts dropout with the parameter 0.2 to avoid overfitting. We
also adopt the idea of ResNet [101] of adding a skip link between convolutional layers to avoid
gradient vanishing. The fingerprint extractor takes a breathing pattern waveform sampled by the
method described in Section 4.3.1 as the input. After the last fully connected layer of the extractor,
it outputs a vector of 512 floating-point numbers as the fingerprint.
   The second challenge about stability requires us to remove moving-dependent noises while
reserving the user-dependent difference among different users’ breathing patterns. To cope with this
issue, we introduce the average fingerprint technique. Specifically, instead of using the fingerprint
that comes from a single breathing pattern waveform as the result of the enrollment stage, we sample
multiple breathing patterns in the enrollment stage and get multiple fingerprints correspondingly.
We calculate the average of these fingerprints as the result of the enrollment stage.
   The idea behind the average fingerprint is that if we focus on the same user, moving-dependent
noises are unstable while the user-dependent difference is stable, therefore, if we take the average
of multiple fingerprints, unstable moving-dependent noises will be smoothed while the stable
user-dependent difference is reserved. We show the effectiveness of the average fingerprint in
Section 4.5.10.
4.3.3   Comparator Design
After getting fingerprints from both the enrollment stage and the authentication stage, we need to
build a comparator to measure the distance between two fingerprints.
                                                                                          83


                                                       Pass? (0/1)
                                                      Comparator
                                            Average
                                                Fingerprint extractor
                                   n
                                                                 Breathing Pattern
                                     Breathing Pattern 1
Figure 4.7: The end-to-end system design combining the fingerprint extractor with the comparator.
    In general, there are two approaches to building such a comparator. The first one is applying
triplet loss [102] while training the feature extractor. With this approach, we need to provide a
triplet ( 𝐴, 𝑃, 𝑁) as the training data, where A and P are the breathing patterns from the same
person, while N is from another person. The goal of triplet loss is to find a DNN parameter that
outputs fingerprints ( 𝐴′, 𝑃′, 𝑁 ′) satisfies 𝑑 ( 𝐴′, 𝑃′) + 𝛼 ≤ 𝑑 ( 𝐴′, 𝑁 ′), where 𝑑 (·) is the euclidean
distance or the cosine similarity.
    In our practice, however, it is hard to train the fingerprint extractor with the triplet loss. Therefore
we adopt another idea [103]. After getting the fingerprints, instead of building a comparator with
the triplet loss, we build the comparator by applying logistic regression. Specifically, we have the
target function
                                      𝑓 (𝑥, 𝑦) = 𝜎(𝑤𝑇 ∥𝑥 − 𝑦∥ 2 + 𝑏),                                  (4.5)
where 𝜎(·) is the sigmoid function, 𝑤 is the vector of parameters of the comparator, 𝑏 is the bias,
and 𝑥 and 𝑦 are the fingerprints from the enrollment stage and the authentication stage, respectively.
During training the comparator, the target output 𝑓 (𝑥, 𝑦) is set to 1 if the 𝑥 and 𝑦 are from the same
user, otherwise, the target output is set to 0.
                                                         84


4.3.4   Combine the Fingerprint Extractor with the Comparator
As shown in Figure 4.7, we put these components together. The fingerprint extractor and comparator
are combined into a single neural network. During the training process, we randomly choose n
breathing patterns from the same volunteer in the training dataset and choose another breathing
pattern from a random volunteer in the training dataset. If these two volunteers are the same one,
then the final result, i.e., Pass?, is set to 1; otherwise, it is set to 0.
    As for the deployment of these components, the left lower side of the figure, i.e., n breathing
patterns, comes from the enrollment stage. We store the average fingerprint in advance as shown
in Figure 4.3. During the authentication stage, the system samples a breathing pattern as shown
on the right lower side of the Figure 4.7. If the output of the comparator is greater than 0.5, we
denote the final result, i.e., Pass?, as 1, indicating that authentication was successful; otherwise, we
denote the final result as 0, indicating that authentication failed. If authentication fails, BreathPass
prompts the user to sample his breathing pattern and attempt authentication again; if authentication
continues to fail, BreathPass will prevent the user from sampling his breathing pattern until the user
enters the correct PIN number.
4.4    Implementation
4.4.1   Breathing Pattern Sampler and Data Collection
We develop the breathing pattern sampler on Android smartphones, as shown in Figure 4.8a. We
use the native Android library AAudio [104] to generate, emit, and record the ultrasound waves.
With the approval of our IRB under expedited review, we use our sampler to collect data and extract
the breathing patterns of 20 volunteers. Each volunteer is continuously sampled for 60 seconds
and five times (i.e., 300s in total). The 20 volunteers cover people of different gender and age that
may frequently use smart devices. We use a Google Pixel 3a smartphone running Android 11 to
perform sampling. The sampling rate is set as 48 kHz. During sampling, we place the smartphone
on a desk and let the speaker towards a volunteer’s chest. The distance between the smartphone and
                                                     85


                         (a) Breathing pattern sampler        (b) BreathPass
                  (c) Enrollment        (d) Successfully authenticated (e) Failure to authenticate
Figure 4.8: The UI of BreathPass implementation on a smartphone. (a) the breathing pattern
sampler for general data collection; (b)-(e) the pages of our proof-of-concept application.
                                                     86


the volunteer is between 5 and 10 cm. An interval separates two consecutive samplings to allow
the volunteer to adjust their tiny posture. Once the microphone samples the reflected ultrasound
signals, we leverage Apache Commons Math package [105] to build a high pass filter that eliminates
all components below 16 kHz, leaving only the ultrasound signals and then extracts the breathing
pattern using the design described in Section 4.3.1. The extracted breathing patterns are indexed by
number, as shown in Figure 4.8a. Specifically, 1 to 5 represent the first volunteer, 6 to 10 represent
the second volunteer, and so on.
4.4.2   Training the Feature Extractor and Comparator
After getting the dataset from 20 volunteers, we build the feature extractor and comparator using
PyTorch on a desktop equipped with an NVIDIA GeForce RTX 3090 GPU, as discussed in
Section 4.3.2 and Section 4.3.3. We randomly select 10 volunteers’ data for the training set and
the remaining volunteers’ data for the test set. During each iteration of the training and testing,
we first randomly select a volunteer, then we randomly choose a 60s long breathing pattern from
the indexed volunteer dataset, and finally, we randomly crop a segment of the 60s long breathing
pattern ranging from 1s to 5s. This process is repeated 10 times to create the template inputs.
Then we get another segment of breathing pattern but alternatively choose the same volunteer and
a different volunteer as the authentication input. If we have chosen the same volunteer, the target
of the DNN output is set to 1; otherwise, we set it to 0. We also add some fake breathing patterns
which are collected by the breathing pattern sampler with the smartphone speaker towards the wall
or towards nothing to enhance the classification accuracy. We always set the target of the DNN
output to 0 if any of these fake patterns are chosen.
4.4.3   Proof-of-concept Application
To explore the efficiency of the system on different smart devices in practice (see Section 4.5.11),
we develop a proof-of-concept Android application as shown in Figure 4.8b, 4.8c, 4.8d, and 4.8e.
We port our PyTorch model built in Section 4.4.2 with TorchScript [106] and load it into our
                                                  87


application. As shown in Figure 4.8b, our application provides the enrollment stage and the
authentication stage. The enrollment stage (Figure 4.8c) integrates the breathing pattern sampler
discussed in Section 4.4.1, but we only sample 1 second long each time. Here we choose the
1-second segment because experiment results indicate that this segment length provides the best
performance. After sampling the breathing patterns, by clicking ADD TO PROFILE button, our
app launches the DNN-based fingerprints extractor to obtain the fingerprint, and stores a copy into a
directory named PROFILENAME.profile. The app will display a segment of the breathing pattern
to ensure that it is correctly extracted.
    As shown in Figure 4.8d and 4.8e, during the authentication stage, the user first samples his
breathing pattern, then chooses which template profile is going to be authenticated. After hitting
the AUTHENTICATE button, the app will fetch the corresponding fingerprints from the directory
PROFILENAME.profile and calculate their average. Then the app launches the fingerprint extractor
model to generate the fingerprint of sampled breathing patterns. Finally, it launches the comparator
to perform an authentication process and displays whether the authentication is successful.
4.5    Evaluation
4.5.1   Overview
To evaluate BreathPass, we use the data collected in section 4.4 to train and test BreathPass. In
general, we use the following metrics to evaluate the performance of BreathPass:
Accuracy: We use accuracy to determine whether BreathPass can correctly identify the authorized
user whose fingerprints are stored during the enrollment stage. The accuracy is calculated as
                                                  Í𝑁
                                                     𝑖=1 𝐼 ( 𝑦ˆ𝑖 = 𝑦𝑖 )
                                      Accuracy =                        ,                        (4.6)
                                                           𝑁
where 𝑁 is number of test cases, 𝐼 is the indicator function, 𝑦ˆ𝑖 is the output given by BreathPass,
and 𝑦𝑖 is the correct label. In general, the greater the accuracy, the better.
True positive rates (TPR) and false positive rates (FPR): Besides the accuracy, we also focus
on two metrics, i.e., true positive rates (TPR) and false positive rates (FPR). We calculate the TPR
                                                   88


                               100                                                                   100
                                90
                                                                                                      90
                                80
                                                                                                      80
                                                                                   Performance (%)
                                70
                                                                                                      70
                 Percent (%)
                                60
                                                                                                      60
                                50
                                                                                                      50
                                40
                                30       Accuracy
                                                                                                      40
                                20
                                         TPR                                                          30
                                         FPR
                                                                                                      20     Google Pixel 3a
                                10                                                                           Huawei Mate 9
                                 0                                                                    10     Google Pixel
                                     1       2             3      4            5                       0
                                                    Segment (s)                                            Accuracy        TPR   FPR
                                                     (a)                                                                (b)
 Figure 4.9: (a) General performance of BreathPass (b) Performance of different mobile models.
by using the equation
                                                                  Í𝑁
                                                                      𝑖=1   𝐼 ( 𝑦ˆ𝑖 = 1 and 𝑦𝑖 = 1)
                                                      TPR =                 Í𝑁                      ,                                  (4.7)
                                                                                𝑖=1 𝐼 (𝑦 𝑖 = 1)
and the FPR is calculated by
                                                                  Í𝑁
                                                                      𝑖=1   𝐼 ( 𝑦ˆ𝑖 = 1 and 𝑦𝑖 = 0)
                                                      FPR =                 Í𝑁                      ,                                  (4.8)
                                                                                𝑖=1  𝐼 (𝑦 𝑖 = 0)
where 𝑁 is the number of test cases, 𝐼 is the indicator function, 𝑦ˆ𝑖 is the output given by BreathPass,
and 𝑦𝑖 is the correct label.
   When an enrolled user attempts to unlock the device, the TPR determines the likelihood that the
system will successfully authenticate. When an unauthorized user attempts to unlock the device,
the FPR determines the possibility that the system will pass the authentication by mistake. The
higher the TPR, the better, while the lower the FPR, the better.
   Apart from the TPR and FPR, two additional metrics are used to characterize the authentication
system’s performance: true negative rates (TNR) and false negative rates (FNR), which indicate
the likelihood of an unauthorized user being successfully blocked by the system and the likelihood
of an authorized user failing the authentication, respectively. However, we are unconcerned with
these two values because an attacker cannot do anything if the device cannot be unlocked.
   With these three metrics, we would like to answer the following questions:
    • Is it possible that the breathing pattern we sampled could be used to perform authentication
      and can BreathPass become a candidate of “who you are” scheme under the COVID-19
                                                                               89


  surgery                fabric
                                              No cover    Surgical    Fabric   KN95   N95               No cover   Surgical   Fabric   KN95   N95
                                            100                                                       100
                                             95
                                             90                                                        95
                                             85                                                        90
                                             80
                                  TPR (%)                                                   TPR (%)
                                             75                                                        85
                                             70                                                        80
                                             65
                                             60                                                        75
                                             55                                                        70
                                             50
                                             45                                                        65
                                             40                                                        60
                                                  1      2            3         4       5                   1      2          3         4       5
  KN95                   N95                                    Segment (s)                                              Segment (s)
             (a) Masks                                (b) TPR w/o grouping                                      (c) TPR w/ grouping
               Figure 4.10: Performance of BreathPass with different kinds of clothes.
        scenario?
    • Is BreathPass performing well to the same user but wearing different kinds of face covers,
        clothes, with different postures, with dynamic status, under different environments, and on
        different mobile models?
    • Can BreathPass defend against replay attacks?
    • What’s the impact if we use the single template breathing pattern to generate the fingerprint
        rather than using the average fingerprint?
    • Can BreathPass finish the authentication within a reasonable time limit?
4.5.2    General Evaluation
Setup: To determine whether the extracted breathing pattern can be used for authentication, we
train and test the fingerprint extractor and comparator using the dataset collected in Section 4.4.
We have formed the training dataset by randomly choosing 10 volunteers from the whole dataset.
In this experiment, we use the remaining 10 volunteers as well as the fake breathing patterns to
perform testing. We perform 1000 iterations of testing. During each iteration, we randomly choose
1s to 5s breathing pattern segments as described in Section 4.4.1 to form the test datasets with
mini-batches of 32, resulting in a total of 32 × 1000 = 32000 test cases. The output of the sigmoid
                                                                     90


function in Equation (4.5) is in the range of [0, 1]. If the output is greater than or equal to 0.5, the
result is considered passed; otherwise, the result is considered failed.
Results: As shown in Figure 4.9a, BreathPass achieves over 80% accuracies, over 70% TPRs, and
less than 10% FPRs for any segment length of the input breathing pattern. BreathPass achieves an
accuracy of 83%, a TPR of 73%, and an FPR of 5% when the input breathing pattern is segmented
for 1 second, which is the best segment length. As a result, we assert that the breathing pattern
we sampled can be used for authentication and that when combined with the TPR and the FPR,
BreathPass can serve as a candidate for a “who you are” scheme, either alone or in conjunction
with eye recognition in the COVID-19 scenario.
4.5.3    Effectiveness on Different Mobile Models
Setup: To verify whether BreathPass is able to work on different mobile models, we launch
BreathPass on three different mobile phones, i.e., Google Pixel 3a, Huawei Mate 9, and Google
Pixel. In this experiment, we use these three mobile models to collect a volunteer’s breathing
pattern respectively. Then we form the positive test cases by selecting pairs of 1s breathing patterns
from the collected breathing pattern. After that, we make pairs of the breathing patterns from a
given mobile model and the test datasets as discussed in Section 4.5.2 as the negative cases.
Results: As shown in Figure 4.9b, there is little difference between accuracies, TPRs, and FPRs
among different mobile models. Therefore, BreathPass can work on different mobile models.
4.5.4    Influence of Different Kinds of Face Covers
Wearing a face cover may obstruct airflow into the user’s nose or mouth, thereby altering the
user’s breathing pattern. To characterize the effect of various types of face covers on users, we
prepared four types of commonly used face covers, i.e., surgical, fabric, KN95, and N95, as shown
in Figure 4.10a. There are almost no blocks when wearing the surgical mask, while the remaining
makes breathing harder than wearing the surgical mask or not wearing a face cover. We would like
to characterize the performance across different kinds of face covers. In this experiment we only
                                                  91


                                                                      T-Shirt   Hoodie    Sweater       Jacket
                                                            100
                                                             95
                                                             90
                                                             85
                      T-shirt       hoodie
                                                  TPR (%)
                                                             80
                                                             75
                                                             70
                                                             65
                                                             60
                                                             55
                                                             50
                                                                  1        2          3             4            5
                      sweater       jacket                                       Segment (s)
                           (a) Clothes                                          (b) TPR
               Figure 4.11: Performance of BreathPass with different kinds of clothes.
care about TPRs, which means the possibility of successfully authenticated while wearing different
face covers.
Setup without grouping: In this experiment, we invite a volunteer to wear each of the four types
of face covers separately and evaluate the BreathPass’s performance. We ask the volunteer to enroll
his breathing pattern with no face cover, and perform authentication with wearing different kinds
of face covers.
Results without grouping: As shown in figure 4.10b, TPRs decrease while wearing the masks
which blocks the airflow, but almost all of them are over 40%, which means that the extracted
breathing pattern is still feasible while wearing different kinds of face covers.
Setup with grouping: To further improve the TPRs while wearing the face covers that blocks the
airflow, we can split the face covers into two groups, i.e., no airflow blocked (no face covers and
surgical) and airflow blocked (fabric, KN95 and N95). We ask the volunteer to enroll with one of
them in a group and perform authentication with wearing another one in that group. Specifically,
we firstly use breathing patterns collected without a face cover to generate the template fingerprint
and use breathing patterns collected with the surgical mask as the input of the authentication stage.
Then we use the KN95 dataset to generate the template fingerprint and use breathing patterns from
the fabric, and the N95 dataset, respectively, as the input of the authentication stage. Finally, we
generate the template fingerprint using breathing patterns from the N95 dataset and use it to evaluate
                                                  92


                                     Sit   Stand     Lay                                     w/ average    w/o average
                       100                                                       100
                        90                                                        90
                                                               Performance (%)
                        80                                                        80
                                                                                  70
             TPR (%)
                        70                                                        60
                        60                                                        50
                        50                                                        40
                                                                                  30
                        40
                                                                                  20
                        30                                                        10
                        20                                                         0
                             1   2          3          4   5                           Accuracy      TPR           FPR
                                       Segment (s)                                                 Metrics
                                 (a) Posture TPR                                   (b) Performance w/ or w/o average
Figure 4.12: (a) TPR of BreathPass with different postures. (b) Performance with or without
average fingerprint technique.
performance when wearing the KN95 mask. This is reasonable, as the user could enroll in both
groups separately and choose one manually or automatically before performing the authentication.
We use datasets of the volunteers that form the test datasets in Section 4.4.2 to test negative cases.
The volunteer in this experiment is not one of the volunteers in Section 4.4.2.
Results with grouping: As shown in figure 4.10c, compare to the TPRs without face cover, all
TPRs with a face cover are decreased, but most of the TPRs are higher than 70%, and in particular,
for 1s, the TPRs are all higher than 80%. Therefore, BreathPass is feasible across different face
covers.
4.5.5     Influence of Different Clothes
Setup: Since BreathPass extracts breathing patterns from the motion of the chest wall and the
abdomen, the breathing pattern collected might be influenced by different clothes because different
wearings might have different effects of blocking ultrasound signals. In this experiment, we
choose the most common used four kinds of clothes, i.e., T-shirt, hoodie, sweater, and jacket, as
shown in Figure 4.11a, and invite a volunteer to sample his breathing patterns while wearing these
clothes, correspondingly. We then use the dataset collected with wearing the T-shirt to generate
the template fingerprint, and use the breathing pattern from datasets with wearing all four clothes
correspondingly as the input of the authentication stage. The volunteer in this experiment is not
                                                           93


one of the volunteers in Section 4.4.2.
Results: As shown in figure 4.11b, the TPRs are almost higher than 65%, and in particular, for 1s,
the TPRs are all over 75%. Therefore, BreathPass is feasible across different kinds of clothes.
4.5.6     Influence of Different Postures
Setup: As discussed in Section 4.3.1, different postures result in different breathing patterns.
Therefore, to characterize the influence of different postures, we invite a volunteer to provide
breathing patterns by the method discussed in Section 4.4.1 with the three most common postures,
i.e., sitting, standing, and laying down. We use breathing patterns extracted from sitting posture to
generate the template fingerprint, and use breathing patterns extracted from all these three postures
respectively as inputs of the authentication stage. The volunteer in this experiment is not one of the
volunteers in Section 4.4.2.
Results: As shown in figure 4.12a, the sitting and standing posture have higher TPRs than laying
down. The TPRs for sitting and standing are almost higher than 60%, and in particular, for 1s, the
TPRs are all over 70%. Therefore, BreathPass is feasible across different postures.
4.5.7     Influence of Dynamic Status
Setup: Some dynamic status such as walking or after running may result in different breathing
pattern, to verify if BreathPass could still successfully authenticate the user under these dynamic
status. We ask a volunteer to enroll his breathing pattern while sitting in a quiet room at rest, and
perform authentication while sitting in a quiet room at rest (marked baseline), during walking, and
after running 500m, respectively. We choose 1s as the segment length because, from the previous
experiments, we find that 1s segment length works well for most cases. The volunteer in this
experiment is not one of the volunteers in Section 4.4.2.
Results: As shown in table 4.1, walking has almost no effect to authentication. Authentication
after running has a bigger effect as it significantly changes the breathing pattern, however, it still
                                                  94


                      Class    Baseline   Walking      Running      Outside TV
                      TPR        97%         94%          78%        78%       80%
                   Table 4.1: TPRs of different dynamic status and environments.
achieves 78% of the TPR, which means that BreathPass is feasible when the user is under dynamic
status.
4.5.8   Influence of Different Environments
Setup: To verify if BreathPass could still successfully authenticate under different environments.
We ask a volunteer to enroll his breathing pattern while sitting in a quiet room at rest, and perform
authentication while sitting in a quiet room at rest (marked baseline), outside while raining (lower
noise), and near a TV set playing a concert with a high volume (higher noise), respectively. We
choose 1s as the segment length because, from the previous experiments, we find that 1s segment
length works well for most cases. The volunteer in this experiment is not one of the volunteers in
Section 4.4.2.
Results: As shown in table 4.1, compare to baseline, authentication outside while raining and near
the TV set decrease the TPR. It probably because the raining falling down between the speaker and
the chest wall affects the transmition and reflection of the ultrasound signals, and suppression effects
of the microphone [8] affects the recording quality when background noise is huge. The TPRs
however, are around 80%, which means that BreathPass is feasible under different environments.
4.5.9   Defend Replay Attacks
Setup: It is unlikely to replay a breathing pattern by recoding other’s breath like voice recording
does. In addition, we cannot spoof our authentication system with an easy way just like making
face/fingerprint/iris masks. The best way we can issue a replay attack is to make an attacker to
observe and imitate users’ breath pattern. To verify whether BreathPass could defend against replay
attacks, we ask four groups of volunteers (each contains two volunteers) to sit in the same room at
the same seat. In each group, two volunteers are similar in terms of gender, age, height and weight.
                                                   95


In each group, a volunteer acts as an attacker to imitate the other’s breath by controlling the pace
of breathing and get the same times of breathing cycle in a minute. Then we use the non-attacker
volunteer’s breathing pattern in a group to enroll the BreathPass and use the breathing pattern from
the attacker in that group to perform authentication. The segment of breathing pattern is 1 s in the
experiments. We only care about FPRs here to see if such a replay attack can break in the system.
Results: We get four FPRs, 3.6%, 3.8%, 9%, and 31% for each of the groups. We can see the
FPRs of the first three groups are lower than 10% which is impossible to be used to attack the
authentication process. For the forth group, the FPR becomes higher to 31%. The reason is that
the volunteers are more like “twins” than other groups. Their body shape, habit, and exercise-level
are similar as well. Therefore, the imitated breathing pattern may be similar with the original one.
Even though the FPR is 31% which is far less than our overall TPR 73%. Saying that, we can
sacrifice the computation efficiency (e.g., continuous two successful authentication) to improve the
security if the user wants.
4.5.10    Effectiveness of the Average Fingerprint
Setup: During our experiment, we found that even all volunteers in Section 4.4.1 are sitting while
sampling their breathing patterns, the DNN-based model also cannot get a good performance. This
is because even a tiny move within the same posture could result in different breathing patterns that
affect the overall performance. As discussed in Section 4.3.2, we introduce an average fingerprint
technique to eliminate moving-dependent noises while reserving user-dependent differences. In
this experiment, we build another model of the same DNN architecture as discussed in Section 4.3
but without the average fingerprint technique. We choose 1s as the segment length because, from
the previous experiments, we find that 1s segment length works well for most cases. We compare
the performance between models with and without the average fingerprint technique to show the
effectiveness of the average fingerprint technique. Specifically, we use the same training dataset to
train the same model without the average fingerprint technique. After the model converges, we use
the same test dataset as Section 4.5.2 to test the performance.
                                                   96


Results: As shown Figure 4.12b, we can find that the accuracy without the average technique is
lower than the model with the average technique. We can further find that although they have
close TPRs, the FPR without the average technique is much higher than the model with the average
technique, which is unacceptable. The reason why the model without the average technique has
a high FPR is because the model cannot eliminate moving-dependent noises, thus taking them as
the feature to construct the fingerprint. Therefore, it is necessary to apply the average fingerprint
technique so that the model could successfully eliminate moving-dependent noises while reserving
user-dependent differences.
4.5.11    Efficiency on Mobile Phones
Setup: To make BreathPass practical, the DNN-model needs to finish the inference on a mobile
device within a reasonable time limit after a user samples his breathing pattern. To test the
efficiency of BreathPass, we port our model on a Google Pixel 3a Android mobile phone as
discussed in Section 4.4.3. The application shows the time used by the DNN-model along with
the authentication results. We perform 10 times of authentication. The configuration is the same
as the previous experiments, and we use 1s segment length of breathing patterns as the inputs.
Specifically, we enroll 10 breathing signals that each of them is 1s long, and extract 10 fingerprints,
respectively, and store them on the smartphone. During the authentication stage, after the user
samples his 1s breathing pattern, we first calculate the average of 10 fingerprints, then take the
result of the average and sampled breathing pattern as the input to the model. The model extracts
the fingerprint of the sampled breathing pattern and runs the comparator to give the result of the
authentication.
Results: We calculate the average running time, and the result is 855.7 ms, which shows that
BreathPass can be used practically.
                                                  97


4.6    Related Works
Ultrasound Sensing: Ultrasound sensing has been a popular area of research in recent years.
Ultrasound sensing is based on distance measurement and positioning. After getting the objects’
trajectory, systems may employ classifiers or construct a model to detect the objects’ actions.
Numerous techniques have been proposed to localize or track objects, e.g., the technique based
on time-of-arrival (ToA) or time-difference-of-arrival (TDoA) [107, 108, 109, 110, 111], Doppler
frequency shift (DFS) [112, 113, 114], or phase-based technique [99, 115, 116]. In particular,
Liu et al. [107] achieves ultrasound positioning accuracy of tens of centimeters using the time-
of-arrival (ToA) technique. It modulates some signals emitted from several anchor nodes with
known positions. When a microphone receives these signals, it first calculates the time difference
between emitting and receiving, then calculates the distance between the microphone to each
anchor. Finally, it uses trilateration to determine the microphone’s position. LLAP [99] extracts
the phase difference generated by the object’s moving and achieves a 1-D tracking accuracy of
3.5 mm and a 2-D tracking accuracy of 4.6 mm. Vernier [115], on the other hand, leverages the
vernier principle to improve tracking accuracy and achieves a 3D tracking error of less than 4
mm. Although all of these techniques claim to be inaudible to humans, pets and infants can hear
ultrasound with a frequency close to that of audible sounds. Therefore, UPS+ [3] leverages the
nonlinearity effects of commercial off-the-shelf (COTS) microphones that have been extensively
studied by Backdoor [8] and proposed a new ultrasound positioning system. This system employs
ultrasound at a frequency that is inaudible to pets and infants, making the ultrasound positioning
system more environmentally friendly.
    Additionally, ultrasound sensing systems can be used to complete a variety of sophisticated
tasks. For example, existing research efforts [117, 118] employ ultrasound signals to detect sleep
apnea. Specifically, these works emit modulated ultrasound, i.e., FMCW chirp or pseudo-white
noise signal, and then use a classification algorithm to determine whether an apnea symptom
exists. Moreover, SpiroSonic [10] uses reflected ultrasonic signals to detect whether the user’s
pulmonary function is normal. BreathListener [100] also uses reflected ultrasonic signals to
                                                  98


quantify the driver’s breathing status, thereby determining whether or not the driver is driving
safely. AcuTe [119] measures ambient temperature via ultrasonic sensing by utilizing the linear
relationship between temperature and sound speed.
Wireless Authentication: Recent research efforts have primarily concentrated on device-to-
device [120, 121, 122, 123] and human-to-device authentication [124, 125, 126, 81, 80, 127].
For example, in the case of device-to-device authentication, GeneWave [120] derives the ini-
tial acoustic channel response and creates a coding scheme for key agreement and exchange.
DeMiCPU [123] finds that different CPUs generate distinct magnetic induction signal-based fin-
gerprints, then designed a DeMiCPU sensor to read the fingerprint and classify it using ExtraTrees,
thereby performing authentication.
    As for the human-to-device authentication, Cardiac Scan [80] puts two radar sensors in front
of and behind the user, respectively. During authentication, these two radar sensors first capture
the user’s cardiac motion, then extracts Fiducial-based invariant identity descriptors, and use them
to match with the owner template captured in advance to perform authentication. It requires the
target device to be equipped with radar sensors, which increases the cost of operation and precludes
widespread use. Wang et al. [126] measure the heartbeat using the built-in accelerometer of
smartphones, then extract features using wavelet transform and apply the SVM model to determine
whether the captured heartbeat was from an authorized user. WiHF [127] uses a WiFi signal to
deduce the user’s identified gesture and then performs user identification using DNN. This method
requires a WiFi connection and operates exclusively on the server side, making it unsuitable for
client-side outdoor scenarios. BreathPrint [81] conducted a comprehensive and careful survey
on existing clinical studies [128, 129, 130, 131, 132, 133] and summarizes that different people
have distinct breathe sounds, then it designed a system that requires the user to initiate breathing
gestures (e.g., sniff, normal, deep breath) towards a microphone. Then, it extracts GFCC features
and uses a GMM model to determine whether or not the recorded breathing sounds originated from
the authorized user. While this is a non-invasive solution, it requires users to place a microphone
near their nose, which adds to the cost of operation. Meanwhile, during the COVID-19 pandemic,
                                                 99


people typically wear a face cover, which prevents the microphone from recording the sound of
breathing.
4.7    Discussion and Future Work
Although BreathPass can become a candidate of “who you are” unlocking mechanism, it is still
challenging to make it practical because of the following aspects, and we leave them as the future
work:
Sampling position: In our experiment, all of the volunteers are strictly limited to the position
where they put the mobile phone, i.e., 5 to 10 cm before and perpendicular to their chest wall. In
practice, however, users could hold their smartphone at any distance and any orientation, which
significantly affects the sampled breathing pattern. Therefore, to make BreathPass practical, we
need to make the user hold their smartphone freely.
Lower FPR: The authentication system should provide an extremely low FPR in order to keep
security. Although BreathPass achieves 5% FPR in general, it is still far from the authentication
systems that can be used in practice. Getting a lower FPR while keeping a reasonable TPR before
using BreathPass in practice is needed.
                                                100


                                             CHAPTER 5
                                           CONCLUSION
This dissertation shows various unconventional ways to exploit the side effect of devices.
    In Chapter 2, we present RainbowLight, a high-precision 3D visible light based localization
system. Compared with existing approaches, RainbowLight does not require special hardware
design and pre-collected light features. RainbowLight works on COTS mobile phones without
strict user holding requirement. It works well for different types of lamps as well as light off
scenario. Those features significantly reduce the deployment, maintenance and using overhead.
The evaluation results show that RainbowLight achieves an average localization error of 3.3 cm in
2D and 9.6 cm in 3D. We believe RainbowLight can be applied to today’s buildings with a very
small overhead to enable many visible light based applications.
    Acoustic privacy protection has always been an important topic. In Chapter 3, we study
the nonlinear effects on commercial off-the-shelf microphones. Based on our study, we propose
Patronus, which leverages the nonlinear effects to disrupt unauthorized devices from recording
the speech while simultaneously allowing authorized devices to record clear speech audio. We
implement and evaluate Patronus in a wide variety of representative scenarios. Results show that
Patronus effectively blocks unauthorized devices from making secret recordings while allowing
authorized devices to successfully make clear recordings.
    In Chapter 4, we propose BreathPass, a novel biometric authentication method that is more
resilience to replay-attack and has a high flexibility to mobile devices. It samples breathing patterns
from users and extracts fingerprints from them to achieve authentication. We believe that BreathPass
can become a candidate of “who you are” unlocking mechanism, or become complementary to
another untrustable mechanism such as eye recognition to provide authentication service together,
and with wearing different kinds of face covers, clothes, with different postures, dynamic status,
and under different environments.
    We believe with careful study and exploitation of side effects of devices, and carefully utilize
                                                  101


ultrasound signals, smart devices can sufficiently take advantage of their power and resource and
achieve more powerful functionalities.
    There are still many directions along with this dissertation and generatres multiple future
works which are remained to be addressed. Regarding visible light positioning, existing works
mostly focus on active positioning, i.e., the target either needs to be equipped with a camera, or
needs to attach an anchor on it. How to achieve a passive visible light positioning with a low
deploying and using cost is still a problem. Regarding ultrasounic privacy protection, because of
the capacity of NLMS filter, the metrics of PESQ and SRVA are limited and still have a capacity to
improve. Many research efforts employed artificial intelligence methods such as autoencoder, or
generative adversarial network, to improve the performance of denoising, which could also be used
for descrambling. In addition, many security architectures on the chipset such as Intel SGX, ARM
TrustZone, or RISC-V KeyStone provide highly effective methods to protect memory physically. It
opens a new door to enhance the security level of IoT devices and applications.
                                                 102


BIBLIOGRAPHY
     103


                                       BIBLIOGRAPHY
[1]  Chi Zhang and Xinyu Zhang. Litell: robust indoor localization using unmodified light
     fixtures. In Proceedings of ACM MobiCom, 2016.
[2]  Shilin Zhu, Chi Zhang, and Xinyu Zhang. Automating visual privacy protection using a
     smart led. In Proceedings of the 23rd Annual International Conference on Mobile Computing
     and Networking, pages 329–342, 2017.
[3]  Qiongzheng Lin, Zhenlin An, and Lei Yang. Rebooting ultrasonic positioning systems for
     ultrasound-incapable smart devices. In The 25th Annual International Conference on Mobile
     Computing and Networking, pages 1–16, 2019.
[4]  Zhou Zhou, Mohsen Kavehrad, and Peng Deng. Indoor positioning algorithm using light-
     emitting diode visible light communications. Optical Engineering, 51(8):085009, 2012.
[5]  Bo Xie, Guang Tan, and Tian He. Spinlight: A high accuracy and robust light positioning
     system for indoor applications. In Proceedings of ACM SenSys, 2015.
[6]  Ye-Sheng Kuo, Pat Pannuto, Ko-Jen Hsiao, and Prabal Dutta. Luxapose: Indoor positioning
     with mobile phones and visible light. In Proceedings of ACM MobiCom, 2014.
[7]  Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan
     Xu. Dolphinattack: Inaudible voice commands. In Proceedings of the 2017 ACM SIGSAC
     Conference on Computer and Communications Security, pages 103–117, 2017.
[8]  Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. Backdoor: Making micro-
     phones hear inaudible sounds. In Proceedings of the 15th Annual International Conference
     on Mobile Systems, Applications, and Services, pages 2–14, 2017.
[9]  Yitao He, Junyu Bian, Xinyu Tong, Zihui Qian, Wei Zhu, Xiaohua Tian, and Xinbing Wang.
     Canceling inaudible voice commands against voice control systems. In The 25th Annual
     International Conference on Mobile Computing and Networking, pages 1–15, 2019.
[10] Xingzhe Song, Boyuan Yang, Ge Yang, Ruirong Chen, Erick Forno, Wei Chen, and Wei
     Gao. Spirosonic: monitoring human lung function via acoustic sensing on commodity smart-
     phones. In Proceedings of the 26th Annual International Conference on Mobile Computing
     and Networking, pages 1–14, 2020.
[11] Swarun Kumar, Stephanie Gil, Dina Katabi, and Daniela Rus. Accurate indoor localization
     with zero start-up cost. In Proceedings of ACM MobiCom, 2014.
[12] Fadel Adib, Zachary Kabelac, and Dina Katabi. Multi-person localization via rf body
     reflections. In Proceedings of USENIX NSDI, 2015.
[13] Fadel Adib, Zach Kabelac, Dina Katabi, and Robert C Miller. 3d tracking via body radio
     reflections. In Proceedings of USENIX NSDI, 2014.
                                              104


[14] Mei Wang, Zhehui Zhang, Xiaohua Tian, and Xinbing Wang. Temporal correlation of the
     rss improves accuracy of fingerprinting localization. In Proceedings of IEEE INFOCOM,
     pages 1–9, 2016.
[15] Kun Qian, Chenshu Wu, Zheng Yang, Zimu Zhou, Xu Wang, and Yunhao Liu. Enabling
     phased array signal processing for mobile wifi devices. IEEE Transactions on Mobile
     Computing, 17(8):1820–1833, 2017.
[16] Chunhui Duan, Lei Yang, Qiongzheng Lin, and Yunhao Liu. Tagspin: High accuracy spatial
     calibration of rfid antennas via spinning tags. IEEE Transactions on Mobile Computing,
     17(10):2438–2451, 2018.
[17] Yu-Lin Wei, Chang-Jung Huang, Hsin-Mu Tsai, and Kate Ching-Ju Lin. Celli: Indoor
     positioning using polarized sweeping light beams. In Proceedings of ACM MobiSys, 2017.
[18] Bo Xie, Kongyang Chen, Guang Tan, Mingming Lu, Yunhuai Liu, Jie Wu, and Tian He. Lips:
     A light intensity–based positioning system for indoor environments. ACM Transactions on
     Sensor Networks, 12(4):28, 2016.
[19] Radu Stoleru, Tian He, John A. Stankovic, and David Luebke. A high-accuracy, low-cost
     localization system for wireless sensor networks. In Proceedings of ACM SenSys, 2005.
[20] Song Liu and Tian He. Smartlight: Light-weight 3d indoor localization using a single led
     lamp. In Proceedings of ACM SenSys, 2017.
[21] Nishkam Ravi and Liviu Iftode. Fiatlux: Fingerprinting rooms using light intensity. In
     Proceedings of Pervasive, 2007.
[22] Shilin Zhu and Xinyu Zhang. Enabling high-precision visible light localization in today’s
     buildings. In Proceedings of ACM MobiSys, 2017.
[23] Qiang Xu, Rong Zheng, and Steve Hranilovic. Idyll: indoor localization using inertial and
     light sensors on smartphones. In Proceedings of ACM Ubicomp, 2015.
[24] Zhice Yang, Zeyu Wang, Jiansong Zhang, Chenyu Huang, and Qian Zhang. Wearables can
     afford: Light-weight indoor positioning with visible light. In Proceedings of ACM MobiSys,
     2015.
[25] Masaki Yoshino, Shinichiro Haruyama, and Masao Nakagawa. High-accuracy positioning
     system using visible led lights and image sensor. In Proceedings of IEEE RWS, 2008.
[26] S-H Yang, E-M Jeong, D-R Kim, H-S Kim, Y-H Son, and S-K Han. Indoor three-dimensional
     location estimation based on led visible light communication. Electronics Letters, 49(1):54–
     56, 2013.
[27] Ruipeng Gao, Yang Tian, Fan Ye, Guojie Luo, Kaigui Bian, Yizhou Wang, Tao Wang, and
     Xiaoming Li. Sextant: Towards ubiquitous indoor localization service by photo-taking of
     the environment. IEEE Transactions on Mobile Computing, 15(2):460–474, 2016.
                                               105


[28] Chi Zhang and Xinyu Zhang. Pulsar: Towards ubiquitous visible light localization. In
     Proceedings of ACM MobiCom, 2017.
[29] Zhao Tian, Kevin Wright, and Xia Zhou. The darklight rises: Visible light communication
     in the dark. In Proceedings of ACM MobiCom, pages 2–15, 2016.
[30] Edward Collett. Field guide to polarization, volume 15. SPIE press Bellingham, 2005.
[31] WikiPedia. Birefringence. https://en.wikipedia.org/wiki/Birefringence.
[32] Wikipedia. Snell’s law. https://en.wikipedia.org/wiki/Snell\%27s_law.
[33] SHEN Wei-min. Interference pattern of convergent light for a uniaxial crystal with optical
     axis parallel to surface. College Physics, 6:001, 2005.
[34] Dennis H Goldstein. Polarized light. CRC press, 2017.
[35] Wikipedia. Color wheel. https://en.wikipedia.org/wiki/Color_wheel\#Color_wheels_and_
     paint_color_mixing.
[36] Zhao Tian, Yu-Lin Wei, Wei-Nin Chang, Xi Xiong, Changxi Zheng, Hsin-Mu Tsai, Kate
     Ching-Ju Lin, and Xia Zhou. Augmenting indoor inertial tracking with polarized light. In
     Proceedings of ACM MobiSys, pages 362–375, 2018.
[37] Yuanqing Zheng, Guobin Shen, Liqun Li, Chunshui Zhao, Mo Li, Feng Zhao, Yuanqing
     Zheng, Guobin Shen, Liqun Li, Chunshui Zhao, et al. Travi-navi: Self-deployable indoor
     navigation system. IEEE/ACM Transactions on Networking (TON), 25(5):2655–2669, 2017.
[38] Jinsong Han, Chen Qian, Xing Wang, Dan Ma, Jizhong Zhao, Wei Xi, Zhiping Jiang, and
     Zhi Wang. Twins: Device-free object tracking using passive tags. IEEE/ACM Transactions
     on Networking (TON), 24(3):1605–1617, 2016.
[39] Jizhong Zhao, Wei Xi, Yuan He, Yunhao Liu, Xiang-Yang Li, Lufeng Mo, and Zheng Yang.
     Localization of wireless sensor networks in the wild: Pursuit of ranging quality. IEEE/ACM
     Transactions on Networking (ToN), 21(1):311–323, 2013.
[40] Kun Qian, Chenshu Wu, Zheng Yang, Yunhao Liu, Fugui He, and Tianzhang Xing. Enabling
     contactless detection of moving humans with dynamic speeds using csi. ACM Transactions
     on Embedded Computing Systems (TECS), 17(2):52, 2018.
[41] Zuwei Yin, Chenshu Wu, Zheng Yang, and Yunhao Liu. Peer-to-peer indoor navigation
     using smartphones. IEEE Journal on Selected Areas in Communications, 35(5):1141–1153,
     2017.
[42] Kun Qian, Chenshu Wu, Yi Zhang, Guidong Zhang, Zheng Yang, and Yunhao Liu. Widar2.
     0: Passive human tracking with a single wi-fi link. Proceedings of ACM MobiSys, 2018.
[43] Pat Pannuto, Benjamin Kempke, Li-Xuan Chuo, David Blaauw, and Prabal Dutta. Harmo-
     nium: Ultra wideband pulse generation with bandstitched recovery for fast, accurate, and
     robust indoor localization. ACM Transactions on Sensor Networks (TOSN), 14(2):11, 2018.
                                               106


[44] Chunyi Peng, Guobin Shen, and Yongguang Zhang. Beepbeep: A high-accuracy acoustic-
     based system for ranging and localization using cots devices. ACM Transactions on Embed-
     ded Computing Systems, 11(1):4:1–4:29, 2012.
[45] K. Liu, X. Liu, L. Xie, and X. Li. Towards accurate acoustic localization on a smartphone.
     In Proceedings of IEEE INFOCOM, 2013.
[46] K. Liu, X. Liu, and X. Li. Guoguo: Enabling fine-grained smartphone localization via
     acoustic anchors. IEEE Transactions on Mobile Computing, 15(5):1144–1156, 2016.
[47] Pengfei Zhou, Yuanqing Zheng, and Mo Li. How long to wait?: Predicting bus arrival time
     with mobile phone based participatory sensing. In Proceedings of ACM MobiSys, 2014.
[48] Yin Chen, Jie Liu, Dimitrios Lymberopoulos, and Bodhi and Priyantha. Fm-based indoor
     localization. In Proceedings of ACM MobiSys, 2012.
[49] Yonghang Jiang, Zhenjiang Li, and Jianping Wang. Ptrack: Enhancing the applicability of
     pedestrian tracking with wearables. IEEE Transactions on Mobile Computing, 2018.
[50] Ruipeng Gao, Mingmin Zhao, Tao Ye, Fan Ye, Yizhou Wang, and Guojie Luo. Smartphone-
     based real time vehicle tracking in indoor parking structures. IEEE Transactions on Mobile
     Computing, 16(7):2023–2036, 2017.
[51] The Guardian. Apple apologises for allowing workers to listen to siri recordings. https:
     //www.theguardian.com/technology/2019/aug/29/apple-apologises-listen-siri-recordings.
     (Accessed on Feb. 28, 2020).
[52] CNBC. Amazon echo recorded conversation, sent to random person: report. https://www.
     cnbc.com/2018/05/24/amazon-echo-recorded-conversation-sent-to-random-person-report.
     html. (Accessed on Feb. 28, 2020).
[53] The Guardian. Ukraine prime minister offers resignation after leaked recording. https:
     //www.theguardian.com/world/2020/jan/17/ukraine-prime-minister-oleksiy-goncharuk-o
     ffers-resignation-after-leaked-recording. (Accessed on Feb. 28, 2020).
[54] Yu-Chih Tung and Kang G. Shin. Exploiting sound masking for audio privacy in smart-
     phones. In Proceedings of the 2019 ACM Asia Conference on Computer and Communications
     Security, page 257–268, 2019.
[55] Anti-eavesdropping and recording blocker device, China Patent 201320228440, Oct. 2013.
[56] Nirupam Roy, Sheng Shen, Haitham Hassanieh, and Romit Roy Choudhury. Inaudible voice
     commands: The long-range attack and defense. In 15th USENIX Symposium on Networked
     Systems Design and Implementation (NSDI 18), pages 547–560, 2018.
[57] Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. Metamorph: Injecting
     inaudible commands into over-the-air voice controlled systems. In Network and Distributed
     Systems Security (NDSS) Symposium, 2020.
                                              107


[58] Xinyan Zhou, Xiaoyu Ji, Chen Yan, Jiangyi Deng, and Wenyuan Xu. Nauth: Secure face-
     to-face device authentication via nonlinearity. In IEEE INFOCOM 2019-IEEE Conference
     on Computer Communications, pages 2080–2088. IEEE, 2019.
[59] Qiben Yan, Kehai Liu, Qin Zhou, Hanqing Guo, and Ning Zhang. Surfingattack: Interactive
     hidden attack on voice assistants using ultrasonic guided wave. In Network and Distributed
     Systems Security (NDSS) Symposium, 2020.
[60] Aleksandr Rovner. The principle of ultrasound. https://www.echopedia.org/wiki/The_princ
     iple_of_ultrasound, 2015.
[61] Ali H Sayed. Fundamentals of adaptive filtering. John Wiley & Sons, 2003.
[62] Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. Perceptual
     evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone
     networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and
     Signal Processing. Proceedings (Cat. No. 01CH37221), volume 2, pages 749–752. IEEE,
     2001.
[63] Anran Wang, Chunyi Peng, Ouyang Zhang, Guobin Shen, and Bing Zeng. Inframe: Multi-
     flexing full-frame visible communication channel for humans and devices. In Proceedings
     of the 13th ACM Workshop on Hot Topics in Networks, pages 1–7, 2014.
[64] Anran Wang, Zhuoran Li, Chunyi Peng, Guobin Shen, Gan Fang, and Bing Zeng. Inframe++
     achieve simultaneous screen-human viewing and hidden screen-camera communication. In
     Proceedings of the 13th Annual International Conference on Mobile Systems, Applications,
     and Services, pages 181–195, 2015.
[65] Viet Nguyen, Yaqin Tang, Ashwin Ashok, Marco Gruteser, Kristin Dana, Wenjun Hu, Eric
     Wengrowski, and Narayan Mandayam. High-rate flicker-free screen-camera communica-
     tion with spatially adaptive embedding. In IEEE INFOCOM 2016-The 35th Annual IEEE
     International Conference on Computer Communications, pages 1–9. IEEE, 2016.
[66] Kai Zhang, Yi Zhao, Chenshu Wu, Chaofan Yang, Kehong Huang, Chunyi Peng, Yunhao
     Liu, and Zheng Yang. Chromacode: A fully imperceptible screen-camera communication
     system. IEEE Transactions on Mobile Computing, 2019.
[67] Qian Wang, Kui Ren, Man Zhou, Tao Lei, Dimitrios Koutsonikolas, and Lu Su. Messages
     behind the sound: real-time hidden acoustic signal capture with smartphones. In Proceedings
     of the 22nd Annual International Conference on Mobile Computing and Networking, pages
     29–41, 2016.
[68] Man Zhou, Qian Wang, Kui Ren, Dimitrios Koutsonikolas, Lu Su, and Yanjiao Chen.
     Dolphin: Real-time hidden acoustic signal capture with smartphones. IEEE Transactions
     on Mobile Computing, 18(3):560–573, 2018.
[69] Lan Zhang, Cheng Bo, Jiahui Hou, Xiang-Yang Li, Yu Wang, Kebin Liu, and Yunhao
     Liu. Kaleido: You can watch it but cannot record it. In Proceedings of the 21st Annual
     International Conference on Mobile Computing and Networking, pages 372–385, 2015.
                                              108


[70] Ingo R Titze and Daniel W Martin. Principles of voice production, 1998.
[71] Ronald J Baken and Robert F Orlikoff. Clinical measurement of speech and voice. Cengage
     Learning, 2000.
[72] Sheng Shen, Nirupam Roy, Junfeng Guan, Haitham Hassanieh, and Romit Roy Choudhury.
     Mute: bringing iot to noise cancellation. In Proceedings of the 2018 Conference of the ACM
     Special Interest Group on Data Communication, pages 282–296, 2018.
[73] ITUT Rec. P. 800.1, mean opinion score (mos) terminology. International Telecommunica-
     tion Union, Geneva, 2006.
[74] Mika Wilson. Pesq - what is it and how could it transform your customer experience?
     https://www.spearline.com/blog/post/pesq---what-is-it-and-how-could-it-transform-you
     r-customer-experience-/, 2018. (Accessed on Oct. 2, 2020).
[75] Kamil Wojcicki. Pesq matlab wrapper. https://www.mathworks.com/matlabcentral/fileexc
     hange/33820-pesq-matlab-wrapper. (Accessed on Mar. 6, 2020).
[76] About Face ID advanced technology - Apple Support. https://support.apple.com/en-us/HT
     208108. (Accessed on Nov. 04, 2021).
[77] In-screen fingerprint sensors coming to 100 million phones by 2019? - cnet. https://www.
     cnet.com/tech/mobile/in-screen-fingerprint-sensors-coming-to-100-million-phones-by-2
     019-report/. (Accessed on Nov. 04, 2021).
[78] Zia Saquib, Nirmala Salam, Rekha Nair, and Nipun Pandey. Voiceprint recognition systems
     for remote authentication-a survey. International Journal of Hybrid Information Technology,
     4(2):79–97, 2011.
[79] John Daugman. New methods in iris recognition. IEEE Transactions on Systems, Man, and
     Cybernetics, Part B (Cybernetics), 37(5):1167–1175, 2007.
[80] Feng Lin, Chen Song, Yan Zhuang, Wenyao Xu, Changzhi Li, and Kui Ren. Cardiac scan:
     A non-contact and continuous heart-based user authentication system. In Proceedings of
     the 23rd Annual International Conference on Mobile Computing and Networking, pages
     315–328, 2017.
[81] Jagmohan Chauhan, Yining Hu, Suranga Seneviratne, Archan Misra, Aruna Seneviratne, and
     Youngki Lee. Breathprint: Breathing acoustics-based user authentication. In Proceedings
     of the 15th Annual International Conference on Mobile Systems, Applications, and Services,
     pages 278–291, 2017.
[82] Yinghui Li, Zhichao Cao, and Jiliang Wang. Gazture: Design and implementation of a gaze
     based gesture control system on tablets. Proceedings of the ACM on Interactive, Mobile,
     Wearable and Ubiquitous Technologies, 1(3):1–17, 2017.
[83] Hongbo Jiang, Hangcheng Cao, Daibo Liu, Jie Xiong, and Zhichao Cao. Smileauth: Using
     dental edge biometrics for user authentication on smartphones. Proceedings of the ACM on
     Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(3):1–24, 2020.
                                              109


[84] Yongchao Ye, Lingjie Lao, Diqun Yan, and Lang Lin. Detection of replay attack based on
     normalized constant q cepstral feature. In 2019 IEEE 4th International Conference on Cloud
     Computing and Big Data Analysis (ICCCBDA), pages 407–411. IEEE, 2019.
[85] S Saranya, Suvidha Rupesh Kumar, and B Bharathi. Deep learning approach: detection of
     replay attack in asv systems. In International Conference on Soft Computing and Signal
     Processing, pages 291–298. Springer, 2019.
[86] Bin Hao, Xiali Hei, Yazhou Tu, Xiaojiang Du, and Jie Wu. Voiceprint-based access control
     for wireless insulin pump systems. In 2018 IEEE 15th international conference on mobile
     ad hoc and sensor systems (MASS), pages 245–253. IEEE, 2018.
[87] Miroslav Goljan, Jessica Fridrich, and Mo Chen. Defending against fingerprint-copy attack
     in sensor-based camera identification. IEEE Transactions on Information Forensics and
     Security, 6(1):227–236, 2010.
[88] Roberto Caldelli, Irene Amerini, and Andrea Novi. An analysis on attacker actions in
     fingerprint-copy attack in source camera identification. In 2011 IEEE International Workshop
     on Information Forensics and Security, pages 1–6. IEEE, 2011.
[89] Robert W Frischholz and Alexander Werner. Avoiding replay-attacks in a face recognition
     system using head-pose estimation. In 2003 IEEE International SOI Conference. Proceed-
     ings (Cat. No. 03CH37443), pages 234–235. IEEE, 2003.
[90] Gang Pan, Zhaohui Wu, and Lin Sun. Liveness detection for face recognition. Recent
     advances in face recognition, pages 109–124, 2008.
[91] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer. End-to-end text-
     dependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech
     and Signal Processing (ICASSP), pages 5115–5119. IEEE, 2016.
[92] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss
     for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and
     Signal Processing (ICASSP), pages 4879–4883. IEEE, 2018.
[93] Xiaojia Zhao, Yang Shao, and DeLiang Wang. Casa-based robust speaker identification.
     IEEE Transactions on Audio, Speech, and Language Processing, 20(5):1608–1616, 2012.
[94] Chengkun Jiang, Junchen Guo, Yuan He, Meng Jin, Shuai Li, and Yunhao Liu. mmvib:
     micrometer-level vibration measurement with mmwave radar. In Proceedings of the 26th
     Annual International Conference on Mobile Computing and Networking, pages 1–13, 2020.
[95] Hideo Kaneko and Jun Horie. Breathing movements of the chest and abdominal wall in
     healthy subjects. Respiratory care, 57(9):1442–1451, 2012.
[96] Maria Ragnarsdóttir and Ella Kolbrun Kristinsdóttir. Breathing movements and breathing
     patterns among healthy men and women 20–69 years of age. Respiration, 73(1):48–54,
     2006.
                                               110


[97] Pablo Martinez-Lozano Sinues, Malcolm Kohler, and Renato Zenobi. Human breath analysis
      may support the existence of individual metabolic phenotypes. PloS one, 8(4):e59909, 2013.
[98] JERE Mead and STEPHEN H Loring. Analysis of volume displacement and length changes
      of the diaphragm during breathing. Journal of Applied Physiology, 53(3):750–755, 1982.
[99] Wei Wang, Alex X Liu, and Ke Sun. Device-free gesture tracking using acoustic signals.
      In Proceedings of the 22nd Annual International Conference on Mobile Computing and
      Networking, pages 82–94, 2016.
[100] Xiangyu Xu, Jiadi Yu, Yingying Chen, Yanmin Zhu, Linghe Kong, and Minglu Li. Breathlis-
      tener: Fine-grained breathing monitoring in driving environments utilizing acoustic signals.
      In Proceedings of the 17th Annual International Conference on Mobile Systems, Applica-
      tions, and Services, pages 54–66, 2019.
[101] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep
      residual networks. In European conference on computer vision, pages 630–645. Springer,
      2016.
[102] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding
      for face recognition and clustering. In Proceedings of the IEEE conference on computer
      vision and pattern recognition, pages 815–823, 2015.
[103] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the
      gap to human-level performance in face verification. In Proceedings of the IEEE conference
      on computer vision and pattern recognition, pages 1701–1708, 2014.
[104] Android. Aaudio library. https://developer.android.com/ndk/guides/audio/aaudio/aaudio.
      (Accessed on Nov. 02, 2021).
[105] Apache. Commons math: The apache commons mathematics library. https://commons.ap
      ache.org/proper/commons-math/. (Accessed on Nov. 02, 2021).
[106] PyTorch. Torchscript. https://pytorch.org/docs/stable/jit.html. (Accessed on Nov. 02, 2021).
[107] Kaikai Liu, Xinxin Liu, Lulu Xie, and Xiaolin Li. Towards accurate acoustic localization on
      a smartphone. In 2013 Proceedings IEEE INFOCOM, pages 495–499. IEEE, 2013.
[108] Kaikai Liu, Xinxin Liu, and Xiaolin Li. Acoustic ranging and communication via microphone
      channel. In 2012 IEEE Global Communications Conference (GLOBECOM), pages 291–296.
      IEEE, 2012.
[109] Kaikai Liu, Xinxin Liu, and Xiaolin Li. Guoguo: Enabling fine-grained smartphone local-
      ization via acoustic anchors. IEEE transactions on mobile computing, 15(5):1144–1156,
      2015.
[110] Chunyi Peng, Guobin Shen, and Yongguang Zhang. Beepbeep: A high-accuracy acoustic-
      based system for ranging and localization using cots devices. ACM Transactions on Embed-
      ded Computing Systems (TECS), 11(1):1–29, 2012.
                                               111


[111] Hyosu Kim, Anish Byanjankar, Yunxin Liu, Yuanchao Shu, and Insik Shin. Ubitap: Lever-
      aging acoustic dispersion for ubiquitous touch interface on solid surfaces. In Proceedings of
      the 16th ACM Conference on Embedded Networked Sensor Systems, pages 211–223, 2018.
[112] Joseph Paradiso, Craig Abler, Kai-yuh Hsiao, and Matthew Reynolds. The magic carpet:
      physical sensing for immersive environments. In CHI’97 Extended Abstracts on Human
      Factors in Computing Systems, pages 277–278. 1997.
[113] Kaustubh Kalgaonkar and Bhiksha Raj. One-handed gesture recognition using ultrasonic
      doppler sonar. In 2009 IEEE International Conference on Acoustics, Speech and Signal
      Processing, pages 1889–1892. IEEE, 2009.
[114] Stephen P Tarzia, Robert P Dick, Peter A Dinda, and Gokhan Memik. Sonar-based measure-
      ment of user presence and attention. In Proceedings of the 11th international conference on
      Ubiquitous computing, pages 89–92, 2009.
[115] Yunhao Liu, Jiliang Wang, Yunting Zhang, Linsong Cheng, Weiyi Wang, Zhao Wang,
      Weimin Xu, and Zhenjiang Li. Vernier: Accurate and fast acoustic motion tracking using
      mobile devices. IEEE Transactions on Mobile Computing, 2019.
[116] Ke Sun, Ting Zhao, Wei Wang, and Lei Xie. Vskin: Sensing touch gestures on surfaces
      of mobile devices using acoustic signals. In Proceedings of the 24th Annual International
      Conference on Mobile Computing and Networking, pages 591–605, 2018.
[117] Rajalakshmi Nandakumar, Shyamnath Gollakota, and Nathaniel Watson. Contactless sleep
      apnea detection on smartphones. In Proceedings of ACM MobiSys, pages 45–57, 2015.
[118] Anran Wang, Jacob E Sunshine, and Shyamnath Gollakota. Contactless infant monitoring
      using white noise. In Proceedings of ACM MobiCom, pages 1–16, 2019.
[119] Chao Cai, Zhe Chen, Henglin Pu, Liyuan Ye, Menglan Hu, and Jun Luo. Acute: acoustic
      thermometer empowered by a single smartphone. In Proceedings of the 18th Conference on
      Embedded Networked Sensor Systems, pages 28–41, 2020.
[120] Pengjin Xie, Jingchao Feng, Zhichao Cao, and Jiliang Wang. Genewave: Fast authentication
      and key agreement on commodity mobile devices. IEEE/ACM Transactions on Networking,
      26(4):1688–1700, 2018.
[121] Hongbo Liu, Yang Wang, Jie Yang, and Yingying Chen. Fast and practical secret key
      extraction by exploiting channel response. In 2013 Proceedings IEEE INFOCOM, pages
      3048–3056. IEEE, 2013.
[122] Jinsong Han, Chen Qian, Panlong Yang, Dan Ma, Zhiping Jiang, Wei Xi, and Jizhong Zhao.
      Geneprint: Generic and accurate physical-layer identification for uhf rfid tags. IEEE/ACM
      Transactions on Networking, 24(2):846–858, 2015.
[123] Yushi Cheng, Xiaoyu Ji, Juchuan Zhang, Wenyuan Xu, and Yi-Chao Chen. Demicpu:
      Device fingerprinting with magnetic signals radiated by cpu. In Proceedings of the 2019
      ACM SIGSAC Conference on Computer and Communications Security, pages 1149–1170,
      2019.
                                               112


[124] Kevin R Farrell, Richard J Mammone, and Khaled T Assaleh. Speaker recognition using
      neural networks and conventional classifiers. IEEE Transactions on speech and audio
      processing, 2(1):194–205, 1994.
[125] Michael Schmidt and Herbert Gish. Speaker identification via support vector classifiers. In
      1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Confer-
      ence Proceedings, volume 1, pages 105–108. IEEE, 1996.
[126] Lei Wang, Kang Huang, Ke Sun, Wei Wang, Chen Tian, Lei Xie, and Qing Gu. Unlock with
      your heart: Heartbeat-based authentication on commercial mobile phones. Proceedings of
      the ACM on interactive, mobile, wearable and ubiquitous technologies, 2(3):1–22, 2018.
[127] Chenning Li Li, Manni Liu, and Zhichao Cao. Wihf: Gesture and user recognition with wifi.
      IEEE Transactions on Mobile Computing, 2020.
[128] Volker Gross, Anke Dittmar, Thomas Penzel, Frank Schuttler, and Peter Von Wichert. The
      relationship between normal lung sounds, age, and gender. American journal of respiratory
      and critical care medicine, 162(3):905–909, 2000.
[129] Hans Pasterkamp, Steve S Kraman, and George R Wodicka. Respiratory sounds: ad-
      vances beyond the stethoscope. American journal of respiratory and critical care medicine,
      156(3):974–987, 1997.
[130] Hans Pasterkamp, Richard E Powell, and Ignacio Sanchez. Lung sound spectra at standard-
      ized air flow in normal infants, children, and adults. American journal of respiratory and
      critical care medicine, 154(2):424–430, 1996.
[131] Hans Pasterkamp, Jürgen Schäfer, and George R Wodicka. Posture-dependent change of
      tracheal sounds at standardized flows in patients with obstructive sleep apnea. Chest,
      110(6):1493–1498, 1996.
[132] Ignacio Sanchez and Hans Pasterkamp. Tracheal sound spectra depend on body height.
      American Review of Respiratory Disease, 148:1083–1083, 1993.
[133] JT Sharp, JP Henry, SK Sweany, WR Meadows, RJ Pietras, et al. The total work of breathing
      in normal and obese men. The Journal of clinical investigation, 43(4):728–739, 1964.
                                               113