SIGNAL PROCESSING AND MACHINE LEARNING APPROACHES TO ENABLING

ADVANCED SENSING AND NETWORKING CAPABILITIES IN EVERYDAY

INFRASTRUCTURE AND ELECTRONICS

By

Kamran Ali

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Computer Science – Doctor of Philosophy

2020

ABSTRACT

SIGNAL PROCESSING AND MACHINE LEARNING APPROACHES TO ENABLING

ADVANCED SENSING AND NETWORKING CAPABILITIES IN EVERYDAY

INFRASTRUCTURE AND ELECTRONICS

By

Kamran Ali

Mainstream commercial oﬀ-the-shelf (COTS) electronic devices of daily use are usually de-

signed and manufactured to serve a very speciﬁc purpose. For example, the WiFi routers and

network interface cards (NICs) are designed for high speed wireless communication, RFID read-

ers and tags are designed to identify and track items in supply chain, and smartphone vibrator

motors are designed to provide haptic feedback (e.g. notiﬁcations in silent mode) to the users.

This dissertation focuses on revisiting the physical-layer of various such everyday COTS electronic

devices, either to leverage the signals obtained from their physical layers to develop novel sensing

applications, or to modify/improve their PHY/MAC layer protocols to enable even more useful

deployment scenarios and networking applications - while keeping their original purpose intact -

by introducing mere software/ﬁrmware level changes and completely avoiding any hardware level

changes. Adding such new usefulness and functionalities to existing everyday infrastructure and

electronics has advantages both in terms of cost and convenience of use/deployment, as those

devices (and their protocols) are already mainstream, easily available, and often already purchased

and in use/deployed to serve their mainstream purpose of use.

In our works on WiFi signals based sensing, we propose signal processing and machine learning

approaches to enable ﬁne-grained gesture recognition and sleep monitoring using COTS WiFi

devices. In our work on gesture recognition, we show for the ﬁrst time that WiFi signals can be used

to recognize small gestures with high accuracy. In our work on sleep monitoring, we propose for

the ﬁrst time a WiFi CSI based sleep quality monitoring scheme which can robustly track breathing

and body/limb activity related vital signs during sleep throughout a night in an individual and

environment independent manner.

In our work on RFID signals based sensing, we propose signal processing and machine learning

approaches to eﬀectively image customer activity in front of display items in places such as retail

stores using commercial oﬀ-the-shelf (COTS) monostatic RFID devices (i.e. which use a single

antenna at a time for both transmitting and receiving RFID signals to and from the tags). The key

novelty of this work is on achieving multi-person activity tracking in front of display items by

constructing coarse grained images via robust, analytical model-driven deep learning based, RFID

imaging. We implemented our scheme using a COTS RFID reader and tags.

In our work on smartphone’s vibration based sensing, we propose a robust and practical vibration

based sensing scheme that works with smartphones with diﬀerent hardware, can extract ﬁne-grained

vibration signatures of diﬀerent surfaces, and is robust to environmental noise and hardware based

irregularities. A useful application of this sensing is symbolic localization/tagging, e.g. ﬁguring out

whether a user’s device is in their hand, pocket, or at their bedroom table, etc. Such symbolic tagging

of locations can provide us with indirect information about user activities and intentions without

any dedicated infrastructure, based on which we can enable useful services such as context aware

notiﬁcations/alarms. To make our scheme easily scalable and compatible with mainstream COTS

smartphones, we design our signal processing and machine learning pipeline such that it relies only

on built-in vibration motors and microphone for sensing, and it is robust to hardware irregularities

and background environmental noises. We tested our scheme on Android smartphones.

In our work on powerline communications (PLCs), we propose a distributed spectrum sharing

scheme for enterprise level PLC mesh networks. This work is a major step towards using existing

COTS PLC devices to connect diﬀerent types of Internet of Things (IoT) devices for sensing

and control related applications in large campuses such as enterprises. Our work is based on

identiﬁcation of a key weakness of the existing HomePlug AV (HPAV) PLC protocol that it

does not support spectrum sharing, i.e., currently each link operates over the whole available

spectrum, and therefore, only one link can operate at a time. Our proposed spectrum sharing

scheme signiﬁcantly boosts both aggregated and per-link throughputs, by allowing multiple links

to communicate concurrently, while requiring a few modiﬁcations to the existing HPAV protocol.

Copyright by
KAMRAN ALI
2020

This dissertation is dedicated to my mother, my wife Haneya, and my father who passed away
during my PhD while ﬁghting a long hard battle with cancer. Thank you for always believing in

me and for encouraging and supporting me in all my endeavors.

v

ACKNOWLEDGEMENTS

Working towards a Ph.D. has been a deeply enriching and rewarding experience. Looking back,

many people have helped shape my journey. I would like to extend them my thanks.

• First and foremost, my advisor, Prof. Alex X. Liu. My work would not have been possible

without his constant guidance, unwavering encouragement, his many insights, and his excep-

tional resourcefulness. And most importantly, his friendship. I have been very fortunate to

have an advisor who has also been a close friend. For all of this, Alex, thank you.

• I would also like to thank the rest of my dissertation committee Profs. Eric Torng, Yunhao

Liu, Mi Zhang, and Guan-Hua Tu for their encouragement and insightful comments during

my qualiﬁer and comprehensive exams.

• I would also like to thank Drs. Yiannis Pefkianakis (HP Labs, Apple), Eugene Chai (NEC

Labs), Kazuhito Koishida (Microsoft Applied Sciences Group), and Mohammed Alloulah

(Nokia Bell Labs). I really enjoyed working with them during my summer research internships

and during our collaboration after the internships.

• Throughout my Ph.D., I was supported by various NSF research grants. Thanks NSF!

• I would also like to thank Michigan State University, and speciﬁcally Department of Computer

Science and Engineering for providing me ﬁnancial support to attend various conferences

during my Ph.D.

• Many thanks to my colleagues in Systems and Security Lab at Michigan State University.

In particular, I would like to thank Faraz Ahmed, Salman Ali, Ali Munir, and Xinyu Le for

numerous insightful discussions and collaborations on various projects.

• I am also deeply indebted to Dr. Ijaz Haider Naqvi, who advised my undergraduate thesis,

mentored my undergraduate research at ADCOM Lab, LUMS, Pakistan, and encouraged me

vi

to pursue Ph.D. His passion for science and scholarly pursuit has been an inspiration to me

(and surely, many of his other students) and helped set me on the path on which I ﬁnd myself

today. I feel fortunate to have him as a close friend and an advisor. For all of your support,

Dr. Ijaz, Thank you.

• Finally, I do not know how I can thank my family enough: my parents and wife, from whom I

realized that kindness and devotion is endless; my siblings, who always support me no matter

what, and the rest of the family members who were always supportive of my studies.

• I dedicate my dissertation to my parents, my wife Haneya, and Dr. Ijaz Haider Naqvi. Thank

you all for being an amazing part of my journey!

vii

TABLE OF CONTENTS

LIST OF TABLES .

LIST OF FIGURES .

.

.

.

.

.

.

LIST OF ALGORITHMS .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . xiii

. . . . . . . . . . . . . xiv

. . . . . . . . . . . . . xx

CHAPTER 1

1.1 Contributions .

.

.

.

.

.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

INTRODUCTION .
.

. . . . . . . . . . . . .
. . . . . . . . . . . . .
1.1.1
Fine-grained Gesture Recognition Using Everyday WiFi Devices . . . . . .
1.1.2 Understanding and Modeling WiFi Signals Based Sleep Monitoring . . . .
1.1.3 Monitoring Browsing Behavior of Customers via RFID Imaging . . . . . .
1.1.4
Fine-grained Vibration Based Sensing Using a Smartphone . . . . . . . . .
1.1.5 Distributed Spectrum Sharing for Enterprise Powerline Communication

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

Networks

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . .

CHAPTER 2 FINE-GRAINED GESTURE RECOGNITION USING EVERYDAY WIFI

1
2
2
3
4
5

5

.

.

.

.

.

.

.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

Introduction .

2.1
.
2.2 Related Work .

DEVICES .
.
.

.
.

2.4 Noise Removal

2.5 Keystroke Extraction .

2.3 Channel State Information .
.
.
2.4.1 Low Pass Filtering .
2.4.2

.
.
.
.
PCA Based Filtering .
.

.
.
.
.
.
.
.
PCA on Normalized Stream .
.

2.3.1 WiKey Overview .
.

.
.
.
2.2.1 Device Free Activity Recognition .
.
2.2.2 Keystrokes Recognition .
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
2.5.1
.
2.5.2 Keystroke Detection .
2.5.3 Combining Results from Antenna Pairs
.
2.5.4 Extracting Keystroke Waveforms .
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
2.7.1 Dynamic Time Warping .
2.7.2 Classiﬁer Training .
.
.
2.7.3 Behavioral Clustering of User Data .
.
.
Implementation & Evaluation .
.
.
.
2.8.1 Hardware Setup .
.
2.8.2 Data Collection .
.
.
.
2.8.3 Keystroke Extraction Accuracy .
2.8.4 Classiﬁcation Accuracy .
.
.

2.6 Feature Extraction .
.
2.7 Classiﬁcation .

.
.
.
.
.

2.8

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.

.

.
.

.

.

.

.

.

.

.

.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.

.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

7
. . . . . . . . . . . . .
. . . . . . . . . . . . .
7
. . . . . . . . . . . . . 10
. . . . . . . . . . . . . 10
. . . . . . . . . . . . . 12
. . . . . . . . . . . . . 13
. . . . . . . . . . . . . 14
. . . . . . . . . . . . . 15
. . . . . . . . . . . . . 16
. . . . . . . . . . . . . 17
. . . . . . . . . . . . . 18
. . . . . . . . . . . . . 19
. . . . . . . . . . . . . 19
. . . . . . . . . . . . . 24
. . . . . . . . . . . . . 24
. . . . . . . . . . . . . 26
. . . . . . . . . . . . . 29
. . . . . . . . . . . . . 30
. . . . . . . . . . . . . 30
. . . . . . . . . . . . . 31
. . . . . . . . . . . . . 31
. . . . . . . . . . . . . 31
. . . . . . . . . . . . . 32
. . . . . . . . . . . . . 33
. . . . . . . . . . . . . 34

viii

2.8.4.1 Accuracy with 30 Samples per Key .
.
2.8.4.2 Accuracy vs. the Size for Training Set .
2.8.4.3

Eﬀects of CSI Sampling Rate and Training Samples

.
.

2.8.5 Real-world Evaluation on Sentences .
.

.
.
Eﬀects of Behavioral Clustering .

.
.
2.8.5.1 Accuracy .
2.8.5.2
.
2.8.5.3 Auto-Correction and Word Recognition .
.
.

2.9 Limitations
.
2.10 Conclusion .

.
.
.

.
.
.

.
.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.

.

.

.

.

.

.

.

CHAPTER 3 UNDERSTANDING AND MODELING WIFI SIGNALS BASED SLEEP

. . . . . . . . . . . . . 35
. . . . . . . . . . . . . 35
. . . . . . . 37
. . . . . . . . . . . . . 39
. . . . . . . . . . . . . 39
. . . . . . . . . . . . . 40
. . . . . . . . . . . . . 41
. . . . . . . . . . . . . 43
. . . . . . . . . . . . . 44

. . . . . . . . . . . . . 45
. . . . . . . . . . . . . 45
. . . . . . . . . . . . . 45
. . . . . . . . . . . . . 46
. . . . . . . . . . . . . 47
. . . . . . . . . . . . . 48
. . . . . . . . . . . . . 50
. . . . . . . . . . . . . 52
. . . . . . . . . . . . . 52
. . . . . . . . . . . . . 52
. . . . . . . . . . . . . 54
. . . . . . . . . . . . . 54
. . . . . . . . . . . . . 55
. . . . . . . . . . . . . 58
. . . . . . . . . . . . . 60
. . . . . . . . . . . . . 60
. . . . . . . . . . . . . 61
. . . . . . . . . . . . . 62
. . . . . . . . . . . . . 64
. . . . . . . . . . . . . 64
. . . . . . . . . . . . . 64
. . . . . . . . . . . . . 65
. . . . . . . . . . . . . 65
. . . . . . . . . . . . . 67
. . . . . . . . . . . . . 67
. . . . . . . . . . . . . 68
. . . . . . . . . . . . . 68
. . . . . . . . . . . . . 69
. . . . . . . . . . . . . 69
. . . . . . . . . . . . . 72
. . . . . . . . . . . . . 73
. . . . . . . . . . . . . 75
. . . . . . . . . . . . . 75
. . . . . . . . . . . . . 76

.
.

.

.

.

.

.

.

.
.
.

.
.
.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.
.
.

.

.
.
.

.

3.1

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Proposed Approach .

3.2 Related Work .

MONITORING .
.
.

3.3.1 Overview of WiFi CSI
3.3.2 Breath-Multipath Model
3.3.3 Breath-Subspace Model .

.
.
.
.
Introduction .
.
.
3.1.1 Motivation .
.
.
3.1.2 Limitations of Prior Art .
3.1.3
.
.
3.1.4 Technical Challenges and Solutions
3.1.5
Summary of Experimental Results .
.

.
.
.
.
.
.
.
.
3.2.1 Respiration, Body Movements and Sleep .
.
3.2.2
.
.
.
.
.
.
.

Sleep Monitoring Technologies .
3.3 Modeling of Vital Signs and WiFi CSI .
.
.
.
.
.
.

.
.
.
3.4 CSI Signal Processing Architecture .
3.4.1 Noise Removal
.
3.4.2 Tracking Body Movements .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.4.2.1 Body Movements Detection Approach .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
3.4.3.1 Bandpass Filtering .
.
3.4.3.2 Detecting Outage in Breathing Signal
.
3.4.3.3 Measuring Breathing Rate .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
3.6.4 Naturally Occurring Motion False Positives .
.
3.6.5 Breath Signal Outage .
3.6.6
.
.
.
.

.
.
Implementation and Evaluation .
.
3.6.1 Hardware Implementation .
3.6.2 Experimental Settings .
.
3.6.3 Breath Tracking Accuracy .

.
.
.
.
.
Long-term Accuracy .
Short-term Accuracy .

.
.
Sleep Quality .
.

3.5 Sleep Scoring .
3.6

Sleep Insights .
3.6.6.1

3.4.3 Tracking Breath .

3.6.7 Discussion .

3.6.3.1
3.6.3.2

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.

.

ix

3.7 Conclusions .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 78

CHAPTER 4 MONITORING BROWSING BEHAVIOR OF CUSTOMERS VIA RFID

.

.
.

.
.

.
.
.

.
.
.

.
.
.

.
.

.
.
.

.
.
.

.
.
.

.
.

.
.

.
.

.
.
.

.
.
.

.

.

.

.
.
.

.

.

.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

4.3 System Overview .

Introduction .

4.1
.
4.2 Related Work .

IMAGING .
.
.

4.4 Preprocessing RSS and Phase .
.
.

.
.
.
4.5 Analytical RFID Imaging Approach .
4.6 Deep Learning based RFID Imaging .
.
.
4.7 Multi-Person RFID Imaging .
.
Implementation & Evaluation .
.
4.8
4.8.1 Evaluation Methodology .
.

4.4.1 Calibration Mode .
4.4.2 Monitoring Mode .

.
.
.
.
.
.
4.2.1 Radio Tomographic Imaging (RTI) .
.
4.2.2 Customer Behavior Monitoring using RFIDs
4.2.3 Customer Behavior Monitoring using Cameras
.
.
.

. . . . . . . . . . . . . 80
. . . . . . . . . . . . . 80
. . . . . . . . . . . . . 85
. . . . . . . . . . . . . 85
. . . . . . . . . . . . . 86
. . . . . . . . . . . . . 86
. . . . . . . . . . . . . 87
.
.
. . . . . . . . . . . . . 87
.
4.3.1 Monostatic Passive RFIDs
4.3.2 TagSee’s Imaging Infrastructure .
. . . . . . . . . . . . . 88
4.3.3 Overview of TagSee Imaging and Tracking Scheme . . . . . . . . . . . . . 89
. . . . . . . . . . . . . 89
. . . . . . . . . . . . . 90
. . . . . . . . . . . . . 91
. . . . . . . . . . . . . 95
. . . . . . . . . . . . . 98
. . . . . . . . . . . . . 100
. . . . . . . . . . . . . 102
. . . . . . . . . . . . . 103
. . . . . . . . . . . . . 103
. . . . . . . . . . . . . 104
. . . . . . . . . . . . . 104
. . . . . . . . . . . . . 105
. . . . . . . . . . . . . 105
. . . . . . . . . . . . . 107
. . . . . . . . . . . . . 107
. . . . . . . . . . . . . 109
. . . . . . . . . . . . . 110
. . . . . . . . . . . . . 111
. . . . . . . . . . . . . 113

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Eﬀect of the number of training users .
.
Eﬀect of the number of reader antennas .
.
.
.
.
.

.
.
.
.
.
.
.
.
Experimental Setup .
.
Performance Metrics .

.
.
.
.
.
.
.
.
.
4.8.1.1
.
4.8.1.2 Data Collection .
4.8.1.3
.
Single Person Imaging Scenarios .
4.8.2.1
4.8.2.2

.
Eﬀect of the number of training users .
.
Eﬀect of impact width kcw .
.
.
.
.
.
.

4.8.3.1
4.8.3.2
4.9 Discussions .
4.10 Conclusions .

4.8.3 Multi-Person Imaging Scenarios .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

4.8.2

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.

.
.
.

.

.

.

.

.

.

.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.

CHAPTER 5 FINE-GRAINED VIBRATION BASED SENSING USING A SMART-

5.1

.
.

.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

PHONE .
.

Proposed Approach .

.
.
.
.
.
Introduction .
.
.
5.1.1 Motivation .
.
.
5.1.2 Limitations of Prior Art .
.
5.1.3
.
5.1.4 Technical Challenges and Our Solutions .
.
5.1.5 Key Novelty and Advantages .
.
5.1.6
.
.
.

.
Summary of Experimental Results .
.
.
.

.
.
5.3.1 Vibrator Motors in Smartphones .

.
.
.
.
.

.
.
.
.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.

.

.

.

.

5.2 Related Work .
.
5.3 Understanding Vibrations .

.

.

.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

. . . . . . . . . . . . . 115
. . . . . . . . . . . . . 115
. . . . . . . . . . . . . 115
. . . . . . . . . . . . . 115
. . . . . . . . . . . . . 116
. . . . . . . . . . . . . 118
. . . . . . . . . . . . . 120
. . . . . . . . . . . . . 120
. . . . . . . . . . . . . 120
. . . . . . . . . . . . . 122
. . . . . . . . . . . . . 122

x

.

.

.

.

.

.

.

.

.

.

.

.

.

.
.

.
.

.
.
.
.

.
.
.
.
.
.

5.3.2

5.4 Feature Extraction .

.
.
5.4.1 Robustness to Background Noise .
.
5.4.2 Robustness to Hardware Imperfections .
.
5.4.3 Extraction of Vibration Signature .

Physics of Surface Response to Vibrations .
.
.
.
.
Extraction of Vibration Patterns .

.
.
.
.
.
5.4.3.1
.
5.4.3.2 Construction of Vibration Signature .
.
.
.
.
.
.

. . . . . . . . . . . . . 123
. . . . . . . . . . . . . 124
. . . . . . . . . . . . . 124
. . . . . . . . . . . . . 126
. . . . . . . . . . . . . 129
. . . . . . . . . . . . . 130
. . . . . . . . . . . . . 132
. . . . . . . . . . . . . 133
. . . . . . . . . . . . . 134
. . . . . . . . . . . . . 134
. . . . . . . . . . . . . 135
. . . . . . . . . . . . . 136
. . . . . . . . . . . . . 139
5.6.4.1 Object and Location Recognition Accuracy . . . . . . . . . . . . 139
5.6.4.2
. . . . . . . . . . . . 140
. . . . . . . . . . . . . 142
5.6.4.3
. . . . . . . . . . . . . 144
5.6.4.4
. . . . . . . . . . . . . 145
. . . . . . . . . . . . . 145

Location Recognition Accuracy over Days
.
Impact of Surrounding Objects
.
.
Impact of Upper Cut-Oﬀ Frequency .
.
.
.
.

.
5.5 Classiﬁcation & Recognition .
.
Implementation & Evaluation .
5.6
Implementation Details .
5.6.1
.
5.6.2 Evaluation Setup .
5.6.3 VibroTag’s Sensitivity .
5.6.4 VibroTag’s Accuracy .
.

5.7 Usability Study .
.
5.8 Conclusion .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.

.

.

.

.

CHAPTER 6 DISTRIBUTED SPECTRUM SHARING FOR ENTERPRISE POWER-

.

.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

Introduction .

6.3.1
6.3.2 HomePlug AV standard .

PLC Channel Characteristics .
.

.
6.1
.
6.2 Related work .
.
6.3 HomePlug AV Powerline Communications .
.
.
6.4 A Measurement Study of Enterprise PLCs .

LINE COMMUNICATION BASED IOT NETWORKS . . . . . . . . . . . 147
. . . . . . . . . . . . . 147
. . . . . . . . . . . . . 151
. . . . . . . . . . . . . 152
. . . . . . . . . . . . . 152
. . . . . . . . . . . . . 152
. . . . . . . . . . . . . 155
6.4.1 Many Disjoint PLC Links Compete for Channel Access . . . . . . . . . . . 156
6.4.2 Enterprise PLC Channels are Highly Location Dependent . . . . . . . . . . 159
. . . . . . . . . . . . . 160
6.4.3 Enterprise PLC Channels are Pseudo-Stationary .
.
. . . . . . . . . . . . . 161
. . . . . . . . . . . . . 161
.
. . . . . . . . . . . . . 163
.
. . . . . . . . . . . . . 163
.
.
. . . . . . . . . . . . . 165
. . . . . . . . . . . . . 165
.
. . . . . . . . . . . . . 168
.
.
. . . . . . . . . . . . . 168
. . . . . . . . . . . . . 175
.
.
. . . . . . . . . . . . . 177

6.5 Distributed Spectrum Sharing for HPAV PLCs .
.
.
.
.
.
.
.
.
.

.
.
.
.
6.6 Enabling Spectrum Sharing for HPAV PLCs .
.
6.7
.
.
.

.
.
6.8 Advantages of Multi-Hop Routing in PLCs .
6.9 Conclusions .
.

Preliminary Deﬁnitions .
.
Spectrum Sharing (SS) Algorithm .
.
6.5.2.1 Optimal SS approach:
6.5.2.2 Ranking of S-Links
.

Implementation and Evaluation .
6.7.1 Evaluation Metrics .
.

6.5.1
6.5.2

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.

.
.

.
.

.
.

.
.

.
.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CHAPTER 7 CONCLUSIONS AND FUTURE WORK .
.
.

.
7.1.1 WiFi Signals Based Typing Biometrics .

7.1 Future Work .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

. . . . . . . . . . . . . 178
. . . . . . . . . . . . . 178
. . . . . . . . . . . . . 178

xi

7.1.1.1 Motivation .
7.1.1.2 Challenges .

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

. . . . . . . . . . . . . 178
. . . . . . . . . . . . . 178

7.1.2 Eﬀective Fusion of Orthogonal Components in WiFi Signals Subspace

.
.
.
.

.

.
.
.
.

.

.
.
.
.

.

.
.
.
.

.

.
.
.
.

.

.
.
.
.

.

.
.
.
.

.

.
.
.
.

.

. . . . . . . . . . . . . 179
. . . . . . . . . . . . . 179
. . . . . . . . . . . . . 181
. . . . . . . . . . . . . 183

. . . . . . . . . . . . . 185

For Improved Activity Recognition .
.
7.1.2.1 Motivation .
.
.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

.
.
.

7.1.3 Challenges
.

7.2 Conclusions .

.

.

.
.

.
.

.
.

.
.

.
.

BIBLIOGRAPHY .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

xii

LIST OF TABLES

Table 2.1: Average values of features extracted from keystrokes of keys collected from user 10 27

Table 2.2: Average values of features extracted from keystrokes of keys collected from user 10 27

Table 2.3: Variance of diﬀerent features extracted from keystrokes of keys collected from

user 10 .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 27

Table 2.4: Variance of diﬀerent features extracted from keystrokes of keys collected from

user 10 .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 29

Table 5.1: Average accuracy of recognizing diﬀerent surfaces in oﬃce and apartment

scenarios

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 140

Table 6.1: UDP and single-ﬂow TCP throughput and jitter with OLSR on/oﬀ (jitter is

reported by iperf only for UDP traﬃc) .

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 177

xiii

LIST OF FIGURES

Figure 2.1: WiKey System .

.

.

.

.

.

.

.

.

.

.

.

Figure 2.2: Original and ﬁltered CSI time series

Figure 2.3: Correlated variations in subcarriers .

Figure 2.4: PCA of Z-normalized CSI stream Zt,r

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . .

8

. . . . . . . . . . . . . 16

. . . . . . . . . . . . . 17

. . . . . . . . . . . . . 20

Figure 2.5: Feature extraction from 2nd keystroke waveforms extracted from TX-1, RX-1

for I & O .

.

.

.

.

.

.

.

.

.

.

Figure 2.6: Keystroke extraction results .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Figure 2.7: Mean accuracy for keys A-Z (Users 1-10) .

.

.

.

Figure 2.8: Mean accuracy for all 37 keys (Users 1-10) .

Figure 2.9: Per user average classiﬁer accuracies .

Figure 2.10: Accuracy for keys A-Z from user 10 .

Figure 2.11: Accuracy for all 37 keys from user 10 .

.

.

.

.

.

.

Figure 2.12: Color map of user 10’s confusion matrix .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 28

. . . . . . . . . . . . . 34

. . . . . . . . . . . . . 36

. . . . . . . . . . . . . 36

. . . . . . . . . . . . . 37

. . . . . . . . . . . . . 37

. . . . . . . . . . . . . 38

. . . . . . . . . . . . . 38

Figure 2.13: Multifold cross-validated average accuracies for user 10’s lower resolution

keystroke waveforms .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 39

Figure 2.14: Keystroke recognition for sentences collected from all users using 30 samples

per key .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 40

Figure 2.15: Keystroke recognition for sentences collected from user 10 using 80 samples

per key .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 40

Figure 2.16: Keystrokes recognition for sentence S1 collected from user 10 after behavioral

clustering .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 42

Figure 2.17: Comparison of word recognition accuracies before and after auto-correction . . 43

xiv

Figure 3.1: Example showing our system tracking breathing and body movements through-
out full night’s sleep of subject. Xethru radar (X4M200 [155]) ground truth
is approximately synchronized with CSI data .

.

.

.

.

.

. . . . . . . . . . . . . 51

Figure 3.2:

(a) Variation of Ai with di (t) for diﬀerent D0,i; (b) Single breath samples for
diﬀerent conﬁgurations shown in (C) .

. . . . . . . . . . . . . 57

.

.

.

.

.

.

.

.

.

Figure 3.3:

Impact of bodily activity during sleep on WiFi subspace . . . . . . . . . . . . . 59

Figure 3.4: Our WiFi CSI signal processing architecture for extracting vital signs . . . . . . 59

Figure 3.5: Example showing performance of our body movement detection algorithm,
compared to Xethru radar ground truth. Boxes show the areas where breathing
is usually present. Ground truth is approximately synchronized with CSI data . . 64

Figure 3.6: Example showing Sleep/Awake classiﬁcation for full night’s sleep of a subject.

Sleep eﬃciency was 62.1% .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 66

Figure 3.7: The real-world deployment scenarios used for evaluation of our sleep moni-

toring scheme “Serene” .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 67

Figure 3.8: CDF of overall and per-user breathing rate MSE compared to a Xethru
X4M200 ground truth; Serene’s full-night breath tracking performance; and
average BPM errors for short duration sleep experiments in diﬀerent sleep
postures

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 70

Figure 3.9: CDFs of BPM errors calculated over 15 minute windows for 6 diﬀerent nights

(Users 1, 2 and 3)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 71

Figure 3.10: CDF’s of numbers and total duration of motion false positives during a night
when compared with X4M200 ground truths. Motion false positive naturally
occur due to activity of other housemates .

.

.

.

.

.

.

.

. . . . . . . . . . . . . 72

Figure 3.11: Second-order statistics of breath estimation outage events. Outage rate and
average outage duration mirror, respectively, their counterparts level crossing
rate and average fade duration from wireless propagation literature . . . . . . . 74

Figure 3.12: Sleep eﬃciency and body motion corresponding to 4 users and throughout

13+ consecutive nights

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 75

Figure 3.13: Overall and per-user CDF for motion duration and sleep eﬃciency . . . . . . . 76

Figure 3.14: Two-stage (asleep versus awake) crude classiﬁcation using simple feature
engineering approach and as compared to classiﬁcation from a commercial
ResMed S+ device .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 78

xv

Figure 4.1: Example system setup .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 81

Figure 4.2: High level ﬂow diagram of TagSee’s monitoring mode

. . . . . . . . . . . . . 88

Figure 4.3: Phase and frequency diversity based ﬁltering for a tag obstructed by a human

(head & arms allowed to move) .

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 93

Figure 4.4:

Intuitions behind using the concept of First Fresnel zones [54] for phase based
ﬁltering & imaging .

. . . . . . . . . . . . . 94

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Figure 4.5: DNN architecture for K = 116 tags , k x = 29 tags, k y
.

distance of 5 inches along both axes

.

.

.

.

.

.

.

.

= 4 tags and inter-tag
.

. . . . . . . . . . . . . 99

Figure 4.6: TagSee’s spatial moving window based approach for multi-person imaging . . . 101

Figure 4.7: Detailed experimental setup .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 103

Figure 4.8: Comparison between TagSee’s baseline (top) and DNN based (bottom) ap-

proaches for single person scenario .

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 105

Figure 4.9: Eﬀect of number of training users on TagSee’s performance (TPRs, FPRs and

MRs) for kcw

= 6 .

.

.

.

.

.

.

.

.

.

. .

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 106

Figure 4.10: Performance in single person monitoring scenario using 1 reader antenna

only, kcw

= 8 .

.

.

.

.

.

.

.

.

.

.

. .

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 106

Figure 4.11: Comparison between TagSee’s baseline (top) and DNN based (bottom) RFID
imaging approaches for multi-person scenario. The leftmost 4 images corre-
spond to 2-user scenarios, and the rightmost 2 images correspond to 3-user
scanerios .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 108

Figure 4.12: Eﬀect of impact width kcw and number of training users on TagSee’s perfor-

mance in 2 person scenarios .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 108

Figure 4.13: Eﬀect of impact width kcw and number of training users on TagSee’s perfor-

mance in 3 person scenarios .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 109

Figure 4.14: Eﬀect of impact width kcw on TagSee’s performance (TPRs, FPRs and MRs)
for 8 diﬀerent multi-person scenarios, i.e. item category sets {1,3}, {1,4},
{1,5}, {1,6}, {3,6}, {4,6}, {1,4,6}, {2,4,6}, 5 training users used . . . . . . . . 110

Figure 4.15: Impact of reading rate on FPRs and MPRs .

.

.

.

.

.

.

. . . . . . . . . . . . . 112

Figure 5.1: Experimental scenarios and their corresponding extracted acoustic time-series

based vibration signatures .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 117

xvi

Figure 5.2: ERM and LRA based vibration motors [128] .

.

.

.

.

.

. . . . . . . . . . . . . 123

Figure 5.3:

Figure 5.4:

Impact of background noises on features extracted by traditional techniques
and VibroTag .

. . . . . . . . . . . . . 125

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Impact of hardware imperfection on features extracted by traditional tech-
niques PSD and STFT .

. . . . . . . . . . . . . 127

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Figure 5.5: Repetitive patterns appearing in the processed sound signals corresponding

to vibration .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 128

Figure 5.6: PSD of the sound signals in time windows corresponding to the scenarios in

Figs. 5.5(a)-5.5(b) .

.

.

.

.

.

.

.

.

.

.

.

.

.

Figure 5.7: Extracted time-series features (Low Noise) .

Figure 5.8: Extracted time-series features (High Noise) .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 131

. . . . . . . . . . . . . 132

. . . . . . . . . . . . . 132

Figure 5.9: Colormaps of distance between features of (a) VibroTag (DTW) & (b) IMU

scheme (Euclidean)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 134

Figure 5.10: VibroTag Setup (a) VibroTag App (b) oﬃce environment (c) example of data
collection locations in oﬃce (d) example of surfaces used for data collection
(e) example of data collection locations in apartment .

.

. . . . . . . . . . . . . 136

Figure 5.11: Confusion matrices for experiments performed by User-1 and User-2 to de-

termine VibroTag’s sensitivity .

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 138

Figure 5.12: Average accuracy with increasing number of training samples (sensitivity

experiments) .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 139

Figure 5.13: (a) Average 4-fold cross-validation accuracies over all classes (User-1) (mod-
erately restricted experiments), (b) Confusion matrix after cross-validation,
(c) Training on data from previous days, testing on subsequent days . . . . . . . 141

Figure 5.14: Accuracies on consecutive days .

.

.

.

Figure 5.15: Removing things from bedroom table .

.

.

.

.

.

.

Figure 5.16: Bringing light objects closer to smartphone .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 142

. . . . . . . . . . . . . 143

. . . . . . . . . . . . . 143

Figure 5.17: Eﬀect of (a) moving objects closer and of (b) removing objects on classiﬁcation 143

Figure 5.18: Cross-validation accuracies for diﬀerent band-pass ﬁlter upper cut-oﬀ fre-

quencies (User-3)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 144

xvii

Figure 5.19: Vote distribution of 7 VibroTag’s usability questions asked from 24 participants 145

Figure 6.1: Example scenario: Links 5-8 and 12-11 in the same collision domain can

share spectrum for concurrent operation .

.

.

.

Figure 6.2: Basic Beacon Period structure in HPAV MAC .

Figure 6.3: Building power distribution plan .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 148

. . . . . . . . . . . . . 153

. . . . . . . . . . . . . 156

Figure 6.4:

(a) Link asymmetry, (b) Temporal variation in throughput over 2 days, (c)
Link throughput stability CDF (45 links) .

. . . . . . . . . . . . . 157

.

.

.

.

.

.

.

Figure 6.5: CDF of throughputs observed in diﬀerent cases

.

.

.

.

. . . . . . . . . . . . . 157

Figure 6.6: Tonemaps of 12 links among 4 PLC nodes in one of our PLC deployments,

showing possibility of gains from SS .

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 160

Figure 6.7: Testing scenario with per-link (local) minimum Te, optimizing for net through-

put (#1-#7, top-bottom)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 169

Figure 6.8: Testing scenario with per-link (local) minimum Te, optimizing overall fairness

(#1-#7, top-bottom)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 170

Figure 6.9: Per-link throughput changes for Deployment#1 (testing scenario with per-link

(local) minimum Te requirement) .

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 170

Figure 6.10: Testing scenario with network-wide minimum Te requirement, optimizing for

net throughput (#1-#7, top-bottom) .

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 171

Figure 6.11: Testing scenario with network-wide minimum Te requirement, optimizing for

overall fairness (#1-#7, top-bottom) .

.

.

.

.

. . . . . . . . . . . . . 171

.

.

.

.

.

.

Figure 6.12: Per-link throughput changes for Deployment#1 (testing scenario with network-

wide minimum Te requirement) .

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 172

Figure 6.13: Computational and communication complexity of our spectrum sharing ap-

proach as number of PLC nodes increase .

.

.

.

.

.

.

.

. . . . . . . . . . . . . 172

Figure 6.14: Normalized throughputs observed for diﬀerent cases in deployment #1 (STD

of noise 1, 2 and 3 from left-right)

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 174

Figure 6.15: Percentage change in throughput for diﬀerent cases in deployment #1 (STD

of noise 1, 2 and 3 from left-right)

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 174

xviii

Figure 6.16: Normalized throughputs observed for diﬀerent cases in deployment #2 (STD

of noise 1, 2 and 3 from left-right)

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 174

Figure 6.17: Percentage change in throughput for diﬀerent cases in deployment #2 (STD

of noise 1, 2 and 3 from left-right)

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 174

Figure 6.18: Normalized throughputs observed for diﬀerent cases in deployment #3 (STD

of noise 1, 2 and 3 from left-right)

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 175

Figure 6.19: Percentage change in throughput for diﬀerent cases in deployment #3 (STD

of noise 1, 2 and 3 from left-right)

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 175

Figure 6.20: Percentage loss in throughput of case 3 compared to case 2 as the probability

of change in PLC channels increases .

.

.

Figure 6.21: Throughput with OLSR on/oﬀ for 60 secs

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 175

. . . . . . . . . . . . . 176

Figure 7.1: Eﬀect of fusing information from successive PCA projections on recognition

accuracy (10-fold cross-validation) .

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . 182

xix

LIST OF ALGORITHMS

Algorithm 1:

Keystroke Detection from single TX-RX stream . . . . . . . . . . . . . 23

Algorithm 2:

Optimal algorithm for distributed spectrum sharing in HPAV PLC-Nets . 164

xx

CHAPTER 1

INTRODUCTION

Internet of Things (IoT) applications are becoming extremely popular and will play a key role

in making diﬀerent types of environments around us - such as homes, oﬃces, schools, hospitals,

factories, shopping plazas, and more - smarter. However, many IoT applications require dedicated

communication and/or sensing hardware and/or infrastructure, which can often be cumbersome to

deploy. My dissertation focuses on revisiting the physical-layer of of various such everyday COTS

electronic devices, either to leverage the signals obtained from their physical layers to develop novel

sensing applications, or to modify/improve their PHY/MAC layer protocols to enable even more

useful deployment scenarios and networking applications - while keeping their original purpose

intact - by introducing mere software/ﬁrmware level changes and completely avoiding any hardware

level changes. Enabling such new usefulness and functionalities to existing everyday infrastructure

and electronics brings advantages both in terms of cost and convenience of use, as the underlying

devices and their protocols are already mainstream, easily available, and often already purchased

and deployed to serve their mainstream purpose of use. However, developing such applications

poses unique challenges as most mainstream COTS electronic devices of daily use are not designed

to be used for purposes other than their original purpose of use.

In this dissertation, I present my research on WiFi signals based sensing, RFID signals based

sensing, smartphones’ vibration based sensing, and distributed spectrum sharing in Powerline

Communications (PLCs) - a mechanism which leverages existing power distribution network in

a building for communication. In my works on WiFi signals based sensing, I developed signal

processing and machine learning approaches to enable ﬁne-grained gesture recognition and sleep

monitoring using COTS WiFi devices. In my work on RFID signals based sensing, I developed an

RFID signals based imaging scheme to track customer activity in front of display items - in places

such as retail stores - using COTS RFID tags and readers. In my work on smartphones’ vibration

based sensing, I developed a robust and practical vibration based surface recognition scheme that

1

works with smartphones with diﬀerent hardware, can extract ﬁne-grained vibration signatures of

diﬀerent surfaces, and is robust to environmental noise and hardware based irregularities. This

work ﬁnds its applications in indoor localization, for example, training your smartphone to turn oﬀ

lights when you put it on your bed table. And ﬁnally, as communication and sensing go hand in

hand, I worked on PLC technology where I developed a distributed spectrum sharing scheme to

make enterprise level PLCs based IoT networks faster.

1.1 Contributions

This dissertation takes an in-depth look at the following research problems.

1.1.1 Fine-grained Gesture Recognition Using Everyday WiFi Devices

In this work, we show for the ﬁrst time that WiFi signals can also be leveraged to recognize

small gestures such as keystrokes. The intuition is that while typing a certain key, the hands and

ﬁngers of a user move in a unique formation and direction and thus generate a unique pattern in the

time-series of Channel State Information (CSI) values, which we call CSI-waveform for that key.

In this work, we propose a WiFi signal based keystroke recognition system called WiKey. WiKey

consists of two Commercial Oﬀ-The-Shelf (COTS) WiFi devices, a sender (such as a router) and

a receiver (such as a laptop). The sender continuously emits signals and the receiver continuously

receives signals. When a human subject types on a keyboard, WiKey recognizes the typed keys

based on how the CSI values at the WiFi signal receiver end. We implemented the WiKey system

using a TP-Link TL-WR1043ND WiFi router and a Lenovo X200 laptop. WiKey achieves more

than 97.5% detection rate for detecting the keystroke and 96.4% recognition accuracy for classifying

single keys. In real-world experiments, WiKey can recognize keystrokes in a continuously typed

sentence with an accuracy of 93.5%. WiKey can also recognize complete words inside a sentence

with more than 85% accuracy. In this work, we have shown that ﬁne grained activity recognition

is possible by using COTS WiFi devices. Thus, the techniques proposed in this work can be used

for several localized HCI applications. Examples include zoom-in, zoom-out, scrolling, sliding,

and rotating gestures for operating personal computers, gesture recognition for gaming consoles,

2

in-home gesture recognition for operating various household devices, and applications such as

writing and drawing in the air. Other than recognizing keystrokes on conventional keyboards, our

WiKey technology can be potentially used to build virtual keyboards where human users type on a

printed keyboard.

1.1.2 Understanding and Modeling WiFi Signals Based Sleep Monitoring

Long-term sleep monitoring is crucial for patients with sleep disorders, as well as, for the

general population so that people can keep track of their sleep quality and improve their sleeping

habits. Moreover, continual sleep monitoring can help with early identiﬁcation of sleep disorders and

related illnesses which would otherwise go undiagnosed. Recently, WiFi Channel State Information

(CSI) signals based methods have emerged as an eﬀective approach to low-cost and easily adoptable

sleep monitoring for in-home environments. The idea is to track breathing and other body/limb

activity, which are closely related to sleep quality in humans, by leveraging the changes caused by

those bodily motions in CSI signals. However, their key limitation lies in the lack of a model that

can correlate the changes introduced in CSI with those bodily motions. In this work, we propose

Serene, a WiFi CSI based sleep quality monitoring scheme which can robustly track breathing

and body/limb activity related vital signs during sleep throughout a night in an individual and

environment independent manner. We develop two models based on which we design Serene’s signal

processing pipeline: a breath-multipath model, and a breath-subspace model. Our breath-multipath

model quantiﬁes the eﬀect of small breathing movements on the CSI signals, and allows Serene to

robustly extract breathing waveforms. Our breath-subspace model quantiﬁes how breathing aﬀects

the subspace formed by WiFi subcarriers very diﬀerently compared to other bodily motions, and

allows Serene to robustly diﬀerentiate between breathing and body/limb activity during sleep. We

implemented Serene with commodity oﬀ-the-shelf (COTS) WiFi hardware, and tested on 5 diﬀerent

individuals, where we collected more than 550 hours (80 nights) of CSI data at their apartments.

55% of our dataset corresponds to NLOS deployment scenarios, and 45% to LOS. Our results

demonstrate that Serene can track breathing with an average error of <0.59 BPM breaths per minute

3

(BPM) for controlled sleep experiments and an average error of <1.19 BPM for real-world full-night

in-home sleep experiments, respectively. We conclude with qualitative sleep assessments for our

study participants based on a light weight sleep scoring algorithm.

1.1.3 Monitoring Browsing Behavior of Customers via RFID Imaging

In this work, we propose to use commercial oﬀ-the-shelf (COTS) monostatic RFID devices (i.e.

which use a single antenna at a time for both transmitting and receiving RFID signals to and from

the tags) to eﬀectively image customer activity in front of display items in places such as retail

stores. To this end, we propose TagSee, a multi-person tracking system based on monostatic RFID

imaging. TagSee is based on the insight that when customers are browsing the items on a shelf, they

stand between the tags deployed along the boundaries of the shelf and the reader, which changes

the multi-paths that the RFID signals travel along, and both the RSS and phase values of the RFID

signals that the reader receives change. Based on these variations observed by the reader, TagSee

constructs a coarse grained image of the customers. Afterwards, TagSee identiﬁes the items that

are being browsed by the customers by analyzing the constructed images. The key novelty of this

work is on achieving multi-person activity tracking in front of display items by constructing coarse

grained images via robust, analytical model-driven deep learning based, RFID imaging. To achieve

this, we ﬁrst mathematically formulate the problem of imaging humans using monostatic RFID

devices and derive an approximate analytical imaging model that correlates the variations caused

by human obstructions in the RFID signals. Based on this model, we then develop a deep learning

framework to robustly image customers with high accuracy. We implement TagSee scheme using

a Impinj Speedway R420 reader and SMARTRAC DogBone RFID tags. Our experimental results

show that, on average, TagSee can achieve a true positive rate (TPR) of ∼90% and a false positive
rate (FPR) of ∼10% using training data from just 2-3 users. Moreover, TagSee can achieve a TPR
of more than ∼80% and a FPR of less than ∼15% in multi-person scenarios using training data
from just 3-4 users.

4

1.1.4 Fine-grained Vibration Based Sensing Using a Smartphone

Vibration based sensing has been shown to be a low-cost and eﬀective approach to recognizing

diﬀerent surfaces. The key intuition is that diﬀerent surfaces respond to the same vibration dif-

ferently. Previous schemes either use custom hardware for creating and sensing vibration, which

makes them diﬃcult to adopt, or use inertial (IMU) sensors in commercial oﬀ-the-shelf (COTS)

smartphones to sense movements produced due to vibrations, which makes them coarse-grained

because of the low sampling rates of IMU sensors. In this work, we propose VibroTag, a robust

and practical vibration based sensing scheme that works with smartphones with diﬀerent hardware,

can extract ﬁne-grained vibration signatures of diﬀerent surfaces, and is robust to environmental

noise and hardware based irregularities. The key intuition is that as the vibrating mass inside a

smartphone’s vibrator motor repeatedly moves to and fro, the vibrating mass causes the whole

smartphone structure and the hardware inside it to vibrate in a peculiar pattern, which depends

upon the vibration response (or absorption properties) of the surface that the smartphone is placed

on. These vibrations produce peculiar sound waves that VibroTag detects using the smartphone’s

microphone. To make VibroTag easily scalable and compatible with COTS smartphones, we design

VibroTag’s signal processing and machine learning pipeline such that it relies only on built-in vibra-

tion motors and microphone for sensing, and it is robust to hardware irregularities and background

environmental noises. We implemented VibroTag on two diﬀerent Android phones and evaluated

in multiple diﬀerent environments. Our results show that VibroTag achieves an average accuracy of

86.55% while recognizing 24 diﬀerent surfaces, which is more than 37% higher than the average

accuracy of 49.25% achieved by the state-of-the-art IMUs based scheme, which we implemented

for comparison with VibroTag.

1.1.5 Distributed Spectrum Sharing for Enterprise Powerline Communication Networks

As powerline communication (PLC) technology does not require dedicated cabling and network

setup, it can be used to easily connect multitude of IoT devices deployed in enterprise environments

for sensing and control related applications. While PLC technology has the potential to improve

5

connectivity and allow for new applications in enterprise settings, it has been mainly deployed

in home networks and its deployment in enterprise settings has been largely overlooked. IEEE

has standardized the PLC protocol in IEEE 1901, also known as HomePlug AV (HPAV) [3, 5],

which has been widely adopted in mainstream PLC devices. A key weakness of HPAV protocol is

that it does not support spectrum sharing. Currently, each link in an HPAV PLC network operates

over the whole available spectrum, and only one link can operate at any time within a single

collision domain. In this work, through an extensive measurement study of HPAV PLCs in a real

enterprise environment using commodity oﬀ-the-shelf (COTS) HPAV PLC devices, we discover

that spectrum sharing can signiﬁcantly beneﬁt enterprise level PLC networks. To this end, we

propose a distributed spectrum sharing technique for enterprise HPAV PLC networks, and show

that ﬁne-grained distributed spectrum sharing on top of current HPAV MAC protocols can boost

the aggregated and per-link throughput by up to 60% and 250% respectively, by allowing multiple

PLC links to communicate concurrently, while requiring a few modiﬁcations to the existing HPAV

devices and protocols.

6

FINE-GRAINED GESTURE RECOGNITION USING EVERYDAY WIFI DEVICES

CHAPTER 2

2.1 Introduction

Keystroke privacy is critical for ensuring the security of computer systems and the privacy

of users as what being typed could be passwords or privacy sensitive information. The research

community has studied various ways to recognize keystrokes in context of computer systems and

the privacy of users, which can be classiﬁed into three categories: acoustic emission based ap-

proaches, electromagnetic emission based approaches, and vision basead approaches. Acoustic

emission based approaches recognize keystrokes based on either the observation that diﬀerent keys

in a keyboard produce diﬀerent typing sounds [13, 170] or the observation that the acoustic ema-

nations from diﬀerent keys arrive at diﬀerent microphones at diﬀerent time [169]. Electromagnetic

emission based approaches recognize keystrokes based on the observation that the electromagnetic

emanations from the electrical circuit underneath diﬀerent keys in a keyboard are diﬀerent [143].

Vision based approaches recognize keystrokes using computer vision technologies [16].

In this work, we show for the ﬁrst time that commodity WiFi devices can also be used to

recognize keystrokes. The key intuition is that while typing a certain key, the hands and ﬁngers of

a user move in a unique direction and formation, generating a unique pattern in the time-series of

Channel State Information (CSI) values, which we call CSI-waveform of that key. The keystrokes

of each key introduce relative unique multi-path distortions in WiFi signals and this uniqueness

can be leveraged to recognize keystrokes. Due to the high data rates supported by modern WiFi

devices, WiFi cards provide enough CSI values within the duration of a keystroke to construct a

high resolution CSI-waveform for each keystroke.

In this work, we propose a WiFi signal based keystroke recognition system called WiKey. WiKey

consists of two Commercial Oﬀ-The-Shelf (COTS) WiFi devices, a sender (such as a router) and

a receiver (such as a laptop), as shown in Figure 4.2. The sender continuously emits signals and

7

the receiver continuously receives signals. When a human subject types in a keyboard, on the

WiFi signal receiver end, WiKey recognizes the typed keys based on how the CSI value changes.

CSI values quantify the aggregate eﬀect of wireless phenomena such as fading, multi-paths, and

Doppler shift on the wireless signals in a given environment. When the environment changes, such

as a key is being pressed, the impact of these wireless phenomena on the wireless signals change,

resulting in unique changes in the CSI values.

Figure 2.1: WiKey System

There are three key technical challenges. The ﬁrst technical challenge is to segment the CSI

time series to identify the start time and end time of each keystroke. We studied the characteristics

of typical CSI-waveforms of diﬀerent keystrokes and observed that the waveforms of diﬀerent

keys show a similar rising and falling trends in the changing rate of CSI values. Based on this

observation, we design a keystroke extraction algorithm that utilizes CSI streams of all transmit-

receive antenna (TX-RX) pair pairs to determine the approximate start and end points of individual

keystrokes in a given CSI-waveform by continuously matching the trends in CSI time series with

the experimentally observed trends using a sliding window approach.

The second technical challenge is to extract distinguishing features for generating classiﬁcation

models for each of the 37 keys (10 digits, 26 alphabets and 1 space-bar). As the keys on a keyboard

are closely placed, conventional features such as maximum peak power, mean amplitude, root mean

square deviation of signal amplitude, second/third central moment, rate of change, signal energy

or entropy, and number of zero crossings cannot be used because the values of these features for

adjacent keys are almost identical. To address this challenge, we use the CSI-waveform shapes of

8

each key from each TX-RX antenna pair as features. As the waveforms for each key contain a large

number of samples, we apply the Discrete Wavelet Transform (DWT) technique on these waveforms

to reduce the number of samples while keeping the shape preserving time and frequency domain

information intact. We use the waveforms resulting from the DWT of individual keystrokes as their

shape features.

The third challenge is to compare shape features of any two keystrokes. The midpoints of

extracted CSI-wavforms of diﬀerent keystrokes rarely align with each other because the start

and end points determined by extraction algorithm are never exact. Moreover, the lengths of

diﬀerent keystroke waveforms also diﬀer because the duration of pressing any key is often diﬀerent.

Consequently, the midpoints and lengths of shape features do not match either. Another issue is

that the shape of diﬀerent keystroke waveforms of the same key are often distorted versions of

each other because of slightly diﬀerent formation and direction of motion of hands and ﬁngers

while pressing that key. Thus, two shape features cannot be compared using standard measures like

correlation coeﬃcient or Euclidean distance. To address this challenge, we use the Dynamic Time

Warping (DTW) technique to quantify the distance between the two shape features. DTW can ﬁnd

the minimum distance alignment between two waveforms of diﬀerent lengths.

The key novelty of this work is on proposing the ﬁrst WiFi signal based keystroke recognition

approach. Some recent work uses CSI values to recognize various macro aspects of human move-

ments such as falling down [48], household activities [148], detection of human presence [168],

and estimating the number of people in a crowd [156]. These schemes extract coarse grained

information from the CSI values to recognize the macro-movements such as falling down or rec-

ognizing fullbody/limb gestures. They cannot be directly adapted to recognize keystrokes because

such coarse grained information does not capture the minor variations in the CSI values caused

by human micro-movements such as those of hands and ﬁngers while typing. Some recent work,

namely WiHear, uses CSI values to extract the micro-movements of mouth to recognize 9 syllables

in the spoken words [145]. However, WiHear uses special hardware including directional antennas

and stepper motors to direct WiFi beams towards speaker’s mouth and extract the micro-movements.

9

We implemented the WiKey system using a TP-Link TL-WR1043ND WiFi router and a Lenovo

X200 laptop. In the evaluation process, we ﬁrst build a keystroke database of 10 human subjects

with IRB approval. WiKey achieves more than 97.5% detection rate for detecting the keystroke

and 96.4% recognition accuracy for classifying single keys. In real-world experiments, WiKey

can recognize individual keystrokes in a continuously typed sentence with an accuracy of 93.5%.

Moreover, WiKey can recognize complete words in a sentence with more than 85% accuracy.

In this work, we have shown that ﬁne grained activity recognition is possible by using COTS

WiFi devices. Thus, the techniques proposed in this work can be used for several HCI applications.

Examples include zoom-in, zoom-out, scrolling, sliding, and rotating gestures for operating personal

computers, gesture recognition for gaming consoles, in-home gesture recognition for operating

various household devices, and applications such as writing and drawing in the air. Other than

recognizing keystrokes on conventional keyboards, our WiKey technology can be potentially used

to build virtual keyboards where human users type on a printed keyboard.

2.2 Related Work

2.2.1 Device Free Activity Recognition

Device-free activity recognition solutions use the variations in wireless channel to recognize

human activities in a given environment. Existing solutions related to our work can be grouped into

three categories: (1) RSS based, (2) CSI based and (3) Software Deﬁned Radio (SDR) based.

RSS Based: Sigg et al. proposed activity recognition schemes that utilize RSS values of

WiFi signals to recognize four activities including crawling, lying down, standing up, and walking

[123, 124]. They achieved activity recognition rates of over 80% for these four activities. To obtain

the RSS values from WiFi signals, they used USRPs, which are specialized hardware devices

compared to the COTS WiFi devices that we used in our work. While RSS values can be used

for recognizing macro-movements, they are not suitable to recognize the micro-movements such

as those of ﬁngers and hands in keyboard typing because RSS values only provide coarse-grained

information about the channel variations and do not contain ﬁne-grained information about small

10

scale fading and multi-path eﬀects caused by these micro-movements.

CSI Based: CSI values obtained from COTS WiFI network interface cards (NICs) (such as

Intel 5300 and Atheros 9390) have been recently proposed for activity recognition [48, 94, 145,

148, 156, 168] and localization [120, 157, 159]. Han et al. proposed WiFall that detects fall of a

human subject in an indoor environment using CSI values [48]. Zhou et al. proposed a passive

human detection scheme which exploits multi-path variations for detecting human presence in an

indoor environment using CSI values [168]. Zou et al. proposed Electronic Frog Eye that counts

the number of people in a crowd using CSI values by treating the people reﬂecting the WiFi signals

as “virtual antennas” [156]. Wang et al. proposed E-eyes that exploits CSI values for recognizing

household activities such as washing dishes and taking a shower [148]. Nandakumar et al. leverage

the CSI and RSS information from oﬀ-the-shelf WiFi devices to classify four arm gestures - push,

pull, lever, and punch [94]. The fundamental diﬀerence between these schemes and our scheme is

that these schemes extract coarse grained features from the CSI values provided by the COTS WiFi

NIC to perform these tasks while our proposed scheme reﬁnes these CSI to capture ﬁne grained

variations in the wireless channel for recognizing keystrokes. Wang et al. propose WiHear that uses

CSI values recognizes the shape of mouth while speaking to detect whether a person is uttering one

of a set of nine predeﬁned nine syllables [145]. While WiHear can capture the micro-movements

of lips, it uses special purpose directional antennas with stepper motors for directing the antenna

beams towards a person’s mouth to obtain a clean signal for recognizing mouth movements. In

contrast, our proposed scheme does not use any special purpose equipment and recognizes the

micro-movements of ﬁngers and hands using COTS WiFi NIC.

SDR Based: Researchers have proposed schemes that utilize SRDs and special purpose hard-

ware to transmit and receive custom modulated signals for activity recognition [8, 64, 85, 110].

Pu et al. proposed WiSee that uses a special purpose receiver design on USRPs to extract small

Doppler shifts from OFDM WiFi transmissions to recognize human gestures [110]. Kellogg et al.

proposed to use a special purpose analog envelop detector circuit for recognizing gestures within a

distance of up to 2.5 feet using backscatter signals from RFID or TV transmissions [64] . Lyonnet

11

et al. use micro Doppler signatures to classify gaits of human subjects into multiple categories

using specialized Doppler radars [85]. Adib et al. proposed WiTrack that uses a specially designed

frequency modulated carrier wave radio frontend to track human movements behind a wall [8].

Recently, Chen et al. proposed an SDR based custom receiver design which can be used to track

keystrokes using wireless signals [23]. Compared to these schemes, our scheme does not use any

specialized hardware or SDRs rather utilizes COTS WiFi NICs to recognize keystrokes.

2.2.2 Keystrokes Recognition

To the best of our knowledge, there is no prior work on recognizing keystrokes by leveraging

variations in wireless signals using commodity WiFi devices. Other than the SDRs based keystroke

tracking approach proposed in [23] which uses wireless signals to track keystrokes, researchers have

proposed several keystrokes recognition schemes that are based on other sensing modalities such

as acoustics [13, 25, 169, 170], electromagnetic emissions [143], and video cameras [16]. Next, we

give a brief overview of the other existing schemes that utilize these sensing modalities to recognize

keystrokes.

Acoustics Based: Asonov et al. proposed a scheme to recognize keystrokes by leveraging the

observation that diﬀerent keys of a given keyboard produce slightly diﬀerent sounds during regular

typing [13]. They used back-propagation neural network for keystroke recognition and fast fourier

transform (FFT) of the time window of every keystroke peak as features for training the classiﬁers.

Zhuang et al. proposed another scheme that recognizes keystrokes based on the sounds generated

during key presses [170]. They used cepstrum features [25] instead of FFT as keystroke features and

used unsupervised learning with language model correction on the collected features before using

them for supervised training and recognition of diﬀerent keystrokes. Zhu et al. proposed a context-

free geometry-based approach for recognizing keystrokes that leverage the acoustic emanations

from keystrokes to ﬁrst calculate the time diﬀerence of keystroke arrival and then estimate the

physical locations of the keystrokes to identify which keys are pressed [169].

Electromagnetic Emissions Based Vuagnoux et al. used a USRP to capture the electromagnetic

12

emanations while pressing the keys [143]. These electromagnetic emanations originated from the

electrical circuit underneath each key in conventional keyboards. The authors proposed to capture

the entire raw electromagnetic spectrum and process it to recognize the keystrokes. Unfortunately,

this scheme is highly susceptible to background electromagnetic noise that exists in almost all

environments these days such as due to microwave ovens, refrigerators, and televisions.

Video Camera Based Balzarotti et al. proposed ClearShot that processes the video of a person

typing to reconstruct the sentences (s)he types [16]. The authors propose to use context and language

sensitive analysis for reconstructing the sentences.

2.3 Channel State Information

Modern WiFi devices that support IEEE 802.11n/ac standard typically consist of multiple

transmit and multiple receive antennas and thus support MIMO. Each MIMO channel between

each transmit-receive (TX-RX) antenna pair of a transmitter and receiver comprises of multiple

subcarriers. These WiFi devices continuously monitor the state of the wireless channel to eﬀectively

perform transmit power allocations and rate adaptations for each individual MIMO stream such that

the available capacity of the wireless channel is maximally utilized [46]. These devices quantify

the state of the channel in terms of CSI values. The CSI values essentially characterize the Channel

Frequency Response (CFR) for each subcarrier between each transmit-receive (TX-RX) antenna

pair. As the received signal is the resultant of constructive and destructive interference of several

multipath signals scattered from the walls and surrounding objects, the disturbances caused by

movement of hands and ﬁngers while typing on a keyboard near the WiFi receiver not only lead to

changes in previously existing multipaths but also to the creation of new multipaths. These changes

are captured in the CSI values for all subcarriers between every TX-RX antenna pair and can then

be used to recognize keystrokes.

Let MT denote the number of transmit antennas, MR denote the number of receive antennas

and Sc denote the number of OFDM sub-carriers. Let Xi and Yi represent the MT dimensional

transmitted signal vector and MR dimensional received signal vector, respectively, for subcarrier

i and let Ni represent an MR dimensional noise vector. An MR × MT MIMO system at any time

13

instant can be represented by the following equation:

Yi = HiXi + Ni

i ∈ [1, Sc]

(2.1)

In the equation above, the MR× MT dimensional channel matrix Hi represents the Channel State
Information (CSI) for the sub-carrier i. Any two communicating WiFi devices estimate this channel

matrix Hi for every subcarrier by regularly transmitting a known preamble of OFDM symbols

between each other. For each Tx-Rx antenna pair, the driver of our Intel 5300 WiFi NIC reports

CSI values for Sc = 30 OFDM subcarriers of the 20 MHz WiFi Channel [47]. This leads to 30

matrices with dimensions MR × MT per CSI sample.

2.3.1 WiKey Overview

To recognize keystrokes from CSI time series, WiKey needs classiﬁcation models for all

keystrokes. WiKey ﬁrst generates these classiﬁcation models using the following four steps and

then uses them to classify previously unseen keystrokes.

The ﬁrst step is to remove noise from the time series of CSI values. The CSI time series reported

by WiFi NICs contain a large amount of noise even when the environment is static. WiKey removes

the noise in two steps. First, it passes CSI time series of all subcarriers for each TX-RX antenna

pair through a low-pass ﬁlter to remove high frequency noises. Second, it leverages our observation

that the variations in the CSI time series of all subcarriers due to the movements of hands and

ﬁngers are correlated and applies Principal Component Analysis (PCA) on the ﬁltered subcarriers

to extract the signals that only contains variations caused by movements of hands with acceptable

levels of noise.

The second step is to detect the starting and ending points of keystrokes and extract the CSI

waveforms for individual keystrokes. WiKey uses our keystroke extraction algorithm to identify

the starting and ending points of individual keystrokes in a given CSI time series by leveraging the

observation that CSI waveforms of diﬀerent keystrokes show similar trends in the rates of change

in the CSI values at the start and end of any keystroke. Our keystroke extraction algorithm takes

into account the variations in the CSI time series of all subcarriers for all TX-RX antenna pairs

14

during keystroke extraction to minimize chances of detection errors, including missed keystrokes,

false positives and detection of the same keystroke multiple times.

The third step is to extract appropriate features from the CSI waveforms of keystrokes to

generate classiﬁcation models. For this, WiKey applies Discrete Wavelet Transform (DWT) on

those waveforms to obtain shape features of keystrokes. These shape features obtained from DWT

preserve both frequency and time domain information of the CSI waveforms and while at the same

time reduce the number of samples in the CSI waveform, which helps in reducing the computational

cost.

The fourth step is to generate classiﬁcation models using these shape features for keystrokes.

WiKey trains an ensemble of classiﬁers to generate classiﬁcation model for each key using the

training data of the user. We chose k-Nearest Neighbor (kNN) classiﬁer because it essentially

searches the entire feature space to match the shape features of one keystroke with others, and thus

is most suited for this particular application. To compare the shape features of any two keystrokes,

WiKey uses Dynamic Time Warping (DTW) based distance metric while training the kNN classiﬁer.

2.4 Noise Removal

The CSI values provided by commodity WiFi NICs are inherently noisy because of the frequent

changes in internal CSI reference levels, transmit power levels, and transmission rates. To use CSI

values for recognizing keystrokes, such noise must ﬁrst be removed from the CSI time series. For

this, WiKey ﬁrst passes the CSI time series from a low-pass ﬁlter to remove high frequency noises.

Unfortunately, a simple low pass ﬁlter does not denoise the CSI values very eﬃciently. Although

strict low-pass ﬁltering can remove noise further, it causes loss of useful information from the signal

as well. To extract useful signal from the noisy CSI time series, WiKey leverages our observation

that the variations in the CSI time series of all subcarriers due to the movements of hands and

ﬁngers are correlated. Therefore, it applies Principal Component Analysis (PCA) on the ﬁltered

subcarriers to extract the signals that only contain variations caused by movements of hands. Next,

we ﬁrst describe the process of applying the low-pass ﬁlter on the CSI time series and then explain

15

how WiKey extracts hand and ﬁnger movement signal using our PCA based approach.

2.4.1 Low Pass Filtering

The frequency of variations caused due to the movements of hands and ﬁngers lie at the low

end of the spectrum while the frequency of the noise lies at the high end of the spectrum. To

remove noise in such a situation, Butterworth low-pass ﬁlter is a natural choice which does not

signiﬁcantly distort the phase information in the signal and has a maximally ﬂat amplitude response

in the passband and thus does not distort the hand and ﬁnger movement signal much. WiKey applies

the Butterworth ﬁlter on the CSI time series of all subcarriers in each TX-RX antenna pair so that

every stream experiences similar eﬀects of phase distortion and group delay introduced by the ﬁlter.

Although this process helps in removing some high frequency noise, the noise is not completely

eliminated because Butterworth ﬁlter has slightly slow fall oﬀ gain in the stopband.

e
d
u
t
i
l

p
m
A

19

18

17

16

15

14

13

12

11

10

0

e
d
u
t
i
l

p
m
A

17

16

15

14

13

12

11

0

500

1000

1500

Sample

500

1000

1500

Sample

(a) Original time series

(b) Filtered time series

Figure 2.2: Original and ﬁltered CSI time series

We observed experimentally that the frequencies of the variations in CSI time series due to

hand and ﬁnger movements while typing approximately lie anywhere between 3Hz to 80 Hz. As

we sample CSI values at a rate of Fs = 2500 samples/s, we set the cut-oﬀ frequency ωc of the

Butterworth ﬁlter at ωc =

2500 ≈ 0.2 rad/s. Figure 2.2(a) shows the amplitudes of the
unﬁltered CSI waveform of a keystroke and Figure 2.2(b) shows the resultant from the Butterworth

2π∗ f
Fs

= 2π∗80

ﬁlter. We observe that Butterworth ﬁlter successfully removes most of the bursty noises from the

CSI waveforms.

16

2.4.2 PCA Based Filtering

We observed experimentally that the movements of hands and ﬁngers results in correlated

changes in the CSI time series for each subcarrier in every transmit-receive antenna pair. Figure 2.3

plots the amplitudes of CSI time series of 10 diﬀerent subcarriers for one transmit-receive antenna

pair while a user was repeatedly pressing a key. We observe from this ﬁgure that all subcarriers

show correlated variations in their time series when the user presses the keys. The subcarriers that

are closely spaced in frequency show identical variations whereas the subcarriers that farther away

in frequency show non-identical changes. Despite non-identical changes, a strong correlation still

exists even across the subcarriers that are far apart in frequency. WiKey leverages this correlation

and calculates the principal components from all CSI time series. It then chooses those principal

components that represent the most common variations among all CSI time series.

l

e
u
a
V
e
t
u
o
s
b
A

 

l

2.6
2.4
2.2
2
1.8

4

3

9

8

7

2000

4000

6000

8000

2000

4000

6000

8000

0

2000

4000

6000

8000

0

2000

4000

6000

8000

13
12
11
10
9

16

14

12

16

14

12

16

14

12

10

9

22

20

18

3

2.5

2

2000

4000

6000

8000

2000

4000

6000

8000

0

2000

4000

6000

8000

0

2000

4000

6000

8000

0

2000

4000

Sample

6000

8000

0

2000

4000

Sample

6000

8000

(a) # 1,2,3,4,5

(b) # 5,10,15,20,25

Figure 2.3: Correlated variations in subcarriers

There are two main advantages of using PCA. First, PCA reduces the dimensionality of the CSI

information obtained from the 30 subcarriers in each TX-RX stream, which is useful because using

information from all subcarriers for keystroke extraction and recognition signiﬁcantly increases

17

the computational complexity of the scheme. Consequently, PCA automatically enables WiKey

to obtain the signals that are representative of hand and ﬁnger movements, without having to

devise new techniques and deﬁne new parameters for selecting appropriate subcarriers for further

processing. Second, PCA helps in removing noise from the signals by taking advantage of correlated

varations in CSI time series of diﬀerent subcarriers. It removes the uncorrelated noisy components,

which can not be removed through traditional low pass ﬁltering. This PCA based noise reduction

is one of the major reasons behind high keystroke extraction and recognition accuracies of our

scheme.

2.5 Keystroke Extraction

WiKey segments the CSI time series to extract the CSI waveforms for individual keystrokes. For

this, WiKey operates on the CSI time series resulting from the butterworth ﬁltering. Let Ht,r (i) be

an Sc × 1 dimensional vector containing the CSI values of the Sc subcarriers between an arbitrary
TX-RX antenna pair t − r for the it h CSI sample. Let Ht,r be an N × Sc dimensional matrix
containing the CSI values of the Sc subcarriers between an arbitrary TX-RX antenna pair t − r for
N consecutive CSI samples. This matrix is given by the following equation:

Ht,r = [Ht,r (1)|Ht,r (2)|Ht,r (3)|...|Ht,r (N )]T

(2.2)

The columns of the matrix Ht,r represent the CSI time series for each OFDM subcarrier. To

detect the starting and ending points of any arbitrary key, WiKey ﬁrst normalizes the Ht,r matrix

such that every CSI stream has zero mean and unit variance. We denote the normalized version

of Ht,r by Zt,r . WiKey then performs the PCA based dimensionality reduction and denoising (as

described in Section 2.4.2) on Zt,r and the resultant waveforms are further processed to detect

the starting and ending points of the keystrokes from this particular TX-RX antenna pair. WiKey

repeats this process on the CSI time series for all antenna pairs and obtains values for starting and

ending points for keys based on the CSI time series from each antenna pair one by one. Finally,

WiKey combines the starting and ending points obtained from all TX-RX antenna pairs to calculate

a robust estimate of starting and ending points of the time windows containing those keystrokes.

18

Next we explain these steps in more detail.

2.5.1 PCA on Normalized Stream

Let Φ{1:p}Z

be an Sc × p dimensional matrix that contains the top p principal components
obtained from PCA on Zt,r . We remove the ﬁrst component from those top p principal components

based on our observation that the ﬁrst component captures majority of the noise, while subsequent

components contain information about movements of hands and ﬁngers while typing. This happens

because PCA ranks principal components in descending order of their variance, due to which

the noisy components with higher variance gets ranked among top principal components. Due to

correlated nature of variations in multiple CSI time series, the removal of this PCA component

does not lead to any signiﬁcant information loss as remaining PCA components still contain enough

information required for successfully detecting starting and ending points of the keystrokes.

If we exclude the ﬁrst component, the projection of the CSI stream Zt,r of t-r transmit-receive

antenna pair onto the remaining principal components Φ{2:p}Z

can then be written as:

Z{2:p}
t,r

= Zt,r × Φ{2:p}Z

(2.3)

where Z{2:p}

t,r

is an N × (p − 1) dimensional matrix containing the projected CSI streams in its
columns. We choose the p = 4 in our implementation based on our observation that only top

4 principal components contained most signiﬁcant variations in CSI values caused by diﬀerent

keystrokes. Figure 2.4(a) shows the result of projecting normalized CSI time series Zt,r onto its

top 4 principal components. We observe from Figure 2.4(b) that by removing the ﬁrst principle

component, we essentially remove the most noisy projection among the all 4 projections of Zt,r .

2.5.2 Keystroke Detection

Although existing DFAR schemes propose techniques to automatically detect the start and end

of activities, they can not be directly adapted for use in detecting the start and end of keystrokes.

Existing schemes use simple threshold based algorithms for detecting the start and end of activities.

While, threshold based schemes work well for macro-movements, they are not well suited for micro-

19

 

PCA 1
PCA 2
PCA 3
PCA 4

1000

2000

3000

4000

5000

6000

7000

Sample

(a) Top 4 projections

 

PCA 2 
PCA 3
PCA 4

7

6

5

4

3

2

1

0

s
e
u
l
a
v
 
I

 

S
C
d
e
t
c
e
j
o
r
P

−1

 

7

6

5

4

3

2

1

0

s
e
u
a
v

l

 
I

 

S
C
d
e
t
c
e
o
r
P

j

−1

 

1000

2000

3000

4000

5000

6000

7000

Sample

(b) Projections 2, 3 & 4

Figure 2.4: PCA of Z-normalized CSI stream Zt,r

movements such as those of hands and ﬁngers while typing, where we need to precisely segment

time series of keystrokes that are closely spaced in time. Unlike general purpose threshold based

algorithms, we propose a keystroke detection algorithm that provides better detection accuracy,

since it is strictly based on the experimentally observed shapes of diﬀerent keystroke waveforms.

The intuition behind our algorithm is that the CSI time series of every keystroke shows a typical

increasing and decreasing trend in rates of change in CSI time series, similar to the one shown in

Figure 2.2. To detect such increase and decrease in rates of change in CSI time series, our algorithm

uses a moving window approach to detect the increasing and decreasing trends in rates of change

20

in all p − 1 time series for each transmit-receive antenna pair i.e., on each column of Z
algorithm detects the starting and ending points of keystrokes in following six steps.

2:p
t,r . Our

First, the algorithm calculates the mean absolute deviation (MAD) for each of the p − 1 time
series for each window of size W at j-th iteration. This is done primarily to detect the extent of

variations in the values of a given time series. The main reason behind choosing MAD instead of

variance is that in calculating, the deviations from the mean are squared which gives more weight

to extreme values. In cases where a time series contains outliers, this results in undue weight given

to those outlying values and that signiﬁcantly corrupts the measure of deviation. The MAD is

calculated using following equation.

△m j [k] = P j +W

i= j

t,r (i) − Z{k}
|Z{k}

t,r ( j : j + W )|

W

(2.4)

where Z{k}

t,r ( j : j + W ) represents the vector of means of the kth projected CSI stream in j-th

window. It calculates the value of △m j for each sample point j and for the principle components
2 ≤ k ≤ p.

Second, the algorithm adds the mean absolute deviations in each waveform to calculate a

combined measure △Mj of MAD in all p − 1 waveforms, which is calculated in the following
equation.

Third, the algorithm compares △Mj to a heuristically set threshold T hr esh. Let δ j = △Mj −
T hr esh, then δ j > 0 shows that the current window j contains signiﬁcant variations in CSI

amplitudes.

Fourth, the algorithm compares δ j to its value in last window δ j−1 to detect increasing or
decreasing trend in detected variations. When δ j − δ j−1 > 0, there is an increasing trend in the
rate of change in combined MAD (△Mj ) of CSI time series and vice versa. These increasing and
decreasing trends are captured in variables iu and du, respectively. The algorithm increments the

value of iu by 1 whenever δ j − δ j−1 > 0 and du by 1 whenever δ j − δ j−1 < 0. Let σ represent
forgetting factor, which is used to “forget” the variations caused by noise to avoid false positives.

21

△Mj =

△m j [k]

(2.5)

p

Xk =2

To forget such variations, the algorithm decrements both iu and du by 1 if △Mj < T hr esh for a
duration of σW .

Fifth, as soon as the values of iu and du exceed empirically determined thresholds Iu and Du,

respectively, the algorithm detects the start of the keystroke. As soon as the algorithm detects a

keystroke, it estimates the starting point sm and ending point em of the keystroke waveforms using

following equations.

sm = j − βW − Ble f t

em = j − βW + tavg

+ Bright

(2.6)

(2.7)

where tavg is the average number of data points spanned by waveforms of diﬀerent keystrokes, β is

the span factor which determines the estimated starting point of the keystroke and Ble f t and Bright

are guard intervals on both sides of the estimated keystroke interval. The guard intervals ensure

that the detected keystroke waveforms are complete.

Last, our algorithm calculates the sum of powers in all waveforms lying within those starting

and ending points and then compares this combined power with a sum power threshold (Pavg) to

conﬁrm the presence of a complete keystroke within that interval. This ensures that the training

models are built using only those waveforms which contain complete shapes of the keystrokes.

Once keystroke detection is conﬁrmed, the algorithm ﬁnally returns the starting point (sm) of the

detected keystroke and jumps △tavg data points ahead of sm to look for next keystroke, where
△tavg is the average number of data points between arrival of two consecutive keystrokes. From
the CSI data set we collected from our volunteers, we observed that on average the waveforms of

a keystroke spanned tavg ≈ 650 data points and average number of data points between arrival of
two consecutive keystrokes was △tavg ≈ 1250 data points at the CSI sampling rate of Fs = 2500
samples/s. We empirically determined appropriate values for the remaining constants including W ,

Du, Iu, σ, β, Ble f t, Bright, T hr esh and Pavg, as described in Algorithm 1.

22

Algorithm 1: Keystroke Detection from single TX-RX stream

/*Start indices of keystrokes*/

= 0.35 /*Sum Avg. Pwr Threshold*/

1: initialize St,r
2: initialize W = 200 /*Window Size*/
3: initialize T hr esh = 0.8 /*Threshold*/
4: initialize Pavg
5: initialize iu /*Duration of increasing trend*/
6: initialize du /*Duration of decreasing trend*/
7: initialize Iu = 100 /*Avg. duration of increasing trend*/
8: initialize Du = 100 /*Avg. duration of decreasing trend*/
9: initialize C /*Counts repetitive insigniﬁcant changes*/
10: initialize Kstart
/*Flags possible start of a keystroke*/
11: initialize j = 0 /*Iteration count*/
12: initialize Ble f t = W
13: initialize Bright = tavg
14: initialize σ = 0.1 /*Forgetting factor*/
15: initialize β = 1.5 /*Keystroke span factor*/

+ W

16: while current window is within {1:length(Z{2:p}

)} do

t,r
t,r ( j: j +W )|

i= j

t,r (i)−Z{k}
|Z{k}
k =2 △m j [k]

△m j [k] ← P j +W
△Mj ←Pp
if ( j = 0) then δ j ← △Mj − T hr esh
else

W

17:

18:
19:
20:
21:
22:
23:
24:
25:
26:

27:

28:
29:

30:
31:
32:

33:
34:
35:
36:
37:

if δ j ≥ 0 then
Kstart ← 1
if (δ j − δ j−1) < 0 then du = du + 1
else iu = iu + 1
end if
if (du > Du) & (iu > Iu) then

t,r

( j − f loor ( βW ) −

O ← Z{2:p}
Ble f t : j + f loor ( βW ) + Bright )
p ← sum(mean(O2))

if p > Pavg then

St,r ← St,r ∪ ( j − f loor ( βW ) − Ble f t )

iu ← 0, du ← 0
j ← j + △tavg

end if

end if

else

/*Removing spurious counts of iu and du if insigniﬁcant changes detected in waveforms*/

if (Kstart = 0) then

C ← C + 1
if (C > σW ) then

if iu > 0 then iu ← iu − 1
end if

23

if du > 0 then du ← du − 1
end if

end if

else

Kstart ← 0

end if

end if
δ j ← △Mj − T hr esh

38:

39:

40:

41:

42:

43:

44:

45:

46:

end if
j ← j + 1

47:
48: end while
49: Return St,r

2.5.3 Combining Results from Antenna Pairs

As mentioned earlier, we obtain the starting points of keystrokes independently from each TX-

RX antenna pair. Let St,r represent the set containing the starting points of all keystrokes obtained

from the keystroke detection algorithm applied on the antenna pair t − r. First, we obtain the set
St,r for each t − r pair. Second, we take the average of all the starting points that are within △tavg of
each other in all sets St,r to obtain a robust estimate of starting points of keystrokes. Third, based on

experimentally measured average span tavg of diﬀerent keystrokes, we calculate the ending points

of all keystrokes by simply adding tavg to the corresponding starting point.

2.5.4 Extracting Keystroke Waveforms

Once the algorithm calculates the set of starting and corresponding ending points for keystrokes,

we use those points to extract the waveforms from CSI matrix Ht,r . Let Km,t,r represent the CSI

waveform of mt h keystroke extracted from the antenna pair t-r. Let sm represent the average of the

starting points for the mt h keystroke from all antenna pairs. We can express Km,t,r in terms of Ht,r

follows.

Km,t,r = Ht,r (sm : sm + tavg )

(2.8)

After extracting the CSI waveforms Km,t,r from all subcarriers of the t-r antenna pair, we apply

PCA on those CSI waveforms to remove the noisy components and obtain the components that

represent the variations caused by movements of hands and ﬁngers.

24

Unlike principle components derived from normalized streams, it is diﬃcult to decide which

PCA component represents noise and should be removed from the top p principal components for

the case of Km,t,r . The diﬃculty arises because Km,t,r contains the set of waveforms for a speciﬁc

keystroke instead of the whole CSI stream, due to which the variance of noisy component often

becomes small. We observe that the noisy PCA component keeps changing positions between 1st

and 2nd place among the sorted PCA components for diﬀerent extracted keystroke waveforms. In

order to get rid of this problem, we ﬁrst project Km,t,r onto all top q principal components. Let
Φ{1:q}K
be an Sc × q dimensional matrix that represent the top q principal components in Km,t,r
obtained after applying PCA and K{1:q}m,t,r be an L × q dimensional matrix containing the projected
CSI streams in its columns, where L is the length of segmented keystroke waveform. Thus, K{1:q}m,t,r
is given by the following equation.

K{1:q}m,t,r

= Km,t,r × Φ{1:q}K

(2.9)

In our implementation, we choose q = 4. This choice is again based on the observation that the

top 4 principal components contain enough information about keystrokes required to achieve high

accuracy during classiﬁcation.

To detect which waveform in K{1:q}m,t,r represents the noisy projection, we chose the top 2 projected
waveforms and divide each of them into R bins and calculate the variances in those bins. We then

compare the variances calculated for diﬀerent bins of one waveform with the corresponding bins of

the other waveform. The waveform that has larger number of higher variance bins is considered to be

the noisy projection, which we remove from K{1:q}m,t,r to ﬁnally get q− 1 waveforms. Here we leverage

the fact that although overall variance of a noisy projection may be smaller than the variance of

other waveforms, but if the waveform is divided into appropriate number of smaller bins then the

number of bins in which the variance of the noisy projection is higher than the corresponding bins

of other waveforms is always larger. This is because the impact of noise is more dominant in smaller

time windows compared to larger time windows. We used R=10 in our implementation of WiKey.

PCA can lead to diﬀerent ordering of principal components in waveforms of diﬀerent keystrokes

of the same key, because the ordering of waveforms done by PCA is based solely on the value of their

25

variance, which can change even if a key is pressed in a slightly diﬀerent way. This is problematic

because to recognize the keystrokes, we need to compare the projections of an unseen key with the

corresponding projections of the keys in the training data. In order to minimize the possibility of

reordering, we order the projected keystroke waveforms in descending order of their peak to peak

values before using the waveforms for feature extraction and classiﬁer training.

2.6 Feature Extraction

To diﬀerentiate between keystrokes, we need to extract features that can uniquely represent those

keystrokes. As diﬀerent keys on a keyboard are closely placed, standard features such as maximum

peak power, mean amplitude, root mean square deviation of signal amplitude, second/third central

moments, rate of change, signal energy or entropy, and number of zero crossings cannot be used

because adjacent keys give almost the same values for these features. Tables 2.1, 2.2, 2.3 and 2.4

show means and variances of some of these features calculated for 2nd waveform in the extracted

CSI-waveforms for keystrokes of alphabetic keys pressed by a users. It can be observed that the

values of these features for diﬀerent keys (for example ‘c’ and ‘d’) come out to be very similar.

Looking at the means calculated for features like energy and number of zero crossings in Tables 2.1

and 2.2, it seems that they have diﬀerent values for diﬀerent keys. But as we observe from Tables

2.3 and 2.4, the variance of those features is high. Due to the reasons above, it becomes infeasible

to use these features for keystroke classiﬁcation. Frequency analysis is also not feasible because

the frequency components in keystrokes of many diﬀerent keys are similar. Another reason behind

inapplicability of frequency domain analysis is that they lead to complete loss of time domain

information.

From our data set, we have observed that although the frequency components in most keys

are similar, they occur at diﬀerent time instants for diﬀerent keys. Therefore, we use shapes of the

extracted keystroke waveforms as their features because the shapes retain both time and frequency

domain information of the waveforms and are thus more suited for use in classiﬁcation. We observed

experimentally that the shapes of diﬀerent keystroke waveforms were quite diﬀerent from each other,

26

Table 2.1: Average values of features extracted from keystrokes of keys collected from user 10

Features
Mean amplitude
Second central moment
Third central moment

a
-0
0.08
0.02
RMS deviation 0.27
71.5
9.76
11.8

Energy
Entropy
Zero Crossings

b
-0.04
0.133
-0.03
0.359
116.6
9.762
6.913

c
0.0124
0.0801
0.0036
0.2782
69.788
9.7616
12.363

d
-0.03
0.083
-0.01
0.285
73.34
9.762
6.225

e
0.045
0.156
0.029
0.385
137.5
9.762
6.4

f
-0.043
0.1818
-0.06
0.4244
159.43
9.7616
4.375

g
-0.076
0.6523
-0.919
0.7899
570.8
9.7616
4.075

h
-0.06
0.263
-0.05
0.506
232.1
9.762
3.4

i
0.014
0.12
-0.01
0.332
104.8
9.762
12.08

j
-0.03
0.231
-0.1
0.472
201.4
9.762
9.088

k
0.03
0.33
0.05
0.57
288
9.76
6.05

l
-0.01
0.11
0.02
0.32
95.2
9.76
13.7

m
-0
0.1
-0
0.3
83.7
9.76
10

Table 2.2: Average values of features extracted from keystrokes of keys collected from user 10

Features
Mean amplitude
Second central moment
Third central moment

n
0.032
0.108
0.01
RMS deviation 0.323
94.98
9.762
9.063

Energy
Entropy
Zero Crossings

o
0.02
0.09
0.01
0.29
75.6
9.76
13.8

p
0.03
0.19
0.04
0.43
167
9.76
12.9

q
-0.012
0.1022
-0.006
0.3137
88.928
9.7616
11.85

r
0.008
0.051
0.003
0.222
44.22
9.762
15.41

s
0.054
0.245
0.098
0.472
215.5
9.762
6.35

t
7E-04
0.192
-0.101
0.434
167.1
9.762
12.85

u
-0.013
0.062
-0.01
0.242
54.56
9.762
16.75

v
-0.02
0.12
0.029
0.335
104.4
9.762
11.88

w
-0
0.097
0.023
0.306
84.48
9.762
14.3

x
-0.1
0.26
-0
0.5
227
9.76
6.48

y
-0.02
0.09
-0.02
0.3
81.5
9.76
10.1

z
0.06
0.21
0.04
0.45
182
9.76
7.55

Table 2.3: Variance of diﬀerent features extracted from keystrokes of keys collected from user 10

Features
Mean amplitude
Second central moment
Third central moment
RMS deviation
Energy
Entropy
Zero Crossings

a
0.00029
0.00513
0.00155
0.0108
3874.59
0
26.3859

b
4E-04
0.003
9E-04
0.006
2283
0
12.36

c
0.0003
0.0011
0.0001
0.0031
816.91
0
33.196

d
1E-04
0.001
2E-04
0.004
912.4
0
12.94

e
4E-04
0.007
0.002
0.011
5204
0
9.433

f
0.0002
0.0028
0.0033
0.0038
2160
0
6.2627

g
0.0003
0.1008
0.7021
0.0348
76863
0
3.9943

h
8E-04
0.012
0.002
0.011
9315
0
3.585

i
5E-04
0.005
0.001
0.011
3925
0
44.91

j
5E-04
0.009
0.007
0.01
6846
0
13.14

k
5E-04
0.016
0.009
0.012
12094
0
12.58

l
2E-04
0.006
0.017
0.009
4883
0
31.14

m
5E-04
0.002
5E-04
0.006
1679
0
28.51

as shown by Figure 2.5(a) and 2.5(b).

Directly using the extracted keystroke waveforms as keystroke features leads to high com-

putational costs in the classiﬁcation process because waveforms contain hundreds of data points

per keystroke. Therefore, we apply Discrete Wavelet Transform (DWT) to compress the extracted

keystroke waveforms while preserving most of the time and frequency domain information.

The DWT of a discrete signal y[n] can be written in terms of wavelet basis functions as:

y[n] =

1

√LXk

λ ( j0, k )ϕ j0,k (n) +

1
√L

∞Xj = j0Xk

γ( j, k )ψ j,k (n)

where L represents the length of signal y[n]. The functions ϕ j,k (n) are called scaling functions,

where as the corresponding coeﬃcients λ ( j, k ) are known as scaling or approximation coeﬃcients.

Similarly, the functions ψ j,k (n) are known as wavelet functions and the corresponding coeﬃcients

γ( j, k ) are known as wavelet or detail coeﬃcients. To calculate approximation and detail coeﬃcients,

the scaling and wavelet functions are chosen such that they are orthonormal to each other. Thus,

the following condition holds.

hψ j,k (n), ϕ j0,m (n)i = δ j, j0

δk,m

27

 

1
2
3

3

2

l

e
u
a
V

1

0

 

1
2
3

−1

−2

 
0

200

400

600

800

Sample

200

400

600

800

sample

(a) Keystroke waveforms for key i

 

1
2
3

200

400

600

800

sample

6

4

2

0

l

e
u
a
V

−2

−4

 
0

 

1
2
3

200

400

600

800

Sample

(b) Keystroke waveforms for key o

4

2

0

−2

−4

s
t
n
e

i

c

 

i
f
f
e
o
C
n
o
i
t
a
m
x
o
r
p
p
A

i

0

50

100

temporal units

3

2

l

e
u
a
V

1

0

−1

−2

 
0

3

2

l

e
u
a
V

1

0

−1

−2

 
0

2

1

0

i

 

s
t
n
e
c
i
f
f
e
e
o
C
n
o
i
t
a
m
i
x
o
r
p
p
A

−1

−2

−3

−4

0

50

100

temporal units

(c) DWT features of i

(d) DWT features of o

Figure 2.5: Feature extraction from 2nd keystroke waveforms extracted from TX-1, RX-1 for I & O

Based on the condition above, the approximation and detail coeﬃcients calculated for j-th scale

can then be written as:

λ ( j, k ) = hy[n], ϕ j +1,k (n)i =

1

√LXn

y[n]ϕ j +1,k (n)

28

Table 2.4: Variance of diﬀerent features extracted from keystrokes of keys collected from user 10

Features
Mean amplitude
Second central moment
Third central moment
RMS deviation
Energy
Entropy
Zero Crossings

n
3E-04
0.003
3E-04
0.005
2153
0
15.25

o
3E-04
0.002
5E-04
0.004
1150
0
29.24

p
5E-04
0.007
0.003
0.008
5048
0
24.09

q
0.0003
0.0017
0.0003
0.0042
1296.2
0
21.673

r
0.00018
0.00041
7.70E-05
0.00196
308.95
0
17.3847

s
6E-04
0.03
0.024
0.026
23201
0
12.21

t
4E-04
0.003
0.003
0.004
2181
0
17.7

u
3E-04
0.001
1E-04
0.004
886.9
0
36.85

v
1E-04
0.006
0.015
0.008
4714
0
21.63

w
3E-04
0.002
9E-04
0.004
1403
0
27.71

x
4E-04
0.007
0.001
0.007
5166
0
13.67

y
6E-04
0.003
6E-04
0.007
2100
0
31.49

z
4E-04
0.005
0.003
0.007
4147
0
6.529

γ( j, k ) = hy[n], ψ j +1,k (n)i =

y[n]ψ j +1,k (n)

1

√LXn

To achieve desired compression using DWT, we need to select appropriate wavelet and scaling

ﬁlters. We tested the accuracy of our classiﬁer using two diﬀerent wavelet ﬁlters: Daubechies

and Symlets. We choose Daubechies D4 (four coeﬃcients per ﬁlter) wavelet and scaling ﬁlters

because the models trained with the DWT features extracted using these ﬁlters achieved higher

classiﬁcation accuracy. For each keystroke, we perform DWT 3 times on each one of its (q − 1) = 3
waveforms, which is achieved by applying DWT on the approximation coeﬃcients obtained from

the previous steps. We choose to apply DWT 3 times because this preserves enough details of those

waveforms required for successful classiﬁcation while achieving maximum compression. WiKey

uses only the approximation coeﬃcients as keystroke features and discards the detail coeﬃcients

because approximation coeﬃcients alone result in good classiﬁcation accuracy. Therefore, we

have 3 × MT × MR keystroke shapes for every keystroke, i.e. the approximation coeﬃcients of all 3
waveforms extracted from the CSI time series in each TX-RX antenna pair, where MT is the number

of transmitting antennas and MR is the number of receiving antennas. Figures 2.5(a) through 2.5(d)

show feature extraction procedure performed on the 2nd keystroke waveforms for keys ‘i’ and ‘o’,

extracted from TX-1, RX-1 antenna pair.

2.7 Classiﬁcation

After obtaining DWT based shape features of keystrokes, WiKey builds training models for

classiﬁcation using them. As WiKey needs to compare shape features of diﬀerent keystrokes,

we need a comparison metric that provides an eﬀective measure of similarity between shape

features of two keystrokes. WiKey uses a well-known method called dynamic time warping (DTW)

that calculates the distance between waveforms by performing optimal alignment between them.

29

Using DTW distance as the comparison metric between keystroke shape features, WiKey trains an

ensemble of k-nearest neighbour (kNN) classiﬁers using those features from all TX-RX antenna

pairs. WiKey obtains decisions from each classiﬁer in the ensemble and uses majority voting to

obtain ﬁnal result. Next, we ﬁrst explain how we apply DTW on the keystroke shape features and

then explain how we train the ensemble of classiﬁers.

2.7.1 Dynamic Time Warping

DTW is a dynamic programming based solution for obtaining minimum distance alignment

between any two waveforms. DTW can handle waveforms of diﬀerent lengths and allows a non-

linear mapping of one waveform to another by minimizing the distance between the two. In contrast

to Euclidean distance, DTW gives us intuitive distance between two waveforms by determining

minimum distance warping path between them even if they are distorted or shifted versions of each

other. DTW distance is the Euclidean distance of the optimal warping path between two waveforms

calculated under boundary conditions and local path constraints [92]. In our experiments, DTW

distance proves to be very eﬀective metric for comparing two shape features of diﬀerent keystrokes.

WiKey uses the open source implementation of DTW in the Machine Learning Toolbox (MLT) by

Jang [59]. WiKey uses local path constraints of 27, 45, and 63 degrees while determining minimum

cost warping path between two waveforms. For the features extracted for keys ‘i’ and ‘o’ shown

in ﬁgures 2.5(a) and 2.5(b), the DTW distance among features of key ‘i’ was 18.79 and the DTW

distance among features of key ‘o’ was 19.44. However, the average DTW distance between features

of these keys was 44.2.

2.7.2 Classiﬁer Training

To maximize the advantage of having multiple shape features per keystroke obtained from

multiple transmit-receive antenna pairs, we build separate classiﬁers for each of those shape features.

We build an ensemble of 3× MT × MR classiﬁers using kNN classiﬁcation scheme. WiKey requires
the user to provide training data for the keystrokes to be recognized and each classiﬁer is trained

30

using the corresponding features extracted from CSI time series from all TX-RX antenna pairs. To

classify a detected keystroke, WiKey feeds the shape features of that keystroke to their corresponding

kNN classiﬁers and obtains a decision from each classiﬁer in the ensemble. Each kNN classiﬁer

searches for the majority class label among k nearest neighbors of the corresponding shape feature

using DTW distance metric. WiKey calculates the ﬁnal result through majority voting on the

decisions of all kNN classiﬁers in the ensemble.

2.7.3 Behavioral Clustering of User Data

In this section, we explain how we use the relative consistency of multiple training samples to

reduce the computational complexity of WiKey’s classiﬁcation process in real-world experimental

scenarios, where users type sentences, consisting of multiple diﬀerent types of keystrokes. We

observed from our experiments that every user has several distinct styles of pressing each key.

We also observed that these distinct styles of pressing the keys remained consistent over time for

all users. To utilize this consistency during classiﬁcation, WiKey applies hierarchical clustering

with Ward’s linkage [149] on the training samples. WiKey uses DTW distance as comparison

metric during this behavioral clustering step. It also provides the expected number of clusters C to
the hierarchical clustering algorithm. After dividing the training samples into C clusters, WiKey
randomly picks P percent of samples to train the KNN classiﬁers. Although behavioral clustering
increases the classiﬁcation speed, it slightly reduces the overall accuracies. We discuss this trade-oﬀ

between speed and accuracy in Section 2.8.5 when evaluating WiKey’s performance on recognizing

sentences in real world typing scenario.

2.8 Implementation & Evaluation

2.8.1 Hardware Setup

We implemented our scheme using oﬀ-the-shelf hardware devices. Speciﬁcally, we use a Lenovo

X200 laptop with Intel Link 5300 WiFi NIC as the receiver that connects to the three antennas of

the X200 laptop. The X200 laptop has 2.26GHz Intel Core 2 Duo processor with 4GB of RAM and

31

Ubuntu 14.04 as its operating system. We used TP-Link TL-WR1043ND WiFi router as transmitter

operating in 802.11n AP mode at 2.4GHz. We collect the CSI values from the Intel 5300 NIC using

a modiﬁed driver developed by Halperin et al. [47]. The transmitter has 2 antennas and the receiver

has 3 antennas, i.e., MT = 2 and MR = 3. This gives 3 × 2 × 3 = 18 classiﬁcation models for each
key in our evaluations.

We place the X200 laptop at a distance of 30 cm from the keyboard such that the back side

of its screen faces the keyboard on which the users type and its screen is within the line-of-sight

(LOS) of the WiFi router it is connected to. The distance of WiFi router from the target keyboard

is 4 meters. The CSI values are measured on ICMP ping packets sent from the WiFi router, i.e., the

TP-Link TL-WR1043ND, to the laptop at high data rate of about 2500 packets/s. Setting a higher

ping frequency leads to higher sampling rate of CSI, which ensures that the time resolution of the

CSI values is high enough for capturing maximum details of diﬀerent type of keystrokes.

2.8.2 Data Collection

To evaluate the accuracy of WiKey, we collected training and testing dataset from 10 users.

These 10 users were general university students who volunteered for the experiments and only 2

out of them had some know how of wireless communication. Users 1–9 ﬁrst provided 30 samples

for each of the 37 keys (26 alphabets, 10 digits and 1 space bar) by pressing that key multiple times.

After this, these users typed the sentence S1 = “the quick brown fox jumped over the lazy dog” two

times, without spaces.

To evaluate how the number of training samples impact the accuracy, we collected 80 samples

for each of the 37 keys from User 10. Afterwards, this user typed each of the following sentences

5 times, without spaces: S1 =“the quick brown fox jumps over the lazy dog”, S2 = “nobody knew

why the candles blew out”, S3 = “the autumn leaves look like golden snow”, S4 = “nothing is as

profound as the imagination” and S5 = “my small pet mouse escaped from his cage”. We asked

users to type naturally with multiple ﬁngers but only press one key at a time while keeping the

average keystroke inter-arrival time at 1 second. After recording the CSI time series for each of the

32

above experiments, we ﬁrst applied our keystroke extraction algorithm on those recorded CSI time

series to extract the CSI waveforms for individual keys and then extracted the DWT based shape

features from each of the extracted keystroke waveforms.

2.8.3 Keystroke Extraction Accuracy

We evaluate the accuracy of our keystroke extraction algorithm in terms of the detection ratio,

which is deﬁned as the total number of correctly detected keystrokes in a CSI time series divided

by the total number of actual keystrokes. The detection ratio of our proposed algorithm is more

than 97.5%. Figure 2.6(a) shows the color map showing the percentage of the missed keystrokes of

all 37 keys for all 10 users.

The darker areas represent higher rate of missed keystrokes. We can observe from this ﬁgure

that the number of missed keystrokes vary for diﬀerent individuals depending upon their typing

behaviors. For example we observed that the keystrokes of user 4 were missed in higher percentage

with average detection ratio of 91.8% whereas the keystrokes of user 10 were not missed at all with

average detection ratio of 100% calculated over all 37 keys. The lower extraction accuracy for user

4 shows that more keystrokes were missed, which is due to the signiﬁcant diﬀerence in his typing

behavior compared to other users. The accuracy of our scheme for such a user can be increased

signiﬁcantly by tuning the parameters of our algorithm for the given user. We also observe from

this ﬁgure that the keystrokes that are missed are usually those for which ﬁngers move very little

when typing. For example, in pressing keys ‘a’, ‘d’, ‘f’, ‘i’, ‘j’ and ‘x’ the hands and ﬁngers move

very little, and thus the variations in the CSI values sometimes go undetected. Figure 2.6(b) shows

the keystroke extraction rate for each user averaged over all 37 keys. The experimental results show

that our keystroke extraction algorithm is robust because it consistently achieves high performance

over diﬀerent users without requiring any user speciﬁc tuning of system parameters.

33

1

2

3

4

5

6

7

8

9

s
r
e
s
U

10

SPa b c d e f g h i

j k l m n o p q r s t u v w x y z 1 2 3 4 5 6 7 8 9 0

Keys

(a) Colormap for missed keys

)

%

(
 
e
t
a
r
 
n
o
i
t
c
a
r
t
x
E

98

96

94

92

90

88

86

1

2

3

4

6
5
Users

7

8

9

10

(b) Keystroke extraction rates per (averaged over all keys)

Figure 2.6: Keystroke extraction results

2.8.4 Classiﬁcation Accuracy

We evaluate the classiﬁcation accuracy of WiKey through two sets of experiments. In the ﬁrst

set of experiments, we build classiﬁers for each of the 10 users using 30 samples and measure the

10-fold cross validation accuracy of those classiﬁers. In the second set of experiments, we build

classiﬁer for user 10 while increasing the number of samples from 30 to 80 in order to observe the

impact of increase in the number of training samples on the classiﬁcation accuracy. Cross validation

automatically picks a part of data for training and remaining for testing and does not use any data in

testing that was used in training. Recall that the WiKey uses kNN classiﬁers for recognizing keys.

In all our experiments, we set k = 5.

34

2.8.4.1 Accuracy with 30 Samples per Key

We evaluate the classiﬁcation accuracy of WiKey in terms of average accuracy per key and

average accuracy on all keys of any given user. We also present confusion matrices resulting from

our experiments. A confusion matrix tells us which key was recognized by WiKey as which key

with what percentage. We calculate the average accuracy per key by taking the average of confusion

matrices obtained from all users and average accuracy on all keys of any given user by averaging

the accuracy on all keys within the confusion matrix of that user. For each user, we trained each

classiﬁer using features from 30 samples of each key. We conducted experiments on all 37 keys as

well as on only 26 alphabet keys and performed 10-fold cross validation to obtain the confusion

matrices.

WiKey achieves an overall keystroke recognition accuracy of 82.87% in case of 37 keys and

83.46% in case of 26 alphabetic keys when averaged over all keys and users. Figure 2.7 shows

the recognition accuracy for each key across all users for the 26 alphabetic keys. Similarly, Figure

2.8 shows the recognition accuracy for each key across all users for all 37 keys. Figure 2.9 shows

the average recognition accuracy achieved by each user for both 26 keys and 37 keys. We observe

that the recognition accuracy for 26 alphabetic keys is on average greater than the recognition

accuracy for the all 37 keys. This is because the keystroke waveforms of the digit keys (0-9) often

show similarity with keystroke waveforms of alphabet keys in the keyboard row staring with QWE,

which leads to slightly greater number of misclassiﬁcations.

2.8.4.2 Accuracy vs. the Size for Training Set

To determine the impact of the number for training samples on the accuracy of WiKey, we again

perform two sets of experiments: one for 26 alphabetic keys and other for all 37 keys.

The accuracy of WiKey increases when the number of training samples per key are increased

from 30 to 80. Figure 2.10 shows the results from 10-fold cross validation for the 26 alphabetic

keys when 80 training samples are used per key. We observe from this ﬁgure that the recognition

accuracy increased from 88.3% (as seen in Figure 2.9) to 96.4% when the number of training

35

)

%

(
 

y
c
a
r
u
c
c
A

90

85

80

75

70

65

a b c d e f g h i

j k l m n o p q r s t u v w x y z

Keys

Figure 2.7: Mean accuracy for keys A-Z (Users 1-10)

)

%

(
 

y
c
a
r
u
c
c
A

90

85

80

75

70

65

SPa b c d e f g h i

j k l m n o p q r s t u v w x y z 1 2 3 4 5 6 7 8 9 0

Keys

Figure 2.8: Mean accuracy for all 37 keys (Users 1-10)

samples are increased from 30 to 80. Figure 2.11 shows the results from 10-fold cross validation

for all 37 keys when 80 training samples are used per key. We again observe that the recognition

accuracy increased from 85.95% (as seen in Figure 2.9) to 89.7% when the number of training

samples are increased from 30 to 80. The gray-scale maps of the confusion matrix obtained after

10-fold cross-validation on 80 training samples of User 10 is shown in Figure 2.12.

36

Figure 2.9: Per user average classiﬁer accuracies

100

95

90

85

80

75

70

)

%

(
 

y
c
a
r
u
c
c
A

a b c d e f g h i

j k l m n o p q r s t u v w x y z

Keys

Figure 2.10: Accuracy for keys A-Z from user 10

2.8.4.3 Eﬀects of CSI Sampling Rate and Training Samples

In previous experiments, we used high CSI sampling rate of 2500 samples/s. Furthermore, the

10-fold cross validation automatically chose 10% of the data for testing and remaining 90% for

training. Next, we evaluate the eﬀect of changing the CSI sampling rate and the percentage of

data used for training on accuracy. To extract keystrokes, we halved the values used for W , Du, Iu,

Ble f t, and Bright. We performed X−fold cross validation (2 ≤ X ≤ 10) on the data obtained for
alphabetic keys from user 10. Figure 2.13 plots the accuracies for number of folds varying from

37

SP a b c d e f g h i

j k l m n o p q r s t u v w x y z 1 2 3 4 5 6 7 8 9 0

Figure 2.11: Accuracy for all 37 keys from user 10

Keys

100

)

%

(
 

y
c
a
r
u
c
c
A

95

90

85

80

75

70

SP
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
0
1
2
3
4
5
6
7
8
9

SP a b c d e f g h i

j

k

l m n o p q r

s

t u v w x

y

z 1 2 3 4 5 6 7 8 9 0

Figure 2.12: Color map of user 10’s confusion matrix

2 to 10, where each plotted value if the average over all alphabet keys. We observe from Figure

2.13 that the accuracies dropped compared to previously achieved accuracy because of the drop

in resolution of keystroke shapes due to reduced sampling rate. We also observe that recognition

accuracies of the keys for which hands and ﬁngers move little were aﬀected the most. When 50%

of data was used for training, i.e., for 2-fold cross validation, the accuracies for keys ‘j’,‘x’,‘v’ and

‘p’ dropped below 60%. However, the average accuracy remained approximately 80% for all folds.

38

90

85

80

75

70

65

60

55

50

s
e

i

 

c
a
r
u
c
c
A
e
g
a
t
n
e
c
r
e
P

2

3

4

5

6

7

8

9

10

Cross Validation Folds

Figure 2.13: Multifold cross-validated average accuracies for user 10’s lower resolution keystroke

waveforms

2.8.5 Real-world Evaluation on Sentences

To evaluate WiKey in real world scenarios, we collected CSI data for diﬀerent sentences typed

by users 1 through 10, as mentioned in Section 2.8.2. To train the classiﬁers to recognize keystrokes

in sentences, we used the same dataset of individual keystrokes that we used in the evaluations

presented above. For the test samples, we used the the keystrokes extracted from datasets obtained

from typing the sentences.

2.8.5.1 Accuracy

WiKey achieves an average keystroke recognition accuracy of 77.43% for typed sentences when

30 training samples per key were used. For each user, we trained classiﬁers using 30 samples for

each of the 26 alphabetic keys. We then applied our keystroke extraction algorithm to ﬁrst extract

waveforms of individual keys, applied PCA on them to denoise the waveforms and then extracted the

shape features for each extracted key and feeded them to the classiﬁers to recognize the keystrokes

in the sentence. Figure 2.14 shows the keystroke recognition accuracy for the sentences typed by

each user.

WiKey achieves an average keystroke recognition accuracy is 93.47% in continuously typed

sentences with 80 training samples per key. We ﬁrst trained classiﬁers using 80 samples for each of

the 26 alphabetic keys and then fed them with keystrokes from typed sentences. Figure 2.15 shows

the keystroke recognition accuracy for all the sentences (S1 to S5) typed by user 10. The average

39

keystroke recognition accuracy rate for user 10 in previous experiment, which used 30 samples for

training classiﬁers was just 80%. Thus, we can conclude that increasing the number of training

samples increases the accuracy of WiKey.

Figure 2.14: Keystroke recognition for sentences collected from all users using 30 samples per key

Figure 2.15: Keystroke recognition for sentences collected from user 10 using 80 samples per key

2.8.5.2 Eﬀects of Behavioral Clustering

In this subsection, we show how behavioral clustering aﬀects keystroke recognition time and

accuracy of WiKey. As discussed in Section 2.7.3, every user has a set of distinct styles of pressing

each key that all occur and stay consistent over time. WiKey leverages these consistent behaviors to

40

reduce computational complexity of KNN classiﬁers by reducing the number of training samples

required to achieve high keystroke recognition accuracies.

Figures 2.16(a) and 2.16(b) show the keystroke recognition accuracy and time, respectively, as

we increase the percentage of samples (P) that WiKey uses for training from each behavioral cluster.
We obtained the results in these two ﬁgures on the keystrokes from the sentence S1 collected from

user 10. Each value in these two ﬁgures is an average from the ﬁve repetitions of S1. We observe

from Figure 2.16(a) that average recognition time increases consistently as the number of samples

increase, which is intuitive because complexity of KNN classiﬁcation (which relies on exhaustive

search) increases with the increase in the number of training samples. At the same time, we also

observe from Figure 2.16(b) that the keystroke recognition accuracy of WiKey suﬀers when using

fewer training samples. Nonetheless, it still stays above 79%. Another interesting observation from

Figure 2.16(b) is that the accuracy of WiKey increases when P increases from 10% up to 70%,
which is intuitive, but it decreases beyond that. The reason behind this decrease is the inclusion

of noisy and/or inconsistent samples into the training dataset when a large percentage of training

samples is used for training. This observation implies that the behavioral clustering of training

samples not only helps in decreasing the training time but also helps in eliminating noisy and

inconsistent samples from the training set, which helps the overall accuracy of WiKey.

2.8.5.3 Auto-Correction and Word Recognition

Next, we apply dictionary based auto-correction on the recognized keystrokes and study whether

it improves WiKey’s accuracy in terms of correctly recognizing entire words instead of individual

keystrokes. Auto-correction automatically replaces a word, which is not part of the dictionary, with

its best match. For word recognition experiments, we chose the number of behavioral clusters to be

C = 6. Furthermore, we use P = 70% and P = 10% samples, respectively, from each cluster for
training. We average the recognition results for each sentence over its ﬁve repetitions and report

the ﬁnal word recognition accuracies. During auto-correction, we observed cases (not shown here)

where words containing keystroke(s) that WiKey recognized incorrectly were actually valid words

41

1

0.8

0.6

0.4

0.2

0
10

20

30

40

50

60

70

80

90

Samples chosen from each cluster for training (%)

(a) Recognition time versus samples used

e
m
T

i

)

i

m
u
m
x
a
m
y
b
d
e
z
i
l

 

 

a
m
r
o
N

(

100

)

%

(
 
d
e
r
e
v
o
c
e
R
s
e
k
o
r
t
s
y
e
K

 

95

90

85

80

75

70

65

60

20

90
10
Samples chosen from each cluster for training (%)

70

40

60

50

30

80

(b) Keystrokes recognized versus samples used

Figure 2.16: Keystrokes recognition for sentence S1 collected from user 10 after behavioral

clustering

in the dictionary. For such cases, auto-correction has no eﬀect on these words and thus on the

recognition accuracy of WiKey. We treat such words as incorrect while calculating WiKey’s word

recognition accuracy.

Figures 2.17(a) and 2.17(b) show WiKey’s word recognition accuracies, before and after auto-

correction, for the ﬁve sentences collected from user 10. We observe from these ﬁgures that

auto-correction improves word recognition accuracy in all cases. Furthermore, in some cases (such

as for sentence 2 in Figure 2.17(a)) the word recognition accuracy reaches up to 100%. Comparing

these two ﬁgures, we also observe that WiKey achieves higher word recognition accuracies when

P = 70% compared to when P = 10%. This happens due to the higher individual keystroke
recognition accuracies for P = 70% compared to P = 10%.

42

1

2

3

4

5

Sentences

(a) Word recognition accuracies using 70% training samples

Before
After

Before
After

100

90

80

70

60

)

%

(
 

i

s
e
c
a
r
u
c
c
A

)

%

(
 

s
e

i

c
a
r
u
c
c
A

80

60

40

20

0

1

2

3

4

5

Sentences

(b) Word recognition accuracies using 10% training samples

Figure 2.17: Comparison of word recognition accuracies before and after auto-correction

2.9 Limitations

Currently, WiKey works well under relatively stable and controlled environments. The accuracy

of our current scheme will be aﬀected by variations in the environment such as human motion in

surrounding areas, changes in orientation and distance of transceivers [103] [104], typing speeds,

and keyboard layout and size. During our experiments, we assumed that the major motion is due

to keystrokes of the target user only and no other major motion such as walking occurs in the room

where CSI data is collected. WiKey may be extended to allow small movement in the environment

e.g. having multiple persons walking in a library, however this would require training WiKey with

the proﬁles of those activities and adding the capability to subtract the waveforms of those activities

to extract the waveforms of keystrokes. Furthermore, most of the parameters used in our keystroke

extraction algorithm are scenario dependent and need to be changed if CSI sampling rates or

physical parameters of the environment (e.g. transceiver positions) change.

During data collection, we instructed the users not to move their heads or other body parts

43

signiﬁcantly while typing. However, we allowed natural motions which occur commonly when a

person types, such as eye winking and movements in the arm, shoulder and ﬁngers on the side of

the hand being used for typing. We also instructed the users to type one key at a time while keeping

the inter-arrival time of keystrokes between 0.5 to 1 second to facilitate correct identiﬁcation of

start and end times of keystrokes. However, we did allow users to use multiple ﬁngers for typing so

that they use whichever ﬁnger they naturally use to press any given key.

2.10 Conclusion

In this work, we make the following key contributions. First, we propose the ﬁrst WiFi based

keystroke recognition approach, which exploits the variations in CSI values caused by the micro-

movements of hands and ﬁngers in typing. The key intuition is that while typing a certain key, the

hands and ﬁngers of a user move in a unique formation and direction and thus generate a unique

pattern in the time-series of CSI values for that key. Second, we propose a keystroke extraction

algorithm that automatically detects and segments the recorded CSI time series to extract the

waveforms for individual keystrokes. Third, we implemented and evaluated the WiKey system

using a TP-Link TL-WR1043ND WiFi router and a Lenovo X200 laptop. Our experimental results

show that WiKey achieves more than 97.5% detection rate for detecting the keystroke and 96.4%

recognition accuracy for classifying single keys. In real-world experiments, WiKey can recognize

keystrokes in a continuously typed sentence with an accuracy of 93.5%. The key scientiﬁc value of

this work is in demonstrating the possibility of recognizing micro-gestures such as keystrokes using

everyday COTS WiFi devices. We have shown that our technique works in controlled environments,

and we hope that this work will spark interest of other researchers in this area to address the problem

of mitigating the eﬀects of more harsh wireless environments by building on our proposed micro-

gesture extraction and recognition techniques.

44

UNDERSTANDING AND MODELING WIFI SIGNALS BASED SLEEP MONITORING

CHAPTER 3

3.1 Introduction

3.1.1 Motivation

Long-term sleep monitoring is crucial for the patients with sleep disorders, as well as, for the

general population so that people can keep track of their sleep quality and improve their sleeping

habits. Moreover, continual sleep monitoring can help with early identiﬁcation of sleep disorders

and related illnesses which would otherwise go undiagnosed. Sleep disorders such as insomnia,

apnea, and narcolepsy are common in general population, and associated with greater risk of

cardiovascular, neurological, psychiatric, and other disorders [68, 119]. Moreover, studies have

shown that a majority of us remain unaware of our overall sleep quality and longer-term patterns in

our sleep duration and consistency [138].

Recently, WiFi CSI signals based methods have emerged as an eﬀective approach to low-cost

and easily adoptable sleep monitoring for in-home environments. The idea is to track breathing

and other body/limb activity, which are closely related to sleep quality in humans [33, 82, 117], by

leveraging the changes caused by those bodily motions in WiFi CSI signals. CSI based methods

are by far some of the least intrusive methods for monitoring sleep, both in terms of privacy

and convenience of use. The “gold standard” procedure for scoring sleep and assessing sleep

disorders is polysomnography (PSG) [70]. However, PSG is highly obtrusive mainly due to its

prohibitively high costs and low patient comfort-level because it requires subjects to sleep in an

unfamiliar and motion-constrained environment. Several other less obtrusive methods have been

proposed for tracking vital signs (body motion, cardiac and/or respiratory activity) during sleep, for

example Actigraphy [12, 40, 83, 99, 117, 118, 158] and EEG [31] based techniques. However, these

methods still require body contact which is something people are often not comfortable with [27].

Moreover, some people (e.g. the elderly) can forget to wear such devices before going to sleep. For

45

these reasons, contact-less sleep monitoring has attracted signiﬁcant interest, which mainly includes

Audio [35,50,106], Video [52,107], Bed sensors (e.g. Ballistocardiography (BCG), pressure and/or

motion sensors based techniques) [24, 28, 67, 87, 102, 153] and RF sensing based techniques (e.g.

mmWave, Frequency Modulated Continuous Wave (FMCW) radar, Pulse-Doppler radar, RFIDs

and WiFi based techniques) [9, 77, 80, 100, 101, 111, 160, 162]. Audio and video based techniques

are privacy intrusive, and therefore, are often avoided. Bed sensors based techniques involve

considerable deployment eﬀort because the sensors have to be installed at speciﬁc locations on

the bed or inside the pillows. Most existing radar-based techniques—whether they are mmWave or

RF—can monitor breathing and other movements during sleep fairly well. However, their operation

requires LOS which leads to signiﬁcant directivity issues. Moreover, all of them require users to

buy dedicated devices, which often tends to be expensive —e.g. Xethru costs $4001 per module.

This prevents their large-scale and long-term deployment.

3.1.2 Limitations of Prior Art

Multiple WiFi CSI based schemes have been proposed for tracking vital signs during sleep

[53, 77, 80, 146, 164]. The key limitation of these schemes is the lack of a model that can correlate

the changes introduced in WiFi CSI with the bodily activity (e.g. breathing and body/limb motion)

during sleep. Due to the lack of such a model, they rely on trial-and-error based positioning

of WiFi transceivers and signal processing techniques to track vital signs, which leads to high

dependency on multiple, environment-dependent and diﬃcult to tune parameters. Consequently,

their techniques lack robustness to diﬀerent individuals and environments, and they only work well

in controlled lab experiments performed on the same user, where they require the user to lie down

in between and/or very close to both transmitter (TX) and the receiver (RX) to ensure line-of-sight

(LOS) scenarios. As all the previous WiFi based sleep monitoring scheme have been evaluated

with short-duration mock sleep experiments in very controlled settings, the applicability of their

techniques and ﬁndings in practical sleep monitoring scenarios is limited. Their methods may be

1at the time of writing [155]

46

suitable for controlled short-duration sleep experiments; however, they cannot be generalized to

diﬀerent individuals, environments, positioning of WiFi transceivers, LOS/NLOS situations, and

to natural in-home full-night sleeping scenarios.

3.1.3 Proposed Approach

In this work, we propose Serene, a WiFi CSI based sleep quality monitoring scheme which can

robustly track breathing and body/limb activity related vital signs during sleep throughout a night

in an individual and environment independent manner. We propose two models based on which

we develop Serene’s signal processing pipeline: a breath-multipath model, and a breath-subspace

model. Our breath-multipath model quantiﬁes the eﬀect of small breathing movements on the CSI

signals, and allows Serene to robustly extract breathing waveforms. Our breath-subspace model

quantiﬁes how breathing aﬀects the signal subspace formed by WiFi subcarriers very diﬀerently

compared to other bodily motions, and allows Serene to robustly diﬀerentiate between breathing

and body/limb activity during sleep. These two models combined correlate the changes in CSI

values with the user’s vital signs during sleep, and therefore, form the foundation of Serene’s signal

processing pipeline to robustly track those vital signs. On the hardware side, Serene consists of

two commodity oﬀ-the-shelf (COTS) WiFi devices: a transmitter (e.g. a router) for continuously

sending signals, and a receiver (e.g. a laptop or a small embedded device placed close to the

sleeping user) for continuously receiving those signals to sample CSI. When a user is sleeping

near the receiver, Serene continuously tracks their breathing and other body/limb activity related

vitals signs. Using these vital signs, Serene also measures user’s per-night sleep quality based on a

well-known light-weight sleep scoring technique proposed in [150].

Our proposed breath-multipath and breath-subspace models advance the state-of-the-art on

WiFi signal based sleep monitoring from two fronts. First, they provide us the theoretical basis to

understand the relationship between changes in CSI values and a user’s vital signs during sleep,

and the relationship between the vital signs and the signal subspace formed by diﬀerent subcarriers

of the WiFi signals. Regarding the relationship between CSI value dynamics and the vital signs,

47

our model shows that if we diﬀerentiate (i.e. by taking ﬁrst order diﬀerence) the CSI signals from

each WiFi subcarrier, and then max-min normalize the CSI signal projection corresponding to

variations due to breathing, we can robustly extract the waveform corresponding to user’s breathing

motion in an environment and individual independent manner, as long as the user sleeps close

to the WiFi receiver. Such a requirement is easy to achieve in real-life in-home sleep scenarios.

Regarding the relationship between the vital signs and the signal subspace formed by diﬀerent

subcarriers of the WiFi signals, our model shows that when there is no body/limb motion, there

is only one dominant time-varying component in the subspace, which corresponds to breathing.

However, more components along these dimensions evolve (i.e. show signiﬁcant variations) during

other body movements, e.g. during roll overs or arm/leg movement. Based on this model, we are

able to distinguish breathing from body/limb motion events during sleep, without requiring any

environment-dependent calibrations. To the best of our knowledge, our breath-subspace model is

the ﬁrst of its kind to utilize the formal concept of signal subspace from wireless literature [11]

in a concrete real-world application. Our breath-multipath and breath-subspace models combined

help Serene avoid the inconvenience of requiring every user to provide calibration for multiple

diﬀerent environments and possible positioning of the transceivers, and therefore, allow us to build

a robust scheme which can track vital signs using design-time training data obtained from only few

conﬁgurations and users. Therefore, our modeling lowers the deployment and usability barriers of

low-cost, in-home sleep monitoring using COTS WiFi devices.

3.1.4 Technical Challenges and Solutions

The ﬁrst technical challenge is to extract the CSI values that are resilient to static changes in

the environment (e.g. changes in arrangement of room furniture). To address this challenge, based

on our proposed breath-multipath model, we diﬀerentiate the CSI streams coming from WiFi NIC

by taking their ﬁrst order diﬀerence. This procedure not only removes the static changes in CSI

values, but also brings CSI timeseries into a form using which Serene’s breath tracking algorithm

can robustly measure breathing rate independent of diﬀerent individuals and deployment scenarios.

48

The second technical challenge is to isolate the CSI variations due to bodily activity during sleep

from the noisy CSI values in real-time. The COTS WiFi NICs report noisy CSI values, both due

to hardware limitations (e.g. low resolution Analog to Digital Converters (ADCs), etc.) and due to

changing transmission power and rates. To remove such high frequency noise from the CSI streams,

we use a combination of median, exponential moving average (EMA), and Butterworth low-pass

ﬁlters. This combination of ﬁlters eﬀectively removes higher frequency noise in CSI values, while

also retaining information about the breathing and body/limb motion during sleep which is needed

for Serene to robustly track those vitals signs.

The third technical challenge is to robustly and accurately distinguish between breathing and

other body movements. To achieve this, we follow our proposed breath-subspace model. Today’s

MIMO and OFDM based WiFi devices use many frequency subcarriers and multiple transmit-

receive (Tx-Rx) antennas for data communication. The MIMO system consisting of the OFDM

subcarriers and the Tx-Rx antennas, forms a multidimensional array which can be represented by a

signal subspace along these frequency and spatial dimensions [11,32,134,151]. The key intuition is

that while a user is sleeping, the signal subspace along these dimensions is aﬀected by both breathing

and body/limb motion. When there is no body/limb motion, there is only one dominant time-

varying component in the subspace, which corresponds to breathing. However, as attested by our

experimental observations, more components along these dimensions evolve (i.e. show considerable

variations) during other body/limb activity e.g. during roll overs or arm/leg movement. Based on this

principle, we can isolate breathing from limb motion without requiring any environment-dependent

calibrations. To achieve this in Serene, we ﬁrst extract a batch of top dominant components from the

multi-dimensional CSI signal using Principal Component Analysis (PCA). In the absence of other

body movements, Serene tracks breathing using the top-most PCA projection, which automatically

captures the variations due to breathing. However, during body/limb movements, breathing cannot

be tracked as these movements cause signiﬁcation variations in multiple top PCA projections which

makes breathing almost impossible to extract. Therefore, to accurately detect and then discard the

CSI values corresponding to such body/limb movement events, we track variations in the lower PCA

49

components by means of a multi-dimensional clustering technique. During body movement, these

lower PCA components show considerable variations, which are robustly and accurately detected

by our clustering approach.

The fourth challenge is to accurately estimate the breathing rate using the top PCA projection

of CSI streams. To estimate breathing rate, we use a peak detection based approach. First, following

our breath-multipath model, we max-min normalize the signal which makes parametrization of

our peak detection algorithm easily generalizable to diﬀerent users. Second, to achieve accurate

tracking of breathing rate, we perform parameterization of Serene’s breath estimation algorithm

using ground truths obtained from a contact-less COTS Xethru X4M200 Breath/Motion sensor.

The Xethru radar-based sensor has been shown to have 96% breath tracking accuracy compared

to PSG, i.e. the gold standard for sleep monitoring [154]. We perform this parametrization only

during the design of our system and do not require any end-user calibration eﬀort during real-world

deployments.

3.1.5 Summary of Experimental Results

We developed HummingBoard (HMB) [127] based easy to deploy sleep monitoring devices,

which were installed with Intel 5300 NIC for extracting CSI information [47]. We used Linksys

AC1200+ routers in our deployments. Moreover, we developed client (in C) and server (in Python)

programs capable of real-time CSI extraction and processing throughout the night. We tested Serene

on 5 diﬀerent individuals, where we collected >550 hours (80 nights) of CSI data at their apartments.

55% of our dataset corresponds to NLOS deployment scenarios, and 45% to LOS. Our results

demonstrate that Serene can track breathing throughout a night with an average error of <0.59 BPM

breaths per minute (BPM) for controlled sleep experiments and an average error of <1.19 BPM

for real-world in-home sleep experiments, respectively. Our system experiences an average nightly

outage (during which it cannot track the vitals) of less than 6.38 minutes. Figures 3.1(a) and 3.1(b)

show how our proposed system tracks breathing and body movement vital signs during a full night’s

sleep of a subject, where the receiver was placed on a table close to the subject’s bed and router

50

22

20

18

16

14

12

10

M
P
B

Commerial Xethru Radar based breath sensor
Our WiFi based approach (Serene)

Median MSE error in breathing rate ~1.02 BPM

0

50

100

150

250
minutes into sleep

200

300

350

400

(a) Respiration tracking (full night’s sleep)

Component 1
Component 4

Component 2
Component 5

Component 3
Body Movement Events

0.8

0.6

0.4

0.2

0

-0.2

-0.4

-0.6

-0.8

0

50

100

200

150
minutes into sleep

250

300

350

400

e
u
a
v

l

 
l

a
n
g
s

i

(b) Body movement tracking (full night’s sleep)

Figure 3.1: Example showing our system tracking breathing and body movements throughout full

night’s sleep of subject. Xethru radar (X4M200 [155]) ground truth is approximately

synchronized with CSI data

was placed outside their bedroom in their TV lounge. We also assess per-night sleep eﬃciency

(i.e. the percentage of night spent in sleep stage) results of our subjects based on a classic light

weight actigraphy based sleep scoring algorithm [150], which tracks each night’s sleep progress

in terms of sleep and awake stages based on the body movement information provided by Serene.

The sleep scoring algorithm we use has been shown to agree with EEG based sleep monitoring

94.46% of the time [150]). We also discuss the possibility of advanced sleep-stage classiﬁcation

based on the vital signs obtained from our system in section §2.9, where we compare some results

of our current light weight sleep-stage classiﬁcation approach with a commercial sleep monitoring

device, ResMed S+ [113].

51

3.2 Related Work

3.2.1 Respiration, Body Movements and Sleep

Previous works have shown that breathing and body movements during sleep are closely related

to sleep quality in humans [33, 82, 117]. These studies show that respiratory dynamics vary over

sleep stages, which means that respiratory activity can be used to separate sleep stages [82]. For

example, Dafna et al. evaluated whole night sleep based on sleep-awake classiﬁcation using audio

recordings of breathing sounds [33]. They captured and quantiﬁed variations in breathing features

such as periodicity and consistency, and showed that these features contribute to distinguishing

between sleep and wake epochs. Our work is motivated by such studies, where our goal is to

develop a robust and generic scheme to extract breathing and limb/body activity related vital signs

using CSI signals obtained from COTS WiFi devices. Our scheme can be used by sleep researchers

to develop low-cost and easily deployable/scalable sleep-stage monitors which can track diﬀerent

stages (e.g. light, deep, rapid-eye-movement (REM), and awake) of sleep.

3.2.2 Sleep Monitoring Technologies

Several sleep monitoring techniques have been proposed in the past which use diﬀerent sensing

modalities, such as in-ear [97], inertial sensors (Actigraphy) [12, 40, 83, 99, 117, 118, 131, 158],

EEG [31], Audio [35, 50, 106], Video [52, 107], Bed sensors (e.g. Ballistocardiography (BCG),

pressure and/or motion sensors based techniques) [24, 28, 67, 87, 102, 153] and RF sensing based

techniques (e.g. mmWave, Frequency Modulated Continuous Wave (FMCW) radar, Pulse-Doppler

radar, RFID and WiFi based techniques) [9,77,80,100,101,111,160,162]. For brevity, we will only

discuss some of the closely related recent works on contact-less sleep monitoring, which include

some sound, radar and WiFi CSI based techniques.

Lullaby [63] tracks various environmental factors, sound, light, temperature, and motion that

help users assess the quality of their sleep environments. iSleep [50] uses the built-in microphone

of the smartphone to detect the events that are closely related to sleep quality, including body

movement, couch and snore, and infers quantitative measures of sleep quality. Sleep Hunter [45]

52

uses actigraphy and acoustic events to predict sleep stage transitions by smartphone. Toss-N-

Turn [88] uses features such as sound amplitude, acceleration, light intensity, screen proximity,

battery and screen states, etc. to track a subject’s sleep quality. However, Audio based techniques

are privacy invasive, and therefore, often avoided as sleep is a private activity.

RF sensing based techniques are by far the least intrusive methods for monitoring sleep, both in

terms of privacy and convenience of use. DoppleSleep [111] is another unobtrusive sleep sensing

system which uses short-range doppler radar to perform sleep stage classiﬁcation (Sleep vs. Wake

and REM vs. Non-REM). Vital-radio [9] develop an FMCW based system which is shown to

accurately track a person’s breathing and heart rate without body contact, from distances up to 8

meters. Based on the same system, [165] proposes a deep learning architecture to perform 4-stage

sleep stage classiﬁcation. More recently, authors of [162] proposed algorithms to achieve multi-

person identiﬁcation and breath monitoring based on the same FMCW hardware. Although the

aforementioned radar based techniques do fairly well in terms of monitoring vital signs during

sleep. However, they require dedicated hardware and spectrum, adding cost and/or RF regulation

hurdles. These factors prevent their large-scale and long-term deployment. WiFi signals based

sensing provides an eﬀective approach towards low-cost and easily adoptable long-term sleep

monitoring, as the widespread use of WiFi capable devices (e.g. smart-home assistants, smart-

phones, etc.) has made WiFi signals the most ubiquitous form of sensing in homes requiring no

additional hardware costs. Multiple WiFi CSI based schemes have been proposed for tracking vital

signs during sleep [77, 80, 146, 164]. The key limitation of these schemes is the lack of a model

that can correlate the changes introduced in WiFi CSI with the bodily activity (e.g. breathing

and body/limb motion) during sleep. Due to lack of such a model, they rely on trial-and-error

based positioning of WiFi transceivers and signal processing techniques to track vital signs, which

leads to high dependency on multiple, environment-dependent and diﬃcult to tune parameters.

Consequently, their techniques lack robustness to diﬀerent individuals and environments, and they

only work well in controlled lab experiments performed on the same user, where they require the

user to lie down in between and/or very close to both transmitter (TX) and the receiver (RX) to

53

ensure line-of-sight (LOS) scenarios. As all the previous WiFi based sleep monitoring scheme

have been evaluated with short-duration mock sleep experiments in very controlled settings, the

applicability of their techniques and ﬁndings in practical sleep monitoring scenarios is limited. Their

methods may be suitable for controlled short-duration sleep experiments; however, they cannot be

generalized to diﬀerent individuals, environments, positioning of WiFi transceivers, LOS/NLOS

situations, and to natural in-home full-night sleeping scenarios.

3.3 Modeling of Vital Signs and WiFi CSI

3.3.1 Overview of WiFi CSI

WiFi devices measure the Channel State Information (CSI), which characterises the surround-

ing wireless channel across bandwidth and multiple antennae. The Orthogonal Frequency Division

Multiplexing (OFDM) communication scheme used in IEEE 802.11a/n/ac divides the wireless chan-

nel’s bandwidth into multiple modulated subcarriers. To correct for channel frequency-selectivity

(or equivalently the delay spread in time-domain) and maximise the link’s capacity, WiFi devices

continuously track changes over these subcarriers in terms of CSI values, which are then used to

adapt transmission power and rates in real time. CSI values are the Channel Frequency Response

(CFR) at per subcarrier granularity between each transmit-receive (Tx-Rx) antenna pair. When a

user is sleeping, the chest and body movements change the constructive and destructive interference

patterns of the WiFi signals. The CSI values are sensitive enough to measure these breathing move-

ments, as CSI measurements can be obtained at high sampling rates and from multiple diﬀerent

OFDM subcarriers of each TX-RX stream. For example, the driver of the Intel 5300 WiFi NIC,

which we use to implement our scheme, reports CSI values on 30 OFDM subcarriers [47] for each

TX-RX antenna pair for every CSI measurement. This leads to 30 matrices with dimensions Mt× Mr
per CSI sample, where Mt and Mr denote the number of transmit and receive antennas respectively.

Such high dimensional data allows us to recover detailed information about the sleeping behavior

even if the breathing movements only incur small changes in the CSI.

54

3.3.2 Breath-Multipath Model

Next, we develop our breath-multipath model to understand the eﬀect of small breathing

movements on the CSI signals. Based on this model, we design Serene’s signal processing pipeline

to robustly extract breathing waveforms in an individual and environment independent manner. Our

model shows that if we diﬀerentiate (i.e. by taking ﬁrst order diﬀerence) the CSI signals from each

WiFi subcarrier, and then max-min normalize the CSI signal projection corresponding to variations

due to breathing, we can robustly extract the waveform corresponding to user’s breathing motion

in an environment and individual independent manner, as long as the user sleeps close to the WiFi

receiver. Such proximity requirement is easy to satisfy during real-life in-home sleep scenarios by

either mounting receiver on the headboard of a bed frame or placing it on a table nearby. The basis

of our model is formed by a closed form expression, which we derive using time-varying Channel

Frequency Response (CFR) of WiFi channel. The time-varying CFR corresponding to a Tx-Rx

antenna pair for a subcarrier with wavelength λ can be quantiﬁed as [42]:

(3.1)

H ( f , t) = Hs ( f ) +

N

K

j2πDi (t)

λ

Di (t)2 × e

Xi=1
|                        {z                        }

Hd ( f ,t)

In the equation above, N is the number of multipath reﬂections of the transmitted signal at the

Rx end, Di represents the distance traveled by it h multipath reﬂection, and K is an environment

dependent proportionality constant. Hs ( f ) is the static component of CFR corresponding to all

non-user multipath reﬂections, while the second term on the right hand side corresponds to the

dynamic component of CFR, represented as Hd ( f , t), while the user is breathing and/or moving

during sleeping. Now, let us assume that user is sleeping at a distance D0,i from the router, and

di (t) is the change in distance traveled by it h reﬂected path due to breathing. To make our scheme

resistant to static changes in the environment, we ﬁrst eliminate Hs ( f ) by diﬀerentiating the above

equation with respect to t, and substitute Di (t) = D0,i + di (t) to get:

H′( f , t) =

d

dt" N
Xi=1

k
D2

0,i(cid:16)1 +

di (t)

D0,i (cid:17)−2

e

j2π(D0,i +di (t))

λ

#

(3.2)

55

As di (t) caused by motion due to breathing is in the order of a few centimeters, whereas

di (t)
D0,i

D0,i is usually in the order of meters (i.e. di (t) ≪ D0,i), we can expand the negative polynomial
)−2 via binomial series expansion. After performing binomial expansion, discarding the
(1 +
di (t)m
(D0,i )n terms with n = 4 or higher, and doing some algebraic manipulations, we get the following
expression for H′( f , t):

j2πD0,i

λ

H′ ≈ ke

N

Xi=1

d′i (t)  −

2
D3
0,i

+ j" 2π

λ D2
0,i

−

4πdi (t)
λ D3
0,i

#! e

j2πdi (t)

λ

After converting the term inside summation into polar coordinates, and discarding the

terms with n = 4 or higher, we get the following simpliﬁed expression for H′( f , t):

di (t)m
(D0,i )n

λ

· e

j2πD0,i

s1 +  λ
πD0,i! 2
D0,i (cid:17)g. Figure 3.2(a) shows variation of Ai with di (t), as di (t)
(cid:16)1 −

# · " N
Xi=1

+ j Ai#

d′i (t)e

2·di (t)

j2πdi (t)

λ

H′ ≈ " 2πk
Here, Ai = tan−1f πD0,i

λ D2
0,i

λ

varies from 1cm to 20cm (typical range for motion due to human breathing is 1-5cm [82]), for

diﬀerent router-receiver distances D0,i ranging from 3m - 10m (i.e. which is typical for regular

home use cases).

We observe that changes in di (t) do not signiﬁcantly aﬀect the value of Ai. Moreover, the impact

of di (t) on Ai decreases even further as the distance between receiver and the router it is connected

to increases. Therefore, we can safely approximate Ai ≈ tan−1f πD0,i

λ

g = A0,i and write H′( f , t)

as:

H′ ≈ " 2πk

λ D2
0,i

s1 +  λ
πD0,i! 2

e

j(cid:16) 2πD0,i

λ

+A0,i(cid:17)# ·

N

Xi=1

j2πdi (t)

λ

d′i (t)e

The ﬁrst term on right hand side of the equation above stays constant when receiver is placed

on some surface, e.g. a desk/table, and is not moving. We write amplitude of CFR i.e. |H′( f , t)| as:

|H′( f , t)| ≈ C0,i ·(cid:12)(cid:12)(cid:12)

N

Xi=1

j2πdi (t)

λ

d′i (t)e

(cid:12)(cid:12)(cid:12)

(3.3)

56

(a) Variation of Ai with di (t) for D0,i 3 to 10

(b) Single breath samples for 7 conﬁgurations

(c) Conﬁgurations of receiver

Figure 3.2: (a) Variation of Ai with di (t) for diﬀerent D0,i; (b) Single breath samples for diﬀerent

conﬁgurations shown in (C)

portionality term C0,i in breathing samples extracted from |H′( f , t)| corresponding to diﬀerent

(cid:12)(cid:12)(cid:12) corresponds to the variations due to breathing. The pro-

j2πdi (t)

λ

d′i

(t)e

The waveform(cid:12)(cid:12)(cid:12)PN

i=1

placement of receiver can be easily eliminated via max-min normalization. Figure 3.2(b) shows

extracted and processed single breath samples from a user for seven slightly diﬀerent receiver place-

ment conﬁgurations close to the user, while the router was in subject’s TV-lounge (router-receiver

distance >10 meters).

57

di(t) (Centimeters)05101520Change in Ai(Degrees)-0.06-0.05-0.04-0.03-0.02-0.0103 meters4 meters5 meters6 meters7 meters8 meters9 meters10 metersRouter - Laptop Distance13456723.3.3 Breath-Subspace Model

Next, we present our breath-subspace model to understand how breathing aﬀects the signal

subspace formed by WiFi subcarriers compared to other bodily movements. Today’s MIMO and

OFDM based WiFi devices use many frequency subcarriers and multiple transmit-receive (Tx-Rx)

antennas for data communication. The MIMO system between the OFDM subcarriers and the

Tx-Rx antennas, forms a multidimensional array which eﬀectively represents a high-dimensional

mathematical space. Contained in this space is the signal subspace along frequency and spatial

dimensions. The key intuition behind our model is that while a user is sleeping, the signal subspace

along these dimensions is aﬀected by both breathing and body/limb motion. When there is no

body/limb motion, there is only one dominant time-varying component in the subspace, which

corresponds to breathing. However, more components along these dimensions evolve (i.e. show

considerable variations) during other body/limb activity e.g. during roll overs or arm/leg move-

ment. Based on this principle, Serene isolates breathing from limb motion without requiring any

environment-dependent calibrations.

To model this in Serene, we track the top dominant components in the CSI signal subspace

using Principal Component Analysis (PCA). Figure 3.3(a) shows power values in top 5 PCA

projections of the CSI signals corresponding to multiple sleep epochs during a sleep experiment,

where the dotted lines correspond to epochs with motion events—highlighted in Figure 3.3(b).

We observe that in the absence of any body/limb activity, the top-most PCA projection is enough

to track breathing as it is the only major motion occurring in the environment. However, during

body/limb movements, multiple lower PCA projections also show signiﬁcant variations. Based

on this phenomena, we accurately detect and then discard the CSI values corresponding to any

body/limb activity by tracking variations in the lower PCA components (e.g. 3, 4 and 5) using a

multi-dimensional clustering technique, which we discuss in §2.6.

58

Rollover during sleep
Slight Arm Movement
Rollover + Waking Up

)

B
d
(
 
r
e
w
o
P

30

25

20

15

10

5

0

1
PCA projection IDs (descending order of variance)

2

3

4

5

(a) Power values in top PCA projections

(b) CSI timeseries for Fig. 3.3(a)

Figure 3.3: Impact of bodily activity during sleep on WiFi subspace

WiFi CSI 

Signal

Denoising Module

Body Movement 

Detector

Breathing Rate Monitor

L
o
w
p
a
s
s
 

F
I
l
t
e
r
i
n
g

 

P
C
A
d
e
c
ds
o
m
p
o
s
i
t
i
o
n

 

(
E
p
o
c
h
C
o
n
s
t
r
u
c
t
i
o
n
)

D
o
w
n
s
a
m
p

l
i

n
g

Multi-Dimensional 
Clustering-based 
Subspace Tracking

P
r
o
j
e
c
t
i
o
n

1

s
t

 

P
C
A

 
 

No Body

Movements

Detected

B
a
n
d
p
a
s
s
s
 

F
I
l
t
e
r
i
n
g

 

O
u
t
a
g
ds
e
D
e
t
e
c
t
i
o
n

 

P
e
a
k
D
e
t
e
c
t
i
o
n

F
i
r
s
t
 
O
r
d
e
r
 
D

i
f
f
e
r
e
n
c
e

Xethru Motion Ground Truth Signals

Parameterization

Xethru Breath Rate Ground Truth Signal

Parameterization

Breath Rate 

Signal

Body 

Movement 

Signal

Figure 3.4: Our WiFi CSI signal processing architecture for extracting vital signs

59

3.4 CSI Signal Processing Architecture

To obtain CSI data in real-time during sleep, we develop a client-server based mechanism to

communicate the CSI values extracted from WiFi NIC to a Python based CSI processing server.

Based on our discussion in §5.3, we take ﬁrst order diﬀerence of the incoming CSI data and then

take its amplitude i.e. |H′( f , t)| for further processing. From now onward, we use the term “CSI”
to denote |H′( f , t)|. CSI data is collected in 30 second epochs, which is typically the partitioning

convention followed by most sleep monitoring systems. Next, we ﬁrst perform basic low-pass

ﬁltering for removing bursty noise due to hardware noise and isolate the signal of interest i.e. to

extract human motion related frequencies only. Second, we perform PCA on the low-pass ﬁltered

CSI streams for dimensionality reduction and automatic distinguishing of CSI variations due to

body movements from those of breathing in diﬀerent subcarriers based on our discussion in §3.3.3.

This avoids the need for complex trial-and-error based subcarrier selection procedures used in

previous works [77, 80]. Third, we harmonise the ﬁltered CSI data corresponding to each 30s sleep

epoch into uniformly sampled and consistent measurements via down-sampling. Fourth, we robustly

detect body movements by tracking lower PCA projections of CSI signals using a clustering-based

event detection technique. Finally, we ﬁrst detect the presence of breathing using a power threshold,

and then perform further band-pass ﬁltering to extract the signal corresponding to breathing. Figure

3.4 shows our system architecture. Next, we brieﬂy discuss Serene’s noise removal process.

3.4.1 Noise Removal

Commodity Wi-Fi NICs report noisy CSI values, both due to hardware limitations (such as low

resolution Analog to Digital Converters (ADCs)) and due to changing transmission power and rates.

We use a combination of median ﬁlter and an exponential moving average ﬁlter to get rid of such

bursty noise and spikes in CSI time series. After this basic ﬁltering step, we further remove any high

frequency variations in CSI signals using Butterworth low-pass ﬁlter. Variations due to movement

during sleep cause low frequency variations, typically under 5 Hz [82]. We use Butterworth low-pass

ﬁlter for separating these variations from higher frequency noise in CSI values. Due to maximally

60

ﬂat amplitude response of Butterworth ﬁlter, its application on CSI time series does not distort

the shape of CSI variations due to body motion. Our scheme samples CSI values at a nominal

frequency Fs = 800. With this in mind, we use cut-oﬀ frequency ωc =

2π∗ f
Fs

= 0.0125π rad/s for

Butterworth ﬁltering. We apply the same ﬁlter on CSI timesseries of all the subcarriers, making

sure that every CSI stream experiences the same phase distortion and group delay introduced by

the ﬁlter.

Based on our experimental results, we observed that ﬁltered CSI waveforms still retain some

noisy variations which are not related to activity/breathing. We avoid any further low pass ﬁltering

on CSI streams as it can lead to loss in details of variations due to activity/breathing behavior. To

remove such noise, we utilize the fact that the CSI variations in CSI streams of multiple subcarriers

in each Tx-Rx antenna pair are correlated. We apply PCA on CSI obtained from all subcarriers

and all Tx-Rx pairs, and retain only the waveforms that represent the most common variations in

all the subcarriers, i.e., the variations due to breathing and/or body movements during sleep. That

is, signal subspace-based ﬁltering enables our scheme to automatically obtain the signals that are

representative of the monitored vital signs only. PCA also reduces the dimensionality of data by

discarding the principal components unrelated to the vital signs i.e. the noise subspace. Finally,

we rearrange the multi-dimensional ﬁltered CSI data corresponding to each 30s sleep epoch into

consistent length samples (600 in our current implementation) via down-sampling, performing zero-

padding where necessary. Although we have partitioned the incoming CSI data into 30s epochs,

we concatenate data from consecutive epochs for real-time tracking of vital signs (e.g. breathing)

which we discuss later in this section.

3.4.2 Tracking Body Movements

As discussed earlier, today’s MIMO and OFDM based WiFi devices use multiple frequency

subcarriers and Tx-Rx antennas for data communication. The MIMO system between the Tx-Rx

antennas, and the OFDM subcarriers, form a multidimensional tensor along space-frequency axes.

Contained in such tensor is the signal subspace we wish to track for breathing and body motion.

61

We observed that when there is no body/limb motion, there is only one dominant, time-varying

component in the signal subspace, which corresponds to breathing. PCA sorts diﬀerent principal

components in descending order of their variation. During sleep, the signal subspace is rather quiet

and breath is captured in the top PCA projection of the CSI time series. However, we observed

that during episodes of other body movements—e.g. during roll overs or arm/leg movement—more

signal subspace components evolve, since body movements cause more pronounced variations in

the spatial-frequency subspace compared to faint breathing movements.

3.4.2.1 Body Movements Detection Approach

To robustly distinguish body/limb motion from breathing, we propose to use a multi-dimensional

hyper-ellipsoidal clustering on the lower PCA projections of CSI data. At a high-level, we can think

of this clustering method as a high-dimensional generalization to a Gaussian outlier rejection tech-

nique whereby measurements few sigma’s away from the mean are deemed erroneous. Speciﬁcally,

let Rk = {r1, r2, · · · rk} be the ﬁrst k samples of CSI vectors containing values from the selected
signal subspace—we use PCA projections 3, 4 and 5 in our current implementation. Each sample ri
is a d×1 vector in Rd, where d is the number of signal subspace components. This hyper-ellipsoidal
technique clusters the normal data points (i.e. when there is no body movement in the environment),

and any points lying outside the cluster are declared as outliers. The boundary of the cluster (a

“hyper-ellipsoid’ in this case) is related to a distance metric which is a function of mean mR,k and

covariance Sk of the incoming signal subspace components Rk . We use the Mahalanobis distance

metric, Di, for which the cluster is arrived at according to the following [91]:

ek (mR, S−1

k , t) = {riǫ Rd|q(ri − mR,k )T Sk−1(ri − mR,k )
|                                   {z                                   }

Di =M ahalanobis distance o f ri

≤ t}

(3.4)

where ek is the set of normal data points whose Mahalanobis distance, Di < t and t is the

eﬀective radius of the hyper-ellipsoid. The choice of t depends on the distribution of the normal

data points. If the normal data follows a chi-squared distribution, it has been shown that up to

98% of the incoming normal data can be enclosed by the boundary of an hyper-ellipsoid, if the

62

eﬀective radius t is chosen such that t2 = ( χd

2)−1

0.98 [91]. Data samples ri for which Di > t, are

therefore, identiﬁed as outliers. As it is often not practical to store all the samples of a streaming

data, therefore a recursive algorithm is required to update ek . Let rk +1 be the most recent CSI

sample. mR,k +1 =

kmR,k +rk +1

k +1

and m

R2,k +1

=

km

R2,k

+rk +1rT
k +1

k +1

can be updated recursively from

the previous means. By substituting covariance matrix Sk = m

R2,k − (mR,k mT

R,k

) into Eq. (3.4) we

can represent ek entirely in terms of means. The resulting equation updates the cluster boundary

and classiﬁes the incoming data samples as normal readings or outliers. Our scheme uses above

equations for initial estimation of mean and covariance. Afterwards, the mean mR,k is recursively

calculated using an exponential moving average technique, where mean mR,k +1 is updated as

mR,k +1 = α mR,k + (1 − α)rk +1, where α = 0.9995 in our implementation. Moreover, after initial
estimation of covariance, our scheme recursively updates the covariance inverse S−1
by using the
k

following equation [91], which avoids extra computations required for calculating the inverse of

matrix S:

S−1
k +1

=

k S−1
k

α(k − 1) ×

I −

(rk +1 − mR,k )(rk +1 − mR,k )T S−1
+ (rk +1 − mR,k )T S−1

k

k

(k−1)

α

(rk +1 − mR,k )

(3.5)

To determine the start and end of activity waveforms, we use both cardinality and temporal

proximity of the detected outliers. If the number of consecutive outliers increases a threshold E1,

we declare a micro-event. Multiple micro-events constitute a activity event. All the data points

including the points constituting the micro-events as well as the points in between the consecutive

micro-events are recorded as part of activity waveform (merging). We only merge the micro-events

which are within E2 data points of each other. Both E1 and E2 are design time, easy to tune

thresholds. Figure 3.5 shows some body movements detected by our algorithm in a portion of

processed CSI timeseries corresponding to an in-home full-night sleep monitoring experiment.

63

PCA1

PCA2

PCA3

PCA4

PCA5

Detected Movement

Radar Ground Truth

0.3

0.2

0.1

0

e
u
l
a
v
 
l
a
n
g
i
s

340

345

350

355

360

365

370

minutes into sleep

Figure 3.5: Example showing performance of our body movement detection algorithm, compared

to Xethru radar ground truth. Boxes show the areas where breathing is usually present. Ground

truth is approximately synchronized with CSI data

3.4.3 Tracking Breath

Human breath involves motion of chest and lungs during inspiration (when air enters the lungs)

and expiration (when air is blown out from the lungs) [82]. These motions are often periodic (e.g.

in case of healthy subjects [82]), and therefore, cause periodic variations in WiFi channel which

we can extract using CSI data. In the absence other body movements, the ﬁrst PCA projection

of CSI data would be able to capture these variations due to breathing. However, as these minute

variations are often embedded in noise, and because human subject might not be in proximity of

the RX device, we can not always assume that breathing signal exists in a particular sleep epoch.

Therefore, to robustly track breathing, we propose the following signal processing pipeline.

3.4.3.1 Bandpass Filtering

To extract the periodic variations in CSI due to breathing, we apply a Butterworth bandpass

ﬁlter on the ﬁrst PCA projection. We choose the ﬁltering parameters of this ﬁlter according to the

fact that breathing rate of humans (including adults as well as newly born babies) ranges between

10 - 40 breaths per minute (BPM) [82]. This step removes any non-breathing related noise present

in the signal.

3.4.3.2 Detecting Outage in Breathing Signal

We deﬁne the outage of our system as the event when variations due to breathing are not present

in the CSI signal. To detect outage, our system determines the noise ﬂoor of the environment using

the ﬁrst PCA projection. During real-time tracking, our system compares the average power of the

64

signal, calculated over 7.5s windows of a 30s sleep epoch (i.e. 1/4t h of an epoch’s duration), with

the determined noise ﬂoor. Our breathing rate measurement algorithm only runs if average power

of the sleep epoch is greater than the noise ﬂoor.

3.4.3.3 Measuring Breathing Rate

We design our system to measure breathing rate in terms of breaths per minute (BPM). We

measure the rate over a window of two sleep epochs in length, which moves over concatenated data

from multiple consecutive sleep epochs. To report BPM every second, we move this window over

the concatenated data every second (i.e. 20 samples a time).

To measure breathing rate, we employ a peak detection based approach. First, we max-min nor-

malize the signal so that parametrization of our peak detection algorithm can be easily generalized

to diﬀerent users. Second, to robustly detect the number of peaks, we use three parameters, namely

minimum peak prominence (MINPRO), minimum peak distance (MINDIST), and minimum peak

strength (MINSTR). The prominence of a peak measures how much the peak stands out, due to its

height and location, relative to other peaks around it. We tune MINPRO such that we only detect

those peaks which have a relative importance of at least MINPRO. We tune MINDIST according

to the fact that human breathing rate ranges between 10-40 BPM [82], so that redundant peaks are

discarded. To further sift out redundant peaks, we only choose peaks of value greater than MINSTR

times the median peak value. In our current implementation, we chose MINPRO = 0.025, MINDIST

= 1.5 seconds and MINSTR = 0.6 which generalize well for diﬀerent sleeping scenarios. To achieve

accurate tracking of breathing rate, we perform parameterization of Serene’s breath estimation

algorithm using ground truths obtained from a contact-less COTS Xethru X4M200 Breath/Motion

sensor [155]. We perform this parametrization only during the design of our system and do not

require any end-user calibration eﬀort in the real-world deployments.

3.5 Sleep Scoring

In this work, we take an actigraphy based approach towards sleep monitoring, where we classify

the stage of each minute as sleep or awake period. Our approach is inspired by the classic actigraphy

65

based method proposed in [150], which determines sleep-awake stage of a minute by taking into

account body movement related information corresponding to the surrounding minutes. The activity

sleep-awake scores determined by their technique have been shown to agree with EEG based sleep

monitoring 94.46% of the time [150]. In our implementation, we adopt the following model from

their work, which takes 4 previous minutes and 2 following minutes into account to classify stage

of the current minute:

sm = ρ × (w−4a−4 + w−3a−3 + w−2a−2 + w−1a−1 + w0a0 + w+1a+1 + w+2a+2)

(3.6)

where sm is the average sleep-awake score for the current minute, ρ is a scaling factor, a−i, a0,
a+i are activity scores (normalized number of body movement events in each minute) for previous,

current and following minutes, and w−i, w0, w+i represent weights for the previous minutes,
current minute and following minutes. If sm ≤ 1, the current minute’s stage is classiﬁed as sleep,
and if sm > 1, the current minute’s stage is classiﬁed as awake. In our implementation, we chose

ρ = 0.125, w−4 = 0.15, w−3 = 0.15, w−2 = 0.15, w−1 = 0.08, w0 = 0.21, w1 = 0.12, w2 = 0.13, as
suggested by the authors of [150] for best results in their real-world deployments.

PCA Projection 3

PCA Projection 4

PCA Projection 5

Sleep/Awake State Marker

0.2

0.1

0

e
u
l
a
v
 
l
a
n
g
i
s

50

100

150

200

250
minutes into sleep

300

350

400

Figure 3.6: Example showing Sleep/Awake classiﬁcation for full night’s sleep of a subject. Sleep

eﬃciency was 62.1%

We also calculate sleep eﬃciency (SE) for each night, which is deﬁned as the ratio of actual time

spent in sleep stages to total time spent in bed (i.e. Tsleep/(Tsleep + Tawake)). Figure 3.6 showing

Sleep/Awake classiﬁcation performance for full night’s sleep of a subject, where sleep eﬃciency

was determined to be 62.1%.

66

3.6 Implementation and Evaluation

In this section, we present the performance evaluation of our system. Our results show that Serene

can robustly and accurately track vital signs during sleep across diﬀerent users and environments.

Next, we ﬁrst discuss our hardware implementation and experimental settings.

Figure 3.7: The real-world deployment scenarios used for evaluation of our sleep monitoring

scheme “Serene”

3.6.1 Hardware Implementation

We developed compact HummingBoard (HMB)-based small-sized nodes as sleep monitoring

devices [127] which makes Serene easy to deploy. HMB nodes were equipped with the Intel 5300

NICs with modiﬁed drivers for extracting CSI information [47]. We used Linksys AC1200+ routers

as transmitters in our deployments. Moreover, we developed a client-server software architecture—

in C and Python respectively—capable of the real-time extraction and processing of CSI data

throughout the night. For body movement and breathing rate ground truths, we deployed state-

of-the-art pulse-doppler radar-based Xethru X4M200 Breath/Motion sensors [155]. In terms of

breathing rate accuracy, the X4M200 devices have been shown to perform very closely to a

medical-grade, gold standard equipment (X4M200 has been shown to track breathing with up to

96% accuracy when compared to PSG) [154]. We utilize these ground truths in our system for:

(1) the robust parametrization of our signal processing pipelines (e.g. breath tracking) and (2)

measuring breath tracking accuracies.

67

3.6.2 Experimental Settings

We deployed our system in 5 apartments, where we collectively recorded more than 80 nights

(>550 hours) of data from 5 diﬀerent participants. The participants were male graduate students

aged between 23 to 32 years. The duration of data collection for each participant varied from 5 to 31

days. Figure 3.7 shows the real-world deployment scenarios for our sleep experiments. The numbers

in circles specify user/environment IDs. Data collected from environments 2 and 3 corresponds to

NLOS deployment scenario, and constitute 55% of our dataset. Data collected from environments

1, 4, and 5 corresponds to LOS deployment scenario, and constitute 45% of our dataset. To evaluate

Serene’s vital signs tracking performance, we collected Xethru ground truth alongside CSI data for

the environments 1, 2, 3 and 5. We evaluate Serene’s performance in terms of three key metrics: (1)

breath tracking accuracy, (2) breathing signal outage (during which Serene cannot track breathing),

and (3) naturally occurring motion false positives due to activity of other house occupants. To

determine long-term sleep quality metrics (i.e. sleep eﬃciency based on sleep-awake classiﬁcation

discussed in §3.5), we use data collected from the environments 1-4. Based on these metrics, we

present insights on sleep eﬃciency of diﬀerent users and how it varies in successive nights. Next,

we ﬁrst evaluate the breath tracking accuracy of our scheme.

3.6.3 Breath Tracking Accuracy

We evaluate the accuracy of our WiFi-based breathing rate estimation in terms of BPM error.

We deﬁne BPM error as the average mean squared error (MSE) between per second BPM values

reported by Serene and the corresponding ground truth BPM values reported by X4M200 over a

speciﬁc time window. We perform this evaluation on data collected from environments 1,2,3 and

5. We evaluate both long-term (i.e. whole night) and short-term (i.e. speciﬁc short duration sleep

windows during the night).

68

3.6.3.1 Long-term Accuracy

Serene achieves a median error of less than 1.19 BPM for real-world in-home full night sleep

experiments. To evaluate Serene’s long-term breath tracking accuracy, we compute the mean squared

error (MSE) of instantaneous (per second) breathing rate estimate across an entire night of sleep.

The overall cumulative distribution function (CDF) of the MSE error in breath per minute (BPM)

is depicted in ﬁgure 3.8(a). Inspecting the blue curve, we see that the median accuracy of Serene’s

breathing rate estimate is 1.19 BPM, while its 95th percentile conﬁdence is under 1.9 BPM. We

skip User 5’s CDF from the graph as we were only able to collect 5 nights of data from that user.

The average, minimum and maximum BPM errors observed for User 5 were 1.1, 0.811, and 1.14,

respectively. Figure 3.8(c) shows how Serene tracks breathing rate throughout the night in for 4

diﬀerent users, where we have plotted X4M200 ground truth side by side. We can observe that

even if BPM accuracies slightly drop during some parts of a night, Serene can still track the overall

pattern in breath fairly well when compared to Xethru’s ground truth.

Figure 3.8(a) also shows how BPM accuracy varies across subjects. For instance, the median

accuracy was better than 1.12 BPM and 1.2 BPM for users 1 and 2, respectively. However, user 3’s

median accuracy was a bit higher (i.e. 1.488), while the 95th percentile conﬁdence was as large as

1.95 BPM. These slight variations across diﬀerent users and environments occur due to diﬀerences

in physiques, respiratory system morphologies and environmental deployment conditions. For

example, environments 2 and 3 both correspond to NLOS scenarios, which leads to relatively lower

BPM accuracies. However, the level of robustness and accuracy Serene achieves is comparable

to other commercial contact-less sleep monitoring products, and is enough for daily in-home use

for gaining insights into sleep breathing patterns with minimal overhead using existing networking

infrastructure.

3.6.3.2 Short-term Accuracy

Serene can achieve an error of as low as 0.34 BPM during certain parts of a full-night sleep. In

Fig. 3.8(c), we notice that there are certain time windows during the night where Serene matches

69

100

i

 

s
t
h
g
n
 
f
o
e
g
a
t
n
e
c
r
e
P

80

60

40

20

0
0.8

Overall-CDF
(IDs 1,2,3,5)
User1
User2
User3

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

Respiration Rate MSE (BPM)

1

0.8

0.6

0.4

0.2

0

)

M
P
B

(
 
r
o
r
r
e
 
e
t
a
r
 
h
t
a
e
r
b
 
e
g
a
r
e
v
A

LOS

NLOS

Curl

Prone Recumbent

Supine
Sleep Posture

(a) CDF of overall and per-user BPM error.

(b) Sleep posture experiments.

M
P
B

M
P
B

22

20

18

16

14

12

0

22

20

18

16

14

12

0

User 1

Serene

Xethru

50

100

150

200

250

300

350

400

User 3

Serene

Xethru

100

200

300

400

500

22

20

18

16

14

12

0

22

20

18

16

14

12

0

User 2

Serene

Xethru

100

200

300

400

500

User 5

Serene

Xethru

50

100

150

200

250

300

350

minutes into sleep

minutes into sleep

(c) Full night breath tracking.

Figure 3.8: CDF of overall and per-user breathing rate MSE compared to a Xethru X4M200

ground truth; Serene’s full-night breath tracking performance; and average BPM errors for short

duration sleep experiments in diﬀerent sleep postures

Xethru’s performance very closely. To know how many times such time windows occur during

diﬀerent nights in our dataset, we divide each night into small 15 minute time windows and

compute the MSE of per second BPM estimate in those windows. Figure 3.9 show the CDF plots

for 6 diﬀerent full-night sleep experiments corresponding to users 1, 2 and 3. From Fig. 3.9(a), we

observe that in the case of User 1, Serene experienced a breathing rate error of only 0.34 BPM in

70

one 15 minute window during Night 6. Moreover, error in more than 50% of the time windows

remained below 0.84 BPM during Night 6. Similarly, for other users, we observe that in multiple

time windows during a full-night sleep, BPM error stays under 1 BPM. This shows that Serene

does fairly well when compared to controlled short-duration sleep experiments performed in recent

CSI based sleep monitoring studies. Also, ﬁgure 3.8(b) shows average BPM errors for controlled

10 minute sleep experiments in diﬀerent sleep postures. We observe that Serene achieve an error

of less than 1 BPM for most sleep postures even in NLOS scenarios. The errors were as low 0.55

BPM in LOS scenarios.

i

 

s
w
o
d
n
w
e
t
u
n
m
5
1
 
f
o
#

i

 

 

 
l

 

a
t
o
t
 
f
o
n
o
i
t
c
a
r
F

1

0.8

0.6

0.4

0.2

Night 1
Night 2
Night 3
Night 4
Night 5
Night 6

0

0

0.5

1
BPM Error

1.5

2

2.5

i

 

s
w
o
d
n
w
e
t
u
n
m
5
1
 
f
o
#

i

 

 

 
l

 

a
t
o
t
 
f
o
n
o
i
t
c
a
r
F

1

0.8

0.6

0.4

0.2

0

0

Night 1
Night 2
Night 3
Night 4
Night 5
Night 6

1

2

3

4

BPM Error

(a) User 1, 6 nights.

(b) User 2, 6 nights.

i

 

s
w
o
d
n
w
e
t
u
n
m
5
1
 
f
o
#

i

 

 

 
l

 

a
t
o
t
 
f
o
n
o
i
t
c
a
r
F

1

0.8

0.6

0.4

0.2

0

0

Night 1
Night 2
Night 3
Night 4
Night 5
Night 6

1

2

3

4

5

6

BPM Error

(c) User 3, 6 nights.

Figure 3.9: CDFs of BPM errors calculated over 15 minute windows for 6 diﬀerent nights (Users

1, 2 and 3)

71

3.6.4 Naturally Occurring Motion False Positives

Serene experienced a median of 20 false positive limb/body motion events, which can be

attributed to activity of other house residents while the user is sleeping. The total duration of

such events stayed below 37 minutes more than 95% of the time. Radar and WiFi are both very

sensitive to body motion. We observed from our experiments that whenever a user moves in his

bed, both Serene and X4M200 successfully detect the motion event. However, we also observed

scenarios where Serene detected body movements but the ground truth remained undisturbed (i.e.

contained breathing signal only). We call such spurious movements detected by Serene as motion

false positives (MFPs), which we attribute to other movements present in the environment (e.g.

when one of the occupants wakes up to get water, etc.). To understand how signiﬁcant such MFPs

can be in real-world deployments of our WiFi CSI based sleep monitoring system, we evaluate two

key metrics namely the number and duration of MFPs per night’s sleep.

100

100

i

 

s
t
h
g
n
 
f
o
e
g
a
t
n
e
c
r
e
P

80

60

40

20

0

0

 

i

s
t
h
g
n
 
f
o
e
g
a
t
n
e
c
r
e
P

80

60

40

20

0

20

10

60
Number of motion false positives

40

50

30

70

0

80
Length of motion false positives (minutes)

60

40

20

(a) motion false positives

(b) duration of motion false positives

Figure 3.10: CDF’s of numbers and total duration of motion false positives during a night when
compared with X4M200 ground truths. Motion false positive naturally occur due to activity of

other housemates

Figure 3.10 illustrates the CDFs of these two metrics evaluated over more than 65 nights in our

database. We observe that our system detected less than 56 MFPs occurrences for 95% of tested

nights 3.10(a). Moreover, when we observe that when MFPs occur, their collective duration remains

bounded under 37 minutes for 95% of the time, as shown in ﬁgure 3.10(b)—which is a minuscule

72

fraction of the entire 65+ night dataset.2 Note that these MFPs do not signify any technical limitation

of Serene, as such motions occur naturally in home environments. However, based on our results,

we can conclude that the Serene will be able to successfully meet vital signs tracking accuracy

requirements as expected from any other in-home COTS sleep monitor (e.g. X4M200, ResMed

S+ [113], smart-watches, etc.), as the number and duration of naturally occurring interferences in

WiFi signals due to activity of other residents is usually low during night time.

3.6.5 Breath Signal Outage

Serene experiences an average outage of less than 6.38 minutes, during which it cannot track

any vital signs. As we discussed in 3.4.3, Serene identiﬁes an outage event when power in the

ﬁrst PCA projection (i.e. the breath signal) corresponding to a sleep epoch goes below a threshold

(i.e. noise ﬂoor). To assess our system’s ability to continuously track vital signs (i.e. breathing

and other limb/body activity) in the real-world, we measure Serene’s per-night outage performance

statistics. To achieve this, we follow the treatment of signal outage in wireless propagation literature.

Speciﬁcally, we calculate two second-order statistics: level crossing rate (LCR), and average fade

duration (AFD) [6]. LCR determines the rate at which outages occur during a full-night sleep,

whereas AFD determines the duration of each outage. We analyze the LCR and AFD using the

ﬁrst PCA projection’s power with respect to the noise ﬂoor. LCR and AFD have been extensively

studied in body area network (BAN) literature owing to the complex and non-stationary way in

which a human body interacts with the wireless channel [126]. Next, we present a summary of the

aforementioned statistics derived from our entire dataset.

Figure 3.11(a) shows LCR or outage rate calculated per hour across our sleep dataset. We

can observe that on average, the breathing rate estimation of participants experienced 2 outage

events per hour. At 95 percentile conﬁdence, outage amounted to less than 6.6 events per hour.

However, in the context of sleep monitoring, a further piece of detail must be considered to fully

understand outage events during sleep, i.e. the duration of such outages, which we characterize

265 nights with 7 hours average sleep equal 27300 minutes, making MFPs relative duration a

mere 0.136%.

73

100

i

 

s
t
h
g
n
 
f
o
e
g
a
t
n
e
c
r
e
p

80

60

40

20

0

i

 

s
t
h
g
n
 
f
o
e
g
a
t
n
e
c
r
e
p

100

80

60

40

20

0

i

 

s
t
h
g
n
 
f
o
e
g
a
t
n
e
c
r
e
p

100

 80

 60

 40

 20

  0

  2

  4

  6

  8

outage event per hour

0.5

1.0

1.5

2.0

 5

10

15

average small outage duration (min)

average large outage duration (min)

(a) Outage rate

(b) small-scale outage duration

(c) large-scale outage duration

Figure 3.11: Second-order statistics of breath estimation outage events. Outage rate and average

outage duration mirror, respectively, their counterparts level crossing rate and average fade

duration from wireless propagation literature

using AFD. Serene can experience two types of outage events: small-scale and/or large-scale.

Small-scale outage may arise from users rolling over in bed to a diﬀerent position or sleep posture

while sleeping. This is because certain sleep postures may momentarily make it diﬃcult for Serene

to detect the breath signal due to weaker chest movements. In contrast, large-scale outage may

occur when a user gets out of bed during the night, for example, to get water or go to restroom. To

understand how such small- and large-scale outage events are distributed naturally in the real-world,

we introduce a design threshold to separate the two types of outage events. From the analysis of our

dataset, we set such design threshold to 5 minutes, where we consider outage events longer than 5

minutes as a large-scale outages and vice versa. Figure 3.11(b) elaborates on the statistical behavior

of small-scale outage. On average, small-scale outage events lasted for 0.7 minutes, while the 95th

percentile conﬁdence outage duration is under 1.62 minutes. CDF of the duration of large-scale

outage is shown in ﬁgure 3.11(c). Large-scale outage duration averaged around 6.38 minutes while

its 95th percentile conﬁdence is under 11.56 minutes, although durations in excess of 15 minutes

can occur. We conclude our outage characterization by emphasizing that our current analysis is

application and dataset speciﬁc. Many other factors such as participants’ physiques (e.g. larger

torso3 area) and sleep patterns can be associated with both small- and large-scale outages. For

example, a person suﬀering for sleep apnea may suﬀer multiple small outages thoughout his sleep.

Our analysis provides a basis for sleep researchers to conduct larger and more representative studies

3It is beyond the scope of the article to delve into morphological factors contributing to breathing

style variations within human populations.

74

on breath outage events using our proposed sleep monitoring scheme, as breath outages during

sleep may indicate a possible disease in the user’s respiratory tract, etc.

3.6.6 Sleep Insights

3.6.6.1 Sleep Quality

To motivate the merits of our accessible sleep monitoring scheme, we present a few interesting

insights on sleep quality gained from our data collection campaign. Our ﬁndings motivate the

usefulness of Serene in terms of tracking long term sleep quality and sleeping patterns. Next, we

show how Serene can provide users with actionable feedback on a per-night basis towards the

long-term tracking and management their sleeping habits. We perform these assessments using the

CSI data corresponding to users 1 - 4.

Figure 3.12 shows three diﬀerent metrics of sleep determined for 3 users over a period of more

than 13 consecutive days, namely sleep eﬃciency, aggregate motion (in minutes) during sleep

and sleep length. Sleep eﬃciency for each night of sleep was calculated using on our actigraphy

based sleep scoring approach. As users manually started and ended each night’s data collection

using our software, the sleep lengths were easily determined according to those end points. We

observe interesting insights for these long term sleep metrics. For instance, we can see that User 1

experienced a noticeably restless 9th night which resulted in poor sleep eﬃciency. User 4 only slept

for 1.25 hours, but as he was awake for only 4.156 minutes during that time, his sleep eﬃciency

reaches 95%.

Mean

User 1

User 2

User 3

User 4

)

%

i

(
 
y
c
n
e
c
i
f
f
e
p
e
e
s

 

l

95

90

85

80

75

70

65

5

10

15

consecutive nights

80

60

40

20

)
s
e
t
u
n
m

i

(
 
y
t
i
v
i
t
c
a
b
m

 

i
l
/

y
d
o
b

10

8

6

4

2

)
s
r
u
o
h
(
 
h
t
g
n
e

l
 

p
e
e
s

l

5
consecutive nights

10

15

5
consecutive nights

10

15

Figure 3.12: Sleep eﬃciency and body motion corresponding to 4 users and throughout 13+

consecutive nights

75

 

i

s
t
h
g
n
 
f
o
e
g
a
t
n
e
c
r
e
P

100

80

60

40

20

0

0

i

 

s
t
h
g
n
 
f
o
e
g
a
t
n
e
c
r
e
P

100

80

60

40

20

0

User1
User2
User3
User4
Overall-CDF

User1
User2
User3
User4
Overall-CDF

20

40

60

80

100

120

60

Body Movement (minutes)

70
Sleep Efficiency

80

90

100

(a) Overall and per-user CDF for motion duration.

(b) Overall and per-user CCDF for sleep eﬃciency

Figure 3.13: Overall and per-user CDF for motion duration and sleep eﬃciency

In terms of aggregate body motion statistics over nights and across subjects, we measured a

median of 40 minutes with the 95th percentile being under 80 minutes as illustrated in the blue CDF

in ﬁgure 3.13(a). On an individual basis, and considering user 2 and user 4 for instance, their median

body movements were 36 minutes and 47 minutes, respectively. This insight is corroborated when

inspecting the complementary CDF’s depicted in ﬁgure 3.13(b). Speciﬁcally, while both users 2

and 3 have a comparable maximum sleep eﬃciency of 96%, User 3’s sleep eﬃciency was lower

than 80% on 3 diﬀerent nights. Moreover, User 2 has a worst eﬃciency of 75%, whereas User 3

has worst eﬃciency of 63%. For the aggregate dataset, the median user population sleep eﬃciency

was around 87%. The average sleep duration among these 3 users during this consecutive testing

period was 7.32 hours. Note that the recommended sleep for ages 18-64 years is 7-9 hours [82].

3.6.7 Discussion

In this section, we provide commentary on the limitations of our work and discuss avenues of

future research which we believe would be particularity exciting and fruitful.

Interpretable CSI dimensionality reduction. Our work is premised on the notion of subspace

tracking and the ability to isolate instances of faint breathing from turbulent movements in the signal

locale that would mask breathing. Incidentally, our sleep application requires a binary decision on

subspace dynamics: quiet or turbulent. It would be interesting to perform the function of PCA

76

such that the resultant subspace can be interpreted in native ways that support atomic elements of

inference i.e. require no training. Such putative subspace may facilitate more than binary decisions

on the dynamics of the channel. As an example, it is shown in [11] that the CSI tensor can be

“unfolded” as to expose the channel behaviour in space across antennae and in frequency across

subcarriers. It is further shown that tracking the spatial subspace is indicative of multiuser activities.

Returning to our sleep application, recasting the subspace in such terms may allow us to infer channel

modulations from other users sharing a house and ultimately cancelling their masking contributions

on the signal of interest.

Parameterization. Our proposed approach does not involve extensive environment- and/or participant-

dependent parameterization. However, our aim is to develop techniques that can accurately and

robustly monitor the primitives of sleep; namely, respiration and body movement vital signs. These

techniques ought to perform reasonably well when compared to state-of-the-art, radar-based com-

mercial sleep monitors. In order to achieve this objective, we have taken a cross-modal supervision

approach, whereby we leverage the ground truth vital signs obtained from radar optimally param-

eterize our CSI processing pipeline. However, we stress that parametrization is only performed

during the design phase and does not require calibration eﬀort on the part of end-user during real-

world deployments. We believe that it is possible to develop a machine learning (ML) architecture

for CSI-based sleep tracking. With such an ML architecture, arriving at a transformation between

the supervising modality (i.e. ground truth) and CSI is rendered automatic. However, questions

pertaining to the requisite volume of data for such a task are open. A combination of (1) relatively

benign urban channels, (2) an abundance of cellular basestation CSI data, and (3) superior MIMO

resolutions, transpire in [130] as to aﬀord the CSI channel sensing problem an intriguing and

principled machine learning framework. Adapting such concepts within the context of the indoor

CSI sensing channel and its peculiarities is an interesting future work direction.

Sleep stage classiﬁcation. A more ambitious learning task pertaining to sleep monitoring is stage

classiﬁcation [18]. It has been shown in prior art using custom radar signalling that wireless sleep

4-stage classiﬁcation can be achieved through a domain adaptation approach paired with ground

77

e
u
a
v

l

 
l

a
n
g
s

i

4

3

2

1

0

-1

0

Skewness (CSI breath signal)
Kurtosis (CSI breath signal)

ResMed S+ Sleep/Awake Ground Truth
Our Sleep/Awake Classification Approach

100

200

300

400

500

minutes into sleep

Figure 3.14: Two-stage (asleep versus awake) crude classiﬁcation using simple feature

engineering approach and as compared to classiﬁcation from a commercial ResMed S+ device

truth from medical-grade devices [165]. The radar device we utilised in our sleep comparative

study does not supply sleep stage classiﬁcation ground truth. Nonetheless, we have demonstrated

in this work the elements of sleep monitoring needed for such classiﬁcation; namely, breathing

and motion estimation. In order to investigate the potential extension of this work towards 4-stage

sleep classiﬁcation, we have conducted a limited pilot study using ground truth obtained from

a commercial ResMed S+ device for sleep versus awake classiﬁcation [113]. For demonstration

purposed, we apply a simple feature engineering approach using statistical features derived from

our respiration estimate. Figure 3.14 depicts a snapshot of such classiﬁcation over an entire night.

From close inspection, it is evident that this simple approach results in classiﬁcation performance

very close to that supplied by the ResMed S+ ground truth device. Therefore, advanced sleep-stage

classiﬁcation is possible by using both body movement and respiration based vital signs obtained

from our system. We leave expanding this research strand for further work.

3.7 Conclusions

In this work, we design, implement, and evaluate Serene, a practical scheme for tracking

respiration and body movement vital signs during sleep using WiFi CSI signals. Our approach

is very easy to use and unobtrusive, as it is contact-less and uses exiting WiFi devices to sense

78

vital signs. We make three major contributions. First, we derive a closed-form expression which

shows that robust breathing waveforms can be extracted using CSI signals independent of the router

position and orientation as long as receiver is close to the subject. The insight gained by such

expression enables us to develop our robust and generalizable CSI signal processing architecture to

extract vital signs. Second, to accurately track variations due to breathing and body movements in

the spatial-frequency subspace formed by the WiFi channel, we propose a novel multi-dimensional

clustering subspace tracking technique. Our technique accurately detects and diﬀerentiates between

waveforms corresponding to sleep vital signs. Third, we extensively evaluated our system in long-

term experiments conducted in 5 diﬀerent apartments, where we collected more than 550 hours

(80 nights) of data from 5 individuals. Our system can track respiration throughout a night with

an average accuracy of 1.19 breaths per minute (BPM) compared to state-of-the-art radar based

sleep monitors, and experiences an average nightly outage of under 6.38 minutes only. Based on the

interesting sleep insights we gained from our user study, we remark that our proposed system lowers

the deployment barriers for in-home, long-term sleep quality monitoring. Such accessible sleep

monitoring will not only help users improve their sleeping habits, but also potentially aid in the

early identiﬁcation of sleep-related disorders. In future, advanced machine learning algorithms can

be developed to achieve multi-stage sleep classiﬁcation using the body movement and respiration

vitals signs obtained using Serene.

79

MONITORING BROWSING BEHAVIOR OF CUSTOMERS VIA RFID IMAGING

CHAPTER 4

4.1 Introduction

RFIDs based activity monitoring has recently emerged as a way to track customer activity in

physical retail stores [41, 49, 79, 81, 122]. Acquiring customer activity information is important, as

the amount of time that a customer spends on browsing an item is a key indicator of the amount

of interest that the customer has towards the item. Manufacturers can use such information to

improve the quality of their products, such as their visual attractiveness. Moreover, retailers can

use such information for the strategic placement of retail items [19, 20, 61, 96]. However, existing

RFID based systems for customer activity tracking in physical retail stores [41, 49, 79, 81, 122]

have two key limitations. First, they require physical interactions with tagged display items for

detecting human interest in places such as clothing stores [122]. Second, they do not work in multi-

person environments, which are most common in reality. In contrast, we seek to leverage COTS

RFID devices for monitoring browsing activity (i.e. when there is no physical interaction between

customers and the display items) of customers in general retail stores. Such information is easy to

obtain in online shopping environments by tracking customers’ online browsing behavior, such as

the amount of time spent on viewing a product or the number of clicks on a product, but it is diﬃcult

to obtain in physical shopping environments. Eﬀective tracking of customer browsing activity in

physical shopping environments will not only provide useful insights on customer behavior to

product manufacturers and retail store managers, but can also help to shorten the gap between

online shopping and physical shopping.

In this work, we propose TagSee, a multi-person activity tracking system based on RFID

imaging. The hardware requirements of TagSee include a set of RFID tags and an RFID reader,

both tags and the reader are COTS products. The tags are deployed on the boundaries of the shelves.

The reader is deployed so that the customers are between the monitored shelves and the reader.

80

TagSee is based on the insight that when customers are browsing the items, as they stand between

the tags and the reader, the multi-paths that the RFID signals travel along change, and therefore,

both RSS and phase values of the RFID signals that the reader receives change as well. Based on

these variations, TagSee constructs a coarse grained image of the customers and the tags using a

model-driven deep learning framework. Afterwards, TagSee determines popularity of diﬀerent item

categories that are being browsed by the customers by analyzing the constructed images. Figure

4.1 shows an example of our RFID based customer monitoring system setup. The key novelty of

this work is on achieving multi-person activity tracking in front of display items by constructing

coarse grained images via robust, analytical model-driven deep learning based, RFID imaging using

existing RFID devices and protocols. TagSee works for multi-person scenarios, works for scenarios

where there is no physical interaction between customers and the display items, and is device-free

(i.e. there is no need to attach anything to shelf items or customers).

Figure 4.1: Example system setup

The ﬁrst technical challenge is to robustly model the relationship between the signal attenuation

caused by the human obstruction in RFID signals and the images of the human obstruction. This

modeling is diﬃcult for two major reasons. First, the interactions between human objects and RFID

signals during a browsing activity are highly complex. Second, as we use monostatic RFID readers,

which use a single antenna at a time for both transmitting and receiving RFID signals to and

from the tags, modeling the impact of human obstructions on RFID signals becomes even more

81

diﬃcult. This is because on any reader-tag-reader (TX-Tag-RX) path, the RFID signals experience

two diﬀerent attenuations due to an obstruction, once when signals are sent from the reader antenna

to the tag, and the second when the tag backscatters those signals towards the same reader antenna.

Employing geometrical and measurement models, such as the ones used in previous RF imaging

techniques [95,133,152,166], will entail high dependency on multiple, environment dependent and

diﬃcult to tune parameters. Moreover, the imaging accuracy of such geometrical and measurement

models based systems is not accurate and robust enough for imaging multiple customers showing

interest in multiple diﬀerent item categories simultaneously. Also, as interactions between human

objects and RFID signals during a browsing activity are highly complex, an accurate mathematical

is hard to derive. To address these challenges, we propose a model-driven Deep Neural Networks

(DNNs) [43] based RF imaging approach. First, we mathematically formulate the problem of

imaging human obstructions using monostatic RFID devices and derive an approximate analytical

imaging model that correlates the variations caused by human obstructions in the RFID signals.

Second, based on the derived imaging model, we develop a DNNs based deep learning framework

to robustly image customers with high accuracy. Our key intuition is that by training our system

with the images constructed based on RFID signals when humans are browsing items, our system

can automatically learn the underlying relationship between those images and the observed RFID

signal dynamics. Our DNN based approach is easy to realize in practice as it is environmental and

hardware independent, and the thresholds and parameters are easy to tune. Moreover, our approach

allows for robust imaging, even when customers are naturally moving to-and-fro or sideways while

browsing or picking up/putting back items from a product category.

The second challenge is to enable multi-person imaging, but without changing the training

requirements. That is, our system should only require the DNN to be trained for single-person

scenarios, and it should not require the DNN to be trained for multi-person scenarios. This is

because ﬁrst, it is cumbersome to train the DNN with all possible multi-person scenarios, and

second, it will lead to overﬁtting of the DNN. To address this challenge, we propose a spatial

moving window based imaging technique to image multiple customers, who are browsing products

82

in diﬀerent columns, simultaneously. The intuition is that, a single customer can signiﬁcantly

inﬂuence the RSS values of only a block of deployed tags (i.e., the ones covered by the moving

window), and that multiple customers maintain a distance between themselves while browsing any

shelf. To achieve this, TagSee moves a window over the spatial distribution of tags, shifting it

rightward, one column of tags at a time. For each instance of the moving window, TagSee replaces

the observed RSS variations for the tags lying outside the moving window with random values,

which are sampled from the Gaussian distributions deﬁned by the mean and variance of RSS

variations corresponding to those tags, observed during the calibration phase (i.e., when there is

no human obstruction around). Afterwards, TagSee constructs the images corresponding to each

instance of the moving window, by using the modiﬁed RSS variation vectors, which consists of

changes observed in the RSS values of every tag on the shelf, as the input to the DNN. Finally, it

combines those images by applying the averaging ﬁlters to output the ﬁnal image. Note that our

approach does not require the exact locations of deployed tags and the reader antennas to be known

in advance, which makes it easy to deploy in practice.

The third challenge in TagSee is to robustly quantify the variations introduced by human

obstructions in the RSS values of the deployed tags. This is necessary because TagSee uses

these RSS variations as inputs to its DNN for imaging obstructions. Anomalous variations in

RSS values of diﬀerent tags occur frequently during browsing behavior, either due to fading loss

from constructive/destructive interference of RFID signals due to multipath eﬀects, or due to

the measurement noise of the RFID reader. To address this challenge, we leverage the frequency

hopping (FH) capability of the multi-frequency UHF RFID hardware, which operates in the 902-

928 MHz frequency range (divided into 50 closely spaced subcarriers). First, in scenarios where

some part of RFID spectrum is under fade while a reader attempts to interrogate a tag, FH capability

allows the reader to interogate that tag on stronger subcarriers, which helps TagSee gather enough

measurements per tag per second, required for robust and accurate image reconstruction. Second,

since the subcarriers are closely spaced in the frequency range 902-928 MHz, the impact of

disturbances created by a human subject on the RSS values corresponding to a certain tag, is similar

83

across most subcarriers because transmitted power in all subcarriers are the same. Based on this

intuition, TagSee robustly estimates the variations in RSS values of diﬀerent deployed tags by taking

the median of the RSS variations observed over multiple subcarriers. This reduces anomalies in the

variations of RSS values, leading to signiﬁcantly reduced distortions in the constructed images.

We implement TagSee system using a Impinj Speedway R420 reader and SMARTRAC Dog-

Bone RFID tags. We attach RFID tags in a distributed and orderly fashion, just like a mesh, along the

boundaries of a shelves, while covering all column-wise item categories. We call such tags which

are attached to the shelves as Static Tags. In any monitoring scenario consisting of A number of

RFID antennas, there are A∗ K unique TX-Tag-RX links for K tags deployed along the boundaries
of the shelf (so number of links M = K in our case). We use Impinj Speedway R420 RFID readers,

which are capable of reading upto ∼ 450 tags/s, which allows for gathering enough RSS and phase
information from the deployed tags, required for smoother activity tracking. We create real-life

scenarios to perform comprehensive experiments involving 10 diﬀerent human subjects with IRB

approval, and then evaluate the performance of TagSee on this data set. Our experimental results

show that, on average, TagSee can achieve a true positive rates (TPR) of more than 90% and a false

positive rates (FPR) of less than 10% using training data from just 2-3 users. Moreover, TagSee

can achieve a TPR of more than 80% and a FPR of less than 15% in multi-person scenarios using

training data from just 3-4 users.

The rest of the work proceeds as follows. In Section 4.2, we discuss related work. In Section 4.3,

we give a brief overview of our TagSee system. In Section 4.4, we present the preprocessing tech-

niques TagSee uses to prepare data for constructing images. In Section 4.5, we ﬁrst mathematically

formulate the problem of imaging humans using monostatic RFID devices and derive an approxi-

mate analytical imaging model that correlates the variations caused by human obstructions in the

RFID signals. Based on the analytical model presented in Section 4.5, we develop our deep neural

networks based RFID imaging framework in Section 4.6. In Section 4.7, we ﬁrst extend our DNNs

based imaging approach to image multiple persons simultaneously, and then present our customer

activity tracking technique, which uses computer vision algorithms to determine popularity scores

84

of diﬀerent item categories. Finally, we give concluding remarks in Section 6.9.

4.2 Related Work

4.2.1 Radio Tomographic Imaging (RTI)

Previously proposed RTI approaches, which are closest to our work, are either based on sensor

networks [133, 152] or bistatic passive RFID (pRFID) systems [144]. In the sensor networks

based approaches [133, 152], each node deployed around a monitored area is capable of both

transmitting and receiving RF signals, independently, and for any TX-RX pair, there is exactly one

communication link which gets aﬀected by an obstruction. In contrast, the bistatic passive RFIDs

based system proposed in [144] entails two communication links per tag read, i.e. a TX-Tag link

from the TX antenna which interrogates the tag, and a Tag-RX link where the RX antenna receives

the response back from the tag (we refer to RFID links as TX-Tag-RX in rest of the work). However,

both of the aforementioned RTI scenarios are similar, because in each case, RF signals experience

attenuation due to an obstruction only once on their way to the receiver side. Moreover, all previous

RTI schemes have an inherent issue of high dependence on multiple, diﬃcult to tune parameters,

such as the parameters corresponding to diﬀerent types of geometrical and measurement models,

which are used to capture the eﬀects of attenuation due to human obstructions [133, 152]. First,

these parameters are often highly dependent on experimental scenarios and the hardware being

used. Second, these parameters have to be tuned manually, which is time consuming, and often

requires intensive calibration for each diﬀerent deployment scenario. Third, ineﬃcient tuning of

these parameters leads to unstable and ineﬀective imaging results. Moreover, all the previous

RF imaging approaches require the exact locations of all RF nodes to be known in advance. The

aforementioned limitations make previously proposed RTI techniques practically diﬃcult to realize.

Compared to these previous RTI works, we develop a model-driven deep learning based imaging

scheme for monostatic RFID systems, which gets rid of manual setting of diﬃcult to tune RFID

channel parameters, and enables accurate multi-person imaging using monostatic passive RFID

hardware.

85

4.2.2 Customer Behavior Monitoring using RFIDs

RFID based techniques for human activity tracking in physical retail [41, 49, 79, 81, 122] utilize

variations in received signal strength (RSS) and phase values of RFID signals to monitor customer

behavior. However, most RFID based behavior monitoring techniques such as [49, 79, 122] only

focus on clothing stores. Moreover, the current RFID based techniques are highly parameter depen-

dent and fail to work well in multi-person scenarios. Moreover, these techniques require all retail

items in a shelf to be tagged with RFID tags, a requirement which is often not satisﬁed in many retail

stores. To the best of our knowledge, there is no prior work that can monitor customer browsing

activity without using cameras or the requirement of physically touching retail items tagged with

RFIDs.

4.2.3 Customer Behavior Monitoring using Cameras

Several camera based solutions to customer behavior analytics exist in literature [21, 30, 34, 66,

74, 78, 105, 108, 109]. A key advantage of RFID based solutions compared to dense deployment of

cameras is that RFID based solutions are not privacy intrusive. In the past, multiple privacy concerns

related to camera based solutions have been brought to light, for example with the ‘Amazon Go’

stores that use dense deployment of cameras to monitor customers [132,137]. However, please note

that even though our system has a privacy advantage, we do not seek to replace any existing camera

based solutions. Instead, we envision that TagSee can be integrated with existing camera based

techniques for improved customer behavior analytics and enhanced shopping experience solutions

such as automatic check-out [57]. Such integration of RFID and camera based solutions will also

help reduce privacy concerns by allowing the retail stores to enable smart solutions with less

number of cameras. The goal of this paper is to advance the state-of-the-art in the emerging ﬁeld of

RFIDs based customer behavior analytics in retail stores by leveraging oﬀ-the-shelf RFID readers

and tags for monitoring browsing activity of customers (i.e. when there is no physical interaction

between customers and the display items). To the best of our knowledge, there is no prior work that

can monitor customer browsing activity without using a camera or without the requirement of ﬁrst

86

attaching tags on the retail items with RFIDs and then physically touching those items.

4.3 System Overview

In this section, we ﬁrst give a brief overview of monostatic passive RFID (pRFID) technology.

Second, we brieﬂy discuss TagSee’s imaging infrastructure. Third, we give an overview of TagSee’s

multi-person imaging and tracking scheme.

4.3.1 Monostatic Passive RFIDs

In a monostatic pRFID system, a reader transmits continuous wave signals to interrogate tags

deployed in its proximity, and then receives backscattered signals from those tags which contain

their unique IDs. In this work, use industrial standard, EPC Global Class 1 Generation 2 (C1G2)

RFID [38] compatible, Ultra-High-Frequency (UHF) pRFID tags and Impinj Speedway R420

monostatic RFID readers in our proposed popularity tracking scheme. The RFID readers we use

operate in the frequency range 902-928 MHz, through a frequency hopping (FH) mechanism, where

the frequency range is divided into 50 subcarriers (i.e. 902.75 - 927.25 MHz with an interval of 0.25

MHz), which are randomly hopped between during each interrogation cycle. This FH capability

reduces interference between nearby RFID readers, and leads to robust and reliable interrogation

of tags in cases where some part of the spectrum is under fade. Also, these readers are equipped

with multiple antennas. The query-response communication corresponding to each antenna is

multiplexed in time, where each antenna interrogates the tags in an alternating manner. In each

query-response communication, the EPC C1G2 compatible RFID tags respond to RFID signals

from a reader, through a random access collision avoidance technique called slotted ALOHA [62].

The RFID readers we use are capable of reading upto ∼ 450 tags/s, which allows for gathering
enough RSS and phase information from the deployed tags, required for smoother and near real-time

popularity tracking.

87

Figure 4.2: High level ﬂow diagram of TagSee’s monitoring mode

4.3.2 TagSee’s Imaging Infrastructure

TagSee requires items to be placed in column-wise categories on shelves. TagSee also assumes

that RFID tags are already attached in a distributed and orderly fashion, just like a mesh, along

the boundaries of diﬀerent shelves, while covering all item categories. We call such tags which are

attached to the shelves as Static Tags. Although, TagSee does not require the items placed in the

shelves to be tagged, yet, for the sake of completeness, we call such tags as Dynamic Tags, because

their positions can change with time or they can even disappear from their shelves in cases where

those items get purchased. Figure 4.1 shows the layout of RFID infrastructure used by TagSee.

In any monitoring scenario consisting of A number of RFID antennas, there are A ∗ K unique
TX-Tag-RX links for K tags deployed on a shelf (so number of links M = K in our case). To track

activity of customers in front of a shelf or a display item, an RFID reader transmits continuous wave

signals to interrogate tags deployed in its proximity, and then receives backscattered signals from

those tags which contain their unique IDs. Afterwards, the RSS and phase values corresponding

to these tags are leveraged to image any customers standing between the tags and the reader. Our

analytical model based monostatic RFID imaging approach requires the locations of all static tags

and RFID reader antennas to be known in advance. However, our ﬁnal deep learning based imaging

approach does not have any such requirement. Next, we present a system level overview of TagSee’s

activity tracking scheme.

88

4.3.3 Overview of TagSee Imaging and Tracking Scheme

TagSee consists of two working modes, namely calibration mode and monitoring mode. In

calibration mode, TagSee reads RSS and phase values from the deployed tags when there is

no human obstruction in the monitored area. These calibrated RSS and phase values are used

for background subtraction during the monitoring mode. In monitoring mode, TagSee constructs

images after processing the RSS and phase values it reads from the deployed tags. Figure 4.2 shows

a system level diagram of TagSee for monitoring mode. In the ﬁrst step, the raw RSS and phase

values obtained from the deployed tags are fed into a pre-processing module. The pre-processing

module ﬁrst applies a combination of moving average and moving median ﬁlters on streaming

RSS and phase data, and then subtracts the calibrated RSS and phase values from the incoming

ﬁltered RSS and phase values, respectively. Afterwards, it ﬁlters out the anomalous variations in

RSS values by applying phase diﬀerence and frequency diversity based ﬁltering techniques. In the

second step, the power of resultant RSS diﬀerence vectors obtained from pre-processing module

is checked with a threshold, to determine the existence of obstruction in front of the deployed

tags. If any obstructions are detected, the RSS diﬀerence vectors are then fed into a DNN based

multi-person RFID imaging module, which then constructs coarse-grained images of the detected

human obstructions. Finally, TagSee applies a computer vision based blob tracking technique on

the constructed images, to track the browsing activity near diﬀerent items, by determining the

popularity scores of diﬀerent item categories. Next, we discuss how TagSee processes the RSS and

phase values during its two working modes.

4.4 Preprocessing RSS and Phase

In this section, we describe the pre-processing techniques TagSee uses during its two working

modes, i.e., calibration and monitoring modes, to prepare data for TagSee’s imaging module. Next,

we ﬁrst introduce the RFID signal parameters which TagSee pre-processes.

Received Signal Strength (RSS): Monostatic RFID channel is a double fading channel, i.e. each

fade is experienced twice, once in the forward link and once in the reverse link. A typical RFID use

89

case scenario involves indoor multipath environment, where each link consists of a line-of-sight

(LOS) path and a few major reﬂections. For Mono-static RFID readers, for a given subcarrier of

wavelength λ, the received power P{k,a}

r

at the reader antenna a, of the backscattered signal from

tag k located at distance d from the reader, can be approximated in terms of transmit power Pt for

a free space scenario as:

P{k,a}
r

= PtG2

aG2

kT{k}b

λ

4πd

· (

)4

(4.1)

where Ga and Gk are at h reader antenna and kt h tag antenna gains, respectively [98]. T k
b
backscatter or modulation loss of the kt h tag. Next, we assume that Pt, Gk , Ga and T{k}b
constant for a tag-antenna pair, and then re-write a simpliﬁed logarithmic relation for RSS at kt h

remain

is the

tag as follows:

RSS{k} (dBm) = A{k}0

+ 10 · β{k} log[(

λ

4πd

)]

(4.2)

where A{k}0
is assumed to be a constant for a speciﬁc environment, and the value of β{k} = 4 in
case of LOS free space path loss scenario depicted in Eq. (4.1). In reality, β{k} is dependent upon

indoor multipath environment and shadowing eﬀects due to obstructions. We use Eq. (4.2) while

formulating our analytical RFID imaging approach.

Phase: For RFID propagation environment involving monostatic readers, the phase information of

a received signal, received from kt h tag, provided by the reader can be written as:

φ = mod (φp + φo + φ{k}b

, 2π)

(4.3)

where φp = 2κd + φm (κ =

2π f
c , f = signal frequency and φm phase contribution due to con-

structive/destructive interference due to multipaths), φo is the phase oﬀset which includes phases
of the cables and other reader and antenna components, and φ{k}b
kt h tag modulation. Next, we discuss the techniques we use in TagSee’s calibration and monitoring

is the backscatter phase of the

working modes to pre-process RSS and phase data before feeding it to the imaging module.

4.4.1 Calibration Mode

In calibration mode, TagSee reads RSS and phase values from the deployed tags when there

is no human obstruction in the monitored area. These calibrated RSS and phase values are used

90

for background subtraction during the monitoring mode. During this mode, TagSee keeps reading

the RSS and phase values for tcal ≈ 2 minutes, to ensure that it receives enough readings from
each of the deployed tags and carrier frequencies used by the reader during frequency hopping. For

F frequencies and K tags, TagSee records F K RSS and phase vectors, per RFID antenna. In the

end, TagSee applies a combination of moving average and moving median ﬁlters (window size =

5) to all F K number of RSS and phase vectors, calculates the median of each of those F K ﬁltered
vectors, and records F × K dimensional RSS and phase calibration matrices (M cal
corresponding to each of the A antennas, for background subtraction during image construction in

r ss and M cal

),

phase

the monitoring mode.

4.4.2 Monitoring Mode

In monitoring mode, TagSee constructs images after processing the RSS and phase values it

reads from the deployed tags. During this mode, TagSee keeps reading the RSS and phase values,

while maintaining the streaming values in a buﬀer Bmon for batch processing. In our experiments,

we observed that every user takes at least 3 − 4 secs while browsing a certain item category, which
amounts to approximately Nmon = 2000 RSS and phase readings. Therefore, during monitoring

mode, TagSee maintains latest Nmon = 2000 readings in buﬀer Bmon. Bmon is updated with new

readings every tmon = 1 secs, which amounts to approximately 450-500 readings. As all tags

contend for the medium through a random access protocol, TagSee might not receive readings for

some tag-frequency pairs in a period of 3 − 4 secs. For utilizing frequency hopping capability
of RFID readers eﬃciently during the frequency diversity based ﬁltering process that we discuss

later on, TagSee waits until it gets enough readings from the deployed tags for multiple diﬀerent

frequencies. We experimentally observe that the above values of Nmon and tmon allow for recording

enough readings for robustly constructing reasonable images. However, we also observed that

during a browsing activity, it often happens that the data obtained during a certain time window

does not contain data from all the deployed tags. Advanced matrix completion algorithms [65, 112]

can be used to interpolate missing RSS data, however it will signiﬁcantly increase the computational

91

complexity. Therefore, in the case where TagSee does not ﬁnd any reading in Bmon for a certain

tag-frequency pair, it replicates the calibrated RSS and phase readings corresponding to that tag-

frequency pair. Finally, TagSee applies a combination of moving average and moving median ﬁlters

(window size = 5) to all F K number of RSS and phase vectors contained in Bmon corresponding to

each antenna. Next, we describe how TagSee leverages the frequency diversity and phase diﬀerence

to calculate a robust estimate of RSS diﬀerence vector yrss for each RFID antenna.

Phase Diﬀerence based Filtering: To ﬁlter RSS based on phase diﬀerence, TagSee ﬁrst calculates

phase diﬀerence vector for each of the F K phase vectors contained in Bmon by subtracting the

calibrated phase values in M cal

phase

corresponding to each of the F K frequency-tag pair, to obtain

B′mon. Next, TagSee leverages the concept of Fresnel zones [54] to ﬁlter out RSS values in Bmon.

The ﬁrst Fresnel zone determines the LOS path between two RF nodes, and encompasses most

of the RF wavefronts which contribute signiﬁcantly to RF propagation. Therefore, if the ﬁrst

Fresnel zone is clear of obstructions, we can assume LOS communication between a tag and its

reader. For monostatic RFIDs, the phase diﬀerence between the direct LOS path, and any other RF

propagation path lying within the ﬁrst Fresnel zone, can be at max 2π, as show in 4.4(b), which

corresponds to one wavelength. Assuming that the phase values obtained for each tag-frequency

pair during calibration phase (M cal

phase

) correspond to the direct LOS paths between those tags

and their corresponding antenna, TagSee discards all RSS values in B′mon, for which the absolute
diﬀerence between their phase readings and the corresponding calibrated values in M cal

, is less

phase

than φmon = π/4.

We chose φmon = π/4 = 0.125 × 2π because we want to select those RSS values for imaging,
which correspond to the scenarios where a human subject obstructs at least 6.25% of the Fresnel

zone. As R420 reader only provides phase values in the range 0− 2π, phase wraps may occur, which
can lead to improper phase diﬀerence based ﬁltering. However, ﬁrst of all, the phase diﬀerence

between two RF propagation paths lying within the ﬁrst Fresnel zone can be at max 2π. Second,

neither the tags nor the RFID reader antennas move during the experiments. Therefore, TagSee

assumes that phase diﬀerence between the phase values read during calibration and the ones read

92

10

9

)

S
S
R

(
 

s
s
r

8

y

 

e
c
n
e
r
e
f
f
i

 

 

D
S
S
R
e
t
u
o
s
b
A

l

7

6

5

4

3

2

1

40

20
Monitoring buffer (B

60

100

120
80
) update instance

mon

140

(a) yr ss ﬁltered using phase values & frequency diversity (bold-
red) is plotted over RSS diﬀerences obtained for 50 subcarriers

F
D
C

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

1

Median absolute RSS
difference observed
for 45th update of B

mon

2

3

4

5

6

7

8

9

10

Absolute RSS difference observed over 50 subcarriers

(b) CDF of absolute RSS diﬀerence values obtained from 46th
tag for 50 subcarriers

Figure 4.3: Phase and frequency diversity based ﬁltering for a tag obstructed by a human (head &

arms allowed to move)

during monitoring mode remains below 2π, and does not take into account phase wraps of over

2π. Note that, a few longer paths, i.e. paths outside the ﬁrst Fresnel zone, may exist at any point in

time. However, we assume that the phase values corresponding to those paths are ﬁltered out by

aforementioned moving average and moving median ﬁltering. After applying this ﬁlter on readings

obtained from each antenna separately, TagSee calculates the median of each of those F K ﬁltered
vectors, and records F × K dimensional RSS matrix (M mon
antennas.

r ss ) corresponding to each of the A

93

(a) Image plane intersecting a Fresnel zone

(b) Impact of human obstruction on paths

Figure 4.4: Intuitions behind using the concept of First Fresnel zones [54] for phase based ﬁltering

& imaging

Frequency Diversity based Filtering. To calculate yrss, TagSee ﬁrst calculates the absolute diﬀer-
ence between calibrated and monitored RSS matrices, i.e. Yr ss = | M cal
r ss |, which gives an
F × K dimensional RSS diﬀerence matrix. Next, TagSee leverages the frequency diversity of the
RFID system we use, to reduce anomalous variations in Yr ss. Since the frequencies which RFID

r ss - M mon

reader randomly hops between are closely spaced in the frequency range 902-928 MHz, the RSS

diﬀerence observed for a tag over these multiple closely spaced frequencies, due to impact of human

obstruction in a small time window of 3 − 4 secs, is similar, given that the trasnmit power in each
subcarrier is the same). Therefore, TagSee takes the median of RSS diﬀerence values in Yr ss over

F = 50 diﬀerent frequencies, to obtain the K dimensional RSS diﬀerence vector yrss, which is

then fed into the imaging module, after every tmon = 1 secs. Again, this ﬁltering is performed for

each antenna. Figure 4.3 shows the ﬁltered RSS diﬀerence values, plotted in bold-red color over

the RSS diﬀerence values corresponding to all 50 subcarriers, for a slightly blocked tag by a human

standing at the same place, but allowed to move head and arms. We can observe that the ﬁltered

RSS diﬀerence values remain at a relatively stable level over time, as compared to the unﬁltered

RSS diﬀerence values showing the varying nature of individual subcarriers.

Power based ﬁltering before RFID Imaging. Before feeding the ﬁltered yrss into imaging module,

TagSee checks if the changes observed in RSS values of diﬀerent tags deployed along a shelf

94

are signiﬁcant enough to reveal existence of an obstruction. If there are no signiﬁcant changes

observed, then imaging should be skipped, which reduces unnecessary computations during image

construction. To achieve this, TagSee compares the power contained in the yrss vector with a simple

to tune threshold (Prss = 10), and discards any yrss vectors which do not meet this threshold. Note

that, the aforementioned phase based ﬁltering of RSS values, along with this power threshold based

ﬁltering of yrss vectors, ﬁlter out the variations in RFID channel which do not correspond to a

obstruction, which leads to more accurate imaging.

4.5 Analytical RFID Imaging Approach

In this subsection, we propose an analytical imaging approach for monostatic RFID systems,

which forms the basis of our proposed deep neural networks (DNNs) based imaging technique in

§4.6. We explain our model using Fig. 4.4(a). The green surface represents the surface of the shelf

on which RFID tags are deployed. In a typical browsing scenario, customers will be standing closer

to the shelf. Our aim is to image the customers while they stand close to the shelf for browsing the

items displayed on that shelf. To develop our analytical model, we ﬁrst erect an imaginary image

plane at a point Pz along Z-axis, parallel to the X-Y-axes as shown in Fig. 4.4(a). Next, we divide

the image plane into voxels, such that there are px voxels between each pair of tags along X-axis,

py voxels between each pair of tags along Y-axis, and the coordinates of bottom leftmost voxels

and top rightmost voxels are (0, 0) and (M ax x, M ax y ), respectively (px = py

= 5 in our current

implementation, with inter-tag distance of 5 inches along both axes). Our imaging problem can

then be represented as a linear system of equations, where the changes in RSS of diﬀerent links yrss

can be written in terms of the changing attenuations x and an M × N weight matrix W (where M =
Number of links, N = Number of voxels in the image plane), specifying the contributions of each

link towards the changes observed in attenuation at each diﬀerent voxel, as yrss = Wx + n, where

n corresponds to fading and measurement noise. The goal is to solve the system for each antenna

separately, using the RSS diﬀerence vectors for each tag-antenna pair.

For the system represented by yrss = Wx + n, we ﬁrst model the weight matrix W by employing

95

the concept of Fresnel Zones [54] between two nodes of a RF link, and use imaginary ellipsoids

centered at the locations of diﬀerent RF nodes in the network, to determine weights of diﬀerent

voxels in the image plane covering the monitored area. The intuition is, that these ellipsoids

determine the LOS path (typically chosen to be the ﬁrst Fresnel zone) of each respective link, and

if a voxel is intersected by the LOS path of a link, it will be assigned more weight as compared to

the voxel which does not fall in LOS. Figure 4.4(a) shows an example scenario. To construct image

from the RSS diﬀerence values obtained for all the links (i.e. yrss), the most straightforward way
is to calculate the least-squares solution as xLS = P × yrss, where P = (WT W)−1WT . However,

the matrix W is not full rank in case of imaging systems, which makes imaging an ill-posed

inverse problem. To handle ill-posedness, we use Tikhonov Regularization approach (such as the

one proposed in [152] for minimizing objective function of the original problem.

Estimating Change in Path Loss: Monostatic RFID communications make the task of estimating

weight matrix W more diﬃcult, because signals experience attenuations multiple times before

reaching back to the same receiver, i.e. once when signals are sent to the deployed tags, and second

when those tags backscatter those signals back to the reader antenna. Assuming that both forward

and reverse channels of a TX-Tag-RX link are symmetric, we assume an imaginary set of symmetric

ellipsoids (approximate LOS regions) between each tag and reader antenna pair. Intuitively, the

image plane cuts the Fresnel zones between each TX-Tag-RX link. Weights are assigned to each

voxel of an image plane, based on whether they fall inside the imaginary ellipsoid of the TX-Tag-RX

link or not. Given the value of free space path loss exponent for a TX-Tag-RX link corresponding

to each tag k is 4, we can assume that during calibration phase, the initial value of β{k}
init

= 4 + φ{k}
init

.

= 4 + φ{k}new. Both β{k}

init

Similarly, let us assume that for each new update of the buﬀer Bmon during the monitoring mode,
the new β{k}new
and β{k}new are unknown here, however, as will show next, only
the diﬀerence β{k}new − β{k}
exponent in terms of the change in RSS for tag k (i.e. yr ss,k ) at the RFID reader antenna as follows:

is required. At any time instance, we can write this change in path loss

init

△ β{k} = β{k}new − β{k}

init

=

yr ss,k (dBm)
λavg
10 · log[(
4πd

)]

(4.4)

96

As the RFID reader uses multiple frequencies, we use average of those frequencies and choose

λ = λavg for simplicity. Given the fact that these frequencies are close to each other, the choice

of this value of λ does not hurt the results of this scheme. After calculating △ β, TagSee assigns
weights to each voxel on the image plane as:

1

d (4−△ β )
0,

,

if dk j,1 + dk j,2 < d + Θ

otherwise

wk j =

Here d is distance between the reader and tag for a link k, dk j,1 is the distance from center of pixel

j to the tag and dk j,2 and is the distance from center of pixel j to the reader antenna corresponding

to link k. Θ is a parameter describing ellipsoid’s width. Using the general equation for calculating

the ﬁrst Fresnel zone radius at any point in the middle of the link, we chose Θ to be as follows:

Θ = Θ0 ×s λavg · dk j,1 · dk j,2

dk j,1 + dk j,2

where we chose Θ0 is an environment and RFID infrastructure dependent parameter, which we

tune to achieve reasonable imaging results. Finally, TagSee employs regularization techniques

mentioned in [144, 152] to construct image by determining attenuation values per voxel (i.e. ˆx)

as ˆx = P × yrss = (WT W + σN Cx−1 + α(DT

X

DX + DT
Y

DY ))−1WT yrss, where Dx and Dy are

diﬀerential operators along X and Y directions respectively, catering for the “spread” of the impact

of attenuations in RSS values along these axes. We tune α = 15 in our current implementation, for

reasonable imaging results . Cx−1 is another prior term, which controls the spatial correlation of

the impact of RSS attenuations across neighboring pixels of the image. Although we do not know

as Cx =

ˆx in advance, we approximate Cx based on a exponential spatial attenuation model [10] [114],
σx
δ exp−Dp/δ, where Dp is a square form distance matrix, containing distance between
each pair of pixels along the imaginary image plane, and σx is prior pixel variance due to human

motion. σN is the prior variance of noise in pixel values i.e. when there is no human motion. We

approximate σN as function cn ∗ σy of variance in RSS values during the calibration phase. We
approximate pixel variance σx as function cx ∗ σy of variance in RSS values during the monitoring
phase. We tune the values of cn, cx and δ for the best results.

97

Imaging: For imaging, TagSee erects multiple image planes along Z-axis and then takes an average

of the images corresponding to all planes. We tune the locations of these image planes for best

results. Finally, TagSee combines the images obtained from all antennas by taking their average.

While evaluating TagSee’s performance, we compare the imaging performance of this analytical

measurement model based imaging approach with our DNN based approach.

Next, we propose our DNN based imaging approach, which not only gets rid of manual tuning

of parameters in our analytical imaging approach, but also enables multi-person imaging that is

required for monitoring browsing behavior of multiple customers towards diﬀerent item categories

simultaneously.

4.6 Deep Learning based RFID Imaging

In our deep learning based RFID imaging approach, we aim to solve the linear regression

problem posed in the equation yrss = Wx + n by modeling it as a deep regression problem. Deep

regression techniques have been shown to yield state-of-the-art results without having to resort to

more complex and ad-hoc regression models [73]. To achieve this, we model x = P × yrss as a
DNN, where yrss corresponds to the input of ﬁrst layer, x corresponds to the output of last layer

and the image construction matrix P corresponds to the combination of all the layers in between

input and output. The intution is that if we train our system with approximate images of how a

human obstacle should look like while he browses a item category, it can automatically learn the

underlying relationship between those images and the RFID channel dynamics observed during

that browsing activity, which is otherwise diﬃcult to model through geometrical or measurement

models based approaches.

Choice of Layers in Neural Network. We chose the DNN layers such that the weights and biases

learned at diﬀerent layers of the network can mimic the impact of human obstructions on the

RFID signals. Based on the solution to our analytical imaging approach, we can see that image

construction matrix P = (WT W + σN Cx−1 + α(DT
X

DX + DT
Y

DY ))−1WT captures linearities as well

as non-linearities related to the impact of human obstructions in terms of attenuations introduced in

98

the RFID channel. The non-linearities are introduced by terms in weight matrix W, which models

the exponential attenuation of RFID signals with distance, and the terms in correlation matrix Cx,

which models how the impact of attenuations due to human obstructions decays spatially along the

2D image plane. The linearities are introducted by the terms corresponding to linear diﬀerential

operators Dx and Dy, as well as, due to the inherent linear nature of our imaging problem, i.e. all

diﬀerent matrices are connected through basic linear operations such as multiplication, addition,

and inverse. For DNN based imaging, we design the input RSS diﬀerence vectors yrss and the output

image vectors ˆx to be normalized vectors, containing values between 0 and 1. We normalize the

input yrss vectors by dividing with the maximum observed RSS diﬀerence, which we empirically

estimate for the system. Moreover, the image construction matrix P can consist of both positive and

negative values, due to the inverse operations involved in it.

Figure 4.5: DNN architecture for K = 116 tags , k x = 29 tags, k y

= 4 tags and inter-tag distance

of 5 inches along both axes

Based on the aforementioned nature of underlying parameters, we chose the ﬁrst two layers of

the DNN to be rectiﬁed-linear layers (ReLU), next two layers to be tanh layers and the last two

layers to be sigmoid layers. The output of a ReLU layer is always non-negative real numbers, the

output of a tanh layer remains between -1 and 1, and the output of a sigmoid layer always stays

between 0 and 1. We propose to generalize the dimensions of aforementioned layers as {K × 3
2 K},
{ 3
2 K × 3K}, {3K × 3K}, {3K × 2K}, {2K × 2K} and {2K × px py (k x − 1)(k y − 1)}, respectively,
= 4 in our current implementation of TagSee. The number of
where K = 116, k x = 29, and k y

DNN layers, as well as the dimension of each DNN layer, can be tuned empirically to achieve

lowest cross-validation errors, and based on the intuition that the amount of information is limited

by the number of RFID tags, i.e. any constructed image is basically an extrapolation of the impact

of RSS variations observed for K tags. For example, in our proposed DNN architecture, we keep

99

the dimensions of all layers to be within 3 times the number of tags being used for imaging. Finally,

to prevent DNN from over-ﬁtting, we use L2 regularization in combination with dropout [129],

where the dropout rate (i.e. the probability to retain a DNN unit during training) is 0.7. Figure 4.5

shows the visual representation of the DNN that we design for our current implementation.

Training Requirements. We separately ask a couple of volunteers to browse each diﬀerent item

category by standing in front of them, for approximately 30 − 60 secs. We do not constrain natural
human motion during this training phase, i.e., the individuals browsing diﬀerent item categories

are allowed to move to-and-fro and sideways, and browse items in a natural manner. TagSee uses

the variations observed in RSS values of the deployed tags during this training phase as inputs to its

DNN, after normalization. For each browsing activity during the training phase, we also generate

approximate normalized images of human obstructions, which TagSee uses as the training “labels”

or regression outputs to its DNN. In this work, we approximate the images of human subjects as

ellipses, where the width and height of those ellipses is chosen as average width and height of the

human subjects used for training TagSee’s DNN. The height of the ellipses is limitted by the size

and location of the deployed mesh of tags. For robustness, we design TagSee’s DNN classiﬁcation

module as an ensemble of D single DNN classiﬁers. The ﬁnal output is median of the outputs

obtained from all the D single classiﬁers.

4.7 Multi-Person RFID Imaging

To develop our multi-person imaging technique, TagSee leverages the intuition that a human

subject will only impact kcw columns of tags along X-axis, for any deployment of a mesh of

tags. Based on this intuition, TagSee creates multiple new vectors {yi

rss} from each new RSS

diﬀerence vector yrss corresponding to an antenna, by moving a window of size kcw columns over

the spatial distribution of deployed tags, as shown in ﬁgure 4.6(a). For each window i, TagSee

only copies those values from yrss to yi

rss which correspond to the tags contained inside that

window, and replaces the values corresponding to remaining tags with two times the standard

deviation values (2· σk,a,r ss) of Gaussian distributions N ( µk,a,r ss, σ2

k,a,r ss

), which model the RSS

diﬀerence values observed for those tags, respectively, when there are no obstructions around.

100

(a) Spatial moving window of impact width kcw = 6

Histogram of Absolute RSS Differences
Fitted Normal Distribution

30

25

20

15

10

5

e
c
n
e
r
r
u
c
c
o
 
f
o
y
c
n
e
u
q
e
r
F

 

0
-10

-5

0

5

Tag 47

Tag 46

30

25

20

15

10

5

0
-10
10
RSS difference y

20

15

0

10

20

30

 (dBm)

rss

(b) Distribution of yrss during calibration mode

Figure 4.6: TagSee’s spatial moving window based approach for multi-person imaging

This avoids spurious blobs during image construction. Figure 4.6(b) shows the distribution of

the RSS diﬀerence values, obtained for two closely spaced tags, during calibration mode in one

of our experiments, over a period of ∼ 2 mins. We can observe that the RSS diﬀerence values
approximately follow a Gaussian distribution. The mean and variance values for the aforementioned

distributions N ( µk,a,r ss, σ2

k,a,r ss

during the calibration phase.

) corresponding to each possible tag-antenna pair, are estimated

For k x columns of tags, TagSee ﬁrst generates k x − kw

+ 1 new vectors {yi

rss}. Afterwards, it

constructs images corresponding to all vectors {yi

rss} using the aforementioned DNN based imaging

technique, and then merges them after passing through a 2D ﬁlter (median and averaging ﬁlters),

to output the ﬁnal image. TagSee applies this multi-person imaging technique for each antenna,

separately, and ﬁnally combines the images obtained from all antennas through averaging. This

101

approach enables multi-person imaging without changing the training requirements of TagSee’s

imaging technique, i.e. our system does not require the DNN to be trained for multi-person scenarios.

The only change required is to train DNN for all possible yi

rss corresponding to every training sample.

Note that for any deployment, we train TagSee’s DNN with data corresponding to ‘no obstruction’

scenarios as well to avoid detection of spurious blobs during image construction. To train the DNN

for ‘no obstruction’ scenarios, the input vectors are set to be the 2 · σk,a,r ss values corresponding
to each deployed tag and the outputs are set to be zero vectors (i.e. blank images).

Monitoring Browsing Activity: TagSee monitors the customer browsing behavior towards diﬀerent

items in terms of popularity of those items. In monitoring mode, TagSee feeds the ﬁnal constructed

image frames to a Blob Analysis module [86], which determines the background using a few initial

frames, and then outputs the coordinates of bounding boxes and centroids of any human images it

detects in foreground of each consecutive frame. As the boundaries of item categories are known

in advance, TagSee determines the popularity of each category by checking the proximity of the

centroids corresponding to detected blobs in each frame to the centroid of that category. If the

centroid of a blob is within η2 voxels of the centroid of jt h category, TagSee increments the

popularity Pj . η1 and η2 is dependent upon the density of deployed tags, and can be easily tuned
empirically for a certain deployment scenario. For robust popularity estimates, TagSee maintains a

buﬀer consisting of 5 latest constructed images (which corresponds to a period of ∼ 3 − 4 secs),
takes the median of all those images and applies thresholding (on the scale from 0 to 1, pixel values

below 0.1 are set to 0), before calculating Pj ’s.
4.8 Implementation & Evaluation

We implement TagSee using COTS UHF Impinj R420 pRFID reader [55] and SMARTRAC’s

Dogbone pRFID tags [125], which operates in frequency range 902.75 - 928.25 MHz and is

compatible with EPC Global C1G2 [38] standard. We use two circular polarized antennas, which

are connected to two of the four antenna ports of R420 reader. As reader interrogates the deployed

static tags, the information containing IDs, time stamps, channel frequencies, reader antenna IDs,

RSS values and phase values corresponding to the tags read in each cycle are sent through Ethernet

102

to a laptop running TagSee. We develop our RFID data collection module by bulding upon the JAVA

based Impinj Octane SDK [56]. For any deployment scenario, TagSee ﬁrst runs in calibration mode,

for approximately 2 mins, to determine the background values of RSS and phase for all deployed

tags. Afterwards, it turns on its monitoring mode to image customers and track popularity of

diﬀerent item categories being browsed by those customers.

4.8.1 Evaluation Methodology

Figure 4.7: Detailed experimental setup

4.8.1.1 Experimental Setup

Fig. 4.7 shows detailed experimental setup of tags and reader antennas that we use to test

TagSee. We deploy a total of K = 116 tags on a wooden shelf (each tag is ﬁrst pasted on a sticky

post-it note which is then attached to the shelf). We deploy k x = 29 and k y

= 4 tags along X

and Y axes, respectively, with an inter-tag distance of 5 inches along both axes. The area between

the tags is divided into voxels or image pixels, such that there are 5 voxels between each pair of

tags deployed along both axes. The voxels are shown by red dots in Fig. 4.7. We place two reader

antennas 13.78 f t away from the shelf, with an inter-antenna distance of 20 inches as shown in Fig.

4.7. Tags are attached such that the distance of the ﬁrst row of tags is 3.5 f t from the ground. Both

antennas are placed parallel to the shelf, such that the distance of their centers is 3.5 f t from the

ground. Moreover, as imaging is based on obstruction of LOS paths between tags and the antennas,

we pointed both antennas towards the deployed tags. We mark 6 item categories on the shelf, where

each category is covered by 4 separate columns of tags. Note that the exact dimensions of the setup

are only required by our analytical imaging approach that derives our DNN based approach and

103

serves as its comparison metric. Our DNN based approach just requires that the locations of tags

and antennas do not change after training as that will require retraining the DNN.

4.8.1.2 Data Collection

For data collection, we recruited 10 users who volunteered to provide data for our project. From

5 of those users, we were able to collect a total of 14617 samples (2500+ samples per person).

We use this data to train TagSee’s DNN model because we assume that it is big enough to capture

the diversity of browsing movements of those users (who had diﬀerent body widths and heights)

reasonably well. However, due to time limitation, the remaining 5 users could only provide us with

2413 samples (< 500 samples per person). As the data obtained from these users is limited, we use

their data for testing TagSee’s performance. Hence, we test TagSee’s performance using unseen data

(i.e. data from the users who are not used for training TagSee’s DNN), which makes our evaluation

more robust. Note that the users in our study had diﬀerent body widths and heights. However, an

evaluation of the impact of such variations on TagSee’s performance is out of the scope of this work

and left as part of future work. In this work, we only focus on coarse-grained imaging of customers

in front of the shelves.

4.8.1.3 Performance Metrics

Except for the scenarios where we compare TagSee’s imaging performance, we evaluate

TagSee’s popularity tracking performance for any experiment using TPRs, FPRs and Miss Rates

(MRs), which are calculated based on the correctness of item popularities Pj TagSee determines in
diﬀerent time windows. True positives correspond to the scenarios during which TagSee is able to

detect interest in the categories being tested. However, TagSee may wrongly detect interest in cat-

egories (i.e. other than the ones being tested) as well, which correspond to false positive scenarios.

TagSee misses when it is unable to detect interest in the tested categories during a time window

(i.e. MR = 1 - TPR). Our goal is to achieve maximum TPRs and minimum FPRs.

104

4.8.2 Single Person Imaging Scenarios

TagSee can achieve TPRs of more than 90% and FPRs of less than 5% for single person

monitoring scenarios. Figure 4.8 compares the imaging results of these two approaches, where

TagSee constructs images of a user as he stands in front of each diﬀerent item category, using 2

antennas. For this experiment, TagSee used a DNN trained for 3 volunteers, where the selected

volunteers did not include the tested user. We can see that the images constructed by DNN based

RFID imaging (bottom) are highly accurate as compared to the ones constructed using baseline

approach (top). This is because, ﬁrst, our DNN approach automatically tunes all values in the image

construction matrix for minimum construction errors. Second, the DNN based approach is less

vulnerable to natural human motions during browsing activity, which happens because it takes such

variations due to motion into account during the training process. Next, we show how TagSee’s

performance in single person scenarios is aﬀected by number of training users, impact width and

number of antennas.

)
s
l
e
x
o
v
(
 
e
t
a
n
d
r
o
o
C
Y

-

i

)
s
l
e
x
o
v
(
 
e
t
a
n
d
r
o
o
C
Y

-

i

16

14

12

10

8

6

4

2

0

16

14

12

10

8

6

4

2

0

0

50

100

16

14

12

10

8

6

4

2

0

150

16

14

12

10

8

6

4

2

0

0

50

100

16

14

12

10

8

6

4

2

0
150
16

0

14

12

10

8

6

4

2

50

100

16

14

12

10

8

6

4

2

0
150
16

0

14

12

10

8

6

4

2

0

50

100

150

0

50

100

0
150

0

50

0
150

100
50
X-Coordinate (voxels)

0

16

14

12

10

8

6

4

2

50

100

0
150

0

16

14

12

10

8

6

4

2

100

0
150

0

16

14

12

10

8

6

4

2

0

50

100

150

0

50

100

150

16

14

12

10

8

6

4

2

0

50

100

150

0

50

100

150

Figure 4.8: Comparison between TagSee’s baseline (top) and DNN based (bottom) approaches for

single person scenario

4.8.2.1 Eﬀect of the number of training users

Figure 4.9 shows the impact of number of training users on TPRs, FPRs and MRs for 3 diﬀerent

experiments, performed for 3 of the 5 volunteers selected for testing. Impact width was set to

kcw

= 6, and the error rates reported were averaged over 6 diﬀerent item categories. Moreover,

105

100

80

60

s
e
t
a
R

40

20

0

TPRs
FPRs
MRs

1

2

4
Number of training users

3

100

80

60

s
e
t
a
R

40

20

0

TPRs
FPRs
MRs

1

5

2

4
Number of training users

3

5

(a) Average error rates for test user 1

(b) Average error rates for test user 2

100

s
e
t
a
R

80

60

40

20

0

TPRs
FPRs
MRs

1

2

4
Number of training users

3

5

(c) Average error rates for test user 3

Figure 4.9: Eﬀect of number of training users on TagSee’s performance (TPRs, FPRs and MRs)

for kcw

= 6

100

80

60

s
e
t
a
R

40

20

0

TPRs
FPRs
MRs

1

2

3

Number of training users

4

5

Figure 4.10: Performance in single person monitoring scenario using 1 reader antenna only,

kcw

= 8

data from 2 RFID reader antennas was used for constructing images in these experiments. We

can observe an increasing trend in TPRs for all three users 4.9(a)-4.9(b), which is intuitive. This

106

happens because as TagSee’s DNN is trained with more scenarios, corresponding to diﬀerent users

of diﬀerent shapes and sizes, its image construction becomes more accurate and robust, leading

to higher detection rates. This is also the reason behind the decreasing trend in MRs, which we

can observe for all three users 4.9(a)-4.9(b). However, we see that FPRs do not show an expected

decreasing trend, which may seem counter-intuitive at ﬁrst. This happens because when a customer

is browsing an item category, our multi-person imaging algorithm sometimes wrongly detects

and images human presence in front of nearby item categories as well, which leads to spurious

popularity counts. In this scenario, TagSee can achieve TPRs of more than 85% and FPRs of less

than 15%, averaged over all users and categories.

4.8.2.2 Eﬀect of the number of reader antennas

In the aforementioned experiments, we were using 2 reader antennas. Figure 4.10 shows the

error rates achieved using single reader antenna, for kcw

= 8 and diﬀerent number of training users.

We observe that the trend in TPRs and MRs remains similar to the ones corresponding to the 2

antenna scenarios (Fig. 4.9(a)). We observe that the average FPR drastically increases to more than

45%. This is because, excluding images from one of the antennas leads to ineﬀective ﬁltering (as

mentioned in § 4.7) of consecutive image frames, which may contain spurious popularity counts
for untested categories.

4.8.3 Multi-Person Imaging Scenarios

TagSee can achieve more than 90% TPRs, and less than 10% FPRs for 2 person monitoring

scenarios. Moreover, for 3 person monitoring scenarios, TagSee can achieve more than 80% TPRs,

and less than 20% FPRs. Figure 4.11 shows some selected imaging results for these two approaches,

which were constructed using 2 reader antennas, where test users performed browsing activities

in front of item category sets {1,3}, {1,6}, {3,6}, {4,6}, {1,4,6} and {2,4,6} respectively. We

can observe that TagSee’s DNN based RFID imaging approach produces accurate images even for

multi-person monitoring scenarios as well. We chose 3 of the 5 test users for TagSee’s multi-person

107

16

14

12

10

)
s

l

e
x
o
v
(
 

16

14

12

10

i

e
t
a
n
d
r
o
o
C
Y

-

)
s

l

e
x
o
v
(
 

i

e
t
a
n
d
r
o
o
C
Y

-

0

50

100

8

6

4

2

0

16

14

12

10

8

6

4

2

0

8

6

4

2

0

150

16

14

12

10

8

6

4

2

0

0

50

100

16

14

12

10

8

6

4

2

0
150
16

0

14

12

10

8

6

4

2

50

100

16

14

12

10

8

6

4

2

0
150
16

0

14

12

10

8

6

4

2

50

100

16

14

12

10

8

6

4

2

0
150
16

0

14

12

10

8

6

4

2

0

50

100

150

0

50

100

0
150

0

50

0
150

100
50
X-Coordinate (voxels)

0

100

0
150

0

16

14

12

10

8

6

4

2

0

50

100

150

0

50

100

150

16

14

12

10

8

6

4

2

0

50

100

150

0

50

100

150

Figure 4.11: Comparison between TagSee’s baseline (top) and DNN based (bottom) RFID
imaging approaches for multi-person scenario. The leftmost 4 images correspond to 2-user

scenarios, and the rightmost 2 images correspond to 3-user scanerios

100

s
e
t
a
R

80

60

40

20

0

1

2

3

4

Number of training users

TPRs
FPRs
MRs

100

80

TPRs
FPRs
MRs

s
e
t
a
R

60

40

20

0

5

1

2

3

Number of training users

4

5

(a) Overall average error rates for kcw = 4

(b) Overall average error rates for kcw = 6

100

80

TPRs
FPRs
MRs

60

s
e
t
a
R

40

20

0

1

2

Number of training users

3

4

5

(c) Overall average error rates for kcw = 8

Figure 4.12: Eﬀect of impact width kcw and number of training users on TagSee’s performance in

2 person scenarios

108

100

s
e
t
a
R

80

60

40

20

0

1

TPRs
FPRs
MRs

100

80

s
e
t
a
R

60

40

20

0

5

1

TPRs
FPRs
MRs

5

2

3

4

Number of training users

2

3

4

Number of training users

(a) Per scenario error rates for kcw = 4

(b) Per scenario error rates for kcw = 6

100

80

s
e
t
a
R

60

40

20

0

1

2

4
Number of training users

3

TPRs
FPRs
MRs

5

(c) Per scenario error rates for kcw = 8

Figure 4.13: Eﬀect of impact width kcw and number of training users on TagSee’s performance in

3 person scenarios

performance evaluation. Also, we kept the number of reader antennas A = 2 for robust imaging.

4.8.3.1 Eﬀect of the number of training users

Here, ﬁrst we discuss performance for 2 person monitoring scenarios i.e. corresponding to

the category sets {1,3}, {1,4}, {1,5}, {1,6}, {3,6} and {4,6}. Figures 4.12(a)-4.12(c) show how

TagSee’s average performance changes with number of training users, for three diﬀerent values of

kcw. We can observe an increasing trend in TPRs (decreasing MRs). We also see a decreasing trend

in FPRs, but just like for single person monitoring scenarios, it is not consistent. For example, in

case of kcw

= 8, FPRs decrease as training users increase from 1 to 4, but we see an increase in

FPRs for the case corresponding to 5 training users. We attribute such unexpected changes in FPRs

109

100

s
e
t
a
R

80

60

40

20

0

1

TPRs
FPRs
MRs

2
7
Multi-person monitoring scenario

5

6

3

4

8

(a) Per scenario error rates for kcw = 4

(b) Per scenario error rates for kcw = 6

100

80

TPRs
FPRs
MRs

s
e
t
a
R

60

40

20

0

1

7
2
Multiperson monotoring scenario

6

4

5

3

8

(c) Per scenario error rates for kcw = 8

Figure 4.14: Eﬀect of impact width kcw on TagSee’s performance (TPRs, FPRs and MRs) for 8
diﬀerent multi-person scenarios, i.e. item category sets {1,3}, {1,4}, {1,5}, {1,6}, {3,6}, {4,6},

{1,4,6}, {2,4,6}, 5 training users used

to spurious popularity counts. Second, we discuss performance for 3 person monitoring scenarios

i.e. corresponding to the category sets {1,4,6} and {2,4,6}. Figures 4.13(a)-4.13(c) show TagSee’s

average performance for diﬀerent number of training users and values of kcw. Again, we observe

an increasing trend in TPRs, however, the overall TPRs achieved are lower as compared to 2 person

monitoring scenarios.

4.8.3.2 Eﬀect of impact width kcw

Figures 4.14(a)-4.14(c) show performance for the multi-person monitoring scenarios corre-

sponding to all tested item category sets. By closely observing the ﬁgures 4.12(a)-4.12(c) and

4.14(a)-4.14(c), we can see that although, TPRs increase with kcw, but FPRs also increase simulta-

110

neously. For both 2 person (1-6 in 4.14(a)-4.14(c))) and 3 person scenarios (7-8 in 4.14(a)-4.14(c))),

we observe that TPRs increase with kcw in almost each tested scenario, but FPRs also increase,

which happens because in multi-person scenarios, the RSS values corresponding to the tags de-

ployed around the item categories in between two nearby customers, are more aggressively aﬀected,

which can fool TagSee’s multi-person imaging algorithm into wrongly incrementing the popularity

counts of those categories. Although, TPRs achieved for kcw

= 4 and 6 are often lower than for

kcw

= 8, but FPRs corresponding to those cases are considerably lower.

4.9 Discussions

TagSee is an early step towards monitoring customers’ browsing behavior using COTS RFID

devices. There is obviously room for continued research in various perspectives. In this section, we

provide commentary on the limitations of our work and discuss avenues of future research.

DNN Architecture. The number of neurons per layer, number of layers and the types of layers

are the primary hyper parameters of our DNN architecture. The problem of ﬁnding the correct

hyper parameters for a neural network is a research problem in itself [17, 37, 84]. However, unlike

standard practice in deep learning where many DNN architectures are randomly tried, the design

of our DNN architecture is grounded on an analytical RFID imaging model that we derive in §4.5.

From this model, we derive useful insights based on which we set those hyper parameters of our

DNN architecture (§4.6). Our results show that our model driven architecture achieves reasonably

good accuracies and generalizes well for unseen data. Other hyper parameters like dropout rate,

learning rate, and regularization need to be estimated through trial and error, or a grid search.

However, this search can be done before the model is deployed, so its cost does not aﬀect the

runtime of TagSee. Our DNN architecture can be generalized to tag matrices of diﬀerent sizes and

tag density (as discussed in §4.6) by changing the parameters k x, k y, px, py, and K. However, its

evaluation is out of the scope of this paper.

Reading Rate. Figure 4.15 shows the impact of reading rate (normalized) on the Miss Rates and

False Positive Rates in a 3-person imaging scenario. The MRs and FPRs were obtained for 50

111

diﬀerent experiments that we ran for every plotted reading rate. We can observe that MRs and FPRs

increase as reading rate decreases and vice versa. This is because imaging depends on variations

in RSS signal from all the tags being aﬀected by an obstruction. Receiving signal from only a few

tags leads to imaging inaccuracies resulting in higher MRs and FPRs. However, note that our goal

is to monitor the ‘browsing’ behavior of customers (i.e. when they stop to look at an item category

without touching any items), not to continuously track the location of customers as they move about

in the store. Browsing is a slow activity as it assumes that when a customer is browsing some item

category, they usually spend a few seconds (e.g. 3-4 seconds) to browse the category. Because

average tag read rate of our current system is approximately 475 reads/second, and there are only

116 tags in our current deployment, our system can obtain enough readings to construct reasonably

accurate images of the users standing in front of diﬀerent item categories. TagSee can work well

in small stores (e.g. a small shoe store) with where the number of shelves is small and the number

of deployed tags is on the order of reading rate. However, for a larger store, we can divide the store

into multiple smaller regions and then deploy separate readers to monitor customer activity in each

region to ensure that each tag is read frequently. In each region, the transmit power of antennas and

the frequencies of operation can be set such that the inter-region interference is minimized.

)

%

(
 

e
g
a
t
n
e
c
r
e
P

100

80

60

40

20

0

Miss Rate (MR)
False Positive Rate (FPR)

1.25 2.5 3.75 7.5 15 22.5 30 37.5 50
Normalized reading rate (%)

75 100

Figure 4.15: Impact of reading rate on FPRs and MPRs

Density of tags and image resolution: The parameters k x, k y, px, py, and K collectively control

the size and density of a deployed mesh of tags. Imaging resolution will naturally be higher if

tag density in a mesh is higher (i.e. inter-tag distance is small). For example, deploying tags more

densely along X-axis can help better resolve two customers standing close to each other. We leave

112

the evaluation of such dynamics between tag density and imaging resolution as part of future work.

Collection of training data: Due to time and scheduling constraints, we were only able to collect

data from a limited number of volunteers for testing. Although the data we collected using current

setup is enough to test the basic working of TagSee, yet it is just the ﬁrst step. In a real-life

deployment scenario, for example in a shoe store, an automated camera triggered labeling system

can be developed to calibrate TagSee over a period of few weeks. Such an automated calibration

system will help generate a bigger and more diverse dataset that can be used to train a more robust

DNN for TagSee.

Practical real-life deployments: TagSee’s imaging scheme is based on the obstruction of LOS

paths, or, more precisely, the Fresnel zones [54] between tags and reader antennas. In our current

setup, the tags and antennas are placed in front of each other and at a certain height above the

ground such that an approximate LOS is established between them. However, LOS can also be

established in real-life in-store deployments, by hanging the antennas at an angle from the roof

using ceiling mounts and attaching the tags to lower racks of the shelves as well as on the ﬂoor area

near the shelves. In this way, when customers come near a shelf to browse an item category, they

would obstruct the Fresnel zones between the tags and their respective reader antennas, based on

which TagSee will try to determine the popularity of that item category.

4.10 Conclusions

In this work, we propose, implement, and evaluate TagSee, which is the ﬁrst monostatic RFIDs

based multi-person imaging scheme, which can be used to monitor browsing activity of customers

near diﬀerent display items in places such as physical retail stores. To achieve this, we propose a

deep neural networks (DNNs) based RFID imaging approach, which robustly images the browsing

activity of customers in front of the shelves with high accuracy. Our DNNs based imaging approach

is driven by an analytical model, where we ﬁrst mathematically formulate the problem of imaging

humans using monostatic RFID devices and derive an approximate analytical imaging model

that correlates the variations caused by human obstructions in the RFID signals. Afterwards, we

113

use that model to design our DNN’s architecture. Finally, based on our proposed DNNs based

imaging approach, we develop a technique which can track activity of multiple customers, showing

interest in multiple diﬀerent item categories, simultaneously. The key contribution of this work is

in demonstrating the possibility of eﬀective imaging of the browsing activity of multiple customers

using existing RFID devices and protocols via robust, analytical model-driven deep learning based

RFID imaging, which works even for scenarios where there is no interaction between customers

and the display items. To the best of our knowledge, there is no prior work that can monitor

customer browsing activity without using a camera or the requirement of physically touching retail

items tagged with RFID tags. We believe our proposed RFIDs based multi-human activity tracking

scheme for physical shopping environments will be useful for manufacturers and physical retail

stores, and will help to shorten the gap between online and physical shopping.

114

CHAPTER 5

FINE-GRAINED VIBRATION BASED SENSING USING A SMARTPHONE

5.1 Introduction

5.1.1 Motivation

Vibration based sensing has been shown to be a low-cost and eﬀective approach to recognizing

diﬀerent surfaces [26, 44, 69, 121]. A useful application of such sensing is symbolic localiza-

tion/tagging, e.g. ﬁguring out whether a user’s device is in their hand, pocket, or at their bedroom

table. The key intuition is that diﬀerent surfaces respond to the same vibration diﬀerently because

the surfaces may be made of diﬀerent materials, and even if they are made of the same material, they

may have diﬀerent shapes and sizes. Even if two surfaces are made of the same material and have

the same shape and size, they may have diﬀerent objects placed on them, such that those surfaces

still exhibit diﬀerent vibration patterns because the objects placed on them may respond to the same

vibration diﬀerently. Such symbolic tagging of locations can provide us with indirect information

about user activities and intentions without any dedicated infrastructure, based on which we can

enable useful services such as context aware notiﬁcations/alarms.

A robust and practical vibration based sensing scheme should satisfy three key requirements.

First, it should work with commercial oﬀ-the-shelf (COTS) smartphones with diﬀerent hardware,

so that it can be easily deployed and widely adopted. Second, it should be able to extract ﬁne-grained

vibration signatures, so that it can accurately diﬀerentiate diﬀerent surfaces. Third, it should be

robust to environmental noise and hardware based irregularities, so that its accuracy stays consistent

across diﬀerent environments and devices.

5.1.2 Limitations of Prior Art

Several vibration based sensing schemes have been proposed in the past to realize diﬀerent kinds

of applications; however, none of them satisﬁes all the above three requirements. Existing vibration

115

based sensing schemes can be divided into two categories: custom hardware based and COTS

smartphones based. The custom hardware based schemes use separate hardware including a micro-

controller, a vibrator motor, and some piezoelectric (e.g. a microphone) or IMU sensors [69, 71, 75,

76, 121], which gives them ﬁne-grained low level control over diﬀerent physical layer parameters

of their underlying hardware. However, these schemes are incompatible with COTS smartphones

and are not easily generalizable to diﬀerent hardware because most COTS smartphones have

limited sensing capabilities and control over the hardware installed in them. Moreover, the custom

hardware based schemes that use microphone to sense vibration are prone to short-term and constant

background noises (e.g. intermittent talking, clapping, exhaust fan, etc.) because microphones not

only capture the sounds created by vibration but also other interfering sounds present in the

environment. The COTS smartphones based schemes rely on motion based features extracted from

built-in IMU sensors [26,44]. However, these features are very coarse-grained because the sampling

frequencies of the IMU sensors are low, which naturally leads to low classiﬁcation accuracies.

Moreover, these schemes can only broadly diﬀerentiate between diﬀerent types of surfaces (e.g.

wood and plastic) and cannot diﬀerentiate between similar surfaces (e.g. two diﬀerent wooden

tables in Figs. 5.1(b) and 5.1(e)). Also, IMU readings get signiﬁcantly aﬀected by the smartphone’s

own motion in space (e.g. when a user moves his hand while holding the smartphone).

5.1.3 Proposed Approach

In this paper, we propose VibroTag, a vibration based sensing approach that can robustly

recognize diﬀerent surfaces based on their unique vibration signatures. Compared to previous

work, VibroTag is robust and practical because it works with COTS smartphones, it can extract

ﬁne-grained features representative of diﬀerent surfaces, and it is robust to hardware irregularities

and background environmental noises. The key intuition is that as the vibrating mass inside a

smartphone’s vibrator motor repeatedly moves to and fro, the vibrating mass causes the whole

smartphone structure and the hardware inside it to vibrate in a peculiar pattern, which depends

upon the vibration response (or absorption properties) of the surface that smartphone is placed

116

on. These vibrations produce peculiar sound waves that VibroTag detects using the smartphone’s

microphone. Figure 5.1 shows the unique vibration signatures that VibroTag extracted for 6 diﬀerent

surfaces. We observe that vibration signatures of even two similar surfaces, i.e. Bed and Sofa, are

quite diﬀerent from each other.

(a) Bed

(b) Bed-Table

(c) Kitchen

(d) Sofa

(e) Work-Table

(f) Restroom

(g) Bed

(h) Bed-Table

(i) Kitchen

(j) Sofa

(k) Work-Table

(l) Restroom

Figure 5.1: Experimental scenarios and their corresponding extracted acoustic time-series based

vibration signatures

117

To make VibroTag easily scalable and compatible with COTS smartphones, we design Vibro-

Tag’s signal processing pipeline such that it relies only on built-in vibration motors and microphone

for sensing, and it is applicable to diﬀerent phones with diﬀerent hardware. To reliably extract ﬁne-

grained vibration signatures from the sound signals recorded during vibration, we propose a novel

time-series based approach, which is robust to hardware irregularities and environmental noise.

The key idea behind our vibration signature extraction approach is that even if there are irreg-

ularities in vibration frequencies due to hardware imperfections, the time-series patterns created

during diﬀerent vibration cycles are very similar. When a phone vibrates during a speciﬁc period

of time, such as 3 seconds, multiple such patterns occur and get distributed all over the time-series

of recorded sound signals. VibroTag ﬁnds multiple of these patterns in randomly selected intervals

of the time-series, and then combines them into single time-series features that ensures consistency

even if there are irregularities in the occurrence of those patterns and/or if the environment is

slightly noisy. Afterwards, it uses these features to diﬀerentiate between surfaces.

5.1.4 Technical Challenges and Our Solutions

The ﬁrst technical challenge is to reliably extract ﬁne-grained vibration signatures. Based on

our experiments on two diﬀerent phones, we observed that the frequency response of a surface

to vibrations introduced by vibration motors installed in COTS smartphones exhibit repeated

irregularities, which makes extraction of reliable features a challenging task. This happens because

the phone, its vibrator motor, and the rest of its hardware vibrate at irregular frequencies during

every experiment, which occurs due to hardware imperfections. This behavior is random and

uncontrollable, and therefore, is bound to create signiﬁcant variations within features extracted

at the same location, which will lead to classiﬁcation inaccuracy. The existing techniques that

use microphone to extract sound (sampled in the order of kHz) based straightforward frequency

domain features (e.g. vibration sound spectrum [69]) are not only considerably susceptible to such

hardware based irregularities, but also to short-term and constant environmental noises, where even

intermittent talking or noise from a restroom’s exhaust fan can signiﬁcantly aﬀect their performance.

118

To address this challenge, we take a time-series based vibration signature extraction approach. First,

we diﬀerentiate the recorded sound signals and take their root mean square (RMS) envelope, which

removes most of the unrelated constant and higher frequency background noise. Second, we develop

a specialized peak-detection based algorithm to extract unique time-series patterns corresponding to

vibrations from the RMS envelope, and then use them as vibration signatures to represent diﬀerent

surfaces. Our extraction algorithm is based on the observation that even if there are irregularities

in vibration frequencies due to hardware imperfections, the time-series patterns created during

diﬀerent vibration cycle are very similar. When a phone vibrates during a speciﬁc period of time,

multiple such patterns occur all over the time-series of recorded sound signals, which can be

successfully extracted by our algorithm. To make the vibration signatures robust to environmental

noise, VibroTag extracts numerous such vibration patterns across time during an experiment and

combines them by taking their median.

The second technical challenge is to compare vibration signatures of any two surfaces. The

midpoints of extracted vibration signatures of the same surface rarely align with each other because

the start and end points determined by extraction algorithm are never perfectly aligned. Moreover,

the lengths of diﬀerent vibration signatures also diﬀer slightly because the duration of vibration

cycle can often be a little diﬀerent due to hardware irregularities. Consequently, the midpoints and

lengths of vibration signatures do not match either. Another issue is that the shape of diﬀerent

vibration signatures of the same surface are often distorted versions of each other, which occurs due

to hardware based irregularities in the vibration mechanism. Therefore, two vibration signatures

cannot be compared using standard measures like correlation coeﬃcient or Euclidean distance.

To address this challenge, we use the Dynamic Time Warping (DTW) to quantify the distance

between any two vibration signatures. DTW can ﬁnd the minimum distance alignment between two

waveforms of diﬀerent lengths. For classiﬁcation, we employ a Nearest-Neighbor (NN) classiﬁer

with DTW distance as the comparison metric between diﬀerent vibration signatures.

119

5.1.5 Key Novelty and Advantages

The key technical novelty of this paper is on proposing the ﬁrst ﬁne-grained vibration based

sensing scheme that can recognize diﬀerent surfaces using the vibration mechanism and microphone

of a single COTS smartphone. Furthermore, we propose a novel signal processing technique to

extract ﬁne-grained vibration signatures that are robust to hardware irregularities and background

environmental noises. The key insight is that even if there are irregularities in vibration frequencies

due to hardware imperfections, the time-series patterns created during diﬀerent vibration cycles

are very similar. VibroTag ﬁnds many such patterns in the sound signals recorded during vibration,

and combines them into single consistent vibration signatures. Compared to previous schemes,

VibroTag works with COTS smartphones, it can extract ﬁne-grained features representative of

diﬀerent surfaces, and it is robust to hardware irregularities and background environmental noises.

5.1.6 Summary of Experimental Results

We implemented VibroTag on two Android based smartphones, i.e. Nexus 4 and OnePlus

2, for which we developed an application for generating vibrations and to sample sound signals

simultaneously. We tested our system for 4 diﬀerent individuals, from whom we collected data for

5 - 20 days. We show that VibroTag achieves an average accuracy of 86.55% while recognizing 24

diﬀerent surfaces, with as few as 15 training samples per surface. Moreover, VibroTag maintains

an average accuracy of up to 85% without any re-training requirements after 3-4 days of training.

We also implement the state-of-the-art IMUs based vibration sensing scheme for single COTS

smartphones proposed in [26], and compare its surface recognition accuracy with VibroTag. We

show that VibroTag achieves more than 37% higher accuracy when compared to the IMUs based

scheme, while recognizing the 24 diﬀerent surfaces.

5.2 Related Work

Existing work related to our work consists of some vibration based sensing schemes [26, 44, 69,

71, 75, 76, 121, 136] and sound based symbolic localization schemes [15, 135].

120

Vibration Based Sensing: Vibration based sensing schemes leverage the response of diﬀerent

surfaces to a speciﬁc vibration pattern to recognize those surfaces. Existing vibration based sensing

schemes can be divided into two categories, i.e. custom hardware based, and COTS smartphones

based. The custom hardware based schemes use separate customized hardware made using a set

of micro-controller, vibrator motor, and piezoelectric (e.g. microphones) or IMU sensors [69,

71, 75, 76, 121], so that they have ﬁne-grained low level control over diﬀerent physical layer

parameters of their hardware. ViBand uses variations introduced due to vibrations produced by

diﬀerent objects to identify those objects, e.g. electric tooth brush [71]. VibKeyboard [75] and

VibSense [76] develop a virtual keyboard based on the idea that the impact of a touch on a surface

such as a table or door causes a shockwave to be transmitted through the material that can be

passively detected with accelerometers or more sensitive piezo-vibration sensors. Kunze et al. [69]

develop customized hardware to recognize surfaces through active sampling of acceleration and

sound signatures. However, the above schemes are incompatible with COTS smartphones and

are not easily generalizable to diﬀerent hardware because most COTS smartphones have limited

sensing capabilities and control over the hardware installed in them. The COTS smartphones

schemes rely on motion based features extracted from built-in IMU sensors [26, 44]. Cho et

al. [26] and Shafer et al. [121] use built-in vibrator and accelerometer of a COTS smartphone to

recognize surfaces. Griﬃn et al. use vibration detected by an acceleration signal to determine if

a phone is in the user’s hand [44]. However, the features used by these schemes are very coarse-

grained because the sampling frequencies of IMU sensors are low, which naturally leads to low

classiﬁcation accuracies. Finally, all the previous schemes that use microphone based approaches

to sense vibrations are prone to short-term and constant background environmental noises (e.g.

intermittent talking, clapping, exhaust fan, dripping water, etc.). Compared to all the above schemes,

VibroTag is robust and easily deployable because it works with COTS smartphones, it can extract

ﬁne-grained features representative of diﬀerent surfaces/locations, and it is robust to hardware

irregularities and background environmental noises.

Sound Based Symbolic Localization: Sound based symbolic localization systems leverage the

121

propagation of the sound generated using speakers of a device, such as a smartphone, to determine

the symbolic location of that device (e.g. whether the device is in the user’s kitchen or at his

bedroom table). SurroundSense uses sensor data from a microphone, a light sensor, the wireless

radio, and passive accelerometer data for localization [15]. However, their technique can only be

used for very coarse-grained localization (e.g. room level) and not for ﬁner-grained localization

(e.g. whether the phone is on user’s study table or his bedroom table). EchoTag generates ultrasound

signals and then uses the reﬂections from the environment to achieve centimeter level tagging [135].

However, their work requires strict millimeter level marking of the tagged locations because the

ultrasound signals based signatures that they use are highly location dependent, where even small

variations in the phone’s position leads to signiﬁcant localization errors. This makes their scheme

unsuitable for symbolic localization, and also puts signiﬁcant calibration eﬀort on the user end. In

contrast to above schemes, VibroTag uses vibration instead of speaker generated sound signals for

such symbolic localization. Moreover, VibroTag achieves ﬁner-grained localization, and does not

require strict marking of the tagged locations.

5.3 Understanding Vibrations

5.3.1 Vibrator Motors in Smartphones

Electric vibrator motors generate vibrations by periodically moving an unbalanced mass around

a center position using the principles of electromagnetic induction. The vibrator motors used in

today’s smartphones are often known as coin-type vibration motors due to their coin-like shapes

and sizes. There are two types of coin-type vibrator motors that are widely adopted in smartphones:

(i) Linear Resonant Actuator (LRA) based (e.g. used in Nexus 4) and (ii) Eccentric Rotating Mass

(ERM) based (e.g. used in OnePlus 2). Figure 5.2 shows the internals of ERM and LRA based

vibration motors. ERM based motors use a DC motor to rotate an eccentric mass around an axis. As

the mass is not symmetric with respect to its axis of rotation, it causes the device to vibrate during

the motion. Both the amplitude and frequency of vibration depend on the rotational speed of the

motor, which can in turn be controlled through an input DC voltage. With increasing input voltages,

122

both amplitude and frequency increase almost linearly and can be measured by an accelerometer.

In LRA based motors, vibration is generated by the linear movement of a magnetic mass suspended

near a coil, called the “voice coil”. When an AC current is applied to the motor, the coil behaves

like a magnet (due to the generated electromagnetic ﬁeld) and causes the mass to be attracted or

repelled, depending on the direction of the current. This generates vibration at the same frequency

as the input AC signal, while the amplitude of vibration is dictated by the signal’s peak-to-peak

voltage. Thus, LRAs oﬀer control on both the magnitude and frequency of vibration.

Figure 5.2: ERM and LRA based vibration motors [128]

5.3.2 Physics of Surface Response to Vibrations

Sound is essentially pressure waves created by vibrating matter. These waves are longitudinal,

i.e. they oscillate along the axis of travel, where the oscillation is composed of compression and

rarefaction of molecules in the medium (e.g. air). For example, human speech is based on vibrations

created inside our vocal chords, and audio speakers generate sound by translating an electrical signal

into physical vibrations via mechanical excitation of a diaphragm using an electromagnet.

VibroTag is based on the intuition that diﬀerent surfaces exhibit diﬀerent response to vibrations

introduced by smartphone. When a smartphone vibrates, it mechanically excites not only it’s own

structure and hardware inside it, but also the surface on which it is placed. On one hand, some

surfaces tend to absorb most of the vibration energy (e.g. Sofas), while on the other hand, some

surfaces may exhibit a resonant response where they start vibrating in sync with the smartphone

123

(e.g. the smartphone’s surface vibrates in sync with the vibrator motor inside). Moreover, the eﬀect

of these vibrations can reach diﬀerent objects placed nearby, which may get mechanically excited

as well (especially the lighter objects); therefore, leading to more peculiar sounds. As diﬀerent

surfaces respond to the vibrations diﬀerently (in terms of their absorption/dampening eﬀect on

smartphone’s movements), and as diﬀerent surfaces often have diﬀerent objects placed on them,

which also respond to those vibrations diﬀerently, pressure waves peculiar to those surfaces are

created during the vibration, which we can sense using a piezoelectric device (e.g. a built-in

microphone) and then leverage to diﬀerentiate those surfaces.

5.4 Feature Extraction

To diﬀerentiate between diﬀerent locations, we need to extract features that can uniquely

and consistently represent those locations. In VibroTag, a smartphone is vibrated for about three

seconds while the surface response to the vibration is recorded simultaneously via the phone’s

built-in microphone. Sounds produced during vibration are sampled at ﬁxed Fs = 44.1 kHz. The

recorded sound is analyzed in both frequency and time domains to extract robust surface/location

speciﬁc vibration signatures. There are two key challenges in feature extraction for VibroTag to

be robust. The ﬁrst challenge is on reducing impact of background noises (such as those created

by fans and short-term human speech). The second challenge is on accommodating smartphone

hardware imperfections (i.e. microphones and vibrator motors mainly) that degrades the quality of

the signals collected when a smartphone vibrates.

5.4.1 Robustness to Background Noise

To understand the challenge posed by background noise, we use Fast Fourier Transform (FFT)

based Power Spectral Density (PSD), which is one of the mainstream frequency based feature

extraction techniques for acoustic sensing. Figure 5.3(a), 5.3(b), 5.3(c) show the FFT coeﬃcients

for both lower and higher frequency ranges corresponding to our experiments conducted at the

same location on a wooden chair’s cushion for three scenarios: (a) no noise, (b) intermittent human

speech, and (c) clapping, respectively. We can observe that the FFT features are signiﬁcantly aﬀected

124

by the background noises because the frequencies produced by these noises directly interfere with

the frequency bands for vibration based sensing. It also shows that mainstream techniques such

as FFT or PSD are unsuitable for vibration based sensing on COTS smartphones when there are

background noise sources present in the environment. In this work, we propose two schemes to

reduce the impact of constant and intermittent short-term background noises, respectively.

0.2

0.15

0.1

0.05

0

0.08

0.06

0.04

0.02

e
d
u
t
i
n
g
a
M
T
F
F

 

0.8

0.6

0.4

0.2

e
d
u
t
i
n
g
a
M
T
F
0.2F

 

0.1

0

1000 2000 3000 4000 5000 6000 7000 8000

e
d
u
t
i
n
g
a
M
T
F
F

 

0.2

0.15

0.1

0.05

0.08

0.06

0.04

0.02

1000 2000 3000 4000 5000 6000 7000 8000

300

400

500

600

700

800

900

1000

300

400

500

600

700

800

900

1000

Frequency (Hz)

(a) No noise

Frequency (Hz)

(b) Intermittent Talking

1000 2000 3000 4000 5000 6000 7000 8000

1

0.8

0.6

0.4

0.2

l

e
u
a
V

300

400

500

600

700

800

900

1000

0

0

50

Frequency (Hz)

(c) Clapping

100

150

300
Temporal units (Sample #)

200

250

350

400

(d) VibroTag signatures (a)-(c)

Figure 5.3: Impact of background noises on features extracted by traditional techniques and

VibroTag

To reduce the impact of constant background noises in VibroTag, we take the ﬁrst order

diﬀerence of the recorded sound signals and then take their root mean squared (RMS) envelope.

We choose RMS envelope for our analysis as it gives us a measure of the power of the vibration

signals, while producing a waveform that is easy to analyze. Moreover, higher frequency noisy

variations are averaged out in the envelope signal, while it still keeps most of the useful vibration

125

response related information intact. We take the RMS envelope of the signals over a sliding window

of N samples, where N = 15 audio samples in our current implementation of VibroTag. In the rest of

this work, when we mention sound signals, we mean the ﬁrst order diﬀerence of the RMS envelope

of those sound signals.

To reduce the impact of intermittent short-term noises (similar to Figures 5.3(b) and 5.3(c)),

we vibrate the phone for at least 3 seconds, and extract multiple vibration patterns across time from

the processed sound signals; then, we combine these virbration patterns to get a single consistent

vibration signature. We will discuss how we extract such signatures in Section 5.4.3. Figure 5.3(d)

shows the signatures extracted by VibroTag for the experiments corresponding to Figures 5.3(a),

5.3(b), 5.3(c), from which we observe that our signatures are consistent and almost identical even

though there are intermittent short-term noises.

5.4.2 Robustness to Hardware Imperfections

To understand the challenge posed by smartphone hardware imperfections, we use PSD and

Short Time Fourier Transform (STFT). Figures 5.4(a) and 5.4(b) show the PSD of the recorded

unprocessed sound signals for multiple diﬀerent experiments, which we performed by placing a

smartphone at the same location on a sofa and a bed, respectively. Each ﬁgure shows the PSD

coeﬃcients over two diﬀerent frequency ranges. We can observe that PSDs for both the sofa and

the bed are not consistent for repetitive experiments, even when the smartphone’s location and

the environmental scenario while performing the experiments remained unchanged. This occurs

because the smartphone, its vibrator motor, and the rest of its hardware vibrate at slightly diﬀerent

frequencies in each diﬀerent experiment. This behavior is random and uncontrollable, and therefore,

is bound to cause intra-class (i.e. within samples of the same class) variations, which will lead to

classiﬁcation errors. Moreover, due to this inconsistency, it often happens that the variations due to

vibration on two diﬀerent surfaces occur in similar set of frequencies (as shown by some samples in

Figs. 5.4(a) and 5.4(b)), which further makes the use of such frequency domain features infeasible

as they cannot uniquely represent diﬀerent surfaces. Figs. 5.4(c) and 5.4(d) show the STFT of the

126

0.6

0.4

0.2

e
d
u
t
i
n
g
a
M
T
F
0.1F

 

0

0.08

0.06

0.04

0.02

)
z
H
k
(
 
y
c
n
e
u
q
e
r
F

18

16

14

12

10

8

6

172

174

176

178

180

182

184

186

188

172

174

176

178

180

182

e
d
u
t
i
n
g
a
M
T
F
F

 

0.2

0.15

0.1

0.05

0.06

0.04

0.02

0

1950

2000

2050

2100

2150

2200

1950

2000

2050

2100

2150

Frequency (Hz)

(a) Home Bed

Frequency (Hz)

(b) Home Living Room Sofa

)
z
H
k
(
 
y
c
n
e
u
q
e
r
F

18

16

14

12

10

8

6

4.4

4.6

4.8

5

3

3.2

3.4

3.6

3.8

4

4.2

4.4

4.6

4.8

5

Time (secs)

3

3.2

3.4

3.6

3.8

4

4.2
Time (secs)

(c) Home Bed

(d) Home Living Room Sofa

Figure 5.4: Impact of hardware imperfection on features extracted by traditional techniques PSD

and STFT

unprocessed sound signals from one experiment corresponding to each of the two surfaces. Again,

we observe that STFTs of both surfaces are very similar, and variations often occur in the similar

frequency ranges that makes it harder to diﬀerentiate between those surfaces. Interestingly, we also

observe that some patterns repeating along time; however, the time period of their repetition is not

consistent.

We hypothesize that even if there are irregularities in the repetition frequency of such vibration

patterns, the patterns themselves must be very similar. As we discussed in §5.3, smartphone

vibrations are generated because of the to and fro motion of a mass inside its vibration motor.

The vibration motor tries to move the smartphone (and the hardware inside) at its own vibration

frequency (which is often irregular due hardware imperfections). However, the smartphone’s motion

127

1

0.8

0.6

0.4

0.2

1

0.8

0.6

0.4

0.2

e
u
l
a
V

l

e
u
a
V

0

0.02

0.04

0.06

0.08

0.1

Time

(a) Home Bed

0

0.02

0.04

Time

0.06

0.08

0.1

(b) Home Living Room Sofa

Figure 5.5: Repetitive patterns appearing in the processed sound signals corresponding to vibration

is constrained due to its own weight/structure and the absorption properties of the surface that it

is placed on. This whole process during vibration gives rise to peculiar pressure waves, which can

be sensed by the built-in microphone. Moreover, as the mass inside the motor repeatedly moves

to and fro, it will give rise to similar pressure waves in every cycle of its “irregular” vibration

period, which will reﬂect in time-series of the sound signals. These intuitions form the basis of our

time-series based analysis of the surfaces’ vibration response (§5.4.3).

In this work, we propose to address the issues due to smartphone hardware imperfections by

extracting time-series based vibration signatures from the processed sound signals. Figures 5.5(a)

and 5.5(b) show time-series of the sound signals (i.e. ﬁrst order diﬀerence of the RMS envelope)

corresponding to one experiment from each of the two diﬀerent surfaces (i.e. Bed and Sofa), whose

PSDs are shown in 5.4(a) and 5.4(b) and whose STFTs are shown in 5.4(c) and 5.4(d) , respectively.

The two time-series correspond to a window of 4800 sound samples (i.e. ∼0.1088 seconds for
Fs = 44.1 kHz). We can easily observe distinguishing patterns repeating in both time-series, which

repeat approximately with the frequency of the vibrating mass in the smartphone’s vibration motor.

We also observe that the patterns in both scenarios are consistent across time, and that the patterns

in one scenario are diﬀerent from the ones in other scenario.

128

5.4.3 Extraction of Vibration Signature

To robustly diﬀerentiate surfaces, we need to extract vibration patterns from time-series of

the processed sound signals and then use those patterns to obtain consistent vibration signatures.

However, we face multiple challenges. The ﬁrst challenge is that intermittent short-term noises in

real-life scenarios are uncontrollable, and therefore, can aﬀect any part of the time-series. VibroTag

needs to extract the vibration patterns that are representative of the whole time-series, i.e. the

vibration patterns extracted from one segment of the time-series (i.e. a time window) should repeat

in other segments, and therefore, truly represent the surface’s vibration response. A naive approach

is to extract all vibration patterns in the recorded signals and then combine them (e.g. by taking

their average), which is computationally expensive. To address this challenge, we take a randomized

approach, where we ﬁrst divide the whole time-series of the sound signal into equally sized time

windows, and then randomly select multiple of those time windows to extract vibration patterns

from. Each window is of size S, where S = 4800 sound samples in our current implementation of

VibroTag. Moreover, the windows are selected without replacement, i.e. once selected, they are not

selected again. VibroTag keeps randomly selecting new time windows until M vibration patterns are

extracted (M=100 in our implementation). In real-life scenarios, the number of iterations required

to extract M vibration patterns of a surface can be used to tell the user whether their environment

is too noisy to extract a valid vibration signature or not. For example, the average (taken over 20

diﬀerent samples) number of iterations it took for convergence when loud music (i.e. high variable

noise) was played on a laptop in the background was ∼1351, for medium noise/volume level the
number was ∼495, whereas for no variable noise scenarios it was ∼136. If our algorithm cannot
ﬁnd enough vibration patterns and runs out of possible time windows to search for patterns, it will

not converge. However, we did not experience any such scenarios during our testing.

The second challenge is to extract the vibration patterns from diﬀerent randomly selected

time windows by localizing their place of occurrence in those time windows. However, because

of the inconsistencies in the vibration behavior of smartphone due to hardware imperfections,

we cannot know the frequency of repetition of the vibration patterns, which makes it harder to

129

localize the place of occurrence of such patterns. To address this challenge, we develop a peak

detection based algorithm to extract vibration patterns. Our algorithm is based on the observation

that every vibration pattern has a peak value that occurs consistently at around the same part of

every vibration cycle (as evident in Figs. 5.5(a) and 5.5(b)). Based on this algorithm, VibroTag

determines the locations of multiple such peaks in each of the randomly selected time windows.

Afterwards, VibroTag uses the consecutive peaks detected in each window to extract the vibration

patterns between those peaks.

5.4.3.1 Extraction of Vibration Patterns

There are two key challenges in developing our peak detection based vibration pattern extraction

algorithm. First, based on our experiments, we observe that the scale of variations due to vibration

in the time-series of diﬀerent randomly selected windows varies, even when the same smartphone is

placed at the same location of the same surface, which happens due to hardware imperfections based

inconsistencies in the vibration process. Moreover, diﬀerent surfaces and diﬀerent smartphones

exhibit diﬀerent scale of variations due to vibration. This makes the parametrization of our peak

detection algorithm diﬃcult to generalize. To address this challenge, VibroTag performs max-min

normalization on the time-series corresponding to every selected window before feeding it to the

peak detection algorithm. This step ensures that the parametrization of our algorithm can be easily

generalized to diﬀerent time windows and to diﬀerent smartphones and surfaces.

The second challenge is to robustly determine the locations of peaks corresponding to diﬀerent

vibration patterns present in a time window. To address this, VibroTag’s peak detection algorithm

determines the location of such peaks based on three key parameters, namely minimum peak promi-

nence (MINPRO), minimum peak distance (MINDIST), and minimum peak strength (MINSTR).

The prominence of a peak measures how much the peak stands out, due to its height and location,

relative to other peaks around it. We tune MINPRO such that we only detect those peaks which

have a relative importance of at least MINPRO. We tune MINDIST according to the fact that

maximum repetition rate of patterns is approximately ˆf o + δ f , such that the redundant peaks are

130

discarded, where ˆf o is an approximate number for the the frequency of repetition ( f o) of vibration

patterns. As we discussed before, the frequency of repetition of vibration patterns in the processed

sound signal is irregular. In order to determine ˆf o, VibroTag calculates PSD of the time-series in

the selected window. Figures 5.6(a) and 5.6(b) show example PSD’s corresponding to one of the

randomly selected time windows from seven diﬀerent experiments performed on the Bed and the

Sofa, respectively. VibroTag determines the peak frequency from the PSD, which corresponds to

the approximate repetition frequency (i.e. ˆf o) of the vibration patterns present in the window. The

term δ f represents the second standard deviation of the variation in vibration frequency around ˆf o,

which we estimate for every smartphone based on multiple diﬀerent experiments. δ f is required as

some vibration patterns can repeat earlier than 1/ ˆf o.

e
d
u
t
i
n
g
a
M
T
F
F

 

0.01

0.008

0.006

0.004

0.002

0

Exp1
Exp2
Exp3
Exp4
Exp5
Exp6
Exp7

e
d
u
t
i
n
g
a
M
T
F
F

 

0.016

0.014

0.012

0.01

0.008

0.006

0.004

0.002

0

170

175

180

185

170

172

Exp1
Exp2
Exp3
Exp4
Exp5
Exp6

180

182

174

176

178
Frequency (Hz)

Frequency (Hz)

(a) Home Bed

(b) Home Living Room Sofa

Figure 5.6: PSD of the sound signals in time windows corresponding to the scenarios in Figs.

5.5(a)-5.5(b)

To further sift out redundant peaks, we only choose peaks of value greater than MINSTR times

the median value of the peaks detected in the window. In our current implementation, we chose

MINPRO = 0.65, MINDIST =

fo+δ f seconds, δ f = 6.5, and MINSTR = 0.5. We perform this
parametrization only during the design time, which generalizes well for multiple diﬀerent surfaces

1

and smartphones (i.e. Nexus 4 and OnePlus 2). Our algorithm does not require any end-user

calibration eﬀort. VibroTag uses the consecutive peaks detected in each randomly selected window

to extract multiple vibration patterns between those peaks.

131

5.4.3.2 Construction of Vibration Signature

To construct a single consistent vibration signature, VibroTag ﬁrst collects at least a total of

M = 100 patterns extracted from the randomly selected time windows. Once an enough number of

vibration patterns are extracted from diﬀerent time windows, VibroTag combines all those patterns

using median (loses anything lying outside 75% of the data) to get a single vibration signature. The

median operation helps VibroTag remove short-term noisy variations in diﬀerent vibration patterns,

and therefore, allows it to extract a single robust vibration signature of the surface. Figures 5.7(a) and

5.7(b) show the example signatures extracted for the Bed and the Sofa related experiments, where

we can observe that the vibration signatures of each surface are consistent and almost identical.

Moreover, the extracted signatures uniquely represent their respective surfaces.

1

0.8

l

e
u
a
V

0.6

0.4

0.2

0

0

50

100

150

200

250

300

350

400

450

50

100

150

200

250

300

350

400

450

Temporal units (Sample #)

Temporal units (Sample #)

(a) Bed

(b) Sofa

Figure 5.7: Extracted time-series features (Low Noise)

1

0.8

l

e
u
a
V

0.6

0.4

0.2

0

0

0.8 

Cl)  0.6 

:J -nJ 
> 0.4 

0.2 

ｯ ｾｾＭＭＧＭｾｾｾｾＭＭＧＭｾｾＭＭＧＭＧ＠

0 

100 

200 

300 

400 

Temporal Unit 

(a) Cafeteria Table 1

(b) Cafeteria Table 2

Figure 5.8: Extracted time-series features (High Noise)

Figure 5.8 shows the vibration signatures extracted for two similar tables during lunch time

in a cafeteria on a university campus (i.e. a highly noisy environment). We can see that although

132

some vibration signatures that VibroTag extracted are inconsistent, yet it was able to extract several

consistent vibration signatures even in such a highly noisy environment.

5.5 Classiﬁcation & Recognition

We use the shapes of the extracted waveforms as features because the shapes retain both

time and frequency domain information of the waveforms and are thus more suited for use in

classiﬁcation. After obtaining the time-series based vibration signatures, VibroTag uses them to

build training models for classiﬁcation. As VibroTag needs to compare vibration signatures obtained

for diﬀerent surfaces, we need a comparison metric that provides an eﬀective measure of the

similarity between vibration signatures of two surfaces. To achieve this, VibroTag uses the technique

of Dynamic Time Warping (DTW) that calculates the distance between waveforms by performing

optimal alignment between them. Using DTW distance as the comparison metric between vibration

signatures, VibroTag trains a k-nearest neighbour (kNN) classiﬁer using those signatures.

DTW is a dynamic programming based solution for obtaining the minimum distance alignment

between any two waveforms. DTW can handle waveforms of diﬀerent lengths and allows a non-

linear mapping of one waveform to another by minimizing the distance between the two waveforms.

In contrast to Euclidean distance, DTW gives us the intuitive distance between two waveforms by

determining the minimum distance warping path between them even if they are distorted or shifted

versions of each other. DTW distance is the Euclidean distance of the optimal warping path between

two waveforms calculated under boundary conditions and local path constraints. In our experiments,

DTW distance proves to be eﬀective for comparing two vibration signatures of diﬀerent surfaces.

Figure 5.9(a) shows the colormap of DTW distance between the vibration signatures extracted by

VibroTag from the experiments corresponding to the bed and the sofa (12 signatures each). The

average DTW distance among signatures of the bed is ∼2.3 and that for the sofa is ∼3.1. However,
the average DTW distance between the vibration signatures of the two surfaces was 16.59. Figure

5.9(b) shows the color map of Euclidean distance between features obtained using the IMUs based

scheme proposed in [26]. We can see that IMU based features cannot successfully diﬀerentiate

133

between the two surfaces due to high similarity, whereas VibroTag’s signatures are signiﬁcantly

better at diﬀerentiating the two seemingly similar surfaces.

d
e
B

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

a
f
o
S

5

10

15

20

d
e
B

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

a
f
o
S

5

10

15

20

5

10

15

20

5

10

15

20

(a) ViborTag

(b) IMU features [26]

Figure 5.9: Colormaps of distance between features of (a) VibroTag (DTW) & (b) IMU scheme

(Euclidean)

VibroTag requires training data for the surfaces to be recognized. Afterwards, it trains a kNN

classiﬁer using the vibration signatures corresponding to those surfaces. To recognize a surface,

VibroTag feeds the detected vibration signature of that surface to the trained kNN classiﬁer. The

kNN classiﬁer searches for the majority class label among k nearest neighbors of the corresponding

vibration signature using the DTW distance metric. VibroTag declares the majority class label

obtained from the kNN classiﬁer as label of the tested surface. In the current implementation of

VibroTag, we chose k = 5 so that the classiﬁcation process averages more voters in each prediction,

which makes our classiﬁer more resilient to outliers.

5.6 Implementation & Evaluation

5.6.1

Implementation Details

We developed an Android application for generating vibrations and sampling sound signals

simultaneously (Fig. 5.10(a) shows VibroTag’s UI). Our application can record sound in a separate

high priority asynchronous thread which helps minimize sampling related irregularities. We use a 16

bit PCM encoding on Mono channel with a sampling rate of 44,100Hz for sound recording. We also

record the data from the smartphone’s IMU sensors (i.e. accelerometer and gyroscope) in a separate

134

high priority thread. We use this data to implement the state-of-the-art IMUs based vibration sensing

approach for single COTS smartphones proposed in [26], and then compare its surface recognition

accuracy with VibroTag. Each data instance constitutes ∼3 seconds of sound and IMU data, during
which the vibration motor keeps vibrating. Our application controls a smartphone’s vibrator motor

only in terms of turning it ON or OFF, and therefore, does not change the amplitude or pattern of

the vibrations. This allows the smartphone to vibrate at its default vibration settings, which helps

keep data samples collected at the same surface/location consistent. Moreover, it makes VibroTag

applicable to smartphones which do not provide any amplitude control over their vibration motors.

We evaluated VibroTag using two smartphones, i.e. Google Nexus 4 and OnePlus 2.

5.6.2 Evaluation Setup

We evaluated VibroTag’s performance by conducting extensive experiments in two diﬀerent type

of environments, i.e. oﬃce and apartment. We selected these environments because they represent

real-world use case scenarios where a user interacts with diﬀerent objects and surfaces regularly.

We collected data from 4 diﬀerent volunteers, three with Nexus 4 and one with OnePlus 2, whom

we name User-1 (Nexus), User-2 (Nexus), User-3 (OnePlus 2), User-4 (Nexus), respectively. All

volunteers were university students who lived in diﬀerent apartment complexes. No restrictions were

imposed on the movement or work conditions of people residing/working in the apartments/oﬃce.

For example, when we collected data in the oﬃce environment, other people in the oﬃce were

working and chatting as they do on a normal working day. Similarly, data collection did not

cause any interference in the daily activities (cooking, eating, watching TV, cleaning, etc.) in the

volunteers’ apartment mates. Therefore, our evaluation of VibroTag takes into account realistic

environments where noise from human activities is present most of the time. We used metrics such

as confusion matrices, True-Positive-Rates (TPRs) and False-Positive-Rates (FPRs) to evaluate

VibroTag’s classiﬁcation performance. We also compare VibroTag’s performance with the IMUs

based approach proposed in [26].

135

(a) ViborTag App interface

(b) Oﬃce

(c) Oﬃce locations

(d) Apartment

(e) Apartment locations

Figure 5.10: VibroTag Setup (a) VibroTag App (b) oﬃce environment (c) example of data

collection locations in oﬃce (d) example of surfaces used for data collection (e) example of data

collection locations in apartment

5.6.3 VibroTag’s Sensitivity

We deﬁne VibroTag’s sensitivity as its ability to diﬀerentiate between diﬀerent positions and

orientations of the smartphone placed on the same location/surface. For example, a user can

place his smartphone on his oﬃce table in several possible positions and diﬀerent orientations.

VibroTag’s sensitivity is an important metric since we claim that a user can place his smartphone

on a surface with reasonable ﬂexibility, without having to worry about centimeter level diﬀerences in

136

its position and orientation (unlike e.g. EchoTag [135]). This claim will not be satisﬁed if VibroTag

is too sensitive.

To understand VibroTag’s sensitivity, each volunteer collected data in his restroom (on the toilet

tank), on his bed, bedroom table, living room table and living room sofa. Users collected 25 to

30 samples from each surface for three diﬀerent smartphone placement scenarios i.e. (1) least

restricted, (2) moderately restricted and (3) highly restricted. Each scenario corresponded to three

rectangular regions of diﬀerent sizes. We marked the highly restricted region to be approximately

within a few inches of the same dimensions as that of the smartphone, the moderately restricted

region to be ∼4 times larger than that of the highly restricted region, and the least restricted region
to be about ∼3 times larger in size compared to moderately restricted region. for example, Figs.
5.1(d) and 5.1(e) show the marked regions for a sofa and a worktable, respectively, in an apartment.

For the experiments related to the least restricted region, volunteers were allowed to place their

smartphone even beyond the third zone, as long as their smartphone was placed on the same surface.

Figure 5.11 shows confusion matrix plots for the tested 5 classes, (namely Bed, Living Room Table,

Living Room Sofa, Restroom Ledge and Kitchen Counter), for both User-1 using Nexus 4 (Figs.

5.11(a)-5.11(c)) and User-2 using OnePlus 2 (Figs. 5.11(d)-5.11(f)). For each scenario, confusion

matrices are plotted using results from 2-fold cross-validation. We observe that for both users,

highly restricting the device placement results in highest average prediction accuracy, (i.e. 92.16%

and 97.03% respectively) which gradually decreases as restriction on smartphone’s position and

orientation changes from high to least. From the confusion plots (Fig. 5.11), we observe that the

accuracies corresponding to User-2 are higher that User-1’s. This may be because either OnePlus 2

is able to extract better quality vibration signatures than Nexus 4, or because User-2’s environment

and the tested surfaces therein were diﬀerent from User-1’s (e.g. User-1’s bedroom table might have

some light objects (e.g. keys) placed close to the smartphone that created noise when responding

to the vibration). We discuss such impact of surrounding objects on VibroTag in §5.6.4.

Our results show that average accuracy corresponding to User-1’s moderately restricted scenario

are higher than highly restricted scenario (Figs. 5.11(a)-5.11(c)), which may be attributed to more

137

(a) High-User-1

(b) Moderate-User-1

(c) Least-User-1

(d) High-User-2

(e) Moderate-User-2

(f) Least-User-2

Figure 5.11: Confusion matrices for experiments performed by User-1 and User-2 to determine

VibroTag’s sensitivity

noisy samples obtained during highly restricted scenario. Figures 5.12(a) and 5.12(b) show average

accuracy of all 5 classes for all 3 restriction scenarios obtained with 2-fold, 3-fold, 4-fold and 5-fold

(i.e. increasing percentage data used for training from 50% to 80%) cross-validation classiﬁcation

138

y
c
a
r
u
c
c
A

110

100

90

80

70

60

Significantly Restricted
Moderately Restricted
Least Restricted

2

3

4

5

Number of Folds

(a) k-folds on User-1 data

y
c
a
r
u
c
c
A

105

100

95

90

85

Significantly Restricted
Moderately Restricted
Least Restricted

2

3

4

5

Number of Folds

(b) k-folds on User-2 data

Figure 5.12: Average accuracy with increasing number of training samples (sensitivity

experiments)

experiments. We observe that VibroTag performs well for all 3 restriction scenarios even when only

50% data is used for training and remaining for testing, and accuracies for even least restricted

scenarios reach as high as 87% for User-1 and 95% for User-2 when percentage of training data

reaches 80%.

5.6.4 VibroTag’s Accuracy

5.6.4.1 Object and Location Recognition Accuracy

VibroTag achieves an average 4-fold accuracy of 86.55% when identifying diﬀerent objects

and locations, whereas the IMU based approach achieves only 49.25%. Table 5.1 shows average

2-fold and 4-fold classiﬁcation accuracies obtained for 24 diﬀerent objects (e.g. a box) and locations

(e.g. kitchen counter) by User-1. For these experiment, User-1 collected 30-35 samples from each

of those objects and locations in a moderately restricted manner. Our results show that VibroTag

achieves an average (4-fold) recognition accuracy of 86.55%, which is 37% higher than the average

accuracy achieved by the latest IMUs based approach [26], which achieves only 49.25%. Moreover,

the lowest accuracy achieved by VibroTag is 70.33%, whereas the IMUs based method’s accuracy

goes as low as 16.38%. This shows that the features extracted using VibroTag can successfully

diﬀerentiate between diﬀerent objects, even when the objects are made of very similar material

139

(e.g. wood chair vs wood table, or metal drawer vs metal shelve).

Table 5.1: Average accuracy of recognizing diﬀerent surfaces in oﬃce and apartment scenarios

Oﬃce environment

Apartment environment

Pages
Bundle

Printer

Foam
Chair

Metal
Drawer

Carpet

Wooden
Chair

Metal
Shelve

Mouse
Pad

Leather
Chair

Center
Desk
(Wood)

Cardboard
Box

XBox

Work
Desk
(Wood)

Window
Ledge
(Mar-
ble)

VibroTag’s Accuracy

Bathtub
Ledge

Living
Table
(Wood)

Glass
Ta-
ble

Kitchen
Counter

Fridge

Wooden
Floor

Bedroom
Table
(Wood)

TV
Table
(Wood)

Living
Room
Sofa

Microwave

2-Fold 77.60

4-Fold 80.87

97.72

99.28

2-Fold 68.64

87

4-Fold 75.63

90.39

68.50 83.31 86.55 88.31

89.09 78.53 91.76 79.02

95.79

73.46 80.62

92.86

93.19

81.9

79.03 87.36

87.65 84.31

82.29

76.39

72.94

70.33 86.77 91.15 88.69

90.84 81.25 93.19 81.53

95.52

75.96 83.49

96.16

94.06

82.84

82.43 91.17

90.09 88.26

88.26

77.61

76.99

State-of-the-Art IMUs Based Method’s Accuracy [26]

30.35 68.3

35.13 94.94

89.96 64.05 75.45 27.07

34.29

19.55 49.41

65.08

38.78

25.18

16.89 19.42

56.60 32.98

51.10

30.07

29.31

31.23 70.02 37.48 95.91

91.52 66.31 78.23 29.69

40.21

21.56 52.92

66.95

40.26

25.92

17.04 20.23

63.40 33.98

52.35

32.53

32.27

91.06

92.56

64.98

65.27

5.6.4.2 Location Recognition Accuracy over Days

VibroTag can maintain an average accuracy of up to 85% using training samples obtained

for 3-4 days only. VibroTag’s accuracy can change over days due to several diﬀerent reasons, for

example, changes in environmental noise and/or changes in position of other items placed on a

surface (e.g. light objects such as keys, etc.). Next, we explore how VibroTag’s accuracies change

over days and how much training VibroTag requires to maintain high accuracies when testing on

data from a new day.

Figure 5.13(a) shows average cross-validation accuracy over all classes for data obtained from

User-1 on 5 diﬀerent days. The ﬁgure also shows cross-validation accuracy and confusion matrix

obtained when data from all 5 days was combined. For these experiments, User-1 collected 6-20

samples from 10 diﬀerent locations (i.e. Workplace (oﬃce table), Bed, Living Room Table, Kitchen

(on marble counter), Car (small compartment in front of the gear stick), Hand, Living Room Sofa,

Restroom (on toilet tank), Bedroom Table and Pocket) every day. For this set of experiments,

data was collected for both highly and moderately restricted smartphone placement scenarios. Our

results show that VibroTag achieves at least 79% (and at most 87%) accuracy everyday when using

only 50% of a day’s data for training. Figure 5.13(c) shows how combining data from previous

days improves accuracy for data collected on the subsequent days from User-1. We observe that

VibroTag can achieve accuracy of more than 80% on the unknown samples on day 5 for both

restriction scenarios. Moreover, we observe that the accuracy of moderately restricted smartphone

placement scenarios approaches highly restricted scenarios.

140

)

%

(
 
y
c
a
r
u
c
c
A

100

95

90

85

80

75

70

2-fold

3-fold

4-fold

5-fold

Day 1 Day 2 Day 3 Day 4 Day 5 All Days

Days

(a) Individual and all 5 days

Accuracy: 89.44%

34

0.0%

0

0.0%

0

81

0.0%

0

74

0.0%

0

0.0%

1.2%

0

1

5.1%

0.0%

2.6%

2

0

2

s
s
a
C

l

 
t
u
p
t
u
O

Workplace

Bed

LRTable

Kitchen

Car

Hand

Sofa

Bath

BRTable

Pocket

87.2%

0.0%

0.0%

1.2%

0.0%

0.0%

1.1%

0.0%

1.7%

0.0%

0

0

1

0

0

1

0

1

0

96.4%

0.0%

0.0%

2.0%

0.0%

1.1%

0.0%

1.7%

0.0%

0

0

2

0

1

0

1

0

94.9%

0.0%

1.0%

1.6%

6.6%

1.1%

3.3%

0.0%

0

1

1

6

1

2

0

90.4%

0.0%

0.0%

2.2%

1.1%

8.3%

0.0%

75

1.2%

1

0

0

2

1

5

0

93.1%

1.6%

1.1%

1.1%

1.7%

0.0%

1

1

1

1

0

93.4%

3.3%

3.4%

1.7%

0.0%

95

0.0%

0

0.0%

0.0%

1.3%

2.4%

0

0

1

2

0.0%

1.2%

0.0%

0.0%

1.0%

0

1

0

0

1

57

3.3%

2

2.6%

0.0%

0.0%

0.0%

1.0%

0.0%

1

0

0

0

1

0

5.1%

1.2%

1.3%

4.8%

2.0%

0.0%

1.1%

2

1

1

4

2

0

1

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

1.1%

0

0

0

0

0

0

0

1

3

3

1

0

81.3%

4.5%

1.7%

2.9%

4

1

1

84.1%

1.7%

0.0%

74

2.2%

2

74

3.4%

3

1

0

78.3%

0.0%

47

0.0%

0

0

97.1%

33

Workplace Bed LRTableKitchen Car

Hand
Target Class

Sofa

Bath BRTablePocket

(b) Confusion matrix for all 5 days data

highly restricted
moderately restricted

)

%

(
 
y
c
a
r
u
c
c
A

100

80

60

40

20

0

1

1-2

1-2-3

1-2-3-4

Days used for training

(c) Consecutive days accuracy

Figure 5.13: (a) Average 4-fold cross-validation accuracies over all classes (User-1) (moderately
restricted experiments), (b) Confusion matrix after cross-validation, (c) Training on data from

previous days, testing on subsequent days

141

To understand how VibroTag’s accuracy changes over days across multiple users, we collected

5 samples from Users 2, 3, and 4 for 4 diﬀerent locations (i.e. Kitchen (on marble counter), Living

Room Sofa, Restroom (on toilet tank), Bedroom Table and Living Room Table) for 20, 10, and 10

consecutive days, respectively. Fig. 5.14 shows how the classiﬁcation accuracy changes for diﬀerent

users, where VibroTag is trained using the data from previous days and test on the subsequent days.

We observe that the accuracy generally increases with days for User-2, however, there is a major

dip for User-3 on day 5 and for User-4 between days 5-7, which can be attributed to major changes

in the surrounding environment in terms of noise and/or addition/removal of diﬀerent objects (such

as keys or a pen) on the surface.

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

User-2

100

80

60

40

100
80
60
40

)

%

(
 
y
c
a
r
u
c
c
A

2

100
80
60
40

2

3

3

4

4

5

5

6

6

Days

7

7

8

8

Figure 5.14: Accuracies on consecutive days

User-3

User-4

10

10

9

9

5.6.4.3

Impact of Surrounding Objects

To understand the impact of surrounding objects on VibroTag’s accuracy, we performed two

diﬀerent sets of experiments. In the ﬁrst set of experiments, we collected data on a participant’s

bedroom table before and after removing 4 diﬀerent heavier objects (i.e. a guitar, an LCD, a laptop

and a mug) from the table one by one. Figure 5.15 shows the setup for these experiments, where

5.15(a) corresponds to the scenario where all objects were on the table, and 5.15(d) corresponds to

the scenario where all three objects were removed. In the second set of experiments, we collected

data on the participant’s living room table by bringing 4 diﬀerent lighter objects (i.e. a cup, a set

of keys, a pen, and a water bottle) closer to the smartphone. Fig. 5.16 shows the setup, where

5.16(a) shows the diﬀerent objects used in these experiments, and 5.16(b) - 5.16(d) shows a set of

142

keys being brought closer to the smartphone. Fig. 5.17 shows results for the aforementioned sets of

experiments.

(a) setup

(b) no guitar

(c) no laptop

(d) no LCD

Figure 5.15: Removing things from bedroom table

(a) setup

(b) 3 inches

(c) 9 inches

(d) 12 inches

Figure 5.16: Bringing light objects closer to smartphone

100

)

%

(
 

y
c
a
r
u
c
c
A

80

60

40

20

0

cup
keys
pen
waterbottle

)

%

(
 
y
c
a
r
u
c
c
A

95

90

85

80

3

12

9

Distance from smartphone (inches)

GUITAR

REMOVED

MONITOR

REMOVED

LAPTOP

REMOVED

MUG

REMOVED

no-G

no-M-G no-M-G-L no-M-G-L-M

Things removed from bedroom table

(a) Living Room Table

(b) Bedroom Table

Figure 5.17: Eﬀect of (a) moving objects closer and of (b) removing objects on classiﬁcation

We observe from Fig. 5.17(a) that as the lighter objects come closer to the smartphone (i.e. down

from 12 inches to 3 inches closer), the classiﬁcation accuracy of the table decreases signiﬁcantly. For

example, when the pen is within 3 inches of the phone, the accuracy goes as low as 9%. From Fig.

5.17(b), we observe that the impact of heavier objects on VibroTag’s accuracy is not as signiﬁcant

as the lighter ones, which happens because the energy transfered by smartphone’s vibration is not

enough to make those objects vibrate signiﬁcantly. However, we observe that the classiﬁcation

143

accuracy still drops more than 10% as we slowly remove the objects that were previously placed

on the table. This is because each of those objects has its own vibration response which was

contributing to the overall vibration signature of the bedroom table. So, when those objects are

removed one by one, their response is subsequently omitted in the new vibration signature of the

surface, which leads to loss in the classiﬁcation performance.

5.6.4.4

Impact of Upper Cut-Oﬀ Frequency

VibroTag achieves best accuracy when frequencies above 5500Hz are ﬁltered out from the

recorded sound signals. A smartphone’s microphone can usually capture sounds in the frequency

range of 20Hz - 20kHz. However, the smartphone’s vibration usually causes variations in lower

frequencies, and therefore, ﬁltering out higher frequencies can reduce impact of background noise

and any unwanted noisy variations. To understand which frequencies can be ﬁltered out to achieve

best accuracy in VibroTag, we employ a Butterworth band-pass ﬁlter, and determine the average

accuracy for 5 diﬀerent upper cut-oﬀ frequencies of the ﬁlter. Figure 5.18 shows how User-3’s

(OnePlus 2) multi-fold cross-validation accuracies vary as upper cut-oﬀ frequency increase from

1500 to 12000. We observe that VibroTag achieves best accuracy at cut-oﬀ frequency of 5500Hz.

As the upper cut-oﬀ frequency increases, it allows higher frequency noisy variations in the vibration

signatures, which leads to lower classiﬁcation accuracies. We observed similar results for other users

as well. Therefore, all accuracies reported in this work correspond to 5500Hz cut-oﬀ frequency.

Note that other smartphones may exhibit better accuracies for cut-oﬀ frequencies which are slightly

diﬀerent from 5500Hz, however, this is an aspect which is out of the scope of this work.

y
c
a
r
u
c
c
A

90

80

70

1500

2-fold

3-fold

4-fold

5-fold

3000

Frequency (Hz)

5500

8000

12000

Figure 5.18: Cross-validation accuracies for diﬀerent band-pass ﬁlter upper cut-oﬀ frequencies

(User-3)

144

5.7 Usability Study

We carried out a usability study and asked 24 participants (20 male, 4 female), recruited

at university, about the ﬂexibility and usability of the VibroTag application in daily life. The

participants comprised of students and university employees of ages 19 to 35. They were ﬁrst briefed

about the working of VibroTag and then its target applications such as symbolic localization. The

volunteers were shown the VibroTag application interface as pictured in Figure 5.10(a) and given a

demo of the acoustic trace collection. They were also briefed on how the smartphone can be placed

on a surface with 3 diﬀerent levels of restriction ﬂexibility. At the end, they were given a set of

usability questions given below and summarized in Figure 5.19: Q1. Are you comfortable using

smartphone for location recognition? Q2. Can VibroTag help you save time by setting reminders?

Q3. Are you comfortable with VibroTags’ use of vibration? Q4. Is it easy to place smartphone on

preferred locations for learning? Q5. Can VibroTag help you in setting smart notiﬁcations linked to

locations? Q6. Is VibroTag useful in activating other smart applications? Q7. Do you ﬁnd VibroTag

application valuable and fun to use?

s
e
t
o
V

 
f
o
 
r
e
b
m
u
N

30

20

10

0

Q1

Disagree

Neutral

Agree

Q2

Q3

Q6
Usability Study Questions

Q5

Q4

Q7

Figure 5.19: Vote distribution of 7 VibroTag’s usability questions asked from 24 participants

Our study indicates high agreement on the usefulness of VibroTag based smart notiﬁcations

and reminders. It also indicates some discomfort in the use of vibration.

5.8 Conclusion

In this work, we make the following contributions. First, we propose the ﬁrst ﬁne-grained

vibration based sensing scheme, that can recognize diﬀerent surfaces using the vibration mechanism

and microphone of a single COTS smartphone. The intuition is that the smartphone’s vibration

145

causes the whole smartphone structure and the hardware inside it to vibrate in a peculiar pattern,

which depends upon the absorption properties of the surface that the smartphone is placed on.

These vibrations produce peculiar sound waves that we detect using the smartphone’s microphone.

Second, we propose a novel signal processing technique to extract ﬁne-grained vibration signatures

that are robust to hardware irregularities and background environmental noises. We implemented

VibroTag on two diﬀerent Android phones and evaluated in multiple diﬀerent environments. Our

results show that VibroTag achieves an average surface recognition accuracy of 86.55%, which is

37% higher than the average accuracy of only 49.25% achieved by the state-of-the-art IMUs based

schemes.

146

CHAPTER 6

DISTRIBUTED SPECTRUM SHARING FOR ENTERPRISE POWERLINE

COMMUNICATION BASED IOT NETWORKS

6.1 Introduction

As powerline communication (PLC) technology does not require dedicated cabling and network

setup, it can be used to easily connect multitude of Internet of Things (IoT) devices deployed in

enterprise environments for sensing and control related applications. Thanks to the plug-n-play

nature of PLC technology, a PLC enabled device just needs to be connected to a wall socket, and

it will automatically form a mesh network with nearby PLC devices. IEEE has standardized the

PLC protocol in IEEE 1901, also known as HomePlug AV (HPAV) [3, 5], which has been widely

adopted in mainstream PLC devices.

A key weakness of HPAV protocol is that it does not support spectrum sharing. Currently,

each link in an HPAV PLC network operates over the whole available spectrum, and only one link

can operate at any time within a single collision domain. Figure 6.1 shows an example enterprise

level IoT application scenario, where multiple PLC nodes (including multiple gateway nodes) are

connected in the same MAC collision domain to a power distribution network. Currently, two

disjoint PLC links (e.g. 5-8 and 12-11 in Fig. 6.1) cannot operate concurrently with existing HPAV

MAC protocols. However, in real enterprise PLC deployments, we often encounter scenarios where

a subset of subcarriers on some PLC links are highly underutilized as compared to other links,

which implies that the low-modulated subcarriers of one PLC link can be utilized by one of the

other links to improve the aggregated throughput. Moreover, if multiple PLC links, which may be

competing for the same channel simultaneously, can operate in parallel via sharing spectrum, many

costly collisions can be avoided.

In this work, through an extensive measurement study of HPAV PLCs in a real enterprise

environment using commodity oﬀ-the-shelf (COTS) HPAV PLC devices, we discover that spectrum

sharing can signiﬁcantly beneﬁt enterprise level PLC networks. Our ﬁrst ﬁnding is that PLC

147

Figure 6.1: Example scenario: Links 5-8 and 12-11 in the same collision domain can share

spectrum for concurrent operation

nodes connected under the same circuit breaker in a building’s power distribution network can

communicate at 6.5 times higher throughput than the PLC nodes connected under two diﬀerent

breakers, and 18-30 times higher throughput than the PLC nodes connected to two completely

diﬀerent power distribution/trunk lines. This implies that enterprise PLC networks must have at

least one gateway node connected under every breaker, to provide best possible connectivity to

the IoT devices connected under that breaker. As each power distribution line can contain tens to

hundreds of breakers with multitude of IoT devices connected to a gateway under each breaker, the

number of disjoint links, which consist of diﬀerent source-destination pairs and may compete for the

same channel simultaneously, becomes signiﬁcant. Second, based on our subcarrier level spectral

analysis, we observe that PLC channels of more than 50% of the PLC links are signiﬁcantly diﬀerent

from each other due to highly location dependent multipath characteristics. As the performance

of diﬀerent frequency subcarriers varies among diﬀerent PLC links, low-modulated subcarriers

of one link can be utilized by other links, and vice versa. Third, most links in an enterprise PLC

network are pseudo-stationary, i.e. the channel characteristics between any two PLC nodes have

low temporal variability (standard deviation of throughput observed over 15 minute time windows

is below 2.2Mbps for more than 80% of the links), and therefore, a spectrum sharing scheme can

be achieved at low channel estimation related control overhead.

Multiple Frequency Division Multiplexing (FDM) based spectrum sharing techniques have

148

been proposed for PLCs [7, 51]. However, such spectrum sharing techniques have three major

limitations. First, they are incompatible with the HPAV MAC, which makes them diﬃcult to be

adopted. Second, they are designed for WiFi like point-to-multipoint communications. This may

be suitable for home PLC networks where a few IoT devices are connected to a single PLC gateway

node. However, it is unsuitable for enterprise PLC environments, as enterprise PLC networks are

mesh networks, with multitude of disjoint links between IoT devices and their respective gateway

nodes. Third, they have prohibitively high computational and control overheads involved in their

underlying subcarrier assignment and bit loading algorithms. This makes them impractical for real

world deployment.

In this work, we aim to design a spectrum sharing scheme which is compatible with HPAV

MAC, is suitable for enterprise level PLC mesh networks, and incurs minimal computational

and control overheads. To this end, we propose an HPAV compatible, distributed, low overhead

spectrum sharing approach for enterprise PLC networks. Currently, HPAV MAC protocol uses

Carrier-Sense Multiple Access with Collision Avoidance (CSMA/CA) and Time-Division Multiple

Access (TDMA) techniques for sharing medium access among PLC nodes. To make our scheme

compatible with existing HPAV MAC, we design it such that any link which occupies the PLC

channel following the regular HPAV CSMA/CA or TDMA protocol shares a part of its spectrum

with another link to improve the aggregated throughput of both links. Moreover, we design our

scheme such that it can be enabled in the current HPAV PLC devices while incurring minimum

ﬁrmware level changes. We call the links which occupy the PLC channel following regular HPAV

CSMA/CA or TDMA protocol as primary links, and the links with which the primary links

share their spectrum as secondary links. To make our scheme suitable for enterprise level PLC

mesh networks, we develop a distributed spectrum sharing strategy. To achieve this, we develop

an optimal spectrum sharing algorithm which each node uses to locally compute a complete set

of network-wide spectrum sharing rules for all possible primary links and their corresponding

secondary links in the network. Our algorithm leverages subcarrier level channel information

corresponding to all possible links in the network to compute those rules. Based on these rules,

149

any primary link can decide which of the possible secondary links should it share its spectrum

with, and what part of the spectrum should it share, to achieve best possible spectrum sharing

gains. When the source node of a primary PLC link gets channel access, it broadcasts the link’s

source-destination IDs to all the remaining nodes in its network. Next, it picks one of the possible

secondary links to share its spectrum with, based on its locally computed network-wide spectrum

sharing rules, and then continues its remaining transmission in the unshared region of spectrum.

Meanwhile, the source and destination nodes of the chosen secondary link establish connection, and

start operating in parallel with the primary link over the shared region of spectrum. This happens

automatically, as both source and destination nodes of the chosen secondary link already know the

source-destination IDs of the primary link and have the same set of network-wide spectrum sharing

rules. Transmission of secondary link ﬁnishes as soon as the primary link ﬁnishes its transmission.

To minimize the computational and control overhead of our scheme, we take the following design

decisions: First, we design our spectrum sharing algorithm such that the basic optimization problem

which it solves comes down to optimally sharing spectrum between just two links (i.e., a primary

and a secondary), which is a computationally simpler problem to solve than sharing spectrum with

several links simultaneously. Second, we design our scheme to operate in a distributed manner,

where each node locally computes network-wide spectrum sharing rules. This makes real-time

spectrum sharing seamless, as it completely avoids any extra control related communications for

coordinating spectrum sharing in the network. Third, our design takes advantage of the pseudo-

stationary nature of enterprise PLC channels to reduce channel estimation related overhead. The

computation of network-wide spectrum sharing rules at each node requires latest subcarrier level

channel information of all possible links in the network. To achieve this, each node ﬁrst gets channel

information corresponding to all possible links it can form, and then shares that information with

other nodes in the network, which can involve considerable communication overhead. However,

as most PLC channels in an enterprise setting are pseudo-stationary, PLC nodes do not need to

update their copy of network-wide channel information too frequently. Therefore, the frequency of

channel probing is signiﬁcantly reduced, which maintains the spectrum sharing gains.

150

We implement and evaluate our proposed spectrum sharing techniques on HPAV CSMA pro-

tocol only, as the integration of our spectrum sharing technique with HPAV TDMA protocol is

relatively straightforward to achieve (we present a detailed discussion on this in §6.6). We perform

trace driven simulations using channel response (tonemap) traces collected from seven diﬀerent

4-node PLC deployments. We show that ﬁne-grained distributed spectrum sharing can boost the

aggregated and per-link throughput by more than 60% and 250% respectively.

6.2 Related work

Hayasaki et al. [51] and Achaichia et al. [7] have proposed FDM based multiple access tech-

niques in the context of point-to-multipoint communication in PLC networks. Hayasaki et al. [51]

proposed a theoretical bit-loading based OFDMA scheme for in-home PLCs, as an alternative to

TDMA/CSMA based medium access. Their scheme consists of two iterative algorithms: a sub-

carrier assignment algorithm and a bit-loading algorithm. The subcarrier assignment algorithm

assigns subcarriers to maximize the whole throughput, while satisfying the minimum throughput

guarantees of each destination PLC node ﬁrst. Afterwards, the bit-loading algorithm is utilized for

loading bits into the assigned subcarriers, while optimizing both the bit quantity on each subcarrier

as well as the whole code rate, subject to BER constraints on each subcarrier. Achaichia et al. [7]

proposed a similar technique named Tone Maps Splitting Algorithm (TMSA) to orthogonalize

spectrum assigned to multiple active links in a point-to-multipoint communication. However, the

aforementioned techniques are designed for WiFi like point-to-multipoint communication in PLCs,

and proposed as an alternative MAC protocol to existing HPAV TDMA/CSMA based MAC. There-

fore, their techniques are incompatible with current HPAV MAC. Moreover, the aforementioned

techniques involve high computational and control overheads corresponding to their underlying

resource allocation schemes (i.e. subcarrier assignment and bit loading algorithms), which makes

them impractical for real world deployment scenarios. In contrast, our goal is to design a distributed

spectrum sharing technique for HPAV PLC networks while incurring minimal computational and

control overheads. Moreover, our aim is to augment and integrate spectrum sharing on top of exist-

151

ing TDMA/CSMA MAC used in the mainstream PLC devices such as HPAV, etc. while incurring

minimal ﬁrmware level modiﬁcations.

In [93, 142] authors compare HPAV with WiFi performance. They study temporal and spatial

variations of the throughput of PLC links and make a case for hybrid PLC-WiFi networks [142].

The measurement study in [14] shows the multi-ﬂow performance of PLC networks. It then presents

BOLT, which seeks to manage traﬃc ﬂows in PLC networks. The above studies ignore the spectral

ineﬃciencies at MAC layer of HPAV networks. In contrast, we extensively study the behavior of

PLCs in spatial, temporal and spectral dimensions, and propose novel spectrum sharing strategies

to improve per-link and aggregated throughput of enterprise level PLC networks.

6.3 HomePlug AV Powerline Communications

6.3.1 PLC Channel Characteristics

Multipath is a key characteristic of PLC channels, which is attributed to unmatched electric loads

or branch circuits connected to diﬀerent sockets on the powerline. In a typical power distribution

network of a large building, there are multiple branch circuits with diﬀerent impedances, and

therefore, PLC signals are reﬂected from multiple reﬂection points leading to multipath eﬀects. On

top of multipath attenuations, several diﬀerent types of noise in PLC channels have been identiﬁed

[22, 36]. Harmonics of AC mains and other low power noise sources in the power lines lead to

colored background noise, which decreases with frequency. Periodic impulsive noise is created due

to rectiﬁers, switching power supplies and AC/DC converters, which can be either synchronous or

asynchronous with AC line cycle. Aperiodic impulsive noise also exists in PLC channels due to

switching transients in power supplies, AC/DC converters, etc.

6.3.2 HomePlug AV standard

The most widely adopted family of PLC standards are HomePlug AV, AV2 and Green PHY

standards [72]. HomePlug AV2, which is the latest of these standards, can support up to 1 Gbps

PHY rates. Our study focuses on the HomePlug AV standard, which has been widely used in home

152

networks to improve coverage, and can support maximum PHY rates of up to 200 Mbps [3, 5].

However, our ﬁndings and solutions can also be generalized for PLC technologies other than HPAV,

such as HPAV2.

HPAV PHY-layer: HPAV uses 1.8-30 MHz frequency band and employs OFDM with 917

subcarriers (for the USA devices), where each subcarrier can use any modulation scheme from

BPSK to 1024-QAM depending on the channel conditions [72]. In order to update the modulation

schemes for each subcarrier, two communicating HPAV PLC devices continuously exchange and

maintain tonemaps between them. Tonemaps refer to the information about the modulation scheme

used per subcarrier, i.e. the number of bits modulated per subcarrier. The tonemaps exchanged

are estimated for multiple diﬀerent sub-intervals of the AC mains cycle. Tonemaps are exchanged

between PLC devices through a sounding process, where the transmitter sends sounding frames

to the receiver using QPSK for all subcarriers, the destination estimates the channel quality and

sends back the tonemaps corresponding to diﬀerent sub-intervals of AC mains cycle back to the

transmitter. The destination can communicate up to 7 tonemaps, i.e. 6 tonemaps for the diﬀerent sub-

intervals of the AC line cycle called slots and one default tonemap [72], depending on the condition

of noise and attenuation observed in diﬀerent parts of AC line cycle. Tonemaps are continuously

updated by default after 30 seconds or when the error rate exceeds a threshold [72]. Tonemaps

provide us with the information about Channel Frequency Response (CFR) of the channel between

two communicating PLC devices.

Figure 6.2: Basic Beacon Period structure in HPAV MAC

HPAV MAC-layer: MAC-layer of HPAV based PLCs works very diﬀerently from that of WiFi

MAC. First, unlike WiFi, channelization is not allowed and not used in PLCs, which limits the

153

possibility of deploying non-interfering networks, such as WiFi networks on diﬀerent channels.

Second, there is no concept of a central Access Point (AP) in PLCs. There exists a dynamically

chosen central authority to manage network, called the Central Coordinator or CCo, and large

PLC networks can contain multiple CCo’s managing their own collision domains. However, CCo’s

role is passive, mainly authentication and association of new nodes, monitoring the network,

synchronizing it with the AC line cycle, and taking time-division access decisions in terms of

allocating TDMA and CSMA/CA slots. In contrast, the WiFi AP-mode forces downlink/uplink

traﬃc types and a star-like logical network. PLCs only form mesh networks, and every node can

communicate with its peers, without relaying through the CCo. Both TDMA and CSMA/CA are

supported by HPAV [72]. Tonemaps are optimized for the QoS required for the traﬃc in the TDMA

allocations. HPAV uses a Beacon Period, managed by a CCo, for allocating CSMA and TDMA

sessions (Fig. 6.2). The Beacon Period is synchronized with AC line cycle and is two AC line cycles

in length. The schedules advertised in the Beacon are persistent and are not changed for a number of

Beacon Periods. The CSMA protocol of HPAV devices (IEEE 1901) is diﬀerent from CSMA/CA

used by WiFi devices (IEEE 802.11). Both use a time-slotted random backoﬀ with a backoﬀ counter

(BC) and a contention window (CW). 1901 includes two more counters in its backoﬀ procedure,

i.e., backoﬀ procedure counter (BPC) and the deferral counter (DC). Using DC based backoﬀ

procedure, HPAV PLC nodes increase their contention windows not only after a collision, but also

after sensing the medium to be busy, which reduces the probability of collision. Details of the 1901

backoﬀ procedure can be found in [141]. Request to Send (RTS) and Clear to Send (CTS) delimiters

can be enabled during CSMA slots to handle hidden nodes. HPAV frames are 512 byte aggregated

physical blocks (PBs) of data. In order to reduce protocol overheads, HPAV employs two-level

frame aggregation, where the First, the data is organized in 512 byte physical blocks (PB). PBs are

then aggregated into HPAV frames. Reception of each PB of a frame is separately acknowledged,

so that the transmitter retransmits only the corrupted PBs.

154

6.4 A Measurement Study of Enterprise PLCs

Experimental setup: Our study is based on measurements with commodity HomePlug AV hard-

ware. We use Meconet HomePlug AV mini-PCI adapters with Intellon INT6300 chipsets, which

can support 200 Mbps PHY rates. We connect the PLC adapters to ALIX 2D2 boards, which run

OpenWrt operating system. We use open source PLC software tool named open-plc-utils, which

is developed by Qualcomm, to extract PHY and MAC-layer feedback (such as tonemaps), directly

from the Meconet HPAV adapters. Note that the measurements, analysis and solutions proposed in

this work can also be applied to newer HomePlug AV2 devices with a few modiﬁcations.

Experimental methodology: For our experiments we place our PLC nodes in various locations of

an enterprise building. We generate saturated iperf UDP traﬃc among the PLC nodes. Results we

report are averaged over multiple runs.

Metrics: We analyze the performance of PLC networks by ﬁrst collecting iperf throughput statistics.

We further elaborate on the per-subcarrier PLC network performance by analyzing the tonemaps

extracted by the open-plc-utils tool running on PLC nodes. For a given PLC communication

link and for the kt h sub-interval of AC line cycle, the eﬀective PHY rate can be estimated from

tonemaps as R{k}ph y
number of subcarriers. T [ j] is the modulation rate (i.e., bits per subcarrier) of the jt h subcarrier

[4], where j is subcarrier number and N is total

Ts

=

T [ j]{k}]·C{k}·(1−B{k}err )

j =1

[PN

[4]. C is Forward Error Correction (FEC) code rate. HomePlug AV supports FEC code rates

of 1/2 and 16/21. Finally, Berr is the bit error rate and Ts is the symbol interval of OFDM

communication. Ts is approximately ∼46µs for HomePlug AV including all overheads [72]. The
expected throughput, averaged over all the sub-intervals of the AC line cycle, can be written as

R{k}ph y

/NAC. Here Fo accounts for HPAV protocol overheads and NAC is the

T ≈ (1 − Fo) ·PNAC

k =1

number of sub-intervals of AC line cycle. NAC is 5 or 6 for USA frequencies and Fo is typically

∼ 0.4 based on iperf throughput measurements. In all our experiments, we observed that the FEC
code rate was always 16/21 for the communication among our HPAV devices. Therefore, we assume

FEC code rate of 16/21 in rest of the work, unless explicitly mentioned otherwise.

155

N8

N2

N9

N5

N7

N4

N6

Figure 6.3: Building power distribution plan

6.4.1 Many Disjoint PLC Links Compete for Channel Access

To understand how signiﬁcant the number of disjoint links can become in a an enterprise

level PLC network for IoT applications, we study the impact of diﬀerent components of a power

distribution network (e.g. phases, breakers and distribution/trunk lines) by measuring the throughput

performance of more than 40 links (PLC transmitter-receiver pairs). Power distribution network

ﬂoorplan of the enterprise building, where we conducted our experiments, is shown in Figure 6.3.

The main switchboard of the enterprise, steps down the voltage from thousands to hundreds

of Volts and the down converted electric power is then distributed towards diﬀerent ﬂoors of

diﬀerent buildings in the enterprise, through multiple diﬀerent distribution lines or trunk lines [161]

(represented with hexagonal boxes with #4 written on them). The power from the trunk lines coming

into the ﬂoor of a building is then further distributed into diﬀerent parts of the ﬂoor, through a

distribution board containing a set of circuit breakers which divide the electrical power feed into

subsidiary circuits. Each trunk line consists of 3 cables corresponding to 3 diﬀerent phases and

each distribution board contains multiple breakers per phase. The letters (A-E) and numbers in the

ﬂoorpan of Figure 6.3 represents some of the diﬀerent locations where we placed our PLC nodes.

Next, we elaborate on our experiments.

Case 1: We observed that the performance of a PLC link operating on same breaker and same

156

80

75

70

65

60

55

)
s
p
b
M

(
 
t
u
p
h
g
u
o
r
h
T

DAY

NIGHT

DAY

NIGHT

per second
per minute
per hour

35

40

45

F
D
C

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.05

0.1

0.15

0.2

0.25

0.3

50

0

5

10

15

Link Asymmetry Metric

20

25

30
Time (Hours)

(a) Link asymmetry.

(b) PLC throughput timeseries (48 hours).

F
D
C

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Throughput Standard Deviation (Mbps)

(c) Link stability.

Figure 6.4: (a) Link asymmetry, (b) Temporal variation in throughput over 2 days, (c) Link

throughput stability CDF (45 links)

F
D
C

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
10

Same P, Same B, Same DL
Same P, Different B, Same DL
Different P, Different B, Same DL

20

30

40

50

60

70

80

90

Throughput (Mbps)

Figure 6.5: CDF of throughputs observed in diﬀerent cases

157

distribution line is mainly aﬀected by the location of PLC nodes with respect to the interfering

electrical appliances. Highly attenuating device impedances or severe device interferences can lead

to signiﬁcant performance degradation (we observe ∼6.5 fold decrease in throughput). Moreover,
as shown by the CDF in Fig 6.5, throughputs of more than 70 Mbps were observed across the tested

links approximately 75% of the time. Jitter was low, with the median being 0.2 ms and a maximum

of 2.5 ms.

Case 2: PLC nodes connected to same (or diﬀerent) phase but diﬀerent breakers operate over

lower throughputs as compared to same phase, same breaker case (∼20-30% decrease in observed
throughput)1. This is because signals experience higher attenuations while passing through the

breaker circuitry located between the PLC nodes. We observed maximum throughput of 63 Mbps,

which is 25.6 Mbps lower (29% decrease) than the previous case where nodes were connected

under the same breaker. The median throughput observed was 51 Mbps, with minimum being 26

Mbps, which is higher than the minimum of same breaker case as we did not encounter any high

interference from electric appliances this case.

Case 3: PLC performance signiﬁcantly drops (∼18-30 folds throughput decrease) when nodes
are located at diﬀerent distribution lines. Distribution lines can make PLC connectivity often

impossible, due to transformers in between. The maximum throughput that we observed between

any two pair of nodes was 3 Mbps and 5 Mbps for both directions, and the jitter varied between

2.03 ms and 5.7 ms.

Conclusions: PLC nodes connected under the same breaker in a building’s power distribution

network can communicate at 6.5 times higher throughput than the PLC nodes connected under

two diﬀerent breakers, and 18-30 times higher throughput than the PLC nodes connected to two

completely diﬀerent power distribution/trunk lines. This implies that enterprise PLC networks must

have at least one gateway node connected under every breaker, to provide best possible connectivity

to the IoT devices connected under that breaker. As each power distribution line can contain tens to

hundreds of breakers with multitude of IoT devices connected to a gateway under each breaker, the

1We have excluded the cases of high interference from electric devices.

158

number of disjoint links, which consist of diﬀerent source-destination pairs and may compete for

the same channel simultaneously, becomes signiﬁcant.

6.4.2 Enterprise PLC Channels are Highly Location Dependent

We measure the intensity of location dependence of PLC channels through a PLC link asymmetry

metric Aa,b. Asymmetry of a PLC link depends on channel frequency response or transfer function
between PLC nodes communicating over that link, and it can be directly attributed to the diﬀerent

multipath characteristics of the powerline, which can vary depending on the location of PLC

nodes compared to branch circuits or other connected electrical devices [171–173] (i.e. location

dependent multipath characteristics). We quantify asymmetry of a PLC link a − b as [Aa,b =
j =1 |Ta→b[ j]{k} − Tb→a[ j]{k}|]]/NAC, where N is the number of subcarriers, Tj is the
[PN
modulation rate of the jt h subcarrier and NAC is the number of sub-intervals of AC line cycle. The

PNAC

k =1

above equation estimates asymmetry between two links as the distance between tonemaps of these

links, averaged over all AC line cycle sub-intervals. The max and min values for Aa,b are 9170 (917

subcarriers × 10 bits/carrier) and 0, respectively. In Figure 6.4(a) we present the distribution of our
link asymmetry metric Aa,b normalized by the maximum Aa,b (which is 9170), from the tonemaps

of 25 pair of nodes in the same neighborhood a, b. We observe that for more than 50% of the links,

the normalized Aa,b is greater than 0.1 (917 bits). The maximum throughput diﬀerence observed

in asymmetric links is 15 Mbps.

Figure 6.6 shows snapshots of the tonemaps of 12 diﬀerent links from a real world scenario,

where we deployed a network of 4 PLC nodes in our test environment. We observe that the same

subcarriers perform diﬀerently for diﬀerent links. If we consider the last 200 subcarriers (717-917)

for all the links of node N1, we observe the modulation is at least 6 bits per carrier (cf. Figures

6.6(a), 6.6(b), 6.6(c)). On the other hand, the last 200 subcarriers for all the links of node N2, show

lower modulation, which can be as low as 2 bits per carrier (cf. Figures 6.6(d), 6.6(e), 6.6(f)). The

modulations of N2’s links, for the ﬁrst 100 subcarriers (e.g. 1-100), are overall better compared

to the corresponding modulations of N1’s links. A spectrum sharing strategy could allow both N1

159

)
r
e
i
r
r
a
c

/

 

s
t
i
b
(
 
e
t
a
R
n
o
i
t
a
u
d
o
M

l

11

10

9

8

7

6

5

4

3

2

1

0

)
r
e
i
r
r
a
c
/
s
t
i
b
(
 
e
t
a
R
n
o
i
t
a
l
u
d
o
M

 

11

10

9

8

7

6

5

4

3

2

1

0

100

200

300

400

500

600

700

800

900

100

200

300

400

500

600

700

800

900

Subcarrier ID

Subcarrier ID

)
r
e
i
r
r
a
c

/

 

s
t
i
b
(
 
e
t
a
R
n
o
i
t
a
u
d
o
M

l

11

10

9

8

7

6

5

4

3

2

1

0

100

200

300

400

500

600

700

800

900

Subcarrier ID

)
r
e

i
r
r
a
c

/

s
t
i
b
(
 

 

e
t
a
R
n
o
i
t
a
u
d
o
M

l

11

10

9

8

7

6

5

4

3

2

1

0

100

200

300

400

500

600

700

800

900

Subcarrier ID

(a) Tonemap N1-N2

(b) Tonemap N1-N3

(c) Tonemap N1-N4

(d) Tonemap N2-N1

)
r
e

i
r
r
a
c

/

s
t
i
b
(
 

 

e
t
a
R
n
o
i
t
a
u
d
o
M

l

11

10

9

8

7

6

5

4

3

2

1

0

100

200

300

400

500

600

700

800

900

Subcarrier ID

)
r
e

i
r
r
a
c

/

s
t
i
b
(
 

 

e
t
a
R
n
o
i
t
a
u
d
o
M

l

11

10

9

8

7

6

5

4

3

2

1

0

100

200

300

400

500

600

700

800

900

Subcarrier ID

)
r
e
i
r
r
a
c
 
r
e
p
 
s
t
i
b
(
 
e
t
a
R
n
o
i
t
a
l
u
d
o
M

 

11

10

9

8

7

6

5

4

3

2

1

0

100

200

300

400

500

600

700

800

900

)
r
e
i
r
r
a
c
 
r
e
p
 
s
t
i
b
(
 
e
t
a
R
n
o
i
t
a
l
u
d
o
M

 

11

10

9

8

7

6

5

4

3

2

1

0

100

200

300

400

500

600

700

800

900

Subcarrier ID

Subcarrier ID

(e) Tonemap N2-N3

(f) Tonemap N2-N4

(g) Tonemap N3-N1

(h) Tonemap N3-N2

 

)
r
e
i
r
r
a
c
 
r
e
p
s
t
i
b
(
 
e
t
a
R
n
o
i
t
a
u
d
o
M

l

 

11

10

9

8

7

6

5

4

3

2

1

0

100

200

300

400

500

600

700

800

900

 

)
r
e
i
r
r
a
c
 
r
e
p
s
t
i
b
(
 
e
t
a
R
n
o
i
t
a
u
d
o
M

l

 

11

10

9

8

7

6

5

4

3

2

1

0

100

200

300

400

500

600

700

800

900

 

)
r
e
i
r
r
a
c
 
r
e
p
s
t
i
b
(
 
e
t
a
R
n
o
i
t
a
u
d
o
M

l

 

11

10

9

8

7

6

5

4

3

2

1

0

100

200

300

400

500

600

700

800

900

 

)
r
e
i
r
r
a
c
 
r
e
p
s
t
i
b
(
 
e
t
a
R
n
o
i
t
a
u
d
o
M

l

 

11

10

9

8

7

6

5

4

3

2

1

0

100

200

300

400

500

600

700

800

900

Subcarrier ID

Subcarrier ID

Subcarrier ID

Subcarrier ID

(i) Tonemap N3-N4

(j) Tonemap N4-N1

(k) Tonemap N4-N2

(l) Tonemap N4-N3

Figure 6.6: Tonemaps of 12 links among 4 PLC nodes in one of our PLC deployments, showing

possibility of gains from SS

and N2 to transmit at the same time to their neighbors (e.g. N1-N3 and N2-N4) using only their

high-performance subcarriers. Similar observations hold for other links (tonemaps not shown here),

where certain subcarriers cannot carry data (0 modulation) and others can allow high modulations.

Conclusion: Per-subcarrier performance can vary signiﬁcantly among diﬀerent links in enter-

prise PLC networks. Therefore, the low-modulated subcarriers of one PLC link can be utilized by

other PLC links, and vice versa.

6.4.3 Enterprise PLC Channels are Pseudo-Stationary

Performance of PLCs in enterprise settings can be dynamic either due to interference from

already connected appliances, or due to a multitude of electrical devices being turned on/oﬀ on

a regular basis. In order to study temporal dynamics, we measure performance of a PLC link

for a long time periods. Figure 6.4(b) shows a representative scenario of a PLC link throughput

160

variation, for 2 days (48 hours) period. The throughput variations are averaged over one second,

one minute and one hour time windows respectively. We observe that the throughput performance

can vary from 52 Mbps to 80 Mbps. The link appears to be highly bursty, which shows that some

intense performance dynamics happening at small time scales, which are attributed to interference

created by nearby electrical devices. The throughput variations observed at coarser time scales

(minutes or hours) are attributed to human activity (e.g. connection/disconnection of new devices,

etc.). The analysis of tonemaps (not shown here) also veriﬁes the link variations with time, as we

observed that the tonemaps exchanged among PLC nodes during day were diﬀerent from those

during night. However, we observed that throughput between most PLC links remained quite stable.

Figure 6.4(c) shows the CDF plot of standard deviation (averaged over 10 second intervals) of the

real time throughput of 45 diﬀerent links we tested in our building. Throughput for each link was

collected over 15 minute time windows. It can be observed that more than 60% of the time, the

standard deviation of throughput is below 1.5 Mbps, which shows that throughput performance of

most PLC links remains consistent over time.

Conclusion: Most links in an enterprise PLC network are pseudo-stationary, i.e. the channel

characteristics between any two PLC nodes have low temporal variability. Therefore, a spectrum

sharing can be realized at low control overheads.

6.5 Distributed Spectrum Sharing for HPAV PLCs

In this section, we lay the theoretical foundations of our proposed spectrum sharing (SS) strategy.

Our proposed techniques can be generalized for other PLC technologies, such as HPAV2, which

use bit-loaded OFDM at PHY layer, as our technique shares spectrum at OFDM subcarrier level.

6.5.1 Preliminary Deﬁnitions

Primary & Secondary Links. We call the links which occupy the PLC channel through regular

HPAV CSMA/CA or TDMA protocol as primary (P-Link or pi→ j ), and the links with which
a primary link shares spectrum with, as secondary (S-Link or sm→n). Whenever a P-Link is
established, only one S-Link can operate during that communication slot. For example, if we

161

assume that all links are saturated (i.e., each node always has traﬃc to send), the S-Link which

gives maximum possible gain by sharing spectrum with an established P-Link will operate in

parallel with that P-Link. Later on, we will present a ranking based strategy which each node in the

network can follow locally to resolve contention for S-Link.

and [Tm→n]s

Tonemaps. Let [Ti→ j ]p
be the vector of tonemaps of a pair of P-Link (i → j)
and S-Link (m → n), respectively. The diﬀerence between tonemap vectors of P-Link and S-Link
can then be denoted by [Di→ j,m→n]1×N = [Ti→ j ]p

, and vice versa.

1×N

1×N

1×N − [Tm→n]s

1×N

Minimum Throughput Requirement. Let us denote the number of PLC nodes in a network to be

E. Moreover, let us denote the minimum throughput requirement of e-th node as Te. Then we can
represent the minimum number of bits to be modulated across a given set of OFDM subcarriers

(tonemap), required to meet Te as τe = Te

Ts

. Note that for any P-Link (i → j) and S-Link (m → n)

pair, our SS strategy needs to meet throughput requirement of the P-Link only.

Allowed Tonemaps & Link Ranks. Each node e in the network will locally calculate an SS matrix

using the proposed SS algorithm. The entries of the SS matrix consist of two entities, namely

allowed tonemaps - [ST, PT ] (where [ST ] is a set of subcarriers allowed to be modulated on an

S-Link while a P-Link operates, and it is vice versa for [PT ]), and link ranks - r (i.e. a rank

proportional to the SS gain of an S-Link when sharing spectrum with a P-Link). For each node e,

there are 2× (E − 1)(E − 2) possible P-Links which can operate in its vicinity. Moreover, for each of
those possible P-Links, there are 2× (E − 2)(E − 3) possible S-Links which can operate in parallel.
Therefore, a locally computed SS matrix any node would be of size 4×(E−1)(E−2)×(E−2)(E−3),
where each 2× (E − 2) × (E − 3) will correspond to the allowed tonemaps and ranks of all possible
S-Links corresponding to one of the 2 × (E − 1)(E − 2) possible P-Links.
Spectrum Sharing Gain. We represent the gain Gm→n obtained by allowing an S-Link to operate
with a P-Link as:

Gm→n = [X[ST ]

[Tm→n] + X[PT ]

[Ti→ j ]] − X[1,N]

[Ti→ j ]

(6.1)

162

6.5.2 Spectrum Sharing (SS) Algorithm

We design our SS algorithm to meet two key requirements: (a) It must take into account

minimum throughput requirements of the destination node of each possible P-Link in the network,

and (b) It should involve minimal channel probing and control overhead. As our SS approach is

designed to work on top of existing user scheduling provided by current HPAV/AV2 CSMA/CA

or TDMA procedures, therefore, SS is performed only when a P-Link is established and is already

operating. Our SS algorithm runs locally at each node of the network, and therefore assumes that

each node in the network has complete tonemap information about all other possible links in the

network. Later in section 6.6, we explain how this can be achieved using current HPAV protocols,

while incurring minimal channel probing and control overhead. Moreover, for simplicity of our

discussion, we assume that all nodes in the network are in the same collision domain, and that there

are no hidden nodes in the network (Request to Send (RTS) and Clear to Send (CTS) delimiters

can be used by regular HPAV MAC to handle hidden nodes during CSMA/CA).

6.5.2.1 Optimal SS approach:

Our optimal SS approach is described in Algorithm 2. The algorithm runs on each node of

the network separately, and computes 4 × (E − 1)(E − 2) × (E − 2)(E − 3) allowed tonemaps
- [ST ]’s for all S-Links, corresponding to all possible P-Links. To describe in words, for each

P-Link and S-Link pair, the above algorithm ﬁrst sorts the diﬀerence vector Di→ j,m→n, and starts
assigning the subcarriers to P-Link corresponding to descending order of the entries HDi→ j,m→n,

until its minimum throughput requirement is met. The remaining subcarriers are then assigned to

the S-Link. Although, we did not observe such cases in our deployments, but in practice, a PLC

network can may contain some extremely bad P-Links (modulation of all subcarriers is very low).

Following the aforementioned algorithm, bad P-Links would only share their spectrum once their

own throughput requirements are met, and therefore, will not starve. Next, we discuss how this SS

approach can be optimized for overall network throughput and fairness in spectrum, respectively.

Optimizing overall network throughput. Once minimum throughput requirement of a P-Link is

163

Algorithm 2: Optimal algorithm for distributed spectrum sharing in HPAV PLC-Nets

1: /*Takes in a set of tonemaps for all possible links*/
2: procedure GetAllowedTonemaps_SS([T ]E)
3:

P ← all possible P − link s
S ← all possible S − link s
for each (i → j ) ∈ P do
for each (m → n) ∈ S do

[ST ] ← [1, N]
[PT ] ← ∅
tn ← 0
Di→ j,m→n ← [Ti→ j ]p
ˆI, ˆDi→ j,m→n ← sort (Di→ j,m→n, descend)
while tn < τn do

1×N − [Tm→n]s

1×N

tn ← tn + Ti→ j ( ˆI (1))
[ST ] ← [ST ] − { ˆI (1)}
[PT ] ← [PT ] + { ˆI (1)}
ˆI ← ˆI − { ˆI (1)}

end while

end for

4:

5:

6:

7:

8:

9:

10:

11:

12:

13:

14:

15:

16:

17:

18:

19:

⊲ Allowed indices for S-Link
⊲ Allowed indices for P-Link

⊲ ˆI = indices corresponding to sorted entries

⊲ remove from set
⊲ add to set
⊲ remove from index set

end for

20:
21: end procedure

met, the remaining subcarriers are assigned to both P-Link and S-Link such that the total number

of modulated bits is maximized i.e. max(Gm→n). This requires a slight modiﬁcation to Algorithm
2 (between steps 13-18), such that it will keep assigning subcarriers to P-Link in descending order

of the entries in Di→ j,m→n, as long as the total number of bits modulated on both links increases.
Optimizing for overall spectrum fairness. Once minimum throughput requirement of a P-Link is

met, the remaining subcarriers are assigned to both P-Link and S-Link such that the ratio of number

modiﬁcations to Algorithm 2 (between steps 13-18).

bits modulated along both links approaches 1.0, i.e. P[ST ][Tm→n]

P[PT ][Ti→ j ] ≈ 1.0. This also requires slight
Complexity. The aforementioned SS approaches will require approximately 4× (E − 1)(E − 2)(E −
2)(E−3)(N∗log(N ) + N ) computations at each PLC node. The overall complexity of the algorithm
can be written as O (E4 ∗ N ∗ log(N )).

164

6.5.2.2 Ranking of S-Links

While computing allowed tonemaps for each S-Link m → n, Algorithm 2 also assigns a rank
rmn to that S-Link proportional to its SS gain (rmn = k · Gm→n, where k = 1 in our work). We will
show how we use these ranks while in the design of our proposed SS protocol for HPAV devices,

later in Section 6.6.

6.6 Enabling Spectrum Sharing for HPAV PLCs

In this section, we show how our proposed SS strategy can be enable enabled in the MAC layer

of current HPAV PLC devices while incurring minimum ﬁrmware level changes.

Channel probing and control overheads. As mentioned before, our SS algorithm works in

a distributed manner and runs locally at each node, such that the network level SS decisions

are eventually known to each node in the PLC network. Therefore, PLC nodes will not have to

distribute their SS decisions to other nodes in the network. However, our SS algorithm requires

tonemap information about all possible links in the network, which will incur communication

overhead. The communication overhead will be on the order of O (E2), as each node in a PLC
network will broadcast tonemap information for its (E − 1) possible links with other nodes in the
network. However, due to the pseudo-stationary nature of PLC links (6.4.3), this probing overhead

will be minimal and will not interfere with regular data transmissions.

A channel probing interval t pr obe for SS can be set by the CCo of a PLC network. The CCo can

then periodically command all nodes the network to log tonemaps of all possible links and formulate

their SS decisions. CCo can use control-related messaging schemes already built into HPAV MAC

(e.g. Management Messages (MMEs)) for this purpose [3, 5], and t pr obe can be chosen such that

the exchange of control messages incurs minimal overhead and interference to data transmissions.

The probing frequency (i.e. 1/τpr obe) must be kept within a certain threshold in case CCo observes

some very dynamic PLC links in its network, because otherwise spectrum sharing may lead to

loss of overall network throughput due to high channel probing overhead. CCo can also completely

stop spectrum sharing throughout its network and fall back to default HPAV MAC if the channel

165

conditions of PLC links in its network are not conducive to SS. Note that SS will not be performed

during the exchange of control messages.

Periodic Re-evaluation of Full Spectrum: All nodes in the network will periodically disable

SS and transmit across full spectrum following default HPAV MAC. No S-Link will operate in

this case. The frequency of this periodic behavior can be chosen by CCo of the network, based

on temporal dynamics 6.4.3 of PLC links its network. Such periodic use of the whole spectrum

will allow each node to automatically update its full spectrum tonemaps towards other nodes in the

network, during regular data transmissions. The network CCo will then re-evaluate the SS decisions

in its network by accessing these tonemaps as described before.

Medium Access during SS: Current HPAV MAC is centrally controlled through Beacon signals

from CCo. The Beacon signals broadcast by CCo to establish Beacon Periods (BPs) with TDMA

and CSMA slots are robust and reliable (Beacons and several other control-related messages operate

over ROBust mOdulation (ROBO) modes [72]). Next, we explain how the medium access will work

during TDMA and CSMA slots in an SS enabled HPAV MAC.

TDMA: Whenever a P-Link is scheduled to send traﬃc in a TDMA slot, the highest ranked

S-Link (according to the SS algorithm) corresponding to that P-Link will be scheduled to operate

in the same slot. In case some of the S-Links corresponding to that P-Link do not have any traﬃc

to send, the highest ranked S-Link will only be chosen from among the S-Links which are waiting

in line to send traﬃc. Therefore, the allowed tonemaps for both P-Link (i.e. [PT ]) and the selected

S-Link (i.e. [ST ]) will be chosen accordingly.

CSMA/CA: In case of TDMA, the selection of allowed tonemaps [PT ] and [ST ] is straight

forward, since the P-Link and S-Link connections can be speciﬁcally scheduled by the CCo to

operate in the same slot. However, two major issues arise in case of CSMA: (a) How will a P-Link

know which of the possible S-Links have traﬃc to send, so it can select its [PT ] accordingly?, and

(b) Assuming issue (a) is resolved, how will the S-Link know that a P-Link is established so that it

can select its [ST ] according to [PT ]? In following steps, we discuss how medium access and the

consequent link interactions will diﬀer from the regular HPAV CSMA/CA.

166

(i) Before broadcasting Beacon signals, the CCo identiﬁes all links with pending traﬃc, and then

shares that information with each node in its network through HPAV control-related messaging.

This resolves issue (a).

(ii) Once a P-Link gets medium access, the remaining nodes go into their backoﬀ stages,

following the regular CSMA/CA procedure. Afterwards, the source node of the P-Link enables the

Multicast Flag (MCF) in the Start-of-Frame Control (SOF) ﬁeld of its MAC Protocol Data Unit

(MPDU) [72] while establishing its connection with the destination node, so that all remaining

nodes in the network can extract the source and destination IDs of the P-Link from this SOF

delimiter ﬁeld. This resolves issue (b) later in step(iv).

(iii) Both source and destination nodes of the P-Link select [PT ] corresponding to the highest

ranked S-Link among the S-Links with pending traﬃc (given the information received in step (i)),

and use the unshared subcarriers for transmission/reception, while disabling the shared ones.

(iv) After knowing the P-Link information from SOF delimiter in P-Link’s broadcast MPDU

frame, the source and destination nodes of the highest ranked S-Link with pending traﬃc enable

[ST ] and disable [PT ]. The S-Link then operates in parallel with the P-link.

(v) Nodes belonging to any active P-Link come out of their SS state (i.e. re-enable all subcarriers

and enter again into contention for whole spectrum) when the transmission between them is ﬁnished.

As soon as P-Link’s transmission ends, the S-Link ﬁnishes its transmission as well, and comes out

of spectrum sharing state (as the transmission frame ends).

Disabling modulation of subcarriers: Thankfully, it is easy to disable the subcarriers in HPAV

devices. In current HPAV PHY, each subcarrier is independently modulated based on channel

characteristics between transmitter and receiver (i.e. bit-loading). HPAV PHY allows dynamic

notching of speciﬁc subcarriers by turning them oﬀ, which can be achieved by making soft changes

to device’s tone mask (enabled subcarriers) [3, 72]. However, currently, this functionality needs

proprietary access to ﬁrmware supplied by vendors. Up to 30 dB deep notches are possible in

HPAV, and typically 4 additional subcarriers on each side of a notch can be turned oﬀ to achieve a

30 dB notch depth, resulting in about 200 KHz of guard-band overhead for each notch.

167

6.7 Implementation and Evaluation

We evaluate our proposed SS strategy through trace-driven simulations using traces we obtained

from multiple PLC network deployments in our enterprise. We implement and evaluate our proposed

SS strategies on top of HPAV CSMA protocol only. We perform trace driven simulations using

tonemap traces collected from seven diﬀerent 4-node PLC deployments (Figure 6.6 represents

Deployment#1). Our simulations do not take into account frame aggregation procedures, bit loading

of ethernet frames inside PLC frames, management messages and channel errors, since these

parameters are proprietary vendor-speciﬁc implementation information. In our simulations, we

choose collision duration τc = 2920.64µs, duration of successful transmission τs = 2542.64µs and

frame length Fl = 2050 [140, 141]. The contention window (CW) and deferral counter (DC) values

used for each HPAV CSMA/CA back oﬀ stage are [8, 16, 32, 64] and [0, 1, 3, 15], respectively.

We assume that there are no hidden terminals, and transmission failures are only due to collisions.

6.7.1 Evaluation Metrics

We use following metrics to evaluate the performance of our proposed SS approaches:

Throughput: We calculate the normalized throughput T hr for each link m → n in our simulation
as follows:

T hr = 100 ·

i=1

[P[#SuccessTr ansmissions]

T otal simulation time

SFi

] · [Fr ame lengt h]
=PN

j =1

represents the fraction of spectrum utilized at i-th transmission. SFi

[Tm→n]/9170, such

SFi
that max(SFi

) = 1 and min(SFi

) = 0.

Fairness: We evaluate the fairness of diﬀerent SS strategies by calculating Jain’s fairness index

(JFI) [58] and Fairly Shared Spectrum Eﬃciency (FSSE) [39]. An allocation strategy is maximally

fair if all nodes in a PLC-Net allocate the same throughput, in which case JFI = 1. On the other

hand, FSSE of a PLC-Net gives the spectrum eﬃciency (SE) of the PLC node with minimum

throughput in the network. In case of maximum spectrum fairness, FSSE is equal to the SE of the

168

whole network. For a PLC network, we deﬁne its SE to be its average throughput, and its FSSE to

be its minimum throughput.

Next, we show how our optimal SS strategies can achieve higher overall network through-

put (when optimized for maximum throughput), and maintain higher fairness (when optimized

for fairness), while meeting per-link minimum bandwidth requirements. We test following two

scenarios:

Per-link (Local) Minimum Te: In this scenario, each P-Link uses a percentage of it’s available
bandwidth as it’s minimum Te. Figures 6.7(a)-6.7(c) show how net throughput, JFI and FSSE of
the seven deployments change as Te increases, when SS is optimized for maximum net throughput.
The X-axis starts from 10%, i.e. when Te is 10% of the available bandwidth. We can observe
net throughput gains of up to 60%, and per-link throughput gains as high as 250% (Fig. 6.9(a)).

However, it comes at the expense of large decrements in overall fairness. Figures 6.8(a)-6.8(c) show

results for scenario when SS is optimized to maintain fairness in the network. In this case, we

observe net throughput gains of up to 14% and per-link throughput gains as high as 110% (Fig.

6.9(b)), while incurring much lower decrease in overall fairness, leading to 30% and 87% better

JFI and FSSE values (JFI and FSSE of some deployments improve up to 1% and 6%, respectively,

for some values of Te’s)).

44
33
22
11
0

57
45
32
20
7

52
41
30
18
7

42
32
21
11
0

60
46
31
17
2

33
26
19
11
4

t
u
p
h
g
u
o
r
h
t
 
n

i
 

e
g
n
a
h
C
%

 

34
27
20
13
6
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

% of each link's bandwidth,      

used as local minimum requirement

I

F
J
n

 

i
 

e
g
n
a
h
C
%

 

0
-9.5
-19
-28.5
-38

1
-7
-15
-23
-31

2
-5.75
-13.5
-21.25
-29

0
-9.25
-18.5
-27.75
-37

-2
-11.25
-20.5
-29.75
-39

-1
-9.5
-18
-26.5
-35

0
-7
-14
-21
-28

6
-17
-40
-63
-86

0
-22.5
-45
-67.5
-90

0
-19.5
-39
-58.5
-78

0
-22.25
-44.5
-66.75
-89

0
-22
-44
-66
-88

0
-22.5
-45
-67.5
-90

3
-16.75
-36.5
-56.25
-76

E
S
S
F
n

 

i
 

e
g
n
a
h
C
%

 

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

% of each link's bandwidth,      

used as local minimum requirement

% of each link's bandwidth,      

used as local minimum requirement

(a) Network throughput variation.

(b) Jain’s Fairness Index variation.

(c) FSSE variation.

Figure 6.7: Testing scenario with per-link (local) minimum Te, optimizing for net throughput

(#1-#7, top-bottom)

169

15
11
8
4
0

14
12
9
7
4

16
13
9
6
2

15
11
8
4
0

12
10
7
5
2

11
9
6
4
1

t
u
p
h
g
u
o
r
h
t
 
n

i
 

e
g
n
a
h
C
%

 

9
7
5
3
1
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

I

F
J
n

 

i
 

e
g
n
a
h
C
%

 

0
-1.75
-3.5
-5.25
-7

1
-0.25
-1.5
-2.75
-4

1
-0.75
-2.5
-4.25
-6

1
-0.5
-2
-3.5
-5

-1
-2.5
-4
-5.5
-7

0
-1.25
-2.5
-3.75
-5

0
-1.5
-3
-4.5
-6

6
2.5
-1
-4.5
-8

6
2.5
-1
-4.5
-8

6
2.5
-1
-4.5
-8

0
-2
-4
-6
-8

0
-1
-2
-3
-4

0
-0.75
-1.5
-2.25
-3

6
2.5
-1
-4.5
-8

E
S
S
F
n

 

i
 

e
g
n
a
h
C
%

 

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

% of each link's bandwidth,      

used as local minimum requirement

% of each link's bandwidth,      

used as local minimum requirement

% of each link's bandwidth,      

used as local minimum requirement

(a) Network throughput variation.

(b) Jain’s Fairness Index variation.

(c) FSSE variation.

Figure 6.8: Testing scenario with per-link (local) minimum Te, optimizing overall fairness (#1-#7,

top-bottom)

k
n

i
l
 

e
s
a
e
r
c
e
d
e
s
a
e
r
c
n

/

i
 

e
g
n
a
h
c

 
t
u
p
h
g
u
o
r
h
T
%

 

300

250

200

150

100

50

0

-50

-100

L1

L2

L3

L4

L5

L6

L7

L8

L9

L10

L11

L12

1

2

3

4

5

6

7

8

9

10

% of each link's bandwidth, used as local minimum requirement

k
n

i
l
 

e
s
a
e
r
c
e
d
e
s
a
e
r
c
n

/

i
 

e
g
n
a
h
c

 
t
u
p
h
g
u
o
r
h
T
%

 

120

100

80

60

40

20

0

-20

-40

L1

L2

L3

L4

L5

L6

L7

L8

L9

L10

L11

L12

1

2

3

4

5

6

7

8

9

10

% of each link's bandwidth, used as local minimum requirement

(a) Overall network throughput optimization policy.

(b) Fairness optimization policy.

Figure 6.9: Per-link throughput changes for Deployment#1 (testing scenario with per-link (local)

minimum Te requirement)

Network-wide Minimum Te: In this scenario, a percentage of maximum number of bits which can
be modulated over any link (i.e. 10 · N = 9170 bit s) is used as minimum Te requirement for each
P-Link. Such a scenario can arise when a PLC network is required to meet bandwidth requirements

of a certain type of application. Figures 6.10(a)-6.10(c) show how net throughput, JFI and FSSE of

the seven deployments change as Te increases, when SS is optimized for maximum net throughput.
We can observe net throughput gains of up to 56%, and per-link throughput gains as high as 180%

(Fig. 6.12(a)), which in most cases comes at the expense of large decrement in fairness performance

(except for Deployment#7 (Fig. 6.10(c) whose FSSE improves up to 14% for some Te’s)). Figures

170

t
u
p
h
g
u
o
r
h
t
 
n

i
 

e
g
n
a
h
C
%

 

44
33
22
11
0

41
31
21
10
0

39
29
20
10
0

40
30
20
10
0

56
42
28
14
0

31
23
16
8
0

27
20
14
7
0

3
-7
-17
-27
-37

3
-3.75
-10.5
-17.25
-24

I

F
J
n

 

i
 

e
g
n
a
h
C
%

 

3
-4
-11
-18
-25

3
-5.25
-13.5
-21.75
-30

3
-7
-17
-27
-37

1
-7.75
-16.5
-25.25
-34

2
-5.5
-13
-20.5
-28

3
-18
-39
-60
-81

0
-4
-8
-12
-16

0
-1.5
-3
-4.5
-6

0
-19.25
-38.5
-57.75
-77

3
-16.25
-35.5
-54.75
-74

0
-16.5
-33
-49.5
-66

14
6.5
-1
-8.5
-16

E
S
S
F
n

 

i
 

e
g
n
a
h
C
%

 

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100

  % of maximum modulation (i.e. 9170 bits)                                                        

  % of maximum modulation (i.e. 9170 bits)                                                        

  % of maximum modulation (i.e. 9170 bits)                                                        

 used as minimum = requirement                                                                 

 used as minimum = requirement                                                                 

 used as minimum = requirement                                                                 

(a) Network throughput variation.

(b) Jain’s Fairness Index variation.

(c) FSSE variation.

Figure 6.10: Testing scenario with network-wide minimum Te requirement, optimizing for net

throughput (#1-#7, top-bottom)

14
11
7
4
0

13
10
7
3
0

15
11
8
4
0

14
11
7
4
0

10
8
5
3
0

11
8
6
3
0

9
7
5
2
0

t
u
p
h
g
u
o
r
h
t
 
n

i
 
e
g
n
a
h
C
%

 

3
0.5
-2
-4.5
-7

2
0.25
-1.5
-3.25
-5

4
1.5
-1
-3.5
-6

2
0.75
-0.5
-1.75
-3

2
-0.25
-2.5
-4.75
-7

1
-0.5
-2
-3.5
-5

1
-0.75
-2.5
-4.25
-6

I

F
J
 
n

i
 
e
g
n
a
h
C
%

 

0
-2.5
-5
-7.5
-10

E
S
S
F
n

 

i
 
e
g
n
a
h
C
%

 

6
2
-2
-6
-10

6
2
-2
-6
-10

1
-1
-3
-5
-7

6
3
0
-3
-6

0
-1
-2
-3
-4

6
2
-2
-6
-10

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

  % of maximum modulation (i.e. 9170 bits)                                                        

  % of maximum modulation (i.e. 9170 bits)                                                        

  % of maximum modulation (i.e. 9170 bits)                                                        

 used as minimum = requirement                                                                 

 used as minimum = requirement                                                                 

 used as minimum = requirement                                                                 

(a) Network throughput variation.

(b) Jain’s Fairness Index variation.

(c) FSSE variation.

Figure 6.11: Testing scenario with network-wide minimum Te requirement, optimizing for overall

fairness (#1-#7, top-bottom)

6.11(a)-6.11(c) show results for scenario when SS is optimized to maintain fairness in the network.

In this case, we observe net throughput gains of up to 15% and per-link throughput gains as high

as 100% (Fig. 6.12(b)), while incurring much lower decrease in fairness performance, leading to

25% and 60% better JFI and FSSE values (JFI and FSSE of some deployments improve up to 4%

for some values of Te’s)).
Eﬀect of the changes in PLC channels: In real world scenarios, PLC channels can change due

to interference caused by electrical appliances or power distribution components, as discussed in

171

200

150

100

50

0

-50

-100

0

k
n

i
l
 

e
s
a
e
r
c
e
d
e
s
a
e
r
c
n

/

i
 

e
g
n
a
h
c
 
t
u
p
h
g
u
o
r
h
T
%

 

L1
L2
L3
L4
L5
L6
L7
L8
L9
L10
L11
L12

100

80

60

40

20

0

k
n

i
l
 

e
s
a
e
r
c
e
d
e
s
a
e
r
c
n

/

i
 

e
g
n
a
h
c

L1
L2
L3
L4
L5
L6
L7
L8
L9
L10
L11
L12

-20

-40

0

5

10

15

20

25

30

35

40

45

50

55

 
t
u
p
h
g
u
o
r
h
T
%

 

10

20

30

40

50

60

70

80

  % of maximum modulation (i.e. 9170 bits)                                                        

  % of maximum modulation (i.e. 9170 bits)                                                        

 used as minimum = requirement                                                                 

 used as minimum = requirement                                                                 

(a) Overall network throughput optimization policy.

(b) Fairness optimization policy.

Figure 6.12: Per-link throughput changes for Deployment#1 (testing scenario with network-wide

minimum Te requirement)

i

s
k
n
L
-
P
d
n
a
 
-

 

S

 
f
o
 
r
e
b
m
u
n
e
b
s
s
o
P

l

 

i

3000

2500

2000

1500

1000

500

0

0

Possible S-Links per P-Link
Possible P-Links

10

20

30

40

Number of PLC nodes

90

80

70

60

50

40

30

20

10

 

y
t
i

l

x
e
p
m
o
c

 
l

a
n
o
i
t
a
t
u
p
m
o
C

e
d
o
n
 
r
e
p
 
)
s
(
 

x
i
r
t
a
m
S
S

 

 
f
o

0

0

10

20

30

40

Number of PLC nodes

(a) Possible links as nodes increase.

(b) Complexity of SS algorithm.

y
t
i

l

 

x
e
p
m
o
c
n
o
i
t
a
c
n
u
m
m
o
C

i

e
d
o
n
 
r
e
p
 
)
b
M

i

(
 
g
n
b
o
r
p

 
l

e
n
n
a
h
c
 
f
o

1.2

1

0.8

0.6

0.4

0.2

0

0

10

20

30

40

Number of PLC nodes

y
t
i

l

 

x
e
p
m
o
c
n
o
i
t
a
c
n
u
m
m
o
C

i

0.12

0.1

0.08

0.06

0.04

0.02

e
d
o
n
 
r
e
p
 
)
s
(
 
g
n
b
o
r
p

i

 
l

e
n
n
a
h
c
 
f
o

0

0

10 Mbps
15 Mbps
20 Mbps
25 Mbps
30 Mbps
35 Mbps
40 Mbps
45 Mbps
50 Mbps
55 Mbps
60 Mbps
65 Mbps

10

20

30

40

Number of PLC nodes

(c) Communication complexity (Mbs).

(d) Communication complexity (secs).

Figure 6.13: Computational and communication complexity of our spectrum sharing approach as

number of PLC nodes increase

§6.4. Therefore, two communicating HPAV PLC devices must continuously exchange and update

the tonemaps between them to update the modulation schemes for each subcarrier. This is the

default behavior of all HPAV based PLC devices, where each device follows a tonemaps updating

172

protocol (that runs over one of the ROBust mOdulation (ROBO) modes [72]) to update tonemaps

while communicating with other nodes in a network. Without regular tonemap updates, PLC links

risk high packet loss, specially when the channel between two nodes changes signiﬁcantly.

Similarly, when SS is enabled, nodes of any S- or P-Link update their tonemaps by simply

following the regular HPAV protocol. However, to fully utilize SS, all PLC nodes in the network

must share their latest tonemaps with other nodes in the network and then recompute the local

SS matrices. However, this incurs both computational and communication related overheads, as

discussed in §6.5 and §6.6. Figs. 6.13(a) - 6.13(d) show how both overheads change as the number

of PLC nodes in a network increase. We observe that both overheads increase as the number

of nodes increase, where the computational complexity of SS algorithm speciﬁcally increases

signiﬁcantly. Therefore, spectrum sharing will only be beneﬁcial when most PLC channels in a

network are pseudo-stationary, as this will signiﬁcantly reduce the frequency of channel probing

and recalculation of the local SS matrices.

To understand how frequent changes in PLC channels in a network can impact the spectrum

sharing gains, we run simulations where we manually add noise to real tonemap data obtained

from 3 diﬀerent deployments. We introduce and evaluate the impact of two parameters in our

simulations: (1) probability (range 0.05 through 1) that the channel will change during a speciﬁc

communication instance between two nodes, and (2) standard deviation (STD) (range 1 through

3) of the variations introduced in subcarriers (this values stays same for all subcarriers during a

speciﬁc simulation) whenever the channel changes. In these simulations, we implemented the “net

throughput optimization” SS strategy mentioned earlier in this section, where we used a network-

wide minimum throughput requirement of Te = 25% of the maximum possible modulation (i.e.

9170/4 = 2292.5 bits). In our simulations, we measure the normalized throughput and the percentage

change in throughput correspond to 4 diﬀerent cases: (1) channels do not change at all, (2) channels

change, tonemaps updated between communicating nodes, tonemaps shared between all nodes and

subcarriers optimally reassigned based on SS algorithm, (3) channels change, tonemaps updated

between communicating nodes, but the subcarriers are not reassigned (i.e. assignment stays intact),

173

and (4) channels change but tonemaps are not updated between the communicating nodes. Figs.

6.14 - 6.20 show the results corresponding to all of the 3 real-world deployments. We observe

that throughput deteriorates signiﬁcantly in case 4, as not updating tonemaps leads to signiﬁcant

number of unsuccessful transmissions. We discuss case 4 for the sake of completeness, as it never

Case 1, with SS

Case 1, without SS

Case 2, with SS

Case 2, without SS

Case 3, with SS

Case 4, with SS

0.4

0.2

d
e
z
i
l

a
m
r
o
N

t
u
p
h
g
u
o
r
h
T

0

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

0.6

0.4

0.2

0

0.6

0.4

0.2

0

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

Probability that channel will change

Figure 6.14: Normalized throughputs observed for diﬀerent cases in deployment #1 (STD of noise

1, 2 and 3 from left-right)

Case 1, with SS

Case 2, with SS

30
Case 3, with SS

30

20

10

0

20

10

0

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

e
g
n
a
h
c

 

e
g
a
t
n
e
c
r
e
P

)

%

(
 
t
u
p
h
g
u
o
r
h
t
 
n

i

30

20

10

0

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

Probability that channel will change

Figure 6.15: Percentage change in throughput for diﬀerent cases in deployment #1 (STD of noise

1, 2 and 3 from left-right)

Case 1, with SS

Case 1, without SS

Case 2, with SS

Case 2, without SS

Case 3, with SS

Case 4, with SS

d
e
z
i
l
a
m
r
o
N

t
u
p
h
g
u
o
r
h
T

0.4

0.2

0

0.4

0.2

0

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

Probability that channel will change

0.4

0.2

0

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

Figure 6.16: Normalized throughputs observed for diﬀerent cases in deployment #2 (STD of noise

1, 2 and 3 from left-right)

30 Case 1, with SS

Case 2, with SS

30
Case 3, with SS

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

20

10

0

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

Probability that channel will change

20

10

0

30

20

10

0

)

%

(
 
t
u
p
h
g
u
o
r
h
t
 
n

i

e
g
n
a
h
c

 

e
g
a
t
n
e
c
r
e
P

Figure 6.17: Percentage change in throughput for diﬀerent cases in deployment #2 (STD of noise

1, 2 and 3 from left-right)

occurs either in regular or SS enabled HPAV protocol. Interestingly, we observe that case 3 exhibits

signiﬁcant SS gains even though the latest tonemaps were not shared between all nodes and the

subcarriers were not optimally reassigned. However, the gains in case 3 are less than the optimal

174

d
e
z
i
l

a
m
r
o
N

t
u
p
h
g
u
o
r
h
T

0.4

0.2

0

Case 1, with SS

Case 1, without SS

Case 2, with SS

Case 2, without SS

Case 3, with SS

Case 4, with SS

0.4

0.2

0

0.6

0.4

0.2

0

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

Probability that channel will change

Figure 6.18: Normalized throughputs observed for diﬀerent cases in deployment #3 (STD of noise

30

20

10

0

)

%

(
 
t
u
p
h
g
u
o
r
h
t
 
n

i

 

e
g
n
a
h
c
e
g
a
t
n
e
c
r
e
P

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

1, 2 and 3 from left-right)

Case 1, with SS

Case 2, with SS

30
Case 3, with SS

20

10

0

30

20

10

0

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

1

Probability that channel will change

Figure 6.19: Percentage change in throughput for diﬀerent cases in deployment #3 (STD of noise

1, 2 and 3 from left-right)

)

%

(
 
t
u
p
h
g
u
o
r
h
t
 
n

i
 
s
s
o

l
 
e
g
a
t
n
e
c
r
e
P

0

-2

-4

-6

-8

-10

-12

-14

0

0

-5

-10

-15

)

%

(
 
t
u
p
h
g
u
o
r
h
t
 
n

i
 
s
s
o

l
 
e
g
a
t
n
e
c
r
e
P

-20

0

STD of noise = 1
STD of noise = 2
STD of noise = 3

0.2

0.4

0.6

0.8

1

Probability that channel will change

0

-5

-10

-15

-20

-25

)

%

(
 
t
u
p
h
g
u
o
r
h
t
 
n

i
 
s
s
o

l
 
e
g
a
t
n
e
c
r
e
P

-30

0

STD of noise = 1
STD of noise = 2
STD of noise = 3

0.2

0.4

0.6

0.8

1

Probability that channel will change

STD of noise = 1
STD of noise = 2
STD of noise = 3

0.2

0.4

0.6

0.8

1

Probability that channel will change

(a) Deployment #1.

(b) Deployment #2.

(c) Deployment #3.

Figure 6.20: Percentage loss in throughput of case 3 compared to case 2 as the probability of

change in PLC channels increases

gains in case 2, and they decrease as the probability of channel change and/or STD of variations

increases, as shown by results in Figs 6.20(a) - 6.20(c).

6.8 Advantages of Multi-Hop Routing in PLCs

HomePlug PLC devices currently do not support multi-hop communication [3, 5]. However,

direct link communication in PLC networks can often be impossible or show very low throughput,

either due to highly location dependent multipath attenuations and/or interference from appliances.

Next, we explore if multi-hop routing can improve throughput and connectivity in large building

settings, such as enterprises, through real world experiments.

For our evaluation, we use the optimized link state routing protocol (OLSR) [29], which is

a table-driven proactive link-state routing protocol and has been widely used in 802.11 wireless

175

mesh networks. For our testbed experiments, we ﬁrst port the open-source OLSR and ETT imple-

mentations [1, 2] in our OpenWrt boards. Then, we deploy 9 PLC nodes in various topologies in

our ﬂoorplan (Fig. 6.3) and then evaluate routing performance of the PLC-Net. Our results show

that routing can signiﬁcantly improve PLC-Net performance in scenarios where certain PLC links

perform very poorly. We identiﬁed such a scenario during the communication between PLC nodes

N9 and N6, which were located at diﬀerent breakers but in the same distribution line (cf. Fig. 6.3).

Figure 6.21, shows the UDP throughput performance between PLC node N9 and N6 for one minute

window, while OLSR is enabled and disabled. When OLSR is turned on, UDP throughput between

N9-N6 and N6-N9 is 5.6 and 4.5 times higher, respectively, as compared to the case when OLSR is

oﬀ. We observe that such communication is aﬀected by electrical devices (lamps, phone chargers,

monitors) between N9 and N6, which interfere with the PLC network. When OLSR is enabled,

N9 and N6 communicate through node N7 or N8, avoiding such interferences. The throughput

temporal variations shown in Fig. 6.21 are attributed to the interference dynamics, which make

OLSR to change routes periodically. We make the same observations for TCP traﬃc (Table 6.1).

When OLSR is on, TCP throughput is up to 3.6× higher compared to the case when OLSR is oﬀ.

22

20

18

16

14

12

10

8

6

4

2

0

0

)
s
p
b
M

(
 
t
u
p
h
g
u
o
r
h
T

N9->N6 (OLSR)
N9->N6 (OLSR Disabled)
N6->N9 (OLSR)
N6->N9 (OLSR Disabled)

10

20

30

40

50

60

Time (secs)

Figure 6.21: Throughput with OLSR on/oﬀ for 60 secs

Conclusion: Mesh routing can signiﬁcantly boost PLC-Net performance in scenarios where

direct PLC links perform very poorly and multi-hop communication is required. We believe that

combining multi-hop routing with ﬁne grained spectrum sharing can potentially improve PLC

network performance even further, especially in scenarios where direct PLC links perform poorly.

We will pursue this direction in future.

176

Table 6.1: UDP and single-ﬂow TCP throughput and jitter with OLSR on/oﬀ (jitter is reported by

iperf only for UDP traﬃc)

Traﬃc

Flow

Thr(Mbps)

Jitter(ms)

UDP

TCP

N9→N6 (olsr on)
N9→N6 (olsr oﬀ)
N6→N9 (olsr on)
N6→N9 (olsr oﬀ)
N9→N6 (olsr on)
N9→N6 (olsr oﬀ)
N6→N9 (olsr on)
N6→N9 (olsr oﬀ)

9.5
1.7
2.7
0.6
4.2
1.4
1.8
0.5

5.9
17.8
11.6
18.7

-
-
-
-

6.9 Conclusions

PLC technology has the potential to improve connectivity and allow for large-scale sensing,

control and automation applications at low cost, without any need for dedicated network cabling.

In this work, we make following contributions. First, we conduct an extensive measurement study

of PLCs in a real enterprise environment using COTS HPAV PLC devices, based on which we

conclude that spectrum sharing (not supported by existing PLC standards) can signiﬁcantly beneﬁt

enterprise level PLC mesh networks. Second, we propose, implement and evaluate a spectrum

sharing scheme, and show that ﬁne-grained distributed spectrum sharing can signiﬁcantly boost

the aggregated and per-link throughput performance by up to 60% and 250% respectively, by

allowing multiple PLC links to communicate concurrently, while requiring a few modiﬁcations

to the existing HPAV devices and protocols. We believe that combining multi-hop routing with

ﬁne grained spectrum sharing can potentially improve PLC network performance even further,

especially in scenarios where direct PLC links perform poorly. This is can be a possible future

research direction.

177

CHAPTER 7

CONCLUSIONS AND FUTURE WORK

In this chapter, I provide an overview of some future research directions I have been exploring

during my PhD and conclude my dissertation.

7.1 Future Work

7.1.1 WiFi Signals Based Typing Biometrics

7.1.1.1 Motivation

Typing behavior based biometrics provide an extra level of security on top of traditional

passphrase and PINs based authentication schemes, which are often vulnerable to shoulder surﬁng,

keylogging malwares and video based attacks [89, 90, 115, 116, 163, 167]. All such schemes are

based on the hypothesis that every user has a unique and consistent typing behavior, and that it is

diﬃcult for an adversary to reproduce that behavior. Existing typing biometric schemes have used

modalities such as time delays between key presses or video of hands as the user types, to capture

the uniqueness in typing behavior. This work concerns the problem of developing a new method for

getting typing biometrics, which captures the uniqueness of a user’s typing behavior by leveraging

the changes caused in CSI signals due to motion of the user’s hands and ﬁngers during typing.

7.1.1.2 Challenges

The key technical challenge is to extract CSI based typing behavior features such that they

are robust to changes in position and orientation of a user’s laptop, as well as, resilient to static

changes in the environment (e.g. changes in arrangement of room furniture). In our preliminary

study, we collected over 6000 samples corresponding to 8 diﬀerent words from 10 volunteers. For

user identiﬁcation, our current scheme is able to achieve accuracies of up to 90% with passwords

of length as small as 9 letters. For user authentication, our current scheme can achieve Equal

Error Rates (EER) of less than 15%. However, these accuracies are not enough for authentication

178

and identiﬁcation purposes. We have observed that longer and complex passwords lead to higher

authentication and identiﬁcation accuracies. The increased accuracy in case of longer and complex

words can be attributed to the fact that those words are relatively diﬃcult and longer to type

than simple to type words, which makes each user’s typing behavior more distinguishable from

the other users. We believe that typing longer passwords or sentences can signiﬁcantly improve

the uniqueness of CSI based biometrics. Results from our preliminary study have shown that

identiﬁcation accuracies of up to 97.5% and EERs of less than 5% can be achieved for identiﬁcation

and authentication purposes, respectively, where users typed a known sequence of 8 words in form

of a sentence.

7.1.2 Eﬀective Fusion of Orthogonal Components in WiFi Signals Subspace For Improved

Activity Recognition

7.1.2.1 Motivation

Many WiFi signals based human activity recognition schemes have been proposed in the last 5-6

years. However, a major limitation of all existing schemes is that they are highly environment (e.g.

the position of WiFi transceivers, location of the individual, etc.) and individual dependent, and

therefore, do not work in diﬀerent environments and with diﬀerent individuals. Some recent schemes

have tried to address this issue [60, 139, 147]. Wang et al. have proposed a CSI-speed model, which

quantiﬁes the correlation between the CSI value dynamics and human movement speeds, and a CSI-

activity model, which quantiﬁes the correlation between the movement speeds of diﬀerent human

body parts and a speciﬁc human activity. Based on these two models, they quantitatively build the

correlation between CSI value dynamics and a speciﬁc human activity. However, their models are

only able to recognize macro-level daily human activities (e.g. walking, running and opening a

door/fridge) and not ﬁner-grained gestures (e.g. ﬂick, push and pull). Virmani et al. propose a novel

translation function, which automatically generates virtual samples for all gestures in all possible

conﬁgurations using the training samples provided for all gestures in only one conﬁguration. Using

the virtual samples corresponding to each speciﬁc conﬁguration, they create machine learning

179

models for every conﬁguration. To recognize gestures of a user at runtime, as soon as the user

performs a gesture, their scheme ﬁrst automatically estimates the conﬁguration of the user and then

evaluates the gesture against the classiﬁcation model corresponding to that estimated conﬁguration.

However, their technique requires ﬁne-grained smartphone based calibration of every gesture in the

learning conﬁguration, which is highly dependent not only on the individual but also the position

of WiFi transceivers, thus rendering their scheme diﬃcult to realize. Jiang et al. have proposed

a deep-learning based activity recognition framework that, according to their claim, can remove

the environment and subject speciﬁc information contained in the activity data and can extract

environment/subject-independent features shared by the data collected on diﬀerent individuals

under diﬀerent environments. Their scheme converts the CSI data obtained from diﬀerent TX-RX

antenna streams to “images” and then directly (blindly) feeds that multi-dimensional time-series

data into the deep-learning network. However, there is a major issue in their pipeline. We have

empirically observed that the eﬀect of the same human activity on a speciﬁc WiFi subcarrier and

TX-RX antenna stream is highly location and individual dependent. Basically, even if the activity’s

eﬀect on the overall WiFi subspace is similar, the eﬀect on individual CSI streams can be very

diﬀerent. Talking in the terms of computer vision, the way an activity aﬀects WiFi signals in

diﬀerent environments is like a cat which looks like a normal cat in one environment, but has its

eyes replaced by its toes and vice versa. Most deep learning frameworks will not be able to identify

such a cat. One major reason behind why Jiang et al. are able to achieve a reasonable (up to 75%)

activity recognition accuracy is because they place the router and WiFi receiver in almost the same

manner for every diﬀerent location and individual, which leads to similar variations in each speciﬁc

CSI stream/dimension. Our work concerns the problem of developing a signiﬁcantly improved

WiFi based human activity and gesture recognition scheme that can eﬀectively fuse the activity

related information present in the signal subspace formed by diﬀerent WiFi TX-RX streams, to get

more resilient activity speciﬁc features.

180

7.1.3 Challenges

The key technical challenge here is to extract CSI based features of diﬀerent human activities

that are environment and individual independent. Our hypothesis that an activity performed by

diﬀerent individuals in diﬀerent locations aﬀects the WiFi subspace similarly. If we can eﬀective

combine/fuse the information in diﬀerent orthogonal components in the WiFi subspace, we can

improve the activity recognition accuracies. To achieve this, we have tried diﬀerent techniques to

fuse the information from diﬀerent orthogonal components present in the WiFi subspace formed

the CSI signals corresponding to diﬀerent activities/gestures. By orthogonal components of WiFi

subspace formed the CSI signals corresponding to diﬀerent activities, we mean the orthogonal

components obtained after applying Principal Component Analysis (PCA) on a speciﬁc multi-

dimensional CSI sample corresponding to that activity/gesture. In our previous work 2, we only

kept top few PCA projections of the CSI data for building classiﬁers and discarded the rest, which

we think is not a good idea when the aim is to develop a generic activity/gesture recognition

scheme, as we believe that rest of the PCA projections also contain important activity/gesture

related information. However, we cannot simply add those projections to our feature set either,

as they may contain noise or unrelated variations. Moreover, as we discussed before, the PCA

projections can change positions which depends on the location and individual performing the

activity (referring to the cat’s eyes and toes example in 7.1.2.1).

Recently, we have tried combining consecutive PCA projections by ﬁrst max-min normalizing

each projection and then taking the magnitude to obtain a single timeseries/waveform, just like we

take magnitude of readings obtained from diﬀerent dimensions of an accelerometer. Our hypothesis

behind this step was that activity recognition accuracies should increase as we fuse information from

successive PCA projections (in descending order of variance). To test this, we collected over 13000

samples corresponding to several diﬀerent activities (namely ‘NoPresence’,‘Presence’, ‘Sitting’,

‘Standing’, ‘Walking’, and ‘Waving’) from over 70 diﬀerent users in 11 diﬀerent environments to

build a database of diverse samples. Using this data, we trained an SVM classiﬁer using 13 diﬀerent

time-domain features (details not mentioned here). Figure 7.1(a) shows the 10-fold cross-validation

181

80

75

70

65

)

%

(
 

 

y
c
a
r
u
c
c
A
n
o
i
t
i
n
g
o
c
e
R
y
t
i

 

v

i
t
c
A

60

0

)

%

(
 

 

y
c
a
r
u
c
c
A
n
o
i
t
i
n
g
o
c
e
R
y
t
i

 

34 features
13 features

80

78

76

74

v

i
t
c
A

35

72

0

0.002

0.004

0.006

0.008

0.01

10

5
30
Consecutive PCA projections fused 

25

20

15

(descending order)

PCA Successive Significance Difference Threshold

(a) Eﬀect of fusing information from more PCA projec-
tions on recognition accuracy (10-fold cross-validation)

(b) Eﬀect of Successive Signiﬁcance Diﬀerence (SSD)
threshold on accuracy (10-fold cross-validation)

Figure 7.1: Eﬀect of fusing information from successive PCA projections on recognition accuracy

(10-fold cross-validation)

accuracies obtained from our initial experiments. Based on this result, we claim that our hypothesis

is true, and successive lower PCA components in the WiFi subspace also contain important activity

related information. However, this fusion mechanism is ad-hoc and makes it hard to know after

which PCA projection we should stop fusing information. Moreover, the number of such PCA

projections which contain important activity related information can be diﬀerent even for the same

activity performed by the same individual at the same location. So there is a need for automatic

selection of important PCA components in every data sample. To deal with this issue, we have come

up with a Successive Signiﬁcance Diﬀerence (SSD) threshold, based on which we discard all the

successive unimportant PCA components. We compute SSD of the PCA projections of data sample

by ﬁrst max-min normalizing the PCA coeﬃcients corresponding to those projections, and then

taking their ﬁrst order diﬀerence (i.e. subtracting normalized coeﬃcients of each successive PCA

component from its previous one). Any successive PCA projection after the ﬁrst PCA projection that

does not meet the SSD threshold requirement is automatically discarded. Figure 7.1(b) shows how

the 10-fold cross-validation accuracies vary with SSD threshold. For this experiment, we trained

an SVM classiﬁer using 13 and 34 diﬀerent time-domain features (details not mentioned here),

respectively. We can observe that accuracy in both scenarios reaches an approximately optimal

point, after which fusing information from any extra PCA projections leads to signiﬁcantly lower

182

accuracies due to addition of noisy PCA projections. Note that some of the time-domain features

(e.g. mean, which signiﬁes the average magnitude of variations in the CSI signals) we use with SVM

classiﬁers are obtained after directly taking magnitude of the selected PCA projections, without

max-min normalizing the timeseries of the individual PCA projections. Currently, we working

on further improving our fusion scheme and developing a deep learning based framework which

maybe able to learn more representative features of diﬀerent activities by taking both timeseries

shape and frequency domain features (e.g. FFTs) of the waveforms obtained after fusion.

7.2 Conclusions

In my dissertation, I have shown that several mainstream commodity oﬀ-the-shelf (COTS)

electronic devices of daily use can be leveraged to develop interesting new IoT sensing and connec-

tivity solutions. In my research, I revisited the physical-layer of various everyday COTS electronic

devices such as WiFi, RFID, Smartphone (vibration mechanism), and Powerline Communication

(PLC) devices, either to leverage the signals obtained from their physical layers to develop novel

sensing applications, or to improve/modify their protocols to enable more useful deployment sce-

narios and networking applications - which they are not originally designed for - by introducing

mere software/ﬁrmware level changes and completely avoiding any hardware level changes. Adding

such new usefulness and functionalities on top of existing everyday infrastructure and electron-

ics has advantages both in terms of cost and convenience of use/deployment, as those devices

(and their protocols) are already mainstream, easily available, and often already purchased and

in use/deployed to serve their mainstream purpose of use. In my works on WiFi signals based

sensing, I developed signal processing and machine learning approaches to enable ﬁne-grained

gesture recognition and sleep monitoring using COTS WiFi devices. In my work on RFID signals

based sensing, I developed signal processing and machine learning approaches to eﬀectively image

customers’ browsing behavior in front of display items in places such as retail stores using COTS

RFID devices. In my work on smartphone’s vibration based sensing, I developed a robust and prac-

tical vibration based sensing scheme that works with COTS smartphones with diﬀerent hardware,

183

can extract ﬁne-grained vibration signatures of diﬀerent surfaces, and is robust to environmental

noise and hardware based irregularities. This work ﬁnds its applications in symbolic localization

based context aware services, for example, training a smartphone to turn oﬀ lights when a user puts

it on their bed-side table. And ﬁnally, as communication and sensing go hand in hand in the world

of IoTs, I worked on PLCs (i.e., communication mechanisms that leverage existing power distribu-

tion network/infrastructure inside a building for communication), where I developed a distributed

spectrum sharing scheme to make enterprise level PLCs based IoT networks faster. This work is a

major step towards using existing COTS PLC devices to connect diﬀerent types of IoT devices for

sensing and control related applications in large campuses such as enterprises.

184

BIBLIOGRAPHY

185

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Olsr routing protocol. In http://www.olsr.org.

Olsr with link cost extensions. In http://sourceforge.net/projects/olsr-lc/.

Homeplug av whitepaper. http://www.homeplug.org/tech-resources/resources/, 2007.

Ieee standard for broadband over power line networks: Medium access control and physical
layer speciﬁcations. In IEEE Std. 1901, 2010.

Homeplug av2 whitepaper. http://www.homeplug.org/tech-resources/resources/, 2011.

Ali Abdi, Kyle Wills, H Allen Barger, M-S Alouini, and Mostafa Kaveh. Comparison of the
level crossing rate and average fade duration of rayleigh, rice and nakagami fading models
with mobile channel data. In Vehicular Technology Conference , 2000. IEEE-VTS Fall VTC
2000. 52nd, volume 4, pages 1850–1857. IEEE, 2000.

Pierre Achaichia, Marie Le Bot, and Pierre Siohan. Point-to-multipoint communication in
power line networks: A novel fdm access method. In IEEE International Conference on
Communications. IEEE, 2012.

Fadel Adib, Zach Kabelac, Dina Katabi, and Robert C Miller. 3d tracking via body radio
reﬂections. In Usenix NSDI, 2013.

Fadel Adib, Hongzi Mao, Zachary Kabelac, Dina Katabi, and Robert C Miller. Smart homes
that monitor breathing and heart rate. In Proceedings of ACM CHI, 2015.

[10] Piyush Agrawal and Neal Patwari. Correlated link shadow fading in multi-hop wireless

networks. IEEE TWC, 2009.

[11] M. Alloulah, A. Isopoussu, C. Min, and F. Kawsar. On Tracking the Physicality of Wi-Fi: A

Subspace Approach. IEEE Access, pages 1–1, 2019.

[12] Sonia Ancoli-Israel, Roger Cole, Cathy Alessi, Mark Chambers, William Moorcroft, and
Charles P Pollak. The role of actigraphy in the study of sleep and circadian rhythms. Sleep,
2003.

[13] Dmitri Asonov and Rakesh Agrawal. Keyboard acoustic emanations. In 2012 IEEE Sympo-

sium on Security and Privacy. IEEE Computer Society, 2004.

[14] Ahmed Osama Fathy Atya et al. Bolt: Realizing high throughput power line communication

networks. In Proceedings of ACM CoNEXT, 2015.

[15] Martin Azizyan, Ionut Constandache, and Romit Roy Choudhury. Surroundsense: mobile

phone localization via ambience ﬁngerprinting. In ACM MOBICOM, 2009.

186

[16] Davide Balzarotti, Marco Cova, and Giovanni Vigna. Clearshot: Eavesdropping on keyboard
input from video. In Security and Privacy, 2008. SP 2008. IEEE Symposium on. IEEE, 2008.

[17]

James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-
parameter optimization. In Advances in neural information processing systems.

[18] Richard B Berry, Rita Brooks, Charlene E Gamaldo, Susan M Harding, CL Marcus,
BV Vaughn, et al. The aasm manual for the scoring of sleep and associated events. Rules,
Terminology and Technical Speciﬁcations, Darien, Illinois, American Academy of Sleep
Medicine, 2012.

[19] Peter H Bloch and Marsha L Richins. Shopping without purchase: An investigation of

consumer browsing behavior. NA-Advances in Consumer Research Volume 10, 1983.

[20] Peter H Bloch, Nancy M Ridgway, and Daniel L Sherrell. Extending the concept of shopping:
An investigation of browsing activity. Journal of the Academy of Marketing Science, 1989.

[21] Anthony Burke. System and method for retail customer tracking in surveillance camera

network, September 28 2017. US Patent App. 15/076,708.

[22] Hasan Basri Çelebi. Noise and multipath characteristics of power line communication

channels. PhD thesis, University of South Florida, 2010.

[23] Bo Chen, Vivek Yenamandra, and Kannan Srinivasan. Tracking keystrokes using wireless
signals. In Proceedings of the 13th Annual International Conference on Mobile Systems,
Applications, and Services. ACM, 2015.

[24] Wenxi Chen, Xin Zhu, Tetsu Nemoto, Yumi Kanemitsu, Keiichiro Kitamura, and Ken-ichi
Yamakoshi. Unconstrained detection of respiration rhythm and pulse rate with one under-
pillow sensor during sleep. Medical and Biological Engineering and Computing, 2005.

[25] Donald G Childers, David P Skinner, and Robert C Kemerait. The cepstrum: A guide to

processing. Proceedings of the IEEE, 1977.

[26]

Jungchan Cho, Inhwan Hwang, and Songhwai Oh. Vibration-based surface recognition for
smartphones. In IEEE RTCSA, 2012.

[27] Eun Kyoung Choe, Julie A Kientz, Sajanee Halko, Amanda Fonville, Dawn Sakaguchi, and
Nathaniel F Watson. Opportunities for computing to support healthy sleep behavior. In ACM
CHI, 2010.

[28] Byung Hun Choi, Gih Sung Chung, Jin-Seong Lee, Do-Un Jeong, and Kwang Suk Park.
Slow-wave sleep estimation on a load-cell-installed bed: a non-constrained method. Physi-
ological measurement, 2009.

[29] Clausen et al. Optimized link state routing protocol (olsr). In RFC 3626, 2003.

[30]

Jonathan Connell, Quanfu Fan, Prasad Gabbur, Norman Haas, Sharath Pankanti, and Hoang
Trinh. Retail video analytics: an overview and survey. In Video Surveillance and Trans-
portation Imaging Applications. Int. Society for Optics and Photonics, 2013.

187

[31] Gibson Research Corporation. Zeo sleep manager pro. https://uk.pcmag.com/zeo-sleep-

manager-pro/5064/review/zeo-sleep-manager-pro, December 2012.

[32] Nelson Costa and Simon Haykin. A novel wideband MIMO channel model and experimental

validation. IEEE Trans. Antennas Propag., 56(2):550–562, 2008.

[33] Eliran Dafna, Ariel Tarasiuk, and Yaniv Zigel. Sleep-wake evaluation from whole-night

non-contact audio recordings of breathing sounds. PloS one, 2015.

[34]

JP D’Amato, C Garcia Bauzaa, and E Rinaldib. Consumer buying metrics extraction using
computer vision techniques.

[35] GR De Bruijne, PCW Sommen, and RM Aarts. Detection of epileptic seizures through audio
classiﬁcation. In 4th European conference of the International Federation for Medical and
Biological Engineering. Springer, 2009.

[36] Luca Di Bert, Peter Caldera, David Schwingshackl, and Andrea M Tonello. On noise
modeling for power line communications. In IEEE International Symposium on Power Line
Communications and Its Applications. IEEE, 2011.

[37] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hy-
perparameter optimization of deep neural networks by extrapolation of learning curves. In
Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence, 2015.

[38] EPC EPCglobal. Radio-frequency identity protocols class-1 generation-2 uhf rﬁd protocol

for communications at 860 mhz - 960 mhz. EPCGlobal Inc., 1.2.0 edition, 2008.

[39] Magnus Eriksson. Dynamic single frequency networks. Selected Areas in Communications,

IEEE Journal on, 2001.

[40] Fitbit. Fitbit. https://www.ﬁtbit.com/, July 2018.

[41] Toshiki Fujino, Masaki Kitazawa, Takashi Yamada, Masakazu Takahashi, Gaku Yamamoto,
Atsushi Yoshikawa, and Takao Terano. Analyzing in-store shopping paths from indirect
observation with rﬁdtags communication data. Journal on Innovation and Sustainability.
RISUS ISSN 2179-3565, 2014.

[42] Andrea Goldsmith. Wireless communications. Cambridge university press, 2005.

[43]

[44]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org.

Jason Griﬃn and Steven Fyke. User hand detection for wireless devices, 2008. US Patent
7,430,439.

[45] Weixi Gu, Zheng Yang, Longfei Shangguan, Wei Sun, Kun Jin, and Yunhao Liu. Intelligent

sleep stage mining service with smartphones. In Proceedings of ACM Ubicomp, 2014.

[46] Daniel Halperin, Wenjun Hu, Anmol Sheth, and David Wetherall. 802.11 with multiple

antennas for dummies. ACM SIGCOMM Computer Communication Review, 2010.

188

[47] Daniel Halperin, Wenjun Hu, Anmol Sheth, and David Wetherall. Tool release: gathering
802.11 n traces with channel state information. ACM SIGCOMM Computer Communication
Review, 2011.

[48] Chunmei Han, Kaishun Wu, Yuxi Wang, and Lionel M Ni. Wifall: Device-free fall detection

by wireless networks. In INFOCOM, 2014 Proceedings IEEE. IEEE, 2014.

[49]

Jinsong Han, Han Ding, Chen Qian, Wei Xi, Zhi Wang, Zhiping Jiang, Longfei Shangguan,
and Jizhong Zhao. Cbid: A customer behavior identiﬁcation system using passive tags.
IEEE/ACM Transactions on Networking, 2016.

[50] Tian Hao, Guoliang Xing, and Gang Zhou.

isleep: unobtrusive sleep quality monitoring

using smartphones. In Proceedings of ACM Sensys, 2013.

[51] Taro Hayasaki, Daisuke Umehara, Satoshi Denno, and Masahiro Morikura. A bit-loaded
ofdma for in-home power line communications. In Power Line Communications and Its
Applications, 2009. ISPLC 2009. IEEE International Symposium on, pages 171–176. IEEE,
2009.

[52] Adrienne Heinrich, Frank van Heesch, Bhargava Puvvula, and Mukul Rocque. Video based
actigraphy and breathing monitoring from the bedside table of shared beds. Journal of
Ambient Intelligence and Humanized Computing, 2015.

[53] Peter Hillyard, Anh Luong, Alemayehu Solomon Abrar, Neal Patwari, Krishna Sundar,
Robert Farney, Jason Burch, Christina Porucznik, and Sarah Hatch Pollard. Experience:
Cross-technology radio respiratory monitoring performance study.
In ACM MOBICOM,
2018.

[54] Hristo D Hristov. Fresnal Zones in Wireless Links, Zone Plate Lenses and Antennas. Artech

House, Inc., 2000.

[55]

[56]

[57]

Impinj.
speedway-revolution/, 2017. [Online; accessed February 24, 2017].

Speedway R420.

Impinj

http://www.impinj.com/products/readers/

Impinj.
202755268-Octane-SDK, 2017. [Online; accessed March 25, 2019].

Octane

SDK.

https://support.impinj.com/hc/en-us/articles/

Impinj.
enhanced-shopper-experience/, 2019. [Online; accessed Oct 20, 2019].

Enhanced Shopper Experience.

https://www.impinj.com/solutions/retail/

[58] Rajendra K Jain, Dah-Ming W Chiu, and William R Hawe. A quantitative measure of
fairness and discrimination for resource allocation in shared computer system. Eastern
Research Laboratory, MA, 1984.

[59]

Jyh-Shing
http://mirlab.org/jang/matlab/toolbox/machinelearning, accessed on december 23, 2014.

available

Machine

learning

toolbox,

Roger

Jang.

at

[60] Wenjun Jiang, Chenglin Miao, Fenglong Ma, Shuochao Yao, Yaqing Wang, Ye Yuan, Hongfei
Xue, Chen Song, Xin Ma, Dimitrios Koutsonikolas, et al. Towards environment independent
device free human activity recognition. In ACM MOBICOM, 2018.

189

[61] Rasoul Karimi, Alexandros Nanopoulos, and Lars Schmidt-Thieme. Rﬁd-enhanced museum
for interactive experience. In Multimedia for cultural heritage, pages 192–205. Springer,
2012.

[62] Yuusuke Kawakita and Jin Mitsugi. Anti-collision performance of gen2 air protocol in
In Applications and the Internet Workshops, 2006.

random error communication link.
SAINT Workshops 2006. International Symposium on. IEEE, 2006.

[63] Matthew Kay, Eun Kyoung Choe, Jesse Shepherd, Benjamin Greenstein, Nathaniel Watson,
Sunny Consolvo, and Julie A Kientz. Lullaby: a capture & access system for understanding
the sleep environment. In Proceedings of ACM Ubicomp, 2012.

[64] Bryce Kellogg, Vamsi Talla, and Shyamnath Gollakota. Bringing gesture recognition to all

devices. In Usenix NSDI, 2014.

[65] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from

noisy entries. JMLR, 2010.

[66] Erick Christian Kobres and Lyle Sandler. Whole store scanner, October 18 2016. US Patent

9,473,747.

[67]

Juha M Kortelainen, Martin O Mendez, Anna Maria Bianchi, Matteo Matteucci, and Sergio
Cerutti. Sleep staging based on signals acquired through bed sensor. IEEE Transactions on
Information Technology in Biomedicine, 2010.

[68]

Jean Krieger, Nelly Maglasiu, Emilia Sforza, and Daniel Kurtz. Breathing during sleep in
normal middle-aged subjects. Sleep, 1990.

[69] Kai Kunze and Paul Lukowicz. Symbolic object localization through active sampling of
In Springer International Conference on Ubiquitous

acceleration and sound signatures.
Computing, 2007.

[70] Clete A Kushida, Michael R Littner, Timothy Morgenthaler, Cathy A Alessi, Dennis Bailey,
Jack Coleman Jr, Leah Friedman, Max Hirshkowitz, Sheldon Kapen, Milton Kramer, et al.
Practice parameters for the indications for polysomnography and related procedures: an
update for 2005. Sleep, 2005.

[71] Gierad Laput, Robert Xiao, and Chris Harrison. Viband: High-ﬁdelity bio-acoustic sensing

using commodity smartwatch accelerometers. In ACM UIST, 2016.

[72] Haniph A Latchman, Srinivas Katar, Larry Yonge, and Sherman Gavette. Homeplug AV and

IEEE 1901: A Handbook for PLC Designers and Users. John Wiley & Sons, 2013.

[73] Stéphane Lathuilière, Pablo Mesejo, Xavier Alameda-Pineda, and Radu Horaud. A com-
prehensive analysis of deep regression. IEEE transactions on pattern analysis and machine
intelligence, 2019.

190

[74] Daniele Liciotti, Marco Contigiani, Emanuele Frontoni, Adriano Mancini, Primo Zingaretti,
and Valerio Placidi. Shopper analytics: A customer activity recognition system using a
distributed rgb-d camera network. In International workshop on video analytics for audience
measurement in retail and digital signage, 2014.

[75]

[76]

[77]

[78]

Jian Liu, Yingying Chen, and Marco Gruteser. Vibkeyboard: virtual keyboard leveraging
physical vibration. In ACM MOBICOM, 2016.

Jian Liu, Yingying Chen, Marco Gruteser, and Yan Wang. Vibsense: Sensing touches on
ubiquitous surfaces through vibration. In IEEE SECON, 2017.

Jian Liu, Yan Wang, Yingying Chen, Jie Yang, Xu Chen, and Jerry Cheng. Tracking vital
signs during sleep leveraging oﬀ-the-shelf wiﬁ. In Proceedings of ACM MobiHoc, 2015.

Jingwen Liu, Yanlei Gu, and Shunsuke Kamijo. Customer behavior recognition in retail store
from surveillance camera. In 2015 IEEE International Symposium on Multimedia (ISM),
2015.

[79] Tianci Liu, Lei Yang, Xiang-Yang Li, Huaiyi Huang, and Yunhao Liu. Tagbooth: Deep
shopping data acquisition powered by rﬁd tags. In Computer Communications (INFOCOM),
2015 IEEE Conference on, pages 1670–1678. IEEE, 2015.

[80] Xuefeng Liu, Jiannong Cao, Shaojie Tang, and Jiaqi Wen. Wi-sleep: Contactless sleep

monitoring via wiﬁ signals. In IEEE RTSS, 2014.

[81] Yunhao Liu, Yiyang Zhao, Lei Chen, Jian Pei, and Jinsong Han. Mining frequent trajectory

patterns for activity monitoring using radio frequency tag arrays. IEEE TPDS, 2012.

[82] Xi Long. On the analysis and classiﬁcation of sleep stages from cardiorespiratory activity.

SLEEP-WAKE, 2015.

[83] Xi Long, Pedro Fonseca, Jérôme Foussier, Reinder Haakma, and Ronald M Aarts. Sleep
and wake classiﬁcation with actigraphy and respiratory eﬀort using dynamic warping. IEEE
journal of biomedical and health informatics, 2014.

[84]

Ilya Loshchilov and Frank Hutter. Cma-es for hyperparameter optimization of deep neural
networks. arXiv preprint arXiv:1604.07269, 2016.

[85] Bastien Lyonnet, Cornel Ioana, and Moeness G Amin. Human gait classiﬁcation using
In Radar Conference , 2010 IEEE.

microdoppler time-frequency signal representations.
IEEE, 2010.

[86] MathWorks. Blob Analysis. https://www.mathworks.com/help/vision/ref/blobanalysis.html,

2017. [Online; accessed February 21, 2017].

[87] Matteo Migliorini, Anna M Bianchi, Domenico Nisticò, Juha Kortelainen, Edgar Arce-
Santana, Sergio Cerutti, and Martin O Mendez. Automatic sleep staging based on ballisto-
cardiographic signals recorded through bed sensors. In IEEE EMBC, 2010.

191

[88]

Jun-Ki Min, Afsaneh Doryab, Jason Wiese, Shahriyar Amini, John Zimmerman, and Jason I
Hong. Toss’n’turn: smartphone as sleep and sleep quality detector. In Proceedings of ACM
CHI, 2014.

[89] Fabian Monrose, Michael K Reiter, and Susanne Wetzel. Password hardening based on

keystroke dynamics. International Journal of Information Security - Springer, 2002.

[90] Fabian Monrose and Aviel Rubin. Authentication via keystroke dynamics. In ACM CCS,

1997.

[91] M. Moshtaghi and et. al. Incremental elliptical boundary estimation for anomaly detection

in wireless sensor networks. In IEEE ICDM, 2011.

[92] Meinard Müller. Dynamic time warping. Information retrieval for music and motion, 2007.

[93] Rohan Murty et al. Characterizing the end-to-end performance of indoor powerline networks.

Harvard University Microsoft Research, 2008.

[94] Rajalakshmi Nandakumar, Bryce Kellogg, and Shyamnath Gollakota. Wi-ﬁ gesture recog-

nition on existing devices. arXiv preprint arXiv:1411.5394, 2014.

[95] Santosh Nannuru, Yunpeng Li, Yan Zeng, Mark Coates, and Bo Yang. Radio-frequency to-
mography for passive indoor multitarget tracking. IEEE Transactions on Mobile Computing,
2013.

[96] Andrew J Newman and Gordon R Foxall. In-store customer behaviour in the fashion sector:
some emerging methodological and theoretical directions. International Journal of Retail
& Distribution Management, 2003.

[97] Anh Nguyen, Raghda Alqurashi, Zohreh Raghebi, Farnoush Banaei-Kashani, Ann C Hal-
bower, and Tam Vu. Libs: A lightweight and inexpensive in-ear sensing system for automatic
whole-night sleep stage monitoring. GetMobile: Mobile Computing and Communications,
2017.

[98] Pavel V Nikitin and KV Seshagiri Rao. Antennas and propagation in uhf rﬁd systems. In

IEEE international conference on RFID, 2008.

[99] O. Oura ring. https://ouraring.com/, July 2018.

[100] Cecilia Occhiuzzi and Gaetano Marrocco. The rﬁd technology for neurosciences: feasibility
of limbs’ monitoring in sleep diseases. IEEE Transactions on Information Technology in
Biomedicine, 2010.

[101] Cecilia Occhiuzzi, Carmen Vallese, Sara Amendola, Sabina Manzari, and Gaetano Marrocco.
Night-care: A passive rﬁd system for remote monitoring and control of overnight living
environment. Elsevier Procedia Computer Science, 2014.

[102] Joonas Paalasmaa, Mikko Waris, Hannu Toivonen, Lasse Leppäkorpi, and Markku Partinen.

Unobtrusive online monitoring of sleep at home. In IEEE EMBC, 2012.

192

[103] Neal Patwari and Sneha K Kasera. Robust location distinction using temporal link signatures.
In Proceedings of the 13th annual ACM International Conference on Mobile computing and
networking. ACM, 2007.

[104] Neal Patwari and Sneha K Kasera. Temporal link signature measurements for location

distinction. Mobile Computing, IEEE Transactions on, 2011.

[105] Eduardo Marques Pereira, Jaime S Cardoso, and Ricardo Morla. Motion ﬂow tracking in
unconstrained videos for retail scenario. In Iberian Conference on Pattern Recognition and
Image Analysis. Springer, 2013.

[106] Dirk Pevernagie, Ronald M Aarts, and Micheline De Meyer. The acoustics of snoring. Sleep

medicine reviews, 2010.

[107] Ming-Zher Poh, Daniel J McDuﬀ, and Rosalind W Picard. Advancements in noncontact, mul-
tiparameter physiological measurements using a webcam. IEEE Transactions on Biomedical
Engineering, 2011.

[108] Mirela Popa, Alper Kemal Koc, Leon JM Rothkrantz, Caifeng Shan, and Pascal Wiggers.
Kinect sensing of shopping related actions. In International Joint Conference on Ambient
Intelligence. Springer, 2011.

[109] Mirela Popa, Leon Rothkrantz, Zhenke Yang, Pascal Wiggers, Ralph Braspenning, and
Caifeng Shan. Analysis of shopping behavior based on surveillance system. In IEEE SMC,
2010.

[110] Qifan Pu, Sidhant Gupta, Shyamnath Gollakota, and Shwetak Patel. Whole-home gesture
recognition using wireless signals. In Proceedings of the 19th annual International Confer-
ence on Mobile computing & networking. ACM, 2013.

[111] Tauhidur Rahman, Alexander T Adams, Ruth Vinisha Ravichandran, Mi Zhang, Shwetak N
Patel, Julie A Kientz, and Tanzeem Choudhury. Dopplesleep: A contactless unobtrusive
sleep sensing system using short-range doppler radar. In Proceedings of ACM Ubicomp,
2015.

[112] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning

Research, 2011.

[113] RedMed. S+ sleep sensor. https://www.resmed.com/us/en/consumer/s-plus.html, 2018.

[114] Brian D Ripley. Spatial statistics. John Wiley & Sons, 2005.

[115] Joseph Roth, Xiaoming Liu, and Dimitris Metaxas. On continuous user authentication via

typing behavior. Image Processing, IEEE Transactions on, 2014.

[116] Joseph Roth, Xiaoming Liu, Arun Ross, and Dimitris Metaxas. Investigating the discrim-
inative power of keystroke sound. Information Forensics and Security, IEEE Transactions
on, 2015.

193

[117] Avi Sadeh and Christine Acebo. The role of actigraphy in sleep medicine. Sleep medicine

reviews, 2002.

[118] Avi Sadeh, Peter J Hauri, Daniel F Kripke, and Peretz Lavie. The role of actigraphy in the

evaluation of sleep disorders. Sleep, 1995.

[119] Michael J Sateia. International classiﬁcation of sleep disorders. Elsevier Chest, 2014.

[120] Souvik Sen, Jeongkeun Lee, Kyu-Han Kim, and Paul Congdon. Avoiding multipath to revive
inbuilding wiﬁ localization. In Proceedings of the 11th annual International Conference on
Mobile systems, applications, and services. ACM, 2013.

[121] Ilari Shafer. Learning location from vibration. http://www.mrcaps.com/, 2013.

[122] Longfei Shangguan, Zimu Zhou, Xiaolong Zheng, Lei Yang, Yunhao Liu, and Jinsong Han.
Shopminer: Mining customer shopping behavior in physical clothing stores with cots rﬁd
devices. In ACM Sensys, 2015.

[123] Stephan Sigg, Markus Scholz, Shuyu Shi, Yusheng Ji, and Michael Beigl. Rf-sensing of
activities from non-cooperative subjects in device-free recognition systems using ambient
and local signals. Mobile Computing, IEEE Transactions on, 2014.

[124] Stephan Sigg, Shuyu Shi, Felix Buesching, Yusheng Ji, and Lars Wolf. Leveraging rf-
channel ﬂuctuation for activity recognition: Active and passive systems, continuous and
rssi-based signal features.
In Proceedings of International Conference on Advances in
Mobile Computing & Multimedia. ACM, 2013.

[125] SMARTRAC. DogBone Impinj Monza 4D. https://www.smartrac-group.com/ﬁles/content/
2017.

Products_Services/PDF/SMARTRAC_DOGBONE_IMPINJ_MONZA_4D.pdf,
[Online; accessed March 25, 2019].

[126] David B Smith and Leif W Hanlen. Channel modeling for wireless body area networks. In

Ultra-Low-Power Short-Range Radios, pages 25–55. Springer, 2015.

[127] SolidRun. Hummingboard. https://www.solid-run.com/nxp-family/hummingboard/, 2018.

[128] Sparkfun. Erm and lra motors. https://learn.sparkfun.com/tutorials/haptic-motor-driver-

hook-up-guide/erm-and-lra-motors, 2018.

[129] Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
dinov. Dropout: A simple way to prevent neural networks from overﬁtting. JMLR, January
2014.

[130] C. Studer, S. Medjkouh, E. Gönültaş, T. Goldstein, and O. Tirkkonen. Channel Charting:
IEEE

Locating Users within the Radio Environment using Channel State Information.
Access, 6:47682–47698, 2018.

[131] Xiao Sun, Li Qiu, Yibo Wu, Yeming Tang, and Guohong Cao. Sleepmonitor: Monitoring
respiratory rate and body position during sleep using smartwatch. Proceedings of ACM
IMWUT, 2017.

194

[132] TechWorld. Amazon Go looks convenient, but raises huge questions over privacy. https:

//www.techworld.com/business/amazon-go-looks-amazing-but-at-what-cost-3651434/,
2018. [Online; accessed November 3, 2019].

[133] Frederic Thouin, Santosh Nannuru, and Mark Coates. Multi-target tracking for measurement
models with additive contributions. In Information Fusion (FUSION), 2011 Proceedings of
the 14th International Conference on. IEEE, 2011.

[134] Antonia M Tulino, Angel Lozano, and Sergio Verdú. Capacity-Achieving Input Covariance

for Single-User Multi-Antenna Channels. IEEE Trans. Wireless Commun., 5(3), 2006.

[135] Yu-Chih Tung and Kang G Shin. Echotag: Accurate infrastructure-free indoor location

tagging with smartphones. In ACM MOBICOM, 2015.

[136] Yu-Chih Tung and Kang G Shin. Expansion of human-phone interface by sensing structure-

borne sound propagation. In ACM MOBISYS, 2016.

[137] Twice. Are Amazon Go Stores Putting Consumer Data At Risk? https://www.twice.com/
blog/are-amazon-go-stores-putting-consumer-data-at-risk, 2018. [Online; accessed Novem-
ber 3, 2019].

[138] Joris C Verster, Seithikurippu R Pandi-Perumal, and David L Streiner. Sleep and quality of

life in clinical medicine. Springer, 2008.

[139] Aditya Virmani and Muhammad Shahzad. Position and orientation agnostic gesture recog-

nition using wiﬁ. In ACM MOBISYS, 2017.

[140] Christina Vlachou, Albert Banchs, Julien Herzen, and Patrick Thiran. Analyzing and boost-
ing the performance of power-line communication networks. In Proceedings of ACM In-
ternational on Conference on emerging Networking Experiments and Technologies. ACM,
2014.

[141] Christina Vlachou, Albert Banchs, Julien Herzen, and Patrick Thiran. On the mac for
In IEEE

power-line communications: Modeling assumptions and performance tradeoﬀs.
International Conference on Network Protocols (ICNP). IEEE, 2014.

[142] Christina Vlachou et al. Electri-ﬁ your data: Measuring and combining power-line com-
munications with wiﬁ. In Proceedings of ACM Internet Measurement Conference , number
EPFL-CONF-211905, 2015.

[143] Martin Vuagnoux and Sylvain Pasini. Compromising electromagnetic emanations of wired

and wireless keyboards. In USENIX Security Symposium , 2009.

[144] Benjamin Wagner, Neal Patwari, and Dirk Timmermann. Passive rﬁd tomographic imaging
for device-free user localization. In IEEE Positioning Navigation and Communication, 2012.

[145] Guanhua Wang, Yongpan Zou, Zimu Zhou, Kaishun Wu, and Lionel M Ni. We can hear you
with wi-ﬁ! In Proceedings of the 20th annual International Conference on Mobile computing
and networking. ACM, 2014.

195

[146] Hao Wang, Daqing Zhang, Junyi Ma, Yasha Wang, Yuxiang Wang, Dan Wu, Tao Gu, and
Bing Xie. Human respiration detection with commodity wiﬁ devices: do user location and
body orientation matter? In Proceedings of ACM Ubicomp, 2016.

[147] Wei Wang, Alex X Liu, Muhammad Shahzad, Kang Ling, and Sanglu Lu. Understanding

and modeling of wiﬁ signal based human activity recognition. In ACM MOBICOM, 2015.

[148] Yan Wang, Jian Liu, Yingying Chen, Marco Gruteser, Jie Yang, and Hongbo Liu. E-
eyes: device-free location-oriented activity identiﬁcation using ﬁne-grained wiﬁ signatures.
In Proceedings of the 20th annual International Conference on Mobile computing and
networking. ACM, 2014.

[149] Joe H Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the

American statistical association, 1963.

[150] John B Webster, Daniel F Kripke, Sam Messin, Daniel J Mullaney, and Grant Wyborney.

An activity-based sleep monitor system for ambulatory use. Sleep, 1982.

[151] Werner Weichselberger, Markus Herdin, Huseyin Ozcelik, and Ernst Bonek. A stochastic
IEEE Trans. Wireless

MIMO channel model with joint correlation of both link ends.
Commun., 5(1):90–100, 2006.

[152] Joey Wilson and Neal Patwari. Radio tomographic imaging with wireless networks. Mobile

Computing, IEEE Transactions on, 2010.

[153] Withings. Withings sleep tracking mat. https://www.withings.com/us/en/sleep, 2018.

[154] Xethru.

Xethru

vs.

polysomnography

(psg)

comparative

study.

https://www.xethru.com/community/resources/categories/white-papers.6/.

[155] Xethru.

Respiration sensor x4m200.

https://www.xethru.com/x4m200-respiration-

sensor.html, 2018.

[156] Wei Xi, Jizhong Zhao, Xiang-Yang Li, Kun Zhao, Shaojie Tang, Xue Liu, and Zhiping Jiang.
Electronic frog eye: Counting crowd using wiﬁ. In INFOCOM, 2014 Proceedings IEEE,
2014.

[157] Jiang Xiao, Kaishun Wu, Youwen Yi, and Lionel M Ni. Fifs: Fine-grained indoor ﬁngerprint-
ing system. In Computer Communications and Networks (ICCCN), 2012 21st International
Conference on. IEEE, 2012.

[158] Xiaomi. Xiaomi mi band 3. https://www.mi.com/en/miband/, July 2018.

[159] Zheng Yang, Zimu Zhou, and Yunhao Liu. From rssi to csi: Indoor localization via channel

response. ACM Computing Surveys (CSUR), 2013.

[160] Zhicheng Yang, Parth H Pathak, Yunze Zeng, Xixi Liran, and Prasant Mohapatra. Monitoring

vital signs using millimeter wave. In Proceedings of ACM MobiHoc, 2016.

196

[161] Vivek Yenamandra and Srinivasan Kannan. Vidyut: Exploiting power line infrastructure for

enterprise wireless networks. In Proceedings of ACM SIGCOMM, 2014.

[162] Shichao Yue, Hao He, Hao Wang, Hariharan Rahul, and Dina Katabi. Extracting multi-

person respiration from entangled rf signals. Proceedings of ACM IMWUT, 2018.

[163] Saira Zahid, Muhammad Shahzad, Syed Ali Khayam, and Muddassar Farooq. Keystroke-
In Recent Advances in Intrusion Detection -

based user identiﬁcation on smart phones.
Springer, 2009.

[164] Fusang Zhang, Daqing Zhang, Jie Xiong, Hao Wang, Kai Niu, Beihong Jin, and Yuxi-
ang Wang. From fresnel diﬀraction model to ﬁne-grained human respiration sensing with
commodity wi-ﬁ devices. ACM IMWUT, 2018.

[165] Mingmin Zhao, Shichao Yue, Dina Katabi, Tommi S Jaakkola, and Matt T Bianchi. Learning
sleep stages from radio signals: a conditional adversarial architecture. In IEEE ICML, 2017.

[166] Yang Zhao, Neal Patwari, Jeﬀ M Phillips, and Suresh Venkatasubramanian. Radio tomo-
graphic imaging and tracking of stationary and moving people via kernel distance. In ACM
IPSN 2013.

[167] Yu Zhong, Yan Deng, and Anubhav K Jain. Keystroke dynamics for user authentication. In

IEEE CVPR Workshop, 2012.

[168] Zimu Zhou, Zheng Yang, Chenshu Wu, Longfei Shangguan, and Yunhao Liu. Towards
omnidirectional passive human detection. In INFOCOM, 2013 Proceedings IEEE. IEEE,
2013.

[169] Tong Zhu, Qiang Ma, Shanfeng Zhang, and Yunhao Liu. Context-free attacks using keyboard
acoustic emanations. In Proceedings of the 2014 ACM SIGSAC Conference on Computer
and Communications Security. ACM, 2014.

[170] Li Zhuang, Feng Zhou, and J Doug Tygar. Keyboard acoustic emanations revisited. ACM

Transactions on Information and System Security (TISSEC), 2009.

[171] Manfred Zimmermann and Dostert Klaus. An analysis of the broadband noise scenario
in powerline networks. In International Symposium on Powerline Communications and its
Applications, 2000.

[172] Manfred Zimmermann and Dostert Klaus. Analysis and modeling of impulsive noise in
broad-band powerline communications. IEEE Transactions on Electromagnetic Compati-
bility, (1), 2002.

[173] Manfred Zimmermann and Dostert Klaus. A multipath model for the powerline channel.

IEEE Transactions on Communications, (4), 2002.

197