WIRELESS COMMUNICATION AND SENSING SYSTEM DESIGN: A LEARNING-BASED
APPROACH

By

Shichen Zhang

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2025

ABSTRACT

With the rapid advancement of digital technologies, wireless communication and sensing sys-

tems have become increasingly integral to our daily lives. These systems utilize wireless signals

not only as data carriers but also as a medium for radio sensing. Model-based approaches have tra-

ditionally been a popular choice for addressing existing challenges in communication and sensing.

However, model-based approaches struggle to accurately characterize signal propagation, espe-

cially at higher frequencies, and optimizing them for communication is even more difficult. More-

over, extracting human motion-related information from these complex signals is often challenging

with conventional methods. Recent progress in artificial intelligence (AI) has opened new avenues

for addressing these challenges. This thesis explores learning-based approaches to uncover the hid-

den information embedded within wireless signals. By doing so, it aims to enhance the efficiency of

wireless communication systems and enable fine-grained human motion sensing, thereby pushing

the boundaries of wireless systems.

The first part of this thesis explores the capability of various RF signals to sense different levels

of human motion using learning-based approaches. We begin by proposing AuthIoT, a gesture-

based wireless authentication scheme designed for IoT devices. AuthIoT leverages a convolutional

neural network (CNN) to learn human gesture features from Wi-Fi channel state information (CSI)

and maps them to specific letters for device authentication. To enhance robustness and enable

gesture recognition across diverse environments, the system employs a feature fusion approach that

integrates location-independent features, ensuring strong transferability. Next, we shift our focus

to tiny motions and propose RadSee, a system capable of recognizing fine-grained handwriting.

We develop a 6 GHz FMCW radar system along with a tailored deep neural network to identify

handwritten letters through walls. The model combines a bidirectional long short-term memory

(BiLSTM) network with an attention mechanism to leverage temporal dependencies and capture

critical features—such as turning points—in radar phase sequences for accurate recognition. We

push the limits of this system further with a novel learning framework and introduce RadEye,

a system designed to recognize eye movements. Given the subtle nature of eye motion and the

challenge of detecting it in RF signals, we adopt a transformer encoder as the feature extractor

to more effectively exploit temporal dependencies in the phase sequences. To further enhance

performance, we incorporate a state-of-the-art vision-based method to provide guidance and prior

knowledge during the learning process.

The second part of this thesis focuses on leveraging learning-based solutions to improve the ef-

ficiency of wireless communication systems, with particular emphasis on enhancing the through-

put of mmWave communication systems. We begin by proposing an uplink multi-user MIMO

(MU-MIMO) mmWave communication (UMMC) scheme for WLANs. MU-MIMO techniques

are well-known for increasing network efficiency and throughput. A key innovation in this work is

a learning-based Bayesian optimization (BayOpt) framework for joint beam search across multiple

antennas. This approach eliminates the need for complex channel modeling and identifies opti-

mal beamforming directions with only a few search iterations, significantly reducing beamforming

overhead. We then further explore the beamforming problem in mmWave communications, shifting

our focus to mobile mmWave networks. In such dynamic environments, beamforming overhead

becomes more pronounced. To address this challenge, we leverage the temporal correlation of

wireless channels to aid in beam selection. Specifically, we propose a Temporal Beam Prediction

(TBP) scheme that enables a mobile mmWave device to predict its future beam direction based on

its historical beam selection profile. At the core of this scheme is a modified LSTM architecture,

complemented by an adversarial learning model to improve the robustness and generalizability of

the beam steering process.

This thesis presents efficient communication schemes and novel sensing applications based on

learning-driven approaches, paving the way for the design of AI-enabled next-generation wireless

communication and sensing systems. It provides detailed descriptions of system implementations,

experimental setups, and performance evaluations of the proposed schemes in real-world environ-

ments. Furthermore, it offers an in-depth analysis of the limitations of these systems and discusses

open challenges in developing future wireless communication and sensing systems using learning-

based techniques.

Copyright by
SHICHEN ZHANG
2025

To my parents, Lixia and Yu, and to my wife, Jinghan

v

ACKNOWLEDGMENTS

I sincerely thank my advisor, Prof. Huacheng Zeng, for his professional guidance, consistent

support, and mentorship during my Ph.D. journey. His invaluable insights, high standards, and

encouragement have played a crucial role in shaping the direction of my research and in helping

me grow as a scholar. I hold deep respect for his conscientious and meticulous approach to research,

which has become a model I aspire to follow in my own future work.

I am truly grateful to my committee members—Prof. Li Xiao, Prof. Qiben Yan, Prof. Zhichao

Cao, and Prof. Zhaojian Li—for their insightful feedback, constructive input, and thoughtful guid-

ance. Their expertise and perspectives have consistently helped refine my research direction and

sharpen my overall approach. Their support has been instrumental in the successful completion of

this thesis.

I would like to thank both past and present colleagues from the INSS Lab—Dr. Pedram Kheirkhah

Sangdeh, Dr. Hossein Pirayesh, Qijun Wang, Peihao Yan, Bowei Zhang, and Jie Lu—for their col-

laboration and support throughout my research projects. I am also deeply grateful for the time

spent with my lab mates, which brought great joy and meaning to my daily life. Their support has

extended beyond research, enriching both my academic journey and personal life.

I would like to extend my sincere thanks to the faculty, staff, and fellow students in the Depart-

ment of Computer Science and Engineering (CSE) at Michigan State University (MSU) for their

support and assistance throughout the completion of my Ph.D. thesis. The intellectually stimulating

and supportive environment fostered by the CSE department has played a vital role in advancing

my research and contributing to my scholarly growth.

Finally, and most importantly, I would like to thank my mother, Lixia Cui, and my father, Yu

Zhang, for their unconditional love and unwavering support. Even when we were separated by

distance, their love reached me across oceans. I am deeply grateful to my wife, Jinghan Liu, who

crossed oceans to stand by my side. Her unwavering support and countless sacrifices helped me

overcome the challenges of my academic journey and turned our love into a lifelong commitment.

vi

TABLE OF CONTENTS

CHAPTER 1:

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Research Background .
Thesis Contributions .
.
Organization .

.

.

.

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
2
4
8

CHAPTER 2:

CHAPTER 3:

CHAPTER 4:

1.1:
1.2:
1.3:

2.1:
2.2:
2.3:
2.4:
2.5:
2.6:
2.7:

3.1:
3.2:
3.3:
3.4:
3.5:
3.6:
3.7:
3.8:
3.9:
3.10:

4.1:
4.2:
4.3:
4.4:
4.5:
4.6:
4.7:
4.8:

5.1:
5.2:
5.3:

. .

. .

. .

A GESTURE-BASED WIRELESS AUTHENTICATION SCHEME FOR
IOT DEVICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9
9
Introduction .
. 12
Related Work . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 14
AuthIoT: Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 16
AoA Estimation for General Antenna Array . . . . . . . . . . . . . . . . . . .
Learning-based Passcode Recognition . . . . . . . . . . . . . . . . . . . . . .
. 25
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
. 39
Summary . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. .

.
.
.
.

. .
.
.
.
.
.
.

HANDWRITING RECOGNITION THROUGH WALLS USING FMCW
RADAR .
.
.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Introduction .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Related Work .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
.
Attack Model .
RadSee: Design Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
RadSee: Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
RadSee: DNN-based Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 61
Implementation .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Countermeasures and Other Applications . . . . . . . . . . . . . . . . . . . . . 77
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Summary .

.

.

.

.

.

.

.

.

.

.
.

.
.

.
.

.
.

EYE MOTION TRACKING USING FMCW RADAR . . . . . . . . . . 79
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
.
Introduction .
Related Work .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
RadEye: Design Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
RadEye: Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
RadEye: DNN-based Eye Movement Detection . . . . . . . . . . . . . . . . . . 94
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Limitations and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Summary .

.

.

.

.

.

.

CHAPTER 5:

UPLINK MU-MIMO COMMUNICATION IN MMWAVE WLANS . . . 108
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

.
Introduction .
Related Work .
.
Problem Description .

.
.

.
.

.
.

vii

5.4:
5.5:
5.6:
5.7:
5.8:

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Overview of UMMC .
Bayesian Optimization for Beam Search . . . . . . . . . . . . . . . . . . . . . 118
Asynchronous MU-MIMO Detection . . . . . . . . . . . . . . . . . . . . . . . 123
Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Summary .

.

.

.

.

.

.

.

CHAPTER 6:

. .

. .

TEMPORAL BEAM PREDICTION FOR MOBILE MMWAVE
NETWORKS .
. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Introduction .
Related Work . .
. 138
Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
TBP: Design .
TBP: Out-of-band Enhancement . . . . . . . . . . . . . . . . . . . . . . . . .
. 150
Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
. 164
Summary . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

. .

. .

. .

CHAPTER 7:

CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . 165
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

. .
Conclusion . .
Future Research .

. .
. .

6.1:
6.2:
6.3:
6.4:
6.5:
6.6:
6.7:

7.1:
7.2:

BIBLIOGRAPHY . .

. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

viii

CHAPTER 1: INTRODUCTION

With the prevalence of Internet of Things (IoT) devices, wireless signals have become present in

every corner of people’s lives. In today’s digitized world, there is a growing demand not only

for higher data rates in wireless systems but also for these systems to capture and convey infor-

mation about the physical world. On one hand, wireless systems are advancing into higher fre-

quency bands—such as the mmWave spectrum—enabling new applications like high-quality wire-

less VR/AR headsets, high-resolution video streaming, and vehicle-to-vehicle (V2V) communica-

tions. On the other hand, these systems are evolving into sensors that can capture human motion-

related information. These new sensors leverage widely available wireless signals to provide a con-

tactless, privacy-preserving, and resilient sensing approach that functions effectively in low-light

and adverse weather conditions. Wireless sensing applications are becoming increasingly common

in our daily lives, including measuring vital signs [61, 167], monitoring sleep patterns [195], as-

sisting the elderly with memory-related tasks [46], detecting falls in older adults [164], and even

measuring soil moisture [36, 48].

Both evolutionary trends require accurate signal transmission models. In wireless communica-

tion systems, such as mmWave communication systems, a reliable transmission model is essential

for guiding the beamforming process. Similarly, wireless sensing applications depend on accu-

rate models to describe how human motion affects wireless channels. However, using traditional

modeling approaches to achieve these goals is often difficult or inefficient. For example, mmWave

communication systems face challenges due to hardware imperfections, such as phase noise, clock

jitter, and inaccurate antenna radiation patterns, all of which contribute to imprecise mathematical

models. Modeling human motion based on variations in wireless signals is even more challenging.

The presence of multipath effects and the often uncorrelated relationship between human motion

and signal variation patterns make model-based approaches infeasible.

We will begin by introducing the research background of designing wireless communication

and sensing systems, and outline the limitations of existing approaches. Next, we will present the

1

system we developed and explain how we address these limitations using learning-based methods.

Finally, we will provide an overview of the organization of this thesis.

1.1: Research Background

1.1.1: Wireless Sensing Systems

Sensing with Communication Signals. Designing wireless sensing systems generally follows

two main trends: utilizing existing communication signals or employing dedicated sensing signals.

Using existing communication signals for sensing repurposes current infrastructure and spectrum,

increases resource utilization, and transforms communication systems into multifunctional plat-

forms. Recently, many applications have emerged in this area. For example, existing works have

leveraged Wi-Fi signals for gesture recognition [93, 108, 160, 213] and vital sign detection [167].

Cellular signals have also been used for respiration sensing [47] and even soil moisture measure-

ment [48]. Additionally, low-power signals such as RFID have been applied to touch sensing [122].

These sensing capabilities also show potential for addressing other emerging challenges. One

such challenge, which this thesis focuses on, is authentication for IoT devices. Since IoT devices

often lack input interfaces, connecting them to Wi-Fi networks can be difficult—users have no

straightforward way to input passwords. Existing solutions typically rely on pre-deployed plat-

forms to serve as intermediaries for authentication. However, this thesis explores how existing

wireless connections can be leveraged for authentication through gesture recognition, without re-

quiring any additional equipment.

Sensing with Radar Signals. While sensing with communication signals offers advantages

by leveraging existing infrastructure, it faces limitations when extended to fine-grained human

sensing in complex scenarios. As a result, another design trend focuses on sensing systems based

on Frequency-Modulated Continuous Wave (FMCW) radar. Compared to communication signal-

based sensing, radar systems are coherent, meaning they do not suffer from hardware-induced

phase, frequency, or timing misalignments commonly found in wireless communication systems.

Existing work has demonstrated the use of radar to enable human body skeleton sensing [95, 209–

2

211]. However, these studies primarily target large-scale body motions. The potential of radar

systems for detecting fine movements—such as handwriting or eye motion—remains largely un-

explored.

In this thesis, we focus on two compelling problems using FMCW radar systems. The first is

detecting handwriting through walls, which raises significant security concerns. Through-the-wall

detection remains a major challenge for most other sensors, such as cameras or acoustic sensors. Al-

though some RF-based sensing solutions can penetrate walls, achieving the resolution necessary to

capture subtle movements like handwriting remains difficult. Furthermore, handwriting detection

is particularly susceptible to interference from other moving objects in the environment. The second

and more challenging problem we address is the detection of eye motion. Eye tracking is critical

in various applications, including human-computer interaction (HCI), virtual reality, and medical

diagnostics. While existing camera-based eye-tracking solutions offer high accuracy and usabil-

ity, they often raise privacy concerns and perform poorly in low-light conditions. Radar-based

solutions present a promising alternative, addressing these issues while offering high-resolution,

reliable tracking.

1.1.2: Wireless Communication Systems

With the growing demand for data traffic in daily life, wireless communication systems are moving

to higher frequency bands—the mmWave spectrum—to support larger bandwidths. This shift is

foundational for 5G and beyond, enabling the vision of a smart society and a digitized physical

world by delivering ultra-low latency, multi-gigabit per second (Gbps) scalable wireless connec-

tions. These capabilities are essential for emerging applications such as virtual reality (VR), cloud-

centric real-time AI, and high-resolution video streaming. However, mmWave frequencies suffer

from significant path loss, making reliable communication challenging. To address this, mmWave

systems heavily rely on analog beamforming to establish and maintain strong links. Beam selec-

tion, therefore, becomes a critical challenge in mmWave communications. In this thesis focuses on

reducing beamforming overhead across different mmWave communication scenarios, while also

3

proposing efficient communication schemes to enhance performance in mmWave systems.

The combination of mmWave and MU-MIMO technologies has attracted significant interest

from both academia and industry due to its potential to deliver data rates in the hundreds of gi-

gabits per second. While extensive research has been conducted on the downlink of mmWave

MU-MIMO systems, progress on the uplink remains limited. One of the primary challenges in

designing uplink MU-MIMO schemes is the complexity of beamforming across multiple antennas.

Existing solutions often rely on accurate antenna models and detailed channel state information

(CSI), which are difficult to obtain in real-world deployments. Some model-free approaches have

been proposed, but they primarily address single-antenna scenarios. These solutions are not suitable

for multi-antenna systems, where joint beamforming decisions must consider spatial correlations

and user interference.

Beamforming is not only a challenge in multi-antenna systems, but also in single-antenna sce-

narios in mobile mmWave networks, where rapid user movement necessitates frequent beam align-

ment. To address this issue, existing solutions have explored techniques such as out-of-band CSI-

assisted beam selection, compressive sensing, and hierarchical beam search. While these methods

are effective to some extent, they primarily exploit the spatial characteristics of mmWave channels

and often overlook the temporal correlation inherent in beam selection over time. Leveraging this

temporal consistency could significantly improve the efficiency and robustness of beamforming in

dynamic environments.

1.2: Thesis Contributions

1.2.1: Thesis Overview

This thesis encompasses five of my previously published works, each of which has contributed to

the development of the core chapters. The thesis is structured in two parts. The first part focuses

on wireless sensing system design: Chapter 2 is based on AuthIoT [200], Chapter 3 builds upon

RadSee [201], and Chapter 4 is derived from RadEye [202]. The second part focuses on wireless

communication system design, where Chapter 5 is based on UMMC [199] and Chapter 6 is de-

4

Figure 1.1: Overview of this thesis.

veloped from TBP [203]. More specifically, as illustrated in Fig. 1.1, the first part of this thesis

focuses on designing deep learning models to extract human motion features from RF signals. It

aims to recognize various types of motion and apply them to the following use cases:

• A gesture-based wireless authentication scheme for IoT devices [200]

• Handwriting recognition through walls using FMCW radar [201]

• Eye motion tracking using FMCW radar [202]

The second part of this thesis focuses on enhancing the efficiency of mmWave communication

systems. In particular, we explore learning-based approaches, such as Bayesian optimization and

tailored LSTM models, to predict beam directions, and thereby reduce mmWave beamforming

overhead across various networking scenarios. The newly designed beamforming schemes are as

follows:

• A Bayesian optimization-based beamforming framework for uplink MU-MIMO mmWave

communication [199]

• Temporal beam prediction for mobile mmWave networks [203]

In both directions, we explore learning-based approaches to advance wireless systems into a

5

This thesisPart II: wireless communication Part I: wireless sensing Chapter IIChapter IIIChapter IVChapter VChapter VIRadEye- Transformer-based camera guided learning framework- Eye motion detection algorithm- FMCW radar with patch antennaSolutionGesture-based wireless authentication scheme for IoTProblemGesture-based wireless authentication scheme for IoTProblemAuthIoT- CNN model with environment-independent feature fusion- MUSIC algorithm for non-linear antenna arraySolutionAuthIoT- CNN model with environment-independent feature fusion- MUSIC algorithm for non-linear antenna arraySolutionHandwriting recognition through wallsProblemHandwriting recognition through wallsProblemEye motion tracking using FMCW radar ProblemEye motion tracking using FMCW radar ProblemUplink MU-MIMO for mmWave communication in WLANs ProblemUplink MU-MIMO for mmWave communication in WLANs ProblemUMMC- Bayesian optimization for joint beam search- MU-MIMO detectorSolutionUMMC- Bayesian optimization for joint beam search- MU-MIMO detectorSolutionTemporal beam prediction for mobile mmWave networksProblemTemporal beam prediction for mobile mmWave networksProblemTBP- Mobility-aware LSTM model with adversarial learning model - Out-of-band enhancementSolutionTBP- Mobility-aware LSTM model with adversarial learning model - Out-of-band enhancementSolutionRadSee- BiLSTM model with attention mechanism - 6 GHZ FMCW radar with patch antennaSolutionRadSee- BiLSTM model with attention mechanism - 6 GHZ FMCW radar with patch antennaSolutionnew generation characterized by high throughput and multifunctionality. In the next section, we

provide a comprehensive overview of each proposed system, detailing their design components

and the specific roles they play in addressing emerging challenges.

1.2.2: Contributions to Wireless Sensing

AuthIoT. We propose AuthIoT, a gesture-based wireless authentication scheme for IoT devices.

It directly utilizes channel state information (CSI) from Wi-Fi communications to recognize input

passwords, without relying on additional platforms. A novel feature fusion scheme is designed

to maintain the system’s transferability across different environments. Specifically, we extract an

environment-independent feature — the Angle of Arrival (AoA) — and fuse it with channel am-

plitude to serve as input for the DNN. In addition, we design an extended 2D MUSIC algorithm

tailored to this scheme to accurately calculate AoA under various antenna configurations on the

access point (AP) side. We have built a prototype of AuthIoT and evaluated its performance in

real-world scenarios. Experimental results show that AuthIoT achieves a letter recognition accu-

racy of 84%.

RadSee. We propose RadSee, a 6 GHz Frequency Modulated Continuous Wave (FMCW) radar

system designed to detect handwriting content through walls. The system is developed through a

combination of hardware and software design. On the hardware side, RadSee features a 6 GHz

FMCW radar equipped with patch antennas. These patch antennas provide a sufficient link power

budget, enabling the system to “see” through most walls while operating at low transmission power.

On the software side, the system extracts phase features corresponding to the writer’s hand move-

ments and utilizes a BiLSTM model with an attention mechanism to classify the letters. The pro-

posed learning framework is specifically designed to identify and extract key features—particularly

the turning points in handwritten letters—that are critical for accurate recognition. Extensive exper-

imental results show that RadSee achieves a 75% letter recognition accuracy when subjects write

62 randomly selected letters.

RadEye. We propose RadEye, a radar system capable of detecting fine-grained human eye

6

movements from a distance. RadEye is realized through an integrated hardware and software co-

design. It leverages a customized sub-6 GHz FMCW radar and a tailored patch antenna pair to

detect millimeter-level eye movements. This hardware combination enables the system to detect

subtle motions over an extended range while also minimizing interference from other directions.

On the software side, a DNN is employed to enhance detection accuracy, guided by camera-based

supervisory training. The DNN incorporates a transformer encoder as the feature extractor, en-

abling it to effectively capture temporal dependencies between radar sampling points. We have

developed a prototype of RadEye, and extensive experimental results demonstrate that it achieves

90% accuracy in detecting human eye rotation directions (up, down, left, and right) across a variety

of scenarios.

1.2.3: Contributions to Wireless Communication

UMMC. We propose an efficient Uplink MU-MIMO mmWave Communication (UMMC) scheme

for WLANs, which enables multiple stations to simultaneously transmit their data packets to a

single access point (AP). A key component of this scheme is a Bayesian optimization (BayOpt)

framework, designed to guide the beam search process. BayOpt leverages the posterior probabil-

ity distribution derived from previously evaluated beam configurations to intelligently explore the

beam space. Compared to conventional exhaustive search methods, BayOpt demonstrates remark-

able efficiency, often identifying near-optimal beam directions within a constrained airtime budget.

In addition to the learning framework, the proposed scheme incorporates a novel MU-MIMO detec-

tor capable of decoding asynchronous data packets from multiple user devices. We have developed

a prototype of UMMC on a mmWave testbed and evaluated its performance through a combination

of over-the-air experiments and extensive simulations. Both experimental and simulation results

confirm the effectiveness and efficiency of UMMC in practical network environments.

TBP. We propose the Temporal Beam Prediction (TBP) scheme, which assists mobile mmWave

devices in predicting future beam directions based on their historical beam selection profiles. TBP

draws inspiration from pedestrian trajectory prediction, employing a Long Short-Term Memory

7

(LSTM) network to model and predict beam directions in mobile mmWave networks. At the core

of TBP is a tailored LSTM module—mobility-aware LSTM (mLSTM)—specifically designed to

handle the non-uniform and non-smooth characteristics often observed in mmWave beam angle se-

quences. An adversarial learning structure is also employed to enhance the system’s generalizabil-

ity across different users. We have implemented a prototype of TBP on a 60 GHz software-defined

radio (SDR) mmWave testbed. Experimental results demonstrate that TBP can improve throughput

by more than 60% compared to existing beam selection approaches across various scenarios.

1.3: Organization

The rest of the thesis is organized as follows: Chapter 2 presents a gesture-based wireless authenti-

cation scheme for IoT devices. Chapter 3 introduces RadSee, a 6 GHz FMCW radar system capable

of recognizing handwriting through walls. Chapter 4 presents RadEye, which extends RadSee by

incorporating computer vision techniques to recognize eye motions. Chapter 5 describes UMMC,

a learning-based beamforming scheme designed to reduce beamforming overhead for uplink MU-

MIMO communication in mmWave WLANs. Chapter 6 presents TBP, a deep learning framework

that reduces beamforming overhead for mobile mmWave networks. Finally, Chapter 7 summarizes

this thesis and outlines future directions from both application and technical perspectives.

8

CHAPTER 2: A GESTURE-BASED

WIRELESS AUTHENTICATION SCHEME

FOR IOT DEVICES

2.1: Introduction

The Internet of Things (IoT) has transformed various aspects of our society, playing a vital role in

enhancing the way we live and work. According to Statista [145], the number of IoT devices world-

wide is projected to reach 40 billion by 2033. In real-world applications, many IoT devices rely on

Wi-Fi connections for Internet access and have no input interfaces (e.g., keypad or touchscreen) due

to their limits in physical size, power consumption, and/or manufacturing cost. For example, smart

home devices such as Gosund Smart Wi-Fi power outlet [11], SYLVANIA Wi-Fi dimmable LED

light bulb, and AGSHOME Wi-Fi windows open alert sensors require Wi-Fi network access to be

functional, but they have no input interfaces which end users can use to type in Wi-Fi passcode for

wireless Internet access. With the proliferation of compact wireless sensors in smart environments,

wireless IoT devices lacking input interfaces are expected to become increasingly common.

One widely used approach to authenticate Wi-Fi-enabled IoT devices that lack input interfaces

involves leveraging existing platforms such as Google Home Assistant [10] and Amazon Alexa [8].

These platforms facilitate device recognition and authentication via a smartphone or computer con-

nected to the same Wi-Fi network. This method, however, requires end users to have a smart-

phone/computer with pre-installed proprietary apps such as Google Home and Amazon Alexa. It

also requires Internet connection to gain the support of Google or Amazon cloud services. These

requirements make this method inapplicable in the scenarios where a smartphone or Internet is

not available and where the IoT device owners do not want to get involved in commercial cloud

platforms.

In this chapter, we present AuthIoT, a gesture-based wireless authentication scheme for IoT

9

Figure 2.1: A CSI-based authentication scheme for wireless IoT devices without input interfaces.

Table 2.1: Wireless writing and gesture recognition.

Ref.

(Tx,Rx)
ant #

Nonlinear
antenna array

Dataset

main approach

Learning features

computation
complexity

cross-environment
transferability

Reported
accuracy

WriFi [51]
WiReader [59]
LetFi [198]
WiDraw [147]
Wi-Wri [33]
AuthIoT

(2,3)
(1,2)
(1,6)
(30,3)
(2,3)
(1,3 or 4)

No
No
No
No
No
Yes

26 capital letters
26 capital letters
26 capital letters
Any
26 capital letters
48 characters

GMM-HMMs
LSTM model
SOM network
Trajectory tracking
kNN model

CSI amplitude
CSI amplitude
CSI amplitude
AoA
CSI amplitude

CNN-based learning LoS AoA, CSI amplitude

High
Medium
Medium
Medium
High
Medium

No
No
No
Yes
No
Yes

87%
90%
95%
91%
82%
84%

devices without input interfaces. AuthIoT requires neither assistance from other devices nor sup-

port from an Internet-based software platform. It is a channel state information based (CSI-based)

passcode recognition scheme for a Wi-Fi communication system, as shown in Fig. 2.1. It consists

of an access point (AP), an IoT device, and an end user. Specifically, AuthIoT works as follows:

The end user holds the IoT device in hand and writes the passcode over the air; and the AP lever-

ages recent advances in deep learning to recognize the passcode input from the IoT device based

on the spatial and temporal CSI features.

A key challenge in the design of AuthIoT is to maintain its transferability across different envi-

ronments. As CSI is significantly affected by the multipath effect of a wireless channel, a wireless

AP tends to observe different CSI in different environments. Hence, at the wireless AP, using

raw CSI for passcode recognition is not a plausible strategy because a deep neural network (DNN)

trained with raw CSI in an environment does not work well in another environment (based on our

experimental results). To address this challenge, AuthIoT extracts environment-independent fea-

tures as the input for the training and inference of a DNN. Specifically, AuthIoT computes the angle

of arrival (AoA) of the line-of-sight (LoS) signal path by leveraging recent advances in wireless

localization [79, 85, 152, 158, 185], and uses the AoA (as well as normalized channel amplitude) as

10

Hold IoT device in hand and write  password over the airLearning-based passcode character recognitionInternetWireless APEnd userData packetsData packetsAthe input for the training and inference of DNN. Since different passcode characters tend to produce

distinct AoA patterns, an AP can identify the characters if the DNN is properly trained.

Another challenge in the design of AuthIoT is to compute the LoS AoA of received packets

for an AP with a nonlinear antenna configuration. While AoA estimation of wireless packets has

been studied in wireless localization (e.g., [79, 85, 152, 158, 183, 185]), most of existing techniques

deal with the case where the antenna elements are equally-spaced and linearly installed. However,

many Wi-Fi routers and other APs are equipped with antenna elements in a nonlinear shape so

as to save space. Existing methods such as MUSIC (MUltiple SIgnal Classification) algorithm

cannot be directly used to estimate AoA for a receiving device with nonlinear antenna configura-

tion. To address this challenge, AuthIoT extends two-dimensional MUSIC algorithm to the case

where the receiver (wireless AP) is equipped with nonlinear antenna elements. Following the idea

from SpotFi [85], AuthIoT jointly considers the AoA and ToF (time of flight) to enhance the AoA

resolution of different signal paths.

Based on the environment-independent features (LoS AoA) as well as the normalized amplitude

of CSI, AuthIoT employs a DNN to recognize the passcode when an end user continuously writes

the passcode characters over the air by holding the IoT device in her hand. Once the AP detects the

passcode, it will grant the network access to the IoT device; otherwise, it will wait until the correct

passcode is detected or the maximum number of attempts is reached. We have built a prototype of

AuthIoT and evaluated its performance on two distinct AP testbeds: i) Intel 5300 Wi-Fi card with

three linear antennas, and ii) USRP N310 with four nonlinear (square-positioned) antennas. Exper-

imental results show that AuthIoT achieves 84% successful rate of passcode character recognition

on the former testbed and 83% successful rate on the latter testbed, both for cross-environment

applications.

The contributions of this work are summarized as follows.

• AuthIoT is, to the best of our knowledge, among the first that explores environment-independent

features of CSI for authenticating IoT devices without input interfaces. It is transferable to a

new environment for handwriting recognition once its DNN is well trained.

11

• AuthIoT extends two-dimensional MUSIC algorithm for AoA estimation from linear, equally-

spaced antenna configuration to nonlinear antenna configuration.

• We have built a prototype of AuthIoT and demonstrated its performance in real scenarios.

Our experimental results show that it can achieve more than 83% passcode recognition ac-

curacy in cross environments for both linear and nonlinear antenna configurations.

2.2: Related Work

We survey the literature in the following category.

Authenticating IoT Devices without Input Interface. As mentioned before, a mainstream au-

thentication method for smart-home IoT devices is to leverage the platforms such as Google Home

[10] and Amazon Alex [8]. This method, however, requires users to have a smartphone with pre-

installed proprietary apps, to have Internet access, and to share the data with the platforms. In

addition to the commercial products, research advances have been made for IoT authentication.

TouchAuth [189] harnesses induced body electric potentials (iBEPs) for IoT authentication

by having users wear a wristband to touch an analog-to-digital (ADC) pin of the IoT device. It

makes the ADC pin touchable by connecting devices’ ADC pins to their conductive exteriors. The

authentication is performed by measuring the IBEPs similarities between the wristband and the

smart object. P2Auth [96] authenticates IoT devices without input interface by leveraging their

inertial measurement unit. It requires users to perform unique petting operations that can be sensed

by both an IoT device and a wristband device. It compares the captured data from the two devices

and makes a decision for the authentication based on their similarity. SFIRE [55] is a secret-free

trust establishment protocol that pairs commercial wireless devices with a hub. It requires a user to

move a helping smartphone around the wireless device and measures the similarity of RSS signals

for authentication. Move2Auth [197] is another authentication scheme for IoT devices without an

input interface. It requires users to hold a smartphone and perform one of two hand-gestures in

front of an IoT device.

In contrast to the above works, AuthIoT takes a very different approach to authenticate IoT

12

devices without input interface. It requires neither assistance from smartphones nor hardware/soft-

ware modifications on IoT devices.

CSI-based Handwriting Recognition. Our work is closely related to the research in this area.

Table 2.1 presents a comparison of our work with prior work. WriFi [51] is a CSI-based hand-

writing system that comprises a Wi-Fi AP, a Wi-Fi client device, and a user writing 26 letters over

the air. In this system, CSI amplitude is collected for learning-based recognition. Operations such

as principal component analysis (PCA) and fast fourier transform (FFT) have been performed to

extract the CSI features for hidden Markov model (HMM) training and inference. The accuracy is

reported to be 86%. Similar to WriFi, Wi-Wri [33] is another CSI-based handwriting letter recog-

nition system. It is based on k-nearest neighbors (k-NN) model and uses dynamic time warping

(DWT) to calculate the distance between CSI waveform and classified data. It reports 83% recog-

nition accuracy for 26 letters. WiReader [59] is another work in this area. It exploits CSI from

commercial Wi-Fi devices to extract activities-related information.

It employs long short-term

memory (LSTM) model for recognition and adopts PCA and discrete wavelet transform (DWT)

for CSI feature extraction. It reports 90% recognition accuracy for 26 letters with intelligence text

correction. LetFi [198] is also a CSI-based over-the-air handwriting recognition system in Wi-Fi

networks. It employs multi-domain feature extraction method and self-organizing mapping neural

networks (with SoftMax regression classifier) to recognize 26 letters. The reported recognition ac-

curacy is 95%. WiDraw [147] is a handwritten recognition system which allows a user to write over

the air. It recognizes hand movement trajectory based on the analysis of collected CSI. With the

presence of 30 transmitters, it can achieve 91% word recognition accuracy and superior accuracy

for hand movement patterns.

As shown in Table 2.1, AuthIoT differs from the above works in several aspects: i) AuthIoT

has a larger dataset (48 characters in AuthIoT versus 26 letters in the above-mentioned works);

ii) it enables its cross-environment transferability by design; and iii) it works for Wi-Fi AP with

nonlinear antenna array.

CSI-based Gesture Recognition. In addition to handwriting recognition, many works have also

13

been done for CSI-based gesture recognition [93,108,160,179,213], by extracting and recognizing

the temporal, spatial, and Doppler features of hand movements. Generally speaking, CSI-based

gesture recognition can achieve very high accuracy (over 90%), because it has a small number of

dataset (e.g., 6 gestures). In contrast, AuthIoT has 48 characters in its dataset, which is much larger

than the above networks. In addition, AuthIoT distinguishes itself from previous works by focusing

on cross-environment transferability design.

Wireless Localization. Another research line related to our work is CSI-based wireless localiza-

tion in Wi-Fi networks [85,139,152,183,185]. Particularly, SpotFi [85] presents an accurate indoor

localization scheme using commercial Wi-Fi devices. It proposes a two-dimensional MUSIC algo-

rithm by leveraging the information in both spectral and spatial domains to enhance the resolution

of AoA estimation. It jointly estimates AoA and ToF (time of flight) of incoming Wi-Fi signals

using multiple antennas and broadband (40MHz) spectrum. The localization median accuracy is

reported to 40cm using the commercial Wi-Fi card. AuthIoT borrows the idea of AoA estima-

tion from the above works, and extends the antenna setting from linear to nonlinear case for IoT

authentication applications.

2.3: AuthIoT: Design Overview

2.3.1: System Setting and Operation

AuthIoT is designed for a wireless communication system as shown in Fig. 2.1, which comprises a

wireless AP (e.g., Wi-Fi router), an IoT device, and an end user. IoT devices do not have input inter-

faces such as keypads and touchscreens due to the limits in their physical size, power consumption,

and/or manufacturing cost. Examples of such IoT devices include Wi-Fi LED light bulbs, Wi-Fi

light switches [9], and window/door open alert sensors [12]. The wireless AP has multiple antennas

for data packet reception. This is very common for Wi-Fi routers, most of which are equipped with

four or more antennas. In such a system, AuthIoT works as follows.

• End User: The end user first triggers wireless AP to exchange packets between itself and

the IoT device at a certain rate (e.g., 200 packets/s). She then holds the IoT device in front

14

of the wireless AP with a distance of about 2 meters and ensures that there is a LoS signal

path between the IoT device and the wireless AP. After that, the end user writes each of the

passcode characters over the air until the IoT device is successfully authenticated.

• IoT Device: The IoT device needs no hardware or software modification. It responds to the

sounding packets from the wireless AP (e.g., using ACK packets) so that the wireless AP

can estimate wireless channel at a desired rate.

• Wireless AP: The wireless AP estimates the channel between itself and the IoT device us-

ing the packets from the IoT device. It continuously runs a modified MUSIC algorithm to

estimate the LoS AoA of the packets from the IoT device and feed the LoS AoA along with

normalized CSI amplitude to a DNN for the recognition of each character in the passcode.

It authenticates the IoT device once the passcode is detected or the maximum number of

attempts is reached.

2.3.2: Challenges and Our Approach

Compared to prior CSI-based recognition work [33, 51, 59, 147, 198], AuthIoT needs to recognize

a much larger set of characters, which include upper-case letters, low-case letters, numbers, and

special characters. In addition, AuthIoT faces the following challenges in its design and imple-

mentation.

Cross-Environment Transferability. A challenge in the design of AuthIoT is to maintain its cross-

environment transferability, so that the system can be used in any environment once its DNN has

been trained. To address this challenge, AuthIoT uses environment-independent CSI features as its

input for passcode recognition. Specifically, it computes the LoS AoA of the received packets from

the IoT device based on the estimated CSI by leveraging recent advances in wireless localization

[85, 152, 183, 185], and uses the LoS AoA as the main feature for passcode recognition. It should

be noted that an end user can always hold the IoT device in front of its wireless AP to ensure the

existence of LoS path between the IoT device and its AP.

Nonlinear Antenna Array at AP. Although the LoS AoA estimation techniques have been well

15

studied for wireless localization, most of them consider the case where the receiver is equipped with

linearly, equally-spaced antenna array [85, 152, 183, 185]. However, many off-the-shelf wireless

APs such as Wi-Fi routers are equipped with nonlinear antenna array (e.g., rectangular-installed) to

save space. As expected, the AoA estimation techniques proposed for a device with linear antenna

array cannot be directly used for a device with nonlinear antenna array. To address this challenge,

AuthIoT revisits the MUSIC algorithm and extends it for the case where the device has nonlinear

antenna array. AuthIoT also borrows the idea from SpotFi [85] to jointly estimate AoA and ToF so

as to improve the AoA resolution.

Indistinguishable Characters. Another challenge lies in the fact that some character pairs are

hard to distinguish in their handwriting format, such as “z” and “Z”, “o” and “O”, “s” and “S”, “v”

and “V”, letter “I” and number “1”, etc. Sometimes, these handwritten character pairs even cannot

be distinguished by a human. Unfortunately, this challenge is hard to address from a technical

perspective. Therefore, AuthIoT resorts to regulation. AuthIoT asks end users to use a passcode

that does not include indistinguishable pairs of characters. Excluding some characters will not

compromise the passcode security as there are still sufficient characters to be used.

2.3.3: Security of AuthIoT

Essentially, AuthIoT serves as an interface for an AP to receive a passcode from an end user for

authenticating a particular IoT device. It does not alter the authentication mechanism and thus

has the same authentication safety as existing methods. However, due to the broadcast nature of

wireless signals, AuthIoT may face the passcode leakage problem. A malicious user may overhear

the signal from IoT device and attempt to infer the passcode for AP access. To address this issue,

a substitution cipher [176] can be applied to the passcode at wireless AP, and the substitution rules

can be updated regularly to avoid replay attacks.

2.4: AoA Estimation for General Antenna Array

This section first offers a primer on the existing MUSIC algorithm for AoA estimation at a wireless

device equipped with uniform linear antenna array, and then extends the MUSIC algorithm to the

16

case where the wireless device is equipped with a general (linear or nonlinear) antenna array.

2.4.1: MUSIC for Uniform Linear Antenna Array (MUSIC-ULAA)

System Modeling. The basic idea of AoA estimation is that different signal propagation paths

are likely to have different AoAs at a receiving device. The different AoAs will introduce a cor-

responding phase shift across the array of antennas. For a uniform linear antenna array, once the

antenna space and the phase shift are given, the AoA can be accordingly calculated. To understand

AoA estimation, let us consider a receiving device with a uniform linear antenna array as shown in

Fig. 2.2, where the number of antennas is M , and the antenna spacing is d. Assume that the num-

ber of signal propagation paths is L and let us focus on the lth path shown in the figure. Denote αl

as the complex channel attention experienced by the signal when impinging on the first antenna.

Then, the complex channel attention of the signal at the second antenna is the same except for an

additional phase shift caused by the additional distance traveled by the signal. Mathematically, the

additional phase shift at the mth antenna can be written as (m − 1) · d · sin(θl) · 2π

λ , where λ is

the wavelength of radio signal. Then, the complex channel attention at the mth antenna can be
expressed as (m−1)·d·2π

· sin(θl) · αl. Denote ⃗hl as the channel coefficient vector for the lth path.

λ

Then, ⃗hl = ⃗a(θl) · αl, where

(cid:2)

⃗a(θl) =

1

e−j

2πd sin(θl)
λ

e−j

4πd sin(θl)
λ

· · · e−j

2π(M −1)d sin(θl)
λ

(cid:3)

T.

(2.1)

At each antenna of the device, the observed CSI is the blend of all paths as well as noise, i.e.,

P

P

⃗H =

⃗hl =

l

l ⃗a(θl)αl. Then, the AoA estimation problem can be formulated as follows. Based
on the N observations of CSI (i.e., ⃗Hn, n = 1, 2, · · · N , where ⃗Hn is the nth observation of channel

vector), how to estimate θl, l = 1, 2, · · · , L.

MUSIC Estimation. MUSIC is a subspace-based algorithm that has been widely used for AoA

estimates in wireless localization. The general idea behind MUSIC method is to use all the eigen-

vectors that span the noise subspace to improve the performance of the Pisarenko estimator. It

mainly comprises the following steps.

17

Figure 2.2: Illustration of MUSIC algorithm for AoA estimates at a wireless device with uniform
linear antenna array. Only one signal path with AoA θl is shown in the figure.

Step 1: Calculate the correlation matrix of CSI observations: R =

P

N
n=1

⃗Hn

⃗H H

n , where (·)H

is conjugate transpose operator.

Step 2: Perform eigendecomposition of the correlation matrix: [E S] = eig(R), where E is a

matrix with its columns being eigenvectors and S is the diagonal matrix with sorted

eigenvalues (in non-decreasing order).

Step 3: Divide E into two sub-matrices: E = [EsEn], where Es is the signal subspace and En

is noise subspace.

Step 4: Evaluate the following function for all possible θ: p(θ) =

n⃗a(θ) , where ⃗a(θ)
is the steering direction defined in (2.1). The values of θ corresponding to the peaks

⃗a(θ)HEnEH

1

of p(θ) are the AoAs of incoming signals.

2.4.2: MUSIC for General Antenna Array (MUSIC-GAA)

The above MUSIC algorithm assumes that the antenna array is equally spaced and linearly installed.

However, in practice, most wireless APs are equipped with nonlinear antenna array. For example,

many Wi-Fi routers are equipped with four antennas which are installed in a rectangular shape

to save the space. In this section, AuthIoT extends the MUSIC algorithm for a wireless device

with general antenna array. In addition, it borrows the idea from SpotFi [85] to improve the AoA

resolution by jointly estimating AoA and ToF of incoming signals. The rationale behind joint

18

ddlldsinldsin2Uniform linear antenna array123MFigure 2.3: Illustration of MUSIC algorithm for AoA estimation at a wireless device with nonlinear
(arbitrary) antenna configuration. Only one signal path with AoA θl is shown here.

estimation is that, if two incoming signals are indistinguishable in the spatial domain (due to the

limited number of antennas), they may be distinguishable in the time domain. Joint estimation

makes it possible to distinguish two incoming signals even if they have very similar AoA.

Consider a receiving device with nonlinear antenna array as shown in Fig. 2.3. For notional

simplicity, we adopt polar coordinate system for the antennas using the first antenna position as

the origin. Denote dm as the distance between the 1st and mth antennas and ϕm as their angle,

as illustrated in the figure. Then, the coordinate of the mth antenna can be written as (dm, ϕm).

Particularly, the first antenna’s coordinate is (0, 0).

Recall that αl is defined as the complex channel attention of the lth path on the first antenna.

The observed channel coefficient (CSI) on the mth antenna over subcarrier k can be modeled as:

hm,k =

X

l

αl · ej

2πdm cos(ϕm−θl)
λ

· e−j2πkfδτl + nm,k,

(2.2)

where (dm, ϕm) is the polar coordinate of the mth antenna, fδ is the subcarrier spacing of OFDM

modulation, (αl, θl, τl) is the complex attention, AoA, and delay of the lth path, respectively. Lastly,

nm,k is the CSI observation noise/error at antenna m over subcarrier k.

Collectively, the observed CSI at all antennas and over all subcarriers can be expressed as an

M × K complex matrix, where M is the number of antennas and K is the number of subcarriers.

Consider a four-antenna 802.11 Wi-Fi router as an example, which has 52 valid subcarriers in

19

Nonlinear antenna array123mMm+1lmOFDM modulation. The CSI matrix H ∈ C4×52 can be written as follows:

(2.3)

Solely using spatial degrees of freedom (DoF) provided by antennas for AoA estimate may not

be an ideal approach, as it requires the number of antennas is larger than the number of paths. This

requirement may not be fulfilled in a real-world indoor environment when the number of antennas

on a wireless AP is limited (e.g., four antennas on a Wi-Fi router). To improve the AoA resolution,

AuthIoT expands the CSI matrix H for MUSIC-based AoA estimate by following the idea in [85].

Consider the CSI matrix in (2.3) as an example. AuthIoT can expand the CSI matrix by bonding

every three columns as a new column as illustrated below:

2

6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4

h11 h12 h13 h14 h15 h16 h17

. . .

h21 h22 h23 h24 h25 h26 h27

. . .

h31 h32 h33 h34 h35 h36 h37

. . .

h41 h42 h43 h44 h45 h46 h47

. . .

h12 h13 h14 h15 h16 h17 h18

. . .

h22 h23 h24 h25 h26 h27 h28

. . .

h32 h33 h34 h35 h36 h37 h38

. . .

h42 h43 h44 h45 h46 h47 h48

. . .

h13 h14 h15 h16 h17 h18 h19

. . .

h23 h24 h25 h26 h27 h28 h29

. . .

h33 h34 h35 h36 h37 h38 h39

. . .

h43 h44 h45 h46 h47 h48 h49

. . .

3

7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5

.

He =

(2.4)

The expanded CSI matrix is of 12 by 50 size, i.e., He ∈ C12×50; and its correlation matrix is

of 12 by 12 size, i.e., HeHH
e

∈ C12×12. This means that, when applying MUSIC to AoA estimate,

the expanded matrix renders a larger dimension for noise subspace compared to the original CSI

matrix (12 − L versus 4 − L), thereby tending to offer a better AoA resolution.

20

............494847464544434241393837363534333231292827262524232221191817161514131211hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhSpatial domain over antennasFrequency domain over OFDM subcarriersH =In a general case, for CSI matrix H ∈ CM ×K, a question is how many columns should be

bonded when expanding this matrix for AoA estimate. For this question, we have the following

considerations. On one hand, the number of rows of He should be maximized to improve the

dimension of noise subspace; on the other hand, the expanded CSI matrix He should be a flat

matrix for MUSIC calculation. Denote b as the number of bonding columns in the CSI matrix.

Then, these two observations can be formulated as: max(M b), subject to M g ≤ K − G + 1 and

G ∈ Z. Hence, we have G = ⌊ K+1
M +1

⌋. Therefore, the dimension of the expanded CSI matrix is M g

by K − G + 1, i.e., He ∈ C(M g)×(K−G+1). The jth column of He is [Hj; Hj+1; · · ·; Hj+G−1], where

Hj is the jth column of H and [; · · · ; ] is vertical concatenation operator.

For the expanded CSI matrix He, we would like to explore its basis for its columns. Based on

(2.2), it can be verified that each of its columns is a linear combination of the following L basis

vectors:

⃗al = [a11 a21
|

· · · aM 1
}
{z
column 1

a12 a22
|

· · · aM 2
}
{z
column 2

· · · a1G a2G · · · aM G
}

|

{z
column G

]T

(2.5)

for 1 ≤ l ≤ L, where amg = ej

2πdm cos(ϕm−θl)
λ

· e−j2πgfδτl with 1 ≤ m ≤ M and 1 ≤ g ≤ G.

Based on the expanded CSI matrix He and its column basis, the two-dimensional MUSIC al-

gorithm is summarized as follows.

Step 1: Measure the CSI matrix H at M antennas over K subcarriers. Construct the expanded

CSI matrix He by letting its jth column be [Hj; Hj+1; · · · ; Hj+G−1], where Hj is the

jth column of H, [; · · · ; ] is vertical concatenation operator, and G = ⌊ K+1
M +1

⌋.

Step 2: Calculate the correlation matrix of CSI observations: R = HeHH

e , where (·)H is con-

jugate transpose operator.

Step 3: Perform eigendecomposition of the correlation matrix: [E S] = eig(R), where E is a

matrix with its columns being eigenvectors and S is the diagonal matrix with sorted

eigenvalues (in non-decreasing order).

Step 4: Divide E into two sub-matrices: E = [Es En], where Es is the signal subspace and

En is noise subspace.

21

Table 2.2: Simulation parameters of MUSIC-GAA.

parameter
carrier
frequency

value

5 GHz

bandwidth

40 MHz

FFT size

# of valid
subcarrier
# of
antennas

64

52

4

antenna
configuration

Vertex of
6cm×6cm
square

parameter

# of paths

path 1:
(α1, θ1, τ1)
path 2:
(α2, θ2, τ2)
path 3:
(α3, θ3, τ3)
path 4:
(α4, θ4, τ4)

path 5:
(α5, θ5, τ5)

value

5

(1.00ej1.26, 15o, 5ns)

(.40ej0.64, −71o, 21ns)

(.20e−j1.86, 81o, 38ns)

(.15ej1.64, −15o, 65ns)

(.10e−j1.51, 31o, 89ns)

Step 5: Evaluate the following function for all possible θ and τ :

p(θ, τ ) =

1

⃗a(θ, τ )HEnEH

n⃗a(θ, τ )

.

(2.6)

Based on (2.5), the steering vector ⃗a(θ, τ ) is defined as follows:

(cid:2)

⃗a(θ, τ ) =

a11 a21 · · · aM 1
}
{z
|
column 1

a12 a21 · · · aM 2
|
}
{z
column 2

· · · a1G a2G · · · aM G
}

|

{z
column G

(cid:3)

T,

(2.7)

where amg = ej 2πdm cos(ϕm−θ)

λ

· e−j2πgfδτ for 1 ≤ m ≤ M and 1 ≤ g ≤ G. The values

of (θ, τ ) corresponding to the peaks of p(θ, τ ) are regarded as a path with AoA of θ

and delay of τ .

An Example: We use an example to illustrate the performance of MUSIC-GAA. We consider a

wireless AP and an IoT device and attempt to estimate the AoA of signal paths at the wireless AP.

Table 2.2 lists the parameters that we use for simulation. Particularly, the antennas on the AP are

not linear installed; instead, they are installed at the vertex of a 6cm×6cm square. This antenna

configuration is more realistic compared to a uniform linear antenna array. In this case, the number

of paths is greater than the number of antennas. Fig. 2.4 shows our simulation results when the CSI

bears different levels of error. Specifically, Fig. 2.4a depicts the result when the AP has perfect

22

CSI. In this figure, the small circles mark the ground truth, while the black dots in the circles are

the results of MUSIC-GAA. The results reveal that MUSIC-GAA finds the exact AoAs and delays

of the five paths. Figs. 2.4b-d depict the results when the CSI at the AP has -40 dB, -30 dB, and

-20 dB error. It can be seen that the heatmap becomes increasingly blurry when the CSI bears

larger error. This indicates that accurate CSI is crucial. Fortunately, AuthIoT has accurate CSI for

MUSIC-GAA as the IoT device is physically close to the AP with a LoS path.

Another observation from Figs. 2.4b–d is that the hot spots appear to be horizontally stretched,

rendering better accuracy for AoA estimate than for delay estimate. This is because AuthIoT only

requires AoA of LoS signal path and does not need the delay information. This phenomenon stems

from the CSI expansion operation (see (2.4) for example), where each column of the expanded CSI

matrix contains the CSI from all antennas (but the CSI from a subset of subcarriers).

2.4.3: MUSIC-GAA for AuthIoT

Using MUSIC-GAA for AuthIoT to estimate the LoS AoA faces the following two challenges.

The first challenge is the very small delay difference of multiple paths indoor environments, espe-

cially in a small room with many objects. For example, if the distance difference of two paths is

1m, their delay difference is 3.3ns. To achieve this delay resolution (3.3ns), it requires 300MHz

bandwidth. Such a large signal bandwidth is not affordable for most wireless systems. 5GHz Wi-Fi

offers 40MHz bandwidth, which is insufficient to distinguish two paths whose distance difference

is less than 1m. The second challenge is the CSI quantization error. For example, Atheros Wi-

Fi NIC [182] offers 10-bit CSI quantization, rendering a quantization error of 10 log10(1/210) =
−30 dB; Intel 5300 Wi-Fi NIC [63] offers 8-bit quantization for CSI, and its quantization error

is 10 log10(1/28) = −24 dB. As shown in Fig. 2.4, the CSI error degrades the performance of

MUSIC-GAA.

AuthIoT addresses these two challenges as follows. First, it asks users to keep the IoT device

close to the AP (∼2m) so that there is a strong LoS path between the two devices. It also asks

users to handwrite the passcode over the air at a large scale (i.e., spanning a 75cm×75cm area for

23

(a) Perfect CSI (no error)

(b) CSI estimation error: -40 dB

(c) CSI estimation error: -30 dB

(d) CSI estimation error: -20 dB

Figure 2.4: Performance of MUSIC-GAA algorithm.

24

0102030405060708090100110120Path delay  (ns)9070503010-10-30-50-70-90Angle of arrival  (degree)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:20)(cid:29)(cid:3)(cid:11)(cid:24)(cid:81)(cid:86)(cid:15)(cid:3)(cid:20)(cid:24)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:21)(cid:29)(cid:3)(cid:11)(cid:21)(cid:20)(cid:81)(cid:86)(cid:15)(cid:3)(cid:16)(cid:26)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:22)(cid:29)(cid:3)(cid:11)(cid:22)(cid:27)(cid:81)(cid:86)(cid:15)(cid:3)(cid:27)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:24)(cid:29)(cid:3)(cid:11)(cid:27)(cid:28)(cid:81)(cid:86)(cid:15)(cid:3)(cid:22)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:23)(cid:29)(cid:3)(cid:11)(cid:25)(cid:24)(cid:81)(cid:86)(cid:15)(cid:3)(cid:16)(cid:20)(cid:24)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)0102030405060708090100110120Path delay  (ns)9070503010-10-30-50-70-90Angle of arrival  (degree)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:20)(cid:29)(cid:3)(cid:11)(cid:24)(cid:81)(cid:86)(cid:15)(cid:3)(cid:20)(cid:24)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:21)(cid:29)(cid:3)(cid:11)(cid:21)(cid:20)(cid:81)(cid:86)(cid:15)(cid:3)(cid:16)(cid:26)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:22)(cid:29)(cid:3)(cid:11)(cid:22)(cid:27)(cid:81)(cid:86)(cid:15)(cid:3)(cid:27)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:24)(cid:29)(cid:3)(cid:11)(cid:27)(cid:28)(cid:81)(cid:86)(cid:15)(cid:3)(cid:22)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:23)(cid:29)(cid:3)(cid:11)(cid:25)(cid:24)(cid:81)(cid:86)(cid:15)(cid:3)(cid:16)(cid:20)(cid:24)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)0102030405060708090100110120Path delay  (ns)9070503010-10-30-50-70-90Angle of arrival  (degree)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:20)(cid:29)(cid:3)(cid:11)(cid:24)(cid:81)(cid:86)(cid:15)(cid:3)(cid:20)(cid:24)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:21)(cid:29)(cid:3)(cid:11)(cid:21)(cid:20)(cid:81)(cid:86)(cid:15)(cid:3)(cid:16)(cid:26)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:22)(cid:29)(cid:3)(cid:11)(cid:22)(cid:27)(cid:81)(cid:86)(cid:15)(cid:3)(cid:27)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:24)(cid:29)(cid:3)(cid:11)(cid:27)(cid:28)(cid:81)(cid:86)(cid:15)(cid:3)(cid:22)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:23)(cid:29)(cid:3)(cid:11)(cid:25)(cid:24)(cid:81)(cid:86)(cid:15)(cid:3)(cid:16)(cid:20)(cid:24)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)0102030405060708090100110120Path delay  (ns)9070503010-10-30-50-70-90Angle of arrival  (degree)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:20)(cid:29)(cid:3)(cid:11)(cid:24)(cid:81)(cid:86)(cid:15)(cid:3)(cid:20)(cid:24)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:21)(cid:29)(cid:3)(cid:11)(cid:21)(cid:20)(cid:81)(cid:86)(cid:15)(cid:3)(cid:16)(cid:26)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:22)(cid:29)(cid:3)(cid:11)(cid:22)(cid:27)(cid:81)(cid:86)(cid:15)(cid:3)(cid:27)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:24)(cid:29)(cid:3)(cid:11)(cid:27)(cid:28)(cid:81)(cid:86)(cid:15)(cid:3)(cid:22)(cid:20)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(cid:51)(cid:68)(cid:87)(cid:75)(cid:3)(cid:23)(cid:29)(cid:3)(cid:11)(cid:25)(cid:24)(cid:81)(cid:86)(cid:15)(cid:3)(cid:16)(cid:20)(cid:24)(cid:82)(cid:12)(cid:3)(cid:74)(cid:85)(cid:82)(cid:88)(cid:81)(cid:71)(cid:3)(cid:87)(cid:85)(cid:88)(cid:87)(cid:75)(a) Linear antenna array.

(b) Nonlinear antenna array.

Figure 2.5: Experimental results of MUSIC-GAA in two cases: (a) Intel 5300 card with three
linear equal-spaced antennas; (b) USRP N310 with four antennas placed at the vertex of 6cm×6cm
square.

each passcode character), so that the AoA change of writing a passcode character is significant.

These requirements will be specified on the manual for end users. Second, it combines multiple

consecutive packets to improve the LoS AoA estimation through k-means clustering [105]. Details

will be given in §2.5.2.

We have evaluated the performance of MUSIC-GAA for AuthIoT via experiments on two cases:

i) the AP is an Intel 5300 card with three linear equal-spaced antennas; and ii) the AP is a USRP

N310 with four antennas placed at the vertex of 6cm×6cm square. Both testbeds use Wi-Fi signal

for data packet transmission, and the packet rate is 1000 per second. It means that the AP can

obtain 1000 CSI instances per second. The distance between the AP and the IoT device is 2m,

with the presence of LoS path. We conducted the measurement campaign in a two-story apartment

with ordinary furniture. Fig. 2.5 shows our experimental results. It can be seen that the estimated

LoS AoA increases/decreases as the ground-truth LoS AoA increases/decreases. This observation

is consistent for both testbeds. This indicates that the LoS AoA tends to manifest a unique pattern

based on the movement of IoT devices.

2.5: Learning-based Passcode Recognition

A passcode is composed of several characters (English alphabets, numbers, and some special char-

acters). AuthIoT recognizes each individual character based on its generated CSI. Fig. 2.6 depicts

25

-60-45-30-15015304560Ground Truth Angle-80-60-40-20020406080LoS-AoAGround TruthSingle Packet EstimationMultiple Packets Estimation-60-45-30-15015304560Ground Truth Angle-80-60-40-20020406080LoS-AoAGround TruthSingle Packet EstimationMultiple Packets EstimationFigure 2.6: Diagram of CSI-based passcode character recognition.

the high-level system diagram of AuthIoT’s passcode recognition. As shown in the diagram, Au-

thIoT uses both LoS AoA and normalized amplitude of CSI as the features for CNN-based character

recognition. The reason is that our experiments show, compared to solely using LoS AoA as a fea-

ture, adding normalized CSI amplitude as input can considerably improve the recognition accuracy

(by 5% on average in our observations). In what follows, we explain each module in Fig. 2.6.

2.5.1: CSI Segmentation, Resampling, and Compensation

CSI Segmentation. When a user continuously writes passcode characters in the air, the AP pings

the IoT device at a certain rate (e.g., 200 ping packets per second), so that it can frequently estimate

the CSI based on the ACK packets from the IoT device. In practice, an end user may take different

amounts of time to write different characters, and different users may take different amounts of time

to write the same character. Therefore, it is necessary to separate the collected CSI data in the time

domain for each written character. To facilitate the CSI segmentation and improve its accuracy,

AuthIoT asks end users to pause (holding IoT device still) one second before they begin to write a

character. AuthIoT leverages the pause between two neighboring characters for CSI segmentation.

In addition, AuthIoT asks end users to hold IoT device still for two seconds before they start to

write passcode and after they complete passcode writing. Since a still IoT device generates unique

CSI features, AuthIoT leverages such features to determine the time period of passcode writing.

Fig. 2.7 shows an example of AuthIoT’s CSI segmentation, which comprises the following

steps.

Step 1: Calculate the following metric: g(i) = ∠(hm,k(i)hn,k(i)∗) · |hm,k(i)|, where hm,k(i)

26

CSI acquisitionCSI acquisitionAmplitudeAmplitudeData  segmentation Data  segmentation Data resample Data resample CSI Pre-processingCSI Pre-processingFeature ExtractionFeature ExtractionMUSIC-GAAMUSIC-GAABandpass filterBandpass filterPCAPCADWTDWTLoS AoA clusteringLoS AoA clusteringPasscode character recognition result Passcode character recognition result CSI calibrationCSI calibrationFully-connected neural networkFully-connected neural networkCNNCNNConv layer &Flatten layerConv layer &Flatten layerFigure 2.7: An example illustrating CSI segmentation.

is the channel coefficient from antenna m, subcarrier k, and packet i. In our design,

m and n are the two antennas that offer strongest CSI, and k = 1. Fig. 2.7a shows

an instance of phase difference of two channels, i.e., ∠(hm,k(i)hn,k(i)∗). Fig. 2.7b

shows an instance of channel amplitude, i.e., |hm,k(i)|. Fig. 2.7c shows an instance

of g(i).

Step 2: Calculate the window-slided variance as follows: v(i) = 1
W

where ¯g = 1
W

i+W −1
j=i

g(j). Fig. 2.7d shows an instance of v(i).

P

P

i+W −1
j=i

|g(j) − ¯g|2,

Step 3: Compare v(i) with a threshold Tv, where Tv = 0.03 × avg{v(i)}. The CSI segment

corresponding to v(i) ≥ Tv is considered for an individual character. Fig. 2.7d il-

lustrates the windows corresponding to the segments of CSI to be used for character

recognition.

Step 4: Check the segmentation length for each letter. If the time duration of a CSI segment

is shorter than 1 second or longer than 4 seconds. AuthIoT discards this CSI segment.

27

0246810packets104-10-50Phase Difference0246810packets10400.050.1Amplitude0246810packets104-1012310-5Phase Difference * Amplitude (Normalization)0246810packets10400.0050.01Variance with Sliding Window & Segmentation Result(b)(a)ThresholdSmall Gap(d)(c)CSI Resampling. After CSI segmentation, different CSI segments may have different numbers of

CSI samples. The purpose of resampling is to make sure that the number of CSI samples in each

CSI segment is identical. Doing so is likely to ease the training and inference of CNN. AuthIoT

resamples each CSI segment using linear interpretation and/or decimation on the real and imaginary

parts of CSI samples.

CSI Compensation. The CSI data need to be calibrated before feeding to the MUSIC-GAA. Since

the receiver and transmitter are not synchronized, the CSI data from a Wi-Fi receiver may suffer

from Sampling Time Offset (STO) and Sampling Frequency Offset (SFO). To compensate STO and

SFO, a popular method is performing linear regression over multiple consecutive CSI instances in

both time and frequency domains [85]. The linear fit of the unwrapped CSI phase for ith packet

can be expressed as

τs,i = arg min

α

MX

NX

m=1

n=1

(ϕi(m, n) + 2πfδ(n − 1)α + β)

(2.8)

The τs,i is the STO for ith packet. The fδ is the frequency spacing between subcarriers. And the

ϕi(m, n) is the wrapped phase at mth antenna and nth subcarrier. After estimating τs,i based on

(2.8), the compensation is performed by adding 2πfδ(n − 1)τs,i to subcarrier n, n = 1, 2, · · · , N .

The same compensation applies to the CSI from each antenna.

2.5.2: Feature Extraction

LoS AoA Feature Extraction. AuthIoT uses MUSIC-GAA to estimate the AoA-delay profile of

the signal paths based on the CSI samples. One observation from our experiments is that the AoA

corresponding to the largest profile value is always associated with the minimum delay. This makes

sense as there always exists a strong LoS path between the IoT device and the AP. Based on this

observation, AuthIoT chooses the AoA corresponding to the largest value as LoS AoA.

As shown in Fig. 2.8, the LoS AoA computed from CSI is noisy due to the imperfection of

hardware (e.g., 8-bit quantization) and interference caused by environment changes (e.g., body

movement). Sometimes the LoS AoA jumps over 20 degrees for consecutive 10 packets. Obvi-

28

Figure 2.8: Removal of abnormal LoS AoA samples through filtering.

ously, such a big jump is abnormal. To reduce the adverse effect of this phenomenon, AuthIoT

employs a clustering algorithm for the elimination of unexpected AoA values. The rationale be-

hind this algorithm is that the AoA should not change over 20 degrees over 10 packets (10 ms).

The clustering algorithm works as follows.

Step 1: Slide a window of size 10 to move across the LoS AoA sample sequence using the

step size of 5. In each window, the k-means clustering algorithm [105] is employed

to divide the 10 LoS AoA samples into 2 groups.

Step 2: Calculate the average values of the samples in the two groups. If the difference is

larger than 20 degree and the number of samples in one group is less than 3, then the

group of smaller size is regarded as abnormal.

Step 3: Replace every sample in the abnormal group with the average value of the larger

group.

Amplitude Feature Extraction. In addition to LoS AoA, AuthIoT uses CSI amplitude as another

feature for CNN-based recognition. The raw CSI amplitude is noisy. To enhance the input data

quality, AuthIoT employs a Butterworth band-pass filter with frequency band 5Hz–20Hz to elim-

inate the undesired frequency components and reduce the noise for the CSI amplitude. This is

because human’s writing movement is in this frequency range [59]. Fig. 2.9 shows an example of

the filtering operation.

29

02004006008001000Packets-80-60-40-20020406080LoS AoARaw LoS AoAClustered LoS AoAFigure 2.9: CSI amplitude before and after bandpass filtering.

Figure 2.10: Illustration of PCA operation on CSI amplitude.

In indoor environments, wireless channels over neighboring subcarriers are very similar [51].

Hence, AuthIoT applies Principle Component Analysis (PCA) to a group of adjacent subcarriers for

data compression. Specifically, AuthIoT groups every 6 subcarriers and applies PCA to each group.

The first component of PCA results is kept as the amplitude features, while other components are

discarded. Fig. 2.10 shows an example of this operation. As it can be seen, the adjacent 6 subcarriers

have similar channel amplitude, and the first component of PCA results maintains the main shape

of the channels.

Writing a character over the air mainly comprises a series of strokes. The action of each stroke

is the key feature for the CSI-based character recognition. To capture the action of each stroke,

AuthIoT performs Discrete Wavelet Transform (DWT) on the CSI amplitude after PCA operation,

30

02004006008001000Packets00.050.10.15CSI amplitdueRaw CSI amplitudeFiltered CSI amplitude02004006008001000Packets-0.1-0.0500.050.10.150.2CSI amplitudesubcarrier 1subcarrier 2subcarrier 3subcarrier 4subcarrier 5subcarrier 6PCA first componentas shown in Fig. 2.6. Similar to WiReader [59], it performs 8-level discrete wavelet transform on

the CSI amplitude samples using symlet as the basis function. Fig. 2.11 shows an example of

DWT operation on the CSI amplitude, where Fig. 2.11(a) shows the CSI amplitude from PCA and

Fig. 2.11(b) shows the DWT results. The DWT results are then sent to the CNN for training and

inference.

Figure 2.11: Illustration of DWT operation on CSI amplitude.

2.5.3: CNN Settings and Training

CNN Settings. Fig. 2.12 shows the structure of CNN, which is composed of convolution layers,

flatten layers and fully-connected layers. Since the CSI amplitude matrix is of high dimension

(1000 × 40 × 3), AuthIoT employs convolution operations to extract its high-level features and

reduce its dimension. Specifically, AuthIoT treats the amplitude DWT spectrum (1000 × 40) as an

image and each of the three antennas as an image channel, similar to the process of RGB channels

in colorful image recognition. Two convolution layers are used to compress the amplitude DWT

spectrum. The first convolution layer involves 32 kernels of 11 × 5 size, and the second layer has

16 kernels of 6 × 4 size. The step size of both kernels is one. The purpose of the convolution

layers is to extract the features from amplitude DWT spectrum based on its spatial relationship. It

employs kernels moving across the feature matrix and outputs the convolution result with ReLU

function. To further reduce the data dimension, AuthIoT employs an averaging pooling layers with

31

1002003004005006007008009001000Packet-0.0500.050.1CSI ampitude (PCA)2004006008001000Packets12345678Level20406080100Figure 2.12: CNN Structure.

Table 2.3: Passwords Characters.

Capital Letters
Lower-case letters
Numbers
Special Characters

A-Z
a,b,d,e,f,g,h,q,r,t
3-9
#,$,%,+,=

a size of 3 × 3 for each of the convolution layers. The pooling layers down-sample the amplitude

matrix, thereby reducing the computational complexity. The output of the second pooling layer

is flattened for vectorization. AuthIoT then concatenates the resultant amplitude features with the

AoA features, and feeds the concatenated data vector to a fully-connected 128 × 64 × 32 neural

network. SoftMax activation function is used for the output layer to calculate the probability of

each possible passcode character.

CNN Training and Inference. As stated before, some character pairs are not distinguishable in

their handwriting format, such as “z” and “Z”, “c” and “C”, “o” and “O”, “s” and “S”, “v” and “V”,

letter “I” and number “1”, etc. Unfortunately, this challenge is hard to address from a technical

perspective. Therefore, AuthIoT excludes the subset of indistinguishable characters. Table 2.3

lists the 48 characters that can be used for passcode in AuthIoT.

To train the CNN model, CSI data are collected from different locations and diverse users

32

11 x 5 Convolution 3 x 3 pooling 3 x 3 pooling 6 x 4 Convolution 1000 x 40 x 3Amplitude DWT Spectrum  Flatten & Concatenate 990 x 36 x 32  330 x 12 x 32  Fully Conneted Output325 x 9 x 16  108 x 3 x 16  1000 x 1LoS‐AoA vector(details given in §2.6). The batch size in our training process is set to 100, and the number of epochs

is set to 25. A batch normalization layer is added to the neural network after the activation function.

We observed that it could improve the convergence speed in the training process, especially when

the CSI is not stable due to the change of environment. In addition, a dropout layer is added after

the second (64 neurons) and third (32 neurons) layers to avoid overfitting [143]. It can make the

network less sensitive to specific neurons, and in turn make the network better generation. The

dropout rate is set to 0.2 for each layer by randomly setting the output to zero. The CNN uses

cross-entropy as the loss function and employs Adam optimization algorithm to update the weights.

After the CNN is trained, the system is then used for online passcode character recognition in

different environments. The CNN model will eventually yield the possibility of the input being

each character. The character with the highest probability is regarded as the character being written

by the end user.

2.6: Experimental Evaluation

2.6.1: Implementation and Experimental Settings

Intel 5300 Testbed. This testbed is implemented using Dell XPS 8940 Desktop with the Intel Wi-

Fi NIC 5300 and a Redmi Note 9 Pro cellphone. The desktop serves as AP working in hotspot

mode, and the cellphone emulates an IoT device. The desktop is installed with the Ubuntu 14.04

operating system with 802.11 Linux CSI tool [63], which is used to acquire the CSI from the Wi-Fi

card. The carrier frequency is 5GHz, and the bandwidth is 40MHz. The packet rate is 600 packets

per second. Intel Wi-Fi NIC 5300 is equipped with three antennas, which are linearly placed with

equal spacing. The antenna spacing is half wavelength (3cm). Fig. 2.13(a) shows the linear antenna

setting of this testbed.

USRP Testbed. This testbed consists of a USRP N310 and a USRP N210. USRP N310 has four

antennas. It serves as the AP. USRP N210 has one antenna. It emulates an IoT device by sending

data packets to USRP N310. The carrier frequency is 2.4GHz, and the bandwidth is 20MHz. The

packet rate is 1000 per second. This testbed has two antenna settings: linear antenna array as

33

(a)

(b)

(c)

Figure 2.13: AP antenna settings: (a) Intel 5300 testbed with linear antenna array; (b) USRP testbed
with linear antenna array; (c) USRP testbed with nonlinear antenna array.

(a) Lab scenario

(b) Office scenario

(c) Hallway scenario

(d) Home scenario

Figure 2.14: Experimental settings.

shown in Fig. 2.13b and nonlinear antenna array as shown in Fig. 2.13c. For the linear case, the

antenna spacing is 6.25cm. For the nonlinear case, the four antennas are positioned at the vertex of

a 6cm×6cm square.

Figure 2.15: Recognition ac-
curacy of AuthIoT on Intel
5300 testbed.

Figure 2.16: Recognition accu-
racy breakdown of AuthIoT on
Intel 5300 testbed.

Figure 2.17: Recognition ac-
curacy of AuthIoT on USRP
testbed.

Experimental Settings. Four scenarios are considered for the evaluation of AuthIoT: lab, office,

hallway, and home, as shown in Fig. 2.14. The AP was placed on a table of 70cm height, and the

IoT device was held by the participants. The participants were asked to face the AP and keep an

34

0.880.840.850.83LabOfficeHallwayHome00.51Accuracy0.890.870.880.850.90.890.890.870.930.90.910.88LabOfficeHallwayHome00.51AccuracyCaptial LetterLower Case LetterNumber0.90.860.850.840.890.840.840.82LabOfficeHallwayHome00.51AccuracyLinear SettingNonlinear Settingapproximate 2m distance. We placed the two testbeds in these four scenarios and collected data to

evaluate the performance.

The training data were collected solely from lab, while the evaluation (inference) was performed

in four scenarios (lab, office, hallway, and home). The training data were collected from five

different participants, while the evaluation was conducted over nine participants (i.e., those five

participants for training plus four new participants). In the training phase, each participant was

asked to write the 48 characters in Table 2.3, and each character was repeated 12 times. In total,

576 data samples were collected from each participant in the lab scenario for the training purpose.

In the test (inference) phase, each of the nine participants was asked to hold the IoT device and

write 500 characters at his/her will at each scenario. The collected data samples were fed into the

system for evaluation purpose.

2.6.2: Experimental Results from Intel 5300 Testbed

Intel 5300 is a commercial off-the-shelf Wi-Fi NIC that is widely used for computers and routers.

Evaluating AuthIoT on this testbed reveals its performance in real-world Wi-Fi networks.

Overall Accuracy. Fig. 2.15 presents the overall recognition accuracy on this testbed. Literally,

AuthIoT reaches 88% recognition accuracy with a standard deviation of 0.018 over the nine partic-

ipants in the lab scenario; it reaches 85% recognition accuracy with a standard deviation of 0.023 in

the hallway scenario; it reaches 84% recognition accuracy with a standard deviation of 0.014 in the

office scenario; and it reaches 83% recognition accuracy with a standard deviation of 0.019 in the

home scenario. It can be seen that AuthIoT performs best in the lab scenario. This is not surprising,

because AuthIoT’s CNN model was trained by the dataset collected from the lab scenario.

Fig. 2.18 shows the confusion matrix of passcode character recognition. It can be seen that the

accuracy is above 85% for most characters. The majority of errors occur due to the ambiguity of

the characters sharing similar hand gestures. For example, AuthIoT is more likely to be confused

by letters ‘C’ and ‘O’; it is also hard to distinguish letters ‘M’ and ‘N’.

Accuracy of Individual Category. To obtain more details, we examine the performance of Au-

35

Figure 2.18: Confusion matrix for Intel 5300 testbed (with linear antenna array).

36

ABCDEFGHIJKLMNOPQRSTUVWXYZabdefghqrt3456789#$%+=PredictedABCDEFGHIJKLMNOPQRSTUVWXYZabdefghqrt3456789#$%+=TrueConfusion matrix0.00.20.40.60.81.0thIoT over three subsets of passcode characters: 26 upper-case letters, 10 lower-case letters, and

10 numbers. Fig. 2.16 shows our test results. It can be seen that the recognition accuracy in all

scenarios are beyond 85% for the three subsets of characters.

2.6.3: Experimental Results from USRP Testbed

We further evaluate the performance of AuthIoT on the USRP testbed with linear and nonlinear

antenna arrays.

Linear Antenna Array. Fig. 2.17 presents the recognition accuracy on the USRP testbed when it

is equipped with four linearly equal-spaced antennas. It can be seen that the recognition accuracy

in the lab scenario is better than other scenarios. This is because AuthIoT’s CNN model was trained

by the dataset collected from the lab scenario. It also can be seen that the recognition accuracy on

the USRP testbed is slightly higher than that on Intel 5300 testbed. This can be attributed to the

fact that the USRP testbed has one more antenna than the Intel 5300 testbed.

We examine the recognition accuracy for each individual participants. Fig. 2.19 shows our

experimental results. The results show that the recognition accuracy is within the range of 81%

to 88% for the nine participants. This indicates that AuthIoT is robust against the variation of end

users.

Nonlinear Antenna Array. Fig. 2.17 also presents the recognition accuracy when the USRP

testbed is equipped with four nonlinear antennas. It can been seen that the two cases (linear antenna

array and nonlinear antenna array) have very similar recognition accuracy, with a difference less

than 2%. The performance similarity can be traced down to the accuracy of LoS AoA estimation as

shown in Fig. 2.5. Since the LoS AoA estimation in the two antenna settings has similar accuracy,

it is not surprising that the recognition in the two antenna settings has similar accuracy.

2.6.4: Robustness of AuthIoT

To evaluate the robustness of AuthIoT, we examine its performance when the user is located at

different distances and from different directions.

Different Distances. We change the distance between AP and IoT device to examine the perfor-

37

Figure 2.19: Recognition accuracy of each in-
dividual participant on USRP testbed.

Figure 2.20: Recognition accuracy for user at
different angles to the AP.

Table 2.4: Recognition accuracy of AuthIoT when the distance between AP and IoT device
changes.

Lab

Office

Hallway

Home

distance Linear Non-linear Linear Non-linear Linear Non-linear Linear Non-linear

2.0m
2.5m
3.0m
3.5m

89%
87%
85%
85%

89%
86%
85%
84%

86%
85%
84%
83%

84%
84%
83%
82%

85%
84%
84%
83%

84%
84%
83%
83%

84%
83%
82%
82%

82%
82%
81%
81%

mance of AuthIoT. We consider four distances: 2.0m, 2.5m, 3.0m, and 3.5m. We conduct experi-

ments in four scenarios: lab, office, hallway, and home. Table 2.4 presents our experimental results.

It can be seen that, in each scenario, AuthIoT has a consistent performance when the distance be-

tween AP and IoT device varies from 2.0m to 3.5m. For all cases, the recognition accuracy of

AuthIoT is within the range from 81% to 89%, regardless of the experimental scenario, the antenna

pattern, and the distance between AP and IoT device. This indicates the robustness of AuthIoT.

Different Directions. To evaluate its robustness to user’s facing direction, we let the user keeps the

same distance to the AP but moves around with different facing angles ranging from 10 degree to 40

degree. As we can observe from Fig. 2.20, the recognition accuracy of AuthIoT slightly degrades

when the angle between the user and AP increases from 0 to 40 degree. This is because the training

data are collected at the 0 degree location. However, it can be observed that the accuracy for both

linear and nonlinear settings are always above 83%.

Discussions. The overall recognition accuracy of 83% is not perfect but within an acceptable

range. In practice, there are some ways to further improve AuthIoT’s Quality of Experience (QoE)

for end users. For example, an end user can consider using an all-numbers passcode. AuthIoT

38

0.880.840.830.850.860.810.850.830.81ABCDEFGHI00.51Accuracy0.880.860.840.840.870.850.840.8310°20°30°40°00.51AccuracyLinear SettingNonlinear Settingoffers a superior performance when the passcode is all numbers. Meanwhile, an all-number pass-

code is sufficiently strong in practice. Moreover, a prompt-notification mechanism can be added

into AuthIoT to improve the QoE of end users. In essence, AuthIoT is a learning-based classifica-

tion algorithm. The output of AuthIoT includes not only the corresponding character but also its

recognition probability (i.e., the recognition confidence). When AuthIoT has a low confidence for

a character recognition, it immediately asks end user to rewrite the previous character. Doing so

will offer a better QoE for end users.

2.7: Summary

In this chapter, we studied the communication authentication problem for wireless IoT devices

without an input interface. We presented AuthIoT to authenticate such IoT devices in Wi-Fi net-

works by leveraging the unique CSI pattern generated by the movement of IoT devices. AuthIoT

exploits environment-independent CSI features for learning-based character recognition, and there-

fore is transferable for cross-environment applications. AuthIoT also extends its applications for

the case where a Wi-Fi AP is equipped with a nonlinear-installed antenna array by generalizing

existing AoA estimation methods. We have built a prototype of AuthIoT and evaluated its perfor-

mance on the testbeds with linear and nonlinear antenna arrays. Our experimental results confirm

that AuthIoT is transferable for cross-environment applications, and show that AuthIoT achieves

at least 83% recognition accuracy.

39

CHAPTER 3: HANDWRITING

RECOGNITION THROUGH WALLS

USING FMCW RADAR

3.1: Introduction

Despite the rise of digital technologies, handwriting—whether on traditional paper or electronic

devices like iPads—continues to be a widely used method for recording and sharing information.

Studies show that people still engage in handwriting more frequently than they might expect [18].

In some scenarios, the confidentiality of written content is of paramount importance. A natural

question to ask is: if one is writing important documents on a desk in a private room, is it possible for

an attacker outside the room to detect the letters being written through the wall? Understanding the

capability and performance limits of such an attacker would inform the public of not only potential

threats but also possible countermeasures, thereby preventing information leakage and enhancing

human activity privacy.

Recent years have witnessed significant progress in remote human activity recognition (HAR)

using different sensing technologies such as cameras [45, 80, 101], ultrasound [113, 187], Wi-Fi

[52, 78, 93, 94, 123, 163], RFID [172, 191], and millimeter-wave (mmWave) radar [19, 72, 98, 99,

161, 174, 188, 206]. In contrast to existing work, the task of detecting handwriting content through

a wall is unique yet challenging in the following three aspects.

(a) Through-wall detection. This requirement significantly limits the viable sensing tech-

niques for this task. Camera-based computer vision (CV) can be used for human activity recog-

nition by analyzing video data to identify and classify different human actions and movements

[80,101]. Powered by advanced deep neural network (DNN) techniques, a camera system can eas-

ily recognize the handwriting characters from a distance [57, 132]. However, camera-based HAR

systems are limited by occlusions and thus not applicable to through-wall detection. Ultrasound

40

Figure 3.1: RadSee is a joint hardware (radar) and software (deep learning) design to detect the
letters being written by a victim behind a wall.

sensors have also been widely used for HAR. They emit high-frequency sound waves that bounce

off objects and produce echoes, which can be analyzed to determine the patterns of human activ-

ities [50, 113, 187]. Unlike camera sensors, ultrasound can work even in low light conditions and

does not require a line-of-sight path. But ultrasound has a very limited ability of passing through

walls due to its short wavelength, making it unsuitable for this task. Radio frequency (RF) has

emerged as a popular technology for HAR such as gesture recognition [52,66,82,93,123,166,200],

keystroke detection [22,190], and vital signal detection [61,167]. Among existing RF technologies,

high-frequency signals (e.g., mmWave) have very limited ability to pass through a wall. Therefore,

radio signals on sub-10 GHz bands appear to be the plausible carrier for HAR behind walls.

(b) Millimeter-level hand movement for writing. Handwriting features very small move-

ments compared to other human activities. Typically, the movement of a pen-holding hand is

smaller than 1 cm for both paper and iPad writings. When using an RF system for handwrit-

ing detection, its detection resolution is determined by its signal wavelength. On one hand, high-

frequency mmWave (e.g., 60 GHz and 77 GHz) signals are capable of detecting sub-mm movement

of an object but cannot pass through a wall. On the other hand, low-frequency (e.g., 915 MHz) mi-

crowave signals can easily pass through a wall but cannot detect the mm-level movement of an

object. On the middle-frequency spectrum bands such as 2.4 GHz and 5 GHz, channel state infor-

mation (CSI) in Wi-Fi networks has been extensively used for HAR [52, 78, 93, 94, 163, 171]; and

41

Chirp generatorADCPower amp (30dB)Custom-designed and optimized 6GHz FMCW radar circuit on PCBPATwo-stage LNA (38dB)MixerTXRXCustom-designed and optimized patch antennascouplerSoftwareHardwareWallits application on 5 GHz frequency bands seems a possible solution to achieve the desired trade-off

between wall penetration and detection resolution. However, Wi-Fi CSI-based HAR is a non-

coherent detection approach that suffers from phase, frequency and timing misalignments in hard-

ware. As such, it is incapable of detecting mm-level movement in time. Recently, 6 GHz FMCW

radar, which is a coherent detection system, has been used for HAR such as human body skeleton

construction [16, 95, 209–211]. This approach uses custom-designed hardware and promises high

accuracy and stability. However, so far, its applications are limited to the detection of large-scale

movements such as people walking and interaction.

(c) Interference resilience. The detection of handwriting may suffer from interference from

other moving objects such as a walking person around the writer. It may also suffer from inter-

ference from indirect paths between the writer and the detection equipment. Actually, such in-

terference is a notorious issue with RF sensing [86, 119, 184]. This issue is particularly acute in

sub-6 GHz RF sensing systems. If not addressed, the interference may appear dominant and place

a fundamental limit on the detection performance. Moreover, since different scenarios have differ-

ent multi-path effects and different moving objects/people, addressing the interference is critical

to extract environment-independent features for handwriting recognition and ultimately develop a

radio detector that can work in new environments.

In this chapter, we present RadSee, a 6 GHz FMCW radar system for detecting the handwrit-

ing activities behind a wall, as shown in Fig. 3.1. RadSee is realized through a joint hardware and

software design. In terms of hardware, RadSee builds a 6 GHz FMCW radar with highly optimized

patch antennas. In terms of software, RadSee first extracts the phase information of demodulated

FMCW signals and employs a deep neural network (DNN) model for letter classification. Combin-

ing the hardware and software innovations, RadSee is capable of continuously detecting mm-level

handwriting movement over time and recognizing most letters based on their unique phase patterns.

RadSee addresses Challenge (a) with its FMCW modulation, its high-gain patch antenna, and

its optimized baseband analog filter. RadSee has co-located Tx and Rx RF chains, making it pos-

sible to perform coherent signal demodulation for handwriting recognition. In addition, the opti-

42

mized patch antennas have a total 36 dBi gain for wall penetration. RadSee addresses Challenge

(b) by using the phase information of demodulated FMCW signals to extract the features of hand-

writing movements. FMCW radar has been widely used for ranging. Its range resolution is c

2B ,

where c is light speed and B is signal bandwidth. Achieving the range resolution of 1 mm requires

B = 3×108

2×10−3 = 150 GHz signal bandwidth, which is impossible in practice. However, the phase

of demodulated FMCW signals is much more sensitive to the movement of an object. In theory,

1 mm hand movement corresponds to 14◦ phase change of the demodulated signal, which is easy

to detect. Therefore, RadSee uses the phase of demodulated FMCW signals as the features of let-

ter recognition. RadSee addresses Challenge (c) by demodulating the reflective signals only from

the handwriting movement. This is achieved by its FMCW modulation and high-directional patch

antennas. The FMCW modulation allows it to focus on the Range-FFT bin that corresponds to the

distance of interest; the patch antennas allow it to focus on the reflective signal from the direction

of interest. Combining FMCW modulation and antenna directivity, RadSee is capable of detect-

ing a clear phase pattern corresponding to the handwriting movements behind a wall using a small

transmission power (20 dBm).

Based on the demodulated FMCW signals, RadSee employs a bidirectional long short-term

memory (BiLSTM) model to classify the handwriting characters (a-z, A-Z, and 0-9). Different

from other human activities such as keystroke [22, 190], handwriting is a smooth and continuous

movement of the pen-holding hand. As such, handwriting tends to generate a unique temporal

phase pattern for each letter. That is the reason why RadSee uses BiLSTM to classify a phase data

sequence. Of a phase data sequence, some parts may be very important for letter recognition (e.g.,

those turning points), while some parts may not carry useful information (e.g., horizontal strokes).

Therefore, RadSee adds an attention layer to the BiLSTM model so that the model can automatically

focus on those important parts of a phase data sequence for letter classification. Powered by the

BiLSTM model and its attention mechanism, RadSee is capable of recognizing handwriting letters

based on their unique movement patterns.

We have built a prototype of RadSee (through PCB fabrication) and evaluated its performance

43

Table 3.1: Related work on human activity recognition. W = “See through wall?”, M = “Mm-
level movement detection?”, R = “Resilient to multipath?”, I = “Resilient to interference from
other moving objects?”, S = “Classification size”.

References
RF-Capture [16],
RF-Avatar [210],
RF-Pose [209],
RF-Pose3D [211],
RF-Action [95]
WiSIA [94],
WiPose [78],
F. Wang [163]
Tadar [191],
RF-HMS [172]
mtrack [174]
WiKey [22]
WiHF [93], WriFi [52],
WiSee [123]
Soli [99]
PhaseBeat [167]
RF-SCG [61]
RadSee (ours)

Objective

Technique

W M R I

S

Human body skeleton

6GHz FMCW
radar

3

7

3

3

N/A

Radio imaging

Wi-Fi

Human tracking

Hand writing
Key stroke

Gesture
recognition

Vital sign

Hand writing

RFID

mmWave
Wi-Fi

Wi-Fi

mmWave
Wi-Fi
mmWave
FMCW radar

7

3

7
7

7

7
3
7
3

7

7

3
7

7

3
7
3
3

7

7

7
7

7

7
7
7
3

7

7

3
7

7

3
7
3
3

N/A

N/A

N/A
37

26

4
N/A
N/A
62

in various scenarios. Experimental results show that, when placed behind office interior drywalls

and external wood/vinyl walls, RadSee achieves 75% letter recognition accuracy when victims

randomly write 62 different letters and 87% word recognition accuracy when victims write articles.

Notably, RadSee demonstrates its resilience to the interference from walking persons around the

victim writer and the interference from other radio devices.

Table 3.1 shows the comparison of RadSee and its related work. It advances the state-of-the-art

in the following aspects.

• It designs and implements a 6 GHz FMCW radar device that can detect mm-level movements

of an object behind a wall using a small transmission power.

• RadSee is capable of detecting the letters that one is writing behind a wall. Furthermore, it

is resilient to the interference from other mobile objects and other radio devices.

• Extensive experimental results show that RadSee can achieve over 75% accuracy when de-

tecting 62 random letters and 87% word recognition accuracy behind walls.

44

3.2: Related Work

We surveyed the literature in two categories: through-wall detection and fine-grained human ac-

tivity recognition. Table 3.1 in Section 3.1 outlined RadSee’s uniqueness compared to prior work.

3.2.1: See Through Wall using Radio

See Through Wall using FMCW Radar. Some pioneering works have studied 6 GHz FMCW

radar to detect and track human activities behind walls using model-based or learning-based meth-

ods [16, 95, 209–211]. For instance, [209–211] focuses on using FMCW radar to generate the

heatmap image of human body skeleton through walls. [95] uses FMCW radar to detect the in-

teractions between two people behind walls. However, all these works are based on the ranging

detection of FMCW radars. Since the range resolution of an FMCW radar is fundamentally limited

by its bandwidth, this method cannot achieve mm-level accuracy for through-wall motion detec-

tion. To address this issue, RadSee uses the phase information for through-wall mm-level hand

movement detection.

RF-capture [16] is probably the most related work of RadSee. It also uses FMCW radar to

recognize the “handwriting” behind a wall. However, the letters that RF-capture aims to recognize

are of large size (e.g., 0.5 m×0.5 m). It is actually a gesture recognition rather than normal-sized

handwriting detection. Its method is based on range- and angle-based tracking, and thus cannot

achieve mm-level accuracy. Therefore, RadSee is fundamentally different from RF-capture.

Through-Wall Detection using Wi-Fi. Wi-Fi signal is ubiquitous and it has a strong ability

of passing through a wall. [17] utilizes Wi-Fi signals and multi-antenna techniques to track the

movement of people behind a wall. [173] uses Wi-Fi signals to recover the audio sound from a

speaker placed behind a soundproof wall. However, due to the no-coherent detection at a Wi-Fi

receiver, it is impossible for a Wi-Fi receiver to detect movement at the millimeter level. Therefore,

Wi-Fi signals are not suitable for through-wall handwriting detection.

Through-Wall Detection using RFID. Through-wall detection is also possible by using RFID

systems. Tadar [191] and RF-HMS [172] demonstrated their capabilities of tracking human moving

45

directions through walls using an array of RFID tags. However, the tracking error in these systems

is around 10 cm, indicating their incapability of tracking mm-level hand movements. RFID tag can

also be used to measure the vibration pattern of a loudspeaker [162]. But, due to its long wavelength

(33 cm), it is not a good candidate for tracking mm-level movements.

3.2.2: Fine-Grained HAR

Handwriting Recognition. Camera-based handwriting recognition is a well-established field [39].

However, the camera cannot see through walls. Recently, RF signals have been studied for hand-

writing recognition. RF-IDraw [165] attaches an RFID tag to a people’s finger and can reconstruct

the trajectory of that finger. A multi-resolution positioning technique was designed, yielding a

tracing accuracy at the centimeter level. mTrack [174] developed a mmWave (60 GHz) tracking

system and achieved mm-level tracking accuracy. It also demonstrated its capability of recognizing

handwriting letters. However, mmWave signals are vulnerable to blockage and cannot go through

walls. Therefore, it is not suitable for our purpose.

MmWave FMCW Radar Detection. In recent years, mmWave (24 GHz, 60 GHz and 77 GHz)

FMCW radars become available on the market for autonomous driving applications. These radars

have been widely used for human activity recognition and vital sign detection [19, 61, 72, 98, 99,

161, 174, 188, 206]. Given their large bandwidth and small wavelength, they can easily achieve

mm-level accuracy when detecting object movements. However, mmWave signals cannot pass

through walls. Therefore, they cannot apply to through-wall handwriting detection.

Gesture and Vital Sign Detection. CSI in Wi-Fi networks has been used for a wide range of

sensing applications such as gesture recognition [52, 93, 123], vital sign detection [167], and radio

imaging [78, 94, 163]. However, Wi-Fi is a non-coherent system due to the physical separation of

its transmitter and receiver. Therefore, its detection accuracy is fundamentally limited by timing,

frequency, and phase misalignments. As a result, it is not competent for mm-level handwriting

detection.

46

3.3: Attack Model

Attack Scenarios. We consider a scenario as shown in Fig. 3.1, where one is writing a confidential

document on a paper or an electronic device (e.g., iPad and Kindle Scribe) in a private room (e.g.,

government office, business office, hotel room, and apartment). Inside the room there may be

other static objects (e.g., furniture) and people performing daily activities. Outside the room there

is a malicious attacker who aims to detect the content (English letters and Arabic numbers) being

written by the victim.

Attacker’s Assumptions. We assume that the attacker has physical access to the space behind

the wall which the victim is facing toward. As such, the radio signals for detecting the victim’s

handwriting movements would not be blocked by the victim’s torso. We also assume that the at-

tacker knows the layout of the room and the approximate location of the victim. However, the

accurate location of victim’s writing hand will be estimated by the attacker using radar signal. We

know that it is not improbable to obtain the knowledge about the location of a desk in a room, as

many public spaces such as hotels have standard layouts that are consistent across rooms. Further-

more, we assume that there are no RF-shielding materials inside the wall between victim and the

attacker.

Challenges. As we stated before, there are three grand challenges that must be addressed for

the design of such an adversarial device, namely, through-wall detection, mm-level recognition,

and interference resilience. In addition, handwriting recognition has 62 character candidates (26

low-case letters, 26 upper-case letters, and 10 Arabic numbers) for classification. Such a large

character set adds another level of challenge to the task. To address these challenges, it calls for a

joint hardware and software design for such an attack device.

47

3.4: RadSee: Design Analysis

3.4.1: A Primer on FMCW Radar

FMCW radar is an active radio device that uses frequency modulation to generate a continuous

wave signal with a linear frequency sweep. This signal is transmitted from the radar antenna toward

a target, and the reflection from the target is received by the radar antenna. The frequency difference

between the transmitted and received signals, known as the beat frequency, is proportional to the

range of the target. By analyzing the beat frequency over time, FMCW radar can determine the

distance and velocity of the target.

Fig. 3.2 shows the diagram of an FMCW radar device.

It transmits frequency-modulated

continuous-wave signals and receives the reflective signals from the surrounding objects. Denote

sT (t) as the transmitting signal and sR(t) as the received echo from an object. Mathematically, we

have

and

(cid:0)

sT (t) = exp

j(2πf0t + πKt2)

(cid:1)

,

sR(t) = αsT (t − 2d/c),

(3.1)

(3.2)

where f0 is the starting frequency, K is the frequency ramp rate, α is the path attenuation, d is the

distance from the radar to the object of interest, and c is light speed.

The received signal and the transmitted signal are mixed together, generating the intermediate

frequency (IF) signal. The IF signal can be written as:

sIF (t) = sT (t)sR(t)∗ = exp

(cid:16)

d
t
j4πK
c
| {z }
frequency

+ j4πf0

d
c
| {z }
phase

− j4πK

d2
c2
| {z }
negligible

(cid:17)

.

(3.3)

As we can see from (3.3), the observed frequency and phase both contain the distance infor-

mation. Typically, the frequency term in (3.3) is used to estimate the range of an object, while the

phase term is used to estimate the velocity of the object. Specifically, the range and velocity of the

48

Figure 3.3: Illustration of handwriting pattern.

Figure 3.2: Illustration of radar device.

object are estimated as follows.

• Range. As illustrated in Fig. 3.2, the IF signal from each chirp is digitized and converted to

the frequency domain through FFT operation. Suppose that the FFT size is N and a peak is

identified at the ith FFT bin (0 ≤ i ≤ N − 1). Then, the distance of the corresponding object

is d = c

2B i, where B is the FMCW signal bandwidth. Accordingly, the range resolution is

∆d = c

2B , which is determined solely by the FMCW signal bandwidth.

• Velocity. Grouping an array of chirps together, the velocity of the object can be accurately

estimated by performing the second FFT operation on the ith Range-FFT bins. Suppose

that the time duration of one chirp is T and that the FFT size is M . In this case, a peak is

identified at the kth FFT bin, which allows us to calculate the velocity of the object v =

kc
2M T f0

. Accordingly, the velocity resolution is ∆v = c

2M T f0

, which is determined by three

parameters: the initial frequency, the time duration of a chirp, and the number of used chirps.

3.4.2: Feasibility Analysis

To detect fine-grained movements, the first option that came to our mind is mmWave FMCW radar,

which is widely available on market at a low price. Particularly, existing work (e.g., [72, 98, 161])

has demonstrated the ability of mmWave radars to “see” through walls made of cotton and glass. A

key question to ask is whether a mmWave radar can “see” through typical walls in our daily lives.

To answer this question, we conducted experiments using IWR1642BOOST 77 GHz mmWave

FMCW radar from Texas Instruments (TI) with a bandwidth of 1.1 GHz. We placed the mmWave

49

PLLVCOLFPADCDSPRefPALNAwallFFTFFTFFTtfDoppler RangechirpsFFTtx signalrx signalbin iFFT bin iFFT Writing on a paper radar behind an office drywall to detect the handwriting in a room. Fig. 3.3 shows our writing

content. Fig. 3.4(a) shows experimental setting and the corresponding FFT-bin’s amplitude and

phase over time. We did not observe any amplitude or phase changes over time caused by the

handwriting. This indicates that mmWave signals cannot go through the drywall under test.

Another possible approach to this task is to use Wi-Fi-based channel state information (CSI).

Since Wi-Fi uses 2.4 GHz and 5 GHz frequency bands, its signal is able to penetrate walls for

movement detection. To examine this approach, we conducted experiments in the same scenario

as the previous case. Fig. 3.4(b) shows the measured CSI at a receiver when using Wi-Fi channel

#3 (2412 MHz–2432 MHz). We observed random CSI changes over time, and did not find any

patterns on the CSI’s amplitude and phase that are related to the handwriting movement. Similar

results are observed for the CSI measured on Wi-Fi channel #36 (5170 MHz–5190 MHz). This can

be attributed to the non-coherent detection of a Wi-Fi receiver. Since Wi-Fi transmitter and receiver

are driven by different clocks, the measured CSI suffers from carrier frequency and sampling time

offsets, making it unreliable to extract the pattern of tiny-scale movements.

In comparison, we replaced the mmWave/Wi-Fi device with RadSee—our custom-designed

6 GHz FMCW radar. Fig. 3.4(c) shows the corresponding FFT-bin’s amplitude and phase over

time. It can be seen that the phase pattern is significant and that the phase pattern is consistent with

the handwriting trajectory on the paper (see Fig. 3.3). This demonstrates the ability of RadSee to

“see” through the wall under test.

Why Use 6 GHz FMCW Radar? Some may inquire about the suitability of other frequencies

for through-wall and fine-grained movement detection. Low-frequency (0–3 GHz) radio signals

have large wavelengths, rendering them incapable of detecting movements at the millimeter level.

High-frequency (20–300 GHz) radio signals, on the other hand, have a large path loss and a sig-

nificant penetration loss; thus they cannot travel through walls with normal transmission power.

Radio signals in the range from 3 GHz to 20 GHz, however, should be suitable for this task. We

opted for 6 GHz due to the availability and cost-effectiveness of electronic chips for FMCW radar

implementation, including phase-locked loop (PLL), voltage-controlled oscillator (VCO), power

50

Figure 3.4: (a) The amplitude and phase of the corresponding FFT-bin from IWR1642BOOST
mmWave FMCW radar. (b) The amplitude and phase of a subcarrier from a Wi-Fi receiver. (c)
The amplitude and phase of the corresponding FFT-bin from RadSee.

amplifier (PA), low-noise amplifier (LNA), etc. On the market, only 6 GHz chips are available for

individual customers at a reasonable price, thanks to the widespread production of 5 GHz Wi-Fi

industry. The cost of our prototype is approximately $500.

Millimeter-level Movement Detection. If an FMCW radar wants to achieve 1 mm range res-

olution, it will need 150 GHz spectrum bandwidth, which is impossible in practice. Therefore,

RadSee uses the phase information of its demodulated FMCW signal to infer the movement pat-

tern of handwriting. Based on Eqn. (3.3), when the object moves 1 mm, RadSee will observe

2df0

c 2π = 0.25 radian (about 14◦) phase change on the corresponding Range-FFT bin. Typically,
handwriting movement is larger than 5 mm, which will generate 70◦ phase change on the Range-

FFT bin. Therefore, the radar will measure the phase pattern over time when a victim is writing,

and use the temporal phase pattern to classify the letters being written.

Fig. 3.5 shows the observed phase change of a Range-FFT bin when one is writing back and

forth on a paper behind a thick office drywall. The distance between the writing hand and the wall is

51

036Time (second)00.10.2Amplitude036Time (second)00.10.2Amplitude036Time (second)00.10.2Amplitude(a)(b)(c)036Time (second)024Amplitude036Time (second)024Amplitude036Time (second)024Amplitude036Time (second)-0.200.2Phase (rad)036Time (second)-0.200.2Phase (rad)036Time (second)-0.200.2Phase (rad)036-202Phase (rad)Time (second)036-202Phase (rad)Time (second)036-202Phase (rad)Time (second)036Time (second)0204060Amplitude036Time (second)0204060Amplitude036Time (second)0204060Amplitude036Time (second)-0.200.2Phase (rad)036Time (second)-0.200.2Phase (rad)036Time (second)-0.200.2Phase (rad)Figure 3.5: Phase observations at a behind-wall radar when one is continuously writing back-and-
forth on a paper within 1.5 cm. (a) phase data before DC component removal. (b) phase data after
DC component removal. (c) phase data after partial DC component removal.

about 2 m. The radar was placed on the other side of the wall, with a distance of 0.5 m. The person

wrote back and forward within a vertical range of 1.5 cm. It can be observed from Fig. 3.5(a) that

the phase changes as the pen-holding hand moves. However, the phase dynamic range is small.

The small dynamic range is attributed to a DC voltage component of the received signal. The

DC component, which can be modeled as a constant complex number, is the reflective signals from

static objects (e.g., furniture and human body) of the same distance. Fortunately, the DC component

is static over time and thus can be easily removed.

Ideally, we should completely remove the

DC component to maximize the phase sensitivity. However, when we completely remove the DC

component, the time period of no-movement will have an irregular phase pattern as illustrated in

Fig. 3.5(b), making it hard for RadSee to detect the gap between two consecutive letters. Therefore,

we partially remove the DC component to strike a balance between movement detection sensitivity

and the phase stability of non-movement periods. Fig. 3.5(c) shows the observed phase of a Range-

FFT bin over time.

Interference Resilience to Other Mobile Objects. In the proximity of a target writer, there

may be many static objects such as desks, chairs, books, and lamps. Fortunately, the static objects

52

(a)(b)(c)051015Time (second)-101Phase (rad)051015Time (second)-101Phase (rad)051015Time (second)-101Phase (rad)051015Time (second)-101Phase (rad)051015Time (second)-101Phase (rad)051015Time (second)-101Phase (rad)051015Time (second)-4-2024Phase (rad)051015Time (second)-4-2024Phase (rad)051015Time (second)-4-2024Phase (rad)MovementStill1.5cmFigure 3.6: The gain pattern of the patch antenna (left). The custom-designed directional antenna
(right).

will not generate interference for the detection of RadSee as their reflective signals appear to be

a constant complex number (DC component) over time. Such a constant can be easily removed

or adjusted to extract the useful phase information. As stated before, RadSee may suffer from

interference from two sources: (i) channel multi-path, and (ii) movement of other objects (e.g., a

walking person). Actually, RadSee is resilient to the interference from these two sources, thanks

to its FMCW modulation and antenna directivity. We explain the reasons below.

• FMCW Modulation (Distance Filter).

If two moving objects have different distances to

the radar and their range difference is larger than the radar’s range resolution, their phase-

change patterns will appear on different Range-FFT bins and will not interfere with each

other. Therefore, increasing the range resolution of RadSee is critical for reducing the inter-

ference from mobile objects. RadSee uses 1.1 GHz (5.4-6.5 GHz) bandwidth and thus has

a range resolution of 14 cm. This means that, if separated by 14 cm, a mobile object (e.g.,

writer’s chest movement of breathing) will not generate interference to RadSee’s handwriting

detection.

• Patch-array Antenna (Directional Filter). In addition to offering high link gain, the patch-

array antenna also serves as a directional filter to suppress the interference from undesired

azimuth/elevation angles. We designed and optimized the patch-array antenna using CST

Studio Suite [42] and fabricated the path-array antenna as shown in Fig. 3.6. The main

53

21 degree (3dB beamwidth)Figure 3.7: Study an FMCW radar’s resilience to radio interference from Wi-Fi devices: experi-
mental setup (left) and experimental results (right).

lobe of the antenna has an angular width of 21◦ (3 dB), which means that this antenna can

effectively mitigate interference from mobile objects when they are positioned 21◦ or more

away from the writer.

Combining its FMCW modulation and patch-array antenna, RadSee is capable of extracting

the phase information corresponding to the movement within a small spot of interest, while being

resilient to interference from other moving objects.

Interference Resilience to In-band Wi-Fi Devices. Although RadSee operates on a frequency

band that overlaps with 5 GHz Wi-Fi, it differs significantly from Wi-Fi in two key aspects. First,

RadSee has a bandwidth of 1.1 GHz, while Wi-Fi devices typically operate within a bandwidth

of 20 or 40 MHz. Second, RadSee utilizes an FMCW waveform, whereas Wi-Fi devices use an

Orthogonal Frequency-Division Multiplexing (OFDM) waveform. OFDM waveforms are charac-

terized by pseudo-noise-like signals. When an OFDM signal is correlated with an FMCW signal

over time, the correlation result is nearly zero. Therefore, in theory, RadSee is resilient to radio

interference from the Wi-Fi devices in its proximity.

To validate the above theory, we conducted experiments by observing RadSee’s IF signals in

two cases: with and without radio interference from a Wi-Fi device, as shown in Fig. 3.7. To

better control the experiments, we use a Universal Software Radio Peripheral (USRP) device for

continuous Wi-Fi signal generation at two frequencies: 5.480 GHz and 5.805 GHz. The bandwidth

of Wi-Fi signals is 20 MHz. The scene is static during the experiments. Fig. 3.7 presents RadSee’s

IF signals (i.e., the input of DNN) in three cases: i) no radio interference from the Wi-Fi device, ii)

54

0204060Frequency (kHz)0510152025Amplitudew/o interferencew/ 5480 MHz interferencew/ 5805 MHz interference0204060Frequency (kHz)0510152025Amplitudew/o interferencew/ 5480 MHz interferencew/ 5805 MHz interference0204060Frequency (kHz)0510152025Amplitudew/o interferencew/ 5480 MHz interferencew/ 5805 MHz interferenceInterference generatorRadSeeFigure 3.8: RadSee process overview.

Figure 3.9: Illustration of the received signals at
the radar.

radio interference from 5.480 GHz Wi-Fi device, and iii) radio interference from 5.805 GHz Wi-Fi

device. It can be seen that the IF signals generated by RadSee are almost the same in these three

cases. This indicates that RadSee is resilient to radio interference from Wi-Fi devices.

3.5: RadSee: Data Processing

In this section, we present the signal processing pipeline of RadSee, as outlined in Fig. 3.8. We

first elaborate on the signal processing modules for phase feature extraction and then use k-nearest

neighbor (kNN) to validate the extracted features.

3.5.1: Signal Processing

Analog Signal Filtering. The received signal at RadSee may have different components, including

RF leakage on PCB, desired echo from handwriting, and undesired echo from other moving objects,

as shown in Fig. 3.9. Since the RF leakage signal is very close to zero frequency, RadSee uses a

high-pass filter with 5 kHz cutoff frequency to suppress the RF signal leakage. Meanwhile, the

undesired high-frequency signal from other moving objects may generate interference to the desired

signal if not suppressed in the analog domain. To do so, RadSee employs a first-order low-pass

filter with a bandwidth of 100 kHz for the suppression of high-frequency echoes from undesired

moving objects. Combining the high-pass and low-pass filters, RadSee has a band-pass filter from

5 kHz to 100 kHz, corresponding to a target range from 0.4 m to 8 m for handwriting detection.

Range-FFT. RadSee sets its chirp cycle time to 1 ms. For each chirp cycle, RadSee sets its

transmission time to 0.6 ms and idle/delay time to 0.4 ms as shown in Fig. 3.10(a). As the PLL

55

Chirp syncDigital Signal ProcessRange FFTFiltering for Range-FFT binsPhase Feature ExtractionRange-FFT bin selectionLetter segmentationDeep Neural NetworkAnalog signal filteringWallTxRxRadarlow-frequency signal leakage Weak mid-frequency desired signalWeak high-frequency interfering signalTxRxFigure 3.10: Illustration of the IF signal. (a) the IF signal in time domain. (b) the IF signal after
FFT operation.

Figure 3.11: (a) The original signal of one Range-
FFT bin (one sample per chirp cycle); (b) the
Range-FFT bin after DC adjustment and low-pass
filter; (c) phase of the signal in (b).

Figure 3.12: Phase sequence of six Range-
FFT bins.

and VCO are typically not very stable at the beginning and end of their frequency ramping, RadSee

discards 0.05 ms at the beginning and at the end of its transmission period, resulting in only 0.5 ms

for useful signal reception. To best observe this useful signal in the digital domain, RadSee samples

its received signal at 5 MSps. As a result, it obtains 2,500 complex samples from each chirp cycle.

To further improve the range resolution, RadSee adds zeros behind the 2,500 samples to perform

8,192-point Range-FFT operation. The resultant Range-FFT bins are shown in Fig. 3.10(b). Of the

resulting 8,192 bins, only the first 256 are under examination.

Filtering for Range-FFT Bins. For each Range-FFT bin of interest, RadSee first adjusts its

DC component to the dynamic range of its real and imaginary parts, and then applies a low-pass

filter to remove the high-frequency component. As per [154], RadSee sets the low-pass filter’s

bandwidth to 5 Hz. Fig. 3.11 compares the data sequences of one Range-FFT bin before and after

56

024Time (millisecond)00.0050.01Amplitude024Time (millisecond)00.0050.01Amplitude024Time (millisecond)00.0050.01Amplitude0.6 ms0.4 ms(a)(b)256 binsAmplitudeFrequency (Hz)0123400.51105(a)(b)(c)Sliding windowDetectedthe DC adjustment and low-pass filter. It can be observed that the process can manifest the phase

pattern of handwriting effectively.

FFT Bin Selection. Experiments show that handwriting will cause multiple bins to fluctuate.

This can be attributed to the high range resolution and the multi-path effect within antenna’s aper-

ture. Instead of using a single Range-FFT bin, RadSee uses multiple consecutive Range-FFT bins

to extract their phase patterns. The questions need to be answered: (i) how many Range-FFT bins

should be selected, and (ii) which Range-FFT bins should be used. For the first question, Rad-

See empirically selects five consecutive Range-FFT bins and uses their phase information for letter

classification. For the second question, RadSee selects the Range-FFT bins of the smallest index

but with its phase variance larger than a predefined threshold. RadSee’s bin selection algorithm is

provided in Alg. 1. Its core idea is to identify five consecutive FFT-Range bins based on their phase

variances, so that the handwriting movement pattern can be captured along the line-of-sight (short-

est) through-wall path. These five bins are then fed into our DNN for letter recognition. Fig. 3.12

shows a sample of our observed Range-FFT bins in handwriting detection. In this case, RadSee

selects bins 66 to 70 as the input of its DNN model for letter classification.

Algorithm 1 RadSee’s bin selection algorithm.
Require: Range-FFT phase matrix [S(i, t) ∈ R]N ×T , where i is bin index (0 ≤ i < N ), t is time
index (0 ≤ t < T ), window size W , predefined lower bound of variance θlw, predefined upper
▷ In our experiments, W = 500, N = 256, T = 5000, θlw = 0.03,
bound of variance θup.
θup = 0.18.

Ensure: The smallest bin index i where the phase variance exceeds θlw but is lower than θup.
1: for t = 0 to T − W do
2:

for i = 0 to N do

3:

Calculate window-slided variance as follows:
v(i, t) = 1
W Σt+W −1
where µ = 1
j=t
if v(i, t) > θlw & v(i, t) < θup then

|S(i, j) − µ|2,
S(i, j).

W Σt+W −1

j=t

return i

4:
5:
6:
7:
8: end for
9: return −1

end for

end if

▷ Indicate no writing activity is detected.

Data Segmentation. RadSee performs data segmentation on the phase stream of the selected

57

Figure 3.13: Illustrating the rapid phase change of a target Range-FFT bin during the transition of
writing letters.

Range-FFT bins to extract the meaningful features that correspond to individual letters. RadSee

employs different methods for phase data segmentation at the training and test phases. We elaborate

them as follows.

(i) During Training Phase: Since we have full control of the training data collection, we ask

every participant to stop and be still for one second after writing each letter. By doing so, RadSee

can easily segment phase sequence and extract meaningful phase data for individual letters. (ii)

During Test Phase: In this phase, RadSee has no control over the writing style of a victim. Likely,

the victim writes in a continuous manner without a stop in the middle. Interestingly, we always ob-

served a rapid phase change during the transition from writing one letter to another. Fig. 3.13 shows

an example of our observations. This is caused by the pen-holding hand’s quick movement during

the transition period. RadSee leverages this signature to segment the phase data streams. Since

the time duration of writing different letters may be different, the data sequences corresponding to

different letters are of heterogeneous length.

Extracted Phase Features. Based on the above process, RadSee will obtain the phase data

segments corresponding to individual letters being written. Fig. 3.14 shows some samples of its

obtained phase segments from different users. From the figure we have the following observations.

First, for the same user, the phase patterns of different letters are different. This is an encouraging

observation as the uniqueness of phase patterns is the foundation of letter classification. Second, for

the same letter (e.g., letter ‘A’ in Fig. 3.14), the phase patterns from different users look different.

So far, it is not clear if those phase patterns will be classified to the same letter through an advanced

58

00.511.522.533.5Time (second)-1.2-0.9-0.6Phase (rad)00.511.522.533.5Time (second)-1.2-0.9-0.6Phase (rad)00.511.522.533.5Time (second)-1.2-0.9-0.6Phase (rad)HELLOSegmentation positionFigure 3.14: The observed phase sequences when three users are writing letters ‘A’, ‘B’, and ‘C’.

transformation. To better understand this question, we conduct feature validation using kNN.

3.5.2: kNN-based Feature Validation

We use the kNN model [41] to validate the effectiveness of the extracted features. kNN is a simple

data classification method that estimates the belonging of a new data sample based on a set of

labeled data samples. When a new data sample comes, the distance between this new sample and

all labeled samples is calculated. Then, the k closest neighbors are selected. The selected k closest

neighbors cast weighted votes (using their distance) to make the final classification decision for

the new data sample. One issue with kNN in this case is that the length of data samples (phase

sequences) is not fixed, i.e., different phase sequences have different lengths. To address this issue,

we employ Dynamic Time Warping (DTW), which has been widely used in speech recognition [43]

and data mining [83]. DTW can find an optimal alignment between the two sequences by warping

the time axis non-linearly.

Data Set. We collected the phase data samples for 62 letters (a-z, A-Z, and 0-9) from 12 users.

Each user was asked to write in print writing style on a desk that is one meter away from the wall.

Our radar was placed just behind the wall to collect the phase data. Each letter has 10 samples from

a user and a total of 120 samples from those 12 users. In total, 7,440 samples were collected for all

62 letters, all of which were labeled during the data collection. The data samples are divided into

two groups: those from the first 6 users are used for training, while those from the second 6 users

are used for test.

Validation Results. We perform kNN on the collected data set. As an example, Fig. 3.15 shows

59

012Time (second)-0.500.5Phase (rad)User 1User 2User 3012Time (second)-0.500.5Phase (rad)User 1User 2User 3012Time (second)-0.500.5Phase (rad)User 1User 2User 3012-0.300.3Phase (rad)User 1User 2User 3012-0.300.3Phase (rad)User 1User 2User 3012-0.300.3Phase (rad)User 1User 2User 300.40.8-0.300.3Phase (rad)User 1User 2User 300.40.8-0.300.3Phase (rad)User 1User 2User 300.40.8-0.300.3Phase (rad)User 1User 2User 3Time (second)Time (second)Letter ALetter BLetter CFigure 3.15: Results of using kNN to search 5 closest neighbors for a new data sample. The top-left
figure shows the phase sequence of the new data sample. The remaining 5 figures show the found
5 closest data samples (and their corresponding letters) in our training data set.

the search results of kNN when the new data sample is the phase sequence of letter ‘A’. It can be

seen that, of the five closest data samples in the training data set, four are correct (labeled with ‘A’)

and one is incorrect (labeled with ‘k’). The five closest data samples cast votes to make the final

decision. The weighted vote for ‘A’ is 10.54, while the weighted vote for ‘k’ is 2.32. Based on the

voting result, this new data sample is classified to letter ‘A’, which is correct.

Fig. 3.16 shows kNN’s classification accuracy when the test data samples are from 6 different

users. We note that the test data samples and the training data samples are from different users. As

we can observe, the classification accuracy is from 53% (user 4) to 77% (user 3). This could be

attributed to two factors: i) most of training data are from Asian participants; and ii) User 4 is an

American participant while other five users are Asian participants.

We then evaluate kNN’s classification accuracy using the data samples from User 6 when the

radar was placed at different distances (1 m, 2 m, and 3 m). The training data samples were col-

lected from six different users when the radar was placed at 1 m distance. Fig. 3.17 presents the

classification results. It shows that the classification accuracy is 68% when the test was conducted

at the same distance. However, when RadSee has a different distance from the victim, its detection

60

0500100015000.30.00.3Phase (rad)letter A0200400letter A (dist=0.32)0200400600letter A (dist=0.38)05001000Time (millisecond)0.30.00.3Phase (rad)letter A (dist=0.39)05001000Time (millisecond)letter k (dist=0.43)0250500750Time (millisecond)letter A (dist=0.45)Figure 3.16: kNN’s classification accuracy
when test and training data are from different
users.

Figure 3.17: kNN’s classification accuracy when
radar is at different distances.

accuracy decreases to 58%.

Limitations of kNN. The kNN-based classification results indeed manifest the effectiveness

of phase features in handwriting letter classification. But this approach has two limitations. First,

it has a very high computational complexity and thus limits the size of the labeled (training) data

set. Second, it uses the phase sequence from only one Range-FFT bin for classification. Using

those five Range-FFT bins together may improve the classification accuracy. In what follows, we

design a DNN-based approach for handwriting recognition, with the aim of overcoming the above

limitations and improving the classification accuracy.

3.6: RadSee: DNN-based Recognition

In this section, we focus on designing a DNN model for through-wall handwriting recognition

using the phase features extracted in the previous section. Compared to kNN, DNN is much more

efficient in computation and is more appealing for practical use.

3.6.1: DNN Model

In essence, this letter recognition problem is a classification problem with its input being multi-

dimensional phase sequences and its output being the probability of each letter in the candidate set

(a-z, A-Z, and 0-9). We found that this task is similar to many classification tasks in natural lan-

guage processing (NLP), such as information status classification [70] and stress detection [178].

Following the state-of-the-art classification techniques in NLP, we employ an attention-based Bidi-

61

123456User ID0%25%50%75%100%Accuracy1m2m3mDistance0%25%50%75%100%AccuracyFigure 3.18: The structure of an attention-based BiLSTM model for letter recognition. The input
is the phases of selected 5 Range-FFT bins over T = 3000 ms, and the output is the classified one
from the 62 characters.

rectional LSTM (BiLSTM) model for RadSee’s letter classification.

Fig. 3.18 shows the high-level structure of our attention-based BiLSTM model. The BiLSTM

component is used to extract the temporal features in the time-series phase sequence. The attention

layer is used to capture the key movement information of handwriting. This is critical as the key

information of handwriting movement likely lies in some turning points. This attention layer will

allow the model to focus on specific parts (e.g., those turning points) of the phase sequences, thereby

improving the accuracy and efficiency of classification.

3.6.2: BiLSTM

BiLSTM is a variant of the LSTM network [69] and has demonstrated its effectiveness for a wide

range of NLP tasks such as machine translation [153], part-of-speech tagging [97], and sentiment

analysis [169, 215]. In a BiLSTM, the input sequence is processed in both forward and backward

directions using two separate LSTM layers. This allows the model to capture both past and future

context for each input element. This is crucial for handwriting recognition, because the turning

62

LSTMLSTMAttention layer Input shape3000×5 Fully-connected neural network (256 ×64×128×62)Softmax layerBiLSTM (128 hidden layers)Output shape3000×256 Output shape 256×1 Output shape62×1 (a-z, A-Z, 0-9) h1→h1→h1h1←h1→h1←X1LSTMLSTMh2→h2→h2h2←h2→h2←X2LSTMLSTMh3→h3→h3h3←h3→h3←X2LSTMhT→hT→hThT←hT→hT←XT......LSTMα1 α2 α3 αT .........5 Range-FFT binsy...Input layerft = σ(Wf [ht−1, xt] + bf )
it = σ(Wi[ht−1, xt] + bi)
ot = σ(Wo[ht−1, xt] + bo)
˜ct = tanh(Wc[ht−1, xt] + bc)
ct = fj ⊙ ct−1 + it ⊙ ˜ct
ht = ot ⊙ tanh(ct),

Figure 3.19: The structure and operation of an LSTM cell (ht ∈ R128×1, ct ∈ R128×1, and
Wf , Wi, Wc, Wo ∈ R128×133).

points of handwriting movement carry the key information for letter classification but the turning

points may appear at the beginning, in the middle, and at the end of a phase sequence. The use of

BiLSTM allows the model to capture those turning points at any pace of the input phase sequence.

Input Data. We set the input data shape to be 3000 × 5, where 3,000 is the number of chirps

and 5 is the number of selected Range-FFT bins. Recall that each chirp is 1 ms. This means that the

maximum time of writing a letter is 3 seconds. In most cases, one can finish the writing of a letter

less than 3 seconds. If the phase sequence is less than 3,000 points, we simply pad zero behind

the phase sequence as the input of BiLSTM. If the phase sequence is greater than 3,000, we trim

the head and tail of the phase sequence, retaining only 3,000 points in the middle as input for the

BiLSTM.

LSTM Cell. LSTM has been used in a wide range of learning tasks. It is the key component of

the BiLSTM model as shown in Fig. 3.18. It allows the model to selectively retain or forget infor-

mation at each time step. The cell structure includes three gates: an input gate, a forget gate, and

an output gate. The input gate determines which information should be stored in the cell, the for-

get gate determines which information should be discarded, and the output gate determines which

information should be used for the current output. Fig. 3.19 shows the structure and parameters of

each LSTM cell.

BiLSTM Structure. As shown in Fig. 3.18, BiLSTM has two LSTM cells: one is for forward

information flow, and the other is for backward information flow. In each iteration t, it combines

63

cht-1ct-1ctht[ht-1, xt]ftitct~ otxt tanhhtσ σ tanhσ Wf, bfWi, biconcatenationccadditionmultiplicationWc, bcWo, bothe hidden states of forward and backward LSTMs through concatenation: ht = [⃗ht, ⃗ht], where ⃗ht

is the hidden state from the forward LSTM, ⃗ht is the hidden state from the backward LSTM, and ht

is the hidden state of the BiLSTM. Since each LSTM has 128 hidden layers, we have ht ∈ R256×1,

with t = 1, 2, . . . , 3000. Then, the combined hidden states are fed to the attention layer for further

processing.

3.6.3: Attention Layer

The attention mechanism is probably one of the most important inventions for deep learning and it

has been used for many applications such as GPT [32, 125, 169, 193]. With the attention layer, the

model learns to focus on some key parts of the data sequence. During the handwriting of a letter,

some turning points may carry critical information for letter classification. The attention layer

attempts to learn the importance of each part of the phase sequence and then assigns them with

proper weights. To calculate the corresponding weights, it first feeds ht to a one-layer Multilayer

Perceptron (MLP) to learn a hidden representation ut, and then normalizes the weights to generate

αt. Mathematically, it can be written as follows:

ut = tanh(W⊤

h ht + bh),

αt =

P

exp(ut)
T
k=1 exp(uk)

,

TX

s =

αtht,

t=1

(3.5a)

(3.5b)

(3.5c)

where Wh ∈ R256×1 is the training weights, bh ∈ R is a training bias, and s ∈ R256×1 is the

weighted vector for the fully-connected neural network in Fig. 3.18. The fully-connected network

is of 256 × 64 × 128 × 62 size. The last layer is a SoftMax layer to calculate the possibility of each

letter candidate (a-z, A-Z, and 0-9). The letter of the highest possibility is selected as the output y.

64

Figure 3.20: Radar PCB (left) and a picture of RadSee (right).

3.7: Implementation

3.7.1: Hardware

Fig. 3.20 shows the hardware components of RadSee. We fabricated a radar PCB board as shown

in this figure. The electronic components of this board include VCO, LNA, PA, Tx/Rx 16 dB

RF coupler, RF quadrature mixer, and baseband filter. This PCB was made by OSH Park using

FR408 substrate. We designed, simulated, and optimized 4 × 4 patch-array antennas using HFSS

for radio signal transmission and reception. These antennas offer 18 dBi antenna gain for both

transmission and reception. In total, it offers 36 dBi gain for the link path, making it possible to

compensate the signal penetration loss of a wall. The total cost of RadSee is approximately $500,

including $50 for PCB fabrication, $50 for antennas, and $400 for chips. We use USRP N210 with

LFRX daughterboard to convert the analog signal to digital I/Q samples, which were then sent to a

computer for data process. Transmission power is set to 20 dBm. The FMCW radar sweeps from

5.4 GHz to 6.5 GHz. The time duration of one chirp period is 1 ms, including 600 µs for frequency

sweeping and 400 µs for idle.

3.7.2: Algorithms

Digital Signal Processing. We implemented the data processing algorithms on a laptop in C++

using GNU Radio Out-of-Tree (OOT) module. The laptop receives a continuous data stream from

65

Custom-made PCB boardUSRP N210 Custom-made AntennaLFRXUSB to Serial adapterRF couplerPAVCOLNAFilterRF mixerPowerFigure 3.21: Evaluation setting: (a) Laboratory scenario. (b) Office scenario. (c) Apartment sce-
nario. (d) RadSee attacks from outside of the apartment.

the radar. It needs to synchronize the chirp signal and extract the useful data samples of each chirp.

Fortunately, due to the presence of 400 µs idle period of each chirp, it is easy to identify the useful

data samples from the data stream. Specifically, we use the high peaks as shown in Fig. 3.10 to

extract the useful data samples. One fundamental issue with the current hardware design is the lack

of clock synchronization between ADC and FMCW chirps. To address this issue, we use a high

sampling rate 5 MSps and perform fine-grained synchronization to identify the first data sample

corresponding to the starting moment of each chirp.

Data Collection for DNN Training.1 We collected training data in a laboratory. The radar was

placed behind an interior drywall at a distance of 0.5 m. A writing desk was placed in front of the

wall at a distance of 1 m, as shown in Fig. 3.21(a). Eighteen participants (4 American, 3 Indian, 4

Middle East, 7 Chinese) were asked to write 62 characters (a-z, A-Z, and 0-9) on the desk. Each

participant wrote every character 60 times. In total, we collected 18 × 62 × 60 = 66, 960 data

samples. Of the eighteen participants, twelve were asked to write in the print style, while six were

asked to write in the cursive style. Regarding handedness, two of them were left-handed writers

while the rest were right-handed writers. The handedness and writing styles of the participants are

1The experiments were conducted under FCC experimental spectrum license with Call Sign # WM2XWQ and File

# 0954-EX-CN-2022.

66

(a)(b)(c)(d)Table 3.2: Participants for training and test data collection.

Participants for training

Participants for test

Handedness
Writing style
# of participants

Right-handed
Print Cursive
11

5

Left-handed
Print Cursive

Right-handed
Print Cursive

Left-handed
Print Cursive

1

1

7

3

1

1

Figure 3.22: Writing samples from participants for training.

summarized in Table 3.2. Some writing samples from the participants are provided in Fig. 3.22.

DNN Training. The DNN model was implemented using TensorFlow’s Keras library. We

used cross entropy as loss function. During the training process, we set the batch size to 2,000 and

trained the model for 500 epochs. We used Adam optimizer with a learning rate of 7e−4 to train

the model.

3.8: Experimental Evaluation

3.8.1: Letter Recognition Accuracy

Write on A4 Papers. Recall that our training data was collected in a laboratory from eighteen

participants. To evaluate the recognition accuracy of RadSee, we completely separate the training

and test datasets. We invited twelve new participants (4 American, 4 Chinese, 2 Indian, 2 Middle

East) to write letters in the same setting (i.e., sitting 1 m away from the wall and facing to the radar).

None of these twelve people participated in the training data collection. Each of them wrote 300

67

Figure 3.23: Confusion matrix of RadSee’s letter recognition results.

68

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789PredictedABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789TrueConfusion matrix102030405060Figure 3.24: RadSee’s letter recognition accuracy when participants wrote on A4 papers. Users
1-4 are Americans, users 5-8 are Chinese, users 9-10 are Indians, and users 11-12 are from Middle
East.

random letters on A4 papers. During the test, eight participants were asked to write in the print

style, and four were asked to write in the cursive style. Both print and cursive writing letters are

within the size of 5 mm to 10 mm. Regarding handedness, ten participants were right-handed

writers, while two were left-handed writers. The handedness and writing style are summarized in

Table 3.2.

Fig. 3.23 shows the confusion matrix of RadSee’s letter recognition results. It is evident that

RadSee can recognize most of the letters. RadSee is prone to making mistakes for some letters. For

instance, it can easily confuse ‘O’ with ‘o’, ‘C’ with ‘O’, and ‘I’ with ‘1’. Other errors can arise from

cursive writing, such as confusing ‘S’ with ‘8’ and ‘Z’ with ‘3’. This is understandable, as their

handwriting patterns are similar to each other. Overall, RadSee achieves 75% letter recognition

accuracy.

Print vs. Cursive. Fig. 3.24 presents RadSee’s letter recognition accuracy for the 12 individual

participants. As observed, RadSee has a lower recognition accuracy for the participants who wrote

in cursive style compared to those who wrote in print style. This observation can be attributed to

two factors. First, cursive writing is more individualized and diverse, making it challenging for

the model to extract consistent features across different participants, despite having cursive-style

data in the training dataset. Second, our segmentation method relies on detecting signal transitions

between letters, which becomes more difficult when people write in cursive style.

69

0.810.780.680.650.820.810.780.740.790.640.80.671234567891011120%25%50%75%100%AccuracyUser ID Print (Right)  Print (Left)  Cursive (Right)  Cursive (Left)Figure 3.25: Writing on different media. (a) Writing on papers, iPad, and Post-it notes. (b) The
recognition accuracy of RadSee when writing on different media.

Writing Handedness. Besides writing style, handedness is another factor that may affect Rad-

See’s letter recognition accuracy. However, experimental results show that handedness affects

RadSee very slightly. As shown in Fig. 3.24, RadSee has a very similar performance for both left-

handed and right-handed users. This can be attributed to the fact that most left-handed individuals

have the same writing movement pattern as right-handed individuals, i.e., write from left to right

and from top to bottom.

Write on iPad and Post-it Notes. Tablets, such as Apple iPad, have become increasingly popu-

lar for writing activities, with many individuals opting to use them for important documents instead

of traditional pen and paper. To evaluate the performance of writing on an iPad, we repeated our

measurements by asking twelve participants to write 300 random letters using an Apple Pencil. The

experimental results are shown in Fig. 3.25(b). RadSee achieves 74% letter recognition accuracy.

In the same setting, RadSee achieves 75% letter recognition accuracy when participants write on

A4 papers. This indicates that RadSee has almost the same performance for A4 paper and iPad writ-

ing recognition. Another commonly used medium for writing is Post-it notes. Given their smaller

size, we asked participants to write 20 random letters on Post-it notes. RadSee’s letter recognition

accuracy for Post-it notes is 71%, as presented in Fig. 3.25(b). As shown in Fig. 3.25(a), these three

writing media have different horizontal writing ranges. Since RadSee has similar performance for

them, it suggests that RadSee effectively accommodates the horizontal range for writing on A4

papers, iPad, or Post-it notes.

70

(a)(b)Accuracy100%75%50%25%0%PaperiPadPost-it Notes0.750.740.71Figure 3.26: RadSee’s phase signal for differ-
ent letter size.

Figure 3.27: RadSee’s accuracy for letters of
different sizes.

3.8.2: Impact of Letter Size

We conducted experiments to better understand RadSee’s ability of detecting small-size letters.

Fig. 3.26 presents RadSee’s signal changes when a participant wrote letter ‘N’ of different sizes.

Evidently, RadSee is capable of detecting as small as 3 mm handwriting movement. We further

asked one participant to write on A4 papers with grid boxes of different sizes: 3 mm×3 mm, 4 mm×

4 mm, 5 mm×5 mm, and 10 mm×10 mm. The participant was instructed to write letters within the

boundaries of the grid boxes. However, for the 3 mm × 3 mm grids, since the boxes were too small,

a considerable portion of the written letters exceeded the boundaries. Fig. 3.27 presents RadSee’s

letter recognition accuracy in these four cases. It is evident that RadSee’s accuracy decreases with

the letter size. But notably, RadSee achieves 68% recognition accuracy even in the case where the

letter size is confined within 3 mm.

3.8.3: Impacts of Distance and Angle

When an attacker attempts to detect the handwriting behind a wall, it may not know the distance

from itself to the victim and the angular direction of the victim. The attacker may use RadSee to

do an exhaustive search to find the best pointing direction for the radar’s antennas, but the radar-

antenna-pointing direction may not be accurate. To evaluate RadSee’s robustness, we examine its

accuracy in different settings: (i) the writers are 1 m, 2 m, and 3 m behind the wall; and (ii) RadSee’s

antenna is pointing to different angles (0◦, 10◦, 20◦, and 30◦). The combination constitutes 12

71

02468Time (second)00.20.4Phase (rad)02468Time (second)00.20.4Phase (rad)02468Time (second)00.20.4Phase (rad)10 mm5 mm4 mm3 mm0.770.740.70.6810mm5mm4mm3mm0%25%50%75%100%AccuracyLettersizeFigure 3.28: Letter recognition accuracy of RadSee when writers are at different distances and
different angles from the wall.

different cases. In each case, we instructed eight participants to write 300 letters using their normal

handwriting habits.

Fig. 3.28 presents our measured accuracy and deviation. It can be seen that RadSee is robust

to the distance change. This can be explained by its design. In nature, FMCW radar is capable

of precisely capturing the movement features at different distances. When the distance between

the writer and the wall changes from 1 m to 3 m, RadSee will identify another 5 Range-FFT bins

for phase feature extraction. Since the handwriting movement patterns are not related to the wall

distance, the extracted features will remain unchanged. Therefore, RadSee is robust to distance

changes.

Fig. 3.28 also presents our measurement results when RadSee’s antennas was pointing to dif-

ferent angles. Evidently, RadSee’s accuracy decreases when its directional error increases from

0◦ to 30◦. Specifically, when RadSee was pointing to 0◦, it achieved 77% recognition accuracy.

When RadSee was pointing to 30◦, it achieved 55% recognition accuracy. In all cases, the standard

deviation is almost the same, i.e., 4%. This degradation can be attributed to the directivity of the

patch-array antennas, as shown in Fig. 3.6. When the writer deviates from its central direction, the

patch antenna’s effective radiation power decreases, making noise and other imperfections more

significant and thus leading to a decreased accuracy.

72

0.780.740.720.620.760.730.690.530.760.660.60.5101020300%25%50%75%100%AccuracyRadar antenna's pointing angle (Degree) 1 m   2 m   3 mFigure 3.29: Interference test. (a) Interferer is 2 meters from writer. (b) RadSee’s resilience to
interference from a walking person.

3.8.4: Impact of Interference from Other Moving Objects

Experimental results in Fig. 3.7 have confirmed that RadSee is immune to radio interference from

in-band (5 GHz) Wi-Fi devices. All experiments in this work were conducted in office and labo-

ratory environments, which are rich with interference from multiple Wi-Fi sources. Therefore, the

experimental results presented have already taken into account the radio interference from multi-

ple Wi-Fi sources. Additionally, RadSee is not affected by static objects (e.g., desks and chairs)

around a writer as they appear to be a constant in the received signal, which can be easily miti-

gated. Therefore, we focus on studying RadSee’s performance in the presence of moving objects

(e.g., walking persons) in the proximity of the writer. We emulated this scenario by asking another

person to walk around the writer as shown in Fig. 3.29(a). We measure the recognition accuracy

of RadSee in three cases, i.e., the distance between a writer and a walking person is 1 m, 2 m, and

4 m. We asked eight participants to write 300 random letters in each case and measured RadSee’s

letter recognition accuracy.

Fig. 3.29(b) depicts our measured results. We can see that the performance degradation depends

on the distance between the writer and the interferer. The closer the interferer is, the larger perfor-

mance degradation RadSee has. For the case where interferer is 1 m away, RadSee demonstrates

67% letter recognition accuracy, with 11% accuracy degradation compared to the case without in-

terference. When the interferer is 2 m away, RadSee rapidly increases its accuracy to 76%, which is

close to its accuracy in the case without interference. We note that the participants in all experiments

73

Interference(a)(b)Distance between write and walking personAccuracy100%75%50%25%0%No Int.1 m2 m4 m0.780.670.760.77Figure 3.30: Illustration of six different types of wall materials.

Figure 3.31: RF signal’s power attenuation
for penetrating a wall of different materials.

Figure 3.32: RadSee’s recognition accuracy
when placed behind six wall materials.

maintained normal physiological activities, such as breathing and respiration. The experimental re-

sults reported above have already taken into account those normal physiological activities of the

writers.

3.8.5: Impact of Different Wall Materials

RF signals have varying penetration abilities depending on the type of wall. We conducted exper-

iments to evaluate the performance of RadSee in detecting letters through different wall materials.

Specifically, we considered six wall materials as shown in Fig. 3.30: drywall (12 cm), vinyl wall

(20 cm), wood wall (19 cm), brick wall (22 cm), concrete wall (23 cm), and metal door (4 cm). We

first measured their penetration loss, which refers to the power attenuation of radio signals as they

pass through a wall. Fig. 3.31 presents our measurement results. It is evident that drywall, vinyl

and wood walls have similar penetration loss for radio signal, which is about 10 dB. Brick wall is

more lossy for radio signal compared to wood wall. Its penetration loss is about 21 dB. However,

74

DrywallVinyl wallWood wallBrick wallConcrete wallMetal door101111214242DrywallVinylWoodBrickConcrete Metal01020304050Attenuation (dB)>>0.780.770.770.2400DrywallVinylWoodBrickConcrete Metal0%25%50%75%100%AccuracyFigure 3.33: Writing samples from participants as they transcribed CNN news articles in both print
and cursive styles.

concrete walls and metal doors completely block radio signals. Their attenuation loss is greater

than 42 dB.

We then conducted experiments to measure RadSee’s letter recognition accuracy. Eight partic-

ipants took part in the experiments. They were seated 1 meter away from the wall, while RadSee

was positioned 0.5 meters away on the other side of the wall as shown in Fig. 3.21. Each of the

eight participants wrote 300 random letters using his/her own writing style. Fig. 3.32 presents the

experimental results. It shows that RadSee achieves similar performance when participants wrote

behind drywall, vinyl, and wood walls. This similarity is due to the comparable electromagnetic

properties of these materials. In contrast, a brick wall significantly reduces recognition accuracy,

with RadSee achieving only 24% letter recognition accuracy in this scenario. Furthermore, con-

crete walls and metal doors completely obstruct letter detection.

3.8.6: Word Recognition Accuracy in Content

In addition to detecting individual letters, we evaluate RadSee’s performance of recovering entire

sentences. This is important because an attacker’s interest may lies in the content that a victim is

writing, rather than individual letters. We asked twelve participants to reproduce an CNN News

article, which is about 300 words. Some writing samples from the participants are provided in

75

Table 3.3: A case study of RadSee detecting the sentences written by a person behind a lab drywall.

Ground truth

`football is
popular in
the united
states'
`Bill is a
hardworking
student'
`My favourite
fruit is
apple'

Letters
recognized by
RadSee

`ecctbollispo
pulaintheuni
tedstate'

`Billiislhar
dworkimg
studena'
`mgfavouri
teffruitl
qapple'

Segmented by
Wordsegment [75]

Corrected by
TextBlob [106]

`ecc', `t', `boll',
`is', `popula',
`in', `the',
`united', `state'
`bill', `i', `isl',
`hard', `work',
`img', `studena'
`mg', `favourite',
`f', `fruit', `lq',
`apple'

`etc', `t', `ball',
`is', `popular',
`in', `the',
`united', `state'
`bill', `is',
`hard', `work',
`ing', `student'
`my', `favourite',
`fruit', `is',
`apple'

Figure 3.34: Word recognition accuracy of RadSee with and without correction for different users.

Fig. 3.33. The experimental setting is the same as described above.

RadSee employs two open-source software tools to translate its detected letters into word sen-

tences: Wordsegment [75] and TextBlob [106]. It first sends the detected letters to Wordsegment

for word segmentation. Then, it sends the segmented text to TextBlob for automatic spelling

correction. Table 3.3 presents samples of the sentence recognition results. Leveraging these two

open-source tools, RadSee demonstrates impressive performance in word and sentence recogni-

tion. It nearly recognized the first sentence in the table and accurately recovered both the second

and third sentences.

We then use word recognition accuracy as the metric to evaluate the performance of RadSee.

According to [112], word recognition accuracy is defined as W RA = N −S−D−I

N

, where N is the

number of words in the ground-truth text, S is the number of word substitutions, D is the num-

76

0.920.910.840.830.930.890.880.870.880.790.890.820.550.550.430.410.560.540.510.50.520.40.510.431234567891011120%20%40%60%80%100%AccuracyUser ID WRA w/ correction WRA w/o correctionber of word deletions, and I is the number of word insertions. Fig. 3.34 shows RadSee’s W RA

with and without using TextBlob for automatic spelling correction. It can be seen that without

automatic spelling correction, RadSee’s W RA ranges from 40% to 56% across the twelve partic-

ipants. In contrast, when automatic spelling correction is applied, RadSee’s W RA significantly

improves, ranging from 79% to 93%. On average, RadSee’s W RA hovers around 87% with auto-

matic spelling correction. This level of word recognition accuracy is sufficient for an attacker to

comprehend the content written by a victim.

3.9: Countermeasures and Other Applications

3.9.1: Countermeasures

Handwriting Safety Tips. RadSee demonstrated a serious threat to handwriting privacy. Based

on the study, we have the following tips for those who have concerns about their handwriting

information leakage. Tip 1: Do not write important documents in a room with drywall or vinyl

wall. Instead, write them in a room with thick concrete or any metal walls. These walls can largely

reduce the radio signal and thus reduce the probability of information leakage. Tip 2: Do not

face yourself to a wall behind which a radar may be placed. Instead, face against that wall. Your

body/torso will significantly reduce the radio signal strength and thus reduce the probability of your

content being detected by an attacker. Tip 3: If possible, write important documents on a desk far

from all walls rather than a desk against a wall. This will increase the distance between yourself

and a radar, thereby reducing its recognition accuracy.

Protection Strategies. One natural approach to protecting handwriting content is to install

multi-layer RF shielding materials inside the walls of your room [90]. Common materials used for

RF shielding include metals such as aluminum, copper, and steel, as well as conductive coatings

or paints. Another approach is to take advantage of recent advances in reconfigurable intelligent

surface (RIS), which has also been studied under other names such as electromagnetic metasurface

or radio relay. RIS can be used to create virtual multipath from radar’s Tx to its Rx. By manipu-

lating its phase shifting and beam steering, RIS is capable of generating fake phase patterns for the

77

radar, preventing it from recovering the handwriting content. Unfortunately, neither of the above

approaches is easy or economical to deploy.

3.9.2: Other Applications

While RadSee was designed to better understand the radio attacks related to handwriting privacy, it

can also be used for many other applications. For instance, RadSee can be installed on a laptop as

an input method. When an end user physically writes something on paper in front of his/her laptop,

the content is automatically recognized by RadSee and digitally recorded on his/her laptop. In this

case, RadSee does not need to use a 4 × 4 patch-array antennas since there is no need to penetrate

through walls. Rather, a small patch antenna should be sufficient. RadSee can also be used as a

human-computer interface for smart TVs. End users can write using their bare hands, and a TV

equipped with RadSee can recognize the letters being written.

3.10: Summary

While mmWave FMCW radar has been extensively studied for autonomous driving and HAR, sub-

10GHz FMCW radar has not received as much attention. This is of particular interest due to its

see-through-wall capability, which may pose significant threats to the privacy of human activi-

ties. In this chapter, we presented RadSee, a 6 GHz FMCW radar system designed for detecting

handwriting content behind walls. Through a combined hardware and software design, RadSee is

capable of detecting mm-level handwriting movements and recognizing most letters based on their

unique phase patterns. Additionally, it is resilient to the interference from other moving objects

and coexisting radio sources. Extensive experimental results show that RadSee achieves 75% let-

ter recognition accuracy when victims write 62 different letters and 87% word recognition accuracy

when they write articles. In light of these realistic threats, we offered handwriting safety tips and

defense strategies to help the public protect their handwriting information.

78

CHAPTER 4: EYE MOTION TRACKING

USING FMCW RADAR

4.1: Introduction

Eye motion tracking has a wide range of applications across various fields. According to the Amy-

otrophic Lateral Sclerosis (ALS) Association, more than 5,000 people in the U.S. are diagnosed

with ALS each year [26]. Individuals with ALS progressively lose control of their muscles, which

affects their ability to move, speak, eat, and breathe [136]. Eye movements often become the only

means of communication for individuals with ALS [28]. A non-intrusive, privacy-preserving eye-

tracking system can help better interpret their intended messages and enable more efficient com-

munication. In addition to assisting individuals with disabilities, eye motion tracking can serve as

an effective human-computer interaction (HCI) tool in various scenarios. For example, it can help

immobile patients communicate, as illustrated in Fig. 4.1. It can also be used to remotely control

devices such as smart TVs, home appliances, elevators, and virtual reality systems. Furthermore,

a reliable eye-tracking system has broader applications in healthcare, including psychology re-

search [126], marketing analysis [196], and early disease detection [151].

Existing contactless eye tracking solutions employ various sensors, including cameras, acous-

tic, and radar. While cameras have demonstrated high accuracy in eye motion detection [53, 156],

their application may pose privacy concerns in some scenarios. Additionally, cameras do not per-

form well in poor lighting conditions. Recently, acoustic signals have been studied for eye track-

ing on smartphones [38, 103]. However, due to the propagation nature of sounds, acoustic-based

eye tracking systems are limited to eye blink detection within a small distance. Millimeter-wave

(mmWave) radio frequency (RF) radar has also been studied for facial recognition and eye blink de-

tection (e.g., [71,204]). While mmWave radar can achieve mm-level motion detection, its detection

range is very limited due to its small wavelength. Additionally, existing mmWave-based sensing

79

Figure 4.1: Illustration of RadEye.

work focuses mainly on eye blink detection rather than eye motion tracking. Low-frequency radio

signals have been widely leveraged for fine-grained human activity recognition (HAR), such as

Wi-Fi sensing [52, 78, 93, 94, 163, 167], RFID sensing [162, 191], and 4G/5G sensing [47, 175].

However, due to their large wavelength as well as their non-coherent sensing approaches, those

systems may not be able to detect such subtle eye motions. So far, there is no RF-based system that

can track human eye motions from a distance.

In this chapter, we present RadEye, an RF sensing system that can track eye motions from a

distance. Compared to camera-based eye tracking, RadEye not only mitigates privacy concerns

but also performs reliably in poor lighting conditions. The privacy-preserving nature of RF signals

stems from their inherent characteristics. Unlike camera images, RF signals are not visually inter-

pretable by humans and inherently possess low spatial resolution. As a result, they are unlikely to

reveal detailed, identifiable features of individuals. Extracting personal information from RF sig-

nals involves complex processes that require advanced signal processing and AI models, making

misuse significantly less accessible compared to camera systems. Furthermore, RF sensing systems

are typically designed to capture only coarse-grained human activities—such as presence, move-

ment, or positioning—rather than detailed personal characteristics like facial features or voice. This

makes RF-based sensing systems inherently more privacy-preserving than cameras, ensuring better

protection of individual privacy.

80

Target AreaEye MotionVirtual keyboardCustom-designed patch antennaCustom-designed FMCW radar boardTable 4.1: Comparison of RadEye and existing eye detection works.

Reference

Technique

Max.
distance

Track eye
motion

Work in
low light

Blink Listener [103]

TwinkleTwinkle [38]

BlinkRadar [71]

X. Zhang [207]

Acoustic

Acoustic

IR-UWB

mmWave

C. Ryan [134]

Event Camera

GazeRecorder [53]

Web Camera

0.8 m

0.6 m

0.8 m

1.2 m

0.6 m

0.7 m

RadEye

Sub-6GHz FMCW Radar

5 m

7

7

7

7

3

3

3

3

3

3

3

7

7

3

Compared to acoustic- and mmWave-based eye detection approaches, RadEye extends both

the RF sensing capability (from eye blink to eye motion) and detection range (from less than 1 m

to more than 5 m). We note that the eye motion tracking task is very different from eye blink

detection. The former is a regression problem, while the latter is a binary classification problem.

Such an extension will significantly enlarge RadEye’s application landscape in real life.

In the design of RadEye, we face two challenges. Challenge #1: subtle eye movement and

long detection range. On one hand, eye rotation, encompassing eyelid and eye muscle displace-

ment, involves movement around 1 mm [91]. Such a subtle motion makes it hard to detect for an RF

system. On the other hand, an eye-tracking sensor may be used in indoor or outdoor scenarios. The

eye detection distance varies significantly, ranging from 0.5 m (e.g., from smartphone to eyes) to

5 m (e.g., from smart TV to eyes). Devising an RF system that can detect subtle (mm-level) eyeball

rotation movement from a distance is not a trivial task. Challenge #2: interference mitigation by

design. An eye-tracking system may suffer from interference from three sources: i) multipath from

the target person to the RF sensor, ii) the movement of the target person’s other body parts such as

chest breathing, arm waving, and leg shaking, and iii) other moving objects/people in the area. For

instance, when the RF sensor detects eye movement, another person may walk around in the same

room, generating interference to the received signals at the RF sensor. In general, interference is a

notorious problem for RF sensing. Given the subtlety of eyeball movement, the interference must

be mitigated by design so as to accurately detect eye’s movement.

81

RadEye addresses the above two challenges through a joint hardware and software design. To

achieve the required detection resolution and range (i.e., Challenge #1), we design and optimize a

5 GHz FMCW radar for eye movement tracking. We choose 5 GHz radar for two reasons: i) high-

frequency radio wave (e.g., mmWave) is suited for detecting tiny motions, but has a small detection

range; and ii) low-frequency radio wave is suited for long detection, but not suited for detecting

subtle movement. A tradeoff between detection resolution and range leads to our selection of 5 GHz

frequency band. Additionally, the market has rich electronic devices (e.g., power amplifiers, mix-

ers, and low noise amplifiers) at 5 GHz frequency bands due to the maturity of Wi-Fi industry.

Thus, it is cost-friendly to build 5 GHz radars. To mitigate interference from multipath and other

moving objects (i.e., Challenge #2), we combine four techniques: i) FMCW modulation for the

5 GHz radar, ii) a sophisticated signal processing pipeline for eye-related feature extraction, iii) a

transformer-based deep neural network (DNN) for eye motion detection, and iv) a camera-guided

supervisory training method for the DNN model. Together, these four techniques make RadEye ca-

pable of separating the eye motion features from the interference from multipath and other objects.

More importantly, these four techniques make RadEye transferable to unseen scenarios, enhancing

its generalizability in practice.

We have built a prototype of RadEye and evaluated its performance in multiple scenarios. Ex-

perimental results show that, for a person at a 5 m distance, the average estimation error of eye

rotation is 24 degrees in azimuth and 21 degrees in elevation. By formulating the eye rotation

problem to a classification (up, down, left, and right) problem, RadEye achieves 90% accuracy.

Extensive results confirm the generalizability of RadEye in unseen scenarios as well as its resilience

to interference.

Table 4.1 shows how RadEye advances the state-of-the-art (SOTA) RF sensing technology. The

main contributions of RadEye are summarized as follows:

• To the best of our knowledge, RadEye is the first-of-its-kind system that utilizes RF signals

to estimate eye rotation angles from a distance.

• RadEye presents a joint hardware and software scheme for subtle eye motion detection in

82

the presence of interference.

• Extensive experimental results validate the performance, robustness, and generalizability of

RadEye.

4.2: Related Work

4.2.1: Eye Motion Recognition

Acoustic-based Detection. As speakers and microphones are now commonplace on mobile de-

vices, acoustic signals have become widely utilized for recognizing human daily activities. Blin-

kListener [103] can detect eye blink motions using acoustic signals, modeling variations caused

by eye blinks and interference. By leveraging interference, they identify an optimal position to

maximize the variation induced by eye blinks. TwinkleTwinkle [38] addresses a similar objective

using a different approach. They employ a phase difference-based method to detect potential blink

motions, followed by a model-based approach to distinguish subtle motions. Additionally, they

establish a language input system based on ASCII code and Morse code. While RadEye is capable

of recognizing eye motions, the mentioned works focus solely on detecting eye blinks at limited

distances.

RF-based Detection. In the study by Zhang et al. [207], an off-the-shelf mmWave FMCW radar

is employed to detect eye blinks. They introduce an Adaptive Variational Mode Decomposition

(AVMD) algorithm to extract the blink signal, achieving an effective detection distance of up to 1.2

meters. Several other studies [34, 107, 170] have taken a similar approach with mmWave FMCW

radar. In addition to the mmWave signal, BlinkRadar [71] employs UWB radar to detect eye blinks

in a driving scenario. They implemented a customized impulse-radio ultra-wideband (IR-UWB)

radar. By analyzing signal features in the complex domain, the system can isolate eye blinks without

interference from other motions.

Camera-based Detection. Due to the ubiquity of web cameras and smartphone cameras, sig-

nificant progress has been made in using cameras to track eye motion. In the computer vision

research domain, deep learning networks have been effectively employed to predict gaze direc-

83

tion [87, 146, 208, 216]. Furthermore, in the security domain, studies have shown that the camera

on a mobile phone can even track the gaze trace, raising concerns about potential password leak-

age [37, 168]. Additionally, there are commercial eye-trackers available on the market [53, 156]

that support high-accuracy eye-tracking at an affordable price. However, it’s worth noting that all

camera-based eye-tracking solutions may raise privacy concerns and may not function effectively

in low-light scenarios.

Wearable-based Detection. With the prevalence of VR devices, smart glasses offer another so-

lution for eye motion detection. Google Glass [74] and Jins MEME pair of eyeglasses [44] demon-

strated the potential to detect eye motions several years ago. Building on existing approaches, Liu

et al. [104] attached a copper electrode to the glass frame to sense eye blinks by utilizing the capac-

itance variation between the electrode and eyelid. However, wearable devices like these require

users to keep the glasses on their heads, which may be inconvenient for daily use.

4.2.2: Fine-grained HAR

MmWave-based Recognition. MmWave FMCW radar has reached great performance these years

due to its fine-grained detection ability and affordable cost. They are extensively used in human

activity recognition and vital sign detection [61, 98, 181, 188, 205, 206]. Thanks to their substantial

bandwidth and diminutive wavelength, they attain millimeter-level accuracy in detecting object

movements.

Wi-Fi-based Recognition. Channel State Information (CSI) in Wi-Fi networks has been ap-

plied across various sensing applications, including gesture recognition [52, 93, 200], vital sign

detection [167], and radio imaging [78, 94, 142, 163]. Nevertheless, Wi-Fi, characterized as a non-

coherent system due to the physical separation of its transmitter and receiver, faces limitations in

detection accuracy stemming from timing, frequency, and phase misalignments.

84

4.3: RadEye: Design Analysis

4.3.1: Background

RadEye leverages the FMCW signal to detect the eye motions. The signal is transmitted from the

radar’s TX antenna towards the eyes; and the reflective signal from the eyes is received by the

radar’s RX antenna. The difference between the transmitted and received signal is used to extract

the eye motion features. As shown in Fig. 4.2, the FMCW signal starts from frequency f0 and

ramps up linearly over time T . The transmitted signal can be written as:

ST (t) = e−j2π(f0t+ B

2T t2).

The received signal reflected from the target can be written as:

SR(t) = αe−j2π(f0(t−τ )+ B

2T (t−τ )2).

(4.1)

(4.2)

The transmitted and received signals are mixed together, leading to an immediate frequency

(IF) signal as follows:

SM (t) = ST (t)SR(t)∗ = αe−j2π(f0τ + B

T τ t− B

2T τ 2).

(4.3)

RadEye uses the IF signal to infer the eye motions. As we can see from the Eqn. 4.3, both the

frequency and phase of the IF signal are proportional to the delay of the signal. The frequency of

the IF signal fm = B

T τ . The time delay can be calculated as τ = fmT

B and, as a result, the distance

can be calculated by d = cτ

2 = cfmT
2B .

To separate the signal reflected from different objects, we do range-FFT on each chirp of the

IF signal, as illustrated in Fig. 4.2. Each range bin represents the signal coming from different

distances. The range resolution ∆d = c

2B is determined only by the bandwidth of the signal. To

identify the FFT bin corresponding to eye motions, a user will be asked to blink his/her eye as a

85

Figure 4.2: Illustration of FMCW signal.

reference. The algorithm will be presented in §4.4.

4.3.2: Detectability of Human Eye Rotation

The kinematics of eye rotation is a complex process involving the stretching and contraction of

six extraocular muscles. The combined movement of these muscles alters the shape of the reflec-

tion surface, affecting the length of the signal reflection path and influencing the phase shift of

the FMCW signal. Additionally, these muscle movements impact signal attenuation, as muscles

and surrounding tissues absorb and scatter FMCW signals to varying degrees based on their den-

sity, composition, and position. When eye muscles move, the spatial distribution of these tissues

changes, altering the amount of signal absorbed or scattered and leading to variations in attenuation.

In addition to muscle movements, eyelid motion also affects reflected FMCW signals. During

eye rotations, the eyelids fold or stretch, and this change in thickness modifies the distance of the

signal reflection path. Furthermore, eyelid movement alters the size of the exposed area of the

eyeball, further influencing the attenuation of the reflected signal.

4.3.3: Feasibility Analysis

We conducted experiments to compare the performance of RadEye with a 60 GHz FMCW mmWave

radar (i.e., AWR6843 [155]), both of which have 1.1 GHz bandwidth. Specifically, a participant

performed eye blinks at distances of 3 m, 4 m, and 5 m. Fig. 4.3 presents our experimental mea-

86

FFTFFTFFTChirpsRange Target binTime Frequencyf0f0 + B BTTransmitted signalReceived  signal11FFTτ fmFigure 4.3: (a) The phase change of the corresponding FFT-bin from the mmWave radar. (b) The
phase change of the corresponding FFT-bin from the RadEye.

surements. The experimental results show that mmWave radar can detect human eye blinks within

a range of 3 meters. However, the detectability decreases rapidly as the distance increases. In con-

trast, RadEye exhibits a consistent capability of detecting human eye blinks at those three distances.

In some cases, the line-of-sight path from human eyes to the radar device might be blocked.

Thus, we conducted comparative tests to evaluate the ability of two types of radars to track eye

movements under obstructed conditions. To simulate such cases, we repeated the same test at a

distance of 3 m but placed a wooden door between the radar and the participant. As shown in

Fig. 4.4, RadEye is capable of detecting eye blinks even behind the door, whereas the mmWave

radar fails to do so. This limitation of the mmWave radar can likely be attributed to the high

attenuation of mmWave signals. It is worth noting that these experiments were conducted in the

same environment and used an identical signal processing pipeline. The detailed parameters of the

two systems are provided in Table 4.2.

Millimeter-Level Motion Detection. As we mentioned earlier, the ranging resolution of an

FMCW radar is not enough to detect the tiny eye motion. However, the phase of the demodulated

FMCW signal can reflect the eye rotation motion. Eye rotation involves eyelid and eye muscles

displacement, which moves at millimeter level [91]. Based on Eqn. (4.3), one-millimeter movement

87

012345Time (second)-3-2-10123Phase (rad)3m4m5m012345Time (second)-3-2-10123Phase (rad)3m4m5m012345Time (second)-3-2-10123Phase (rad)3m4m5m012345Time (second)-3-2-10123Phase (rad)3m4m5m012345Time (second)-3-2-10123Phase (rad)3m4m5m012345Time (second)-3-2-10123Phase (rad)3m4m5m(a)(b)Figure 4.4: (a) RadEye tracking behind a wooden door.
wooden door. (c) Phase change of the corresponding FFT-bin from both systems.

(b) mmWave radar tracking behind a

Table 4.2: Detailed Parameters for the RadEye and mmWave Radar.

Systems

RadEye AWR6843

Parameters

Tx / Rx antenna gain
Transmission power
Chirp duration
Idle time
Bandwidth
Gain figure (Rx chain)
Noise figure (Rx chain)
Gain from baseband amplifier

15 dBi
15 dBm
600 µs
400 µs
1.1 GHz
–
7 dB
8 dB

7 dBi
12 dBm
600 µs
400 µs
1.1 GHz
48 dB
12 dB
–

of eyeball can cause a phase change of FMCW signal by: 2πf0

2d

c = 0.25 radian (i.e., 14◦), which

is easy to detect and measure on the corresponding Range-FFT bin.

Resilience to Interference. An eye-tracking system may suffer from interference from three

sources: i) multipath from the target person to the RF sensor, ii) the movement of the target per-

son’s other body parts such as chest breathing, arm waving, and leg shaking, and iii) other moving

objects/people in the area. To mitigate such interference, RadEye employs wideband FMCW mod-

ulation and high-optimized directional antennas. The FMCW modulation with 1.1 GHz offers a

ranging resolution of 14 cm, allowing RadEye to distinguish objects separated by 14 cm. The

FMCW modulation can effectively filter out the interference from the target person’s other body

motions such as chest breathing. A custom-designed patch antenna is used for signal transmission

and reception, serving as an angular filter for suppressing interference from other directions. As

shown in Fig. 4.5, the patch antenna has a 3 dB beamwidth of 21 degrees. In addition to the hard-

88

0123456Time (second)-3-2-10123Phase (rad)RadEye0123456Time (second)-3-2-10123Phase (rad)RadEye0123456Time (second)-3-2-10123Phase (rad)RadEye(a)(b)(c)mmWaveFigure 4.5: The custom-designed patch antennas (left) and their gain pattern (right).

ware design and optimization, a transformer-based DNN, trained through a video-guided pipeline,

will be useful to focus on the desired features while eliminating the interfering features through a

self-attention mechanism.

4.3.4: Feature Validation

We conducted a preliminary study of the sub-6GHz radar’s capability for detecting eye motions.

A participant was seated 3 m in front of the radar and instructed to rotate his eyeballs in four

different directions (up, down, left, and right). Fig. 4.6(a) depicts the signal amplitude from the

corresponding Range-FFT bin during these eye rotations. The “Ground truth” points in the figure

were captured by a camera with the SOTA computer vision-based eye detection algorithm [89],

marking the moment of real eye rotations. It is evident that the eyeball rotations towards the four

directions indeed induce the amplitude change of the radar’s IF signal. This indicates that the

5 GHz FMCW radar is capable of capturing the eye motions. When delving into eye rotation

signals in different directions, we observe that up/down movements exhibit a more substantial

change compared to left/right movements. This is not surprising, because the eyelid of vertical

actions has larger movements. Additionally, the up-rolling of eyes results in an amplitude increase.

This is because the eyelid’s movement during the upward gaze involves more parts of the eyeball

in reflection. The water-textured nature of the eyeball, in contrast to the skin, enhances signal

reflection. Given that the different eye rotations cause different amplitude changes, we use the

amplitude variation ratio as one of the features for inference.

Features in Complex Domain. In addition to the observation in the temporal domain, we

further observe the IF signal in the complex domain as shown in Fig. 4.6(b). For the eye motion

89

21 degree (3dB beamwidth)Figure 4.6: The feature for eye rotations. (a) The signal amplitude when eye rotates toward different
directions. (b) The signal in the complex domain when eye is rotating. (c) The kernel density
estimation of the signal amplitude variation ratio. (d) The kernel density estimation of the motion
pattern eccentricity.

signal Se from the corresponding Range-FFT Bin, it can be decomposed into a static component

Ss and a dynamic component Sd. They can be written as:

Se = Sd + Ss = αdeϕd + αseϕs,

(4.4)

where αs and ϕs are the amplitude and phase of the static component. αd and ϕd are the amplitude

and phase of the dynamic component. As shown in Fig. 4.6(b), the curve shape of the dynamic

component in the complex domain is determined by both amplitude and phase.

Different Eye Rotation Directions. For different directional movements of the eyeball, the

folding of the eyelid and the rotation of the eyeball create a unique relative relationship, manifesting

through the changes in IF signal amplitude and phase. We characterize this feature by utilizing the

curvature of the curve. Specifically, we perform regression on the curve, identify an elliptical

equation [49], and use the eccentricity of the ellipse to characterize this feature. As shown in

90

(c)(d)Ground truth(a)sdss(b)Fig. 4.6(b), the ellipse generated by the up/down eye motions exhibits a more elongated shape,

while the right/left motions lead to a more circular shape. This can be attributed to the fact that

up/down eye motions involve more eyelid movements, leading to changes in reflective surface,

while left/right eye rotations mainly cause changes in the length of the reflection path.

Experimental Validation. To verify the robustness of the amplitude and eccentricity features,

we conduct experiments involving five participants. We repeat the experiments in the same setting

as described above. Each participant performs his/her eye rotations in each direction 50 times.

Then, we perform the kernel density estimation for all the participants. Fig. 4.6(c) presents the

density estimation results of amplitude changing ratio. The up and down motions are centered on

the 0.3 and -0.4, respectively. The left and right motions have a relatively smaller variation; and

they are centered around zero. Fig. 4.6(d) shows the eccentricity density estimation results. The

up/down motions are centered close to 1; and the left/right motions are close to 0.8. The statistical

results across different individuals align with the findings described earlier for a single person.

This consistency suggests the potential of segregating these motions based on the radar’s signal.

To enhance the recognition of eye motions, we employ a DNN model, which will be described in

§4.5.

4.4: RadEye: Signal Processing

In this section, we describe the signal processing of the radar’s IF signal for eye rotation detection.

Fig. 4.7 shows the overall structure of the system. In what follows, we introduce the signal pro-

cessing techniques for RadEye, which include the selection of the range bin and the extraction of

eye motions.

Range-FFT. RadEye sets the chirp duration to 1 ms for detecting eye motions. In each cycle,

the chirp takes 0.6 ms, and the delay takes 0.4 ms. RadEye employs a sample rate of 2.5 MSps

to observe the signal in the digital domain. Consequently, in each cycle, 1500 complex numbers

are acquired. Subsequently, RadEye appends zeros to the end of the samples and performs a 4096-

point range-FFT operation to obtain the signal at different ranges. In the 4096 bins, only the first

91

Figure 4.7: The system overview of RadEye.

Figure 4.8: (a) The signal amplitude variance for different Range-FFT bins during eye blinks.
(b) Eye motion detected based on signal phase (with the camera-based ground truth marked). (c)
Comparison of Eye motions and interfering motions from the target person’s head.

256 will be used since it already covers the range to 13 m.

Filtering for Range-FFT. RadEye applies a second-order Butterworth bandpass filter to sup-

press noise and out-of-band interference. It sets the filter’s pass band to 1 Hz∼5 Hz [204, 205].

We note that, although the eye blink frequency may overlap with the chest breathing frequency

(0.1 Hz∼0.5 Hz), the input eye motion command for RadEye has a higher frequency and will not

be affected by the filter. One may ask whether the heartbeat will cause the slight head shaking

and thus pollute the eye motion signals, our answer is no. Based on our experimental results, the

heartbeat is too weak to cause the head shaking that can be captured by RadEye.

FFT Bin Selection. Eye rotation motions have a very small dynamic range. Therefore, it is

92

Chirp sync Range FFTFilteringαβData preprocessingBin selectionEye motion detectionDNNTX antennaRX antennaCamera-based Eye tracking algorithm Web camera Selected binsupervise                                           01020304050Time (second)-1.2-1-0.8-0.6-0.4Phase (rad)BlinkRotate eyeballSpeak with mouthShake head                                                                                         Selected bin(a)(b)(c)nontrivial to find the FFT bin that carries the eye motion features. To do so, RadEye requires

users to blink their eyes three times with an interval of 2 seconds as a ‘start button’ to initiate the

control process. The user only needs to provide the initialization command once. In this period,

the users should keep their head still (no movement more than 14 cm). If the user’s head position

moves beyond this range, the initialization command must be re-entered. Since this is a human

input device, it is reasonable to require the user to remain relatively stationary within a short time

period. RadEye utilizes the amplitude dynamic range as an indicator to find the candidate bin. The

reason for using the amplitude is that, when the eyes switch between opening and closing, the blink

motion causes the reflective surface to switch from the water-textured eyeball to the skin-textured

eyelid [103], causing the amplitude change of the radar’s IF signal.

We describe the bin selection algorithm as follows. RadEye calculates the window-slides vari-

ance for each Range-FFT bin i by processing the signal as: vi(j) = 1
W

P

where ¯yi = 1
W

j+W −1
m=j

|yi(m)|, and w is the window size which is set to 200 to fit the duration

P

j+W −1
m=j

(|yi(m)| − ¯yi)2,

of the eye blink. If vi(j) is larger than a predefined threshold T , the timestamp j will be recorded

as tn. Here, T is empirically set to 0.05. Only when the next detected timestamp j satisfies the

2000 < j − tn < 3000 (fit in the interval of blink), it will be counted as the continuous blink. Once

the three continuous blinks have been detected, we mark the Range-FFT bin i as the candidate bin.

Multiple Range-FFT bins might satisfy this condition as shown in Fig. 4.8(a). In this case, Rad-

Eye chooses the bin that has the smallest index. The smallest-index Range-FFT bin represents the

shortest path of signal travel, which best reflects the eye motion pattern.

Eye Motion Detection. After RadEye identifies the Range-FFT bin, it will continue to monitor

this bin and estimate eye motions based on it. Each eye motion can be separated into three phases:

start moving the eyeball, the eyeball reaches the edge, and the eyeball backs to the start position.

RadEye tries to detect these three positions for each eye motion. Although both amplitude and phase

contain information about eye motions, we found that phase exhibits a more significant pattern

when detecting eye rotations.

Eyeball rotations involve repetitive movement. It rolls back to its central point when reaching

93

the edge. This motion occurs swiftly, resulting in a repetitive phase change pattern. Hence, the

positions of local phase extremums correspond to where eye movement reaches the edge. Addi-

tionally, we noticed that there are inflection points in the signal phase at the beginning and end

of eye movements. RadEye utilizes these features to extract eye motions, it first searches for the

local maximum/minimum on phase with an interval of 1 second. After identifying the local peaks,

RadEye will search along the gradients of samples before/after the peak. The position in which the

gradient is equal to zero will be defined as the start/end position. Fig. 4.8(b) presents the phase

of the signal when a participant repeats the look-up motion, and the detected start, peak, and end

positions are marked on the figure.

To mitigate interference from the target person’s other body parts, we utilize both phase shift

and time duration to refine the detection results. Fig. 4.8(c) shows eye blinks, eye rotations, head

motions, and mouth motions. It can be seen that head and mouth motions induce significant changes

in the signal phase. Consequently, if the phase shift surpasses a specified threshold, the signal is

discarded. For motions falling below the threshold, the time duration is considered. Only signals

within the duration range of 200 ms∼600 ms are deemed as valid eye motion signals. Doing so

will effectively filter out eye blinks, which typically last for less than 100 ms.

4.5: RadEye: DNN-based Eye Movement Detection

In this section, we present a DNN model for eye rotation recognition by using the amplitude and

phase of the radar’s IF signal. RadEye utilizes a transformer encoder to extract features and feeds

these features into a fully connected layer to output the azimuth and elevation angles of a target

person’s eyeball. The DNN is trained using a camera-guided method, transferring the knowledge

from computer vision to radio sensing.

4.5.1: Sequential Signal

The input signal to our DNN model is a time-series signal with a high sampling rate. As the sub-

ject’s eyeballs rotate toward different angles, the swift motions of the eyeballs and eyelids cause

fluctuations in the amplitude and phase of the input signal over time. Therefore, to accurately

94

Figure 4.9: A camera-guided DNN structure for RadEye.

model the temporal dependencies between various sampling points, we employ a DNN model that

is capable of efficiently encoding information over the temporal domain.

Traditional time-series models, such as Recurrent Neural Network (RNN) [133] and LSTM

[69], are capable of temporal sequence modeling. However, they struggle with gradient vanish-

ing and exploding problems when dealing with long sequence inputs, limiting their capabilities of

capturing dependencies over a long distance. In contrast, Transformer [159], employing a self-

attention mechanism, can effectively overcome these issues. The self-attention mechanism allo-

cates a weight to the output of each position in a time series, reflecting the degree of attention that

the position pays to other positions within the sequence. This method allows for the computation of

correlations between any two positions in the sequence without being constrained by their physical

separation, thus better capturing long-range dependencies. Furthermore, the multi-head attention

mechanism within Transformers can project eye movement signals into various subspaces, includ-

ing different frequency spaces. Frequency analysis can better distinguish certain angle information,

as the movement of the eyeball at different angles will have different speeds.

95

Fully connected layer (64 ×32×2)Multi-head Attention Add &NormFeed Forward Add &NormInput embedingLinear Concatenate Scaled  Dot-Product Attention Linear Linear Linear N (2) ×Input signal shape 200×2Output shape 200×64Output shape 200×64Output shape 200×2y = [α , β ]Output shape 1×2Positional EncodingVKQsqueezeTX RX Web camera Signal processing ^SuperviseVision processing4.5.2: Camera-guided DNN Framework

The overall structure of our designed model, as shown in Fig. 4.9, primarily utilizes Transformer to

predict an angle vector based on the input time-series data, which includes both the amplitude and

phase derived from the corresponding Range-FFT bin. We introduce its key elements as follows.

• Input Data. The input data for the eye motion signal is captured by RadEye from the corre-

sponding Range-FFT bin. Each input data sample consists of 200 × 2 dimensions, where 200

is the time length of the data sample, and 2 is the number of features: amplitude and phase.

The data have been normalized to ensure that they share the same dynamic range.

• Signal Encoder. Before sending to Transformer, the signal data are first passed through a

projection layer to up-sample them to a higher-dimensional representation. The output size

of this projection layer is 200 × 64. Additionally, to enable the model to discern the temporal

relationships within the input sequence embeddings, each signal representation is augmented

with positional embeddings.

• Backbone. The transformer encoder can extract features from the embedded data. RadEye

uses two transformer encoders; and each encoder has four heads. The self-attention mech-

anism in the transformer encoder can build connections across different time steps in the

signal and also attend to different parts of the signal. These connections enable the model to

easily derive information about the eye rotation angle.

• Prediction Head. The extracted features finally feed into the fully connected layer, which

has a size of 64 × 32 × 2. It combines features from previous layers and flattens the output

into the appropriate shape. The output of the model is a direction vector y = [α, β], where α

is the eyeball’s azimuth angle and the β is the eyeball’s elevation angle.

• Camera-guided Training. The vision processing module initially tracks the user’s face and

subsequently localizes the position of the eyes. It then calculates the eye rotation angle based

on the relative position of the pupil within the eye region. Guided by these vision-based tech-

niques, the DNN model endeavors to create a feature extractor similar to those used in vision

processing, but specifically designed to handle RF signals. The vision processing module

96

Figure 4.10: Experiment settings in a lab for different distances and angles. (a) 3 meters. (b) 4
meters (c) 5 meters. (d) 15◦. (e) 30◦.

ultimately provides the ground truth angle to the DNN, which then utilizes Mean Squared

Error (MSE) to calculate the loss for angle estimation. The loss function is as follows:

Langle =

1
N

NX

i=1

(yi − ˆyi)2,

(4.5)

where yi is the predicted angle vector and the ˆyi is the ground truth angle vector.

4.5.3: Data Collection

We gathered training data exclusively in a controlled laboratory setting, where participants engaged

with the radar and camera setup positioned on a table before them. Participants were seated 3 m

away from the radar, facing it directly as shown in Fig. 4.10(a). A total of 8 participants took part

in the data collection. In each data collection session, participants were instructed to rotate their

eyes in up, down, right, and left directions. Recognizing the potential fatigue associated with eye

rotation, each test session had a limited duration of 3 minutes. Each participant will repeat 15

sessions, generating a total of 27,000 data samples.

We note that during the test phase, the eye motion signals are directly captured using RadEye (no

camera presents). In this scenario, the length of the signal vector may differ from the training data.

To ensure consistent input dimensions, downsampling or interpolation techniques are employed to

normalize the data dimension.

97

CameraAntenna15o 30o (a)(b)(c)(d)(e)Figure 4.11: The system setting for RadEye.

4.6: Experimental Evaluation

In this section, we conduct experiments to evaluate the performance of RadEye. Particularly, we

aim to answer the following questions.

• Q1 (§4.6.4): What is RadEye’s detection rate of eye motions?

• Q2 (§4.6.5): What is RadEye’s accuracy in estimating eye rotation angles?

• Q3 (§4.6.6): What is RadEye’s resilience to environmental changes and interference?

• Q4 (§4.6.7): What is RadEye’s zero-shot performance (for unseen users and in unseen sce-

narios)?

4.6.1: Implementation

Hardware. Fig. 4.11 shows the hardware of RadEye. We have a fabricated a PCB board capable

of transmitting and receiving FMCW signals at 5 GHz, using Wi-Fi’s electronic components. The

received signal undergoes amplification with a power amplifier (PA), and then it is mixed with the

transmitted signal. The electronic components of this board also include Tx/Rx 16 dB coupling,

RF I/Q mixer, and baseband filtering. Additionally, we have custom-designed and optimized a

4 × 4 patch antenna using HFSS for signal transmission and reception. A single patch antenna

provides an 15 dBi gain, resulting in a total gain of 30 dBi. The patch antenna design maintains

the signal beam within a narrow range while providing significant gain, enabling RadEye to detect

eye motions from a distance. The mixed signal is subsequently fed into a USRP N210 with an

98

CameraUSRP N210Custom-made antenna LFRXCustom-made PCB board PA VCORF couplerLNARF mixer FilterPowerLFRX daughterboard to convert the analog signal into baseband I/Q samples. The FMCW signal

generated by RadEye sweeps from 5.4 GHz to 6.5 GHz. Each chirp has a time duration of 1 ms,

with 600 µs for frequency ramping and 400 µs idle.

Software. We implemented our data preprocessing module in C++ using the GNURadio out-

of-tree module. A crucial function of this module is to synchronize the chirps, facilitating the

extraction of useful samples. However, owing to the absence of clock synchronization between

the USRP ADC and FMCW chirps, only the software-based method can be employed for synchro-

nization. To address this issue, we utilize a high-sampling rate of 2.5 MSps and the idle period for

synchronization. Initially, we detect the idle period based on the smooth amplitude during this in-

terval, followed by fine-grained detection to identify the first sample of the chirp. The DNN model

was implemented using PyTorch with the Adam optimizer. Throughout the training process, a

batch size of 200 and 50 epochs were set.

4.6.2: Experimental Setting

During the experiments, participants were seated on a chair facing the antennas of RadEye. The

antennas were positioned 1.1 m above the ground on a tripod. A varifocal camera was placed on top

of the laptop to capture the participants’ face video for their eye motion detection using the SOTA

gaze tracking tool [89]. Our experimental studies show that this camera-based eye detection tool

achieves about 98% accuracy as shown in Fig. 4.13. While it is not perfect, we use the detection

results from the camera-based tool as the ground-truth labels to supervise the training of RadEye’s

DNN model. During the inference, we also use the camera-based detection results as the ground-

truth to evaluate the estimation accuracy of RadEye.

RadEye and camera operated concurrently, synchronized with the PC clock, to estimate partici-

pants’ eye rotations. RadEye, with a higher sample rate than the camera’s frame rate (10 frames/s),

resulted in each camera-captured direction being mapped to 200 continuous chirps. Each training

sample in our dataset is a 200 × 2 matrix, where 200 is the time dimension, and 2 is the feature

dimensions (azimuth and elevation).

99

Figure 4.12: The eye rotation directions captured by a
camera. It is used as ground truth for DNN training and
evaluation.

Figure 4.13: Tracking accuracy of eye
rotation with camera at varying dis-
tances.

4.6.3: Performance Metrics

We consider the following three performance metrics.

• Eye Motion Detection Rate (EMR). The eye motion here is defined as the eye blink and eye

rotation. The signal processing module extracts the eye motion signals from RF data before

sending them to the DNN model. We define Detection rate = Number of eye motions detected
Total eye motions performed .

• Estimation Error of Eye Rotation Angle (ERA). The eye rotation angles are illustrated

in the bottom right corner of Fig. 4.7. The estimation error of azimuth angle is defined as

eα = |α − ˆα|, where ˆα is the estimated eye rotation azimuth angle and α is the eye rotation

angle ground truth provided by the camera. Similarly, the estimation error of elevation angle

is defined as: eβ = |β − ˆβ|.

• Estimation Accuracy of Eye Rotation Direction (ERD). Some applications of RadEye

(e.g., remote TV control) may not require precise angle measurements for functionality but

instead focus on eye rotation direction. This metric evaluates the accuracy of classifying eye

movement directions into four categories: up, down, left, and right. Specifically, we define

Accuracy = Number of correct direction estimations

Total eye rotations performed

.

4.6.4: Eye Motion Detection Rate

RadEye’s eye motion detection ability, including eye rotation and eye blink detection, serves as the

foundation of many eye tracking applications. While RadEye focuses on estimating eye rotation

angles, eye blink detection is also one of its key components. This feature not only enhances the

100

DownUpLeftRightLooking upLeft pupil: (92, 161)Right pupil: (88, 163)Looking downLeft pupil: (93, 21)Right pupil: (86, 23)Looking leftLeft pupil: (12, 91)Right pupil: (19, 93)Looking rightLeft pupil: (163, 90)Right pupil: (160, 91)3m4m5m0%20%40%60%80%100%Accuracy0.980.980.97Figure 4.14: (a) RadEye’s eye blink/rotation detection rate. (b) RadEye’s eye rotation estimation
error at different distances. (c) RadEye’s accuracy of estimating eye rotation directions. (d) The
confusion matrix of RadEye’s eye rotation direction estimation at 3 m distance.

input of RadEye but also enriches its functionalities. Therefore, we evaluate RadEye’s success rate

of detecting eye rotation and blink.

We instructed eight participants to perform eye rotations and blinks from three different dis-

tances: 3 m, 4 m, and 5 m, as shown in Fig. 4.10(a)-(c). A total of 5 minutes of data were collected

at each distance for each participant. Fig. 4.14(a) shows RadEye’s average eye motion detection

rate for 8 individuals at various distances. The highest detection rate for eye blink and eye rotation

is 94% and 96%, respectively. This was observed at the distance of 3 m. Overall, the detection rate

is consistent. Even at a distance of 5 m, RadEye achieves 88% detection rate for eye blinks and 91%

detection rate for eye rotation. This confirms the robustness of RadEye in eye motion detection in

different environmental settings. Numerically, the standard deviation of eye blink detection across

the eight individuals is about 2%. For the eye rotation detection, the standard deviation is about

4%. This slight difference can be attributed to the simplicity and similarity of eye blink motion

across different individuals. Additionally, the detection rate of eye rotations is consistently higher

than that of eye blinks in all cases. This is not surprising, as eye rotations involve more significant

facial muscle movements compared to eye blinks.

4.6.5: Eye Rotation Angle/Direction Estimation

Eye Rotation Angle Estimation. We conducted the experiments in the same way as described

in §4.6.4. Fig. 4.14(b) presents the cumulative distribution function (CDF) of the angle estimation

errors of RadEye for all participants at three different distances. The mean azimuth/elevation errors

at 3 m, 4 m, and 5 m are approximately 14◦/7◦, 20◦/18◦, and 24◦/21◦, respectively. Evidently, the

101

(a)(c)(d)(b)eye rotation angle estimation error increases as the distance increases. This is not surprising, as

the radio signal has a larger attenuation over a longer distance. Additionally, we observed that

the elevation angle estimation error is consistently smaller than the azimuth angle estimation error.

This observation agrees with our previous observation in §4.3, i.e., eye’s vertical movements (up

and down) generate more pronounced changes in radar signal’s amplitude and phase compared to

eye’s horizontal movements (right and left).

Eye Rotation Direction Estimation. As some applications of RadEye need only the eye rota-

tion direction information, we first classify the estimated eye rotation angle into four directions (up,

down, left, and right) and then evaluate the estimation accuracy. Since the slight eye motions are al-

ways accompanied by humans, we consider the eye rotation action effective only when the azimuth

α < 50◦ or α > 130◦, or the elevation β < 50◦ or β > 130◦, as exemplified by Fig. 4.12. Using the

effective input captured by the camera as ground truth, we can measure RadEye’s estimation accu-

racy. Fig. 4.14(c) plots the average estimation accuracy for eight people at three different distances.

The average estimation accuracy at 3 m, 4 m, and 5 m is 90.0%, 84.7%, and 83.5%, respectively.

These accuracy levels are suited for most daily applications requiring human input. The standard

deviations at 3 m, 4 m and 5 m are 3%, 5% and 5.5%. This indicates RadEye’s robustness when

detecting eye rotation of different users. Additionally, Fig. 4.14(d) presents the confusion matrix

in the 3-meter case. It is evident that distinguishing between right and left eye rotations is more

challenging compared to up and down eye rotations. This suggests that developing an application

with binary input for up and down movements could enhance RadEye’s robustness.

4.6.6: RadEye’s Robustness

RadEye’s Field of View. RadEye has two patch antennas for signal transmission and reception.

Ideally, the target person should be perpendicularly facing RadEye’s antenna. In practice, the target

person may not be ideally positioned. Therefore, we conducted experiments to evaluate RadEye’s

field of view by examining its estimation accuracy when the target person was located in different

directions, as illustrated in Fig. 4.10(d)-(e). Specifically, five participants performed eye rotations

102

Figure 4.15: RadEye’s estimation accu-
racy of eye rotation directions when the
target person is located at different dis-
tances and angles.

Figure 4.16: (a) The interference test in a lab. (b) Test
at 5 m in the conference room. (c) Test at 5 m in the
hallway.

Figure 4.17: RadEye’s estimation accuracy of
eye rotation directions when experiencing in-
terference from walking people.

Figure 4.18: RadEye’s estimation accuracy of
eye rotation directions during self-body mo-
tions.

at angles of 15◦ and 30◦ from three different distances. In total, six scenarios were studied. In each

scenario, participants performed eye rotations for 5 minutes. Fig. 4.15 shows our experimental

results. It can be seen that RadEye achieves a high estimation accuracy when the target person is

located at 0◦, 15◦, and 30◦. Overall, the estimation accuracy remains above 82% in all cases. This

indicates that RadEye has at least 60◦ field of view.

Impact of Moving Objects. Since static objects can easily be filtered out in the received signal,

we further evaluated RadEye’s resilience to interference caused by nearby walking individuals. In

the experiments, another person was asked to walk around the user in close proximity, as depicted

in Fig. 4.16(a). We measured the accuracy in three scenarios where the distance between the user

and the walking person was 1 m, 2 m, and 3 m. In each scenario, five participants performed eye

rotations for 5 minutes. Fig. 4.17 shows the results. The presence of a walking person causes a slight

decrease in RadEye’s estimation accuracy. Overall, RadEye achieves accuracies of 84%, 88%,

103

3 m4 m5 mDistance0%20%40%60%80%100%Accuracy0.90.850.840.890.840.830.850.820.8201530(a)(b)(c)No int.1 m2 m3 mDistance to interference0%20%40%60%80%100%Accuracy0.90.840.880.89MouthHeadLeg & Hand0%20%40%60%80%100%Accuracy0.450.340.82Figure 4.19: Wi-Fi interference test: experimental setup (left) and experimental results (right).

and 89% when the walking person is 1 m, 2 m, and 3 m away from the participant, respectively.

We note that all these experiments were conducted in normal scenarios. The participants were

only instructed to keep their heads still when performing eye rotations. No other restrictions were

made to avoid interference from multipath effects or other normal physiological activities of the

participants.

Impact of Self-Body Motions. Besides nearby moving objects, the participant’s own body

movements may also affect RadEye’s performance. To evaluate RadEye’s usability in practical

scenarios, we studied the cases where the participant was speaking, shaking head, or engaging in

leg or hand motions. A participant was asked to perform these three activities separately while

executing eye rotations. In each scenario, the participant performed eye rotations for five minutes

at a distance of 3 m. Fig. 4.18 presents our measurement results. It can be seen that RadEye’s

accuracy remains at 82% when the participant performed leg or hand motions. However, RadEye’s

accuracy decreases to 34% when he was shaking his head and to 45% when he was speaking.

This reduction could be attributed to the limited resolution of RadEye. Since the legs and hands are

more than 14 cm away from the eyes, their movements have minimal impact on RadEye’s detection

accuracy. However, head and mouth movements interfere with the eye rotation signal, leading to a

lower detection accuracy.

Impact of Wi-Fi signals. As RadEye operates at 5 GHz, overlapping with part of the Wi-Fi

spectrum, we conducted experiments to evaluate the impact of Wi-Fi signal interference on RadEye.

A Wi-Fi interferer was set up using the USRP and placed next to RadEye, as shown in Fig. 4.19,

continuously transmitting Wi-Fi packets. The Wi-Fi signals were generated at two frequencies,

104

0102030405060Frequency (kHz)0510152025Amplitudew/o interferencew/ 5500 MHz interferencew/ 5825 MHz interference0102030405060Frequency (kHz)0510152025Amplitudew/o interferencew/ 5500 MHz interferencew/ 5825 MHz interference0102030405060Frequency (kHz)0510152025Amplitudew/o interferencew/ 5500 MHz interferencew/ 5825 MHz interferenceInterference generator5.5 GHz and 5.825 GHz, with a bandwidth of 20 MHz. The experiments were conducted in a

static environment. We compared the IF signals from RadEye with and without Wi-Fi interference,

and the results are shown in Fig. 4.19. We observed that the IF signals remain nearly identical,

regardless of the presence of Wi-Fi signals.

RadEye appears to be resilient to Wi-Fi interference due to two key factors. First, RadEye

operates with a broad bandwidth of 1.1 GHz, whereas Wi-Fi signals occupy only 20 MHz. Second,

RadEye employs FMCW modulation, which contrasts with Wi-Fi’s OFDM modulation. OFDM

signals exhibit pseudo-noise characteristics, and when they are correlated with FMCW signals over

time, the resulting correlation is close to zero. This theoretical outcome explains why RadEye can

effectively resist interference from nearby Wi-Fi devices.

4.6.7: Zero-Shot Performance

Since RadEye has a DNN component for eye rotation detection, it is critical to evaluate its zero-shot

performance against new users and new scenarios.

New Users. The generality of the trained model is crucial as it allows for easy extension to new

users with minimal effort. To evaluate this, we conducted a cross-user test for RadEye. Specifi-

cally, four new users who were not involved in the training data collection were invited to perform

eye rotations at a distance of 5 m. Among these users, participants B and C wore glasses. Each

participant contributed 5 minutes of data. Fig. 4.20 reports RadEye’s estimation accuracy for these

four different users. The highest accuracy is 86% for user A, and the lowest accuracy is 77% for

user D. Notably, wearing glasses does not seem to affect the results much. RadEye achieves an

average accuracy higher than 80% for new users. This demonstrates the generalizability of RadEye

to new users.

New Environments. In addition to evaluating RadEye’s generalizability to new users, we also

assessed its zero-shot performance in unseen scenarios. Four new participants performed eye ro-

tations at a distance of 5 m in both a conference room and a hallway. Fig. 4.16(b)-(c) illustrates

our experimental settings, where participants faced RadEye at an angle of 0◦. Fig. 4.21 presents

105

Figure 4.20: The accuracy for users not in the
training set.

Figure 4.21: The accuracy in the different en-
vironments.

the measurement results. RadEye achieves accuracies of 83% and 84% in the conference room and

hallway, respectively. These results demonstrate RadEye’s ability to generalize to unseen environ-

ments as well.

4.7: Limitations and Discussions

In this section, we point out the limitations of RadEye and discuss potential solutions to address

them.

• Interference Caused by Head and Mouth Movements. While RadEye is resilient to inter-

ference from surrounding environments, it requires the users to keep their heads still during

use. Movements such as head shaking, speaking, or other facial expressions can obscure

the eye rotation signals, resulting in unsuccessful detection. To address this issue, one ap-

proach is to increase the bandwidth of RadEye. When the bandwidth is sufficiently large,

RadEye can differentiate eyes from mouth in the frequency domain, thereby eliminating the

interference from head and mouth movements for eye motion detection.

• RadEye versus mmWave Radar. MmWave radar is capable of detecting subtle movements,

such as eye motion, and is commercially available on the market. However, its detection

range is relatively short due to the rapid signal attenuation of mmWave propagation.

In

contrast, RadEye offers a significantly larger range for eye motion detection but requires a

wide spectrum bandwidth at lower frequencies. Therefore, both mmWave radar and RadEye

have distinct advantages and limitations. MmWave radar is better suited for short-range eye

106

User AUser BUser CUser D0%20%40%60%80%100%Accuracy0.860.810.820.77LabConf. roomHallway0%20%40%60%80%100%Accuracy0.840.830.84tracking, while RadEye is more appropriate for long-range use cases.

• Physical Size of RadEye. Our current prototype of RadEye is not compact enough for certain

applications, such as installation on wheelchairs. This limitation arises because our proto-

type has not yet been optimized. In fact, the current design has significant potential for size

reduction through various optimizations, including using smaller packages (e.g., SMD) for

electronic components, more efficient power management chips, improved patch antenna

designs, and shorter cables. Moreover, integrating the patch antennas directly into the PCB

could significantly reduce the system’s physical size, making it more suitable for space-

constrained applications.

• User Fatigue. RadEye currently recognizes eye rotations in only four directions, requiring

users to rotate their eyes multiple times to input a word. This tends to cause eye fatigue.

Future work will focus on developing a system capable of continuously tracking eye rotation

directions with high accuracy, rather than limiting recognition to four discrete directions.

This improvement would enhance input efficiency and significantly reduce user fatigue.

4.8: Summary

Remote eye tracking has many potential applications ranging from HCI-based input to eye disease

detection. While camera has been widely studied for eye tracking, its application in practice may

raise privacy concerns in some scenarios. In this chapter, we presented RadEye, an RF sensing sys-

tem capable of recognizing fine-grained human eye movement from a long distance. The challenge

in the design of RadEye is to detect tiny eyeball movements in the presence of interference from

other moving objects. RadEye addresses this challenge through a joint hardware and software de-

sign. For hardware, RadEye custom-designed a sub-6GHz FMCW radar for feature extraction and

interference mitigation. For software, a camera-guided DNN model has been crafted to improve

RadEye’s detection accuracy. Extensive experiments show that RadEye achieves 90% accuracy

when detecting people’s eye rotation directions (up, down, left, and right) in various scenarios.

107

CHAPTER 5: UPLINK MU-MIMO

COMMUNICATION IN MMWAVE WLANS

5.1: Introduction

Recently, the integration of millimeter-wave (mmWave) and multi-user multiple-input multiple-

output (MU-MIMO) technologies has attracted significant research and development attention in

wireless local area networks (WLANs), due to their potential to deliver data rates of hundreds of

Gbps through the simultaneous transmission of multiple independent data streams [54]. As a con-

crete step towards its real-life applications, downlink mmWave MU-MIMO has been standardized

by IEEE 802.11ay [5], and its theoretical data rate can reach 176 Gbps.

However, the advancement of mmWave MU-MIMO is mainly limited to its downlink. Very

limited progress has been made so far for its uplink. While both 802.11ac (sub-6GHz) and 802.11ay

(60GHz) support downlink MU-MIMO, neither of them supports uplink MU-MIMO. This stag-

nation underscores the grand challenges in the design of practical yet efficient uplink mmWave

MU-MIMO communication schemes. In addition, the demand of uplink data rate is dramatically

increasing in emerging applications such as autonomous driving and video streaming. Ericsson

predicts that the amount of global uplink traffic will reach 70 EB per month in 2027 [2]. Therefore,

there is a critical need to fill this gap.

In this chapter, we present a practical yet efficient uplink MU-MIMO mmWave communica-

tion scheme (UMMC) for a wireless local area network (WLAN). UMMC allows multiple sta-

tions to simultaneously send their data packets to an access point (AP) while not requiring fine-

grained inter-station synchronization. We address two challenges in the design of UMMC. The

first challenge lies in the analog beamforming for a multi-antenna AP. While the literature has a

wealth of analog beamforming work, existing approaches can be generally classified into two cat-

egories: model-based optimization (e.g., [111], [88, Table V]) and model-free beam search (e.g.,

108

[62, 65, 118, 148, 150]). While model-based approaches offer the optimal antenna weight vectors

(AWVs) for analog beamforming, they require accurate antenna model and channel knowledge,

which are hard to obtain. Therefore, these approaches are not amenable to practical use. Model-

free approaches do not require the above knowledge as they aim to find the best beam in a predefined

beambook. However, most of them focus on maximizing the signal strength for a single-antenna

mmWave device while minimizing their beam search overhead. While maximizing signal strength

is equivalent to maximizing data rate in single-antenna systems, it is not the case in MU-MIMO

systems. This is because the capacity of an MU-MIMO channel is dependent upon not just the sig-

nal strength but also the correlation of MIMO channels. When two stations have highly-correlated

channels, the AP may not be capable of decoding their packets even if the signals are strong. In

addition, exhaustive search is notorious for its large airtime overhead and thus not suitable for

practical use.

To address this challenge, we design a Bayesian optimization (BayOpt) framework for joint

beam search at the AP. This framework is inspired by two facts: i) the relation between a selected

beam and its achievable data rate in MU-MIMO communications is complex and unknown in real

systems; and ii) BayOpt has been proved to be an effective technique for finding an optimal or near-

optimal solution to an optimization problem whose objective function and constraints are unknown

and costly to evaluate. The key idea of the BayOpt framework is to guide beam search using

the posterior probability derived from those beams that have already been evaluated. The more

beams we evaluate, the more accurate information we have for the remaining beams. Compared

to exhaustive search, BayOpt appears to be surprisingly efficient in finding a near-optimal beam

within a given airtime budget.

Another challenge in the design of UMMC is the synchronization among stations. Actually,

the signal detection in uplink MU-MIMO transmission has been well studied in sub-6GHz wireless

networks, and some signal detection methods such as zero-forcing (ZF) and minimum mean square

error (MMSE) have been widely used in practice. However, existing signal detectors are based

on an important assumption — the data packets from different stations are synchronized in time

109

when impinging on the AP. Particularly, in OFDM systems, the time misalignment of the packets

when arriving at the AP must be less than the time duration of an OFDM symbol’s cyclic prefix

(CP). While this requirement can be achieved in narrow-band (20 MHz) sub-6GHz systems (e.g.,

using timing advance protocols), it is extremely challenging to achieve in ultra wideband mmWave

systems. For instance, using conventional MU-MIMO detectors, the time misalignment of packets

in 802.11ay must be less than 36ns, which is hard to maintain in practice. Due to this stringent

requirement, uplink MU-MIMO has not yet been supported by 802.11ay standard [5].

To address this challenge, we argue that it is more desirable living with the packet misalign-

ment at the AP instead of employing an onerous protocol to synchronize stations. Towards this

goal, we observed that existing MU-MIMO detectors work in the spatial domain while the packet

misalignment is an imperfection in the temporal domain. Since these two domains are orthogo-

nal, spatial MU-MIMO detectors should be immune to temporal misalignment of data packets. In

fact, the real problem is that the construction of existing MU-MIMO detectors requires the knowl-

edge of channel, which relies on orthogonal pilots (reference signals) in data packets to estimate.

However, misaligned packets cannot maintain the orthogonality of their pilots, making it hard to

estimate channels. To solve this problem, we design an asynchronous MU-MIMO detector through

a transformation of existing MMSE MU-MIMO detector. This new detector is capable of decoding

asynchronous packets from multiple stations without the need of explicit channel knowledge. The

key idea behind our design is to use the interfered pilots within each packet to train its detection fil-

ter. Doing so eliminates the need of channel matrix in the construction of the detection filter while

achieving a surprisingly good performance. The new detector fundamentally relaxes the synchro-

nization requirement for stations in uplink MU-MIMO transmission, making UMMC amenable to

practical implementation.

We have evaluated UMMC through a blend of over-the-air experiments and extensive simula-

tions. We implemented UMMC on a two-user MIMO mmWave (60GHz) testbed and demonstrated

that it enables real-time uplink packet transmission in the absence of inter-user synchronization (see

our demo video via the anonymous link in [29]). Experimental results show that, compared to ex-

110

haustive beam search, BayOpt achieves 92% throughput while reducing the overhead by 98%. In

addition, simulation results from a 100-user mmWave network show that, compared to exhaustive

beam search, BayOpt achieves more than 80% of its throughput while entailing less than 5% of its

overhead in all two-user, three-user, and four-user MIMO cases.

The contributions of this work are summarized as follows.

• We design a practical uplink MU-MIMO mmWave communication scheme for WLANs. We

demonstrate that it works in realistic scenarios via over-the-air experiments.

• We introduce the first-of-its-kind BayOpt framework for beam search in mmWave MU-

MIMO systems, and show its efficiency through both experimental and simulation results.

• We propose a new MU-MIMO detector that can decode the asynchronous data packets from

multiple user devices. For the first time, it demonstrates via theory and experiments that

fine-grained inter-user synchronization is not needed for uplink MU-MIMO mmWave trans-

mission.

5.2: Related Work

This work is relevant to mmWave MIMO communications, beam search, MU-MIMO detection,

and system prototyping.

802.11ad/ay and Cellular Networks. In 2012, the IEEE 802.11ad amendment standardized com-

munication in 60 GHz unlicensed band to offer up to 6.75 Gbps data rate in a short range [1]. While

802.11ad devices may have multiple antennas, they do not support MU-MIMO transmission. As a

follow-up, 802.11ay was standardized in 2020 [5], which supports new features including channel

bonding, higher-order modulation, and downlink MU-MIMO. However, it does not support uplink

MU-MIMO yet.

The 3GPP specification for 5G cellular networks has already supported MU-MIMO, hybrid

beamforming, and mmWave communications in the 24–53 GHz band [13]. While abundant lit-

erature has studied the beam design and MU-MIMO for mmWave, most of them are limited to

signal processing and numerical analysis [67, 116]. Very likely, future cellular networks will em-

111

Table 5.1: Representative work on beam search in literature.

Beam search technique
Learning-based search: [23,
30, 68, 84]
Out-of-Band assistance: [21,
62, 64, 118, 135, 148]
Compressive sensing:
127]
Hierarchical search: [76, 92,
120, 141, 157, 194]

[65,

Approach
Train the machine learning model to pre-
dict the current best beam direction.
Utilize the out-of-band information (sub-
6GHz, light, camera) to align the beam.
Finding the best alignment beam direction
with sparse measurement.
Design a beamforming codebooks and
training in a hierarchical way.

ploy sophisticated synchronization protocols (e.g., timing advance) and flexible numerology (e.g.,

long CP) to support uplink MU-MIMO. However, this approach is not suitable for WLANs as they

target low-cost applications.

Beam Search. There is a large body of work on beam training for mmWave communications.

Table 5.1 lists some of representative work and their basic ideas. Of existing work, most focuses

on finding the best beam in a predefined beambook to maximize signal strength while minimizing

the associated cost. As we explained before, maximizing the signal strength is not a good strategy

for MU-MIMO.

In addition to beam selection, there is a considerable amount of work that studies analog beam-

forming from a signal processing perspective by formulating the AWV design problem to an opti-

mization [92, 157, 180, 194]. However, this kind of work is not amenable to practical implementa-

tion for several reasons: i) they assume that the phased-array antenna has an ideal radiation pattern;

ii) they require the over-the-air channel knowledge for the design of AWV; and iii) they suffer from

high computation in solving an optimization problem.

Uplink MU-MIMO in Sub-6GHz Networks. Uplink MU-MIMO has been supported in 4G cel-

lular networks and will be supported by 5G and beyond [3]. In contrast, the way of uplink MU-

MIMO to 802.11 standards was rocky. Thus far, no on-market Wi-Fi devices support uplink MU-

MIMO. Similar to 802.11ay, 802.11ac supports downlink MU-MIMO but does not support uplink

MU-MIMO [4]. This can be attributed to the fact that WLANs are distributed, contention-based

systems and lack inter-user coordination. Although 802.11ax will support uplink MU-MIMO, the

symbol-level synchronization remains an outstanding challenge [140].

112

Inter-User Synchronization for Uplink MU-MIMO. Timing advance (TA) is the main mecha-

nism used in wireless networks to compensate inter-user time misalignment and offset the signal

propagation delays for uplink MU-MIMO and other multi-access technologies. Per [14] and [110],

the timing error achieved by TA in cellular networks cannot meet the requirement of mmWave MU-

MIMO based on the 802.11ay numerology. [212] validated the throughput gain of MU-MIMO via

offline experiments but did not address the timing problem.

5.3: Problem Description

We consider the uplink MU-MIMO communication in a WLAN as shown in Fig. 5.1, where an AP

wishes to decode concurrent data packets from multiple stations. Our objective is to maximize the

uplink throughput through the design of analog and digital beamforming for the AP. In the pursuit

of this objective, we assume that a beam has already been selected for each station using an existing

beam search scheme such as sector-level sweep (SLS) and beam refinement protocol (BRP) [1].

We focus on the analog and digital beamforming at the AP for uplink MU-MIMO transmission.

5.3.1: Problem Formulation

Analog Beamforming. Denote M as the number of phased-array antennas (RF chains) on the

AP and N as the number of stations involved in the uplink MU-MIMO transmission (assuming

N ≤ M ). We assume that all the phased-array antennas on the AP are identical. Suppose that a

linear phased-array antenna intends to steer its beam energy to the direction of θ. Then, its antenna

weight vector (AWV) can be modeled as: Gap(θ) = [ej

dap

λ i sin(θ)]0≤i≤Nap−1, where dap is the patch

element spacing, λ is the wavelength, and Nap is the number of patch elements. Similarly, for the

phased-array antenna on a station, suppose it intends to steer its beam energy to the direction of ϕ.
Then, its AWV can be modeled as: Gsta(ϕ) = [ej dsta

λ i sin(ϕ)]0≤i≤Nsta−1, where dsta is the patch element

spacing and Nsta is the number of patch elements. Then, the signal received by the AP’s mth RF

113

Figure 5.1: Uplink MU-MIMO transmission in WLAN.

chain can be written as:

ym =

NX

n=1

Gap(θm) ~Hmn Gsta(ϕn)⊤xn + wm,

(5.1)

where xn is the signal transmitted by the nth station, wm is the received noise, ~Hmn ∈ CNap×Nsta is

the over-the-air channel between the AP’s mth antenna and the nth station’s antenna.

Digital Beamforming. At the AP (receiver), digital beamforming serves for the purpose of MU-

MIMO Detection. Denote ⃗y = [y1, y2, · · · , yM ]⊤ as the received signals and ⃗pn as the AP’s spatial

filter for decoding the data packets from station n. Then, the decoded version of the signal from

station n can be written as: ˆxn = ⃗p H

n ⃗y , for 1 ≤ n ≤ N , where (·)H is the conjugate transpose

operator.

Design Objective. At the AP, denote ⃗θ = [θ1, θ2, · · · , θM ] as the beam angle vector, which can be

directly used to calculate the AWV for analog beamforming. Denote ⃗p = [⃗p1, ⃗p2, · · · , ⃗pN ] as the

detection vector. Denote EVMn as the error vector magnitude (EVM) of the decoded signals from
station n, i.e., EVMn ≡ E[|xn−ˆxn|2]
E[|xn|2]
at stations are normalized, i.e., E[|xn|2] = 1. Then, we have

. Without loss of generality, we assume that the transmit power

EVMn = E[|xn − ˆxn|2].

(5.2)

The link capacity (spectral efficiency) between station n and the AP can be written as: cn = log2(1+

1
EVMn

).

114

...RF 1...Digital processingRF M...Digital processingRF...RF......STA 1STA Nga(θ1)ga(θM)gs(ø1)gs(øN)APDigital beamforming p = [p1, p2,  ,pN](signal detection)    θ1θMIn uplink MU-MIMO, it is important not only to maximize the data rate but also ensure the

fairness among users. Thus, our objective is to pursue the best analog and digital beams so that the

bottleneck link data rate can be maximized. Mathematically, it can be formulated as:

(cid:18)

[⃗θ∗, ⃗p ∗] = arg max
⃗θ∈B, ⃗p

min
n

log2(1 +

(cid:19)

,

1
EVMn

)

(5.3)

where B is the predefined beambook that includes all possible beam angle vectors.

The optimization problem in (5.3) can be divided into two subproblems: i) analog beam selec-

tion (determining ⃗θ), and ii) MU-MIMO detector construction (determining ⃗p ). These two subprob-

lems are tightly coupled with each other. Given the complex nature of this problem, it is intractable

to pursue a global optimal solution in real systems. Therefore, we develop a practical yet efficient

scheme to solve the two subproblems.

5.3.2: Key Challenges

Inaccurate Models. Solving the above optimization is nontrivial as the gradients of the objective

function are unknown, so first-order methods like gradient descent cannot be applied. In addition,

we used Gap(θ) and Gsta(ϕ) to model the response of ideal linear phased-array antennas. In practice,

phased-array antennas have many imperfections in their radiation patterns. Their actual mathemat-

ical models are unknown. The discrepancy between the ideal and real antenna model significantly

affects the beamforming design.

Channel Correlation. The capacity of MU-MIMO transmission is determined by not only signal

strength but also MIMO channel correlation. Existing approaches based on signal strength only

are not suitable for beam search in MU-MIMO. Therefore, it calls for a beam search scheme that

can jointly identify the best beams for all antennas. One straightforward approach is exhaustive

search. However, it will entail a large airtime overhead and thus compromise the throughput gain

of MU-MIMO. Therefore, an efficient joint beam search scheme is needed.

Inter-Station Timing Synchronization. Uplink MU-MIMO detection has been well studied.

However, existing schemes require fine-grained inter-user timing synchronization for signal de-

115

tection. That is, the time misalignment of data packets from different stations must be less than

OFDM CP length. In 802.11ay [5], the normal guard interval duration (CP) is 36.36ns. Main-

taining the inter-user synchronization within 36.36ns not only entails a large overhead but also

complicates the network design and operation. For this reason, neither 802.11ac (sub-6GHz) nor

802.11ay (60GHz mmWave) supports uplink MU-MIMO.

5.4: Overview of UMMC

In this section, we first highlight our approaches to overcoming the above challenges and then

present the overall system diagram of UMMC. In what follows, we denote f (⃗θ) = maxn{EVMn}.
When ⃗p is given, the optimization in (5.3) is equivalent to minimizing f (⃗θ).

5.4.1: Our Approaches

Analog Beam Search. To address the beam search challenge, we design a BayOpt scheme for

joint beam search. BayOpt has been proved to be an effective technique for solving sequential

optimization problems where the objective function is complex (treated as a black-box), the (sub-

)gradient is unknown, and the evaluation is expensive [114]. To illustrate the idea behind BayOpt,

let us consider the beams in a beambook [⃗θ1, ⃗θ2, · · · , ⃗θ3600]. Suppose that we have measured two
beams, say ⃗θ10 and ⃗θ1000, and found that f (⃗θ10) = 5 and f (⃗θ1000) = 0.1. Then, in the next iteration
we should select a beam in the neighborhood of ⃗θ1000 to evaluate, because the global minimum is
more likely sitting in the neighborhood of ⃗θ1000 compared to ⃗θ10. BayOpt is a principled strategy

to guide the process of joint beam search based on posterior probability.

MU-MIMO Detection. Inter-station synchronization is a fundamental problem for uplink MU-

MIMO. Achieving the required timing alignment for packet transmission among distributed stations

is extremely hard. In light of this, we live with the timing misalignment among the stations and

focus on enabling asynchronous MU-MIMO detection. To this end, we revisit the conventional

(synchronous) MMSE detector and find that a transformation can make it applicable to decoding

asynchronous data packets from independent stations.

116

Figure 5.2: The high-level system diagram of UMMC.

5.4.2: System Diagram

Fig. 5.2 shows the system diagram of UMMC. The AP measures the performance of a sequence of

analog beams [⃗θ1, ⃗θ2, · · · , ⃗θt, · · · , ⃗θT ], where t is the evaluation/iteration index and T is the prede-

fined maximum number of evaluations/iterations allowed (e.g., T = 30). In the end of T iterations,

UMMC chooses the beam that yields the best performance. In each iteration t, the operations of

UMMC include the following four steps:

• Step 1: The AP selects a beam ⃗θt for evaluation in the current iteration based on the posterior
probability derived from the past evaluations, i.e., (⃗θt′, f (⃗θt′)) for 1 ≤ t′ < t. Details are

presented in Section 5.5.

• Step 2: The AP reconfigures its phased-array antennas by setting their beam patterns to ⃗θt.

• Step 3: The AP first calculates its digital beamformers (a.k.a. MU-MIMO detector) ⃗p =

[⃗p1, ⃗p2, · · · , ⃗pN ], and then uses them to decode asynchronous signal frames from the N sta-

tions. Details are presented in Section 5.6.

• Step 4: The AP measures the EVM of the decoded signals from each station. By doing so,

it obtains f (⃗θt). Then, (⃗θt, f (⃗θt)) is added to the dataset and will be used to guide the future

beam search.

117

Analog beamforming:Configure the weights for AP’s M antennasDigital beamforming:Asynchronous MU-MIMO detectioniteration 1: θ1, f(θ1) iteration 2: θ2, f(θ2) iteration 3: θ3, f(θ3) iteration t-1: θt-1, f(θt-1) ...Bayesian optimization for beam search control θt = [θ1, θ2, …, θM]Evaluated beamsPerform next beam evaluationp = [p1, p2, …,pN]Evaluation:Set analog beam to θt and digital beamformer to p, measure the objective function f(θt) Add the evaluation result to the dataset4231Compute the best beam for next evaluation: θt5.5: Bayesian Optimization for Beam Search

In this section, we assume that the algorithm for determining ⃗p is given and focus on the BayOpt

design to find a near-optimal beam ⃗θ for the AP. The design of ⃗p will be presented in the next

section.

5.5.1: Why Bayesian Optimization?

Recall that the objective function is f (⃗θ) = maxn{EVMn}. It has the following salient features.

• f (⃗θ) has a complex structure: Fig. 5.3 shows an example of f (⃗θ) obtained through exhaustive

beam search on our two-user MIMO 60GHz mmWave testbed2. It is evident that f (⃗θ) is hard

to optimize due to its non-convexity.

• f (⃗θ) is unknown: Practical mmWave communication systems typically suffer from hardware

imperfections such as phase noise and clock jitters [192], which are hard to characterize

and model. As such, the beam pattern may largely deviate from its ideal model Gap(θ).
The accurate objective function f (⃗θ) is unknown and can only be obtained via exhaustive

experimental measurements.

• Evaluating f (⃗θ) is costly: To evaluate f (⃗θ) for a given ⃗θ, the AP needs to physically set up

the beam pattern and measure the resultant signal quality. This process incurs a fixed airtime

overhead. For example, in 802.11ay, measuring the value of f (⃗θ) for a given ⃗θ may take the

time of one Control PHY Preamble (about 3.7µs), let alone other airtime overhead incurred

in this process. Therefore, there is a tradeoff between the quality of ⃗θ and the number of

evaluations of f (⃗θ).

Fortunately, BayOpt is an effective technique to optimize such a function that is unknown yet

expensive to evaluate [114]. It makes use of the laws of probability to combine prior belief with

observed data to compute posterior distribution of the objective function. Therefore, we will design

a BayOpt framework for analog beam search.

2The detailed experimental setup is presented in Section 5.7.1.

118

Figure 5.3: An instance of f (⃗θ) obtained from experimental measurements on a two-user MIMO
60GHz testbed, where ⃗θ = [θ1, θ2] and f (⃗θ) = max(EVM1, EVM2) in dB.

5.5.2: A Bayesian Optimization Framework

To perform BayOpt, one needs to address two problems: i) finding a statistical process to model

the function being optimized, and ii) selecting an acquisition function as a surrogate approximation

to guide the search in each iteration. In what follows, we address these two problems in order.

Gaussian Process Regression. We model the iterative beam search problem as a Gaussian process.

In the tth iteration, the AP has observed t − 1 beams. Denote Θ = {⃗θi}t−1
that the AP has already observed. Denote f (Θ) = {f (⃗θi)}t−1

i=1 as the objective function values of

i=1 as the set of beams

those observed beams. We treat f (Θ) as a multi-variate Gaussian distribution, with µ(Θ) as its

mean and k(Θ, Θ) as its covariance kernel. Here, µ(Θ) is a (t−1)×1 vector, while k(Θ, Θ) is

a (t−1)×(t−1) matrix. Let ⃗θ be an arbitrary beam in the beambook. Then, per the definition of

Gaussian process, the joint distribution of the function values corresponding to ⃗θ and Θ should

satisfy:

2

6
4

f (Θ)

f (⃗θ)

3

0

7
5 ∼ N

B
@

















,









µ(Θ)

µ(⃗θ)

k(Θ, Θ) k(Θ, ⃗θ)

k(⃗θ, Θ)

k(⃗θ, ⃗θ)

1

C
A ,









(5.4)

where µ(·) and k(·, ·) should be understood as an element-wise operational function. There are

various definition for Gaussian kernel, such as Matérn kernel, exponentiated quadratic kernel, and

radial basis function kernel [177]. In our experiments, we choose radial basis function kernel,

119

Optimal Pointk(⃗θi, ⃗θj) = exp(− 1
2σ2

||⃗θi − ⃗θj||2), where σ is a hyper-parameter that governs the kernel width. In

our experiments, we let σ = 1.

The posterior distribution on the arbitrary beam ⃗θ can be calculated through standard Bayesian

rules. Specifically, the distribution of f (⃗θ) can be modeled as:

where

(cid:0)

f (⃗θ) ∼ p

f (⃗θ)|⃗θ, Θ, f (Θ)

(cid:1)

(cid:0)

= N

µ(⃗θ), Σ(⃗θ)

(cid:1)

,

µ(⃗θ) = k(⃗θ, Θ)k(Θ, Θ)−1f (Θ),

Σ(⃗θ) = k(⃗θ, ⃗θ) − k(⃗θ, Θ)k(Θ, Θ)−1k(Θ, ⃗θ).

(5.5)

(5.6)

(5.7)

Acquisition Function. There are different acquisition functions available for BayOpt problems

such as Probability of Improvement (PoI), Expected Improvement (EI), and Gaussian process Up-

per Confidence Bound (GP-UCB) [177]. We choose EI for two reasons: i) compared to PoI, it

has been shown to be better-behaved; and ii) unlike GP-UCB, it does not involve tuning parame-

ters [138]. The acquisition function can be written as:

(cid:2)

EI(⃗θ) = E

max(f (⃗θ) − f (⃗θ+), 0)

(cid:3)

,

(5.8)

where ⃗θ+ is the best beam found so far. Under the Gaussian process model, it can be analytically

written as follows:

(cid:0)

EI(⃗θ) =

µ(⃗θ) − f (⃗θ+) − ξ

(cid:1)

CDF(Z) + Σ(⃗θ) pdf(Z),

(5.9)

where Z = µ(⃗θ)−f (⃗θ+)−ξ

Σ(⃗θ)

, CDF(·) and pdf(·) are the cumulative distribution function and the prob-

ability density function of standard normal distribution, respectively, and ξ is a parameter that

determines the amount of exploration during the optimization. A large value of ξ leads to more

120

exploration, while a small value leads to more exploitation. In our experiments, we empirically set

ξ to 0.1.

Beam Selection. Then, in the tth iteration, the beam selected for evaluation is obtained by solving

the following problem:

⃗θt = arg max
⃗θ∈B\Θ

EI(⃗θ),

(5.10)

where B is the set of all predefined beams and Θ is the set of beams that has been evaluated so far.

It is worth noting that (5.10) is easy to solve because (5.9) is a simple, disciplined function.

5.5.3: Practical Considerations

There are two challenges associated with the above BayOpt framework when it is applied to beam

search. In the following, we first point out the challenges and then present our solutions.

Limited number of evaluations. MmWave systems have a fixed airtime budget for beam search/-

training, which determines the maximum number of evaluations/iterations that can be performed

before data transmission. In practice, given the limited airtime budget for beam search, it is un-

likely to find the optimal beam for data transmission. Therefore, the beam search problem is further

constrained by the number of evaluations. To address this challenge, we propose a recenter-and-

shrink (RaS) scheme for the Gaussian process regression. This scheme was inspired by [144]. The

basic idea is that, when approaching the evaluation budget, we recenter the search space to the

current optimal beam and shrink the search space. Doing so increases the probability of finding a

better beam when we reach the evaluation budget. Following this idea, we modify the acquisition

function in (5.10) to:

EI(⃗θ)

(5.11)

⃗θt = arg max
⃗θ∈B\Θ
8
>><

s.t. θm ∈

if 1 ≤ t < T /2

if T /2 ≤ t ≤ T.

[− π

2 , π
2 ]

>>:

[θ+
m

− ϕt

2 , θ+

m + ϕt
2 ]

121

where t is the iteration/evaluation index, T is the maximum number of evaluations, ⃗θ+ = [θ+

is the best beam found so far, and ϕt is the reduced search range. Empirically, we set ϕt = ( 3
2

in our experiments.

m]M
− t

m=1

T )π

Cubic Computational Complexity. The computational complexity of Gaussian process regres-

sion is cubic to the number of data samples, i.e., O(t3), where t is the number of evaluations that

have been performed [177]. Clearly, the computation rapidly increases as the evaluation procedure

evolves. To overcome the computation challenge of Gaussian process, a wealth of sparse approxi-

mations have been recently suggested, such as the subset of data (SoD) approximation, the subset

of regressors (SoR) approximation, the deterministic training conditional (DTC) approximation,

and partially independent training conditional (PITC) approximation [124]. In these methods, a

subset of the latent variables are treated exactly while the remaining variables are treated approx-

imately to reduce the computation. Here, we employ the SoR approximation for the beam search

as it demonstrates a good tradeoff between performance and computation (see Tables 8.1 & 8.2

in [124]).

Denote Φ as the subset of training data samples that are selected for exact regression, where

Φ ⊂ Θ. Per [124], the Gaussian process regression can be characterized by the approximate mean

and covariance as follows:

µ(⃗θ) = σ−2k(⃗θ, Φ)Q−1k(Φ, Θ)f (Θ),

Σ(⃗θ) = k(⃗θ, Φ)Q−1k(Φ, ⃗θ),

(5.12)

(5.13)

where Q = σ−2k(Φ, Θ)k(Θ, Φ) + k(Φ, Φ).

A question to ask is how to select the active data samples for Φ. Empirically, we define an

integer number τ ∈ Z which is smaller than t. We choose the τ beams in Θ that are closest to

⃗θ+ as the active samples for Φ. Denote g(⃗θ) ≜ ||⃗θ+ − ⃗θ||2 as the metric for ⃗θ. Based on this

metric, we sort the elements in Θ in a non-decreasing order and denote the resulting vector as

122

Algorithm 2 Bayesian optimization for analog beam search.

1: Required: T : the budgeted number of evaluations.
2: Output: A beam ⃗θ∗ in the predefined beambook B for data packet reception at the AP
3: Initialization Θ = [⃗0].
4: for t = 1, 2, · · · , T do
5:
6:

7:

Calculate Φ using (5.14)
Calculate µ(⃗θ) using (5.12) and Σ(⃗θ) using (5.13)
Construct the surrogate function EI(⃗θ) using (5.9)
Find the next beam direction ⃗θt by solving (5.11)
Add ⃗θt to Θ

8:
9:
10: end for
11: return ⃗θ∗ = arg min⃗θ∈Θ f (⃗θ).

Θsrt = [⃗θs1, ⃗θs2, · · · , ⃗θst]. Then, we let:

Φ = [⃗θs1, ⃗θs2, · · · , ⃗θsτ ].

(5.14)

With the approximation in (5.12)-(5.14), the computational complexity of Gaussian process

regression in the tth iteration decreases to O(τ 2t) [177]. More importantly, the complexity scales

linearly (rather than cubically) with the number of iterations.

We present the proposed BayOpt algorithm in Alg. 2. In a nutshell, it is a non-parametric online

learning algorithm that guides the beam search using the posterior probability of those data samples

that have been evaluated so far.

5.6: Asynchronous MU-MIMO Detection

In this section, we first review the MMSE MU-MIMO detector, and then present a transformation

for MMSE MU-MIMO detector so that it can decode asynchronous data packets. The resulting

detector fundamentally relaxes the inter-user synchronization for uplink MU-MIMO, and thus is

particularly suited for mmWave communications. Finally, we conduct performance analysis of the

proposed detector in mmWave networks.

123

5.6.1: Conventional (Synchronous) MMSE MU-MIMO Detector

Consider the uplink MU-MIMO transmission from N stations to an M -antenna AP as shown in

Fig. 5.1. Suppose that data packets from the N stations are perfectly aligned in time when impinging

on the AP. Then, the signal transfer model in the digital domain can be written as:

⃗y = H⃗x + ⃗w,

(5.15)

where ⃗y ∈ CM ×1 is the received digital baseband signal vector at the AP, ⃗x = [x1, x2, · · · , xN ]⊤ is

the transmit signal vector, where xn is the signal at the nth station, ⃗w ∈ CM ×1 is the noise vector,

and H = [Hmn]1≤m≤M,1≤n≤N ∈ CM ×N is the compound channel between the N stations and the

AP.

To decode the N data packets, the AP can first estimate the compound channel using the orthog-

onal pilots (a.k.a. reference signals) in the N data packets and then construct the MMSE MIMO

detector as follows:

P = HH(HHH +

σ2
w
σ2
x

I)−1,

(5.16)

where I is an identity matrix of proper dimension, σ2

w is noise power. After
constructing the MMSE detector, the AP can perform MU-MIMO detection as follows: ˆ⃗x = P⃗y,

x is signal power, and σ2

where ˆ⃗x is an estimated copy of ⃗x.

Conventional MU-MIMO detectors can work only when the data packets are well aligned in

time. Roughly speaking, the time misalignment of data packets must be less than the time duration

of an OFDM symbol’s cyclic prefix [27]. For example, in 802.11ay, the time misalignment must

be less than 36.36ns [5]. In real systems, this requirement is extremely hard to satisfy as many

factors (e.g., propagation delays, digital processing delays, and clock jitters) contribute to the time

misalignment. For this reason, uplink MU-MIMO is not standardized in IEEE 802.11ac (sub-

6GHz) [4] and 802.11ay (60GHz) [5].3

3Note that 802.11ax is the only WLAN standard that supports uplink MU-MIMO. Yet, there is still no 802.11ax

product that supports this feature. In addition, 802.11ax is more of a centralized rather than distributed network.

124

Figure 5.4: An illustration of the received asynchronous packets from multiple stations at the AP
in an 802.11ay WLAN.

5.6.2: A Transformation of MMSE MU-MIMO Detector

Since it is hard to maintain the time alignment of the data packets for the AP, we wish to design an

MIMO detector for the AP so that it can decode the misaligned data packets as shown in Fig. 5.4.

In this case, if the AP knows the MMSE MIMO detector P in (5.16), then it should still be able to

decode those asynchronous data packets. This is because P is a spatial filter and its effectiveness

is not affected by the temporal imperfections (i.e., time misalignment of data packets). In other

words, the spatial and temporal properties of data packets are orthogonal to each other. The key

question here is how to obtain P when the AP receives asynchronous data packets. In synchronous

MU-MIMO, the data packets from different stations carry orthogonal pilots for the AP to estimate

the channel matrix H, based on which the AP can calculate P using (5.16). In asynchronous MU-

MIMO, the data packets from different stations cannot maintain the orthogonality of their pilots.

As a result, the AP cannot estimate the channel H and thus (5.16) does not work for this case.

To overcome this challenge, we show that a transformation of the MMSE detector in (5.16) can

eliminate the need of channel knowledge H and obtain an approximation of P, which allows the

AP to decode those asynchronous data packets separately. Denote Rn{·} as the nth row of a matrix

or a vector. Per the conventional MMSE detection, we have

ˆxn = Rn{ˆ⃗x} = R{P⃗y} = Rn{P}⃗y .

(5.17)

Denote Rx as the correlation matrix of ⃗x, i.e., Rx = E[⃗x⃗xH]. Denote Rw as the correlation

125

L-STFL-CEFL-HeaderEDMG-Header-AEDMG-STFEDMG-CEFEDMG-Header-BDataTRNL-STFL-CEFL-HeaderEDMG-Header-AEDMG-STFEDMG-CEFEDMG-Header-BDataTRNNon-EDMG portionEDMG portionL-STFL-CEFL-HeaderEDMG-Header-AEDMG-STFEDMG-CEFEDMG-Header-BDataTRNL-STFL-CEFL-HeaderEDMG-Header-AEDMG-STFEDMG-CEFEDMG-Header-BDataTRNL-STFL-CEFL-HeaderEDMG-Header-AEDMG-STFEDMG-CEFEDMG-Header-BDataTRNL-STFL-CEFL-HeaderEDMG-Header-AEDMG-STFEDMG-CEFEDMG-Header-BDataTRNSTA 1STA 2STA N...misalignmentmatrix of ⃗w, i.e., Rw = E[ ⃗w ⃗wH]. In practice, signal and noise are always independent. Then, we

have Rx = σ2

xI and Rw = σ2

wI. Per (5.16), we have

Rn{P} = Rn

(cid:8)

HH(HHH +

(cid:9)

I)−1

σ2
w
σ2
x

(cid:8)

(a)
= Rn
(cid:8)

= Rn

RxHH(HRxHH + Rw)−1

(cid:9)

E[⃗x⃗x H]HH(HE[⃗x⃗xH]HH + E[ ⃗w ⃗wH])−1

(cid:9)

= E[Rn{⃗x⃗x HHH}]E[H⃗x⃗x HHH + ⃗w ⃗wH]−1

= E[Rn{⃗x}⃗x HHH]E[H⃗x⃗xHHH + ⃗w ⃗wH]−1

= E[xn(H⃗x)H]E[(H⃗x + ⃗w)(H⃗x + ⃗w)H]−1

(b)

= E[xn(H⃗x + ⃗w)H]E[(H⃗x + ⃗w)(H⃗x + ⃗w)H]−1

= E[xn⃗y H]E[⃗y⃗y H]−1,

(5.18)

where (a) and (b) follow from the assumptions that Rx is of full rank and E[xn ⃗w] = 0, respectively.

Both assumptions are always valid in practice.

Eq. (5.18) shows that the MMSE detector can be computed without channel knowledge, but

using E[xn⃗y H] and E[⃗y⃗y H]. Now a question to ask is how to compute these two terms. In UMMC,

we use the sample averaging operation to approach statistic expectation based on the fact that every

packet in practical systems carries reference signals (a.k.a., pilots or preamble) for signal detection.

Consider the 802.11ay frame shown in Fig. 5.4 for example. The reference signals include L-STF,

L-CEF, EDMG-STF, and EDMG-CEF, which are pre-defined and known to all stations and APs.

These reference signals will be used to compute E[xn⃗y H] and E[⃗y⃗y H] in (5.18).

In the following, we slightly abuse the notation by introducing l as the index of OFDM symbol

and k as the index of OFDM subcarrier. Denote An(k) as the set of reference symbols (pilots) in

126

the data packet transmitted by station n on OFDM subcarrier k. Then, we have

(cid:8)

⃗pn(k) ≜ Rn
h X

≜

P(k)} (5.18)= E[xn⃗y H]E[⃗y⃗y H]−1

ih X

xn(l, k′)⃗y(l, k′)H

⃗y(l, k′)⃗y(l, k′)H

(5.19)

i†

,

(l,k′)∈An(k)

(l,k′)∈An(k)

where (·)† is the pseudo-inverse operator, and xn(l, k′) and ⃗y(l, k′) represent the transmitted and

received reference signal on OFDM symbol l and subcarrier k′, respectively.

With the MU-MIMO detector in (5.19), the AP decodes the data packet from station n as fol-

lows: ˆxn(l, k) = ⃗pn(k)⊤⃗y(l, k), where ⃗y(l, k) is the received payload signal vector at the AP and

ˆxn(l, k) is its decoded payload signal from station n, 1 ≤ n ≤ N .

5.6.3: Performance Analysis and Discussions

Performance Analysis. Since analyzing the performance of the proposed detector in general set-

tings is extremely hard, we focus on an ideal case. Suppose that the number of reference signals

(e.g., pilots in L-STF, L-CEF, EDMG-STF, and EDMG-CEF in Fig. 5.4) is greater than or equal to

the number of stations, i.e., |An(k)| ≥ N . Then, we have the following lemma.

Lemma 1 : If M ≥ N and σw = 0, the MU-MIMO detector in (5.19) can perfectly recover

the misaligned signals from the asynchronous stations, i.e., ˆxn(l, k) = xn(l, k) for 1 ≤ n ≤ N ,

1 ≤ k ≤ K, and 1 ≤ l ≤ L.

Proof Sketch. We omit the subcarrier index k to simplify the notation. Given that M ≥ N , H is a

square or tall/thin matrix. Then, based on (5.19), we have:

xn(l)⃗y(l)H

ih X

i†

⃗y(l)⃗y(l)H

l∈An
ih X

xn(l)⃗x(l)HHH

i†

H⃗x(l)⃗x(l)HHH

ih

l∈An
i†

H^RxHH

h X

l∈An
h X

l∈An

h

(a)
=

⃗pn

(b)
=

(c)
=

Rn{^Rx}HH
(cid:9)

(cid:8)

(d)
= Rn

H†

,

127

(5.20)

where (a) follows from (5.19) by omitting the subcarrier index k; (b) follows from the fact that

⃗y = H⃗x when σw = 0; (c) follows from our definition that ^Rx =

P

l∈An

⃗x(l)⃗x(l)H; (d) follows from

the fact that H is a square or tall matrix since M ≥ N and that ^Rx is a square matrix of full rank since

|An(k)| ≥ N . Based on (5.20), we have ˆxn(l, k) = ⃗pn(k)⃗y(l, k) = Rn

■

(cid:8)

(cid:9)

H(k)†

⃗y(l, k) = xn(l, k).

In practice, the assumption of |An(k)| ≥ N and M ≥ N are typically valid, but σw ̸= 0. For

the realistic case, we will evaluate this detector through experiments in Section 5.7.1.

Explicit Channel Knowledge is Not Needed. It is evident that the MU-MIMO detector in (5.19)

does not require explicit channel knowledge H for packet detection. Instead, it uses the reference

signals in data packets to compute the detectors for each individual data stream. As such, this MU-

MIMO detector is particularly suitable for an AP to decode asynchronous data packets, while the

conventional MMSE detector is incapable of doing so.

Unique Features of mmWave MU-MIMO. MmWave communication systems are typically equipped

with directional antennas (e.g., phased-array antenna), which significantly reduce the multipath ef-

fect of channels. As a result, the mmWave channels are more frequency-flat compared to sub-6GHz

systems. In addition, compared to SISO mmWave WLANs (e.g., 802.11ad), MU-MIMO mmWave

networks (e.g., 802.11ay) have pilots in both legacy preamble (L-STF and L-CEF) and enhanced

preamble (EDMG-STF and EDMG-CEF); see Fig. 5.4. Lemma 1 shows that these two properties

make the proposed asynchronous MMSE detector particularly suitable for 802.11ay networks.

5.7: Performance Evaluation

5.7.1: Experimental Results (Two-User MIMO Case)

Implementation. We built a 60 GHz mmWave MU-MIMO testbed that comprises an AP and two

stations as shown in Fig. 5.5. The AP was built using two HMC6300 Boards (60 GHz RF Frontend)

and one USRP X310. We modified the clock circuits of the two HMC6300 boards to synchronize

their clock for MU-MIMO applications. The AP was equipped with two planar antennas, each

of which has 4 × 8 patch elements. Each station was built using one HMC6300 Board and one

128

(a) Two-antenna AP.

(b) Experimental setup.

Figure 5.5: Illustration of our prototype and experimental setup.

Figure 5.6: EVM specified in IEEE 802.11ay standard [5].

USRP X310, and connected with a horn antenna. The two stations worked independently, and

there is no external clock to synchronize their packet transmissions. The instantaneous bandwidth

of this MU-MIMO testbed is 100 MHz. We used GNURadio OTT (in C++) to implement the signal

processing modules of a simplified 802.11ay PHY layer (512 FFT for OFDM modulation, QPSK,

without LDPC codes) for the uplink MU-MIMO transmission. A demo video can be found in [29].

Experimental Setting. We consider three indoor scenarios for our experiments. Scenario 1: short

distance (2.5m) for both stations. Scenario 2: long distance (5m) for both stations. Scenario 3:

short distance (2.5m) for station 1 and long distance (5m) for station 2.

Performance Metrics. We use EVM and throughput as the performance metrics. EVM is widely

used for the performance measurement of wireless receivers in industry. It was defined in (5.2).

Based on the measured EVM, we calculate the throughput of 802.11ay networks as follows: rn =

B ·

τof dm
τgi+τof dm

· Ndata
Nf f t

· γ(EVMn), where B = 2.64GHz is the sampling rate, τgi = 36.36ns is the

129

60 GHz RF frontend4x8 Planar Antenna 1 and 2USRP X310STA 1STA 2AP-27.5-25-22.5-20-17.5-15-12.5-10-7.5EVM (dB)0246Bits carried by a sample( i.e., (EVM) )02468Throughput (Gbps)(a) STA 1’s signal in MU-
MIMO

(b) STA 2’s signal in MU-
MIMO

(c) STA 1’s
SIMO

signal

in

(d) STA 2’s
SIMO

signal

in

Figure 5.7: Constellation diagram of the decoded signals at the AP.

(a) EVM

(b) Throughput

Figure 5.8: Comparison of our proposed asynchronous MU-MIMO technique and conventional
SIMO technique.

normal guard interval, τof dm = 194.56ns is the OFDM symbol duration, Ndata = 336 is the number

of subcarriers for data, Nf f t = 512 is the FFT size, and γ(EVMn) is the adaptive rate specified

by [5] and shown in Fig. 5.6. Recall that our objective is to maximize the minimum of user’s

throughput. Therefore, we denote EVM = max(EVM1, EVM2) and Throughput = min(r1, r2).

Asynchronous MU-MIMO Detection. We first validate the feasibility of the proposed asyn-

chronous MU-MIMO detector on the testbed, where the two stations are continuously transmitting

data packets but have no synchronization mechanism. For both AP and stations, we perform ex-

haustive search to find their best analog beams. Fig. 5.7(a-b) shows the two constellation diagrams

observed at the AP. It is clear that the proposed detector is able to decode the data packets in the

absence of inter-station synchronization.

As a comparison baseline, we also implemented the single-input and multiple-output (SIMO)

transmission scheme on the testbed in the same settings. In this case, each station uses a half of

the airtime for packet transmission in turn (i.e., TDMA mode). When serving each station, the

130

Scenario 1Scenario 2Scenario 3151050EVM (dB)STA 1 (SIMO)STA 2 (SIMO)STA 1 (Async MU-MIMO)STA 2 (Async MU-MIMO)Scenario 1Scenario 2Scenario 30123Throughput (Gbps)STA 1 (SIMO)STA 2 (SIMO)STA 1 (Async MU-MIMO)STA 2 (Async MU-MIMO)Table 5.2: Comparison of exhaustive separate beam search and exhaustive joint beam search.

Scenario 1

Scenario 2

Scenario 3

Search approach
Best angle for ant 1 (θ∗
1)
Best angle for ant 2 (θ∗
2)
EVM (dB)
Throughput (Gbps)

Joint
-30◦
15◦
-16.2
3.65

Separate
-45◦
-23◦
-13.0
2.28

Joint
-30◦
-30◦
-13.7
2.28

Separate
-13◦
-24◦
-10.0
0.91

Joint
30◦
30◦
-14.5
2.37

Separate
-47◦
25◦
-10.1
1.46

AP selects its best antenna to decode its data packets. Fig. 5.7(c-d) shows the two constellation

diagrams observed at the AP. It can be seen that the AP observes similar constellation diagrams

in the two cases. This reveals the effectiveness of our proposed MU-MIMO detector in decoding

asynchronous data packets.

We repeated the above tests in all three scenarios to quantify the EVM and throughput of the two

techniques (Async MU-MIMO and SIMO). Fig. 5.8(a) shows the EVM comparison. It shows that

the two techniques have a similar EVM. This is a bit surprising, because in theory SIMO should

offer a better (e.g., 3 dB) EVM performance than Async MU-MIMO. We conjecture that it was

caused by the non-negligible phase noise of 60GHz mmWave RF devices. Phase noise increases

linearly with carrier frequency in communication systems. When phase noise is strong, it dictates

the communication performance and marginalizes the difference caused by other factors.

Fig. 5.8(b) shows the throughput comparison. It can be seen that Async MU-MIMO almost

doubles the throughput of SIMO. This is because the AP can only serve the stations in turn in

SIMO, while Async MU-MIMO allows the AP to serve both stations simultaneously.

Impact of MU-MIMO Channel Correlation. For the two antennas at the AP, we consider two ap-

proaches for their beam search: i) exhaustive separate search, and ii) exhaustive joint search. In the

separate search, each individual antenna finds the beam angle that maximizes its signal strength. In

the joint search, the two antennas try all possible beam combinations to find the one that maximizes

the bottleneck of user data rates.

Table 5.2 shows our experimental results. It is clear that joint and separate search approaches

lead to different beam results. Consider scenario 1 for example. When using separate beam search,

the optimal angle is -45◦ for antenna 1 and -23◦ for antenna 2. This combination is optimal in terms

131

(a) EVM

(b) Throughput

Figure 5.9: Comparison of BayOpt and exhaustive search.

(a) Two-user MIMO case.

(b) Three-user MIMO case.

(c) Four-user MIMO case.

Figure 5.10: Throughput comparison of Separate beam search, BayOpt beam search, and exhaus-
tive search.

of the signal strength at each individual antenna, but it is not optimal in terms of user throughput.

For the joint search approach, the combination (-30◦, 15◦) yields the best EVM and thus the best

user throughput. Similar phenomena can also be observed in scenarios 2 and 3. This confirms that

signal strength is not a good criterion for beam search in MU-MIMO mmWave systems.

BayOpt Search versus Exhaustive Search. Using the proposed MU-MIMO detector, we compare

two joint beam search approaches: BayOpt search and exhaustive search. For exhaustive search,

we search the beams for each antenna every 5 degrees, and the search range is from -60◦ to 60◦.

So the total number of beam combinations for search is (120/5 + 1)2 = 625. Fig. 5.3 shows an

instance of exhaustive search results. For BayOpt search, we fix the number of search iterations

(evaluations) to 20. Therefore, the overhead of BayOpt search is only 3.2% of the exhaustive

search.

Fig. 5.9 shows the comparison of these two joint beam search approaches in three scenarios.

It can be seen that BayOpt can achieve a similar EVM and throughput performance of exhaustive

132

Scenario 1Scenario 2Scenario 3151050EVM (dB)BayesOptExhaustive searchScenario 1Scenario 2Scenario 30123Throughput (Gbps)BayesOptExhaustive search01234567Throughput (Gbps)0.00.20.40.60.81.0CDFSeparate search (33 iter)Joint BayOpt search (20 iter)Joint BayOpt search (30 iter)Joint BayOpt search (50 iter)Joint exhaustive search (332 iter)01234567Throughput (Gbps)0.00.20.40.60.81.0CDFSeparate search (33 iter)Joint BayOpt search (20 iter)Joint BayOpt search (30 iter)Joint BayOpt search (50 iter)Joint exhaustive search (333 iter)01234567Throughput (Gbps)0.00.20.40.60.81.0CDFSeparate search (33 iter)Joint BayOpt search (20 iter)Joint BayOpt search (30 iter)Joint BayOpt search (50 iter)Joint exhaustive search (334 iter)search. More accurately, BayOpt achieves 94.3% throughput of exhaustive search. It is important

to point out that the throughput in Fig. 5.9(b) does not take into account the airtime overhead of

beam training. If the beam training overhead is taken into consideration, BayOpt would easily

outperform exhaustive search.

5.7.2: Simulation Results (More-User MIMO Case)

Due to the hardware limitation, we resort to simulations for the evaluation of BayOpt in more-user

MIMO cases. We consider a 400ft2 conference room where the AP is deployed on a wall and 100

users are uniformly and randomly distributed over the whole room. We use the model in [109] to

calculate the path loss based on the distance between a user and the AP, and use the model in [129] to

generate the gain of phased-array antennas for a given direction. In each time slot, the AP randomly

selects N users for uplink MU-MIMO transmission, where N ∈ {2, 3, 4} as defined in 802.11ay.

In the simulations, we focus on the comparison of three different beam search approaches (separate

exhaustive search, joint exhaustive search, and BayOpt) without considering packet misalignment

issue. An ideal MMSE detector is used to decode concurrent packets and calculate their EVM and

throughput.

We present the simulation results in Fig. 5.10. Compared to separate exhaustive search, BayOpt-

30 (BayOpt with 30 iterations) has a similar airtime overhead (30 vs. 33 iterations), but it improves

the throughput by 95.8%, 109.8%, and 267.2% in the two-user, three-user, and four-user cases, re-

spectively. Compared to joint exhaustive search, BayOpt-50 achieves 88.6% throughput with 4.6%

overhead in two-user MIMO case, 82.1% throughput with 0.1% overhead in three-user MIMO case,

and 83.5% throughput with 0.004% overhead in four-user MIMO case. Note that the throughput

presented in Fig. 5.10 does not take into account beam search overhead.

5.8: Summary

In this chapter, we presented a practical yet efficient uplink MU-MIMO communication (UMMC)

scheme for mmWave networks. This scheme has two key components: BayOpt for beam search

and asynchronous MU-MIMO detection. UMMC provides the first BayOpt framework for beam

133

search in mmWave MU-MIMO systems, and introduced a new MU-MIMO detector that can de-

code asynchronous data packets from multiple users. It has demonstrated through both theory and

experiments that fine-grained inter-user synchronization is not needed for uplink MU-MIMO trans-

mission. We have evaluated the performance of UMMC through a blend of experiments and sim-

ulations. Experimental and simulation results confirm the practicality and efficiency of UMMC.

134

CHAPTER 6: TEMPORAL BEAM

PREDICTION FOR MOBILE MMWAVE

NETWORKS

6.1: Introduction

Millimeter wave (mmWave) technology is expected to serve as the foundation for 5G and future

wireless networks, enabling the vision of a smart society and a digitized physical world. It offers

ultra-low latency, multi-gigabit-per-second (Gbps), and scalable wireless connectivity for emerg-

ing applications such as virtual reality (VR), cloud-based real-time artificial intelligence (AI), and

high-resolution video streaming [131]. In mmWave networks, devices commonly rely on analog

beamforming to mitigate the effects of high path loss. In practice, a set of beam angles are prede-

fined, and the analog beamforming operation for a mmWave device is equivalent to the selection

of the best beam index that can maximize the signal strength at a receiver. When candidate beams

are probed sequentially and exhaustively, the beam search procedure incurs significant airtime

overhead. Unfortunately, most existing mmWave devices perform beam search in this exhaustive

manner, which leads to poor scalability in environments with high user density [25].

To reduce the airtime overhead of beam selection, different approaches have been studied for

the management of beam search, such as out-of-band CSI-assisted beam selection [21,64,118,148],

compressive sensing [65, 127, 128], hierarchy beam search [24, 92, 157, 194], and learning-based

beam search [23, 40, 58, 100, 121, 130]. These approaches have demonstrated a great success in

the acceleration of beam selection and the reduction of its airtime overhead. However, most of

prior efforts focus on the beam search optimization in a snapshot of networks by exploiting spatial

features of mmWave channels. The exploitation of temporal correlation of mmWave channels for

beam selection remains limited.

In this chapter, we exploit temporal channel correlation of a mobile mmWave device to predict

135

its future beam direction based on its history beam selection profile, with the aim of reducing the

airtime overhead of beam search in mobile mmWave networks. Specifically, we present a Tem-

poral Beam Prediction (TBP) scheme for a mobile mmWave device to estimate its future beam

direction based on its history beam selection profile. TBP was motivated by the recent success of

pedestrian trajectory prediction [20, 60, 186], for which recurrent neural network (RNN) models

have demonstrated a great potential for accurate prediction. TBP borrows the idea from pedestrian

trajectory prediction by using LSTM [69] as the model for beam prediction in mobile mmWave

networks. LSTM has proved its efficiency and effectiveness in capturing the dynamic pattern of

beam directions by retaining information about past directions. The internal memory can encapsu-

late details pertaining to the environment and user mobility, thereby achieving accurate prediction

of the future beam direction. Specifically, TBP asks each mmWave device to record its beam an-

gles adopted by its past packets, and uses its past beam angle profile to predict the best beam angles

for its current or next packet transmission. As expected, the predicted beam angle might not be the

best. So a beam refinement algorithm is employed to find the best one through a local search.

Compared to trajectory prediction, beam prediction has two new challenges. First, the past

data samples (history beam selection results) are non-uniform over time due to the bursty nature

of data traffic. For example, the time interval of VoIP traffic ranges from 5 ms to 40 ms [137],

depending on the voice intensity. The non-uniformity of data samples calls for a new LSTM model

that can take into account data timestamp for the beam angle prediction. Second, unlike movement

trajectory, the best-beam angle trajectory may have sharp changes due to the multipath effect of

wireless channels and the imperfect radiation pattern of phased-array antennas. Consequentially,

the best-beam angle trajectory is typically non-smooth over time, making it challenging to perform

an accurate prediction.

To address these two challenges, TBP proposes a mobility-aware LSTM (mLSTM) model for

the beam angle prediction. The novelty of mLSTM lies in a new structure of its cells, which takes

both data samples (history beam angles) and their timestamps as the input to predict future beam

angles. This is in sharp contrast to traditional LSTM, which does not consider the timestamps.

136

The inclusion of data timestamps makes it possible for the mLSTM model to extract the time-

dependent features, which is critical for improving the beam prediction accuracy. In addition, TBP

employs an adversarial learning structure to extract the user-independent features for the beam

prediction. The combination of CNN-based feature extractor, mLSTM-based beam angle predic-

tor, and adversarial-learning discriminator appears to be an efficient model for the temporal beam

prediction in mobile mmWave networks.

In addition, a wireless device may be equipped with both mmWave and sub-6GHz radios for

its communications. For such a case, we enhance the design of TBP by leveraging the out-of-

band channel state information (CSI) from co-located sub-6GHz radio to improve the accuracy of

mmWave beam prediction. The key challenge here stems from the heterogeneity of data samples

from the two radios. Specifically, mmWave radio generates beam indices (i.e., beam angles), while

sub-6GHz radio generates channel coefficients (or channel matrix). Per our experiments, simple

concatenation of data samples from the two radios as the input for beam prediction yields an in-

ferior performance. To address this challenge, TBP converts sub-6GHz CSI to the corresponding

beam angles by exploiting their inherent spatial relations. The converted beam angles will then

be combined with mmWave beam angles for training and inference. Such a CSI-assisted learning

model is particularly useful for cases where a mmWave radio does not have sufficient data samples

for prediction (e.g., when a mmWave radio just wakes up from sleep mode).

We have built a prototype of TBP on a software-defined radio (SDR) 60GHz mmWave testbed.

The mmWave device is equipped with a planar antenna with 4×8 patch antenna elements, and

installed on a building’s ceiling. We evaluated the performance of TBP in four scenarios: lab,

conference room, hallway and apartment. Experimental measurement shows that the average pre-

diction error of TBP is less than 7 degrees for both beam azimuth and elevation angles in most of our

studied cases. Experimental measurement also shows that the utilization of out-of-band CSI can

further improve the beam prediction accuracy for a mmWave device. Based on the measurement

results, we simulate the throughput gain of TBP in a mobile mmWave network with representative

traffic settings. The simulation results show that TBP can improve the throughput by more than

137

60% compared to existing approaches in all four scenarios.

The contributions for this paper are summarized as follows.

• To the best of our knowledge, TBP is the first system-focused beam prediction scheme that

exploits the temporal correlation of mmWave channels along a device’s movement trajectory

to reduce the airtime overhead of beam search.

• TBP proposes a deep-learning network structure with new LSTM cells, which is capable

of accommodating non-uniform, non-smooth data samples for accurate prediction. It also

leverages out-of-band CSI to improve the beam prediction accuracy.

• TBP has been evaluated on a 60GHz mmWave testbed. Extensive experimental measurement

confirms its effectiveness of beam prediction.

6.2: Related Work

We review prior works in the following categories.

Learning-based Beam Search. Pioneering works have been done to predict the beam direction

using deep learning. [23,58] studied the beamforming problem in highly-mobile mmWave systems.

Their primary emphasis is on vehicular communications within outdoor scenarios featuring multi-

ple base stations. They demonstrate the capability to predict the optimal beam by jointly analyzing

the uplink signal received at these stations. However, it is noteworthy that the network traffic pat-

tern for vehicular communications significantly differs from that of indoor scenarios. Vehicular

environments involve high-speed mobility, but the trajectory dynamics are less pronounced com-

pared to indoor scenarios. Consequently, these studies exclusively employ traditional deep neural

networks without considering the time interval.

In the communication scenarios beyond vehicular contexts, [130] specifically targets indoor

environments. This study takes into account the orientation of the user’s device at each location,

employing spatial features in 3D space to train the neural network and align the beam. DeepIA [40]

trains a deep learning model to predict the best beam based on received signal strength (RSS) ob-

tained from a subset of beams. However, this approach still necessitates scanning a subset of beams

138

each time, resulting in a significant overhead for mobile millimeter networks. [100] is the most

relevant work to our work. It proposed a deep learning-based beam tracking scheme for mobile

mmWave devices. Unlike our work, which primarily relies on the user’s moving trajectory to pre-

dict the optimal beam direction, this study focuses on estimating dynamic channels resulting from

the user’s minor motion. Additionally, it incorporates Inertial Measurement Unit (IMU) sensor

measurements as additional input data. The works mentioned above do not consider practical is-

sues such as non-uniform and non-smooth data samples, and they are evaluated through simulation.

In contrast, TBP is a practical design and is evaluated via realistic experiments.

Deepbeam [121] is the only work that has been implemented on a testbed. It listens to the data

transmission between the AP and other users, learning unique patterns from the in-phase-quadrature

(I/Q) representation of the waveform. This allows it to predict the beam utilized by the transmitter

and AoA on the receiver side. However, it relies heavily on other devices in the environment

and primarily focuses on leveraging spatial features, a distinction from TBP. On the other hand,

[73] improves this method by proposing a deep regularized waveform learning (DRWL) strategy.

This approach demonstrates the ability to predict beams even with limited samples, showcasing

advancements beyond its predecessor.

The work by [81] focuses on beam tracking in UAV-mmWave communication, modeling the

problem as a multi-variable Gaussian process and using the Gaussian Process Machine Learning

(GPML) method to address it. However, it is noteworthy that UAVs typically follow a pre-defined

path with an existing LoS path for communication, distinguishing it from our approach.

Out-of-Band Assistance for Beam Search. Our work is also related to the research in this area.

Table 6.1 compares our work with prior works. The works [21,64,118,148] harness the Wi-Fi band

to infer the beam directions for mmWave communications. Most of them use linear antenna arrays

for beam search and are limited to the beam search in a 2D plane. MUST [148] studied the beam

search in 3D space using an array of three antennas. It achieves 71.2% prediction accuracy and 10◦

beam tracking error. [35] proposes a self-supervised deep learning approach, directly mapping the

CSI from sub-6GHz to mmWave beams. TBP was inspired by these works but uses the out-of-band

139

Table 6.1: Out-of-Band beamforming for mmWave device.

Ref.

Out-of-Band

sensor or
antenna array

2D/3D

BBS [118]
[64]

Wi-Fi band
sub-6GHz
MUST [148] Wi-Fi band
sub-6GHz
Light
Images
Wi-Fi band

[21]
Listeer [62]
[135]
TBP

8 antennas
4 antennas
3 antennas
8 antennas
light sensor array
Camera
5 antennas

2D
2D
3D
2D
3D
3D
3D

Error
range
4◦
10◦
10◦
N/A
3.5◦
N/A
7◦

CSI in a different way. It converts the sub-6GHz CSI information to the best-beam azimuth and

elevation angles based on history sub-6GHz CSI and mmWave beam angles.

Besides leveraging out-of-band information from the radio side, LiSteer [62] determines the

mmWave beamforming direction by tracking indicator LEDs on APs. Since it relies on light re-

sources, LoS is required for this solution. Taking a further step, [135] inputs images captured by

the transmitter and receiver into a deep neural network, identifying beam directions based on these

images. However, both of these methods require an additional sensor installation on the AP, which

may not be suited for practical mmWave systems.

Compressive Sensing. In addition to out-of-band beamforming, compressive sensing tech-

nique has been studied for mmWave beamforming in order to reduce the beam search overhead

[65, 127]. In [127], compressive sensing is directly applied for beamforming. But it relies on the

accurate phase measurement. In practice, accurate phase information may not be available due to

the existence of carrier frequency offset. Agile-Link [65] hashes the beam directions and utilizes a

voting mechanism to recover the directions. It can identify the best path by tracking the change of

energy across different bins in a logarithmic number of measurements. The works in this area focus

on exploiting the spatial correlation of wireless channels to facilitate the beam search. In contrast,

TBP exploits the temporal channel correlation for beam prediction.

Hierarchical Beam Search. To accelerate the beam search process, some works have formu-

lated the beam search problem as an optimization problem [92,157,194]. Sophisticated algorithms

have been developed to search the possible beams in a hierarchical manner so as to minimize the

140

Figure 6.1: Illustration of temporal beam prediction (TBP) for both both mmWave AP and station
(user device).

search time. In practice, the signals from different paths may cancel each other, leading to an in-

accurate estimation of the power in the beam search process. However, these works are all based

on simulation and without considering practical issues such as non-ideal antenna radiation and

multipath effect. Clearly, TBP differs from this research line.

Model-based Beam Forecast. [214] studied model-based beam forecast by utilizing the spatial

correlation of 60GHz near-field channels to predict the future channel when Tx/Rx moves. It relies

on the anchor point channel profile to reconstruct the channel profiles for nearby points. However,

when a user moves too far from the anchor point, the beam scanning will still be triggered.

6.3: Problem Description

We consider a mmWave communication network as shown in Fig. 6.1, where an access point (AP)

is installed on the ceiling of a building to serve a set of stationary or mobile stations (user devices).

To combat the high path loss, analog beamforming is adopted at both AP and station sides for signal

energy steering. In practice, a set of beam angles are typically predefined for selection. As such,

analog beamforming is equivalent to the selection of the best azimuth and elevation beam angles

for a mobile device’s phased-array antenna. In what follows, we first briefly introduce the beam

search approaches in 802.11ad and 5G NR, and then present our design objective.

141

60GHz + 2.4GHz APTBP at APTBP at user deviceFigure 6.2: The structure of beacon interval (BI) in 802.11ad.

6.3.1: Preliminaries

Beam Search in IEEE 802.11ad/ay. IEEE 802.11ad [6] is a 60GHz mmWave communication

standard. The beamforming training in 802.11ad comprises two phases: i) Sector Level Sweep

(SLS), and ii) Beam Refinement Protocol (BRP). Fig. 6.2 shows the beacon interval (BI) structure

in 802.11ad. The SLS takes place in the beacon header interval (BHI), while the BRP takes place

in the data transmission interval (DTI). In the SLS phase, the user device configures its antenna

to an omnidirectional radiation pattern, while the AP sweeps its beam over all possible directions.

At the end of this process, the user device identifies the AP’s best beam index and reports it to the

AP. After identifying the AP’s best beam index, a similar operation is performed on the user side to

find its best beam index. More specifically, the AP uses its identified best beam index, and the user

device sweeps its beam over all possible directions. In the end, the AP will find the user device’s

best beam index and send it to the user. SLS is mandatory in IEEE 802.11ad, and its beam training

process takes about 1.54 ms for 7-degree beamwidth [118]. While the SLS phase is mandatory, the

BRP phase is optional in 802.11ad. In this phase, the beam selected from SLS is refined through

an iterative procedure. While the goal of SLS is to establish the connection between two devices

at the control mode rate, the goal of BRP is to optimize devices’ antenna settings by making use of

the TRN field in a frame. It allows for multiple measurements in the same packet and thus enables

coherent measurements, leading to a significant performance improvement compared to SLS.

Compared to 802.11ad, 802.11ay introduces various enhancements and new concepts that im-

prove beam training and extend its support for new applications. Some of the training-related

enhancements are highlighted as follows: i) a new beamforming procedure, called BRP Transmit

Sector Sweep (TXSS), was introduced to improve the efficiency of beamforming; ii) first path

142

BTIA-BFTATIDTIBeacon Interval (BI)Beacon Header Interval(BHI)Data Transmission Interval(DTI)beamforming training is defined to support positioning applications; iii) group beamforming is

specified to reduce overhead by enabling the training for multiple stations simultaneously; and iv)

a new BRP packet called EDMG BRP-RX/TX packet is defined to enable concurrent Tx and Rx

beam training.

Beamforming in 5G mmWave Networks. MmWave communication (on 24–47 GHz) is a key

component of 5G New Radio (NR) in order to increase the network throughput. The beam training

procedure in 5G is similar to that in 802.11ad. The base station initiates the beam search process

by sweeping its beam over all possible directions. At the end of this period, the user equipment

identifies the best beam index for the base station and reports it to the base station. This process

repeats with a smaller beamwidth at the base station, until the base station obtains its best beam

index. After that, the base station fixes the beam direction and the user equipment sweeps over its

beam angles to find the best one.

6.3.2: Design Objective

The airtime overhead of beam training in both 802.11ad/ay and 5G NR is O(N ), where N is the

number of beam candidates. While many schemes (e.g., [62,65,148,149,214]) have been proposed

to reduce the overhead, most of them are limited to the exploitation of spatial-domain channel fea-

tures in a snapshot of the network. Inspired by the pedestrian trajectory prediction [20, 60, 186],

this paper focuses on the system-perspective design of efficient beam search strategies for a mo-

bile mmWave device by exploiting the temporal channel correlation over its movement trajectory,

aiming to reduce the airtime overhead of its beam training procedure.

6.4: TBP: Design

6.4.1: Overview

Fig. 6.3 shows the system architecture of TBP for a mobile mmWave device to predict its future

beam angles based on its past beam selection results. The system has a database to record the past

beam information, i.e., (αi, βi, ti, si), where i is the data sample (beam) index (i = 1, 2, · · · , N ),

143

Figure 6.3: The overview of TBP.

αi is the beam azimuth beam angle, βi is the elevation beam angle, ti is the time moment when this

data sample is generated, and si is the ID of the user device for whose data sample is generated. A

CNN is used to extract the features of beam azimuth and elevation angles. Two mLSTM branches

are adopted in the system. One is used for prediction, and the other is used as the user device

discriminator for adversarial learning. The rationale behind this design is that we wish to extract

the beam features that are independent of individual user devices, so that this design is generally

applicable. The user device discriminator (i.e., mLSTM 2 in Fig. 6.3) is used for this purpose.

The predicted beam angles go through the local beam search procedure in real mmWave sys-

tems to find the optimal beam angles. The final selected beam is used by the antenna for signal

steering and sent to the database for future use. The key components of TBP are highlighted as

follows.

• CNN for Feature Extraction: As shown in Fig. 6.3, a CNN is used to extract features before

sending data to the mLSTM modules. The CNN has 12 kernels with a size of 4, and it uses

ReLU as the activation function.

The kernel size and the number of kernels are adjusted based on the model’s performance.

Smaller kernel sizes excel at capturing fine-grained features, while larger ones are adept

at capturing broader patterns. Given that TBP aims to capture dynamic patterns, the beam

prediction is oriented towards the current moment; hence, larger kernel sizes might overlook

the optimal beam direction at the present time. While a higher number of kernels enhances the

feature space, it amplifies the model complexity. Based on our experimental observations,

144

Feature Extractor User discriminator Beam Prediction (αi, βi) titi, si (α, β)pLdLaDatabase: past beam angles(αi, βi, ti, si) CNNmLSTM 1mLSTM 2 Adversarial Learning Model(α, β)Perform local beam search in real system(for user s at time t)FCN FCN SoftMax(α, β, t, s)Figure 6.4: Experimental data that show the comparison between a mmWave AP’s best-beam di-
rection and its LoS direction when communicating with a station.

we employ 12 kernels to extract the features of the data. Besides, the causal padding is

employed to avoid the time length changes.

• mLSTM and Adversarial Learning: Referring to Fig. 6.3, two identical mLSTM networks

are used in TBP. These two mLSTM networks are structured for adversarial learning, fol-

lowing the architecture in [77]. The first mLSTM is for predicting the beam angles based

on the history information, while the second mLSTM is used to predict the user device. We

note that the training purpose is to enable the CNN to deceive the second mLSTM, thereby

allowing the CNN to extract user-independent features. Both mLSTMs are connected to the

fully connected layers. In addition, mLSTM 2 is added with a SoftMax layer after a fully-

connected layer. The output of SoftMax is the possibilities of the devices by which the data

samples are generated.

• Local Beam Search in Real System: Since the predicted beam may not be the best one, a local

beam search module is employed to perform beam refinement in real systems. It follows

the specified protocol, with the aim of finding the optimal beam for data transmission and

therefore improving the efficiency of transmission. Details are given in Section 6.4.4.

145

012345Time/s01020304050Angle/degreemmWave best-beam azimuthmmWave best-beam elevationLoS azimuthLoS elevation6.4.2: mLSTM: Mobility-Aware LSTM

The prediction of mmWave beam azimuth and elevation angles has two unique challenges: non-

uniform and non-smooth data samples over time. Fig. 6.4 shows our experimental results of com-

parison between the best-beam direction and the LoS direction in a lab scenario. It can be seen

that the time intervals of consecutive data samples are not identical. This is because data traffic

is bursty in nature. It can also be seen that the best beam angles may differ from and less smooth

than LoS angles over time. This is caused by the multipath effect and non-ideal antenna radiation.

To address these challenges, we propose an LSTM model for a mobile device to predict its beam

angles based on its past beam selection profile.

RNN has been widely used for processing time-series sequential data. Connections between

hidden units form a cycle which can send the past memory to the current cell. In this way, RNNs

are particularly useful for dealing with the problem where the past memory has an strong effect to

the current status. However, RNNs is incapable of capturing long-term dependence because of the

vanishing and exploding gradient problems [69]. LSTM, a special case class in RNN, can handle

the long-term memory without the problem mentioned above.

A traditional LSTM cell has four parts: forget gates, input gates, output gates, and a cell state.

The input time-series data for the traditional LSTM cell is assumed uniformly distributed, disre-

garding the time intervals between samples. This means that the time gaps between consecutive

samples are not taken into consideration during the data processing. However, the time series data

is not always uniformly distributed especially when we consider communication requests. The

communication frequency depends on the users’ needs and also the type of data the user requested.

The data from the mmWave band only obey the Poisson distribution.

Given the non-uniformity and non-smoothness of the beam angles over the device’s movement

trajectory, traditional LSTM models may not work well for the prediction (e.g., [102, 115]). This

coincides with our experimental observations. To solve this problem, we propose a new mobility-

aware LSTM (mLSTM) model for the beam prediction over time, as shown in Fig. 6.5. The input

includes both data samples (past beam angles) and their timestamp. The output are the predicted

146

Figure 6.5: The structure of the mLSTM.

beam angles at a given time moment. In this mLSTM structure, the memory from previous time slot

is first decomposed into short-term memory through a data driven method. Unlike the time-aware

LSTM in [31], which discounts on the short-term memory, our mLSTM extracts the long-term

memory and discounts on it. The intuition behind this operation is that, for the beam prediction

problem, the short-term memory should carry higher weights for the prediction. In other words, the

near-past beam samples should play a more important role in the prediction than the far-past beam

samples. Still, the long-term memory is kept to capture the general moving tendency. Comparing

with the traditional LSTM, the memory is adjusted with a larger amount of short-term memory and

a fewer long-term memory. To discount the long-term memory, mLSTM utilizes a non-increasing

function on the time interval and multiplies it with the long-term memory.

Fig. 6.5 shows the diagram of our proposed mLSTM, where each of its cells can be mathemat-

147

σ σ σ tanhtanhtanhd(Δt)Cj-1hj-1hjCjXjΔtjXj-1Δtj-1mLSTMmLSTMXj+1Δtj+1mLSTMijojfjcj ~CsClj-1j-1Cj-1hj-1CjhjCj+1hj+1Cj-2hj-2XjΔtjically expressed as follows:

cs
j−1 = tanh(Wscj−1 + bs)

j−1 = cj−1 − cs
cl

j−1

∆tj = tj − tj−1

j−1 = cl
ˆcl

j−1

⊙ d(∆tj)

ˆcj−1 = ˆcl

j−1 + cs

j−1

ij = σ(Wxixj + Whihj−1 + bi)

fj = σ(Wxf xj + Whf hj−1 + bf )

oj = σ(Wxoxj + Whohj−1 + bo)

˜cj = tanh(Wxcxj + Whchj−1 + bc)

cj = fj ⊙ ˆcj−1 + ij ⊙ ˜cj

hj = oj ⊙ tanh(cj),

(6.1a)

(6.1b)

(6.1c)

(6.1d)

(6.1e)

(6.1f )

(6.1g)

(6.1h)

(6.1i)

(6.1j)

(6.1k)

where xj is the input data, hj−1 is the previous hidden state, and cj−1 is the previous memory.

Wxi, Whi, bi are the parameters for input gate. Wxf , Whf , bf are the parameters for forget gate.

Wxo, Who, bo are the parameters for output gate. Wxc, Whc, bc are the parameters for candidate

memory. cs

j−1 is the short-term memory, which is decomposed from previous memory cell cj−1;

Ws and bs are new weight matrix and bias vector defined for this operation, respectively. cl

j−1 is

the long-term memory, and ˆcl

j−1 is the discounted long-term memory. Here, the non-increasing

function being used is d(∆t) = 1/∆t. ˆcj−1 is the final adjusted memory which includes the full

short-term memory and the discounted long-term memory. It is used to update the memory. This

mLSTM is the fundamental building block for the design of TBP.

6.4.3: Automated Training Process

Data Collection Automation. A mmWave device first works in the traditional beam training

mode (e.g., using 802.11ad or 5G NR beam training protocol [7,56]) to collect data samples for the

148

training of TBP. A data sample can be denoted as (αi, βi, ti, si), where i is the sample index. After

sufficient data samples have been collected, TBP starts to train its models. As shown in Fig. 6.3,

after TBP completes its model training and enters its inference phase, it will still add data samples

to its database, which can be further used to train its model if necessary. It should be noted that

the training process will not disrupt the normal communications of a mmWave device, as the data

samples are side information from standard-compatible mmWave communication.

Training of TBP. Following the structure in Fig. 6.3, mLSTM 1 is trained to minimize the follow-
i=1[(ˆαi − αi)2 + ( ˆβi − βi)2], where (αi, βi) are the beam angles of a
ing loss function: La = 1
N
data sample, (ˆαi, ˆβi) are the prediction results (see Fig. 6.3), and N is the total number of training

P

N

samples in this mini-batch. Denote Wa as the weights of mLSTM 1. Then, they are updated as

follows: Wa ← Wa − µa

∂La
∂Wa

, where µa is the update step size (µa = 0.01 in our experiments).

mLSTM 2 serves as a user device discriminator. It shares the identical structure of mLSTM 1.

Denote ⃗pi as the output probability vector of the SoftMax layer when the input data sample is

(αi, βi, ti, si). We define its loss function as: Ld = − 1
N

N
i=1 log(g(⃗pi, si)), where g(⃗pi, si) returns

P

the element of ⃗pi that corresponds to user device si. Based on this loss function, the weights of

mLSTM 2 are updated by: Wd ← Wd − µd

∂Ld
∂Wd

, where µd is the update step size (µd = 0.01 in our

experiments).

The training of CNN has two purposes:

i) minimizing the prediction loss La, and ii) max-

imizing the domain discrimination loss Ld. We define the combined loss function as follows:

Le = γLa − Ld, where γ is a tuning parameter (γ = 0.2 in our experiments). Based on this

loss function, the weights of CNN are updated by: We ← We − µe

∂Le
∂We

, where µe is the update step

size (µe = 0.01 in our experiments). As a feature extractor, the CNN tries to cheat the user discrim-

inator by maximizing its loss function Ld while improving the performance of beam prediction by

minimizing the loss function La. With this adversarial learning structure, TBP tends to extract the

user-independent features for the beam prediction.

149

6.4.4: Prediction with Local Beam Search

After the model is trained, it then can be used for beam prediction based on the past beam samples

in the database. In the inference phase, the predicted beam angles (the output of mLSTM 1) may

or may not be accurate enough for packet transmission. Hence, TBP performs a local beam search,

with the aim of finding the best beam angle for signal steering. Suppose that a mmWave device

has a set of pre-defined beam azimuth angles {αp : 1 ≤ p ≤ Np} and a set of pre-defined beam
elevation angles {βq : 1 ≤ q ≤ Nq}. Also, recall that (ˆα, ˆβ) are prediction results of our model, i.e.,

the output of mLSTM 1 in Fig. 6.3. Then, the task of beam refinement module can be formulated
as: (p∗, q∗) = arg maxp,qf (αp, βq), subject to |αp − ˆα| ≤ τα and |βq − ˆβ| ≤ τβ, where τα and τβ

are the thresholds for azimuth and elevation angles, respectively. f (αp, βq) is the resulting signal

strength at receiver when the transmitter uses (αp, βq) as the azimuth/elevation beam angles. To

find the optimal beam direction, we can perform the beam probing protocols in Section 6.3.1. Since

TBP only needs to perform a local search, its airtime overhead of beam training is much less than

that of existing beam training schemes.

6.5: TBP: Out-of-band Enhancement

MmWave communication systems feature high bandwidth, small coverage, and susceptibility to

blockage. Thus, it is expected that mmWave communication systems will coexist with sub-6GHz

Wi-Fi systems as they complement each other. In this section, we consider an indoor wireless

communication network where each device is equipped with both mmWave and sub-6GHz Wi-Fi

radios. We aim to take advantage of widely-available sub-6GHz Wi-Fi CSI to enhance the beam

prediction for mmWave radio. This design is particularly useful for the case where mmWave radio

is in the sparsity of past data samples for beam prediction (e.g., mmWave radio just wakes up from

a sleep mode, mmWave radio is inactive for a long time). Although the literature has many works

on out-of-band beamforming [21, 62, 64, 118, 135, 148], their focus is mainly on simplifying beam

search in the spatial domain. Here, TBP focuses on the temporal prediction of beam angles.

150

Figure 6.6: The structure for TBP when taking into account sub-6GHz CSI for beam angle predic-
tion.

6.5.1: Design

Fig. 6.6 shows the overall structure of TBP for the case when sub-6GHz CSI is available for beam

prediction. Compared to the learning model in Fig. 6.3, the only difference is that it adds data

from sub-6GHz radio for its training and inference. As shown in Fig. 6.6, the data sample from

mmWave radio is denoted as (αi, βi, ti, si), while the data sample from sub-6GHz radio is denoted

as (Hj, tj, si). Apparently, the data samples from the two radios are in very different format.

An important question is how to combine the data from the two radios for training and inference.

For this question, a straightforward method is concatenation, i.e., feeding all raw data to the CNN

and letting the CNN to extract the useful features. However, this method did not perform well in our

experiments. We guess the reason is that the CNN is incapable of extracting meaningful features

from the heterogeneous data samples. To address this problem, we could unify the data format

from the two sources by converting the sub-6GHz CSI to the azimuth and elevation angles of the

LoS path (see Fig. 6.7 for example), and then combine the best beam azimuth/elevation angles and

the LoS azimuth/elevation angles as the input of CNN. While this method performs better than the

previous method, its performance still remains unsatisfactory in our experiments.

It could be attributed to two reasons. The first one lies in the fact that the best-beam direction

may differ from the LoS direction. Through a careful study of the CNN model’s input and out-

151

Feature Extractor User discriminator Beam Prediction (αi, βi) titi, si (α, β)pLdLaCNNAdversarial Learning Model(α, β)mmWave beam anglesSub-6GHz CSI anglemmWave beam anglesSub-6GHz CSI angleData fusionCSI-based angle estimationCSI-based angle estimationDatabase: mmWave angles(αi, βi, ti, si) Database: sub-6GHz WiFi CSI(Hj, tj, sj) mLSTM 1mLSTM 2FCN FCN SoftMaxSub-6GHz raido(α, β, t, s)Perform local beam search in real system(for user s at time t)Figure 6.7: The continuous angle-of-arrival
(AoA) estimation from Wi-Fi band and
ground truth measured by laser meter.

Figure 6.8: Illustration of signal angles (α, β, θx,
and θy in 3D space.

put data, we found that in a large portion of input data, there is a discrepancy between the LoS

direction (calculated from sub-6GHz CSI) and the best-beam direction (obtained from mmWave

search). This is not surprising. In practice, the best-beam direction deviates from the LoS direction

due to the imperfect radiation pattern of patch antennas in mmWave systems and the presence of

strong Non-LoS paths. The second reason is that CNN may not be capable of differentiating the

best-beam direction from the LoS direction during training and inference. Due to the discrepancy

between LoS and best-beam directions within the training data, the model faces challenges in dis-

tinguishing the best-beam direction from the LoS direction in the inference phase. This is because

the direct-merging method depends solely on time alignment, without utilizing the best-beam di-

rection obtained through mmWave search to correct the corresponding sub-6GHz LoS direction.

Furthermore, the larger amount of sub-6GHz CSI data significantly diminishes the impact of the

best-beam data generated from the mmWave search.

Based on the above observations, we propose a new method to convert sub-6GHz CSI to its

azimuth and elevation angles. It comprises two steps: i) convert CSI to LoS azimuth/elevation

angles, and ii) estimate the corresponding best-beam azimuth/elevation angles based on the LoS

azimuth/elevation angles. The details are presented in the next subsection. After merging the data

from the two radios, the system model is trained and operated in the same way as presented in

Section 6.4.

152

012345Time/s01020304050Angle/degreeWiFi estimated AoA azimuthWiFi estimated AoA elevationLoS azimuthLoS elevationxyα β θx θy x-y planez6.5.2: Data Fusion

We consider the case where the sub-6GHz radio is co-located with mmWave radio. We assume

that sub-6GHz radio is equipped with multiple equally-spaced antenna elements along both x and

y axes as illustrated in Fig. 6.8. To unify the data format for training and inference, we convert

the sub-6GHz CSI data (i.e., (Hi, ti)) to corresponding azimuth/elevation angles (i.e., (αi, βi, ti)).

Fig. 6.9 shows the diagram of our data conversion method. It comprises two steps:

Step 1: Identifying anchor data samples. In this step, we find those sub-6GHz CSI data sam-

ples that coincide with a mmWave data sample in time. In Fig. 6.9, CSI data (Hi, ti)

and (Hi′, ti′) are two examples showing the coincidence. Then, the corresponding

azimuth/elevation angles of these two samples can be found, as shown in the figure.

These CSI data samples will be used as anchors to calculate the azimuth/elevation

angles for the rest of CSI data samples. In practice, as long as the time gap between

the CSI data sample and mmWave sample is less than 1 ms, we consider them in

coincidence.

Step 2: Converting the resting CSI data samples. The data conversion process is illustrated

through the example of (Hj, tj) in Fig. 6.9. We explain this process by the following

three steps.

1⃝ Calculate (θx,i, θy,i) for Reference Data Sample: Consider an incoming signal in the 3D space

as shown in Fig. 6.8. θx,i is the angle-of-arrival (AoA) of incoming signal for the linear antenna

array on x-axis, while θy,i is the AoA of incoming signal for the linear antenna array on y-axis. Then,

the relation between azimuth/elevation angles (αi, βi) and signal AoA (θx,i, θy,i) can be expressed

as:

θx,i = cos−1(cos(αi) cos(βi)),

θy,i = cos−1(sin(αi) cos(βi)).

(6.2)

(6.3)

2⃝ Calculate (θx,j, θy,j) for Data Sample j: We first focus on the calculation of θx,j using the

153

Figure 6.9: The diagram of data fusion.

antenna elements on x axis as shown in Fig. 6.8. The calculation of θy,j will follow the same token

using the antenna elements on y axis. Consider antenna k on x axis. The additional phase shift of

its received signal with respect to the first antenna can be written as:

ϕx,k = 2π

(k − 1)d
λ

cos(θx),

(6.4)

where λ is the wavelength, d is the antenna spacing, θx is the AoA of incoming signal.

Denote ϕx,k,i and ϕx,k,j are the additional phase shift on antenna k at time ti and tj, respectively.

Then, we have

ϕx,k,j − ϕx,k,i = 2π

(k − 1)d
λ

(cos(θx,j) − cos(θx,i)) .

(6.5)

Based on (6.5), we further have

cos(θx,j) − cos(θx,i) =

λ
2π(k − 1)d

(ϕx,k,j − ϕx,k,i) .

(6.6)

Denote Hx,j = [Hx,1,j, Hx,2,j, · · · , Hx,K,j] as the channel vector measured on the antenna

154

t(αi, βi, ti) (Hi, ti) (αi , βi , ti ) (Hi , ti ) (Hj, tj) (αj, βj, tj) θx,i θy,i 2113mmWave data sample (α,β,t) WiFi CSI data sample (H, t)Performing D-MUSIC on CSI from antenna elements on y-axis2θx,i θx,i θx,j θy,j Performing D-MUSIC on CSI from antenna elements on x-axiselements on x axis at time tj. In (6.6), ϕx,k,j is a component of the phase of channel coefficient

Hx,k,j; and ϕx,k,i is a component of the phase of channel coefficient Hx,k,i. It can be seen that the

estimation problem in (6.6) is similar to the classic AoA estimation problem. Therefore, the left-

hand side of (6.6) can be estimated using MUSIC algorithm with the input of Hx,j ⊙ (Hx,i)∗, where

⊙ is element-wise product and (·)∗ is conjugate operator. Mathematically, we have

cos(θx,j) − cos(θx,i) = MUSIC (Hx,j ⊙ (Hx,i)∗) ,

(6.7)

where Hx,j and Hx,i are measured channels from the antenna elements on x axis at time tj and ti,

respectively.

Based on (6.7), we have

cos(θx,j) = cos(θx,i) + MUSIC (Hx,j, Hx,i) ,

(6.8)

By the same token, we have

cos(θy,j) = cos(θy,i) + MUSIC (Hy,j, Hy,i) .

(6.9)

3⃝ Calculate (αj, βj, tj) for Data Sample j: Given cos(θx,j) in (6.8) and cos(θy,j) in (6.9), we

can calculate the desired (αj, βj) as follows:

αj = tan−1

βj = cos−1

(cid:18)

(cid:18)

(cid:19)

(cid:19)

,

.

cos(θy,j)
cos(θx,j)

cos(θx,j)
cos(αj)

(6.10)

(6.11)

This completes the conversion from (Hj, tj) to (αj, βj, tj).

155

6.5.3: Training and Inference

The only purpose of out-of-band CSI from sub-6GHz radio is to enrich the dataset for the temporal

mmWave beam prediction. It does not alter the training and inference procedure of TBP. That said,

the training and inference operations in this case are the same as those in Section 6.4.

6.6: Performance Evaluation

6.6.1: Implementation

60 GHz mmWave Testbed. We built a mmWave testbed for the evaluation of TBP using

EK1HMC6350 RF front-ends from Analog Devices. Fig. 6.10 shows the overall diagram of our

testbed; and Fig. 6.11 shows a picture of the mmWave board. HMC6300 supports carrier frequency

from 57GHz to 64GHz, and the bandwidth of each channel is 1.8GHz. Two planar antennas, each

with 4×8 patch elements, are used for this testbed. One is for transmitter, and the other is for

receiver. Since the planar antennas cannot steer its beam electronically; two stepper motors are

installed to control the beam direction in the 3D space. One stepper motor controls the beam’s

azimuth angle; the other controls the beam’s elevation angle. The angle resolution of stepper motors

is 1.8 degree. The two stepper motors are controlled by the host computer via its USB interface.

The mmWave radio RF front-end is connected to USRP X310 through baseband I/Q differential

interface, and USRP X310 is then connected to a high-performance computer through 10Gbps

SFP+ cable. In our experiments, USRP X310 is installed with BasicTx and BasicRx daughter-

boards to generate baseband signals. All signal processing modules were implemented in the host

computer to measure the signal strength at receiver. Overall, the mmWave testbed can support

100MHz instantaneous bandwidth for real-time communication and 2-dimensional beam steering.

Sub-6GHz Radio for CSI Acquisition. As shown in Fig. 6.10, a 5-antenna SDR receiver was

built using an array of synchronized USRP N210 devices to obtain the CSI for the evaluation of

TBP when CSI is taken into account. The 5 omnidirectional antennas are deployed in a cross shape

with 3 antennas on x axis and 3 on y axis. The sub-6GHz system implements commodity 802.11

156

Figure 6.10: The diagram of our testbed installed on the ceiling of an office.

Figure 6.11: The testbed installed on the ceiling of a lab.

protocol with a bandwidth of 20MHz.

TBP Implementation. We implement TBP in the host computer as shown in Fig. 6.10 using

TensorFlow [15]. The database records the past beam selection information, which will be used

for the beam prediction over time. The data collection is automated using the beam angles from

the local beam search module in Fig. 6.3.

6.6.2: Experimental Settings

Our experiments were conducted in four scenarios: a lab, a conference room, a hallway and an

apartment. The lab is 220 ft2, with typical cubicles and furniture. The conference room is about

170 ft2, with a big table and multiple chairs. The hallway is relatively large and empty with a few

display cases near the wall. The apartment is about 260 ft2, furnished with common items such as

a tea table, chairs, sofa, and TV. In each scenario, the mmWave radio was installed on the ceiling,

communicating with a mmWave device carried by six different persons on the floor.

In each scenario, six persons walked along their routing paths sequentially, and 1,920 trace

157

USRP X310Ethernet switchAnalog DeviceHMC 6350(60GHz frontend)USRP N210High-endcomputer 2.4GHz antennammWave antenna60GHz planar antenna(4x8 antenna elements)EK1HMC6350(60GHz RF frontend)Stepper motor 2 (y-axis)Stepper motor 1 (x-axis)RF connection with USRP X310(a) Single-user case.

(b) Single-user with out-of-band CSI.

(c) Multi-user case.

Figure 6.12: Training and test loss for TBP in lab scenario.

samples were collected in total. 30% of the trace data are randomly selected and used for testing

purposes. Along the routing path, the beam angle samples are recorded in irregular time intervals

varying from tens to hundreds of milliseconds, and sub-6GHz Wi-Fi CSI was measured once per

millisecond.

6.6.3: Training Process

The model was trained in each individual scenario. Fig. 6.12 presents the training and test loss in the

lab scenario across various cases. Examining the training loss for a single-user case in Fig. 6.12a,

we observed that the model converges after approximately 60 epochs. When combining mmWave

data with out-of-band CSI, we observed that the convergence time extends to 80 epochs, as depicted

in Fig. 6.12b. This can be attributed to the increased complexity of the data. Comparing the cases

with and without out-of-band CSI, we observed that the loss decreases more rapidly at the beginning

of training when incorporating the out-of-band CSI. This suggests that the beam direction pattern

becomes more discernible with the additional information. Fig. 6.12c shows the training loss for

the multi-user case, taking around 100 epochs for the model to converge due to the high complexity

of multi-user data.

6.6.4: Performance Metrics and Comparison Baseline

Metrics. We use the prediction error as the performance metric. Specifically, referring to Fig. 6.3,

the prediction error of azimuth angle is eα = |α − ˆα|, where ˆα is the predicted beam azimuth angle

while α is the beam angle after beam refinement. Similarly, the prediction error of elevation angle

158

0102030405060Epoch100200300400LossTrainTest020406080Epoch100200300400LossTrainTest020406080100Epoch100200300400LossTrainTestis eβ = |β − ˆβ|.

Comparison Baselines. Two schemes are used as the comparison baselines for the evaluation

of TBP.

• Previous Azimuth/Elevation Angle: For this scheme, we simply use the previous beam’s az-

imuth/elevation angle as the beam direction for the current packet transmission. Apparently,

the performance of this scheme is highly dependent on the mmWave data sampling rate and

the movement speed of the target mmWave device as well as the dynamics of the environ-

ment.

• LSTM model: This scheme uses traditional LSTM as the model to predict the beam angles

based on the history beam angle information. Specifically, we replace the mLSTM in Fig. 6.3

with a traditional LSTM for the beam prediction and remove the adversarial learning com-

ponents.

6.6.5: Experimental Results: Beam Angle Errors

In this subsection, we measure the performance of TBP. In addition, we explore the answers to

the following questions: For a mmWave AP, is it necessary to create and train a model for each

individual user device? If a mmWave AP maintains a separately trained model for each individual

user device, would it offer a better performance than the case where the mmWave AP uses a single

model for all user devices? To seek the answers, we conduct experiments in two cases: single-user

case and multi-user case, as detailed below.

Single-User Case. We first evaluate the performance of TBP based on the history beam selec-

tion profile (without CSI from sub-6GHz radio).

Fig. 6.13 shows the CDF of the prediction errors in four scenarios, and Table 6.2 summarizes

their average and 95-percentile prediction errors. It can be seen that TBP performs better than the

other two schemes. In most cases, the average prediction error of TBP is less than 7 degrees, and

the 95-percentile of its prediction error is less than 16 degrees. Both of them are smaller than the

other two schemes. Particularly, TBP significantly outperforms the LSTM-based scheme. This

159

(a) Lab scenario.

(b) Conference room scenario.

(c) Hallway scenario.

(d) Apartment scenario.

Figure 6.13: Prediction error for TBP: single-user case.

indicates that our proposed mLSTM structure is much more efficient for beam angle prediction

than the traditional LSTM structure.

Table 6.2: Prediction errors of TBP: single-user case.

TBP’s
prediction error (degree)

LSTM’s
prediction error (degree)

w/o CSI

w/ CSI

w/o CSI

w/ CSI

Previous Point’s
prediction error (degree)
–

Avg

95%

Avg

azi
3.8
6.9
7.1
6.5

ele
6.3
6.8
6.6
5.0

azi
11.2
12.2
13.3
15.6

ele
14.7
12.1
11.4
16.8

azi
2.1
4.9
6.0
3.8

ele
2.8
6.1
4.1
4.5

95%

ele
azi
5.0
8.2
10.7 9.6
12.8 8.7
9.3
8.5

lab
conference
hallway
apartment

Avg

95%

Avg

95%

Avg

95%

ele

ele

ele
7.5
11.4

azi
azi
18.8 19.3 5.2 6.2 11.6
17.0 18.7 6.5 7.3 13.6

ele
azi
18.6
6.7
9.6
14.0
14.3 12.7 24.8 21.3 8.7 6.1 17.3 13.9
16.2
8.6

21.1 23.1 5.3 6.8 13.3

8.1

azi

azi
ele
azi
9.8
22.2
8.9
11.3
12.1 20.1
15.3 14.8 26.5
12.4 10.6 27.3

ele
27.8
19.8
24.8
27.5

We now report the experimental results of TBP when it takes advantage of available CSI data

from co-located sub-6GHz radio for its training and inference. Fig. 6.14 shows the CDF of the

measured prediction errors, and Table 6.2 shows the comparison between the cases with and without

CSI data from sub-6GHz radio. It can be seen that, with the utilization of CSI from sub-6GHz radio,

the average prediction error of TBP is less than 3 degrees in the lab scenario. The average prediction

160

051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.(a) Lab scenario.

(b) Conference room scenario.

(c) Hallway scenario.

(d) Apartment scenario.

Figure 6.14: Prediction error for TBP: Single-user case with out-of-band CSI enhancement.

error is around 5 degrees in the conference room, hallway, and apartment scenarios. It is larger than

that in the lab scenario mainly because of their large size. It can also be seen that the use of CSI data

can notably improve the prediction accuracy in the lab scenario and slightly improve the prediction

accuracy in the conference room, hallway and apartment scenarios. This is because the lab has

many furniture and equipment and thus is more reflective than the other three scenarios.

Multi-User Case. We conduct experiments in the four scenarios by creating and training a sin-

gle TBP model for six user devices, and measure the prediction errors to evaluate its performance.

Fig. 6.15 presents the CDF of our measured prediction errors in the four scenarios, and Table 6.3

summarizes the average and 95th percentile of the measured prediction errors. It can be seen that,

for most cases, the average prediction error of TBP is less than 7 degrees. In addition, TBP signifi-

cantly outperforms its counterparts (LSTM-based scheme and previous beam scheme). Compared

to the results presented in Table 6.2 and Table 6.3, we found that TBP has similar performance in

the single-user and multi-user cases. This indicates that a mmWave AP does not need to maintain

161

051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.(a) Lab scenario.

(b) Conference room scenario.

(c) Hallway scenario.

(d) Apartment scenario.

Figure 6.15: Prediction error for TBP: multi-user case.

(create, train, and re-train) different TBP models for different user devices. In other words, it only

needs to maintain a single TBP model for all user devices.

Table 6.3: Prediction errors for TBP: Multi-User Case.

TBP’s
prediction error (degree)
Avg

95%

azi
4.9
5.1
6.9
5.3

ele
4.8
6.8
8.7
5.0

azi
10.9
14.1
12.6
12.1

ele
13.8
13.1
16.2
13.2

LSTM’s
prediction error (degree)

Avg

ele
azi
7.3
6.5
6.8
8.6
11.6 10.1
9.3
7.2

95%

azi
18.0
18.9
22.5
19.1

ele
19.1
16.9
22.7
20.6

Previous Point’s
prediction error (degree)

Avg

95%

azi
8.3
8.3
17.1
8.7

ele
9.5
9.6
14.1
9.6

azi
18.3
18.5
26.2
27.1

ele
22.4
20.8
24.7
24.2

scenario

lab
conference
hallway
apartment

6.6.6: Throughput Gain

Based on the measured beam angle prediction errors, we now assess the throughput gain of TBP in

some representative scenarios.

Comparison Baseline. We use the beam search approach in IEEE 802.11ad (see Section 6.3.1)

162

051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.051015202530Error in degree0.00.20.40.60.81.0CDFTBP azi.TBP ele.LSTM azi.LSTM ele.Pre. azi.Pre. ele.Figure 6.16: Throughput gain of TBP over the 802.11ad beam search approach based on the mea-
sured beam prediction errors.

as our comparison baseline.4 Following the beam search parameters in [118], we assume that the

beam search range is from 0◦ to 180◦, with the step size being 7◦. Also following the setting in [118],

we assume that the measurement of one beam direction takes 60 µs. Therefore, to search for the

best azimuth and elevation angles in the 3D space, it takes about 180
7

× 180
7

×60µs = 39.6ms. This is

the airtime overhead of beam search during a beacon interval in conventional mmWave networks.

To evaluate the throughput, another important factor is the time duration of a beacon interval

(see Fig. 6.2). Theoretically, a longer beacon interval will improve the throughput by amortizing

the beam search overhead. However, in practice, the maximum time duration of a beacon interval

is constrained by channel coherence time. Here, we follow the parameters in [117] by setting the

beacon interval to 100 ms.

Throughput Gain. When a mmWave device uses TBP, it does not need to search for the whole

angle range (i.e., 0◦ to 180◦). It only needs to refine the beam direction within its prediction error

range. Consider TBP in the lab scenario for example. Its 99-percentile prediction error is 13.0

degrees for azimuth angle and 24.4 degrees for elevation angle. Therefore, the beam search time

can be estimated by 13×2
7

× 24.4×2
7

× 60µs = 1.55ms, where 7 is the angle search step size and 60µs

is the airtime of one search. Recall that we assume the beacon interval is 100ms. Then, TBP has

100 − 1.55 = 98.45ms for data transmission. In contrast, the conventional beam search approach

4One may wonder why we use IEEE 802.11ad beam search (rather than the more advanced beam search schemes
in the literature) as our comparison baseline. We argue that, while there are many efficient beam search schemes in
the literature [23, 24, 65, 118, 148], most of them are limited to the spatial domain. TBP is the first temporal beam
prediction approach and complementary to those schemes. Simply put, TBP can be used on top of those beam search
schemes to further improve the throughput of mmWave networks.

163

LabConference  roomHallwayApartment0%20%40%60%80%Throughput Gain0.620.630.640.63(comparison baseline) has 100 − 39.6 = 60.4ms for data transmission. This means that TBP can

improve the throughput by 62%. Following the same approach, we also calculate the throughput

gain of TBP in other three scenarios. Fig. 6.16 presents the projected throughput gain of TBP. It

can be observed that TBP improves the throughput by more than 60% in all four scenarios. This

shows the efficiency and robustness of TBP in different environments.

6.7: Summary

Analog beamforming is a fundamental problem in mmWave communication systems. One key

problem related to analog beamforming is how to reduce the airtime overhead of beam selection

as it is critical for improving the efficiency of mmWave communications. In this chapter, we pre-

sented TBP for beam prediction in a 3D space by leveraging the temporal correlation of mmWave

channels. The innovation of TBP lies in the design of a new LSTM model, which is capable of

performing accurate beam prediction by taking non-uniform, non-smooth history data samples.

We further enhanced TBP by taking out-of-band CSI from sub-6GHz radio as its input for training

and inference. A novel data fusion method was developed to unify the format of the data samples

from the mmWave and sub-6GHz radios. We have evaluated the performance of TBP on a 60GHz

mmWave testbed. Experimental results show that the average prediction error of TBP is less than

7 degrees in most of our tested cases and that TBP can improve the throughput by more than 60%

in representative mmWave networks.

Our future work will focus on three directions. First, we will develop a high-fidelity testbed for

the evaluation of TBP. We will evaluate it on both 28GHz and 60GHz mmWave testbeds following

the 5G/Wi-Fi standards. Second, we are currently using the LSTM model for the beam predic-

tion. More advanced deep-learning models have been developed for computer vision and natural

language processing. We will enhance TBP by employing new models and evaluating their perfor-

mance in real-world systems. Third, we will design tools and protocols to enable the automation of

data collection for the model training of TBP. Such tools will significantly improve the practicality

and generalizability of TBP.

164

CHAPTER 7: CONCLUSION AND

FUTURE WORK

7.1: Conclusion

Wireless communication systems serve as the backbone of today’s digitized world, supporting a

wide range of applications such as AR/VR, autonomous vehicles, V2X communication, indus-

trial automation, and smart cities. These systems are no longer limited to data communication

alone; they also function as sensing platforms, capturing information about the physical world.

This transformation has enabled a variety of emerging applications, including elderly care, secu-

rity and intrusion detection, gesture and activity recognition, and sleep and vital sign monitoring.

In this thesis, we designed wireless communication and sensing systems using learning-based ap-

proaches, with the goal of improving communication efficiency and enabling fine-grained human

motion sensing. We developed learning models in conjunction with signal processing algorithms

and customized hardware designs to advance the capabilities of wireless communication and sens-

ing systems, thereby enabling new applications.

The first part of the thesis presented the use of various learning frameworks to design wireless

sensing systems for fine-grained human motion sensing. We began by introducing a gesture-based

wireless authentication scheme for IoT devices, which employed a convolutional neural network

with a feature fusion strategy. Specifically, it combined Wi-Fi CSI amplitude and AoA to enable

location-independent gesture recognition. This scheme served as a novel authentication method for

widely deployed IoT devices that lack traditional input interfaces. Next, we explored a more chal-

lenging topic—handwriting detection through walls. We addressed this challenge through a joint

hardware and software design. On the hardware side, we developed a 6 GHz FMCW radar with

patch antennas. These components enabled the system to detect motion behind walls while mini-

mizing interference. On the software side, we designed a tailored deep neural network to recognize

165

handwritten letters through obstructions. The model integrated a BiLSTM with an attention mecha-

nism to capture temporal dependencies and extract critical features—such as turning points—from

radar phase sequences for accurate recognition. We further extended this system to support eye

motion recognition by incorporating a camera-guided deep neural network. This framework used a

Transformer encoder as the feature extractor and integrated a state-of-the-art vision-based approach

to guide the learning process, enabling the extraction of subtle eye motion features from RF signals.

We prototyped all these wireless sensing applications and evaluated them in real-world scenarios,

with the hope that they will serve as a foundation for future research.

The second part of the thesis proposed two learning-based solutions aimed at reducing beam-

forming overhead to enhance the throughput of mmWave communication systems under different

network settings. First, we introduced an uplink MU-MIMO mmWave communication (UMMC)

scheme for WLANs, which utilized a Bayesian optimization framework for joint beam search

across multiple antennas. By significantly reducing beamforming overhead, UMMC improved

the network throughput achieved by MU-MIMO in WLANs. Second, we proposed TBP, a Tem-

poral Beam Prediction framework, which focused on reducing beamforming overhead in mobile

mmWave networks. TBP was a learning-based beam prediction scheme equipped with a mobility-

aware LSTM model. This customized module incorporated timestamp information during training,

enabling it to predict beam directions across variable time intervals. By effectively minimizing

beamforming overhead, TBP enhanced the throughput performance of mobile mmWave networks.

Investigating wireless communication and sensing together aims to advance the development

of future Integrated Sensing and Communication (ISAC) systems.

ISAC systems can improve

the efficiency of spectrum and hardware utilization, transforming communication platforms into

multifunctional systems. This aligns with the vision for 6G networks, which are expected to provide

users with a more diverse experience—ranging from data transmission to the delivery of sensing

information. This thesis serves as a foundation for future research on the integration of sensing and

communication.

166

7.2: Future Research

Our future research focuses on integrating AI into the next generation of wireless communication

and sensing systems. We outline the future direction along two dimensions. First, from an appli-

cation perspective, we will explore potential use cases for wireless sensing and communication.

Then, we will discuss several challenges that must be addressed for practical deployment and for

expanding the applicability of our proposed solutions.

7.2.1: Applications

Sensing with Terahertz Communication Signals. Future communication systems are expected

to move into the Terahertz (THz) frequency band to support significantly larger bandwidths. The

use of high-frequency signals implies shorter wavelengths, which opens up a wide range of novel

applications focused on micro-scale information. THz signals can be used to detect fine particles

in the air, making them suitable for air quality monitoring. For the same reason, they are promising

for applications such as food quality assessment, as shown in Fig. 7.1. Additionally, due to their

ability to penetrate the surface of the skin, THz signals offer the potential for non-invasive detection

of certain skin diseases in daily life.

Wireless Sensing with LLMs. Wireless sensing leverages the pervasive presence of RF sig-

nals, enabling sensing capabilities to extend into every aspect of our daily lives. The continuous

stream of data collected by these systems can be effectively processed using Large Language Mod-

els (LLMs), significantly enhancing user experience through intelligent interpretation and interac-

tion. By integrating input from wireless sensing and tracking, LLMs gain an additional layer of

perception—functioning as a new type of sensor that connects the digital and physical worlds. In

the future, LLM-based agents may generate context-aware responses informed by users’ activities,

locations, and even health conditions, as shown in Fig. 7.2. Synthesizing these technologies to

create advanced cyber-physical systems (CPS) will be essential for enabling intelligent, adaptive,

and human-centric applications in the next generation of smart environments.

Ubiquitous Vital Sign Measurement. Vital sign measurement is a well-established topic in

167

Figure 7.1: THz communica-
tion signals are used to assess
food quality and detect skin
diseases.

Figure 7.2: Healthcare reports are generated by an LLM agent
based on wireless sensing data.

both academia and industry. However, achieving accurate, contactless monitoring over a large area

remains an unsolved challenge. This capability is particularly important for elderly individuals,

whose health status should ideally be monitored continuously and unobtrusively throughout the day.

RF signals have emerged as a promising candidate for this application, as they can already detect

vital signs such as respiration and heartbeat without physical contact. Nonetheless, effectively

covering an entire room or large area with consistent accuracy continues to be a significant hurdle

that must be addressed for widespread deployment.

RIS-Aided Sensing and Communication. As communication signals move into higher fre-

quency bands, they experience greater attenuation and are more easily blocked by obstacles. Recon-

figurable intelligent surfaces (RIS), which can manipulate electromagnetic waves in a controlled

manner, offer a promising solution to this challenge. By intelligently redirecting signals, RIS can

help avoid blockages, extend coverage areas, and focus signals in specific directions—enhancing

both communication performance and sensing capabilities. Designing and integrating such surfaces

will be a key component of future 6G networks.

7.2.2: Towards the Advancement of Techniques

Generalization to Different Environments. Generalizability of learning models is a key chal-

lenge across many domains, and it becomes particularly critical when dealing with RF data, which is

highly sensitive to environmental changes. Although AuthIoT has explored environment-independent

168

THz access pointFood qualitySkin diseaseWireless SensingLLM agentHealthcare reportRespiratory Pattern:Stable tidal breathing was detected, with no signs of respiratory distress, apnea, or hyperventilation.Heart Rate Variability... gesture recognition and has been evaluated in multiple settings, its performance in more complex or

dynamic environments remains uncertain. A similar issue arises in TBP, where the current system

requires fine-tuning with data collected from the target environment for accurate beam prediction.

Training learning models with RF data that can generalize well across diverse environments re-

mains an open and difficult challenge.

Interference from Dynamic Objects. Interference is a well-known challenge in wireless sens-

ing systems, and it becomes particularly severe when using communication signals—since these

signals propagate throughout the environment, any non-target movement can impact them. Most

existing Wi-Fi sensing studies assume that the user is the only dynamic object in the environment.

Radar-based systems can help mitigate this issue; for example, RadSee and RadEye demonstrate the

ability to avoid interference from objects located one meter away. However, they still suffer from

interference when non-target objects are in close proximity. Designing a wireless sensing system

that can fully eliminate interference from dynamic objects remains an open research problem.

Multi-Target Recognition Using Communication Signals. Recognizing multiple targets with

a wireless sensing system is challenging, as it is difficult to distinguish targets solely based on vari-

ations in RF signals. Accurately determining the locations of multiple targets is already difficult,

and even when their positions can be estimated, simultaneous motions can cause their signals to

interfere with each other. Nonetheless, multi-target recognition is essential for developing practical

and robust wireless sensing systems suitable for real-world applications.

Efficient Data Collection. Learning-based approaches heavily rely on data, and acquiring

training data efficiently and intelligently remains a significant challenge. Sensing tasks involving

human targets are particularly labor-intensive, often requiring repetitive motions and extensive

manual labeling. A promising future direction is to develop methods for automatic data collection

and labeling, reducing the burden on human effort. Additionally, exploring data augmentation

techniques to generate large and diverse datasets from a small amount of collected data will be

critical for improving model performance and generalizability.

Lightweight AI Models. Although AI models are highly powerful, they often require sub-

169

stantial computing resources. While our current implementations are deployed on local computers,

wireless communication and sensing systems are increasingly expected to run on portable, compact

devices. Therefore, designing lightweight AI models that can be efficiently deployed on resource-

constrained IoT devices is a critical direction for future research.

170

BIBLIOGRAPHY

[1] IEEE 802.11ad-2012. https://standards.ieee.org/ieee/802.11ad/4527/, Accessed: 08-July-

2022.

[2] Mobile

data

traffic

outlook.

https://www.ericsson.com/en/reports-and-papers/

mobility-report/dataforecasts/mobile-traffic-forecast, Accessed: 08-July-2022.

[3] 5G massive MIMO.

https://res-www.zte.com.cn/mediares/zte/Files/PDF/white_book/

202009101153.pdf, Accessed: 09-July-2022.

[4] IEEE 802.11ac-2013. https://standards.ieee.org/ieee/802.11ac/4473/, Accessed: 09-July-

2022.

[5] IEEE 802.11ay-2021. https://standards.ieee.org/ieee/802.11ay/6142/, Accessed: 09-July-

2022.

[6] IEEE 802.11ad. https://www.ieee802.org/11/Reports/tgad_update.htm, Accessed: 30-July-

2022.

[7] Release 15. https://www.3gpp.org/release-15, Accessed: 30-July-2022.

[8] Amazon Alex. https://developer.amazon.com/en-US/alexa, Accessed:11-June-2021.

[9] Decora smart - smart switches. https://www.leviton.com/, Accessed:11-June-2021.

[10] Google home assistant. https://assistant.google.com, Accessed:11-June-2021.

[11] Gosund WiFi smart switch. https://www.gosund.com, Accessed:11-June-2021.

[12] Simplisafe. https://simplisafe.com, Accessed:11-June-2021.

[13] 3GPP. NR and NG-RAN overall description, 2018.

[14] 3GPP Tdoc R1-1901252. Evaluation on TSN requirements, Jan. 2019.

[15] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A
system for large-scale machine learning.
In Proceedings of 12th USENIX symposium on
operating systems design and implementation (OSDI), pages 265–283, 2016.

[16] Fadel Adib, Chen-Yu Hsu, Hongzi Mao, Dina Katabi, and Frédo Durand. Capturing the

human figure through a wall. ACM Transactions on Graphics (TOG), 34(6):1–13, 2015.

[17] Fadel Adib and Dina Katabi. See through walls with wifi!

In Proceedings of the ACM

SIGCOMM 2013 conference on SIGCOMM, pages 75–86, 2013.

[18] AFPRELAXNEWS. Handwriting still has a place in our connected world, now it’s a trend on
social media. https://tinyurl.com/bddxy57n, April 2024. [Online; accessed 01-April-2024].

171

[19] Adeel Ahmad, June Chul Roh, Dan Wang, and Aish Dubey. Vital signs monitoring of multi-
ple people using a fmcw millimeter-wave sensor. In 2018 IEEE Radar Conference (Radar-
Conf18), pages 1450–1455. IEEE, 2018.

[20] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei,
and Silvio Savarese. Social LSTM: Human trajectory prediction in crowded spaces.
In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 961–
971, 2016.

[21] Anum Ali, Nuria González-Prelcic, and Robert W Heath. Millimeter wave beam-selection
IEEE Transactions on Wireless Communications,

using out-of-band spatial information.
17(2):1038–1052, 2017.

[22] Kamran Ali, Alex X Liu, Wei Wang, and Muhammad Shahzad. Keystroke recognition using
wifi signals. In Proceedings of the 21st annual international conference on mobile computing
and networking, pages 90–102, 2015.

[23] Ahmed Alkhateeb, Sam Alex, Paul Varkey, Ying Li, Qi Qu, and Djordje Tujkovic. Deep
learning coordinated beamforming for highly-mobile millimeter wave systems. IEEE Ac-
cess, 6:37328–37348, 2018.

[24] Ahmed Alkhateeb, Omar El Ayach, Geert Leus, and Robert W Heath. Channel estimation
and hybrid precoding for millimeter wave cellular systems. IEEE Journal of Selected Topics
in Signal Processing, 8(5):831–846, 2014.

[25] Ahmed Alkhateeb, Geert Leus, and Robert W Heath. Compressed sensing based multi-
user millimeter wave systems: How many measurements are needed? In Proceedings of
IEEE international conference on acoustics, speech and signal processing (ICASSP), pages
2909–2913. IEEE, 2015.

[26] ALS news today. ALS Facts and Statistics. https://tinyurl.com/bs8rmh3w, September 2024.

[Online; accessed 11-September-2024].

[27] Maria Antonieta Alvarez and Umberto Spagnolini. Distributed time and carrier frequency
synchronization for dense wireless networks. IEEE Transactions on Signal and Information
Processing over Networks, 4(4):683–696, 2018.

[28] Amy Yee. Why Do Eye Muscles Function in ALS as Other Muscles Waste Away? https:

//tinyurl.com/37tk2rm4, September 2024. [Online; accessed 11-September-2024].

[29] Anonymous. Demo video of real-time uplink MU-MIMO mmWave communication. https:

//youtu.be/Q2Bk7i6O5mg, Accessed:30-July-2022.

[30] Irmak Aykin, Berk Akgun, Mingjie Feng, and Marwan Krunz. MAMBA: A multi-armed
bandit framework for beam tracking in millimeter-wave systems. In Proceedings of IEEE
Conference on Computer Communications (INFOCOM), pages 1469–1478. IEEE, 2020.

172

[31] Inci M Baytas, Cao Xiao, Xi Zhang, Fei Wang, Anil K Jain, and Jiayu Zhou. Patient subtyp-
ing via time-aware LSTM networks. In Proceedings of the 23rd ACM SIGKDD international
conference on knowledge discovery and data mining, pages 65–74, 2017.

[32] Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. Datastories at semeval-2017
task 4: Deep lstm with attention for message-level and topic-based sentiment analysis. In
Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017),
pages 747–754, 2017.

[33] X. Cao, B. Chen, and Y. Zhao. Wi-Wri: Fine-grained writing recognition using Wi-Fi sig-

nals. In 2016 IEEE Trustcom/BigDataSE/ISPA, pages 1366–1373, 2016.

[34] Emanuele Cardillo, Gaia Sapienza, Changzhi Li, and Alina Caddemi. Head motion and eyes
blinking detection: A mm-wave radar for assisting people with neurodegenerative disorders.
In 2020 50th European Microwave Conference (EuMC), pages 925–928, Piscataway, NJ,
USA, 2021. IEEE.

[35] Irched Chafaa, Romain Negrel, E Veronica Belmega, and Mérouane Debbah.

Self-
supervised deep learning for mmwave beam steering exploiting sub-6 ghz channels. IEEE
Transactions on Wireless Communications, 21(10):8803–8816, 2022.

[36] Zhaoxin Chang, Fusang Zhang, Jie Xiong, Junqi Ma, Beihong Jin, and Daqing Zhang.
Sensor-free soil moisture sensing using lora signals. Proceedings of the ACM on Interactive,
Mobile, Wearable and Ubiquitous Technologies, 6(2):1–27, 2022.

[37] Yimin Chen, Tao Li, Rui Zhang, Yanchao Zhang, and Terri Hedgpeth. Eyetell: Video-
assisted touchscreen keystroke inference from eye movements. In 2018 IEEE Symposium
on Security and Privacy (SP), pages 144–160, Piscataway, NJ, USA, 2018. IEEE.

[38] Haiming Cheng, Wei Lou, Yanni Yang, Yi-pu Chen, and Xinyu Zhang. Twinkletwinkle:
Interacting with your smart devices by eye blink. Proceedings of the ACM on Interactive,
Mobile, Wearable and Ubiquitous Technologies, 7(2):1–30, 2023.

[39] Umesh Choudhary, Sampada Bhosale, Sonali Bhise, and Purushottam Chilveri. A survey:
Cursive handwriting recognition techniques. In 2017 2nd IEEE International Conference on
Recent Trends in Electronics, Information & Communication Technology (RTEICT), pages
1712–1716. IEEE, 2017.

[40] Tarun S Cousik, Vijay K Shah, Tugba Erpek, Yalin E Sagduyu, and Jeffrey H Reed. Deep
learning for fast and reliable initial access in ai-driven 6g mm wave networks. IEEE Trans-
actions on Network Science and Engineering, 2022.

[41] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE transactions

on information theory, 13(1):21–27, 1967.

[42] Dassault Systèmes Simulia]. CST Studio Suite. https://www.3ds.com/products-services/

simulia/products/cst-studio-suite/, April 2024. [Online; accessed 17-April-2024].

173

[43] Shivanker Dev Dhingra, Geeta Nijhawan, and Poonam Pandit. Isolated speech recognition
using mfcc and dtw. International Journal of Advanced Research in Electrical, Electronics
and Instrumentation Engineering, 2(8):4085–4092, 2013.

[44] Murtaza Dhuliawala, Juyoung Lee, Junichi Shimizu, Andreas Bulling, Kai Kunze, Thad
Starner, and Woontack Woo. Smooth eye movement interaction using eog glasses. In Pro-
ceedings of the 18th ACM International Conference on Multimodal Interaction, pages 307–
311, New York, NY, USA, 2016. Association for Computing Machinery.

[45] Muhammad Ehatisham-Ul-Haq, Ali Javed, Muhammad Awais Azam, Hafiz MA Malik, Aun
Irtaza, Ik Hyun Lee, and Muhammad Tariq Mahmood. Robust human activity recognition
using multimodal feature-level fusion. IEEE Access, 7:60736–60751, 2019.

[46] Lijie Fan, Tianhong Li, Yuan Yuan, and Dina Katabi. In-home daily-life captioning using
radio signals. In European Conference on Computer Vision, pages 105–123. Springer, 2020.

[47] Yuda Feng, Yaxiong Xie, Deepak Ganesan, and Jie Xiong. Lte-based pervasive sensing
across indoor and outdoor.
In Proceedings of the 19th ACM Conference on Embedded
Networked Sensor Systems, pages 138–151, New York, NY, USA, 2021. Association for
Computing Machinery.

[48] Yuda Feng, Yaxiong Xie, Deepak Ganesan, and Jie Xiong. Lte-based low-cost and low-
power soil moisture sensing. In Proceedings of the 20th ACM Conference on Embedded
Networked Sensor Systems, pages 421–434, 2022.

[49] Andrew Fitzgibbon, Maurizio Pilu, and Robert B Fisher. Direct least square fitting of el-
IEEE Transactions on pattern analysis and machine intelligence, 21(5):476–480,

lipses.
1999.

[50] Yongjian Fu, Shuning Wang, Linghui Zhong, Lili Chen, Ju Ren, and Yaoxue Zhang. Svoice:
Enabling voice communication in silence via acoustic sensing on commodity devices. In
Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, pages
622–636, 2022.

[51] Z. Fu, J. Xu, Z. Zhu, A. X. Liu, and X. Sun. Writing in the air with WiFi signals for virtual

reality devices. IEEE Transactions on Mobile Computing, 18(2):473–484, 2019.

[52] Zhangjie Fu, Jiashuang Xu, Zhuangdi Zhu, Alex X Liu, and Xingming Sun. Writing in the
air with wifi signals for virtual reality devices. IEEE Transactions on Mobile Computing,
18(2):473–484, 2018.

[53] GazeRecorder. Online Eye Tracking Software. https://gazerecorder.com/, September 2024.

[Online; accessed 6-September-2024].

[54] Yasaman Ghasempour, Claudio RCM Da Silva, Carlos Cordeiro, and Edward W Knightly.
IEEE 802.11 ay: Next-generation 60 GHz communication for 100 Gb/s Wi-Fi. IEEE Com-
munications Magazine, 55(12):186–192, 2017.

174

[55] Nirnimesh Ghose, Loukas Lazos, and Ming Li. SFIRE: Secret-free-in-band trust establish-
ment for COTS wireless devices. In IEEE INFOCOM 2018 - IEEE Conference on Computer
Communications, pages 1529–1537, 2018.

[56] Marco Giordani, Michele Polese, Arnab Roy, Douglas Castor, and Michele Zorzi. A tutorial
on beam management for 3GPP NR at mmwave frequencies. IEEE Communications Surveys
& Tutorials, 21(1):173–196, 2018.

[57] Alex Graves, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, and
Jürgen Schmidhuber. A novel connectionist system for unconstrained handwriting recog-
nition.
IEEE transactions on pattern analysis and machine intelligence, 31(5):855–868,
2008.

[58] Yiqun Guo, Zihuan Wang, Ming Li, and Qian Liu. Machine learning based mmwave chan-
In Proceedings of IEEE International Conference on

nel tracking in vehicular scenario.
Communications Workshops (ICC Workshops), pages 1–6. IEEE, 2019.

[59] Z. Guo, F. Xiao, B. Sheng, H. Fei, and S. Yu. WiReader: Adaptive air handwriting recog-
nition based on commercial WiFi signal. IEEE Internet of Things Journal, 7(10):10483–
10494, 2020.

[60] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social
GAN: Socially acceptable trajectories with generative adversarial networks. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2255–2264,
2018.

[61] Unsoo Ha, Salah Assana, and Fadel Adib. Contactless seismocardiography via deep learning
radars. In Proceedings of the 26th annual international conference on mobile computing and
networking, pages 1–14, 2020.

[62] Muhammad Kumail Haider, Yasaman Ghasempour, Dimitrios Koutsonikolas, and Ed-
ward W Knightly. Listeer: mmwave beam acquisition and steering by tracking indicator
LEDs on wireless APs.
In Proceedings of the 24th Annual International Conference on
Mobile Computing and Networking, pages 273–288, 2018.

[63] Daniel Halperin, Wenjun Hu, Anmol Sheth, and David Wetherall. Tool release: Gathering
802.11n traces with channel state information. SIGCOMM Comput. Commun. Rev., 41(1):53,
January 2011.

[64] Morteza Hashemi, C Emre Koksal, and Ness B Shroff. Out-of-band millimeter wave beam-
forming and communications to achieve low latency and high energy efficiency in 5G sys-
tems. IEEE Transactions on Communications, 66(2):875–888, 2017.

[65] Haitham Hassanieh, Omid Abari, Michael Rodriguez, Mohammed Abdelghany, Dina
Katabi, and Piotr Indyk. Fast millimeter wave beam alignment. In Proceedings of the 2018
Conference of the ACM Special Interest Group on Data Communication, pages 432–445,
2018.

175

[66] Wenfeng He, Kaishun Wu, Yongpan Zou, and Zhong Ming. Wig: Wifi-based gesture recog-
In 2015 24th International Conference on Computer Communication and

nition system.
Networks (ICCCN), pages 1–7. IEEE, 2015.

[67] Robert W Heath, Nuria Gonzalez-Prelcic, Sundeep Rangan, Wonil Roh, and Akbar M Say-
eed. An overview of signal processing techniques for millimeter wave MIMO systems. IEEE
Journal of Selected Topics in Signal Processing, 10(3):436–453, 2016.

[68] Yuqiang Heng and Jeffrey G Andrews. Machine learning-assisted beam alignment for
IEEE Transactions on Cognitive Communications and Networking,

mmwave systems.
7(4):1142–1155, 2021.

[69] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,

9(8):1735–1780, 1997.

[70] Yufang Hou. Incremental fine-grained information status classification using attention-based
lstms. In Proceedings of COLING 2016, the 26th International Conference on Computa-
tional Linguistics: Technical Papers, pages 1880–1890, 2016.

[71] Jingyang Hu, Hongbo Jiang, Daibo Liu, Zhu Xiao, Schahram Dustdar, Jiangchuan Liu, and
Geyong Min. Blinkradar: non-intrusive driver eye-blink detection with uwb radar. In 2022
IEEE 42nd International Conference on Distributed Computing Systems (ICDCS), pages
1040–1050, Piscataway, NJ, USA, 2022. IEEE.

[72] Pengfei Hu, Yifan Ma, Panneer Selvam Santhalingam, Parth H Pathak, and Xiuzhen Cheng.
Milliear: Millimeter-wave acoustic eavesdropping with unconstrained vocabulary. In IEEE
INFOCOM 2022-IEEE Conference on Computer Communications, pages 11–20. IEEE,
2022.

[73] Hao Huang, Guan Gui, Haris Gacanin, Chau Yuen, Hikmet Sari, and Fumiyuki Adachi. Deep
regularized waveform learning for beam prediction with limited samples in non-cooperative
mmwave systems. IEEE Transactions on Vehicular Technology, 2023.

[74] Shoya Ishimaru, Kai Kunze, Koichi Kise, Jens Weppner, Andreas Dengel, Paul Lukowicz,
and Andreas Bulling. In the blink of an eye: combining head motion and eye blink frequency
for activity recognition with google glass.
In Proceedings of the 5th augmented human
international conference, pages 1–4, New York, NY, USA, 2014. Association for Computing
Machinery.

[75] Grant Jenks. wordsegment. https://grantjenks.com/docs/wordsegment/, April 2024. [Online;

accessed 17-April-2024].

[76] Neeta Jha, Amrita Mishra, Jyotsna Bapat, and Debabrata Das. Fast beam search with two-
level phased array in millimeter-wave massive MIMO: A hierarchical approach.
In Pro-
ceedings of IEEE Wireless Communications and Networking Conference (WCNC), pages
1371–1376. IEEE, 2022.

176

[77] Wenjun Jiang, Chenglin Miao, Fenglong Ma, Shuochao Yao, Yaqing Wang, Ye Yuan,
Hongfei Xue, Chen Song, Xin Ma, Dimitrios Koutsonikolas, et al. Towards environment
independent device free human activity recognition. In Proceedings of the 24th Annual In-
ternational Conference on Mobile Computing and Networking, pages 289–304, 2018.

[78] Wenjun Jiang, Hongfei Xue, Chenglin Miao, Shiyang Wang, Sen Lin, Chong Tian, Srini-
vasan Murali, Haochen Hu, Zhi Sun, and Lu Su. Towards 3d human pose construction using
wifi. In Proceedings of the 26th Annual International Conference on Mobile Computing and
Networking, pages 1–14, 2020.

[79] Kyle Jamieson Jie Xiong, Karthikeyan Sundaresan. Tonetrack: Leveraging frequency-agile
radios for time-based indoor wireless localization. In Proceedings of the 21st Annual Inter-
national Conference on Mobile Computing and Networking, MobiCom ’15, page 537–549,
New York, NY, USA, 2015. Association for Computing Machinery.

[80] Shian-Ru Ke, Hoang Le Uyen Thuc, Yong-Jin Lee, Jenq-Neng Hwang, Jang-Hee Yoo,
and Kyoung-Ho Choi. A review on video-based human activity recognition. Computers,
2(2):88–131, 2013.

[81] Yongning Ke, Hui Gao, Wenjun Xu, Lixin Li, Li Guo, and Zhiyong Feng. Position prediction
based fast beam tracking scheme for multi-user uav-mmwave communications. In ICC 2019-
2019 IEEE International Conference on Communications (ICC), pages 1–7. IEEE, 2019.

[82] Bryce Kellogg, Vamsi Talla, and Shyamnath Gollakota. Bringing gesture recognition to all
devices. In 11th USENIX Symposium on Networked Systems Design and Implementation
(NSDI 14), pages 303–316, 2014.

[83] Eamonn J Keogh and Michael J Pazzani. Scaling up dynamic time warping for datamin-
In Proceedings of the sixth ACM SIGKDD international conference on

ing applications.
Knowledge discovery and data mining, pages 285–289, 2000.

[84] Sara Khosravi, Hossein Shokri-Ghadikolaei, and Marina Petrova. Learning-based handover
in mobile millimeter-wave networks. IEEE Transactions on Cognitive Communications and
Networking, 7(2):663–674, 2020.

[85] Manikanta Kotaru, Kiran Joshi, Dinesh Bharadia, and Sachin Katti. Spotfi: Decimeter level
In Proceedings of the 2015 ACM Conference on Special Interest

localization using wifi.
Group on Data Communication, pages 269–282, 2015.

[86] Manikanta Kotaru, Kiran Joshi, Dinesh Bharadia, and Sachin Katti. Spotfi: Decimeter level
In Proceedings of the 2015 ACM Conference on Special Interest

localization using wifi.
Group on Data Communication, pages 269–282, 2015.

[87] Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra Bhandarkar, Woj-
ciech Matusik, and Antonio Torralba. Eye tracking for everyone. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 2176–2184, Piscataway, NJ,
USA, 2016. IEEE.

177

[88] Shajahan Kutty and Debarati Sen. Beamforming for millimeter wave communications: An
inclusive survey. IEEE Communications Surveys & Tutorials, 18(2):949–973, 2015.

[89] Antoinelame Antoine Lamé. Gaze Tracking. https://github.com/antoinelame/GazeTracking,

September 2024. [Online; accessed 11-September-2024].

[90] Brian J Lee, Ronald D Watkins, Chen-Ming Chang, and Craig S Levin. Low eddy cur-
rent rf shielding enclosure designs for 3t mr applications. Magnetic resonance in medicine,
79(3):1745–1752, 2018.

[91] Kyoung-Min Lee, Annie P Lai, James Brodale, and Arthur Jampolsky. Sideslip of the medial
rectus muscle during vertical eye rotation. Investigative ophthalmology & visual science,
48(10):4527–4533, 2007.

[92] Bin Li, Zheng Zhou, Weixia Zou, Xuebin Sun, and Guanglong Du. On the efficient beam-
forming training for 60ghz wireless personal area networks. IEEE Transactions on Wireless
Communications, 12(2):504–515, 2012.

[93] Chenning Li, Manni Liu, and Zhichao Cao. WiHF: Enable user identified gesture recognition
with wifi. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pages
586–595. IEEE, 2020.

[94] Chenning Li, Zheng Liu, Yuguang Yao, Zhichao Cao, Mi Zhang, and Yunhao Liu. Wi-fi see
it all: generative adversarial network-augmented versatile wi-fi imaging. In Proceedings of
the 18th Conference on Embedded Networked Sensor Systems, pages 436–448, 2020.

[95] Tianhong Li, Lijie Fan, Mingmin Zhao, Yingcheng Liu, and Dina Katabi. Making the in-
In Proceedings of the

visible visible: Action recognition through walls and occlusions.
IEEE/CVF International Conference on Computer Vision, pages 872–881, 2019.

[96] Xiaopeng Li, Fengyao Yan, Fei Zuo, Qiang Zeng, and Lannan Luo. Touch well before
use: Intuitive and secure authentication for iot devices. In The 25th Annual International
Conference on Mobile Computing and Networking, MobiCom ’19, New York, NY, USA,
2019. Association for Computing Machinery.

[97] Yang Li, Ting Liu, Jing Jiang, and Liang Zhang. Hashtag recommendation with topical
attention-based lstm. In Proceedings of COLING 2016, the 26th International Conference
on Computational Linguistics: Technical Papers, pages 3019–3029, 2016.

[98] Zhengxiong Li, Fenglong Ma, Aditya Singh Rathore, Zhuolin Yang, Baicheng Chen, Lu Su,
and Wenyao Xu. Wavespy: Remote and through-wall screen attack via mmwave sensing.
In 2020 IEEE Symposium on Security and Privacy (SP), pages 217–232. IEEE, 2020.

[99] Jaime Lien, Nicholas Gillian, M Emre Karagozler, Patrick Amihood, Carsten Schwesig, Erik
Olson, Hakim Raja, and Ivan Poupyrev. Soli: Ubiquitous gesture sensing with millimeter
wave radar. ACM Transactions on Graphics (TOG), 35(4):1–19, 2016.

178

[100] Sun Hong Lim, Sunwoo Kim, Byonghyo Shim, and Jun Won Choi. Deep learning-based
beam tracking for millimeter-wave communications under mobility. IEEE Transactions on
Communications, 69(11):7458–7469, 2021.

[101] Weiyao Lin, Ming-Ting Sun, Radha Poovandran, and Zhengyou Zhang. Human activity
recognition for video surveillance. In 2008 IEEE International Symposium on Circuits and
Systems (ISCAS), pages 2737–2740. IEEE, 2008.

[102] Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel. Learning to diagnose

with LSTM recurrent neural networks. arXiv preprint arXiv:1511.03677, 2015.

[103] Jialin Liu, Dong Li, Lei Wang, and Jie Xiong. Blinklistener: ” listen” to your eye blink using
your smartphone. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous
Technologies, 5(2):1–27, 2021.

[104] Mengxi Liu, Sizhen Bian, and Paul Lukowicz. Non-contact, real-time eye blink detection
with capacitive sensing. In Proceedings of the 2022 ACM International Symposium on Wear-
able Computers, pages 49–53, New York, NY, USA, 2022. Association for Computing Ma-
chinery.

[105] Stuart Lloyd. Least squares quantization in PCM. IEEE transactions on information theory,

28(2):129–137, 1982.

[106] Steven Loria. TextBlob: Simplified Text Processing. https://textblob.readthedocs.io/en/dev/,

April 2024. [Online; accessed 17-April-2024].

[107] Lina Ma, Yangtao Ye, Changzhan Gu, and Junfa Mao. High-accuracy contactless detection
of eyes’ activities based on short-range radar sensing. In 2022 IEEE MTT-S International
Microwave Biomedical Conference (IMBioC), pages 266–268, Piscataway, NJ, USA, 2022.
IEEE.

[108] Yongsen Ma, Gang Zhou, Shuangquan Wang, Hongyang Zhao, and Woosub Jung. Signfi:
Sign language recognition using WiFi. Proceedings of the ACM on Interactive, Mobile,
Wearable and Ubiquitous Technologies, 2(1):1–21, 2018.

[109] George R MacCartney, Sijia Deng, and Theodore S Rappaport. Indoor office plan environ-
ment and layout-based mmWave path loss models for 28 GHz and 73 GHz. In Proceedings
of the IEEE 83rd vehicular technology conference (VTC Spring), pages 1–6. IEEE, 2016.

[110] Aamir Mahmood, Muhammad Ikram Ashraf, Mikael Gidlund, Johan Torsner, and Joachim
Sachs. Time synchronization in 5G wireless edge: Requirements and solutions for critical-
MTC. IEEE Communications Magazine, 57(12):45–51, 2019.

[111] Christos Masouros and Gan Zheng.

Exploiting known interference as green signal
power for downlink beamforming optimization. IEEE Transactions on Signal processing,
63(14):3628–3640, 2015.

179

[112] I McCowan, D Moore, J Dines, D Gatica-Perez, M Flynn, P Wellner, and H Bourlard. On
the use of information retrieval measures for speech recognition. Technical report, tech. rep.,
IDIAP Research Institute, Martigny, Switzerland, 2005.

[113] Jess McIntosh, Asier Marzo, Mike Fraser, and Carol Phillips. Echoflex: Hand gesture recog-
nition using ultrasound imaging. In Proceedings of the 2017 CHI Conference on Human
Factors in Computing Systems, pages 1923–1934, 2017.

[114] Jonas Mockus. Bayesian approach to global optimization: theory and applications, vol-

ume 37. Springer Science & Business Media, 2012.

[115] Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. Phased LSTM: Accelerating recurrent
network training for long or event-based sequences. arXiv preprint arXiv:1610.09513, 2016.

[116] Duy HN Nguyen, Long Bao Le, Tho Le-Ngoc, and Robert W Heath. Hybrid MMSE precod-
ing and combining designs for mmWave multiuser systems. IEEE Access, 5:19167–19181,
2017.

[117] Thomas Nitsche, Carlos Cordeiro, Adriana B Flores, Edward W Knightly, Eldad Perahia,
and Joerg C Widmer. Ieee 802.11 ad: directional 60 ghz communication for multi-gigabit-
per-second wi-fi. IEEE Communications Magazine, 52(12):132–141, 2014.

[118] Thomas Nitsche, Adriana B Flores, Edward W Knightly, and Joerg Widmer. Steering with
eyes closed: mm-wave beam steering without in-band measurement. In Proceedings of IEEE
Conference on Computer Communications (INFOCOM), pages 2416–2424. IEEE, 2015.

[119] Kai Niu, Fusang Zhang, Jie Xiong, Xiang Li, Enze Yi, and Daqing Zhang. Boosting fine-
grained activity sensing by embracing wireless multipath effects. In Proceedings of the 14th
International Conference on emerging Networking EXperiments and Technologies, pages
139–151, 2018.

[120] Song Noh, Michael D Zoltowski, and David J Love. Multi-resolution codebook and adaptive
beamforming sequence design for millimeter wave beam alignment. IEEE Transactions on
Wireless Communications, 16(9):5689–5701, 2017.

[121] Michele Polese, Francesco Restuccia, and Tommaso Melodia. Deepbeam: Deep waveform
learning for coordination-free beam management in mmwave networks.
In Proceedings
of the Twenty-second International Symposium on Theory, Algorithmic Foundations, and
Protocol Design for Mobile Networks and Mobile Computing, pages 61–70, 2021.

[122] Swadhin Pradhan, Eugene Chai, Karthikeyan Sundaresan, Lili Qiu, Mohammad A Kho-
jastepour, and Sampath Rangarajan. Rio: A pervasive rfid-based touch gesture interface. In
Proceedings of the 23rd Annual International Conference on Mobile Computing and Net-
working, pages 261–274, 2017.

[123] Qifan Pu, Sidhant Gupta, Shyamnath Gollakota, and Shwetak Patel. Whole-home gesture
recognition using wireless signals. In Proceedings of the 19th annual international confer-
ence on Mobile computing & networking, pages 27–38, 2013.

180

[124] Joaquin Quinonero-Candela, Carl Edward Rasmussen, and Christopher KI Williams. Ap-
proximation methods for gaussian process regression. In Large-scale kernel machines, pages
203–223. MIT Press, 2007.

[125] Colin Raffel and Daniel PW Ellis. Feed-forward networks with attention can solve some

long-term memory problems. arXiv preprint arXiv:1512.08756, 2015.

[126] Rima-Maria Rahal and Susann Fiedler. Understanding cognitive and affective mechanisms
in social psychology through eye-tracking. Journal of Experimental Social Psychology,
85:103842, 2019.

[127] Dinesh Ramasamy, Sriram Venkateswaran, and Upamanyu Madhow. Compressive tracking
with 1000-element arrays: A framework for multi-Gbps mmWave cellular downlinks. In
Proceedings of 50th Annual Allerton Conference on Communication, Control, and Comput-
ing (Allerton), pages 690–697. IEEE, 2012.

[128] Maryam Eslami Rasekh, Zhinus Marzi, Yanzi Zhu, Upamanyu Madhow, and Haitao Zheng.
Noncoherent mmwave path tracking. In Proceedings of the 18th International Workshop on
Mobile Computing Systems and Applications, pages 13–18, 2017.

[129] Mattia Rebato, Jihong Park, Petar Popovski, Elisabeth De Carvalho, and Michele Zorzi.
Stochastic geometric coverage analysis in mmWave cellular networks with realistic channel
and antenna radiation models. IEEE Transactions on Communications, 67(5):3736–3752,
2019.

[130] Sajad Rezaie, Elisabeth De Carvalho, and Carles Navarro Manchón. A deep learning ap-
proach to location-and orientation-aided 3d beam selection for mmwave communications.
IEEE Transactions on Wireless Communications, 21(12):11110–11124, 2022.

[131] Wonil Roh, Ji-Yun Seol, Jeongho Park, Byunghwan Lee, Jaekon Lee, Yungsoo Kim,
Jaeweon Cho, Kyungwhoon Cheun, and Farshid Aryanfar. Millimeter-wave beamform-
ing as an enabling technology for 5G cellular communications: Theoretical feasibility and
prototype results. IEEE Communications Magazine, 52(2):106–113, 2014.

[132] Prasun Roy, Subhankar Ghosh, and Umapada Pal. A cnn based framework for unistroke
numeral recognition in air-writing. In 2018 16th international conference on frontiers in
handwriting recognition (ICFHR), pages 404–409. IEEE, 2018.

[133] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal represen-
tations by error propagation. Technical report, California Univ San Diego La Jolla Inst for
Cognitive Science, 1985.

[134] Cian Ryan, Brian O’Sullivan, Amr Elrasad, Aisling Cahill, Joe Lemley, Paul Kielty,
Christoph Posch, and Etienne Perot. Real-time face & eye tracking and blink detection
using event cameras. Neural Networks, 141:87–97, 2021.

[135] Batool Salehi, Mauro Belgiovine, Sara Garcia Sanchez, Jennifer Dy, Stratis Ioannidis, and
Kaushik Chowdhury. Machine learning on camera images for fast mmwave beamforming. In

181

IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), pages
338–346. IEEE, 2020.

[136] Science Daily. Eye muscles are resilient to ALS. https://tinyurl.com/cy4kxhv2, September

2024. [Online; accessed 11-September-2024].

[137] Sangho Shin and Henning Schulzrinne. Measurement and analysis of the voip capacity in
ieee 802.11 wlan. IEEE Transactions on Mobile Computing, 8(9):1265–1279, 2009.

[138] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of
machine learning algorithms. Advances in neural information processing systems, 25, 2012.

[139] Elahe Soltanaghaei, Avinash Kalyanaraman, and Kamin Whitehouse. Multipath triangula-
tion: Decimeter-level WiFi localization and orientation with a single unaided receiver. In
Proceedings of the 16th annual international conference on mobile systems, applications,
and services, pages 376–388, 2018.

[140] Youngwook Son, Seongwon Kim, Seongho Byeon, and Sunghyun Choi. Symbol timing
synchronization for uplink multi-user transmission in IEEE 802.11ax WLAN. IEEE access,
6:72962–72977, 2018.

[141] Jiho Song, Junil Choi, and David J Love. Common codebook millimeter wave beam design:
Designing beams for both sounding and communication with uniform planar arrays. IEEE
Transactions on Communications, 65(4):1859–1872, 2017.

[142] Kunzhe Song, Qijun Wang, Shichen Zhang, and Huacheng Zeng. Siwis: Fine-grained human
detection using single wifi device. In Proceedings of the 30th Annual International Confer-
ence on Mobile Computing and Networking, pages 1439–1454, New York, NY, USA, 2024.
Association for Computing Machinery.

[143] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
dinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of
machine learning research, 15(1):1929–1958, 2014.

[144] Nielen Stander and KJ Craig. On the robustness of a simple domain reduction scheme for

simulation-based optimization. Engineering Computations, 19(4):431–450, 2002.

[145] Statista. Number of Internet of Things (IoT) connected devices worldwide from 2019
https://www.statista.com/statistics/1194682/iot-connected-devices-vertically/,

to 2033.
Accessed:08-May-2025.

[146] Yusuke Sugano, Yasuyuki Matsushita, and Yoichi Sato.

Learning-by-synthesis for
appearance-based 3d gaze estimation. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 1821–1828, Piscataway, NJ, USA, 2014. IEEE.

[147] Li Sun, Souvik Sen, Dimitrios Koutsonikolas, and Kyu-Han Kim. WiDraw: Enabling hands-
In Proceedings of the 21st Annual

free drawing in the air on commodity WiFi devices.
International Conference on Mobile Computing and Networking, pages 77–89, 2015.

182

[148] Sanjib Sur, Ioannis Pefkianakis, Xinyu Zhang, and Kyu-Han Kim. WiFi-assisted 60 GHz
wireless networks. In Proceedings of the 23rd Annual International Conference on Mobile
Computing and Networking, pages 28–41, 2017.

[149] Sanjib Sur, Ioannis Pefkianakis, Xinyu Zhang, and Kyu-Han Kim. Towards scalable and
ubiquitous millimeter-wave wireless networks. In Proceedings of the 24th Annual Interna-
tional Conference on Mobile Computing and Networking, pages 257–271, 2018.

[150] Sanjib Sur, Xinyu Zhang, Parmesh Ramanathan, and Ranveer Chandra. Beamspy: Enabling
In 13th USENIX Symposium on Networked Systems

robust 60 ghz links under blockage.
Design and Implementation (NSDI ’16), pages 193–206, 2016.

[151] Koh Tadokoro, Toru Yamashita, Yusuke Fukui, Emi Nomura, Yasuyuki Ohta, Setsuko Ueno,
Saya Nishina, Keiichiro Tsunoda, Yosuke Wakutani, Yoshiki Takao, et al. Early detection
of cognitive decline in mild cognitive impairment and alzheimer’s disease with a novel eye
tracking test. Journal of the neurological sciences, 427:117529, 2021.

[152] Tzu-Chun Tai, Kate Ching-Ju Lin, and Yu-Chee Tseng. Toward reliable localization by
In Proceedings of the 17th Annual International Conference on
unequal AoA tracking.
Mobile Systems, Applications, and Services, MobiSys ’19, page 444–456, New York, NY,
USA, 2019. Association for Computing Machinery.

[153] Hanqing Tao, Shiwei Tong, Hongke Zhao, Tong Xu, Binbin Jin, and Qi Liu. A radical-aware
attention-based model for chinese text classification. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 33, pages 5125–5132, 2019.

[154] Hans-Leo HM Teulings and Arnold JWM Thomassen. Computer-aided analysis of hand-

writing movements. Visible Language, 13(3):218, 1979.

[155] Texas Instruments. TI AWR6843. https://www.ti.com/tool/AWR6843ISK, September 2024.

[Online; accessed 11-September-2024].

[156] Tobill. TOBII PRO SPECTRUM. https://www.tobii.com, September 2024. [Online; ac-

cessed 11-September-2024].

[157] Y Ming Tsang, Ada SY Poon, and Sateesh Addepalli. Coding the beams: Improving
beamforming training in mmwave communication system. In Proceedings of IEEE Global
Telecommunications Conference-GLOBECOM 2011, pages 1–6. IEEE, 2011.

[158] Deepak Vasisht, Swarun Kumar, and Dina Katabi. Decimeter-level localization with a single
WiFi access point. In 13th USENIX Symposium on Networked Systems Design and Imple-
mentation (NSDI 16), pages 165–178, Santa Clara, CA, March 2016. USENIX Association.

[159] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS’17, page
6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.

183

[160] Raghav H Venkatnarayan, Griffin Page, and Muhammad Shahzad. Multi-user gesture recog-
nition using WiFi. In Proceedings of the 16th Annual International Conference on Mobile
Systems, Applications, and Services, pages 401–413, 2018.

[161] Chao Wang, Feng Lin, Zhongjie Ba, Fan Zhang, Wenyao Xu, and Kui Ren. Wavesdropper:
Through-wall word detection of human speech via commercial mmwave devices. Proceed-
ings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 6(2):1–26,
2022.

[162] Chuyu Wang, Lei Xie, Yuancan Lin, Wei Wang, Yingying Chen, Yanling Bu, Kai Zhang,
and Sanglu Lu. Thru-the-wall eavesdropping on loudspeakers via rfid by capturing sub-mm
level vibration. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous
Technologies, 5(4):1–25, 2021.

[163] Fei Wang, Sanping Zhou, Stanislav Panev, Jinsong Han, and Dong Huang. Person-in-wifi:
Fine-grained person perception using wifi. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 5452–5461, 2019.

[164] Hao Wang, Daqing Zhang, Yasha Wang, Junyi Ma, Yuxiang Wang, and Shengjie Li. Rt-
fall: A real-time and contactless fall detection system with commodity wifi devices. IEEE
Transactions on Mobile Computing, 16(2):511–526, 2016.

[165] Jue Wang, Deepak Vasisht, and Dina Katabi. RF-IDraw: Virtual touch screen in the air using

RF signals. ACM SIGCOMM Computer Communication Review, 44(4):235–246, 2014.

[166] Saiwen Wang, Jie Song, Jaime Lien, Ivan Poupyrev, and Otmar Hilliges. Interacting with
soli: Exploring fine-grained dynamic gesture recognition in the radio-frequency spectrum.
In Proceedings of the 29th Annual Symposium on User Interface Software and Technology,
pages 851–860, 2016.

[167] Xuyu Wang, Chao Yang, and Shiwen Mao. Phasebeat: Exploiting csi phase data for vital
sign monitoring with commodity wifi devices. In 2017 IEEE 37th International Conference
on Distributed Computing Systems (ICDCS), pages 1230–1239. IEEE, 2017.

[168] Yao Wang, Wandong Cai, Tao Gu, and Wei Shao. Your eyes reveal your secrets: An eye
movement based password inference on smartphone. IEEE transactions on mobile comput-
ing, 19(11):2714–2730, 2019.

[169] Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. Attention-based lstm for aspect-
level sentiment classification. In Proceedings of the 2016 conference on empirical methods
in natural language processing, pages 606–615, 2016.

[170] Yong Wang, Yuhong Shu, and Mu Zhou. A novel eye blink detection method using frequency
modulated continuous wave radar. In 2021 IEEE International Workshop on Electromag-
netics: Applications and Student Innovation Competition (iWEM), pages 1–3, Piscataway,
NJ, USA, 2021. IEEE.

[171] Yuxi Wang, Kaishun Wu, and Lionel M Ni. Wifall: Device-free fall detection by wireless

networks. IEEE Transactions on Mobile Computing, 16(2):581–594, 2016.

184

[172] Zhongqin Wang, Fu Xiao, Ning Ye, Ruchuan Wang, and Panlong Yang. A see-through-wall
system for device-free human motion sensing based on battery-free rfid. ACM Transactions
on Embedded Computing Systems (TECS), 17(1):1–21, 2017.

[173] Teng Wei, Shu Wang, Anfu Zhou, and Xinyu Zhang. Acoustic eavesdropping through wire-
In Proceedings of the 21st Annual International Conference on Mobile

less vibrometry.
Computing and Networking, pages 130–141, 2015.

[174] Teng Wei and Xinyu Zhang. mtrack: High-precision passive tracking using millimeter wave
radios. In Proceedings of the 21st Annual International Conference on Mobile Computing
and Networking, pages 117–129, 2015.

[175] Zhiqing Wei, Yuan Wang, Liang Ma, Shaoshi Yang, Zhiyong Feng, Chengkang Pan, Qixun
Zhang, Yajuan Wang, Huici Wu, and Ping Zhang. 5g prs-based sensing: A sensing refer-
ence signal approach for joint sensing and communication system. IEEE Transactions on
Vehicular Technology, 72(3):3250–3263, 2022.

[176] Wikipedia.

Substitution cipher.

https://en.wikipedia.org/wiki/Substitution_cipher,

Accessed:13-June-2021.

[177] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine

learning, volume 2. MIT press Cambridge, MA, 2006.

[178] Genta Indra Winata, Onno Pepijn Kampman, and Pascale Fung. Attention-based lstm for
psychological stress detection from spoken language using distant supervision. In 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
6204–6208. IEEE, 2018.

[179] Chenshu Wu, Feng Zhang, Yusen Fan, and KJ Ray Liu. RF-based inertial measurement. In
Proceedings of the ACM Special Interest Group on Data Communication, pages 117–129.
2019.

[180] Zhenyu Xiao, Lipeng Zhu, Zhen Gao, Dapeng Oliver Wu, and Xiang-Gen Xia. User fairness
non-orthogonal multiple access (NOMA) for millimeter-wave communications with analog
beamforming. IEEE Transactions on Wireless Communications, 18(7):3411–3423, 2019.

[181] Jiahong Xie, Hao Kong, Jiadi Yu, Yingying Chen, Linghe Kong, Yanmin Zhu, and Feilong
In
Tang. mm3dface: Nonintrusive 3d facial reconstruction leveraging mmwave signals.
Proceedings of the 21st Annual International Conference on Mobile Systems, Applications
and Services, pages 462–474, New York, NY, USA, 2023. Association for Computing Ma-
chinery.

[182] Yaxiong Xie, Zhenjiang Li, and Mo Li. Precise power delay profiling with commodity
WiFi. In Proceedings of the 21st Annual International Conference on Mobile Computing
and Networking, MobiCom ’15, page 53–64, New York, NY, USA, 2015. ACM.

185

[183] Yaxiong Xie, Jie Xiong, Mo Li, and Kyle Jamieson. Md-track: Leveraging multi-
dimensionality for passive indoor Wi-Fi tracking. In The 25th Annual International Con-
ference on Mobile Computing and Networking, MobiCom ’19, New York, NY, USA, 2019.
Association for Computing Machinery.

[184] Yaxiong Xie, Jie Xiong, Mo Li, and Kyle Jamieson. md-track: Leveraging multi-
In The 25th Annual International Con-

dimensionality for passive indoor wi-fi tracking.
ference on Mobile Computing and Networking, pages 1–16, 2019.

[185] Jie Xiong and Kyle Jamieson. Arraytrack: A fine-grained indoor location system. In 10th
USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages
71–84, Lombard, IL, April 2013. USENIX Association.

[186] Hao Xue, Du Q Huynh, and Mark Reynolds. SS-LSTM: A hierarchical LSTM model for
pedestrian trajectory prediction. In Proceedings of IEEE Winter Conference on Applications
of Computer Vision (WACV), pages 1186–1194. IEEE, 2018.

[187] Hongfei Xue, Wenjun Jiang, Chenglin Miao, Fenglong Ma, Shiyang Wang, Ye Yuan,
Shuochao Yao, Aidong Zhang, and Lu Su. Deepmv: Multi-view deep learning for device-
free human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable
and Ubiquitous Technologies, 4(1):1–26, 2020.

[188] Hongfei Xue, Yan Ju, Chenglin Miao, Yijiang Wang, Shiyang Wang, Aidong Zhang, and
Lu Su. mmmesh: towards 3d real-time dynamic human mesh construction using millimeter-
wave.
In Proceedings of the 19th Annual International Conference on Mobile Systems,
Applications, and Services, pages 269–282, 2021.

[189] Zhenyu Yan, Qun Song, Rui Tan, Yang Li, and Adams Wai Kin Kong. Towards touch-
to-access device authentication using induced body electric potentials. In The 25th Annual
International Conference on Mobile Computing and Networking, MobiCom ’19, New York,
NY, USA, 2019. Association for Computing Machinery.

[190] Edwin Yang, Qiuye He, and Song Fang. Wink: Wireless inference of numerical keystrokes
via zero-training spatiotemporal analysis. In Proceedings of the 2022 ACM SIGSAC Con-
ference on Computer and Communications Security, pages 3033–3047, 2022.

[191] Lei Yang, Qiongzheng Lin, Xiangyang Li, Tianci Liu, and Yunhao Liu. See through walls
with cots rfid system! In Proceedings of the 21st Annual International Conference on Mobile
Computing and Networking, pages 487–499, 2015.

[192] Xi Yang, Michail Matthaiou, Jie Yang, Chao-Kai Wen, Feifei Gao, and Shi Jin. Hardware-
constrained millimeter-wave systems for 5G: challenges, opportunities, and solutions. IEEE
Communications Magazine, 57(1):44–50, 2019.

[193] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierar-
chical attention networks for document classification. In Proceedings of the 2016 conference
of the North American chapter of the association for computational linguistics: human lan-
guage technologies, pages 1480–1489, 2016.

186

[194] Wenfang Yuan, Simon MD Armour, and Angela Doufexi. An efficient and low-complexity
beam training technique for mmwave communication. In 2015 IEEE 26th Annual Interna-
tional Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), pages
303–308. IEEE, 2015.

[195] Shichao Yue, Yuzhe Yang, Hao Wang, Hariharan Rahul, and Dina Katabi. Bodycompass:
Monitoring sleep posture with wireless signals. Proceedings of the ACM on Interactive,
Mobile, Wearable and Ubiquitous Technologies, 4(2):1–25, 2020.

[196] H Zamani, A Abas, and MKM Amin. Eye tracking application on emotion analysis for
marketing strategy. Journal of Telecommunication, Electronic and Computer Engineering
(JTEC), 8(11):87–91, 2016.

[197] Jiansong Zhang, Zeyu Wang, Zhice Yang, and Qian Zhang. Proximity based IoT device
authentication. In IEEE INFOCOM 2017 - IEEE Conference on Computer Communications,
pages 1–9, 2017.

[198] L. Zhang, J. Wang, Q. Gao, X. Li, M. Pan, and Y. Fang. Letfi: Letter recognition in the air
using CSI. In 2018 IEEE Global Communications Conference (GLOBECOM), pages 1–6,
2018.

[199] Shichen Zhang, Bo Ji, Kai Zeng, and Huacheng Zeng. Realizing uplink mu-mimo com-
munication in mmwave wlans: Bayesian optimization and asynchronous transmission. In
IEEE INFOCOM 2023-IEEE Conference on Computer Communications, pages 1–10. IEEE,
2023.

[200] Shichen Zhang, Pedram Kheirkhah Sangdeh, Hossein Pirayesh, Huacheng Zeng, Qiben Yan,
and Kai Zeng. Authiot: A transferable wireless authentication scheme for iot devices without
input interface. IEEE Internet of Things Journal, 9(22):23072–23085, 2022.

[201] Shichen Zhang, Qijun Wang, Maolin Gan, Zhichao Cao, and Huacheng Zeng. Radsee: See
In Network and Distributed Systems

your handwriting through walls using fmcw radar.
Security (NDSS) Symposium, 2025.

[202] Shichen Zhang, Qijun Wang, Kunzhe Song, Qiben Yan, and Huacheng Zeng. Radeye: Track-
ing eye motion using fmcw radar. In Proceedings of the 2025 CHI Conference on Human
Factors in Computing Systems, pages 1–13, 2025.

[203] Shichen Zhang, Qiben Yan, Tianxing Li, Li Xiao, and Huacheng Zeng. Tbp: Temporal beam

prediction for mobile millimeter-wave networks. IEEE Internet of Things Journal, 2024.

[204] Tianfang Zhang, Zhengkun Ye, Ahmed Tanvir Mahdad, Md Mojibur Rahman Redoy
Akanda, Cong Shi, Yan Wang, Nitesh Saxena, and Yingying Chen. Facereader: Unob-
trusively mining vital signs and vital sign embedded sensitive info via ar/vr motion sensors.
In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications
Security, pages 446–459, New York, NY, USA, 2023. Association for Computing Machin-
ery.

187

[205] Xi Zhang, Yu Zhang, Zhenguo Shi, and Tao Gu. mmfer: Millimetre-wave radar based facial
expression recognition for multimedia iot applications. In Proceedings of the 29th Annual
International Conference on Mobile Computing and Networking, pages 1–15, New York,
NY, USA, 2023. Association for Computing Machinery.

[206] Xiaotong Zhang, Zhenjiang Li, and Jin Zhang. Synthesized millimeter-waves for human
motion sensing. In Proceedings of the Twentieth ACM Conference on Embedded Networked
Sensor Systems (SenSys), page 377–390, 2022.

[207] Xinze Zhang, Walid Brahim, Mingyang Fan, Jianhua Ma, Muxin Ma, and Alex Qi. Radar-
based eyeblink detection under various conditions. In Proceedings of the 2023 12th Inter-
national Conference on Software and Computer Applications, pages 177–183, New York,
NY, USA, 2023. Association for Computing Machinery.

[208] Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. Appearance-based gaze
estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 4511–4520, Piscataway, NJ, USA, 2015. IEEE.

[209] Mingmin Zhao, Tianhong Li, Mohammad Abu Alsheikh, Yonglong Tian, Hang Zhao, An-
tonio Torralba, and Dina Katabi. Through-wall human pose estimation using radio signals.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
7356–7365, 2018.

[210] Mingmin Zhao, Yingcheng Liu, Aniruddh Raghu, Tianhong Li, Hang Zhao, Antonio Tor-
ralba, and Dina Katabi. Through-wall human mesh recovery using radio signals. In Proceed-
ings of the IEEE/CVF International Conference on Computer Vision, pages 10113–10122,
2019.

[211] Mingmin Zhao, Yonglong Tian, Hang Zhao, Mohammad Abu Alsheikh, Tianhong Li, Ru-
men Hristov, Zachary Kabelac, Dina Katabi, and Antonio Torralba. Rf-based 3d skeletons.
In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Com-
munication, pages 267–281, 2018.

[212] Renjie Zhao, Timothy Woodford, Teng Wei, Kun Qian, and Xinyu Zhang. M-cube: A
In Proceedings of the 26th Annual In-

millimeter-wave massive MIMO software radio.
ternational Conference on Mobile Computing and Networking, pages 1–14, 2020.

[213] Yue Zheng, Yi Zhang, Kun Qian, Guidong Zhang, Yunhao Liu, Chenshu Wu, and Zheng
Yang. Zero-effort cross-domain gesture recognition with Wi-Fi. In Proceedings of the 17th
Annual International Conference on Mobile Systems, Applications, and Services, pages 313–
325, 2019.

[214] Anfu Zhou, Xinyu Zhang, and Huadong Ma. Beam-forecast: Facilitating mobile 60 ghz
networks via model-driven beam steering. In Proceedings of IEEE Conference on Computer
Communications (INFOCOM), pages 1–9. IEEE, 2017.

[215] Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. Attention-based lstm network for cross-
lingual sentiment classification. In Proceedings of the 2016 conference on empirical methods
in natural language processing, pages 247–256, 2016.

188

[216] Wangjiang Zhu and Haoping Deng. Monocular free-head 3d gaze tracking with deep learn-
In Proceedings of the IEEE International Conference on

ing and geometry constraints.
Computer Vision, pages 3143–3152, Piscataway, NJ, USA, 2017. IEEE.

189