EXPLORING SPATIAL-TEMPORAL MULTI-DIMENSIONS IN OPTICAL WIRELESS 
COMMUNICATION AND SENSING 

By 

Xiao Zhang 

A DISSERTATION 

Submitted to 
Michigan State University 
in partial fulfillment of the requirements 
for the degree of 

Computer Science—Doctor of Philosophy 

2023 

ABSTRACT 

Optical Wireless Communication (OWC) techniques are potential alternatives of the next generation 

wireless  communication.  These  techniques,  such  as,  VLC  (visible  light  communication),  OCC 

(optical camera communication), Li-Fi, FSOC (free space optical communication), and LiDAR, 

are increasingly deployed in our daily life.  However, current OWC approaches are constrained by 

slow speeds and limited usage cases. The primary goal of this thesis is to boost the data rate of 

OWC with extended use scenarios and enable optical wireless sensing by exploiting the potentials 

on both the transmitter and receiver sides with designed effective strategies. We investigate the 

possibilities of various spatial-temporal dimensions (e.g., 1D, 2D, 3D, and 4D) as below. 

1D Temporal Optical Wireless Communication. We found that compensation symbols, which 

are commonly used for fine-grained dimming, are not used for data transmission in OOK-based LiFi 

for  indoor  lighting  and  communication.  We  exploit  compensation  symbol  in  1D  temporal  diversity 

to address the conflict of fine-grained dimming and transmission. We intend to demonstrate the 

LiFOD framework, which is installed on commercial off-the-shelf (COTS) LiFi systems, to increase 

the data rate of existing Li-Fi systems. We utilize compensation symbols, which were previously 

only used for dimming, to carry data bits (bit patterns) for enhanced throughput. 

2D Spatial-Temporal Optical Wireless Communication. In our study of camera-based OWC 

(i.e., optical camera communication), we first investigate 2D rolling blocks in the camera imaging 

process rather than 1D rolling strips for improved optical symbol modulation and data rate. Our 

proposed RainbowRow overcomes the limitation of restricted frequency responses (i.e., tens of Hz) 

in  traditional  optical  camera  communication. We  implement  low-cost  RainbowRow  prototypes 

with  adaptations  for  both  indoor  office  and  vehicular  networks. The  results  demonstrate  that 

RainbowRow  achieves  a 20× data  rate improvement  compared  to  existing  LED-OCC  systems. 

3D Spatial Optical Wireless Communication. When compared to existing acoustic and RF- 

based approaches, underwater optical wireless communication appears promising due to its broad 

bandwidth  and  extended  communication  range.  Existing  optical  tags  (bar/QR  codes)  embed  data  in 

the plane with limited symbol distance and scanning angles. To address this limitation, we exploit 

3D spatial diversity to design passive optical tags for simple and robust underwater navigation. We 

also develop underwater denoising algorithms with CycleGAN, CNN based relative positioning, 

and real-time data parsing. The experiments demonstrate that our U-star system can provide robust 

self-served underwater navigation guidance. 

3D  Spatial  Optical  Wireless  Sensing.  The  vision  approaches  compatible  with  time-consuming 

image processing for hand gesture reconstructing adopt low 60 Hz location sampling rate (frame 

rate). To overcome this limitation, we propose RoFin, which first exploits 6 spatial-temporal 2D 

rolling  fingertips  for  real-time  20-joint  hand  pose  reconstructing.  RoFin  designs  active  optical 

labeling for massive fingers with fine-grained finger tracking. These features enable great potential 

for enhanced multi-user HCI and virtual writing for users, especially for Parkinson sufferers. We 

implement RoFin gloves attached with single-colored LED nodes and commercial cameras. 

4D Spatial-Temporal Optical Wireless Integrated Sensing and Communication. Existing 

centralized  radio  frequency  controlled  from  base  stations  face mutual interference  and high latency, 

which  causes  localization  errors.  To  avoid  localization  delay  error,  we  explore  optical  camera 

communication for on-site pose parsing for drones. We exploit 4D spatial-temporal diversity (i.e., 

3D spatial and 1D temporal diversities) for integrated sensing and communication. We propose 

PoseFly, an AI assisted OCC framework with integrated drone identification, on-site localization, 

quick-link communication, and lighting functions for swarming drones. 

The variety of applications in many contexts demonstrates OWC’s potential and usefulness as 

a foundation for next-generation wireless technology. By leveraging the multiple dimensions of 

spatial-temporal diversities, we were able to successfully overcome some aspects of current OWC 

systems, delivering critical insights and discoveries for the future of optical wireless communication. 

Copyright by 
XIAO ZHANG 
2023 

ACKNOWLEDGEMENTS 

First of all, I would like to say thanks to my advisor Prof. Li Xiao, who gave me numerous advice 

and strong support during my Ph.D. journey. She is a great mentor both mentally and technically. 

I learn a lot from her. Without Dr. Xiao’s patience, support and guidance, I would not complete 

this dissertation. I would like to express my thanks to my guidance committee members Prof. Matt 

Mutka, Prof. Tianxing Li and Prof. Xiaobo Tan for their guidances as well. 

As always, my parents are my strongest supporters both mentally and physically. I would like 

to express my greatest appreciation to my parents, Mr. Hongbao Zhang and Mrs. Liangping Zhu, 

who  give  me  love  unconditionally.  I  would  like  to  thanks  my  brother,  Mr.  Xin  Zhang,  for  his 

support as well. Without them, I can not finish my doctoral degree. I am also grateful to my dear 

friends Dr. Jie Huang, Mrs. Xuting Zou, Dr. Eakachai Kantawong, Mr. Baobing Lei, Mr. Yong 

Lei, who treat me as family member with sincere, care and love.  I appreciate them for sharing my 

happiness and sadness. 

I would like to thank members of ELANs and other mates, Prof. Yunhao Liu, Prof. Guanhua 

Tu, Prof. Qiben Yan, Prof. Zhichao Cao, Prof. Charles Ofria, Masoud, James, Yiwen, Hanqing, 

Griffin,  Kanishka,  Manni,  Chenning,  Li,  Lingkun,  Nick,  Jianzhi,  Yuanda,  Juexing,  Guangjing, 

Bocheng, Ce, Xinyu, Tian, Jingwen, Ao, Yang, Shenghong, Shuqi, Yuzhao, Ming, Wei, Yan. I also 

would like to thank our department chair Prof. Abdol Esfahanian, Prof. Sandeep Kulkarni, Prof. 

Colbry Katy, colleagues Brenda, Vincent, Amy. Also, I would like to thank my academic brothers 

Dr. Pei Huang, Dr. Chin-Jung Liu, Dr. Ruofeng Liu, Dr. Yan Pan, Dr. Yan Yan, for their selfless 

helps.  I  am  also  grateful to  my  master  adviser Prof. Shining Li  and other professors Prof. Zhe 

Yang, Prof. Yu Zhang, Prof. Zhigang Li, who inspired me to explore the wireless world. Finally, I 

would like to thank others who directly or indirectly offered helps to me. 

The  projects  of  this  thesis  are  partially  supported  by  the  U.S.  National  Science  Foundation 

under Grants  CNS-2226888, CCF-2007159, and  CNS-1617412. As for the remaining  errors or 

deficiencies in this work, the responsibility rests entirely upon the author. 

v 

TABLE OF CONTENTS 

LIST OF ABBREVIATIONS ...................................................................................................... viii 

CHAPTER 1    INTRODUCTION AND MOTIVATION ............................................................... 1 
1.1   OWC Background ........................................................................................................2 
1.2   Comparisons between Optical and RF Medium ...........................................................3 
1.3   Problems in Existing OWC and Our Solutions ............................................................4 
1.4   Dissertation Organization ...........................................................................................10 

CHAPTER 2    LIGHTING EXTRA DATA VIA 1D TEMPORAL DIVERSITY ....................... 11 
2.1   Motivation ..................................................................................................................11 
2.2   Background and Related Work...................................................................................13 
2.3   Our Approach:  LiFOD ...............................................................................................16 
2.4   Bit Pattern Discovery .................................................................................................19 
2.5   Fine-grained Dimming via CS ...................................................................................26 
2.6   Robust Decoding of CS ..............................................................................................29 
2.7   Implementation and Evaluation .................................................................................33 
2.8   Discussion and Summary ...........................................................................................41 

CHAPTER 3    BOOSTING OCC VIA 2D SPATIAL-TEMPORAL DIVERSITIES .................. 42 
3.1   Motivation ..................................................................................................................42 
3.2   Background and Related Work...................................................................................46 
3.3   Our Approach: RainbowRow .....................................................................................52 
3.4   2D Rolling Blocks Modeling .....................................................................................53 
3.5   Optical Imaging Management ....................................................................................60 
3.6   Use Case Adaptations.................................................................................................65 
3.7   Implementation and Evaluation .................................................................................69 
3.8   Discussion and Summary ...........................................................................................77 

CHAPTER 4    3D SPATIAL DIVERSITIES ENABLED UNDERWATER NAVIGATION…..79 
4.1  Motivation .................................................................................................................. 79 
4.2   Background and Related Work...................................................................................82 
4.3   Our Approach:  U-Star ................................................................................................86 
4.4   Passive 3D Optical Tag ..............................................................................................89 
4.5   Underwater Positioning ..............................................................................................93 
4.6   AI-based Mobile Tag Reader .....................................................................................95 
4.7   Implementation and Evaluation ...............................................................................102 
4.8   Discussion and Summary .........................................................................................115 

CHAPTER 5    HAND POSE RECONSTRUCTION VIA 3D SPATIAL DIVERSITIES……. 118 
5.1  Motivation ................................................................................................................ 118 
5.2   Background and Related Work.................................................................................120 
5.3   Our Approach: RoFin ..............................................................................................122 
5.4   Active Optical Labeling ...........................................................................................125 
5.5   3D Spatial Parsing ....................................................................................................129 

vi 

5.6   Hand Pose Reconstructing .......................................................................................134 
5.7   Implementation and Evaluation ...............................................................................138 
5.8   Discussion and Summary .........................................................................................148 

CHAPTER 6  4D SPATIAL-TEMPORAL DIVERSITIES IN SWARMING DRONES ....... 150 
6.1  Motivation ................................................................................................................150 
6.2   Background and Related Work.................................................................................152 
6.3   Our Approach: PoseFly............................................................................................154 
6.4   Drone Identification .................................................................................................157 
6.5   Drone Localization ...................................................................................................160 
6.6   Drone Quick-Link ....................................................................................................163 
6.7   Implementation and Evaluation ...............................................................................165 
6.8   Discussion and Summary .........................................................................................174 

CHAPTER 7  CONCLUSION AND FUTURE WORK ............................................................176 
7.1   Conclusion ...............................................................................................................176 
7.2   Ongoing Work ..........................................................................................................177 
7.3   Future Work .............................................................................................................178 

BIBLIOGRAPHY ........................................................................................................................180 

vii 

LIST OF ABBREVIATIONS 

OWC 

Optical Wireless Communication 

OCC 

Optical Camera Communication 

LiFi 

Light Fidelity 

VLC 

Visible Light Communication 

LiDAR  Light Detection and Ranging 

FSOC 

Free Space Optical Communication 

RF 

Radio Frequency 

LED 

Light Emission Diode 

LD 

PD 

Laser Diode 

Photo Diode 

LCD 

Liquid Crystal Display 

CNN 

Convolutional Neural Network 

DNN 

Deep Neural Network 

PWM 

Pulse Width Modulation 

ESP 

Effective Subcarrier Pairing 

LiFOD  Lighting Extra Data via Fine-grained OWC Dimming 

RBR 

Rainbow Rows 

U-Star  Underwater Stars

RoFin 

Rolling Fingertips 

PoseFly  Pose parsing of Flying drones 

HotSys  Holographic Optical Tag based Systems 

AR 

VR 

Augmented Reality 

Virtual Reality 

MR 

Mixed Reality 

viii 

XR 

CS 

AR, VR, MR 

Compensation Symbols 

FPS 

Frame Per Second 

FOV 

Field of View 

UAV 

Unmanned Aerial Vehicle 

V2X 

Vehicle to Everything 

CBS 

Centralized Base Station 

IMU 

Inner Measurement Unit 

LoS 

Line-of-Sight 

NLoS 

None Line-of-Sight 

OOK 

On Off Keying 

MPPM  Multiple-Pulse-Position Modulation 

CSK 

Color Shift Keying 

VPPM  Variable Pulse Position Modulation 

PRU 

Programmable Real-time Unit 

BBB 

Beagle Bone Black 

UOID 

Underwater Optical Identification 

HCI 

Human Computer Interaction 

AI 

CV 

Artificial Intelligence 

Computer Vision 

ix 

CHAPTER 1 

INTRODUCTION AND MOTIVATION 

Optical Wireless Communication (OWC) emerges as a compelling alternative to existing Radio 

Frequency wireless communication, thanks to its broad bandwidth. And OWC becomes a strong 

contender for the next generation of wireless communication. The high On/Off switching speed 

of LEDs enables them to serve as efficient OWC high-speed transmitters, allowing for both fast 

communication and effective lighting in our everyday scenarios. As for OWC receiver, there are 

two distinct types. The first type is a single-pixel device known as a photodiode (PD). The second 

type consists of cameras with millions of pixels. 

However, current OWC systems mainly focus on point-to-point communication such as LiFi 

system and does not fully harness the potential of high-dimensional spatial-temporal diversities. 

This limitation hinders the data throughput of OWC, especially for camera-based OWC applications. 

To  address  these  limitations,  we  investigate  various  spatial-temporal  diversities  in  data  embedding, 

such as 1D temporal dimming side-channel, 2D spatial-temporal rolling blocks, and 3D spatial 

diversity. Furthermore, it is challenge to uncover and define these spatial-temporal diversities. We 

must deal with technical challenges in system implementation when utilizing these diversities such 

as mutual interference among LEDs on both the transmitter and receiver ends, as well as denoising 

under a variety of ambient conditions. 

To better motivate our work, I will present an overview of OWC and emphasize the similarities 

and differences between optical and traditional radio frequency mediums for wireless communi- 

cation. Following that, I will showcase five fully developed projects where I served as the first 

author,  focusing  on  harnessing  innovative spatial-temporal  diversities  for  data  embedding  in  optical 

wireless communication and sensing to overcome the limitations in existing OWC systems. 

1 

1.1  OWC  Background 

1.1.1  OWC enabled Numerous Applications 

There are various OWC technologies, as described in [15], such as VLC (Visible Light Com- 

munication), LiFi  (Light Fidelity),  OCC (Optical Camera Communication), FSOC  (Free Space 

Optical  Communication),  and  LiDAR  (Light  Detection  and  Ranging).  These  OWC  approaches 

enable a wide range of applications [60, 82, 3, 116, 49].  For example, OWC techniques can be used 

in industry, transportation, workplaces, houses, malls, underwater, and space. Depending on the 

application type and the required data speed, communication type, and platform, different OWC 

techniques are employed. The traffic flow in optical wireless communication enabled applications 

is illustrated in Figure 1.1. The comparisons of different kinds of OWC scenarios are given below. 

Core Network 

Backhaul 

Traﬃc Sink: Optical Wireless Communication Applications 

Cloud / Server 

Base station 

Internet 

Modem/Router 

CC 

Li-Fi 

OCC 

VLC 

VLC 

FSO 
FSO 

LiDAR 

Traﬃc Source 

Gateway 

Industry 

Transportation 

Oﬃce/Mall 

Space/Underwater 

Figure 1.1 The network traffic flow in optical wireless communication and enabled numerous 
applications. 

1.1.2  Modulated Optical Signals for Communication 

Modulation is the technique that alters the amplitude, frequency, or phase of a carrier signal to 

convey information during signal transmission. We introduce some conventional OWC modulations 

below. (1) OOK: On–off keying (OOK) modulation is the simplest form of amplitude-shift keying 

(ASK)  modulation  [2].  OOK  is  applied  to  RF  carrier  waves  as  well  as  optical  communication 

systems.  OOK  represents  digital  data  by  the  presence  or  absence  of  a  carrier  wave.  Bit  ‘1’  is 

represented by the light being turned on, whereas bit ‘0’ is represented by the light being turned off. 

(2) VPPM: Variable pulse position modulation (VPPM) is a modulation technology that allows for

simultaneous illumination, dimming control, and communication [2]. VPPM is intended for pulse- 

2 

width-based light dimming and protects against intraframe flicker. In VPPM, the pulse amplitude 

is always constant, and the dimming is controlled by pulse width rather than amplitude.  (3) CSK: 

Color-shift keying (CSK) is a visible light communication intensity modulation described in the 

IEEE 802.15.7 standard that sends data invisibly by changing the color of red, green, and blue light 

emitting diodes[2]. The CSK symbol is produced by combining three color light sources from the 

seven color bands indicated in the standard. The center wave length of the three color bands on xy 

color coordinates determines the three vertices of the CSK constellation triangle. 

1.2  Comparisons between Optical and RF Medium 

1.2.1  Physical Feature Differences 

Optical radiation is electromagnetic radiation that has wavelengths ranging from 100 nanometers 

to one millimeter. The wavelength range that the human eye can detect is referred to as visible 

radiation (VIS) and ranges between 400 nm and 800 nm [15]. UV light is optical radiation having 

wavelengths less than 400 nanometers. Infrared (IR) radiation has wavelengths greater than 800 

nm.   Microwave (1 mm  - 1 m), VHF wave (1 -  10 m), LF wave (10-100m), MF wave (100 - 

1000 m), LF wave (10 m - 1 km), and VLF wave are all examples of RF wavelengths (100 m - 

10 km).  The bandwidth of optical waves is around 30 PHz, which is 10,000 times greater than 

the bandwidth of radio waves (300 GHz). OWC necessitates a direct link between transmitter and 

receiver. Unlike RF transmissions, optical signals cannot flow through or around obstacles such as 

non-transparent objects. Light’s LoS feature may provide a more secure physical layer than RF- 

based wireless communication. For RF signals, there are four propagation modes: (1) Free space 

propagation,  (2)  Direct  modes  (Line-of-Sight),  (3)  Surface  modes  (groundwave),  and  (4)  Non- 

Line-of-Sight modes.  Lower-frequency radio waves can pass through obstacles like buildings and 

plants, but this is still considered a Line-of-Sight approach. Surface modes are radio transmissions 

with lower frequencies ranging from 30 to 3,000 kHz that travel as surface waves following the 

curvature of the Earth. Non-Line-of-Sight propagation modes include ionospheric modes, meteor 

scattering, meteor scattering, auroral backscatter, sporadic-E propagation, tropospheric scattering, 

rain scattering, airplane scattering, and lighting scattering [41, 32]. 

3 

1.2.2  Specific Advantage of Optical Signals 

The performance of optical and radio frequency waves for underwater wireless communication 

differs as well. Two mechanisms impede light transmission in water: absorption and scattering. 

As  a  result  of  scattering,  the  quantity  of  photons  captured  by  the  receiver  is  reduced.  Further- 

more, in a murky underwater environment, numerous photons may arrive with delays, resulting in 

inter-symbol interference (ISI) [91]. RF results in extremely poor performance for long distance 

underwater communications, especially over long distances, due to heavily influenced elements 

such as multi-path propagation, channel time changes, and strong signal attenuation (particularly 

the electromagnetic shielding effect in sea water).   As a result,  the RF systems are constrained 

by  the  associated  short  link  range  [14].  When  compared  to  an  RF  system,  which  necessitates 

energy-guzzling antennae and additional energy for cooling down, optical wireless communication 

uses energy-efficient LED bulbs and the consumed energy is not only for communication but also 

for simultaneous lighting [31]. Thus, OWC can provide considerable energy savings. Offloading 

traffic from RF networks to optical networks reduces overall power consumption [14]. 

1.2.3  Common Features of Optical and RF 

Despite their distinct physical properties, optical waves and radio frequency waves have several 

similarities.  (1)  They  both  have  the  same  propagation  speed  in  the  air  that  is  faster  than  audio 

waves,  (2)  they  have  the  same  upper  layers  in  the  network  architecture  with  the  exception  of 

differences in the Physical layer and the MAC layer, (3) they are both essentially electromagnetic 

waves, transverse waves rather than longitudinal waves like sound waves, (4) the mmWave in the 

RF spectrum propagates in a LoS way, similar to optical waves, and (5) except for the VL (visual 

light) optical spectrum, other optical spectrum are likewise invisible, similar to RF waves. 

1.3  Problems in Existing OWC and Our Solutions 

Despite the promising prospects of optical wireless communication, it is currently facing various 

challenges that limit its development and widespread application. For instance, in indoor optical 

wireless communication, a tradeoff needs to be considered between user illumination experience 

and the efficiency of optical data transmission, as shown at block of LiFOD in Figure 1.2. Another 

4 

 
Figure 1.2 The problems (illustrated in gray blanks) in existing OWC systems and our solutions: an 
overview. To address these problems in different applications, we investigate multiple dimensions 
of spatial-temporal diversities in optical signals’ propagation from the transmitter to the receiver. 

example is in existing optical camera communication, where the limited camera response frequency 

restricts the achievable data rate to just a few Kbps, as shown at block of RainbowRow in Figure 1.2. 

Furthermore, existing optical tags are single-plane, lacking the capability to provide additional rich 

information in three-dimensional space such as underwater scenario, as shown at block of U-Star 

in Figure 1.2. Similarly, utilizing vision-based hand gesture recognition for finger tracking with 

only a few tens of Hz sampling rate hinders the provision of fine-grained finger tracking, as shown 

at block of RoFin in Figure 1.2. To achieve real-time, low-cost, on-site unmanned aerial vehicle 

(UAV) recognition, localization, and communication, it is challenge to meet all these requirements 

with one single solution, as shown at block of PoseFly in Figure 1.2. To address these problems, 

we specifically model spatial-temporal diversities and with different dimensions and leverage them 

5 

for specific OWC applications and scenarios, as described in Figure 1.2. We also briefly introduce 

each problem with our proposed solution below. 

1.3.1  LiFOD to Address Conflicts between Dimming and Communication 

Recent  trends  in  lighting  include  replacing  incandescent  and  fluorescent  bulbs  with  high- 

intensity LEDs because of their high energy efficiency, low heat generation, and long lifespan[123, 

109,  99].  LED  lighting  saves  the  average  family  approximately  $225  in  electricity  bills  each 

year[80]. Another benefit of LEDs is their capability to switch between different light intensities 

quickly and efficiently [151]. This feature creates opportunities for LEDs to be used as OWC trans- 

mitters for both high-speed communication and efficient lighting in everyday situations[11][132]. 

However,  even  with  LED  bulbs,  lighting  still  accounts  for  around  15%  of  an  ordinary  home’s 

electricity use[80]. Thus, for indoor LED bulbs, transmitting more data robustly with less retrans- 

mission while not sacrificing the user experience of lighting is another path to improve energy 

efficiency. 

To  transmit  more  data,  we  can  design  high-order  modualtions  in  transmission.  Recent  research 

has focused on high-order modulation to  improve throughput in  OWC systems  [151, 38, 124]. 

However, in poor optical channel conditions, such as indoor scenarios with complex artificial light 

sources or with sunny or underwater outdoor scenarios, the nonlinear effect of LEDs and the short 

symbol distance make decoding high-order modulation more complex and fragile, which leads to 

more error bits and, subsequently, more retransmissions that require energy consumption[131, 106, 

46]. Thus most OWC systems, such as OpenVLC and LiFi [30, 84, 64, 151, 126, 69, 126, 34, 19], 

switch from high-order to low-order modulation such as simple OOK, which is defined as primary 

modulation in the OWC standard IEEE 802.15.7 [2]. As noted in [114], a tradeoff exists between 

the dimming performance and the achieved data rate due to the compensation symbols occupying 

the transmission bandwidth. To address the problems above, we propose LiFOD in Chapter 2 to 

achieve the fine-grained dimming and communication by utilizing the 1D temporal diversity of 

optical signals. 

6 

1.3.2  RainbowRow to Boost Restricted Data Rate in Optical Camera Communication 

PDs are single-pixel light sensors and thus allow for fast light sensing that has the fast switching 

rate of LEDs at a couple of hundreds of KHz due to their simple and timely readout processing[30]. 

For example, OpenVLC[23] offers a data rate of about 150 Kbps at 3m for indoor use cases. How- 

ever, they are not practical for outdoor and long-range scenarios due to varied optical environmental 

and strict directional requirements between the transmitter and the receiver. 

Compared  to  single-pixel  PD  approaches,  the  image  sensor  (IS)  in  a  camera  has  millions 

of pixels (each pixel element can be treated as a PD) and can easily separate the ambient light 

noise with the optical signals from the transmitter by reflecting them in different pixel zones[15]. 

Nonetheless,  cameras  require more processing and readout  time for light  sensations in  contrast 

with single-pixel PDs[84, 30] and thus commercial cameras only offer tens of Hz frame rate and 

several kHz of rolling shutter rate. Given that LED-based transmitters offer ON/OFF switching 

rates of several MHz, this turns the camera-based receiver into the OCC systems’ bottleneck and 

greatly restricts the data rate[151]. To overcome the bottleneck of optical camera communication, 

we introduce the RainbowRow protocol in Chapter 3. This protocol utilizes 2D spatial-temporal 

diversities of optical signals to significantly enhance the data rate. 

1.3.3  U-Star to Address Limitations of Optical Codes in Underwater 

Underwater Optical Wireless Communication (UOWC) has shown significant potential due to 

its longer propagation range, lower propagation delay, and lower power consumption compared with 

acoustic and RF-based techniques[91, 134, 147, 151, 117, 129, 141]. Moreover, UOWC systems 

based on passive optical tags, which utilize natural light sources, are more practical because they 

do  not  rely  on  finite  battery  power  in  underwater  scenarios  where  it  is  not  feasible  to  perform 

frequent battery replacement. Similar to terrestrial navigation procedures, underwater navigation 

systems need to be able to answer these two fundamental questions: (1) Where am I now? and (2) 

How do I get to where I am going? For GPS-based navigation, systems first determine the user’s 

current location by GPS localization and then provide terrestrial navigation guidance based on a 

pre-established location database. 

7 

 
Another common method of terrestrial navigation guidance involves signage systems, such as 

visitor guidance boards in museums, campuses, or trails. These boards typically feature a tour map 

with notations (e.g., stars/dots) indicating the user’s current location, allowing them to navigate to 

their desired destination based on the map’s guidance [71]. In underwater environments, GPS is 

not viable, and other underwater acoustic/RF-based localization methods tend to be costly [89]. 

Consequently, divers traditionally rely on portable waterproof compasses and information provided 

by their guide before diving, which can be limiting in terms of intelligence, reliability, and flexibility 

[121, 45, 66]. Inspired by terrestrial navigation, we can adopt waterproof signage systems to show 

users rich location information for underwater navigation. This, however, has many challenges, 

as it is hard to find and read a finite-sized map image or messages underwater due to the harsh 

optical  environment. Alternatively,  we can use passive tags and  a portable tag reader for more 

embedded and clear navigation information. In our daily life, passive optical tags such as barcodes 

and QR (Quick Response) codes are popular [81, 138], but their short communication range makes 

underwater navigation impossible because users cannot even find the tags to scan them. Increasing 

the size of the tag could indeed extend the communication range, but it comes with the trade-off 

of higher costs and a potentially greater disturbance to the original ecological environment. To 

circumvent the limitations  of existing optical tags,  we introduce the U-Star  system in Chapter 

4. This system is designed to offer a self-served navigation solution by leveraging the 3D spatial

diversity of optical signals.

1.3.4  RoFin to Relieve Coarse Sampling in Vision Tracking 

Human hands are not just crucial, vital organs for catching and grabbing; they have also long 

been used for communication, such as in greetings, sign language for the deaf, or hand signs in 

sports and wars. Hand poses have become direct, and cost-effective Human-Computer Interaction 

(HCI) across a wide variety of applications due to the fast development of computer technology and 

artificial intelligence (AI). For example, fingers and hands can be used in smart homes to control 

IoT devices for a variety of purposes (e.g., turning devices on/off), in interactive video games to 

provide a user-friendly and immersive gaming experience (e.g., accelerating race cars), and in XR 

8 

(AR, VR, and MR) enabled mobile applications to provide interactive operations that are close to 

reality (e.g., navigation) [59, 26, 137, 149, 142]. 

Vision-based hand gesture recognition systems have grown in popularity, simulating human 

vision  to  recognize  hand  shapes  at  a  rate  of  roughly  60Hz  [137].  Using  deep  learning,  these 

algorithms attain an accuracy of more than 80%. They do, however, have limitations: (1) They 

struggle in poor light or at greater distances due to the camera’s sensor receiving little light from 

the hand. (2) Cameras sample slowly (e.g., 60 Hz) when tracking fingers, mimicking human ocular 

limits and making it difficult to see detailed hand motions, such as tremors in Parkinson’s patients 

[95,  24,  122].  (3)  Complex  hand  form  recognition  with  around  20  joints  results  in  substantial 

processing costs and delays. (4) Privacy concerns arise when sensitive situations capture hand- 

related  frames,  thereby  jeopardizing  the  privacy  of  persons  [139].  To  enhance  finger  tracking 

accuracy and reduce the overhead of hand pose reconstruction,  we introduce the RoFin system 

in Chapter 5. This system is designed to offer fine-grained finger tracking and precise hand pose 

reconstruction by leveraging the 3D spatial-temporal diversity of optical signals for sensing. 

1.3.5  PoseFly for Low-cost Joint Sensing and Communication 

Currently, drones are primarily controlled by a centralized base station (CBS), such as a drone 

pilot  on  the  ground  or  a  satellite  in  orbit,  utilizing  the  radio  frequency  (RF)  spectrum  [6,  36]. 

However, these centralized controlling techniques limit the potential use cases for drones since they 

lack mutual communication among drones. As a result, on-site data sharing directly among drones 

without the need for assistance from a centralized base becomes challenging. The requirement for 

each drone in the drone cluster to acquire commands from the CBS and transmit its status, including 

its surroundings and posture state measured by its inner sensors like IMU (Inner Measurement 

Unit), adds to the communication latency due to the centralized drone controlling mechanism. This 

can lead to significant localization errors, especially in high-motion scenarios, where the back- 

and-forth communication latency becomes a critical concern. As an example, consider two drones 

moving at a speed of 20m/s in opposing directions. The 0.25s required for location computation and 

communication between them would result in a 10m localization error (0.25 × 20 × 2). Furthermore, 

9 

as the number of drones in the cluster increases, the limited capacity of the RF spectrum becomes 

increasingly  crowded.  This  congestion  could  lead  to  bit  errors  during  retransmissions,  exacerbating 

the localization error even further [144]. 

Optical  camera  communication  (OCC)  has  garnered  significant  attention,  particularly  with 

the  proliferation  of  commodity  mobile  devices  equipped  with  built-in  cameras.  Compared  to 

photodiode-based techniques like LiFi, OCC offers the advantage of low interference with ambient 

light. It also facilitates location-based services (LBS), enabling fine-grain AR navigation through 

the association of data from visible transmitters within a flexible communication range [148, 95, 

24, 151, 124]. To enable low-cost localization and communication among swarming drones, we 

harness the 4D spatial-temporal diversities of optical signals and introduce the PoseFly system in 

Chapter 6. 

1.4  Dissertation Organization 

The rest of the dissertation is structured as follows. In Chapter 2, we provide a comprehensive 

exploration of the dimming side channel and illustrate how we leverage the 1D Spatial-Temporal 

diversity (i.e., 0D spatial with 1D temporal) to enhance the data rate of Li-Fi. Moving forward, 

Chapter  3  delves  into  the  details  of  our  proposed  RainbowRow  protocol,  which  exploits  2D 

Spatial-Temporal diversities (i.e., 1D spatial with 1D temporal) through rolling strips to enhance 

optical camera communication.  In Chapter  4, we introduce 3D hollowed-out optical tags (i.e., 

3D spatial with 0D temporal) designed for underwater navigation, extending symbol distances in 

space. Shifting our focus to optical wireless sensing, Chapter 5 presents the RoFin system, which 

leverages 3D spatial-temporal diversities (i.e., 3D spatial with 0D temporal) for fine-grained finger 

tracking and hand pose reconstruction. In Chapter 6, we delve into the use of 4D Spatial-Temporal 

diversities (i.e., 3D spatial with 1D temporal) for on-site pose parsing of swarming drones. Finally, 

we conclude this dissertation and discuss future research directions in Chapter 7. 

10 

CHAPTER 2 

LIGHTING EXTRA DATA VIA 1D TEMPORAL DIVERSITY 

Owing to the wide spectrum and rapid intensity switching capabilities of LEDs, optical wireless 

communication (OWC) holds tremendous promise for high-speed data transmission. In difficult 

conditions,  many  OWC  systems  switch  from  sophisticated,  error-prone  high-order  modulation 

approaches to the more resilient On-Off Keying (OOK) modulation described in the IEEE OWC 

standard. In this chapter, we describe LiFOD, a new indoor OOK-based OWC system that can 

provide fine-grained dimming while maintaining robust communication at the same time with rates 

of up to 400 Kbps across a 6-meter distance. 

LiFOD provides two crucial features. Firstly, LiFOD uses Compensation Symbols (CS) as a 

reliable side-channel to dynamically represent bit patterns for improved data rate. Secondly, LiFOD 

reconfigures optical data symbols (i.e., OOK symbols) and CS symbols placement algorithms in real 

time, optimizing them for fine-grained dimming and dependable decoding. Empirical tests using 

low-cost Beaglebone prototypes with commercial LED lights and photodiodes (PD) demonstrate 

LiFOD’s superiority over state-of-the-art systems.  LiFOD achieves 2.1× throughput boost based 

on the SIGCOMM17 data-trace. 

2.1  Motivation 

Considering  the  user  experience  of  lighting,  LED  brightness  may  cause  undesired  flickers 

when transmitting data via the optical spectrum[2, 38, 124]. Meanwhile, dimming is essential to 

adjust light intensity for a variety of purposes and activities, such as office or  hallway lighting, 

sleeping, reading, or other activities, with benefits that include reduced eye strain, mood setting, 

and LED life extension. Therefore, within the OWC standard [2], compensation symbols (CS) are 

employed in OOK modulation for smooth lighting and dimming control, while not affecting wireless 

communication. The entire PHY frame in OOK-based OWC is split into multiple subframes. In 

each subframe, a continuous number of CS symbols proportional to the length of the subframe 

are inserted in front of the OOK symbols (P, H, RF, DS fields) to adjust (i.e., increase, keep or 

decrease) average brightness (AB) smoothly. 

11 

e
d
u
t
i
l

p
m
a

l

o
b
m
y
S

AB=D0% 

AB=N% 

AB=N% 
subframe 

PHY frame 

AB=N% 

AB=D1% 

AB value: D0 < N > D1 > D2 
framelength :  δ 
amplitude of CS : 

h 

θ

0.5δ 

θ 

δ 

θ 

2δ 

2θ 

δ 

h
2

2θ 

h

h

h

2θ 

δ 

h
5
.
1

S  P 
С 

H 

СS 

RF 

DS 

СS 

RF 

DS 

СS 

RF 

DS 

СS 

RF 

DS 

AB=D2% 

2θ 

RF 

DS  Time 

h
8

.

0

δ 
СS 

P:preambles  H:PHY headers & extension 

CS:compensation symbols 

RF:resync ﬁeld 

DS:data symbol 

AB:average brightness

Figure 2.1 Illustration of OOK dimming control with compensation symbols (CS), redesigned 
from IEEE OWC standard [2]. A higher ratio of CS symbols in a subframe and higher CS symbol 
amplitude can both achieve higher average brightness (AB). 

A tradeoff is observed when more control is needed to achieve fine-grained dimming, there is less 

of  an  opportunity  for  wireless  communication  transmission,  which  results  in  lower  throughput[114, 

2]. Moreover, CS symbols are solely used for dimming[147].  This consumes transmission resources 

in the time domain and limits the data rate of OOK, which already has a limited number of bits. 

There  are  two  key  observations  that  motivate  our  approach.  (1)  Bit  patterns  [39,  40,  73] 

occur in transmitted bit-streams. A bit pattern is a bit sequence (i.e., multiple continuous bits), that 

frequently occur in traffic during a historical period. (2) Compensation symbols have not been used 

for data transmission in OOK-based OWC networks, as shown in Figure 2.1. In related dimming 

research[108, 128, 127, 133], approaches focus only on dimming itself without considering the 

potential for data transmission. However, we can use CS as a reliable side-channel to denote bit 

patterns for improved throughput considering the significant symbol distances between CS and 

OOK symbols. 

To  achieve these  goals,  we present  LiFOD,  which uses  compensation symbols  (CS)  to  not only 

assist  dimming,  as  has  been  done  in  the  past,  but  also  to  encode  data  bits  primarily  for  better 

throughput  in  OOK-based  OWC  networks.  In  our  method,  CSs  perform  dual  functions  in  dimming 

control and data transfer.  A repositioned CS symbol inside the PHY subframe can signify a specific 

bit  pattern  within  a  transmitted  sequence.  Along  with  modulation,  the  transmitter  performs  a 

lightweight bit pattern discovery procedure on a regular basis and transmits the most recent bit 

pattern information to the receiver via preambles. 

12 

 
 
 
 
 
 
 
 
2.2  Background and Related Work 

Single-color  LED  lamps  are  the  most  popular  trend  as  a  cost-effective  choice  for  eco-  and 

user-friendly residential lighting fixtures in our daily lives. Lighting and dimming are the primary 

functions of these LED lamps. Besides, photodiode (PD)-based OWC systems, such as OpenVLC 

and LiFi[30, 84, 64, 151, 126, 69, 126, 34, 19] with low-order modulations such as OOK, MPPM, 

and their varieties, treat wireless communication as secondary functions of these commercial LED 

lamps. We provide a primer of OWC dimming functions and modulation below to better define our 

research problem. 

2.2.1  Dimming in OWC 

Light dimming is defined as controlling a light source’s perceived brightness based on a users’ 

requirements. We classify the primary OWC dimming methods in the IEEE OWC standard[2] into 

two types, coupled dimming with transmissions and decoupled dimming with transmissions, 

as shown in Figure 2.2. 

(1)  Dimming by control  signal amplitude 

coupled  dimming 

e
d
u

t
i
l

p
m
a

e
d
u

t
i
l

p
m
a

e
d
u
t
i
l

p
m
a

1  0  1  1  0  1 

1 

0 

1  1 

0 

1 

time 

(2)  Dimming by control signal pulse duty 

1 

0 

1 

1 

0 

1 

1 

0 

1 

1 

0 

1 

time 

(3)  Dimming  by  adding  compensation  symbols 

decoupled  dimming 

СS 

1  0  1  1  0  1 

1 

0  1  1  0  1  time 

1  number  of  CS 
2  amplitude  of  CS 

СS 

Figure  2.2  Couple/decoupled  OWC  dimming  with  transmission. Core  idea  of  LiFOD:  utilizing  CS 
as robust side-channel to denote more bits. 

13 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
For coupled dimming with transmission, the control signals’ amplitude has no impact on the 

time slots/carrier bandwidth of transmission while the control signals’ pulse width influences the 

carrier’s bandwidth. As observed in SmartVLC[115], a drawback of fine-grain coupled dimming 

control is the lower throughput that can be achieved because complex modulations that allows fine- 

grained dimming control wastes transmission bandwidth and adds more error bits. The researchers 

proposed Adaptive Multiple Pulse Position Modulation (AMPPM), which designs super symbols 

to generate more pulse width combinations for fine-grained dimming. However, AMPPM is still 

discrete step dimming with more modulation cost than the same-order OOK. 

Decoupled  dimming  with transmission  inserts compensation symbols (CS) into the data frame 

and sends the constant brightness symbols of OOK modulation to adjust the average brightness 

of the light source. This treats data transmission and light dimming as two relatively individual 

modules with  limited interaction.  It  has more  robust  communication and fine-grained dimming 

control while also providing the potential of using CS symbols to transmit extra data in comparison 

to coupled dimming methods. However, the CS symbols take up the time slots for data symbols 

compared with coupled dimming. 

2.2.2  Communication in OWC 

Besides lighting, it is also crucial to provide users with high-speed communication. Based on 

the receiver type and modulation, we classify OWC into two types: 

(1) Camera-based  OWC  with  high-order  modulation.  Image  Sensors  in  commercial  cameras

can be treated as millions of single-pixel photodiodes (PD) and require more processing time than 

one PD [109]. The limited frequency response of the camera makes it hard to achieve a sufficiently 

high data rate as the switch speed of the transmitters is too fast for the frequency response of the 

receiver [124, 51]. Rolling shutter cameras on smartphones offer a frequency response only up to 

a couple of tens of kHz, which is well below the needed value for high speed communication of 

hundreds of kHz. 

To overcome the bottleneck of camera-based OWC systems, many researchers[72, 123, 38, 

124] focus on designing high-order modulation schemes to improve throughput.  In [38], authors

14 

proposed ColorBars to utilize Color Shift Keying (CSK) modulation to improve the data rate via 

Tri-LEDs.  They achieved a data rate of up to 5.2 Kbps on smartphones.  Similarly, Yanbing et 

al. proposed Composite Amplitude-Shift Keying (CASK)[124] to improve the throughput of the 

Camera-based OWC system. CASK modulates data in a high-order way without a complex CSK 

constellation design. CASK achieves a data rate of up to 7 Kbps by digitally controlling the On-Off 

states of several groups of LED chips. 

These existing high-order modulations are high cost due to the necessity of specific devices 

and  therefore  cannot  scale  easily.  For  example,  CSK  modulation  requires  Tri-color  LEDs  as 

transmitters, which costs more than single color LEDs used in OOK and are quite unlikely to be 

deployed in real life[124]. CSK also needs a complicated and expensive receiver to precisely detect 

intensities of three colors: Red, Green, and Blue in the CIE color space chromaticity diagram[16]. 

(2) Photodiode-based OWC with primary modulation. Photodiodes (PD) are semiconductor P-

N  junction  devices  that  convert  the  analog  light  signal  into  digital  electrical  current[57,  136]. 

PDs are single-pixel with a small surface area, which allows PDs to have a fast response time of 

sensing processing. This means the receiver can achieve a fast and robust symbol detection for high- 

speed communication. Most OWC systems, such as LiFi [30, 84] and OpenVLC[23, 115, 19, 69] 

adopt PDs as receivers for high-speed transmission and achieve a frequency response of a couple 

hundreds of kHz. 

To suit a high-speed transmission frequency, PD-based OWC adopts primary and low-order 

modulations such as OOK. This occurs because it is non-trivial to demodulate higher-order optical 

symbols (e.g., 8-CASK, 32-CSK) at the PD-based clock speed of hundreds of kHz, due to reduced 

symbol distances  compared to  OOK  symbols. Moreover, in poor  optical  channel  conditions  such  as 

sunshine/underwater  scenarios,  the nonlinear  effect  of  LED  and  short  symbol distances  makes  them 

more complex and fragile with more error bits[106, 46, 67, 21, 131]. Higher-order modulations 

will bring more error bits and need more retransmission for the required BER. Thus most popular 

OWC systems such as LiFi [30, 84] switch from high-order modulations to low-order modulation 

such as OOK for robust transmission with a low BER in changing environments with poor channel 

15 

 
conditions. The latest version of OpenVLC[23] can achieve, on average, about 150 Kbps at 4m 

under optical interference. 

Our scope: We focus on the indoor OWC systems equipped with low-cost PD sensors and 

single-color commercial LED  lamps, which are resilient lighting infrastructures. Our goal is to 

boost throughput and fine-grained dimming simultaneously without additional cost. 

2.3  Our Approach: LiFOD 

LiFOD  consists  of  commercial  LED  lamp  based  transmitter  and  PD-based  receiver.  The 

architecture diagram and workflows of LiFOD are shown in Figure 2.3. 

Fine-grained dimming for lighting 

High-speed wireless communication 

non-ﬂicker symbol 

OOK  modulation  upper layer traﬃc 

CS  relocation 

bit  pattern  mining 

source coding 

ﬁne-grained  dimming  adjustment 

channel enchoder 

scrambler 

1200 

s
t
i
b
s
u
n
o
B

800 

600 
400 

LED bulb 

transmitter 

CS 

1  CS 

0 

CSС = 001001 

bit  pattern 

F141 

F101 

F181 

Bit sequences 

F21 

F61 

0  1  0 
1  0  1  1 
0 

CS  1  0  0 

1 

0   
0   
0  CS 

CS 
1 

0 

1
0
0
1
0
0

...1 0 

1
0
0
1
0
0

0 ... 

1 

1 

... 1 

ON 

OFF 

user comfort brightness 

smooth lighting experience 

PD-based 
receiver 

dynamic optical threshold 

rebalanced symbol distance 

decoupled  dimming  in  OWC 

robust CSC notiﬁcation 

Figure 2.3 System architecture and workflow of LiFOD. 

(1) Dimming workflow:  After a user turns on an LED lamp, they may start OWC. They can 

smoothly control the dimming level by adjusting the knob (actual physical knob or virtual knob on 

16 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
the IoT user interface). Manual adjustment is the most accessible and most fine-grained manner 

of  dimming  control  as  opposed  to  various  communication-coupled  dimming  methods  that  can 

only  provide digital and discrete step dimming.  OOK symbols  are  constant  brightness for data 

communication. In contrast, CS symbols are brightness-adjustable for fine-grained light dimming. 

Instead of the original continuous CS symbol insertion, LiFOD uses discrete CS symbol relocation 

to  denote bit patterns without impacting CS-based smooth  dimming and the detection of OOK 

symbols for robust communication. 

(2) OWC workflow: Modulation occurs when Internet data from upper layers is encoded as 

optical  data  symbols.  There  are  three  essential  network  modules  before  OWC  modulations  defined 

in  the  standard  [2]:  Source  coding,  Scrambler,  and  Channel  coding.  Our  introduced  module  in 

LiFOD is a lightweight bit pattern mining module added after these three network modules, but 

before modulation. Although scrambling and channel coding has already occurred, there are still 

some frequently appearing bit sequences (e.g., “001001” in the illustration). These are bit pattern 

candidates. In a real-world trace, SIGCOMM 2017[101], as shown in the middle left in Figure 

2.3, multiple bit sequences appear in high frequency and introduce bonus bits (i.e., we can add CS 

symbols to assist transmission and achieve a higher data rate than current standards). 

(3) Overview. We encode p-length bit patterns into a Compensation Symbol Code (CSC) as 

shown in the middle right in Figure 2.3. Each instance of a CSC code increases transmission speed 

because more bits are transmitted if p >  1. When allocating bits, we first check whether the next p 

bits match the predefined CSCs from our bit pattern discovery. If false, one bit is allocated to an 

OOK symbol as usual. On the contrary, we define it as a hit if the bits match the predefined CSCs. 

Instead of mapping only one bit to an OOK symbol, p bits are transmitted through a CS symbol. 

Once the receiver detects a CS symbol’s existence, it inserts a p-bit CSC into the data stream. The 

receiver now can detect only one CS symbol that denotes p bits, instead of needing to detect p 

OOK symbols. Because (p-1) more bits (i.e., bonus bits) are transmitted when there is a hit and 

all symbol types/(ON/OFF/CS) are used for transmission, it is clear that the data rate of our system 

will increase. 

17 

 
2.3.1  Challenges and Solutions 

There  are  two  technical  challenges  that  LiFOD  should  deal  with.  When  a  larger  degree  of 

control  is  necessary  to  accomplish  exact  dimming,  the  capability  for  wireless  communication 

transmission  is  lowered,  resulting  in  poorer  throughput  [114,  2].  Furthermore,  using  only  CS 

symbols for dimming costs transmission resources in the temporal domain, limiting the data rate 

of OOK, which has a limited bit capacity by design. 

In our design, CSs are used in both dimming controls and data transmission. A bit pattern in a 

transmitted bitstream can be represented by one relocated CS symbol in the PHY subframe. The 

transmitter periodically conducts lightweight bit pattern discovery in parallel with modulation and 

notifies the receiver of the latest bit patterns via preambles. 

Network throughput improves remarkably due to improved data rate and decoding performance. 

(1) Data rate: CS symbols become data symbols without consuming transmission resources in the 

time domain. Moreover, each CS symbol carries more bits than an OOK symbol. (2)  Decoding: CS 

symbols have a lower detection error rate than OOK symbols. Furthermore, the receiver decodes 

the CS symbol to its corresponding bit pattern directly instead of decoding multiple OOK symbols 

for that bit pattern, which reduces decoding error possibilities. 

Our contributions are summarized as follows: 

•  We  creatively  exploit  compensation  symbols  (CS  symbols)  to  improve  throughput.  CS 

symbols were traditionally used only for dimming in OOK-based OWC systems.  We explore 

bit pattern possibilities and propose a greedy mining algorithm to identify multiple bit patterns 

to maximize the overall throughput. 

•  We redesign non-flicker optical symbols (OOK and CS symbols) for smooth lighting and 

communication. This ensures the robust identification of symbol types in a changing environ- 

ment. Initially, CSs are inserted continuously and proportionally into subframes for constant 

lighting. In our approach, CSs are relocated to discrete locations to denote bit patterns, which 

may introduce undesired flickers, however, we also design CS relocation schemes for stable 

18 

 
 
lighting. 

•  We implement a LiFOD prototype on commercial devices and validate its lighting and com- 

munication performance in different transmission settings. Our comprehensive evaluation 

results demonstrate that LiFOD  can achieve up to 400  Kbps  up to 6m  with fine-grained 

dimming, effectively doubling throughput at a longer range compared with SmartVLC on 

the SIGCOMM17 datatrace. 

2.4  Bit Pattern Discovery 

2.4.1  Mining Challenges. 

Throughput improvement depends on the length of p and the hit rate in a given data frame. For 

example, as the length of a bit sequence increases, the probability of a hit decreases, and vice versa. 

There is a clear tradeoff between bit sequence length and hit probability. Moreover, not only one 

bit sequence is likely to be a bit pattern. When one bit sequence is selected as a bit pattern, the 

bitstream will be split by this bit pattern. After one bit pattern is assigned, depending on which 

pattern is chosen, the resulting allocation of the data bits is wholly changed. The next challenge, 

is to decide which pattern will be selected as the next bit pattern. All options need to be explored 

based on the choice of the previous bit patterns. 

An example is illustrated in Figure 2.4. Suppose the bit sequence “01” appears most often when 

allocating the bitstream “...1001010101110001...”.  Also, it offers the maximal bonus bits when 

compared  with  other potential bit  sequences.  In  this  case it is  (2 − 1) × 5 = 5 bonus bits.  We  may 

encode bit sequence “01” as one type of CSC. However, other bit sequences may also exist, such 

as “10”, which often appears and brings the same level of bonus bits as “01”,  (2  −  1)  ×  5  =  5. A 

challenge of LiFOD is deciding which bit sequence, in this case “01” or “10”, should be selected 

as the bit pattern. (1) If we choose “01” as the bit pattern, the bit stream will be split into three bit 

segments: “...10”, “...1100...” and “...”. (2) If choosing “10”, the bit stream will be split into four 

bit segments:  “...”, “0”, “11”, and “001...”. 

Additional bit sequences also frequently appear in the split bit segments produced after the first 

19 

 
 
 
...1001010101110001... 

hit in Round1 

hit in Round2 

if choose “ 01” as bit pattern 

if choose “ 10” as bit pattern 

Round 1 

...1001010101110001...  ...1001010101110001... 

...1001010101110001... 

...1001010101110001... 

hit 

Round 2 

hit 

...10 

1100 

... 

... 

0 

11 

001... 

bit segments A 

bit segments B 

sequence  hit in R1 

00 
01 
10 
11 

2 
5 
5 
1 

sequence  hit in R2-A 

00 
01 
10 
11 

1 
0 
2 
1 

sequence  hit in R2-B 

00 
01 
10 
11 

1 
1 
0 
1 

Figure 2.4 Bit pattern candidates change in next round. 

round  of  bit  pattern  selection.  These  sequences  can  be  chosen  as  another  bit  pattern  to  further 

speed up the data rate. However, the bit pattern selected for a specific round impacts the bit pattern 

choice for the next round, and previously discovered bit pattern candidates in earlier rounds may 

not be candidates anymore. When choosing bit patterns, we need to consider the total bonus bit 

performance of all chosen bit patterns of all rounds. 

2.4.2  Identify Patterns Greedily. 

To  address  the  problem  above,  we  execute  bit  pattern  mining  in  multiple  rounds  shown  in 

Figure 2.5. The bit pattern for each round will be selected as different types of CSCs. After several 

rounds of mining, there will be less opportunity to find bit patterns because bitstreams have already 

been split into short-length segments. Consequently, any obtained bonus bits will decrease as the 

number of rounds increases. Furthermore, if there are too many types of CSCs, the compensation 

symbol design for modulation will be more complicated and therefore increase the error rate of 

demodulation. Therefore, the choice to continue bit pattern mining is a tradeoff between increased 

data rate and error rate. The number of rounds we run for bit pattern mining depends on the bonus 

ratio for each round. The bonus ratio is defined as the ratio of bonus bits introduced by CSC for a 

specific round to bit numbers of the entire data frame.  When the bonus ratio is less than 10%, bit 

20 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pattern mining stops at that round, and any previously mined bit patterns are chosen as CSCs. 

...1001010101110001... 

Round 1 

choose “  01  ” as pattern-I 

hit in Round1 

hit in Round2 

split stream into several segments 

...1001010101110001... 

...10 

1100 

... 

Round 2 

choose another “  10  ” as pattern-II 

Round N 

...10 

1100 

... 

STOP 

if  bonus  ratio  < 10% 

Figure 2.5 The illustration of multiple rounds mining. 

According to analysis above, we design a lightweight greedy algorithm to explore bit patterns 

and summarize multiple rounds algorithm. The goal is to bring the maximum number of bonus 

bits possible in each mining round locally and obtain the maximum bonus bits of all mining rounds 

globally. 

Based on our experimental results, we’ve determined that with a bit sequence length larger than 

six bits the total number of bonus bits we gain starts to fall, and therefore we search for bit sequences 
   6    2𝑖  = 124.  We 

whose length is up to 6 as bits long.  The number of bit sequences possible is  

𝑖=2 

scan each of them in the frame, count hit number, and calculate bonus bits. We then choose the bit 

sequence with the most bonus bits as the bit pattern at that mining round. We calculate the bonus 

ratio of the bit pattern for each round and compare it with the 10% threshold. If the bonus bit ratio 

less than the threshold, mining will stop at that round. 

2.4.3  Ablation Study of Bit Pattern 

Real-world  Daily  Data-trace.  The  OWC  backhaul  is  connected  with  the  Internet[30].  We 

conduct CSC code abstraction based on two sets of real-world wireless traffic data from the (1) 

21 

 
 
 
 
SIGCOMM 2017 trace [101], which is the recorded wireless network activities at the SIGCOMM 

2017. (2) Another trace is from CAIDA 2019 [102], which collects the daily network traffic of 

a city in the US. These data packets are scrambled and encoded with the convolutional encoder 

specified in the IEEE 802.11 standards. 

Bonus  Bits  Distribution  and  Potentials.  Figure  2.6  shows  heat  maps  of  our  bit  pattern 

mining results in Round 1 and 2 among different frames from our two traces. There are more bit 

pattern candidates in Round 1 (i.e., six strongly highlighted columns). In Round 2, there are fewer 

bit pattern candidates (i.e., two significant highlighted columns) and the bonus bits in Round 1 are 

much more significant than Round 2.  It implies that there are abundant known bits in the first round 

of mining used because of the high probability of having a hit on the CSCs. In high-order rounds, 

opportunities to use CSCs are few. 

e
m
a
r
F

e
m
a
r
F

Round  1 
SIGCOMM17 

Bit  sequences 

Round 1 
CAIDA  19 

Bit  sequences 

e
m
a
r
F

e
m
a
r
F

s
t
i
b
s
u
n
o
B

800 

600 

400 

200 

0 

150 

120 

s
t
i
b

s
u
n
o
B

90 

60 

30 
0 

Round  2 
SIGCOMM17 

Bit  sequences 

Round 2 
CAIDA  19 

Bit  sequences 

300 
240 

180 

120 

60 

0 

s
t
i
b

s
u
n
o
B

100 

80 

60 

40 

20 

0 

s
t
i
b

s
u
n
o
B

Figure 2.6 Bonus bit heat maps for two rounds mining on two daily traffic: SIGCOMM17[101] 
and CAIDA19[102]. 

Tricks of CSC Decision in a Round. In general, the decision to choose a particular bit pattern 

22 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
candidate as a CSC code for each round depends on their bonus bits. However, if two bit pattern 

candidates have identical bonus bits, as occurs in Round 1 of the SIGCOMM17 trace shown at the 

top in Figure 2.7, we choose the longer bit pattern candidate “000000” as the bit pattern even if 

other bit pattern candidates have the same bonus ratio performance for that round. The reason is 

that when two or more bit pattern candidates have identical bonus bits the longer one will make the 

bit segments shorter after splitting the longer bit pattern. Thus, there will be less hits in the next 

round which means there will be more CSC-I and less CSC-II. 

600 

SIGCOMM17 

-  Frame1 
-  Frame2 
-  Frame3 
-  Frame4 
-  Frame5 

s
t
i

b

s
u
n
o
B

400 

200 

0 

s
t
i
b

s
u
n
o
B

120 

80 

40 

0 

CAIDA19 

-  Frame1 
-  Frame2 
-  Frame3 
-  Frame4 
-  Frame5 

Bit sequence 

Figure 2.7 CSC decision tricks in a mining round. 

23 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bonus ratio 
bonus bits 
SIGCOMM17 

Compensation symbol сode 

bonus ratio 
bonus bits 

СAIDA19 

r
e
b
m
u
n

t
i

B

800 

600 

400 

200 

0 

r
e
b
m
u
n

t
i

B

400 

300 

200 

100 

0 

60 

40 

20 

10 

0 

)

%

(

e
g
a
t
n
e
c
r
e
P

40 

)

%

t

(
e
g
a
n
e
c
r
e
P

30 

20 

10 

0 

Compensation symbol сode 

Figure 2.8 Two CSC can embed considerable extra data. 

24 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Two CSC with Considerable Exta Data. Figure 2.8 shows that in Round 1 of mining, more 

than 40% of all bits are transmitted as bonus bits through CSC-I of the SIGCOMM17 trace. The 

CAIDA19 trace, also achieves a bonus ratio of more than 20% for CSC-I. As the number of mining 

rounds increases, a lower percent of bonus bits can be used, however, the bonus ratio is still above 

10%  Round  2  in  the  SIGCOMM17  trace.  The  bonus  ratio  in  Round  2  for  the  CAIDA19  trace 

remains near 20%, showing almost no decline from Round 1. In Round 3 of mining for both traces, 

the bonus ratio falls below the threshold of 10%, and subsequently, the mining stops after Round 

3. 

Finally,  we  choose  two  CSCs  (CSC-I  and  CSC-II)  that  will  be  used  for  transmission.  The 

total bonus ratio of the two rounds of mining on two real-world traces is, combined, more than 

40%. Although the transmission rate benefits less directly from bonus bits when utilizing CSC-II, 

it still provides decoding benefits from the known bits represented by CSC-II. Overall, the more 

bits represented by CS symbols, the fewer opportunities for the false detection for OOK symbols. 

Delay  and  Overhead  Measurement.  We analysis and measure the overhead of bit pattern 

mining  based  on  real-world  data  traces.  The  results  of  execution  time  and  memory  overhead 

of our greedy bit pattern mining  are shown in Table  2.1 and Table 2.2. The bit pattern mining 

process for SIGCOMM 17 and CAIDA 19 consumes 0.78 s and 0.37 s in average, which is short 

enough as normal delay time before transmission. The computation cost of our pattern mining for 

SIGCOMM 17 and CAIDA 19 data-traces are both 144 MiB of memory in average, which is pretty 

low even compared with  the computation abilities of MCU devices such as BeagleBone Black 

device (512MB RAM). The results show bit pattern mining of LiFOD is lightweight, real-time, 

and thus suitable for usage in the real world. 

Two real-world 
data trace 

SIGCOMM 17 

CAIDA 19 

Execution  Time  (s) 

Round  1 

Round  2 

min  max 

ave  min 

max 

ave 

min 

Total 
max 

0.44 

0.11 

0.87 

0.38 

0.61 

0.22 

0.12 

0.07 

0.24 

0.25 

0.17 

0.15 

0.56 

0.18 

1.67 

0.63 

ave 

0.78 

0.37 

Table 2.1 Delay measurement of bit pattern mining on two real-world data traces. 

25 

Two real-world 
data trace 

SIGCOMM 17 

CAIDA  19 

Round  1 

min 

max 

ave 

Memory  Overhead  (MiB) 
Round  2 
max 

ave 

min 

Total 

min  max 

ave 

72 

72 

72 

72 

72 

72 

72 

72 

72 

72 

72 

72 

144 

144 

144 

144 

144 

144 

Table 2.2 Overhead measurement of bit pattern mining on two real-world data traces. 

2.5  Fine-grained Dimming via CS 

2.5.1  Non-flicker Symbol Design 

Flicker is the temporal modulation of lighting perceivable by the human eye, which can neg- 

atively  affect  a user’s lighting experience. The  maximum flickering time period (MFTP) is  the 

maximum time period over which the light intensity can be changed and  not sensed by human eyes. 

Thus any brightness changes over periods longer than MFTP must be avoided (i.e., significant low 

frequency brightness changes cause flickers and should be mitigated)[2]. 

In the  current  standard,  OFF/ON  and  CS symbols  have  different  amplitudes,  and  as  shown 

in Figure 2.9, CS-I and CS-II also have different amplitudes. The random distribution of CSCs 

encoded by LiFOD that appear in PHY frames at low frequencies causes significant flickering. 

To address this, our flicker-mitigation solution is inspired by Manchester coding [2], where each 

symbol  is  extended  to  include  itself  and  its  complementary  symbol.  This  guarantees  that  any 

significant brightness changes will appear too fast to be sensed by human eyes. 

There are three amplitude scales in the new symbol design: B0, B1, and B2 (brightness: B0 < 

B1 <  B2) for OFF, ON, CS-I, and CS-II symbols instead of four brightness amplitudes in the original 

symbol design. Symbol OFF is designed as B0+B1. In the first half of a symbol’s duration, it has 

an amplitude of B0. In the second half of a symbol’s duration, it has an amplitude of B1. Similarly, 

symbol ON is designed as B1+B0. And we design CS-I as B2+B0, while CS-II is B0+B2. Our 

newly designed symbols only need two thresholds rather than three for demodulation, decreasing 

the complexity and load of symbol detection. This increases the symbol distance and decoding 

robustness further. Additionally, CS-I and CS-II have the same brightness in our non-flicker symbol 

design, which further reduces the flickering possibility compared to the standard symbol design. 

26 

 
 
 
Simplest CS symbol design 
threho.ld.n. 

Bn 

B3 

threhold 3 
threhold 2 
threhold 1 B0 

B1 

B1 

OFF 

B0 

ON 

B2 

CS-I 

Non-ﬂicker CS symbol desgin 

threhold 2 
threhold 1 B0 

B1 

B1

ﬁxed 
brightness 
LED1 on 

}  B0 

OFF 

ON 

CS-II  ...  CS-N
B2 

B2 

△B} LED2 on 

adjustable 
brightness 

B1  B0 
CS-I 

B0 

CS-II 

Figure 2.9 Non-flicker optical symbol design in LiFOD. 

Note that  there exists  more CS-I symbols  than CS-II symbols.  It  is easier  for the receiver 

to distinguish  the amplitude  difference between  B2 and B0 than  between B1 and B0.  Suppose 

a symbol has an amplitude of B2 in the first half of symbol duration.   In this case, the symbol 

will be decoded as one CS-I symbol directly without estimating the amplitude of the second half 

symbol duration. That is why we design the CS-I symbol as B2+B0 instead of B0+B2. This design 

decreases  the detection error rate (DER) of the CS-I symbol,  which carries more data than the 

CS-II symbol.  Finally, this benefits total throughput and BER performance. 

2.5.2  Compensation Symbols Relocation 

Fine-grained  dimming  control.  LiFOD  consists  of  two  commercial  LED  lamps  that  are 

controlled  synchronously  shown in  Figure  2.10.  The transmitter  sends  out  OOK  symbols via LED1 

and sends out compensation symbols via LED1 and LED2 together. LED1’s brightness is set by 

the user and fixed before OWC begins. Users can continuously adjust LED2 by the dimmer knob to 

provide the additional brightness of (B2-B1, i.e., ΔB) to increase or decrease the average brightness 

(AB) without impacting optical symbol detection. This saves transmission bandwidth and does 

27 

ON 

OFF 

CS-I 

CS-II 

LED2 

LED1 

LED1 

LED2 

LED1 

LED2 

LED1 

LED2 

Transmit OOK symbols via LED1 only 

Transmit CS symbols via LED1 and LED2 

B0 

B1 

△B

constant 

LED1 darkness   LED1  &  LED2
OFF 

OFF 

constant 
lightness 

LED1 
ON 

adjustable 
LED2lightness 
ON 

B2 
=
B1+

△B

LED1  &  LED2  adjustable 
lightness 

ON 

Figure 2.10 Two commercial LED bulbs (<$10) in LiFOD. 

not affect symbol decoding. The number of CS symbols is proportional to each frame’s length to 

guarantee the same AB between frames. This mitigates inter-frame flickers and keeps constant 

brightness, even after an updated dimming is set. 

Random  CSC  Locations  and  Numbers.  There  are  subframes  in  each  frame.  Currently, 

compensation symbols are continuously inserted into subframes for dimming control in the IEEE 

OWC  standard[2].  However,  these  are  incapable  of  denoting  the  bit  patterns  that  may  appear 

discretely in the bitstream of one frame for transmission. Moreover, the hit numbers of CSC-I and 

CSC-II are not always the same in subframes, even though different subframes should have the 

same brightness to reduce intra-frame flickers. This means each subframe should have an equal 

proportion of CS-I and CS-II symbols. 

CS Relocation.  In Figure 2.11, there are 40 OOK and CS symbols in each subframe.  We 

set  1  of the symbols (i.e.,  8 CS symbols) for dimming to keep a constant AB of the subframe. 

5 

There are 8 CS symbols at the beginning of each subframe initially. If there is a CSC-I/II in the 

subframe, we put one CS-I/II symbol in that location. These picked CS-I/II symbols are used both 

for dimming and assisting transmission. The left redundant CS-I/CS-II symbols at the front part of 

the subframe are only used for dimming. The CS symbols only used for dimming are separated by 

the resync field (RF) with symbols used for transmission (OOK and picked CS symbols). We only 

28 

8 

32 

Subframe1 

0 CSC hit 

Subframe2 

4 CSC hits 

Subframe3 

2 CSC hits 

ON/OFF 
Only used for 
transmission 

CS-I/II 
Only used for 
dimming 

CS-I/II 
Used both for 
dimming  and 
transmission 

Resync 
Field 

Subframe4 

8 CSC hits 

Figure 2.11 CS symbol relocation scheme. 

decode the symbols after the RF field. Compared with the original, continuous CS symbols, CS 

relocation provides the potential to create robust side-channels for data transmission and mitigates 

the flickering possibility further as an unintentional benefit while keep constant brightness. 

2.6  Robust Decoding of CS 

2.6.1  Dynamic Optical Threshold 

As shown in Figure 2.9, the receiver checks grayscales of two parts in one received symbol to 

identify its symbol type by its grayscale threshold. In LiFOD’s non-flicker design, there are three 

brightness levels B0, B1, and B2. The receiver distinguishes them based on grayscale thresholds 

informed by a preamble from the transmitter. 

However, as shown in Figure 2.12, a received grayscale is not identical to the one transmitted 

by  the  transmitter  under  four  different  dimming  levels  (i.e.,  B2’s  incremental  brightness).  The 

received grayscale of different brightness may overlap with others, and B2 in different dimming 

settings can influence the perceived brightness of B0, B1 due to their continuous distribution in 

the PHY frame.  To identify an optical symbol’s type with varying brightness, the receiver should 

29 

t
h
g

i
l

d
e
v
i
e
c
r
e
p
d
e
z
i
l
a
m
r
o
N

0.5 

0.4 

0.3 

0.2 

0.1 

0 

t
h
g

i
l

d
e
v
i
e
c
r
e
p
d
e
z
i
l
a
m
r
o
N

0.5 

0.4 

0.3 

0.2 

0.1 

0 

Dimming Level 1 

- B2
- B1
- B0

0.2 

0.1 
Normalized emitted brightness 

0.3 

0.5 

0.4 

0.6 

Dimming Level 3 

- B2
- B1
- B0

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

Normalized emitted brightness 

t
h
g

i
l

d
e
v
i
e
c
r
e
p

d
e
z
i
l
a
m
r
o
N

0.5 

0.4 

0.3 

0.2 

0.1 

0 

t
h
g

i
l

d
e
v
i
e
c
r
e
p

d
e
z
i
l
a
m
r
o
N

0.5 

0.4 

0.3 

0.2 

0.1 

0 

Dimming Level 2 

- B2
- B1
- B0

0.2 

0.1 
Normalized emitted brightness 

0.3 

0.4 

0.5 

0.6 

Dimming Level 4 

- B2
- B1
- B0

0.1 

0.2 

0.3 

0.4 

0.5 

0.6 

Normalized emitted brightness 

Figure 2.12  Grayscale diagram of B0, B1, B2 on four incremental dimming levels. 

be informed of dynamic thresholds among B0, B1 and B2 via a preambles from the transmitter. 

Grayscale thresholds are measured and calculated based on short training symbols in the preamble 

field. The threshold values are dynamically adjusted based on the measurement informed by the 

preamble. 

2.6.2  Rebalanced Magnitude Distance 

In addition to our dynamic threshold measurement with preambles for different dimming settings 

in varying environments, we also need to combat any environmental influences. When an optical 

signal radiates away from its transmitting light source, the signal spreads out in different directions. 

Parts of spreading light beams reflect off objects and arrive at receiving light sensors from different 

paths. Consequently, different ambient light brightness will impact detection of original optical 

symbols. 

30 

 
 
 
 
 
 
 
 
 
 
 
 
If the ambient light is weak, the brightness of B1 or B2 will dominate the receiver’s sensed 

intensity. When ambient light gets stronger, the ambient light will dominate the received brightness 

and the brightness of B0, B1,  and B2 will have a similar high grayscale level,  as shown in the 

left of Figure 2.13. The same case happens when the transmission distance increases. When the 

transmission distance between transmitter and receiver becomes larger, ambient light will dominate 

the receiver’s brightness as well, as shown in the right of Figure 2.13. The intensity of B0, B1, and 

B2 will have a similar low grayscale level. 

s
s
e
n

t

h
g
i
r
b
d
e
v
e
c
e
r

i

d
e
z

i
l

a
m
r
o
N

0.9 

0.7 

0.5 

0.3 

0.1 

- B2
- B1
- B0

0.1 

0.9 
Normalized ambient light strength 

0.3 

0.5 

0.7 

s
s
e
n

t

h
g
i
r
b

i

d
e
v
e
c
e
r

d
e
z

i
l

a
m
r
o
N

- B2
- B1
- B0

0.7 

0.5 

0.3 

0.1 

0.1 

0.9 
Normalized  transmission  distance 

0.5 

0.3 

0.7 

Figure 2.13 Influence of ambient light and distance. 

These two factors significantly cause the perceived magnitude of brightness transmitted to be 

harder to distinguish from one another, and therefore, the received symbol is not identical to the 

transmitted optical symbol. We need to estimate the optical channel response using the preamble to 

further conduct equalization to eliminate the influence of ambient light and transmission distance. 

Suppose optical channel response O is H(O) and the transmitted brightness is b.  The received 

brightness is 

A sequence of known brightness values in the preamble, S, are transmitted to help estimate channel 

𝑏′ = 𝐻(𝑂)𝑏 

(2.1) 

31 

 
 
 
 
 
 
 
 
 
response. H(O) is estimated as 

𝐻ˆ  𝑂   = 
(  ) 

𝑆′

𝑆 

(2.2) 

where received brightness S’ includes an ambient light and transmission distance factor. The 𝐻ˆ (𝑂) 

is  not  equal  to  H(O)  due to  other noises such as  the temperature variation and noise figures at 

receiver, but it is still well estimated because S is known at receiver. 

The subsequent brightness magnitudes, x, B0, B1 or B2, are finally estimated by multiplying 

received brightness x’ with the multiplicative inverse of the estimated optical response of channel 

𝐻ˆ (𝑂): 

𝑥′

𝑥ˆ = 

𝐻ˆ (𝑂) 

(2.3) 

2.6.3  Robust CSC Notification 

Preambles are used in LiFOD to notify the receiver of the CSC codes used in our system. The 

IEEE 802.15.7 standard [2] defines the format of the Physical Protocol Data Unit (PPDU). The 

PHY frame consists of a synchronization header (SHR), a PHY header (PHR), and Physical Service 

Data Unit (PSDU). The SHR contains the preamble field.  CSC-I and CSC-II are prepended to the 

data packet in the preamble field to inform the receiver of the bit patterns being used. The receiver 

stores CSC codes and understands that they are specified for CS-I and CS-II symbols separately. 

When the receiver estimates the transmitted brightness magnitude by dividing the estimated 

optical  response  of  channel  𝐻ˆ (𝑂),  the  absolute  magnitude  change  on  a  symbol  with  a  lower 

magnitude  is  lower  than  that  on  a  higher  magnitude  symbol,  as  shown  in  Figure  2.14.  For 

example, if the estimation is that a received symbol should be magnified by 20%. The absolute 

magnitude changes of symbols are different.  Low magnitude symbols have a minor error margin, 

while magnitude errors of high magnitude symbols are scattered in a broader range than that of 

low magnitude symbols. Because LiFOD only adopts three brightness magnitudes (B0, B1, B2) in 

symbol design, the equalization can successfully eliminate the influence of the varying environment. 

When using compensation symbols for transmission, the dimming will not impact the ON/OFF 

symbol identification due to the smaller magnitude estimation error margin of B0 and B1 in OOK 

32 

n
o
i
t
c
e
t
e
d

s
s
e
n
t
h
g
i
r
b

f
o

.

o
N

l
a
i
r
e
S

S1 

S5 

S10 

S15 

B0 

Note: LiFOD only 

uses B0, B1 and B2. 

1

0.5 

0 

i

n
g
r
a
m

r
o
r
r
e
n
o
i
t
a
m

i
t
s
e
d
e
z
i
l
a
m
r
o
N

B1 

B2 
Brightness magnitude level 

B3 

B4 

Figure 2.14 The normalized magnitude estimation error margin of 15 detections in varying 
environment. 

symbols than B0 and B2 in CS symbols. Nevertheless, suppose there are too many types of CS 

symbols. In that case, the decoding performance of CS symbols with a higher magnitude will get 

worse due to the broader estimation error margin. LiFOD uses two CS symbols with B0 and B2 

brightness magnitudes, ensuring robust CSC notification. 

2.7  Implementation and Evaluation 

2.7.1  Hardware 

Transmitter. Our LiFOD transmitter consists of several commercial components: two regular 

LED  lamps  (LED1,  LED2),  and  MOSFET  and  BeagleBone  Black  (BBB)  boards,  as  shown  in 

Figure 2.15. LED1 is used to generate constant-brightness OOK symbols, LED1 and LED2 are 

used to generate variable brightness CS symbols. They are controlled uniformly by the BBB board. 

Because BBB can only provide 3.3V control signals, which can not drive high-power LEDs, we 

use a MOSFET transistor as a fast switch to drive the LEDs. To provide variable and fine-grained 

dimming,  we  wired  a  potentiometer  as  a  dimmer  knob  between  the  DC  power  with  the  LED 

positive lead. We removed the AC-DC converter in our daily LED lamp, which affects the ON-OFF 

33 

 
 
 
 
 
 
 
 
 
lamp holder 

BBB 

LED1 

LED2 

MOSFET 

BBB 

ADC 

dimmer 

op-ampliﬁer 

photodiode 

day 

ambient light 

day 

light 
meter 

night 

night 

Transmitter 

Receiver 

Figure 2.15 LiFOD prototype: transmitter, receiver and experiment scenarios in day and night. 

34 

switching speed significantly. 

Receiver. The LiFOD receiver prototype has three main components: analog-digital converter 

(ADC), operational amplifier (OPA), and the photodiode (PD), as shown in Figure 2.15. The light 

is sensed by the PD to convert the light signal to a small current and amplified by the OPA. Finally, 

analog values are converted to digital values in the SPI data format. SPI data is then processed to 

estimate analog light intensities for symbol decoding. The driving circuit can be fully powered and 

controlled by the BBB. 

System cost. The system cost of LiFOD is shown in Table 2.3. The Beaglebone Black board 

($80) in our prototype can be fully replaced with Beaglebone pocket($37), which is cheaper. Thus, 

totally including transmitter and receiver, the LiFOD system costs less than $100. 

Component 

Brand/Model/Type 

Unit Price( USD)

LED Bulb 

MOSFET 
Photodiode 

Op-ampliﬁer 
ADC 

potentiometer 

BBB board 

BAOMING-5W-MR16 
BOJACK-30N06LE 
OSRAM SFH206K 

Todiys-TLC272 
TI-ADS7883 

HUAREW-PTM15 

Beaglebone-Black or Pocket 

Table 2.3 Price table and system cost of LiFOD. 

4.2 
0.7 
1.4 
2.4 
3.2 
0.1 
80 or  37 

2.7.2  Software 

There  are  two  main  tasks  on  the  software  side:  (1)  send  out  optical  symbols  at  high  speed 

from the transmitter; (2) demodulate received optical symbols at high speed with reliability on the 

receiver. We use low-cost BBB platforms. Ideally, the PRU of BBB can achieve high-frequency 

modulation  and  demodulation  at  the  200MHz  level.  But  due  to  significant  distortion  of  light 

signals generated by commercial LED lamps at such high transmission frequencies, and we set the 

transmission frequency at hundreds KHz level, which is the same as the state-of-art SmartVLC or 

OpenVLC. Other software modules, such as our lightweight bit pattern mining and CS relocation, 

35 

as shown in Figure 2.3 are run on the BBB as firmware to provide services among the PHY layers 

and upper layers. 

2.7.3  Setup 

(1) Dataset. We choose two real-world datasets SigCOMM17 and CADIDA19, to simulate

user’s daily Internet traffic. (2) Transmission frequency. We set the transmission frequency to be 

lower than 200KHz. (3)  Sampling  rate.  To better detect the optical symbol shape, we set the ADC 

sampling rate to 1.2MHz, six times of transmission frequency. (4) Ambient light setting. Based 

on real-world scenarios, we conduct experiments in a 4 x 8 𝑚2 living room in the day and night 

scenarios. (5) Dimming setting. We set the dimming level by adjusting the dimmer knob neatly 

and using a light meter to measure its granularity. 

2.7.4  Lighting Performance 

Fine-grained dimming: The brightness of LiFOD can be manually adjusted to any continuous 

setting. We evaluate ten incremental dimming levels at different distances, as shown in Figure  2.16. 

The dimming range is from 0 lux to 450 lux, which meets the office lighting requirement from 

U.S. General Services Administration [27]. In the different dimming setting index, the brightness 

sensed by the user increases depending on the day or night scenarios. The experiment results prove 

that the dimming function works well. 

Non-flicker  performance:  We measure the non-flicker performance with the light meter based 

on the photometric quality, which measures the foot candle (FC) value range from its maximum to 

minimum values. The more extensive range of FC values, the more flickering possibility. When the 

transmission frequency increases, the flicker possibilities reduce for the two optical symbol designs. 

Figure 2.16 shows that users sense no flickers since the transmission frequency for LiFOD’s non- 

flicker symbols are lower than the original optical symbols.  Due to the unexpected low frequency 

of CS symbols, LiFOD’s non-flicker symbols will provide more smooth lighting without flickering 

than  the  original  symbol  design,  even  at  a  very  high  transmission  frequency  such  as  200KHz. 

Results show that our flicker-mitigation solution addresses the flicker well. 

We also investigate users’ perception of flickering and comfortableness of lighting, as shown 

36 

Day 

1m 

2m 

3m 

Night 

1m 

2m 

3m 

450 

400 

350 

300 

250 

200 

150 

100 

50 

)
x
u

l
(

s
s
e
n
t
h
g

i
r
b

d
e
s
n
e
s

r
e
s
U

0 

1  2  3  4  5  6  7  8  9  10 
Dimming level setting of dimmer 

)

C
F
(

e
g
n
a
r

y
t
i
l

a
u
q

c

i
r
t
e
m
o
t
o
h
P

150 

120 

90 

60 

30 

0 

10 

8 

6 

4 

2 

0 

Original  Optical  symbols 
Non-ﬂicker  symbols  in  LiFOD 

user  sense  no  ﬂicker  after 

Transmisstion frequency (Hz) 

Figure 2.16 Dimming and non-flicker evaluation. 

in Table 2.4. Three volunteers are invited to experience the lighting function of LiFOD. Each user 

scores their user experience for at 10 dimming settings in different conditions such as facing directly 

or indirectly, at different distances to LED lamp.  The results show all users have good experience 

with comfortable and stable lighting perception. 

Total   Scores   at 
10 DimmingSettings 

View 

direct  view 

side  view 

Distance 

1 m 

3 m 

5 m 

User A 

User B 

User C 

Average 

FLK 

LIT 

FLK 

LIT 

FLK 

LIT 

FLK 

LIT 

9 
10 

8 

10 
10 

10 
8 

9 

8 
8 

10 
10 

10 

10 
10 

10 
10 

9 

10 
10 

10 
10 

9 

10 
10 

10 
9 

9 

9 
9 

9.7 
10 

9 

10 
10 

10 
9 

9 

9 
9 

Table 2.4 Users’ perception scores of flickering (FLK) and lighting (LIT) for 10 dimming setting 
at 100 KHz transmission frequency. If one senses no flickers or has comfortable lighting at 
specific setting, the score is 1, otherwise 0.  The score value in each cell is the sum of 10 settings. 

2.7.5  Communication Performance 

In this section, we evaluate the throughput performance of LiFOD in three aspects: (1) through- 

put vs. transmission frequency and distance; (2) throughput vs. incident angle and position; (3) 

37 

 
 
 
 
 
 
 
 
 
 
throughput comparison with the state-of-art OWC schemes considering fine-grained dimming and 

high-speed communication simultaneously. 

(1) Impact of transmission frequency and distance.

)
z
H
K

(

y
c
n
e
u
q
e
r
f

n
o

i
s
s
i
m
s
n
a
r
T

40 

80 

120 

160 

200 

40 

80 

120 

160 

200 

1 

2 

3 

4 

5 

6 

Transmission  distance  (m) 

Figure 2.17 Throughput vs. distance and frequency. 

I

S
G
C
O
M
M
1
7

C
A

I

D
A
1
9

450 

360 

270 

180 

90 

0 

T
h
r
o
u
g
h
p
u
t

(

K
b
p
s
)

We first evaluate LiFOD’s throughput performance at different transmission frequencies and 

distances based on two real-world data traces. As shown in Figure 2.17, the throughputs increase 

significantly as transmission frequency increases at the same distance setting. Although increasing 

distance will cause the throughput decline, it decreases less noticeably due to the reliable OOK 

modulation and our robust  symbol detection. Due to  the higher bonus bits introduced by CSC, 

LiFOD achieves up to 400 Kbps in data rate at a range of up to 6m in SIGCOMM17 traffic. It is 

about 2.7 times better for throughput and 1.5 times better for communication range compared with 

the latest OpenVLC (average 150 Kbps at 4m under optical interferences). 

(2) Impact of incidence angle and position.

38 

 
 
 
 
 
 
 
 
 
 
 
 
Because  light  beams  emit  and  spread  in  the  line-of-sight  (LOS)  manner,  the  pointing  and 

direction setting is essential in high-speed OWC systems. We evaluate the influence of different 

facing angles and the receiver’s relative locations as shown in the experimental schematic Figure 

2.18.  The  transmitter  is  fixed  while  the  receiver’s  location  and  its  facing  angle  are  changed 

incrementally at 5and 2cm from its base location L0 and direction. We set the transmission distance 

from L0 of the receiver to the transmitter at 3.5m and the transmission frequency to 125KHz for 

our two data traces. 

R 

T 

experimental  schematic 

r
e
v
i
e
c
e
r

f
o

e
l

g
n
a

g
n

i
c
a
F

-25°
-20°
-15°

-10°
-5°
0°
+5°
+10°
+15°
+20°
+25°
-25°
-20°
-15°

-10°
-5°
0°
+5°
+10°
+15°
+20°
+25°

I

S
G
C
O
M
M
1
7

C
A

I

D
A
1
9

-10 

-8    -6 

0  +2    +4    +6    +8  +10
Relative location of receiver (cm) 

-4 

-2 

250 

200 

150 

T
h
r
o
u
g
h
p
u
t

(

100 

K
b
p
s
)

50 

0 

Figure 2.18 Throughput vs. Incidence angle and position. 

As shown in Figure 2.18, when the receiver is set at L0, LiFOD can tolerate more unaligned 

angles. When the receiver is moved left or right in small ranges such as 2 or 4 cm, it is the same. 

For long-range location movement, the throughput can drop dramatically unless the proper angle is 

set. The performance trend is consistent for the two data traces. Thus, it is important for real-world 

usage of LiFOD to make sure the transmitter’s light directly points to the receiver. However, this is 

consistent with normal usage habits of using lamps for our daily lighting. 

(3) Throughput comparison with the state-of-art.

Finally, we make comparisons among LiFOD with state-of-art methods: OOK-CT, MPPM, 

39 

 
 
 
 
 
 
 
 
LiFOD-SIGCOMM17 
LiFOD-CAIDA19 
LiFOD-no  CSC  bonus 

AMPPM-SmartVLC 
MPPM-SmartVLC 
OOK-CT-SmartVLC 

)
s
p
b
K

(
t
u
p
h
g
u
o
r
h
T

260 

220 

180 

140 

100 

60 

20 

0.1  0.2 

0.3  0.4 

0.5  0.6 

0.7  0.8 

0.9 

Normalized dimming level of LED 

Figure 2.19 Comparison with state-of-art[115]. 

and AMPPM discussed in SmartVLC[115]. We set the same transmission frequency to 125KHz 

and distance to 3.5m, as described in SmartVLC. OOK-CT is OOK with Compensation Time, it 

keeps the CS symbols’ amplitude constant and only changes the inserted number of CS symbols for 

dimming. Thus, OOK-CT, MPPM, and AMPPM are coupled-dimming-based OWC. We evaluate 

LiFOD’s  performance  with  the  SIGCOMM17  and  CAIDA19  data  traces.  We  transmit  OOK 

symbols without CSC bonus in LiFOD as a comparison. 

First  of  all, our  LiFOD  throughput performances  are  better  than  coupled-dimming-based  OWC 

methods in all scenarios. The reason is that LiFOD decouples the dimming with transmission and 

releases most times slots for standard data symbol transmission. Based on different CSC bonus 

ratios in various data traces, LiFOD for SIGCOMM17 traffic performs best and achieves 250 Kbps 

in all dimming settings, which is an improvement of at least  110%  compared to AMPPM. Although 

lower than SIGCOMM17, LiFOD for CAIDA19 traffic which collects the daily network traffic of 

a city in the US still achieves 155 Kbps in all dimming settings, which corresponds to at least a 

34% improvement over AMPPM in SmartVLC (the best throughput performance is 120 Kbps). 

40 

 
2.8    Discussion and Summary 

Generalizability. The throughput improvement ratio in LiFOD is based on the bonus ratio of 

traffic. Other OWC platforms, such as the LiFi system, can apply LiFOD approach to imporve their 

performance. Suppose the common OWC platform is improved in engineering or products such as 

robust symbol transmission and decoding at the MHz/GHz level. In that case, LiFOD can also be 

adopted to achieve the throughput improvement at the same boost ratio and may achieve the data 

rate at hundreds of Mbps/Gbps with fine-grained dimming support. 

The LiFOD exploits opportunities of expanding dimming methods for its use in data transmis- 

sion: using compensation symbols as a side-channel to carry data bits to improve the throughput in 

OOK-based OWC networks.  First,  we design a lightweight greedy algorithm to identify bit patterns 

to maximize the total bonus bit performance in real-world traces. Then we utilize the preamble to 

notify CSC codes, dynamic thresholds, and estimate channel conditions for robust demodulation 

in the changing optical environment. Most importantly, we design non-flicker optical symbols and 

compensation symbol relocation scheme to support smooth lighting and communication with im- 

proved throughput. LiFOD can achieve up to 400 Kbps throughput in the communication range up 

to  6m  with  fine-grained  dimming.  Compared  with  SmartVLC  at  the  same transmission  parameters, 

LiFOD improves more than 34% and 110% throughput for two real-world data traces respectively 

in all dimming levels. 

41 

CHAPTER 3 

BOOSTING OCC VIA 2D SPATIAL-TEMPORAL DIVERSITIES 

Optical  camera  communication  (OCC)  has  garnered  increasing  attention,  driven  by  the  widespread 

availability  of  affordable  mobile  devices  equipped  with  built-in  cameras.  Additionally,  OCC 

stands out for its low interference with ambient light, distinguishing it from other optical wireless 

communication (OWC) techniques.  Notably, OCC offers location-based services (LBS), enabling 

fine-grained AR navigation through the association of data from visible transmitters within a flexible 

communication range [95, 24]. Despite these advantages, developing a high-speed and practical 

OCC system remains an open challenge, particularly for LED-based OCC. 

In this project, our main objective is to design a practical data embedding protocol that capitalizes 

on the 2D spatial diversities of optical signals. By doing so, we aim to overcome the limitations of 

existing optical camera communication systems and break through the current bottleneck caused 

by the low frequency response at the receiver side. 

3.1  Motivation 

Currently, the Radio Frequency (RF) spectrum below 10 GHz is widely utilized for our everyday 

wireless communication. However, with the increasing demand for massive high-speed wireless 

services in the future, even higher RF bandwidths like mmWave and nanometer waves may soon 

become inadequate [95, 24, 83]. 

In  contrast  to  the strictly regulated RF band,  which covers frequencies between 3 kHz and 

300 GHz on the electromagnetic spectrum, the optical spectrum boasts a bandwidth over 10,000 

times broader than RF spectrum [15]. The growing adoption of light-emitting diode (LED) lamps 

for indoor and outdoor lighting, as well as information display, is due to their energy efficiency, 

cost-effectiveness, and extended lifespan. These widespread LED infrastructures, including home 

lighting fixtures, street lamps, traffic lights, and car headlights [13, 85], possess superior ON/OFF 

switching rates. This characteristic facilitates optical wireless communication (OWC) in various 

aspects of our daily lives [15, 151]. 

OWC offers reliable connections through line-of-sight (LOS) spread, ensuring secure commu- 

42 

nication and high-capacity networks with broad spectrum bandwidth, low power consumption, and 

high speed compared to RF-based communication [42]. 

In contrast to RF approaches, Optical Wireless Communication (OWC) offers several advan- 

tages, including reliable connections through Line of Sight (LoS) for secure communication and 

spatial multiplexing. High-capacity networks are made possible by leveraging spatial multiplexing 

and  broad  spectrum  bandwidth,  while  still  maintaining  low  power  consumption  for  high-speed 

services [141, 145]. There are primarily two types of OWC based on the receiver types: (1) PD 

(photo diode) based OWC, exemplified by technologies like LiFi  [84], and (2) Camera based OWC, 

commonly  referred  to  as  Optical  Camera  Communication  (OCC)  [2,  15].  OCC  can  be  further 

classified based on transmitter types into (1) LCD-OCC: liquid crystal display based OCC such 

as the screen-camera communication [82, 138, 58]; and (2) LED-OCC: LED based OCC such as 

ColorBar, CASK[124, 38].  We discuss their differences below. 

The LCD-OCC approach captures each frame and subsequently decodes the embedded data, 

such as a QR code, in that frame. Despite the fact that the spatial diversity provided by millions 

of pixels at both screen and camera sides are exploited for dense data embedding in each frame 

and  achieves  hundreds  of  Kbps  with  the  constraint  of  LC’s  low  response  frequency  at  tens  of 

Hz[82,  138],  the  expensive  screen,  complicated  decoding  and  limited  range  (i.e.,  within  0.9m) 

hinder it from having an enormous market like LED-OCC. 

The LED-OCC  utilizes LED’s  faster On/Off switching rate rather than low-speed liquid crystal 

and thus record data with a faster shutter rate than the frame-rate at the camera side contrasted 

with LCD-OCC. Researchers have made many attempts to further improve its data rate, including 

Yanbing[124, 122] who investigated a high-order modulation, CASK (composite amplitude shift 

keying), which encodes data into different brightness levels, and Pengfei [38][37] who proposed 

ColorBar, that uses CSK (color shift keying) in OCC, which encodes data into different colors. 

They achieved up to 8 Kbps data rate for commercial smartphone-based OCC. However, these 

approaches only consider the grayscale difference (amplitude diversity) and color difference (spec- 

trum diversity) recorded in 1D rolling strips for improved data rate and do not consider the 2D 

43 

spatial diversity in optical imaging at both transmitter and receiver sides. 

1 single-color    LED 

1 triple-color    LED 

s
w
o
r

s
p
i
r
t
s
g
n

i
l
l

o
r
D
1

1 symbol 

s
w
o
r

columns 

4-Amplitude-Shift-Keying

columns 

4-Color-Shift-Keying

s
p
i
r
t
s
g
n

i
l
l

o
r
D
1

4 triple-color    LEDs 

1 symbol 

s
w
o
r

columns 

4-Spatial-4-Amplitude-4-Color
-Shift-Keying

s
k
c
o
b

l

g
n

i
l
l

o
r

D
2

Figure 3.1 The illustration of 2D rolling blocks spatial diversity in our proposed (c) RainbowRow 
and its comparison with 1D rolling strips spatial diversity in state of the arts in OCC: (a) 
CASK[124] and (b) ColorBar[38]. 

As shown in  Figure  3.1, in  the process  of  camera imaging,  existing  LED-OCC systems do 

not consider spatial diversity, and treat the whole row (1D rolling strips) from the rolling shutter 

as one value by taking the overall average. However, the camera can capture transmitter units at 

different horizontal locations in each row with different amplitudes and colors and generate 2D 

rolling blocks to embed more data and therefore boost the data rate of OCC. 

Researchers have made many attempts to improve the data rate of LED-based OCC, including 

Yanbing[124][122] who investigated a high-order modulation, CASK (composite amplitude shift 

keying), which encodes data into different luminance levels, and Pengfei [38][37] who proposed 

ColorBar, that uses CSK (color shift keying) in OCC, which encodes data into different colors. 

They achieved less than 8 Kbps data rate for commercial smartphone-based OCC. However, these 

44 

 
 
 
 
 
 
 
 
 
 
 
 
 
approaches only consider the grayscale difference (amplitude diversity) and color difference (spec- 

trum diversity) combined with 1D rolling strips in modulation for improved data rate and do not 

consider the 2D rolling blocks spatial diversity of camera imaging. 

Motivation:  (1)  RF  techniques  are  insufficient  for  future  numerous  high-speed  and  high- 

density services due to congested spectrum and severe interference. (2) PD based OWC such as 

LiFi senses light with single-pixel and thus requires rigorous direction pointing and is vulnerable to 

ambient light. (3) Although LCD-based OCC uses spatial diversity, its narrow market potential is 

hindered by its slow LC response frequency, expensive screen cost, and limited range. (4) Existing 

LED-OCC approaches do not share drawbacks in (1)-(3), however, they only consider amplitude, 

spectrum diversities in 1D rolling strips and achieve limited data rate. (5) Despite using the spatial 

multiple LED sources and camera pixels to achieve spatial redundancy forward error correction 

(FEC) in transmission, UFSOOK (undersampled frequency shift on-off keying) encodes data with 

On/Off blinking at frame rate level (tens of Hz) and does not exploit rolling effect and 2D rolling 

blocks in transmission[2]. 

To address the problems above, we design RainbowRow, an OCC framework with 1D spatial 

diversity in the design of the transmitter and 1D temporal diversity enabled by rolling shutter effect, 

as illustrated in Figure 3.1. RainbowRow is made up of an LED bar with four transmission units and 

a standard camera. Our RainbowRow protocol includes the following 5 key features: (1) Low cost: 

It only requires basic LEDs and cameras. (2) High-speed: It significantly enhances data transfer, 

exceeding  conventional  LED-OCC by a factor of 20. Because of the camera’s pixel  count  and 

simple modulation, it remains unaffected by motion and ambient light, with customizable distance 

and a wide vision. (4) Energy-saving: LEDs conserve energy while acting as data transmitters and 

lighting  sources.  (5)  Practical:  RainbowRow  is  suitable  for  a  variety  of  applications,  including 

indoor communication and vehicle networks, while also providing illumination benefits. 

45 

 
160 

140 

120 

100 

80 

60 

40 
0

e
d
u

t
i
l

p
m
a

d
e

t
c
e
e
D

t

50  100  150  200  250 

Lightness of LED 

255 

218 

182 

146 

108 

72 

36 

0 

PD 

PR 

LED 

Grayscale Detection 

Figure 3.2 Amplitude diversity: generation at Tx and detection at Rx. 

3.2  Background and Related Work 

3.2.1  Amplitude Diversity 

Amplitude diversity is generated by different brightness of the light source and measured by 

the light sensor (i.e., PhotoDiode, PhotoResistance, and the camera) as grayscale, as depicted in 

Figure 3.2. Due to the photoelectric effect, these semiconductor devices transform optical signals 

into electrical signals, and thus the different brightness can be encoded as data bits. Suppose the 

detected grayscale range is normalized from 0 to 255. Ideally, we can design 256-ASK (amplitude 

shift keying) modulation mapping 256 grayscale levels into 8 bits. However, because to the narrow 

range of illumination and varied optical environment, the majority of OWC systems could only 

map 8 grayscale levels into 3 bits. 

Additionally,  as  seen  in  Figure  3.3,  the  data  rate  changes  nonlinearly  while  the  amplitude 

diversity  changes  linearly.  When  the  amplitude  diversity  increases  from  16  to  64,  the  denoted 

bits in each symbol improves from 4 to 6, but the symbol distance reduces from 16 to 4 sharply. 

The  shorter  symbol  distance  that  comes  with  higher-order  ASK  results  in  minor  performance 

improvements but significant detection errors because of the smaller margin for correct detection 

between symbols.  RainbowRow adopts 4 amplitude diversity, which is of a relatively low order, 

46 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
250 

200 

150 

100 

e
c
n
a
t
s
i
d

l

o
b
m
y
S

50 

symbol distance 

bits per symbol 

(64,6) 

(16, 4 ) 

10 

B

i
t
s
p
e
r
s
y
m
b
o

l

8 

6 

4 

2 

0

(16,16) 

50 

(64,4)
100 

0 

0

150 

200 

250 

Amplitude  diversity 

Figure 3.3 Symbol distance/bits per symbol vs. amplitude diversity. 

to boost the transmission’s robustness.  However, this is supplemented with spectrum and spatial 

variety to increase the data throughput while preserving robustness. 

B  R+G+B 

G 

R 
WS2812B  Tri-color LED 

RED 
R+B 

GREEN 
G+B 

BLUE 
R+G 

object 

light 

color ﬁlter 

Figure 3.4 Spectrum diversity: generation at Tx and detection at Rx. 

47 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3.2.2  Spectrum Diversity 

Commercial RGB Tri-LEDs can generate a variety of colors by combining different amounts 

of  Red  (700nm),  Green  (546.1nm),  and  Blue  (435.8nm)  colors  based  on  the  RGB  model.  For 

example, as shown in Figure 3.4, the mixture of pure red and green light emits the yellow light. 

A set  of RGB values will eventually  be applied to  the LED’s voltage to generate colored (i.e., 

different light wavelength/frequency) optical symbols.  For color detection, three filters with R, 

G, B wavelength sensitivities are used to measure the wavelengths of red, green, and blue color 

components, respectively. The sensor responds using the light-to-voltage converter by producing a 

voltage corresponding to the detected color. 

IEEE  OWC  standard[2]  defines  color  shift  keying  (CSK)  modulation.  In  CSK,  the  optical 

symbols are generated based on the points on constellation triangles based on the RGB model. The 

CSK constellation is decided by combining the selected three color bands to form a triangle on the 

xy color coordinates of CIE 1931[16]. It increases the symbol distance when compared to the same 

order ASK modulations. However, different devices generate different optical signals even with 

the same RGB parameter input. Furthermore, even detecting the same optical signal from the same 

device, the varying optical environment could bring challenges of accurate symbol recognition for 

high-order CSK (e.g., 32-CSK[38]). As a result of the hue’s one-to-one relationship with the color, 

we employ the HSV (Hue, Saturation, Value) model to reliably identify colors instead of the RGB 

model. 

For  color  detection,  three  filters  with  R,  G,  B  wavelength  sensitivities  are  used  to  measure 

the wavelengths of red, green, and blue color components, respectively. Based on the activation 

of these filters, the color of the optical signal is categorized. A light-to-voltage converter is also 

present  in  the sensor. The sensor responds  to  color by generating a voltage proportional  to  the 

detected color at the receiver. 

To utilize spectrum diversity for transmission, in the IEEE OWC standard[2], it defines color 

shift keying (CSK) modulation. The optical symbols are generated based on the points on the CSK 

constellation triangles based on the RGB model, as shown in Figure 3.5.  The CSK constellation 

48 

 
constellation 

band j  triangle 

Saturation 

(x,y) 

bnamnd 

k 

band i 

H
u
e

L

i

g
h
t
n
e
s
s

S
a
t
u
r
a
t
i

o
n

L

i

g
h
t
n
e
s
s

RGB model (CIE 1931) 

HSL model 

Figure 3.5 Comparison of RGB and HSL model[16]. 

is decided by combining the selected three color bands, which can form a triangle on the xy color 

coordinates of CIE 1931[16]. It increases the symbol space and distance than the same order ASK 

modulations. However, CSK modulation has a complicated and high requirement for control at the 

transmitter with  additional overhead  and  cost. Moreover, different devices  generate different optical 

signals even when they have the same input RGB parameters. Furthermore, even detecting the same 

optical signal of the same device in the varying optical environment can also bring the challenge 

of accurate symbol recognition at the receiver for high-order CSK such as 16-CSK, 32-CSK[38]. 

Compared with the RGB model used for color generation, the HSL model is more natural to 

describe colors and more popular for color recognition, as shown in Figure 3.5. H stands for Hue, 

which corresponding to the red, orange, yellow, green, cyan, blue, violet and so on. Hue reflects 

the changes and differences of colors more directly, which is the spectrum diversity of the optical 

wavelength.  The  more  kinds  of  wavelength,  the  higher  S  (Saturation)  value.  L  stands  for 

Lightness or Luminance, and it reflects the grayscale of the light. The HSL model separates the 

lightness and color of the light, which are the amplitude and spectrum diversity separately. 

49 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
3.2.3  Spatial Diversity 

(1) Camera shutter and spatial diversity in camera. 

The  shutter  is  an  essential  camera  mechanism  that  controls  a  photographic  film’s  effective 

exposure time.  There are two shutter types:  global shutter and rolling shutter, as shown in Figure 

3.6.  (1)  Global  shutter  exposes the whole scene at the same time. Light sensors at each pixel 

collect light synchronously and are exposed at the same time. At the beginning of the exposure, 

all light sensors begin to collect the light, and cut off light sensing and collection at the end of the 

exposure. (2) Unlike a global shutter, the rolling shutter is implemented by exposing one row of 

pixels simultaneously and row by row generates an entire image. 

Spatial diversity is generated by millions of pixels in 2D camera image sensors with multiple 

light sources shown in camera’s FOV. Each pixel or each cluster of pixels can record the optical 

features such as amplitude and spectrum diversities of each light source shown in FOV of camera. 

Based on camera shutter types and the transmission frequency of LED sources, the spatial diversity 

can  be  classified  into  two  categories:  (1)  with  frame-level  update  speed,  and  (2)  with  faster 

row-level update, as depicted in Figure 3.7 (a) and (b) separately and illustrated below. 

LED wave 

C1  C2  C3  C4  ...  no strips 

s
s
e
n
t
h
g

i
l

δ 

time 

Exposure time 

Global shutter: δ x N 

Rolling shutter:  δ 

r
o
s
n
e
S

e
g
a
m

I

R1 

R2 

R3 

R4 
... 
RN 

Global shutter principle and sample image 

C1  C2  C3  C4  ...  strips 

R1 

R2 

R3 

R4 
... 

RN 

Rolling shutter principle and sample image 

Figure 3.6 Rolling shutter effect. 

50 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
scene 

camera 

1 LED 

camera 

lens  & 
shutter 
frame-grained 

image 
sensor 

daily 
togra 

pho 

phy 

on/oﬀ  blink 

lens  & 
shutter 
row-grained 

image 
sensor 

existing 
LED-OCC 

screen 

camera 

4 LED 

camera 

lens  & 
shutter 
frame-grained 

screen-camera 
image      communiation 
sensor 

individual 
on/oﬀ  blink 

lens  & 
shutter 
row-grained 

2D rolling  blocks 
image    spatial diversity 
sensor 

(a)  spatial  diversity  at  frame  level  update 

(b)  spatial  diversity  at  row  level  update 

Figure 3.7 Spatial diversity in camera imaging with frame / row level updates. 

(2) Update with frame-level vs. row-level. 

Frame-level  updated spatial  diversity.  When one period of transmitted data from all light 

sources in FOV is emitted (synchronously or asynchronously) during the frame period and captured 

by cameras whatever the global shutter or rolling shutter, the captured frame will have no rolling 

strips and the transmitted data will be decoded at the frame level. For example, the existing screen- 

camera communication approaches[82, 58] captures each frame as  a full unit and subsequently 

decodes the embedded data, such as a QR code, in that frame. The UFSOOK [2] is also updated 

at the frame level even though it repeats the data over several LEDs to provide spatial redundancy 

FEC. 

Row-level  updated  spatial  diversity.  When  one  period  of  transmitted  data  from  all  light 

sources in FOV is emitted synchronously during the rolling shutter period and captured by rolling 

shutter camera, the captured frame will have rolling strips and the transmitted data will be decoded 

at  the  faster  rolling-shutter  level  than  the  frame  level.  Compared  with  existing  screen-camera 

communication and UFSOOK, which utilize the low frame-level spatial diversity, the approaches 

that adopted rolling-shutter-level update speed are supposed to have higher data rate due to its faster 

update rate. Nonetheless, these approaches ( e.g., ColorBar, CASK[124, 38]) do not consider the 

spatial diversity and only exploit the 1D rolling strips in communication instead of the2D rolling 

blocks in our proposed RainbowRow. 

51 

 
 
 
 
 
 
 
Bit 
Stream 
Split 

Vertical / 
Horizontal 
Gaps 
Insertion 

LED Bar 
Mounting 
Initialization 

RainbowRow 
Symbol 
Mapping 

BBB based 
Fast & 
Synchronized 
Signal Emit 

Angle 
Mismatch 
Adaptation 

vehicular 
networks 

in motion  & 

varied optical 
environment 

indoor 
oﬃce 

Use   
Cases 
Adaptation 

Shutter & 
Other Camera 
Parameters 
Setting 

Symbol 
Decoding 
and Data 
Parsing 

Mounting 
Initialization 

Lens 
conﬁguration 

Image 
Capturing 
Frame   
by  Frame 

LED  based  Transmitter 

RainbowRow 

Camera  based  Receiver 

Figure 3.8 RainbowRow system overview and technical challenges at both the transmitter and the 
camera sides. 

3.3  Our Approach: RainbowRow 

System Overview: Our proposed RainbowRow consists of two parts, as shown in Figure 3.17: 

(1) Tri-color LED bar based RainbowRow transmitter, and (2) Commercial camera based mobile 

RainbowRow  receiver.  LED  Transmitter:  LED  bar  consists  of  4  spatial  transmission  units 

and each include 3 LED bulbs (i.e, red, green and blue). Camera Receiver: The receiver is a 

commercial camera such as COTS smartphones. 

The  transmission  workflow  is:  (1)  bit  stream  split,  (2)  RainbowRow  symbol  mapping,  (3) 

BeagleBone  Black  (BBB)  based  fast  and  synchronized  signal  emission  ,  (4)  vertical/horizontal 

gaps insertion, (5) mounting initialization, and (6) angle mismatch adaptation. 

The decoding workflow at the receiver side is: (1) mounting initialization, (2) use case adapta- 

tions, (3) shutter and other camera parameters setting, (4) lens configuration, (5) image capturing 

frame by frame, (6) symbol decoding and data parsing. 

Technical Challenges. (1) Modeling of spatial diversity in 2D rolling blocks: The spatial 

diversity in 2D rolling blocks has never been considered and exploited before. It is a challenge to 

investigate deeply and model 2D rolling blocks clearly because this spatial diversity is dependent 

on: LED transmitter, optical propagation, and rolling shutter camera. (2) Optical imaging man- 

agement at both Tx and Rx sides: In contrast to 1D rolling strips in existing work, it is a challenge 

to control multiple spatial located LED transmission units to emit optical signals synchronously 

in  high  frequency.  The  optical  signals  from  various  transmission  units  would  also  destroy  the 

decoding owing to their mutual interference and overlapping despite the fact that the inner fusion 

52 

 
 
 
 
 
of optical signals in each transmission unit is the basis of amplitude and spectrum diversities. (3) 

Practical adaptations for real use cases: The misaligned rotation angle between the LED bar and 

the horizontal axis of the camera will result in a data rate drop in an indoor office setting. Addi- 

tionally, in vehicular scenarios, RainbowRow encounters a long distance caused by weak optical 

signals and a variety of horizontal gaps at various viewing angles. 

Our main contributions can be summarized as follows: 

•  RainbowRow is the first work to employ 2D rolling blocks for LED based optical camera 

communication. We model 2D spatial diversity in optical imaging and use it to break the 

throughput bottleneck of LED-OCC systems. 

•  We propose the RainbowRow protocol, which exploits the spatial diversity in  2D  rolling 

blocks instead of 1D rolling strips and combine it with amplitude and spectrum diversities 

to boost LED-OCC’s data rate. 

•  We implement a RainbowRow prototype based on commercial devices and address techni- 

cal challenges including optical imaging management in transmission and adaptations for 

indoor/vehicular cases. 

•  We evaluate RainbowRow on our testbed and conduct a case study for two real-world ap- 

plications for its practicality. Our RainbowRow achieves up to 170 Kbps, over 20 times of 

existing LED-OCC approaches. 

3.4  2D Rolling Blocks Modeling 

3.4.1  Why a LED bar instead of a LED matrix? 

Each row comprises multiple pixels, which can denote multiple colors or grayscales in different 

parts of pixels in that row. This spatial diversity on each row provides a great potential to boost 

the throughput without an additional cost by allowing more data to be embedded in multiple light 

sources.  We name this  spatial diversity as  2D rolling blocks  spatial diversity  to differentiate 

it from the 1D rolling strips spatial diversity in the state-of-the-art.  To take advantage of the 

53 

 
 
 
n
o
i
t
c
e
r
i
d

e
r
u
s
o
p
x
e

tex 

r
e
t
t
i

m
s
n
a
r
T

temporal feature (amplitude, spectrum, etc) of Tx  recorded
recorded 

1D  horizontal spatial diversity feature of  Tx 

RainbowRow FIRST exploit this! 

s
w
o
r

s
w
o
r

2D 
1D vertical columns of Rx 
1D horizontal rows of Rx  Rolling 
Blocks 

L1 

L2 

L3 

L4 

L1 

L2 

L3 

L4 

columns in an image frame 

one RainbowRow optical symbol 

exposure 
row by row 

columns in image sensor 

LED bar is set in parallel with 
the rows of the camera 

Optical Imaging Process from Tx to Rx 

t4 

D
E
L

h
c
a
e

f
o

t3 

n
o
i
t
a
i
r
a
v

t2 

l
a
t1 
r
o
p
m
e
t

y
t
i
s
r
e
v
d

i

e
d
u
t
i
l

p
m
a

y
t
i
s
r
e
v
d

i

m
u
r
t
c
e
p
s

ttrans 

spatial diversity 

L1 

L2 

L3  L4 

1D LED bar instead of 2D LED matrix 

4
n

R

-
3
n

R

3
n

R

-
2
n

R

s
w
o
r

2
n

R

-
1
n

R

1
n

R

-
0
n

R

Cn0-Cn1 

L1 

r
e
v

i

e
c
e
R

Cn1-Cn2 
L2 

Cn2-Cn3 
L3 

Cn3-Cn4 
L4 

part of a captured frame via camera 

Figure 3.9 The illustration of 2D rolling blocks with diversity combination of amplitude and 
spectrum. 

54 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
   
 
 
   
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
spatial diversity at the receiving end, spatially related coding and modulation are required at the 

transmitting end. 

As shown in Figure 3.9, the transmitter is designed as a LED bar with multiple transmission 

units  horizontally  instead  of  a  LED  matrix.  Each  transmission  unit  generates  temporal  varied 

optical  signals  with  different  brightness  and  colors  that  are  recorded  as  strips  vertically  while 

other transmission units horizontally located in camera’s FOV conduct emission synchronously. 

Our RainbowRow creatively combines the spatial diversity with fast shutter-rate-level temporal 

diversities (i.e, amplitude and spectrum) in LED-OCC’s modulation via 2D rolling blocks instead 

of fully spatial  diversity with  low frame-level  update speed in  a LED matrix  with  more severe 

vertical interference. 

3.4.2  Is it possible to boost OCC via 2D Rolling Blocks? 

We propose to combine amplitude diversity, spectrum diversity, and spatial diversity of 2D 

rolling blocks to improve the data rate of OCC systems, as shown in Figure 3.9. The benefit of this 

combination is that we can eliminate the short symbol distance limitations for each diversity. We can 

employ the robust and proper range in each diversity to encode and decode the data separately. Let 

A denote the amplitude diversity, S1 and S2 denote the spectrum and spatial diversity of 2D rolling 

blocks respectively.  The bits encoded in each symbol can be represented as:  log2 ( 𝐴 × 𝑆1) × 𝑆2. 

log2 ( 𝐴 × 𝑆1) × 𝑆2 

(3.1) 

For  instance,  we  adopt  4  brightness  and  4  colors,  the  same  order  level  of  4-CASK  and  4- 

CSK separately. The modulations and decoding in each diversity of 4 individual spatially located 

transmission  units  are  very  simple  and  reliable  compared  with  high  order  modulations  such  as 

8/16-CASK, 32/64-ColorBar and so on [38, 124]. This diversity combination can output a total of 

log2(4 × 4) × 4 = 16 bits per symbol period without the limitation of short symbol distance in each 

diversity and is faster and more robust. 

3.4.3  RainbowRow Modulation 

(1) Modulation Exploration.

To design a robust and fast OCC system, we explore 9 modulation methods on our testbed for 

55 

spatial, spectrum, and amplitude diversities, as shown in Figure 3.10. For each diversity, we set up 

to 4 levels for illustration. 

OOK:  On-Off-Keying  is  the  primary  amplitude-based  modulation,  and  it  is  2-Amplitude-Shift- 

Keying.  It only has amplitude diversity. 

4-ASK:  4-Amplitude-Shift-Keying  utilizes  four  amplitude  statuses  to  denote  2  bits  in  each 

symbol. It only has amplitude diversity. 

4-SOOK: 4-Spatial-On-Off-Keying adopts basic OOK at four different spatial locations, making 

each symbol denote 4 bits, 4 times that of OOK. It has amplitude and spatial diversities. 

4-S-4-ASK: 4-Spatial-4-Amplitude-Shift-Keying adopts 4-ASK at four different spatial loca- 

tions,  making  each  symbol  denote  8  bits,  4  times  that  of  4-ASK.  It  has  amplitude  and  spatial 

diversities. 

4-SC-4-ASK:  4-Spatial-Colored-4-Amplitude-Shift-Keying  adopts  4-ASK  at  four  different  spa- 

tial locations. The only difference with 4-S-4-ASK is that each ASK has a different color instead 

of the same color.  It still only has amplitude and spatial diversities without spectrum diversity. 

4-CSK: 4-Color-Shift-Keying utilizes four colors to denote 2 bits in each symbol. It only has 

spectrum diversity. 

4-A-4-CSK: 4-Amplitude-4-Color-Shift-Keying utilizes four colors combining with four am- 

plitudes to denote 4 bits in each symbol.  It has amplitude and spectrum diversities. 

C-4-SOOK: Colored-4-Spatial-On-Off-Keying is similar to 4-SOOK with the same denoted 

bits. The only difference is that OOK has a different color at each location instead of the same 

color. It still only has spatial and amplitude diversities without spectrum diversity. 

4-S-4-CSK: 4-Spatial-4-Color-Shift-Keying adopts 4-CSK at four different spatial locations, 

making each symbol denote 8 bits, 4 times of 4-CSK. It has spatial and spectrum diversities without 

amplitude diversity. 

(2) 4-order RainbowRow. 

As shown in Figure 3.10 and 3.11, RainbowRow adopts 4-Spatial-4-Amplitude-4-Color-Shift- 

56 

 
s
n
m
u
o
c

l

K
S
A
-
4
-
C
S
-
4

s
n
m
u
o
c

l

K
S
A
-
4
-
S
-
4

s
n
m
u
o
c

l

K
O
O
S
-
4

s
n
m
u
o
c

l

rows 

rows 

s
n
m
u
o
c

l

rows 

rows 

s
n
m
u
o
c

l

rows 

rows 

s
n
m
u
o
c

l

s
n
m
u
o
c

l

K
S
A
-
4

rows 

rows 

g
n

i
y
e
K

-
t
f
i
h
S
-
e
d
u
t
i
l

p
m
A
-
4
-
d
e
r
o
l
o
C

-
l
a
i
t
a
p
S
-
4

g
n
i
y
e
K

-
t
f
i

h
S
-
e
d
u
t
i
l

p
m
A
-
4
-
l
a
i
t
a
p
S
-
4

g
n
i
y
e
K
-
ﬀ
O
-
n
O

-
l
a
i
t
a
p
S
-
4

g
n
i
y
e
K

-
t
f
i
h
S
-
e
d
u
t
i
l
p
m
A
-
4

l

o
b
m
y
s
/
s
t
i
b

8

l

o
b
m
y
s
/
s
t
i
b

8

l

o
b
m
y
s
/
s
t
i
b

4

l

o
b
m
y
s
/
s
t
i
b

2

s
n
m
u
o
c

l

rows 

rows 

s
n
m
u
o
c

l

rows 

rows 

s
n
m
u
o
c

l

rows 

rows 

s
n
m
u
o
c

l

rows 

rows 

s
n
m
u
o
c

l

s
n
m
u
o
c

l

K
O
O

g
n
i
y
e
K
-
ﬀ
O
-
n
O

l

o
b
m
y
s
/
t
i
b

1

s
n
m
u
o
c

l

rows 

rows 

rows 

rows 

s
n
m
u
o
c

l

w
o
R
w
o
b
n
a
R

i

s
n
m
u
o
c

l

K
S
C
-
4
-
S
-
4

s
n
m
u
o
c

l

K
O
O
S
-
4
-
C

s
n
m
u
o
c

l

K
S
C
-
4
-
A
-
4

s
n
m
u
o
c

l

K
S
C
-
4

g
n
i
y
e
K

-
t
f
i
h
S
-
r
o
l
o
C
-
4
-
e
d
u
t
i
l
p
m
A
-
4
-
l
a
i
t
a
p
S
-
4

g
n

i
y
e
K

-
t
f
i

h
S
-
r
o
l
o
C
-
4
-
l
a
i
t
a
p
S
-
4

g
n

i
y
e
K
-
ﬀ
O
-
n
O

-
l
a
i
t
a
p
S
-
4
-
d
e
r
o
l
o
C

g
n
i
y
e
K

-
t
f
i
h
S
-
r
o
l
o
C
-
4
-
e
d
u
t
i
l
p
m
A
-
4

g
n

i
y
e
K

-
t
f
i
h
S
-
r
o
l
o
C
-
4

l

o
b
m
y
s
/
s
t
i
b

6
1

l

o
b
m
y
s
/
s
t
i
b

8

l

o
b
m
y
s
/
s
t
i
b

4

l

o
b
m
y
s
/
s
t
i
b

4

l

o
b
m
y
s
/
s
t
i
b

2

Figure 3.10 The illustration and captured images of 9 explored modulations and RainbowRow 
balanced coding table. 

57 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Keying, which uses 4-CSK combined with 4-ASK at four different locations, making each symbol 

denote  16  bits.  This  is  a  significant  improvement  on  existing  work  [38,  124].  We  named  it 

RainbowRow  due to  the generated strip  patterns with random  colors and lightness at different 

locations on a specific row.  Ideally, the RainbowRow protocol can extend to N-order and transmit the 

log2(𝑁 × 𝑁) × 𝑁  bits per RainboRow symbol.  Moreover, RainbowRow can fully utilize amplitude 

and spectrum diversities to present random bit sequences at each location, guaranteeing the random 

appearance of different colors and lightness for non-flickering during data transmission. 

RainbowRow Coding Table 

Color  Amplitude 

Red 

Green 

Blue 

Yellow 

level-1 

level-2 

level-3 

level-4 

level-1 

level-2 

level-3 
level-4 

level-1 

level-2 

level-3 

level-4 

level-1 

level-2 

level-3 

level-4 

Location 

#1 

#2 

#3 

#4 

0000  0000  0000  0000 
0001  0001  0001  0001 
0010  0010  0010  0010 
0011  0011  0011  0011 
0100  0100  0100  0100 
0101  0101  0101  0101 
0110  0110  0110  0110 
0111  0111  0111  0111 
1000  1000  1000  1000 
1001  1001  1001  1001 
1010  1010  1010  1010 
1011  1011  1011  1011 
1100  1100  1100  1100 
1101  1101  1101  1101 
1110  1110  1110  1110 
1111  1111  1111  1111 

Figure 3.11  RainbowRow balanced coding table. 

Undesired Flicker Mitigation. Although we want cameras to clearly record multiple colors 

and  levels  of  brightness  for  robust  communication,  we  do  not  expect  human  eyes  to  sense  the 

flickers in its concurrent lighting function.  We avoid undesired flickers in two aspects.  (1) Fast 

58 

 
 
 
 
 
 
 
 
 
transmission  frequency.  RainbowRow  adopts  transmission  frequency  at  several  to  tens  of 

KHz, which is faster than the response frequency of human eyes (i.e., 60Hz).  (2) Color/Amplitude 

Balanced  Coding.  As  presented  in  Figure  3.11,  each  transmission  unit  has  16  combination 

of  color  and  amplitude  (i.e,  R1,R2,R3,R4,G1,G2,G3,G4,B1,B2,B3,B4,Y1,  Y2,Y3,Y4)  that  are 

mapped to 16 different 4-bits segments (e.g, ‘0010’) with equal appearance possibility, preventing 

some color or amplitude appearances  at  low frequencies that  would have resulted in  unwanted 

flickers. 

700nm 

600nm 

500nm 

400nm 

Red 

R+G =Yellow 

Green 

G+B =Cyan 

Blue 

R+B =Purple 

250 

200

150 

100 

50 

0 

l

e
u
a
v
e
u
H

Blue 

Green 

Red 

R+B

G+B

R+G 

1  2  3  4  5  6  7  8  9   10 ave 

Measurement index 

Figure 3.12 Color choice of RGBY in spectrum diversity. 

Color Choice. The top of Figure 3.12 shows that R+G generates Yellow, G+B generates Cyan 

and R+B generates Purple. The bottom-left of Figure 3.12 shows the measured hue values on our 

testbed. Cyan is too close to blue and green. Purple has the shortest wavelength out of these six 

colors, although having a wider hue gap than yellow. Thus we chose yellow as the 4𝑡ℎ  color in

addition to red, green and blue.  Furthermore, yellow, red and green have longer wavelengths than 

cyan and purple, which makes them suitable for long distance propagation, the same as traffic lights 

and headlights. 

59 

 
 
3.5  Optical Imaging Management 

Different from traditional wireless systems such as RF-based approaches with severe inter- 

ference at the receiver side, the Line-of-Sight (LoS) propagation makes optical signals easier to 

manage their paths. In the camera imaging process, the optical signals from the transmitter are 

reflected on the millions of pixels at the image sensor via the principle of pinhole imaging. Thus, 

the main interference is at the transmitter side as well as the ambient noise in its propagation when 

the camera’s parameters are set properly. In this section, we address technical challenges in optical 

imaging management from 1D to 2D at both transmitter and the receiver sides to guarantee the 

final robust decoding. 

3.5.1  At Transmitter Side 

(1) Fast and Synchronized Transmission. 

LED selection. As shown in Figure 3.13, the low-power and single color LED elements only 

propagate optical signals for a short distance. High-power Tri-LED strips and Tri-LED panels are 

suitable to achieve spatial diversity and long  communication range. However, the LED control 

manner of strips and panels is serial control, which will cause the wrong emission of RainbowRow 

optical symbols. Finally, we adopt 12V T10-194 car interior LED bulbs. Each bulb has 5 single- 

color 5050 SMD LED elements. We combine 1 red, 1 green, and 1 blue bulb  together in  each 

transmission unit and totally 12 LED bulbs for fast and synchronized transmission. 

Beaglebone Black. In our proposed RainbowRow, the transmitter should control the color and 

lightness of 12 LED bulbs synchronously and achieve the transmission frequency at several kHz to 

match  rolling  shutter  frequency  of  commercial  smartphones.  We  adopt  low-cost  Beaglebone  Black 

($80) for fast and synchronized transmission. When using Pulse Width Modulation (PWM) for 

amplitude control, the Beaglebone’s 12MHz GPIO speed is insufficient as well as Arduino boards 

with the similar 16MHz GPIO speed. Besides, all these GPIO mentioned above are read/write in 

serial manner. 

PRU. However, BBB has the Programmable Real-time Unit (PRU) which can speed up LED 

control speed up to 200MHz and synchronously control 12 LED via register.  Thus we can exploit 

60 

 
ON 

inner-unit  fusion  of  color/amplitude 

LED element 

LED strip  10×10 Panel 

inter-unit  interference 

OFF 

BeagleBone 
PRU 

serial  control x 

8×32 Tri-LED Panel 

good  encapsulation 

synchronized control 

inner-unit light fusion 

inter-unit  interference  avoidance 

LED waves 

add delay time 

horizontal  interference 

e
c
n
e
r
e
f
r
e
t
n

i

l

a
c

i
t
r
e
v

ON 

OFF 

ON 

OFF 

without / with vertical  gap 

generate vertical gap 

avoid  horizontal  interference 

without / with horizontal  gap 

physically set horizontal gaps among transmission units 

eyes view 

sphere  cover 

without horizontal gaps 

with horizontal gaps 

camera view 

Figure 3.13 Optical imaging management at the transmitter side in RainbowRow design. 

61 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
   
 
 
 
 
 
 
 
 
 
 
BBB’s PRU and achieve fine-grained amplitude control of each of 12 LED bulbs with [0, 100] step 

range at the same time at several to tens of kHz, suitable for the fast and synchronized transmission 

in our RainbowRow. 

(2) Inner-unit Fusion vs. Inter-unit Interference. 

From 1D to 2D. In 1D rolling strips based approaches, we care about the inner amplitude or color 

fusion inside of only one transmission unit. To generate the expected amplitude level or specific 

color,  the  transmitter  should  emit  different  amounts  of  brightness  or  R,G,B  color  components 

properly during the symbol duration. These components overlap and fuse among optical signals 

from one transmission unit to provide the base of the amplitude and spectrum diversity. However, 

by increasing the transmission units from 1 to multiple (e.g., 4 in RainbowRow), the optical signals 

from different transmission units will overlap as well. In contrast to inner unit color fusion, there is 

mutual interference among different transmission units that generate undesired brightness and colors 

for each transmission unit, which cause the wrong amplitude and color detection at the receiver 

side (e.g., camera in RainbowRow). The challenge here is to minimize this mutual interference 

among different transmission units while enhancing the fusion within each transmission unit. 

Inner-unit Light Fusion. Each of our self-made transmission units consists of 3 separate R,G,B 

LED bulbs. They are well-encapsulated tiny Tri-LED elements emitting expected colors by using 

great color fusion. However, they may cause incorrect symbol detection (e.g, one transmission unit 

wants to emit yellow by lighting up its red and green bulbs, but the detected color is red or green). 

We address this issue by encapsulating R,G,B bulbs with hot melt adhesive and covered with a 

sphere cover shown in Figure 3.13. 

Vertical Interference and Temporal Avoidance. The vertical optical signals with amplitude 

and spectrum features are varied with time and thus we can add the proper delay time between 

two optical  signals  switching to  generate gaps vertically, as shown in  Figure  3.13. However,  a 

longer  delay  time  sacrifices more of  the transmission  bandwidth  with lower  throughput.  We set  the 

delay time as 0.05 times of the symbol duration to guarantee a significant vertical gap for detection 

without transmission bandwidth sacrifice. 

62 

 
Horizontal Interference and Spatial Avoidance. (1) Sphere cover. The captured Rain- 

bowRow symbols in a frame without any cover have strong overlapping and aliasing horizontally 

shown in Figure 3.13.  We should constrain the optical signals from a specific transmission unit in 

its expected spatial area. Inspired by our daily light bulbs, we use a transparent plastic ball as the 

light cover for each transmission unit, the outside of the ball is smooth without spraying, and the 

inside surface is sprayed with thin and uniform white paint. (2) Physical horizontal gaps. 

In addition, we assign 4 transmission units horizontally with proper mutual physical distance to 

mitigate horizontal interference further. 

3.5.2  At Receiver Side 

(1) Ambient Light Filtering. 

There are two aspects in our proposed RainbowRow to filter out the ambient light from both 

natural world and artificial light sources. (1) High  shutter  rate. To record clear rolling blocks, 

the rolling shutter rate in RainbowRow is set from several to tens of KHz. The faster shutter rate 

leads to a decrease in amount of light coming in. In contrast with the active lights from high-power 

RainbowRow transmitter, most of the weak ambient light can be filtered out and not recorded in the 

captured image frames. (2) Millions  of  pixels. Even very strong ambient light such as direct 

incident sunlight, thanks to the millions of pixels in camera, the ambient light source is projected in 

different pixel zones from our RainbowRow rolling blocks based on the pinhole imaging principle, 

as shown in Figure 3.14 (a). 

(2) Optical Signal Enhancement. 

When optical signals from the RainbowRow transmitter propagate to the camera via increased 

communication range, there exist two main problems. (1) Decreased vertical strip number 

present  on sphere cover.  While the rolling strip’s width is constant because of the fixed 

shutter rate, the increased communication distance will result in a smaller captured sphere size. As 

a result, there are fewer rolling strips shown on the cover of transmission unit. (2) Optical signal 

attenuation. The non-trivial attenuation of optical signals caused by a longer propagation 

distance will also result in weaker captured RainbowRow symbols. 

63 

 
same outdoor scene 

strong ambient light 

Auto / low shutter rate 

shutter rate = 10 KHz 

Auto / low shutter rate 

shutter rate = 10 KHz 

same  indoor  scene 

clear RainbowRow 2D rolling blocks 

millions of pixels in camera separate optical signals 

from strong ambient light via pinhole imaging 

scene 

camera 

2m 

s
p
i
r
t
s

≈ 10 

3m 
≈ 8 

s
p
i
r
t
s

4m 
≈ 6 

s
p
i
r
t
s

decreased 
strip 
number 

ﬁxed shutter rate 

same strip width 
small sphere size 

longer distance 

longer distance causes less strip number shown in cover 

lens & 
shutter 
pinhole  imaging 

image 
sensor 

w/o lens   

 X 4 lens  

magnifying lens can enhance strip eﬀect 

(a) ambient light ﬁltering 

(b) optical signal enhancement via magnifying lens 

When the transmission frequency is set to 3KHz 

1K 

1.25K 

3.3K 
captured strips at diﬀerent shutter rate 

8K 

5K 

2K 

12K 

strip eﬀect illustration 

ISO 

3200 

1600 

800 

400 

(c ) camera parameters inﬂuence of the quality of captured strips 

Figure 3.14 Optical imaging management at the camera side in RainbowRow design. 

64 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
By placing an appropriate magnifying lens in front of the camera, we solve these issues. The 

lens can assist the camera in capturing the larger sphere sizes of each transmission unit, and therefore 

(1) increasing the number of shown rolling strips on the light cover, (2) enhancing the strength of 

optical signals by presenting more pixels. 

(3) Capture clear strips via proper camera parameters. 

The camera parameter setting is crucial for capturing the correct and clear RainbowRow strips 

( i.e., each RainbowRow strip is made up of four rolling blocks, as illustrated in Figures 3.1 and 

Figure 3.9), so that they can be decoded. 

Rolling shutter rate. The strip width 𝑆𝑤  are related to only two factors: (1) the transmission 

frequency  𝐹𝑡,  and  (2)  the rolling  shutter  frequency  𝐹𝑟  .  When  𝐹𝑟  <  𝐹𝑡,  the captured  strips  are then 

mixed together and overlapped into the wrong optical symbols shown in Figure 3.14 (c).  When 

𝐹𝑟  ≥ 𝐹𝑡, the 𝑆𝑤  decreases with the 𝐹𝑟  increases from their maximum strip width when 𝐹𝑟  = 𝐹𝑡. 

Thus we should set 𝐹𝑟  ≈ F𝑡. 

Other parameters. Two key camera parameters: (1) ISO, and (2) resolution may also affect 

the quality of captured RainbowRow strips. ISO refers to camera’s sensitivity to the light. Thanks 

to the high shutter rate setting filtering out the ambient light already, the higher ISO setting will 

not cause increased noise points. Thus, to enhance the captured RainbowRow strips, the camera 

should set a high ISO. Resolution is defined as the pixel numbers of the captured image frame. 

A higher resolution may improve the clarity of the recorded strips. Therefore, we ought to set high 

enough resolution such as 1080P instead of 480P. 

3.6  Use Case Adaptations 

Our proposed RainbowRow protocol has great potential because of its expansibility (i.e., in- 

crease  the  order  of  spatial/amplitude/spectrum  diversities)  and  flexibility  (i.e.,  numerous  appli- 

cations  including  mobile/static,  day/night,  indoor/outdoor,  terrestrial/aerial).  In  this  section,  we 

deploy the 4-order RainbowRow design to two real-world use cases:  (1)  indoor  office, and 

(2)  vehicular  networks by applying some adaptations for specific requirements. 

65 

 
m
2

<

night 

day 

At 

Ac 

At 

A 

c     

(a) indoor oﬃce use case 

ceiling  plane 

At 

ceiling  plane 

At 

ceiling  plane 

table plane 

table plane 

table plane 

rotation angle (-) 

At 

At 

incomplete RBR strip 

rotation angle (+) 

incomplete RBR strip 

(b) rotation angle mismatch 

s
w
o
r

s
w
o
r

columns 
central symmetry point 

d
e
s
r
e
v
e
r

central symmetry symbols 

reconection 

reconection 

reconection 

columns 

frame edge  entire RainbowRow stripes for decoding 

reconect symmetric strips into 

(c )  centrosymmetric  intra-frame  embedding 

Figure 3.15 RainbowRow adaptation for indoor office: rotation angle mismatch avoidance. 

66 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3.6.1  Adaptations for Indoor Office 

(1) Rotation angle mismatch. 

As shown in Figure 3.15 (a), the RainbowRow transmitter is mounted on the ceiling and the 

camera is set on the table to access the Internet (e.g., data downloading for multimedia services 

as supplement of WiFi to improve user experience with higher data rate). In this case, both the 

transmitter and receiver remain relatively fixed. It is normal if the camera’s horizontal axis 𝐴𝑐  is 

not parallel to the transmitter bar 𝐴𝑡. However, it will cause a decreased number of RainbowRow 

strips that can be correctly decoded.  We define the angle between  𝐴𝑐  and  𝐴𝑡  as rotation angle due 

to they are in two parallel planes (i.e., ceiling and table). 

(2) Centrosymmetric Intra-Frame Embedding. 

To address the issues above, we simply adjust an original RainbowRow symbol mapping in 

each frame into cenrosymmetric symbol mapping, as shown in Figure 3.15 (c).  For instance, each 

frame contains 10 RainbowRow strips 𝑆1 - 𝑆10. When the transmitter embeds data of one frame, 

the half of data (from  𝐿1  and 𝐿2) in 𝑆1  and the half of data (from 𝐿3  and 𝐿4) in 𝑆10  are emitted at 

the same time, while similar to 𝑆2 ↔S8, 𝑆3 ↔S7. Therefore, even with rotation angle mismatch, 

we can reconstruct most RainbowRow symbols in each frame to avoid the data rate drop caused by 

the  decreased  number of entire RainbowRow strips. We also  set  frame borders  before the first  strip 

and the last strip. 

3.6.2  Adaptations for Vehicular Networks 

(1) Varied viewing angle & long distance. 

The RainbowRow transmitters and receivers can be mounted to cars and traffic infrastructures 

for both uplink and downlink services. Given one example of uplink from car B to car A, as shown 

in Figure 3.16 (a), the camera is installed on the back of A, while the RainbowRow transmitter is 

mounted on the front of B. In this case, both the transmitter and receiver are in a mobile scenario. 

The camera’s horizontal axis 𝐴𝑐  and the LED bar 𝐴𝑡  are coplanar. However, these two lines are not 

in parallel when car A and B are in different or curved lanes. We define the angle between 𝐴𝑐 and 

𝐴𝑡 as the viewing angle.  Despite setting the physical horizontal gaps among nearby transmission 

67 

 
A 

Rx 

A 

Rx 

10 m 

B 

Tx 

day 

10 m 

B 
Tx  night 

(a) vehicle to vehicle use case 

Ac 

viewing angle 

0° 

telescope 

Tx   B 

-60° 

Ac 

-30° 

-60° 

-30° 

Rx 
30° 

60° 

A 
0° 

60° 

30° 

(b) viewing angle mismatch and long distance 

w
o

l

i

m
u
d
e
m

t
s
a
f

(c ) impact of varied speed motion 

Figure 3.16 RainbowRow adaptation for vehicular network: avoid impact of viewing angle, long 
distance and motion. 

68 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
units, the different viewing angles will result in different physical horizontal gaps being captured, 

which  will  make  decoding  difficult.  Furthermore,  these  gaps  decrease  significantly  in  the  long 

distance between the transmitter and the receiver because of the perspective principle, as shown in 

Figure 3.16 (b). 

(2) Use telescope instead of magnify lens. 

Instead of magnifying lens, we switch to telescope lens to shrink the distance from the transmitter 

to the camera to eliminate varied and decreased horizontal gaps caused by varied viewing angles 

and long distance. 

(3) Impact of high speed motion. 

Although vehicles are in high-speed motion, the speed of light is 3 x 108 m/s which is over- 

whelmingly faster than the vehicles’ speed. Therefore, the optical signals from the transmitter can 

be recorded in real time on their RainbowRow strips. The main impact of high speed motion is the 

varied sphere shape and size with different motion speed, which is sometimes a positive situation 

instead of a negative situation. 

3.7  Implementation and Evaluation 

Transmitter.  We  implement  a  low-cost  RainbowRow  prototype,  as  shown  in  Figure  3.17 

(a). The transmitter consists of a BeagleBone Black MCU, slef-implemented fast LED drivers with 

MOSFET transistors, and a 12V self-made Tri-LED bar, total cost is under $100. Each transmission 

unit consists of a red, a green, and a blue LED bulbs with white sphere cover. 

Receiver. The receiver is a commercial smartphone (VIVO Y71A or iPhone 7) with an addi- 

tional commercial magnifying / telescope lens( < $10) and performs decoding via OpenCV. Some 

commercial smartphones already have several camera modules with magnifying and telescopic 

lens such as Huawei Mate 30, iPhone 13 and Samsung S22. 

Setup.  The RainbowRow implementation is shown in Figure  3.17 (a).  We conduct experiments 

on our prototypes in two real use case settings: indoor office (Figure 3.15), and vehicular network 

(Figure 3.16). We also conduct an ablation study and a diversity robustness evaluation scenario 

(Figure 3.17 (b)).  We set different rotation/viewing angles, distances, day or night, with/without 

69 

 
12V Tri-LED Bar  

camera  in  smartphone 

12V power source 

BeagleBone  Black 

magnifying  lens 

 telescope 

 LED Drive Cuicuit  

Transmitter 

Receiver 

(a) RainbowRow implementation with commercial devices 

Day 

Night 

ceiling 

lighting & OCC 

robust to ambient light 

12V battery 

table 

(b) experiment scenarios for ablation study and other comparisons 

Figure 3.17 RainbowRow implementation including Tx & Rx and experiment scenarios. 

70 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
lens, relative motion speed and camera parameters settings for a comprehensive evaluation. 

3.7.1  RainbowRow in Indoor Office. 

200 

)
s
p
b
K

160 

120 

(

t
u
p
h
g
u
o
r
h
t

80 

40 

0 

THP (w/o adaption) 

THP (with adaptio 

SER  (w/o  adaption) 

SER  (with  adaptio 

n) 1 

n) 

0 

.8 

0 

.6 

0 

.4 

0 

.2 

s
y
m
b
o

l

e
r
r
o
r

r
a
t
e

0 

throughput 

symbol  error  rate 

160 

150 

140 

130 

120 

)
s
p
b
K

(

t
u
p
h
g
u
o
r
h
t

110 

0.25 

0.20 

0.15 

s
y
m
b
o

l

e
r
r
o
r

0.10 

0.05 

r
a
t
e

0 
2m 

-30°   -15°    0°    +15°    +30° 
(a)  rotation angle 

)
s
p
b
K

(

t
u
p
h
g
u
o
r
h
t

160 

150 

140 

130 

120 

110 

throughput 

symbol error rate 

0.25 

0.20 

0.15 

0.10 

0.05 

s
y
m
b
o

l

e
r
r
o
r

r
a
t
e

art-light 
day  night 
( c) ambient light 

0
ave 

0.5m 

1m 

1.5m 
(b)  distance 

)
s
p
b
K

(

t
u
p
h
g
u
o
r
h
t

160 

150 

140 

130 

120 

110 

throughput 

symbol error rate 

0.25 

0.20 

0.15 

0.10 

0.05 

s
y
m
b
o

l

e
r
r
o
r

r
a
t
e

0 

x2 

w/o 
x4 
(d) lens setting 

x3 

Figure 3.18 RainbowRow performance for indoor office use case. 

We set transmission frequency at 10 KHz while adjusting other settings to study their impacts 

to the achieved throughput in an indoor office. 

Throughput vs.  Rotation Angle We set the rotation angle (Figure 3.15) with 5 settings:  -30, 

-15,  0,  +15,  and +30.  We keep the distance at 1m during day time with the same lens setting. 

As shown in Figure 3.18 (a), RainbowRow achieves the highest throughput of 146 Kbps at 0 and 

decreased with the increased absolute value of rotation angle. We also present the data rate with 

centrosymmetric adaptation as contrast. The results demonstrate our centrosymmetric adaptation 

effectively addresses the rotation angle mismatch problem. 

71 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Throughput vs. Distance. We set 4 distances: 0.5m, 1m, 1.5m, and 2m. We keep the rotation 

angle at 0 during day time with the same lens setting. As shown in Figure 3.18 (b), the achieved 

data rate slightly decreased with the increased distance from the transmitter to the receiver from 

148 Kbps at 0.5m to 143 Kbps at 2m, a change of only 5 Kbps. 

Throughput vs. Ambient Light. We conduct experiments at day, night, and with an artificial 

light source (the added light by human) scenario to study the influence of ambient light. We set 

rotation angle at 0. We keep the distance at 1m during day time with the same lens setting. As 

shown in Figure 3.18 (c), there is no significant performance difference among three settings and 

RainbowRow achieves 146.4 Kbps, 146.7 Kbps, and 143.2 Kbps separately. 

Throughput vs. Lens. We also evaluate the influence of different lens settings. We conduct 

experiments during the day with the rotation angle at 0.  We keep the distance at 1m.  As shown 

in Figure 3.18 (d), the achieved throughput increased with the use of magnification. These results 

demonstrate that using the magnifying lens can successfully address the problem of long distance 

within 2m. 

3.7.2  RainbowRow in Vehicular Networks. 

We  set  the  transmission  frequency  at  10  KHz  while  adjusting  other  settings  to  study  their 

impacts to the achieved throughput in vehicular networks. 

Throughput vs.  Viewing Angle.  We set the viewing angle (illustrated in Figure 3.16) with 

5  settings:  -60,  -30,  0,  +30,  and  +60.  We  keep  the  distance  at  4m  during  day  time  with  the 

telescope. As shown in Figure 3.19 (a), RainbowRow achieves the highest throughput at 128 Kbps 

at 0 and did not decrease with the increased absolute value of viewing angle. In contrast to other 

PD-based tight directional requirements, such as the ability to only follow the vehicle in the same 

lane, RainbowRow has a broad viewing angle between the transmitter and the receiver. 

Throughput  vs.  Distance.  We set the distance with 4 settings: 4m, 6m, 8m, and 10m. We 

keep the viewing angle at 0 during day time with the telescope. As shown in Figure 3.19 (b), the 

achieved data rate increases with the increased distance from the transmitter to the receiver from 

128 Kbps at 4m to 133 Kbps at 10m. The reason is the telescope adaptation is suited better for 

72 

 
160 

150 

140 

130 

120 

)
s
p
b
K

(

t
u
p
h
g
u
o
r
h
t

throughput 

symbol  error  rate 

0.25 

0.20 

s
y
m
b
o

l

0.15 

e
r
r
o
r

0.10 

0.05 

r
a
t
e

110 

0 
-60°   -30°   0°   +30°   +60° 
(a)  viewing angle 

)
s
p
b
K

(

t
u
p
h
g
u
o
r
h
t

160 

150 

140 

130 

120 

110 

throughput 
symbol  error  rate 

0.25 

0.20 

0.15 

s
y
m
b
o

l

e
r
r
o
r

r
a
t
e

0.10 

0.05 

day  night 

0
ave 
( c) ambient light 

art-light 

)
s
p
b
K

(

t
u
p
h
g
u
o
r
h
t

160 

150 

140 

130 

120 

110 

)
s
p
b
K

(

t
u
p
h
g
u
o
r
h
t

160 

150 

140 

130 

120 

110 

throughput 

symbol  error  rate 

0.25 

0.20 

0.15 

s
y
m
b
o

l

e
r
r
o
r

0.10 

0.05 

r
a
t
e

0 

4m 

6m 

8m 

10m 

(b)  distance 

throughput 

symbol error rate 

0.25 

0.20 

0.15 

0.10 

0.05 

s
y
m
b
o

l

e
r
r
o
r

r
a
t
e

0 

staic 

low   medium 

fast 

(d) relative motion speed 

Figure 3.19 RainbowRow performance for vehicular network use case. 

longer distance. 

Throughput vs. Ambient Light. We conduct experiments at day, night and with an artificial 

light source scenario to study the influence of ambient light. We set viewing angle at 0. We keep 

the distance at 4m during day time with the same lens setting. As shown in Figure 3.19 (c), there 

is no significant performance difference among three settings. RainbowRow achieves 128 Kbps at 

day, 128.3 Kbps at night, and 124 Kbps with the artificial light source. 

Throughput vs. Relative Motion Speed. We set 4 camera speeds in a horizontal direction to 

simulate the motion between vehicles. We keep the distance at 2m during day time with the same 

lens setting. As shown in Figure 3.19 (d), there is no significant performance difference among 

three settings. However, the captured shape of fast motion speed becomes larger than the static 

shape, which can even help to decode better due to the increased strip length and the strip number, 

73 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
as shown in Figure 3.16 (c). 

Summary. These results verify that RainbowRow with specific adaptations is suitable for both 

indoor office and vehicular network with benefits: (1) over 120 Kbps data rate with flexible distance 

up to 10m; (2) secure indoor communication and broader-view vehicular communication; (3) with 

no additional energy consumption due to its synchronous lighting function; and (4) robust with 

rotation/viewing angles and ambient light, and (5) low cost and easy to deploy due to the already 

mounted LED bulbs and cameras. 

3.7.3  Ablation Study. 

200 

160 

)
s
p
b
K

throughput 

symbol  error  rate 

(

t
u
p
h
g
u
o
r
h
t

120 

80 

40 

0.25 

0.20 

0.15 

s
y
m
b
o

l

e
r
r
o
r

0.10 

0.05 

r
a
t
e

160 

150 

140 

130 

120 

)
s
p
b
K

(

t
u
p
h
g
u
o
r
h
t

0.25 

0.20 

0.15 

s
y
m
b
o

l

e
r
r
o
r

0.10 

0.05 

r
a
t
e

throughput 

symbol  error  rate 

0 

4 

6 

8 

10 

0

12 

110

400 

(a) transmission frequency (KHz) 

1600 

800 
(b) ISO setting 

0 
3200 

160 

)
s
p
b
K

150 

140 

(

t
u
p
h
g
u
o
r
h
t

130 

120 

110 

throughput 

symbol  error  rate 

0.25 

0.20 

0.15 

s
y
m
b
o

l

e
r
r
o
r

0.10 

0.05 

r
a
t
e

720P    1080P  4K 

ave 
(c) resolution setting 

0 

)
s
p
b
K

(

t
u
p
h
g
u
o
r
h
t

160 

150 

140 

130 

120 

110 

throughput 

symbol  error  rate 

0.25 

0.20 

0.15 

s
y
m
b
o

l

e
r
r
o
r

0.10 

0.05 

r
a
t
e

0 

ave 

(d) phone model study 

Figure 3.20 Ablation study for Rainbow in different camera parameter setting. 

Transmission frequency.  We set distance at 0.5m during day time and set different transmis- 

sion frequencies from 4KHz to 12KHz. As shown in Figure 3.20 (a), RainbowRow can achieve 

74 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
up  to  170  Kbps.  The  data  rate  increases  with  increasing  transmission  frequency.  ISO:  We  set 

transmission frequency at 10KHz during day time with 0.5m distance and set different ISO from 

400  to  3200.  As  shown  in  Figure  3.20  (b),  the  data  rate  increases  with  increasing  ISO  value. 

Resolution:  We set transmission frequency at 10KHz during day time with 0.5m distance and set 

different resolutions in [720P, 1080P, 4K]. As shown in Figure 3.20 (c), the data rate increases with 

increasing  resolution.  Different  phones:  We  set  transmission  frequency  at  10KHz  during  day  time 

with 0.5m distance and set resolution at 1080P while using two commercial phones. As shown in 

Figure 3.20 (d), the achieved data rate are similar with the same parameter. 

3.7.4  Comparison with existing work. 

With existing LED-OCC. As shown in Figure 3.21 (a), both hue and lightness keep proper gaps 

for robust decoding in varied transmission frequencies, distances, and ambient light. Combining 

these, the proper symbol distance modulation assists the SER (symbol error rate) reduction and the 

throughput  improvement  compared  with  other  high-order  modulation  methods  such  as  16-ColorBar 

and 32-ColorBar[38], as shown in Figure 3.21 (b)- (c). The throughput of RainbowRow is higher 

than 4-ColorBar and 4-CASK and even higher than the high-order 32-ColorBar and 8-CASK among 

all frequencies. The throughput of RainbowRow is about 10X of 4-ColorBar and 4-CASK with the 

same diversity order. When the frequency is 5KHz, RainbowRow can achieve up to 72Kbps. 

With  other  approaches.  Although  the  current  achieved  data  rate  of  over  120Kbps  within 

10m  does  not  compete  with  similar  range  RF  techniques  such  as  Bluetooth  at  1Mbps  within 

10m, RainbowRow is more secure due to its LoS propagation in the physically individual space 

with the great potential for dense spatial multiplexing and simpler interference control compared 

to  RF-techniques.  We  also  build  a  radar  map  for  comparison  among  RainbowRow  with  other 

approaches:  (1) LiFi (LED-PD), (2) RF-based, (3) screen-camera (LCD-OCC) in 8 aspects with 

their  performance  ranking:  (1)  data  rate,  (2)  distance,  (3)  security,  (4)  energy  efficiency,  (5) 

flexibility, (6) low-price, (7) broad view, and (8) broad bandwidth, as shown in Figure 3.21 (d). 

These results show our RainbowRow generally outperforms than the existing approaches by its 

practical data rate, long distance, secure feature, energy-efficient, suitable for numerous use cases, 

75 

 
Blue 

Green 

Yellow 

Red 

Level-1 

Level-2 

Level-3 

Level-4 

200 

160 

e
u
H

120 

80 

40 

0 

200 

160 

120 

80 

40 

s
s
e
n
t
h
g
L

i

0 
1

250    500   750   1000 
3 
Transmission frequency (kHz)  Transmission distance (m)  Ambient light strength (lux) 

1 

0 

4 

5 

5 

2 

4 

2 

3 

(a)  robustness of 4-order spectrum and amplitude 

RainbowRow 

16-ColorBar 
32-ColorBar 

0.3 

0.2 

0.1 

)

R
E
S

(

e
t
a
R
r
o
r
r
E

l

o
b
m
y
S

0 
1 

2 

5 
3 
Transmission frequency (kHz) 

4 

)
s
p
b
K

(

t
u
p
h
g
u
o
r
h
T

80 

60 

40 

20 

0
1

RBR throughput 

4-ColorBar 

32-ColorBar 
4-CASK 
8-CASK 

2 

3 

4 

5 

Transmission  frequency  (kHz) 

(b)  SER  comparison 

(c)  throughput  comparison 

RainbowRow (LED-OCC) 
LiFi (LED-PD) 

RF-based approaches 
Screen-Camera  (LCD-OCC) 

performance ranking 

data  rate 

distance 

security 

broad 
bandwidth 

1st 
2nd 
3rd 
4 th 

broad  view 

energy 
eﬃciency 

low  price 

ﬂexibility 

(d) radar map comparison 

Figure 3.21 The comparison of RainbowRow with the existing LED-OCC modulation and other 
related work. 

76 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
low-cost, broad view and uncrowded spectrum. 

3.8  Discussion and Summary 

Some Concerns. (1) Additional lens. A smartphone’s inbuilt telescope camera may be able 

to take a high-resolution image of the moon instead of needing a separate telescope due to the quick 

development of camera technology in smartphones and mobile devices. (2) Energy consumption. 

Similar  to  RF  approaches,  our  RainbowRow  can  also  design  the  wake-up  mechanism  to  turn 

on/off the OCC function to avoid the always-on camera imaging. The LED bulbs in RainbowRow 

are  energy-efficient  and  also  offer  simultaneous  illumination  capabilities  instead  of  RF’s  sole 

communication. (3) Practical Use Cases. Our RainbowRow can be deployed in many scenarios, 

such  as indoor/outdoor  lighting,  traffic  signs,  and  vehicle lamps,  lighthouses,  and  underwater/drone 

communication because of the wide deployment and low-cost of LED and commercial cameras. 

Future Directions. (1) MAC and Handover. The dense deployment of LEDs for OCC small 

cells require multiple user access[15, 55, 105]. RainbowRow should allow users to switch from 

different optical cells for handovers. It is essential to design handover mechanisms for seamless 

communications  and  smooth  mobility,  and  need  to  be  studied  appropriately  in  the  future.  (2) 

Higher-speed potential. Our RainbowRow is a worthwhile first attempt by utilizing low-order 

spatial diversity (i.e, 4) and gains up to a maximum of 170 Kbps,  which is 20X of the existing 

1D rolling LED-OCC approaches (i.e., < 8 Kbps). In the future, we could explore higher-order 

RainbowRow to boost LED-OCC’s data rate further (e.g, 16 or even 64) by MCU with more control 

ports and fast synchronous controlling ability. 

In summary, we propose RainbowRow, the first to utilize spatial diversity in 2D rolling blocks 

to boost LED-OCC’s data rate for real-world applications. We model 2D rolling blocks and explore 

the modulation design  combining  this spatial  diversity  with  other diversities for improved data 

rate.  Furthermore,  we  address  technical  challenges  in  optical  imaging  management  at  both  the 

transmitter  and  the  camera.      Then,  we  deploy  RainbowRow  testbed  in  2  real-world  use  cases 

with  practical  adaptations.  Our  comprehensive  experiments  and  results  demonstrate  that  our 

RainbowRow  protocol  can  achieve  a  throughput  over  120  Kbps  at  up  to  10m  and  outperforms 

77 

 
 
existing LED-OCC (<8Kbps, <1m).  We believe that RainbowRow can be the beginning of LED- 

OCC in bridging the performance gap for future high-speed applications. 

78 

 
CHAPTER 4 

3D SPATIAL DIVERSITIES ENABLED UNDERWATER NAVIGATION 

Underwater optical wireless communication techniques hold great promise, offering a broad band- 

width  and long  communication range in  comparison to  existing  expensive underwater  commu- 

nication methods like acoustic and RF-based techniques. This makes them particularly suitable 

for underwater navigation assistance, especially in dive and rescue operations. Adopting passive 

optical tags for object and human identification, as well as location-based services, proves to be a 

practical solution in these scenarios. 

However, existing optical tags, such as barcodes or QR codes, typically employ one or two- 

dimensional designs, which can limit their robust decoding capability and full-directional localiza- 

tion capabilities required for underwater navigation tasks. To address this limitation, we propose 

a novel passive 3D optical identification tag-based positioning scheme for underwater navigation. 

Our unique UOID (Underwater Optical Identification) tag enables users to determine their current 

orientation by utilizing the arc of clockwise positioning elements. Additionally, the tag employs 

perspective  principles  to  estimate  underwater  distances  accurately.  By  incorporating  these  en- 

hancements, our UOID tag overcomes the limitations of existing passive optical tags, providing a 

more effective and reliable solution for underwater navigation tasks. 

4.1  Motivation 

The ocean, other natural and man-made water areas (e.g. lakes, rivers, ponds, pools, reservoirs) 

account  for  more  than  71%  of  the  surface  area  of  Earth.  Although  sea  exploration  has  been 

undertaken throughout history, much of the underwater world remains a mystery that still needs 

to  be  explored  by  humans[89,  91].  Nowadays,  there  has  been  a  growing  research  interest  in 

numerous water-based applications such as climate change monitoring, oceanic animals study, oil 

rigs exploration, lost treasure discovery, unmanned operations, scuba diving, search/rescue, and 

underwater  navigation  assistance[134].  Additionally,  it  is  reported  by  Market  Reports  that  the 

Global Scuba Diving Equipment market was valued at USD 1127 million in 2020 and is projected 

to reach USD 1503 million by 2027[87].  Most of these applications require reliable, flexible, and 

79 

 
fast underwater communication to provide a safe and comfortable experience. However, despite 

the  rapid  development  and  progress  of  terrestrial  and  space  communication,  high-speed  underwater 

wireless communication (UWC) is still not fully explored[89, 76, 62, 9]. 

There are significant differences between underwater and terrestrial scenarios, such as a harsh 

environment  and  lack  of  infrastructure  deployment.  When  signals  propagate  in  water,  wireless 

communication faces challenges: water turbulence, limited power supply, unusable GPS, marine 

animal block issues. Today’s most popular UWC techniques adopt acoustic, radio frequency (RF), 

and optical waves as wireless mediums. However, acoustic signals are generated by high-power 

sonar (sound navigation and ranging) equipment with a long communication range, but with the 

cost of high  communication latency. As for RF-based UWC techniques, they have low latency 

but still face high energy consumption issues with a minimal communication range due to severe 

interference of seawater with the electromagnetic waves[46, 146, 54, 63, 89, 134]. 

Underwater  navigation  poses  significant  challenges  due  to  the  limitations  of  GPS  and  the 

cost associated with other acoustic/RF-based methods [89]. Traditionally, divers have relied on 

waterproof compasses and pre-dive location information from guides, which is not an intelligent, 

reliable,  or  flexible  solution  [121,  45,  66].  Drawing  inspiration  from  terrestrial  navigation,  an 

alternative approach involves using waterproof signage systems to display location information 

for underwater navigation. However, this method faces difficulties as finite-sized map images or 

messages are challenging to locate and read underwater due to the harsh optical conditions. In light 

of these challenges, there is a need for innovative solutions that can provide reliable and efficient 

underwater navigation assistance, taking advantage of the unique characteristics of the underwater 

environment. 

An alternative solution to address the challenges of underwater navigation is to utilize passive 

tags along with a portable tag reader, providing embedded and clear navigation information. Passive 

optical  tags,  such  as  barcodes  and  QR  codes,  are  already  popular  in  our  daily  lives  [81,  138]. 

However, their short communication range makes them ineffective for underwater navigation since 

users may struggle to locate the tags  and scan them underwater.  Increasing the tag size could 

80 

 
enhance the communication range, but this approach comes with drawbacks, such as higher costs 

and  potential  disruption  to  the  original  ecological  environment.  Thus,  it’s  essential  to  explore  more 

efficient and environmentally friendly ways to improve the communication range of passive tags 

for underwater navigation without compromising on cost and ecological impact. 

Figure 4.1 Existing optical tags and 3D spatial diversity. 

When discussing passive tags we define a high-order tag as containing more than five elements 

per  dimension.  For  example  the  barcode  in  the  left  of  Figure  4.1  contains  16  columns,  or  16 

elements in its one dimension. We also define a low-order tag as having five or less elements per 

dimension. High-order tags, however, are not feasible for underwater navigation because as the 

number of elements increases the error rate also increases due to the necessity for elements to be 

physically closer to each other. On the other side, the amount of embedded data in a low-order 

barcode or QR code is not rich enough for underwater navigation. 

Motivation: (1) Acoustic and RF-based UWC is not feasible because of drawbacks such as 

high  latency,  low  communication  range,  or  need  for  an  external  power  source.  (2)  High-order 

optical  tags  cannot  be  reliably  used  for  underwater  navigation  because  of  their  error  rates  and 

81 

short communication range. (3) Existing optical tags only utilize 1D/2D spatial diversity for data 

embedding[111]. Even the 3D versions of Bar/QR codes shown in Figure 4.1 have limited element 

distances and ignore 3D spatial diversity. As a result, there will be more error bits in decoding, 

especially in muddy underwater scenarios. (4) Existing bar/QR codes, even in 3D, have limited 

scanning angles and require the user to move to directly face the surface of the codes, which is 

inconvenient for underwater navigation activities. (5) We can use 3D spatial features to provide 

underwater positioning based on the perspective principle, which states that objects such as cubes 

are observed differently at different distance and angles. 

To address the problems above, we design  U-Star, an underwater signage system based on 

passive 3D optical identification tags for underwater navigation, as illustrated in Figure 4.2. U-Star 

consists of UOID tags and the AI-based mobile  tag reader. UOID tags are hollowed-out  cubes 

which consist of data elements and positioning elements. The data elements are positioned with 

proper non-Line-of-Sight spacing on the UOID tag. The positioning elements are set in different 

clockwise  color  sequences  along  the  six  faces  of  the  UOID.  The  U-Star  tag  reader  is  built  on 

waterproof mobile devices with standard, commercial cameras. 

4.2  Background and Related Work 

4.2.1  Underwater Navigation 

Underwater  navigation  is  important  for  human-related  underwater  activities,  such  as  scuba 

diving and underwater accident rescue. Natural underwater navigation requires the diver to utilize 

physical contours and characteristics of dive sites and combine basic compass skills to find the path 

to their destination[45, 43, 7]. Natural underwater navigation is similar to terrestrial navigation, 

the diver first needs to known his/her current location based on the site map or underwater physical 

features of dive sites and then guide him/herself to their destination based on the information on map 

or prior knowledge. However, natural underwater navigation relies highly on diver’s familiarity 

with dive sites. If unfamiliarity or any confusion with dive sites, it is very dangerous for divers, to 

the point that many have lost their lives. 

Many researchers have made efforts to improve underwater navigation[88, 46, 65, 47].  However, 

82 

 
m
e
t
s
y
s

e
g
a
n
g

i

s

n
o

i
t
a
g

i

v
a
n

l
i

a
r
T

- current location  information 
- relative  position to  tags estimation 
- guidance info to other locations 
- danger warnings and notices 

UOID 

3D-Bar/QR 

UOID 

UOID 

waterproof 
smartphone 

UOID 

UOID 

3D  passive 
optical tag 

Underwater  Navigation  Signage  System 

Figure 4.2  U-Star underwater navigation illustration. 

these are based on acoustic and RF techniques that incur significant drawbacks, including high power 

consumption, expensive price, long latency, or short communication ranges. To combat these issues, 

we explore setting underwater, on-site visible signage tags to provide site location information and 

navigation guidance. Our approach is inspired by traditional terrestrial navigation techniques such 

as tour maps and location marks in hiking trails[71, 70] and offers new and innovative techniques 

for underwater navigation. 

However, it is not practical to just place the signage tags underwater in a similar fashion to 

terrestrial navigation.  This is because it is not as easy for users to move to directly face the tags as it 

is on land, the hostile underwater optical environment, and that the lengthy communication distance 

[47, 44] makes effectively reading the signage impossible. The optical tags used in underwater 

navigation need three features: (1) Easily observed. The color and brightness are striking enough 

to be observed by users at long distances (10m-20m) and the content on the tag should be visible 

from practically every angle. (2) Enough data capacity. The data embedded in the tags needs 

83 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
to be large enough to record both the location information and guidance advice. (3) Positioning 

ability.  The  tag  needs  to  provide  relative  position  information  to  the  user.  Feature  (1)  is  more 

based on material and color choice, specifically, to suit the underwater scenario. Feature (1) also 

relies on the hollowed-out structure of the tag design for the real 3D passive optical tag.  Features 

(2) and (3) are in the category of optical wireless communication and we discuss below. 

4.2.2  Existing passive 1D/2D optical tags 

Barcodes  and  QR  codes  are  widely  used  machine-readable  optical  tags  in  our  daily  lives. 

Barcodes, invented in 1951, represent data using parallel lines with varying widths and spacing 

[113].  They  became  commercially  successful  in  supermarket  checkout  systems.  Later,  two- 

dimensional (2D) variants known as matrix codes were developed, capable of representing more 

data per unit area [111, 100]. One of the popular matrix codes is the QR (Quick Response) code, 

widely used in various aspects of life, such as mobile payment, social E-cards, electronic tickets, 

access control. High-order QR codes, like the version 40 QR code (177x177), can embed 23,648 

bits [100]. 

However, in underwater navigation scenarios, using high-order bar/QR codes is not suitable 

due to their limited scanning angles, restricted data element distance, and the challenging optical 

environment’s  quality.  These  limitations  make  it  difficult  for  users  to  see  and  scan  the  codes 

effectively  underwater.  Thus,  there  is  a  need  for  more  robust  and  underwater-friendly  optical 

identification tags that can address the unique challenges of navigation in aquatic environments. 

These bar/QR codes only focus on 1D and 2D spatial diversity and ignore the potential op- 

portunity of three-dimensional spatial diversity in optical tag data embedding. Even with the 3D 

version of Bar/QR codes (six planes of the cube are covered with the same bar/QR codes to ensure 

consistent content at various angles), the user can record up to three repeat bar/QR codes, which 

does not increase data element distances and does not fully take advantage of 3D spatial diversity 

in data embedding. Our 3D optical tag design is inspired by 3D cube-shaped chandeliers, but im- 

proved and modified for the data and communication needs of underwater scenarios. Each element 

inside of a 3D light cube can denote bits 1 and 0 via its On and Off status, as opposed to linear or 

84 

 
 
matrix dots on a surface in a bar/QR code. Although images of the 3D optical tag captured by our 

tag reader is a 2D pixel matrix, we can restore the 3D optical tag based on perspective principles. 

When compared to optical tags with the same tag size and the same amount of embedded data (e.g., 

1D, 2D codes, and surface 3D tags with 1D/2D codes attached), our proposed 3D hollowed-out 

cube improves data element distance by leveraging 3D spatial diversity in data embedding.  In our 

U-Star system[141], we design UOID, passive 3D optical identification tags, to utilize the 3D spatial 

diversity to increase the distances among data elements for robust and full-directional underwater 

decoding. 

4.2.3  Optical Positioning and Perspective 

d 

ranging 

far 

a 

shadow 

near 
e 

b 

constellation 

l
e
v
e
l

e
y
e

far 

near 

viewpoint  near 

viewpoint   
near  trees  are  larger 

L3 

L4 

top 

L2 

h 

D 

L5 

L6 

C 

L1 

see 

L1 

c 

lighthouse 

A 

B 

L2 

L5 
relative positioning 

L3 

L4 

L6 

Figure 4.3 Perspective principle for positioning. 

It is very common for humans to utilize natural or human-made luminous objects for positioning, 

as shown in Figure 4.3. For example, we can determine orientation by observing the direction of 

the shadows during the day time due to the sun’s movement and the direction of the Big Dipper 

at the night because the  orientation of the Big Dipper is unchanged and  always pointing to the 

85 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Earth’s  North  Pole[10,  25].  In  addition  to  orientation  and  localization  based  on  natural  optical 

objects, lighthouses are an example of human-made optical object based positioning. The basic 

functions of lighthouses are to guide ships, indicate dangerous areas and help ships to determine 

their positions[94]. For underwater scenarios, researchers have also made many efforts to design 

optical-based  underwater  positioning  mechanisms  and  systems.  Akhoundi  et  al.  design  RSS 

(Received Signal Strength) based optical positioning systems that calculate location based on the 

received optical signal from multiple anchors[4].  In other work[135], the authors proposed a ToA 

and RSS-based underwater localization system. However, these works require a significant power 

supply and expensive devices with high-accuracy sensors. 

Perspective principles are traditionally used in vision and art[74, 96]. Creatively, we can utilize 

the perspective principles for ranging and relative positioning. The perspective principle simply 

describes  the  visual  relationship  between  the  observer  and  the  observed  object:  (1)  increasing 

the distance between the observer and the object results in a reduced size of the observed object, 

as shown in Figure 4.3 (d); (2) varying the angle from the view point to the object results in a 

variable shape and observed content of the observed object, as shown in Figure 4.3 (e). Our U-Star 

design also utilizes UOID tags as fixed underwater beacons utilizing 3D spatial diversity for optical 

ranging and orientation guidance besides its data embedding. 

Compared with existing work, our UOID tags are based on passive optical wireless commu- 

nication and therefore utilize natural light sources to present data and provide relative positioning 

without energy consumption concerns. The tag readers are also commercial camera-based devices 

instead of expensive sensors. 

4.3  Our Approach: U-Star 

Our proposed underwater navigation system consists of two parts, as shown in Figure 4.4: (1) 

3D passive optical tags:  UOID tags, and (2) AI-based mobile tag reader. 

UOID tags. UOID tags are anchored underwater with fixed facing direction. They are made of 

fluorescent materials and can absorb light from natural underwater environment or users’ flashlight. 

There are data elements and positioning elements in UOID tags, which are assigned with proper 

86 

 
 3D spatial shaped 

at diﬀerent distance 

LOS-avoid  inside  gap 

mass embedded data 

positional elements 

n
o
i
t
a
r
e
p
O

r
e
s
U

at nearly all directions 

diﬀerent environment 

during day and night 

self-luminous  passive  tag 

simple and quick operation 

Images 
Capturing 

CycleGAN 
denoising 

Elements 
Filterring 

AI-Orientation 
Guidance 

Facing-plane 
Judgement 

Optical 
Ranging 

3D structure 
Restoring 

Data element 
Relocating 

Mass Data 
Parsing 

..011 
10001 
110.. 

Pre-process 

Positioning 

Decoding 

g
a
T
D
O
U

I

r
e
d
a
e
R
g
a
T
r
a
t
S
U

-

Figure 4.4 U-Star system diagram including UOID tag, user operation, and tag reader. 

spacing to eliminate LoS blockage in the tag’s 3D spatial domain to present data. 

Tag reader. The tag readers are based on commercial smart devices such as smartphones or 

sports cameras. These devices can capture images of UOID tags and perform underwater, robust, 

and real-time data parsing and relative positioning by its onboard computation abilities.  The U-Star 

tag readers have three key modules: (1) CycleGAN denoising based pre-processing, (2) CNN based 

relative positioning, (3) 3D restoring based decoding. 

User operation and navigation procedure. The detailed U-Star underwater navigation pro- 

cedures are: (1) The diver, equipped with a tag reader, looks for luminous UOID tags. (2) The 

diver uses waterproof tag reader to take pictures of a specific UOID tag at current location. (3) The 

tag reader performs image style transformation for denoising, then the tag reader can determine 

87 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
diver’s relative position including the distance estimation and orientation guidance. (4) The diver 

knows where he/she is now and can navigate him/herself to new sites based on the pre-recorded 

data from the backup database that the tag reader can query with the embedded data from the UOID 

tag (which we call a query code). The user operation is simple and quick and can be performed at 

different distances with all directions in different environments and time. 

4.3.1  Challenges and Solutions 

(1)  LoS  blockage.  When  capturing  tag  images,  some  inside  elements  are  blocked  by  their 

front  elements  due  to  lights’  line-of-sight  propagation.  We  address  this  by  assigning  elements  with 

proper spacing and a machine learning based restoration. (2) Harsh  optical  environment. The 

underwater environment decreases the quality of captured UOID images, and thus makes them hard 

to decode. We design CycleGAN based algorithms to transfer unclear images into clear images 

(Unity3D-style images) before decoding.  (3) Underwater relative positioning.  The UOID tag 

is expected to help determine the distance between the user and the tag as well as user’s current 

orientation  for  relative  positioning. We propose  clockwise positioning  arc schemes to denote  planes 

and a CNN method to infer relative position. (4) 3D  decoding.  The tag reader needs to restore 

each element to a standard 3D space from a random 2D image during decoding. We utilize the 

perspective principle to reconstruct the 3D structure for data parsing. 

4.3.2  Advances Compared with Prior Art 

(1) Same tag order with more embedded bits. Despite the fact that the user can capture the 

information of one and up to three surface planes of a 3D version of existing Bar/QR codes in N-

order, the decoded bits are the same as the bits in one plane. The embedded bits in an N-order 

barcode  are  roughly  𝑁.  The  embedded  bits  in  an  NxN  QR  code  are  roughly  𝑁2-4  bits.  The 

embedded bits in an NxNxN UOID tag are 𝑁3-6 bits. The amount of embedded bits in a UOID tag 

increases exponentially compared to the same order 1D/2D optical tags. Even their 3D versions 

cannot compare to the UOID tags (e.g., 3-order UOID embeds 7x and 4.2x bits of the same order 

Bar and QR code). 

(2)  Same  tag  size  &  data  with  larger  element  distance.  The  larger  the  average  element 

88 

 
distance and the broader the distribution of element distances, the better the detection performance 

and the less the error bits. We measure distances between all 21 data elements in 3D versions of the 

Bar/QR, and UOID tag that have the same embedded bits and tag size (edge is 19cm). Data element 

distances in Bar and QR codes are all smaller than 20cm, however the data element distances in 

UOID tags are completely distributed in a greater range of [5, 30] cm. 

Our contributions can be summarized as follows: 

(1)  This  is  the  first  work  to  employ  passive  3D  optical  identification  tags  for  underwater 

navigation.  We model 3D spatial diversity and utilize it to increase the distance of data elements in 

our proposed UOID tags for simple and robust underwater navigation. 

(2) We propose  a passive  3D  optical  identification tag based positioning  scheme  for  underwater 

navigation.  Our  UOID  tag  can  help  user  to  determine  their  current  orientation  by  the  arc  of 

clockwise positioning elements and estimate the underwater distance due to perspective principles. 

(3) We propose AI-based mobile algorithms at the tag reader for robust UOID decoding. We 

design  CycleGAN  based  underwater  denoising,  CNN-based  relative  positioning,  and  real-time  data 

parsing algorithms without significant computation overhead, latency or energy concerns. 

(4) We implement U-Star and evaluate its performance on UOID tag prototypes in different 

underwater scenarios. Our experiment results show that a 3-order UOID tag can embed 21 bits 

of data with a BER of 0.003 at 1m and less than 0.05 at a distance of up to 3 m. We also make 

fair comparison with existing optical tags (Bar, QR) to show the superiority of our UOID tags in 

underwater navigation. U-Star also achieves over 90% accuracy for both optical ranging at up to 

7m and orientation guidance. 

4.4  Passive 3D Optical Tag 

4.4.1  3D Spatial Diversity Exploration 

As shown in Figure 4.5, we use a 3D cube instead of a 2D matrix to represent more bits in an 

optical tag. Naturally, there are two methods to embed data in a 3D cube: (1) embed data on its 

six surfaces, (2) embed data on both its surfaces and inside space (i.e., hollowed-out), which fully 

utilize the 3D spatial diversity. For method (1), the tag reader can only capture the dots on 1 and up 

89 

 
(a) surface 3D 

(b) spatial 3D 

Figure 4.5 Surface/real 3D. 

(a) 4-order tag without & with spacing 

(b) 3-order with spacing 

Figure 4.6 Proper spacing to combat LoS. 

to 3 surfaces due to the line-of-sight (LoS) characteristic of light. Method (1) cannot also guarantee 

that the embedded data captured at different angles is always the same (unless all 6 planes cover 

the same content) due to the potential of capturing different surfaces, which means that the tag’s 

decoded data will change without consistency. Additionally, Method (1) results in smaller data 

element distances and a shorter communication range. 

Thus we choose method (2) to embed data in our UOID design. However, the LoS issue can 

also occur if we embed data inside of a 3D cube due to mutual blockage among elements physically 

near each other. As shown in Figure 4.6, the 4-order (4x4x4) tag without proper spacing will have 

the mutual blockage issue.  Three factors affect the blockage:  (1) Tag order.  As the order of tags 

90 

 
 
 
 
 
 
increases (3-order, 4-order, 5-order), more and more blockage occurs. Similarly, as the order of 

tags decreases, so does blockage. (2) Element size. The smaller the element size, the less blockage. 

(3) Mutual Spacing. The larger the mutual spacing of elements, the less blockage. We discuss a 

3-order UOID tag with fixed element size and we address the mutual blockage by extending the 

spacing among nearby nodes to guarantee the tag reader can capture all elements in most cases. 

4.4.2  UOID Tag Design 

Positioning  and  data  elements.  In our  UOID tag design,  there are two types of  elements: 

positioning elements and data elements, as shown in Figure 4.7. The positioning elements are on 

six vertex points with three pairs of colors. The positioning elements help determine the relative 

position of the user and assist in reconstructing the 3D cube for data parsing. The data elements 

make up most of the elements  in  a UOID tag for data embedding. They  are located at  the two 

remaining vertex points as well as inside of the tag itself. 

positioning 
elements 

011000 

101111101 

111010 

L3 

L2 

L1 

111010 +  101111101 +  011000 

Figure 4.7 Two element types in UOID. 

data 
elements 

Positioning elements. As shown in Figure 4.7, each pair of colored elements are at a pair of 

vertex points. Thus, each plane of the cube has three different colored positioning elements. They 

can denote six  surfaces based on the generated  clockwise arc color sequence, Figure  4.10 (a)). 

Then the tag can determine which surfaces the user is facing based on captured surfaces of the 

91 

 
 
 
tag and determine orientation based on the perspective principle to support underwater navigation. 

Furthermore, these positioning elements can help to reconstruct the 3D structure from captured 2D 

images based on the perspective principle for data parsing. The reason for using three instead of four 

positioning elements to denote a plane are: (1) Three dots can already determine a surface. Four 

dots will sacrifice the positions that could be used for assigning data elements and thus decrease 

the embedded data. (2) Fewer overall colors is desirable, as more colors will increase the color 

detection error for decoding due to the fewer hue gaps. 

Data elements. The data elements of our UOID are assigned to various 3D spatial locations. 

There are three layers L1, L2, and L3. For each layer, we assign data elements in an ‘S’ shape. 

If  the  data  element  is  colored  green,  the  embedded  bit  is  1,  if  the  data  element  is  not  colored, 

the embedded bit  is 0.  As illustrated in Figure  4.7,  L1 embeds bits  ‘111010’,  L2 embeds bits 

‘101111101’ and L3 embeds bits ‘011000’.  This 3-order UOID tag embeds a total of 33-6=21 

bits, ‘111010 101111101 011000’. We set the current angle of view to be the standard coordinate 

system for data parsing. With the assistance of positioning elements (Figure 4.16), we can map 

the tag images from any angle of view into the standard coordination system and then conduct the 

mass data parsing. 

4.4.3  Underwater-specific Tag Design 

Color  Choices.  Light  with  different  wavelengths/color  have different  absorption  rates in  water. 

As shown in Figure 4.8 (a), the green and blue light have less absorption in deeper underwater 

environments such as a depth of 20 m[140, 89]. However, considering most commercial underwater 

activities do not exceed depths greater than 10 m, the color choices (red, yellow, green, and blue) 

above in the UOID tag are reasonable (for deeper underwater navigation, finer-grained blue and 

green can be chosen). As shown in Figure 4.8 (b), these four colors also have sufficient hue value 

gap to decrease the wrong detection of colors during  decoding[148]. The green light has the longest 

emission time after 5s of being shined by a flashlight as shown in Figure  4.8 (c). Because data 

elements are the most numerous and important elements, we set them to green. 

Luminous powder.  Our UOID tags are passive, without any power supply.  As illustrated in 

92 

 
0m 

200 

Hue  range:  0-255 

160 

)
s
(

after  5s  shine  by  ﬂashlight 

-10m 

-20m 

-30m 

150 

e
u
l
a
V

100 

e
u
H

50 

(a) Underwater  Color  Loss 

0 
(b) Proper Hue Gaps 

120 

e
m
T

i

i

n
o
s
s
m
E

i

80 

40 

0 

(c)  Emission experiment 

Night 

Day 

UOID 

(d)  The luminous powder of R/Y/G/B in day and night. 

Figure 4.8 Color choices and luminous powder. 

Figure 4.8 (d), we coat the elements in luminous powder, which is cheap and nontoxic to marine 

animals.  As  shown  in  Figure  4.8  (c),  the  luminous  powder  with  our  chosen  colors  can  keep 

emitting light more than 60 seconds (1 min) after being shined by a flashlight for 5 seconds in 

our experiments. This ensures that the UOID tags work by absorbing natural underwater light and 

emitting light in specific colors, allowing us to see and scan UOID tags at any time of day or night. 

4.5  Underwater Positioning 

4.5.1  Optical Ranging 

For underwater navigation, the perception and estimation of distance is very important. Our 

UOID tags can give the user a rough feeling of the distance between themselves and the tag. We 

use the rough size of the captured tag to infer the current distance from the user to the tag. The 

estimated relative distance has no relation with the angle of capturing images by the user. 

As shown in Figure 4.9, We can estimate the distance based on the captured tag size because the 

93 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1 

d2 

d3 

e
n
a

l

p

g
n

i

g
a
m

i

optical axis 

lens 

h’’’ 
h’’ 

h’ 

ﬁxed   focal  length 

h’  >  h’’  >  h’’’ 
The shorter distance d, 
The  bigger  captured  tag  size. 

h 

h 

h 

The same tag at diﬀerent distance to tag reader 

Figure 4.9 3D spatial perspective based optical ranging. 

tag size increases when the user is getting closer to the tag due to the spatial perspective principle. 

We first collect the captured images (camera is set with fixed focal length) at different distances 

and use this dataset to train the CNN model for classification offline. Then we can use the trained 

CNN model to predict and estimate the current distance from the user to the tag in real-time. 

4.5.2  Orientation Guidance 

We map the six planes of the UOID tag onto six different clockwise color arcs which start from 

the non-positioning element:  Yellow(Y)-Blue(B)-Red(R) maps to Plane 1, BRY to Plane 2, RBY 

to Plane 3, YRB to Plane 4, RYB to Plane Top, and BYR to Plane Bottom as shown in Figure 4.10. 

The UOID tag is fixed underwater (i.e., a specific plane of the UOID tag always faces in a specific 

direction), and thus the user/tag reader can know his/her orientation based on the plane of the UOID 

the user is currently facing.  For example, as shown in Figure 4.11, Plane 1 is facing South.  That 

means if the user is facing Plane 1, the user can know his/her current orientation is directed North. 

For underwater navigation, the Plane Top and Bottom faces do not provide value to orientation 

decisions. Additionally, North, East, South, and West are not sufficiently descriptive for navigation. 

Therefore, we define 8 user facing orientations:  North (facing Plane 1), Northwest (facing Plane 

1 & 2), West (facing Plane 2), Southwest (facing Plane 2 & 3), South (facing Plane 3), Southeast 

(facing Plane 3 & 4), East (facing Plane 4), Northeast (facing Plane 4 & 1) as shown in Figure 4.11. 

Naturally, we can determine the plane the user is facing based on the color arc detected in images. 

94 

 
 
 
 
 
 
 
 
 
 
 
 
Plane 1 

Plane 2 

Plane 3 

Plane 4 

PlaneTop 

Plane Bottom 

Figure 4.10 Positioning elements for the plane decision. 

However, due to the small size of elements in captured images,  it is hard to judge which plane 

the user is facing. Thus, we employ CNN models to learn plane features offline and then predict 

the plane in the captured image in real-time, similar to the AI method used in the optical ranging 

procedure. 

4.6  AI-based Mobile Tag Reader 

4.6.1  CycleGAN based Denoising 

CycleGAN is a popular deep learning method and is mostly used for image style transforming 

which can convert images between Style X and Style Y. For example, to generate a monet-style 

image from a real world picture or vice versa[150].  We adopt a lightweight CycleGAN to convert 

95 

 
 
 
 
 
 
 
 
 
 
 
Plane3 

N 

W 

E 

S 
Plane1 

bed 
Facing plane directly 

Plane 1  N 

Plane 2  W 

Plane 3  S 

Plane 4  E 

Plane3 

N 

W 

E 

S 

Plane1 

Plane 1 + Plane 2 

Plane 2 + Plane 3 

Plane 3 + Plane 4 

Plane 4 + Plane 1 

Facing among planes 

northwest 

southwest 

southeast 

northeast 

Figure 4.11 The orientation guidance principle illustration. 

the real underwater images taken of real, physical UOIDs created for U-Star (Style X) into clear 

Unity3D-style images created in the Unity3D game engine (Style Y) for further processing. The 

images in real underwater scenarios have a random and different background (i.e., with noise) for 

UOID tags.  The images in the Unity 3D version have clear and pure backgrounds (i.e., there is 

no noise from the background in these images). Thus, we can utilize CycleGAN to convert real- 

world images with noise (Style X) to Unity3D-version images without noise (Style Y) to perform 

underwater denoising as shown in Figure 4.12. 

In  our  CycleGAN-based  denoising,  instead  of  the  typical  unpaired  datasets,  we  create  the 

partial-paired datasets, the Real UOID tags (60 images) and the Virtual UOID tags (60 images), 

for each underwater environment setting in the CycleGAN training procedure, as shown in Figure 

4.12. Partial-paired means the positioning elements are paired between the real UOID tag images 

and the Unity3D version images of the training datasets, while the inside data elements are not 

paired. Partial-paired CycleGAN denoising guarantees mostly correct conversion of both the tag 

structure, data elements and the color of positioning elements. 

To train the CycleGAN efficiently, we use three different types of losses to train our Cycle-GAN. 

96 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
{ 
{ 

x i 

yi 

X 

Y 

{ 

 , 

,

.

.. 

, 

,

. 

..

paired 

unpaired 

x i 

y i 

, 

{ 
{ 

,
partial-paired =  only 
positioning elements paired 

Partial-paired training dataset 

Figure 4.12 CycleGAN based denoising from real underwater tag images to the Unity3D version 
tag images. 

10 

8 

6 

4 

2 

0 

0 

loss of genarator from real to unity 

loss of discriminator from real to unity 

4000 

8000 

12000 

16000 

Index of training epochs 

Loss curve in training 

Figure 4.13 CycleGAN based denoising: training loss curves. 

97 

 
 
 
 
 
 
 
 
 
 
 
 
 
Real-Pool 

Fake-Unity3D 

Real-Lake 

Fake-Unity3D 

Figure 4.14 CycleGAN based denoising: result samples. 

More specifically, we apply an identity loss (L𝑖𝑑) for generator network, a GAN loss (L𝐺𝐴𝑁  ) for 

the Discriminator, and a cycle loss (L𝑐𝑦𝑐𝑙𝑒) for the cycle step. 

L𝐶𝑦𝑐𝑙𝑒𝐺 𝐴𝑁  = L𝑖𝑑  + 𝜆1L𝐺𝐴𝑁  + 𝜆2L𝑐𝑦𝑐𝑙𝑒 

(4.1) 

Both identity loss and GAN loss are using L1 loss, while the cycle loss is applied by a MSE 

loss.  We summed those three losses together with different prior assigned weights (𝜆1 and 𝜆2) to 

help the model converge.  The value of 𝜆1 and 𝜆2 are selected empirically, in our case, we use 10 

and 5  for 𝜆1  and 𝜆2  respectively. By integrating the three losses together, we feed the pairwise 

training images to the CycleGAN and train the generators and discriminators. The loss curve in the 

training of the generator and discriminator (from real images to Unity3D-style images) are shown 

98 

 
 
 
 
 
 
in Figure 4.13. The varying trend of the loss curves show the conversion from the real underwater 

UOID tag images to the Unity3D-style UOID tag images converges successfully. 

The examples of original captured underwater images and the denoised images are shown in 

Figure 4.14. We can see that underwater images from both a pool and lake can be successfully 

denoised and converted to Unity3D-style images with a mostly correct tag structure, color, and 

element positioning.  The CycleGAN denoising also removes the physical UOID frame components 

to reduce the LoS blockage. Although there are a few elements with unmatched colors, we can 

correct them based on the original image easily. The next steps of relative position determination 

and data parsing can then be based on these converted Unity3D-style UOID tag images to lessen 

the influence of harsh underwater optical environment. 

oﬄine training 

realtime orientation guidance 

Optical 
Ranging 
Dataset 

CNN 
model 
ORM 

Onsite 
Images 

input 

output 

Decided 
Direction 

Step1  Step 2 

Onsite 
Images 

input 

output 

Estimated 
Distance 

CNN 
model 
OM 

Orientation 
Dataset 

realtime optical ranging 

oﬄine training 

Identical    ConvNets 

skip   connection 

Adopted  CNN  Architecture 

e
g
a
m

I

2
/

,
4
6

,
v
n
o
c

7
x
7

2
/

,
l
o
o
p

3
x
3

4
6

,
v
n
o
c

3
x
3

4
6

,
v
n
o
c

3
x
3

4
6

,
v
n
o
c

3
x
3

4
6

,
v
n
o
c

3
x
3

{
Layer 1 

Max pool 
stride  =  2 

2
/

,
8
2
1

,
v
n
o
c

3
x

8
2
1

,
v
n
o
c

3
x
3

8
2
1

,
v
n
o
c

3
x
3

8
2
1

,
v
n
o
c

3
x
3

6
5
2

,
v
n
o
c

3
x
3

6
5
2

,
v
n
o
c

3
x
3

6
5
2

,
v
n
o
c

3
x
3

2
/

,
6
5
2

,
v
n
o
c

3
x
3

2
/

,
2
1
5

,
v
n
o
c

3
x
3

2
1
5

,
v
n
o
c

3
x
3

2
1
5

,
v
n
o
c

3
x
3

2
1
5

,
v
n
o
c

3
x
3

l
o
o
p

g
v
a

 3
{
{
{
Layer 2 
Layer 3 

Layer  4 

i

g
n
g
n
a
R

-
7

c
f

n
o

i
t
a
t
n
e
i
r
O

-
8

c
f

Fully 
Connected 

Figure 4.15 CNN based relative positioning of optical ranging, orientation guidance and adopted 
ResNet-18 network architecture. 

99 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4.6.2  CNN based Relative Positioning 

We  adopt  CNN-based  deep  learning  methods  to  determine  the  relative  position  instead  of 

non-deep, traditional computer vision methods to simplify the task and decrease the computation 

overhead. It is difficult to calculate relative distance directly with different underwater backgrounds, 

which requires several steps: (1) locate the tag in the image using AI or CV methods, (2) calculate 

the  tag size,  and (3) utilize  the distance estimation relation to  calculate the  estimated  distance. 

In comparison, we choose a CNN model because it does not necessitate detecting the tag in the 

image or calculating the tag size. Instead, we directly output the prediction distance in different 

underwater environments using the trained CNN model and captured images of UOID tags. 

We create two datasets (1) optical ranging dataset (280 images of Unity3D version and 280 

images of real underwater), and (2) orientation dataset (320 images of Unity3D version and 320 

images of real underwater) for the offline CNN training. The reason using both real underwater tag 

images and Unity3D version tag images in training is to increase the generality of the prediction 

model. Then we use CycleGAN denoised tag images for real-time relative position determination. 

As shown in Figure 4.15, our CNN models, ORM (optical ranging model) and OM (orientation 

model), adapt the ResNet-18 architecture. ResNet-18 is a neural network architecture that adds a 

skip connection between disconnected layers, such that the input of deep layers will not only take 

the output from its preceding layer, but also from its former layers which may contain original 

data. Such design  effectively  copes with  gradient  vanishing problem  in  DNNs[35],  and further 

increases  the depth of  network  with  fewer  additional  parameters. ResNet has  demonstrated  superior 

performance on image classification tasks [17, 18, 1], which is particularly suitable for our goal 

that distinguishing relative position both optical ranging and orientation guidance. We follow the 

ResNet-18 design due to its efficiency and high accuracy on image classification tasks. Specifically, 

we retain all of the convolutional and pooling layers, and modify the output feature of the last fully 

connected layer to match the number of possible options (i.e., 7 for ORM and 8 for OM). 

100 

 
4.6.3  Data Parsing via Perspective Principle 

The data elements in captured images are different when the user is at different relative positions 

to the UOID tag. To decode the embedded data in the tag, the tag reader needs to know the 3D 

locations of data elements in a standard coordinate system to then perform decoding. 

Restore 3D structure. Based on three pairs of positioning elements, the tag reader can restore 

the 3D structure of UOID tag based on captured 2D images in six steps shown in Figure 4.16 (a): 

(1) obtain Unity3D-style UOID image after CycleGAN based denoising, (2) filter out three pairs 

of positioning elements via computer vision tools, (3) decide which positioning element for each 

pair is in the front or rear based on element size, (4) find one of the two remaining vertices, (5) find 

the other remaining vertex, and (6) decide which of remaining vertices is front or rear based on the 

element size of nearby positioning elements. Finally, we can reconstruct the 3D structure based on 

the total of 8 vertices of the 3D cube. 

For step (4), there are two sub-steps: (4-1) Extend line Y1R2 and R1Y2 to find the intersection 

point IP1(not shown in the figure). Then connect B2 with IP1, which is the cross line of plane 

Y2R1B2 and Y1R2B2. (4-2) Extend line Y1B2 and B1Y2 to find the intersection point IP2. Then 

connect R2 with IP2, which is the cross line of plane B2Y1R2 and B1Y2R2. Then we can find 

the first vertex, which is the intersection point of B2IP1 and R2IP2. The sub-steps for step (5) are 

similar to the sub-steps in step (4). 

Data element location restoration. As shown in Figure 4.16 (b) and (c), we can restore the 

location  of  data  elements  by  matching  the  filtered  data  element  and  locations  of  each  element 

calculated  based  on  the  positioning  elements.  If  the  specific  filtered  data  element  is  near  or  at 

the specific calculated location from the restored 3D structure, it signifies a match. Then we can 

denote that this location has a data element as bit  1  while other vacant calculated data element 

locations will be decoded as bit 0. Then the tag reader decodes the embedded data and generates 

the bitstream based on the data assignment rule illustrated in Figure 4.7. 

101 

 
(a)  The  six  steps  to  restore  3D  structure  from  UOID’s  2D  image  via  positioning  elements. 

(1) 

(2) 

Y1 

(3) 

(4) 

(5) 

(6) 

B2 

R2 

X1 

X1 

X2 

X1 

X2 

B2 

R1 

Y1 

Y2 

R2 

B1 

R1 

Y2 

B1 

Extend  lines  and  utilize 
3D  perspective  principle 

Y1 

B2 

B2 
common  plane 

R2 

X1 

R1 

B1 

B2 

Y2 

Y1 

R2 

(5-1)  ﬁnd  cross  line 
IP3 

common  plane 

Y1 

B2 

Y1 

R2 

B2 

Y1 

X1 

R1 

B1 

X1 

X2 

R1 

Y2 

R2 

B1 

(4-1)  ﬁnd  cross  line  of  two  plane 

IP1 

IP2 

(4-2)  ﬁnd  X1 

R1 

Y2 

IP4 

(5-2)  ﬁnd  X2 

(b) ﬁlter out data elements 

(c) relocate data element based on 
reconstructed 3D structure 

Figure 4.16 The illustration of 3D structure restoring and data parsing is based on the perspective 
principle. 

4.7  Implementation and Evaluation 

4.7.1  UOID Tags 

We implement two versions of UOID tags. One is a virtual NxNxN UOID tag created in the 

Unity3D cross-platform game engine to simulate UOID tags of various order and also different 

permutations of embedded data within tags of the same order. We also implement multiple physical 

3x3x3 UOID tags for use underwater. 

Virtual  UOID  tag.  The elements in our virtual UOID tags are translucent with fluorescent 

effects and are assigned with the proper spacing, as shown in Figure 4.10 and Figure 4.16. 

Real UOID tag. As shown in Figure 4.17 (a), the UOID tags can be observed well during both 

the day and night because they absorb natural light and emit light. For the elements of our physical 

UOID tags we employ soft plastic balls (𝜙 = 2cm) glazed with fluorescent powder and attach them 

102 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
sticks    19cm 

plastic 

day 

sport camera kit 

iron 

night 

smartphone with cases 

(a) UOID tags 

(b) Tag readers 

line 

iron 

ﬂashlight 

4m x 4m big tank 

small pond 

swimming pool 

big lake 

day 

night 

day 

night 

(c)  four diﬀerent underwater environment and captured UOID image samples 

day 

night 

day 

night 

pool-day 

pool-night 

lake-day 

lake-night 

(e) experiment scenarios in pool and lake during day and night 

Figure 4.17 U-Star system implementation, setup and experiment scenarios in day and night. 

on 3 types of cube structure frames for exploration (sticks, black and transparent plastic). Finally 

we choose the black plastic frame-based UOID tags (edge: 19cm, weight: 14g) for evaluation. 

4.7.2  Tag Reader 

There are many commercial smart devices that can be adopted for use in our U-Star system. 

Some of these include underwater sports cameras and smartphone with transparent, waterproof 

cases, as shown in Figure 4.17 (b). These commercial camera devices are popular and cheap. In 

our experiment, we use the Campark sport camera, which costs less than $50 and set it at a fixed 

focal length. 

103 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4.7.3  Setup 

Different  underwater  environment.  Figure  4.17  (c)  shows  four  underwater  environments 

(indoor big tank, outdoor small pond, swimming pool, and big lake) and captured images of UOID 

tags. 

Tag fixation and flashlight. We fix the UOID tags at the bottom of a body of water, i.e., a 

specific UOID plane always faces a specific direction.  We use iron and connection pole to sink and 

fix the UOID tag underwater. During the night, the user can use a flashlight for underwater lighting 

to activate the UOID tags. 

We  evaluate  three  performance  aspects  of  our  U-Star  system:  (1)  relative  positioning,  (2) 

data  parsing,  (3)  comparison  with  existing  optical  tags.  In  addition,  we  conduct  an  underwater 

navigation case study in a 4m x 10m indoor pool with 4 UOID tags. Finally, we evaluate other 

aspects such as cost/price, computing overhead, and latency. 

4.7.4  Accurate Relative Positioning. 

We evaluate the relative positioning performance in three aspects: optical ranging accuracy, 

orientation guidance accuracy (both at 100𝑡ℎ  epoch), and their training loss in [5, 200] epochs. 

Optical ranging. We have 7 different distance settings: 1m, 2m, 3m, 4m, 5m, 6m, and 7m. 

As shown in Figure 4.18 (a), due to the considerable tag size difference, the ranging accuracy of 

1m  and  7m  distance  settings  are  100%  for  both  with  and  without  CycleGAN  denoising.  After 

CycleGAN denoising, the ranging accuracy improves significantly and reaches nearly 100% for 

other distance settings. The results show that the trained CNN model for optical ranging performs 

well to estimate the distance from the user to the tag with CycleGAN denoising. The results show 

our current U-Star prototypes can provide up to 7 meters of optical ranging with average accuracy 

nearly 100%. 

Orientation guidance. We provide eight recognized orientations for underwater navigation: 

North(N), North West(NW), West(W), South West(SW), South(S), South East(SE), East(E), and 

North East(NE). As shown in Figure 4.18 (b), no matter what was the user is facing (any of the 

eight recognized orientations) the accuracy of our orientation classification is always 100% when 

104 

 
y
c
a
r
u
c
c
A
g
n
g
n
a
R

i

l

a
c
i
t

p
O

1.0 

0.8 

0.6 

0.4 

0.2 

0.0 

w/o denoising 
with denoising 

1m 

2m 

3m 
distance setting 
(a)  optical ranging performance 

4m 

6m 

5m 

7m 

y
c
a
r
u
c
c
A
e
c
n
a
d
u
G
n
o
i
t
a
t
n
e
i
r

i

O

1.0 

0.8 

0.6 

0.4 

0.2 

0.0 

30 

24 

18 

12 

s
s
o
L
g
n
n
a
r
T

i

i

w/o denoising 
with denoising 

N 

NW  W 

SW 

S 

SE 

E 

NE 

direction setting 
(b)  orientation guidance performance 

ranging w/o denoising 
ranging with denoising 
orientation w/o denoising 
orientation with denoising 

6 

0 

5 

0 

40 

160 

200 

80 
epoch index 

120 

(c) ) training loss 

curve 

Figure 4.18 Relative positioning performance in aspects of optical ranging, orientation guidance 
and training loss. 

105 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
performing  orientation  guidance  with  CycleGAN  based  denoising.  We  also  present  orientation 

guidance  performance  without  CycleGAN  based  denoising  for  comparison.      The  results  show 

that the performance with CycleGAN denoising is better than without CycleGAN denoising. This 

shows that the CycleGAN based denoising helps the CNN model to improve the orientation guidance 

performance by decreasing the impact of harsh water conditions. The results show that our U-Star 

system can provide accurate orientation guidance amongst all eight orientations. 

Training loss. For relative positioning, we also measure the loss in CNN based training for 

optical  ranging  and  orientation  guidance  separately.  As  shown  in  Figure  4.18  (c),  the  optical 

ranging training loss curves both with/without denoising are above the orientation training loss 

curves  during  the  training  process.  This  means  that  features  (tag  size)  in  the  optical  ranging 

dataset are not as rich as the features (positioning elements and their various permutations) in the 

orientation dataset. The curves with CycleGAN denoising are beneath those without CycleGAN 

denoising during the entire training process no matter the optical ranging training or orientation 

training.  That  means  that  using  the  CycleGAN  denoising  can  help  decrease  training  loss  more 

quickly and limit the impact of harsh underwater optical conditions for relative positioning. 

4.7.5  Robust Data Parsing 

We use our tag reader to capture images of four real UOID tags A1, A2, B1, B2 with random 

capturing poses in different distances, water conditions, and time of day to evaluate the decoding 

performance of U-Star. A1 and B1 embed raw bits without error correction codes. A2 has 3, 5, and 

3 common data bits with A1 in layers 1, 2, and 3 respectively.  A2 also has 3, 4, and 3 Hamming 

ECC parity bits in layers 1, 2, and 3.  B2 has 3, 5, and 3 common data bits with B1 in layers 1, 2, and 

3 respectively, and also has 3, 4, and 3 Hamming ECC parity bits in layers 1, 2, and 3. Hamming 

ECC[33] can correct 1 error bit per bitstream, thus, for a total of 3 error bits correction capability 

for a tag. The bits in A1, A2, B1, B2 are shown in Table 4.1. 

We define the BER as the average bit error ratio of the entire embedded valid data bits in two 

UOID tags with different data embedding (i.e., two tags: A1 and B1 or two tags: A2 and B2). Each 

BER value is calculated using 30 captured images and we use it as a metric to evaluate the decoding 

106 

 
Tag 

A1 

A2 

B1 

B2 

bits in 1s t  layer 
1 0 1 1 0 1 
1 0 1 1 0 1 
0 0 1 1 0 1 
0 1 0 1 0 1 

bits in 2 nd  layer 
1 1 0 0 1 0 0 0 1  
1 1 1 1 1 0 0 1 1  
1 0 0 1 1 1 0 1 0  
1 0 1 1 0 0 1 1 1  

bits in 3 rd  layer 
0 0 1 0 1 1 
0 1 0 1 0 1 
1 0 1 0 0 1 
1 0 1 1 0 1 

 common  data  bits  of A1  & A2  or  B1  &  B2 

data  bits  without  ECC  Hamming  ECC  parity  bit    Valid  data  bit 

Table 4.1 Embedded bits in 4 UOID tags: A1, A2, B1 and B2. 

performance of our U-Star system. Besides the difference between UOID tags with and without 

Hamming  ECC  codes[110],  we also  compare BER performance  with  and  without CycleGAN  based 

denoising as comparison. 

In different communication distances.    We adjust the distance of the tag to the tag reader 

to be 1m, 1.5m, 2m, 2.5m, and 3m in clean water (pool) during the daytime.  As shown in Figure 

4.19 (a), the BER remains low, consistently less than 0.09 after CycleGAN denoising in all distance 

settings. We found that the best data parsing distance for current U-Star prototypes is 1m, as the 

BER is 0. The BER performance without CycleGAN denoising is significantly worse than with 

CycleGAN denoising at 3m. This confirms that the CycleGAN denoising works well, especially 

at longer distances. Both with and without CycleGAN denoising, the BER with ECCs is lower 

than for without ECCs. The BER is 0.003 at 1m and continues to be less than 0.05 up to 3m with 

Hamming ECC and CycleGAN denoising simultaneously. 

In different water conditions. We explore four water conditions during the day in experiments: 

indoor tank with clean water, small pond, swimming pool, and big lake, as shown in Figure 4.17 

(c). We conduct experiments at a distance of 1m (the best capturing distance for data parsing of 

the current U-Star prototype mentioned above). As shown in Figure 4.19 (b), without CycleGAN 

denoising, our data parsing performs best in the pool and worst in the pond. This is because the 

pool is clean enough for data parsing without the denoising process and the small pond makes the 

color  of  the  elements  change  too  much. After  CycleGAN  based  denoising,  the  BER  decreased 

significantly  in  all  four  water  conditions.  The  Hamming  ECC  codes  decreased  the  BER  even 

further, resulting in a BER lower than 0.07 for all four water conditions.  Notably, the tank, pool, 

107 

 
 
with denoising w/o ECC 

w/o denoising w/o ECC 

with denoising with ECC 

w/o denoising with ECC 

1.5m 

2m 
distance setting 
(a)  diﬀerent communication distance 

2.5m 

3m 

with denoising w/o ECC 

w/o denoising w/o ECC 

with denoising with ECC 

w/o denoising with ECC 

tank 

pond 

pool 

lake 

average 

underwater  environment  setting 
(b) diﬀerent water condition 

with denoising w/o ECC 

w/o denoising w/o ECC 

with denoising with ECC 
w/o denoising with ECC 

)

R
E
B

(
e
t
a
R

r
o
r
r

E

t
i

B

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 

1m 

0.5 

0.4 

0.3 

0.2 

0.1 

)

R
E
B

(
e
t
a
R

r
o
r
r

E

t
i

B

0.0  

)

R
E
B

(
e
t
a
R

r
o
r
r

E

t
i

B

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 

pool-D 

pool-N 

lake-D 

lake-N 

average 

day / night setting in pool and lake 
( c) diﬀerent times in a day 

Figure 4.19 Decoding performance with different communication distance, water condition, and 
time (day/night). 

108 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
and lake situations show a BER approaching 0. The average BER decreases from 0.16 to 0.03 after 

CycleGAN denoising and Hamming error correction. In summary, the BER in four different water 

conditions is all low enough with CycleGAN based denoising and Hamming error correction for 

robust data parsing. 

In different times of the day.    We conduct experiments during both day and night at a distance 

of 1m in the swimming pool and lake. As shown in Figure 4.19 (c), the BER in the daytime is 

lower than in the night for both the pool and lake. Even with a flashlight shining to activate the 

UOID tag, the current UOID tag only has luminous powder covering the element surface, which is 

not as bright as in the day time. Moreover, at night, the BER without denoising in the lake is worse 

than the clean pool, because the emitted light from the UOID tag is too weak to go through more 

muddy water in the lake. After CycleGAN based denoising and Hamming error correction, the 

BER in all four settings decreased significantly and is lower than 0.03. The results show that the 

current U-Star system performs data parsing well with CycleGAN based denoising and Hamming 

error correction both day and night. 

4.7.6  Comparison with Existing Optical Tags 

We implement the 3D version of existing Bar/QR codes with the same 21 embedded data bits 

(101101  110010001  001011)  and  the  same  tag  size  (cube  edge: 19cm)  as  our  UOID  tag  for  a 

fair comparison across various aspects. The data alignment, implemented tags and the comparison 

experiment scenarios are shown in Figure 4.20 (a). We conduct experiments and make comparisons 

in the five aspects below to demonstrate the superiority and necessity of our designed UOID tags 

over existing optical tags for underwater navigation. 

(1) Same tag order with more embedded bits. Despite the fact that the user can capture the 

information of one and up to three surface planes of a 3D version of existing Bar/QR codes in 

N-order, the decoded bits are the same as the bits in one plane. The embedded bits in an N-order 

barcode  are  roughly  𝑁.  The  embedded  bits  in  an  NxN  QR  code  are  roughly  𝑁2-4  bits.  The 

embedded bits in an NxNxN UOID tag are 𝑁3-6 bits. As shown in Figure 4.20 (b), the amount of 

embedded bits in a UOID tag increases exponentially compared to the same order 1D/2D optical 

109 

 
same 21 bits 
101101  110010001  001011 

UOID 

3D-Bar 3D-QR 

t
n
e
m
n
g

i
l
a

a
t
a
d

same 19 cm 

120 

)
t
i
b
(

100 

t
n
u
o
m
a

a
t
a
d

d
e
d
d
e
b
m
E

UOID  tag 

QR code 
bar code 

80 

60 

40 

20 

0 

3 

4 
The order of optical code / tag 

5 

2000 

)
t
i
b
(

t
n
u
o
m
a

1600 

1200 

UOID  tag 

QR code 
bar code 

a
t
a
d

d
e
d
d
e
b
m
E

800 

400 

0 

4 

12 
The order of optical code / tag 

10 

8 

6 

Implementation of 3D version of Bar/QR Tag 

2D-Bar/QR 

(b)  same  tag  order  with  more  embedded  bits 

20 

15 

)
t
i
b
(

10 

t
u
p
d
o
o
G

UOID  tag 

QR code 
bar code 

5 

20 

15 

10 

5 

)
t
i
b
(

t
u
p
d
o
o
G

UOID  tag 

QR code 
bar code 

0 
1.0 

1.5 

2.0 
Distance in clean water (m) 

2.5 

3.0 

0 
1.0 

1.5 

2.0 
Distance in muddy water (m) 

2.5 

3.0 

(d)  same  tag  size  &  data  with  longer  communication  distance 

clean  creek 

muddy river 

3D-QR 

3D-Bar 
captured at 2m in clean creek    captured at 2m in muddy river 
(a)  comparison  experiment  scenarios 

3D-Bar 

3D-QR 

UOID 

UOID 

Figure 4.20 Comparison between UOID tags with existing optical tags. (a) Experiment scenarios, 
(b) Data improvement, (d) Better goodput performance. 

tags. Even their 3D versions cannot compare to the UOID tags (e.g., 3-order UOID embeds 7x and 

4.2x bits of the same order Bar and QR code). 

(2)  Same  tag  size  &  data  with  larger  element  distance.  The  larger  the  average  element 

distance and the broader the distribution of element distances, the better the detection performance 

and the less the error bits. We measure distances between all 21 data elements in 3D versions of the 

Bar/QR, and UOID tag that have the same embedded bits and tag size (edge is 19cm). As shown 

in Figure 4.20 (a) and Figure 4.21 (c), data element distances in Bar and QR codes are all smaller 

than 20cm, however the data element distances in UOID tags are completely distributed in a greater 

range of [5, 30] cm. 

(3) Same tag size & data with longer communication range. We also investigate the goodput 

performance of three tags mentioned above in two different underwater scenarios: clean creek and 

muddy river at varying distances from 1m to 3m.  In clean creek, all three tags perform well and 

110 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
)

%

(

e
g
a
t
n
e
c
r
e
p

n
o
i
t
u
b
i
r
t
s
D

i

50 

40 

30 

20 

10 

0 

150 

180 

150 

180 

bar code 

)

%

(

e
g
a
t
n
e
c
r
e
p

n
o
i
t
u
b
i
r
t
s
D

i

50 

40 

30 

20 

10 

0 

QR code 

)

%

(

e
g
a
t
n
e
c
r
e
p

n
o
i
t
u
b
i
r
t
s
D

i

50 

40 

30 

20 

10 

0 

UOID tag 

Element distance range (cm) 

Element distance range (cm) 

Element distance range (cm) 

(c) same tag size and data with larger element distance 

90 

90 

180 

120 

60 

120 

60 

240 

120 

30 

150 

30 

300 

0    180 

0   360 

bar-clean-0.5m 

90 

qr-clean-0.5m 

90 

120 

60 

120 

60 

uoid-clean-0.5m 

180 

240 

120 

30 

150 

30 

300 

0    180 

0   360 

bar-muddy-0.5m 

qr-muddy-0.5m 

uoid-muddy-0.5m 

(e) same tag size and data with broader scanning angles 

20 

10 

60 

0  0 

60 

20 

10 

0 

0 

Figure 4.21 Comparison between UOID tags with existing optical tags. (c) Broader element 
distance, and (e) Full-directional scanning. 

produce more than 17 bits of goodput at up to 3m, as illustrated in Figure 4.20 (a) and (d). However, 

in the muddy river, the goodput of Bar and QR codes in the 3D version drops dramatically after 

1.5m, whereas the UOID tag maintains its high goodput until 2.5m. 

(4) Same tag size & data with broader scanning angles. Furthermore, for all three of the 

aforementioned tags, we evaluate the goodput performance with varying scanning angles at 0.5m 

under the clean creek and muddy river. As shown in Figure 4.20 (a) and Figure 4.21 (e), the usage 

view range of the existing optical tags has also been increased from less than 120◦ to 360◦ of UOID 

tags for both clean creek and muddy river. 

(5) Other benefits of UOID design.  When compared to the 2D plane (the version of 1D Bar 

111 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
and 2D QR codes in our daily life, shown in the left middle of Figure4.20 (a)) and confined 3D 

cube (3D version of Bar/QR) to maintain tag’s location and orientation in flowing water or current 

(i.e., creek, river, tide), the hollowed-out UOID can lessen influence of water current to allow it to 

flow through the tags and maintain stabilization. 

4.7.7  Case Study with Multiple UOID tags 

The  usage  of  our  U-Star  signage  system  is  similar  to  barcode/  QR  code  adopted  in  auto- 

supermarket systems. The data embedded in codes are the query codes used for searching a backup 

database with records for all offered goods. Due to the large enough storage ability on the mobile 

device, the ability to embed more query codes will result in better navigation. Our 3-order UOID 

tag can embed 23𝑥3𝑥3−6 = 2,097,152 possible query codes. Even with Hamming ECC parity bits that 

sacrifice 10 (3+4+3=10) bits, there are still 11 data bits available for embedding 211 = 2,048 query 

codes. As shown in Figure 4.22, we implement four UOID tags with Hamming error correction 

codes in the case study, and their 11 valid data bits match to distinct query codes in range of [0, 

2047] in the backup database. The database stores the current absolute location information, the 

guidance information, and risk warnings such as "shark near" which can queried via the related 

query codes. Our demo in a 4m x 10m indoor pool, the user dives at the start site of B and plan to 

go to the destination site of C and then back. 

When the user scans Tag B at the start location, the user will be given the current absolute 

location (i.e., facing North and at (2m, 0.5m) in the coordinate system) as well as information about 

its nearby nodes (i.e., D is the nearest tag with 4.5m relative distance to B’s NorthEast direction, A 

is 5.3m away from B to B’s EastSouth direction, and C is 9m away from B to B’s East direction) to 

help navigate himself to other spots. 

The user intends to visit Tag D first. He looks for a bright dot around 4.5m away (the optical 

ranging of UOID provides him a sense of underwater distance) at the NorthEast direction of Tag 

B. If he cannot find his way, he will travel to another nearby node such as Tag A. 

After confirmation of D’s existence, he moves to Tag D and repeats the similar procedure to go 

to Tag A first (compared with 8.2m to C, the distance to A is 5m and A is the nearest uncovered 

112 

 
D 

B 

C 

A 

ﬁxation  anchor 

sport  camera 

y 

N

C 

D e s t i n a t i o n  

Query codes in backup database 

ID raw  bits  with  ECC  /  valid  data  query # 
A  10 1 1 01  11 1 1 100 1 1  01 0 1 01 
B  01 0 1 01  10 1 1 001 1 1  10 1 1 01 

1481 
413 

C  00 0 0 00  10 0 0 011 0 0  11 1 0 00 

D  11 1 0 00  00 0 11 00 1 1  00 1 0 11 

52 

527 

Queried info of a UOID tag 

(1) current location info 

(2) guidance info to near nodes 

(3) warning information: shark near 

(4) introduction of current site 

go path 
return path 

C 

ID  nea r  n o d e s  

A  C , D, B 
B  D , A 

C  A, D 
D  B, A , C 

C 

A 

A 

D

A 

D 

o 

4mx10m 
pool 

Start  

B 

x 

B 

D 

B 

Figure 4.22 Underwater navigation case study of U-Star in a 4mx10m indoor pool with 4 UOID 
tags and backup database. 

113 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
node to D). And next, from A, he finally reaches at destination C. 

His path (locally optimal) is B-D-A-C and return path is C-D-B (effective path) while globally 

optimal path C-B may not work due to he may not confirm B’s existence from C. By following the 

procedures above, he achieves self-guided underwater navigation easily and effectively, regardless 

of the start and destination tags. 

4.7.8  Other Concerns 

Cost and price.  As shown in Figure 4.23, the main cost of the U-Star system is the tag reader, 

while the UOID tag is very cheap (less than $3 for each). For practicality, the tag reader can be 

replaced with the user’s own smartphones covered with a waterproof case, which is less than $4. 

Considering multiple UOID tags deployed underwater, the U-Star system with 20 UOID tags costs 

less than $100 for an underwater site with an area of 1𝑘𝑚2 (7𝑚 x 7𝑚 x 20). 

Device 

Material 

Cost ($) 

element  balls 

stick  /  plastic 

hot  melt  glue 

double-side   tap 

One 

3x3x3 

UOID 

tag 

luminous    powder 

<  1 

< 0. 5 

< 0. 5 

< 0. 5 

< 0. 5 

Total  for  a  tag  ≈ 3 

One 

tag  reader 

sport  camera 

30  (basic) 

smart  phone 

self-contained 

waterproof   case 

3.5  (Amazon) 

U-Star with 20 tags 

< 100 

Figure 4.23 Cost & price. 

Computation overhead.  For underwater situations, battery is limited and not easy to replace. 

The tag reader should not conduct complex computations that consume energy too fast. The training 

processes are offline, the real-time tasks are denoising, optical ranging, orientation guidance, and 

decoding. As shown in Figure 4.24, the denoising requires the most memory resources and decoding 

required the fewest memory resources. For all four tasks, they require a combined 430 MiB of 

memory and is not a computational burden for a commercial smart device. 

114 

 
 
 
)

i

B
M

(
d
a
e
h
r
e
v
o
y
r
o
m
e
M

computation overhead 

latency in conduction 

450 

360 

270 

180 

90 

0  denoise  range  direction  decode 

total 

Figure 4.24 Overhead. 

2 

1.6 

1.2 

0.8 

0.4 

0 

R
u
n
n
n
g
t
i

i

m
e
(
s
)

Latency.  For  underwater  navigation  tasks,  time  can  be  important  to  improve  the  user 

experience and even save people’s lives. Compared with state-of-art underwater navigation systems, 

including audio-based systems, U-Star has nearly no time delay in signal propagation due to the fast 

propagation of light. Thus we only consider the computational latency. As shown in Figure 4.24, 

optical ranging and orientation guidance have the lowest running time of 0.002 s, while decoding 

has the longest running time at 1.25 s. All four tasks consume 1.59 s total, which is still quick 

enough for a good user experience. 

4.8  Discussion and Summary 

Usage  instruction of  scanning  UOID.  Even with appropriate spacing between data elements in 

UOID tags, there is some LoS blockage at certain scanning angles. However, by slightly adjusting 

capturing poses without moving the user’s location, it is simple to avoid blockages and capture all 

data elements. 

The  number of guidance directions. Our current U-Star prototype can provide user orientation 

guidance in 8 directions, which is sufficient for practical underwater navigation. U-Star, however, 

may  be  updated  to  finer-grained  orientation  guidance  using  a  same  CNN  training  with  more 

directions (e.g., 16 directions). 

UOID  deployment.  Because  GPS  is  unusable  for  underwater  scenarios,  the  positions  of 

115 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
deployed UOID tags are identified and saved in a backup database on shore at a one-time deployment 

cost. We can use spring installation techniques to fix UOID tags on the underwater floor with little 

regard for location and orientation fluctuation caused by tide and flow. They can make the tag flexible 

when subjected to tide power and automatically resume its suspected position when it becomes 

static, much like how tall building dampers maintain stability and extend tag usage lifetime. 

System robustness and potential side-effect on marine animals. (1) moss/scum removing: 

Because  moss  grows  slowly,  we  can  periodically  (e.g.,  every  month)  remove  the  accumulated 

moss and maintain UOID tags as part of underwater infrastructure maintenance.  We can utilize 

an ultrasonic technique to remove moss touchlessly while causing no harm to the UOID tags or 

other marine life. (2) luminous powder: To prevent pollution and harm to marine life, we apply 

non-toxic,  non-radioactive,  and long-lifespan  (more  than  15  years)  luminous powder  wrapping  with 

waterproof glues. (3) marine debris: We can use integrated molding technology and 3D printing 

techniques in the future to produce recycled, solid and not easily damaged UOID tags to avoid 

marine debris. 

Applications benefited by U-Star. (1) Recreation scuba diving. (2) Underwater rescuing. In 

addition to using fixed UOID tags as infrastructure for safe underwater activities, we can attach 

smaller size UOID tags (which store people’s identifying information) on top of underwater helmets 

as mobile UOID tags for persons participating in underwater activities. As a result, rescuers can 

scan UOID tags to identify people and learn about on-site situation (how many people and who are 

in danger or need rescue). The trapped people, on the other hand, can scan larger UOID tags on 

rescuers to actively seek help and instructions from rescuers. (3) Future directions combined with 

Augmented Reality. We can update the tag reader side from current sport camera/smart phone to 

AR goggles to show guidance info in more direct and visual manner instead of small display on 

smartphone for user experience of WYSIWYG, “see UOID, see INFO”. 

In summary, we implement the U-Star system for simple and robust underwater navigation. We 

investigate 3D spatial diversity for data embedding with wider element distances and additionally 

use it for relative positioning. We address challenges in system design and implementation, e.g, 

116 

 
combating harsh underwater environments and 3D structure restoration for data parsing. Finally, 

we conduct experiments based on virtual and real UOID tags in multiple underwater scenarios. 

Our 3-order UIOD prototype can embed 21 bits and achieves a BER of 0.003 at 1m and less than 

0.05 at up to 3 m with approaching 100% relative positioning precision. 

117 

 
CHAPTER 5 

HAND POSE RECONSTRUCTION VIA 3D SPATIAL DIVERSITIES 

Smart  homes,  medical devices,  education  systems, and  other  emerging  cyber-physical  systems  offer 

exciting opportunities for sensing-based user interfaces, especially those utilizing fingers and hand 

gestures as system input. However, existing vision-based approaches, which rely on time-consuming 

image processing, often adopt a low 60 Hz location sampling rate (frame rate) for real-time hand 

gesture recognition. Additionally, they may not perform well in low-light environments or have 

limited detection range. 

To address these challenges, we propose RoFin, a novel system that leverages the 3D spatial- 

temporal diversities of  optical  signals  for  fine-grained  finger tracking  and  hand  pose  reconstruction. 

RoFin stands out as a low-cost and privacy-protected solution, enabling real-time 3D hand pose 

reconstruction  with  fine-grained  finger  tracking  capabilities.  It  works  effectively in various distance 

ranges and under diverse ambient light conditions, providing a more versatile and robust approach 

to hand gesture recognition and tracking. 

Camera with 1 sample per  frame can  not 
capture  ﬁne-grained ﬁnger trace  as 
on-paper written trace below. 

writing on paper by a Parkinson's suﬀer 

ground truth 

Figure 5.1 RoFin can better record jitter of writing[68]. 

5.1  Motivation 

Some researchers attach on-body sensors (e.g., accelerators, gyroscopes.) to each finger and 

joint  to  measure  the  spatial  position  variation  of  fingers.  Other  studies  utilize  wireless  signals 

such as radio frequency signals, acoustic signals, and light signals (e.g., soli[61], FingerIO[75], 

118 

 
 
 
 
 
 
 
 
 
and Ali[59]) for hand-free gesture recognition. However, these methods require the expensive or 

specific devices and have limited sensing distance less than 0.5m. 

Vision-based hand gesture identification approaches are widely popular, using similar process- 

ing techniques as human eyes to detect hand morphology with a perception frequency of about 60Hz. 

The accuracy of vision-based hand gesture recognition exceeds 80% with the aid of deep learning 

[137]. However, these vision-based methods have several drawbacks: (1) They are not effective in 

low-light conditions or for long detection ranges due to the limited amount of light reflected from 

the hand to the camera’s image sensor. (2) The low sampling rate (e.g., 60 Hz) of cameras when 

tracking fingers is similar to the limited perception ability of human eyes, making it challenging to 

capture the detailed motion trajectory of trembling hands, as observed in patients with Parkinson’s 

disease. (3) Vision-based approaches involve high processing costs and latency, mainly due to the 

need for recognizing hand morphology with about 20 hand joints. (4) The captured frames of the 

scenes with hands raise privacy concerns, particularly in sensitive circumstances. 

Commercial  cameras  and  LEDs  are  deployed  everywhere,  enabling  optical  camera  communica- 

tion (OCC) a reality in our daily lives. The rolling shutter in commercial cameras exposes one row 

of pixels and generates a whole image row by row. A clear strip effect appears when the switching 

speed of the light wave from the transmitter is equal to or slightly less than the rolling shutter speed. 

Many researchers have tried to improve data rates by collecting data in rolling strips rather than the 

entire image frame. However, these systems[124, 122, 125, 147, 148] only exploit rolling shutter 

for  communication  instead  of  sensing  such  as inside-frame  fine-grained  location  tracking  with  high 

sampling rate (rolling shutter speed, e.g, 5 KHz) instead of one sample (1Hz). 

To overcome these limitations, our proposed system,  RoFin, leverages 3D spatial-temporal 

diversities of optical signals to offer fine-grained finger tracking  and hand pose reconstruction. 

By doing so, RoFin addresses the drawbacks associated with traditional vision-based hand gesture 

recognition approaches and provides a low-cost, real-time, and privacy-protected solution. RoFin 

consists of wearable gloves and a commercial camera, as shown in Figure 5.2. Each glove finger 

and the wrist is attached to one low-power LED node controlled by Arduino Nano (<$10). 

119 

 
know  which  ﬁngertip it is 

temporal 

spatial  know  where  the ﬁngertip is 

pilot 

#2    0b00010 

#4 

#3 

#2 

e
d
u
t
i
l

p
m
a

on 

oﬀ 

label each ﬁnger invisibly 
with cyclic OOK wave 

time 

#5 

pilot 

oﬀ 
on 

w
e
v

i

a
r
e
m
a
c

in human eyes 

in rolling shutter camera 

H#1 

#1 

O 

A 

x 

r1 

(x,y) 

1        1 

y 

B 

r2 

(x,2y) 2 

shutter-speed level 
ﬁne-grained 3D spatial 
info and move trend 
x2  A:  left  of B 
y2 

x1 
y1 
r1   >  r2  A:  front  of B 

A:  bottom  of B 

static 

move ↘ 

move ↑ 

captured   privacy-protected  image 

inside-frame high-rate sampled 3D variations 

low-cost RoFin gloves 

w
e
i
v

e
y
e

Ro lling  Fin gertips 

utilize temporal  & spatial  rolling embedding of 6 key points 
for 20-joints 3D hand pose reconstructing 

Figure 5.2 3D hand pose reconstructing via 6 temporal-spatial 2D rolling patterns. 

5.2  Background and Related Work 

5.2.1  Vision-based 3D Hand Pose Recognition 

Numerous  works  adopt  cameras  to  recognize  hand  poses.      In  general,  these  computer  vi- 

sion approaches can be classified into 2 categories. (1) Hand image searching in pre-computed 

databases with machine-learning assistance. These methods capture hand images and then query 

pre-computed 3D hand models to determine the best-matched hand pose[107, 8, 56]. (2) Calculate 

3D coordinates of hand joints directly and then identify the hand pose by optimizing an objective 

function.  These  methods  represent  the  hand  with  a  3D  hand  model  and  adopt  an  optimization 

strategy to speed up hand pose prediction[97, 137, 86] However, these existing vision-based hand 

pose recognition methods are based on complete hand morphology such as hand silhouettes and 

120 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
numerous joints (e.g., 20 joints) with non-trivial tracking and computation overhead. Furthermore, 

vision-based approaches sample the location variation at the frame update level while the frame 

rate is set ≤ 60 fps instead of higher to be compatible with time-consuming image processing. 

In contrast, RoFin takes a different approach, enabling 3D hand pose reconstruction using only 

six 2D rolling spots: the five fingertips and the wrist point of the hand. By relying on fewer tracking 

points and employing a lightweight pose reconstruction algorithm called HPR, RoFin[143, 142] 

achieves real-time hand pose reconstruction with an average time cost of 13.8 ms. Additionally, 

even with a limited 60 fps frame rate, RoFin can sample numerous inside-frame points instead of 

only one, as in vision-based approaches. This enhanced sampling granularity greatly improves the 

accuracy and precision of finger tracking. 

5.2.2  Strip Effect in Rolling Shutter Camera 

Cameras commonly found in our everyday smart devices utilize a low-cost technique called 

rolling  shutter  to  reduce  the  readout  time  of  pixels  from  the  entire  image  frame.  In  a  rolling 

shutter camera, the exposure occurs one row of pixels at a time, generating the complete image 

row by row. However, this rolling shutter mechanism can cause a noticeable  strip  effect  when 

the switching speed of the light wave from the transmitter matches or slightly exceeds the rolling 

shutter  speed.  This  strip  effect  allows  for  the  sequential  capture  of  optical  signals  containing 

transmitted data in a symbol period, enabling optical camera communication (OCC) techniques 

such as CASK, ColorBar, and others[124, 122, 148]. These OCC techniques leverage the rolling 

shutter  phenomenon  to  facilitate  communication  by  capturing  and  interpreting  the  transmitted 

optical signals in a series of rolling strips. 

The high-rate sampling ability of a rolling shutter camera is not fully utilized in current vision- 

based finger tracking and hand pose recognition approaches. These methods typically only sample 

one location of a specific objective (e.g., a fingertip) in a frame, despite the rolling shutter camera’s 

capability to capture numerous location samples during a frame period. 

In contrast, RoFin maximizes the potential of these numerous location samples by employing 

active LED spheres attached to the fingertips.  By tracking the location variation of the center of 

121 

 
each LED sphere in 3D space during one frame period, RoFin can achieve fine-grained inside-frame 

finger tracking granularity. This is particularly useful in scenarios involving high-motion status 

(e.g., shaking), as RoFin can generate deformed ellipses to record the finger’s movement accurately. 

This capability enhances user experiences in activities such as virtual painting and writing. 

Moreover, RoFin’s fine-grain tracked virtual writing traces of Parkinson patients enable more 

precise  trace  optimization  compared  to  vision-based  methods,  which  rely  on  coarse-sampled  traces. 

This capability allows RoFin to provide a more accurate and valuable tool for assisting patients 

with Parkinson’s disease in their writing and other motor activities. 

5.3  Our Approach: RoFin 

RoFin first exploits 2D temporal-spatial rolling fingertips for (1) active optical labeling for 

fingers/hands, (2) fine-grained inside-frame finger tracking with rolling shutter speed, and (3) real- 

time 3D hand pose reconstructing. Each LED node covered with same-size sphere emits distinct 

light  waves  as  optical  label,  which is  invisible to  human eyes but  perceptive by rolling shutter 

cameras for  robust  finger identification. Based on the captured  spots  (deformed  ellipses)  via rolling 

shutter at high sampling rate (e.g., 5 KHz), RoFin can parse fine-grained 3D locations and inside- 

frame variations of fingertips (left/right, up/down, and front/rear). Finally, RoFin reconstructs 3D 

hand pose consisting of 20 points by tracking only 6 key points (5 fingertips and 1 wrist point) for 

less latency and computation overhead. 

Composition.  RoFin  system  consists  of  two  parts.  (1)  RoFin  gloves  are  commercial  insulating 

gloves where each fingertip and the wrist are attached with a low-power LED component covered 

with a plastic ball. These LED components are controlled by an Arduino Nano to generate identical 

LED waves to indicate different  fingertips. (2)  RoFin  reader  is based on commercial cameras 

(e.g., smartphones, web cameras). These cameras use adjustable focal length lenses and  rolling 

shutters with configurable shutter rates. 

Workflow. (i) The user puts on RoFin gloves and makes some hand poses. (ii) After setting 

the rolling shutter rate and focal  length,  RoFin reader captures the continuous 2D rolling spots 

of six key points (5 fingertips and  1 wrist point) frame by frame.  (iii) RoFin  reader identifies 

122 

 
l

s
e
v
o
G
n
F
o
R

i

6 key-point spheres 

at ﬂexible distances 

active optical labeling 

mass ﬁnger indication 

energy-eﬃcient & safe 

wearable  cheap  COTS  gloves 

n
o
i
t
a
r
e
p
O

r
e
s
U

single/multiple hands 

diﬀerent environment 

during day & night/dark 

broad use cases with privacy 

Images 
Capturing 

Yolo5 Label 
Prediction 

Group Fingers 
into Hands 

r
e
d
a
e
R
n
F
o
R

i

3D Trajectory 
Generation 

Get Inside-frame 
X/Y samples 

Depth Info Z 
Calculation 

Form Projected 
Palm Plane 

Find Finger 
Planes 

Bend Each 
Finger 

Identify Fingertips 

Parse 3D Info 

Reconstruct Pose 

Figure 5.3 RoFin system overview: composition, workflow and three main tasks. 

each fingertip/wrist point via lightweight CNN model with bounding boxes (i.e, YOLOv5). (iv) 

RoFin parses the 3D location variations of each key point based on captured deformed ellipses in 

each frame with the granularity of strip width. (v) Finally, RoFin reconstruct 3D hand pose via 

lightweight HPR algorithm based on the parsed label and its fine-grained 3D location. 

3 Main Tasks.  At the high level, RoFin responds to two questions:  (1) identify which fingertip 

it is, and (2) locate position and its inside-frame variation of this fingertip with sampling rate at 

rolling shutter speed. RoFin further (3) reconstructs 3D hand pose via HPR algorithm based on 

outputs from (1) and (2). 

5.3.1  Challenges and Solutions 

However, we must address three significant technical challenges in developing RoFin: 

123 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
C1: Each finger from multiple hands must have a distinct and robustly identifiable label, even 

in varied ambient light and at long distances. 

C2: Deciphering the fine-grained 3D fluctuation of fingertips based on the 2D shape (i.e., a 

distorted ellipse) recorded during a frame period poses a considerable challenge. 

C3:  RoFin  relies  on  tracking  only  six  key  points  of  a  hand  to  reduce  overhead.  However, 

accurately reconstructing a 20-point 3D hand pose from these limited 6 key points in real-time 

presents a significant challenge. 

Our contributions can be summarized as follows: 

(1) RoFin is the first work to exploit rolling shutter effect for 3D hand pose reconstructing. We 

indicate each fingertip and wrist point with asynchronous cyclic optical labels. Then we adopt a 

lightweight CNN model with bounding boxes to identify fingertips and wrist points. Our active 

optical labeling overcomes the limitations of the vision-based technique and is appropriate for the 

identification of multiple hands in low-light and long-range detection scenarios. 

(2) We creatively utilize inside-frame high sampling via rolling shutter to track several fingertips’ 

3D  location  variation  instead  of  only  one  2D  location  sample  in  a  frame  to  enhance  tracking 

granularity further while vision-based  approaches only  use one 2D  location sample during one 

frame  period.  The  improved  finger  tracking  ability  has  potentials  for  the  virtual  writing  for 

Parkinson’s suffers, better user experience for virtual writing/painting in AR/VR/MR. 

(3) Based on the finger identification and parsed 3D location info of 6 key points (5 fingertips 

and 1 wrist point) from (1) and (2), we design a real-time and lightweight 20-point 3D hand pose 

reconstructing algorithm HPR from tracked 6 key points. HPR can efficiently reconstruct a 3D 

hand pose by direct calculation instead of redundancy points’ tracking while not sacrificing the 

reconstructing accuracy. 

(4) We implement RoFin with commercial devices and evaluate its performance of (i) finger 

identification performance in different settings, (ii) inside-frame tracking enhancement in compar- 

ison to the vision-based approach, and (iii) hand pose reconstructing error with Leap Motion as the 

benchmark and its reconstruction latency.  We also discuss the potential use cases of RoFin such 

124 

 
as multi-user interaction for meta, virtual writing or health monitoring for Parkinson suffers, hand 

pose commands for smart home. 

5.4  Active Optical Labeling 

5.4.1  Temporal Rolling Patterns 

The light source emits optical signals which varied with time sequences at rolling shutter speed 

level during one frame period can be recorded row by row in the captured image frame by the 

rolling shutter camera. Only when the rolling shutter rate is similar to the transmission frequency, 

however, can we clearly see the distinct rolling strips, as illustrated in Figure 5.4. 

We  can  utilize  captured  rolling  spots  with  distinct  strip  textures  as  active  optical  labels  to 

indicate  fingertips. However, optical signals have multiple light features varied with temporal 

sequences  such  as  amplitude,  color,  frequency.  Which  ought  to  be  used  in  rolling  patterns  for 

RoFin? We explored and the captured images are shown in Figure 5.4. 

•  Amplitude. We can adjust brightness of the light source with time sequences[20, 152, 118]. 

The light amplitude fluctuation is vividly captured sequentially. 

•  Transmission Frequency.  We may also alter the ON/OFF switching speed of the light wave. 

•  Color  Spectrum.  We  could  transmit  the  light  with  different  wavelengths.  The  captured 

rolling strips are colourful and vary in the same way of color fluctuation with time sequences 

as the light source does. 

Choice.  It  requires  RGB  LED  and  complicated  modulation  to  achieve  color  spectrum  diversity. 

Complex modulation and a longer time period to present complete frequency variation (i.e, only 

partial of the complete pattern could be presented on the captured spot of sphere with limited width) 

are  both  necessary  for  transmission  frequency  diversity.  To  indicate  multiple  fingertips,  amplitude 

variation is more suitable compared with different colors or transmission frequency which require 

more complex devices (i.e., multi-color LED, high-clock-rate MCU) and control overhead. Thus, 

we apply single-color LEDs with Pulse Width Modulation (PWM). 

125 

 
 
LED waveOOK 

t e x 

shutter exposure time 

strip generation illustration 

20 

200 
3333 
1000 
rolling shutter speed (Hz) 

12000 

a 5 
a 4 a
a  3 
a 
time 

1 

2 

e
d
u
t
i
l

p
m
a

e
d
u
t
i
l

p
m
a

f 
5 

e
d
u
t
i
l

p
m
a

3 

f f 4
f2
f1

time 

y
c
n
e
u
q
e
r
f

c1   c2 c3     c4 

time 

e
d
u
t
i
l

p
m
a

l

r
o
o
c

Figure 5.4 Captured strips impacted by shutter speed. Light feature selection for temporal rolling 
patterns. 

5.4.2  Fingertip and Hand Indication 

Each attached LED element can emit the different amplitude waves as the active optical la- 

bels.  However,  we  can  not  synchronously  control  each  light  source  to  let  them  start  temporal 

rolling pattern at the same time. Additionally, because of their various positions inside the field of 

view (FOV) of the camera, the recorded rolling strip may begin at a different time. These asyn- 

chronous problems make it difficult for the RoFin reader (i.e., camera) to recognize the embedded 

identification information from different light sources (i.e., LEDs). Thus, we design asynchronous 

Cyclic-Pilot On-Off-Keying (CP-OOK) labeling scheme for different fingertips from multiple hands 

126 

 
 
 
 
 
 
and wrist-assisted hand indication. 

• CP-OOK based Fingertip Indication 

The optical label consists of two parts: (1) CP (cyclic pilot), takes one symbol period at the 

beginning, and (2) indication sequences, formed via 5 (can be extended) OOK (On-Off Keying) 

symbols, as shown in Figure 5.5 (a). Aside from the Off symbol (dark), the optical label design 

has two amplitude levels: the CP symbol has the highest brightness, whereas the On symbol has 

the lower brightness of CP. 

Instead of the normal very long preamble[2], we designed a short pilot (i.e., CP). Because the 

number of rolling strips revealed in the finger pattern (the circle or ellipse) is restricted, we must 

ensure that at least one complete optical label is shown in each rolling pattern for further decoding. 

Furthermore, to improve the robustness of these optical labels in variable environment, we set a 

total of 2 non-dark amplitudes (Am𝐶𝑃, and Am𝑂𝑛) instead of additional amplitude levels (e.g., 5 

amplitude levels in amplitude shit keying). 

We encode the index of each finger with its binary number into OOK symbols, as shown in 

Figure  5.5  (a).  When  the  finger  index  is  11,  for  example,  the  binary  number  is  01011  and  the 

indication sequence is [Off, On, Off, On, On]. The length of the indication sequence is determined 

by the number of fingers being tracked. 3 OOK symbols can represent up to 8 fingers, enough for 1 

hand.  4 OOK symbols can represent 16 fingers, enough for 3 hands.  In general, N OOK symbols 

can  represent  2𝑁  fingers  that  are  appropriate  for  2𝑁 /5  hands.  The  transmission  frequency  of  light 

waves have the same or slightly slower frequency than the rolling shutter, and thus these optical 

labels are clearly recorded for further finger identification, as shown in Figure 5.5 (b). 

• Wrist-assisted Hand Indication 

We assign each finger from multiple hands of multiple users with a finger index as illustrated 

in Figure 5.5 (c). For example, there are users A, B, and so on.  We assign the A’s right hand as the 

hand #1, A’s left hand as hand #2. And we assign the B’s right hand as hand #3, and the rest can be 

done in the same manner. We evaluate three hands (A’s right hand and left hand, B’s right hand). 

We assign these fingers with indication index from 1 to 15 finger by finger as shown in Figure 5.5 

127 

 
one entire optical 
label wave 

11 

On 

CP 
cyclic pilot 

Oﬀ 

indication 
OOK sequence 
#0b01011 

human eye 

camera 

1st  Cycle 

2nd  Cycle 

3rd  Cycle 

p
m
a

(a)  CP-OOK active optical labels 

time 

F3 

F4 

F5 

F2 

F1 

F1, F2,  F3,  F4  ,  F5 
all  belong  hand  #1 

F4 

F3 

F2  F7 

F6 

F5 

F1 

hand  #2 

hand  #1 

hand  #1 

(b)  ﬁnger  clustering  for  single  hand  and  multiple  hands 

Note: mirror eﬀect 
users’ left  hand  shows  at  the  left  in  FOV 
A 
A-Left 
7  8 9 

A-Right 
3 2 

user 

4 

B 

user 
B-Right 
13 12 

5 

1 6 

10 

14
15 

11 

9 // 5 + 1 = 2 

0b 100001  0b 100010 
hand #1  hand #2 
(c) ﬁnger/wrist index assignment 

0b 100011 
hand #3 

Figure 5.5 The scheme design of CP-OOK fingertip indication and wrist point labeling. 

128 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
(c). 

However, only 5 fingertips are not enough to determine a hand in a 3D space.  Besides the 

five fingertips in a hand, we also attach one LED node covered with same-size sphere at the end 

of  the  wrist.  This  additional  wrist  point  has  more  vital  meaning  for  hand  pose  reconstructing 

in comparison  to  any  of  the  five  fingertips.  Furthermore,  different  hands  should  have  distinct 

indications for these 6 key points of each hand (5 fingertips an 1 wrist point) to differentiate hands 

and correctly reconstruct each hand pose when they are shown in the camera view at the same time. 

Based on the analysis above, the indication of the wrist should have more significant indication 

than the fingertips but not introduce additional non-trivial overhead (e.g, use different light features: 

colored-LED, different modulation schemes:  FSK). To achieve this design goal, we use the same 

CP-OOK modulation technique in fingertip indication, but set the leftmost indication bit as 1 while 

the remaining bit sequence as the wrist indication for differentiation. 

Given three hands #1, #2, and #3, it requires 4-bit indication sequence to denote 15 fingers. 

Thus, originally, the binary number for finger #11 is 1011. But we set its indication sequence as 

01011, which is [Off, On, Off, On, On], to make it compatible for wrist point indication. For the 

wrist point from #2, its binary number is 10. Following the rule above, the indication sequence of 

this wrist point is set as 10010, which is [On, Off, Off, On, Off ]. 

5.5  3D Spatial Parsing 

Although vision based approach can use higher frame rate (e.g., 120 fps, 240 fps) for sampling, 

the image processing is still time-consuming. Thus, vision based approaches can not achieve the 

faster hand pose reconstructing as the faster frame rate and normally set the frame rate at about 

60 fps for real-time user experience.  Different with vision based approach which only use one 2D 

location sample (𝑥,  𝑦) in each frame, RoFin tracks numerous 3D location samples (the inside-frame 

trajectory of the sphere’s center, the deformed ellipse) with high sampling rate (i.e., rolling shutter 

speed). Thus RoFin has more sensitive perception ability of rapid or subtle motion changes (e.g., 

writing jitters from Parkinson suffers) than vision approaches with the same frame rate. However, 

it is challenge to parse the 3D coordinates (x, y, and z) via the deformed ellipses. 

129 

5.5.1  Depth Info Estimation: Z 

Perspective Principle.     We keep the LED light source fixed in the FOV, but as we move 

the light source closer or farther to the camera, the size of the captured spot grows and shrinks 

separately due to perspective principle. Based on the size of the captured spot, we could calculate 

the depth info (Z, i.e., the front and back), as shown in Figure 5.6 (a). 

image 
sensor 

lens 

the physical size of spheres are the same 
captured sizes are diﬀerent 
h 

h 

h 

radius 
1 : 2 : 3 

d 

2d 

3d 

(a) relation between spherical radius and depth information

B 

C 

A 

closest 

farthest 

A  B  C 

eye view 

RoFin  view 

z 

(b) depth sorting and depth info estimation via wrist point

Figure 5.6 Absolute depth calculation via perspective principle by using wrist point as the 
reference. 

Absolute  Depth  Calculation.  The  wrist  point  is  designed  not  only  for  assistance  for  hand 

indication, its captured diameter 𝜙𝑤  (unit: pixel) can also be used to calculate the absolute distance 

of key points 𝑑  to the camera. As shown in Figure 5.6 (b), the distances to the camera has the 

relations:  1𝑚
  =  𝜙𝑤  .  Thus,  the  absolute  distance  𝑑  from  the  wrist  point  to  the  camera  can  be
𝑑 

𝜙1 

𝑚 

formulated as 𝑑  =  𝜙1𝑚 .  To do so, we measure and store the captured spot diameter of wrist point at 

𝜙 
𝑤

1m as reference for depth info estimation of all six key points using the same manner. 

Coordination Transformation. We set the center of wrist point is the origin of 3D coordinate 

system. As shown in the right of Figure 5.2, the z value of the five fingertips is set as their physically 

relative depth distance value to the wrist point.  The center (x, y) of each fingertip’s spot shown in 

130 

(1)

(3)

(5)

(7)

motion direction 

→ 

→

(2)

(4)

(6)

(8)

experiment scene 
light always ON

→

(a-i) motion direction 

1 point 

captured images & 
their binary images 

B 

B 

A 

S1 

S2 

A 

(1) static at location A    (2) static at location B 

(3) move path ↗ 
track  numerals location samples inside one frame 

3 points 

S3 

S4 

Move Path 

S1        Low
S2 Medium
S3  Fast
S4 Super-fast

S1 

S2 

S3 

S4 

(a-ii) motion speed 

(4) how shape varies   (5) ellipse with rolling   (6) from circle to ellipse
(b) sample numerous x/y pairs inside a frame

Figure 5.7 (a) Impacts of the shape variation of deformed ellipse: (i) motion direction and (ii) 
motion speed. (b) The sphere center’s location variation is recorded in the deformed ellipse with 
the granularity of strip width. 

131 

the image plane is the pixel value which we need to convert to the physical distance as well. We 

also use the pixel value range of the wrist point’s diameter which maps to the 19 mm of the plastic 

sphere as the reference to convert the relative X/Y value of each fingertip’s center into their relative 

physical distance to the wrist point separately. 

5.5.2  Inside-frame Fine-grained X/Y Tracking 

Why  high-rate  inside-frame  sampling?  The objectives are mostly in mobile with random 

trajectory in real-life situations (e.g., vehicles, drones, or fingers). For example, it is required for 

numerous location samples in unit time to recover the real trajectory of fingertip as brush in virtual 

writing/painting. Either a long, random curve that is drawn quickly or a small curve requires more 

samples to capture more details. However, existing vision-based approaches sample the location 

variation at the level of frame update. Besides, the frame rate is set to about 60 fps instead of higher 

frame rate considering the time-consuming  image processing. To break this  gap, we creatively 

propose to utilize rolling shutter effect for numerous inside-frame location samples. 

Impact of Motion Direction. We move the light source with different directions while keep 

the light source with the fixed distance to the camera plane and the motion speed. For example: (1) 

and (2) from left to right (→) and reversed (←); (3) and (4) from bottom to top (↑) and reversed 

(↓); (5) and (6) from upleft to bottomright (↘) and reversed (↖); and (7) and (8) from bottomleft 

to upright (↗) and reversed (↙).  As shown in Figure 5.7 (a-i), the captured spot shape changes 

to an ellipse rather than the previous circle and its long axis can reflect the moving direction of the 

light source. 

Impact of Motion Speed.  We set 4 levels of motion speed of the light source (i.e, low, medium, 

fast, and super-fast) with the same motion direction (↗) and fixed distance to the camera plane. 

As the motion speed increases, so does the length of the ellipse’s long axis, as shown in Figure 5.7 

(a-ii). 

Numerous Inside-frame X/Y Location Samples.  The captured circle or ellipse’s pixel index 

range in columns and rows reflects the horizontal and vertical location information independently. 

The circle shape means the fingertip/wrist point is not moving or moving slow in the image plane 

132 

during the entire frame period, and its center location (𝑥,  𝑦) can be treated as its location in horizontal 

and vertical  directions.  The deformed ellipse records the detailed inside-frame motion  with  the 

sample rate at rolling shutter speed, as illustrated in Figure 5.2. 

5.5.3  Finger’s Tracking among Frames. 

As shown in Figure 5.7 (a-i), the opposite moving direction of the light source has the same 

rolling  pattern  shape  (i.e,  the  ellipse  with  similar  long  axis  direction)  in  the  single  frame.  For 

example, there are 3 frames in Figure 5.8 (a): frame1, frame2, and frame3. In frame2, the light 

source may move with possible trends as (↗) or (↙) and thus we can not determine fingertip’s 

motion with separate frame. 

L3

ﬁner-grained tracking than 
frame-based approaches  RL 3 - start

L2

L2

L1

L1

me 3 

fra 
me 2 

fra 

frame  1 

L3

move path  among 

frame1->  frame2->frame3 
(a) inter-frame  variation

RL2 - end
RoFin  L 2
L 1

RL2 - start

RL1 - end

L3

× 

combine inter-frame & inside-frame samples 
(b) ﬁne-grained  ﬁnger-tracking

Figure 5.8 RoFin’s Finger Tracking among frames combined with numerous inside-frame samples. 

Moving Trend Determination. However, if we combine the inside-frame moving direction 

candidates with two continuous frames, we can know the finger’s moving trend. Because these 

frames are continuously generated with time sequences, the end position of finger pattern in previous 

frame will be close to the start position of finger pattern as shown in Figure 5.8 (b).   Thus, we 

can determine the finger’s moving trend by finding the closest positions of finger pattern in two 

continuous frames.  In this example, the position point RL1−𝑒𝑛𝑑  and RL2−𝑠𝑡𝑎𝑟𝑡  are the closest position 

133 

points between two continuous frames frame1 and frame2. 

Moving Trajectory Generation.    More importantly, the moving trend determination method is 

a one-time initialization phase that only requires one frame duration to determine the end positions 

of each finger pattern and record them as the start positions for the next frame. In this example, 

using the finger pattern position in frame3, we can know the point RL3−𝑠𝑡𝑎𝑟𝑡  is the start point. Then 

we can track finger locations by combining these numerous inside-frame samples and updating 

them  frame  by  frame,  as  illustrated  in  Figure  5.8  (b).  Finally,  we  can  generate  a  finer-grained 

moving trajectory in RoFin than the vision-based approach. 

5.6  Hand Pose Reconstructing 

5.6.1  Identify Rolling Labels via CNN 

Traditionally, we could decode these optical labels via the amplitude thresholds. However, due 

to the variable optical environment, it is difficult to configure the thresholds dynamically.  Even 

in the same ambient light settings, the captured rolling pattern for each finger requires different 

thresholds  for  decoding.  Furthermore,  the  amplitude  gap  between  the  CP  and  On  symbols  is 

narrowed dramatically in strong ambient light and could cause numerous decoding errors. 

Convolutional Neural Networks (CNN) are widely applied in computer vision object classifi- 

cation due to their great robustness and accuracy. The benefits include: (1) Offline training and 

online identification can reduce latency for real-time finger label parsing; (2) even in high ambient 

light and difficult to distinguish CP and On, the CNN model can learn the features in the repeating 

dark and bright rolling strips. 

We adopt YOLOv5 for our optical labels identification with their related bounding boxes. YOLO 

(You Only Look Once) models are commonly used for objects detection since their fast inference 

with high accuracy. The network structure of YOLOv5 consists of EfficientNet backbone structures, 

BiFPN (Bi-directional Feature Pyramid Network) layers to extract object’s features effectively, as 

shown in Figure 5.9 (a). Then these features are fed through the prediction nets for both objective’s 

class and location of boxes as output. 

We capture 90 images of 3 RoFin gloves in 3 different ambient light strengths with 10 images for 

134 

t
u
p
t
u
o

t
u
p
n

I

conv 

conv 

class  prediction  net 

conv 

conv 

box  prediction  net 

F3 

F2 

F4 

F5 

F1 

BiFPN  layer 

EﬃcientNet  backbone 

W1 

(a) yolov5  network  structure

(b) manually  label  patterns 

(c) robust  label  parsing

Figure 5.9 Label parsing via YOLOv5. 

each setting. Then, we manually label each rolling pattern with 18 class labels (i.e., F1-F15, W1- 

W3),  as shown in Figure 5.9 (b).  Then,  we adopt data augmentation via the gray-scale modification 

to increase the size of training dataset. Finally, we use the trained model to infer the rolling pattern’s 

label  with  bounding  boxes.  As shown in  Figure  5.9  (c), the trained model  can  output the label 

accurately  with  a  high  confidence  ratio.  Besides,  these  outputted  bounding  boxes  include  each 

sphere’s x,y, and radius for 3D spatial parsing and further hand pose reconstructing. 

5.6.2  Cluster Fingers and Wrists into Hands 

Grouped 6 Key Points of a Hand. Based on the identified fingertips and wrists from multiple 

hands above, we can calculate their hand belonging separately. And then  we can easily cluster 

fingertips and wrist points from one hand together. For example, the fingers which have indication 

numbers in [1, 2, 3, 4, 5] and the wrist point with an indication number of 1 should be grouped in 

hand #1 due to their calculated hand index being the same, which is 1. As shown in Figure 5.10, 

the wrist labeling avoids the wrong finger clustering with the wrist point from another hand and 

thus guarantees further accurate hand pose reconstruction. 

3D Coordinates of 6 Key Points. The 6 key points with 3D coordinates clustered into one 

135 

 
 
wrist point avoids wrong ﬁnger 

Figure 5.10 Finger clustering with the correct wrist point into a hand. 

hand will be input into the HPR model and then the HPR model outputs the reconstructed 3D hand 

pose in real time. Different from fine-grained finger tracking with numerous inside-frame sampled 

points in an image frame, the real-time hand pose reconstructing requires only one 3D location 

sample for each of six key points per frame for processing. 

ﬁnger joints vary in ﬁnger plane with diﬀerent length of BO caused by bending or stretching 

HPR  model processing example: from 6 key-points to 20-joints hand pose reconstructing 

Figure 5.11 Hand and the illustration of the HPR (hand pose reconstructing) model for hand pose 
reconstruction via six tracked 3D key points. 

136 

5.6.3  Lightweight HPR Model 

Given 3D positions of 5 fingertips and 1 wrist point from a hand, the 3D hand pose is definite 

and thus we can reconstruct 3D hand pose. In comparison to vision-based approach, our approach 

tracks only 6 key points instead of 20 points for less tracking and computation overhead. However, 

it is challenging to reconstruct a 20-joints hand pose via restricted 6 key points in real time. To 

overcome this challenge, we design the lightweight HPR model illustrated below. 

According  to  six  key  points  with  3D  coordinates  (the  wrist  point  𝑝𝑂,  the  tip  of  thumb  𝑝  𝐴,  the 

tip of index finger 𝑝𝐵, the tip of middle finger 𝑝𝐶, the tip of ring finger 𝑝𝐷, and the tip of little 

finger  𝑝𝐸  ),  as  shown  in  Figure  5.11,  how  can  we  reconstruct  a 20-joints  hand  pose?  The intuitive 

answer is to calculate the 3D location of the other 14 points: 𝑝 𝐴1 ,  𝑝 𝐴2 (i.e., we simplify the thumb 

finger with 2 joints),  𝑝𝐵1 ,  𝑝𝐵2 ,  𝑝𝐵3 ,  𝑝𝐶1 ,  𝑝𝐶2 ,  𝑝𝐶3 ,  𝑝𝐷1 ,  𝑝𝐷2 ,  𝑝𝐷3 ,  𝑝𝐸1 ,  𝑝𝐸2 , and  𝑝𝐸3 . 

The  Plane  of  Projected  Palm.  As  shown  in  Figure  5.11,  the  fingers  and  the  palm  can  be 

projected on the plane which we defined as projected palm 𝑃𝑝𝑎𝑙𝑚. Actually, the tips of the index 

finger and the little finger and the wrist point consists of the 𝑃𝑝𝑎𝑙𝑚  (i.e., 𝑃𝐵𝑂𝐸 ). 

The Plane Formed by Finger Joints. The joints of a finger form a finger plane. These finger 

planes (except the thumb finger plane) are perpendicular to the plane 𝑃𝑝𝑎𝑙𝑚. For example, joints of 

the  index  finger:  𝑝𝐵1 ,  𝑝𝐵2 ,  𝑝𝐵3 ,  𝑝𝐵,  and  the  wrist  point  𝑝𝑂  generates  the  finger  plane  𝑃𝑂𝐵1 𝐵2 𝐵3 𝐵. 

And  𝑃𝑂𝐵1 𝐵2 𝐵3 𝐵  ⊥P𝑝𝑎𝑙𝑚  (i.e.,  𝑃𝑂𝐵1 𝐵2 𝐵3 𝐵  ⊥P𝐵𝑂𝐸 ).  In  contrast  to  finger  planes  above,  the  thumb 

finger plane is almost parallel to the plane 𝑃𝑝𝑎𝑙𝑚  (i.e., 𝑃𝑂𝐴1 𝐴2 𝐴  ⊥P𝐵𝑂𝐸 ).  Thus we can find these 5 

finger planes based on the known plane 𝑃𝐵𝑂𝐸 , as shown in Figure 5.11. 

Given  the  5  connection  lines  between  each  fingertip  to  the  wrist  point  (i.e.,  𝑙𝑂𝐴,  𝑙𝑂𝐵,  𝑙𝑂𝐶, 

𝑙𝑂𝐷  and  𝑙𝑂𝐸 )  and  the  calculated  finger  planes  𝑃𝑂𝐴1 𝐴2 𝐴,  𝑃𝑂𝐵1 𝐵2 𝐵3 𝐵,  𝑃𝑂𝐶1𝐶2𝐶3𝐶,  𝑃𝑂𝐷1 𝐷2 𝐷3 𝐷,  and 

𝑃𝑂𝐸1 𝐸2 𝐸3 𝐸 , we can determine the unknown 14 joints ( underlined ) on the finger planes via following 

two rules. 

• We can simplify the finger bending because each finger section from one finger bends with a

similar angle or proportional angle, as shown in Figure 5.11.

137 

• The length from the fingertip to the wrist point 𝑙𝑐𝑜𝑛  equals the sum of each finger section’s

projection to the line 𝑙𝑐𝑜𝑛. Thus, we can calculate the bending angle and further find each

unknown joint location.

As  shown  in  Figure  5.11,  the  finger  joints  of  the  index  finger  vary  in  its  finger  plane  with 

different lengths of 𝑙𝑐𝑜𝑛  (i.e., 𝑙𝑂𝐵). Thus, given a value of variable 𝑙𝑐𝑜𝑛, the 3D locations of other 

joints from this finger are fixed and can be calculated. For example, we know the length of each 

finger  section  of  the  index  finger  (  i.e.,  𝑙𝑂𝐵1 ,  𝑙𝐵1 𝐵2 ,  𝑙𝐵2 𝐵3 ,  and  𝑙𝐵3 𝐵)  by  the  initial  measurement 

step. Given the calculated 𝑙𝑂𝐵  (i.e.,𝑙𝑐𝑜𝑛), the unknown bending angle for index finger ∠𝛼  can be 

calculated by the equation below: 

𝑙𝑂𝐵1 ×cos2𝛼  +  𝑙𝐵1 𝐵2 ×cos𝛼  +  𝑙𝐵2 𝐵3 ×cos𝛼  +  𝑙𝐵3 𝐵×cos𝛼  =  𝑙𝑂𝐵. 

5.7  Implementation and Evaluation 

5.7.1  RoFin Gloves 

We implement three wearable RoFin gloves for experiments as shown in Figure  5.12. The main 

components in one pair of RoFin gloves are shown in Table 5.1: lightweight insulated breathable 

gloves, 2 Arduino Nano MCU, 12 green LEDs wrapped with 12 green plastic balls (𝜙  = 19mm), 

and a 9V li-ion  battery for power-supply.  The total  weight of one pair of RoFin  glove is  132g 

(including two batteries’ weight of 60g) while the total price is only 26.3$. 

Component 
insulated gloves 
Arduino Nano 
LED 
plastic cover 
battery 
Total price 

Price (USD) 
0.6 x 2 = 1.2 
10 x 2 = 20 
0.02 x 12 = 0.24 
0.08 x 12 = 0.96 
2 x 2 = 4 
26.3 

Details 
for each: 24cm x 15cm, 18g 
ATmega328P, 5V, 16M 
5mm, green, 20000mcd, 20mA 
19mm, green, lightweight 
rechargeable batteries cost about 7x2 = 14$ 
mass produced, cheaper the price 

Table 5.1 Components in one pair of RoFin gloves. 

5.7.2  RoFin Reader 

There are numerous commercial smart devices widely available and reasonably priced that can 

be used as our RoFin reader including smart phones, drone cameras, and even underwater sports 

138 

cameras. In our experiments, we use commercial smartphones such as iPhone 7, VIVO Y71A, and 

Samsung s20, as shown in Figure 5.12 (b). 

We evaluate the RoFin’s performance in three folds. (1)  label  identification  with different 

ambient light settings, distances,  cameras.  (2) inside-frame tracking performance  in contrast 

to vision-based method. (3) hand reconstruction performance with Leap Motion as the bench- 

mark. Then we also discuss about RoFin’s use cases and other concerns such as privacy, power 

consumption. 

5.7.3  Robust Label Parsing 

In  this  subsection,  we  evaluate  the  label  parsing  performance  under  different  settings:  (1) 

ambient light [low, medium, strong], (2) sensing distance [0.5m, 1.5m, 2.5m], (3) different hands 

[#H1,  #H2,  #H3],  (4)  different  labels  [F1-F15,  W1-W3],  and  (5)  different  cameras  [iPhone  7, 

VIVO-Y71A, Samsung s20], as shown in Figure 5.12 (b) and (c). 

(1) Impact of Ambient Light. We use the trained model to predict the labels in the captured

images in three different ambient light environment at the same distance 0.5m with 3 hands. As 

shown in Figure 5.13 (a), the label parsing achieves the best accuracy under the strong ambient 

light at 0.94 and the average accuracy of 0.91. These results demonstrate RoFin’s label parsing 

works robustly under varied ambient light even in the darkness and outperforms than vision-based 

approaches, which can not work in the darkness and lack of identification ability. 

(2) Impact of Sensing Distance. We predict the labels in the captured images in three sensing

distance settings under the same strong ambient light setting. The average accuracy of label parsing 

is shown in Figure 5.13 (b). The accuracy of label parsing drops slowly with increased distance. 

RoFin achieves the best accuracy of 0.93 at 0.5m and 0.77 at 2.5m. These results demonstrate RoFin 

works robustly under varied sensing distance even at 2.5m, which outperforms than vision-based 

approaches with limited distance (i.e, 1m). 

(3) Impact of Different Hands. We also evaluate the label parsing performance of six labels

from different hands. As shown in Figure 5.13 (c), The hand #1 and #3 achieve the high prediction 

accuracy more than 0.96 while the hand #2 achieves the lowest accuracy of 0.77. The reason is 

139 

VIVO -Y71A + 
additional lens 

Day 

cover / led 

glove / battery 

Night 

RoFin 

prototype 

quarter 

Arduino 
Nano 

RoFin  circuit diagram

green 
LED 

resistance 

battery

(a) RoFin gloves

(b) RoFin reader

adjustable 
blinds 

low 

medium 

strong 

(i) ambient light setting

7

e
n
o
h
P

i

1
7
Y
O
V
V

I

0
2
s

g
n
u
s
m
a
S

0.5m 

1.5m 

2.5m 

(iii) captured examples via diﬀerent cameras

(ii) distance setting

( c ) experiment set up and scenarios 

Figure 5.12 System implementation: RoFin gloves (prototype & circuit diagram), RoFin reader 
(commercial cameras) and experiment scenarios (varied ambient light strength and distances from 
0.5m to 2.5m). 

140 

 
 
 
 
 
 
100 

)

%

(

y
c
a
r
u
c
c
a

75 

50 

25 

0 

100 

75 

50 

25 

)

%

(

y
c
a
r
u
c
c
a

0 

100 

75 

50 

25 

)

%

(

y
c
a
r
u
c
c
a

0 

low  medium  strong  AVE 
(a) accuracy  vs.  ambient  light

0.5m  1.5 m   2.5m    AVE 
(b) accuracy  vs.  distance

#H1 

#H2  #H3  AVE 

(c) accuracy  vs.  hand

F1 
F2 
F3 
F4 
F5 

F6 
F7 
F8 
F9 
F10 
F11 
F12 
F13 
F14 
F15 

W1 
W2 
W3 

F6 

W2 

1
F

2
F

3
F

4
F

5
F

6
F

7
F

8
F

9
F

0
1
F

1
1
F

2
1
F

3
1
F

4
1
F

5
1
F

1
W

2
W

3
W

(d) confusion  matrix

1.0 

0.8 

0.6 

0.4 

0.2 

0.0 

F
D
C

1. 0

0. 8

0. 6

0. 4

0. 2

0. 0
0

16.7ms 

iPhone 7 

VIVO Y71A 

Samsung s20 

average 

5 

10 
(e) inference latency (ms)

15 

20 

Figure 5.13 Label parsing accuracy performance in varied settings and latency evaluation. 

the F6-F10 from hand #2 have more confused rolling patterns than hand #1 and #2. Even though, 

the average label parsing accuracy still achieves 0.91, which demonstrates the effectiveness of our 

optical labeling and parsing scheme. 

(4) Impact  of  Different  Labels.  We  also  present  the  confusion  matrix  of  the  trained  label

parsing model for 18 different classes (i.e., F1-F15,W1-W3) in Figure 5.13 (d). It shows the labels 

from hand #2 are easier to be identified as other labels than hand #1 and #3, which is consistent with 

the results in Figure 5.13 (c).  It also shows that the rolling pattern of W2 [CP, On, Off, Off, On, Off] 

is confused by F6 [CP, Off, Off, On, On, Off]. That is because the reversed rolling patterns of F6 

[Off, On, On, Off, Off, CP] (i.e, [..Off, On, On, Off, Off, CP, Off, On, On, Off, Off, CP..] ) has the 

high similarity with the W2 when the amplitude of CP is similar to On symbol. 

(5) Impact  of  Different  Cameras.  We  use  the  trained  model  to  parse  the  labels  captured

by  different  cameras  of  commercial  smartphones  [iPhone  7,  VIVO  Y71A,  and  Samsung  s20] 

141 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
#

n
o
i
t
a
t
o
r

1.6 

to  measure  their  label  parsing  latency  performance.  As  shwon  in  Figure  5.13  (e),  the  labels 

captured by iPhone can be parsed with the shortest time while the average parsing latency of these 

different cameras are about 12ms (i.e., 83Hz), and less than the 16.7ms (i.e., 60 Hz). These results 

demonstrate RoFin achieves the real-time label parsing. 

5.7.4  Enhanced Inside-frame Tracking 

1
#

n
o
i
t
a
t
n
e
i
r
o

2
#

n
o
i
t
a
t
n
e
i
r
o

3
#

n
o
i
t
a
t
n
e
i
r
o

1
#

n
o
i
t
a
t
o
r

2
#

n
o
i
t
a
t
o
r

)

m
c
(

r
o
r
r
e

5 

4 

3 

2 

1 

0 

ave 

Z ground truth 

wrist 
point 

           XZ plane  overlooking 

                     (a) Z estimation setup  

transparent 
plastic paper 

bonded pen 
with RoFin 

letter #1 

vision 

RoFin 

letter #2 

(c) inside-frame X/Y tracking setup

letter #3 

100 

s
t

i

n
o
p
e
p
m
a
s

l

f

o
r
e
b
m
u
n

80 

60 

40 

20 

(b) Z estimation error

h
t
u
r
t

d
n
u
o
r
g

d
e
s
a
b
-
n
o
i
s
i
v

d
e
k
c
a
r
t

i

n
F
o
R

(e) enhanced X/Y tracking performance

0 

vision 
normal 

RoFin 
RoFin 
normal 
fast 
(d) increased sample numbers

vision 
fast 

Figure 5.14 Z estimation performance and the enhanced inside-frame tracking performance.

In this subsection,  we evaluate the accuracy of Z estimation and the enhanced inside-frame 

tracking of X/Y. 

• Z Estimation Performance

Setup.  We set a wooden hand model worn RoFin glove at the desk, as shown in Figure 5.14 (a).

142 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The fingertips are separated with different distances to the camera image plane (XY plane). The 

hand model keeps the same pose but with 3 different orientations to the camera. We also set camera 

with 3 different rotations to capture the RoFin glove. Then we measure the distance between the 

fingertips’ projected points on the desk to the camera plane as the Z ground truth. 

Z Estimation Accuracy. As shown in Figure 5.14 (b), although the error of estimated depth 

info Z via RoFin varies with the different hand orientation and camera rotation, RoFin achieves the 

average estimation error of 1.6 cm when sensing distance is 0.5m. 

• X/Y Tracking Performance 

Setup. We bond one fingertips of the RoFin glove with a pen (blue marker) and draw on the 

transparent plastic paper hanging parallel to the camera’s image plane, as shown in Figure 5.14 (c). 

We also set two cameras at the fixed distance 0.5m when the user is drawing. One camera follows 

the traditional vision based approach which captures the video as usual with 60 fps frame rate while 

the other camera (RoFin reader) captures the video of the rolling patterns with the same 60 fps 

frame rate but with high rolling shutter rate (8KHz). Thus we track 3 traces of user’s drawing at 

the same time: (1) ground truth on the plastic paper, (2) vision approach tracked trace, (3) RoFin 

tracked trace. 

X/Y Tracking Enhancement. We ask the user to draw 3 different letters:  (1) M with more 

straight lines, (2) C with curve, (3) a rotated 𝛼 with more complex curve with two writing speed:  (1) 

normal speed, and (2) faster speed. As shown in Figure 5.14 (d) and (e), the RoFin tracked 4 times 

of location points for the same letter, which significantly enhances the granularity of tracking trace 

in compared to vision-based tracking. Besides, RoFin achieves more accurate trace tracking than 

vision based method among all three different letters due to its fine-grained inside-frame sampling. 

In  a  nutshell,  it  demonstrates  that  our  low-cost  RoFin  provides  accurate  Z  estimation  and 

enhanced X/Y tracking. 

5.7.5  Real-time Hand Pose Reconstruction 

We define 10 hand poses as shown in Figure 5.15 for hand pose reconstruction evaluation. 

We capture the images of the wooden hand worn the RoFin glove with RoFin reader for different 

143 

 
（a） 

（b） 

（c） 

（d） 

（e） 

（f） 

（g） 

（h） 

（i） 

（j） 

Figure 5.15 10 defined hand poses: (a) bend index finger, (b) point with index finger, (c) close the 
fist, (d-g) pinch thumb with Index, Middle, Ring, and Little finger, (h) turn palm to the left, (i) turn 
palm to the right, (j) the palm. 

hand poses. Then we run HPR model and evaluate its accuracy and latency with Leap Motion as 

benchmark. 

• Reconstructing Accuracy 

Impact of Ambient Light. We define the deviation error as the average difference of x,y,z 

between RoFin with Leap Motion. As shown in Figure 5.16 (b), the average deviation error of three 

ambient light settings [low, medium, strong] under 0.5m has the similar distribution and the most 

deviation error is distributed less than 22 mm. Among three ambient light settings, the medium 

ambient light achieves the best performance due to the RoFin reader can capture the most clear 

contours of six key points’ spheres. 

Impact of Sensing Distance. As shown in Figure 5.16 (c), the average deviation error of three 

distances [0.5m, 1.5m, 2.5m] are similar and are mostly distributed in 28 mm. The deviation error 

of 1.5m achieves the best performance with the average deviation error of 14 mm while the 2.5m 

setting achieves the largest average deviation error of 19 mm. These results demonstrate our HPR 

model works well up to  2.5 m  while the vision approaches usually work  within 1 m  and Leap 

Motion works within 0.5m. 

Impact of Different Poses.  We also evaluate the reconstructing deviation error of 10 hand 

144 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1.0 

0.8 

0.6 

0.4 

0.2 

0.0

F
D
C

1.0 

0.8 

0.6 

0.4 

0.2 

0.0 

F
D
C

strong 
medium 
weak 

0.5m 
1.5m 
2.5m 

0 

1 

2

3 

4 

5 

1 

2 

3 

4 

5 

deviation error (cm) 

deviation error (cm) 

(b) accuracy vs. ambient light

(c) accuracy vs. distance

0 

RoFin 

Leap 
Motion 

Vision 

(a) experiment  scenario
6

)

m
c
(

r
o
r
r
e

n
o
i
t
a
v
e
d

i

5 

4 

3 

2 

1 

0  

a  b  c  d  e 

i 
(d) accuracy vs. poses

f  g  h 

x 
y 
z 

1.0 

0.8 

0.6 

0.4 

0.2 

F
D
C

s
m
7
.
6
1

RoFin 
MediaPipe 

j  ave 

0.0

0 

10 

20 

30 

40 

50 

(e) latency (ms) vs. MediaPipe

Figure 5.16 Hand pose reconstructing performance. 

poses defined above. As shown in Figure 5.16 (d), the reconstructed y has the largest deviation 

error compared with x and z, especially for the hand pose (b), point with index finger. The reason is 

that the finger planes of the ring finger, the little finger are not exactly as assumed in our simplified 

HPR model that their finger planes perpendicular to the projected palm plane. Among 10 hand 

poses, the pose (j) achieves the lowest average deviation error in hand pose reconstructing of 7.6 

mm. 

• Reconstructing Latency.

As for hand pose reconstructing, the main advantage of RoFin  compared with  vision-based

approaches is its less tracked key points and flexible and long sensing distance. We evaluate the 

hand pose reconstructing latency and make comparison with the vision based approach Media Pipe 

ran on the same platform: Thinkpad T480 with Intel(R) Core(TM) i7-8650U CPU for different 

hand poses under the same 0.5m distance and strong ambient light setting. 

145 

 
 
 
 
 
 
 
As shown in Figure 5.16 (e), the latency of the RoFin HPR model is distributed less than 21 ms 

with the average latency of 13.8 ms (72Hz), which is less than 16.7 ms (60Hz). The vision based 

Media Pipe achieves 47.5 ms latency in average. Although the finger label parsing requires about 

12ms for each image frame, the label parsing module and the HPR module can still run in pipe-line 

manner to achieve the real-time processing. These results demonstrate that our HPR model can 

achieve real-time hand pose reconstructing due to its only tracking 6 key points with simplified 

HPR model. 

5.7.6  Use Cases 

In this subsection, we provide three potential use cases for RoFin gloves in aspect of RoFin’s 

three main features:  (1) fingers/hands identification.  (2) fine-grained inside-frame X/Y tracking. 

(3) real-time hand pose reconstructing. 

Multi-user interaction for AR/VR/MR. RoFin can track inside-frame X/Y location samples 

at rolling shutter rate and thus provide the ability of fine-grained finger tracking, especially for the 

high-speed motion or small-scale motion. Multiple users can use their fingertips to write or paint 

virtually at the same time in front of the camera. Thus, RoFin can be used as the user interface 

with better user experience for AR/VR/MR with privacy protection of users due to they only want 

the camera to capture the trace instead of the face, as shown in Figure 5.17 (a). 

Virtual Writing or Health Monitoring for Parkinson’s Suffers. Our RoFin system can track 

fine-grained writing trace including the subtle trembling while the vision-based approaches (1Hz 

inside-frame sampling rate and about 60 fps frame rate) can not track it clearly as human eyes. The 

Parkinson’s suffers can use our RoFin glove to virtually write characters. Then we can use RoFin 

tracked fine-grained trace to better smooth the trace (e.g, connect the middle points among two 

trace sub-lines), as shown in Figure 5.17 (b). Besides, the tracked fine-grained trace can be also 

utilized as the medical diagnosis and health monitoring. 

Hand Pose Commands for Video Games/Smart Home. RoFin achieves real-time hand pose 

reconstructing with less computation overhead and high accuracy. With the similar use cases as 

other hand gesture recognition approaches, our low-cost RoFin system can be used as the hand pose 

146 

 
Media 
Pipe 

RoFin 

#1 

without 
hand identity 

#3 

#2 
with 
hand identity 

(a) multi-user MR interaction 

enhanced inside-frame sampling 
via rolling shutter eﬀect 

comparison with written 
trace from Parkinson suﬀer 

further shaking 

canceling 

written trace from 
Parkinson’s suﬀer 

60FPS, 0.5s, 30 frames 

ﬁner-grained tracking, better smoothing (shaking canceling) 

(b)  ﬁne-grained trampling tracking for further smoothing 

Z 

Z 

Y 

x 

reconstructed 
hand pose (a) 

x 

Y 

reconstructed 
hand pose (g) 

( c ) hand pose commands 

Figure 5.17 Three possible use cases for our low-cost RoFin: (1) multi-user MR interactions with 
identification and protected privacy, (2) finer-grained tracking of writing of Parkinson’s suffer[68], 
(3) real-time hand pose commands. 

147 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
command input interface for video games, smart home. Figure 5.17 (c) shows reconstructed hand 

pose examples via RoFin’s HPR model. 

5.8  Discussion and Summary 

Non-vision  based  Solutions.  Our  RoFin  outperforms  the  vision-based  approach  in  several 

aspects: (1) provide finger indication, (2) finer-grained finger tracking with the same frame rate 

setting, (3) less key points tracking and faster hand pose reconstruction, (4) long work distance and 

robust under varied ambient lights, (5) privacy protection and low cost. As for the non-vision based 

solutions, there are two types: (1) on-body sensor based approaches[12, 52, 48], and (2) hand-free 

approaches[120, 61, 75, 59].  Compared with our RoFin, these approaches have some limitations: 

(1) requirement of specific or expensive sensors and devices instead of commercial LED nodes, 

such as mmWave chips, FBG sensors, (2) limited sensing distances within the near hand area (i.e., 

within 0.5m), (3) lack of finger or hand identification ability and can not serve multiple users with 

user identification. 

Privacy Leakage.  Vision approaches such as human eyes, Media Pipe, as well as the Leap 

Motion integrated with camera can cause the privacy leakage. One example is shown in Figure 

5.18. The user conducts the hand pose command while his/her hand holds a bank card. The Leap 

Motion and the Media Pipe cause the leakage of sensitive information (i.e., cvv number) which 

may result in the property loss. 

Eye  View 

RoFin  View 

Leap  Motion 

Media  Pipe 

Figure 5.18 Sensitive data leakage of vision-SOTA. 

148 

 
 
Power  Consumption  and  Safety.  Our RoFin gloves are made of electric insulation rubber 

gloves, and the voltage at the LED node side is less than 3V, ensuring the safety of users who wear 

gloves. The current through one RoFin glove’s circuit is 75 mA, and the power consumption is 225 

mW. Based on our 600mAh and 9V li-ion battery, one RoFin glove can work for approximately 5.4 

Wh / 225 mW = 24 hours before needing to be recharged. 

Limitation of RoFin. Compared with hand-free approach such as vision based method, our 

current  RoFin  prototype  requires the user to  wear gloves attached with  plastic spheres  and has 

wires and battery. This limitation can be relieved by ergonomic design, textile technique, energy 

harvesting, or even passive labeling optimization in the future. 

Future Direction. (1) optimize the spheres and explore back-scatter based passive fingertips’ 

labeling. We can decrease the sphere size and exploit energy harvesting techniques for decreased 

weight and ease to use. (2) update HPR model. We cam improve HPR for hand poses in which 

finger planes are not perpendicular to the projected palm plane (e.g., hand pose (b)). (3) extend 

RoFin  for  body  gesture  recognition.  The  core  idea  of  RoFin  can  be  extended  for  human  body 

gesture reconstructing easily with predictable benefits. 

In  summary,  we  exploit  the  2D  temporal-spatial  rolling  to  construct  3D  hand  pose.  We  address 

technical  challenges  in  RoFin  system  design  and  implementation,  e.g,  fingertips  active  optical 

labeling, fine-grained 3D information parsing of rolling fingertips, and lightweight 20-joints 3D 

hand pose reconstructing via 6 tracked key points. Then we undertake studies using RoFin gloves in 

a variety of circumstances.  The results demonstrate our RoFin can robustly identify fingers, parse 

fine-grained 3D info, and achieve real-time hand pose reconstruction. Our RoFin is a low-cost but 

effective solution for human computer interactions with promising use cases. 

149 

 
CHAPTER 6 

4D SPATIAL-TEMPORAL DIVERSITIES IN SWARMING DRONES 

Drones have become increasingly popular in both the industry and research communities due to 

their  numerous  advantages,  such  as  low  cost,  small  size,  adaptability,  ease  of  use,  and  a  wide 

range of potential applications. However, the current control method for swarming drones relies on 

stand-alone modes and centralized radio frequency control from a ground-based base station,  which 

lacks drone-to-drone communication. This approach has several drawbacks, including crowded RF 

spectrum with mutual interference, high latency, and a lack of on-site drone-to-drone interactions. 

To address these limitations, we propose PoseFly, an AI-assisted Optical Camera Communi- 

cation (OCC) system designed for  drone clusters.  OCC offers several  benefits,  including high 

spatial multiplexing capability, Line of Sight (LoS) security, broader bandwidth, and an intuitive 

vision-based manner. By leveraging the rolling shutter effect in drone sensing and communication, 

PoseFly provides drone identification, on-site localization, quick-link communication, and lighting 

functionalities.  This innovative approach offers a more efficient and reliable solution for sensing 

and communication within drone clusters, enhancing their overall performance and capabilities. 

6.1  Motivation 

Drones, one type of unmanned aerial vehicle (UAV), attract more attention because of their 

advantages over manned aircraft, including their small size, low cost, simplicity of operation, and 

broad  potential  applications[112,  53,  103,  93,  79].  Drones  are  now  used  in  a  variety  of  fields, 

such as aerial photography, plant protection, express deliveries, transportation, animal monitoring, 

surveying and mapping, power inspection, disaster relief, news reporting, selfies, film and television 

production. Drones are projected to play significant roles in integrative development for sensing, 

communication, and computing in the near future due to ongoing advances in artificial intelligence 

and their superior mobility. According to Verified Market Research, the size of the global drones 

market, which was expected to be worth USD 19.23 billion in 2020, would increase to USD 63.05 

billion by 2028 with a CAGR of 16.01 percent between 2021 and 2028[28]. 

Nonetheless, the current approach to drone’s control relies on centralized base station (CBS) 

150 

 
 
from the ground. This technique has several limitations, including RF spectrum congestion, which 

causes interference, significant latency, and the absence of real-time drone-to-drone interactions 

on-site.  The  transmission  between  the  drones  and  the  CBS  in  centralized  control  can  naturally 

be avoided by the on-site interactions among drones in distributed manner. We could use RF to 

establish distributed drone-to-drone communication. However, due to Non-Line-of-Sight (NLoS) 

propagation, eavesdroppers can easily detect RF signals, and there is nontrivial multi-path effects 

and  caused  mutual  interference[6,  36].  Even  though  there  is  no  back-and-forth  communication 

cost  between  drones  and  the  CBS  in  RF  based  distributed  drone-to-drone  communication,  the 

growing drone population may cause the RF spectrum to become crowded, which could lead to 

more localization errors owing to retransmission and lag. 

There are two main issues for localization of drones with high mobility: (1) computing a drone’s 

appropriate localization information, including distance, posture, speed, and so on; and  (2) promptly 

receiving the computed localization information. Actually, we can use on-site posture features of a 

drone (transmitter) and compute at the receiving side (another drone) instead of transmitter’s IMU 

to reduce transmission overhead. For instance, when a flock of geese is flying together, goose A 

(receiver) observes goose B (transmitter) and processes B’s posture features in A’s brain rather than 

goose B computing its own position and notifying A. 

e
n
o
r
d

e
t
i
s
-
n
o

n
o

i
t
a
c
ﬁ

i
t
n
e
d

i

n
o

i
t
a
z
i
l
a
c
o

l

Active  Optical  Label 

Oﬀ 

On 

CP 

0b 0011  drone # 3 

0° 

45° 

L3

L2

L0

L 
1 

static 

fast 

PoseFly 

4-in-1
OCC

 L 
2

L1

L3 

n

i

g
n

i
t
h
g

i
l

s
s
e
n
k
r
a
d

day 

night 

k
n

i
l
-
k

i

l

e
n
n
a
h
c

c
u
L3  q

L1 +  L2 +  L3 

L1 

L2 

Figure 6.1 PoseFly: 4-in-1 OCC for swarming drones, similar to geese flying and their relative 
localization and collaboration. 

To overcome the limitations of existing work,  we introduce PoseFly, a novel approach that 

151 

 
 
 
 
 
 
 
 
 
leverages the 2D spatial-temporal diversities of rolling shutter cameras for on-site drone positioning. 

As depicted in Figure 6.1. PoseFly makes use of four inexpensive LEDs with plastic covers. One 

of these LEDs is red, and the remaining three are green. The red LED in the front-left corner of the 

drone emits unique cyclic OOK (On-Off Keying) waves, serving as an optical identification for each 

drone. As a result, drones with inbuilt cameras can easily identify one another. Furthermore, when 

coupled with green LEDs, the red LED aids in locating. PoseFly precisely calculates the positions 

of the drones and enables rapid data flow between them via Optical Camera Communication (OCC) 

links by evaluating changes in the arrangement of these LEDs. 

6.2  Background and Related Work 

6.2.1  Drone Identification 

Vision based methods could be used to identify drones. For example, the camera can take an 

image of a drone and identify it based on its shape and features. Then the reader uses the greyscale 

image of the scene and detect the drone based on its silhouette[104]. However, these systems cannot 

work well at night, as the captured image of drones are not clear enough, nor do they work at longer 

distances. RF systems can identify drones in a few ways. Drones typically communicate at a much 

higher frequency than other mobile devices. If the RF connection is monitored, the used frequency 

could be utilized to determine if a device is a drone or not. However, other wireless devices could 

communicate at the same frequency and thus it will cause the wrong identification[78]. Instead of 

the clear images with complete morphology needed by computer vision or confused RF spectrum 

indication, PoseFly[144] simply requires one active LED node which holds the indication sequences 

and can work well in both day and night. 

6.2.2  Drone Localization 

We present  the related  work of drone localization below  and illustrated in  Figure  6.2.   (1) 

RF. Current RF-based drone localization methods are based on received signal strength or time 

difference of arrival. By monitoring the signal strength of an emitter or the change in time of its 

arrival, a receiver could determine the direction and speed of the drone. However, interference in the 

152 

 
path can corrupt the localization results [77]. (2) Vision. The vision based localization approaches 

use  cameras  to  record  several  frames  of  scene,  then  detect  a  drone  and  calculate  its  velocity 

and future position[90]. While this is certainly effective, it has non-trivial processing overhead, 

especially for image processing of morphology with varied background when the drone is flying.(3) 

IMU. Drones can also measure their own localization date (e.g., position, and velocity) via inner 

measurement  unit  (IMU)  and  send  them  out  to  other  drones.  However,  these  messages  would 

need to be sent constantly and received through long distances.Thus, the IMU based methods have 

non-trivial send-out communication overhead and time delay, especially when there are numerous 

drones with severe interference[98].(4) GPS. Although GPS system can provide accurate location 

information, they also have send-out cost and cannot work well in urban areas, caves, tunnels. (5) 

LiDAR. As for LiDAR system, they can provide on-site localization of nearby drones. However, it 

has high-energy consumption. 

URBUN 
/ CAVE 

ENERGY 

U
M

I

SEND OUT 
OVERHEAD 

drone 
localization 

y
l

F
e
s
o
P

VARIED 
BACKG 
- ROUND 

INTERFERENCE 

Figure 6.2 Drone localization approaches: GPS, IMU, vision, RF, LiDAR, and PoseFly. 

In contrast to above mentioned drone localization approaches, PoseFly only requires one frame 

image to determine velocity and orientation. PoseFly uses 4 LED nodes to illustrate which direction 

153 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
the drone is facing, allowing orientation to be found. Velocity can also be found through the orbs, 

as  the  faster  the  drone  moves,  the  more  the  orbs  will  deform  in  one  direction.  It  is  free  from 

interference from multiple drones thanks to the spatial diversity of millions pixels from the camera 

to capture them into different image zones. The illuminated balls allow PoseFly to work during 

day and night over flexible distances. Considering these energy efficient LED balls also provide 

lighting function, PoseFly is a green localization approach. Moreover, the localization of PoseFly 

does not have the send-out cost due to the reader capture the drone’s image (the light propagates at 

high speed of 3×108m/s) and then process it locally.  Besides, PoseFly’s on-site localization only 

relies on the drones themselves and thus can work in caves/tunnels where GPS can not work. 

6.2.3  Drone Communication 

Today,  most  drones  communicate  via  radio  frequency  medium.  RF  signals  can  travel  over 

relatively  long  distances.  However,  RF  systems  can  be  prone  to  eavesdroppers,  jammers,  and 

interference [29]. The RF signal is sent though the open space and anybody can listen or send their 

own confounding signals. PoseFly is based on the Line-of-Sight propagation manner and thus the 

signals can be blocked out to attackers out of the swarming drones and makes it more secure than 

RF-based communication. Similarly, jammers must send more light directly into the receiver to 

jam the camera. 

6.3  Our Approach: PoseFly 

Our proposed PoseFly, is composed of two parts, as illustrated in Figure 6.3: (1) commercial 

LED based PoseFly Transmitter, (2) AI-assisted commercial camera based PoseFly Reader. One 

drone can equip both transmitter and receiver as a transceiver. 

PoseFly  transmitter.  PoseFly transmiter consists of 4 commercial low-power LED components 

attached on each corner of a four-rotor drone. These 4 LEDs, one is red while the others are green, 

are covered with plastic balls of the same color and controlled by an Arduino Nano. 

PoseFly receiver/reader. PoseFly reader is based on commercial cameras, which can be the 

mounted  cameras on  the  drones.  These  cameras  use  adjustable  focal  length  lenses  and  configurable 

rolling shutter rates and frame rate. 

154 

 
 
Drone 
Index  Number 
Setting 

Optical 
Signal 
Emission 

variable 
optical 
environment 

& relative 
location 

Image 
Frames 
Capture 

Red Spot 
Optical  Label 
Parsing 

 CNN based 
 Relative Speed 
Estimation 

4-in-1 
PoseFly for 
Collaboration 
of Drones 

Drone 
Posture 
Setting 

Quick-link 
Data 
Embedding 

Image 
Pre- 
Process 

CNN based 
Distance 
Estimation 

CNN based 
Relative Angle 
Estimation 

Quick-link 
Data 
Decoding 

Figure 6.3 The system overview including transmitter and receiver, and the workflow of PoseFly. 

Four Integrated Functions: (1) Drone identification: The red LED generates OOK waves 

with cyclic pilots to indicate the index of a drone in the drone cluster.  For example, the OOK 

wave [on, off, off, on] indicates the index of the drone is 0b1001, which is # 9. (2) Drone on-site 

localization: The PoseFly reader can estimate distance from the transmitter to the reader 

based on the size of captured four LEDs. Furthermore, the reader can conduct on-site angle 

parsing based on generated shape and color pattern of four LEDs. Additionally, the shape of the 

rolling spot varies from normal circle to ellipse with different motion speed of drones, which can 

help the reader to conduct speed estimation. (3)  Drone quick-link: At the same time, the other 

three green LEDs create the quick-link channel among nearby drones by fast on-off switching. (4) 

Lighting: These LED components provide lighting function at the dark environment or night. 

Workflow:  As  shown  in  Figure  6.3,  these  four  functions  are  achieved  at  different  distance 

between two drones step by step. (1) Firstly, when a drone, Drone A, notices there is a bright spot, 

which is another drone, Drone B, based on B’s lighting function in long distance (>20m) via its 

camera. (2) Then Drone A will fly closer to B based on its distance estimation (<20m) function 

and conduct the drone identification (<12 m) to know the index number of Drone B in the cluster 

of drones. (3) Later, Drone A flies closer to B and performs finer-grained localization of B such 

as the estimation of motion speed and posture angle  of B. (4) When these two drones require 

mutual data sharing, they can fly closer within 4m and utilize the quick-link channel to share the 

information such as the fly instructions, on-site posture info of other drones. 

There are three main technical challenges, as illustrated in Figure 6.4 and outlined below: 

C1:  Robust  identification  of  drones  at  long  distances.  Unlike  geese,  drones  cannot  easily 

155 

 
 
 
 
 
C1:  robust drone indication 
massive 
drones 

C2: asynchronous spatial data 

single  light  source 
multiple  light  sources 

passive    lables 

1  image  frame 

1  image  frame 

- good  data  rate 
- limited  error 
- data   combination 
- avoided    interference 

C3:  ﬁne-grained multidimensional localization of drones with high motion 

-  achieve  relative  low-motion 

A 

relatively  low  motion  is  helpful 
for  symbol  decoding  in  quick-link 

B 

:
s
l
a
o
G

- relative  distance,  angles,  and  speed 
- discrete  localization  determination 
- well-combined  with  other 3 functions 

-  no  additional  cost,  robust  and  practical 

Figure 6.4 Three main challenges in PoseFly: robust drone indication, asynchronous spatial data 
combination, and localization with high motion. 

recognize  other  drones  with  similar  appearances  through  visual  recognition  alone.  To  address 

this, we propose attaching optical marks or labels on drones. However, traditional static marks or 

existing bar/QR codes are passive and can only work within a limited recognition distance, typically 

around 1 meter. 

C2: Lightweight yet precise localization (distance, speed, angle). Geese can sense the posture 

of other geese using various vision features, such as the head, wings, and feet. However, applying 

the same method for sensing the drone’s posture would introduce non-trivial computation overhead, 

which is not desirable for real-time applications. 

C3: Decoding asynchronized rolling strips in rolling spots with random locations in a frame. 

The rolling  strips generated in  each  rolling spot  are not  synchronized  for decoding  with  flying 

drones.  This  asynchronous  nature  poses  a  challenge  in  efficiently  and  accurately  decoding  the 

information  from  the rolling strips, particularly  when they appear at  random  locations within a 

frame. 

Our contribution can be summarized as follows: 

(1) This is the first work to exploit rolling patterns for on-site drone posture parsing, including 

relative distance, speed and angle estimation, which was solely used for optical camera communi- 

156 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cation before. 

(2) We thoroughly investigate the spatial rolling patterns and design the 4-in-1 PoseFly, an AI- 

assisted approach for drone identification, drone localization, drone communication, and lighting 

with commercial LEDs and cameras. 

(3) We address challenges via cyclic pilots and OOK for active optical labeling and robust quick- 

link communication.  We adopt CNN models for accurate and robust identification, localization at 

the receiver side. 

(4) We evaluate PoseFly on our implemented prototypes in both day and night with varying 

distance and motion speed. Experiment results show that PoseFly can identify drones with nearly 

100% accuracy within 12m while providing accurate pose parsing (100% distance estimation within 

20m, 100% speed and angle estimation within 4m). Additionally, PoseFly provides averagely 5 

Kbps quick-link channel at up to 4m. 

6.4  Drone Identification 

For drone interactions, drone detection is critical. However, current optical labels like barcodes 

and QR codes are passive and only function at close ranges of a few centimeters. To overcome this 

limitation, we design active optical labels for drone identification in long distance (up to 12m). We 

present our active optical label design at transmitter side and the CNN based robust label parsing 

solution below. 

6.4.1  High-capacity Optical Labeling 

Rolling Shutter strip Effect.  The global shutter exposes the entire scene at once. The rolling 

shutter in commercial CMOS cameras, in contrast, exposes one row of pixels while concurrently 

creating an entire image row by row. Figure 6.5 illustrates the rolling shutter strip effect, which 

happens when the rolling shutter speed and the switching speed of the light wave from the transmitter 

are about equal. Thus, temporal optical signals carrying transmitted data during symbol periods 

can be successively collected as rolling strips. 

CP-OOK Label Wave Design. In PoseFly, each drone is identified by an optical label that 

regularly emits distinct amplitude waves that are invisible to human eyes (the On-Off switching rate 

157 

 
label each drone invisibly 
with cyclic OOK wave 
pilot 

pilot 

e
d
u
t
i
l

p
m
a

on 

oﬀ 

time 

oﬀ 

#9  0b1001 
on 

#1 

#2 

#3 

#4 

#5 

pilot 

#6 

#7 

#8 

#9 

#10 

in  human  eyes     in  rolling  shutter 

camera 

know which drone it is 

#11 

#12 

#13 

#14 

#15 

Figure 6.5 Rolling strip effect and cyclic CP-based active optical label design: 4 OOK symbols 
denote up to 16 drones. 

is too high such as more than KHz frequency to be sensed by human eyes[124, 2]). The optical 

label is comprised of two components: (1)  CP  (cyclic  pilots), which begins with one symbol period 

with adjustable symbol period (strip width) and is used to distinguish an entire optical label, and 

(2)  indication  symbols, which are made up of four (or more) OOK (On-Off Keying) symbols. 

There are two amplitude levels  besides darkness  in  the Off symbol,  generated by PWM (pulse 

width modulation) control: the On symbol has a lower brightness than the CP symbol while the 

CP symbol has the highest brightness. 

High  Indication  Capacity.  We embed drone’s binary index into OOK indication symbols. 

The binary number is 1001 when the drone index is 9 with indication symbols of [On, Off, Off, 

On]. The amount of drones in the drone cluster determines how long the indication symbols are. 4 

OOK symbols can indicate up to 16 drones. In general, N OOK symbols can represent 2𝑁  numbers 

for 2𝑁  drones, which is promising for high-capacity indication and identification of drone swarms. 

Although some drones may be very close and appear in the FOV of the camera at the same time, 

different optical labels can notify the observing drone who they are. 

158 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6.4.2  CNN based Robust Label Parsing 

Traditionally, the amplitude threshold was used to decode these optical labels. But it is difficult 

to configure the threshold dynamically due to drones’ nonlinear movement, long distance and the 

dynamic optical environment. For the following reasons, we adopt convolutional neural network 

(CNN)-based label parsing in PoseFly to avoid the complexity and decoding overhead: (1) Online 

identification  and  offline training  can  reduce latency  for  real-time  drone label parsing;  (2) the CNN 

model can learn the features in the repeated dark and bright rolling strips even in conditions where 

it is difficult to distinguish the amplitude of CP and On. 

Identical  ConvNets 

skip  connection 

adopted CNN architecture 

e
g
a
m

I

4
6

,
v
n
o
c

3
x
3

4
6

,
v
n
o
c

3
x
3

4
6

,
v
n
o
c

3
x
3

4
6

,
v
n
o
c

3
x
3

2
/

,
4
6

,
v
n
o
c

7
x
7

2
/

,
l
o
o
p

3
x
3

8
2
1

,
v
n
o
c

3
x
3

8
2
1

,
v
n
o
c

3
x
3

8
2
1

,
v
n
o
c

3
x
3

2
/

,
8
2
1

,
v
n
o
c

3
x
3

6
5
2

,
v
n
o
c

3
x
3

6
5
2

,
v
n
o
c

3
x
3

6
5
2

,
v
n
o
c

3
x
3

2
/

,
6
5
2

,
v
n
o
c

3
x
3

2
1
5

,
v
n
o
c

3
x
3

2
1
5

,
v
n
o
c

3
x
3

2
1
5

,
v
n
o
c

3
x
3

l

o
o
p

g
v
a

2
/

,
2
1
5

,
v
n
o
c

3
x
3

Max  pool     
stride = 2 

{
Layer  1

{
Layer  2 

{
Layer  3 

{
Layer  4 

Fully Connected 

5
1

=
c
f

y
f
i
t
n
e
d

i

l
e
b
a
l

5

=

c
f

n
o
i
t
a
m

i
t
s
e

e
c
n
a
t
s
i
d

4

=

c
f

n
o
i
t
a
m

i
t
s
e

d
e
e
p
s

8

=

c
f

g
n
i
s
r
a
p
e

e
l
g
n
a

Figure 6.6 Adopted CNN networks in PoseFly: ResNet-18 with modified fully connected layers. 

We capture real images of optical labels from 15 drones at various distances in day and night to 

use as training data. The CNN models adopted in PoseFly shown in Figure 6.6 use the ResNet-18 

architecture.  They are the Drone Identification Model (DIM), Distance Estimation Model (DEM), 

Speed Estimation Model (SEM), and Angle Parsing Model models (APM). PoseFly has demonstrated 

exceptional performance on image classification tasks including [17, 18, 1], which is extremely 

appropriate for our objective of identifying rolling strip patterns and the created shape with color 

patterns. The last fully connected layer’s output feature is modified to meet the number of options 

(e.g., 15 in DIM, 5 in DEM, 4 in SEM, and 8 in APM) while keeping other layers the same. 

159 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6.5  Drone Localization 

The on-site drone localization (pose parsing) in our proposed PoseFly consists of three parts: 

(1) distance estimation, (2) relative speed estimation, and (3) on-site angle parsing. We present 

challenges and design details below. 

6.5.1  Relative Distance Estimation 

For  drone  localization,  the  perception  and  estimation  of  distance  is  very  important  for  the 

interactions  among  flying  drones.  For  example,  accurate  estimation  of  distance  between  two 

drones can avoid unexpected collisions and keep the specific flight formations similar to geese 

flying for complex collaboration and tasks. The quadrangle generated by the four LED spots in our 

PoseFly transmitter can give another drone a rough sensing of the distance between themselves. 

We use the rough size of the captured quadrangle of drone to infer the current relative distance 

between two drones. 

Figure 6.7 Distance estimation via perspective principle: longer distance, smaller captured drone 
size. 

As shown in the bottom of Figure 6.7, we can estimate the distance based on the captured drone 

size because the drone size increases when the drone is getting closer to the other drone due to 

the spatial perspective principle.  We first collect the captured images (camera is set with fixed 

160 

 
 
focal length) at different distances and use this data set to train the CNN model for classification 

offline. Then we can use the trained CNN model to predict and estimate the current relative distance 

between two drones in real-time. 

To filter out  the strong ambient  light  and emphasize the 4 colored spots, we set  the rolling 

shutter with a high shutter speed such as 4000 Hz in our experiments. In our current version of 

PoseFly, we set 5 distances: 4m, 8m, 12m, 16m, and 20m. The captured quadrangles in day and 

night with random poses are shown in Figure 6.12 (c). 

Figure 6.8 Relations with motion speed and varied spot shapes: fast the speed, larger shape 
variation of the spot. 

6.5.2  Relative Speed Estimation 

The  same  as  distance  estimation,  the  drone  speed  is  critical  for  drones’  collaboration  and 

accident avoidance. In PoseFly, we exploit our discovered relation among motion speed and the 

varied shape of the spot generated by one of four LEDs. 

First, we explore the relation between different motion speed and the captured spot shape at the 

same distance between  the camera and  the light  source.  We set different  motion  speeds of  the light 

source to simulate the drone’s different motion speed and capture the shape of generated spot. As 

shown in Figure 6.8, we set 4 levels of movement speed of the light source (i.e, static, low, medium, 

161 

 
 
 
 
and fast) and move the light source with the same movement path (↗) without movement in the 

front and back direction, the shape of captured rolling patterns changes.  As the speed of the light 

source increases, the shape morphs from a circle to an oval with speed, so does the length of the 

ellipse’s long axis for both light sources embedding and without embedding data. 

In  PoseFly,  we  captured  images  of  the  shapes  of  each  spot  generated  by  four  LEDs  speed 

estimation within 4m. To make the SEM more robust, we capture these images in day and night 

with 4 different motion speeds with random moving paths and used as training dataset for SEM. 

6.5.3  Relative Angle Parsing 

We model the drone as a rigid body and use the four LEDs to denote the plane of the bottom 

plane of the drone. The red LED is mounted at the left-front corner of a drone and it can be treated 

as the positioning element to denote the facing angle of the drone. 

0°/  360° 

45° 

90° 

135° 

180° 

225° 

270° 

315° 

w
e
i
v

p
o
t

w
e
i
v

e
d
i
s

ﬂi p images based on the mirr or eﬀect in  captur ed image s 

d
e
r
u
t
p
a
c

s
e
g
a
m

i

Figure 6.9 On-site angle parsing via colored-arc variation. 

As shown in Figure 6.9, we define the relative angle is 0◦ when the camera captures a drone’s 

tail end. Then the captured red spot rotated 45◦ in clockwise direction. Using the same rule, we 

totally define 8 relative angle statues: [0◦ or 360◦, 45◦, 90◦, 135◦, 180◦, 225◦, 270◦, and 315◦]. 

Naturally, we can determine the relative angle of the captured drone based on the position of red 

spot in the color arc detected in images.  However, due to the small size of LED spots in captured 

162 

 
 
 
 
 
   
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
images, it is hard to judge the relative angle. Thus, we employ CNN models to learn relative angle 

features offline and then predict the relative angle in the captured image in real-time, similar to the 

AI method used in previous optical label parsing, distance estimation, and relative speed estimation. 

Similarly,  we set high rolling shutter speed to avoid the ambient light when we capture the 

images of color arcs. The captured images for training at 4m in day and night are shown in the 

bottom of Figure 6.9. 

6.6  Drone Quick-Link 

The  sensed  postures  of  nearby  drones  can  be  stored  locally  for  the  usage  of  drone  itself. 

At the same time, this posture information  can also  be shared to  nearby drones and extend the 

communication ranges by using some drones as the relay nodes. Thus, even if some drones are far 

away or blocked by other drones due to LoS (line-of-sight), they can still communicate with each 

other. To achieve this goal, we design a quick-link channel for data sharing and communication 

and present the details of the PoseFly quick-link below. 

6.6.1  Modulation Design 

Quick-link is one type of OCC, which provides data sharing ability for a small amount of burst 

data[2].  In PoseFly, we design quick-link to provide a robust optical channel with the similar data 

rate level (hundreds of bps to several Kbps) besides other 3 functions synchronously. The challenge 

here is that the captured three green spots are randomly located in a captured image frame due 

to  the  high  motion  of  the  drone  and  varied  among  frames.  Thus,  even  though  we  successfully 

recorded the data in one of the three green spots, we are unable to identify which spot it is and 

cannot eventually complete the correct decoding. Furthermore, different with optical labels, if we 

adopt PWM and use amplitude shift keying, it will sacrifice the transmission bandwidth and the 

decreased data rate significantly. 

In PoseFly, firstly we can determine which green spot (i.e., 𝐿1, 𝐿2, or 𝐿3) based on the colored 

arc in  captured image. For the modulation in each green spot, we design  CP (cyclic preamble) 

based cyclic OOK data sequences with only bright and dark amplitude levels for robust quick link. 

The CP takes the same duration with the CP in optical labels. The CP in green spots are dark strips 

163 

 
 
guaranteed 
one  cycle 
shown  in  spot 

L1 

3  individual 
transmission  units 
L1 

L3 

L2 

one  frame  ->  3  x  30  bits  =  90 bits 

e
d
u
t
i
l

p
m
a

CP 
(same   length 
h  Optical  Lab 
adjustable) 

el, 

wit 

1 0 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0  1 

)

N
O

s
t
i
b

1

(

p
a
g

data  bits  (30  bits  OOK  symbols) 

)

N
O

s
t
i
b

1

(

p
a
g

repeated data 
in one frame 

CP 

time 

Figure 6.10 Quick link modulation design in PoseFly. 

with adjustable width. The symbol length of OOK data sequences is set as 32 bits while setting the 

beginning symbol and the end symbol as On as gaps between CP and valid data symbols shown in 

Figure 6.10. The data sequence may contain the same length of dark strips as the CP which may 

make it hard to recognize the CP during rolling strips. Nevertheless, we can set the CP to have a 

long symbol length to prevent this from happening to confuse decoding. For example, if we set 

CP with 8 symbol periods, the possibility of the inside data sequence containing 8 continuous Off 

symbols is (30-8)  / 𝐶8  ≈ 4x10−6, which is low enough for potential conflicts.  Thus, we set the 

30 

CP as 8 continuous Off symbols. The data amount embedded in each spot depends on how many 

rolling strips are in it and total data amount in one image frame is the sum of number of strips in all 

three spots. In each frame, we embed the different data into three green LEDs and choose proper 

symbol duration of OOK and CP to guarantee there is over one entire cyclic CP and data sequence 

in one spot. 

To robustly detect the data symbols between CP, PoseFly performs quick link communication 

within 4m. As shown in Figure 6.11, whatever the position of the three spots is in a captured frame 

with different motion, the strips are clear. So, using the three transmission units that were recorded 

in each frame, we could collect the data from each green spot and then reconstruct the bit stream. 

Finally, the data is transferred via the quick link provided by PoseFly, frame by frame. 

164 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
L3 

L 
2 

L1 

L1 

L3 

L 
2 

L3 

L 1 

L2 

slow  motion 
(top view) 

slow  motion 
(side view) 

high  motion 
(top view) 

Figure 6.11 Robust symbol detection when drones are flying. 

In our prototype, each image frame embeds 30×3 (the number of spots) = 90 valid OOK data 

symbol (i.e., 90 bits). And the camera frame rate is set as 60 frame per second, the quick link in 

our proposed PoseFly can achieve the 60×90 = 5400 bits per second data rate, which is 5.4 Kbps, 

enough for quick link communication among drones to send commands, urgent messages, pose 

information of drones. 

6.7  Implementation and Evaluation 

6.7.1  Transmitter 

We implement the PoseFly transmitter prototype for experiments as shown in Figure 6.12. The 

main components in one PoseFly prototype are shown in Table 6.1: entry-level drone, 1 Arduino 

Nano MCU, 1 red and 3 green LEDs wrapped with 1 red and 3 green plastic balls (𝜙  = 19mm). The 

total weight of added components in PoseFly except the drone is 25g (we use the battery of drone 

itself for powering the Arduino Nano) while the total price except the drone is only about 12$. 

6.7.2  Receiver 

There are numerous commercial smart devices that can be used as the PoseFly reader in our 

prototype. As shown in  Figure 6.12  (b), these commercial camera devices are widely available 

and reasonably priced such as VIVO Y71A, and the iPhone 7 we used.  To extend the distance for 

165 

 
 
 
Drone 

RED/GREEN  LEDs  with  covers 

Quarter 

9V 

Arduino  Nano 

Red 
LED 

Green 
LED 

drone 
battery 

5V 

drone  battery  to  power  MCU 

PoseFly  Circuit  Diagram 

(a) 

Transmitter  design  in  PoseFly  and  implementation 

VIVO  Y71 

iPhone  7 

(b)  Receiver  design  in  PoseFly  and  used  additional  lens

20 X  Mobile  Lens

day 

night 

day 

night 

0m 

1m 

2m 

3m 

0m 
1m 
2m 

3m 

0m 
4m 
8m 

0-8m 
12m 
16m

4mnight  day 
within  4m for 
speed,  angle  estimation  &  quick-link 

4m

12m 

20m 

within12m for 
identiﬁcation 

within 20m for 
distance  estimation 

(c)   experiments scenarios with diﬀerent  settings

e
c
n
a
t
s
d

i

t
n
e
r
e
ﬀ
d

i

t
a

s
n
e

l

e
m
a
s

h
t
i

w

l

s
e
g
n
a
r
d
a
u
q

d
e
r
u
t
p
a
c

4m 

8m 

12m 

16m 

20m 

Figure 6.12 PoseFly implementation including transmitter (a) and receiver (b). The experiment 
scenarios and setup (c). 

166 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
usage of PoseFly, we use commercial portable lens for smartphone photographing, the price of 

the lens we used is about 5$. This universal 20x lens can capture the clear images of objects in 

long  distance.  In  real  use  scenarios,  PoseFly  receivers  are  the  mounted  cameras  similar  to 

cameras in our prototype. 

Component 
entry-level drone 
Arduino Nano 
LED 
plastic cover 
portable lens 
Total price 

Price ($) 
40 
10 
0.1 
0.3 
≈ 8 
< 60 

Details 
size:  14cm x 14cm, 125g 
ATmega328P, 5V, 16M 
5mm, gree/red, 20000mcd, 20mA 
19mm, green, lightweight 
Bostionye 20x mobile lens 
mass produced, cheaper the price 

Table 6.1 Components in PoseFly. 

6.7.3  Setup 

Drone size. The drone used in our prototypes is tiny sized: 14cm×14cm. In the future, we can 

equip PoseFly to bigger drones (e.g., 1m×1m) to have better performance such as longer distance 

and higher data rate because of stronger LED power and higher number of strips shown in LED 

spot. 

Different  optical  environment.  Figure  6.12  (c)  shows  the  scenarios  of  our  implemented 

PoseFly  transmitter  flying  in  two  environment  (day  and  night).  Figure  6.12  (c)  also  shows  the 

experiment scenarios in day and night with different distance. 

Simulate  the  drone  flying.  In our experiments,  we hold  the drone in  hand or hang it on a 

hanger and simulate it is flying with different distances, angles, and speeds to the PoseFly receiver 

(smartphone) in day and night. 

We  evaluate  PoseFly’s  performance  based  on  our  implemented  testbed  in  three  folds:  (1) 

the drone identification accuracy performance, (2) the drone localization accuracy performance 

including distance, speed, and angle estimation, (3) quick-link performance. Finally, we measure 

the computation overhead and running time caused latency for each function and make comparisons 

among PoseFly and the state-of-the-art approaches. 

167 

#4 

y
a
d

#4 

t
h
g
n

i

0m 

4m 

8m 

12m 

0m 

4m 

8m 

12m 

(a)  captured optical label #4 at diﬀerent distance 

100 

80 

)

%

60 

40 

(

y
c
a
r
u
c
c
a

20 

0 

d 

ay 

nigh 

t 

4m 

12m 
distance  setting 
(b)  identiﬁcation accuracy of optical labels 

average 

8m 

100 

s
s
o

l

i

g
n
n
i
a
r
t

80 

60 

40 

20 

0 

0 

day 
night 

50 

100 
epoches  index 

150 

200 

(c) training loss curves for day and night 

Figure 6.13 Drone identification: (a) captured optical labels of #4 in different distance, (b) optical 
label identification accuracy in both day and night, (c) training loss curves in epoches from [0, 
200] in both day and night. 

168 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6.7.4  Identification Accuracy 

In our experiment, we evaluate the identification accuracy of 15 active optical labels with index 

number in range of [1, 15]. We capture the optical labels shown in the red LED spot at 3 distance 

settings: 4m, 8m, and 12m in both day and night time with random postures of the drone. 

We capture 10 images for each setting (a specific optical label, a specific distance, day/night 

setting), thus totally 10×15×3×2 = 900 images as training dataset. The sampled images of label #4 

are shown  in  Figure  6.13 (a).  We evaluate the label  identification  accuracy  performance  at  day  and 

night, and their training loss in [0, 200] epochs. 

Although the number of strips displayed on the cover become less with the increased distance 

from the drone to the camera and hard for recognizing by human eyes as shown in Figure 6.13 (a), 

the cyclic rolling pattern is still good enough for CNN to be classified which is demonstrated by 

Figure 6.13 (b). The identification accuracy of 15 optical labels achieves average 100% in day time 

and more than 97% at night. The training loss curve for data set of day time drops faster and earlier 

than the night as shown in Figure 6.13 (c). The reason is that it is harder to distinguish amplitudes 

between CP and On symbols at the night due to the fusion of optical signals. 

6.7.5  Localization Accuracy 

(1) Distance Estimation. 

We evaluate the distance estimation accuracy of 5 settings in [4m, 8m, 12m, 16m, 20m]. We 

capture the spot shape of the drone with random postures and speed in both day and night time. 

We capture 10 images for each setting (a specific distance, day/night setting), thus totally 10×5×2 

= 100 images as the training dataset. 

As  shown  in  Figure  6.14  (a),  the  distance  estimation  accuracy  during  day  time  among  all 

distance settings achieves 100%, which demonstrates our PoseFly can provide within 20m distance 

ranging among drones in day time. Similarly, PoseFly also works well for distance estimation at 

night with 100% accuracy within 20m. 

(2) Relative Speed Estimation. 

We evaluate the speed estimation accuracy of 4 settings in [static,  low,  medium,  fast].  We 

169 

 
100 

80 

)

%

y da 

nig ht 

60 

40 

(

y
c
a
r
u
c
c
a

20 

0 

4m 

8m 

12m 

16m 

20m 

distance  setting 
(a)  distance estimation performance 

100 

80 

)

%

day 

night 

60 

40 

(

y
c
a
r
u
c
c
a

20 

0 

static 

low 

medium 

fast 

speed  setting 
(b) relative speed estimation performance 
100 

80 

)

%

day 

night 

60 

40 

(

y
c
a
r
u
c
c
a

20 

0 

0°/ 
360° 

45° 

90°     135°     180°     225°    270°     315° average 

angle  setting 

(c) relative angle estimation performance 

Figure 6.14 Drone localization performance: (a) distance estimation accuracy, (b) speed 
estimation accuracy, (c) angle estimation accuracy in both day and night with models saved at 
200𝑡ℎ  epoch. 

170 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
capture the spot shape of drone with random postures at 4m both day and night time.  We capture 

10 images for each setting (a specific speed, day/night setting), thus totally 10×4×2=80 images as 

the training dataset. 

As shown in Figure 6.14 (b), the speed estimation accuracy during the day time among all four 

speed settings achieves 100% for both day and night, which demonstrates our PoseFly can provide 

robust relative speed estimation among drones. 

(3) Relative Angle Parsing. 

We evaluate the relative angle estimation accuracy of 8 settings in [0◦ or 360◦, 45◦, 90◦, 135◦, 

180◦, 225◦, 270◦, and 315◦]. We capture the spot shape of the drone with random speed at 4m 

both day and night time. We capture 10 images for each setting (a specific relative angle, day/night 

setting),thus totally 10×8×2 = 160 images as the training dataset. As shown in Figure 6.14 (c), the 

CNN model saved at 200𝑡ℎ  epoch can classify the drones with different relative angles of 8 options 

accurately for both day and night with estimation accuracy of 100% within 4m sensing distance. 

To sum up, our AI-assisted drone pose parsing/localization works well for all three aspects 

during day and the night in different distances for the flying drones. 

0.40 

0.32 

0.24 

0.16 

0.08 

)

R
E
B

(

e
t
a
r

r
o
r
r
e

t
i
b

0 

day 
night 
ave 

)
s
p
b
(

t
u
p
h
g
u
o
r
h
t

6.0K 

4.8K 

3.6K 

2.4K 

1.2K 

0K 

day 
night 
ave 

distance  setting 
(a)  quick-link bit error rate 

distance   setting 
(b)  quick-link  throughput 

Figure 6.15 Quick link performance: (a) BER, and (b) throughput. 

171 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6.7.6  Quick-link Evaluation 

We evaluate the Quick-link performance within 4m (0.5m, 1m, 1.5m, 2m, 2.5m, 3m, 3.5m, 4m) 

in both day and night. We set the shutter speed properly (12Khz) with transmission frequency to 

capture clear rolling strips shown on the three green spots in each frame and set the frame rate as 

60FPS. For each setting (a specific distance, day/night setting), we capture 10 images, thus totally 

8×2×10 = 160 images to measure its BER and achieved throughput. 

BER performance. We decode OOK data sequence inside of two CPs. As shown in Figure 6.15 

(a), the bit error rate in each frame is 0 within 2m for both day and night. With the increased distance, 

the BER increased as well due to the weaker optical signals at longer distances. Nevertheless, our 

prototype still achieves the average BER less than 0.08 at 4m. The reason the BER in day is higher 

than the BER in the night is that the lower amplitude gap of captured On symbols and Off symbol 

in day due to the strong ambient light than at the night for the same distance. 

Throughput performance.  The valid data bits in each frame is the sum of valid data in three 

green spots, which is calculated by 30 bits (32-2)×3×frame rate (60 FPS)×BER. As shown in 

Figure 6.15 (b), our PoseFly achieves 5.4 Kbps within 2m for both day and night. Although the 

throughput drops with increased transmission distance, the dropped data amount is limited. Even 

at 4m, our PoseFly still achieves the average throughput over 5 Kbps. Although the captured spot 

memory cost 

time cost 

)

i

B
M

(

d
a
e
h
r
e
v
O

g
n
i
t
u
p
m
o
C

320 

240 

160 

80 

0 

1.00 

0.80 

R
u
n
n
n
g

i

0.60 

T

i
i

m
e

0.40 

(
s
)

0.20 

0.00 

Figure 6.16 Computation cost and latency evaluation. 

172 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
size is decreased by the increased distance, we can still capture the complete and differentiable 

strips at 4m with lens. 

6.7.7  Overhead 

Computation overhead. For drones, battery is limited. LEDs provide lighting function and 

are energy efficient. Thus, we only consider the computation overhead at reader side. The reader 

should not conduct complex computations and consume energy too fast. The training processes are 

offline, the drone identification, distance, speed, and angle estimations are real-time tasks conducted 

with few computation cost for each step by step when the drone is flying. As shown in Figure 6.16, 

the quick link requires the most memory resources due to more narrow strips in decoding compared 

with CNN based tasks mentioned above. For all these tasks, they require a combined 313 MiB of 

memory and is not a computational burden for a commercial smart device. 

Latency. For collaboration tasks among drones, time can be important to improve the efficacy 

and  efficiency.  Compared  with  state-of-art  drone  localization  systems,  including  audio-based 

systems, PoseFly has nearly no time delay in signal propagation due to the fast propagation of light. 

Thus we only consider the computational latency. As shown in Figure 6.16, the drone identification, 

drone on-site localization (distance, speed, angle estimation) have the low running time of about 

0.07 s - 0.09s for each. These functions can be run in pipeline manner (i.e., totally 0.07s-0.09s) 

and thus achieve the real-time on-site pose parsing. For example, given two drones with 20m/s 

relative speed, after drone A completes its pose paring function for drone B, the parsed distance 

may only have 20m/s×0.09 = 1.8 m distance estimation error.  The distance estimation in PoseFly 

is designed for discrete distance ranges [4m, 8m, 12m, 16m, 20m], and 1.8m distance estimation 

error is  acceptable  and practical.  Different  with  real-time on-site drone pose parsing, the quick 

link function is  designed for information  sharing (e.g., roughly which drones are nearby, some 

broadcast commands) if needed which is not strictly require real-time communication. Thus, 0.31 

s is acceptable, which is similar to the collaborations among geese. 

173 

 
6.8  Discussion and Summary 

Comparison with Existing Work. (1) Passive optical label. Compared with passive optical 

label  such  as  bar  code  and  QR  codes[111]  with  the  similar  size  (2cmx2cm)  as  the  red  cover 

in our prototype, we measured that these passive optical labels are only workable within 50cm. 

(2)  RF  based  localization.  RF  based  localization  can  provide  distance  estimation  error  within 

about several meters with a localization time of more than 70 seconds while not providing other 

aspects of drone pose parsing in our PoseFly such as angle and speed estimation[92]. (3) RF/OCC 

communication.   RF techniques can provide long communication distance, however, they face 

the  severe  interference  when  there  are  massive  drones.  Existing  OCC  approaches  can  achieve 

similar several Kbps throughput ability, however, they did not provide optical labeling, and on-site 

localization functions[38, 37]. 

Other  Concerns.  (1) discrete value. Current PoseFly provides discrete relative localization 

instead of continuous relative distance/angle/speed value. However, PoseFly is designed for swarm- 

ing drones’ collaboration which does not require the exact value of relative positioning, the similar 

to the geese flying. (2) modulated ambient light. Although there are modulated light such as LiFi 

(>100KHz) transmitters, our PoseFly can filter them out them via spatial diversity of millions of 

camera pixels and different frequency (about 10 KHz). (3) frame gap loss.  The transmitted data 

in quick-link channel are repeated for broadcast and thus the frame gaps caused data loss will not 

impact the final decoded data. 

In summary, we propose PoseFly for simple and robust on-site drone pose parsing via optical 

camera  communication.  We  design  a  color-arc  scheme  and  investigate  spatial  embedding  ability  of 

rolling shutter cameras and first exploit it for drone localization including relative distance, speed, 

and angle estimations. Besides, we design active optical labels with cyclic pilot and data sequences 

in frame-level for high-capacity drones indication and quick-link communication for real-time and 

smooth collaborations among drones. Finally, we conduct experiments on implemented prototype 

in various scenarios.  The solid experiments show that our PoseFly can achieve near 100% accuracy 

for drone identification at up to 12m, 100% drone localization as well as 5 Kbps average data rate 

174 

 
with average BER lower than 0.08 at up to 4m for both day and night.  These results demonstrate 

our PoseFly works well. 

175 

 
CHAPTER 7 

CONCLUSION AND FUTURE WORK 

7.1  Conclusion 

Because of the rapid growth of the limited and crowded RF bandwidth for high-speed wireless 

communication services, there has been a boom in research and industry interest in optical wireless 

communication (OWC). The new technology ushers in a new potential world of fast and ubiquitous 

wireless  communications  and  enables  integrated  sensing  and  communication,  as  well  as  new 

challenges in developing OWC techniques. 

First, we propose LiFOD to improve the data rate in LiFi system. We exploit Compensation 

Symbols previously only used for dimming to indicate bit patterns in modulation as dimming side- 

channel.  We  addressed  challenges  including  greedy  bit  pattern  mining,  compensation  redesign  and 

relocation. LiFOD utilizes 1D temporal diversity in data embedding. 

Second,  we  propose  RainbowRow  to  boost  the  data  rate  of  optical  camera  communication. 

We exploit 2D rolling blocks in optical imaging to transmit more bits for each optical symbol. 

By redesigning the transmitter with linear LED bulbs and addressing optical singals’ interference, 

RainbowRow achieves 20× data rate improvement than the existing OCC systems. 

Third, we embed data bits with the 3D spatial manner to overcome the limitations of existing 

passive optical tags. U-Star is a cost-effective and practical underwater self-navigation solution for 

large-scale applications. We utilize deep learning and color-arc designs to address challenges such 

as underwater denoising, relative positioning, and robust decoding. 

Besides communication, we also  exploit 3D spatial-temporal diversities for optical wireless 

sensing in RoFin. We design low-cost RoFin gloves with 6 key points and utilize rolling shutter 

effect to construct the hand pose in real time. Our proposed RoFin can also provide fine-grained 

finger tracking for numerous applications such as virtual writing for Parkinson suffererss. 

Finally, we propose PoseFly, which utilizes 4D (3D spatial with 1D temporal) diversities for 

on-site pose parsing of drones. PoseFly is designed as a low-cost, but effective integrated optical 

sensing and communication framework for large-scale drone networks with 4 functions, including 

176 

 
 
 
 
massive drone indication, quick-link channel, lighting, and multi-level drone positioning. 

These  studies  and  outcomes  from  our  implemented  prototypes  validate  the  ideas  we 

advocated.  These  results  demonstrate  that  our  explored  multi-dimensions  of  spatial-temporal 

diversities  in  optical  wireless  communication  can  indeed  improve  the  performance  of  OWC 

systems. Our work may enable numerous applications in the future as the promising techniques 

for next generation wireless communication and networks. 

7.2  Ongoing Work 

Following the explored projects, the ongoing project we are working on is HotSys, which focuses 

on systems of holographic optical tags for scalable and collaborative mobile infrastructures. Most 

existing  OWC  based  techniques  for  vehicular  mobile  systems  adopt  omnidirectional  beamforming. 

This requires strict beam alignment, which leads to a limited communication field of view and lacks 

relative positioning capabilities [130]. Therefore, we propose HotSys, a system of Holographic 

Optical  Tags to  overcome this limitation  to  support scalable  and collaborative mobile systems, 

which may include Vehicle to Everywhere (V2X) Systems, as shown in Figure 7.1. 

Figure 7.1 Research Objectives of HotSys: (1) holographic optical tag design, (2) distributed 
collaborative localization, (3) middleware design for multi-to-multi networking. 

HotSys tags are virtual  3D tags embedded with  data and positioning elements in 3D space. 

The  images  of  a  HotSys  virtual  3D  tag  are  delivered  in  multiple  directions  via  a  new  multi- 

177 

direction reflector. The HotSys tags attach to individual vehicles for simultaneous multi-to-multi 

communications  (i.e.,  multi-to-multi  communications  means that  a node  can  transmit to  and  receive 

from multiple directions at the same time, as shown in Figure 7.1). Multi-to-multi communications 

using the HotSys tags will not require beam alignment concerns and therefore exploit data embedded 

in 3-dimensional space for fast and robust data transmission.  The system will include middleware 

to  enable  collaborative  positioning  identification  of  the  mobile  vehicles  within  the  system.  As 

a result, HotSys tags on the vehicles will be composed into a distributed system  to  construct  a 

reliable and accurate localization system and a scalable collaborative communication system. The 

prototype of HotSys tag is shown in Figure 7.2. 

Figure 7.2 The design illustration of Holographic Optical Tags. 

7.3  Future Work 

In the future, our research will continue to explore the multi-dimensions of spatial-temporal 

diversities  to  further  enhance  optical  wireless  communication  (OWC)  and  enable  novel  OWC 

sensing  techniques.  These  advancements  have  a  wide  range  of  potential  application  scenarios, 

including cellular connectivity, smart homes, V2X communication, underwater communication, 

e-health,  space  communication,  smart  shopping,  and  more.  However,  it  is  important  to  note 

that  while  these  applications  mainly  focus  on  the  user  side,  we  must  also  pay  attention  to  the 

infrastructure side. There are related technologies, such as data center optical networks, virtualized 

radio access  networks,  MIMO (Multiple  Input  Multiple  Output)[50],  Full-Duplex spectrum[5], 

178 

 
 
 
 
beamforming[119], and smart surfaces[22], which form the backbone and foundation to enable and 

support the diverse applications mentioned above. 

Integrating  research  efforts  in  both  user-side  applications  and  advanced  infrastructure  tech- 

nologies is crucial to fully harness the potential of optical wireless communication and achieve 

next-generation wireless networks. By focusing on both aspects, we can create a comprehensive 

ecosystem that addresses the challenges and opportunities of optical wireless communication. 

On the user-side, exploring diverse application scenarios and developing innovative solutions 

for  areas  like  smart  homes,  underwater  communication,  and  e-health  will  lead  to  practical  im- 

plementations  of  optical  wireless  communication  in  everyday  life.  Simultaneously,  advancing 

infrastructure technologies such as data center optical networks, virtualized radio access networks, 

MIMO, Full-Duplex spectrum, beamforming, and smart surfaces will provide a strong foundation 

to support the increasing demands of optical wireless communication networks. 

Combining  these  research  efforts  will  lead  to  a  well-rounded  and  future-proof  approach  to 

optical wireless communication, enabling efficient and reliable wireless communication systems 

that cater to the diverse needs of modern society. This integration will play a vital role in shaping 

the  next-generation  wireless  landscape  and  unlocking  new  possibilities  for  communication  and 

connectivity. 

179 

[1]

[2]

[3]

[4]

BIBLIOGRAPHY 

COCO. https://paperswithcode.com/dataset/coco, 2014.

Ieee standard for local and metropolitan area networks–part 15.7: Short-range optical wire- 
less communications. IEEE Std 802.15.7-2018 (Revision of IEEE Std 802.15.7-2011), pages 
1–407, April 2019.

Yun Ai, Aashish Mathur, Gyan Deep Verma, Long Kong, and Michael Cheffena.  Compre- 
hensive physical layer security analysis of fso communications over málaga channels. IEEE 
Photonics Journal, 12(6):1–17, 2020.

Farhad  Akhoundi,  Amir  Minoofar,  and  Jawad  A  Salehi.  Underwater  positioning  system 
based on cellular underwater wireless optical cdma networks.  In 2017 26th Wireless and 
Optical Communication Conference (WOCC), pages 1–3. IEEE, 2017.

[5] Muhammad Amjad, Fayaz Akhtar, Mubashir Husain Rehmani, Martin Reisslein, and Tariq 
Umer. Full-duplex communication in cognitive radio networks: A survey. IEEE Communi- 
cations Surveys & Tutorials, 19(4):2158–2191, 2017.

[6]

[7]

[8]

[9]

Lorenzo  Bertizzolo,  Salvatore  D’Oro,  Ludovico  Ferranti,  Leonardo  Bonati,  Emrecan 
Demirors,  Zhangyu  Guan,  Tommaso  Melodia,  and  Scott  Pudlewski.  Swarmcontrol:  An 
automated distributed control framework for self-optimizing drone networks. In IEEE IN- 
FOCOM 2020 - IEEE Conference on Computer Communications, pages 1768–1777, 2020.

Azzedine  Boukerche  and  Peng  Sun.  Design  of  algorithms  and  protocols  for  underwater 
acoustic wireless sensor networks. ACM Computing Surveys (CSUR), 53(6):1–34, 2020.

Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan.  Weakly-supervised 3d hand pose 
estimation  from  monocular  rgb  images.  In  Proceedings  of  the  European  Conference  on 
Computer Vision (ECCV), pages 666–682, 2018.

Charles J. Carver, Zhao Tian, Hongyong Zhang, Kofi M. Odame, Alberto Quattrini Li, and 
Xia  Zhou.  Amphilight:  Direct  air-water  communication  with  laser  light.  GetMobile:  Mobile 
Comp. and Comm., 24(3):26–29, January 2021.

[10] Darren Caulfield and Kenneth Dawson-Howe. Direction of camera based on shadows. In 
Proceedings of the Irish Machine Vision and Image Processing Conference, pages 216–223. 
Citeseer, 2004.

[11] Nan Cen, Neil Dave, Emrecan Demirors, Zhangyu Guan, and Tommaso Melodia. Libeam: 
Throughput-optimal cooperative beamforming for indoor visible light networks.  In IEEE 
INFOCOM  2019-IEEE  Conference  on  Computer  Communications,  pages  1972–1980.  IEEE, 
2019.

[12] Weiya Chen, Chenchen Yu, Chenyu Tu, Zehua Lyu, Jing Tang, Shiqi Ou, Yan Fu, and Zhidong 
Xue. A survey on hand pose estimation with wearable sensors and computer-vision-based 
methods.  Sensors, 20(4):1074, 2020.

180 

[13] Ramesh Kumar Chidambaram and Rammohan Arunachalam. Automotive headlamp high 
power led cooling system and its effect on junction temperature and light intensity. Journal 
of Thermal Engineering, 6(6):354–368, 2020.

[14] Mostafa  Zaman  Chowdhury,  Moh  Khalid  Hasan,  Md  Shahjalal,  Md  Tanvir  Hossan,  and 
Yeong Min Jang. Optical wireless hybrid networks: Trends, opportunities, challenges, and 
research directions. IEEE Communications Surveys & Tutorials, 22(2):930–966, 2020.

[15] Mostafa Zaman Chowdhury, Md  Tanvir Hossan, Amirul Islam, and Yeong Min  Jang.  A 
comparative survey of optical wireless technologies: Architectures and applications. IEEE 
Access, 6:9819–9840, 2018.

[16] CIE. chromaticity diagram, 1931.

[17] CIFAR10.https://paperswithcode.com/sota/image-classification-on-cifar-10, 2009.

[18] CIFAR100.https://paperswithcode.com/sota/image-classification-on-cifar-100, 2009.

[19] Minhao Cui, Yuda Feng, Qing Wang, and Jie Xiong. Sniffing visible light communication 
through  walls.  In  Proceedings  of  the  26th  Annual  International  Conference  on  Mobile 
Computing and Networking, pages 1–14, 2020.

[20] Minhao  Cui,  Qing  Wang,  and  Jie  Xiong.  Breaking  the  limitations  of  visible  light  com- 
munication through its side channel. In Proceedings of the 18th Conference on Embedded 
Networked Sensor Systems, pages 232–244, 2020.

[21] Francisco J. Escribano, José Sáez Landete, and Alexandre Wagemakers. Chaos-based mul- 
ticarrier VLC modulator with compensation of LED nonlinearity. IEEE Trans. Communi- 
cations, 67(1):590–598, 2019.

[22] Roberto Flamini, Danilo De Donno, Jonathan Gambini, Francesco Giuppi, Christian Maz- 
zucco,  Angelo  Milani,  and  Laura Resteghini.  Toward  a  heterogeneous  smart  electromagnetic 
environment for millimeter-wave communications: An industrial viewpoint. IEEE Transac- 
tions on Antennas and Propagation, 70(10):8898–8910, 2022.

[23] Ander Galisteo, Diego Juara, and Domenico Giustiniano. Research in visible light commu- 
nication systems with openvlc1. 3. In 2019 IEEE 5th World Forum on Internet of Things 
(WF-IoT), pages 539–544. IEEE, 2019.

[24] Ander Galisteo, Qing Wang, Aniruddha Deshpande, Marco Zuniga, and Domenico Gius- 
tiniano.  Follow  that  light:  Leveraging  leds  for  relative  two-dimensional  localization.  In 
Proceedings of the 13th  International Conference on emerging Networking EXperiments 
and Technologies, pages 187–198, 2017.

[25]

Jazmine Gaona and Ray Oltion. Natural navigation. 2013.

181 

[26]

Jun Gong, Yang Zhang, Xia Zhou, and Xing-Dong Yang. Pyro: Thumb-tip gesture recogni- 
tion using pyroelectric infrared sensing.  In Proceedings of the 30th Annual ACM Symposium 
on User Interface Software and Technology, pages 553–563, 2017.

[27] GSA.  U.s.  general  services  administration,  6.15  lighting.  https://www.gsa.gov/node/ 

82715, 2021.

[28] GSA.  Verified  market 

research.  https://www.verifiedmarketresearch.com/

product/drones-market/, 2022.

[29] Lav Gupta, Raj Jain, and Gabor Vaszkun. Survey of important issues in uav communication 

networks. IEEE Communications Surveys & Tutorials, 18(2):1123–1152, 2015.

[30] Harald Haas, Liang Yin, Yunlu Wang, and Cheng Chen. What is lifi? Journal of lightwave 

technology, 34(6):1533–1544, 2015.

[31] MA Hadi. Wireless communication tends to smart technology li-fi and its comparison with 

wi-fi. American Journal of Engineering Research (AJER), 5(5):40–47, 2016.

[32] C Haldoupis and K Schlegel. Characteristics of midlatitude coherent backscatter from the 
ionospheric e region obtained with sporadic e scatter experiment. Journal of Geophysical 
Research:  Space Physics, 101(A6):13387–13397, 1996.

[33] Richard W Hamming. Error detecting and error correcting codes. The Bell system technical 

journal, 29(2):147–160, 1950.

[34]

J.  Hao, Y. Yang, and J. Luo. Ceilingcast: Energy  efficient and location-bound broadcast 
through  led-camera  communication.  In  IEEE  INFOCOM  2016  -  The  35th  Annual  IEEE 
International Conference on Computer Communications, pages 1–9, 2016.

[35] Kaiming  He,  Xiangyu  Zhang,  Shaoqing  Ren,  and  Jian  Sun.  Deep  residual  learning  for 
image recognition. In Proceedings of the IEEE conference on computer vision and pattern 
recognition, pages 770–778, 2016.

[36]

Justin Hu, Ariana Bruno, Drew Zagieboylo, Mark Zhao, Brian Ritchken, Brendon Jackson, 
Joo Yeon Chae, Francois Mertil, Mateo Espinosa, and Christina Delimitrou. To centralize 
or not to centralize: A tale of swarm coordination.  arXiv preprint arXiv:1805.01786, 2018.

[37] P.  Hu,  P.  H.  Pathak,  H.  Zhang,  Z.  Yang,  and  P.  Mohapatra.  High  speed  led-to-camera 
communication  using  color  shift  keying  with  flicker  mitigation.  IEEE  Transactions  on 
Mobile Computing, 19(7):1603–1617, 2020.

[38] Pengfei  Hu,  Parth  H  Pathak,  Xiaotao  Feng,  Hao  Fu,  and  Prasant  Mohapatra.  Colorbars: 
Increasing data rate of led-to-camera communication using color shift keying.  In proceedings 
of the 11th ACM conference on Emerging Networking experiments and technologies, pages 
1–13, 2015.

[39] Pei Huang, Jun Huang, and Li Xiao. Exploiting modulation scheme diversity in multicar- 
rier wireless networks.  In 2016 13th  Annual IEEE International  Conference on Sensing, 
Communication, and Networking (SECON), pages 1–9. IEEE, 2016.

182 

[40] Neminath Hubballi and Mayank Swarnkar.  𝑏𝑖𝑡𝑐𝑜𝑑𝑖𝑛𝑔: Network traffic classification through
encoded bit level  signatures.  IEEE/ACM  Transactions  on  Networking, 26(5):2334–2346,
2018.

[41] RD Hunsucker and HF Bates. Survey of polar and auroral region effects on hf propagation.

Radio Science, 4(4):347–365, 1969.

[42] Ayesha Ijaz, Lei Zhang, Maxime Grau, Abdelrahim Mohamed, Serdar Vural, Atta U Quddus,
Muhammad Ali Imran, Chuan Heng Foh, and Rahim Tafazolli. Enabling massive iot in 5g
and beyond systems: Phy radio frame design considerations. IEEE Access, 4:3322–3339,
2016.

[43] Tariq Islam and Seok-Hwan Park. A comprehensive survey of the recently proposed local- 

ization protocols for underwater sensor networks. IEEE Access, 2020.

[44] Mohammad  Jahanbakht,  Wei  Xiang,  Lajos  Hanzo,  and  Mostafa  Rahimi  Azghadi.  Inter- 
net  of  underwater  things  and  big  marine  data  analytics—a  comprehensive  survey.  IEEE
Communications Surveys & Tutorials, 2021.

[45] Fahad Jalal  and Faizan  Nasir. Underwater navigation, localization and path  planning for
autonomous  vehicles:  A  review.  In  2021  International  Bhurban  Conference  on  Applied
Sciences and Technologies (IBCAST), pages 817–828. IEEE, 2021.

[46]

Junsu Jang and Fadel Adib. Underwater backscatter networking. In Proceedings of the ACM
Special Interest Group on Data Communication, pages 187–199. 2019.

[47] Ruhul Khalil, Mohammad Babar, Tariqullah Jan, and Nasir Saeed. Towards the internet of
underwater  things:  Recent  developments  and  future  challenges.  IEEE  Consumer  Electronics
Magazine, 2020.

[48]

Jun Sik Kim, Byung Kook Kim, Minsu Jang, Kyumin Kang, Dae Eun Kim, Byeong-Kwon
Ju, and Jinseok Kim. Wearable hand module and real-time tracking algorithms for measuring
finger joint angles of different hand sizes with high accuracy using fbg strain sensor.  Sensors,
20(7):1921, 2020.

[49] Aleksandra  Kostic-Ljubisavljevic and Branka Mikavica.  Challenges  and  opportunities of
vlc application in intelligent transportation systems. In Encyclopedia of Information Science
and Technology, Fifth Edition, pages 1051–1064. IGI Global, 2021.

[50] Erik G Larsson, Ove Edfors, Fredrik Tufvesson, and Thomas L Marzetta. Massive mimo for

next generation wireless systems. IEEE communications magazine, 52(2):186–195, 2014.

[51] Hui-Yu Lee, Hao-Min Lin, Yu-Lin Wei, Hsin-I Wu, Hsin-Mu Tsai, and Kate Ching-Ju Lin.
Rollinglight: Enabling line-of-sight light-to-camera communications. In Proceedings of the
13th Annual International Conference on Mobile Systems, Applications, and Services, pages
167–180, 2015.

183 

[52] Yongjun Lee, Myungsin Kim, Yongseok Lee, Junghan Kwon, Yong-Lae Park, and Dongjun
Lee. Wearable finger tracking and cutaneous haptic interface with soft sensors for multi- 
fingered  virtual  manipulation.  IEEE/ASME  Transactions  on  Mechatronics,  24(1):67–77,
2018.

[53] Bin  Li,  Zesong  Fei,  and  Yan  Zhang.  Uav  communications  for  5g  and  beyond:  Recent
advances and future trends. IEEE Internet of Things Journal, 6(2):2241–2263, 2018.

[54] Chenning Li, Hanqing Guo, Shuai Tong, Xiao Zeng, Zhichao Cao, Mi Zhang, Qiben Yan,
Li Xiao, Jiliang Wang, and Yunhao Liu. Nelora: Towards ultra-low snr lora communication
with neural-enhanced demodulation. In Proceedings of ACM SenSys, 2021.

[55]

Juan Li, Xu Bao, Wance Zhang, and Nan Bao. Qoe probability coverage model of indoor
visible light communication network. IEEE Access, 8:45390–45399, 2020.

[56] Rui  Li,  Zhenyu  Liu,  and  Jianrong  Tan.  A  survey  on  3d  hand  pose  estimation:  Cameras,

methods, and datasets. Pattern Recognition, 93:251–272, 2019.

[57] Tianxing Li, Chuankai An, Zhao Tian, Andrew T Campbell, and Xia Zhou. Human sens- 
ing  using  visible  light  communication.  In  Proceedings  of  the  21st  Annual  International
Conference on Mobile Computing and Networking, pages 331–344, 2015.

[58] Tianxing  Li,  Chuankai  An, Xinran Xiao, Andrew T Campbell,  and Xia Zhou. Real-time
screen-camera communication behind any scene. In Proceedings of the 13th Annual Inter- 
national Conference on Mobile Systems, Applications, and Services, pages 197–211, 2015.

[59] Tianxing Li, Xi Xiong, Yifei Xie, George Hito, Xing-Dong Yang, and Xia Zhou. Recon- 
structing hand poses using visible light. Proceedings of the ACM on Interactive, Mobile,
Wearable and Ubiquitous Technologies, 1(3):1–20, 2017.

[60] You Li and Javier Ibanez-Guzman. Lidar for autonomous driving: The principles, challenges,
and trends for automotive lidar and perception systems. IEEE Signal Processing Magazine,
37(4):50–61, 2020.

[61]

Jaime  Lien,  Nicholas  Gillian, M  Emre Karagozler, Patrick  Amihood, Carsten  Schwesig,  Erik
Olson, Hakim Raja, and Ivan Poupyrev. Soli: Ubiquitous gesture sensing with millimeter
wave radar. ACM Transactions on Graphics (TOG), 35(4):1–19, 2016.

[62] Chi Lin, Yongda Yu, Jie Xiong, Yichuan Zhang, Lei Wang, Guowei Wu, and Zhongxuan
Luo.  Shrimp:  a robust underwater visible light communication system.   In  Proceedings
of the 27th Annual International Conference on Mobile Computing and Networking, pages
134–146, 2021.

[63] Huaiyin Lu, Ming Jiang, and Julian Cheng. Deep learning aided robust joint channel clas- 
sification, channel estimation, and signal detection for underwater optical communication.
IEEE Transactions on Communications, 69(4):2290–2303, 2020.

[64] Philip Lundrigan, Neal Patwari, and Sneha K. Kasera. On-off noise power communication. In
The 25th Annual International Conference on Mobile Computing and Networking, MobiCom
’19, New York, NY, USA, 2019. Association for Computing Machinery.

184 

[65] Chengcai  Lv,  Binjian  Shen,  Chuan  Tian,  Shengzong  Zhang,  Liang  Yu,  and  Dazhen  Xu. 
Signal design and processing for underwater acoustic positioning and communication inte- 
grated system. In 2020 IEEE 3rd International Conference on Information Communication 
and Signal Processing (ICICSP), pages 89–93. IEEE, 2020.

[66] Nino E Merencilla, Alvin Sarraga Alon, Glenn John O Fernando, Elaine M Cepe, and Den- 
nis C Malunao. Shark-eye: A deep inference convolutional neural network of shark detection 
for underwater diving surveillance. In 2021 International Conference on Computational In- 
telligence and Knowledge Economy (ICCIKE), pages 384–388. IEEE, 2021.

[67] Raed  Mesleh,  Hany  Elgala,  and  Harald  Haas.  Led  nonlinearity  mitigation  techniques  in 
optical  wireless  ofdm  communication  systems.  Journal  of  Optical  Communications  and 
Networking, 4(11):865–875, 2012.

[68] Micrographia.  U.s.  general  services  administration,  6.15  lighting.  https://

en.wikipedia.org/wiki/Micrographia_%28handwriting%29, 2022.

[69] Muhammad Sarmad Mir, Borja Genoves Guzman, Ambuj Varshney, and Domenico Giustini- 
ano. Passivelifi: rethinking lifi for low-power and long range rf backscatter. In Proceedings 
of the 27th Annual International Conference on Mobile Computing and Networking, pages 
697–709, 2021.

[70] Olga  Mirgorodskaya,  Olesya  Ivanchenko,  and  Narine  Dadayan.  Using  digital  signage 
technologies  in  retail  marketing  activities.  In  Proceedings  of  the  International  Scientific 
Conference-Digital  Transformation  on  Manufacturing,  Infrastructure  and  Service, pages 
1–7, 2020.

[71] András J Molnár. Trailsigner: A conceptual model of hiking trail networks with consistent 
signage planning and management. In Information Modelling and Knowledge Bases XXXII, 
pages 1–25. IOS Press, 2020.

[72] Mohammed SA Mossaad, Steve Hranilovic, and Lutz Lampe. Visible light communications 
using ofdm and multiple leds. IEEE Transactions on Communications, 63(11):4304–4313, 
2015.

[73] N Muraleedharan, Anna Thomas, S Indu, and BS Bindhumadhava. A traffic monitoring and 
policy enforcement framework for http.  In 2020 Third ISEA Conference on Security and 
Privacy (ISEA-ISAP), pages 81–86. IEEE.

[74] Zhang Nan, Zhang Fan, and Enmao Liu. Design of a shared platform for interactive public art 
from perspective of dynamic vision. In  2020 15th IEEE Conference on Industrial Electronics 
and Applications (ICIEA), pages 37–42. IEEE, 2020.

[75] Rajalakshmi  Nandakumar,  Vikram  Iyer,  Desney  Tan,  and  Shyamnath  Gollakota.  Finge- 
rio: Using  active sonar for fine-grained finger tracking.  In Proceedings  of  the 2016 CHI 
Conference on Human Factors in Computing Systems, pages 1515–1525, 2016.

185 

[76]

Ibrahima N’Doye, Ding Zhang, Mohamed-Slim Alouini, and Taous-Meriem Laleg-Kirati.
Establishing and maintaining a reliable optical wireless communication in underwater envi- 
ronment.  IEEE Access, 9:62519–62531, 2021.

[77] Phuc Nguyen, Taeho Kim, Jinpeng Miao, Daniel Hesselius, Erin Kenneally, Daniel Massey,
Eric  Frew,  Richard  Han,  and  Tam  Vu.  Towards  rf-based  localization  of  a  drone  and  its
controller. In Proceedings of the 5th workshop on micro aerial vehicle networks, systems,
and applications, pages 21–26, 2019.

[78] Phuc Nguyen, Mahesh Ravindranatha, Anh Nguyen, Richard Han, and Tam Vu. Investigating
cost-effective rf-based detection of drones. In Proceedings of the 2nd workshop on micro
aerial vehicle networks, systems, and applications for civilian use, pages 17–22, 2016.

[79] Phuc Nguyen, Hoang Truong, Mahesh Ravindranathan, Anh Nguyen, Richard Han, and Tam
Vu.  Matthan:  Drone presence detection by identifying physical signatures in the drone’s
rf communication. In Proceedings of the 15th annual international conference on mobile
systems, applications, and services, pages 211–224, 2017.

[80] U.S. Department of Energy. Lighting Choices to Save You Money. https://www.energy.

gov/energysaver/lighting-choices-save-you-money, 2022.

[81] Hao Pan, Yi-Chao Chen, Lanqing Yang, Guangtao Xue, Chuang-Wen You, and Xiaoyu Ji.
mqrcode:  Secure qr code using nonlinearity of spatial frequency in light. In  The 25th Annual
International Conference on Mobile Computing and Networking, page 27. ACM, 2019.

[82] Kun  Qian,  Yumeng  Lu,  Zheng  Yang,  Kai  Zhang,  Kehong  Huang,  Xinjun  Cai,  Chenshu
Wu, and Yunhao Liu. Aircode: Hidden screen-camera communication on an invisible and
inaudible dual channel.  In NSDI, pages 457–470, 2021.

[83] Qualcomm. Making 5g nr a reality: leading the technology inventions for a unified, more

capable 5g air interface.  White paper, 2016.

[84] E Ramadhani and GP Mahardika. The technology of lifi: A brief introduction. In IOP Conf.

Series: Materials Science and Engineering, volume 3, pages 1–10, 2018.

[85] A Rammohan and C RameshKumar. Investigation on light intensity and temperature dis- 
tribution of automotive’s halogen and led headlight. In 2017 International conference on
Microelectronic Devices, Circuits and Systems (ICMDCS), pages 1–6. IEEE, 2017.

[86] Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Sign language recognition: A deep

survey. Expert Systems with Applications, 164:113794, 2021.

[87] Market  Reports.  Globle  scube  diving  equipment  industry  research  report,  growth
trends and competitive analysis 2021-2027. https://www.marketreportsworld.com/ global-
scuba-diving-equipment-industry-18271751,  2021.

[88] Aleksandr  Rodionov,  Petr  Unru,  and  Aleksandr  Golov.  Long-range  underwater  acoustic
navigation and communication system. In 2020 IEEE Eurasia Conference on IOT, Commu- 
nication and Engineering (ECICE), pages 60–63. IEEE, 2020.

186 

[89] Nasir Saeed, Abdulkadir Celik, Tareq Y Al-Naffouri, and Mohamed-Slim Alouini. Under- 
water  optical  wireless  communications,  networking,  and  localization:  A  survey.  Ad  Hoc 
Networks, 94:101935, 2019.

[90] Krishna Raj Sapkota, Steven Roelofsen, Artem Rozantsev, Vincent Lepetit, Denis Gillet, 
Pascal Fua, and Alcherio Martinoli.  Vision-based unmanned aerial vehicle detection and 
tracking  for  sense  and  avoid  systems.  In  2016  IEEE/RSJ  International  Conference  on 
Intelligent Robots and Systems (IROS), pages 1556–1561. Ieee, 2016.

[91] Giuseppe Schirripa Spagnolo,  Lorenzo  Cozzella,  and Fabio  Leccese.  Underwater optical 

wireless communications:  Overview. Sensors, 20(8):2261, 2020.

[92] Zhambyl Shaikhanov, Ahmed Boubrima, and Edward W Knightly. Autonomous drone net- 
works for sensing, localizing and approaching rf targets. In  2020 IEEE Vehicular Networking 
Conference (VNC), pages 1–8. IEEE, 2020.

[93] Abhishek Sharma, Pankhuri Vanjani, Nikhil Paliwal, Chathuranga M Wĳerathna 

Basnayaka, Dushantha Nalin K Jayakody, Hwang-Cheng Wang, and P 
Muthuchidambaranathan. Com- munication and networking technologies for uavs: A 
survey. Journal of Network and Computer Applications, 168:102739, 2020.

[94] Truman R Strobridge. Chronology of Aids to Navigation and the Old Lighthouse Service, 

1716-1939. Public Affairs Division, United States Coast Guard, 1974.

[95] Sanjib Sur, Ioannis Pefkianakis, Xinyu Zhang, and Kyu-Han Kim. Towards scalable and 
ubiquitous millimeter-wave wireless networks. In Proceedings of the 24th Annual Interna- 
tional Conference on Mobile Computing and Networking, pages 257–271, 2018.

[96] Witold Szymański and Maurycy Kin. The perspective transformation in illusionistic ceiling 

painting of late baroque. Teka Komisji Architektury, Urbanistyki i Studiów 
Krajobrazowych, 15(1):104–112, 2019.

[97] Andrea Tagliasacchi, Matthias Schröder, Anastasia Tkach, Sofien Bouaziz, Mario Botsch, 
and Mark Pauly. Robust articulated-icp for real-time hand tracking. In Computer graphics 
forum, volume 34, pages 101–114. Wiley Online Library, 2015.

[98] Lei  Tao,  Tao  Hong,  Yichen  Guo,  Hangyu  Chen,  and  Jinmeng  Zhang.  Drone  identifica- 
tion  based  on  centernet-tensorrt.  In  2020  IEEE  International  Symposium  on  Broadband 
Multimedia Systems and Broadcasting (BMSB), pages 1–5. IEEE, 2020.

[99] Zhao Tian, Kevin Wright, and Xia Zhou. The darklight rises:  Visible light communication in 
the dark. In  Proceedings of the 22nd Annual International Conference on Mobile Computing 
and Networking, pages 2–15, 2016.

[100] Sumit Tiwari. An introduction to qr code technology. In 2016 international conference on 

information technology (ICIT), pages 39–44. IEEE, 2016.

[101] CAIDA   UCSD.  SIGCOMM’17    anonymized  internet  traces.  https://www.caida.org/

data/passive/passive_dataset.xml, 2017.

187 

[102] CAIDA  UCSD.  CAIDA’19  anonymized  internet  traces.  https://www.caida.org/data/

passive/passive_dataset.xml, 2019.

[103] Hanif Ullah, Nithya Gopalakrishnan Nair, Adrian Moore, Chris Nugent, Paul Muschamp, 
and Maria Cuevas. 5g communication: an overview of vehicle-to-everything, drones, and 
healthcare use-cases.  IEEE Access, 7:37251–37268, 2019.

[104] Eren Unlu, Emmanuel Zenou, and Nicolas Riviere.  Using shape descriptors for uav detection. 

Electronic Imaging, 2018(9):128–1, 2018.

[105] Suseela Vappangi and VV Mani. Concurrent illumination and communication: A survey on 

visible light communication.  Physical Communication, 33:90–114, 2019.

[106] Qing Wang, Marco Zuniga, and Domenico Giustiniano. Passive communication with ambient 
light.  In  Proceedings  of  the  12th  International  on  Conference  on  emerging  Networking 
EXperiments and Technologies, pages 97–104, 2016.

[107] Robert Wang, Sylvain Paris, and Jovan Popović. 6d hands: markerless hand-tracking for 
computer aided design. In Proceedings of the 24th annual ACM symposium on User interface 
software and technology, pages 549–558, 2011.

[108] X.  Wang,  J.  P.  Linnartz,  and  T.  Tjalkens.  An  intelligent  lighting  system:  Learn  user 
preferences  from  inconsistent  feedback.  In  Proceedings  of  the  2016  ACM  International 
Joint Conference on Pervasive and Ubiquitous Computing: Adjunct, UbiComp ’16, page 
1620–1626, New York, NY, USA, 2016. Association for Computing Machinery.

[109] Zeyu Wang, Zhice Yang, Qianyi Huang, Lin Yang, and Qian Zhang. Als-p: Light weight 
visible light positioning via ambient light sensor. In  IEEE INFOCOM 2019-IEEE Conference 
on Computer Communications, pages 1306–1314. IEEE, 2019.

[110] WiKi.   Hamming  code.   https:en.wikipedia.orgwikiHamming_code,  2021.

[111]  WiKi. Barcode.  https://en.wikipedia.org/wiki/Barcode#Matrix_(2D) _barcodes, 2022. 

[112] WiKi.  Unmanned  aerial  vehicle.  https://en.wikipedia.org/wiki/Unmanned_ 

aerial_vehicle, 2022.

[113] Norman J Woodland and Silver Bernard. Classifying apparatus and method, October 7 1952. 

US Patent 2,612,994.

[114] Hongjia Wu, Qing Wang, Jie Xiong, and Marco  Zuniga.  Smartvlc:  When smart  lighting 
meets vlc. In Proceedings of the 13th International Conference on emerging Networking 
EXperiments and Technologies, pages 212–223, 2017.

[115] Hongjia Wu, Qing Wang, Jie Xiong, and Marco Zuniga. Smartvlc:  Co-designing smart light- 
ing  and communication for visible light networks.  IEEE Transactions on  Mobile Computing, 
19(8):1956–1970, 2019.

188 

[116] Xiping Wu, Mohammad Dehghani Soltani, Lai Zhou, Majid Safari, and Harald Haas. Hybrid
lifi and wifi networks: A survey. IEEE Communications Surveys & Tutorials, 23(2):1398–
1420, 2021.

[117] Yue Wu, Purui Wang, Kenuo Xu, Lilei Feng, and Chenren Xu.  Turboboosting visible light
backscatter communication. In Proceedings of the Annual Conference of the ACM Special
Interest Group on Data Communication on the Applications, Technologies, Architectures,
and Protocols for Computer Communication, SIGCOMM ’20, page 186–197, New York,
NY, USA, 2020. Association for Computing Machinery.

[118] Yue Wu, Purui Wang, Kenuo Xu, Lilei Feng, and Chenren Xu.  Turboboosting visible light
backscatter communication. In Proceedings of the Annual conference of the ACM Special
Interest  Group on  Data Communication on the  applications, technologies, architectures,  and
protocols for computer communication, pages 186–197, 2020.

[119] Zhenyu  Xiao,  Lipeng  Zhu,  Yanming  Liu,  Pengfei  Yi,  Rui  Zhang,  Xiang-Gen  Xia,  and
Robert Schober. A survey on millimeter-wave beamforming enabled uav communications
and networking. IEEE Communications Surveys & Tutorials, 24(1):557–610, 2021.

[120] Huichuan Xu, Daisuke Iwai, Shinsaku Hiura, and Kosuke Sato. User interface by virtual
shadow projection. In 2006 SICE-ICASE International Joint Conference, pages 4814–4817.
IEEE, 2006.

[121] Beiya Yang and Erfu Yang. A survey on radio frequency based precise localisation technology
for uav in gps-denied environment. Journal of Intelligent & Robotic Systems, 103(3):1–30,
2021.

[122] Y. Yang, J. Hao, and J. Luo. Ceilingtalk: Lightweight indoor broadcast through led-camera
communication. IEEE Transactions on Mobile Computing, 16(12):3308–3319, 2017.

[123] Yanbing Yang, Jie Hao, and Jun Luo. Ceilingtalk: Lightweight indoor broadcast through
led-camera communication. IEEE Transactions on Mobile Computing, 16(12):3308–3319,
2017.

[124] Yanbing Yang and Jun Luo. Boosting the throughput of led-camera vlc via composite light
emission. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages
315–323. IEEE, 2018.

[125] Yanbing Yang and Jun Luo. Composite amplitude-shift keying for effective led-camera vlc.

IEEE Transactions on Mobile Computing, 19(03):528–539, 2020.

[126] Yanbing Yang, Jun Luo, Chen Chen, Wen-De Zhong, and Liangyin Chen. Synlight: synthetic
light emission for fast transmission in cots device-enabled vlc. In IEEE INFOCOM 2019-
IEEE Conference on Computer Communications, pages 1297–1305. IEEE, 2019.

[127] Yang Yang, Zhimin Zeng, Julian Cheng, and Caili Guo. Spatial dimming scheme for optical
ofdm based visible light communication. Optics express, 24(26):30254–30263, 2016.

189 

[128] Yang Yang, Zhimin Zeng, Julian Cheng, and Caili Guo. A novel hybrid dimming control
scheme for visible light communications. IEEE Photonics Journal, 9(6):1–12, 2017.

[129] Zhice Yang, Zeyu WANG, Jiansong Zhang, Chenyu Huang, and Qian Zhang. Polarization- 
based visible light positioning. IEEE Transactions on Mobile Computing, 18(3):715–727,
2019.

[130] Ibrar Yaqoob, Latif  U Khan, SM Ahsan Kazmi, Muhammad  Imran, Nadra Guizani,  and
Choong Seon Hong. Autonomous driving cars in  smart  cities: Recent  advances, require- 
ments, and challenges. IEEE Network, 34(1):174–181, 2019.

[131] Kai Ying, Zhenhua Yu, Robert J Baxley, Hua Qian, Gee-Kung Chang, and G Tong Zhou.
Nonlinear distortion mitigation in visible light communications. IEEE Wireless Communi- 
cations, 22(2):36–45, 2015.

[132] Fahad Zafar, Masuduzzaman Bakaul, and Rajendran Parthiban. Laser-diode-based visible
light communication: Toward gigabit class communication. IEEE Communications Maga- 
zine, 55(2):144–151, 2017.

[133] Fahad Zafar, Dilukshan Karunatilaka, and Rajendran Parthiban. Dimming schemes for visi- 
ble light communication: the state of research. IEEE Wireless Communications, 22(2):29–
35, 2015.

[134] Zhaoquan  Zeng,  Shu  Fu,  Huihui  Zhang,  Yuhan  Dong,  and  Julian  Cheng.  A  survey  of
underwater  optical  wireless  communications.  IEEE  communications  surveys  &  tutorials,
19(1):204–238, 2016.

[135] Bo Zhang and Hoi Dick Ng. An experimental investigation of the explosion characteristics

of dimethyl ether-air mixtures.  Energy, 107:1–8, 2016.

[136] Chi Zhang and Xinyu Zhang. Pulsar: Towards ubiquitous visible light localization. In Pro- 
ceedings of the 23rd Annual International Conference on Mobile Computing and Networking,
pages 208–221, 2017.

[137] Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-
Ling Chang, and Matthias Grundmann. Mediapipe hands: On-device real-time hand track- 
ing. arXiv preprint arXiv:2006.10214, 2020.

[138] Kai Zhang, Yi Zhao, Chenshu Wu, Chaofan Yang, Kehong Huang, Chunyi Peng, Yunhao
Liu, and Zheng Yang. Chromacode: A fully imperceptible screen-camera communication
system. IEEE Transactions on Mobile Computing, 2019.

[139] Lan Zhang, Kebin Liu, Xiang-Yang Li, Cihang Liu, Xuan Ding, and Yunhao Liu. Privacy- 
friendly photo capturing and sharing system. In Proceedings of the 2016 ACM International
Joint Conference on Pervasive and Ubiquitous Computing, pages 524–534. ACM, 2016.

[140] Weidong Zhang, Lili Dong, Xipeng Pan, Peiyu Zou, Li Qin, and Wenhai Xu. A survey of
restoration  and  enhancement  for  underwater  images. IEEE  Access,  7:182259–182279,  2019.

190 

[141] Xiao Zhang, Hanqing Guo, James Mariani, and Li Xiao. U-star: An underwater navigation
system based on passive 3d optical identification tags. In Proceedings of the 28th Annual
International Conference on Mobile Computing And Networking, pages 648–660, 2022.

[142] Xiao Zhang, Griffin Klevering, Juexing Wang, Li Xiao, and Tianxing Li. Rofin: 3d hand pose
reconstructing via 2d rolling fingertips. Proceedings of 21st ACM International Conference
on Mobile Systems, Applications, and Services, conditionally accepted, 2023.

[143] Xiao  Zhang,  Griffin  Klevering,  and  Li  Xiao.  Exploring  rolling  shutter  effect  for  motion
tracking with objective identification. In Proceedings of the Twentieth ACM Conference on
Embedded Networked Sensor Systems, pages 816–817, 2022.

[144] Xiao  Zhang,  Griffin  Klevering,  and  Li  Xiao.  Posefly:  On-site  pose  parsing  of  swarming
drones via 4-in-1 optical camera communication. In 2023 IEEE 24th International Sympo- 
sium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM), pages 1–10.
IEEE, 2023.

[145] Xiao Zhang, James Mariani, Li Xiao, and Matt W Mutka. Lifod: Lighting extra data via
fine-grained  owc  dimming.  In  2022  19th  Annual  IEEE  International  Conference  on  Sensing,
Communication, and Networking (SECON), pages 73–81. IEEE, 2022.

[146] Xiao Zhang and Li Xiao. Effective subcarrier pairing for hybrid delivery in relay networks.
In 2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS),
pages 238–246. IEEE, 2020.

[147] Xiao Zhang and Li Xiao. Lighting extra data via owc dimming. In Proceedings of the Student

Workshop, pages 29–30, 2020.

[148] Xiao Zhang and Li Xiao. Rainbowrow: Fast optical camera communication. In 2020 IEEE
28th International Conference on Network Protocols (ICNP), pages 1–6. IEEE, 2020.

[149] Run Zhao, Dong Wang, Qian Zhang, Xueyi Jin, and Ke Liu. Smartphone-based handwritten
signature verification using acoustic signals. Proceedings of the ACM on Human-Computer
Interaction, 5(ISS):1–26, 2021.

[150] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image
translation using cycle-consistent adversarial networks. In Proceedings of the IEEE interna- 
tional conference on computer vision, pages 2223–2232, 2017.

[151] Shilin  Zhu, Chi  Zhang, and  Xinyu  Zhang.  Automating  visual  privacy  protection  using  a
smart led. In Proceedings of the 23rd Annual International Conference on Mobile Computing
and Networking, pages 329–342, 2017.

[152] Shilin Zhu, Chi Zhang, and Xinyu Zhang. Lishield: Create a capture-resistant environment
against photographing. In Proceedings of the 9th ACM Workshop on Wireless of the Students,
by the Students, and for the Students, pages 23–23, 2017.

191