FEDERATED REINFORCEMENT LEARNING FOR CONTENT DISSEMINATION IN UAV 
NETWORKS 

By 

Amit Kumar Bhuyan 

A DISSERTATION 

Submitted to 
Michigan State University 
in partial fulfillment of the requirements   
for the degree of 

Electrical and Computer Engineering – Doctor of Philosophy 

2025 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ABSTRACT 

In disaster scenarios with compromised communication infrastructure, Unmanned Aerial Vehicles 

(UAVs)  can  provide  ad  hoc  connectivity  for  resilient  information  dissemination.  This  thesis 

develops  a  hierarchical  UAV-assisted  framework  of  federated  multi-armed  bandit  learning  for 

post-disaster  content  dissemination.  The  developed  framework  incorporates  a  two-tier  UAV 

hierarchy consisting of Anchor UAVs (A-UAVs) with high-cost backhaul connectivity, and more 

mobile Micro-Ferrying UAVs (MF-UAVs) without backhaul links. Such a hierarchy allows for 

strategic offloading of storage-intensive tasks to A-UAVs, while leveraging the mobility of MF-

UAVs to dynamically ferry content across disconnected user clusters. By integrating trajectory-

aware  selective  caching  strategies  into  UAV  operations,  the  framework  aligns  aerial  mobility 

patterns with evolving spatio-temporal content demands. Algorithmic innovation of the framework 

stems from a federated bandit and stateless reinforcement learning paradigm, which enables UAVs 

to collaboratively learn content popularity profiles, and adapt caching policies based on localized 

user request patterns. Unlike centralized methods, the federated approach preserves data locality 

and minimizes inter-UAV communication overhead, which is critical in bandwidth- and energy-

constrained  post-disaster  environments.  The  multi-armed  bandit  learning  mechanism  utilizes  a 

multi-dimensional  reward  feedback  architecture  that  captures  content  relevance,  inter-UAV 

delivery latency, and caching diversity across disjointed and isolated user communities. The thesis 

also explores the interplay between UAV energy budgets, caching capacities, and mission-critical 

delivery constraints such as quality-of-service expectations in terms of tolerable access delay. To 

summarize, the research in this thesis bridges multi-agent learning with mission-oriented aerial 

networking towards developing solutions for smart content dissemination in networks with sparse 

connectivity.  

 
 
ACKNOWLEDGEMENTS 

I would like to express my deepest gratitude to my advisor, Dr. Subir Biswas, whose guidance, 

insight, and support have been central to my academic journey. His mentorship has profoundly 

shaped  my  research  and  growth.  I  am  also  sincerely  thankful  to  my  dissertation  committee 

members,  Dr.  Nihar  Mahapatra,  Dr.  Shaunak  Bopardikar,  and  Dr.  Sandeep  Kulkarni,  for  their 

valuable insights, constructive feedback, and time throughout this process. 

I  have  been  fortunate  to  share  my  PhD  journey  with  outstanding  colleagues  and  friends.  I  am 

especially grateful to my lab colleague and close friend, Hrishikesh Dutta, whose companionship, 

support,  and  stimulating  conversations  greatly  enriched  my  doctoral  experience.  My  sincere 

appreciation  also  goes  to  Gao  Kang,  whose  friendship  and  encouragement  have  been  truly 

meaningful. I also thank Yida Yang for his genuine camaraderie, helpfulness and support at every 

step.  I  extend  my  thanks  to  Avirup  Roy  and  Tianxiang  Zhang  for  fostering  a  positive  lab 

environment. I would also like to acknowledge some old members and new additions to our lab 

group,  Daniel  McDermott,  Grace  Michael,  Tushig  Bolorchuluun,  Fengyang  Shang,  Santhosh 

Manoharan, and Reeve Fernandes, whose presence and enthusiasm enhanced our team’s dynamic. 

I am grateful for the consistent administrative support from the College of Engineering and the 

Department of Electrical and Computer Engineering at Michigan State University, specifically Dr. 

Katy Colbry, Dr. Tim Hogan, Dr. Nelson Sepulveda, Dr. Yiming Deng, and staff members Lisa 

Clark, Michelle Stewart, Casie Medina, Michele Pursell, Brian Wright, Laurene Rashid, Meagan 

Kroll, and Jessica Pung. 

I also want to acknowledge past mentors and collaborators who helped shape my early academic 

path, including Dr. D. L. Woodard, Dr. J. B. Harley, Dr. A. Zare, Dr. S. N. Merchant, Dr. J. H. 

iii 

 
 
Nirmal, Dr. K. Samudravijaya, Dr. S. K. Kopparapu, and friends Vivek Patkar and Apoorva Kayal 

for their meaningful involvement during formative stages of my research career. 

To  my  friends  from  childhood  to  adulthood,  thank  you  for  always  being  there,  lending  an  ear, 

offering advice, and providing much-needed laughter along the way. 

I owe my greatest gratitude to my family, especially to my mother, Kabita Rani Bhuyan (Amma), 

whose  unconditional  love,  enduring  strength,  and  sacrifice  have  provided  the  foundation  upon 

which I stand today. I am profoundly thankful to my sister, Reshma, whose steadfast support, quiet 

resilience, and constant belief in me have remained a dependable source of strength throughout 

this journey. I extend my thanks to my brother-in-law, Babandeep, for his constant encouragement. 

To my family here, words fall short. My heartfelt appreciation and deepest thanks go to my wife, 

Kristyn (Mrs. Bhuyan), who is the greatest inspiration and pillar of strength in my life, without 

whose unwavering love, support, thoughtful encouragement, and understanding this achievement 

would not have been possible. I am immensely thankful for my daughters, Daisy and Charlotte, 

whose joyful presence provides hope and purpose every day. 

Finally, to all my immediate and extended family members, your continuous encouragement and 

love have been invaluable throughout this journey. 

iv 

 
 
 
 
 
 
 
TABLE OF CONTENTS 

Chapter 1: 

Introduction ......................................................................................................... 1 
1.1  Background and Motivation ............................................................................................. 1 
1.2  UAVs and Micro-UAVs for Content Provisioning .......................................................... 2 
1.3  Advantages of Proactive Caching over Traditional Caching Techniques ........................ 6 
1.4  Cooperative Federated Reinforcement Learning-based Techniques ............................... 6 
Learning Caching Policies and Enhancing UAV-aided Content Dissemination in 
1.5 
Disaster-Affected Areas .............................................................................................................. 8 
Strategic Joint Deployment of UAVs for Cache Space Utilization ................................. 9 
1.6 
Trajectory-aware Collaborative Caching using Swarm of Micro-Ferrying UAVs ........ 10 
1.7 
1.8  Dissertation Objectives .................................................................................................. 12 
Scope of the Dissertation ................................................................................................ 14 
1.9 

Chapter 2: 

Related Work ..................................................................................................... 16 
2.1 
Post-Disaster Content Provisioning ............................................................................... 16 
2.2  Use of UAVs/Micro-UAVs for disaster management services ..................................... 17 
2.3  UAV-based Content Provisioning via Proactive Caching ............................................. 19 
Learning-based Caching for UAV-based content dissemination system ....................... 22 
2.4 
Federated, Bandit-based and Reinforcement learning for adaptive UAV caching ........ 24 
2.5 
Summary ........................................................................................................................ 26 
2.6 

UAV Centric Content Caching for Communication Challenged       

Chapter 3: 
Scenarios           ............................................................................................................................ 27 
3.1  Motivation ...................................................................................................................... 27 
3.2  Design Objective ............................................................................................................ 30 
3.3 
System Model ................................................................................................................. 30 
3.4  Caching Policies ............................................................................................................. 32 
3.5  Content Dissemination Performance .............................................................................. 36 
Experimental Results and Analysis ................................................................................ 39 
3.6 
Summary and Conclusion .............................................................................................. 47 
3.7 

Using QoS-aware Caching for Handling Demand Heterogeneity in        

Chapter 4: 
UAV-based Content Provisioning ............................................................................................. 49 
4.1  Motivation ...................................................................................................................... 49 
4.2  Design Objective ............................................................................................................ 49 
4.3 
System Model ................................................................................................................. 50 
4.4  Caching Policies to Handle Heterogeneity ..................................................................... 51 
4.5  Deployment and Trajectory Planning of UAVs ............................................................. 59 
4.6  Content Dissemination Performance .............................................................................. 59 
Experimental Setup ........................................................................................................ 62 
4.7 
Experimental Results and Analysis ................................................................................ 64 
4.8 
Summary and Conclusion .............................................................................................. 71 
4.9 

v 

 
 
Chapter 5:  Multi-Armed Bandit Learning for Content Provisioning in Network of 
UAVs……………………………………………………………………………………………. 72 
5.1  Motivation ...................................................................................................................... 73 
5.2  Design Objectives .......................................................................................................... 73 
5.3  Caching based on Content Pre-loading at Anchor UAVs .............................................. 74 
5.4  Decentralized Caching with Multi-Armed Bandit ......................................................... 76 
Experiments and Results ................................................................................................ 83 
5.5 
Summary and Conclusion .............................................................................................. 89 
5.6 

Distributed Federated-Multi-Armed Bandit Learning for Content 

Chapter 6: 
Management in Connected UAVs ............................................................................................. 90 
6.1  Motivation ...................................................................................................................... 90 
6.2  Design Objective ............................................................................................................ 91 
System Model ................................................................................................................. 93 
6.3 
Limitations of Cache Pre-loading at A-UAVs ............................................................... 95 
6.4 
Federated Multi-Armed Bandit Learning for Caching ................................................... 96 
6.5 
Experiments and Results .............................................................................................. 111 
6.6 
Summary and Conclusion ............................................................................................ 124 
6.7 

Chapter 7: 
Benchmarking UAV Trajectory-Aware Caching Policies in    
Infrastructure-Less Networks. ................................................................................................. 126 
7.1  Motivation .................................................................................................................... 126 
7.2  Design Objective .......................................................................................................... 126 
7.3 
System Model ............................................................................................................... 127 
7.4  Content Request and Provisioning Model .................................................................... 129 
7.5 
Trajectory-aware Content Placement Planning for Ferring UAVs .............................. 130 
7.6  Content Dissemination Performance and Experimental Results .................................. 134 
Summary and Conclusion ............................................................................................ 148 
7.7 

Top-k Multi-Armed Bandit Learning for Trajectory-Aware Caching in 

Chapter 8: 
Swarms of Micro-UAVs ........................................................................................................... 150 
8.1  Motivation .................................................................................................................... 150 
8.2  Design Objective .......................................................................................................... 151 
8.3 
System Model ............................................................................................................... 152 
8.4  Caching based on Content Pre-loading at A-UAVs ..................................................... 153 
8.5  Decentralized Caching with Multi-Armed Bandit ....................................................... 156 
Experimental Results and Content Dissemination Performance .................................. 167 
8.6 
Summary and Conclusion ............................................................................................ 175 
8.7 

Federated Multi-Armed Bandit Learning for Trajectory-aware Caching 

Chapter 9: 
Policy in Content Dissemination System using Swarm of UAVs .......................................... 177 
9.1  Motivation .................................................................................................................... 177 
9.2  Design Objective .......................................................................................................... 178 
9.3 
System Model ............................................................................................................... 179 
9.4  Content Caching Problem Formulation ........................................................................ 182 

vi 

 
 
9.5  Benchmark Caching Policy with A-Priori Demand Knowledge ................................. 183 
Federated Multi-Armed Bandit Learning for Content Caching ................................... 192 
9.6 
9.7 
Experimental Results and Content Dissemination Performance .................................. 212 
9.8  Conclusion .................................................................................................................... 225 

Chapter 10:  Conclusions and Future Works ..................................................................... 228 
Key Findings and Design Guidelines ....................................................................... 229 
Future Directions ...................................................................................................... 231 

10.1 
10.2 

BIBLIOGRAPHY ..................................................................................................................... 239 

vii 

 
 
Chapter 1:  Introduction 

1.1 

Background and Motivation 

Catastrophic  events,  including  natural  disasters  like  earthquakes  and  floods,  as  well  as 

human-induced crises such as wars, have profound impacts on both the physical landscape and 

critical human-made infrastructures, notably the communication systems. In the aftermath of such 

events,  the  collapse  of  conventional  communication  networks  can  significantly  hinder  disaster 

response and relief efforts [1], [2]. Such situations often leave communities isolated from crucial 

information  flow  regarding  disaster  dynamics,  relief  operations,  weather  conditions,  and 

rehabilitation efforts. Access to such information is essential, sometimes even lifesaving, for the 

affected communities [3], [4]. 

The thesis aims towards scenarios in which a disaster/war-stricken population is forced into 

multiple  clusters  of  isolated  communities  with  diverse  information  needs.  The  diversity,  or 

heterogeneity,  of  these  needs  reflects  in  the  varying  popularity  of  requested  content  across 

communities, influenced by their proximity to the disaster and the users’ geo-temporal context. 

For instance, a community close to a fire might prioritize information about nearby fire stations, 

whereas one farther away focuses on transportation to evacuate. Additionally, the expectation of 

Quality-of-Service, measured by tolerable access delay (𝑇𝐴𝐷) [5], [6], [7] for different content, 

varies based on the urgency and type of information needed.  

Numerous studies have explored the deployment of Device-to-Device (D2D) communication 

[8],  [9]  and  Ad  Hoc  networks  [10],  [11]  as  solutions  to  bridge  gaps  in  communication 

infrastructure, although with limitations in cases of total infrastructure collapse. Proposed Delay 

Tolerant Networks (DTNs) facilitate content transfer across fragmented communities [12], [13], 

1 

 
[14], [15], [16] that addresses routing delays but often neglects challenges in data caching and the 

effects of device mobility on caching efficiency. 

Moreover, ensuring the consistent availability of content, given the wide variances in user 

request patterns, poses an unresolved challenge. Some strategies employ function approximation 

to predict request dynamics, a method dependent on extensive data collection that may not suit the 

urgent nature of information dissemination in crises. 

The  thesis  underscores  the  potential  of  using  Unmanned  Aerial  Vehicles  (UAVs)  as 

alternative platforms for content delivery that leverages their mobility against the constraints of 

limited  storage,  energy,  and  flight  duration  [17],  [18],  [19].  These  insights  aim  to  refine  the 

understanding  and  management  of  content  dissemination  in  disaster-affected  areas  which 

emphasizes  the  need  for  innovative  solutions  to  ensure  timely  and  reliable  information  access 

amidst challenging conditions. 

1.2 

UAVs and Micro-UAVs for Content Provisioning  

This  research  introduces  an  advanced  family  of  caching  solutions  towards  employing 

trajectory-aware Unmanned Aerial Vehicles (UAVs) to facilitate the flow of information in areas 

where  traditional  communication  infrastructures  are  compromised  due  to  disasters  or  conflicts. 

This solution adopts a novel two-tier system which leverages anchor-UAVs (A-UAVs) equipped 

with high-cost vertical connectivity, such as satellite links [20], [21], [22], and a network of micro-

ferrying-UAVs  (MF-UAVs)  [23],  [24],  [25]  that  operate  without  these  connections.  The  MF-

UAVs  are  pivotal  in  transferring  and  disseminating  content  among  A-UAVs  that  ensures 

widespread content accessibility across fragmented communities by bypassing the need for direct 

vertical connections. 

2 

 
The main goal is to develop sophisticated strategies for content caching and downloading 

that  are  tailored  to  the  unique  storage  limitations  of  the  A-UAVs  and  MF-UAVs,  the  diverse 

demands for content among the communities, and the strategic distribution of content requests. A 

significant  focus  is  placed  on  analyzing  how  the  trajectories  of  MF-UAV  fleets  impact  the 

availability of content to these segmented groups. The intention is to enhance content reach within 

these communities by overcoming the obstacles of restricted connectivity. 

By leveraging a dual-layered strategy, this thesis aims to discover highly efficient methods for 

distributing content. It factors in the varying demands for information, the sizes of MF-UAV fleets, 

and their storage capabilities to maximize content access for isolated groups. This approach not 

only addresses the immediate need for reliable information in crisis situations but also sets a new 

benchmark for content delivery mechanisms in challenging environments. 

1.2.1  Advantages of employing UAVs 

Using Unmanned Aerial Vehicles for content provisioning in scenarios where communication 

infrastructure is absent or damaged has several advantages: 

a)  Rapid  Deployment:  UAVs  can  be  quickly  deployed  to  areas  lacking  communication 

infrastructure that enables swift establishment of temporary communication networks. This is 

particularly  beneficial  in  disaster-hit  areas  where  existing  infrastructure  is  destroyed  or  in 

remote regions that lack such facilities. 

b)  Flexibility  and  Scalability:  UAVs  offer  flexible  and  scalable  solutions  for  content  delivery. 

They can be used in a variety of scenarios, ranging from small-scale deployments to cover 

specific areas to larger networks consisting of multiple UAVs that work together to cover wider 

regions. 

3 

 
c)  Cost-Effectiveness: Compared to the construction of traditional communication infrastructure, 

UAVs represent a cost-effective solution, especially in hard-to-reach areas. They eliminate the 

need for physical infrastructure like towers and cables which reduces both the initial setup cost 

and ongoing maintenance expenses. 

d)  Dynamic  Network  Topology:  UAVs  can  dynamically  adjust  their  positions  based  on  the 

demand  for  communication  services  which  optimizes  the  network’s  performance  and 

coverage.  This  adaptability  ensures  efficient  use  of  resources  and  enhanced  connectivity  in 

areas where user density might fluctuate. 

e)  Improved Accessibility: By providing overhead communication links, UAVs can enhance the 

accessibility  of  content  and  internet  services  to  remote  and  underserved  communities.  This 

helps in bridging the digital divide and promoting equal access to information. 

f)  Enhanced Data Collection and Distribution: UAVs equipped with sensors and cameras can 

collect and distribute a wide range of data, including live video feeds, which can be crucial for 

surveillance, environmental monitoring, and disaster management. 

1.2.2  Challenges faced in UAV-based Content Provisioning  

Deploying  Unmanned  Aerial  Vehicles  (UAVs)  for  content  delivery  in  areas  without 

traditional communication infrastructure encounters specific challenges like power management 

and  operational  efficiency.  These  UAVs  face  constraints  like  limited  flight  duration  and 

operational  range,  which  are  critical  factors  in  their  ability  to  deliver  content  over  extended 

distances  or  periods.  The  power  requirements  for  various  operations  such  as  downloading  and 

transmitting content, receiving content requests, and maintaining basic flight, are significant. Each 

of these operations depletes the UAV’s battery, with the energy consumed during its flight and 

idle states [18], [26], [27], [28], [29]. 

4 

 
A crucial aspect of using UAVs for content distribution is managing the balance between 

power consumption and operational effectiveness. The process of downloading and transmitting 

content  to  users,  for  instance,  requires  careful  consideration  of  power  use,  especially  since 

transmitting  power  needs  are  substantially  higher  than  those  for  receiving  requests.  This 

discrepancy is largely due to the properties of signal transmission, where ensuring that a signal 

reaches the receiver with sufficient strength to be decoded correctly necessitates a higher energy 

output. This shows that increased power consumption with higher transmission power necessitates 

a careful balance between communication requirement and operational longevity of the UAVs. 

Enhancing one can significantly impact the other, potentially reducing the effectiveness of content 

distribution efforts. 

Furthermore, to maintain unhindered content provisioning, the aim should be to maximize 

the on-time of the UAVs. However, there are inherent energy expenses associated with UAV like 

the ones cited above. One straight-forward approach is to minimize the communication energy 

expenditure  which  includes  the  content  download,  transmission  and  reception  expenses.  Such 

approach will depend heavily on the average data consumption rate of individual users. Service 

providers have reported an average monthly data consumption of 25-30 gigabytes per user [30], 

[31], [32]. With an average population density of 10! users/sq-mile [33], [34], the data requirement 

per day can reach up to 80-100 terabytes. Attempts to handle such requirements can be made by 

installing Non-Volatile Memory Express solid-state drive (NVMe SSD) memory cards [35], [36] 

in  UAVs  that  can  store  contents  with  data  size  of  aforesaid  magnitude.  Nevertheless,  the 

communication energy expenditure can still deplete the battery of the UAVs while storing and 

replacing contents in these large memory devices. To exacerbate the situation, with increase in the 

5 

 
data storage capacities of such memory cards, the communication energy expenditure scales which 

leads to even faster depletion of UAV battery.  

This limitation necessitates the contents to be intelligently spread across UAVs. Also, the 

inability of the total storage capacity offered by the UAV network to store all contents required 

currently and, in the future, requires efficient content management strategies. 

1.3 

Advantages of Proactive Caching over Traditional Caching Techniques 

Proactive caching is an advanced strategy that enhances data storage and access beyond 

conventional methods like LRU, FIFO, and LFU [37], [38], [39], particularly vital in the context 

of UAVs with their limited storage and battery constraints. Unlike traditional caching, which often 

relies on removing the oldest or least used content, proactive caching anticipates and prepares for 

future demand by intelligently predicting which content will be needed soon. This foresight allows 

UAVs to prioritize and store only the most relevant information which optimizes resource use and 

improves content delivery efficiency. Key to this approach is leveraging data on content popularity 

and request rates that enables these systems to dynamically adjust their cache to meet anticipated 

needs. This ensures that high-priority content is always ready for users, thereby maximizing cache 

space efficiency and minimizing latency. 

1.4 

Cooperative Federated Reinforcement Learning-based Techniques  

Building  on  these  principles  by  leveraging  machine  learning  (ML)  techniques  such  as 

reinforcement  learning  (RL)  [40],  [41]  and  Multi-Armed  Bandit  (MAB)  algorithms  [42],  [43], 

[44],  UAVs  can  develop  sophisticated  caching  policies  that  adapt  in  real-time  without  prior 

knowledge of content popularity or request rates. These ML strategies enable UAVs to balance 

between exploring new caching approaches and exploiting effective existing ones which in turn 

optimizes for long-term benefits like lower latency and higher content accessibility. This adaptive 

6 

 
framework allows UAVs to dynamically tailor caching decisions to fluctuating network conditions 

and  user  demands  by  using  trial  and  error  to  refine  strategies  based  on  direct  environmental 

y

x

MF-UAV and A-UAV
information sharing 
via. lateral link

feedback. 

Satellite Link

Communication
Infrastructure
Destruction

z

w

Anchor UAV

Micro-Ferrying UAV

MF-UAV Trajectory

User Community

Figure 1.1. Coordinated UAV system for content dissemination in environments without 

communication infrastructure 

Furthermore,  the  adoption  of  cooperative  learning  algorithms  enhances  collaborative 

caching within UAV networks. This involves multiple UAVs or nodes sharing insights to refine 

caching decisions network-wide, crucial for coordinated operations such as surveillance or content 

delivery. Federated Learning (FL) [45], [46], [47] emerges as a key technique in this context that 

promotes distributed learning and decision-making to continuously refine caching strategies based 

on collective data, thereby improves content availability and user experience. 

These adaptive learning mechanisms allow UAVs to adjust caching policies on-the-fly that 

responds adeptly to varying user demands and network conditions. By combining the exploration 

capabilities  of  MAB  [48],  [49]  with  the  distributed  intelligence  of  FL  [50],  [51],  [52],  UAV 

7 

 
 
networks  achieve  greater  resource  efficiency  and  content  delivery  performance.  This  blend  of 

proactive  and  cooperative  learning  techniques  in  caching  equips  UAVs  with  the  capability  to 

continuously evolve their caching strategies that enhances the efficiency of cache space utilization 

and the user experience by minimizing content access delays. This continuous improvement cycle 

ensures  UAVs  can  effectively  anticipate  and  meet  changing  content  needs  which  enhances  the 

overall efficiency and responsiveness of the network. 

1.5 

Learning Caching Policies and Enhancing UAV-aided Content Dissemination in 

Disaster-Affected Areas 

In  UAV-enhanced  communication  systems,  Anchor  UAVs  (A-UAVs)  with  advanced 

communication technology like satellite links work alongside Micro-Ferrying UAVs (MF-UAVs) 

to bridge gaps in disaster-struck areas that ensures critical data reaches isolated communities. This 

network  relies  on  content  caching,  especially  within  A-UAVs  by  employing  Smart  Cache 

Duplication to optimize their storage by prioritizing essential data. This strategy faces challenges 

such as accurately predicting demand and adapting to the varying nature of emergencies, which 

can alter priority information swiftly. 

Addressing  the  variability  in  information  needs  across  different  communities,  where 

urgency and type of information required can vary significantly, introduces complexity. Quality 

of Service (QoS) like Tolerable Access Delay (𝑇𝐴𝐷), adds another layer of complexity, with user 

expectations  for  prompt  content  delivery  influences  caching  decisions.  Traditional  methods 

struggle in such scenarios due to difficulty in forecasting content popularity and delay tolerances. 

To navigate these challenges, this thesis integrates the Top-k Multi-Armed Bandit (MAB) 

algorithm, a technique that permits UAVs to refine caching strategies based on observed demand. 

This adaptive method moves beyond static pre-loaded caches to a dynamic model that aligns with 

8 

 
real-time  needs  which  accommodates  the  diverse  and  evolving  content  requirements  and  𝑇𝐴𝐷 

expectations.  By  adopting  a  multi-dimensional  reward  mechanism,  the  Top-k  MAB  approach 

enables  UAVs  to  continuously  update  their  caching  policies  that  ensures  optimal  content 

availability  and  meets  varying  QoS  demands.  This  innovative  strategy  demonstrates  a  tailored, 

responsive approach to content delivery in UAV-supported networks, significantly enhancing the 

relevance and timeliness of information provided to affected communities. 

While  the  Top-k  MAB  approach  offers  significant  improvements,  certain  limitations 

remain, particularly in large-scale scenarios with sparse user activity. Relying on MF-UAVs to 

relay  content  availability  information  across  distant  regions  slows  down  the  learning  process, 

especially when information must traverse multiple communities. Additionally, content that is less 

frequently  requested  often  suffers  from  unreliable  popularity  estimates  which  reduces  the 

effectiveness of caching decisions. These issues are further complicated as the content pool grows 

which makes it harder to maintain stable and timely learning. To address these challenges, this 

thesis  introduces  a  Federated  Multi-Armed  Bandit  (FedMAB)  framework,  where  A-UAVs 

collaboratively refine their caching strategies by sharing learned models rather than raw data. This 

cooperation accelerates learning, enhances the stability of caching decisions, and ensures that even 

low-demand  content  receives  fair  consideration 

that  ultimately 

improves 

the  overall 

responsiveness and equity of information delivery in critical situations. 

1.6 

Strategic Joint Deployment of UAVs for Cache Space Utilization 

In  UAV  content  dissemination  networks,  operating  micro-ferrying  UAVs  (MF-UAVs) 

independently without collaboration can lead to inefficiencies, notably through content duplication 

across UAVs. This redundancy hampers cache utilization, with UAVs possibly carrying identical 

9 

 
data and leaving other crucial information not cached. This approach not only wastes cache space 

but also fails to address diverse user needs effectively. 

Such redundancy also undermines the efficiency of Anchor UAVs (A-UAVs), crucial for 

their  larger  cache  capacities  and  broader  communication  capabilities.  When  MF-UAVs  carry 

duplicated  content,  A-UAVs’  potential  to  distribute  a  varied  range  of  information,  especially 

critical in emergencies, is not fully leveraged. 

Implementing a coordinated UAV deployment with a unified caching strategy can rectify 

these issues. By aligning caching decisions among MF-UAVs, the network optimizes cache space 

that  ensures  a  broader  variety  of  content  is  distributed  efficiently.  This  collective  approach 

eliminates unnecessary duplications which allows for strategic content allocation based on user 

demand,  therefore  significantly  improves  information  availability  and  network  performance  in 

crisis scenarios. 

1.7 

Trajectory-aware Collaborative Caching using Swarm of Micro-Ferrying UAVs  

Deploying  UAVs  strategically  for  content  caching  is  essential  for  enhancing  the 

functionality of UAV-supported networks. The collaboration between Anchor UAVs (A-UAVs) 

and Micro-Ferrying UAVs (MF-UAVs) is vital, with A-UAVs acting as central storage hubs for 

in-demand content, and MF-UAVs distributing this content. The application of the Top-k Multi-

Armed Bandit Learning technique at A-UAVs facilitates dynamic learning which enables UAVs 

to  adjust  their  caching  strategies  in  real-time  to  meet  user  demand.  This  method  considers  the 

diverse urgency and importance of content requests, along with the content distribution patterns 

between UAVs, to optimize content availability. 

Nonetheless, these strategies face challenges, like the risk of content duplication among 

MF-UAVs, which can waste valuable cache space. Moreover, without tailoring caching policies 

10 

 
to Quality of Experience (QoE) requirements, UAVs may non-selectively cache identical content 

by ignoring the specific needs of their target areas. 

An  evolved  approach,  Top-k  Multi-Armed  Bandit  Learning  with  Selective  Caching, 

addresses these issues by discerningly caching content, taking into account what is already stored 

across the network to prevent redundancy. This selective strategy ensures a varied and efficient 

cache usage that aligns more closely with user community demands. Continuous adaptation and 

learning from environmental interactions allow for the prioritization of highly relevant content, 

therefore enhancing network performance. This ensures that content is available where and when 

needed, thereby sustaining a high quality of service and maximizing the UAV network’s impact. 

However,  introducing  federated  learning  presents  a  new  challenge.  While  FedMAB 

improves  adaptability  through  collaborative  model  updates,  its  aggregation  process  can 

inadvertently reduce the effectiveness of selective caching. This creates a trade-off between global 

coordination  and  localized  efficiency,  where  selective  strategies  risk  being  overshadowed  by 

uniform model consensus. This thesis deliberates on this critical tension and proposes a latency-

aware  coordination  mechanism  that  preserves  the  benefits  of  selective  caching  without 

compromising the collaborative strengths of federated learning. By aligning the learning dynamics 

with the operational constraints, the proposed framework ensures balanced, efficient, and context-

sensitive content dissemination across the UAV network. This thesis also attempts to convert this 

reactive  caching  policy  to  a  proactive  one,  by  incorporating  crowd-counting  algorithms.  This 

method uses advances computer vision techniques to adeptly achieve population density which 

can compensate for any plausible weak estimate as a result of request sampling discrepancies. 

11 

 
Characterization of a 
Network of UAVs

Chapter-3
UAV-centric Content Caching 
Architecture

• Multi-Tier UAV Hierarchy
• Static and Mobile UAVs
• Power Law like (Zipf) Content

Popularity

• Smart Cache Duplication
• Storage Segmentation Factor

Chapter-4
Demand Heterogeneity 
Characterization

• Demand Heterogeneity
• User-specific QoS Expectation
• Tolerable Access Delay

• Popularity-Based Caching
• Value-Based Caching

Chapter-7
Content Placement Planning for 
UAV Trajectory-aware Caching 
Policy
• UAV Trajectory Characterization
•
Inter-Community Distances
•
Impact of hover and transition
time

• Low Availability Period
•
• Trade-off between availability

Joint Deployment of UAVs

•

and delay
Increase in UAV cache space
utilization

Adaptive Learning-based Caching Policies 

Chapter-5
Multi-Armed Bandit Learning for 
Content Provisioning

• On-the-fly Caching Policy Learning
• No a-priori content popularity knowledge
• Variant of Reinforcement Learning
• Decentralized Caching
• Multi-Armed Bandit
• Multi-dimensional Feedback & Reward
• Hybrid Exploration (UCB+!-greedy)

Chapter-6
Distributed Federated-Multi-Armed 
Bandit Learning for Content Management
• Adaptive Cooperative Learning
• Model Sharing using Federated Learning
• Federated Aggregation of MAB Q-Table
• Non-IID nature of individual MAB model
• Due to Content Popularity and QoS

i.e., "#$

• Divergence-Based weighted aggregation

Chapter-10 (Future Work 2)
Preemptive Caching at UAVs using Large 
Language Models

• LLMs to analyze user request patterns
• Understanding the context and semantics
of users for content popularity trends
• Forecast which data or content will be in

high demand
•

reducing latency and improving user
experience

Chapter-10 (Future Work 1)
Crowd Estimation-based Context-Aware 
Caching Using Bandits
• Crowd counting to improve confidence

on user request patterns

• VGG-16 backbone based auxiliary point

guidance crowd counting

• Target

latent

features are interpolated

using implicit feature interpolation (IFI)
• Features processed through prediction
head to obtain confidence score & offsets

Chapter-8
Trajectory-aware Content Dissemination 
in Swarm of Micro-UAVs using Top-k
Multi-Armed Bandit

• Variant of MAB i.e., Top-k MAB
• Continuous Multi-dimensional Reward
• Richer feedback and better convergence
• Selective Caching at micro-ferrying UAV
• Aware of Inter-UAV distances and TAD
• Preceding MF-UAV caching information
• Effect of Selective Caching on MAB
• Reduction in redundant content copies

Chapter-9
Federated Multi-Armed Bandit Learning 
for Trajectory-aware Caching Policy
• Model Sharing with trajectory-aware

caching at micro-ferrying UAVs

• Leveraging models of adjacent anchor

UAVs to improve self-model

• Divergence-based weight
aggregation in FedMAB

for model

• Reduction in divergence due to model
aggregation and associated challenges

• FedMAB with Selective Caching

Trajectory-aware Learning for Caching Policies

Figure 1.2. Thesis Organization 

(The grey blocks are the works that this thesis has achieved, and the orange blocks are the future 

1.8 

Dissertation Objectives 

directions) 

The objectives of the dissertation are multifaceted, that enhances UAV-based content provisioning 

systems through advanced algorithmic frameworks: 

1-  The dissertation characterizes a multi-tiered UAV-aided framework specifically tailored for 

managing  content  dissemination  in  disaster  affected  area.  This  framework  is  anticipated  to 

adeptly cope with the variability in demand for content which ensures that all users receive 

12 

 
 
timely and relevant information. An essential element of this is to enhance the caching decision 

of  UAVs  by  leveraging  knowledge  of  adjacent  communities’  UAVs  with  varying  user 

demands that ensures an uninterrupted flow of information. 

2-  It achieves the development of a Federated Multi-Armed Bandit Learning-based framework to 

enhance content delivery via on-the-fly learning of caching policies for UAV-based systems. 

The intention here is twofold: firstly, to ensure the highest levels of content availability, thus 

ensuring users have access to the most pertinent information when needed; and secondly, to 

minimize the amount of downloaded contents that optimizes the efficiency of the network. 

3-  A  core  component  of  this  research  is  the  creation  and  thorough  examination  of  a  joint 

deployment  strategy  for  UAV  networks.  The  strategy  aspires  to  amplify  the  availability  of 

content  across  the  network,  drastically  diminish  periods  when  content  is  unavailable,  and 

amplify the diversity of content within the UAV-aided caching system. 

4-  Furthermore,  the  dissertation  explores  the  construction  of  an  adaptive  learning-based 

framework with an integrated Selective Caching method. This novel approach is sensitive to 

UAV trajectories and QoS parameters that focuses on elevating content availability, limiting 

access delays, reducing redundancy in content replication, and ensures that the cached content 

sequence mirrors the benchmark for optimal caching sequences. 

5-  Eventually, this thesis also attempts to improve this reactive learning-based caching policy to 

a proactive caching policy by incorporating crowd-counting algorithms. This addition aids the 

Federated Multi-Armed Bandit based caching policy by compensating for any weak reward 

estimates as a result of low request sampling problem. 

The  culmination  of  these  objectives  is  aimed  at  delivering  a  robust,  responsive,  and 

efficient content delivery service via UAVs, even in the most challenging environments. Through 

13 

 
these sophisticated learning mechanisms, UAV networks can dynamically adapt to the immediate 

needs of different user communities, effectively managing cache space to maintain high service 

quality even under the constraints and uncertainties inherent in post-disaster environments. 

1.9 

Scope of the Dissertation  

The main goal of this thesis proposal is to propose a Federated Multi-Armed Bandit Learning-

based  content  dissemination  system  for  cache  enabled  Unmanned  Aerial  Vehicles  to  ensure 

content provisioning in communication infrastructure-less scenario. The proposed methods deliver 

via service provisioning performance maximization in terms of content availability, content access 

delay, content low availability period and cached content sequence similarity. The organization of 

the Thesis proposal is as follows: 

This thesis presents a comprehensive exploration of UAV-centric content caching architectures 

aimed at enhancing communication in challenging environments. Chapter 2 lays the foundational 

groundwork  by  reviewing  relevant  literature  which  highlights  key  developments  and  existing 

strategies in UAV-assisted communication and content caching. In Chapter 3, we delve into the 

design  of  a  UAV-centric  Content  Caching  Architecture  for  Communication-challenged 

Environments that establishes the core principles behind deploying UAVs to facilitate efficient 

data  delivery  where  traditional  communication  infrastructure  is  lacking  or  damaged.  Chapter  4 

expands  on  this  by  discussing  how  to  handle  demand  heterogeneity  in  UAV-aided  Content 

Caching which addresses the complexities of diverse content demands across different user groups 

in such environments. Chapter 5 introduces a novel approach that employs Multi-Armed Bandit 

Learning for Content Provisioning in a Network of UAVs which focuses on optimizing content 

delivery by learning user preferences and demand patterns over time. Chapter 6 emphasizes on the 

model  sharing  strategy  by  incorporating  the  concept  of  Federated  Learning  on  Bandits,  which 

14 

 
strengthens the caching policy’s confidence and fairness of the bandit-based models across the 

disaster scenario. In Chapter 7, we explore Content Placement Planning for UAV Trajectory-aware 

Caching  Policy  in  Infrastructure-less  Wireless  Networks,  aiming  to  enhance  the  efficiency  of 

content  delivery  by  optimizing  UAV  flight  paths  based  on  content  caching  needs  and  network 

topology.  Chapter  8  further  advances  this  discussion  by  integrating  Top-k  Multi-Armed  Bandit 

Learning for Content Dissemination in Swarms of Micro-UAVs with a trajectory-aware selective 

caching algorithm. This fine-tunes the content delivery process by identifying the groups of micro-

UAVs  traversing  in  close  proximity  and  prioritizing  the  contents  most  in  demand.  Chapter  9 

discussed  the  hurdles  faced  while  incorporating  Federated  Multi-Armed  Bandit  to  learn  the 

caching policy. It also uncovers the algorithmic additions to tackle the issues of model sharing in 

the presence trajectory-aware selective caching algorithm. Finally, Chapter 10 discusses one of the 

immediate future extensions of this thesis where it shows the impact of crowd-counting applied on 

an image dataset that contains real disaster affected images. It explains the effect of precise crowd 

estimation on the learning-based caching policy. Additionally, it reflects on the research journey, 

summarizes  achievements,  and  outlines  potential  future  directions.  This  section  highlights  the 

scalability of the proposed architectures and algorithms, their adaptability to emerging hurdles, 

and  their  potential  impact  on  future  UAV-assisted  communication  systems.  Through  this 

organized  structure,  the  thesis  systematically  addresses  the  challenges  of  UAV-based  content 

caching and delivery, offers innovative solutions and paves the way for future advancements in 

the field.  

15 

 
 
 
 
Chapter 2:  Related Work 

This  chapter  builds  upon  the  foundational  motivations  outlined  in  Chapter  1  by  offering  a 

structured  review  of  existing  literature  that  informs  the  development  of  UAV-assisted  content 

dissemination strategies. It examines five interrelated areas that form the core of this thesis; post-

disaster content provisioning frameworks, the role of UAVs and micro-UAVs in disaster response, 

UAV-based content provisioning through proactive caching, learning-based caching systems for 

dynamic  environments,  and  recent  developments  in  federated  learning  and  bandit-based 

reinforcement learning for adaptive, decentralized decision-making. 

2.1 

Post-Disaster Content Provisioning 

In  the  wake  of  disasters,  whether  natural  or  human-made,  the  efficient  provisioning  of 

content  is  crucial  for  effective  response  and  recovery  efforts.  This  entails  not  only  the 

dissemination  of  vital  information  to  affected  populations  and  first  responders  but  also  the 

coordination among various agencies and stakeholders involved in the disaster management [53], 

[54] process. The existing literature on content provisioning frameworks in post-disaster scenarios 

highlights a variety of approaches, each with its unique set of challenges, solutions, and limitations. 

These include Mobile Ad-Hoc Networks (MANETs), Delay-Tolerant Networks (DTNs), social 

media  platforms,  satellite  communications,  Content  Distribution  Networks  (CDNs),  blockchain 

technology, and the Internet of Things (IoT). 

MANETs and DTNs are often highlighted for their ability to provide flexible and resilient 

connectivity in the absence of traditional communication infrastructure. For example, [55], [56] 

discusses the design and application of DTNs in challenging communication environments which 

includes  disaster-impacted  areas.  On  the  other  hand,  social  Media  Platforms  have  been 

increasingly recognized for their role in disaster communication. The authors of [57] examine how 

16 

 
social  media  is  utilized  for  emergency  management  that  emphasizes  its  capacity  for  rapid 

information dissemination and public engagement. 

Despite  their  potential,  MANETs  and  DTNs  based  frameworks  face  various  limitations 

and  can  suffer  from  connectivity  and  scalability  issues.  Similarly,  social  media  platforms  may 

struggle  with  misinformation  and  information  overload.  To  tackle  these  discrepancies,  IoT 

Applications  in  disaster  management  are  gaining  attention  for  their  ability  to  provide  real-time 

data  and  enhance  situational  awareness.  The  work  in  [58]  discusses  the  integration  of  IoT 

technologies  in  emergency  response  systems  which  highlights  their  potential  to  improve 

monitoring, content provisioning and coordination. This motivates the use of UAVs as a viable 

solution for content dissemination as an extension to the IoT-assisted response systems. 

2.2 

Use of UAVs/Micro-UAVs for disaster management services 

UAVs, especially micro-UAVs, have emerged as crucial tools in managing and mitigating 

the aftermath of disasters [59]. Their applications span across various critical tasks which includes 

aerial surveillance, terrestrial imaging, precision agriculture, and infrastructure inspection in areas 

struck by calamities [60], [61]. These aerial vehicles excel in gathering real-time data and offers a 

bird’s-eye view of disaster-struck regions, which is instrumental for effective and timely decision-

making [23], [24]. The agility and small size of micro-UAVs make them particularly suitable for 

navigating  through  constrained  spaces  that  enables  assessments  in  areas  that  are  otherwise 

inaccessible to traditional disaster response machinery [62]. 

The utility of UAVs in such scenarios is multifaceted. Primarily, they are deployed for their 

ability  to  quickly  and  efficiently  survey  large  and  hard-to-reach  areas  that  provides  critical 

information  on  the  extent  of  damage,  identifying  stranded  victims,  and  assessing  the  needs  of 

affected communities [63]. Their application in precision agriculture, for instance, through yield 

17 

 
estimation and crop monitoring, underscores their capability in managing resources and assessing 

environmental conditions, which can be adapted for disaster assessment and recovery efforts [9]. 

Several studies have focused on optimizing the capabilities of UAVs for enhanced service 

provision in disaster management. Works such as [64], [65] propose methods to optimize the flight 

paths of UAVs and schedule communication tasks. These strategies aim to extend service coverage 

through  optimized  UAV  hovering  times  and  employing  multi-hop  relaying  between  multiple 

UAVs which includes device-to-device (D2D) routing. Such approaches are pivotal for ensuring 

continuous  and  effective  communication  in  disaster-affected  areas,  especially  when  traditional 

communication infrastructure is compromised. 

The incorporation of energy-aware strategies which includes the use of multi-armed bandit 

algorithms  [66],  [67],  focuses  on  selecting  user  hotspots  for  efficient  data  transmission  while 

minimizing  UAV  energy  consumption.  Additionally,  employing  multiple  UAVs  at  different 

altitudes [68] or through dynamic leader selection in a master-slave architecture [62] optimizes 

both  the  coverage  area  and  energy  usage  that  ensures  longer  operational  periods  in  critical 

situations. 

While many studies have emphasized enhancing communication range [69] and optimizing 

flight paths, there’s a noted gap in addressing content placement and caching strategies specific to 

disaster management scenarios. However, the adaptation of solid-state drives (SSDs) for increased 

caching capacity [70] suggests a direction towards integrating more sophisticated data handling 

capabilities in UAVs that enhances their utility in disseminating vital information and services in 

disaster-struck regions. 

Despite these advancements, simply increasing storage space does not directly solve the 

content availability challenge. This is because an expansion in storage capacity results in higher 

18 

 
energy  consumption  for  downloading  and  updating  content,  which  requires  long-range 

communication equipment (refer Section 1.2.2). As the storage space enlarges, the communication 

energy  expense  also  scales,  which  inadvertently  reduces  the  UAV’s  flight  time  by  consuming 

energy  that  could  otherwise  support  flight  operations.  In  contrast,  adopting  inter-UAV  content 

exchange  methods  that  utilize  low-range  communication  equipment  can  significantly  save  on 

communication  energy  which  results  in  conservation  of  more  power  for  prolonged  flight.  This 

approach not only addresses the efficient management of content but also enhances the UAV’s 

operational time which makes it a more viable solution for disaster management scenarios where 

endurance and efficient content delivery are critical. This multifaceted utility underscores UAVs’ 

significance in not only bridging connectivity gaps but also in ensuring timely, efficient, and secure 

access to vital content and services in diverse operational contexts. 

2.3 

UAV-based Content Provisioning via Proactive Caching 

In the evolving landscape of Unmanned Aerial Vehicle (UAV) technology, the strategic 

placement and caching of content have emerged as pivotal components in enhancing the efficacy 

of UAV-based communication networks, particularly within the context of Internet of Things (IoT) 

networks and disaster management scenarios. This thesis delves into the various methodologies 

proposed  in  the  literature  for  optimizing  these  aspects  and  sheds  light  on  their  potential  to 

revolutionize the way information is disseminated in critical situations. 

The advent of Named Data Networking (NDN) architecture in IoT networks represents a 

significant leap forward in content distribution. Studies such as those referenced in [71] illustrate 

how  UAVs  can  harness  this  architecture  to  gather  data  directly  from  the  field  and  deliver  it 

efficiently to the intended recipients, and therefore circumvents the need for retransmission and 

19 

 
enhances overall network performance. This approach not only simplifies the data delivery process 

but also significantly reduces the burden on the network infrastructure. 

Further exploring the realm of UAV-enabled communication, research in [72] introduces 

innovative strategies where UAVs proactively transmit content to a select group of ground nodes 

(GNs). These GNs are algorithmically chosen to cooperatively cache the necessary content that 

ensures a broad and efficient distribution network that maximizes accessibility for end users. The 

employment of a probabilistic cache placement technique, as discussed in [73], aims to further 

refine  this  process  by  enhancing  cache  hit  probabilities  that  leverages  a  homogeneous  Poisson 

Point Process for the strategic placement of wireless nodes. 

Addressing the challenges faced by small-cell base stations (SBSs) under the strain of high 

data traffic, several studies [74], [75] propose the use of UAVs as a relief mechanism. By caching 

enhanced  layer  information,  UAVs  can  efficiently  manage  high-definition  video  streaming 

requests, with the base layer information handled by the SBSs. This dual-layer approach not only 

alleviates pressure on the SBSs but also incorporates measures for interference management and 

security  against  potential  eavesdropping  which  showcases  the  multifaceted  benefits  of  UAV 

integration into existing networks. 

The optimization of cache placement in areas with high data traffic is another critical area 

of  research.  In  [76],  the  authors  utilize  greedy  algorithms  and  the  Lagrange  dual  method  to 

strategically determine the content to be cached on UAVs that takes into consideration the dynamic 

nature  of  user  movement  across  different  coverage  areas.  This  emphasis  on  adapting  to  user 

heterogeneities marks a significant advancement in customizing content availability that directly 

addresses the limitations of temporally static user movement models. 

20 

 
Despite these innovative approaches, there remains a gap in effectively maximizing cache 

capacity within UAV-aided content dissemination networks, especially in scenarios characterized 

by demand heterogeneity. This gap signifies an area ripe for further exploration, as highlighted by 

the  research  efforts  aimed  at  traffic  offloading  methods  and  learning-based  caching  strategies. 

Studies [72], [73], [74], [75] reveal that by considering factors such as content popularity and size, 

the caching capacity of UAVs can be significantly enhanced. This is particularly evident in UAV-

enabled small-cell networks, where data traffic is offloaded from SBSs to UAVs which allows for 

the proactive caching of popular content and direct delivery to users as needed. 

However,  these  mechanisms,  while  promising  for  scenarios  of  partial  infrastructure 

destruction, face limitations in fully-functional alternatives where communication infrastructures 

are completely obliterated. Additionally, the reliance on temporally static models of global content 

popularity  [73]  in  most  existing  mechanisms  fails  to  capture  the  real-world  complexity  and 

variability of content demands, particularly in disaster scenarios. 

This  thesis  advances  the  conversation  around  UAV-based  content  provisioning  systems 

that focuses on proactive caching [73] as a cornerstone of efficient and resilient communication 

networks. By critically examining the current state of research and identifying areas for further 

investigation, this work contributes to the development of more sophisticated, adaptive, and robust 

UAV-based content distribution frameworks that are capable of meeting the nuanced demands of 

modern communication challenges. Through proactive caching and strategic content placement, 

UAVs hold the promise of transforming the landscape of information dissemination, particularly 

in scenarios where traditional communication infrastructures are compromised or entirely absent.  

21 

 
2.4 

Learning-based Caching for UAV-based content dissemination system 

The integration of Unmanned Aerial Vehicles (UAVs) into content dissemination networks 

represents a transformative approach to addressing the dynamic needs of modern communication 

systems. This particularly important in scenarios characterized by challenging environments and 

demand  heterogeneity.  This  thesis  explores  the  novel  application  of  learning-based  caching 

strategies and joint optimization of caching and trajectory decision techniques [77] that leverage 

the agility and flexibility of UAVs to optimize content delivery in various contexts. 

Recent  studies  have  proposed  innovative  methods  that  combine  caching  decisions  with 

UAV trajectory planning to minimize content delivery delays and enhance user satisfaction. For 

instance, research indicated in [78] introduces a system where online decisions are influenced by 

inputs processed through a Convolutional Neural Network (CNN). This is done in tandem with 

subsequent caching and trajectory optimizations performed offline using a Clustering-Based Two 

Layered (CBTL) algorithm. This dual-phase approach meticulously balances immediate decision-

making with strategic planning that ensures a cohesive content distribution strategy. 

Further  advancements  in  this  field  have  been  demonstrated  by  [79],  where  a  deep  Q-

learning  based  framework  is  employed  to  jointly  optimize  UAV  trajectory  and  radio  resource 

allocation. This method specifically addresses the complexities of large networks characterized by 

an  extensive  range  of  state-action  pairs  that  underscores  the  potential  of  deep  reinforcement 

learning techniques in navigating the intricacies of UAV-based content dissemination. 

However,  existing  models  often  overlook  the  critical  factor  of  content  popularity 

heterogeneity,  which  varies  significantly  with  the  geographic  location  of  users.  This  oversight 

limits  the  applicability  of  such  models  in  real-world  scenarios  where  user  demand  and  content 

preferences can shift drastically across different regions. To bridge this gap, our work introduces 

22 

 
a nuanced approach that incorporates the variability of content popularity into the learning-based 

caching and decision-making process. 

In the realm of UAV trajectory control, [80] has proposed mechanisms that dynamically 

adjust  UAV  missions  based  on  real-time  observations  which  includes  the  decision  to  continue 

service delivery along a planned trajectory or to return to a charging station. This adaptability is 

crucial  in  maximizing  operational  efficiency  and  ensuring  uninterrupted  service  provision. 

Similarly, [81] delve into the mathematical formulation of joint optimization problems that aims 

to  find  the  most  energy-efficient  trajectories  for  UAVs  while  managing  radio  resources  and 

caching replacements. 

Despite  these  technological  strides,  previous  studies  have  not  sufficiently  addressed  the 

specific challenges posed by disaster geographies, such as demand heterogeneity and the physical 

impacts on UAV flight decisions. Our research fills this critical void by meticulously analyzing 

how disaster-induced variations in geography and user demand affect caching policies and UAV 

trajectory planning. 

Furthermore, the methodologies explored in [76], [78], [79], [80], [81] primarily utilize 

long-term  estimation  techniques,  which  may  not  adequately  respond  to  the  rapid  changes  in 

network conditions and user demand. This thesis argues for the development of more responsive 

and  adaptable  learning-based  caching  mechanisms  that  can  swiftly  adjust  to  evolving 

environmental and network dynamics. Additionally, there is a notable absence of efforts aimed at 

maximizing  cache  space  utilization  and  reducing  reliance  on  costly  server  downloads  through 

direct UAV-to-user content delivery. 

To address these shortcomings, our work develops a comprehensive framework that not 

only  considers  the  heterogeneity  of  content  popularity  but  also  incorporates  adaptive  learning 

23 

 
methods  to  optimize  UAV  caching  decisions  and  flight  trajectories  in  real-time.  By  leveraging 

advanced machine learning algorithms which includes reinforcement learning and its variant like 

Multi-Armed Bandits, we establish a robust benchmark for evaluating the effectiveness of UAV-

based content dissemination strategies. 

This  framework  significantly  enhances  the  efficiency  of  content  delivery  networks, 

particularly  in  disaster  recovery  operations  and  other  critical  scenarios  where  traditional 

communication infrastructures are compromised. Through the judicious application of learning-

based  caching  strategies,  our  approach  improves  the  UAV  content  dissemination  landscape  by 

offering a more agile, responsive, and user-centric model that can dynamically adjust to the unique 

demands of diverse geographic and operational contexts. 

2.5 

Federated, Bandit-based and Reinforcement learning for adaptive UAV caching 

To address the limitations of static or centralized approaches, recent literature has explored 

learning-based  methods  that  enable  UAVs  to  adaptively  make  caching  decisions  in  dynamic 

environments. Reinforcement learning (RL), particularly through deep Q-networks and actor-critic 

methods,  has  been  employed  to  jointly  optimize  content  delivery  and  trajectory  planning  [82], 

[83],  [84],  [85],  [86],  [87],  [88],  [89],  [90].  These  methods  offer  adaptability  but  often  rely  on 

centralized infrastructure or long convergence periods which makes them less suitable for disaster 

scenarios characterized by network volatility and limited infrastructure. 

In parallel, Multi-Armed Bandit (MAB) algorithms have been studied for online caching 

decisions  under  uncertainty.  MAB-based  methods  treat  content  selection  as  a  reward-driven 

exploration-exploitation  trade-off.  While  effective  in  adapting  to  changing  demand,  early 

implementations typically assumed globally uniform popularity or lacked inter-agent coordination 

[91],  [92],  [93],  [94],  [95],  [96],  [97],  [98],  [99].  More  recent  work  has  attempted  to  combine 

24 

 
MABs with contextual information, yet challenges remain in achieving scalable coordination and 

responsiveness across distributed UAV agents. 

Furthermore,  Federated  learning  has  gained  traction  as  a  way  to  address  scalability  and 

privacy constraints in decentralized environments [100], [101], [102], [103], [104]. In the federated 

paradigm,  each  processing  node  independently  learns  a  local  caching  policy  and  periodically 

contributes to a shared global model through model aggregation, rather than raw data exchange. 

This  is  particularly  beneficial  in  disaster  zones,  where  bandwidth  and  power  constraints  make 

centralized updates infeasible. 

The integration of federated learning with MABs, referred to as Federated Multi-Armed 

Bandit (FedMAB) learning, offers a hybrid approach that combines the local adaptivity of MABs 

with the scalability and privacy-preserving characteristics of federated learning. Unlike methods 

relying on long-term global demand estimation, FedMAB supports geo-temporal heterogeneity by 

learning  and  sharing  local  caching  decisions  across  UAVs.  The  literature  demonstrates  the 

effectiveness of federated learning approaches in mobile and distributed networks (e.g., FL-based 

edge classification systems) [105], [106], [107], [108], [109], [110], [111], but their application to 

content dissemination under full infrastructure failure is still emerging. Also, federated aggregation 

has been achieved in graph-type learning paradigms such as DNN [112], [113], [114], [115], [116], 

which is both intuitive and achievable. Amalgamation of federated learning with tabular methods 

like  MAB  and  RL  has  fundamental  limitations,  since  the  such  aggregations  are  not  straight-

forward and can have detrimental effect on learning capabilities and contextual loss. 

Our  thesis  builds  directly  upon  these  developments  by  implementing  a  FedMAB 

framework tailored for disaster-affected regions. It incorporates demand heterogeneity, varying 

quality-of-service  constraints,  and  inter-UAV  collaboration  without  requiring  centralized 

25 

 
coordination. By leveraging both the theoretical advantages and empirical performance of MABs 

and  federated  updates,  our  approach  provides  a  foundation  for  scalable,  resilient,  and  context-

sensitive UAV caching strategies. 

2.6 

Summary 

Existing learning-based UAV-aided content dissemination systems face several challenges 

that  includes  a  lack  of  adaptability  to  rapid  changes  and  demand  heterogeneity,  inefficient 

utilization  of  UAVs’  caching  capabilities,  and  insufficient  focus  on  real-time  adjustments  for 

optimized  content  delivery.  Our  methods  designed  in  this  thesis  address  these  drawbacks  by 

incorporating advanced machine learning algorithms that account for demand variability across 

different  regions,  optimizing  caching  strategies  according  to  UAV  trajectory  in  real-time,  and 

ensuring efficient use of UAVs’ cache spaces for content delivery. By focusing on adaptability, 

efficiency, and responsiveness, our approaches enhance the effectiveness of UAV-aided content 

dissemination systems in meeting diverse operational demands. 

In the next chapter, a UAV-aided content dissemination framework is characterized that 

can tackle content needs from the stranded users from disjoint communities in a disaster affected 

region. The framework is designed in a scenario where communication infrastructure is completely 

obliterated due to unforeseen catastrophic events. 

26 

 
 
 
 
 
 
Chapter 3:  UAV Centric Content Caching for Communication 

Challenged Scenarios 

This chapter introduces a specialized UAV-based caching framework aimed at enhancing 

content  delivery  in  disaster-affected  areas  where  traditional  communication  infrastructure  is 

absent. Utilizing both static anchor UAVs for direct content access and mobile ferrying UAVs for 

broader  content  distribution,  this  system  focuses  on  optimizing  content  availability  through 

strategic  caching  and  content  duplication  methods  tailored  to  the  constraints  of  UAV  storage 

capacity. The framework’s effectiveness is demonstrated through analytical and simulation-based 

evaluations, highlighting its capacity to adapt to various disaster scenarios, UAV trajectories and 

operational  constraints.  Here,  the  thesis  details  the  framework’s  architecture,  its  innovative 

caching strategies, and the significant role of UAV trajectories in maximizing content accessibility 

for isolated user communities in crisis situations.  

3.1  Motivation 

Using UAVs for content provisioning without communication infrastructure faces specific 

challenges like power limitation, which affects flight duration and operational range, limiting the 

UAV’s ability to deliver content over long distances or for extended periods. Let’s outline a basic 

model for UAV power expenditure using the following parameters. 

a)  𝑃"#$%&#’":  Power  consumption  of  the  communication  module  when  actively  downloading 

content. 

b)  𝑃(): The power used for transmitting content to users, influenced by factors like distance, data 

rate, and the efficiency of the communication protocol. 

c)  𝑃*): The power consumed by the UAV’s communication system for receiving content request, 

depending on receiver sensitivity and signal processing requirements. 

27 

 
d)  𝑃+:  This  is  the  power  consumed  by  the  UAV’s  communication  system’s  circuitry,  which 

includes the transmitter circuitry, receiver circuitry, and any signal processing components. 

e)  𝑃,"&-: Power consumption of communication module when on but not actively downloading. 

f)  𝑃.&,/0(: Power consumption for keeping the UAV in the air (motors, avionics, etc.). 

g)  𝑇"#$%&#’": Time spent downloading content. 

h)  𝑇(): Time taken to transmit content to user. 

i)  𝑇*): Time taken to receive content request from user. 

j)  𝑇(#(’&: Total flight time available (average flight time considered is 30 minutes). 

k)  𝐸1’((-*2:  Total  energy  available  from  the  UAV’s  battery.  The  UAV’s  battery  charge  (or 

energy) can be measured in watt-hours (𝑊ℎ) or milliamp-hours (𝑚𝐴ℎ). 

Based on the above parameters, the depleted energy, remaining energy and remaining on-time can 

be mathematically approximated as follows: 

𝐸"-3&-(-" = 𝑃"#$%&#’". 𝑇"#$%&#’" + (𝑃() + 𝑃+). 𝑇() + (𝑃*) + 𝑃+). 𝑇*)	(3.1) 

𝐸*-4’,%,%/ = 𝐸1’((-*2 − 𝐸"-3&-(-"																																																															(3.2) 

𝑇*-4’,%,%/ =

𝐸*-4’,%,%/
𝑃.&,/0( + 𝑃,"&-

=

𝐸1’((-*2 − 𝐸"-3&-(-"
𝑃.&,/0( + 𝑃,"&-

																														(3.3) 

The above expressions show the remaining on-time contingent upon the content download and 

lateral  communication  load  with  the  users.  To  be  noted  that  the  transmission  power  𝑃()  is 

significantly higher than the reception power 𝑃*) owing to the Friis transmission equation [117], 

[118] for free space. The transmission power in wireless communication systems is the power that 

the transmitter needs to emit to ensure the signal reaches a receiver with sufficient strength (𝑃*) to 

be decoded correctly. This relationship can be illustrated through the Friis transmission equation 

in a simplified form, assuming free space and line-of-sight communication: 

28 

 
𝑃* = 𝑃(). 𝐺(). 𝐺*. 5

5
:

𝜆
4𝜋𝑑

																																																											(3.4) 

The equation above defines 𝑃() in relation to 𝑃*, given known values of 𝐺(), 𝐺*, 𝜆 and 𝑑, which 

are the gain of the transmitter and receiver antenna, transmitted signal’s wavelength and distance 

between the transmitter and receiver. This shows that increased power consumption with higher 

transmission power necessitates a careful balance between communication range and operational 

longevity of the UAVs [30]. Enhancing one can significantly impact the other, potentially reducing 

the effectiveness of content distribution efforts. 

To maintain unhindered content provisioning, the aim should be to maximize the on-time 

of the UAVs. However, there are inherent energy expenses associated with UAV like the ones 

cited above. One straight-forward approach is to minimize the communication energy expenditure 

which includes the content download, transmission and reception expenses. Such approach will 

depend heavily on the average data consumption rate of individual users. Service providers have 

reported an average monthly data consumption of 25-30 gigabytes per user [30], [31], [32]. With 

an average population density of 10! users/sq-mile [33], [34], the data requirement per day can 

reach  up  to  80-100  terabytes.  Attempts  to  handle  such  requirements  can  be  made  by  installing 

Non-Volatile Memory Express solid-state drive (NVMe SSD) memory cards [35], [36] in UAVs 

that can store contents with data size of aforesaid magnitude. Nevertheless, the communication 

energy expenditure can still deplete the battery of the UAVs while storing and replacing contents 

in these memory devices. To exacerbate the situation, with increase in the data storage capacities 

of such memory cards, the communication energy expenditure scales which leads to even faster 

depletion of UAV battery.  

29 

 
This limitation necessitates the contents to be intelligently spread across UAVs. Also, the 

inability of the total storage capacity offered by the UAV network to store all contents required 

currently and, in the future, requires efficient content management strategies. 

3.2 

Design Objective 

The  objective  of  this  chapter  is  to  design  and  validate  a  comprehensive  UAV-enabled 

content  dissemination  framework  optimized  for  environments  lacking  fixed  communication 

infrastructure.  This  involves  the  development  of  a  detailed  architectural  model  that  leverages 

UAVs for content delivery. Furthermore, it includes formulation of optimal content placement and 

caching strategies tailored to specific UAV trajectories, and exploration of how these trajectories 

influence caching efficiency. Additionally, the chapter aims to construct analytical models capable 

of estimating content availability within this framework, supported by the execution of extensive 

simulation  experiments.  These  simulations  are  intended  to  rigorously  test  and  evaluate  the 

effectiveness of the proposed strategies across a spectrum of network conditions and operational 

scenarios, thereby ensuring the framework’s applicability. 

3.3 

System Model 

3.3.1  UAV Hierarchy  

The content distribution system is organized in two layers, namely, the anchor UAVs (i.e., 

A-UAVs)  and  the  ferrying  UAVs  (i.e.,  F-UAVs).  As  shown  in  Figure  3.1,  each  partitioned 

community of users is served by an A-UAV using a lateral wireless link such as WiFi. A-UAVs 

can also download content form the internet via an expensive vertical link such as satellite-based 

internet.  One  monolithic  system  design  approach  is  to  let  the  A-UAVs  download  all  needed 

content, as requested by their local users, via the vertical links. In this approach, with no inter-A-

UAV  data  transfer,  the  following  shortcomings  will  be  encountered.  First,  there  will  be 

30 

 
duplications of downloads via the expensive vertical links by different A-UAVs due to overlaps 

in requests for popular contents. This will incur high download costs. Second, storage constraints 

will cap the number of contents that can be downloaded and stored in an A-UAV, thus limiting the 

content availability. Finally, due to limited infrastructure availability, some of the communities of 

users are rendered isolated from content access without a dedicated A-UAVs assigned to them.  

T

S

D

Anchor UAV

Ferrying UAV

Ferrying UAV’s 
trajectories

Community within 
lateral link of F/A-UAV

Satellite

Vertical Link

X

A

Y

R

C

Q

W

B

Z

!!"#$%&!&'$	#$%&'(#

P

(#

	#$%&'
$%&!&'
!!"#

$

Figure 3.1.  Coordinated UAV system for content caching and distribution in environments  

without communication 

To address these problems, a set of ferrying UAVs (i.e., F-UAVs) are introduced. Unlike 

A-UAVs, the F-UAVs do not possess vertical links, but they do have lateral links such as WiFi, 

using which they can communicate with the A-UAVs. The role of these UAVs is to cache and 

transfer content around the A-UAVs such that the users in a community are able to access content 

that was downloaded by A-UAVs serving other communities via F-UAVs.  

31 

 
 
3.3.2  Content Request and Provisioning Model 

Content requests are generated by the community users and sent to the local A-UAV or a 

visiting F-UAV, in that order.  

Content Popularity and Requests: Studies have shown that content request pattern often follows a 

Zipf  distribution  in  which  a  requested  content’s  popularity  is  a  geometric  multiple  of  the  next 

popular content in a larger pool [119]. Popularity of content ′𝑖′ is given as  

𝑝6(𝑖) =

#

7

!
"

8

																																																																														(3.5)  

#

!
$

8

∑

7
$∈&

The parameter 𝐶 represents the total number of contents in the pool, and the Zipf parameter 𝛼 

determines the skewness of the distribution. Poisson request generation is the most prevalent way 

to capture real-time user requests. 

Tolerable Access Delay and Content Provisioning: For each generated request, a Tolerable Access 

Delay  (TAD)  is  specified.  TAD  is  a  quality-of-service  parameter  that  indicates  the  duration  a 

requesting user waits before the content is provisioned via download. After receiving a request 

from  one  of  its  community  users,  the  relevant  A-UAV  first  searches  its  local  storage  for  the 

content. If not found, it waits for a potential future delivery of the content by one of the traveling 

F-UAVs. If no F-UAV with that content arrives within the specified TAD, the A-UAV downloads 

it via the vertical link.   

3.4 

Caching Policies 

The  caching  related  design  questions  to  be  addressed  are:  a)  which  content  to  be 

downloaded  in  the  A-UAVs  via  the  vertical  links  so  that  they  can  serve  their  own  community 

directly, and the remote communities via the traveling F-UAVs; b) which content to be transferred 

from  the  A-UAVs  to  the  F-UAVs  via  the  lateral  links,  and  cached  within  the  F-UAVs 

32 

 
subsequently;  and  finally,  c)  what  inter-community  trajectories  should  be  followed  by  the  F-

UAVs.  

This  chapter  addresses  these  questions  in  that  it  assumes  pre-assigned  globally  known 

content popularities and static content pre-placements before user request are generated. In terms 

of  F-UAV  trajectories,  different  pre-programmed  trajectories  are  characterized  along  with 

different  static  content  placement  strategies.  After  understating  and  characterizing  such  static 

policies,  the  goal  will  be  to  develop  runtime  and  dynamic  mechanisms  for  all  these  design 

components and report it in a future publication.  

3.4.1  Caching at Anchor UAVs (A-UAVs) 

A  naïve  strategy  for  the  A-UAVs  would  be  to  cache  the  most  popular  contents  (i.e., 

following  the  globally  known  Zipf  distribution)  to  fill  out  their  individual  storage  space  of  𝐶: 

contents. This naïve fully duplicated (FD) [120], [121] mechanism has the shortcoming in that it 

limits the number of accessible contents for all user communities to 𝐶:, the A-UAV cache size. 

This limitation can be addressed by storing a certain number of unique (exclusive) contents in all 

the  A-UAVs  and  share  those  contents  across  the  communities  via  the  traveling  F-UAVs.  This 

Smart Cache Duplication (SCD) mechanism can effectively increase the access to the number of 

contents  for  all  users  across  the  entire  system,  thus  improving  the  overall  availability  within  a 

given TAD. 

Let  the  size  of  the  duplicate  segment  of  A-UAV  cache  be  𝜆. 𝐶:  and  that  of  the  unique 

segment be (1 − 𝜆). 𝐶:where 𝜆 is a duplication factor that decides the level of content duplication 

in A-UAVs. This results into 𝑁:. (1 − 𝜆). 𝐶: unique contents stored across all 𝑁: number of A-

UAVs in the system, and these can be shared across all user communities via the mobile F-UAVs. 

These unique contents have popularities after the top 𝜆. 𝐶: popular duplicated contents in all the 

33 

 
A-UAVs. For symmetry, all 𝑁:. (1 − 𝜆). 𝐶: unique contents are uniformly randomly distributed 

across  𝑁:  number  of  A-UAVs.  The  total  number  of  contents  in  system:  𝐶;2; = 𝜆. 𝐶: +

𝑁:. (1 − 𝜆). 𝐶:.  

It should be noted that with 𝜆 set to one, the SCD system reduces to the fully duplicated 

(FD) strategy. With higher 𝜆 values, the users have better access to more number of highly popular 

contents, but to fewer of them with low popularity, that are stored across the system-wide A-UAVs 

and can be accessed via the mobile F-UAVs. A lower 𝜆	creates an opposite effect. The goal is to 

be able to choose a 𝜆, that strikes the right balance between those effects and maximizes the overall 

availability. 

3.4.2  Caching at Ferrying UAVs (F-UAVs) 

The purpose of the F-UAVs is to ferry around 𝑁:. (1 − 𝜆). 𝐶: unique contents stored in all 

𝑁:	A-UAVs. In the presence of limited per-F-UAV caching space, 𝐶<, its caching policy can be 

determined based on its trajectories, the value of 𝜆, and the Zipf parameter defining the content 

popularity. 

Consider a situation in which an F-UAV k is approaching towards the A-UAV i. Let 𝑈, be 

the set of all unique contents in the entire system except the ones stored in A-UAV i. To maximize 

content availability for the users in A-UAV i’s community, the F-UAV should carry as many low 

popularity contents from set 𝑈,	as its cache space permits. To enable such access, F-UAV k should 

carry 𝐶<	top popular contents from the set 𝑈,	while approaching A-UAV i. The size of the set 𝑈, 

can be expressed as |𝑈,| = (𝑁: − 1). (1 − 𝜆). 𝐶:. In scenarios when 𝐶< ≤ |𝑈,|, the F-UAV should 

carry  the  𝐶<	top  popular  contents  as  outlined  above.  Otherwise,  the  F-UAV  will  carry  all  |𝑈,| 

unique  contents,  leaving  part  of  the  F-UAV  cache  (i.e.,  𝐶< − |𝑈,|)  empty.  This  causes 

34 

 
underutilization  of  F-UAV  cache  space  due  to  large  𝜆	values,  leading  to  heavy  in-A-UAV 

duplications, thus storing few unique contents. 

3.4.3  Trajectory of Ferring UAVs  

An  F-UAV’s  trajectory  is  represented  by  the  sequence  of  visited  A-UAVs,  and  the 

hovering duration at each A-UAV. Trajectory sequence can be categorized as partitioned or global 

cycles.  With  a  partitioned  trajectory  cycle,  an  F-UAV  go  around  a  specific  part  of  the  system 

containing a fixed subset of all the A-UAVs like F-UAVs A and B follow a partitioned cycle of A-

UAVs X, Y, Z and W in Figure 3.1. With a global cycle, an F-UAV moves around all the A-UAVs 

in the system like F-UAVs C and D in Figure 3.1. Intuitively, if the contents cached in the unique 

segments of A-UAVs have very low popularity then the global sequence cycle would be beneficial. 

Conversely, when some of the A-UAVs maintain unique contents with comparatively very high 

popularity,  then  using  partitioned  cycle  may  be  rewarding.  These  will  be  evaluated  in  the 

experiment in Section 3.6.  

The cycle time of an F-UAV trajectory is 𝑇+2+&- = 𝑁:

= × (𝑇>#?-* + 𝑇@*’%;,(), where 𝑁:

=	 

is the number of A-UAVs in the cycle (partitioned or global), 𝑇>#?-* is the hover duration at each 

A-UAV,  and  𝑇@*’%;,(  is  the  transit  time  between  two  consecutive  A-UAVs  in  a  sequence. 

𝑇@*’%;,(	depends on the F-UAV flying speed, inter-A-UAV distance, wind speed/directions, and 

other  environmental  factors.  𝑇>#?-*  should  be  set  to  a  value  which  is  determined  by  the  data 

transfer rate and the amount of data needs to be exchanged between F-UAV to/from A-UAV. It 

should be noted that A-UAVs don’t follow a trajectory since they are stationed at their respective 

communities for uninterrupted content dissemination. 

35 

 
3.5 

Content Dissemination Performance 

3.5.1  Content Availability  

Availability is defined as the probability of finding a requested content within the local A-

UAV or a future visiting F-UAV within a TAD. Consider a situation in which a single F-UAV 

cycles in a round-robin manner through all the A-UAVs with hovering and transit respectively. 

For a content requested from a community, the F-UAV may or may not be accessible within the 

specified TAD. This probability is as follows: 

𝑃<: = G

A’×(@()*+,D@:E)
		𝑓𝑜𝑟	𝑇𝐴𝐷 < ((A-
A-×(@()*+,D@.,/01"2)	
A’
1																																	𝑓𝑜𝑟	𝑇𝐴𝐷 ≥ ((A-
A’

− 1)𝑇>#?-* + A-
A’
− 1)𝑇>#?-* + A-
A’

𝑇@*’%;,()

𝑇@*’%;,()

         (3.6) 

If  the  𝑇𝐴𝐷  is  larger  than  a  specific  duration,  then  the  F-UAV’s  accessibility  to  the  requesting 

community  is  guaranteed.  Otherwise,  it  follows  the  first  expression  in  Eqn.  3.6.  Note  that  the 

physical accessibility to the F-UAV does not guarantee the access to the requested content since 

the F-UAV can store only a limited number (i.e., 𝐶<) of unique contents. Let 𝑃< be the probability 

that the requested content can be found within the F-UAV following a caching strategy as stated 

in Section 3.4. It can be expressed as: 

   𝑃< = ∑

=-DHD=3’’
,I=-DH

𝑝6(𝑖)

                                                            (3.7) 

where, 𝑝6(𝑖) is the Zipf distributed popularity as defined in Section 3.3. The effective cache size 

of the F-UAV is given as: 𝐶J<< = 𝑚𝑖𝑛{𝐶<, (𝑁: − 1) × (1 − 𝜆) × 𝐶:}. Effective cache size is less 

than  𝐶<  when  F-UAVs  cache  is  partly  empty  i.e.,  underutilized  (see  Section  3.4).  Now,  the 

probability  that  requested  content  can  be  found  within  a  A-UAV  that  is  local  to  the  request 

generating community can be expressed as: 

𝑃: = ∑

K×=-D(HLK)×=-
,IH

𝑝6(𝑖)

	                                                   (3.8) 

Combining those three probabilities above, the overall availability can be stated as: 

36 

 
 𝑃:?’,& = 𝑃: + 𝑃<: × 𝑃<                                                          (3.9) 

To summarize, local contents from A-UAVs (i.e., both duplicate and unique) and unique 

contents from future visiting F-UAVs contribute towards the overall availability 𝑃:?’,& within a 

specified  𝑇𝐴𝐷.  Note  that  all  unavailable  contents  within  the  specified  TAD  will  have  to  be 

downloaded  by  the  A-UAVs  using  their  expensive  vertical  links  such  as  the  satellite  Internet. 

Therefore, availability indirectly indicates the content download cost in the system.    

3.5.2  Low Availability Period  

Consider  the  scenario  in  Figure  3.2  with  two  A-UAVs  and  one  F-UAV.  The  users  in  a 

community have access to the content in the F-UAV for a duration of 𝑇𝐴𝐷 + 𝑇>#?-*. Time taken 

for the F-UAV to come back to the same community before the users in the community will have 

access  to  its  content  again  is:  2. 𝑇@*’%;,( + 𝑇>#?-* − 𝑇𝐴𝐷.  This  is  the  period  during  which  the 

content availability for the users will only be from the local A-UAV, and that is without access to 

the F-UAV. 

This duration is referred to as the low availability period, which can be generally expressed as: 

𝐿𝐴𝑃 = A-@.,/01"2D(A-LH)@()*+,L@:E

A’

                                         (3.10) 

where 𝑁: and 𝑁< are the number of A-UAVs and F-UAVs in the system. With higher transit and 

hovering  times  and  𝑁:,  while  the  low  availability  period  goes  up,  the  overall  availability,  as 

derived in Eqns. 1 through 4, goes down.  

37 

 
A-UAV 1

!!"#$%

A-UAV 2

!!"#$%

!&%'()*+

!&%'()*+

A-UAV 1

!!"#$%

2!&%'()*+ + !!"#$%

!!"#$% + !%&
T
A
D

!!"#$%

!&%'()*+ − !%&

T
A
D

!!"#$%

!&%'()*+ − !%&

T
A
D

!!"#$%

2!&%'()*+ + !!"#$% − !%&

Figure 3.2.  (Top) Scenario with 𝑇𝐴𝐷 = 0; (Bottom) With non-zero 𝑇𝐴𝐷 

3.5.3  Content Access Delay 

Any request that is served by a local A-UAV experience zero access delay. There is no 

access delay if the request for content from F-UAV is generated when the F-UAV is hovering in 

the community. Therefore, the only scenario with a non-zero access delay would be the one in 

which the requested content is available at an F-UAV, and it is currently not visiting the requesting 

community. The probability of that scenario 𝑃=:E can be expressed as: 

𝑃=:E = 	 R

A’×@:E
A-×(@()*+,D@.,/01"2)

A’×M

.$4$5+
6’

L@()*+,N

A-×(@()*+,D@.,/01"2)

											𝑓𝑜𝑟	𝑇𝐴𝐷 <

									𝑓𝑜𝑟	𝑇𝐴𝐷 ≥

@$4$5+
A’

@$4$5+
A’

− 𝑇>#?-*

− 𝑇>#?-*

                    (3.11) 

Note that the access delay is upper bounded by the specified  𝑇𝐴𝐷. As per the second expression 

in  Eqn.  3.11,  if  the  𝑇𝐴𝐷  is  larger  than  the  time  it  takes  for  the  F-UAV  to  reach  the  request 

generating  community,  then  content  is  delayed  by  the  time  taken  by  the  F-UAV  to  reach. 

Conversely, for lower 𝑇𝐴𝐷𝑠, the content is delayed just by the 𝑇𝐴𝐷 duration.  

The average delay incurred in those two cases are:  

@$4$5+
A’

@$4$5+
A’

− 𝑇>#?-*

− 𝑇>#?-*

                               (3.12) 

𝐷𝑒𝑙𝑎𝑦’? = R

@:E

5
.$4$5+
6’
5

											𝑓𝑜𝑟	𝑇𝐴𝐷 <

LO

			𝑓𝑜𝑟	𝑇𝐴𝐷 ≥

38 

 
 
These averages are based on the maximum and the minimum possible delays. Combining 𝑃=:E 

and 𝐷𝑒𝑙𝑎𝑦’?, the access delay (𝐴𝐷) can be expressed as:  

𝐴𝐷 = 𝑃=:E × ∑

,I∀=’

𝑝6(𝑖)

× 𝐷𝑒𝑙𝑎𝑦’?                                           (3.13) 

3.6 

Experimental Results and Analysis 

Experiments were carried out using simulations, for implementing the request generation, 

UAV caching, and F-UAV movement strategies presented in Sections 3.4. Default experimental 

parameters 

are 

𝑁= = 1000, 

𝑁: = 20, 

𝑁< = 10,	𝐶: = 𝐶< = 50,	𝜇 = 1,	𝑇>#?-* =

20	𝑠𝑒𝑐𝑠,	𝑇@*’%;,( = 10	𝑠𝑒𝑐𝑠,	𝑇𝐴𝐷 = 20	𝑠𝑒𝑐𝑠 and 𝛼 = 1.001. 

)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A

t
n
e
t
n
o
C

70

60

50

40

30

20

10

0

0

0.1

0.2

0.3

Analytical Model, N

Simulation, N

=0

F

Analytical Model, N

=0

F

=5

F

Simulation, N

=5

F

Analytical Model, N

=10

F

Simulation, N

=10

F

Analytical Model, N

=15

F

Simulation, N

=15

F

0.7

0.8

0.9

1

0.4

0.5
Lambda ( )

0.6

Figure 3.3. Content Availability with changing 𝜆 for different 𝑁< 

3.6.1  Impacts of F-UAVs on Content Availability 

Figure  3.3  depicts  the  benefits  of  the  ferrying  UAVs  in  terms  of  improving  content 

availability  as  defined  in  Section  3.4.  The  figures  show  availability  computed  analytically  and 

from simulation experiments (i.e., average computed from the success of 10! requests for each 

availability point), both of which are validated through their excellent agreements.  

39 

 
 
 
 
 
Content  availability  is  evaluated  for  varying  𝜆  values,  representing  the  split  between  cached 

duplicated  and  unique  objects  within  the  A-UAVs,  as  described  in  Section  3.4.  The  following 

observations can be made from Figure 3.3. First, increasing F-UAVs can improve availability by 

ferrying  contents  that  are  not  otherwise  available  to  a  community  in  its  local  A-UAV’s  cache. 

Second, the percentage increase in availability is more drastic for lower values of 𝜆 for which more 

unique contents are cached in the A-UAVs. Since the F-UAVs ferry around those unique contents 

across different communities, the dependance of availability on cached contents in the F-UAVs is 

more pronounced for smaller 𝜆. Third, there is an optimum duplication factor 𝜆, for which the 

content availability is the maximum for a given number of A-UAVs, F-UAVs, and default system 

parameters. Beyond the optimal operating point, availability reduces due to cache underutilization 

in F-UAVs, as shown in Section 3.4.  

3.6.2  Impacts of the Number of User Communities  

Figure  3.4(a)  shows  the  impacts  of  the  number  of  deployed  A-UAVs  (i.e.,  number  of 

communities) on availability, while keeping the number of F-UAVs constant. These results are 

computed analytically from the equations provided in Section 3.5. The numbers show percentage 

increase in availability compared to the no-F-UAV case. The figure shows that the benefits of data 

ferrying consistently go down with increasing number of A-UAVs. The main reason for this is in 

the reduction in probability 𝑃<: (i.e., in Eqn. 3.6) of physical access to the F-UAVs due to the 

increase in their overall cycle times. This can be mitigated using more F-UAVs and is shown later.  

40 

 
)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A
n

i

e
s
a
e
r
c
n
I

m
u
m
x
a
M

i

)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A

t
n
e
t
n
o
C

10

9

8

7

6

5

4

3

2

1

0

70

60

50

40

30

20

10

0

Max. Availability= 69.36 %

=5

N

F

Max. Availability
= 65.82 %

Max. Availability
= 63.97 %

Max. Availability
= 62.82 %

5

10

15

20

Number of A-UAVs (1 A-UAV per community)

(a) 

)

%
n
i
(

n
o
i
t
u
b
i
r
t
n
o
C

70

60

50

40

30

20

10

0

=5

=10

=15

=20

N

N

N

N

A

A

A

A

A-UAV contribution
F-UAV contribution

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5
Lambda ( )

0.6

(b) 

Figure 3.4. (a) Maximum increase in Availability with A-UAVs, (b) Contribution % from A- and 

F-UAVs, (c) Maximum increase in Availability with F-UAVs 

41 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 3.4 (cont’d) 

)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A
n

i

e
s
a
e
r
c
n
I

m
u
m
x
a
M

i

10

9

8

7

6

5

4

3

2

1

0

=20

N

A

Max. Availability = 63.07 %

Max. Availability
= 69.56 %

Max. Availability
= 63.07 %

5

10
Number of F-UAVs

15

(c) 

A content can be provisioned to a user either by its local A-UAV or by a visiting F-UAV. 

Hence, availability has an A-UAV component and an F-UAV component. These two are shown 

separately in Fig 3.4(b). As expected, as the amount of duplicated cached contents in the A-UAVs 

go  up  (i.e.,  with  larger  𝜆),  the  contribution  from  the  A-UAVs  go  up  accordingly.  A-UAVs’ 

contribution, however, is lesser for larger number of communities since the unique contents are 

uniformly  randomly  distributed  across  more  A-UAVs  as  explained  in  Section  3.4.  The 

contributions of the F-UAVs reduce because of the fall in 𝑃<:, as stated for Figure 3.4(a).  

3.6.3  Impacts of Deploying Multiple F-UAVs 

Deploying more F-UAVs increase the probability of physical access to the F-UAVs (i.e., 

𝑃<:), thus improving availability over the corresponding no-F-UAV scenarios which is shown in 

Figure 3.4(c). The results are with 20 A-UAVs, computed analytically and from simulation. The 

improvement in 𝑃<: with increasing number of F-UAVs can be derived from Eqn. 3.6 as ∆𝑃<: =

42 

 
 
 
 
 
 
 
(A’

)58)×(@:ED@()*+,)

0+7LA’
A-×(@()*+,D@.,/01"2)

.  Here  𝑁<

%-$  and  𝑁<

#&"  are  the  number  of  deployed  F-UAVs  after  and 

before additional deployments. ∆𝑃<: shows the rise in accessibility of F-UAVs to communities, 

which in turn, improves overall availability as given in Eqns. 1-4.  

3.6.4  Effects of Hover Time and Tolerable Access Delay 

Content availability is impacted by both the F-UAV hover time and the user-specified TAD 

in an interdependent manner. Those dependencies are shown in Figure 3.5(a) with 𝑁: = 10, 𝑁< =

5, and 𝜆 = 0.8. The figure shows non-monotonic behavior of availability with varying hovering 

time and 𝑇𝐴𝐷. One notable observation is that for low 𝑇𝐴𝐷𝑠, availability increases with increase 

in hover time 𝑇>#?-* and otherwise for high 𝑇𝐴𝐷𝑠. This can be explained as follows.  First, for 

𝑇𝐴𝐷 < 𝑇@*’%;,(, when an F-UAV travels from community i to next community j, the F-UAV does 

not contribute to availability at community i or j for 𝑇@*’%;,( − 𝑇𝐴𝐷 duration (see Figure 3.2). In 

this  case,  it  is  advantageous  for  the  F-UAV  to  hover  over  a  community.  Second,  for  𝑇𝐴𝐷 >

𝑇@*’%;,(,  increase  in  hovering  time  reduces  the  possibility  of  the  condition	(𝑇𝐴𝐷 − 𝑇>#?-*) >

𝑇@*’%;,( to be true. In other words, the possibility of exhausting the given 𝑇𝐴𝐷 before reaching 

next community increases. So, it is beneficial to hover less, which increases the accessibility of F-

UAVs at future communities in the cycle before TAD expires. Finally, for 𝑇𝐴𝐷 = 𝑇@*’%;,(, an F-

UAV adds to availability within 𝑇𝐴𝐷 irrespective of its hovering decision.   

43 

 
(a) 

(b) 

Figure 3.5. (a) Availability for variable 𝑇0#?-* and TAD, (b) Increase in availability for different 

trajectories, (c) Unique contents for varying 𝑁: and 𝜆 

44 

 
 
 
 
 
Figure 3.5 (cont’d) 

3.6.5  Effect of F-UAV Trajectory on Content Availability 

(c) 

Trajectory of an F-UAV has an impact on what content it carries and its contribution to the 

overall content availability based on A-UAVs in its trajectory cycle. Figure 3.5(b) depicts those 

impacts for a system with 640 A-UAVs, 128 F-UAVs, 𝐶< = 200, 𝛼 = 0.8 and all other default 

parameters calculated analytically. Increase in availability is reported as a percentage difference 

between  the  baseline  no-F-UAV  case  and  maximum  content  availability  (i.e.,  at  the  optimal 

duplication factor 𝜆) for specific F-UAV trajectories.  

The global cycle (GC) trajectory refers to when an F-UAV visits all A-UAVs in a cycle. 

The content volume (i.e., 𝐶:. (1 − 𝜆). 𝑁:) to fill the F-UAVs in this trajectory scenario is quite 

high due to the large number of A-UAVs in the cycle, which is shown in Figure 3.5(c). The next 

trajectory that was experimented with is partitioned cycle-1 (𝑃𝐶H). In this, the A- and F-UAVs are 

divided into two sets, which are 2 sets of 320 A-UAVs and 2 sets of 64 F-UAVs. F-UAVs from 

the first set cycle around the first set of the A-UAVs, and the same applies to the second sets of A-

45 

 
 
UAVs and F-UAVs. Functionally, this scenario is equivalent to scaled down GC system with half 

as many A-UAVs and F-UAVs used for the GC results. In this scenario, the content availability is 

slightly larger than the GC case as can be seen in Figure 3.5(b). The reasons are as follows. First, 

due to less cycle duration probability 𝑃<: increases (i.e., in Eqn. 3.6). Second, is the sufficiency of 

unique contents in the system due to adequate count of A-UAVs in the cycle (Figure 3.5(c)). Third, 

the optimal duplication factor 𝜆 is same for both i.e., 0.95. Thus, any increase in 𝑃<: will increase 

content  availability  at  optimal  𝜆.  The  second  and  third  partitioned  cycles  (𝑃𝐶5  and  𝑃𝐶Q)  are 

functionally identical to 𝑃𝐶H except that in these cases, both F-UAVs and A-UAVs are divided 

into 4 and 8 equal sets, respectively. Due to enough A-UAVs in the cycles to fill respective F-

UAVs, content availability increases. Dividing the A-UAVs and F-UAVs further into 16 and 32 

equal sets (i.e., in 𝑃𝐶R, 𝑃𝐶!) leads to reduction in availability due to the fewer A-UAVs in each 

F-UAV cycle (Figure 3.5(b)). In such cases, the cache space in the F-UAVs go underutilized at 

the optimal 𝜆 value. To ensure adequate filling up of F-UAV, a sub-optimal 𝜆 is chosen which 

reduces duplication. This can be seen in Figure 3.5(c) where 𝜆, reduces from 0.95 to 0.90 for 𝑃𝐶R 

and 0.80 for 𝑃𝐶!. This indicate that for a given number of A- and F-UAVs, there exists an optimal 

partitioning at which the overall content availability can be maximized.  

3.6.6  Effects of Content Duplication on Access Delay 

Figure  3.6  shows  consistent  reduction  in  average  access  delay  with  increasing  A-UAV 

duplication factor 𝜆 which is computed from the analytical equations given in Section 3.5.  

This reduction is explained as follows. With higher 𝜆, less popular contents are cached in 

F-UAVs.  As  contents  with  low  popularity  are  less  likely  to  be  requested  according  to  Zipf 

distribution (see Section 3.3), the average access delay also goes down accordingly. Substantial 

reduction in access delay due to underutilization of F-UAV’s cache, explained in Section 3.4, can 

46 

 
be seen in Figure 3.6 for values above 𝜆 = 0.95. It can also be seen that with increase in number 

of F-UAVs, the average content access delay increases. As, delay is only due to contents that are 

cached in F-UAVs, more F-UAVs increase the quantity 𝑃=:E which adds to access delay. 

)

D
A

(

s
d
n
o
c
e
s

n

i

l

y
a
e
D
s
s
e
c
c
A
e
g
a
r
e
v
A

3

2.5

2

1.5

1

0.5

0

0

0.1

0.2

0.3

:5

:10

:15

N

N

N

F

F

F

0.7

0.8

0.9

1

0.4

0.5
Lambda ( )

0.6

Figure 3.6.   Increase in delay with increasing F-UAVs for varying 𝜆 

An F-UAV’s hover time impacts its overall cycle duration, that affects the duration for 

which  the  content  availability  from  that  F-UAV  to  the  users  remains  low.  During  such  Low 

Availability  Periods  (LAP),  as  explained  in  Eqn.  3.10  in  Section  3.5,  only  the  locally  cached 

contents from A-UAV’s remain available. LAP reduces when more F-UAVs are deployed. This 

underlying  effect  is  visible  in  Figure  3.3  where  adding  F-UAVs  reduces  LAP  and  boosts 

availability.  

3.7 

Summary and Conclusion 

This chapter investigates caching policies in UAV networks for content dissemination in 

communication  challenged  systems.  Cache-enabled  UAVs  serve  communities  of  users  in  a 

disaster/war-stricken area by caching popular contents in order to reduce downloading needs using 

satellites and other expensive vertical links. A framework is adopted in which two types of UAVs, 

namely  anchor  UAVs  and  ferrying  UAVs,  are  deployed.  Through  analytical  modeling  and 

47 

 
 
 
 
 
 
 
simulation experiments, the chapter establishes an optimal content duplication strategy in which 

certain number of popular objects are duplicated in all anchor UAVs and certain number of non-

duplicated/unique  contents  are  carried  in  both  types  of  UAVs.  It  was  shown  that  content 

availability  in  such  a  system  can  be  maximized  by  appropriately  dimensioning  the  content 

duplication  factor.  The  system  was  functionally  validated,  and  performance  evaluated  for  a 

different scenario including various ferrying UAV trajectories. The next chapter will extend this 

concept  to  a  heterogenous  demand  scenario  where  requests  can  belong  to  different  popularity 

distributions.  Additionally,  to  emulate  a  more  realistic  scenario  the  generated  requests  can  be 

accompanied with user-specific tolerable access delays. Furthermore, dynamic nuances of ferrying 

UAVs’ trajectories are considered to enhance the collective content provisioning capability of the 

UAV-aided network for all the aforementioned design components. 

48 

 
 
 
 
 
 
 
 
 
 
 
 
Chapter 4:  Using QoS-aware Caching for Handling Demand 

Heterogeneity in UAV-based Content Provisioning 

4.1  Motivation 

In disaster or conflict-affected areas, the collapse of communication infrastructures poses 

a significant barrier to timely information dissemination and recovery efforts. The deployment of 

Unmanned Aerial Vehicles (UAVs) as a plausible solution to form ad hoc networks has gained 

importance,  given  their  ability  to  navigate  and  operate  in  areas  without  stable  infrastructure. 

Despite this, existing UAV-based communication models are largely inept in environments with 

total  infrastructure  failure,  especially  when  faced  with  the  challenge  of  demand  heterogeneity. 

Such  heterogeneity  is  manifested  through  varying  content  popularity,  urgency,  and  Quality  of 

Service  (QoS)  expectations  such  as  tolerable  access  delays.  This  requires  a  nuanced  content 

caching  approach  beyond  the  capabilities  of  current  systems,  which  rely  on  static,  long-term 

request  pattern  estimations.  There  exists  a  need  for  an  agile  and  adaptive  UAV-aided  content 

dissemination framework capable of addressing these multifaceted challenges directly, ensuring 

that critical information reaches all user communities efficiently and reliably. 

4.2 

Design Objective 

The  research  in  this  chapter  aims  to  conceptualize  and  develop  a  UAV-aided  content 

caching system tailored for communication-challenged environments. This system is envisioned 

to efficiently accommodate the heterogeneous demands of isolated user communities, optimizing 

for content availability without excessive reliance on costly vertical connectivity. By leveraging 

the developed algorithmic approaches for content caching, the framework seeks to ensure high-

availability  content  access  across  diverse  user  communities.  A  pivotal  goal  is  to  articulate  the 

interdependencies between user demand patterns, users’ urgencies and caching mechanisms, with 

49 

 
a  focus  towards  identifying  optimal  operational  configurations  that  maximize  content 

dissemination efficiency. Through rigorous simulation experiments and analytical modeling, the 

proposed  system  will  be  validated  and  evaluated,  underscoring  its  potential  as  a  resilient 

communication solution in the aftermath of disasters. 

4.3 

System Model 

4.3.1  UAV Hierarchy  

The content distribution system is organized in two layers, namely, the anchor UAVs (i.e., 

A-UAVs)  and  the  ferrying  UAVs  (i.e.,  F-UAVs).  As  shown  in  Figure  3.1,  each  partitioned 

community of users is served by a A-UAV using a lateral wireless link such as Wi-Fi. A-UAVs 

can download content via an expensive vertical link such as satellite-based internet. 

One monolithic system design approach is to let the A-UAVs download all needed content, 

as requested by their local users. In this approach, with no inter-A-UAV data transfer, the following 

shortcomings will be encountered. First, duplications of downloads will incur high download cost 

via the expensive A-UAV vertical links due to overlaps in requests from different communities. 

Second, storage constraints will cap the number of contents that can be downloaded and stored in 

each  A-UAV,  thus  limiting  the  content  availability.  To  address  these,  ferrying  UAVs  (i.e.,  F-

UAVs) are introduced. Unlike A-UAVs, the mobile F-UAVs do not possess vertical links, but they 

have lateral links such as Wi-Fi, using which they can communicate with the A-UAVs and the 

users. These UAVs share the contents downloaded by A-UAVs serving other communities.  

After receiving a request from one of its community users, an A-UAV first searches its 

local storage for the content. If not found, it waits for a potential future delivery of the content by 

one of the traveling F-UAVs. If no F-UAV with that content arrives within the specified TAD, the 

A-UAV downloads it. 

50 

 
4.3.2  Content Request and Provisioning Model 

Content Popularity: Studies have shown [119], [122] that content request patterns follow a Zipf 

distribution in which a requested content’s popularity is a geometric multiple of the next popular 

content in a larger pool [119]. Popularity of contents is given as: 

 𝑝6(𝑖) =

#
8

7

!
"

∑

9∈&

7

!
9

8

                                                           (4.1) 

#

The parameter 𝐶 represents total number of contents in the pool, and Zipf parameter 𝛼 determines 

skewness of the distribution. Popularity sequence of contents at different communities may vary, 

which introduces popularity heterogeneity, which is the focus of this chapter.  

Content  Requests:  Poisson  distributed  request  generation  is  a  prevalent  way  to  capture  user 

requests in practical networks. 

Tolerable Access Delay and Content Provisioning: For each generated request, a Tolerable Access 

Delay (TAD) [123], [124] is specified. TAD is a Quality-of-Service parameter that indicates the 

duration that a user is ready to wait before a requested content can be accessed. Here, if a content 

is not available from the UAVs within the specified TAD, it will be downloaded from a central 

server using the expensive vertical links of A-UAVs. 

4.4 

Caching Policies to Handle Heterogeneity 

This chapter focuses on following caching related design questions: a) which content to be 

downloaded and cached in A-UAVs so that they can serve their own community directly, and the 

remote communities via traveling F-UAVs; b) which contents to be cached when the popularity 

and TAD of contents vary at different communities; c) which content to be transferred from the A-

UAVs to the F-UAVs; and, d) what inter-community trajectories to be followed by the F-UAVs.  

This  chapter  addresses 

these  questions  with  pre-assigned  and  globally  known 

heterogeneous  content  popularities,  and  content  pre-placements  at  A-UAVs.  Different  pre-

51 

 
programmed  F-UAV  trajectories  are  characterized  with  a  multitude  of  content  placement 

strategies. After understanding such scenarios in this chapter, runtime and dynamic mechanisms 

has been developed and reported in future chapters.  

Full Duplication

Popularity Sequence at Community 1

{", $, %, &, ', (, ), *, +, ",, "", "$, "%, "&, "', "(, "), "*, "+, $,, $", $$, … }

Popularity Sequence at Community 2

{", $, %, &, ', (, ), *, +, ",, "", "$, "%, "&, "', "(, "), "*, "+, $,, $", $$, … }

Popularity Sequence at Community 3

{", $, %, &, ', (, ), *, +, ",, "", "$, "%, "&, "', "(, "), "*, "+, $,, $", $$, … }

Cache space in A-UAV 1

Caches {", $, %, &, ', (, ), *, +, ",}

Cache space in A-UAV 2

Caches {", $, %, &, ', (, ), *, +, ",}

Cache space in A-UAV 3

Caches {", $, %, &, ', (, ), *, +, ",}

Figure 4.1.  Example of FD at 3 A-UAVs with 10 cached contents in the system 

4.4.1  Caching at Anchor UAVs (A-UAVs) 

A naïve fully duplicated (FD) [119] mechanism limits the number of accessible contents 

for all user communities to 𝐶:, the A-UAV cache size, due to the duplication of requested contents 

form the corresponding user communities (see Figure 4.1). This limitation can be addressed by 

storing a certain number of exclusive contents in all the A-UAVs and share those contents across 

the communities via the traveling F-UAVs. This Smart Exclusive Caching (SEC) mechanism can 

effectively increase the access of contents for all users across the entire system, thus improving 

the overall availability within a given TAD.  

52 

 
 
Smart Exclusive Caching with Homogeneous popularity Sequence

Popularity Sequence at Community 1

{", $, %, &, ', (, ), ,, +, "-, "", "$, "%, "&, "', "(, "), ",, "+, $-, $", $$, … }

Popularity Sequence at Community 2

{", $, %, &, ', (, ), ,, +, "-, "", "$, "%, "&, "', "(, "), ",, "+, $-, $", $$, … }

Popularity Sequence at Community 3

{", $, %, &, ', (, ), ,, +, "-, "", "$, "%, "&, "', "(, "), ",, "+, $-, $", $$, … }

Cache space in A-UAV 1 with SSF / = -. )

Segment 1 caches {", $, %, &, ', (, )}

Cache space in A-UAV 2 with SSF / = -. )

Segment 1 caches {", $, %, &, ', (, )}

Cache space in A-UAV 3 with SSF / = -. )

Segment 1 caches {", $, %, &, ', (, )}

Segment 2 caches 
{+, "$, "&}

Segment 2 caches 
{"", "', "(}

Segment 2 caches 
{,, "-, "%}

Figure 4.2.   Content Caching Policy in 3 A-UAVs with Cache Size 𝐶: = 10 for Homogeneous 

Content Popularity Sequence 

Suppose  we  consider  a  disaster/war-stricken  area  with  homogeneous  content  popularity 

sequence  across  all  the  user-  communities  and  an  A-UAV  is  assigned  to  each  community  for 

content  provisioning.  According  to  the  SEC  mechanism,  cache  space  of  an  A-UAV  has  two 

segments i.e., Segment-1 and Segment-2. Let the size of Segment-1 of A-UAV cache be 𝐶SH =

𝜆. 𝐶: and that of Segment-2 be 𝐶S5 = (1 − 𝜆). 𝐶:, where 𝜆 is a storage segmentation factor (SSF) 

that decides the size of Segment-1 of A-UAVs. Top 𝜆. 𝐶: popular contents are cached at Segment-

1 of A-UAVs. These contents are same across all A-UAVs whereas contents from Segment-2 are 

different.  This  results  into  𝐶S5

(#(’& = 𝑁:. (1 − 𝜆). 𝐶:  number  of  total  Segment-2  contents  stored 

across all 𝑁: number of A-UAVs, and these can be shared across all user communities via the 

mobile F-UAVs. These contents have popularities after the top 𝜆. 𝐶: popular Segment-1 contents 

in all the A-UAVs. For symmetry, all 𝑁:. (1 − 𝜆). 𝐶: Segment-2 contents are uniformly randomly 

53 

 
 
distributed across 𝑁: number of A-UAVs. Total number of contents in this content dissemination 

system is as follows:  

𝐶;2; = 𝜆. 𝐶: + 𝑁:. (1 − 𝜆). 𝐶:                                       (4.2)  

Figure  4.2  shows  an  example  of  this  caching  policy  with  3  A-UAVs  and  storage 

segmentation factor 𝜆 = 0.7. Contents in Segment-1 are same across all three A-UAVs while those 

in Segment-2 part of the A-UAV storage are different. Total contents across all A-UAVs are {1 −

16} according to Eqn. 4.2. 

Popularity-Based Caching with Heterogeneous popularity Sequence

Popularity Sequence at Community 1

{", $, %, &, ', (, ), +, ,, "-, "", "$, "%, "&, "', "(, "), "+, ",, $-, $", $$, … }

Popularity Sequence at Community 2

{&, ', (, ), +, ,, ", "-, $, %, "", "$, "%, "&, "', "(, "), "+, ",, $-, $", $$, … }

Popularity Sequence at Community 3

{,, "-, "", "$, "%, ', (, "&, ", $, %, &, ), +, "', "(, "), "+, ",, $-, $", $$, … }

Cache space in A-UAV 1 with SSF / = -. )

Segment 1 caches {", $, %, &, ', (, )}

Cache space in A-UAV 2 with SSF / = -. )

Segment 1 caches {&, ', (, ), +, ,, "}

Cache space in A-UAV 3 with SSF / = -. )

Segment 1 caches {,, "-, "", "$, "%, ', (}

Segment 2 caches 
{"', ",, $-}

Segment 2 caches 
{"&, "(, $$}

Segment 2 caches 
{"), "+, $"}

Figure 4.3.   Content Caching Policy in 3 A-UAVs with Cache Size 𝐶: = 10 for Heterogeneous 

4.4.2  Caching to Cater to Heterogeneous Popularity Sequences     

Content Popularity Sequence 

Caching  policy  described  so  far  assumes  that  the  contents  have  the  same  popularity 

sequence for the requests across all communities. In practice, requests can be heterogeneous in 

that the popularity sequence of requested contents from different communities can be different. 

For example, in case of a fire breakout, information about fire trucks and medical care are the most 

54 

 
 
popular contents for the areas in the vicinity of fire. But, for areas which is in the path of fire spread 

needs logistical support for relocation. In such heterogeneous popularity cases, previous caching 

policy may not be the best fit since most popular contents may not be the same for all communities.  

This limitation can be addressed by caching a community’s most popular contents in the 

Segment-1 of its local A-UAV. Figure 4.3 shows a scenario where there are three A-UAVs at their 

respective communities with different content popularity sequence. These A-UAVs have cached 

most  popular  contents  according  to  their  communities’  content  popularity  sequence.  It  can  be 

observed that contents {5, 6} are cached in all the A-UAVs, contents {1, 2, 3} are cached in A-

UAV  1  and  contents  {4,  7}  are  cached  in  A-UAVs  1  and  2.  Contents  {1, 2, 3,4,7}  are  called 

exclusive contents of Segment-1 that are cached in one or some of the A-UAVs, but not in all of 

them, whereas contents {5, 6} are called non-exclusive contents of Segment-1 that are cached at all 

A-UAVs. Therefore, unlike SEC, the number of contents in Segment-1 across all A-UAVs may 

be more than 𝜆. 𝐶: i.e., 𝐶SH

(#(’& = 𝐶AJ + 𝐶J

(#(’& ≥ 𝜆. 𝐶:. Like SEC, contents in Segment-2 do not 

repeat  across  A-UAVs.  If  𝐶AJ	𝑎𝑛𝑑	𝐶J

(#(’&  are  the  number  of  non-exclusive  and  total  exclusive 

contents in Segment 1, then total number of contents in the system: 

𝐶;2; = 𝐶AJ + 𝐶J

(#(’& + 𝑁:. (1 − 𝜆). 𝐶: ≥ 𝜆. 𝐶: + 𝑁:. (1 − 𝜆). 𝐶:                (4.3) 

Validation of Eqn. 4.3 can be seen in Figs. 4.2 and 4.3. The stored contents across all A-

UAVs  is {1 − 16}  with  SEC  (i.e.,  Figure  4.2).  In  the  heterogeneous  case  (i.e.,  Figure  4.3),  the 

caches contents are {1 − 22}. The objective here is to choose a 𝜆, that strikes the right balance 

between  the  two  segments  and  maximizes  the  overall  content  availability  in  the  heterogeneous 

case.  

One limitation of popularity-based caching is related to the tolerable access delay (𝑇𝐴𝐷). 

The popularity-based caching does not consider content specific 𝑇𝐴𝐷. This shortcoming leads to 

55 

 
reduced availability for contents with both low 𝑇𝐴𝐷 and popularity. This is explained using the 

following example.  

Let us consider a content ′𝑥′ which has popularity higher than 𝜆. 𝐶: and another content ′𝑦′ 

with popularity lower than 𝜆. 𝐶:. According to popularity-based caching, content ′𝑥′ is cached in 

Segment-1 of an A-UAV, and content ′𝑦′ is cached in Segment-2 of one of the A-UAVs. Therefore, 

content ′𝑦′ is ferried by F-UAVs across all communities. Let the inter-community distances be 

such that an F-UAV ‘j’ reaches a community within 30	𝑠𝑒𝑐𝑜𝑛𝑑𝑠 of departure of the previous F-

UAV ‘j-1’. If the 𝑇𝐴𝐷 associated with ′𝑥′ is 100	𝑠𝑒𝑐𝑜𝑛𝑑𝑠 and ′𝑦′ is 5	𝑠𝑒𝑐𝑜𝑛𝑑𝑠, a request for ′𝑦′ 

will rarely be served by the A-UAV/F-UAV content dissemination network. This will lead to ′𝑦′ 

being downloaded. This issue is addressed by a value-based caching strategy proposed in the next 

subsection.   

Value-Based Caching (with !"#"'= 5 sec, !"#",(= 100 sec)

Popularity Sequence at Community 1

{", $, %, &, ', (, ), +, ,, "-, "", "$, "%, "&, "', "(, "), "+, ",, $-, $", $$, … }

Popularity Sequence at Community 2

{&, ', (, ), +, ,, ", "-, $, %, "", "$, "%, "&, "', "(, "), "+, ",, $-, $", $$, … }

Popularity Sequence at Community 3

{,, "-, "", "$, "%, ', (, "&, ", $, %, &, ), +, "', "(, "), "+, ",, $-, $", $$, … }

Cache space in A-UAV 1 with SSF / = -. )

Segment 1 caches {", $, %, &, ', (, )}

Cache space in A-UAV 2 with SSF / = -. )

Segment 1 caches {&, ', (, ), +, ,, "'}

Cache space in A-UAV 3 with SSF / = -. )

Segment 1 caches {,, "-, "", "$, "%, ', "'}

Segment 2 caches 
{"', ",, $-}

Segment 2 caches 
{"&, "(, $$}

Segment 2 caches 
{"), "+, $"}

Figure 4.4. Example of VBC with low TAD content ‘15’ at Segment-1 of A-UAV 

56 

 
 
4.4.3  Value-based Caching to Handle Heterogeneous TAD 

All the policies discussed so far makes caching decision of a content based on its popularity 

at the community where it is requested. However, the promptness with which a content needs to 

be provisioned may not always be positively correlated with its popularity. For example, request 

for logistical support information can be more popular than the information about first responders 

in a post-disaster situation. However, the TAD for first responder information is expected to be 

shorter. To prioritize caching of such contents in Segment-1 of A-UAVs, this chapter devices a 

value-based caching policy where the value of a requested content is calculated from its popularity 

and its associated 𝑇𝐴𝐷. Value of a content ′𝑖′ can be expressed as: 

𝑉(𝑖) = 𝜅𝜐 × 3#(,)
@:E(,)

= 𝜅 × @:E:"0
3#(H)

× 3#(,)
@:E(,)

                                      (4.4) 

Here, 𝑝6(𝑖) is the popularity of the content as per Zipf Distribution, 𝑇𝐴𝐷(𝑖) is the tolerable access 

delay  associated  with  the  content  request,  𝜅 ∈ [0,1]  is  a  scalar  weight  which  increases  with 

decrease in popularity and ′𝜐′ is a normalization constant. For a given Zipf (popularity) parameter 

𝛼, the normalization constant is calculated from the minimum possible 𝑇𝐴𝐷 (𝑇𝐴𝐷4,%) and the 

maximum possible popularity, which is 𝑝6(1). The quantity 𝑉(𝑖) is bounded between [0, 1] and it 

increases  with  increase  in  𝑝6(𝑖)  and  decrease  in  𝑇𝐴𝐷(𝑖).  This  value-based  caching  policy 

increases the likelihood of contents requested with low 𝑇𝐴𝐷 to be cached in Segment-1 of the A-

UAVs, thus making them more readily available (see Figure 4.4). 

4.4.4  Caching at Ferrying UAVs (F-UAVs) 

The purpose of the F-UAVs is to ferry around 𝐶J

(#(’& + 𝑁:. (1 − 𝜆). 𝐶: number of contents 

stored across 𝑁:	number of A-UAVs (see Eqn. 4.3). Due to the limitation of per-F-UAV caching 

space (i.e.,	𝐶<), its caching policy should be determined based on its trajectories, the value of 𝜆, 

the Zipf popularity, and the 𝑇𝐴𝐷𝑠 associated with the contents to be cached.  

57 

 
F-UAV caching policy is explained in the pseudocode below. 

Algorithm 4.1. F-UAV Caching Algorithm with Value-based policy at A-UAV 

1.  Input: Total A-UAVs in its trajectory, 𝑇𝐴𝐷, next A-UAV ′𝑖′, present A-UAV ′𝑖 − 1′  

2.  Output: 𝐶< contents for F-UAV ′𝑗′ 

3.  Initialize 𝐶: contents in each A-UAV based on value of contents 

4.  while True: 

5.        if F-UAV leaving for next A-UAV ′𝑖′ then do 

6.              for 𝑘 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV ′𝑖′ cache 𝐶:

, ) do 

7.                    Check if 𝑘 in 𝐶< cache space of F-UAV ′𝑗′ 

8.                    if true then do  

9.                          Replace ′𝑘′ with highest value content from 𝐶:

,LH  

                              not cached in F-UAV ′𝑗′ and A-UAV ′𝑖′ 

10. 

                  end if 

11. 

            end for    

12. 

      end if 

13. 

      Update next A-UAV ′𝑖′, present A-UAV ′𝑖 − 1′  

14. 

end while 

Consider a situation in which an F-UAV ‘j’ is approaching towards the A-UAV ‘i’. Let 𝑈, 

be the set of all exclusive contents in Segment-1 of all A-UAVs and all contents from Segment-2 

of all A-UAVs in the entire system except the ones stored in A-UAV ‘i’. To maximize content 

availability for the users in A-UAV i’s community, the F-UAV should carry 𝐶<	top valued contents 

(refer Eqn. 4.4) from the set 𝑈,	while approaching A-UAV i. The size of the set 𝑈, can be expressed 

as |𝑈,| = 𝐶J

(#(’& + (𝑁: − 1). (1 − 𝜆). 𝐶:. In scenarios when 𝐶< ≤ |𝑈,|, the F-UAV should carry 

58 

 
the  𝐶<	top  popular  contents  as  outlined  above.  Otherwise,  the  F-UAV  should  carry  all  |𝑈,| 

contents, leaving part of the F-UAV cache (i.e., 𝐶< − |𝑈,|) empty. This causes underutilization of 

F-UAV cache space.  

The next section discusses the deployment and trajectory planning methods for F-UAVs 

employed in this chapter to boost content availability for the requesting users.  

4.5 

Deployment and Trajectory Planning of UAVs 

Ferrying UAVs travel across communities to share contents among the A-UAVs. In this 

setting, the trajectory of F-UAVs can greatly impact content availability to the requesting users. 

4.5.1  Trajectory Sequence and Cycle  

An F-UAV’s trajectory cycle is represented by the sequence of A-UAVs that it visits. The 

cycle  time  of  an  F-UAV  trajectory  is  𝑇+2+&- = 𝑁:

= × (𝑇0#?-* + 𝑇(*’%;,(,#%),  where  𝑁:

=	  is  the 

number of A-UAVs in the F-UAV’s sequence, 𝑇0#?-* is the hover duration at each A-UAV, and 

𝑇(*’%;,(,#%  is  the  transition  time  between  two  consecutive  A-UAVs  in  the  F-UAV’s  sequence. 

𝑇(*’%;,(,#%	depends on the F-UAV flying speed, intercommunity distance, wind speed/directions, 

and other environmental factors. 𝑇0#?-* is the minimum duration necessary for completing data 

exchanged between UAVs.  

4.6 

Content Dissemination Performance 

4.6.1  Content Availability  

Content availability is defined as the probability of finding a requested content from the 

UAV-aided  caching  paradigm  within  the  specified  Tolerable  Access  Delay  (𝑇𝐴𝐷).  F-UAV’s 

accessibility within a given 𝑇𝐴𝐷 while transitioning in round-robin manner across A-UAVs in its 

trajectory is expressed as: 

59 

 
𝑃<: = G

A’×(@()*+,D@:E)
A-×(@()*+,D@2,/01"2")0)	

				𝑓𝑜𝑟	𝑇𝐴𝐷 < k

1																																										𝑓𝑜𝑟	𝑇𝐴𝐷 ≥ k

A-
A’
A-
A’

− 1l 𝑇0#?-* + A-
A’
A-
A’

− 1l 𝑇0#?-* +

𝑇(*’%;,(,#%

𝑇(*’%;,(,#%

            (4.5) 

It can be seen from the condition given in Eqn. 4.5 that when the interval between two visits by an 

F-UAV  to  an  A-UAV  is  less  than  𝑇𝐴𝐷,  the  contents  cached  in  F-UAV  are  always  accessible. 

However, when 𝑇𝐴𝐷 is less than the said interval, the contents in F-UAVs are partially accessible. 

Note that the physical accessibility to F-UAVs does not guarantee the access to a requested content 

since the F-UAVs can store only a limited number (i.e., 𝐶<) of contents. Let 𝑃< be the probability 

that the requested content can be found within a F-UAV. It can be expressed as: 

∑

 𝑃< =

&-;!;&3’’
"<&-;!
∑

T(,)

&
"<!

T(,)

=

∑

&-;!;&3’’
"<&-;!
∑

U×

&
"<!

U×

=#(")
.-@(")

×

.-@:"0
=#(!)

=#(")
.-@(")

×

.-@:"0
=#(!)

                                       (4.6) 

where, 𝑉(𝑖) is value of the content (Eqn. 4.4); 𝐶:, 𝐶J and 𝐶 are cache size, exclusive contents in 

Segment 1 of each A-UAV and total contents respectively. Effective cache size of F-UAV is:  

𝐶J<< = 𝑚𝑖𝑛{𝐶<, (𝑁: − 1). (𝐶J + (1 − 𝜆). 𝐶:)}                                       (4.7) 

Now, 𝑃:, probability of finding a requested content in the local A-UAV of the request generating 

community, is expressed as: 

𝑃: =

∑

A×&-;(!CA)×&-
"<!

T(,)

∑

&
"<!

T(,)

=

∑

A×&-;(!CA)×&-
"<!

U×

=#(")
.-@(")

×

.-@:"0
=#(!)

∑

&
"<!

U×

=#(")
.-@(")

×

.-@:"0
=#(!)

                         (4.8) 

Combining Eqns. 4.5, 4.6 and 4.8, the overall availability is: 

To summarize, local contents from A-UAVs and contents from future visiting F-UAVs contribute 

 𝑃:?’,& = 𝑃: + 𝑃<: × 𝑃<                                                      (4.9) 

towards the overall availability 𝑃:?’,& within the specified 𝑇𝐴𝐷𝑠.  

60 

 
Note that all unavailable contents within specified 𝑇𝐴𝐷𝑠 will be downloaded by the A-

UAVs using their expensive vertical links such as a satellite Internet. Thus, availability indirectly 

indicates the content download cost in the system.   

4.6.2  Low Availability Period  

Consider  the  scenario  in  Figure  4.5  with  two  A-UAVs  and  one  F-UAV.  The  users  in  a 

community can have access to the content in the F-UAV for a duration of 𝑇𝐴𝐷 + 𝑇0#?-*. Time 

taken for the F-UAV to come back to the same community before the users in the community will 

have  access  to  its  content  again  is:  2. 𝑇(*’%;,(,#% + 𝑇0#?-* − 𝑇𝐴𝐷.  During  this  period  content 

availability will be restricted from the local A-UAV.  

A-UAV 1

!!"#$%

A-UAV 2

!!"#$%

!&%'()*&*"(

A-UAV 1

!!"#$%

!&%'()*&*"(

2!&%'()*&*"( + !!"#$%

!!"#$% + !%&
T
A
D

!!"#$%

!&%'()*&*"( − !%&

T
A
D

!!"#$%

!&%'()*&*"( − !%&

T
A
D

!!"#$%

2!&%'()*&*"( + !!"#$% − !%&

Figure 4.5.   (Top) Scenario with 𝑇𝐴𝐷 = 0; (Bottom) With non-zero 𝑇𝐴𝐷 

This duration is referred to as the low availability period, which can be generally expressed as: 

𝐿𝐴𝑃 = A-@2,/01"2")0D(A-LH)@D)*+,L@:E

A’

                                      (4.10) 

where 𝑁: and 𝑁< are the number of A-UAVs and F-UAVs in the system. Low availability period 

is an effect of long cycle duration, and do not necessarily indicate the actual content availability in 

the system.   

4.6.3  Content Access Delay 

Any request that is served by a local A-UAV or found within a visiting F-UAV experience 

zero access delay. The only scenario with a non-zero access delay would be the one in which the 

61 

 
 
requested  content  is  available  at  an  F-UAV,  and  it  is  currently  not  visiting  the  requesting 

community.  

The probability of that scenario 𝑃=:E can be expressed as: 

𝑃=:E = 	 R

A’×@:E
A-×(@()*+,D@.,/01"2)

A’×M

.$4$5+
6’

L@()*+,N

A-×(@()*+,D@.,/01"2)

											𝑓𝑜𝑟	𝑇𝐴𝐷 <

									𝑓𝑜𝑟	𝑇𝐴𝐷 ≥

@$4$5+
A’

@$4$5+
A’

− 𝑇>#?-*

− 𝑇>#?-*

                      (4.11) 

As per the second expression in Eqn. 4.11, if the 𝑇𝐴𝐷 is larger than the time it takes for the F-

UAV to reach the request-generating community, then content is delayed by the time taken by the 

F-UAV to reach the requesting community. Conversely, for lower 𝑇𝐴𝐷𝑠, the content is delayed 

just by the 𝑇𝐴𝐷 duration. Average delay incurred in those two cases are: 

𝐷𝑒𝑙𝑎𝑦’? = R

@:E

5
.$4$5+
6’
5

											𝑓𝑜𝑟	𝑇𝐴𝐷 <

LO

			𝑓𝑜𝑟	𝑇𝐴𝐷 ≥

@$4$5+
A’

@$4$5+
A’

− 𝑇>#?-*

− 𝑇>#?-*

                                  (4.12) 

Using 𝑃=:E and 𝐷𝑒𝑙𝑎𝑦’?, access delay is calculated as follows:  

𝐴𝐷 = 𝑃=:E ×

∑

U×

"<∀&’
&
"<!

∑

U×

=#(")
.-@(")
=#(")
.-@(")

×

×

.-@:"0
=#(!)
.-@:"0
=#(!)

× 𝐷𝑒𝑙𝑎𝑦’?                              (4.13) 

Here, access delay is calculated for all content specific TADs. 

4.7 

Experimental Setup 

For  experimentation,  specific  modules  were  developed  for  implementing  request 

generation, UAV caching, and F-UAV movement strategies presented in Sections 4.3, 4.4 and 4.5. 

Unless  mentioned  otherwise,  default  experimental  parameters  in  Table  4.1  are  used  for  all 

experimentations. Each of the data points in the presented results represent an average computed 

from simulation experiments with 10V content requests. 

62 

 
 
Table 4.1. Default Values for Model Parameters 

Variables 

Default Value 

Total number of contents, 𝐶 
Number of A-UAVs, 𝑁: 
Number of F-UAVs, 𝑁< 
Cache space in A-UAV, 𝐶: 
Cache space in F-UAV, 𝐶< 
Poisson request rate parameter, 𝜇 
Hover time of F-UAV, 𝑇>#?-* 
Transition time of F-UAV, 𝑇@*’%;,(,#% 
Tolerable Access Delay, 𝑇𝐴𝐷 
Zipf parameter (Popularity), 𝛼 
Ferrying UAV Trajectory 

2000 

20 

10 

100 

100 

1 

20 seconds 

10 seconds 

240 seconds 

0.7 

Round-robin 

# 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

4.7.1  Heterogeneity in Content Popularity Sequence 

To  emulate  real-life  heterogeneity,  a  swap-based  mechanism  is  used.  Two  parameters, 

namely,  a  swap  probability  𝜇  and  a  swap  difference  𝛿  are  used  to  create  different  popularity 

sequences from a given sequence. Swap probability 𝜇 is the probability with which the popularities 

of two contents (e.g., ‘x’ and ‘y’) within the original popularity sequence are swapped. The swap 

difference 𝛿 is used to determine which content (e.g., ‘y’) to swap for content ‘x’. The difference 

between the original sequence and the new sequence using the above method is determined using 

Smith-Waterman  Distance  [125].  To  capture  heterogeneity  in  content  popularity  sequence 

different  communities  are  programmed  with  different  popularity  sequences  obtained  using  the 

method stated above.  

4.7.2  Heterogeneity in Tolerable Access Delay (TAD) 

This  chapter  uses  a  binary  request  𝑇𝐴𝐷s  (i.e.,  low  and  high-𝑇𝐴𝐷)  to  incorporate 

heterogeneity  in  tolerable  access  delay.  Experiments  are  conducted  by  broadly  classifying  the 

63 

 
 
contents into 5 popularity classes, such that Class-1 contains contents with highest popularity, and 

Class 5 contents are the least popular. At any given time, 𝛾	% of requests for contents from only 

one class have low 𝑇𝐴𝐷. Remaining requests for contents from the said class and all other classes 

have high 𝑇𝐴𝐷.  

4.8 

Experimental Results and Analysis 

4.8.1  Impacts of Value-Based Caching 

The overall increase in content availability using value-based content caching and joint-

deployment of ferrying UAVs is shown in Figure 4.6. The performance improvement is compared 

against  popularity-based  caching  policy  at  A-UAVs  and  round-robin  trajectories  of  F-UAVs. 

Content availability is evaluated for varying cache size of the UAVs.  

)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A
m
u
m
x
a
M

i

100

90

80

70

60

50

40

30

20

10

0

0

TAD-Popularity based Loading
For Low TAD contents with TAD-Popularity based Loading
For High TAD contents with TAD-Popularity based Loading
Popularity based Loading
For Low TAD contents with Popularity based Loading
For High TAD contents with Popularity based Loading

500

1000

1500

2000

UAV cache size

Figure 4.6.   Improvement in maximum availability of contents by loading 

UAVs using value (𝑇𝐴𝐷+Popularity) of contents 

64 

 
 
 
 
 
It can be seen from Figure 4.6 that a maximum increase in availability of approximately 

12%  can  be  achieved  by  using  value-based  caching  policy  at  the  A-UAVs  while  F-UAVs  are 

following their respective round-robin trajectory. The benefits of value-based caching in scenarios 

with  multi-dimensional  demand  heterogeneities  are  attributed  to  various  factors  including 

heterogeneity in popularity sequence, 𝑇𝐴𝐷 associated with the content requests, popularity of low 

𝑇𝐴𝐷 contents, and value of a content. The effects of these factors are depicted individually in the 

following sub-sections.  

Figure 4.7. Difference between two sequences with varying 𝜇 and 𝛿 

4.8.2  Effects of Heterogeneity in Content Popularity Sequence 

Different popularity sequences are generated using the parameters swap probability 𝜇 and 

swap difference 𝛿. 

Figure  4.7  shows  the  normalized  Smith-Waterman  distance  between  the  sequences. 

Maximum difference between two sequences is recorded at 𝜇 = 0.5. The difference does not vary 

65 

 
 
substantially with Swap Difference 𝛿 for a particular value of 𝜇. However, it shows an increasing 

trend with increase in 𝛿 in the beginning, and then a reduction.  

Figure 4.8.   Increase in availability with respect to scenario without F-UAVs by varying 𝜇 and 𝛿 

Figure 4.8 shows the increase in availability while employing popularity-based caching for 

heterogeneous  content  sequence  at  communities.  This  is  compared  with  a  scenario  without  F-

UAVs. The increase in availability is shown for 𝛼 = 0.9 with varying swap probability and swap 

difference. 

The observations are as follows. First, the increase in availability for 𝜇 = 0 corresponds to 

the cases where the popularity sequences are the same in all communities (i.e., the homogeneous 

case). For such cases, the increase in content availability due to contents ferried by F-UAVs is 

approximately 7.5%. Second, the maximum increase in availability of about 9% is recorded for 

66 

 
 
𝜇 = 0.4 and 𝛿 = 50. This improvement is due to incorporating popularity-based caching at A-

UAVs to tackle heterogeneity in popularity sequence. This demonstrate that benefits of popularity-

based caching are higher for scenarios where content sequence is more heterogeneous.  

Availability for content with TAD = 15 seconds

: 0.3 with Popularity-based caching
: 0.3 with Value-based caching
: 0.7 with Popularity-based caching
: 0.7 with Value-based caching

)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A

16

14

12

10

8

6

4

2

0

)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A

25

20

15

10

5

0

Class 1 (1-50)

Class 2 (50-150)

Class 3 (150-250)

Class 4 (250-350)

Class 5 (>350)

Content Popularity Sequence ID

(a) 

Availability for content with TAD = 240 seconds

Class 1 (1-50)

Class 2 (50-150)

Class 3 (150-250)

Class 4 (250-350)

Class 5 (>350)

Content Popularity Sequence ID

(b) 

Figure 4.9. (a) Availability of Low TAD contents, (b) Availability of High TAD contents, (c) 

Average Availability 

67 

 
 
 
 
 
 
 
Figure 4.9 (cont’d) 

Average Availability

)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A

24

23

22

21

20

19

18

17

Class 1 (1-50)

Class 2 (50-150)

Class 3 (150-250)

Class 4 (250-350)

Class 5 (>350)

Content Popularity Sequence ID

(c) 

4.8.3  Impact of Value of Contents with Different TAD  

To observe the benefits of value-based caching, experiments are conducted with popularity 

parameter 𝛼 = 0.4, low 𝑇𝐴𝐷 =  15	𝑠𝑒𝑐𝑜𝑛𝑑𝑠 and high 𝑇𝐴𝐷 = 240	𝑠𝑒𝑐𝑜𝑛𝑑𝑠. The analysis is done 

for five content popularity classes. The parameter  𝛾 is varied between 0.3 and 0.7 to evaluate the 

effect of low and high probability of low 𝑇𝐴𝐷 contents within a given class. Rest of the parameters 

are according to Table 4.1. Performance comparison between value-based and popularity-based 

caching policy is shown in Figure 4.9 separately for low and high 𝑇𝐴𝐷 content requests, along 

with the average availability.  

The  observations  are  as  follows.  First,  availability  of  low  𝑇𝐴𝐷  contents  is  more  while 

employing value-based caching as opposed to popularity-based caching for all popularity classes 

and 𝛾. The increase in popularity escalates for lower popularity classes such that Class 5 observes 

maximum increase in availability of low 𝑇𝐴𝐷 contents by approximately 7% (Figure 4.9a). This 

is  because  value-based  caching  favors  the  storage  of  low  𝑇𝐴𝐷  contents  by  increasing  their 

68 

 
 
 
 
computed value (see Section 4.4.3). However, if a content is highly popular, its value does not 

improve by a large margin. This can be seen in Figure 4.9a where availability of low 𝑇𝐴𝐷 contents 

doesn’t increase if they belong to Class 1. Second, availability of high 𝑇𝐴𝐷 contents reduce for all 

popularity classes and across all values of 𝛾, while employing value-based caching. This is due to 

the replacement of high 𝑇𝐴𝐷 contents by low 𝑇𝐴𝐷 (high value) contents at the A-UAVs. Figure 

4.9b shows this adverse effect where the maximum reduction in availability of high 𝑇𝐴𝐷 contents 

can  be  observed  at  Class  5.  Third,  average  availability,  while  employing  value-based  caching 

policy, is best for middle range of content popularities. This can be seen in Figure 4.9c, where 

increase  in  average  availability  is  maximum  (i.e.,  approximately  4%)  when  low  𝑇𝐴𝐷  content 

requests are from Class 3, beyond which it tapers off. The physical meaning of this phenomenon 

is that if a very low popular content has low 𝑇𝐴𝐷, it is not beneficial to cache it in A-UAVs. This 

is because a very low popularity content is less likely to be requested. Finally, for higher 𝛾, the 

effect of value-based caching is comparatively severe since more contents from a class have low 

𝑇𝐴𝐷. These effects manifest differently when the cache space of UAVs is varied. 

Effects  of  scaling  caching  capacity  of  UAVs  while  employing  value-based  caching  are 

discussed next.  

4.8.4  Impacts of UAV cache Size on Value-Based Caching 

To explore the extent of value-based caching toward increasing availability, cache space is 

varied. Parameters are set as follows; High 𝑇𝐴𝐷 = 240 seconds, Low 𝑇𝐴𝐷 = 5 seconds, 𝛾 = 0.95, 

and remaining parameters according to Table 4.1. 

69 

 
)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A
m
u
m
x
a
M
n

i

i

e
s
a
e
r
c
n
I

12

10

8

6

4

2

0

200

400

600

800

1000 1200 1400 1600 1800 2000

UAV cache size

Increase in Overall Availability
Increase in Availability for Low TAD contents
Decrease in Availability for High TAD contents

Figure 4.10.   Increase in Availability with value-based caching with respect to popularity-based 

caching for increasing cache space 

Figure  4.10  compares  the  impact  of  increasing  cache  size  on  content  availability  while 

employing  value-based  and  popularity-based  caching  individually.  The  observations  from  the 

figures are as follows. First, for a given total number of contents viz, 2000, maximum increase in 

availability with value-based caching is recorded for UAV cache sizes in the range 900-1100. This 

increase in availability is approximately 12%. Beyond the cache size of 1100, availability of low 

𝑇𝐴𝐷 content reduces due to very low popular contents being cached at the A-UAVs. Second, the 

increase in overall availability is attributed to the increase in availability for low 𝑇𝐴𝐷 contents. 

70 

 
 
 
 
 
 
 
Third, availability for high 𝑇𝐴𝐷 contents reduce marginally. Due to increased value of low TAD 

contents, high TAD contents are replaced at A-UAVs and their availability reduces.  

Beyond the extents of value-based caching, content availability, can be also achieved by 

exploiting F-UAV trajectories as discussed below.  

4.9 

Summary and Conclusion 

This  chapter  designs  a  UAV-aided  content  dissemination  system  to  enable  content 

availability for users in the absence of communication infrastructure in disaster scenarios. Two 

types  of  UAVs  are  used,  namely,  anchor  UAVs  and  ferrying  UAVs.  Anchor  UAVs  provide 

contents to users in their respective communities at all times while ferrying UAVs provide contents 

intermittently by sharing those cached in the anchor UAVs. Popularity based caching policy has 

been introduced which takes the heterogeneity in content popularity sequence into consideration 

to  cache  content  is  the  anchor  UAVs.  Value-based  caching  policy  has  been  explored  where  a 

content is cached in a A-UAV when it is likely to be requested and its associated tolerable access 

delay is low, which signifies urgency of requirement. The developed caching policies, deal with 

demand heterogeneity by associating value to a content based on its popularity and tolerable access 

delay. Together the popularity-based and value-based caching policy improve content availability 

by  approximately  12%.  The  next  chapter  on  this  topic  will  include  incorporating  adaptive 

algorithms to learn the caching policy and ferrying UAV trajectories on-the-fly in time-varying 

disaster regions. 

71 

 
 
 
 
Chapter 5:  Multi-Armed Bandit Learning for Content Provisioning 

in Network of UAVs 

In  disaster-hit  regions, 

the  obliteration  of  communication 

infrastructure 

leaves 

communities  isolated,  deprived  of  crucial  information  for  survival  and  relief.  This  chapter 

introduces an innovative solution where a UAV-based content dissemination network is designed 

to  operate  autonomously  of  traditional  communication  systems.  Addressing  the  inherent 

challenges such as limited UAV energy, storage and flight capabilities necessitates a sophisticated 

approach  to  content  management,  making  UAVs  a  vital  link  in  disseminating  essential 

information. 

Satellite Link

u

v

Information sharing between 
A-UAV and F-UAV

t

Communication
Infrastructure
Destruction

w

j

x

i

y

z

Anchor UAV
Ferrying UAV
F-UAV Trajectory

Figure 5.1.  Coordinated UAV system for content caching and distribution in environments  

without communication infrastructure 

72 

 
 
5.1  Motivation 

This  chapter  is  shaped  by  the  dire  necessity  for  a  resilient,  UAV-assisted  content 

distribution  system  capable  of  functioning  in  the  absence  of  conventional  communication 

networks.  The  challenge  is  amplified  by  the  diverse  and  urgent  information  needs  of  isolated 

communities,  coupled  with  UAV  operational  limitations.  Traditional  content  distribution 

strategies often neglect the nuanced demand and spatial-temporal request heterogeneity. Hence, 

the research in this chapter designs an adaptive, decentralized caching strategy, employing a Top-

k  Multi-Armed  Bandit  Learning  model,  to  ensure  the  prioritized  delivery  of  critical  content  to 

affected populations via on-the-fly learning of caching policies. 

5.2 

Design Objectives 

The  primary  goal  of  the  research  conducted  in  this  chapter  is  to  develop  a  UAV-aided 

content caching and dissemination framework that can dynamically adapt to the unique demands 

of disaster-stricken communities. By employing a Top-k Multi-Armed Bandit Learning model, the 

system  aims  to  optimize  content  caching  decisions  in  real-time  that  takes  into  account  the 

geographical and temporal variations in content popularity as well as the heterogeneous demands 

of the users. The objectives are multi-fold: 

a.  This  chapter  designs  a  decentralized  learning  mechanism  that  enables  UAVs  to  make 

informed  caching  decisions  on-the-fly,  therefore,  maximizing  the  relevance  and 

accessibility of content to stranded users. 

b.  The  designed  method  incorporates  a  multi-dimensional  reward  structure  within  the 

learning  model  that  accounts  for  both  local  and  global  content  popularity  trends  that 

facilitates an optimal caching strategy that improves overall content dissemination. 

73 

 
c.  This  chapter  will  explore  the  interactions  between  the  dynamically  learned  caching 

policies, Quality of Service (QoS) expectations (specifically, tolerable access delay), and 

user demand patterns, aiming to fine-tune the learning model for enhanced performance. 

d.  The  designed  framework  has  been  rigorously  tested  and  validate  through  simulation 

experiments and analytical modeling which ensures its effectiveness in a range of disaster 

scenarios, UAV configurations, and content popularity distributions. 

This research endeavors to bridge the gap in current UAV-based communication solutions 

by introducing an agile, adaptive caching system that responds to the immediate needs of disaster-

affected populations. This can potentially transform the landscape of emergency communication 

and information dissemination in the face of infrastructure collapse. 

5.3 

Caching based on Content Pre-loading at Anchor UAVs 

This  section  discusses  caching  policies  based  on  content  pre-loading  at  A-UAVs  that 

assumes  pre-assigned,  static,  and  globally  known  content  popularities.  After  understanding  the 

limitations of these caching policies, this chapter proposes a runtime, dynamic and adaptive Top-

k Multi-armed Bandit based caching mechanism, which is explained in a later section.  

5.3.1 

Pre-loading Policies at Anchor UAVs (A-UAVs) 

The Fully Duplicated (FD) mechanism [91] is a naive approach that allows A-UAVs to 

download content from vertical links upon request by local users. However, the FD mechanism 

has  limitations  such  as  content  duplication,  high  vertical  link  download  costs,  and  suboptimal 

utilization of UAV cache space. Smart Exclusive Caching (SEC) [91] 

 overcomes the limitations of the FD mechanism by storing a set number of unique contents 

in  all  A-UAVs  and  sharing  them  among  communities  via  F-UAVs.  Assuming  globally  known 

homogeneous  content  popularity  across  all  user  communities,  the  SEC  mechanism  divides  the 

74 

 
cache into two segments. Segment-1 contains the top 𝜆. 𝐶: popular contents cached in all A-UAVs, 

while  Segment-2  contains  unique  contents  (1 − 𝜆). 𝐶:,  where  𝜆  is  the  Storage  Segmentation 

Factor. Total contents in the system as per SEC is given as:   

𝐶;2; = 𝜆. 𝐶: + 𝑁:. (1 − 𝜆). 𝐶:                                                    (5.1) 

Popularity-Based  Caching  (PBC)  [93]  is  employed  when  different  communities  have 

different  content  preferences.  PBC  divides  the  cache  space  of  a  A-UAV  into  two  segments, 

considering the heterogeneous popularity sequence of the local community. Segment-1 caches the 

most popular contents, which can be exclusive to a A-UAV (𝐶J) or non-exclusive i.e., may be 

cached across multiple A-UAVs (𝐶AJ), while Segment-2 is the same as SEC. To be noted that total 

unique contents in the system can be denoted as 𝐶J

(#(’&, which leads to total replicated contents 

across the system to be represented as follows:  

𝐶*-3&,+’(-" = 𝐶AJ + 𝑁:. (1 − 𝜆). 𝐶:																																																(5.2) 

Therefore, by modifying Eqn. 5.1, total number of contents in the system can be expressed as:  

𝐶;2; = 𝐶AJ + 𝐶J

(#(’& + 𝑁:. (1 − 𝜆). 𝐶: ≥ 𝜆. 𝐶: + 𝑁:. (1 − 𝜆). 𝐶:              (5.3) 

Value-Based  Caching  (VBC)  [93]  further  enhances  the  caching  policy  by  storing  top-

valued contents in Segment-1 of A-UAV, where value of contents comprises of their popularity 

and tolerable access delay. Value of a content ‘𝑖’ be calculated as:  

𝑉(𝑖) = 𝜅𝜐 × 3#(,)
@:E(,)

= 𝜅 × @:E:"0
3#(H)

× 3#(,)
@:E(,)

                                 (5.4) 

In this equation, 𝑝6(𝑖) represents the content’s popularity as per the Zipf distribution, 𝑇𝐴𝐷(𝑖) is 

the content’s tolerable access delay, 𝜅 is a scalar weight that increases as popularity decreases, and 

𝜐 is a normalization constant. The normalization constant is calculated for a given Zipf (popularity) 

parameter 𝛼 using the minimum possible 𝑇𝐴𝐷 (𝑇𝐴𝐷4,% ) and the maximum possible popularity, 

which is 𝑝6(1). The value of 𝑉(𝑖) is bounded between [0,1] and increases as 𝑝6(𝑖) increases and 

75 

 
𝑇𝐴𝐷(𝑖) decreases and can present a holistic quantifiable measure for caching decision. 

The caching policy for F-UAVs remains the same for all the discussed and forthcoming 

caching policies for A-UAVs [91], [93], [126].  An F-UAV ferries content from already visited A-

UAVs to future visiting A-UAVs in its trajectory. The caching policy of an A-UAV determines 

the utility of an F-UAV where every A-UAV should maintain sufficient contents in its cache space 

to optimize the F-UAV cache utilization. 

5.3.2 

Limitations of Cache Pre-loading at A-UAVs 

The caching policies discussed in this section rely on pre-loading content into A-UAVs, 

which  has  certain  limitations.  This  approach  assumes  a  priori  knowledge  of  the  popularity 

distribution  of  all  the  content  in  the  system,  which  can  hinder  practical  feasibility  during 

deployment.  Local  popularity  estimation  of  requested  content  within  individual  A-UAVs  can 

partially alleviate this issue, but it cannot adjust the crucial storage segmentation factor (𝜆) (see 

section  5.3.1)  for  maximizing  availability  across  the  entire  system  of  A-UAVs  and  their 

communities. Collaborative global popularity estimation can be introduced, but it fails to capture 

demand heterogeneity across different A-UAV communities.   

The limitations listed above can be addressed by employing a Top-k Multi-armed Bandit 

(Top-k  MAB)  learning-based  caching  mechanism  at  the  A-UAVs,  which  is  explained  in  the 

following section. This paradigm is able to leverage the expected reward maximization attribute 

of  MAB  and  intelligence  sharing  nature  of  proposed  multi-dimensional  reward  structure  for 

caching decision at the A-UAVs. 

5.4 

Decentralized Caching with Multi-Armed Bandit 

Once a A-UAV is deployed into a community, its subsequent action is to decide which 

contents to download (via its vertical link) and cache such that content availability to the requesting 

76 

 
users can be maximized. This goal is achieved by employing a Top-k Multi-Armed Bandit learning 

agent in the A-UAV.  

5.4.1 

Top-k Multi-Armed Bandit Learning  

Multi-Armed  Bandit  is  a  classic  problem  in  reinforcement  learning  [127]  and  decision-

making. At each round 𝑡, an agent chooses an arm 𝐴( out of 𝑁 arms, denoted by 𝐴H, 𝐴5, . . . , 𝐴A, 

and  observes  a  reward  𝑅(.  Each  arm  𝑖  has  an  unknown  reward  distribution  with  mean  𝜇,  and 

variance 𝜎,

5. The agent’s goal is to maximize the total expected reward 𝑅@ over 𝑇 rounds, where 

𝑇 is the total number of rounds (time horizon): 

@

𝑅@ = 𝑚𝑎𝑥	 r 𝐸[𝑅(]

																																																							(5.5) 

(IH

This chapter uses a variant of MAB called Top-k Multi-Armed Bandit [127], [128]. Here, 

the agent has to choose 𝑘 arms out of a larger set of 𝑁 arms, as opposed to choosing one arm in 

classical MAB, and receives a reward for each arm in the chosen set. The goal of the agent is to 

maximize the total cumulative reward 𝑅@ obtained over a finite time horizon 𝑇:  

X

@
𝑅@ = 𝑚𝑎𝑥	 r r 𝐸[𝑅,,(]
(IH

,IH

																																														(5.6) 

Cache

1
2
.
.
.
k

Agent

from

Total
Contents 
1,2,…N

Action

Reward

Environment
(UAV-
caching 
System)

Top-k MAB Model at each A-UAV

Figure 5.2. Top-k Multi-Armed Bandit Learning for Caching Policy at A-UAVs 

77 

 
 
5.4.2 

Decentralized Caching using Top-k Multi-Armed Bandit 

In  the  scenario  of  UAV-caching,  there  is  a  Top-k  MAB  agent  in  each  A-UAV.  Here, 

choosing each content for caching corresponds to choosing an arm. The ‘k’ of Top-k MAB agent 

corresponds  to  the  caching  capacity  of  A-UAV,  i.e.,  𝑘 = 𝐶:.  The  agent’s  aim  is  to  select  ‘𝐶:’ 

contents out of a larger set of ‘𝑁’ contents to be cached in an A-UAV such that content availability 

to the users can be maximized.  Here, the UAV-aided content dissemination system is the learning 

environment where the A-UAVs interact through their actions of choosing specific sets of contents 

to  be  cached.  The  feedback  from  the  environment  for  the  taken  actions  are  in  the  form  of 

rewards/penalties. Actions are rewarded when cached contents are requested by the users and are 

served  to  the  users  within  the  given  tolerable  access  delay  or  penalized  otherwise.  The  top  𝐶: 

contents that accumulate most reward from the corresponding community and other communities 

are chosen to be cached at a A-UAV. It should be noted that the Top-k MAB agents in the A-UAVs 

are provided with no a priori information about the content popularity at the corresponding user 

communities.  

A  learning  decision  epoch  for  each  Top-k  MAB  agent  is  set  according  to  the  F-UAVs 

accessibility  at  the  corresponding  community  (i.e.,  an  F-UAV’s  visiting  frequency).  This  is 

because  the  F-UAVs  carries  the  content  availability  information  from  the  communities  in  its 

trajectory that is leveraged for learning at the A-UAVs’ Top-k MAB agents using appropriately 

designed multi-dimensional rewards. The agent learns to cache contents via the multi-dimensional 

reward  structure  which  has  three  parts:  namely  local,  ferrying,  and  global  reward.  The  first 

corresponds to the increase in availability at an A-UAV’s corresponding community i.e., increase 

in local availability (𝛿&). The second is related to the contents that are cached in an A-UAV, and 

are responsible for increase in availability at other communities i.e., ferried content availability 

78 

 
(𝛿.).  A  global  reward  is  received  when  cached  contents  add  to  increase  in  average  availability 

across  all  communities.  This  is  called  increase  in  global  availability  (𝛿/).  The  three  types  of 

rewards are given below: 

Y = s

𝑅,

1,											𝑓𝑜𝑟	𝛿& > 0	
𝑓𝑜𝑟	𝛿& < 0		
−1,

																																																									(5.7) 

< = s

𝑅,

1,												𝑓𝑜𝑟	𝛿. > 0
𝑓𝑜𝑟	𝛿. < 0
−1,

																																																										(5.8) 

Z = s

𝑅,

1,												𝑓𝑜𝑟	𝛿/ > 0	
𝑓𝑜𝑟	𝛿/ < 0
−1,

																																																									(5.9) 

In the above equations, 𝑅,

Y, 𝑅,

<, and 𝑅,

Z are rewards according to increase in availability for content 

‘𝑖’ cached in an A-UAV.  

Learning is achieved using a tabular method where a Q-table is maintained for all contents 

in the system. The value corresponding to each content is called a Q-value or action-value [127]. 

The  agent  updates  the  Q-value  for  a  content  at  every  learning  epoch  according  to  the  multi-

dimensional  rewards  in  Eqns.  5.7-5.9  from  the  interaction  with  the  environment  (UAV-aided 

content  dissemination  system)  and  learns  the  best  actions  (contents  cached).  The  recursive 

expression which explains Q-value update for a content ‘𝑖’ is given as follows: 

𝑄(𝑖) ← 𝑄(𝑖) + 𝛼v𝑟(𝑖) − 𝑄(𝑖)w                                             (5.10) 

Here, 𝑄(𝑖) represents the Q-value of a content ‘𝑖’; 𝑟(𝑖) is the reward received by caching content 

‘𝑖’;  𝛼  is  a  hyper-parameter  which  controls  the  learning  rate.  The  Q-values  for  all  contents  are 

initialized with zero to ensure no a priori information for a Top-k MAB agent. Also, it ensures 

equal importance to all contents for caching decisions. An epsilon-greedy (𝜖-greedy) exploration 

strategy is implemented. Such exploration strategy guarantees that every content gets to be cached 

in an A-UAV. As learning progresses, exploration decays and best contents with highest Q-values 

79 

 
are exploited with the aim of maximizing accumulated reward which improves the caching policy 

and thus increases content availability. 

The proposed algorithm enables Top-k MAB agents in A-UAVs to learn the caching policy, 

and  the  contents  cached  at  A-UAVs  emulate  the  cache  pre-loading  segmentation  behavior 

described in Section 5.3.1. However, the caching policy and corresponding content availability 

may  fluctuate  due  to  less  request  for  less  popular  content,  leading  to  weak  or  unstable  reward 

estimates. This results in Q-values that are highly sensitive to requests for less popular content and 

less  sensitive  to  requests  for  popular  content.  Therefore,  changes  in  Q-values  of  less  popular 

content  may  lead  to  intermittent  variations  in  caching,  particularly  in  Segment-2  (refer  Section 

5.3.1). Also, there can be vA

Xw combination of contents to be sampled by the Top-k MAB agent for 

caching. Due to this the reward estimation for each content occurs after large intervals, which leads 

to a weak estimate of reward distribution as 𝑁 increases. These oscillations can be controlled by 

empirically  selecting  𝜖  and  its  decay  rate.  To  reduce  the  dependence  of  caching  policy  on  the 

choice of 𝜖, Upper Confidence Bound (UCB) strategy is used [127], [128]. The Top-k MAB agent 

maintains an upper confidence bound on the expected reward of each content, and selects the set 

of 𝐶: contents with highest UCB at each epoch. 

𝑈((𝑖) = 𝑄((𝑖) + y

𝛼[ log(𝑡)
𝑁((𝑖)

																																																							(5.11) 

Here, 𝑈((𝑖) is the UCB of content ‘𝑖’ at epoch ‘𝑡’; 𝑄((𝑖) is the updated Q-value at epoch ‘𝑡’; 𝛼[ is 

a hyperparameter that controls the degree of exploration; 𝑁((𝑖) is the number of time content ‘𝑖’ 

has been requested till epoch ‘𝑡’. The first term represents the reward estimate, and the second 

term depicts the uncertainty in reward estimate. UCB selects the content that has high potential for 

high reward but hasn’t been requested frequently. The promotes exploration without externally 

80 

 
inducing  an  exploration  parameter  such  as  𝜖.  For  this  chapter,  𝜖-greedy  exploration  strategy  is 

applied according to the UCB values, as shown in Step 7-16 in Algorithm 5.1.  

The following pseudo code explains the caching policy at a A-UAV with a Top-k MAB agent.  

Algorithm 5.1. Caching policy at a A-UAV with Top-k MAB Learning  

1.  Initialization: 

a.  N: Total contents in the system 

b.  𝐶:: Caching capacity of an A-UAV 

c.  𝑄: Array of size 𝐶: initialized with 0’s (Q-table). 

d.  𝜖: Exploration rate 

e.  𝛼: Learning rate for Q-table update. 

f.  𝛼[: Degree of exploration (if UCB used) 

2.  Load A-UAV’s cache with 𝐶: randomly chosen contents. 

3.  while True: 

    \\ Check for learning epoch 

4.      if F-UAV is visiting A-UAV then do 

5.          for 𝑖 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV cache size 𝐶:) do 

6.              Get reward 𝑟(𝑖)  \\ according to Eqns. 5.7-5.9 

7.              Update 𝑄(𝑖)       \\ 𝑄(𝑖) ← 𝑄(𝑖) + 𝛼[𝑟(𝑖) − 𝑄(𝑖)] 

                                            \\ 𝑄(𝑖) ← 𝑈(𝑖) if UCB employed 

8.          end for 

9.          𝑣𝑎𝑙𝑢𝑒	 = 	𝒄𝒐𝒑𝒚(𝑄) \\ make a copy of Q-table 

        \\ Reload contents (Select arms) 

10.         for 𝑖 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV cache size 𝐶:) do 

81 

 
Algorithm 5.1. (cont’d) 

11.             Generate random number ‘𝑥’ 

12.             if 𝑥 < 𝜖 then do 

13.                 Load 1 randomly chosen content to A-UAV 

14.             else 

15.                 𝑐4’) = 𝒂𝒓𝒈𝒎𝒂𝒙(𝑣𝑎𝑙𝑢𝑒) 

16.                 Load 𝑐4’) to A-UAV 

17.                 Set 𝑣𝑎𝑙𝑢𝑒[𝑐4’)] = −𝑖𝑛𝑓 

18.             end if 

19.         end for 

20.     end if 

21.     Check for 𝜖 decay condition. 

22.     if true then do 

23.         Update 𝜖 

24.     end if 

25. end while 

This  Top-k  MAB  agent  at  a  A-UAV  learns  a  near  optimal  caching  policy  within  a  finite  time 

horizon  and  approaches  the  best  caching  policy  asymptotically.  The  cached  contents  can  boost 

content availability at their respective communities as well as at other distant communities via F-

UAVs. 

82 

 
 
 
 
Table 5.1    Default Values for Model Parameters 

# 

Variables 

Default Value 

Total number of contents, 𝐶 

1 
2  Number of A-UAVs, 𝑁: 
3  Number of F-UAVs, 𝑁< 
4 

Cache space in A-UAV, 𝐶: 
Cache space in F-UAV, 𝐶< 
Poisson request rate parameter, 𝜇 

6 
7  Hover time of F-UAV, 𝑇>#?-* 
Transition time of F-UAV, 
8 

5 

9 

10 

Zipf parameter (Popularity), 𝛼 
𝑇@*’%;,(,#% 
Ferrying UAV Trajectory 

5.5 

Experiments and Results  

1000 

12 

3 

100 

100 

1 request/sec 

600 seconds 

300 seconds 

0.7 

Round-robin 

Experiments  are  performed  to  analyze  the  performance  of  the  proposed  Top-k  MAB 

learning-based caching mechanism with a discrete event simulator. The simulator accomplishes 

content  request  generation  while  maintaining  an  intra-event  interval  according  to  exponential 

distribution and following a Zipf popularity distribution [126]. To perform the cache pre-loading, 

the mathematical expressions are included in the simulator. To capture heterogeneity in content 

popularity sequence at different communities, contents are swapped with pre-decided probability 

[93] and the difference between the sequences are determined using Smith-Waterman Distance 

[125].  The  experimental  parameters  for  the  proposed  Top-k  MAB  learning  based  caching  and 

cache  pre-loading  policies  are  listed  in  Table  5.1.  The  performance  evaluation  of  the  proposed 

mechanism is accomplished via the following metrics. 

5.5.1 

Performance Metrics 

Content Availability (𝑃’?’,&): It is defined as the ratio between cache hits and generated 

requests within a time interval. Cache hits are the content provided to the users from the contents 

83 

 
 
cached  in  the  UAV-aided  caching  system  (without  download).  Therefore,  content  availability 

indirectly indicates the content download cost of a systems as well. 

Jaro-Winkler  Similarity  (𝐽𝑊𝑆):  It  is  a  similarity  measure  that  is  used  to  compute  the 

similarity  between  two  sequences  [129].  It  is  computed  by  calculating  the  number  of  matches, 

number of transpositions requires within the matches and the similarity in prefix of both sequences. 

𝐽𝑊𝑆 is used to compute the similarity between the content sequence from the learnt caching policy 

and content sequence according to cache pre-loading.   

Access  Delay  (𝐴𝐷):  Performance  of  Top-k  MAB  model  is  also  evaluated  based  on  the 

access  delay  which  is  the  end-to-end  delay  between  the  generation  of  content  request  and  its 

provisioning form the cached contents in the UAVs. This chapter reports the epoch-wise average 

access delay to show the improvement in caching policy as learning progresses.   

5.5.2 

Effect of Exploration Strategies on Learnt Caching Policy 

In order to understand the viability of the proposed Top-k MAB learning-based caching 

policy in scenarios with demand heterogeneity, two type of content popularity sequence are used. 

Every consecutive community has a different popularity sequence. For 𝜖-greedy strategy, initial 

exploration is 𝜖 = 1 with decay rate of 0.0025 per epoch. The degree of exploration in UCB is set 

to  𝛼[ = 2.  Figure  5.3a  shows  the  convergence  behavior  of  the  learnt  caching  policy  with  a 

comparison of exploration strategies employed in the Top-k MAB model.   

The convergence behavior is shown in term of content availability from the learnt caching 

policy.  The  observations  from  Figure  5.3(a)  are  as  follows.  First,  the  figure  shows  that  by 

employing Top-k MAB agent at every A-UAV, a near optimal caching policy can be learnt. The 

algorithm is able to leverages the multi-dimensional reward structure, as explained in Eqns. 5.7-

5.9, to achieve content availability close to the benchmark performance (see section 5.3.1). Second, 

84 

 
when  the  agent  uses  UCB  exploration  strategy,  the  content  availability  settles  at  a  sub-optimal 

value. However, during the initial learning epochs the content availability increases promptly due 

to  high  upper  confidence  value  of  all  contents,  which  avoids  exploitation.  This  is  due  to  low 

sampling of requests. As learning progresses, the sparse request for unpopular contents keeps the 

upper confidence value high which maintains consistent exploratory behavior. 

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

y
t
i
l
i

b
a

l
i

a
v
A

t
n
e
t
n
o
C

Cache Pre-loading policy
Top-k MAB with UCB+ -greedy
Top-k MAB with  -greedy
Top-k MAB with UCB

UCB

-greedy

UCB+ -greedy

0.1

0

50

100

150

200

300

350

400

450

500

250
Epoch

(a) 

Top-k MAB with UCB+ -greedy
Top-k MAB with  -greedy
Top-k MAB with UCB

-greedy

UCB

UCB+ -greedy

l

y
a
e
D
s
s
e
c
c
A
e
g
a
r
e
v
A

)
s
d
n
o
c
e
s
n
i
(

180

170

160

150

140

130

120

110

300

350

400

450

500

0

50

100

150

200

250
Epoch

(b) 

Figure 5.3. Comparison between exploration strategies in Top-k MAB and pre-loading using  

(a) Content Availability; (b) Access Delay 

85 

 
 
 
 
 
 
 
An  algorithmically  induced  𝜖  value  in  𝜖-greedy  strategy  avoids  this  continuous  uncertainty 

behavior due to 𝜖 decay. This can be seen from the content availability with 𝜖-greedy exploration 

strategy which is better than the performance with UCB. Finally, to maintain the initial surge in 

content  availability  and  to  limit  the  unbounded  exploratory  behavior,  𝜖-greedy  exploration  is 

applied on the UCB values of the content. It can be seen that such hybrid exploration strategy helps 

to boost the content availability closer to the benchmark performance by 5%. Similarly, Figure 

5.3(b) shows the convergence behavior of the Top-k MAB learning-based caching agent in terms 

of access delay. This is computed for a 𝑇𝐴𝐷 of 300 seconds and it is observed that as learning 

progresses, the access delay for requested contents reduce while the content availability increases 

simultaneously. This manifests the improvement in learnt caching policy over the learning epochs. 

The best reduction in access delay is observed when 𝜖-greedy exploration is applied on the UCB 

values of the content.  

Figure 5.4. Change in learnt caching policy of A-UAV with TAD 

86 

 
 
5.5.3 

Impact of Tolerable Access Delay on Learning Performance 

To  show  the  learning  capability  of  the  proposed  Top-k  MAB  model,  experiments  are 

conducted  with  varying  𝑇𝐴𝐷s  ranging  from  300  to  1200  seconds.  The  content  availability 

according  to  the  learnt  caching  policy  with  varying  𝑇𝐴𝐷  is  shown  in  Figure  5.4.  The  figure 

demonstrates  the  behavior  of  the  proposed  caching  mechanism  with  respect  to  the  benchmark 

performance, computed from the cache pre-loading policy discussed in Section 5.3.1. Following 

observations can be made from Figure 5.4. First, the learnt caching policy achieves performance 

closer to the benchmark for all values of 𝑇𝐴𝐷. Second, the best possible performance (i.e., the 

benchmark) changes with change in 𝑇𝐴𝐷. The Top-k MAB agents in the A-UAVs adapts to the 

user defined 𝑇𝐴𝐷. It can be observed in Figure 5.4 that the learning performance varies along with 

𝑇𝐴𝐷. In other words, the role of multi-dimensional reward structure of the MAB agent becomes 

more evident with higher 𝑇𝐴𝐷. Especially, the information related to the global availability i.e., 

𝛿. and 𝛿/ (refer Section 5.4.2), are derived from large count of content requests. This improves 

the estimated reward at A-UAVs thus impacting their caching decision.   

(a) 

Figure 5.5.  Jaro-Winkler similarity for (a) A-UAVs and (b) F-UAVs 

87 

 
 
Figure 5.5 (cont’d) 

5.5.4 

Cache Similarity of Learnt Sequence with Best Sequence 

(b) 

The effect of learning on the cached content sequence is demonstrated in Figure 5.5. Figure 

5.5(a) plots Jaro-Winkler Similarity (𝐽𝑊𝑆) of cached content sequences for all 12 A-UAVs. The 

key observation are as follows. First, the 𝐽𝑊𝑆 between the best caching sequence from cache pre-

loading  policy  (see  Section  5.3.1)  and  the  cached  content  sequences  learnt  by  the  Top-k  MAB 

agents at A-UAVs converge near 0.9, with a certain variance. Physically, this represents higher 

degree  of  similarity  post  convergence,  where  1  indicates  complete  similarity  and  0  implies  no 

similarity. Second, the cached contents improve over epochs as learning progresses. Lower 𝐽𝑊𝑆 

values at initial epochs signifies that A-UAVs have no a priori content popularity information, 

local or global. As the MAB agents learn, over epochs of generated content requests, the cached 

contents in A-UAVs become more similar to the best caching sequence. Third, 𝐽𝑊𝑆 is an indirect 

representation of the storage segmentation factor (𝜆), which is used to decide the segment sizes 

according  to  cache  pre-loading  policies.  A  higher  𝐽𝑊𝑆  implies  that,  along  with  learning  the 

caching policy, the Top-k MAB agents learn to emulate the said segmentation behavior. Finally, 

88 

 
 
the  partial  dissimilarity  of  the  cached  content  sequence  can  be  ascribed  to  the  uncertainty 

associated with the Q-values of contents with low popularity. Also, this leads to an oscillatory 

convergence of 𝐽𝑊𝑆 for A-UAVs. This behavior manifests in the 𝐽𝑊𝑆 for F-UAVs as well, which 

is shown in Figure 5.5(b). Since, F-UAVs ferry contents that are requested less frequently, the low 

popularity of such contents leads to a comparatively sluggish improvement of its 𝐽𝑊𝑆 as compared 

to 𝐽𝑊𝑆 improvement of A-UAVs. 

5.6 

Summary and Conclusion 

In this chapter, UAV-aided content dissemination system is designed which can learn the 

caching policy on-the-fly without a priori content popularity information. Two types of UAVs are 

introduced  to  revive  content  provisioning  in  a  disaster/war-stricken  scenario  viz.  anchor  and 

ferrying UAVs. Cache-enabled anchor UAVs are stationed at each stranded community of users 

for uninterrupted content provisioning. Ferrying UAVs act as content transfer agents across anchor 

UAVs. The evolution of pre-loading-based caching policies are discussed which requires a priori 

information about content popularity. A decentralized Top-k Multi-Armed Bandit Learning-based 

caching policy is proposed to ameliorate the limitation of cache pre-loading. It learns the caching 

policy on-the-fly with the help of a multi-dimensional reward structure with encapsulates local and 

global  availability  information.  The  forthcoming  chapters  on  this  research  will  include  the 

characterization  of  shared  intelligence  across  UAVs,  and  UAV  trajectories  and  deployment 

strategies  which  can  build  the  foundation  for  developing  distributed  learning-model  sharing 

approaches to improve content provisioning. 

89 

 
 
 
Chapter 6:  Distributed Federated-Multi-Armed Bandit Learning for 

Content Management in Connected UAVs 

In the aftermath of disasters such as earthquakes, floods, or armed conflicts, survivors are 

often  forced  to  relocate  into  regions  with  severely  compromised  or  entirely  destroyed 

communication  infrastructure.  In  these  situations,  access  to  critical  information,  ranging  from 

emergency services and rescue updates to weather conditions and medical logistics, can determine 

the  success  of  relief  efforts.  The  UAV-aided  caching  system  introduced  in  this  chapter  builds 

directly on the learning mechanisms developed in Chapter 5 by extending them into a federated 

and  distributed  framework  that  is  better  suited  to  such  fragmented  and  high-variability 

environments. 

As described earlier, this chapter presents a two-tiered content dissemination architecture. 

Communities  of  stranded  users  are  each  served  by  Anchor  UAVs  (A-UAVs),  which  maintain 

vertical connectivity to centralized content repositories. A set of Ferrying UAVs (F-UAVs), which 

operate without vertical links, travel between A-UAVs and propagate cached content throughout 

the  network.  The  key  advancement  lies  in  the  introduction  of  Federated  Multi-Armed  Bandit 

(FedMAB)  Learning,  a  decentralized  learning  mechanism  that  enables  UAVs  to  optimize  their 

caching  policies  on-the-fly  based  on  local  user  demands  while  periodically  aggregating  their 

learning to ensure system-wide coordination. 

6.1  Motivation 

The  caching  policies  developed  in  Chapter  5  addressed  on-the-fly  learning  within 

individual UAVs, but they remain limited in scope when faced with distributed user communities 

exhibiting strong geo-temporal variations in content demand. In such disconnected environments, 

local  demand  at  one  community  may  be  vastly  different  from  another,  often  driven  by  the 

90 

 
immediate vicinity to the disaster, accessibility to relief, or evolving user needs. Moreover, content 

requests are not uniform in urgency. Some may require immediate access (e.g., evacuation routes), 

while others can tolerate delay (e.g., food distribution updates). 

This chapter is motivated by the need to enhance learning agility, model generalization, 

and coordination across UAVs deployed in such multi-community environments. While a single 

UAV may adapt to its local context, the opportunity to share learning models across UAVs unlocks 

faster convergence, improved robustness, and reduced reliance on repeated exploration. Federated 

Multi-Armed  Bandit  (FedMAB)  Learning  makes  this  possible  by  allowing  each  UAV  to 

independently learn caching decisions while periodically aggregating Q-values or model updates. 

This  ensures  each  UAV  not  only  reflects  its  local  reality  but  benefits  from  insights  gathered 

elsewhere. 

Unlike  prior  works  that  assume  globally  known  popularity  or  rely  on  slow-to-adapt 

function approximators, this chapter introduces a collaborative and distributed learning framework 

that  prioritizes  responsiveness  and  scalability.  In  doing  so,  it  aligns  learning-based  caching 

strategies with the practical realities of post-disaster operations, limited backhaul, varying QoS 

needs, and non-uniform content value across communities. 

6.2  Design Objective 

The  primary  objective  of  this  chapter  is  to  present  a  UAV-aided  content  caching  and 

dissemination framework that can learn optimal caching policies on-the-fly using Federated Multi-

Armed  Bandit  (FedMAB)  Learning.  The  system  is  designed  to  operate  effectively  in 

infrastructure-deficient disaster scenarios and adapt to geo-temporal variations in content demand. 

91 

 
a.  This  chapter  proposes  a  UAV-based  caching  framework  that  allows  each  UAV  to 

autonomously learn its content caching policy in real time by analyzing locally observed 

content request patterns. 

b.  It  introduces  Multi-Armed  Bandit  learning  algorithms  that  jointly  consider  local 

observations  and  shared  insights  from  other  UAVs  to  improve  caching  decisions  that 

reflect both local and global content popularity. 

c.  It  presents  a  federated  model  aggregation  technique  that  enables  UAVs  to  periodically 

exchange their learned Q-tables, thereby enhancing the overall caching efficiency without 

exchanging raw data. 

d.  It  investigates  the  relationship  between  the  learned  caching  strategies  and  Quality  of 

Service  (QoS)  expectations  by  incorporating  Tolerable  Access  Delay  (TAD)  as  a  key 

constraint in content relevance and urgency. 

e.  It explores the trade-offs between content demand variability and the responsiveness of 

learned  caching  policies  that  offers  insights  into  parameter  tuning  for  optimal  policy 

convergence. 

f.  It  validates  the  effectiveness  of  the  proposed  caching  model  through  simulation-based 

experiments and analytical evaluations across diverse disaster configurations and content 

demand patterns. 

Through  these  objectives,  the  chapter  seeks  to  demonstrate  a  scalable,  adaptive,  and 

decentralized  caching  strategy  that  aligns  with  the  operational  realities  of  disconnected,  post-

disaster environments. It further aims to deliver a robust, distributed solution that enhances the 

agility, resilience, and efficiency of UAV-assisted content dissemination under extreme conditions 

where infrastructure-based communication is no longer viable. 

92 

 
Satellite Link

u

v

t

Communication
Infrastructure
Destruction

w

j

Model sharing for 
Federated Learning

x

i

y

z

Anchor UAV
Ferrying UAV
F-UAV Trajectory

Figure 6.1.  Coordinated UAV system for content caching and distribution in environments 

without communication infrastructure 

6.3 

System Model 

6.3.1  UAV Hierarchy 

As  shown  in  Figure  6.1,  a  two-tiered  UAV-assisted  content  dissemination  system  is  deployed. 

Each community is served by a dedicated A-UAV that uses a lateral wireless connection (i.e., WiFi 

etc.) to communicate with users in that community. To be noted that that the role of A-UAVs can 

be served by ground vehicles with similar mobility restrictions and communication equipment for 

both vertical and lateral links. The system in Figure 6.1 introduces a set of ferrying UAVs (F-

UAVs), which are mobile and only have lateral communication links such as Wi-Fi. The lateral 

links are used for transferring content between the A-UAVs and the users in the community that 

the F-UAV is currently visiting at. Unlike the A-UAVs, the F-UAVs do not possess vertical links. 

The  F-UAVs  act  as  content  transfer  agents  across  different  user  communities  by  selectively 

93 

 
 
transferring content across the A-UAVs. The ferrying UAVs also provide a means for isolated 

communities to access content, making the system more resilient and accessible for all users.  

When a user in a community requests a content, the serving local A-UAV first checks its local 

storage. If the content is not found, the A-UAV waits for a potential delivery by a passing F-UAV. 

This allows the content to be cached and transferred around the A-UAVs, thus enabling users in 

different communities to access content that was downloaded by other A-UAVs. If no F-UAV 

arrives within the specified tolerable access delay (TAD), only then the A-UAV downloads the 

content via its expensive vertical link. This way, the proposed two-tiered UAV-assisted content 

dissemination system is able to mitigate the limitations of the FD approach.  

6.3.2  Content Demand and Provisioning Model  

The  generated  content  requests  from  the  users  in  a  community  follow  different  popularity 

distributions and quality of services as outlined below. 

Content Popularity: Research has shown that the pattern of content requests from a population 

often follows a Zipf distribution [91], [119], [126], where the popularity of a content is proportional 

to the inverse of its rank and is a geometric multiple of the next popular content. Popularity of 

content ‘𝑖’ is given as: 

𝑝6(𝑖) =

#
8

7

!
"

∑

9∈&

7

!
9

8

                                                             (6.1) 

#

The Zipf parameter, 𝛼, determines the distribution's skewness, while the total number of contents 

in the pool is represented by the parameter C. It should be noted that while the request for a specific 

content from a user follows Zipf distribution, the inter-request time from a user follows the popular 

exponential distribution.  

94 

 
Tolerable Access Delay: For each requested content, the user specifies a Tolerable Access Delay 

(TAD) [123], [124], which serves as a quality-of-service parameter and represents the amount of 

time the user is willing to wait before the content is provisioned for download. 

Content Provisioning: Upon receiving a request from one of its community users, the relevant A-

UAV first searches its local storage for the content. If the content is not found, the A-UAV waits 

for  a  potential  future  delivery  by  a  traveling  F-UAV.  If  no  F-UAV  arrives  with  the  requested 

content within the specified TAD, the A-UAV then proceeds to download it through its vertical 

link, which is usually expensive. In other words, the system attempts to provision the requested 

content without incurring the cost of downloading from the centralized server by waiting for the 

user-specified in order to access it from potentially passing F-UAVs. 

6.4  Limitations of Cache Pre-loading at A-UAVs 

All the caching policies described in this section relies on content pre-loading in the A-UAVs. 

Such  preloading  leads  to  the  following  limitations.  Such  preloading    majorly  assumes  prior 

knowledge of the underlying popularity distributions of the entire content population in the system. 

This  assumption  can  seriously  impede  practical  feasibility  from  a  deployment  standpoint.  The 

impacts of the assumption can be partially mitigated by estimating local popularity of the contents 

requested within individual A-UAV’s communities. Such estimates, however, would fail to adjust 

the storage segmentation factor (λ), which is crucial for maximizing availability across the entire 

system of A-UAVs and users in their communities. Although global content popularities can be 

estimated  by  introducing  collaboration  among  the  local  popularity  estimation  modules,  such 

collaboration would fail to capture demand heterogeneity across the communities of different A-

UAVs.   

95 

 
The limitations listed above can be addressed by employing a Federated Multi-armed Bandit (f-

MAB) learning-based caching mechanism at the A-UAVs. This paradigm is able to leverage the 

expected  reward  maximization  attribute  of  MAB  and  intelligence  sharing  nature  of  Federated 

Learning  for  caching  decision  at  the  A-UAVs.  The  f-MAB  learning  based  caching  policy  is 

presented in the following section. 

6.5  Federated Multi-Armed Bandit Learning for Caching 

Once a A-UAV is deployed into a community, its subsequent action is to decide which contents to 

download (via its vertical link) and cache such that content availability to the requesting users can 

be maximized. This goal is achieved by employing a Top-k Multi-Armed Bandit learning agent in 

the A-UAV.  

6.5.1 

Top-k Multi-Armed Bandit Learning  

Multi-Armed Bandit is a classic problem in reinforcement learning [130] and decision-making, 

where an agent is faced with a set of actions or “arms” to choose from, each associated with an 

unknown reward distribution. The objective of the agent is to maximize the total expected reward 

over a sequence of trials or rounds [127]. Formally, let there be 𝑁 arms, denoted by 𝐴H, 𝐴5, . . . , 𝐴A. 

Each arm 𝑖 has an unknown reward distribution with mean 𝜇, and variance 𝜎,

5. At each round 𝑡, 

the  agent  chooses  an  arm  𝐴(  and  observes  a  reward  𝑅(  drawn  independently  from  the  reward 

distribution of the chosen arm. The agent's goal is to maximize the total expected reward 𝑅@ over 

𝑇 rounds, where 𝑇 is the total number of rounds (time horizon): 

@

𝑅@ = 𝑚𝑎𝑥	 r 𝐸[𝑅(]

																																																													(6.2) 

(IH

This thesis uses a variant of MAB called Top-k Multi-Armed Bandit [128]. Here, the agent has to 

choose 𝑘 arms out of a larger set of 𝑁 arms, as opposed to choosing one arm in classical MAB, 

96 

 
and receives a reward for each arm in the chosen set. The goal of the agent is to maximize the total 

cumulative reward 𝑅@ obtained over a finite time horizon 𝑇:  

X

@
𝑅@ = 𝑚𝑎𝑥	 r r 𝐸[𝑅,,(]
(IH

,IH

																																																								(6.3) 

6.5.2  Decentralized Caching using Top-k Multi-Armed Bandit 

In the scenario of UAV-caching, there is a Top-k MAB agent in each A-UAV. Here, choosing each 

content for caching corresponds to choosing an arm. The ‘k’ of Top-k MAB agent corresponds to 

the caching capacity of A-UAV; i.e., 𝑘 = 𝐶:. The agent’s aim is to select ‘𝐶:’ contents out of a 

larger set of ‘𝑁’ contents to be cached in an A-UAV such that content availability to the users can 

be maximized. Here, the UAV-aided content dissemination system is the learning environment 

where the A-UAVs interact through their actions of choosing specific sets of contents to be cached, 

as shown in Figure 6.2. The feedback from the environment for the taken actions are in the form 

of rewards/penalties. Actions are rewarded when cached contents are requested by the users and 

are served to the users within the given tolerable access delay. Otherwise, the actions are penalized. 

The top 𝐶: contents that accumulate most reward from the corresponding community and other 

communities are chosen to be cached at an A-UAV. It should be noted that the Top-k MAB agents 

in  the  A-UAVs  are  provided  with  no  a  priori  information  about  the  content  popularity  at  the 

corresponding user communities.  

97 

 
Cache

1
2
.
.
.
k

Agent

from

Total
Contents 
1,2,…N

Action

Reward

Environment
(UAV-
caching 
System)

Top-k MAB Model at each A-UAV

Figure 6.2.    Top-k Multi-Armed Bandit Learning for Caching Policy at A-UAVs 

A  learning  decision  epoch  for  each  Top-k  MAB  agent  is  set  according  to  the  F-UAVs 

accessibility  at  the  corresponding  community  (i.e.,  an  F-UAV’s  visiting  frequency).  This  is 

because  the  F-UAVs  carries  the  content  availability  information  from  the  communities  in  its 

trajectory. Such content availability information is leveraged for learning at the A-UAVs’ Top-k 

MAB agents using appropriately designed rewards. The agent learns to cache contents via a multi-

dimensional numerical reward structure which has three parts: namely local, ferrying, and global 

reward.  The  first  corresponds  to  the  increase  in  availability  at  an  A-UAV’s  corresponding 

community i.e., increase in local availability (𝛿&). The second is related to the contents that are 

cached in an A-UAV, and are responsible for increase in availability at other communities i.e., 

ferried content availability (𝛿.). A global reward is received when cached contents add to increase 

in average availability across all communities. This is called increase in global availability (𝛿/). 

The three types of rewards are given below: 

Y = s

𝑅,

1,											𝑓𝑜𝑟	𝛿& > 0	
𝑓𝑜𝑟	𝛿& < 0		
−1,

																																			(6.4) 

< = s

𝑅,

1,												𝑓𝑜𝑟	𝛿. > 0
𝑓𝑜𝑟	𝛿. < 0
−1,

																																				(6.5) 

98 

 
 
Z = s

𝑅,

1,												𝑓𝑜𝑟	𝛿/ > 0	
𝑓𝑜𝑟	𝛿/ < 0
−1,

																																			(6.6) 

In the above equations, 𝑅,

Y, 𝑅,

<, and 𝑅,

Z are rewards according to increase in availability for content 

‘𝑖’ cached in an A-UAV.  

Using the aforementioned Top-k MAB model, a A-UAV agent learns the caching policy 

that  can  serve  the  user  requests  which  increases  content  availability  across  the  communities. 

Learning is achieved using a tabular method where a Q-table is maintained for each action i.e., 

each content to be cached in an A-UAV. The value corresponding to each content is called a Q-

value or action-value [127], [130]. The agent updates the Q-value for a content at every learning 

epoch according to the rewards in Eqns. 7-9 from the interaction with the environment (UAV-

aided content dissemination system) and learns the best actions (contents cached). The expression 

which explains Q-value update for a content ‘𝑖’ is given as follows: 

𝑄(𝑖) ← 𝑄(𝑖) + 𝛼[𝑟(𝑖) − 𝑄(𝑖)]                                           (6.7) 

Here, 𝑄(𝑖) represents the Q-value of a content ‘𝑖’; 𝑟(𝑖) is the reward received by caching content 

‘𝑖’;  𝛼  is  a  hyper-parameter  which  controls  the  learning  rate.  The  Q-values  for  all  contents  are 

initialized with zero to ensure no a priori information for a Top-k MAB agent. Also, it ensures 

equal importance to all contents for caching decisions. An epsilon-greedy (𝜖-greedy) exploration 

strategy is implemented. Such exploration strategy guarantees that every content gets to be cached 

in an A-UAV. As learning progresses, exploration decays and best contents with highest Q-values 

are exploited with the aim of maximizing accumulated reward which increases content availability. 

Based on Algorithm 6.1, which captures the concept discussed thus far, the Top-k MAB agents 

in the A-UAVs learn the caching policy. After learning converges, the contents cached at A-UAVs 

emulate  the  cache  pre-loading  segmentation  behavior.  However,  the  caching  policy  and  the 

corresponding average content availability remains oscillatory due to the low request rates for the 

99 

 
 
less  popular  contents.  Due  to  low  sampling  (request  generation)  of  less  popular  contents,  their 

reward estimates are weak (or unstable). This means that the Q-value of highly popular contents 

are  less  sensitive  to  content  requests,  whereas  the  Q-values  for  less  popular  contents  are  very 

sensitive to requests. In other words, when a request for a popular content is generated and served 

to the requesting user, the updated Q-value of that content doesn’t change drastically. However, 

when  an  unpopular  content  is  requested  and  served  its  Q-value  changes  abruptly.  A  sudden 

increase or decrease in Q-value of less popular contents may result in its addition to or removal 

from the cache of a A-UAV. This leads to intermittent variations in caching of some contents, 

which corresponds mostly to cached contents in Segment-2, as mentioned in cache pre-loading 

policies. Such oscillation depends on and can be controlled by the choice of 𝜖 and its decay rate.  

The following pseudo code explains the caching policy at a A-UAV with a Top-k MAB agent.  

Algorithm 6.1. Caching policy at a A-UAV with Top-k MAB Learning  

1.  Initialization: 

a.  N: Total contents in the system 

b.  𝐶:: Caching capacity of an A-UAV 

c.  𝑄: Array of size 𝐶: initialized with 0’s (Q-table). 

d.  𝜖: Exploration rate 

e.  𝛼: Learning rate for Q-table update. 

2.  Load A-UAV’s cache with 𝐶: randomly chosen contents. 

3.  while True: 

4.      if F-UAV is visiting A-UAV then do 

\\ Check if F-UAV visits A-UAV i.e., for learning epoch 

5.          for 𝑖 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV cache size 𝐶:) do 

100 

 
Algorithm 6.1. (cont’d) 

    \\ Loop through every content in A-UAV 

6.              Get reward 𝑟(𝑖)  \\ According to Eqns. 7-9 

7.              Update 𝑄(𝑖)       \\ 𝑄(𝑖) ← 𝑄(𝑖) + 𝛼[𝑟(𝑖) − 𝑄(𝑖)] 

\\ 𝑄(𝑖) ← 𝑈(𝑖) if UCB employed 

8.          end for 

9.          𝑣𝑎𝑙𝑢𝑒	 = 	𝒄𝒐𝒑𝒚(𝑄) \\ make a copy of Q-table 

10.         for 𝑖 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV cache size 𝐶:) do 

\\ Reload contents (Select arms) 

11.             Generate random number ‘𝑥’ 

\\ For 𝜖-Greedy action selection strategy 

12.             if 𝑥 < 𝜖 then do 

13.                 Load 1 randomly chosen content to A-UAV 

14.             else 

15.                 𝑐4’) = 𝒂𝒓𝒈𝒎𝒂𝒙(𝑣𝑎𝑙𝑢𝑒) 

16.                 Load 𝑐4’) to A-UAV 

17.             Set 𝑣𝑎𝑙𝑢𝑒[𝑐&#’"-"] = −𝑖𝑛𝑓 

\\ To avoid redundant reloading of same content 

18.             end if 

19.         end for 

20.     end if 

21.     Check for 𝜖 decay condition. 

22.     if true then do 

101 

 
         
    
Algorithm 6.1. (cont’d) 

23.         Update 𝜖 

\\ 𝜖 = 𝜖 × 𝑑𝑒𝑐𝑎𝑦\ where,  𝑑𝑒𝑐𝑎𝑦\ = 0.99 

24.     end if 

25. end while 

To  reduce  the  dependence  of  caching  policy  on  the  choice  of  𝜖,  Upper  Confidence  Bound 

(UCB) strategy is used [127]. The Top-k MAB agent maintains an upper confidence bound on the 

expected reward of each content, and selects the set of 𝐶: contents with highest UCB at each epoch. 

𝑈((𝑖) = 𝑄((𝑖) + y

𝛼[ log(𝑡)
𝑁((𝑖)

																																																								(6.8) 

Here, 𝑈((𝑖) is the UCB of content ‘𝑖’ at epoch ‘𝑡’; 𝑄((𝑖) is the updated Q-value at epoch ‘𝑡’; 𝛼[ is 

a hyperparameter that controls the degree of exploration; 𝑁((𝑖) is the number of time content ‘𝑖’ 

has been requested till epoch ‘𝑡’. The first term represents the reward estimate, and the second 

term depicts the uncertainty in reward estimate. UCB selects the content that has high potential for 

high reward but hasn’t been requested frequently. This promotes exploration without externally 

inducing  an  exploration  parameter  such  as  𝜖.  For  this  chapter,  𝜖-greedy  exploration  strategy  is 

applied according to the UCB values, as shown in Step 7-16 in Algorithm 6.1. 

This Top-k MAB agent at a A-UAV learns a near optimal caching policy within a finite time 

horizon and approaches the best caching policy asymptotically. However, the learning method for 

caching  encounters  the  following  limitations.  First,  since  the  global  content  availability 

information  is  ferried  by  F-UAVs,  a  large  disaster  area  with  multiple  communities  makes  the 

learning  sluggish.  Second,  communities  with  a  smaller  number  of  users  will  results  in  reduced 

content  requests  received  by  the  local  A-UAV.  In  such  scenarios  the  requests  for  less  popular 

102 

 
 
contents are even less due to heavily skewed content popularity at communities that follows a Zipf 

distribution. This leads to weaker estimate of reward distribution and sensitive (unstable) Q-values 

for  less  popular  contents,  thus  making  the  caching  policy  very  sensitive  to  unpopular  content 

requests. Finally, there can be vA

Xw combination of contents to be sampled by the Top-k MAB agent 

for caching. Due to this the reward estimation for each content occurs after large intervals, which 

leads to a weak estimate of reward distribution as 𝑁 increases.  

These limitations can be alleviated by using a Federated Multi-Armed Bandit Learning for caching 

policy  that  aggregates  Top-k  MAB  models  from  all  the  A-UAVs.  The  mechanism  is  explained 

below.    

6.5.3  A Brief on Federated Aggregation 

Federated learning (FL) [131], [132], [133] is a distributed machine learning technique that allows 

multiple devices/servers to collaboratively train a model without actually sharing their local data 

with a central server/arbitrator. In this approach, data remains on individual devices/servers, and 

only model updates are exchanged between them.  

Some of the popular federated learning techniques include federated aggregation (also called 

weighted  aggregation)  [131],  [132],  secure  aggregation  [133],  and  differential  privacy  [132] 

aggregation. In federated aggregation and weighted aggregation, each device’s model update is 

multiplied by a weight which reflects the importance or contribution of a device towards the final 

model.  These  models  are  then  combined  by  taking  a  weighted  average,  which  is  called  an 

aggregated model. For secure aggregation, the model parameters of a device are encrypted before 

sending them to another device for aggregation. In differential privacy, random noise is added to 

model updates before sending them to another device, in order to preserve privacy of individual 

devices. Such noisy update of models is done using Laplace, Gaussian, or exponential mechanisms 

103 

 
[132]. Overall, federated learning enables improved global model training, decentralized training, 

privacy, security, and scalability.  

For  the  application  in  this  thesis,  improved  global  training  and  scalability  attributes  of  FL  is 

focused on. Federated learning is achieved using the federated or weighted aggregation technique 

to  combine  model  updates  that  reflects  a  collective  knowledge  of  all  devices  in  the  system. 

Federated aggregation process typically involves three main steps: 

Initialization: Let us assume, ‘𝑚’ devices in the system and each device ‘𝑖’ is initialized with a 

model represented by a vector of parameters denoted by ‘𝑤,’.  

Local training: Each device/server ‘𝑖’ trains the model locally using its own local data and updates 

‘𝑤,’. 

Model aggregation: The aggregation is done at a central server which is chosen a priori based on 

the idea of a central arbitrator or a group leader. This central server contains the weight ‘𝑊,’ of 

each device based on its importance or intended contribution towards the aggregated model. One 

common  approach  to  assign  weight  to  a  device  is  by  using  the  number  of  data  samples  ‘𝑛,’ 

available at each device ‘𝑖’. This is shown below: 

𝑊, = %"
∑ %"
∀"

                                                         (6.9) 

Another approach is to use some measure of a device’s performance, like accuracy ‘𝑎𝑐𝑐,’ of the 

device’s local model, to assign weight to a device. 

𝑊, = ’++"
∑ ’++"

∀"

                                                       (6.10) 

In general, the choice of weights depends on the specific application and the aim of the federated 

learning system. The locally trained models are sent to the central server and aggregated to create 

a new, improved model via weighted aggregation, as shown in the expression below: 

104 

 
𝑤’//*-/’(-" =

:
∑ ]".$"
"<!
:
∑ ]"
"<!

                                                 (6.11) 

where ‘𝑤’//*-/’(-"’ is the aggregated model which improves the desired performance by giving 

more  weight  to  better  performing  individual  models  and  less  weight  to  adversely  performing 

models.  

One advantage of federated learning is that it reduces the need to transfer large amounts of data 

to a central server/arbitrator for processing. This can be particularly useful in situations where the 

data is large, and network connectivity is limited or expensive. Another advantage is that Federated 

learning  also  enables  real-time  updates  of  models,  making  it  possible  to  quickly  adapt  to 

changes/anomalies  in  the  data  or  environment.  The  application  in  this  chapter  leverages  these 

advantages of Federated Learning.  

1

A-UAV 1 Q-
table First 10 
contents
0.9999999999
0.9935873796
0.9852072796
0.9535410760
0.9276316221
0.8989534879
0.8615895560
0.8369839115
0.7968945719
0.7485099889

P
e
r
s
o
n
a
l

e
x
p
e
r
i
e
n
c
e

e
c
n
e
i
r
e
p
x
e

l
a
n
o
s
r
e
P

Environment
(UAV-caching System)

2

A-UAV 2 Q-
table First 10 
contents
0.3738137382
0.0987970572
-0.0378431992
-0.0808978328
-0.1732503063
-0.1767103281
-0.2032194381
-0.2279740471
-0.2572811393
-0.2786248498

!!

e
l
b
a
t
-

Q
d
e
t
a
d
p
U

A-UAV 3 Q-
table First 10 
contents
0.9997870098
0.9953421357
0.9877095615
0.9595065470
0.9372667851
0.9119905450
0.8624772498
0.8404379553
0.7946631967
0.7546049715

P
e
r
s
o
n
a
l

e
x
p
e
r
i
e
n
c
e

3

Anchor UAV
Ferrying UAV
F-UAV Trajectory

A-UAV 4 Q-
table First 10 
contents
0.3844768323
0.1104101769
0.0067789059
-0.1085593808
-0.1554441832
-0.2048258079
-0.2110256047
-0.2236875274
-0.2865438082
-0.2635411268

e
c
n
e
i
r
e
p
x
e

l
a
n
o
s
r
e
P

4

A-UAV 1 Q-
table First 10 
contents

A-UAV 2 Q-
table First 10 
contents

A-UAV 3 Q-
table First 10 
contents

A-UAV 4 Q-
table First 10 
contents

0.9999999999 0.3738137382 0.9997870098 0.3844768323
0.9935873796 0.0987970572 0.9953421357 0.1104101769
0.9852072796 -0.0378431992 0.9877095615 0.0067789059
0.9535410760 -0.0808978328 0.9595065470 -0.1085593808
0.9276316221 -0.1732503063 0.9372667851 -0.1554441832
0.8989534879 -0.1767103281 0.9119905450 -0.2048258079
0.8615895560 -0.2032194381 0.8624772498 -0.2110256047
0.8369839115 -0.2279740471 0.8404379553 -0.2236875274
0.7968945719 -0.2572811393 0.7946631967 -0.2865438082
0.7485099889 -0.2786248498 0.7546049715 -0.2635411268

!"

1
2
.
.
.
k

Cache

Agent

Federated 
Aggregation

from

Total
Contents 
1,2,…N

Action

Reward

Environment
(UAV-caching 
System)

Top-k MAB Model at each A-UAV for personal experience

Figure 6.3. Federated Multi-Armed Bandit Learning for Caching Policy at Anchor UAVs 

105 

 
 
 
 
 
 
 
6.5.4  Distributed Caching with Federated Multi-Armed Bandit 

For  UAV-caching  scenario,  each  device  in  the  classical  definition  of  Federated  Learning  is 

analogous to a A-UAV. The local data, based on which each UAV’s Top-k MAB model is updated, 

corresponds to the information about the cached contents, cache hits, and content availability. To 

be noted that, cache hits comprise of the number of times a content cached in a A-UAV is requested 

and served within the given tolerable access delay (TAD). The model, in this case, refers to the Q-

table  of  the  Top-k  MAB  agent  of  A-UAV.  The  role  of  central  server  for  model  aggregation  is 

played  by  the  mobile  F-UAVs.  F-UAVs  are  chosen  because  of  their  ability  to  access  model 

parameters (i.e., Q-tables) of all the A-UAVs in their respective trajectories. The aggregated model 

at  an  F-UAV  is  sent  to  the  A-UAV  in  its  vicinity  using  the  lateral  link  with  an  objective  of 

improving the Top-k MAB model developed at the A-UAV. The aim of this updated Top-k MAB 

model is to help a A-UAV to decide its respective contents to be cached based on the top ‘𝐶:’ Q-

values.  Since  this  updated  caching  method  possesses  the  model  sharing  attribute  of  Federated 

Learning and expected reward maximization attribute of MAB, this is called as Federated Multi-

Armed Bandit Model (f-MAB).  

According to the first step in standard learning paradigm of Federated Learning, the Q-table is 

initialized  at  each  A-UAV.  The  learning  epoch  is  set  based  on  the  F-UAV  visiting  frequency. 

Similar to the Q-value update procedure from Top-k MAB, Q-values of individual Top-k MAB 

agents  of  the  A-UAVs  update  at  each  learning  epoch.  The  Q-tables  are  updated  following  the 

recursive equation for reward estimate of MAB, as shown in Eqn. 6.7. These Q-values captures 

the  respective  communities’  content  request  pattern  and  the  respective  A-UAVs’  caching 

decisions. This is called personal experience of Top-k MAB model of a A-UAV (refer to Figure 

6.3).  This  process  is  synonymous  with  the  second  step  i.e.,  local  training  stage  in  a  standard 

106 

 
Federated Learning paradigm. After gaining personal experience for an epoch duration, the model 

at  a  A-UAV  can  improve  using  models  of  its  adjacent  A-UAVs  via  the  F-UAVs.  The  model 

aggregation operation in f-MAB also occurs according to the frequency of F-UAV visits near a A-

UAV in its trajectory. To be noted that quality of the aggregated model depends on the freshness 

of information (i.e., Q-tables) ferried by an F-UAV. Therefore, with increase in the number of F-

UAVs in UAV-aided caching system the quality of aggregated model improves.  

Model  aggregation  in  case  of  regression/classification  based  model  such  as  neural  networks 

involve aggregation of model parameters such as coefficients/weights [131]. For f-MAB, model 

aggregation  involves  aggregation  of  Q-values.  As  explained  in  the  previous  sub-section,  every 

device  has  a  weight  associated  with  it  which  is  related  to  its  importance  during  aggregation. 

Similarly, in the case of UAV-caching using f-MAB, the weight for each A-UAV’s model, when 

the F-UAV is visiting A-UAV ‘𝑥’, is given by the following expression. 

𝑊),2 =

HL_Y(‘F||‘4)

∑

:
5<!

(HL_Y(‘F||‘5))

                                                    (6.12) 

Here, ‘𝑊),2’ represents the weight associated with A-UAV 𝑦’s model when the F-UAV is at A-

UAV 𝑥. The parameter ‘𝑚’ is the total number of A-UAVs in the system. 𝑃) and 𝑃2 are content 

popularity  distributions  estimated  at  the  A-UAVs  𝑥  and  𝑦,  respectively.  𝐾𝐿(𝑃)||𝑃2)  is  the 

Kullback-Leibler divergence or relative entropy [134] which measures the difference between the 

distributions. Using these weights from Eqn. 6.12, the aggregated model is computed using the 

following expression. 

’//(𝑖) = ∑ 𝑊),2. 𝑄2(𝑖)
𝑄)

4
2IH

=

∑

:
4<!

7HL_Yb‘F||‘4c8.d4(,)
:
(HL_Y(‘F||‘5))
5<!

∑

                          (6.13) 

In Eqn. 6.13, 𝑄)

’//(𝑖) refers to the aggregated Q-value of content ‘𝑖’ at A-UAV ‘𝑥’. The physical 

significance of using KL divergence to compute weight or importance of a A-UAV’s model is as 

107 

 
follows. The term “1 − 𝐾𝐿v𝑃)||𝑃2w” represents how similar the content popularity distributions 

are at communities near A-UAV ‘𝑥’ and ‘𝑦’. Therefore, if estimated content popularity distribution 

of A-UAV ‘𝑦’ is similar to that of A-UAV ‘𝑥’ the weight associated with A-UAV 𝑦’s model is 

more and vice versa.  

After the aggregation of Q-tables at the visiting F-UAV, the aggregated Q-table ‘𝑄’//’ replaces 

the existing Q-table at a A-UAV. This is shown in Figure 6.1 where F-UAV 𝑖 is visiting A-UAV 

𝑥, therefore it has access to the Q-table of A-UAV 𝑥 and model aggregation is done at F-UAV 𝑖. 

Whereas F-UAV 𝑗 is transiting and has not reached the next A-UAV in its trajectory, hence no 

model aggregation takes place at F-UAV 𝑗.  

These  aggregated  Q-tables  improve  the  estimated  reward  associated  with  each  content.  The 

number  of  requests  generated  at  a  community  during  a  learning  epoch  duration  may  not  be 

sufficient for a better reward estimate at its respective A-UAV. Due to this, in the absence of model 

aggregation,  the  A-UAVs  have  weaker  estimate  of  the  reward  distribution.  Through  Q-table 

aggregation,  the  estimated  rewards  are  improved  without  physically  getting  content  requests 

information across all the A-UAVs.  

However, Q-table aggregation loses the context of local popularity when there exists demand 

heterogeneity  across  different  communities.  This  problem  is  analogous  to  personalization-

generalization  problem  in  Federated  Learning  [131],  [132].  A  A-UAV’s  Q-table  updated  with 

personal  experiences  corresponds  to  personalized  (local)  model  and  the  aggregated  Q-table 

represents  the  generalized  (global)  model.  This  chapter  uses  weighted  averaging  technique  to 

preserve the local popularity context along with the improvement in reward estimate. This is given 

as: 

[3"(𝑖) = 𝑤H. 𝑄)(𝑖) + 𝑤5𝑄)
𝑄)

’//(𝑖)                                       (6.14) 

108 

 
This is equivalent to 

[3"(𝑖) = 𝑤H. 𝑄)(𝑖) + -CG82
𝑄)

e1

. v1 − 𝑄)(𝑖)w. 𝑄)

’//(𝑖)                           (6.15) 

In equations 6.14 and 6.15, ‘𝑡’ is the epoch number. The weights 𝑤H and 𝑤5 decide the contribution 

of the local and global (aggregated) models towards the updated model ‘𝑄)

[3"’. From equations 

6.14 and 6.15, 𝑤5 can be expressed as follows: 

𝑤5 = -CG82

e1

. v1 − 𝑄)(𝑖)w                                                  (6.16) 

Here, 𝛽" and 𝛽; represents the weight decay factor and scaling factor. The parameters ensures that 

the contribution of global (aggregated) model reduces as learning progresses. This idea is backed 

by the assumption that, as learning progresses, with the help of model aggregation the individual 

local  models  will  be  reflect  the  true  value  of  contents.  After  this  the  requirement  for  model 

aggregation may not be necessary to improve reward estimate. Also, the expression in Eqn. 6.16 

has the term “v1 − 𝑄)(𝑖)w”, which is a representation of regret [127]. This ensures that the updated 

f-MAB model can be better than the existing Top-k MAB model while maintaining the Q-value 

within the true value of a content.  

For the simulation experiments, 𝑤H is set empirically to 0.99. The motivation behind choosing 

such a high value for 𝑤H is that the local content popularity at a community does not vary with 

time. To be noted that the choice of the weight 𝑤H in Eqn. 6.14 and 6.15 is essential, especially in 

scenarios with time-varying content popularity. In such scenarios, the content popularity estimates 

such as 𝑃), 𝑃2 (along with their Q-values) change with time. To determine the weight 𝑤H in such 

cases, the following expression can be used: 

𝑤H = 1 −

(, 𝑃)

(Hw

𝐽𝑆𝐷v𝑃)
𝑙𝑛 2

⇒ 𝑤H = 1 −

(cid:142)

1
2

k𝐾𝐿(𝑃)

(H
(||𝑀) + 𝐾𝐿v𝑃)
𝑙𝑛 2

||𝑀wl(cid:144)		

															(6.17) 

109 

 
Here, 𝐽𝑆𝐷v𝑃)

(H
(, 𝑃)

w is the Jensen-Shannon Divergence [134] which shows dissimilarity between 

𝑃)

(  and  𝑃)

(H.  𝑃)

(  and  𝑃)

(H  are  content  popularity  distributions  at  time  𝑡  and  𝑡f.  𝑀  is  the  average 

distribution calculated as 𝑀 = v𝑃)

(H
( + 𝑃)

w/2. 𝐾𝐿(𝑃)||𝑀) is the Kullback-Leibler divergence or 

relative  entropy  [134]  which  measures  the  difference  between  an  individual  distribution  (𝑃)

(  or 

(H)  and  the  average  distribution  𝑀.  The  weight  𝑤H  will  be  high  if  the  content  popularity 
𝑃)

distribution doesn’t experience substantial change within time 𝑡 to 𝑡f and vice versa. This weight 

𝑤H determines the importance of the local model in case of time-varying local popularity and the 

dependance of updated Q-table 𝑄)

[3" on it.  

Eqn. 6.15 can be rewritten by substituting 𝑤H and  𝑄)

’// as: 

[3"(𝑖) = (cid:146)1 −

𝑄)

g_Yb‘F

2H
2||hcD_Y7‘F
5.&% 5

||h8i		

(cid:147) . 𝑄)(𝑖) + -CG82

e1

. v1 − 𝑄)(𝑖)w.

∑

:
4<!

7HL_Yb‘F||‘4c8.d4(,)
:
(HL_Y(‘F||‘5))
5<!

∑

(6.18) 

The hyperparameters 𝛽" and 𝛽; can be explored empirically to ensure that caching policy at a A-

UAV  doesn’t  lose  the  local  context  of  content  popularity,  especially  when  there  exists 

heterogeneity in content popularity across communities.  

The  personalization-generalization  problem  handling  in  f-MAB  based  caching  is  shown  in 

Figure 6.3 where model aggregation is done at F-UAV and the model at A-UAV ‘2’ is updated 

using a weighted average of personal and aggregated model. Algorithmically, the implementation 

of f-MAB based caching policy at A-UAV is similar to Top-k MAB based caching except for line 

7 in Algorithm 6.1. The Q-table update uses Eqn. 6.18 along with Eqn. 6.7 to incorporate Federated 

Learning  along  with  the  concepts  of  Top-k  Multi-Armed  Bandit.  The  new  Q-table  is  used  for 

caching decision, where top 𝐶: contents with highest Q values are cached at an A-UAV. Such 

110 

 
 
contents can boost content availability at their respective communities as well as at other distant 

communities via F-UAVs.  

Table 6.1. Default Values for Model Parameters 

Variables 

Default Value 

Total number of contents, 𝐶 

Number of A-UAVs, 𝑁: 

Number of F-UAVs, 𝑁< 

Cache space in A-UAV, 𝐶: 

Cache space in F-UAV, 𝐶< 

1000 

12 

3 

200 

200 

Poisson request rate parameter, 𝜇 

0.5 request/sec 

Hover time of F-UAV, 𝑇>#?-* 

Transition time of F-UAV, 𝑇@*’%;,(,#% 

Zipf parameter (Popularity), 𝛼 

600 seconds 

300 seconds 

0.4 

# 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Ferrying UAV Trajectory 

Round-robin 

6.6  Experiments and Results  

A discrete event simulator was used for experimentally evaluating the performance of the 

proposed  f-MAB  and  Top-k  MAB  learning-based  caching  mechanisms.  Content  requests  are 

generated using an exponential distribution for the inter-request intervals, and a Zipf distribution 

for  the  content  popularity  control  (refer  to  Eqn.  6.1).  Cache  pre-loading  is  done  using  the 

111 

 
 
 
mathematical  expressions,  to  evaluate  the  best  achievable  benchmark  performance  when  the 

content popularities across all communities are known a priori. Unless specified otherwise, the 

parameter values form Table 6.1 are used as defaults. 

The following performance metrics are evaluated.  

Content Availability (𝑃’?’,&): It is defined as the ratio between cache hits and generated requests 

within a time interval. Cache hits are the content provided to the users from the caches in the UAV-

aided caching system. Meaning, when a content is downloaded from the cloud because it was not 

available  in  the  caches  of  both  types  of  UAVs.  In  other  words,  content  availability  indirectly 

indicates the reduction of from-cloud download cost by deploying smart caching.  

Cache Distribution Optimality (CDO): This determines the optimality of the learnt caching policy 

in terms of the caching sequence. Jaro-Winkler Similarity (𝐽𝑊𝑆) [93] is used to represent CDO, 

by  computing  the  similarity  between  the  content  sequence  from  the  learnt  caching  policy  and 

content  sequence  according  to  cache  pre-loading.  It  is  computed  by  calculating  the  number  of 

matches, number of transpositions required within the matches and the similarity in prefix of both 

sequences. It is a normalized similarity measure where 1 represents optimal caching and 0 means 

non-optimal caching. 

Access Delay (𝐴𝐷): Access delay is defined as the total latency between when a content request is 

generated and it is delivered to the user from the cache of any of the UAVs. AD is reported over 

time to demonstrate how it improves as the caching policy learning progresses.   

6.6.1 

Effect of Caching Mechanisms and Exploration Strategies 

In order to understand the viability of the proposed caching policies in scenarios with demand 

heterogeneity,  a  unique  content  popularity  sequence  is  used  for  each  community.  In  order  to 

capture  heterogeneity  in  content  popularity  sequence  at  different  communities,  contents  are 

112 

 
swapped  with  pre-decided  probability  [93]  and  the  difference  between  the  sequences  are 

determined  and  maintained  using  Smith-Waterman  Distance  (SWD)  [93].  This  is  a  normalized 

distance  measure  where  SWD  value  of  1  means  that  the  content  popularity  sequences  are 

completely  different  and  an  SWD  of  0  means  no  difference  in  content  popularity  sequences. 

Additionally, two different request generation rates, 0.5 and 0.01 requests/ second, are used across 

the  communities  for  capturing  demand  heterogeneity.  To  implement  f-MAB,  the  weight  decay 

factor is set to 𝛽" = 0.01, 0.05 and scaling factor of 𝛽; = 2 is chosen empirically. Two values of 

𝛽"  are  used  to  demonstrate  the  effect  of  personalization-generalization  problem  in  Federated 

Multi-Armed  Bandit  Learning,  which  is  explained  later.  For  the  𝜖-greedy  strategy,  initial 

exploration is set as 𝜖 = 1, which is made to decay at the rate of 0.0025 per learning epoch. The 

degree of exploration in Upper Confidence Bound (UCB) exploration strategy is set to 𝛼[ = 2.  

Figure 6.4 shows the convergence behavior of the learnt caching policies with a comparison of 

f-MAB  and  different  exploration  strategies  employed  in  the  Top-k  MAB  model.  The  graph  in 

Figure 6.4a is shown in learning dynamics in terms of the improvements in content availability.  

The observations from Figure 6.4a are as follows. First, the figure shows that by employing f-

MAB agent at every A-UAV, a near-optimal caching policy can be learnt. The algorithm is able 

to  leverage  the  intelligence  sharing  attribute  of  federated  learning  in  order  to  achieve  content 

availability that is close to the benchmark performance. The model sharing approach in federated 

learning reduces the inherent dependance of A-UAVs’ MAB models on their respective content 

requests only. By including the aggregated model for Q-table updates (see Eqn. 6.18), the Q-values 

at each A-UAV captures the requests generated across all communities. Such Q-values represent 

improved reward estimates, which leads to better learning towards a more effective caching policy. 

113 

 
The said improvement in reward can be seen in Figure 6.4c, where f-MAB ensure consistent higher 

rewards created by the learning of an improved caching policy.  

Learning Epoch

Learning Epoch

(a)                                                                                                                                 (b) 

Learning Epoch

Learning Epoch

(c)                                                                                                                                 (d) 

Figure 6.4. Comparison between f-MAB, different exploration strategies in Top-k MAB and 

Cache Pre-Loading in terms of (a) Content Availability; (b) Access Delay; (c) Cumulative 

reward; (d) Epoch-wise Standard Deviation in Content Availability of A-UAVs 

The second observation is that with an increase in weight decay factor 𝛽", the content availability 

increases. As discussed previously, the weight decay factor helps in balancing the personalization-

generalization problem in Federated Multi-Armed Bandit Learning. Physically, this means that a 

very slow decay of aggregated model’s weight (refer to Eqn. 6.16) may increase generalization, 

leading to a replicated caching behavior across all A-UAVs. The effect of over generalization can 

also be observed in Figure 6.4c, where, as learning progresses, the line corresponding to 𝛽" = 0.01 

114 

 
 
 
accumulates  less  rewards  as  compared  to  the  one  with  𝛽" = 0.05.  A  more  detailed  analysis  of 

weight decay factor and its effect on content availability is provided in Figure 6.5. 

The third observation is regarding the performance comparisons between the f-MAB and the 

Top-k MAB approaches with various exploration strategies. It is shown that the multi-dimensional 

reward structure of the Top-k MAB models at the A-UAVs help generating caching policies that 

show performance improvement during the initial learning epochs. These were also highlighted as 

through Eqns. 7-9. As the learning progresses, the performance improvement tapers off after a 

point of learning. This effect is due to the insufficiency of content requests at individual A-UAVs 

which leads to weak estimated Q-values.  

Finally, when the agent uses the standalone UCB exploration strategy, the content availability 

settles at a sub-optimal value. However, during the initial learning epochs, the content availability 

increases promptly due to high upper confidence value of all contents, which avoids exploitation. 

This can be seen in Eqn. 6.8, where 𝑁((𝑖) represents the number of requests generated for content 

‘𝑖’  during  a  learning  epoch  ‘𝑡’.  During  the  initial  learning  epochs,  requests  generated  for  all 

contents are less, which keeps their upper confidence values high. Physically, this means that due 

to initial high confidence on all contents, the model fails to prioritize a subset of contents to cache, 

thus leading to exploratory behavior. As the learning progresses, the sparse requests for unpopular 

contents keep the upper confidence value high which maintains consistent exploratory behavior. 

An  algorithmically  induced  𝜖  value  in  𝜖-greedy  strategy  avoids  this  consistent  exploratory 

behavior due to 𝜖 decay. However, 𝜖 is a predetermined exploration parameter which is controlled 

by its decay rate (refer Algorithm 6.1). A faster decay can limit the exploration capability of the 

proposed algorithm, thus forcing it to converge to a suboptimal learnt caching policy. 

115 

 
Therefore,  to  maintain  the  initial  surge  in  content  availability  and  to  limit  the  unbounded 

exploratory behavior, 𝜖-greedy exploration is applied on the upper confidence bound values of the 

content.  Such  hybrid  exploration  strategy  helps  to  boost  the  content  availability  beyond  their 

respective  non-hybrid  performance.  Specifically,  such  hybrid  exploration  strategy  applied  in 

conjunction with f-MAB approach is able to achieve a performance improvement of 7% compared 

to the Top-k MAB with any standalone exploration strategy.  

Figure 6.4b shows the convergence behavior of f-MAB and Top-k MAB in terms of access delay. 

This is computed for a 𝑇𝐴𝐷 of 1800 seconds and it is observed that as learning progresses, the 

access  delay  for  requested  contents  reduces  while  the  content  availability  increases.  The  best 

reduction in access delay is observed when f-MAB is applied in tandem with the described UCB/𝜖-

greedy hybrid exploration. 

Another way of representing the learning convergence behavior is the standard deviation (SD) 

in  epoch-wise  content  availability  across  all  A-UAVs.  This  characterizes  the  fairness  in  learnt 

caching  behavior  across  all  A-UAVs,  as  learning  progresses.  The  physical  significance  of 

observing the standard deviation of availability is as follows. A content ‘𝑖’ with low popularity 

cached at A-UAV ‘x’ can assist to increase availability at A-UAV ‘y’ via F-UAV. If popularity of 

‘𝑖’  is  high  at  the  A-UAV  ‘y’,  it  serves  more  user  requests  and  commensurately  achieves  high 

ferrying reward 𝑅,

< and global reward 𝑅,

Z at the A-UAV ‘x’ (refer to Eqns. 8 and 9). If ferrying 

reward 𝑅,

< and global reward 𝑅,

Z have high values, Q-value of ‘𝑖’ increases at A-UAV ‘x’, which 

leads  to  caching  of  a  low  popularity  content.  This  violates  the  estimation  criteria  for  non-IID 

samples where a content is cached at a A-UAV depending on its estimate from another A-UAV 

with  different  content  popularity  preferences.  This  phenomenon  contributes  to  the  standard 

deviation in content availability. 

116 

 
The comparison in Figure 6.4d reveals that the f-MAB strategy decreases the standard deviation 

in  availability,  indicating  synchronized  learning  of  caching  policies  among  all  A-UAVs.  In 

contrast,  the  Top-k  MAB  strategy  exhibits  an  increased  standard  deviation,  implying  unfair 

learning  behavior  among  A-UAVs  due  to  non-identically  independently  distributed  (non-IID) 

content request patterns. The proposed f-MAB approach minimizes the absolute dependence on 

the  reward  structure  and  incorporates  the  concept  of  local  and  global  popularity  through 

personalized  and  aggregated  models.  This  leads  to  fairly  simultaneous  content  availability 

improvement across communities, as depicted in Figure 6.4d. The residual standard deviation after 

convergence  is  due  to  the  inherent  demand  heterogeneity  and  sparse  intra-community  request 

generation. Figure 6.4d demonstrates the superior performance of f-MAB over Top-k MAB for 

learnt caching decision-making.   

For best learnt 
performance

)
s
h
c
o
p
E
f
o

s

m
r
e
t
n
i
(

e
c
n
e
g
r
e
v
n
o
C

450

400

350

300

250

200

150

100

50

0

Least
performance
offset achieved
from
f-MAB
learning based
caching
policy

e
c
n
a
m
r
o
f
r
e
P
k
r
a
m
h
c
n
e
B
m
o
r
f

t
e
s
f
f

O

0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0

Learning Epochs

(a)

0.1 0.05 0.01 0.007
Weight Decay Rate

0.1 0.05 0.01 0.007
Weight Decay Factor

(b)

(c)

Figure 6.5.  (a) Effect of weight decay factor 𝛽" on content availability; (b) Convergence with 

𝛽";  (c) Offset from benchmark performance 

Note  that  in  f-MAB,  the  aggregated  model  (see  Eqn.  6.15)  and  its  contribution  towards  the 

update of the Q-values is controlled by a weight decay factor 𝛽" (see Eqn. 6.16). The effects of 𝛽" 

on  content  availability  is  shown  in  Figure  6.5.  The  observations  are  as  follows.  First,  the  best 

content availability is achieved with 𝛽" = 0.05. It can also be observed in Figure 6.5c that the 

117 

 
 
 
 
 
 
 
 
 
achieved  content  availability  offset  from  the  benchmark  performance  is  minimum,  as  shown. 

Second, a higher weight decay factor ensures an initial surge in performance, but it tapers as the 

learning progresses. Figure 6.5c shows that with such 𝛽", the learning performance settles down 

at  a  suboptimal  value.  Finally,  lower  values  of  𝛽"  make  the  learning  sluggish,  which  can  be 

observed in Figure 6.5a and 6.5b. Also, the suboptimality of the achieved contentavailability can 

be  seen  in  Figure  6.5c.  Therefore,  weight  associated  with  aggregated  model  𝑄)

’//  must  be 

computed with careful and empirical selection of 𝛽". 

Learning Epoch

Learning Epoch

(a)                                                                                                                                 (b) 

Learning Epoch

Learning Epoch

(c)                                                                                                                                 (d)  

Figure 6.6.  (a) JWS at A-UAVs with f-MAB; (b) JWS at A-UAVs with Top-k MAB; (c) JWS at 

F-UAVs with f-MAB; (d) JWS at F-UAVs with Top-k MAB 

6.6.2  Quality of Learnt Cache Sequence  

This  section  reports  the  quality  of  algorithmically  learnt  cache  sequences  in  terms  of  their 

similarities with the theoretically best possible cache sequence. To be noted that the best possible 

118 

 
 
 
 
caching sequence can be derived from cache pre-loading policies. The quality of a learnt caching 

policy is reported in terms of cache distribution optimality (CDO) which can be calculated from 

Jaro-Winkler Similarity (𝐽𝑊𝑆) [93].  

Cache  distribution  optimality  can  be  inferred  as  an  indirect  representation  of  the  storage 

segmentation factor (𝜆), which is used to decide the segment sizes according to cache pre-loading 

policies. A higher 𝐽𝑊𝑆 implies that, along with learning the caching policy, the MAB agents learn 

to emulate the said segmentation behavior. The cached contents become close to optimum as the 

learning progresses. Lower 𝐽𝑊𝑆 values at initial epochs signifies that the A-UAVs have no a priori 

content popularity information, neither local nor global. As the MAB agents learn in time with 

generated content requests, the cached contents in the A-UAVs become more similar to the best 

caching sequence. Thus, indirectly, it learns to emulate cache segmentation along with the increase 

in cache distribution optimality. The partial dissimilarity of the cached content sequence can be 

ascribed  to  the  uncertainty  associated  with  the  Q-values  of  contents.  Also,  this  leads  to  an 

oscillatory convergence of 𝐽𝑊𝑆 for A-UAVs (refer Figure 6.6a and 6.6b). This behavior manifests 

in the 𝐽𝑊𝑆 for F-UAVs as well, due to its dependance on caching decisions at A-UAVs.  

Figs. 6.6a and 6.6b plot Jaro-Winkler Similarity of cached content sequences for all 12 A-UAVs 

while employing f-MAB and Top-k MAB, respectively. The key observations are as follows. First, 

the 𝐽𝑊𝑆 between the best caching sequence with cache pre-loading policy and the learnt caching 

sequences with f-MAB agents converge near 0.9, although with a certain variance. Physically, this 

represents high cache distribution optimality, where 1 indicates complete similarity and 0 indicates 

no similarity. Second, the learnt caching sequence with Top-k MAB agents show initial increase 

in learnt similarity. However, it tapers off as learning progresses. That is due to the subpar Q-

values of content, which can be seen from the weak reward estimates in Figure 6.4c. Third, it can 

119 

 
be seen that learnt caching sequences with Top-k MAB has high variance. The reason is twofold: 

global  rewards’  precedence  over  local  penalties  and  the  agent’s  unawareness  about  global 

popularity. Intermittently, cached contents accumulate huge rewards due to global rewards which 

supersedes local penalties. This leads to bad caching decisions locally, thus resulting in reduced 

content availability. Also, an agent’s unawareness about global popularity fails to limit the offset 

due  to  bad  caching  decision.  Finally,  the  adeptness  of  learnt  caching  sequence  at  the  A-UAVs 

affects the learnt caching sequence at the F-UAVs. Figure 6.6c and 6.6d shows the JWS of cached 

content at 3 F-UAVs with f-MAB and Top-k MAB, respectively. Since, F-UAVs ferry contents 

that are requested less frequently, the low popularity of such contents leads to a comparatively 

sluggish improvement of its 𝐽𝑊𝑆 as compared to 𝐽𝑊𝑆 improvement for the A-UAVs. Due to bad 

caching decisions with Top-k MAB at A-UAVs, the caching decision at F-UAVs is affected more 

as compared to f-MAB, which is shown in Figure 6.6c and 6.6d. It shows 10-15% less JWS for 

Top-k  MAB  as  compared  to  f-MAB  learning-based  policy,  which  indicates  lower  cache 

distribution optimality.   

6.6.3 

Impacts of Tolerable Access Delay 

To gain insights about the learning capabilities of the proposed f-MAB and Top-k MAB models, 

experiments are conducted with varying tolerable access delays (𝑇𝐴𝐷) ranging from 1200 to 2400 

seconds.  The  content  availability  according  to  the  learnt  caching  policies  with  varying  𝑇𝐴𝐷  is 

shown  in  Figs.  6.7a  and  6.7b.  The  figures  demonstrate  the  behavior  of  the  proposed  caching 

mechanisms viz. f-MAB and Top-k MAB with respect to the benchmark performance, which is 

computed using the cache pre-loading policy. The two figures 6.7a and 6.7b are different ways to 

emphasize the learning behavior for varying 𝑇𝐴𝐷 scenarios.  

120 

 
Learning 
Epoch

Learning 
Epoch

(a)                                                                                                                                       (b) 

Figure 6.7.  (a-b) Two different ways to show content availability performance with different 

TADs 

Following observations can be made from Figure 6.7. First, the learnt caching policy with f-

MAB learning based caching mechanism achieves performance closer to the benchmark for all 

values of 𝑇𝐴𝐷. Second, Figs. 6.7a-b show that the best possible performance (i.e., the benchmark) 

changes with change in 𝑇𝐴𝐷. Third, the f-MAB and Top-k MAB agents in the A-UAVs adapt to 

the user defined 𝑇𝐴𝐷 via dynamic learning. In other words, the role of multi-dimensional reward 

structure  of  MAB  and  model  sharing  approach  of  federated  learning  becomes  more  evident  as 

content 𝑇𝐴𝐷 increases. Especially, the information related to the global availability i.e., 𝛿. and 𝛿/, 

are derived from large count of content requests. This improves the estimated reward at the A-

UAVs,  thus  impacting  their  caching  decision.  Since,  f-MAB  model  leverages  the  personal 

experience  of  individual  A-UAVs,  with  enhanced  performance  of  Top-k  MAB,  f-MAB’s 

performance improves commensurately.  

121 

 
 
 
Learning Epoch

Learning Epoch

(a)                                                                                                                                       (b) 

Figure 6.8.  (a-b) Two different ways to show content availability performance with different 

6.6.4 

Impacts of Content Popularity Skewness  

Zipf Popularity Skewness 

Content popularity skewness, represented by the Zipf parameter 𝛼, can change the importance 

of all contents such that with increase in 𝛼 the most popular content becomes more popular and 

the popularity of less popular content falls. Figs. 6.8a and 6.8b show two different ways to show 

the proposed learning-based mechanisms’ ability to cope with different Zipf popularity skewness 

𝛼. Both f-MAB and Top-k MAB policies adjust to the modification in 𝛼. Due to the increase in 

popularity of highly requested contents with increase in 𝛼, Q-values of popular contents develop 

comparatively faster than that with lower 𝛼. This behavior favors the learning progression of both 

f-MAB  and  Top-k  MAB  agents.  Similar  to  the  observation  till  now,  f-MAB’s  performance,  in 

terms of content availability, is better than that of Top-k MAB. However, this improvement comes 

with  added  pre-convergence  computational  complexity,  which  is  shown  in  Figure  6.9.  The 

computational load for both of the proposed caching methods are calculated for 1800 requests per 

epoch.  The  computational  load  for  Top-k  MAB  is  calculated  for  the  recursive  Q-value  update 

122 

 
 
 
 
equation [128], which is constant. On the other hand, for f-MAB computation scales with number 

of contents, due to the weight calculation using KL divergence (refer to Eqns. 15-21). Note that 

the additional computation with f-MAB tapers off post-convergence due to the improved Q-values 

of contents. Physically, this implies that the f-MAB caching agent has learnt to balance the local 

content  requirements  of  the  respective  communities  along  with  the  global  need  of  the  disaster 

effected regions.  

25000

20000

15000

10000

5000

0

s
n
o
i
t
a
r
e
p
O

l
a
c
i
t
a
m
e
h
t
a

M

f
o
r
e
b
m
u
N

100 400 1000 2000
Number of  Contents

25000

20000

15000

10000

5000

0

s
n
o
i
t
a
r
e
p
O

l
a
c
i
t
a
m
e
h
t
a

M

f
o
r
e
b
m
u
N

As learning progresses
the weight decay factor
𝛽𝑑 	becomes negligible.
Therefore, the need for
computation associated
with KL divergence and
federated
aggregation
tapers off.

100 400 1000 2000
Number of  Contents

Top-k MAB

f-MAB

Top-k MAB

f-MAB

(a)

(b)

Figure 6.9.    Computation complexity (a) before convergence, and (b) after convergence 

It should be noted that the aforementioned experiments have been conducted and explained for 

a  heterogeneous  scenario  to  show  the  generalization  capabilities  of  the  proposed  caching 

mechanisms.  

A homogeneous demand scenario is a special case of the generalized heterogeneous case. With 

homogeneity in both content popularity and TAD, the benchmark performance is computed using 

Smart Exclusive Caching (SEC), whereas for homogenous TAD with community-specific content 

popularity, Popularity Based Caching (PBC) decides the benchmark performance. It should also 

be noted that both SEC and PBC are special cases of Value Based Caching (VBC). The proposed 

123 

 
 
 
 
 
 
 
 
f-MAB and Top-k MAB models are still applicable in the aforementioned scenarios for on-the-fly 

learning of caching policies.   

Learning Epoch

Figure 6.10.   Federated Multi-Armed Bandit Learning based Caching Performance comparison 

in Heterogeneous and Homogeneous Scenarios 

Figure 6.10 shows that applicability of f-MAB to learn the caching policy in heterogeneous as 

well  as  homogeneous  demand  scenarios.  The  performance  improvements  in  both  scenarios  are 

comparable. 

6.7 

Summary and Conclusion 

In this chapter, a UAV-aided content dissemination system is proposed which can learn the 

caching policy on-the-fly without a priori content popularity information. Two types of UAVs are 

introduced  to  support  content  provisioning  in  a  disaster/war-stricken  scenario  viz.  anchor  and 

ferrying UAVs. Cache-enabled anchor UAVs are stationed at each stranded community of users 

for uninterrupted content provisioning. Ferrying UAVs act as content transfer agents across the 

anchor  UAVs.  The  evolution  of  pre-loading-based  caching  policies,  which  requires  a  priori 

information about content popularity, are discussed. A decentralized Top-k Multi-Armed Bandit 

124 

 
 
 
Learning-based caching policy is proposed to ameliorate the limitation of cache pre-loading. It 

learns the caching policy on-the-fly by maximizing estimated reward for the increase in local and 

global content availability. To improve Q-value estimates, a distributed Federated-Multi-Armed 

Bandit  Learning-based  caching  policy  is  proposed.  This  method  combines  the  Q-values  of  all 

anchor UAVs to produce a better estimate of top popular content at a community. Future work on 

this research includes algorithmically coping with time-varying content popularity and adaptive 

trajectory  planning  in  the  presence  of  operational  unreliabilities  of  the  UAV.  The  next  chapter 

includes the characterization of UAV trajectories and deployment strategies which can build the 

foundation for developing trajectory-aware learning-model sharing techniques to enhance content 

dissemination. 

125 

 
 
 
 
 
 
 
 
 
 
 
 
 
Chapter 7:  Benchmarking UAV Trajectory-Aware Caching Policies 

in Infrastructure-Less Networks 

7.1  Motivation 

The pursuit of this research is driven by the critical need for reliable communication in 

environments where disasters or conflicts have compromised or completely destroyed traditional 

communication infrastructure. The urgency to develop solutions that can quickly and efficiently 

bridge  these  communication  gaps  is  paramount.  Unmanned  Aerial  Vehicles  (UAVs)  present  a 

promising  avenue  for  addressing  this  challenge  due  to  their  flexibility  and  rapid  deployment 

capabilities. However, the effective use of UAVs in such scenarios requires a deep understanding 

of  their  operational  dynamics.  Specifically,  how  the  planning  of  their  flight  paths,  including 

hovering and transitioning behaviors, impacts the availability of essential content and the costs 

associated with its delivery. This chapter aims to fill this gap by exploring the intricate dynamics 

of UAV trajectory planning in communication-challenged scenarios. 

7.2 

Design Objective 

The primary goal of this work is to enhance the accessibility of critical information in areas 

where  standard  communication  systems  are  no  longer  viable.  To  achieve  this,  the  chapter 

introduces a novel Joint Deployment of Ferrying UAVs (JDFU) algorithm designed to optimize 

content availability across various scenarios. This algorithm represents a significant advancement 

over traditional content caching and UAV deployment strategies by dynamically adjusting to the 

specific needs and constraints of disaster-affected environments. A key part of this objective is to 

understand the trade-offs between the UAVs’ operational parameters and the tolerable delays in 

accessing requested content. By doing so, the research seeks to identify an operational sweet spot 

that  ensures  maximum  content  availability.  Additionally,  the  development  of  simulation 

126 

 
experiments  and  analytical  models  is  crucial  for  validating  the  effectiveness  of  the  proposed 

trajectory planning and deployment strategies. These will also be used in subsequent chapters as 

performance benchmarks for learning models that can develop trajectory-aware caching policies. 

These tools are not only instrumental in assessing the performance of the JDFU algorithm but also 

in fine-tuning the overall approach to UAV-aided content dissemination in challenging conditions. 

7.3 

System Model 

7.3.1  UAV Hierarchy  

The two-tier UAV-aided content dissemination system is shown in Figure 3.1, where each 

partitioned community of users is served by a A-UAV using a lateral wireless link such as Wi-Fi. 

With a naïve fully duplicated (FD) approach where A-UAVs download all contents requested by 

the  users  [120],  with  no  inter-A-UAV  data  transfer,  the  following  shortcomings  will  be 

encountered.  First,  there  will  be  duplications  of  downloads  via  the  expensive  vertical  links  by 

different A-UAVs due to the overlaps in requests from different communities for popular contents. 

This will incur high download costs. Second, storage constraints will cap the number of contents 

that can be downloaded and stored in each A-UAV, thus limiting the content availability. Finally, 

due to limited infrastructure availability, some of the communities of users can be rendered isolated 

from content access without dedicated A-UAVs assigned to them.  

To address these, a set of ferrying UAVs (i.e., F-UAVs) are introduced. Unlike A-UAVs, 

the mobile F-UAVs do not possess vertical links, but they do have lateral links such as Wi-Fi, 

using which they can communicate with the A-UAVs and the users. The role of these UAVs is to 

cache  and  transfer  content  around  the  A-UAVs  such  that  the  users  in  a  community  are  able  to 

access content that was downloaded by A-UAVs serving other communities.  

127 

 
After receiving a request from one of its community users, a A-UAV first searches its local 

storage for the content. If not found, it waits for a potential future delivery of the content by one 

of  the  traveling  F-UAVs.  If  no  F-UAV  with  that  content  arrives  within  the  specified  tolerable 

access delay (TAD), the A-UAV downloads it via the vertical link.   

To address the question about trajectory planning in terms of F-UAV trajectories, different 

pre-programmed  trajectories  are  characterized  along  with  below  mentioned  static  content 

placement strategies.  

7.3.2  Caching Policies  

Caching at Anchor UAVs (A-UAVs): As mentioned before that the FD mechanism has the 

shortcoming in that it limits the number of accessible contents for all user communities to 𝐶:, the 

A-UAV cache size. This limitation can be addressed by storing a part of the A-UAV’s cache with 

same contents viz. duplicate contents and the remaining cache space with unique contents. The 

unique contents in all the A-UAVs are shared across the communities via the traveling F-UAVs. 

This Smart Cache Duplication (SCD) mechanism can effectively increase the access to the number 

of contents for all the users across the entire system, thus improving the overall availability within 

a given TAD. 

Let the size of the duplicate segment of A-UAV cache be (𝜆. 𝐶:) and that of the unique 

segment  be  ((1 − 𝜆). 𝐶:)where  𝜆  is  a  duplication  factor  that  decides  the  level  of  content 

duplication  in  A-UAVs.  This  results  into  𝑁:. (1 − 𝜆). 𝐶:  unique  contents  stored  across  all  𝑁: 

number of A-UAVs in the system, and these can be shared across all user communities via the 

mobile F-UAVs. These unique contents have popularities after the top (𝜆. 𝐶:) popular duplicated 

contents  in  all  the  A-UAVs.  For  symmetry,  all  𝑁:. (1 − 𝜆). 𝐶:  unique  contents  are  uniformly 

128 

 
randomly distributed across 𝑁: number of A-UAVs. The total number of contents in the system: 

𝐶;2; = 𝜆. 𝐶: + 𝑁:. (1 − 𝜆). 𝐶:.  

Caching  at  Ferrying  UAVs  (F-UAVs):  The  purpose  of  the  F-UAVs  is  to  ferry  around 

𝑁:. (1 − 𝜆). 𝐶: unique contents stored in all 𝑁:	A-UAVs. In the presence of limited per-F-UAV 

caching space, 𝐶<, its caching policy can be determined based on its trajectories, the value of 𝜆, 

and the Zipf parameter defining the content popularity. 

Consider a situation in which an F-UAV k is approaching towards the A-UAV i. Let 𝑈, be 

the set of all unique contents in the entire system except the ones stored in A-UAV i. To maximize 

content availability for the users in A-UAV i’s community, the F-UAV should carry as many low 

popularity contents from set 𝑈,	as its cache space permits. To enable such access, F-UAV k should 

carry 𝐶<	top popular contents from the set 𝑈,	while approaching A-UAV i. The size of the set 𝑈, 

can be expressed as |𝑈,| = (𝑁: − 1). (1 − 𝜆). 𝐶:. In scenarios when 𝐶< ≤ |𝑈,|, the F-UAV should 

carry  the  𝐶<	top  popular  contents  as  outlined  above.  Otherwise,  the  F-UAV  will  carry  all  |𝑈,| 

unique  contents,  leaving  part  of  the  F-UAV  cache  (i.e.,  𝐶< − |𝑈,|)  empty.  This  causes 

underutilization  of  F-UAV  cache  space  due  to  large  𝜆	values,  leading  to  heavy  in-A-UAV 

duplications, thus storing few unique contents. 

7.4 

Content Request and Provisioning Model 

Content requests are generated by the users in communities and sent to their respective 

local A-UAVs or to a visiting F-UAV, in that order or preference.  

Content Popularity: Studies have shown [119] that content request patterns often follow a Zipf 

distribution in which a requested content’s popularity is a geometric multiple of the next popular 

content.  Popularity  of  contents  is  given  as  𝑝6(𝑖) = (1/𝑖)6/ ∑ (1/𝑘)6

X∈=

.  The  parameter  𝐶 

129 

 
represents  the  total  number  of  contents  in  the  pool,  and  the  Zipf  parameter  𝛼  determines  the 

skewness of the distribution.  

Content Requests: Poisson distributed request generation is the most prevalent way to capture user 

requests in practical network scenarios. 

Tolerable Access Delay and Content Provisioning: For each generated request, a Tolerable Access 

Delay (TAD) [123], [124] is specified. TAD is a Quality-of-Service parameter that indicates the 

duration that a user is ready to wait before a requested content can be accessed. From the network’s 

perspective, if the content is not available from the A-UAV or visiting F-UAVs within the specified 

TAD, it will have to be downloaded from a central server using the expensive vertical link on the 

A-UAVs. Therefore, to reduce that downloading cost, the contents cached in F-UAVs need to be 

more readily available within the A-UAV/F-UAV network.  

7.5 

Trajectory-aware Content Placement Planning for Ferring UAVs 

7.5.1  Trajectory Sequence and Cycle  

An F-UAV’s trajectory is represented by the sequence of visited A-UAVs, and hovering 

duration at each A-UAV. Figure 3.1 shows F-AUVs A and B follow a partitioned cycle of A-UAV 

sequence X, Y, Z and W, whereas F-UAVs C and D follow a global cycle. Choice of sequence 

depends on the popularity of contents cached in A-UAVs. 

The cycle time of an F-UAV trajectory is 𝑇+2+&- = 𝑁:

= × (𝑇0#?-* + 𝑇(*’%;,(,#%), where 𝑁:

=	 

is the number of A-UAVs in the F-UAV’s sequence, 𝑇0#?-* is the hover duration at each A-UAV, 

and 𝑇(*’%;,(,#% is the transition time between two consecutive A-UAVs in the F-UAV’s sequence. 

𝑇(*’%;,(,#%	depends on the F-UAV flying speed, intercommunity distance, wind speed/directions, 

and other environmental factors. 𝑇0#?-* should not be less than a minimum duration that is required 

for successful data transfer between UAVs and users. Minimum hover time is determined by (a) 

130 

 
the data transfer rate; (b) the amount of data needs to be exchanged between F-UAV to/from A-

UAV; (c) multipath fading; (d) shadowing; fading due to height of UAVs etc.  

7.5.2  Joint Deployment of Ferrying UAVs (JDFU) Algorithm  

The caching mechanism discussed so far are based on round-robin trajectories using which 

an F-UAV sequentially visits all the A-UAVs with equal hover durations at each A-UAV. In the 

presence  of  multiple  F-UAVs,  since  the  same  trajectory  is  used  by  all  the  F-UAVs  with  equal 

spacing, the time gap between two consecutive visits of an F-UAV to a community (i.e., a A-UAV) 

is 𝑇+2+&-/𝑁<. Let us consider a scenario where an F-UAV ′𝑖′ leaves an A-UAV ′𝑗′ and the next F-

UAV ′𝑖 + 1′ reaches A-UAV ′𝑗′ after 𝑇+2+&-/𝑁< duration. If a requested content from the F-UAVs 

has  a  Tolerable  Access  Delay  𝑇𝐴𝐷 > 𝑇+2+&-/𝑁<,  then  the  content  is  served  to  the  user  with  a 

minimum of 𝑇𝐴𝐷 − 𝑇+2+&-/𝑁<	 extra time before exhausting the 𝑇𝐴𝐷. If this extra time is more 

than the time gap between two consecutive F-UAV visits (𝑇+2+&-/𝑁<), then this extra time can be 

leveraged by deploying multiple F-UAVs in groups flying together as explained below.  

Here  we  introduce  an  F-UAV  trajectory  mechanism,  termed  as  Joint  Deployment  of 

Ferrying  UAVs  (JDFU),  to  leverage  this  extra  time.  In  this  mechanism,  multiple  F-UAVs  fly 

together while following the same trajectory at the same time. With 𝑁< number of F-UAVs in a 

system,  they  can  be  deployed  in  groups  of  different  sizes  while  employing  JDFU.  If  they  are 

deployed  with  𝑁<

Z  number  of  F-UAVs  per  group,  then  there  will  be  𝑁</𝑁<

Z  such  groups  viz. 

(𝑁<

Z × 𝑁</𝑁<

Z) F-UAVs. This is called the JDFU configuration. It should be noted that the groups 

of  𝑁<

Z  F-UAVs  still  visit  the  A-UAVs  (at  their  corresponding  communities)  in  a  round-robin 

manner. 

It has two major advantages. First, in case of high 𝑇𝐴𝐷 requests, the extra time of 𝑇𝐴𝐷 −

𝑇+2+&-/𝑁< is reduced where the F-UAV is not adding to the content availability. It is explained as 

131 

 
follows. 𝑇+2+&-/𝑁< is the time it takes an F-UAV to reach the next community in its sequence after 

the  previous  F-UAV  leaves  the  community.  If  𝑇𝐴𝐷 > 𝑇+2+&-/𝑁<,  then  the  F-UAV  reaches  the 

community  𝑇𝐴𝐷 − 𝑇+2+&-/𝑁<  before  it  exhausts  the  𝑇𝐴𝐷.  Employing  JDFU  can  leverage  this 

duration  to  improve  availability  of  contents  at  other  communities.  Second,  employing  JDFU 

increases the effective caching capacity of F-UAVs as compared to F-UAVs deployed without 

JDFU. This is explained as follows. If F-UAVs follow their respective trajectories (without JDFU), 

then they carry only the 𝐶< most popular contents out of (𝑁: − 1). (1 − 𝜆). 𝐶: from the cache of 

A-UAVs in their trajectories (refer Section 7.3.2). By employing JDFU, the F-UAVs can carry 

𝑁<

Z. 𝐶< contents out of (𝑁: − 1). (1 − 𝜆). 𝐶: cached in A-UAVs as opposed to 𝐶< contents. Here 

Z is the number of F-UAVs traversing in a group. Therefore, the effective caching capacity of 

𝑁<

the F-UAVs increases form 𝐶< to 𝑁<

Z. 𝐶< which can significantly enhance the content availability.  

However, this increase in effective cache size comes with the following downsides. First, 

the time interval during which the content availability depends only on the A-UAVs increases. The 

explanation is as follows. While employing JDFU, the equal spacing between group of F-UAVs 

depends on the number of groups. This means that the time taken by a group of F-UAVs ′𝑘f to 

reach  an  A-UAV  ′𝑗′  after  the  previous  group  of  F-UAVs  ′𝑘 − 1′  leaves  the  community  will 

increase with increase in 𝑁<

Z. This is the time during which content availability depends only on 

A-UAVs. Second, to completely fill the increased effective cache of the F-UAVs, the duplication 

factor 𝜆 should be lowered. A lower 𝜆 ensures enough contents in Segment-2 of the A-UAVs to 

avoid underutilization of F-UAV cache (refer Section 7.3.2). While lower 𝜆 helps in increasing the 

effective cache utilization of the group of F-UAVs, it reduces the number of most popular contents 

in Segment-1 of the A-UAVs. However, if the 𝑇𝐴𝐷 is sufficiently high, popular contents cached 

132 

 
in F-UAVs due to low 𝜆 can be accessed before exhausting the 𝑇𝐴𝐷. Therefore, the reduction in 

𝜆 will have minimum or no effect on content availability for high 𝑇𝐴𝐷. 

The pseudo code to calculate 𝑁<

Z algorithmically and determine JDFU configuration is as follows. 

Algorithm 7.1.    JDFU Algorithm 

1.  Input: Total UAVs 𝑁:, 𝑁<, 𝑇𝐴𝐷 and 𝑇(*’%;,(,#% 

2.  Output: JDFU configuration 

3.  Initialize 𝑇0#?-*, F-UAV trajectory to round-robin, 𝑁<

Z = 1 

4.  while True: 

5.        compute 𝑇+2+&- 

6.        if 𝑇𝐴𝐷 >

I
@$4$5+×A’
A’

− 𝑇0#?-* then do 

Z 
7.              Increment 𝑁<

8.              while True: 

9.                    Decrement 𝜆 

10.                   if (𝑁: − 1). (1 − 𝜆). 𝐶: ≥ 𝑁<

Z. 𝐶< then do 

11.                         Cache 𝑁<

Z. 𝐶< in 𝑁<

Z F-UAVs 

12.                         break     

13.                   end if 

14.             end while 

15.       else if 𝑇0#?-* > 𝑇0#?-*

4,%  then do 

16.             Decrement 𝑇0#?-* 

17.       end if 

Z 
18.       compute JDFU configuration from 𝑁<

133 

 
Algorithm 7.1. (cont’d) 

19. end while 

The  increase  in  content  availability  by  employing  JDFU  largely  depends  on  the  𝑇𝐴𝐷  of  the 

requested contents. When a content cached in F-UAVs is requested with very high 𝑇𝐴𝐷, it allows 

the  F-UAVs  to  reach  the  request  generating  community  with  a  maximum  delay  of  𝑇𝐴𝐷.  This 

shows  that  the  benefit  of  employing  JDFU  is  directly  proportional  to  the  specified  𝑇𝐴𝐷  in  the 

content  requests.  Intercommunity  distances  also  contribute  to  the  increase  in  availability  while 

employing  JDFU  algorithm  in  which  closely  located  communities  can  be  reached  by  F-UAV 

groups  before  the  specified  𝑇𝐴𝐷.  This  phenomenon  is  elaborated  later  along  with  supporting 

experimental results.  

7.6 

Content Dissemination Performance and Experimental Results 

Content availability is used as a metric to evaluate the performance of the proposed algorithm. It 

is defined as the probability of finding a requested content from the UAV-aided caching paradigm 

within the specified 𝑇𝐴𝐷. In the case of an F-UAV transitioning in round-robin manner across the 

A-UAVs in its trajectory, the F-UAV’s accessibility within a given 𝑇𝐴𝐷 is expressed as follows. 

𝑃<: =

𝑁< × (𝑇>#?-* + 𝑇𝐴𝐷)
Z. 𝑁: × (𝑇>#?-* + 𝑇(*’%;,(,#%)	

⎧
⎪

𝑁<

𝑓𝑜𝑟	𝑇𝐴𝐷 < (cid:146)

1																																																									𝑓𝑜𝑟	𝑇𝐴𝐷 ≥ (cid:146)

⎨
⎪
⎩

𝑁<

𝑁<

Z. 𝑁:
𝑁<
Z. 𝑁:
𝑁<

− 1(cid:147) 𝑇0#?-* +

− 1(cid:147) 𝑇0#?-* +

𝑁<

𝑁<

Z. 𝑁:
𝑁<
Z. 𝑁:
𝑁<

𝑇(*’%;,(,#%

𝑇(*’%;,(,#%

(7.1) 

The relative difference between the 𝑇𝐴𝐷 and the time taken by an F-UAV or a group of F-UAVs 

to reach an A-UAV after the previous group has left decides the accessibility of F-UAVs. Note 

that the physical accessibility to the F-UAV does not guarantee the access to the requested content 

since the F-UAV or the group of F-UAVs can store only a limited number (i.e., 𝑁<

Z. 𝐶<) of unique 

134 

 
 
contents. Let 𝑃< be the probability that the requested content can be found within the F-UAV or 

group of F-UAVs. It can be expressed as: 

   𝑃< = ∑

=-DHD=3’’
,I=-DH

𝑝6(𝑖)

                                      (7.2) 

where, 𝑝6(𝑖) is the Zipf distributed popularity as defined in Section 7.4. The effective cache size 

of the F-UAV is given as: 𝐶J<< = 𝑚𝑖𝑛{𝑁<

Z × 𝐶<, (𝑁: − 1) × (1 − 𝜆) × 𝐶:}.  

Now, let 𝑃: be the probability that the requested content can be found within the A-UAV 

that is local to the community from which the content request was generated. This is expressed as: 

𝑃: = ∑

K×=-D(HLK)×=-
,IH

𝑝6(𝑖)

	                             (7.3) 

Combining those three probabilities above, the overall availability can be stated as: 

 𝑃:?’,& = 𝑃: + 𝑃<: × 𝑃<                                     (7.4) 

To summarize, local contents from A-UAVs (i.e., both duplicate and unique) and unique contents 

from future visiting F-UAVs contribute towards the overall availability 𝑃:?’,& within a specified 

𝑇𝐴𝐷. Note that all unavailable contents within the specified TAD will have to be downloaded by 

the  A-UAVs  using  their  expensive  vertical  links  such  as  the  satellite  Internet.  Therefore, 

availability indirectly indicates the content download cost in the system.   

Before  exploring  the  impact  of  employing  JDFU  algorithm  on  content  availability,  it  is 

important to understand the effects for hover and transition time with respect to tolerable access 

delay, which is discussed next. For experimentation, specific modules were added to implement 

request generation, UAV caching, and F-UAV movement strategies. For all experiments, 𝑁+ =

2000, 𝐶: = 𝐶< = 100, Poisson request rate 𝜇 = 1 requests/second and Zipf parameter 𝛼 = 0.7.  

135 

 
)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A
m
u
m
x
a
M

i

100

90

80

70

60

50

40

30

20

10

0

0

Value-based Loading + JDFU 3X3+1
For Low TAD contents with Value-based Loading + JDFU 3X3+1
For High TAD contents with Value-based Loading + JDFU 3X3+1
Popularity based Loading
For Low TAD contents with Popularity based Loading
For High TAD contents with Popularity based Loading

500

1000

1500

2000

UAV cache size

Figure 7.1.   Improvement in maximum availability of contents by loading 

UAVs using value of contents and deploying F-UAVs in groups 

7.6.1 

Impacts of Value-Based Caching and JDFU 

The overall increase in content availability using value-based content caching and joint-

deployment of ferrying UAVs is shown in Figure 7.1. The performance improvement is compared 

against  popularity-based  caching  policy  at  A-UAVs  and  round-robin  trajectories  of  F-UAVs 

(without JDFU). Content availability is evaluated for varying cache size of the UAVs.  

It can be seen from Figure 7.1 that a maximum increase in availability of approximately 

25% can be achieved by using value-based caching policy at the A-UAVs, and JDFU for the F-

UAVs. The benefits of value-based caching along with JDFU in scenarios with multi-dimensional 

demand  heterogeneities  are  attributed  to  various  factors  including  heterogeneity  in  popularity 

136 

 
 
 
 
 
sequence, 𝑇𝐴𝐷 associated with the content requests, popularity of low 𝑇𝐴𝐷 contents, value of a 

content, and configuration of JDFU. The effects of these factors are depicted individually in the 

following sub-sections.  

7.6.2  Effects of Hover Time and Tolerable Access Delay 

F-UAV hover time and TAD have interdependent impact on the content availability. This 

is shown in Figure 7.2 with 𝑁: = 20, 𝑁< = 10 and 𝑇(*’%;,(,#% = 10	𝑠𝑒𝑐𝑜𝑛𝑑𝑠. 

The surface plot shown in Figure 7.2 is nonmonotonic with respect to content availability 

when  hover  time  and  tolerable  access  delay  are  varied.  The  most  noticeable  observation  is  the 

dichotomous behavior of availability with increase in hover time for low and high 𝑇𝐴𝐷. For low 

𝑇𝐴𝐷,  availability  increases  with  increase  in  hover  time,  whereas  for  high  𝑇𝐴𝐷,  availability 

decreases with longer hover time. The explanation for such behavior is as follows. First, for 𝑇𝐴𝐷 <

𝑇(*’%;,(,#%, while traversing from present community i to next community j, an F-UAV doesn’t 

contribute to availability for 𝑇(*’%;,(,#% − 𝑇𝐴𝐷 duration (refer Figure 7.2). This means that some 

of the requests generated for contents cached in F-UAVs will be downloaded because of partial 

inaccessibility  of  F-UAV.  Hence,  hovering  over  a  community  is  more  beneficial  for  content 

availability even though it may be an unfair increase in average availability (availability increases 

only for the community where the F-UAV is hovering).  

137 

 
Figure 7.2.   Content availability for different 𝑇𝐴𝐷 with varying hover time 

The region to the left of the red line in Figure 7.2 shows this effect. Second, for 𝑇𝐴𝐷 >

𝑇(*’%;,(,#%, increase in hovering time reduces the possibility of the condition	(𝑇𝐴𝐷 − 𝑇0#?-*) >

𝑇(*’%;,(,#% to be true. In other words, the possibility of exhausting the given 𝑇𝐴𝐷 before reaching 

next community increases. So, it is beneficial to hover less, which increases the accessibility of F-

UAVs at future communities in the cycle before 𝑇𝐴𝐷 expires. This behavior is capture in Figure 

7.2  in  the  region  right  to  the  red  line.  Finally,  for  𝑇𝐴𝐷 = 𝑇(*’%;,(,#%,  an  F-UAV  can  add  to 

availability within 𝑇𝐴𝐷, at all times, irrespective of its hovering decision. If F-UAV decides to 

hover at the present community i, it caters to all the requests generated at i whereas transiting to 

next community j ensures accessibility of F-UAV at j since it reaches j within the TAD. The red 

line in Figure 7.2 shows this behavior. It should also be noted that this contradicting behavior is 

for a fixed transition time 𝑇(*’%;,(,#%. With the same number of UAVs, effect of transition time on 

availability for varying tolerable access delay is discussed next. 

138 

 
 
Figure 7.3.   Content availability for different 𝑇𝐴𝐷 with varying transition time 

7.6.3  Effects of Transition Time and Tolerable Access Delay 

The  effect  of  𝑇(*’%;,(,#%  and  𝑇𝐴𝐷  on  content  availability  is  shown  in  Figure  7.3  for 

𝑇0#?-* = 20	𝑠𝑒𝑐𝑜𝑛𝑑𝑠. 

There  are  two  major  observations.  First,  content  availability  reduces  with  increasing 

transition  time.  High  transition  time  reduces  accessibility  of  F-UAVs  at  future  visiting 

communities, which leads to reduction in availability (refer Eqn. 7.1). Second, with increase in 

𝑇𝐴𝐷, content availability increases. This statement is intuitively supported since more 𝑇𝐴𝐷 entails 

more  time  allowed  for  an  F-UAV  to  reach  the  request  generating  user  community.  These 

observations can also be verified from the Eqn. 7.1 and Eqn. 7.4, where 𝑃<: is directly proportional 

to  𝑇𝐴𝐷  and  inversely  proportional  to  𝑇(*’%;,(,#%.  Therefore,  the  maximum  content  availability 

occurs at least transition time 𝑇(*’%;,(,#% and highest user-specified 𝑇𝐴𝐷. 

139 

 
 
Availability with 
Delay with 
Availability with 
Delay with 

:0.9,TAD:240sec,

:0.4, :50

:0.9,TAD:240sec,

:0.4, :50

:0.9,TAD:300sec,

:0.4, :50

:0.9,TAD:300sec,

:0.4, :50

)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A

t
n
e
t
n
o
C

70

65

60

55

50

50

45

40

35

30

25

20

15

10

5

0

)
s
d
n
o
c
e
s

n
i
(

l

y
a
e
D

No F-UAV

10X1 F-UAV

5X2 F-UAV
JDFU Configuration

3X3+1 F-UAV

2X5 F-UAV

Figure 7.4.   Availability and delay with JDFU for different 𝑇𝐴𝐷 

Availability wrt no F-UAV,
Availability wrt no F-UAV,

:0.9,TAD:300sec,
:0.4,TAD:300sec,

:0.4, :50
:0.4, :50

)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A

t
n
e
t
n
o
C
n

i

e
s
a
e
r
c
n
I

25

20

15

10

5

0

10X1 F-UAV

5X2 F-UAV

3X3+1 F-UAV

2X5 F-UAV

JDFU Configuration

Figure 7.5.   Increase in availability with JDFU for different 𝛼 

140 

 
 
 
 
 
 
 
 
 
 
 
 
 
Delay with # of A-UAV:20,
Delay with # of A-UAV:20,

:0.9,TAD:300sec,
:0.4,TAD:300sec,

:0.4, :50
:0.4, :50

)
s
d
n
o
c
e
s

n
i
(

l

y
a
e
D

90

80

70

60

50

40

30

20

10

0

No F-UAV

10X1 F-UAV

5X2 F-UAV
JDFU Configuration

3X3+1 F-UAV

2X5 F-UAV

Figure 7.6.   Delay with JDFU for different 𝛼 

7.6.4  Effect of Joint Deployment of Ferrying UAVs (JDFU) 

To  show  the  benefits  of  JDFU,  the  first  set  of  experiments  are  conducted  with  popularity 

parameter  𝛼 = 0.9,  𝑇𝐴𝐷 = 240	𝑎𝑛𝑑	300	𝑠𝑒𝑐𝑜𝑛𝑑𝑠,  swap  probability  𝜇 = 0.4,  and  swap 

difference 𝛿 = 50. Rest of the parameters are as per Table 7.1. 

Figure 7.4 shows the increase in availability with increase in F-UAV group size for the JDFU 

configuration (Section 7.5.2). The following inferences can be derived from the figure. First, with 

increase in group size, the availability increases. However, for 𝑇𝐴𝐷 = 240	𝑠𝑒𝑐𝑜𝑛𝑑𝑠, the increase 

in availability is restricted for 2 × 5 configuration of JDFU with 5 F-UAVs in each group. This is 

because groups of F-UAVs do not reach the next community in their trajectory before the 𝑇𝐴𝐷 

expires. Second, with increase in 𝑇𝐴𝐷 to 300	𝑠𝑒𝑐𝑜𝑛𝑑𝑠, the benefit of JDFU is retained for 2 × 5 

configuration of F-UAVs since the group of F-UAVs reach the next community in their trajectory 

before the 𝑇𝐴𝐷 expires. Third, content access delay increases with increase in F-UAV group size 

for JDFU. Finally, with increase in 𝑇𝐴𝐷, delay increases proportionally.  

141 

 
 
 
 
 
Figure  7.5  and  7.6  show  the  effects  of  different  popularity  distribution  on  the  increase  in 

availability  and  delay  while  employing  JDFU.  Important  observation  from  Figure  7.5  shows  a 

comparison between the increase in availability for 𝛼 = 0.9 and 0.4 with varying configurations 

of  JDFU.  First,  for  smaller  F-UAV  group  size  in  JDFU  configuration,  higher  𝛼  ensures  more 

increase in availability of contents. This is because for higher 𝛼, popular contents are more likely 

to be requested. Second, for larger F-UAV group size in JDFU configuration, lower 𝛼 produces 

more availability. This is because less popular contents cached in F-UAVs are more likely to be 

requested, which is not in the case of high 𝛼. Figure 7.6 shows that, delay increases with low 𝛼 

due to the content requests being more distributed across all contents cached in A-UAVs and F-

UAVs. This is an attribute of the Zipf distribution (refer Section 7.4) and JDFU configurations. 

When the F-UAV group sizes increase, the contents served by F-UAVs increase as well. Due to 

increase in time a group takes to reach a community, delay increases with the group size. Next, the 

benefits of JDFU are explored by increasing the caching capacity of UAVs.  

7.6.5 

Impacts of UAV cache Size on JDFU 

This experiment discusses the effects of JDFU on availability, when caching capacity of UAVs 

is increased. For best results, value-based caching policy is followed at the A-UAVs. Parameters 

are  set  as  follows;  High  𝑇𝐴𝐷 = 240	𝑠𝑒𝑐𝑜𝑛𝑑𝑠,  Low  𝑇𝐴𝐷 = 5	𝑠𝑒𝑐𝑜𝑛𝑑𝑠,  𝛾 = 0.95,  JDFU 

configuration is 3 × 3 + 1 with 3 F-UAVs in each group and remaining parameters are according 

to default value in Table 7.1.  

142 

 
With Value-based A-UAV loading and JDFU 3X3+1 of F-UAVs

Increase in Overall Availability
Increase in Availability for Low TAD contents
Increase in Availability for High TAD contents

)

%
n
i
(

y
t
i
l
i

b
a

l
i

a
v
A
m
u
m
x
a
M
n

i

i

e
s
a
e
r
c
n
I

25

20

15

10

5

0

-5

200

400

600

800

1000 1200 1400 1600 1800 2000

UAV cache size

Figure 7.7.   Increase in availability with value-based caching and JDFU as compared to 

popularity-based caching 

Fig 7.7 shows the combined effects of JDFU and value-based caching policy towards increase 

in availability with respect to the popularity-based caching policy. The observations are as follow. 

First, increase in availability is maintained for all cache sizes until A-UAV’s cache size is equal to 

total number of contents in the system. It can be observed in Figure 7.7 that beyond cache size of 

2000, the benefits of JDFU ceases to exist since an A-UAV can cache all 2000 contents irrespective 

of  any  caching  policy.  Second,  as  opposed  to  the  value-based  caching  policy  favoring  the 

availability of low 𝑇𝐴𝐷 contents, with JDFU along with value-based caching, both low and high 

𝑇𝐴𝐷  contents  have  high  availability.  Third,  availability  of  contents  increases  with  increase  in 

cache size till a certain cache size and tapers off beyond it. This can be explained as follow. For 

cache size of 500, 1 A-UAV and 3 F-UAVs store all contents in the content pool viz. 2000. Beyond 

this, any increase in cache size entails underutilization of the F-UAV cache space while employing 

143 

 
 
 
 
 
 
 
JDFU. To compensate for the underutilization of F-UAV cache space, the storage segmentation 

factor 𝜆 is reduced (refer Section 7.3), which reduces availability. Finally, high−𝑇𝐴𝐷 contents do 

not contribute to the increase in availability beyond cache size of 1000. The reasons are twofold. 

One is the excessive reduction of 𝜆 to compensate for the space underutilization of F-UAVs, and 

the other is high popularity contents being replaced by very low popularity contents with low 𝑇𝐴𝐷, 

due to their increased value by employing value-based caching policy (see Eqn. 4.4).  

7.6.6  Hovering Required to Maximize Content Availability with JDFU Algorithm 

To explore the benefits and limits of JDFU algorithm, 10 F-UAVs are deployed in groups 

of 𝑁<

Z = 2, 3	𝑎𝑛𝑑	5. For this experiment, 𝑁: = 20 and 𝑇(*’%;,(,#% = 10	𝑠𝑒𝑐𝑜𝑛𝑑𝑠. Figure 7.8-7.11 

shows the impact of JDFU for varying 𝑇0#?-* and 𝑇𝐴𝐷.  

Figure 7.8. Content Availbility Without JDFU 

144 

 
 
Figure 7.9. Content Availbility with JDFU configuration 5 × 2 

Figure 7.10. Content Availbility with JDFU configuration 3 × 3 + 1 

145 

 
 
 
Figure 7.11. Content Availbility with JDFU configuration 2 × 5 

The  observations  are  as  follows.  First,  maximum  content  availability  attained  using  JDFU 

algorithm is with the configuration 2 × 5. Here, F-UAVs are deployed in groups of 5. Since there 

are a total of 10 F-UAVs, there are 2 such groups. To fill the cache space of 5 F-UAVs, 500 unique 

contents are required. With 20 A-UAVs in the system, 𝜆 (duplication factor) is set to 0.75 so that 

the unique content in the system is 𝐶:. (1 − 𝜆). 𝑁: = 100. (1 − 0.75). 20 = 500. Therefore, F-

UAVs ferry 500 contents as opposed to 100 contents without JDFU deployment, which increases 

content  availability.  Second,  next  two  JDFU  configurations,  viz.  3 × 3 + 1  and  5 × 2,  are 

functionally similar to 2 × 5 except that the groups are of 3 F-UAVs and 2 F-UAVs respectively. 

F-UAVs  ferry  300  and  200  contents  for  their  respective  configurations,  which  is  less  than  500 

contents with 2 × 5 configuration. This explains the reason for less availability with 3 × 3 + 1 

and  5 × 2  JDFU  configurations  as  compared  to  2 × 5  configuration.  The  lowest  maximum 

availability  is  attained  for  no  JDFU  configuration.  Third,  for  high  hover  time  𝑇0#?-*,  JDFU 

146 

 
 
algorithm fails to provide more availability. This is due to the reduction in 𝑃<: which reduces the 

content availability (see Eqn. 7.1-7.4). 

7.6.7  JDFU with Different Inter-community Distances 

For the experiments so far, transition time, which represents the intercommunity distance, is 

kept fixed at 𝑇(*’%;,(,#% = 10	𝑠𝑒𝑐𝑜𝑛𝑑𝑠. Figure 7.12 discusses the impact of JDFU with the context 

of inter-community distances. It considers three scenarios, namely, communities located nearby, 

moderately apart, and far apart, and their effects on availability while applying the aforementioned 

proposed mechanisms.  

Figure 7.12.   Benefits of JDFU for different intercommunity distances 

In  Figure  7.12,  red,  black,  and  blue  lines  represent  low,  moderate  and  high  𝑇𝐴𝐷  values, 

respectively.  Solid  and  dashed  lines  are  used  to  depict  less  and  more  hovering  durations 

respectively. The key observations from Figure 7.12 are as follows. First, employing JDFU boosts 

content availability for all combinations of inter-community distances and TADs except for very 

low  TAD  values.  This  can  be  seen  in  across  Figure  7.12a-c.  Second,  the  benefits  of  JDFU 

diminishes with increase in inter-community distances. This can be observed in Figure 7.12b and 

7.12c  where  group  size  of  3  F-UAVs  adds  to  availability  more  for  moderate  inter-community 

147 

 
 
separation as compared to far apart communities. Third, with increase in TAD, the benefit of JDFU 

is  substantial.  This  can  be  seen  in  Figure  7.12c  where  group  size  of  3  F-UAVs  produce  more 

content availability with 𝑇𝐴𝐷 = 300 seconds as compared to 𝑇𝐴𝐷 = 150 seconds. Finally, more 

hovering is beneficial for all inter-community distance scenarios except for very low TAD values 

[92], [135] (Figure 7.12a-c).  

All of these observations are attributed to accessibility of F-UAVs (refer Eqn. 7.1), which shows 

that  increase  in  intercommunity  distances  decrease  probability  of  accessibility  (𝑃<:)  whereas 

increase in TAD increases 𝑃<:. Increase in Low availability period (𝐿𝐴𝑃) can also be used as a 

measure to describe the reduction in accessibility of F-UAVs (Figure 4.5).  

It should be noted that JDFU benefits the availability of requested contents irrespective of the 

caching policy employed at the UAVs. However, performances can be enhanced if the caching 

policy is well formulated like value-based caching. Although the experiments are conducted for 

round-robin  trajectory,  JDFU  can  be  used  to  improved  content  availability  while  using  other 

trajectories as well. Therefore, joint deployment of ferrying UAVs and value-based caching policy 

are  generalized  algorithmic  solutions  for  the  caching  decision  problems  in  communication-

challenged environments.  

7.7 

Summary and Conclusion 

The chapter explores trajectory characterization and planning in a UAV-aided networks 

for content dissemination in infrastructure-less systems. Cache-enabled UAVs serve communities 

of  users  in  a  disaster/war-stricken  area  by  caching  popular  contents  in  order  to  reduce  content 

downloading using satellites and other expensive vertical links. A framework is adopted in which 

two types of UAVs, namely anchor UAVs and ferrying UAVs, are deployed. Through analytical 

modeling and simulation experiments, the chapter establishes a trajectory design paradigm which 

148 

 
considers  the  user-specified  tolerable  access  delay  and  the  nature  of  the  disaster/war-stricken 

region. It is shown that content availability can be maximized by appropriately choosing the hover 

time of ferrying UAVs at each community. It also introduces a novel Joint Deployment of Ferrying 

UAVs (JDFU) algorithm which can leverage user-specified tolerable access delay associated with 

requested  content  and  intercommunity  distances  to  improve  content  availability  by  deploying 

ferrying  UAVs  in  groups.  The  system  has  been  functionally  validated,  and  performance  is 

evaluated  for  different  scenarios  including  stochastic  content  request  generation  and  various 

ferrying  UAV  trajectories.  The  next  chapter  on  this  topic  will  include  incorporating  runtime, 

dynamic  and  adaptive  mechanisms  to  learn  trajectory-aware  caching  policies  befitting  ferrying 

UAV trajectories. Such learning-driven caching policies can be developed on-the-fly for all those 

design components so that content popularities, optimal caching, and the best UAV trajectories 

can be learnt online in time-varying disaster regions. 

149 

 
 
 
 
 
 
 
 
 
 
 
 
Chapter 8:  Top-k Multi-Armed Bandit Learning for Trajectory-

Aware Caching in Swarms of Micro-UAVs 

Continuing  from  the  trajectory  considerations  discussed  in  the  previous  chapter,  this 

chapter advances our understanding of how Micro-Unmanned Aerial Vehicles (Micro-UAVs) can 

be  effectively  utilized  for  content  dissemination  in  environments  devoid  of  standard 

communication  infrastructures  due  to  disasters  or  conflicts.  We  now  focus  on  enhancing  the 

adaptability of these UAV systems through trajectory-aware, adaptive caching strategies. These 

strategies are designed to dynamically respond to changing conditions and demands in disaster-

stricken areas, leveraging the mobility and flexibility of Micro-UAVs. 

Satellite Link

y

x

MF-UAV and A-UAV
information sharing 
via. lateral link

Figure shows popularity of the first
25 contents according to different
Zipf popularity parameter values.

Communication
Infrastructure
Destruction

Every user
community
can follow
a different
content
popularity
pattern.

z

w

Anchor UAV

Micro-Ferrying UAV

MF-UAV Trajectory

User Community

(a)

(b)

Figure 8.1. (a) Coordinated UAV system for content caching and distribution in environments 

without communication infrastructure; (b) Zipf Popularity Distribution 

8.1  Motivation 

Effective  communication  during  disasters  is  crucial  for  efficient  relief  operations  and 

timely  dissemination  of  vital  information.  Micro-UAVs  present  a  promising  solution  to  the 

disruption of traditional communication networks, capable of navigating and servicing isolated or 

inaccessible areas. However, the potential of these UAVs is not fully realized without addressing 

150 

 
 
the  challenges  posed  by  their  limited  storage  capacities  and  the  dynamic  nature  of  disaster 

environments.  There  is  a  compelling  need  for  a  content  management  system  that  not  only 

understands the geographic and temporal aspects of content demand but also integrates the flight 

trajectories of the UAVs. Such a system would ensure that content delivery is both strategic and 

context-aware, maximizing the impact and utility of the UAVs deployed in these critical scenarios. 

8.2  Design Objective 

The  primary  objective  of  this  chapter  is  to  design  a  decentralized,  trajectory-aware,  adaptive 

content  management  system  utilizing  Micro-UAVs  that  optimizes  content  delivery  to  disaster-

affected populations. The following design goals will guide the development of this system: 

a)  This chapter develops a trajectory-aware adaptive caching policy that not only responds to 

changes in content popularity and user demand but also incorporates UAV flight paths and 

operational  constraints.  This  trajectory-aware  approach  ensures  that  caching  decisions 

enhance the overall efficiency and content dissemination via cache-enabled UAVs. 

b)  Utilizing a Top-k Multi-Armed Bandit (MAB) learning approach, the system adapts to real-

time changes in content popularity and user demand. This learning is informed by shared 

data across Micro-UAVs, optimizing content availability on each UAV. 

c)  Furthermore, this chapter implements a Selective Caching Algorithm to effectively manage 

the  trade-off  between  Micro-UAV  storage  and  their  accessibility  via  minimization  of 

content redundancy. By ensuring that only essential content is stored and disseminated, 

this mechanism reduces the storage burden and improves the responsiveness of the UAVs 

to  critical  needs.  It  focuses  on  the  joint  geographical  deployment  of  Micro-UAVs  to 

manage  this  trade-off,  ensuring  that  UAVs  are  deployed  in  a  manner  that  maximizes 

content reach while considering their regional accessibility. 

151 

 
d)  It analyzes how adaptive caching decisions influenced by the learning algorithms affect 

quality  of  service,  particularly  in  terms  of  the  Tolerable  Access  Delay  (𝑇𝐴𝐷),  which 

measures the urgency of different types of information and community expectations. 

e)  The  proposed  mechanism  enables  the  system  to  modify  caching  decisions  in  real-time, 

based on immediate feedback from the environment and user interactions. Such real-time 

adaptability  accommodates  sudden  changes  in  content  demand  and  UAV  operational 

conditions. 

Through these objectives, this chapter aims to further develop the capabilities of Micro-UAVs in 

delivering  critical  information  under  challenging  circumstances,  ensuring  that  they  operate  not 

only as carriers of content but as smart, adaptive components of a larger disaster response strategy. 

This trajectory-aware caching model is intended to be robust yet flexible, capable of adapting to 

both the physical and informational landscapes of emergency scenarios. 

8.3 

System Model 

8.3.1  UAV Hierarchy 

As  shown  in  Figure  8.1,  a  two-tiered  UAV-assisted  content  dissemination  system  is 

deployed. Each community is served by a dedicated A-UAV that uses a lateral wireless connection 

(i.e., WiFi etc.) to communicate with users in that community. The system introduces a set of low-

power-budget Micro-UAVs for the role of ferrying (MF-UAVs). These are unlike A-UAVs which 

operate  with  a  much  larger  power  budgets.  MF-UAVs  are  mobile  and  possesses  only  lateral 

communication links such as Wi-Fi. Unlike the A-UAVs, the MF-UAVs do not possess expensive 

vertical  communication  interfaces  such  as  satellite  links  etc.  Effectively,  the  MF-UAVs  act  as 

content transfer agents across different user communities by selectively transferring content across 

152 

 
the A-UAVs through their lateral links.  

8.3.2 

Content Demand and Provisioning Model  

The  content  popularity  distribution,  quality  of  services  and  content  provisioning  are 

outlined below. 

Content Popularity: Research has shown that user content request patterns often follow a Zipf 

distribution [91], [92], where the popularity of a content is proportional to the inverse of its rank, 

and is a geometric multiple of the next popular content. Popularity of content ‘𝑖’ is given as:  

𝑝6(𝑖) = 5

6
:

1
𝑖

(cid:152)

1
𝑘

r 5

X∈A

6

:

																																																										(8.1) 

The Zipf parameter, 𝛼, determines the distribution’s skewness, while the total number of contents 

in  the  pool  is  represented  by  the  parameter  𝑁.  The  inter-request  time  from  a  user  follows  the 

popular exponential distribution [91].  

Tolerable Access Delay: For each requested content, the user specifies a Tolerable Access Delay 

(𝑇𝐴𝐷) [70], which serves as a quality-of-service parameter and represents the amount of time the 

requesting user can wait before the content is downloaded. 

Content Provisioning: Upon receiving a request from one of its community users, the relevant 

A-UAV first searches its local storage for the content. If the content is not found, the A-UAV waits 

for a potential future delivery by a traveling MF-UAV. If no MF-UAV arrives with the requested 

content within the specified TAD, the A-UAV then proceeds to download it through its vertical 

link. Since vertical links such as satellite links are expensive, smart caching strategies that can 

make  the  content  accessible  from  the  UAVs  can  be  effective  in  reducing  content  provisioning 

costs.  

8.4  Caching based on Content Pre-loading at A-UAVs  

This  section  discusses  caching  policies  based  on  content  pre-loading  at  A-UAVs  that 

153 

 
assumes  pre-assigned,  static,  and  globally  known  content  popularities.  After  understanding  the 

limitations of these caching policies, the chapter designs a runtime, dynamic, and adaptive Top-k 

Multi-armed Bandit based caching mechanism, which is explained in a Section 8.5.  

8.4.1 

Pre-loading Policies at Anchor UAVs (A-UAVs)  

The Fully Duplicated (FD) mechanism [91] is a naive approach that allows A-UAVs to 

download  content  from  vertical  links  upon  request  by  local  users.  FD  has  major  limitations 

including  content  duplication,  high  vertical  link  download  costs,  and  underutilization  of  UAV 

cache space. This means that with a cache size of 𝐶: contents per UAV, the total caching capacity 

of  the  system  is  limited  to  𝐶:.  Smart  Exclusive  Caching  (SEC)  [91],  [92]  overcomes  those 

limitations of FD by storing a set number of unique contents in all A-UAVs and sharing them 

among  communities  via  traveling  MF-UAVs.  Assuming  globally  known  homogeneous  content 

popularity across all user communities, the SEC mechanism divides the cache into two segments 

of  size  𝐶SH  and  𝐶S5.  Segment-1  contains  the  top  𝐶SH = 𝜆. 𝐶:  popular  contents  cached  in  all  A-

UAVs,  while  Segment-2  contains  unique  contents  𝐶S5 = (1 − 𝜆). 𝐶:,  where  𝜆  is  a  Storage 

Segmentation  Factor.  This  results  into  𝐶S5

(#(’& = 𝑁:. (1 − 𝜆). 𝐶:  number  of  total  Segment-2 

contents  stored  across  all  𝑁:  number  of  A-UAVs,  and  these  can  be  shared  across  all  user 

communities via the mobile MF-UAVs. This factor needs to be adjusted and fine-tuned based on 

various network, content, and demand conditions. Total number of contents in the system as per 

SEC is given as:   

𝐶;2; = 𝜆. 𝐶: + 𝑁:. (1 − 𝜆). 𝐶:																																																													(8.2) 

Popularity-Based  Caching  (PBC)  [93]  is  employed  when  different  communities  have  different 

content preferences. Considering the heterogeneous popularity sequence of a community, the PBC 

approach, like SEC, divides the cache space of the local A-UAV into two segments of size 𝐶SH and 

154 

 
𝐶S5. Segment-1 caches the most popular contents, which can be exclusive to a A-UAV (𝐶J) or 

non-exclusive i.e., may be cached across multiple A-UAVs (𝐶AJ), such that, 𝐶SH = 𝐶J + 𝐶AJ. To 

be noted that according to the exclusivity of contents in 𝐶SH, the total number of exclusive contents 

across  all  A-UAVs  is  termed  as  𝐶J

(#(’&.  Segment-2  is  the  same  as  that  in  SEC.  Therefore,  by 

modifying Eqn. 8.2, the total number of contents in the system can be expressed as:  

𝐶;2; = 𝐶AJ + 𝐶J

(#(’& + 𝑁:. (1 − 𝜆). 𝐶: ⇒ 𝐶;2; ≥ 𝜆. 𝐶: + 𝑁:. (1 − 𝜆). 𝐶:																	(8.3)  

Value-Based  Caching  (VBC)  [93]  further  enhances  the  caching  policy  by  storing  top-valued 

contents in Segment-1 of the A-UAVs, where value of contents comprises of their popularity and 

tolerable access delay. Value of a content ‘𝑖’ is calculated as:  

𝑉(𝑖) = 𝜅𝜐∗ ×

𝑝6(𝑖)
𝑇𝐴𝐷(𝑖)

⇒ 𝑉(𝑖) = 𝜅 ×

𝑇𝐴𝐷4,%
𝑝6(1)

×

𝑝6(𝑖)
𝑇𝐴𝐷(𝑖)

																																	(8.4) 

In this equation, 𝑝6(𝑖) represents the content’s popularity as per the Zipf distribution, 𝑇𝐴𝐷(𝑖) is 

the content’s tolerable access delay, 𝜅 is a scalar weight that increases as popularity decreases, and 

𝜐∗  is  a  normalization  constant.  The  normalization  constant  is  calculated  for  a  given  Zipf 

(popularity) parameter 𝛼 using the minimum possible 𝑇𝐴𝐷 (𝑇𝐴𝐷4,% ) and the maximum possible 

popularity, which is 𝑝6(1), i.e., 𝜐∗ = 𝑇𝐴𝐷4,% 𝑝6(1)

⁄

. The value of 𝑉(𝑖) is bounded between [0,1], 

and it increases as 𝑝6(𝑖) increases and 𝑇𝐴𝐷(𝑖) decreases. The content’s value presents a holistic 

quantifiable measure for caching decision.  

The caching policy for micro-ferrying UAVs remains the same for all the above-discussed 

caching policies for A-UAVs, which will be discussed in the forthcoming Section 8.5.  An MF-

UAV ferries content across the A-UAVs it visits along its trajectory. The caching policy of A-

UAVs  determines  the  utility  of  MF-UAVs  where  every  A-UAV  should  maintain  sufficient 

155 

 
contents in its cache space to maximize the MF-UAV cache space utilization.  

8.4.2  Limitations of Cache Pre-loading at A-UAVs 

   The caching policies discussed in this section rely on pre-loading content into A-UAVs, 

which  has  certain  limitations.  These  approaches  assume  a  priori  knowledge  of  the  popularity 

distribution  of  all  the  content  in  the  system,  which  can  hinder  practical  feasibility  during 

deployment.  Local  popularity  estimation  of  requested  content  within  individual  A-UAVs  can 

partially alleviate this issue, but it cannot adjust the crucial storage segmentation factor (𝜆) (see 

Section  8.4.1)  for  maximizing  availability  across  the  entire  system  of  A-UAVs  and  their 

communities. Collaborative global popularity estimation can be introduced, but it fails to capture 

locally meaningful demand heterogeneity across different communities.  

8.5  Decentralized Caching with Multi-Armed Bandit 

This section presents a plausible solution for the aforementioned shortcomings by using 

Top-k Multi-Armed Bandit learning for caching decisions at the A-UAVs. This facilitates faster 

learning and is adaptive to heterogeneous user demand patterns through information sharing via 

micro-UAVs. Based on the forthcoming mechanism, the caching policy for micro-ferrying UAVs 

is also modified to leverage their ubiquity, which is discussed later.  

8.5.1 

Top-k Multi-Armed Bandit Learning  

Multi-Armed  Bandit  is  a  classic  problem  in  reinforcement  learning  [130]  and  decision-

making. At each round 𝑡, an agent chooses an arm 𝐴( out of 𝑁 arms, denoted by 𝐴H, 𝐴5, . . . , 𝐴A, 

and  observes  a  reward  𝑅(.  Each  arm  𝑖  has  an  unknown  reward  distribution  with  mean  𝜇,  and 

variance 𝜎,

5. The agent’s goal is to maximize the total expected reward 𝑅@ over 𝑇 rounds, where 

156 

 
𝑇 is the total number of rounds (time horizon):   

@

𝑅@ = 𝑚𝑎𝑥	 r 𝐸[𝑅(]

																																																																(8.5) 

(IH

This thesis uses a variant of MAB called Top-k Multi-Armed Bandit [128]. Here, the agent 

has to choose 𝑘 arms simultaneously out of a larger set of 𝑁 arms, and it receives a reward for 

each  arm  in  the  chosen  set.  This  is  in  contrast  to  choosing  only  one  arm  in  classical  MAB 

approaches. The goal of the agent is to maximize the total cumulative reward 𝑅@ obtained over a 

finite time horizon 𝑇:  

X

@
𝑅@ = 𝑚𝑎𝑥	 r r 𝐸[𝑅(,,]
(IH

,IH

																																																						(8.6) 

8.5.2  Caching at A-UAV using Top-k Multi-Armed Bandit 

In  the  scenario  of  UAV-caching,  there  is  a  Top-k  MAB  agent  in  each  A-UAV.  Here, 

choosing each content for caching corresponds to choosing an arm. The ‘k’ of Top-k MAB agent 

corresponds  to  the  caching  capacity  of  A-UAV,  i.e.,  𝑘 = 𝐶:.  The  agent’s  aim  is  to  select  ‘𝐶:’ 

contents out of the total pool of ‘𝑁’ contents to be cached in an A-UAV such that the content 

availability to the users can be maximized.  Here, the UAV-aided content dissemination system is 

the learning environment where the A-UAVs interact through their actions of choosing specific 

sets of contents to be cached. The feedback from the environment for the taken actions are in the 

form of rewards/penalties. Micro-ferrying UAVs play a crucial role in transferring information 

across the UAV-aided system, which helps in the computation of appropriate rewards/penalties, 

as shown in Figure 8.2. Actions are rewarded when cached contents are requested by the users and 

are served to the users within the given tolerable access delay or penalized otherwise. The top 𝐶: 

contents that accumulate most reward from the corresponding community and other communities 

157 

 
are chosen to be cached at a A-UAV. It should be noted that the Top-k MAB agents in the A-UAVs 

are provided with no a priori information about the content popularity at the corresponding user 

communities.  

Community 
Users

𝑅𝑡,𝑖,𝕃
(reward)

Top-k Multi-Armed 
Bandit Agent

Anchor UAV ‘𝒚’

(reward)

𝑅𝑡,𝑖,𝔽 + 𝑅𝑡,𝑖,𝔾

Micro-Ferrying 
UAV ‘𝑖’

’
𝑖
‘

t
n
e
t
n
o
C

𝑅
𝑒
𝑞
𝑡
(
𝑖
)

𝑅𝑒𝑞𝑡 (𝑖)

Content ‘𝑖’

UCB

𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦

Take action with probability 𝑝𝜖

𝐴𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 𝒬𝑡 𝑖 +

𝛼𝑢 log 𝑡
𝑁𝑡 𝑖

y

‘𝑚’

‘𝑖’

x

Network of A-UAVs
and MF-UAVs
(Environment)

‘𝑗’

Caching top “𝐶𝐴” contents with highest
estimated reward 𝔼 𝑅𝑡,𝑖,_ for content ‘𝑖’

z

‘𝑙’

w

‘𝑘’

Figure 8.2. Top-k Multi-Armed Bandit Learning for Caching Policy at A-UAVs 

A good choice for learning decision epoch in each Top-k MAB agent is according to the 

MF-UAVs accessibility at the corresponding community (i.e., an MF-UAV’s visiting frequency). 

This is because the MF-UAVs carry the content availability information from the communities in 

its trajectory. Such information is leveraged for learning at the A-UAVs’ Top-k MAB agents using 

appropriately  designed  multi-dimensional  rewards.  The  agent  learns  to  cache  contents  via  the 

multi-dimensional  reward  structure  which  has  three  parts,  namely,  local,  ferrying,  and  global 

rewards. Let 𝕃, 𝔽 and 𝔾 denote the sets of locally requested contents, contents requested at other 

communities, and contents requested across all communities, respectively. These contents can be 

served  to  the  users  directly  by  a  A-UAV  or  indirectly  via  the  visiting  MF-UAVs.  If  a  cached 

content is served to a user within the given TAD and an increase in content availability is observed, 

the content is rewarded. The type of reward is determined by the set to which the cached content 

158 

 
 
 
 
belongs. The expressions for three types of rewards are given as follows: 

𝑅,,𝕃 = 𝕀H(𝑖 ∈ 𝕃, 𝛿𝕃 ≥ 0) + 𝕀LH(𝑖 ∉ 𝕃, 𝛿𝕃 < 0)																																								(8.7) 

𝑅,,𝔽 =

1
𝑁: − 1

A-
r 𝕀H(𝑖 ∈ 𝔽, 𝛿𝔽 ≥ 0)

+

nIH,no𝕏

1
𝑁: − 1

A-
r 𝕀LH(𝑖 ∉ 𝔽, 𝛿𝔽 < 0)

								(8.8) 

nIH,no𝕏

𝑅,,𝔾 =

1
𝑁:

A-
r 𝕀H(𝑖 ∈ 𝔾, 𝛿𝔾 ≥ 0)
nIH

+

1
𝑁:

A-
r 𝕀LH(𝑖 ∉ 𝔾, 𝛿𝔾 < 0)
nIH

																					(8.9) 

𝑤ℎ𝑒𝑟𝑒, 𝕀H(𝐴) = s

		1,
0,

𝑖𝑓	𝐴	𝑖𝑠	𝑡𝑟𝑢𝑒
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

The above equations compute the reward according to increase in availability due to content ‘𝑖’ 

cached at A-UAV ‘𝕏’. Here, 𝑅,,𝕃, 𝑅,,𝔽, and 𝑅,,𝔾 are local, ferrying, and global rewards respectively. 

The terms 𝛿𝕃, 𝛿𝔽 and 𝛿𝔾 correspond to the increase in local availability, ferried content availability 

and global availability respectively. Each type of reward is contingent upon the condition in the 

indicator  function  𝕀H/LH(𝑖).  The  first  terms  in  Eqns.  8.7,  8.8  and  8.9  represent  the  reward 

accumulated  by  caching  content  ‘𝑖’  at  A-UAV  ‘𝕏’,  whereas  the  second  term  is  the  penalty 

associated with adverse condition. To be noted that 𝑅,,𝔽, and 𝑅,,𝔾 are higher if the content ‘𝑖’ is 

requested and served at more communities.  

Learning is achieved using a tabular method where a Q-table is maintained for all contents 

in the A-UAVs. The value corresponding to each content is called a Q-value or action-value [136]. 

The  agent  updates  the  Q-value  for  a  content  at  every  learning  epoch  according  to  the  multi-

dimensional  rewards  in  Eqns.  8.7-8.9  from  the  interaction  with  the  environment  (UAV-aided 

content  dissemination  system)  and  learns  the  best  actions  (contents  cached).  The  recursive 

expression which explains Q-value update for a content ‘𝑖’ at A-UAV ‘𝕏’ is given as follows: 

𝒬(DH(𝑖) = (1 − 𝛼)𝒬((𝑖) + 𝛼 k𝑅(,,,𝕃 + 𝕀H(𝛿)v𝑅(,,,𝔽 + 𝑅(,,,𝔾wl																				(8.10) 

159 

 
 
Here,  𝒬((𝑖)  represents  the  Q-value  of  a  content  ‘𝑖’  at  𝑡(0  epoch;  𝑅(,,,_  is  the  respective  reward 

received by caching content ‘𝑖’; 𝛿 represents the condition for the indicator function 𝕀H(𝜇) which 

is 1 if micro-ferrying UAVs are present in the communication range of A-UAV ‘𝕏’ or 0 otherwise; 

𝛼 is a hyper-parameter which controls the learning rate. The Q-values for all contents are initialized 

with  zero  to  ensure  no  a  priori  information  for  a  Top-k  MAB  agent.  Also,  it  ensures  equal 

importance to all contents for caching decisions. As learning progresses, Q-values improve and 

best contents with highest Q-values are cached with the aim of maximizing accumulated reward 

which improves the caching policy and thus increases content availability.  

Note  that  there  can  be  very  large  number,  i.e.,    vA

Xw,  of  combinations  of  contents  to  be 

sampled  by  the  Top-k  MAB  agent  for  caching.  Consequently,  the  reward  estimation  for  each 

individual content combination occurs infrequently, only after large intervals. This can lead to a 

weak estimates of reward distribution, as the global content population size 𝑁 increases. This issue 

is  handled  by  empirically  selecting  𝜖  and  its  decay  rate  in  the  𝜖-greedy  action  selection  policy 

[137]. To reduce the dependence of a caching policy on the choice of 𝜖, an Upper Confidence 

Bound (UCB) strategy is used [137]. The Top-k MAB agent maintains an upper confidence bound 

on the expected reward of each content, and selects the set of 𝐶: contents with the highest UCB at 

each epoch. 

𝒰((𝑖) = 𝒬((𝑖) + y

𝛼[ log(𝑡)
𝑁((𝑖)

																																																		(8.11) 

Here, 𝒰((𝑖) is the UCB of content ‘𝑖’ at epoch ‘𝑡’; 𝒬((𝑖) is the updated Q-value at epoch ‘𝑡’; 𝛼[ 

is a hyperparameter that controls the degree of exploration; 𝑁((𝑖) is the number of time content ‘𝑖’ 

has been requested till epoch ‘𝑡’. The first term represents the reward estimate, and the second 

term depicts the uncertainty in reward estimate. UCB selects the content that has high potential for 

160 

 
high reward but hasn’t been requested frequently. This promotes exploration without externally 

inducing an exploration parameter such as 𝜖. For this chapter, 𝒰((𝑖) is used in place of 𝒬((𝑖) to 

cache content ‘𝑖’, as shown in Step 7-14 in Algorithm 8.1.  

The following pseudo code explains the caching policy at a micro-ferrying UAV with a 

Top-k MAB agent.  

Algorithm 8.1 Caching policy at a A-UAV with Top-k MAB Learning  

1.  Initialization: 

a.  N: Total contents in the system 

b.  𝐶:: Caching capacity of an A-UAV 

c.  𝒰: Size |𝐶:| initialized with 0’s (Q-table with UCB) 

d.  𝛼: Learning rate for Q-table update 

e.  𝛼[: Degree of exploration (in UCB) 

2.  Load A-UAV’s cache with 𝐶: randomly chosen contents. 

3.  while True: 

4.      Check for learning epoch at A-UAV i.e., at 𝑡(0 epoch 

5.      if True then do 

6.          for 𝑖 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV cache size 𝐶:) do 

7.              Get reward 𝑅(,,,_	 \\ according to Eqns. 8.7-8.9 

8.              Update 𝒰(𝑖)       \\ from Eqns. 8.10 and 8.11 

9.          end for 

10.         𝑣𝑎𝑙𝑢𝑒	 = 	𝒄𝒐𝒑𝒚(𝒰) \\ make a copy of UCB values  

        \\ Reload contents (Select arms) 

11.         for 𝑖 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV cache size 𝐶:) do 

161 

 
Algorithm 8.1. (cont’d) 

12.              𝑐4’) = 𝒂𝒓𝒈𝒎𝒂𝒙(𝑣𝑎𝑙𝑢𝑒) 

13.              Load 𝑐4’) to A-UAV 

14.              Set 𝑣𝑎𝑙𝑢𝑒[𝑐4’)] = −∞ 

15.         end for 

16.     end if 

17. end while 

8.5.3 

Proof of convergence 

Within a finite time horizon, the Top-k MAB agent at a A-UAV converges to a caching 

policy which approaches the benchmark caching policy asymptotically. The proof of convergence 

lies in the intrinsic regret minimizing characteristics of MAB [138], which is shown below.  

𝐶: = {𝑖|𝑖 ∈ 𝑁, 1 ≤ 𝑖 ≤ 𝑘} = argmin

X

v𝑅𝑒𝑔𝑟𝑒𝑡(𝑇)w 

@

= argmin
X

¤r ¤max

X

(IH

X
r 𝑅(,,∗
,IH

X
− r 𝑅(,,
,IH

“

“			(8.12) 

where, 𝑇 is the total number of epochs (time horizon); 𝑘 is the number of contents cached at each 

epoch; 𝑖∗ represents the optimal caching action; 𝑖 is the caching action selected by the Top-k MAB 

agent at 𝑡(0 epoch. Eqn. 8.12 shows the difference between the reward obtained by the algorithm 

and the reward obtained by caching with benchmark policy. Post-convergence, the instantaneous 

regret should be minimum, which is experimentally proven in this chapter. Ideally for a perfectly 

designed reward structure the regret should asymptotically vanishes, i.e.,  lim
@→u

v-/*-((@)

@

= 0 [129]. 

The convergence of estimated rewards (Q-values) to the true values (expected reward) in 

a MAB setup, including Top-k MAB scenarios, can be analyzed using the Law of Large Numbers 

(LLN) [140] and concepts of stochastic approximation. For simplicity, this work initially considers 

162 

 
the proof for a single arm and then extend the idea to all ‘𝑘’ arms in the Top-k selection. According 

to weak law of large numbers [141], the estimated value of a content ‘𝑖’ will be at a minute offset 

‘𝜖,’ from its true value, which is shown in the following expression: 

« lim
@→u

𝒬(DH(𝑖)« − 𝜇,

∗ < 𝜖, ⇒ ‹ lim
@→u

1
𝑛

%
r›𝑅(,,,𝕃 + 𝕀H(𝛿)v𝑅(,,,𝔽 + 𝑅(,,,𝔾wﬁ
(IH

‹ − 𝜇,

∗ < 𝜖,						(8.13) 

Here, a single content/arm ‘𝑖’ has a true value of 𝜇,

∗ , and 𝒬(DH(𝑖) represent the estimated reward 

(Q-value) of content ‘𝑖’ after it has been selected ‘𝑛’ times. The reward is taken from the second 

term (weighted reward) of Eqn. 8.10. For convergence, the weight ‘𝛼’ is chosen empirically in 

such a way that it satisfies the Robbins-Monro stochastic approximation condition [139] for non-

constant  ‘𝛼’,  namely,  ∑ 𝛼%(𝑖) = ∞

%

  and  ∑ 𝛼%(𝑖)5 < ∞

%

.  To  be  noted  that  the  weight  ‘𝛼’  is 

manifestation of ‘1/𝑛’ in Eqn. 8.13. Now, extending the concept to all top ‘𝑘’ contents, Eqn. 8.13 

can be modified using Eqn. 8.6: 

‹ lim
@→u

1
𝑛

X

%
r ﬂr 𝒬(DH(𝑖)
(IH

,IH

(cid:176)

X
∗
‹ − r 𝜇,
,IH

X
< r 𝜖,
,IH

⇒ ‹ lim
@→u

1
𝑛

X

%
r ﬂr›𝑅(,,,𝕃 + 𝕀H(𝛿)v𝑅(,,,𝔽 + 𝑅(,,,𝔾wﬁ
(cid:176)
(IH

,IH

X
∗
‹ − r 𝜇,
,IH

X
< r 𝜖,
,IH

	(8.14) 

The convergence proof for each of the top ‘𝑘’ contents individually follow the same logic 

as for the single content, provided each content is sampled infinitely often. Each content, including 

the top ‘𝑘’ contents, must be selected infinitely often as the number of total selections 𝑇 → ∞. 

This requirement is met in practice by exploration strategies (like 𝜖-greedy/UCB) that ensure all 

arms are explored sufficiently over time. 

With an assumption on the success of the Top-k MAB based caching policy, let’s say that 

the ideal sequence of contents are cached at A-UAVs, which is 𝐶: = {𝑖∗|𝑖∗ ∈ 𝑁, 1 ≤ 𝑖∗ ≤ 𝑘}. For 

163 

 
 
 
this caching decision, ∑ 𝜖,

X
,IH = 0, according to the expression given in Eqn. 8.14. Therefore, the 

instantaneous regret post-convergence can be derived from Eqn. 8.12 and 8.14, as follows: 

max
X

X
r›𝑅(,,∗,𝕃 + 𝕀H(𝛿)v𝑅(,,∗,𝔽 + 𝑅(,,∗,𝔾wﬁ
,IH

X

− r›𝑅(,,,𝕃 + 𝕀H(𝛿)v𝑅(,,,𝔽 + 𝑅(,,,𝔾wﬁ

≈ 0				(8.15) 

,IH

The  evidence  of  convergence,  supporting  the  above  expression  is  shown  in  Figure  8.7, 

where  near-optimal  contents  cached  at  A-UAVs  leads  to  ∑ 𝜖,

X
,IH ≈ 0.  According  to  the  learnt 

caching policy, the cached contents can boost content availability at their respective communities 

as well as at other distant communities via MF-UAVs. 

8.5.4 

Selective Caching at Micro-Ferrying UAVs (MF-UAVs) 

Ideally, the purpose of the MF-UAVs is to ferry around a subset of 𝐶J

(#(’& + 𝑁:. (1 − 𝜆). 𝐶: 

number of contents stored across 𝑁:	number of A-UAVs (see Section 8.4). Due to the limitation 

of per-MF-UAV caching space (i.e.,	𝐶h<), its caching policy should be determined based on its 

trajectories, learnt caching policy at A-UAVs, content request patterns, and the 𝑇𝐴𝐷𝑠 associated 

with the contents to be cached.  

25 contents in 
the range 94-344

A-UAV

25 contents in 
the range 86-273

𝑡
𝑖
𝑠
𝑛
𝑎
𝑟
𝑡
𝑇

MF-UAV2

MF-UAV3

25 contents in 
the range 388-838

𝑇𝑡𝑟𝑎𝑛𝑠𝑖𝑡

MF-UAV1

MF-UAV8

MF-UAV7

MF-UAV6

MF-UAV5

MF-UAV4

𝑇𝑡𝑟𝑎𝑛𝑠𝑖𝑡

25 contents in 
the range 363-623

25 contents in 
the range 118-370

25 contents in 
the range 175-442

𝑇
𝑡
𝑟
𝑎
𝑛
𝑠
𝑖
𝑡

25 contents in 
the range 197-405

25 contents in 
the range 276-588

Figure 8.3. Algorithmic selection of cached contents at MF-UAVs in conjunction 

with Top-k Multi-Armed Bandit learning at A-UAV 

164 

 
 
MF-UAV caching policy is explained in the pseudocode below. 

Algorithm 8.2 MF-UAV Caching Algorithm with Top-k MAB learning-based caching policy at 

A-UAVs 

1.  Input: Total A-UAVs in its trajectory, 𝑇𝐴𝐷, next A-UAV ‘𝑥’, present A-UAV ‘𝑥 − 1’  

2.  Output: 𝐶h< contents for MF-UAV ‘𝑦’ 

3.  Caching at A-UAVs using Top-k MAB policy (Algorithm 8.1) 

4.  while True: 

5.        if MF-UAV leaving for next A-UAV ‘𝑥’ then do 

                // Contents that are not in the future visiting A-UAV 

6.              Update ferrying content knowledge 

                // Function call from the present A-UAV ‘𝑥 − 1’ 

7.              Call content-wise_TAD ( ) 

                // Present A-UAV sends MF-UAV visiting frequency             

8.              Call MF-UAV_visiting_frequency ( )   

                // Check what content the last MF-UAV ferried 

9.              Call Check_previous_MF-UAV_roster ( )  

                  Return roster contents with respective TADs  

                // Compute request interval for last MF-UAV roster 

10. 

            Calculate least popular content’s request interval 

11. 

            Check if request time is less than its TAD and  

              MF-UAV visiting duration 

12. 

            if True then do 

13. 

                  Cache same roster 

165 

 
 
 
Algorithm 8.2. (cont’d) 

14. 

            else 

15. 

                  Cache next best roster 

16. 

            end if 

17. 

            Check if other MF-UAVs flying with MF-UAV ‘𝑦’ 

18. 

            for 𝑙 = 0 to 𝑙𝑒𝑛𝑔𝑡ℎ(MF-UAVs flying together) do 

19. 

                for 𝑘 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV ‘𝑥’ cache 𝐶:

)) do 

20. 

                      Check if 𝑘 in 𝐶h< cache space of MF-UAV ‘𝑦’ 

21. 

                      if True then do  

22. 

                          Replace ‘𝑘’ with highest value content  

                             from 𝐶:

)LH not cached in MF-UAV ‘𝑦’  

                             and A-UAV ‘𝑥’ 

23. 

                      end if 

24. 

                end for   

25. 

                Cache next best roster  

26. 

             end for 

27. 

      end if 

28. 

      Update next A-UAV ‘𝑥’, present A-UAV ‘𝑥 − 1’  

29. 

end while 

The role of MF-UAVs is to ferry contents from the previously visited A-UAVs to the future 

visiting A-UAV such that the future visiting A-UAV gets the benefit of contents cached at other 

A-UAVs. In Algorithm 8.2, this process is described in detail. Figure 8.3 shows the impact of this 

166 

 
collaborative algorithm. 

Consider a situation in which an MF-UAV ‘𝑦’ is ready to leave the A-UAV ‘𝑥 − 1’. Before 

caching  contents,  it  needs  the  following  information  from  A-UAV  ‘𝑥 − 1’;  1)  What  are  the 

contents  eligible  for  ferrying;  2)  What  is  the  MF-UAVs  visiting  frequency;  3)  What  roster  of 

ferrying content did the last MF-UAV ferry, where roster is the grouping of contents based on their 

popularity or value; 4) Are the next roster contents likely to be requested within the given TAD; 

and 5) Are MF-UAVs flying in close proximity with each other. Based on these information MF-

UAV ‘𝑦’ selectively caches contents while maintaining diversity in the contents cached by other 

MF-UAVs in its proximity. This means, if MF-UAVs are flying while maintaining proximity with 

each other or in groups, they ferry contents from consecutive rosters. To be noted that the size of 

Z  
a roster is same as an MF-UAV’s cache size. Therefore, if MF-UAVs are flying in groups of 𝑁h<

(group  size),  then  the  number  of  contents  cached  by  the  group  is  𝑁h<

Z × 𝐶h<.  Such  selective 

caching  policy  at  MF-UAVs  ensures  content  availability  maximization  by  avoiding  redundant 

cache duplication.  

8.6  Experimental Results and Content Dissemination Performance 

Simulation experiments are performed to analyze the performance of the proposed Top-k 

MAB learning-based caching mechanism and selective caching at the micro-ferrying UAVs. An 

event-driven simulator accomplishes content request generation while maintaining an intra-event 

interval according to exponential distribution and following a Zipf popularity distribution (refer to 

Eqn.  8.1).  To  capture  heterogeneity  in  content  popularity  sequence  at  different  communities, 

contents are swapped with pre-decided probability [142] and the difference between the sequences 

are determined using Smith-Waterman Distance [125]. Default experimental parameters for the 

proposed Top-k MAB learning based caching and cache pre-loading policies are listed in Table 

167 

 
8.1.  

# 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Table 8.1. Default Values for Model Parameters 

Variables 

Total number of contents, 𝐶 

Number of A-UAVs, 𝑁: 

Number of MF-UAVs, 𝑁h< 

A-UAV’s Cache space (as number of contents), 𝐶: 

MF-UAV’s Cache space (as number of contents), 𝐶h< 

Poisson request rate parameter, 𝜇 (in request/sec) 

Hover rate of MF-UAV, 𝑅>#?-* = 𝑇>#?-*/𝑇@*’n-+(#*2 

Transit rate of MF-UAV, 𝑅@*’%;,( = 𝑇@*’%;,(/𝑇@*’n-+(#*2 

Zipf parameter (Popularity), 𝛼 

Default 
Value 

2000 

4 

8 

200 

25 

1 

1/6 

1/12 

0.4 

10 

Micro Ferrying UAV Trajectory 

Round-robin 

The performance evaluation of the proposed mechanism is accomplished via the following metrics. 

   Content Availability (𝑃’?’,&): Defined as the ratio between cache hits and generated requests for 

a given tolerable access delay. Cache hits are the content provided to the users from the contents 

cached  in  the  UAV-aided  caching  system  (without  download).  Therefore,  content  availability 

indirectly indicates the content download cost of a systems as well. 

Cache  Distribution  Optimality  (CDO):  This  determines  the  optimality  of  the  learnt  caching 

policy in terms of the caching sequence. Jaro-Winkler Similarity (JWS) [143] is used to represent 

CDO, by computing the similarity between the content sequence from the learnt caching policy 

and content sequence according to cache pre-loading. It is computed by calculating the number of 

matches, number of transpositions required within the matches and the similarity in prefix of both 

168 

 
 
 
sequences. It is a normalized similarity measure where 1 represents optimal caching and 0 means 

non-optimal caching. 

Access Delay (𝐴𝐷): Performance of Top-K MAB model and selective caching policy for micro-

ferrying UAVs is also evaluated based on the access delay which is the end-to-end delay between 

the generation of content request and its provisioning from the cached contents in the UAVs. This 

chapter reports the epoch-wise average access delay to show the improvement in caching policy 

as learning progresses.   

About 5% increase in
content availability with
Top-k MAB

Here, 𝜂𝐴𝑣𝑎𝑖𝑙𝑎𝑏𝑖𝑙𝑖𝑡𝑦 =

𝐶𝑜𝑛𝑡𝑒𝑛𝑡 	𝐴𝑣𝑎𝑖𝑙𝑎𝑏𝑖𝑙𝑖𝑡𝑦
𝐵𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘 	𝐴𝑣𝑎𝑖𝑙𝑎𝑏𝑖𝑙𝑖𝑡𝑦

Figure 8.4. Increase in Content Availability with Top-k MAB and Selective Caching Policy 

8.6.1 

Effect of Exploration Strategies on Learnt Caching Policy 

In order to understand the viability of the proposed Top-k MAB learning-based caching 

policy in scenarios with demand heterogeneity, two type of content popularity sequence are used. 

This  is  achieved  with  adjacent  communities  having  different  popularity  sequences.  For  UCB 

exploration strategy, the degree of exploration is set to 𝛼[ = 2. Also, to show the effectiveness of 

selective caching at micro-ferrying UAVs (MF-UAVs), TAD Ratio 𝑅@:E for contents {51 − 75} 

are kept lower than the default 𝑅@:E i.e., 1/8	. To be noted that TADs are represented as a ratio 

169 

 
 
with respect to trajectory time (𝑇@*’n-+(#*2) to ensure generalizability of the proposed algorithms. 

Figure 8.4 and 8.5 shows the convergence behavior of the learnt caching policy with Top-k MAB 

model at the A-UAVs, and selective caching at the MF-UAVs. The convergence behavior is shown 

in terms of content availability from the learnt caching policy.  

About 9% increase in
content
availability
with Top-k MAB and
Selective Caching

With lower 𝑅𝑇𝐴𝐷 for some contents, selective
caching policy modifies the Micro-ferrying
UAV caching roster accordingly.

Figure 8.5. Responsiveness of Selective caching to user demand i.e., TAD 

The observations from Figure 8.4 and 8.5 are as follows. First, the figure shows that by 

employing  Top-k  MAB  agent  at  every  A-UAV  and  selective  caching  at  MF-UAVs,  a  caching 

policy can be learnt which can provide content dissemination performance closer to the benchmark 

performance [142]. The algorithm is able to leverage the multi-dimensional reward structure, as 

explained in Eqns. 8.7-8.9, to learn the caching policy on-the-fly (see Section 8.5.2). Second, the 

selective  caching  policy  at  micro-ferrying  UAVs  leverages  the  shared  information  between 

themselves  and  with  the  A-UAVs  to  boost  the  content  availability  closer  to  the  benchmark 

performance  by  approximately  9%  (see  Figure  8.5).  It  utilizes  the  currently  visiting  A-UAV’s 

caching information and the preceding MF-UAV’s caching decision to algorithmically select its 

own contents for caching, which is also shown in Figure 8.3. Such selective caching will reduce 

170 

 
 
the redundancy of multiple copies of the same content available through multiple sources at the 

same time. Difference in the effectiveness of selective caching can be observed in Figure 8.4 and 

8.5, where caching decisions at MF-UAVs differ due to the difference in 𝑅@:E in both scenarios. 

Third, when the agent uses UCB exploration strategy, during the initial learning epochs the content 

availability increases promptly due to high upper confidence value of all contents, which avoids 

excessive exploitation. This is due to low sampling of requests. As learning progresses, the sparse 

request for unpopular contents keeps the upper confidence value high which maintains consistent 

exploratory behavior. Figure 8.4 and 8.5 shows that such exploration strategy alone helps to boost 

the  content  availability  closer  to  the  benchmark  performance  by  approximately  5%  more  than 

popular estimation-based methods [78], [79], [80], [81].  

With
and
Top-k MAB
Selective caching policy, the
content access delays are
substantially less than the
TAD of 600 seconds.

With TAD of 450 seconds
for contents {51-75}, the
learnt
policy
caching
adjusts to provide lower
content access delay.

Figure 8.6. Delay with Top-k MAB and Selective Caching Policy 

Similarly, Figure 8.6 shows the convergence behavior of the Top-k MAB learning-based 

caching agent at the A-UAVs and selective caching at micro-ferrying UAVs in terms of access 

delay. It is observed that as learning progresses, the access delay for requested contents reduces 

while the content availability increases. This shows the improvement in learnt caching policy over 

171 

 
 
the learning epochs and its effect on content access delay. The best reduction in access delay is 

observed when Upper Confidence Bound (UCB) exploration is used at the Top-k MAB agent of 

A-UAVs and selective caching is applied at micro-ferrying UAVs.  

8.6.2 

Cache Similarity of Learnt Sequence with Best Sequence 

The effects of learning on the cached content sequence are demonstrated in Figure 8.7. It 

plots Cache Distribution Optimality (CDO) of the cached content sequences for all the A-UAVs 

in terms of Jaro-Winkler Similarity (JWS).  

The post-convergence oscillations show
the sensitive Q-values of the contents
ferried by micro-ferrying UAVs.

Initial

oscillations
with all caching methods
indicate no a-priori content
popularity information

Figure 8.7. Learnt cached content sequence’s similarity with benchmark sequence 

The key observation are as follows. First, the average 𝐶𝐷𝑂 between the benchmark caching 

sequence from cache pre-loading policy (see Section 8.4) and the cached content sequences learnt 

by  the  Top-k  MAB  agents  at  A-UAVs  converge  near  0.9,  although  with  a  certain  variance. 

Physically,  this  represents  higher  degree  of  similarity  after  convergence,  where  1  indicates 

complete similarity and 0 implies no similarity. Second, the cached contents improve over epochs 

as learning progresses. Lower 𝐶𝐷𝑂 values after the initial epochs signify that the A-UAVs have 

no a priori local or global content popularity information. As the MAB agents learn, over epochs 

172 

 
 
of generated content requests, the cached contents in the A-UAVs become more similar to the best 

caching sequence. Third, 𝐶𝐷𝑂 is an indirect representation of the storage segmentation factor (𝜆), 

which is used to decide the segment sizes according to cache pre-loading policies [93]. A higher 

𝐶𝐷𝑂 implies that, along with learning, the caching policy, the Top-k MAB agents learn to emulate 

the said segmentation behavior. Finally, the partial dissimilarity of the cached content sequence 

can be ascribed to the uncertainty (or regret) associated with the Q-values of contents with low 

popularity. Also, this leads to an oscillatory convergence of 𝐶𝐷𝑂 for the A-UAVs.  

The impacts of selective caching at micro-ferrying UAVs can be distinctly seen in Fig 8.7. 

Selective caching at the MF-UAVs along with Top-k MAB caching agent at A-UAVs leads to a 

𝐶𝐷𝑂 of nearly 0.9. Note that this depends on effective caching capacity of the MF-UAVs, which 

is dictated by the 𝑇𝐴𝐷s associated with content requests and the MF-UAVs visiting frequency at 

A-UAVs (refer Algorithm 8.2). The dependance of contents’ Q-values on such information also 

adds  to  the  post-convergence  oscillation.  To  be  noted  that  for  the  computation  of  𝐶𝐷𝑂,  the 

benchmark caching sequence is derived by considering the same effective caching capacity as the 

selective caching algorithm at the micro-ferrying UAVs.  

8.6.3 

Leveraging the Micro-Ferrying UAVs for Better Effective Caching Capacity 

To elaborate on the ability of selective caching at micro-ferrying UAVs to exploit effective 

caching capacity, experiments are conducted with different TAD Ratios 𝑅@:E. The comparison of 

performance is done with a scenario where there is one relatively larger ferrying UAV (F-UAV). 

Such  F-UAVs  can  have  sophisticated  communication  equipment  as  payload  including  a  larger 

caching capacity (≥ total caching capacity of all MF-UAVs). The content availability according to 

the learnt caching policy with 24 MF-UAVs is shown in Figure 8.8. The remaining parameters are 

173 

 
set according to the default values provided in Table 8.1.  

𝑒𝑓𝑓 = 4. 𝐶𝑀𝐹
𝐶𝑀𝐹

𝑦
𝑡
𝑖
𝑙
𝑖
𝑏
𝑎
𝑙
𝑖
𝑎
𝑣
𝐴
𝜂

1

0.95

0.9

0.85

0.8

0.75

0.7

0.65

0.6

0.55

0.5

1

0.95

0.9

0.85

0.8

0.75

0.7

0.65

0.6

0.55

0.5

𝑒𝑓𝑓 = 3. 𝐶𝑀𝐹
𝐶𝑀𝐹

F-UAV with
Cache Pre-
Loading

MF-UAV with
Cache Pre-
Loading

MF-UAV with
Top-k +Selective
Caching

8

6

4

3

8

6

4

3

Effective Caching Capacity
(a)

Effective Caching Capacity

(b)

Figure 8.8. (a) Best learnt 𝐶h<

-.. for 𝑅@:E = 1/6, (b) for 𝑅@:E = 1/8 

Following observations can be made from Figure 8.8 a. First, for a given 𝑅@:E =1/6, the 

best content availability achieved is with effective caching capacity of 4. 𝐶h< i.e., four times the 

caching capacity of an MF-UAV. Physically, this means that the 4 MF-UAVs fly very close to 

each other. Within the fleet of such closely flying MF-UAVs none of the pending content requests, 

for the ones cached at the MF-UAVs, expire by exceeding their respective TADs. Second, content 

availability increases with increase in effective caching capacity up to a certain point beyond which 

it decreases with further increase in effective caching capacity. This is due to two opposing effects: 

a) low availability period [91] for a content increases with increase in effective caching capacity 

which eventually decreases content availability, and b) with increase in effective caching capacity 

content  availability  increases  due  to  more  types  of  contents  cached  at  MF-UAVs.  Therefore, 

selective  caching  at  the  MF-UAVs  handles  the  trade-off  between  these  opposing  behaviors  by 

174 

 
 
choosing a caching policy that increases the effective caching capacity without increasing the low 

availability period of contents cached at MF-UAVs. 

Note that the previous explanation is valid for a particular 𝑅@:E. The best learnt effective 

caching  capacity  differs  when  the  𝑇𝐴𝐷𝑠  associated  with  the  content  requests  change.  This  is 

demonstrated  in  Figure  8.8  b  where  due  to  a  decrease  in  𝑅@:E  from  1/6  to  1/8,  the  best  learnt 

effective caching capacity decreases. Therefore, it can be said that the learning capability of the 

Top-k MAB agents at A-UAVs have an indirect dependence on the effective caching capacity of 

the MF-UAVs.  

This also emphasizes the motivation behind employing micro-UAVs in the role of ferrying 

contents.  With  a  given  cost  budget  for  UAVs  in  a  content  dissemination  system,  micro-UAVs 

provide flexibility in caching policies such that their effective caching capacity can be altered to 

fit to the users’ needs. This facility cannot be leveraged with relatively larger and pricier UAVs, 

especially under equipment cost constraints.  

8.7 

Summary and Conclusion 

In this chapter, a micro-UAV aided content dissemination system is proposed which can 

learn caching policies on-the-fly without a priori content popularity information. Two types of 

UAVs  are  introduced  for  content  provisioning  in  a  disaster/war-stricken  scenario  viz.  anchor 

UAVs  and  micro-ferrying  UAVs.  Cache-enabled  anchor  UAVs  are  stationed  at  each  stranded 

community of users for uninterrupted content provisioning. Micro-ferrying UAVs act as content 

transfer agents across the anchor UAVs. A decentralized Top-k Multi-Armed Bandit Learning-

based caching policy is proposed to ameliorate the limitation of existing caching methods. It learns 

the  caching  policy  on-the-fly  by  maximizing  the  estimated  multi-dimensional  reward  for  the 

increase  in  local  and  global  content  availability.  It  is  shown  that  a  Top-k  MAB  learning  based 

175 

 
caching  policy  achieves  a  content  availability  of  »82%  of  maximum  achievable  content 

availability.  To  improve  the  Q-value  estimates,  Selective  Caching  Algorithm  is  introduced  at 

micro-ferrying UAVs. This method combines the shared information between anchor UAVs and 

micro-ferrying UAVs to reduce redundant copies of contents and to produce a better estimate of 

top popular content at a community. Selective caching at micro-ferrying UAVs along with Top-k 

MAB learning-based caching policy at anchor UAVs boosts the content availability to »87% of 

maximum achievable content availability. With the proposed caching policies, a scaled-up micro-

UAV aided network is shown to attain a content availability of nearly 95% of maximum achievable 

content  availability.  Future  work  on  this  research  includes  algorithmically  coping  with  time-

varying  content  popularity  and  adaptive  trajectory  planning  in  the  presence  of  operational 

unreliabilities  of  the  UAV.  Furthermore,  the  next  experiments  will  focus  on  model  sharing 

approaches like Federated Learning in the presence of selective caching at Micro-Ferry UAVs. 

176 

 
 
 
 
 
 
 
 
 
 
Chapter 9:  Federated Multi-Armed Bandit Learning for Trajectory-

aware Caching Policy in Content Dissemination System using 

Swarm of UAVs 

In the aftermath of large-scale disasters such as earthquakes, floods, and armed conflicts, 

survivors  are  often  left  in  isolated  regions  without  functional  communication  infrastructure. 

Traditional  content  dissemination  mechanisms  become  ineffective,  creating  an  urgent  need  for 

adaptive  and  resilient  alternatives.  Building  upon  the  trajectory-aware  caching  framework 

developed in Chapter 8, this chapter introduces a federated, learning-driven solution that further 

enhances content availability in fragmented environments. 

Specifically,  this  chapter  presents  a  Federated  Multi-Armed  Bandit  (FedMAB)  learning 

approach where UAVs collaborate by sharing learned models rather than raw user data. Through 

this strategy, UAVs jointly optimize their caching decisions while preserving their nuanced local 

content  caching  perspective  and  minimizing  overgeneralization  of  the  shared  models.  The 

architecture builds upon a two-tier structure of anchor UAVs (A-UAVs) and micro-ferrying UAVs 

(MF-UAVs)  that  incorporates  selective  caching  strategies  and  federated  model  aggregation  to 

dynamically adapt to varying user demands, diverse content priorities, and tolerable access delays. 

9.1  Motivation 

Although decentralized learning through Multi-Armed Bandit algorithms enhances content 

caching decisions at individual UAVs, isolated learning can result in slow convergence and weak 

reward  estimation,  especially  under  heterogeneous  and  dynamic  content  demand.  Furthermore, 

trajectory-aware caching strategies, while effective, remain vulnerable to operational uncertainties 

and shifting user preferences. 

177 

 
This  chapter  is  motivated  by  the  need  to  accelerate  learning  convergence,  to  enhance 

caching  robustness  across  a  geographically  distributed  UAV  swarm,  and  to  ensure  coordinated 

decision-making without relying on centralized control. Federated Multi-Armed Bandit Learning 

addresses  these  challenges  by  allowing  UAVs  to  share  their  learned  models  that  enables  rapid 

adaptation, scalable decision-making, and resilient operation in disaster-affected regions. 

9.2 

Design Objective 

The primary objective of this chapter is to create a distributed and federated learning framework 

that  enables  UAVs  to  dynamically  learn  and  optimize  trajectory-aware  caching  policies  in 

environments where conventional communication infrastructure is unavailable. 

a)  First,  this  chapter  designs  a  Federated  Multi-Armed  Bandit  (FedMAB)  based  caching 

framework  that  enables  UAVs  to  collaboratively  refine  their  caching  decisions  while 

maintaining the privacy of user demand information. 

b)  Second,  it  introduces  a  multi-dimensional  reward  structure  that  captures  local  content 

demand,  ferrying-based  dissemination  patterns,  and  global  content  popularity  to  guide 

effective and adaptive caching strategies. 

c)  Third, the chapter presents a divergence-based weighted aggregation method to ensure that 

UAVs experiencing similar content request patterns contribute more significantly during 

federated  model  updates,  thereby  improving  the  alignment  between  local  and  global 

caching priorities. 

d)  Fourth, it designs a Selective Caching Algorithm for micro-ferrying UAVs (MF-UAVs), 

which  strategically  minimizes  content  redundancy  across  the  swarm  while  maximizing 

overall content accessibility for isolated communities. 

178 

 
e)  Fifth, the chapter develops a controlled latency mechanism for federated model updates 

that balances learning responsiveness with caching stability, ensuring that UAVs can adapt 

efficiently while maintaining high system performance. 

f)  Finally,  it  validates  the  designed  FedMAB  framework  through  extensive  simulation 

experiments and analytical modeling, demonstrating its effectiveness in enhancing content 

availability, reducing access delay, improving cache optimality, and enabling adaptability 

under changing user preferences. 

Through these objectives, this chapter aims to establish a robust, scalable, and resilient UAV-aided 

content  dissemination  system  that  responds  intelligently  to  real-world  challenges  encountered 

during disaster recovery operations. 

9.3 

System Model 

9.3.1  UAV Hierarchy 

As  shown  in  Figure  9.1,  a  two-tiered  UAV-assisted  content  dissemination  system  is  deployed. 

Each community is served by a dedicated A-UAV, which operate with much larger power budgets 

compared  to  Micro-UAVs  described  next.  The  A-UAVs  use  lateral  wireless  connections  (i.e., 

WiFi etc.) to communicate with users in that community. A-UAVs can download content via an 

expensive vertical link such as satellite-based internet. The system introduces a set of low-power-

budget  Micro-UAVs  [63]  for  the  role  of  ferrying  (MF-UAVs).  MF-UAVs  are  mobile  and 

possesses only lateral communication links such as Wi-Fi. Unlike the A-UAVs, the MF-UAVs do 

not possess expensive vertical communication interfaces such as satellite links etc. Effectively, the 

MF-UAVs act as content transfer agents across different user communities by selectively caching 

and transferring content across the A-UAVs through their lateral links. 

179 

 
Lateral Link

Vertical
Link

Lateral Link

Lateral Link

Anchor UAV

Micro-ferrying UAV

Vertical Link

Lateral Link

User Community

Inter-UAV communication
served
users
and
through lateral links

are

Vertical Lin

k

Communication 
Infrastructure 
Damage

Figure 9.1. Coordinated UAV system for content dissemination in environments without 

communication infrastructure 

Discussion: The concept of hierarchical UAV structuring and content clustering is aligned with 

the principles of efficient content dissemination. While the designed framework does not explicitly 

impose a higher-layer structure for clustering communities, aspects of hierarchical coordination 

already emerge through the two-tier UAV system. The Anchor UAVs (A-UAVs) inherently serve 

as local caching coordinators for their respective communities, while Micro-Ferrying UAVs (MF-

UAVs) transport content across different regions, effectively creating a layered distribution system 

without rigid structuring. 

Additionally, content placement decisions in FedMAB, to be discussed later, naturally result in 

implicit clustering, as the learning model prioritizes frequently requested content within specific 

regions which ensures that communities receive relevant cached data without requiring manual 

segmentation.  This  data-driven  approach  allows  the  system  to  dynamically  adapt  to  evolving 

content access patterns rather than relying on predefined clusters. 

180 

 
 
 
The model also integrates an implicit indexing mechanism through its Q-value system, also to 

be discussed in the forthcoming sections, where content importance is dynamically ranked based 

on request patterns. This ranking ensures that MF-UAVs retrieve and distribute the most relevant 

content without needing a predefined indexing structure.  

9.3.2 

Content Demand and Provisioning Model 

The  content  popularity  distribution,  quality  of  services  and  content  provisioning  are  outlined 

below. 

Content Popularity: Research has shown that user content request patterns often follow a power 

law distribution such as the Zipf distribution [91]. In Zipf distribution, the popularity of a content 

is proportional to the inverse of its rank, and is a geometric multiple of the next popular content. 

Popularity of content ‘𝑖’ is given as:  

6

1
𝒫6(𝑖) = 5
:
𝑖

1
𝑘

(cid:152)

r 5

X∈=

6

:

																																																														(9.1) 

The Zipf parameter 𝛼 determines the distribution’s skewness, while the total number of contents 

in  the  pool  is  represented  by  the  parameter  𝐶.  The  inter-request  time  from  a  user  follows  the 

popular exponential distribution [91].  

Note that Zipf distribution is widely recognized as an appropriate model for content popularity, 

with empirical validation across multiple domains, including publications, online video platforms, 

social media, and recommendation systems such as Netflix, Instagram and more. This distribution 

effectively captures the heavy-tailed nature of content requests, where a small fraction of items 

accounts for the majority of demand. 

181 

 
Start

Content 
request 
generated

A-UAV 
searches 
cache

If TAD
expired

No

If MF-UAV 
visited

No

No

If found 
in A-UAV

Yes
A-UAV 
downloads content

Yes

No

Yes

Content delivered 
to user

Yes

If MF-UAV 
has content

End

Figure 9.2. Content Delivery Process 

Tolerable Access Delay (TAD): For each generated request, a TAD [70] is specified. TAD is a 

Quality-of-Service parameter that indicates the duration that a user is ready to tolerate before its 

requested content must be provisioned. Operationally, if a content is not available from the UAVs 

within the specified TAD, it must be downloaded from a central server using the expensive vertical 

links of A-UAVs. To be noted that 𝑇𝐴𝐷 is request specific in which it is different for different 

contents depending on the requesting user’s urgency.  

Content Provisioning: Upon receiving a request from one of its community users, the locally 

deployed A-UAV first searches its local storage for the content. If the content is not found, the A-

UAV waits for a potential future delivery by a traveling MF-UAV. If no MF-UAV arrives with 

the requested content within the specified TAD, the A-UAV then proceeds to download it through 

its vertical link. Since vertical links such as satellite links are expensive, smart caching strategies 

that  can  make  the  content  accessible  from  the  UAVs  can  be  effective  in  reducing  the  overall 

content provisioning costs. 

9.4 

Content Caching Problem Formulation 

The caching problem focuses on selecting which contents should be stored at UAVs to maximize 

content availability while considering storage constraints, access latency, and varying user demand 

182 

 
 
across  communities.  For  a  given  number  of  Anchor  and  Micro-ferrying  UAVs,  the  caching 

problem at the UAVs can be defined as follows. 

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒
∀%∈𝒩

1
¶
𝒩

𝒩
ﬂr ℙ%
%IH

’?’,&

(cid:176)„																																																																																				(9.2) 

A-

𝑠𝑢𝑐ℎ	𝑡ℎ𝑎𝑡	 r|𝐶:|

,IH

AK’
+ 	 r|𝐶h<|
nIH

< |𝐶|																																																							(9.3) 

𝑎𝑛𝑑	𝒯𝒽

;-*?- − 𝒯𝒽

*-y ≤ 𝑇𝐴𝐷𝒽, 𝒽 ∈ 	ℋ%, ∀	ℋ% = {1,2,3, ⋯ }																	(9.4) 

where,  ℙ%

’?’,& = ℋ0
ℛ0

,  ℋ%  is  the  number  of  contents  provisioned  at  community  ‘𝑛’  by  the  UAV 

system  (both  A-UAVs  and  MF-UAVs),  ℛ%  is  the  total  number  of  requests  made  by  users  at 

community ‘𝑛’, 𝒩 is the number of commmunities, 𝐶 is the total contents in pool, 𝐶: is the cache 

of each A-UAV, 𝐶h< is the cache of MF-UAVs, 𝑁: is the number of A-UAVs, 𝑁h< is the number 

of MF-UAVs, 𝒯𝒽

*-y is the time at which a content ‘𝒽’ is requested by a user, 𝒯𝒽

;-*?- is the time 

when content ‘𝒽’ is served to the user by the UAV system, and 𝑇𝐴𝐷𝒽 is the tolerable access delay 

associated with content ‘𝒽’. 

The caching problem focuses on maximizing the overall content availability, as shown in Eqn. 9.2. 

This objective is constrained by maintaining the cumulative caching capacities of the UAVs below 

the  total  number  of  contents  in  the  content  pool,  which  is  captured  in  Eqn.  9.3.  An  additional 

constraint is imposed by the tolerable access delay associated with a content served to the user by 

the UAV-aided system (refer to Eqn. 9.4). 

9.5 

Benchmark Caching Policy with A-Priori Demand Knowledge 

This section focuses on the following caching related design questions: a) which content to be 

downloaded and cached in the A-UAVs so that they can serve their own community directly, and 

183 

 
the remote communities via the traveling MF-UAVs; b) which contents to be cached when the 

popularity and 𝑇𝐴𝐷 of contents vary at different communities; c) which content to be transferred 

from  the  A-UAVs  to  the  MF-UAVs;  and,  d)  what  is  the  benchmark  caching  policy  with 

heterogeneous content popularity at each user community and heterogeneity in request-specific 

𝑇𝐴𝐷.  

These questions are addressed by formulating a benchmark caching policy with a priori known 

heterogeneous content popularities. This benchmark caching policy also considers and modifies 

the  caching  policy  to  cater  to  the  request  specific  𝑇𝐴𝐷s.  After  understanding  the  benchmark, 

runtime and dynamic mechanisms will be developed in a next section.  

9.5.1 

Caching at Anchor UAVs (A-UAVs) 

For simplicity, let us consider a disaster/war-stricken area with homogeneous content popularity 

across  all  the  user  communities.  An  A-UAV  is  assigned  to  each  community  for  content 

provisioning.  The  number  of  A-UAVs  in  the  system  is  denoted  by  𝑁:.  In  such  a  scenario,  the 

effective caching capacity of A-UAVs can be maximized by storing a certain number of unique 

contents in all the A-UAVs, and share those contents across the communities via the traveling MF-

UAVs. To maximize the effective caching capabilities of all 𝑁: A-UAVs, the cache space of each 

A-UAV is divided into two segments [91], namely, Segment-1 and Segment-2. Let the sizes of 

Segment-1  and  Segment-2  of  the  A-UAV  cache  be  |𝐶SH|  and  |𝐶S5|  respectively.  They  can  be 

expressed as follows: 

|𝐶SH| = 𝜆 × |𝐶:|																																																																																																(9.5) 

|𝐶S5| = (1 − 𝜆) × |𝐶:|																																																																																				(9.6) 

where 𝜆 is a storage segmentation factor (SSF) that decides the split between the segments within 

a A-UAV [91]. The top 𝜆. |𝐶:| popular contents are cached in Segment-1. These contents are same 

184 

 
across  all  A-UAVs  whereas  contents  stored  in  Segment-2  are  different.  This  results  into  the 

number of total Segment-2 contents stored across all 𝑁: A-UAVs to be: 

(cid:192)𝐶S5

(#(’&(cid:192) = 𝑁: × (1 − 𝜆) × |𝐶:|																																																																													(9.7) 

These contents are shared across all user communities via the mobile MF-UAVs. These contents 

have  popularities  after  the  top  𝜆. |𝐶:|  popular  Segment-1  contents  in  all  the  A-UAVs.  For 

symmetry, all 𝑁: × (1 − 𝜆) × |𝐶:| Segment-2 contents are uniformly randomly distributed across 

𝑁: number of A-UAVs. Hence, for a given Zipf parameter 𝛼 which determines the distribution’s 

skewness, the total number of contents in the system is as follows:  

(cid:192)𝐶;2;

6 (cid:192) = 𝜆 × |𝐶:| + 𝑁: × (1 − 𝜆) × |𝐶:| ⇒ (cid:192)𝐶;2;

6 (cid:192) = v𝜆 + 𝑁: × (1 − 𝜆)w × |𝐶:|										(9.8) 

Now  consider  a  heterogeneous  demand  scenario  in  which  every  community  has  a  different 

demand pattern, and each content is requested with a fixed pre-decided 𝑇𝐴𝐷. The above caching 

policy is modified as follows to address such a situation. Some contents from Segment-1, termed 

as  exclusive  contents,  are  cached  in  one  or  some  of  the  A-UAVs,  but  not  in  all  of  them  [93]. 

Whereas the remaining contents from Segment-1, termed as non-exclusive contents, are cached at 

all  the  A-UAVs  [91].  Therefore,  unlike  the  homogeneous  popularity  scenario,  the  number  of 

contents in Segment-1 across all A-UAVs may be more than 𝜆 × |𝐶:| due to the different A-UAV 

specific exclusive contents. This shown below: 

(cid:192)𝐶SH

(#(’&(cid:192) = |𝐶AJ| + (cid:192)𝐶J

(#(’&(cid:192) ≥ 𝜆 × |𝐶:|																																																																						(9.9) 

Similar to the caching policy in a homogeneous popularity scenario, contents in Segment-2 do not 

repeat across the A-UAVs. If 𝐶AJ	𝑎𝑛𝑑	𝐶J

(#(’& are the non-exclusive and total exclusive contents in 

Segment 1, then total number of contents in the system can be modified from Eqn. 9.8, and can be 

expressed as follows: 

(cid:192)𝐶;2;

6 (cid:192) = |𝐶AJ| + (cid:192)𝐶J

(#(’&(cid:192) + 𝑁: × (1 − 𝜆) × |𝐶:| ⇒ (cid:192)𝐶;2;

6 (cid:192) ≥ v𝜆 + 𝑁: × (1 − 𝜆)w × |𝐶:|		(9.10) 

185 

 
The classification of content as exclusive or non-exclusive is determined by predefined access 

constraints that specify whether a piece of content is intended for a single community or multiple 

communities. Exclusive content is assigned to a specific segment and is cached at designated A-

UAVs serving that community. Non-exclusive content is intended for broader dissemination and 

is made available across multiple A-UAVs to maximize accessibility. In the benchmark models, 

these  classifications  dictate  where  content  is  stored  which  ensures  exclusive  content  remains 

within  its  intended  segment  while  non-exclusive  content  is  widely  distributed.  However,  in 

FedMAB, that is to be discussed in the forthcoming section, caching decisions are not constrained 

by  predefined  exclusivity  labels.  Instead,  the  learning  model  determines  caching  locations 

dynamically based on observed request patterns. Content initially classified as exclusive or non-

exclusive may be placed in different locations if the bandit-driven learning process identifies a 

more efficient caching strategy. This ensures that caching adapts to real-world demand rather than 

being restricted by static classifications. 

To  be  noted  that  the  above  stated  caching  policies  take  the  contents’  popularity  into 

consideration while making the caching decisions. However, the promptness with which a content 

needs to be provisioned, i.e., the 𝑇𝐴𝐷, may not always be positively correlated with its popularity. 

Therefore, unlike cache space optimization done till now, the caching policy needs modification 

from  a  perspective  that  considers  a  content’s  importance.  Hence,  unlike  the  cache  space 

optimization undertaken thus far, the caching policy requires modification from a standpoint that 

considers the significance of content. 

Now consider a demand heterogeneous scenario where every community has a different demand 

pattern, and each content is requested with its own specific 𝑇𝐴𝐷 [91]. If a content is requested 

with less 𝑇𝐴𝐷, this implies that the user is not willing to wait for a visiting MF-UAV to deliver 

186 

 
the content. Therefore, caching such time-critical contents at the A-UAVs becomes imperative. To 

prioritize caching of such contents in Segment-1 of A-UAVs, this chapter devices a value-based 

caching policy where the value of a requested content ‘𝒽’ is calculated from its popularity and its 

𝑇𝐴𝐷, and is as follows: 

𝒱(𝒽) = 𝜅 ×

𝑇𝐴𝐷4,%
𝑝6(1)

×

𝒫6(𝒽)
𝑇𝐴𝐷𝒽

⇒ 𝒱(𝒽) = 𝜅𝜐 ×

𝒫6(𝒽)
𝑇𝐴𝐷𝒽

																																				(9.11) 

Here, 𝒫6(𝒽) is the popularity of the content as per Zipf Distribution, 𝑇𝐴𝐷𝒽 is the tolerable access 

delay  associated  with  the  content  request,  𝜅 ∈ [0,1]  is  a  scalar  weight  which  increases  with 

decrease in popularity and 𝜐 is a normalization constant. For a given Zipf (popularity) parameter 

𝛼, the normalization constant is calculated from the minimum possible 𝑇𝐴𝐷 (𝑇𝐴𝐷4,%) and the 

maximum possible popularity, which is 𝒫6(1). The quantity 𝒱(𝒽) is bounded between [0, 1], it 

increases with increase in 𝒫6(𝒽), and it decreases with 𝑇𝐴𝐷𝒽. This value-based caching policy 

increases the likelihood of contents requested with low 𝑇𝐴𝐷 to be cached in Segment-1 of the A-

UAVs, thus making them more readily available. To be noted that the cache space maximization 

method developed in Eqns. 9.5-9.10 still applies to this scenario. Here, the contents to be cached 

are chosen based on their values instead of their popularity, which is shown below:  

(cid:192)𝐶;2;

𝒱 (cid:192) = |𝐶AJ| + (cid:192)𝐶J

(#(’&(cid:192) + 𝑁: × (1 − 𝜆) × |𝐶:|																																																								(9.12) 

9.5.2 

Caching at Micro-Ferrying UAVs (MF-UAVs) 

The purpose of the MF-UAVs is to ferry (cid:192)𝐶J

(#(’&(cid:192) + 𝑁: × (1 − 𝜆) × |𝐶:| number of contents stored 

across 𝑁:	number of A-UAVs (see Eqn. 9.10). Due to the limitations of per-MF-UAV caching 

space [63] (i.e.,	|𝐶h<|), their caching policy should be determined based on the trajectories, the 

value of 𝜆, the Zipf popularity, and the 𝑇𝐴𝐷𝑠 associated with the contents to be cached [126].  

187 

 
Consider a situation in which an MF-UAV ‘j’ is approaching towards the A-UAV ‘i’. Let 𝑈, be 

the set of all exclusive contents in Segment-1 of all A-UAVs and all contents from Segment-2 of 

all  A-UAVs  in  the  entire  system  except  the  ones  stored  in  A-UAV  ‘i’.  To  maximize  content 

availability for the users in A-UAV i’s community, the MF-UAV should carry |𝐶h<|	top valued 

contents (refer to Eqn. 9.12) from the set 𝑈,	while approaching A-UAV i.  

A-UAV 1

1

2

3

4

5

6

7

9

11

13

MF-UAV

8

10

12

MF-UAV

9

11

13

A-UAV 2

1

2

3

4

5

6

7

8

10

12

Figure 9.3. Caching Policy at MF-UAVs 

The size of the set 𝑈, can be expressed as |𝑈,| = (cid:192)𝐶J

(#(’&(cid:192) + (𝑁: − 1). (1 − 𝜆). |𝐶:|. In scenarios 

when |𝐶h<| ≤ |𝑈,|, the MF-UAV should carry the |𝐶h<|	top popular contents as outlined above. 

Otherwise, the MF-UAV should carry all |𝑈,| contents, leaving part of the MF-UAV cache (i.e., 

|𝐶<| − |𝑈,|)  empty.  This  implies  that  an  apt  choice  of  caching  policy  at  A-UAVs  affect  the 

utilization of MF-UAV’s cache.  

MF-UAV caching policy is explained in the pseudocode below.  

Algorithm 9.1. MF-UAV Caching Algorithm with Value-based policy executed at the A-UAVs 

1.  Input: Total A-UAVs in its trajectory, 𝑇𝐴𝐷, next A-UAV ‘𝑖’, present A-UAV ‘𝑖 − 1’  

2.  Output: 𝐶h< contents for MF-UAV ‘𝑗’ 

3.  Initialize 𝐶: contents in each A-UAV based on value of contents 

188 

 
 
Algorithm 9.1. (cont’d) 

4.  while True: 

5.        if MF-UAV leaving for next A-UAV ‘𝑖’ then do 

6.              for 𝑘 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV ‘𝑖’ cache 𝐶:

, ) do 

7.                    Check if 𝑘 in 𝐶h< cache of MF-UAV ‘𝑗’ 

8.                    if true then do  

9.                      Replace ‘𝑘’ with highest value content from  

                    𝐶:

,LH not cached in MF-UAV ‘𝑗’ & A-UAV ‘𝑖’ 

10. 

                  end if 

11. 

            end for    

12. 

      end if 

13. 

      Update next A-UAV ‘𝑖’, present A-UAV ‘𝑖 − 1’  

14. 

end while 

9.5.3 

Theoretical Performance Upper-Bound 

In this section, a theoretical performance upper-bound is computed when the A-UAVs and MF-

UAVs follow the benchmark caching policy as described in Section 9.5.1. Let us consider a UAV-

caching  system  where  there  are  𝑁:  number  of  A-UAVs,  and  𝑁h<  number  of  MF-UAVs.  The 

number of MF-UAVs traveling in a group is denoted by 𝑁h<

Z . MF-UAVs traverse the complete 

disaster region in 𝒯=2+&- seconds. The hover ratio is ℛ>#?, which is the ratio of the time an MF-

UAV stays at a community before leaving for the next to 𝒯=2+&-. The transition ratio is ℛ@*’%;, 

which is the ratio between the time on MF-UAV takes to travel from one community to the next 

and 𝒯=2+&-. For simplicity, the inter-community distances are kept the same. The content request 

189 

 
pattern is heterogeneous across communities with popularity parameter of 𝛼. Every request ‘𝒽’ is 

accompanied by its respective 𝑇𝐴𝐷𝒽.  

The performance upper-bound has three important parts, namely, the probability ℙ: that the 

content is found in an A-UAV ‘𝑖’, the probability of a content being found in MF-UAV ℙh<, and 

the  probability  that  an  MF-UAV  is  accessible  near  a  A-UAV  before  content  requests  expire 

ℙ:++-;;. The accessibility probability ℙ:++-;; is computed according to a condition 𝕋+#%" which 

is given below: 

𝕋+#%" = (cid:146)

𝑁h<

Z × 𝑁:
𝑁h<

− 1(cid:147) × ℛ>#? × 𝒯=2+&- + (cid:146)

𝑁h<

Z × 𝑁:
𝑁h<

(cid:147) × ℛ@*’%; × 𝒯=2+&- 

⇒ ¤(cid:146)

𝑁h<

Z × 𝑁:
𝑁h<

− 1(cid:147) × ℛ>#? + (cid:146)

𝑁h<

Z × 𝑁:
𝑁h<

(cid:147) × ℛ@*’%;“ × 𝒯=2+&-																		(9.13) 

Eqn. 9.13 computes the time an MF-UAV takes to revisit a location. To ensure that the formulation 

remains applicable across different UAV configurations, it is important to clarify how the grouping 

of  MF-UAVs  and  the  accessibility  factors  influence  the  caching  process.  The  first  term  in 

parentheses in Eqn. 9.13 does not become zero, because 𝑁h<

Z  represents a dynamically determined 

grouping of MF-UAVs based on their flight dynamics and proximity. Since the grouping varies 

I .A-
depending on how MF-UAVs ferry content and reduce redundant transmissions, the ratio AK’
AK’

does not equal one which ensures the first term remains nonzero. Consequently, the first term does 

not  become  negative  because  all  involved  parameters  are  strictly  positive,  and  by  definition, 

𝑁h<

Z ≤ 𝑁h<.  The  subtraction  of  one  ensures  that  the  formulation  correctly  accounts  for 

accessibility relative to the number of A-UAVs in the MF-UAVs’ trajectory cycle. 

Additionally, ℛ>#? is formulated as a probability weight rather than a strict ratio to 𝒯=2+&- that 

ensures adaptability in determining the weighted contribution of hovering time. By treating ℛ>#? 

190 

 
 
as a probability weight, the framework allows for dynamic adjustments based on MF-UAV group 

behavior which prevents an over-simplified proportionality that does not account for variations in 

flight  patterns  and  spatial  arrangements.  This  ensures  that  the  influence  of  hovering  time  is 

contextually adjusted rather than statically imposed, preserving the generality of the formulation. 

Depending on the condition being satisfied, ℙ’++-;; is computed using the following piece-wise 

expression: 

ℙ’++-;; = R

𝑁h< × vℛ>#?𝒯=2+&- + 𝑇𝐴𝐷ˆˆˆˆˆˆw
Z × 𝑁: × vℛ>#?𝒯=2+&- + ℛ@*’%;𝒯=2+&-w

𝑁h<
																																		1, 																																																										𝑓𝑜𝑟	𝑇𝐴𝐷ˆˆˆˆˆˆ ≥ 𝕋+#%"

,																𝑓𝑜𝑟	𝑇𝐴𝐷ˆˆˆˆˆˆ < 𝕋+#%"

𝑁h< × vℛ>#?𝒯=2+&- + 𝑇𝐴𝐷ˆˆˆˆˆˆw
Z × 𝑁: × k(ℛ>#? + ℛ@*’%;)𝒯=2+&-l

= R

𝑁h<
																																1, 																																																																𝑓𝑜𝑟	𝑇𝐴𝐷ˆˆˆˆˆˆ ≥ 𝕋+#%"

,																									𝑓𝑜𝑟	𝑇𝐴𝐷ˆˆˆˆˆˆ < 𝕋+#%"

																(9.14) 

Here, 𝑇𝐴𝐷ˆˆˆˆˆˆ is the mean 𝑇𝐴𝐷, which is used for generalization. The second part of the piece-wise 

expression in Eqn. 9.14 shows that for a very large 𝑇𝐴𝐷, the contents in an MF-UAV are always 

accessible.    However,  for  𝑇𝐴𝐷ˆˆˆˆˆˆ  less  than  the  𝕋+#%",  the  contents  in  MF-UAVs  are  partially 

accessible. Note that the physical accessibility to MF-UAVs does not guarantee the access to a 

requested content since the MF-UAVs can store only a limited number of contents. The probability 

ℙh< that a content can be found in a MF-UAV is given below: 

ℙh< =

⎡
⎢
⎢
⎣

"

𝒽∉~(cid:127)=63

(cid:127)D(cid:127)=3

" (cid:127)(cid:128)

r

𝒱(𝒽)

𝒽∈(cid:129)(cid:127)=3

2)2/5(cid:127)DA-.(HLK).|=-|(cid:130)

⎤
⎥
⎥
⎦

˚

ﬂr 𝒱(𝒽)

(cid:176)

∀=

×

𝒫6(𝒽)
𝑇𝐴𝐷𝒽

				(9.15) 

∑

⇒ ℙh< =

(cid:127)D(cid:127)=3

" (cid:127)(cid:128)

"

𝒽∉~(cid:127)=63
𝒽∈(cid:129)(cid:127)=3

2)2/5(cid:127)DA-.(HLK).|=-|(cid:130)

𝜅 ×

𝑇𝐴𝐷4,%
𝑝6(1)
𝒫6(𝒽)
𝑇𝐴𝐷𝒽

𝑇𝐴𝐷4,%
𝑝6(1) ×

∑

=
𝒽IH

𝜅 ×

191 

 
 
 
The above expression considers the value of the contents from Eqn. 9.11. Now, ℙ:, the probability 

of  finding  a  requested  content  in  the  local  A-UAV  of  the  request  generating  community,  is 

expressed as: 

ℙ: = ¸ r 𝒱(𝒽)
∀	|=63|D|=3|

˚
(cid:204)

ﬂr 𝒱(𝒽)

(cid:176)

∀=

⇒ ℙ: = ¸ r 𝜅𝜐 ×

𝒽∈|=63|D|=3|

𝒫6(𝒽)
(cid:204)
𝑇𝐴𝐷𝒽

=

˚

ﬂr 𝜅𝜐 ×

𝒽IH

𝒫6(𝒽)
𝑇𝐴𝐷𝒽

(cid:176)

													(9.16) 

Combining Eqns. 9.14, 9.15 and 9.16, the average content availability at a community ‘𝑛’ can be 

expressed as: 

ℙ%

’?’,& = ℙ: + ℙ’++-;; × ℙh<																																																		(9.17) 

Eqn. 9.17 shows that the contents from A-UAV ‘𝑖’ and contents from future visiting MF-UAVs 

contribute towards the average availability ℙ%

’?’,& at community ‘𝑛’ within the specified 𝑇𝐴𝐷𝑠. 

Note that all unavailable contents within specified 𝑇𝐴𝐷𝑠 will be downloaded by the A-UAVs using 

their expensive vertical links such as a Satellite Internet link. Thus, availability indirectly indicates 

the content download cost in the system.   

The aim of the learning-based caching policy, discussed in the next section, is to achieve the 

above-mentioned benchmark performance in terms of content availability. The proposed learning 

is achieved in a distributed manner in which all UAVs learn the caching policy without a priori 

demand information and without explicit sharing of user request data. 

9.6 

Federated Multi-Armed Bandit Learning for Content Caching 

9.6.1 

Caching Policy using Top-k Multi-Armed Bandit 

Upon deployment in a community, a A-UAV’s primary task is to optimize content availability for 

users by determining which contents to download and cache through its vertical link. One of the 

192 

 
 
ways to approach this objective involves the utilization of a Top-k Multi-Armed Bandit (Top-k 

MAB) learning agent within the A-UAV. 

The  Top-k  MAB  learning,  a  variant  of  the  classical  Multi-Armed  Bandit  problem  in 

reinforcement learning, is employed to maximize the cumulative reward ℝ(𝑇) over a finite time 

horizon  𝑇  [128].  In  contrast  to  the  traditional  MAB,  this  variant  involves  choosing  𝑘  arms 

simultaneously from a set of 𝑀 arms and receiving individual rewards for each arm selected. 

X

@
ℝ@ = max ﬂr ¤r 𝔼[ℝ((𝑖)]
(IH

,IH

“

(cid:176)																																																																		(9.18) 

Each A-UAV is assumed to be equipped with a Top-k MAB agent. Here, the selection of content 

for caching corresponds to choosing an arm, with ‘𝑘’ in ‘Top-k’ representing the caching capacity 

(𝐶:)  of  the  A-UAV.  The  agent’s  objective  is  to  choose  ‘𝐶:’  contents  from  a  larger  set  of  ‘𝐶’ 

contents in order to maximize content availability for users.  

MF-UAV
Trajectory

MF-UAV

Δℱ, Δ$

Contents

Top-k MAB Agent at A-UAV

Calculate Δℒ

Update ℚ

UCB

.! / = ℚ! / +

2 log 6
7! /
% − '())*+

Load  8" Contents

Increase in ferried content 
and global availability

MF-UAV
Trajectory

Figure 9.4. Top-k Multi-Armed Bandit Learning for Caching Policy at A-UAVs 

In the UAV-aided content dissemination environment, A-UAVs interact by selecting specific 

content sets (i.e., MAB actions) for caching. The feedback from the environment for the taken 

actions are in the form of rewards/penalties. Micro-ferrying UAVs play a vital role in transferring 

193 

 
 
information across the system, contributing to the computation of rewards and penalties. Actions 

are rewarded when cached contents are requested and served within the tolerable access delay. 

Otherwise, they are penalized.  

The learning epoch for each Top-k MAB agent is strategically chosen based on the MF-UAVs’ 

accessibility  at  the  corresponding  community.  Therefore,  epoch  duration  is  influenced  by  the 

visiting  frequency  of  MF-UAVs.  MF-UAVs  carry  the  content  availability  information  of  the 

already visited A-UAVs in its trajectory. The Top-k MAB agents leverage such information and 

learn  to  cache  contents  through  a  multi-dimensional  reward  structure,  encompassing  the  local, 

ferrying,  and  global  rewards.  These  rewards  are  contingent  upon  the  availability  of  the  sets  of 

locally  served  contents  (ℒ),  contents  served  at  other  communities  via  ferrying  (ℱ),  and  overall 

contents served across all communities (𝒢). These contents can be served to the users directly by 

a A-UAV or indirectly via the visiting MF-UAVs. If a cached content is served to a user within 

the given TAD, and an increase in content availability is observed, the caching decision for the 

content  is  rewarded.  The  type  of  reward  is  determined  by  the  set  to  which  the  cached  content 

belongs to. The expressions for three types of rewards are given as follows:  

ℝ(𝑖, ℒ) = [(𝑖 ∈ ℒ)⋀(Δℒ ≥ 0)] − [(𝑖 ∉ ℒ)⋀(Δℒ < 0)]																															(9.19) 

ℝ(𝑖, ℱ) =

1
𝑁: − 1

A-
r [(𝑖 ∈ ℱ)⋀(Δℱ ≥ 0)]

−

nIH,no𝕏

1
𝑁: − 1

A-
r [(𝑖 ∉ ℱ)⋀(Δℱ < 0)]

nIH,no𝕏

⇒ ℝ(𝑖, ℱ) =

1
𝑁: − 1

A-
r [(𝑖 ∈ ℱ)⋀(Δℱ ≥ 0)]

nIH,no𝕏

+

1
𝑁: − 1

A-
r (cid:142)›¬[(𝑖 ∉ ℱ)⋀(Δℱ < 0)]ﬁ − 1(cid:144)

nIH,no𝕏

																																																	(9.20) 

194 

 
 
ℝ(𝑖, 𝒢) =

1
𝑁:

A-
r›(𝑖 ∈ 𝒢)⋀vΔ𝒢 ≥ 0wﬁ
nIH

−

1
𝑁:

A-
r›(𝑖 ∉ 𝒢)⋀vΔ𝒢 < 0wﬁ
nIH

⇒ ℝ(𝑖, 𝒢) =

1
𝑁:

A-
r›(𝑖 ∈ 𝒢)⋀vΔ𝒢 ≥ 0wﬁ
nIH

+

1
𝑁:

A-
r (cid:213)(cid:142)¬›(𝑖 ∉ 𝒢)⋀vΔ𝒢 < 0wﬁ(cid:144) − 1(cid:214)
nIH

			(9.21) 

𝑤ℎ𝑒𝑟𝑒, [𝐴] = s

		1,
0,

𝑖𝑓	𝐴	𝑖𝑠	𝑡𝑟𝑢𝑒
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

The above equations are used for computing the reward received by a Top-k MAB agent at A-

UAV ‘𝕏’. Caching content ‘𝑖’ at A-UAV ‘𝕏’ is rewarded if it leads to an increase in availability. 

Here, ℝ(𝑖, ℒ), ℝ(𝑖, ℱ), and ℝ(𝑖, 𝒢) are local, ferrying and global rewards, respectively. The terms 

Δℒ,  Δℱ  and  Δ𝒢  correspond  to  the  increase  in  local  availability,  ferried  content  availability,  and 

global availability, respectively. Each type of reward is contingent upon satisfying the condition 

‘𝑓(𝑖)’ in the Iverson bracket “[𝑓(𝑖)]“. The first terms in Eqns. 9.7, 9.8 and 9.9 represent the reward 

accumulated by caching content ‘𝑖’ cached at A-UAV ‘𝕏’, whereas the second term is the penalty 

associated with adverse condition. To be noted that ℝ(𝑖, ℱ), and ℝ(𝑖, 𝒢) are higher if the content 

‘𝑖’ is requested and served at more number of communities. 

Learning employs a tabular approach where a Q-table is maintained for all contents in A-UAVs. 

Each content corresponds to a Q-value or action-value [130] in the Q-table. The Q-value indicates 

the importance of a content depending on its popularity and frequency of request. Additionally, it 

indirectly captures the geographical relevance of the content which is related to where the content 

has been requested in the disaster region. The Top-k MAB agent updates the Q-value for a content 

at each learning epoch based on the multi-dimensional rewards (Eqns. 9.19-9.21). These rewards 

are derived from the interactions of a A-UAV’s agent with the UAV-aided content dissemination 

system, shaping its understanding of optimal actions (contents to cache). The recursive Q-value 

update expression for content ‘𝑖’ at A-UAV “𝕏“ is given as follows: 

195 

 
																															 
																																												 
ℚ(DH(𝑖) = (1 − 𝛼Y)ℚ((𝑖) + 𝛼 (cid:216)ℝ((𝑖, ℒ) + (cid:146)

[(𝓍, 𝓎, 𝓏):L(cid:134):TH𝕏H = (𝓍, 𝓎, 𝓏)h<L(cid:134):T]
× vℝ((𝑖, ℱ) + ℝ((𝑖, 𝒢)w

(cid:147)(cid:220)			(9.22) 

In this context, ℚ((𝑖) denotes the Q-value associated with content ‘𝑖’ at the ‘𝑡(0’ epoch. ℝ((𝑖, _) 

signifies 

the  corresponding 

reward  gained 

from  caching  content 

‘𝑖’.  The 

term 

“[(𝓍, 𝓎, 𝓏):L(cid:134):TH𝕏H = (𝓍, 𝓎, 𝓏)h<L(cid:134):T]“ defines the condition inside the Iverson bracket, taking 

the value 1 if micro-ferrying UAVs are within the communication range of A-UAV “𝕏“ and 0 

otherwise. The hyper-parameter “𝛼Y“ governs the learning rate.  

Initially, all the Q-values for contents start at zero, ensuring no prior information for the Top-k 

MAB agent and assigning equal importance to all contents for caching decisions. As the learning 

process advances, Q-values evolve, and the best contents, characterized by the highest Q-values, 

are cached. This approach aims to maximize the cumulative reward, subsequently enhancing the 

caching policy and, in turn, improving content availability.  

The Top-k MAB agent faces a challenge due to the numerous content combinations v=

Xw it must 

sample  for  caching  to  get  the  best  possible  estimated  values  for  all  contents.  An  infrequent 

sampling results in weak reward distribution estimates, especially as the global content population 

(𝐶) increases. To address this, 𝜖 and its decay rate are empirically chosen in the ϵ-Greedy action 

selection policy [139]. To reduce policy dependence on 𝜖, an Upper Confidence Bound (UCB) 

strategy [128] is employed. 

𝕌((𝑖) = ℚ((𝑖) + y

𝜍 log(𝑡)
𝕟((𝑖)

																																																												(9.23) 

The  UCB,  denoted  as  𝕌((𝑖),  is  calculated  using  the  updated  Q-value  ℚ((𝑖),  controlling 

hyperparameter ‘𝜍’, and the number of times content ‘𝑖’ has been requested 𝕟((𝑖). This strategy 

196 

 
aids in content selection by favoring items with high reward potential and infrequent requests, thus 

promoting exploration without introducing an external exploration parameter 𝜖.  

Algorithm 9.2. Caching policy at a A-UAV with Top-k MAB Learning 

1.  Initialization: 

a.  𝐶: Total contents in the system 

b.  𝐶:: Caching capacity of an A-UAV 

c.  𝕌: Size |𝐶:| initialized with 0’s (Q-table with UCB) 

d.  𝛼: Learning rate for Q-table update 

e.  𝜍: Degree of exploration (in UCB) 

2.  Load A-UAV’s cache with 𝐶: randomly chosen contents. 

3.  while True: 

4.      Check for learning epoch at A-UAV i.e., at 𝑡(0 epoch 

5.      if True then do 

6.          for 𝑖 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV cache size 𝐶:) do 

7.              Get reward ℝ((𝑖, _)	 \\ according to Eqns. 9.19-9.21 

8.              Update 𝕌((𝑖)           \\ from Eqn. 9.23 

9.          end for 

10.         𝑣𝑎𝑙𝑢𝑒 = 𝒄𝒐𝒑𝒚(𝕌)       \\ make a copy of UCB values  

11.         for 𝑖 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV cache size 𝐶:) do  

\\ Reload contents (Select arms) 

12.              𝑐4’) = 𝒂𝒓𝒈𝒎𝒂𝒙(𝑣𝑎𝑙𝑢𝑒) 

13.              Load 𝑐4’) to A-UAV 

14.              Set 𝑣𝑎𝑙𝑢𝑒[𝑐4’)] = −∞ 

15.         end for 

197 

 
Algorithm 9.2. (cont’d) 

16.     end if 

17. end while 

The segment-based caching benchmarks in this study are used solely for theoretical comparisons 

and  do  not  reflect  how  the  Top-k  MAB  framework  operates.  These  benchmarks  represent 

deterministic edge cases for establishing performance bounds, whereas the learning-based Top-k 

MAB model is fundamentally more capable in handling fairness dynamically. Caching decisions 

are determined through a multi-dimensional reward structure that accounts for individual content 

popularity, inter-community content influence, and the global impact of caching choices. Unlike 

benchmarks  that  impose  static  constraints,  the  combinatorial  nature  of  Top-k  MAB  inherently 

ensures fairness by dynamically adjusting caching priorities based on real-time content request 

patterns. 

The Q-values computed in the combinatorial bandit model quantify content importance rather 

than predefined allocation constraints. Caching ability depends solely on UAV storage capacity 

that ensures fairness emerges as a function of storage limitations rather than arbitrary segment size 

constraints. 

Furthermore,  Top-k  MAB’s  caching  efficiency  is  entirely  dependent  on  request  generation 

patterns  (i.e.,  sampling),  unlike  deterministic  benchmarks  that  rely  on  predefined  segment 

structures. Larger communities inherently generate higher request volumes which leads them to 

receive  more  cached  content  organically  through  bandit  learning.  This  eliminates  the  need  for 

manually  imposed  fairness  constraints.  Similar  effects  can  be  expected  while  content  requests 

manifest  urgency  or  emergency,  for  which  the  learning-based  model  adapts  such  that  it  can 

determine the importance of contents. 

198 

 
Complexity Analysis: To evaluate the computational feasibility of the Top-k Multi-Armed Bandit 

algorithm,  we  analyzed  the  complexity  of  its  key  components,  including  initialization,  reward 

calculation, action selection, and the Q-value updates. These components collectively determine 

the scalability of the algorithm for large-scale UAV-assisted content dissemination systems. 

The algorithm begins with an initialization phase where each A-UAV creates a Q-table to track 

the rewards associated with all 𝑁 contents. This initialization step, which is performed only once, 

has a time complexity of 𝑂(𝑁). Given the one-time nature of this step, its computational burden 

does  not  affect  the  scalability  of  the  system  in  a  major  way.  During  each  learning  epoch,  the 

algorithm  computes  rewards  from  𝑘  arms  selected  from  the  total  𝑁  contents.  This  reward 

calculation,  which  occurs  across  𝑇  epochs,  involves  evaluating  the  multi-dimensional  reward 

structure. As 𝑘 ≪ 𝑁, the complexity of this step is 𝑂(𝑇. 𝑘), making it computationally light enough 

for real-time operations. The most computationally intensive step is the action selection process, 

where the Upper Confidence Bound (UCB) method ranks all 𝑁 contents to identify the top 𝑘 arms 

to cache. Sorting the contents at each epoch results in a complexity of 𝑂(𝑁 log 𝑁), which, over 𝑇 

epochs,  leads  to  a  total  complexity  of  𝑂(𝑇. 𝑁 log 𝑁).  While  this  step  dominates  the  overall 

computational complexity, it is till bounded by logarithmic growth, thus keeping it scalable for 

large content pools. Finally, the Q-values of the selected arms are updated at each epoch based in 

the observed rewards. This update process has a complexity of 𝑂(𝑇. 𝑘), which remains manageable 

due to the small value of 𝑘 relative to 𝑁. 

Combining  these  components,  the  overall  time  complexity  of  the  Top-k  MAB  algorithm  is 

𝑂(𝑁) + 𝑂(𝑇. 𝑘) + 𝑂(𝑇. 𝑁 log 𝑁) + 𝑂(𝑇. 𝑘).  Simplifying  this  for  𝑘 ≪ 𝑁,  the  dominant  term  is 

𝑂(𝑇. 𝑁 log 𝑁). Therefore, the overall time complexity is 𝑂(𝑇. 𝑁 log 𝑁), with the action selection 

step  contributing  the  most  significant  computational  cost.  This  complexity  demonstrates  the 

199 

 
algorithm’s suitability for large-scale systems, as its logarithmic scaling ensures efficiency even 

with large number of contents. Furthermore, the selective caching mechanism employed by MF-

UAVs,  which  will  be  explained  in  a  forthcoming  subsection,  with  a  linear  complexity  of 

𝑂 k𝑁@*’n-+(#*2. (𝐶h< + 𝑁:)l,  complements  the  learning-based  approach  by  ensuring  effective 

content distribution without introducing significant computational overhead. Note that here 𝐶h< 

is the cache size of MF-UAVs, 𝑁: is the number of A-UAVs, and 𝑁@*’n-+(#*2 is the number of 

trajectory changes. 

Thus,  the  overall  complexity  of  the  Top-k  MAB  algorithm  is  dominated  by  𝑂(𝑇. 𝑁 log 𝑁), 

which makes it scalable for large content pools. The linear complexity of the MF-UAV caching 

mechanism further supports efficient operation in distributed settings. The designed framework is 

computationally  efficient,  scalable  and  is  well-suited  for  real-time  UAV-assisted  content 

dissemination in disaster-stricken environments. 

There are a few limitations of the above learning based caching method. First, relying on MF-

UAVs to ferry global content availability information makes learning slow. Especially so in large 

disaster areas with multiple communities. Second, communities with fewer users result in fewer 

content requests for the corresponding local A-UAVs. The problem is particularly compounded 

for  the  less  popular  contents  for  which  the  popularity  reduces  drastically  following  Zipf 

distribution (see Section 9.3.2). This lack of requests creates a learning challenge, leading to less 

accurate  reward  distribution  estimates  [140].  This  results  in  unstable  Q-values  for  less  popular 

content. Finally, the Top-k MAB agent has to sample v=

Xw content combinations. This results in 

infrequent reward estimations with increasing 𝐶 (i.e., the total number of contents in pool), thus 

weakening the estimate of the reward distribution [140]. This leads to sensitive and unstable Q-

values of contents.  

200 

 
These  challenges  can  be  mitigated  by  employing  a  Federated  Multi-Armed  Bandit  (FedMAB) 

Learning approach. Such an approach involves integrating the Top-k MAB models from all A-

UAVs. The mechanism is presented below. 

9.6.2  Distributed Caching with Federated Multi-Armed Bandit 

This mechanism applies the principles of Federated Learning [63], [132], [133], [144] to the UAV-

caching scenario. Each A-UAV serves as a client [144] in Federated Learning, with its local model 

representing information about cached contents, cache hits [72], and content availability. Note that 

cache hits indicate how often a content cached in an A-UAV is requested and served within the 

user-species 𝑇𝐴𝐷, as defined in Section 9.3.2. The Q-table of the Top-k MAB agent serves as the 

model for each A-UAV. 

MF-UAVs acts as model aggregators [144], chosen for their ability to access Q-tables of A-

UAVs in their respective trajectories. They aggregate the acquired Q-tables aiming to improve the 

Top-k  MAB  model  at  each  A-UAV.  They  receive  the  Q-tables  from  all  A-UAVs  in  their 

trajectories, and send the aggregated model back to the A-UAVs. This aggregated model helps the 

A-UAVs to decide as to which content to cache based on the top ‘|𝐶:|’ Q-values. 

As per the standard Federated Learning paradigm, Q-tables are initialized at the A-UAVs, and 

learning  epochs  are  set  based  on  the  MF-UAVs’  visiting  frequencies.  The  learning  epoch’s 

dependance on MF-UAVs’ visit is important to capture the rewards ℝ(ℱ) and ℝ(𝒢). Note that 

these rewards are associated to the ferried contents and their impact on global availability (refer 

Eqns. 9.20 and 9.21). The Q-values of individual Top-k MAB agents are updated at each epoch, 

thus capturing the latest content request patterns and A-UAV caching decisions. This is termed as 

“personal experience“ which is akin to the local training stage in Federated Learning. After gaining 

personal experience, an A-UAV’s model is improved by exchanging information with its adjacent 

201 

 
A-UAVs through the traveling MF-UAVs. To be noted that the quality of the aggregated model 

depends on the freshness of information ferried by the MF-UAVs. To be noted that cost constraints 

can  reduce  the  deployment  of  high-cost  UAVs,  leading  to  less  frequent  information  updates. 

Therefore,  leveraging  more  number  of  affordable  MF-UAVs  within  the  UAV-assisted  caching 

system  ensures  more  consistent  information  collection,  crucial  for  maintaining  model  accuracy 

under budgetary limitations.  

A-UAV 
‘𝕏’

MF-UAV

𝒫𝛼𝕐

Δℱ, Δ𝒢

Contents

A-UAV ‘𝕏’

Divergence 
Calculation

Compute Contribution Factor ℭ𝕏,𝕐
(Significance of 𝕐 at 𝕏)

Update
𝐴𝑔.
ℚ𝕏

A-UAV 
‘𝕐’

Aggregated model
is computed from
contribution of all
A-UAVs

Estimated popularity
at every community is
derived from requests
received by A-UAVs

𝒫𝛼𝕏

𝒫𝛼𝕐

𝒫𝛼𝕐

1    2    3   4    5    6    7
Content ID

3 2    4   1 5    6    7
Content ID

1    2    3   4    5    6    7
Content ID

Figure 9.5. Contribution factor for Federated Multi-Armed Bandit implementation at A-UAVs 

Unlike weight matrix aggregation in neural networks in regression/classification models [132], 

Federated Multi-Armed Bandit involves the aggregation of Q-values. Each A-UAV’s contribution 

during  aggregation  is  determined  based  on  its  importance.  This  is  synonymous  to  weights 

associated  with  devices  in  classical  Federated  Learning  algorithm.  Such  contributions  can  be 

defined  as  contribution  factor  which  is  crucial  during  the  aggregation  process.  It  is  calculated 

based on the similar between the estimated popularity distributions of contributing A-UAVs. The 

mechanism is shown using the following expressions: 

202 

 
 
ℭ𝕏,𝕐 = [𝜌 − 𝔇_Y(𝒫𝕏||𝒫𝕐)] ¸rv𝜌 − 𝔇_Y(𝒫𝕏||𝒫𝕝)w

(cid:152)

(cid:204)

																																																		(9.24𝑎) 

A-

𝕝IH

A-

where, 𝜌 = max Ør 𝔇_Y(𝒫𝕏||𝒫𝕝)

Œ																																											(9.24𝑏) 

𝕝IH

Using the Eqn. 9.25, the aggregated model can be shown as: 

A-

:/.(𝑖) = r kℭ𝕏,𝕪 × ℚ𝕪(𝑖)l

ℚ𝕏

𝕪IH

A-

= ¸r (cid:236)k𝜌 − 𝔇_Yv𝒫𝕏||𝒫𝕪wl × ℚ𝕪(𝑖)(cid:237)
(cid:204)

𝕪IH

˚

A-
¸rv𝜌 − 𝔇_Y(𝒫𝕏||𝒫𝕝)w
(cid:204)
𝕝IH

													(9.25) 

In Eqn. 9.24, the contribution factor ℭ𝕏,𝕐 denotes the significance of A-UAV 𝕐’s model when the 

MF-UAV  is  at  A-UAV  ‘𝕏’,  where  𝑁:  is  the  total  number  of  A-UAVs.  𝒫𝕏  and  𝒫𝕐  are  content 

popularity distributions estimated at A-UAVs ‘𝕏’ and ‘𝕐’, respectively. The KL divergence [134], 

denoted as ‘𝔇_Y(𝒫𝕏||𝒫𝕐)’, quantifies the distinction between these distributions. Thus, the term 

“𝜌 − 𝔇_Y(𝒫𝕏||𝒫𝕐)“ represents how similar the content popularity distributions are near A-UAVs 

‘𝕏’ and ‘𝕐’. To be noted that the term ‘𝜌’ is included in the expressions due to the unbounded 

nature of ‘𝔇_Y(𝒫𝕏||𝒫𝕐)’. 

Using the contribution factor from Eqn. 9.24, the aggregated model is determined through Eqn. 

9.25. In this equation, ℚ𝕏

:/.(𝑖) signifies the aggregated Q-value of content ‘𝑖’ at A-UAV ‘𝕏’. A 

higher  importance  is  assigned  to  A-UAV  “𝕪’s”  model  if  its  estimated  content  popularity 

distribution is more similar to that of A-UAV ‘𝕏’, and vice versa.    

Aggregating  the  Q-tables  enhances  the  estimated  reward  associated  with  each  content.  In  a 

learning epoch, the generated requests at a community might be insufficient for an accurate reward 

estimate at its A-UAV. Without model aggregation, A-UAVs end up with a weaker estimate of 

203 

 
 
the reward distribution. Q-table aggregation, as proposed above, enhances the estimated rewards 

without requiring content request information from all the A-UAVs. 

However, Q-table aggregation overlooks local popularity nuances when demand varies among 

the  communities.  This  issue  mirrors  the  personalization-generalization  problem  in  Federated 

Learning [144], [145]. A-UAV’s Q-table, updated with personal experiences using Eqns. 9.22 and 

9.23, is akin to a personalized (local) model, while the aggregated Q-table in Eqn. 9.25 signifies a 

generalized (global) model. This chapter employs weighted averaging to retain local popularity 

context while improving reward estimation. This can be expressed as:  

(cid:134)3".(𝑖) = 𝜔Hℚ𝕏(𝑖) + 𝜔5ℚ𝕏

ℚ𝕏

:/.(𝑖)																																																			(9.26) 

In the given context, the weights 𝜔H and 𝜔5 are critical in determining the influence of both local 

and global (aggregated) models in updating the model ℚ𝕏

(cid:134)3".. For the experiments in this chapter, 

𝜔H is empirically set to 0.99, indicating a strong preference for local content popularity, which is 

assumed to be relatively stable over time. However, a correct choice of 𝜔H is pivotal, especially in 

scenarios where the content popularity is dynamic. For such cases, the adaptive selection of 𝜔H is 

governed by:   

𝜔H = 1 − ›𝔇(cid:138)Sv𝒫𝕏

(H
(, 𝒫𝕏

w 𝑙𝑛 2⁄

ﬁ;	𝜔H: 𝔇(cid:138)Sv𝒫𝕏

(H
(, 𝒫𝕏

w =

1
2

k𝔇_Y(𝒫𝕏

(H
(||ℳ) + 𝔇_Yv𝒫𝕏

||ℳwl	(9.27) 

Here, 𝔇(cid:138)Sv𝒫𝕏

(H
(, 𝒫𝕏

w is the Jensen-Shannon Divergence [134], indicating the dissimilarity between 

content popularity distributions at times 𝑡 and 𝑡f. ℳ is the mean distribution calculated as ℳ =

v𝒫𝕏

(H
( + 𝒫𝕏

w/2  and  𝔇_Y(𝒫𝕏

_||ℳ)  is  the  Kullback-Leibler  divergence.  A  high  𝜔H  value  implies 

minimal change in content popularity over time.  

The weight associated with the aggregated model 𝜔5 in Eqn. 9.26 can be expressed as: 

𝜔5 = 𝑒Le8( × v1 − ℚ𝕏(𝑖)w 𝛽;⁄ 																																																	(9.28) 

204 

 
Here, 𝛽" and 𝛽; represent the weight decay factor and scaling factor, respectively. The parameters 

ensure  that  the  contribution  of  global  (aggregated)  model  reduces  as  learning  progresses.  The 

formulation  of  Eqn.  9.28  reflects  the  decreasing  relevance  of  the  global  model  as  learning 

advances,  with  an  embedded  regret  component  “1 − ℚ𝕏(𝑖)“.  This  idea  is  backed  by  the 

assumption that as learning progresses, the local models will reflect the true value of contents. 

Hence, the expressions for weights 𝜔H and 𝜔5 from Eqns. 9.27 and 9.28 can be replaced in Eqn. 

9.26: 

(cid:134)3".(𝑖) = ›1 − (cid:243)𝔇(cid:138)Sv𝒫𝕏

(H
(, 𝒫𝕏

ℚ𝕏

w 𝑙𝑛 2⁄

(cid:244)ﬁ × ℚ𝕏(𝑖) + ı

𝑒Le8(
𝛽;

× v1 − ℚ𝕏(𝑖)w(cid:246) × ℚ𝕏

:/.(𝑖)						(9.29) 

The hyper-parameters ‘𝛽"’ and ’𝛽;’ should be empirically optimized to maintain local relevance 

in content popularity, particularly in heterogeneous environments. 

This framework integrates Federated Learning with Top-k Multi-Armed Bandit principles for 

high-performance  caching  in  autonomous  UAVs  (A-UAVs).  It  leverages  model  aggregation  at 

MF-UAVs and updates A-UAV models with a balance of personalized and aggregated data. The 

updated Q-table is used for caching decisions, thus prioritizing contents with the highest Q-values 

for enhanced content availability across all user communities.  

The  dependency  on  MF-UAVs  for  model  aggregation  does  not  introduce  significant 

bottlenecks, as multiple MF-UAVs operate in parallel to facilitate decentralized updates across 

different A-UAVs. This naturally distributes the computational load by preventing a single UAV 

from  becoming  an  aggregation  bottleneck.  Additionally,  the  contribution-based  weighting 

mechanism (Eqn. 9.24) ensures that only relevant updates are merged which reduces unnecessary 

model exchanges and further optimizes aggregation efficiency. 

For  even  greater  scalability,  an  alternative  approach  could  involve  hierarchical  aggregation, 

where A-UAVs first perform localized model updates before forwarding refined models to MF-

205 

 
UAVs  for  global  merging.  This  multi-level  aggregation  structure  could  reduce  communication 

overhead in networks with a larger number of UAVs that can further enhance system scalability. 

While the current approach remains effective under the evaluated network conditions, future work 

can  explore  hierarchical  aggregation  strategies  to  optimize  performance  in  extremely  large 

deployments. 

The  designed  FedMAB  framework  ensures  privacy  in  federated  learning-based  caching  by 

limiting  shared  information  to  content  popularity  distribution  parameters  and  Q-values,  which 

serve  as  reward  distribution  estimates.  Unlike  traditional  federated  learning  approaches  that 

involve direct transmission of graph-based models which represents user data or raw request logs, 

FedMAB does not share individual user content requests, behavioral patterns, or identifiable data. 

The  aggregation  process  exchanges  only  learned  model  parameters  relevant  to  content  caching 

decisions  which  makes  the  framework  inherently  privacy-preserving.  Since  Q-values  represent 

estimated  rewards  from  caching  actions  rather  than  explicit  user  information,  the  system 

effectively  optimizes  caching  policies  without  exposing  sensitive  data.  Furthermore,  these  Q-

values undergo aggregation at MF-UAVs, where updates are computed based on local learning 

experiences, without requiring access to underlying user-generated request data. 

This  decentralized  learning  mechanism  ensures  that  UAVs  collaboratively  refine  caching 

strategies while preserving privacy. Since no direct content request logs or personal information 

are  transmitted,  this  structure  ensures  secure  model  sharing  which  minimizes  the  risk  of  data 

leakage while maintaining collaborative learning efficiency. 

206 

 
 
MF-UAV 1

MF-UAV 2

MF-UAV 1

MF-UAV 2

MF-UAV 3

MF-UAV 3

&

!!"#$%

MF-UAV 4

.
!!"#$%

MF-UAV 4

If "#$ > 	 !!"#$%: 

|('(

!)*#+| = 4. |(,-|

&

:

If "#$ < 	 "!"#$%
.
!)*#+| = 3. |(,-|

If "#$ > 	 "!"#$%

|('(

: 

Figure 9.6. Increase in collective caching capacity of MF-UAVs through Selective Caching 

9.6.3 

Selective Caching at Micro-Ferrying UAVs (MF-UAVs) 

The  role  of  MF-UAVs  is  to  ferry  contents  from  the  previously  visited  A-UAVs  to  the  future 

visiting A-UAVs such that the future visiting A-UAVs get the benefit of contents cached at other 

A-UAVs.  Ideally,  the  purpose  of  the  MF-UAVs  is  to  ferry  around  a  subset  of  𝐶J

(#(’& +

𝑁:. (1 − 𝜆). 𝐶:  number  of  contents  stored  across  𝑁:	number  of  A-UAVs  (see  Section  9.5.1). 

However,  such  implementation  leads  to  replication  of  all  ferried  contents,  resulting  in 

underutilized cache space at the MF-UAVs. Due to the limitation of per-MF-UAV caching space 

(i.e.,	𝐶h<),  their  caching  policy  should  be  jointly  determined  based  on  their  trajectories,  learnt 

caching policy at the A-UAVs, content request patterns, and the tolerable access delays (𝑇𝐴𝐷𝑠)  

associated  with  the  contents  to  be  cached.  A  “Selective  Caching“  mechanism  as  the  MF-UAV 

caching policy is explained in the pseudocode below. 

Algorithm 9.3. Selective Caching Algorithm MF-UAV with FedMAB caching at A-UAV 

1.  Input: Total A-UAVs in its trajectory, 𝑇𝐴𝐷, next A-UAV ‘𝕏’, present A-UAV ‘𝕏 − 1’  

2.  Output: 𝐶h< contents for MF-UAV ‘𝒴’ 

3.  Caching at A-UAVs using FedMAB policy // Eqns. 9.19-9.29 

4.  while True: 

5.        if MF-UAV leaving for next A-UAV ‘𝕏’ then do 

207 

 
 
Algorithm 9.3. (cont’d) 

                   // Contents that are not in the future visiting A-UAV 

6.              Update ferrying content knowledge 

                   // Function call from the present A-UAV ‘𝕏 − 1’ 

7.              Call content-wise_TAD ( ) 

                   // Present A-UAV sends MF-UAV visiting frequency             

8.              Call MF-UAV_visiting_frequency ( )   

                   // Check what content the last MF-UAV ferried 

9.              Call Check_previous_MF-UAV_roster ( ) Return roster contents with respective TADs

       // Compute request interval for last MF-UAV roster 

10. 

            Calculate least popular content’s request interval 

11. 

            Check if request time is less than its TAD and MF-UAV  

                   visiting duration 

12. 

            if True then do 

13. 

                  Cache same roster 

14. 

            else 

15. 

                  Cache next best roster 

16. 

            end if 

17. 

            Check if other MF-UAVs flying with MF-UAV ‘𝒴’ 

18. 

            for 𝑙 = 0 to 𝑙𝑒𝑛𝑔𝑡ℎ(MF-UAVs flying together) do 

19. 

                for 𝑘 = 0	𝑡𝑜	𝑙𝑒𝑛𝑔𝑡ℎ(A-UAV ‘𝕏’ cache 𝐶:

𝕏) do 

20. 

                      Check if 𝑘 in 𝐶h< cache space of MF-UAV ‘𝒴’ 

21. 

                      if True then do  

208 

 
 
 
 
Algorithm 9.3. (cont’d) 

22. 

                          Replace ‘𝑘’ with highest value content from 𝐶:

𝕏LH  

                                 not cached in MF-UAV ‘𝒴’ and A-UAV ‘𝕏’ 

23. 

                      end if 

24. 

                end for   

25. 

                Cache next best roster  

26. 

             end for 

27. 

      end if 

28. 

      Update next A-UAV ‘𝕏’, present A-UAV ‘𝕏 − 1’  

29. 

end while 

In Algorithm 9.3, the process of selective caching is described in detail. Consider a situation in 

which an MF-UAV ‘𝒴’ is ready to leave the A-UAV ‘𝕏 − 1’. Before caching contents, it needs 

the following information from A-UAV ‘𝕏 − 1’; 1) what are the contents eligible for ferrying to 

A-UAV ‘𝕏’; 2) what is the MF-UAVs visiting frequency; 3) what roster of ferrying content did 

the last MF-UAV ferry, where roster is the grouping of contents based on their popularity or value; 

4) are the next roster contents likely to be requested within the given 𝑇𝐴𝐷; and 5) are MF-UAVs 

flying  in  groups.  Based  on  these  information,  MF-UAV  ‘𝒴’  selectively  caches  contents  which 

helps in maintaining diversity in the contents cached by all MF-UAVs in its vicinity. This means, 

if  MF-UAVs  are  flying  in  groups  or  traversing  in  close  proximity  from  each  other,  they  ferry 

contents from consecutive rosters. To be noted that the size of a roster is same as an MF-UAV’s 

cache size. Therefore, if subsets of MF-UAVs are considered collectively as a group of 𝑁h<

Z  (group 

size),  then  the  number  of  contents  cached  by  the  group  is  𝑁h<

Z × 𝐶h<.  Such  selective  caching 

209 

 
policy  at  MF-UAVs  ensures  content  availability  maximization  by  avoiding  redundant  content 

replication.  

9.6.4 

Enhancing Federated Learning with Controlled Latency 

The use of A-UAVs equipped with federated multi-armed bandit (FedMAB) learning algorithms 

offers a promising avenue for adaptive learning and decision-making based on user demands and 

network conditions. However, the model aggregation nature of FedMAB, while enhancing content 

delivery services, can inadvertently diminish the benefits of selective caching strategies. Especially 

so when such a strategy is crucial for managing a UAV-network’s storage resources effectively. 

To  address  this,  a  nuanced  latency  approach  that  integrates  Federated  Multi-armed  Bandit 

learning at A-UAVs with selective caching at MF-UAVs is proposed. This approach maintains the 

integrity and benefits of both federated learning at A-UAVs and selective caching at MF-UAVs 

by introducing controlled latency into the A-UAVs’ learning cycles. 

Mechanism  Details:  The  modified  FedMAB  learning  algorithm  with  latency  introduces  a 

deliberate delay in the divergence-based weighted computation updates of A-UAVs. In simpler 

terms it adds a delay between the Top-k MAB update and the model aggregation at A-UAVs. This 

delay is managed through a latency_counter, which tracks the number of learning epochs elapsed 

since the last federated learning update. Only when this counter exceeds a predefined threshold, 

𝑇Y, does the A-UAV proceed with its learning and cache update process via federated learning 

(refer to Eqns. 9.24-9.29). This controlled latency allows MF-UAVs more time for data analysis 

and informed decision-making regarding selective caching. 

During the latency period, A-UAVs continue to collect data, learn via Top-k MAB agents, and 

perform their regular operational functions. However, they postpone the federated learning cycle’s 

execution, allowing MF-UAVs to assess and analyze the cached content across various A-UAVs. 

210 

 
MF-UAVs can then identify which contents are likely to be in higher demand and ensure their 

availability by ferrying them between A-UAVs. This synchronization of learning with the mobility 

patterns of MF-UAVs enables more strategic and informed decisions regarding content caching 

and distribution. 

Algorithm 9.4. Federated Multi-Armed Bandit Learning with Strategic Latency for A-UAVs 

1.  Input: 

a.  𝐶: Total contents in the system. 

b.  𝐶:: Caching capacity of an A-UAV. 

c.  𝑇Y: Latency Threshold. 

2.  Initialization: 

a.  Set latency_counter to 0 for each A-UAV 

b.  Initialize each A-UAV’s cache with randomly selected 𝐶: contents 

c.  Set Q-values for all content to 0   

// These values help track content demand. 

d.  Define learning rate (𝛼) and exploration parameter (𝜍) 

3.  Main Loop: 

4.      While the system is running: 

5.            Check if it’s time (current epoch) for a learning update 

                  // This could be determined by MF-UAV flight time 

6.            Calculate reward ℝ((𝑖, _) for content 𝑖 in A-UAV 

7.            Update the Q-value for all cached contents using MAB 

                  // Based on calculated reward and the learning rate (𝛼) 

8.            If latency_counter >= 𝑇Y then: 

9.                  Compute Divergence-based Weights 

211 

 
Algorithm 9.4. (cont’d) 

10.                 Update Q-values using Eqns. 9.24-9.29 

11.                 Reset latency_counter to 0 

                        // Indicating an update has been completed. 

12.           Else If latency_counter <= 𝑇Y then: 

13.                 Increment latency_counter by 1  

                        // This delays the Federated learning update cycle 

14.           Copy the Q-values to a temporary list for manipulation 

15.           For each slot in the A-UAV’s cache: 

16.                 Select not cached content with the highest Q-value 

17.                 Update the cache to include this content 

                        // Replace the least demanded content if necessary. 

18.                 Update the selected content’s Q-value to −∞  

                        // In the temporary list to avoid reselection 

19.     Repeat steps 4-18 for an adaptive system 

This latency-based approach enhances content availability across the network and optimizes the 

use of network resources, ensuring a balance between learning efficacy and caching efficiency. By 

integrating the dynamic learning capabilities of A-UAVs with the selective caching strategies of 

MF-UAVs, the system becomes more resilient, efficient, and user-centric. 

9.7 

Experimental Results and Content Dissemination Performance 

Simulation  experiments  were  conducted  to  evaluate  the  performance  of  the  designed  FedMAB 

learning-based  caching  mechanism  and  selective  caching  at  micro-ferrying  UAVs.  An  event-

driven  simulator  was  used  to  generate  content  requests,  maintaining  intervals  between  events 

212 

 
according to an exponential distribution and following a Zipf popularity distribution (see Eqn. 9.1). 

To  account  for  variations  in  content  popularity  across  different  communities,  contents  were 

swapped  with  a  predetermined  probability  [93],  and  differences  between  sequences  were 

maintained  using  the  Smith-Waterman  Distance  [125].  The  default  system  parameters  for  the 

FedMAB-based caching and cache pre-loading policies are provided in Table 9.1. 

Table 9.1: Default Values for Model Parameters 

Variables 

Default Value 

Total number of contents, 𝐶 

Number of A-UAVs, 𝑁: 

Number of MF-UAVs, 𝑁h< 

A-UAV’s Cache space (content count), 𝐶: 

MF-UAV’s Cache space, 𝐶h< 

Poisson request rate parameter, 𝜇 (request/sec) 

2000 

4 

8 

200 

25 

1 

Hover rate of MF-UAV, 𝑅>#?-* = 𝑇>#?-*/𝑇@*’n-+(#*2 

1/6 

Transit rate of MF-UAV, 𝑅@*’%;,( = 𝑇@*’%;,(/𝑇@*’n-+(#*2 

1/12 

# 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Zipf parameter (Popularity), 𝛼 

10 

Micro Ferrying UAV Trajectory 

0.4 

Round-robin 

In  the  simulation,  the  impact  of  lateral  link  range  on  content  dissemination  has  been 

implemented. An MF-UAV begins serving content upon entering the WiFi transmission range of 

a community, even before reaching its boundaries. The duration during which the MF-UAV starts 

transmitting  content,  denoted  as  Δ𝑡+#44,  is  influenced  by  its  transit  speed.  If  Δ𝑡+#44  is 

significantly  shorter  than  the  Poisson-distributed  content  request  generation  time  (𝑇*-y),  the 

adjusted hover time remains approximately the same (𝑇ł>#?-* ≈ 𝑇>#?-*). Conversely, if Δ𝑡+#44 is 

213 

 
comparable to or exceeds 𝑇*-y, the adjusted hover time increases to 𝑇ł>#?-* ≈ 𝑇>#?-* + Δ𝑡+#44, 

while the transit time decreases to 𝑇ł@*’%;,( ≈ 𝑇@*’%;,( − Δ𝑡+#44. 

The performance of the designed mechanism was evaluated using the following metrics: 

Content Availability (𝑃’?’,&): This is the ratio of cache hits to generated requests within a tolerable 

access delay. Cache hits refer to content provided to users from the UAV-cached content without 

needing  a  download.  Content  availability  indirectly  reflects  the  content  download  cost  of  the 

system. 

Cache Distribution Optimality (CDO): This metric assesses the optimality of the learned caching 

policy in terms of the caching sequence. The Jaro-Winkler Similarity (JWS) [143] measures CDO 

by comparing the similarity between the content sequence from the learned caching policy and the 

cache  pre-loading  sequence.  It  considers  the  number  of  matches,  required  transpositions,  and 

prefix similarity of both sequences. A normalized similarity measure, where 1 indicates optimal 

caching and 0 indicates non-optimal caching, is used. 

Access  Delay  (𝐴𝐷):  Performance  of  FedMAB  model  and  selective  caching  policy  for  micro-

ferrying UAVs is also evaluated based on the access delay which is the end-to-end delay between 

the generation of content request and its provisioning from the cached contents in the UAVs. This 

chapter reports the epoch-wise average access delay to show the improvement in caching policy 

as learning progresses.   

9.7.1 

Effect of Controlled Latency Induced Federated Learning on Content Availability 

To  understand  the  applicability  of  the  designed  FedMAB-based  caching  policy  along  with 

selective caching, experiments were conducted with different durations of controlled latency. This 

is achieved with caching policies learnt through models that update with different levels of latency. 

Each MAB model uses a hybrid exploration strategy including both UCB and 𝜖-greedy, where the 

214 

 
degree  of  exploration  is  set  to  𝛼[ = 2.  Also,  to  show  the  effectiveness  of  selective  caching  at 

micro-ferrying UAVs (MF-UAVs), TAD Ratio 𝑅@:E for contents {51 − 75} are kept lower than 

the  default  𝑅@:E  i.e.,  1/8	.  To  be  noted  that  TADs  are  represented  as  a  ratio  with  respect  to 

trajectory  time  (𝑇@*’n-+(#*2)  to  ensure  generalizability  of  the  designed  algorithms.  Figure  9.7 

shows the convergence behavior of the learnt caching policy with FedMAB model at the A-UAVs, 

and selective caching at the MF-UAVs. The comparison emphasizes on the effects of controlled 

latency.  

Figure 9.7. Increase in content availability by controlling the learning latency in Federated 

Learning aided caching policy 

The convergence behavior is shown in terms of relative content availability, which is the ratio 

between content availability achieved using the designed method and the deterministic baseline 

method  from  Eqns.  9.13-9.17.  The  key  outcomes  are  given  below.  First,  the  best  content 

availability achieved is with the maximum induced latency while implementing FedMAB to learn 

caching policy. This parameter controls the application of divergence-based weight computation, 

eventually the aggregation of the Top-k MAB (refer Eqns. 9.24-9.29 and Algorithm 9.4). Second, 

215 

 
  
 
the promptness in learning behavior is more apparent in the models with least latency or no latency. 

However, the converged learning performance is subpar, and it can be seen via the attained content 

availability. Third, the learning progression is inversely proportional to the controlled latency for 

aggregation, whereas the learning performance is directly proportional to it. For least controlled 

latency,  the  individual  model’s  epoch-wise  reward  estimate  𝔼[ℝ] ≠ 𝕣∗  is  weak  due  to  limited 

content requests experienced within an epoch’s duration. Here, ℝ is the reward received during 

𝑡(0 epoch and 𝕣∗ is the true reward. Also, due to the mobility of the MF-UAVs, the accessibility 

of ferry and global content availability information can’t be guaranteed, leading to a weak and 

sensitive estimated reward. On the contrary, with a high controlled latency, the individual model’s 

reward is substantially stable i.e., 𝔼[ℝ] ≈ 𝕣∗. Additionally, due to the induced latency for model 

aggregation, the content availability information from adjacent communities can be accessed via 

MF-UAVs with high likelihood. This leads to a better overall reward estimate, therefore improving 

content  caching  policy.  However,  the  explicit  introduction  of  latency  to  the  learning  algorithm 

makes the model update process sluggish, which can be seen in Figure 9.7. 

>
-
-
-
)
y
t
i
l
i

b
a

l
i

a
v
A

t

t

n
e
n
o
C
e
v
i
t

l

a
e
R

(
2

1

0.9

0.8

0.7

0.6

0.5

0.4

Importance	of	Information	on	Model	Update

Estimation
Multi-Armed	Bandit	(UCB+0-greedy)
Top-k	MAB	+	Selective	Caching
FedMAB

Sel

0.3

0

200

400

600

Epochs--->

800

1000

>
-
-
-
)
y
t
i
l
i

b
a

l
i

a
v
A

t

i

t

n
e
n
o
C
n
o
i
t
a
v
e
D
d
r
a
d
n
a
t
S

(
<

0.03

0.025

0.02

0.015

0.01

0.005

0

Fairness	in	Content	Delivery	across	All	Communities

Estimation
Multi-Armed	Bandit	(UCB+0-greedy)
Top-k	MAB	+	Selective	Caching
FedMAB

Sel

200

300

400

500
600
Epochs--->

700

800

900

Figure 9.8. (Left) Evolution of learning-based caching policy with information sharing; (Right) 

Uniformity of performance at all A-UAVs 

216 

 
 
	
	
	
	
	
	
	
9.7.2 

Evolution of Learning based Caching Policies and their Impacts 

The evolution of the learning-based caching policies designed in this chapter and their comparison 

are shown in terms of relative content availability. The observations from Figure 9.8 are as follows. 

First, the figure shows that by employing FedMAB model along with selective caching, a caching 

policy can be learnt which can provide content dissemination performance closer to the benchmark 

performance [93]. The benchmark performance, using Value-based Caching, is calculated with the 

aid of apriori information on content popularity and takes into consideration the heterogeneity in 

user demand (see Eqns. 9.9-9.12). The designed FedMAB algorithm is able to leverage the multi-

dimensional  reward  structure  and  divergence-based  weighted  aggregation  to  account  for 

heterogeneity [146], [147], as explained in Eqns. 9.19-9.29, to learn the caching policy on-the-fly 

(see  Section  9.6.1  and  9.6.2).  Second,  the  selective  caching  policy  at  micro-ferrying  UAVs 

leverages the shared information between themselves and with the A-UAVs to boost the content 

availability closer to the benchmark performance by approximately 20% (see Figure 9.8). It utilizes 

the  currently  visiting  A-UAV’s  caching  information  and  the  preceding  MF-UAV’s  caching 

decision to algorithmically select its own contents for caching, which is also shown in Figure 9.6. 

Such selective caching will reduce the redundancy of multiple copies of the same content available 

through multiple sources at the same time. Third, the difference in the efficacy and limitations of 

selective caching can be observed in Figure 9.7 and 9.8, where caching decisions at MF-UAVs 

differ due to the model aggregation in both scenarios. The effectiveness of controlled latency can 

be  seen  here  in  Figure  9.8,  where  the  benefits  of  divergence-based  weighted  aggregation  is 

preserved along with leveraging the pros of selective caching. Fourth, when the agent uses UCB 

exploration strategy, during the initial learning epochs the content availability increases promptly 

due to high upper confidence value of all contents, which avoids excessive exploitation. This is 

217 

 
due to low sampling of requests. As learning progresses, the sparse request for unpopular contents 

keeps the upper confidence value high which maintains consistent exploratory behavior. Figure 

9.8 shows that such exploration strategy alone helps to boost the content availability closer to the 

benchmark performance by approximately 10% more than popular estimation-based methods [78], 

[79], [80], [81]. Finally, the standard deviation across the performances of all A-UAVs is recorded, 

which shows the progression of the learning-based caching policy. Note that FedMAB(cid:139)(cid:140)(cid:141) shows 

lowest standard deviation, which shows highest level of fairness in the performance. Here, the 𝜎 

is computed as average of 150 learning epochs. Also, contrary to the performance behavior of the 

MAB  algorithm  with  hybrid  action  selection  strategy,  it  shows  more  nonuniform  increase  in 

performance  with  respect  to  estimation-based  methods.  This  can  be  attributed  to  intermittent 

accessibility of MF-UAVs, therefore limiting information access. This behavior is not seen in both 

learning variants with selective caching as the caching information is spanned across multiple MF-

UAVs. 

Discussion: FedMAB(cid:139)(cid:140)(cid:141) allows maximum evolution of the caching policy such that increased 𝑇𝐴𝐷 

is  leveraged  to  achieve  highest  content  availability.  The  performances  of  Top-k  MAB  with 

Selective caching and multi-dimensional reward structure follows in that order. These observations 

can be used to deepen the understanding of the components of FedMAB(cid:139)(cid:140)(cid:141). With high 𝑇𝐴𝐷, the 

ferrying and global reward i.e., ℝ(𝑖, ℱ), and ℝ(𝑖, 𝒢) respectively, brings the estimated reward ℝ@ 

closer  to  the  true  mean,  as  it  allows  more  time  for  the  MF-UAVs  to  transit  before  the  request 

expires. Furthermore, the effectiveness of the selective caching algorithm boosts with high 𝑇𝐴𝐷, 

since  it  allows  more  MF-UAVs  to  collaborate  allowing  them  to  avoid  caching  copies  of  same 

contents  amongst  themselves.  The  last  component  of  FedMAB(cid:139)(cid:140)(cid:141),  that  is  the  divergence-based 

weighted updates of the models allows each A-UAV to have explicit knowledge of the content 

218 

 
popularity  at  adjacent  communities,  therefore  avoiding  content  replication  at  A-UAVs.  The 

consolidation of these components of FedMAB(cid:139)(cid:140)(cid:141) results in increased content availability with the 

primary  objective  of  this  work.  Note  that  a  high  value  of  𝑇𝐴𝐷  also  allows  for  unconstrained 

application of controlled latency, which proves to be beneficial in boosting content availability as 

shown in Figure 9.7 and 9.8. 

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Promptness	with	Controlled	Learning	Latency

!

!!

Δ#

Δ"

Top-k	MAB	+	Selective	Caching
FedMAB

Sel

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Promptness	with	Exessive	Learning	Latency

!

!!

Δ#

Δ"

Top-k	MAB	+	Selective	Caching
FedMAB

Sel

>
-
-
-
)
y
t
i
l
i

b
a

l
i

a
v
A

t

t

n
e
n
o
C
e
v
i
t

l

a
e
R

(
2

>
-
-
-
)
y
t
i
l
i

b
a

l
i

a
v
A

t

t

n
e
n
o
C
e
v
i
t

l

a
e
R

(
2

0

0

200

400

600

Epochs--->

800

1000

0

0

200

400

600

800

1000

Epochs--->

Figure 9.9. Balance between reactiveness and performance of 𝐹𝑒𝑑𝑀𝐴𝐵S-& caching policy in case 

of time-varying user preferences 

9.7.3  Adaptability with Changing User Preferences 

The adaptability of the developed learning-based caching mechanisms is further emphasized in 

Figure 9.9. It showcases that the ability of FedMAB(cid:139)(cid:140)(cid:141) approach to learn the caching policy in a 

setting where the user preference changes over time. It goes on to highlight the reactive nature of 

the FedMAB(cid:139)(cid:140)(cid:141), where content availability increases more promptly compared to the standalone 

Top-k MAB implementation or any of its predecessors. Note that dynamic user preference patterns 

are simulated using Smith-Waterman Distance-based sequence swapping [125] and changing the 

Zipf  parameter  (refer  Eqn.  9.1).  Moreover,  the  comparison  between  the  reactiveness  of  the 

learning-based caching polices are depicted in terms of 3 different measures, namely reactiveness 

time (𝜓), lowest performance point (𝜒) and crossover ratio (𝜁). Reactiveness time (𝜓) captures the 

219 

 
 
	
	
	
	
	
	
time taken for the system to start improving its performance after the demand scenario change. 𝜒 

represents  the  lowest  point  in  performance  after  the  demand  scenario  change,  just  before  the 

system begins to recover. Crossover ratio (𝜁) represents the ratio of the time before which one 

algorithm’s performance surpasses another (i.e., 𝜏 − 𝜏+), relative to the time constant (𝜏 which is 

the  duration  of  the  fixed  demand  scenario).  Here,  𝜏+  refers  to  the  time  when  an  algorithm’s 

performance  surpasses  another.  Therefore,  crossover  ratio  𝜁  can  be  expressed  as  𝜁 = (cid:142)L(cid:142)$
(cid:142)

.  For 

interpretability, the case where performance of an algorithm doesn’t surpass its predecessor, 𝜏+ =

𝜏. This indicates that there is no relative improvement in performance within the time constant 𝜏. 

The performance seen with FedMAB(cid:139)(cid:140)(cid:141) exhibit relatively lower values for 𝜓 and higher values for 

𝜒  with  any  level  of  controlled  latency  in  FedMAB(cid:139)(cid:140)(cid:141).  This  indicates  the  promptness  of  the 

developed caching method as compared to its predecessors. Crossover ratio 𝜁, on the other hand, 

shows a more nuanced observation. For controlled latency of 2 epochs, 𝜁 is high but it reduces for 

latency  of  10  epochs,  although  with  improved  relative  performance.  This  shows  that  a  high 

controlled latency in divergence-based weighted updates for FedMAB(cid:139)(cid:140)(cid:141) can improve performance 

significantly, but it comes with a cost of the model’s reactiveness. Therefore, a realistic assumption 

on the dynamic nature of the content demand pattern suggests that for user preferences with high 

time constant 𝜏, the reactiveness of the FedMAB(cid:139)(cid:140)(cid:141) is relatively high as compared to the learning-

based caching mechanisms discussed above. 

Note  that  while  the  Zipf  model  is  used  to  represent  content  request  generation  patterns,  the 

designed Federated Multi-Armed Bandit (FedMAB) framework remains agnostic to the specific 

distribution governing user requests. The caching decisions are purely data-driven that relies on 

observed  request  patterns  rather  than  prior  assumptions  about  the  underlying  distribution.  If 

content requests were generated according to an alternative distribution, such as Normal, T, or 

220 

 
Beta, the learning process would naturally adjust caching policies to match the observed demand 

structure. The adaptability of FedMAB ensures that the system remains effective under diverse 

content  popularity  dynamics,  therefore  optimizing  caching  decisions  based  on  real-world  user 

behavior rather than a predefined statistical model. 

Discussion: The choice of TAD for the experiments is such that it is less than the hovering and 

transiting  duration  together.  This  is  done  to  emphasize  the  reactive  nature  of  the  algorithm  by 

constraining  the  allowed  duration  for  a  request,  before  it  is  served  via  download.  It  should  be 

highlighted that keeping the TAD too high allows the MF-UAVs to reduce the caching frequency 

of those contents. On the contrary, for very low TAD, the model overestimates the value of those 

contents leading to them being cached at A-UAVs allowing ready availability. 

Figure 9.10. Access delay as a determinant for the choice of learning-based caching policy (Two 

viewing perspective) 

9.7.4 

The Interplay Between Learning Latency and Content Access Delay 

The choice of learning-based caching policy with respect to the access delay has been highlighted 

in  Figure  9.10.  This  figure  emphasizes  on  the  various  components,  namely  multidimensional 

reward structure, selective caching and divergence-based weighted aggregation, the amalgamation 

of which leads to the proposed FedMAB(cid:139)(cid:140)(cid:141) caching policy. Additionally, it also scrutinizes the 

221 

 
 
behavior of these components under the influence of shared information in varying learning-based 

caching scenarios. The observations are as follows. First, both Figure 9.8 and 9.10 shows that with 

increase in shared information content availability and access delay improves irrespective of the 

caching  policy  used.  However,  the  efficacy  of  different  versions  of  the  learning-based  caching 

policies  varies.  Second,  Figure  9.10  demonstrates  the  effect  on  access  delay  while  applying 

different versions of the learning-based caching policy as learning progresses. It can be seen that 

for  each  learning-based  caching  policy  the  delay  decreases  with  increase  in  epochs,  which  is 

intuitive.  Along  with  providing  more  contents  from  the  hierarchical  UAV-aided  content 

dissemination  system,  the  delay  decreases  since  more  relevant  contents  are  stored  at  A-UAVs. 

Third,  with  the  implementation  of  multi-dimensional  reward  structure  and  the  hybrid  action 

selection strategy, access delay decreases, since the relevance of the contents cached in both A-

UAVs and MF-UAVs improve. Finally, it can be that as the learning-based caching policy evolves 

the  access  delay  reduces  till  a  certain  point.  However,  an  increase  in  delay  can  be  seen  for 

FedMAB(cid:139)(cid:140)(cid:141) based policy. The reasons are multifaceted. When the model evolves from standalone 

MAB to Top-k MAB with multi-dimensional reward structure, the caching policy of the A-UAVs 

improve  leading  to  high  value  content  being  cached  at  A-UAVs.  This  results  in  better  content 

being  ferried  via  the  MF-UAVs  without  adding  significant  delay.  FedMAB(cid:139)(cid:140)(cid:141)  along  with 

improving  the  caching  policy  for  A-UAVs  improves  the  policies  for  MF-UAVs  jointly,  which 

allows  more  content  to  be  ferried  from  adjacent  communities  before  exhausting  the  requests 

lifetime. This increases the dependance on the hierarchical UAV-aided dissemination system for 

content  provisioning.  Therefore,  an  increase  in  access  delay  is  observed  along  with  a  boost  in 

content availability, which is the primary objective of this work (refer Section 9.4). 

222 

 
Discussion:  The  designed  framework  inherently  mitigates  latency  through  a  combination  of 

controlled learning updates, demand-aware caching policies, and dynamic UAV coordination. As 

detailed in Algorithm 9.4, the system regulates the timing of federated model updates to balance 

responsiveness  and  caching  stability  which  ensures  that  UAVs  do  not  prematurely  overwrite 

cached  content  while  maintaining  adaptability  to  real-time  requests.  The  incorporation  of 

Tolerable Access Delay (TAD) constraints ensures that content with lower latency requirements is 

prioritized at A-UAVs, while content with more relaxed delay tolerance is efficiently transferred 

via MF-UAVs. 

Additionally, high-demand conditions naturally strengthen the learning process of the Federated 

Multi-Armed  Bandit  (FedMAB)  framework.  Increased  accessibility  of  MF-UAVs  in  such 

scenarios exposes the model to a broader range of content requests which allows it to refine caching 

decisions  based  on  diverse  demand  patterns.  However,  local  high-demand  conditions  are  not 

entirely dependent on MF-UAVs; each UAV continues learning independently to track sudden 

surges in requests. This ensures that the personalized Q-table remains highly reliable, as increased 

demand  inherently  leads  to  higher  sampling,  which  is  particularly  beneficial  for  combinatorial 

bandits. This greater sampling rate allows UAVs to gain stronger confidence in learned content 

priorities,  improve  caching  efficiency  via  high  confidence  contribution  factor  ℭ𝕏,𝕐  and  reduce 

overall latency. 

The results in Figure 9.7 and 9.10 empirically validate these latency mitigation mechanisms. By 

leveraging controlled federated learning updates, demand-aware caching, and increased sampling 

during high-demand periods, the system effectively maintains low access delays under varying 

demand  conditions  that  ensures  robust  performance  in  real-world  UAV-assisted  content 

dissemination scenarios. 

223 

 
1

0.9

0.8

0.7

0.6

0.5

0.4

>
-
-
-

O
D
C

Jaro-Winkler	Similarity

Estimation
Multi-Armed	Bandit	(UCB+0-greedy)
Top-k	MAB	+	Selective	Caching
FedMAB

Sel

0.3

0

200

400

600

Epochs--->

800

1000

Figure 9.11. Learnt cached content sequence’s similarity with benchmark sequence 

9.7.5 

Cache Similarity of Learnt Sequence with Best Sequence 

The effects of learning on the cached content sequence are demonstrated in Figure 9.11. It plots 

Cache Distribution Optimality (CDO) of the cached content sequences for all the A-UAVs in terms 

of  Jaro-Winkler  Similarity  (JWS).  The  key  observation  are  as  follows.  First,  the  average  𝐶𝐷𝑂 

between the benchmark caching sequence from cache pre-loading policy (see Section 9.5) and the 

cached  content  sequences  learnt  by  the  FedMAB  agents  at  A-UAVs  converge  near  0.95,  with 

relatively  less  variance  with  respect  to  its  Top-k  MAB  predecessor.  Physically,  this  represents 

higher degree of similarity after convergence, where 1 indicates complete similarity and 0 implies 

no  similarity.  Second,  the  cached  contents  improve  over  epochs  as  learning  progresses.  Lower 

𝐶𝐷𝑂 values after the initial epochs signify that the A-UAVs have no a priori local or global content 

popularity  information.  As  the  MAB  agents  learn  over  multiple  epochs  of  generated  content 

requests, the cached contents in the A-UAVs become increasingly similar to the optimal caching 

sequence,  which,  in  turn,  improves  the  efficacy  of  FedMAB.  Third,  𝐶𝐷𝑂  is  an  indirect 

representation of the storage segmentation factor (𝜆), which is used to decide the segment sizes 

224 

 
 
 
according  to  cache  pre-loading  policies  [91],  [142].  A  higher  𝐶𝐷𝑂  implies  that,  along  with 

learning, the caching policy, the FedMAB agents learn to emulate the said segmentation behavior. 

Finally, the partial dissimilarity of the cached content sequence can be ascribed to the uncertainty 

(or  regret)  associated  with  the  Q-values  of  contents  with  low  popularity.  Also,  this  leads  to  an 

oscillatory convergence of 𝐶𝐷𝑂 for the A-UAVs.  

The  impacts  of  selective  caching  at  micro-ferrying  UAVs  can  be  distinctly  seen  in  Fig  9.6. 

Selective caching at the MF-UAVs along with Top-k MAB caching agent at A-UAVs leads to a 

𝐶𝐷𝑂 of nearly 0.9, although with a certain variance. Note that this depends on effective caching 

capacity of the MF-UAVs, which is dictated by the 𝑇𝐴𝐷s associated with content requests and the 

MF-UAVs visiting frequency at A-UAVs (refer Algorithm 9.3). The dependance of contents’ Q-

values  on  such  information  also  adds  to  the  post-convergence  oscillation.  Such  oscillatory 

uncertainties  are  mitigated  by  the  FedMAB,  which  enhances  the  value  difference  between  in-

demand and low demand contents, therefore improving the expect reward. To be noted that for the 

computation  of  𝐶𝐷𝑂,  the  benchmark  caching  sequence  is  derived  by  considering  the  same 

effective caching capacity as the selective caching algorithm at the micro-ferrying UAVs. 

9.8 

Conclusion 

In this chapter, we design a micro-UAV-assisted content dissemination system that learns caching 

policies  on  the  fly  without  prior  knowledge  of  content  popularity.  Two  types  of  UAVs  are 

introduced for content provisioning in disaster or war-stricken scenarios; anchor UAVs and micro-

ferrying UAVs. Cache-enabled anchor UAVs are stationed at each stranded community of users 

to  provide  uninterrupted  content  delivery,  while  micro-ferrying  UAVs  act  as  content  transfer 

agents between the anchor UAVs. 

225 

 
To  overcome  the  limitations  of  existing  caching  methods,  we  introduce  a  decentralized 

Federated Multi-Armed Bandit (FedMAB) learning-based caching policy. This method leverages 

the collective intelligence of all A-UAVs to increase promptness in learning the caching policy, 

while reducing the redundant copies of the contents across the network. The policy at each A-UAV 

learns the caching decisions dynamically by maximizing an estimated multi-dimensional reward 

aimed at increasing both local and global content availability. Our results show that the FedMAB 

learning-based caching policy achieves approximately 88% of the maximum achievable content 

availability. 

To further improve the Q-value estimates, we implement a Selective Caching Algorithm at the 

micro-ferrying  UAVs.  This  method  leverages  shared  information  between  anchor  UAVs  and 

micro-ferrying UAVs to further reduce the redundant content copies and provide a better estimate 

of the most popular content within a community. Combining selective caching at micro-ferrying 

UAVs  with  the  FedMAB  learning-based  caching  policy  at  anchor  UAVs  boosts  content 

availability to approximately 94% of the maximum achievable level. With the designed caching 

policies, a scaled-up micro-UAV-assisted network is shown to attain content availability close of 

the maximum achievable content availability. 

Discussion and Future Work: Future work includes developing algorithms to handle time-varying 

content  popularity  and  implementing  adaptive  trajectory  planning  to  address  operational 

unreliabilities of the UAVs. Additionally, it is necessary to explore methods for preserving the 

richness of information when converting multi-modal disaster data into smaller-sized formats to 

enhance effective content caching capacity.  

While the developed framework has been validated through simulations, real-world deployment 

would  require  addressing  practical  challenges  that  includes  UAV  coordination,  wireless 

226 

 
interference,  and  energy  constraints.  The  decentralized  structure  of  the  system  mitigates 

coordination  complexity  by  allowing  A-UAVs  and  MF-UAVs  to  operate  autonomously  which 

leverages  federated  learning  to  refine  caching  strategies  without  requiring  continuous  central 

control. 

A transition to operational implementation would follow a phased approach. Initial deployment 

could  involve  small-scale  testbed  experiments,  where  UAVs  execute  FedMAB-based  caching 

policies  in  controlled  environments.  This  would  allow  for  real-world  validation  of  caching 

performance under dynamic network conditions. The next phase would involve field trials in real-

world  UAV-assisted  networks,  where  interference,  energy  efficiency,  and  UAV  mobility 

constraints  could  be  evaluated.  Future  extensions  could  also  explore  hardware  integration  with 

UAV  control  algorithms  which  can  ensure  that  caching  decisions  align  with  real-time  flight 

dynamics. 

By adopting a structured deployment roadmap, the designed framework can be progressively 

refined for real-world applications while maintaining its efficiency in adaptive caching. 

227 

 
 
 
 
 
 
 
 
 
 
Chapter 10: Conclusions and Future Works 

This thesis investigates the complex problem of content dissemination in environments that 

lack  communication  infrastructure  due  to  natural  disasters  or  conflicts.  To  overcome  this 

challenge, the research introduces a hierarchical UAV-assisted content dissemination framework 

designed  to  provide  essential  information  to  isolated  communities.  The  proposed  architecture 

incorporates  Anchor  UAVs  (A-UAVs),  which  provide  stationary  caching  points  with  costly 

satellite-like backhaul connectivity, and Micro-Ferrying UAVs (MF-UAVs) that transfer cached 

contents across different communities without direct backhaul connections. 

A  central  aspect  of  this  thesis  is  the  use  of  Multi-Armed  Bandit  (MAB)  and  Federated 

Multi-Armed  Bandit  (FedMAB)  learning  methodologies  to  dynamically  and  adaptively  cache 

contents.  Unlike  traditional  caching  strategies  that  rely  on  static  or  globally  known  content 

popularity, the proposed model captures local content demands and temporal variations without 

requiring  centralized  coordination.  FedMAB  enables  UAVs  to  collaboratively  enhance  their 

caching  policies  by  sharing  learned  models  rather  than  user  requests  directly,  thus  ensuring  a 

scalable learning framework. 

The research further extends the MAB framework by integrating trajectory-aware caching 

policies.  This  approach  considers  UAV  mobility  patterns  and  content  request  dynamics  to 

determine  optimal  caching  strategies.  By  effectively  leveraging  trajectories,  the  system 

significantly  improves  content  delivery  efficiency,  reduces  redundancy  in  content  storage,  and 

increases overall availability of critical information to users. 

Another important contribution is the Selective Caching Algorithm developed specifically 

for MF-UAVs. This algorithm strategically selects and manages cached contents to maximize their 

availability while minimizing redundant storage across UAV fleets. Through careful evaluation, 

228 

 
this method demonstrated substantial improvements in caching efficiency and system performance 

compared to traditional pre-loading benchmarks. 

Extensive  simulations  were  conducted  to  validate  the  effectiveness  of  the  proposed 

approaches  against  established  benchmark  models,  including  other  traditional  caching  policies. 

Results indicated that the amalgamation of federated learning, multi-armed bandits and trajectory-

aware caching policies significantly outperform traditional approaches that achieves high levels of 

content  availability,  reduced  access  delay,  and  improved  caching  stability  under  varying  user 

demand patterns. 

This  research  delivers  a  practical  and  robust  solution  for  UAV-based  content 

dissemination,  effectively  addressing  the  critical  challenges  posed  by  infrastructure-deprived 

environments.  By  incorporating  advanced  federated  learning  techniques,  bandit  algorithms, 

adaptive  caching,  and  trajectory-awareness,  the  developed  framework  ensures  resilience, 

scalability, and responsiveness. The methodologies and algorithms established through this thesis 

lay the groundwork for future research in adaptive UAV systems which emphasizes a balanced 

approach  between  operational  efficiency,  learning  responsiveness,  and  strategic  content 

management. 

10.1  Key Findings and Design Guidelines 

Given below are the essential core ideas that can be deduced from the results presented in this 

thesis. 

a)  Federated Model Aggregation Enhances Scalability: The integration of Federated Multi-

Armed  Bandit  learning  enables  collaborative  policy  refinement  across  multiple  UAVs 

without  relying  on  centralized  coordination.  This  allows  each  A-UAV  to  build  locally 

relevant models while benefiting from global content popularity trends shared through MF-

229 

 
UAV-based  model  aggregation.  The  system  achieves  rapid  convergence  of  caching 

policies which makes it suitable for large-scale and dynamic disaster-response networks. 

b)  Mobility-Aware Caching Improves Efficiency: Considering UAV trajectory patterns and 

user  request  dynamics  substantially  enhances  content  delivery  performance.  Caching 

decisions informed by UAV mobility trajectories reduce redundancy in content storage and 

improve  overall  content  availability.  To  optimize  system  performance,  UAV  caching 

algorithms  should  dynamically  adapt  based  on  real-time  trajectory  data  and  content 

demand predictions. By aligning caching decisions with known or predicted UAV flight 

paths, the system maximizes content exposure to target communities and minimizes missed 

delivery opportunities. 

c)  Selective  Caching  at  MF-UAVs  Optimizes  Network  Utility:  The  Selective  Caching 

Algorithm at MF-UAVs strategically curates content based on urgency, request likelihood, 

and  caching  history.  This  approach  improves  effective  cache  utilization  across  UAV 

swarms that minimizes overlap in stored content and increases system-wide availability. 

When  deployed  at  scale,  this  mechanism  ensures  maximum  diversity  and  relevance  of 

cached items. 

d)  Multi-Dimensional  Reward  Structures  Enable  Contextual  Adaptation:  The  use  of  local, 

ferrying, and global reward components in caching decision-making leads to finely-tuned 

learning behavior. This structure allows UAVs to dynamically adapt to regional content 

demand variations and system-wide utility metrics. It supports learning policies that are 

simultaneously locally responsive and globally efficient. 

e)  Hierarchical  Architecture  Increases  Robustness:  A  two-tier  UAV  network  architecture 

comprising A-UAVs and MF-UAVs ensures operational continuity even under constrained 

230 

 
conditions.  The  hierarchical  structure  simplifies  learning  coordination,  offloads  content 

ferrying, and enhances fault tolerance. As a result, the system scales robustly with minimal 

central infrastructure. 

f)  Dynamic Adaptability Under Changing Conditions: The proposed federated multi-armed 

bandit and trajectory-aware caching solutions consistently outperform traditional static and 

centralized  methods,  particularly  in  unpredictable  and  dynamic  scenarios.  Emphasizing 

dynamic adaptability allows the system to swiftly adjust caching strategies in response to 

shifting user demands and environmental changes that maintains high system performance 

and reliability. 

g)  Balanced Optimization Across System Constraints: The framework accounts for trade-offs 

among  QoS,  computation,  communication  cost,  cache  capacity,  and  UAV  budgets. 

Learning-driven  caching  policies  help  balance  content  relevance  against  delivery 

feasibility  which  ensures  that  the  system  operates  within  resource  bounds  while 

maintaining high service quality. 

10.2  Future Directions 

10.2.1  Crowd Estimation-Based Context-Aware Caching 

This  direction  explores  a  context-aware  caching  mechanism  that  leverages  crowd 

estimation  and  environmental  sensing  to  guide  content  placement  decisions.  Traditional  MAB 

approaches  rely  solely  on  observed  request  frequency,  without  accounting  for  the  urgency, 

relevance,  or  situational  context  of  content  demand.  In  disaster-stricken  environments,  such 

limitations  can  significantly  impair  content  availability  when  observational  data  is  sparse  or 

missing. 

The proposed solution integrates multimodal context extraction using techniques such as 

231 

 
image  captioning  (e.g.,  CLIP),  crowd  estimation  models  like  auxiliary  point  guidance,  and 

structured image analysis to infer user and environmental conditions from onboard UAV camera 

sensors.  These  multimodal  signals  are  processed  to  generate  real-time  context  scores  that 

encapsulate content urgency, relevance, damage severity, and inferred user intent. 

A context scoring model is developed to estimate content demand when explicit request 

data is unavailable. For example, crowd density estimation or disaster damage detection informs 

content selection priorities when direct user input is missing. These context-driven metrics are then 

integrated into the Federated Multi-Armed Bandit (FedMAB) framework to dynamically update 

caching strategies based on environmental insights. 

The  proposed  system  includes  a  pipeline  for  population  and  popularity  estimation,  and 

temporal  decay  modeling  that  balances  short-term  surges  in  demand  with  long-term  value 

retention.  This  helps  UAVs  adapt  their  caching  policy  in  response  to  changing  real-world 

conditions. 

By incorporating multimodal insights into the reward formulation and decision logic, the 

caching  framework  aligns  delivery  priorities  with  real-world  urgency.  The  use  of  Federated 

Learning enables distributed UAVs to share model parameters without central coordination which 

ensures  that  learning  remains  adaptive  and  robust  in  communication-challenged  environments. 

This  approach  opens  new  avenues  for  deploying  intelligent,  context-aware  caching  in  extreme 

scenarios where traditional demand estimation falls short. 

10.2.2  Large Action Models (LAM) for Enhanced Decision Sampling and Strategic Caching 

This  future  direction  proposes  integrating  Large  Action  Models  (LAM)  with  Federated 

Multi-Armed Bandit (FedMAB) learning frameworks to enhance the decision-making processes 

in UAV-assisted content dissemination systems. Traditional MAB methods typically deal with a 

232 

 
limited set of discrete actions which constrains the ability to efficiently explore and exploit a vast 

action  space  in  highly  dynamic  environments.  LAM  addresses  this  limitation  by  generating 

extensive and diverse action spaces based on historical data, simulation outcomes, and predictive 

modeling. 

Incorporating  LAM  with  FedMAB  can  significantly  expand  the  scope  and  precision  of 

caching policies by enabling UAVs to sample and assess a broad spectrum of potential actions 

rapidly. The system can dynamically construct and evaluate large corpora of actions for various 

scenarios,  such  as  sudden  shifts  in  content  popularity,  unexpected  UAV  failures,  or  rapid 

environmental changes. By combining predictive analytics from machine learning models, such 

as  Transformer-based  predictors,  with  LAM-generated  action  spaces,  UAVs  can  proactively 

formulate and test multiple strategic scenarios before deployment. 

Furthermore, the integration of federated learning within LAM allows UAV networks to 

collaboratively  refine  their  large-scale  action  databases  without  compromising  local  autonomy. 

Each  UAV  contributes  insights  based  on  local  context  and  outcomes  that  enhances  collective 

decision-making  accuracy.  This  collective  intelligence  enables  the  system  to  rapidly  identify 

optimal actions from a comprehensive and evolving action repository, thus significantly enhancing 

the robustness and adaptability of content dissemination policies. 

This  innovative  approach  to  large-scale  decision  sampling  facilitates  more  accurate, 

proactive, and resilient caching strategies. Future research will explore algorithmic advancements 

to optimize action space generation and evaluation, along with scalability improvements to support 

extensive deployments in real-world, resource-constrained disaster-response scenarios. 

10.2.3  Contextual Federated Multi-Armed Bandit Learning for Collective Caching 

This direction of the thesis extension seeks to design and implement a novel caching policy 

233 

 
framework  for  a  swarm  of  micro-ferrying  unmanned  aerial  vehicles  (UAVs)  which  utilizes 

Contextual Federated Multi-Armed Bandit Learning. The primary focus is on maximizing content 

availability through a collective caching strategy that intelligently adapts to the heterogeneous and 

time-varying demands of user content requests, which follow a Zipf popularity distribution. By 

integrating contextual variables derived from the UAVs’ operational environment, such as their 

flight patterns and proximity in formations, this framework aims to enhance the efficiency and 

responsiveness of content delivery networks in diverse scenarios. 

At the core of the suggested system is the development of a caching algorithm tailored to 

manage  and  exploit  the  complexities  of  a  multi-UAV  system  characterized  by  a  hierarchical 

structure UAVs. The caching algorithm will be designed to continuously learn and adapt to the 

operational context of the UAV swarm. By observing the trajectories and grouping patterns of the 

UAVs, the algorithm will adjust the cached content on each UAV to minimize redundancy and 

ensure a diverse range of data is available across the network. This approach leverages the inherent 

characteristics  of  the  UAVs’  flight  routes  and  community  engagement  patterns  that  makes  it 

possible to tailor content delivery to specific regional demands dynamically. 

Furthermore,  the  proposed  caching  policy  will  incorporate  a  multivariate  Contextual 

Federated Multi-Armed Bandit (CFMAB) learning model that utilizes a complex aggregation of 

Q-values derived from multiple UAVs operating in concert. This model will allow for the sharing 

and  updating  of  multivariate  model  information  across  the  swarm  without  significant  delays, 

facilitating a responsive and adaptive system capable of handling non-independent and identically 

distributed  (non-IID)  data  scenarios  [148].  The  use  of  a  joint  distribution-based  divergence 

mechanism  in  the  CFMAB  model  will  help  to  synchronize  and  optimize  the  caching  decisions 

across the UAV swarm which will enhance the overall system’s efficiency and effectiveness. 

234 

 
An essential component of this framework is its ability to balance the trade-off between 

the effective caching capacity of the UAV fleet and their accessibility. This balance is critical to 

maintaining high levels of service quality, especially in terms of meeting the QoS expectations and 

tolerable access delays specified by users. The caching policy will be continuously refined through 

the CFMAB learning process, which will study the interactions between the learnt caching policies 

and the QoS expectations. Such continuous learning will enable adapting the network’s behavior 

to optimize content delivery based on actual user experiences and feedback. 

The  proposed  caching  policy  framework  for  the  UAV  swarm  will  utilize  advanced 

contextual muti-armed bandits and federated learning techniques to dynamically adapt to changing 

environmental and user demand conditions. By addressing the challenges of multivariate, non-IID 

data in a real-world application [149], this research will not only enhance the performance of UAV-

based content distribution networks but also contribute significant insights and methodologies to 

the field of distributed machine learning and autonomous vehicle coordination. 

10.2.4  Adaptive Trajectory Planning in the Presence of Operational Unreliabilities of the Micro 

UAVs 

The proposed initiative to enhance the effectiveness of unmanned aerial vehicles (UAVs) 

in content distribution by addressing the issue of operational unreliabilities requires an integrated 

approach that utilizes both adaptive trajectory route planning and advanced caching techniques. 

This approach aims to refine the current framework, which primarily focuses on predetermined 

trajectories for UAVs and Micro-UAVs, by incorporating dynamic decision-making capabilities 

that respond to fluctuations in operational stability. 

The  central  objective  of  this  enhancement  is  to  implement  adaptive  trajectory  route 

planning that compensates for the unreliability of Micro UAV operations. This entails the design 

235 

 
of a dynamic flight plan that not only adjusts in real-time to the status and availability of UAVs 

within the network but also ensures that the collective storage and dissemination capabilities of 

the fleet are not compromised due to individual UAV failures. 

At a higher level, the primary goal of this development is to ensure that the network of 

UAVs can maintain high levels of content availability and reliability even when individual units 

fail  or  deviate  from  their  expected  operational  parameters.  This  is  achieved  by  dynamically 

modifying the caching policies and trajectory plans of the remaining UAVs. Such adjustments are 

crucial not only for optimizing the distribution of content across the affected areas but also for 

ensuring that the collective capabilities of the UAV swarm are utilized efficiently. Furthermore, 

the design of the flight plan also plays a critical role. It must be flexible enough to allow for real-

time reconfiguration which can enable UAVs to regroup or reposition themselves effectively to 

cover potentially uncovered areas due to the failure of one or more units. This strategic regrouping 

is  expected  to  maximize  the  content  availability  by  leveraging  the  undiminished  parts  of  the 

network to compensate for the affected segments. 

Through the integration of these methodologies, the framework aims to provide a resilient, 

scalable, and highly adaptive solution to the challenges posed by the operational unreliability of 

Micro UAVs in disaster-stricken or isolated regions, thus enhancing the effectiveness of UAV-

assisted communication and content dissemination systems. 

10.2.5  Integration of Proactive and Predictive Caching with Federated Multi-Armed Bandit 

Learning 

In the proposed framework for integrating proactive and predictive caching methods with 

Federated Multi-Armed Bandit Learning (FMABL) or Contextual Federated Multi-Armed Bandit 

(CFMAB)  learning  for  caching  in  UAV/Micro-UAV  networks,  we  aim  to  transition  from 

236 

 
traditional reactive caching strategies to a more predictive model. Traditional methods primarily 

respond  to  immediate  content  requests,  gradually  improving  the  caching  decisions  based  on 

received  user  feedback  and  the  associated  rewards.  These  systems,  while  adaptive,  typically 

initiate improvements only after the content requests are made by UAVs or Micro-UAVs. Such 

reactive  approaches  can  limit  the  efficiency  and  responsiveness  of  the  network,  especially  in 

dynamic and fast-evolving operational environments. 

The  principal  objective  of  this  new  proposal  is  to  implement  Transformer  Learning 

algorithms [150], [151] to anticipate changes in content popularity. Transformer Learning models, 

known for their effectiveness in understanding time-series data and patterns in sequence prediction 

tasks, will be employed to forecast the fluctuating demands for content [152]. By predicting these 

changes,  the  system  can  proactively  adjust  caching  policies  before  actual  requests  occur, 

potentially enhancing the network’s efficiency and user satisfaction. 

This proactive approach is augmented by an advanced caching policy developed through 

MAB/FMABL/CFMAB.  The  MAB/FMABL/CFMAB  framework  leverages  the  decentralized 

nature of UAV networks to aggregate insights from individual UAV experiences, even in a non-

independent and identically distributed (non-IID) data environment [145]. This method allows for 

a dynamic adaptation of caching strategies that are both informed by local conditions and enhanced 

through a collective learning process. The integration of Transformer Learning predictions with 

such caching policy constitutes a significant enhancement over existing methods. By forecasting 

content  popularity  trends,  the  Transformer  model  provides  a  predictive  input  that  the 

MAB/FMABL/CFMAB  system  uses  to  pre-adjust  its  strategies.  This  means  that  the  caching 

decisions can be refined in anticipation of future demands rather than solely in reaction to past 

requests. For example, if the Transformer Learning model predicts a surge in demand for specific 

237 

 
types of content in a particular area, the FMABL system can proactively direct UAVs to cache this 

content in anticipation, rather than waiting for the demand to manifest physically.  

Moreover,  this  proactive  caching  approach  will  not  only  improve  the  timeliness  and 

relevance of the content delivered but also  optimizes the network’s resources. By reducing the 

need for rapid, reactive changes in caching strategies, the system can operate more smoothly and 

efficiently that focuses its computational and communication resources on maintaining optimal 

service rather than constant adjustment. 

In other words, the integration process involves the continuous training of the Transformer 

model  with  historical  and  real-time  data  to  refine  its  predictive  accuracy.  Concurrently,  the 

MAB/FMABL/CFMAB framework adjusts its parameters based on both the predictions from the 

Transformer  model  and  the  ongoing  feedback  from  the  UAV  network.  This  dual-input  system 

ensures  that  the  caching  policy  remains  robust  and  adaptable,  capable  of  handling  the  inherent 

uncertainties  and  variability  of  UAV-assisted  content  delivery  environments.  It  combines  the 

predictive power of Transformer Learning with the adaptive, decentralized learning capabilities of 

MAB/FMABL/CFMAB to create a caching system that not only responds to but anticipates user 

needs and content popularity trends. 

238 

 
 
 
 
 
 
 
 
BIBLIOGRAPHY 

[1] 

  B.  Mukherjee,  M.  F.  Habib,  and  F.  Dikbiyik,  “Network  adaptability  from  disaster 
disruptions and cascading failures,” IEEE Commun. Mag., vol. 52, no. 5, pp. 230–238, May 
2014, doi: 10.1109/MCOM.2014.6815917. 

[2] 

  S.  V.  Kartalopoulos,  “Surviving  a  disaster  [optical  communications],”  IEEE  Commun. 

Mag., vol. 40, no. 7, pp. 124–126, Jul. 2002, doi: 10.1109/MCOM.2002.1018017. 

[3] 

[4] 

[5] 

[6] 

[7] 

[8] 

[9] 

  H. Rong, Z. Wang, H. Jiang, Z. Xiao, and F. Zeng, “Energy-Aware Clustering and Routing 
in Infrastructure Failure Areas With D2D Communication,” IEEE Internet Things J., vol. 6, 
no. 5, pp. 8645–8657, Oct. 2019, doi: 10.1109/JIOT.2019.2922202. 

  M. Matracia, N. Saeed, M. A. Kishk, and M.-S. Alouini, “Post-Disaster Communications: 
Enabling Technologies, Architectures, and Open Challenges,” IEEE Open J. Commun. Soc., 
vol. 3, pp. 1177–1205, 2022, doi: 10.1109/OJCOMS.2022.3192040. 

  R. Dong, C. She, W. Hardjawana, Y. Li, and B. Vucetic, “Deep Learning for Radio Resource 
Allocation With Diverse Quality-of-Service Requirements in 5G,” IEEE Trans. Wireless 
Commun., vol. 20, no. 4, pp. 2309–2324, Apr. 2021, doi: 10.1109/TWC.2020.3041319. 

  M. El Tanab and W. Hamouda, “Fast-Grant Learning-Based Approach for Machine-Type 
Communications  With  NOMA,”  in  ICC  2021  -  IEEE  International  Conference  on 
Communications,  Montreal,  QC,  Canada: 
Jun.  2021,  pp.  1–6.  doi: 
10.1109/ICC42927.2021.9500606. 

IEEE, 

  K.-H. Liu and W. Liao, “Intelligent Offloading for Multi-Access Edge Computing: A New 
Actor-Critic  Approach,”  in  ICC  2020  -  2020  IEEE  International  Conference  on 
Communications 
Jun.  2020,  pp.  1–6.  doi: 
10.1109/ICC40277.2020.9149387. 

(ICC),  Dublin, 

Ireland: 

IEEE, 

  M. S. M. Gismalla et al., “Survey on Device to Device (D2D) Communication for 5GB/6G 
Networks: Concept, Applications, Challenges, and Future Directions,” IEEE Access, vol. 
10, pp. 30792–30821, 2022, doi: 10.1109/ACCESS.2022.3160215. 

  I. O. Sanusi, K. M. Nasr, and K. Moessner, “Radio Resource Management Approaches for 
Reliable  Device-to-Device  (D2D)  Communication  in  Wireless  Industrial  Applications,” 
IEEE  Trans.  Cogn.  Commun.  Netw.,  vol.  7,  no.  3,  pp.  905–916,  Sep.  2021,  doi: 
10.1109/TCCN.2020.3032679. 

[10]    D. Shumeye Lakew, U. Sa’ad, N.-N. Dao, W. Na, and S. Cho, “Routing in Flying Ad Hoc 
Networks: A Comprehensive Survey,” IEEE Commun. Surv. Tutorials, vol. 22, no. 2, pp. 
1071–1120, 2020, doi: 10.1109/COMST.2020.2982452. 

[11]  M. Maad Hamdi, L. Audah, S. Abduljabbar Rashid, A. Hamid Mohammed, S. Alani, and A. 
Shamil Mustafa, “A Review of Applications, Characteristics and Challenges in Vehicular 
Ad  Hoc  Networks  (VANETs),”  in  2020  International  Congress  on  Human-Computer 

239 

 
Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey: IEEE, Jun. 
2020, pp. 1–7. doi: 10.1109/HORA49412.2020.9152928. 

[12]    M. Li, F. R. Yu, P. Si, W. Wu, and Y. Zhang, “Resource Optimization for Delay-Tolerant 
Data in Blockchain-Enabled IoT With Edge Computing: A Deep Reinforcement Learning 
Approach,”  IEEE  Internet  Things  J.,  vol.  7,  no.  10,  pp.  9399–9412,  Oct.  2020,  doi: 
10.1109/JIOT.2020.3007869. 

[13]    J. Liang et al., “LTP for Reliable Data Delivery From Space Station to Ground Station in 
the Presence of Link Disruption,” IEEE Aerosp. Electron. Syst. Mag., vol. 38, no. 9, pp. 24–
33, Sep. 2023, doi: 10.1109/MAES.2023.3290134. 

[14]    L. Yang et al., “Resource Consumption of a Hybrid Bundle Retransmission Approach on 
Deep-Space Communication Channels,” IEEE Aerosp. Electron. Syst. Mag., vol. 36, no. 11, 
pp. 34–43, Nov. 2021, doi: 10.1109/MAES.2021.3094787. 

[15]    L. Yang, R. Wang, J. Liang, Y. Zhou, K. Zhao, and X. Liu, “Acknowledgment Mechanisms 
for Reliable File Transfer Over Highly Asymmetric Deep-Space Channels,” IEEE Aerosp. 
Electron. 
doi: 
no. 
10.1109/MAES.2022.3192508. 

42–51,  Sep. 

Syst.  Mag., 

2022, 

vol. 

37, 

pp. 

9, 

[16]    L. Yang, R. Wang, Y. Zhou, J. Liang, K. Zhao, and S. Burleigh, “An Analytical Framework 
for Disruption of Licklider Transmission Protocol in Mars Communications,” IEEE Trans. 
Veh. Technol., vol. 71, no. 5, pp. 5430–5444, May 2022, doi: 10.1109/TVT.2022.3153959. 

[17]    M.  Zhang,  M.  EI-Hajjar,  and  S.  X.  Ng,  “Intelligent  Caching  in  UAV-Aided  Networks,” 
IEEE  Trans.  Veh.  Technol.,  vol.  71,  no.  1,  pp.  739–752,  Jan.  2022,  doi: 
10.1109/TVT.2021.3125396. 

[18]    S. Gu, X. Sun, Z. Yang, T. Huang, W. Xiang, and K. Yu, “Energy-Aware Coded Caching 
Strategy  Design  With  Resource  Optimization  for  Satellite-UAV-Vehicle-Integrated 
Networks,”  IEEE  Internet  Things  J.,  vol.  9,  no.  8,  pp.  5799–5811,  Apr.  2022,  doi: 
10.1109/JIOT.2021.3065664. 

[19]    Y. Zhou et al., “Caching and UAV Friendly Jamming for Secure Communications With 
Active Eavesdropping Attacks,” IEEE Trans. Veh. Technol., vol. 71, no. 10, pp. 11251–
11256, Oct. 2022, doi: 10.1109/TVT.2022.3186730. 

[20]    Y.  Tian,  G.  Pan,  M.  A.  Kishk,  and  M.-S.  Alouini,  “Stochastic  Analysis  of  Cooperative 
Satellite-UAV Communications,” IEEE Trans. Wireless Commun., vol. 21, no. 6, pp. 3570–
3586, Jun. 2022, doi: 10.1109/TWC.2021.3121299. 

[21]    H. Kong, M. Lin, W.-P. Zhu, H. Amindavar, and M.-S. Alouini, “Multiuser Scheduling for 
Asymmetric  FSO/RF  Links  in  Satellite-UAV-Terrestrial  Networks,”  IEEE  Wireless 
Commun. Lett., vol. 9, no. 8, pp. 1235–1239, Aug. 2020, doi: 10.1109/LWC.2020.2986750. 

240 

 
[22]    J.-H. Lee, J. Park, M. Bennis, and Y.-C. Ko, “Integrating LEO Satellites and Multi-UAV 
Reinforcement Learning for Hybrid FSO/RF Non-Terrestrial Networks,” IEEE Trans. Veh. 
Technol., vol. 72, no. 3, pp. 3647–3662, Mar. 2023, doi: 10.1109/TVT.2022.3220696. 

[23]    Y. Zheng, Z. Chen, D. Lv, Z. Li, Z. Lan, and S. Zhao, “Air-to-air visual detection of micro-
UAVs:  An  experimental  evaluation  of  deep  learning,”  IEEE  Robotics  and  automation 
letters, vol. 6, no. 2, pp. 1020–1027, 2021. 

[24]    A. Mohammadi, Y. Feng, C. Zhang, S. Rawashdeh, and S. Baek, “Vision-based autonomous 
landing using an MPC-controlled micro UAV on a moving platform,” in 2020 International 
Conference on Unmanned Aircraft Systems (ICUAS, IEEE, Sep. 2020, pp. 771–780. 

[25]    K.-B. Kang, J.-H. Choi, B.-L. Cho, J.-S. Lee, and K.-T. Kim, “Analysis of Micro-Doppler 
Signatures of Small UAVs Based on Doppler Spectrum,” IEEE Trans. Aerosp. Electron. 
Syst., vol. 57, no. 5, pp. 3252–3267, Oct. 2021, doi: 10.1109/TAES.2021.3074208. 

[26]    W.  Wang  et  al.,  “Energy-Constrained  UAV-Assisted  Secure  Communications  With 
Position Optimization and Cooperative Jamming,” IEEE Trans. Commun., vol. 68, no. 7, 
pp. 4476–4489, Jul. 2020, doi: 10.1109/TCOMM.2020.2989462. 

[27]    Y.  Zeng  and  R.  Zhang,  “Energy-Efficient  UAV  Communication  With  Trajectory 
Optimization,” IEEE Trans. Wireless Commun., vol. 16, no. 6, pp. 3747–3760, Jun. 2017, 
doi: 10.1109/TWC.2017.2688328. 

[28]    M.  Monwar,  O.  Semiari,  and  W.  Saad,  “Optimized  Path  Planning  for  Inspection  by 
Unmanned  Aerial  Vehicles  Swarm  with  Energy  Constraints,”  in  2018  IEEE  Global 
Communications  Conference  (GLOBECOM),  Abu  Dhabi,  United  Arab  Emirates:  IEEE, 
Dec. 2018, pp. 1–6. doi: 10.1109/GLOCOM.2018.8647342. 

[29]    J. Gu, H. Wang, G. Ding, Y. Xu, Z. Xue, and H. Zhou, “Energy-Constrained Completion 
Time Minimization in UAV-Enabled Internet of Things,” IEEE Internet Things J., vol. 7, 
no. 6, pp. 5491–5503, Jun. 2020, doi: 10.1109/JIOT.2020.2981092. 

[30]    “What  is  the  maximum  range  for  a  UAV’s  remote  control?”  Accessed:  Mar.  26,  2024. 
https://www.linkedin.com/advice/0/what-maximum-range-uavs-

[Online].  Available: 
remote-control-skills-drones-zdfqc 

[31]    “Mobile  data  traffic  forecast  –  Mobility  Report.”  Accessed:  Mar.  26,  2024.  [Online]. 
https://www.ericsson.com/en/reports-and-papers/mobility-

Available: 
report/dataforecasts/mobile-traffic-forecast 

[32]    “Understanding  Average  Data  Usage  Per  Month 

CompareInternet.com.”  Accessed:  Mar. 
https://www.compareinternet.com/blog/average-data-usage-per-month-home-internet/ 

2024. 

26, 

for  Home 

- 
[Online].  Available: 

Internet 

[33]    J.  O.  Klompmaker  et  al.,  “Racial,  Ethnic,  and  Socioeconomic  Disparities  in  Multiple 
Measures of Blue and Green Spaces in the United States,” Environ Health Perspect, vol. 
131, no. 1, p. 017007, Jan. 2023, doi: 10.1289/EHP11164. 

241 

 
[34]    H. Ritchie, E. Mathieu, and M. Roser, “Which countries are most densely populated?,” Our 
[Online].  Available: 

in  Data,  Feb.  2024,  Accessed:  Mar.  26,  2024. 

World 
https://ourworldindata.org/most-densely-populated-countries 

[35]    Q.  Xu  et  al.,  “Performance  analysis  of  NVMe  SSDs  and  their  implication  on  real  world 
databases,” in Proceedings of the 8th ACM International Systems and Storage Conference, 
Haifa Israel: ACM, May 2015, pp. 1–11. doi: 10.1145/2757667.2757684. 

[36]    J. Zhang, F. Meng, L. Qiao, and K. Zhu, “Design and Implementation of Optical Fiber SSD 
Exploiting FPGA Accelerated NVMe,” IEEE Access, vol. 7, pp. 152944–152952, 2019, doi: 
10.1109/ACCESS.2019.2947181. 

[37]    Tanwir, G. Hendrantoro, and A. Affandi, “Early result from adaptive combination of LRU, 
LFU and FIFO to improve cache server performance in telecommunication network,” in 
2015  International  Seminar  on  Intelligent  Technology  and  Its  Applications  (ISITIA), 
Surabaya, Indonesia: IEEE, May 2015, pp. 429–432. doi: 10.1109/ISITIA.2015.7220019. 

[38]    H. Gomaa, G. G. Messier, C. Williamson, and R. Davies, “Estimating Instantaneous Cache 
Hit Ratio Using Markov Chain Analysis,” IEEE/ACM Trans. Networking, vol. 21, no. 5, pp. 
1472–1483, Oct. 2013, doi: 10.1109/TNET.2012.2227338. 

[39]    S.  Maffeis,  “Cache  management  algorithms  for  flexible  filesystems,”  SIGMETRICS 

Perform. Eval. Rev., vol. 21, no. 2, pp. 16–25, Dec. 1993, doi: 10.1145/174215.174219. 

[40]    X.  Wang  et  al.,  “Deep  Reinforcement  Learning:  A  Survey,”  IEEE  Trans.  Neural  Netw. 
doi: 
4, 

5064–5078,  Apr. 

2024, 

pp. 

no. 

Learning 
35, 
Syst., 
10.1109/TNNLS.2022.3207346. 

vol. 

[41]    J.  Oh  et  al.,  “Discovering  Reinforcement  Learning  Algorithms,”  in  Advances  in  Neural 
Information Processing Systems, Curran Associates, Inc., 2020, pp. 1060–1070. Accessed: 
Available: 
Apr. 
https://proceedings.neurips.cc/paper_files/paper/2020/hash/0b96d81f0494fde5428c7aea24
3c9157-Abstract.html 

[Online]. 

2024. 

26, 

[42]    F. Li, D. Yu, H. Yang, J. Yu, H. Karl, and X. Cheng, “Multi-Armed-Bandit-Based Spectrum 
Scheduling Algorithms in Wireless Networks: A Survey,” IEEE Wireless Commun., vol. 
27, no. 1, pp. 24–30, Feb. 2020, doi: 10.1109/MWC.001.1900280. 

[43]    X. Zhou and B. Ji, “On Kernelized Multi-Armed Bandits with Constraints”. 

[44]    A. Kalvit and A. Zeevi, “A Closer Look at the Worst-case Behavior of Multi-armed Bandit 
Algorithms,” in Advances in Neural Information Processing Systems, Curran Associates, 
Inc.,  2021,  pp.  8807–8819.  Accessed:  Apr.  26,  2024. 
[Online].  Available: 
https://proceedings.neurips.cc/paper_files/paper/2021/hash/49ef08ad6e7f26d7f200e1b2b9
e6e4ac-Abstract.html 

242 

 
[45]    C. Shi and C. Shen, “Federated Multi-Armed Bandits,” Proceedings of the AAAI Conference 
Intelligence,  vol.  35,  no.  11,  Art.  no.  11,  May  2021,  doi: 

on  Artificial 
10.1609/aaai.v35i11.17156. 

[46]    D. C. Nguyen et al., “Federated Learning for Smart Healthcare: A Survey,” ACM Comput. 

Surv., vol. 55, no. 3, pp. 1–37, Mar. 2023, doi: 10.1145/3501296. 

[47]    D.  C.  Nguyen,  M.  Ding,  P.  N.  Pathirana,  A.  Seneviratne,  J.  Li,  and  H.  Vincent  Poor, 
“Federated  Learning  for  Internet  of  Things:  A  Comprehensive  Survey,”  IEEE  Commun. 
Surv. Tutorials, vol. 23, no. 3, pp. 1622–1658, 2021, doi: 10.1109/COMST.2021.3075439. 

[48]    A.  K.  Bhuyan,  H.  Dutta,  and  S.  Biswas,  “Multi-Armed  Bandit  Learning  for  Content 
Provisioning  in  Network  of  UAVs,”  in  GLOBECOM  2023  -  2023  IEEE  Global 
Communications 
doi: 
10.1109/GLOBECOM54140.2023.10437463. 

Conference, 

1143–1148. 

2023, 

Dec. 

pp. 

[49]    A. K. Bhuyan, H. Dutta, and S. Biswas, “Top-k Multi-Armed Bandit Learning for Content 
Dissemination in Swarms of Micro-UAVs,” Jan. 15, 2025, arXiv: arXiv:2404.10845. doi: 
10.48550/arXiv.2404.10845. 

[50]    A. K. Bhuyan, H. Dutta, and S. Biswas, “Towards Federated Multi-Armed Bandit Learning 
for  Content  Dissemination  Using  Swarm  of  UAVs,”  ACM  Trans.  Internet  Things,  p. 
3733841, May 2025, doi: 10.1145/3733841. 

[51]    A.  K.  Bhuyan,  H.  Dutta,  and  S.  Biswas,  “Federated  Multi-Armed  Bandit  Learning  for 
Caching  in  UAV-aided  Content  Dissemination,”  Ad  Hoc  Networks,  vol.  151,  p.  103306, 
Dec. 2023, doi: 10.1016/j.adhoc.2023.103306. 

[52]    A.  K.  Bhuyan,  H.  Dutta,  and  S.  Biswas,  “Distributed  Federated-Multi-Armed  Bandit 
Learning  for  Content  Management  in  Connected  UAVs,”  IEEE  Internet  of  Things 
Magazine, vol. 6, no. 4, pp. 130–136, Dec. 2023, doi: 10.1109/IOTM.001.2300081. 

[53]    C.  Mouradian,  N.  T.  Jahromi,  and  R.  H.  Glitho,  “NFV  and  SDN-Based  Distributed  IoT 
Gateway for Large-Scale Disaster Management,” IEEE Internet Things J., vol. 5, no. 5, pp. 
4119–4131, Oct. 2018, doi: 10.1109/JIOT.2018.2867255. 

[54]    Y. Liu, F. Zhou, C. Chen, Z. Zhu, T. Shang, and J.-M. Torres-Moreno, “Disaster Protection 
in Inter-DataCenter Networks Leveraging Cooperative Storage,” IEEE Trans. Netw. Serv. 
Manage., vol. 18, no. 3, pp. 2598–2611, Sep. 2021, doi: 10.1109/TNSM.2021.3089049. 

[55]    H. Verma and N. Chauhan, “MANET based emergency communication system for natural 
disasters,”  in  International  Conference  on  Computing,  Communication  &  Automation, 
Greater Noida, India: IEEE, May 2015, pp. 480–485. doi: 10.1109/CCAA.2015.7148424. 

[56]    K. Fall, “A delay-tolerant network architecture for challenged internets,” in Proceedings of 
the  2003  conference  on  Applications,  technologies,  architectures,  and  protocols  for 
computer  communications,  Karlsruhe  Germany:  ACM,  Aug.  2003,  pp.  27–34.  doi: 
10.1145/863955.863960. 

243 

 
[57]    L. Palen and A. L. Hughes, “Social Media in Disaster Communication,” in Handbook of 
Disaster  Research,  H.  Rodríguez,  W.  Donner,  and  J.  E.  Trainor,  Eds.,  Cham:  Springer 
International Publishing, 2018, pp. 497–518. doi: 10.1007/978-3-319-63254-4_24. 

[58]    G.  Broll,  E.  Rukzio,  M.  Paolucci,  M.  Wagner,  A.  Schmidt,  and  H.  Hussmann,  “Perci: 
Pervasive Service Interaction with the Internet of Things,” IEEE Internet Comput., vol. 13, 
no. 6, pp. 74–81, Nov. 2009, doi: 10.1109/MIC.2009.120. 

[59]    H. Sami, R. Saado, A. E. Saoudi, A. Mourad, H. Otrok, and J. Bentahar, “Opportunistic 
UAV  Deployment  for  Intelligent  On-Demand  IoV  Service  Management,”  IEEE  Trans. 
Netw.  Serv.  Manage.,  vol.  20,  no.  3,  pp.  3428–3442,  Sep.  2023,  doi: 
10.1109/TNSM.2023.3242205. 

[60]    X.  Liu  et  al.,  “Challenges  and  opportunities  for  autonomous  micro-UAVs  in  precision 

agriculture,” IEEE Micro, vol. 42, no. 1, pp. 61–68, 2022. 

[61]    J. Gago et al., “Nano and micro unmanned aerial vehicles (UAVs): a new grand challenge 
for precision agriculture?,” Current protocols in plant biology, vol. 5, no. 1, p. 20103, 2020. 

[62]    S. Misra, P. K. Deb, and K. Saini, “Dynamic leader selection in a master-slave architecture-
based  micro  UAV  swarm,”  in  2021  IEEE  Global  Communications  Conference 
(GLOBECOM, IEEE, Dec. 2021, pp. 1–6. 

[63]    W. Wen, Y. Jia, and W. Xia, “Federated learning in SWIPT-enabled micro-UAV swarm 
networks: A joint design of scheduling and resource allocation,” in 2021 13th International 
Conference on Wireless Communications and Signal Processing (WCSP, IEEE, Oct. 2021, 
pp. 1–5. 

[64]    N.  Zhao,  “UAV-assisted  emergency  networks 
Communications, vol. 26, no. 1, pp. 45–51, 2019. 

in  disasters,” 

IEEE  Wireless 

[65]    X. Liu et al., “Transceiver Design and Multihop D2D for UAV IoT Coverage in Disasters,” 
Internet  Things  J.,  vol.  6,  no.  2,  pp.  1803–1815,  Apr.  2019,  doi: 

IEEE 
10.1109/JIOT.2018.2877504. 

[66]    E.  M.  Mohamed,  S.  Hashima,  and  K.  Hatano,  “Energy  aware  multiarmed  bandit  for 
millimeter  wave-based  UAV  mounted  RIS  networks,”  IEEE  Wireless  Communications 
Letters, vol. 11, no. 6, pp. 1293–1297, 2022. 

[67]    A. Amrallah, E. M. Mohamed, G. K. Tran, and K. Sakaguchi, “Optimization of UAV 3D 
in  a  Post-disaster  Area  Using  Dual  Energy-Aware  Bandits,”  IEICE 

Trajectory 
Communications Express, 2023. 

[68]    M.  Mozaffari,  W.  Saad,  M.  Bennis,  and  M.  Debbah,  “Efficient  deployment  of  multiple 
unmanned aerial vehicles for optimal wireless coverage,” IEEE Communications Letters, 
vol. 20, no. 8, pp. 1647–1650, 2016. 

244 

 
[69]    A.  Al-Hourani,  S.  Kandeepan,  and  S.  Lardner,  “Optimal  LAP  altitude  for  maximum 
coverage,” IEEE Wireless Communications Letters, vol. 3, no. 6, pp. 569–572, 2014. 

[70]    Y.  Zeng,  “Wireless  communications  with  unmanned  aerial  vehicles  Opportunities  and 
challenges,”  Wireless  communications  with  unmanned  aerial  vehicles  Opportunities  and 
challenges IEEE Communications magazine, vol. 54, no. 5, pp. 36–42, 2016. 

[71]    W.  Ejaz,  M.  A.  Azam,  S.  Saadat,  F.  Iqbal,  and  A.  Hanan,  “Unmanned  Aerial  Vehicles 
enabled IoT Platform for Disaster Management,” Energies, vol. 12, no. 14, Art. no. 14, Jan. 
2019, doi: 10.3390/en12142706. 

[72]    X. Xu, Y. Zeng, Y. L. Guan, and R. Zhang, “Overcoming Endurance Issue: UAV-Enabled 
Communications With Proactive Caching,” IEEE J. Select. Areas Commun., vol. 36, no. 6, 
pp. 1231–1244, Jun. 2018, doi: 10.1109/JSAC.2018.2844979. 

[73]    X.  Lin,  J.  Xia,  and  Z.  Wang,  “Probabilistic  caching  placement  in  UAV-assisted 
heterogeneous wireless networks,” Physical Communication, vol. 33, pp. 54–61, Apr. 2019, 
doi: 10.1016/j.phycom.2019.01.004. 

[74]    N.  Zhao  et  al.,  “Caching  UAV  Assisted  Secure  Transmission  in  Hyper-Dense  Networks 
Based on Interference Alignment,” IEEE Trans. Commun., vol. 66, no. 5, pp. 2281–2294, 
May 2018, doi: 10.1109/TCOMM.2018.2792014. 

[75]    N.  Zhao  et  al.,  “Caching  Unmanned  Aerial  Vehicle-Enabled  Small-Cell  Networks: 
Employing  Energy-Efficient  Methods  That  Store  and  Retrieve  Popular  Content,”  IEEE 
Vehicular  Technology  Magazine,  vol.  14,  no.  1,  pp.  71–79,  Mar.  2019,  doi: 
10.1109/MVT.2018.2881228. 

[76]    T.  Zhang,  Y.  Wang,  Y.  Liu,  W.  Xu,  and  A.  Nallanathan,  “Cache-Enabling  UAV 
Communications: Network Deployment and Resource Allocation,” IEEE Trans. Wireless 
Commun., vol. 19, no. 11, pp. 7470–7483, Nov. 2020, doi: 10.1109/TWC.2020.3011881. 

[77]    A. K. Bhuyan and H. Dutta, “Design of a Heuristic IoT-Based Approach as a Solution to a 
Self-Aware  Social  Distancing  Paradigm,”  in  Soft  Computing  Techniques  in  Connected 
Healthcare Systems, CRC Press, 2023. 

[78]    H.  Wu,  F.  Lyu,  C.  Zhou,  J.  Chen,  L.  Wang,  and  X.  Shen,  “Optimal  UAV  Caching  and 
Trajectory in Aerial-Assisted Vehicular Networks: A Learning-Based Approach,” IEEE J. 
Select.  Areas  Commun.,  vol.  38,  no.  12,  pp.  2783–2797,  Dec.  2020,  doi: 
10.1109/JSAC.2020.3005469. 

[79]    T.  Zhang,  “Caching  placement  and  resource  allocation  for  cache-enabling  UAV  NOMA 
networks,” IEEE Transactions on Vehicular Technology, vol. 69, no. 11, pp. 12897–12911, 
2020. 

[80]    S. Chai and V. K. N. Lau, “Online Trajectory and Radio Resource Optimization of Cache-
Enabled  UAV  Wireless  Networks  With  Content  and  Energy  Recharging,”  IEEE  Trans. 
Signal Process., vol. 68, pp. 1286–1299, 2020, doi: 10.1109/TSP.2020.2971457. 

245 

 
[81]    A. Al-Hilo, M. Samir, C. Assi, S. Sharafeddine, and D. Ebrahimi, “UAV-Assisted Content 
Delivery  in  Intelligent  Transportation  Systems-Joint  Trajectory  Planning  and  Cache 
Management,”  IEEE  Trans.  Intell.  Transport.  Syst.,  vol.  22,  no.  8,  pp.  5155–5167,  Aug. 
2021, doi: 10.1109/TITS.2020.3020220. 

[82]    Y. Wei, F. R. Yu, M. Song, and Z. Han, “Joint Optimization of Caching, Computing, and 
Radio  Resources  for  Fog-Enabled  IoT  Using  Natural  Actor–Critic  Deep  Reinforcement 
Learning,” IEEE Internet of Things Journal, vol. 6, no. 2, pp. 2061–2073, Apr. 2019, doi: 
10.1109/JIOT.2018.2878435. 

[83]    X.  Wu,  X.  Li,  J.  Li,  P.  C.  Ching,  V.  C.  M.  Leung,  and  H.  V.  Poor,  “Caching  Transient 
Content  for  IoT  Sensing:  Multi-Agent  Soft  Actor-Critic,”  IEEE  Transactions  on 
Communications, 
doi: 
10.1109/TCOMM.2021.3086535. 

5886–5901, 

2021, 

Sep. 

vol. 

no. 

pp. 

69, 

9, 

[84]    S.  Araf,  A.  S.  Saha,  S.  H.  Kazi,  N.  H.  Tran,  and  Md.  G.  R.  Alam,  “UAV  Assisted 
Cooperative  Caching  on  Network  Edge  Using  Multi-Agent  Actor-Critic  Reinforcement 
Learning,” IEEE Transactions on Vehicular Technology, vol. 72, no. 2, pp. 2322–2337, Feb. 
2023, doi: 10.1109/TVT.2022.3209079. 

[85]    W. Jiang, D. Feng, Y. Sun, G. Feng, Z. Wang, and X.-G. Xia, “Proactive Content Caching 
Based  on  Actor–Critic  Reinforcement  Learning  for  Mobile  Edge  Networks,”  IEEE 
Transactions on Cognitive Communications and Networking, vol. 8, no. 2, pp. 1239–1252, 
Jun. 2022, doi: 10.1109/TCCN.2021.3130995. 

[86]    C.  Wang  et  al.,  “Heterogeneous  Edge  Caching  Based  on  Actor-Critic  Learning  With 
Attention Mechanism Aiding,” IEEE Transactions on Network Science and Engineering, 
vol. 10, no. 6, pp. 3409–3420, Nov. 2023, doi: 10.1109/TNSE.2023.3260882. 

[87]    M. Yan, M. Luo, C. A. Chan, A. F. Gygax, C. Li, and C.-L. I, “Energy-Efficient Content 
Fetching  Strategies  in  Cache-Enabled  D2D  Networks  via  an  Actor-Critic  Reinforcement 
Learning  Structure,”  IEEE  Transactions  on  Vehicular  Technology,  vol.  73,  no.  11,  pp. 
17485–17495, Nov. 2024, doi: 10.1109/TVT.2024.3419012. 

[88]    X.  Gao,  Y.  Sun,  H.  Chen,  X.  Xu,  and  S.  Cui,  “Joint  Computing,  Pushing,  and  Caching 
Optimization for Mobile-Edge Computing Networks via Soft Actor–Critic Learning,” IEEE 
Internet  of  Things  Journal,  vol.  11,  no.  6,  pp.  9269–9281,  Mar.  2024,  doi: 
10.1109/JIOT.2023.3323433. 

[89]    Y. Xiao, H. Yu, Y. Yang, Y. Wang, J. Liu, and N. Ansari, “Adaptive Joint Routing and 
Caching  in  Knowledge-Defined  Networking:  An  Actor-Critic  Deep  Reinforcement 
Learning Approach,” IEEE Transactions on Mobile Computing, vol. 24, no. 5, pp. 4118–
4135, May 2025, doi: 10.1109/TMC.2024.3521247. 

[90]    C. Zhong, M. C. Gursoy, and S. Velipasalar, “Deep Reinforcement Learning-Based Edge 
Caching  in  Wireless  Networks,”  IEEE  Transactions  on  Cognitive  Communications  and 
Networking, vol. 6, no. 1, pp. 48–61, Mar. 2020, doi: 10.1109/TCCN.2020.2968326. 

246 

 
[91]    A.  K.  Bhuyan,  H.  Dutta,  and  S.  Biswas,  “Towards  a  UAV-centric  Content  Caching 
Architecture  for  Communication-challenged  Environments,”  in  GLOBECOM  2022-2022 
IEEE Global Communications Conference, IEEE, Dec. 2022, pp. 468–473. 

[92]    A. K. Bhuyan, H. Dutta, and S. Biswas, “UAV Trajectory Planning For Improved Content 
Availability in Infrastructure-less Wireless Networks,” in 2023 International Conference on 
Information Networking (ICOIN), Bangkok, Thailand: IEEE, Jan. 2023, pp. 376–381. doi: 
10.1109/ICOIN56518.2023.10048929. 

[93]    A. K. Bhuyan, H. Dutta, and S. Biswas, “Handling Demand Heterogeneity in UAV-aided 
Content  Caching  in  Communication-challenged  Environments,”  in  2023  IEEE  24th 
International  Symposium  on  a  World  of  Wireless,  Mobile  and  Multimedia  Networks 
(WoWMoM),  Boston,  MA,  USA: 
Jun.  2023,  pp.  107–116.  doi: 
IEEE, 
10.1109/WoWMoM57956.2023.00025. 

[94]    X. Xu, M. Tao, and C. Shen, “Collaborative Multi-Agent Multi-Armed Bandit Learning for 
Small-Cell Caching,” IEEE Transactions on Wireless Communications, vol. 19, no. 4, pp. 
2570–2585, Apr. 2020, doi: 10.1109/TWC.2020.2966599. 

[95]    P. Blasco and D. Gündüz, “Multi-armed bandit optimization of cache content in wireless 
infostation networks,” in 2014 IEEE International Symposium on Information Theory, Jun. 
2014, pp. 51–55. doi: 10.1109/ISIT.2014.6874793. 

[96]    X.  Xu  and  M.  Tao,  “Decentralized  Multi-Agent  Multi-Armed  Bandit  Learning  With 
Calibration for Multi-Cell Caching,” IEEE Transactions on Communications, vol. 69, no. 
4, pp. 2457–2472, Apr. 2021, doi: 10.1109/TCOMM.2020.3045050. 

[97]    G. Tabei, Y. Ito, T. Kimura, and K. Hirata, “Design of Multi-Armed Bandit-Based Routing 
for  in-Network  Caching,”  IEEE  Access,  vol.  11,  pp.  82584–82600,  2023,  doi: 
10.1109/ACCESS.2023.3301961. 

[98]    Y. Han, L. Ai, R. Wang, J. Wu, D. Liu, and H. Ren, “Cache Placement Optimization in 
Mobile  Edge  Computing  Networks  With  Unaware  Environment—An  Extended  Multi-
Armed Bandit Approach,” IEEE Transactions on Wireless Communications, vol. 20, no. 
12, pp. 8119–8133, Dec. 2021, doi: 10.1109/TWC.2021.3090440. 

[99]    S. A. Bitaghsir, A. Dadlani, M. Borhani, and A. Khonsari, “Multi-Armed Bandit Learning 
for  Cache  Content  Placement  in  Vehicular  Social  Networks,”  IEEE  Communications 
Letters, vol. 23, no. 12, pp. 2321–2324, Dec. 2019, doi: 10.1109/LCOMM.2019.2941482. 

[100]  S.  Liu,  J.  Yu,  X.  Deng,  and  S.  Wan,  “FedCPF:  An  Efficient-Communication  Federated 
Learning Approach for Vehicular Edge Computing in 6G Communication Networks,” IEEE 
Transactions  on  Intelligent  Transportation  Systems,  vol.  23,  no.  2,  pp.  1616–1629,  Feb. 
2022, doi: 10.1109/TITS.2021.3099368. 

[101]  Z.  Wang  et  al.,  “Asynchronous  Federated  Learning  Over  Wireless  Communication 
Networks,” IEEE Transactions on Wireless Communications, vol. 21, no. 9, pp. 6961–6978, 
Sep. 2022, doi: 10.1109/TWC.2022.3153495. 

247 

 
[102]  H.  Ye,  L.  Liang,  and  G.  Y.  Li,  “Decentralized  Federated  Learning  With  Unreliable 
Communications,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 3, pp. 
487–500, Apr. 2022, doi: 10.1109/JSTSP.2022.3152445. 

[103]  H. Chen, S. Huang, D. Zhang, M. Xiao, M. Skoglund, and H. V. Poor, “Federated Learning 
Over  Wireless  IoT  Networks  With  Optimized  Communication  and  Resources,”  IEEE 
Internet  of  Things  Journal,  vol.  9,  no.  17,  pp.  16592–16605,  Sep.  2022,  doi: 
10.1109/JIOT.2022.3151193. 

[104]  H.  Chen,  M.  Xiao,  and  Z.  Pang,  “Satellite-Based  Computing  Networks  with  Federated 
Learning,”  IEEE  Wireless  Communications,  vol.  29,  no.  1,  pp.  78–84,  Feb.  2022,  doi: 
10.1109/MWC.008.00353. 

[105]  J.  Lee,  F.  Solat,  T.  Y.  Kim,  and  H.  V.  Poor,  “Federated  Learning-Empowered  Mobile 
Network  Management  for  5G  and  Beyond  Networks:  From  Access  to  Core,”  IEEE 
Communications  Surveys  &  Tutorials,  vol.  26,  no.  3,  pp.  2176–2212,  2024,  doi: 
10.1109/COMST.2024.3352910. 

[106]  X. Zhou et al., “Decentralized P2P Federated Learning for Privacy-Preserving and Resilient 
Mobile Robotic Systems,” IEEE Wireless Communications, vol. 30, no. 2, pp. 82–89, Apr. 
2023, doi: 10.1109/MWC.004.2200381. 

[107]  B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-Effective Federated Learning in 
Mobile Edge Networks,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 
12, pp. 3606–3621, Dec. 2021, doi: 10.1109/JSAC.2021.3118436. 

[108]  R.  Yu  and  P.  Li,  “Toward  Resource-Efficient  Federated  Learning  in  Mobile  Edge 
IEEE  Network,  vol.  35,  no.  1,  pp.  148–155,  Jan.  2021,  doi: 

Computing,” 
10.1109/MNET.011.2000295. 

[109]  A. Li, J. Sun, P. Li, Y. Pu, H. Li, and Y. Chen, “Hermes: an efficient federated learning 
framework  for  heterogeneous  mobile  clients,”  in  Proceedings  of  the  27th  Annual 
International Conference on Mobile Computing and Networking, New Orleans Louisiana: 
ACM, Oct. 2021, pp. 420–437. doi: 10.1145/3447993.3483278. 

[110]  M. Gecer and B. Garbinato, “Federated Learning for Mobility Applications,” ACM Comput. 

Surv., vol. 56, no. 5, pp. 1–28, May 2024, doi: 10.1145/3637868. 

[111]  C. Feng, H. H. Yang, D. Hu, Z. Zhao, T. Q. S. Quek, and G. Min, “Mobility-Aware Cluster 
Federated  Learning  in  Hierarchical  Wireless  Networks,”  IEEE  Transactions  on  Wireless 
Communications, 
doi: 
10.1109/TWC.2022.3166386. 

8441–8458,  Oct. 

2022, 

vol. 

21, 

10, 

no. 

pp. 

[112]  Y.  Venkatesha,  Y.  Kim,  L.  Tassiulas,  and  P.  Panda,  “Federated  Learning  With  Spiking 
Neural Networks,” IEEE Transactions on Signal Processing, vol. 69, pp. 6183–6194, 2021, 
doi: 10.1109/TSP.2021.3121632. 

248 

 
[113]  C. He et al., “FedGraphNN: A Federated Learning System and Benchmark for Graph Neural 
Networks,” Sep. 08, 2021, arXiv: arXiv:2104.07145. doi: 10.48550/arXiv.2104.07145. 

[114]  K. Xie et al., “Efficient Federated Learning With Spike Neural Networks for Traffic Sign 
Recognition,” IEEE Transactions on Vehicular Technology, vol. 71, no. 9, pp. 9980–9992, 
Sep. 2022, doi: 10.1109/TVT.2022.3178808. 

[115]  Z.  Li,  T.  Lin,  X.  Shang,  and  C.  Wu,  “Revisiting  Weighted  Aggregation  in  Federated 
Learning with Neural Networks,” in Proceedings of the 40th International Conference on 
Machine Learning, PMLR, Jul. 2023, pp. 19767–19788. Accessed: May 13, 2025. [Online]. 
Available: https://proceedings.mlr.press/v202/li23s.html 

[116]  C. Liu, C. Lou, R. Wang, A. Y. Xi, L. Shen, and J. Yan, “Deep Neural Network Fusion via 
Graph  Matching  with  Applications  to  Model  Ensemble  and  Federated  Learning,”  in 
Proceedings of the 39th International Conference on Machine Learning, PMLR, Jun. 2022, 
Available: 
pp. 
https://proceedings.mlr.press/v162/liu22k.html 

Accessed:  May 

13857–13869. 

[Online]. 

2025. 

13, 

[117]  Y. H. Cho and W. J. Byun, “Generalized Friis Transmission Equation for Orbital Angular 
Momentum Radios,” IEEE Trans. Antennas Propagat., vol. 67, no. 4, pp. 2423–2429, Apr. 
2019, doi: 10.1109/TAP.2019.2891438. 

[118]  W.  J.  Byun  and  Y.  Heui  Cho,  “Analysis  of  a  200-GHz  OAM  Radio  Link  Using  a 
Generalized  Friis  Transmission  Equation,”  in  2019  IEEE  International  Symposium  on 
Antennas and Propagation and USNC-URSI Radio Science Meeting, Atlanta, GA, USA: 
IEEE, Jul. 2019, pp. 1051–1052. doi: 10.1109/APUSNCURSINRSM.2019.8888820. 

[119]  R.  Wang,  S.  Biswas,  S.  Das,  and  J.  Rao,  “Collaborative  Caching  for  Dynamic  Map 
Dissemination 
International 
Communications  Quality  and  Reliability  Workshop  (CQR),  Apr.  2019,  pp.  1–6.  doi: 
10.1109/CQR.2019.8880103. 

in  Vehicular  Networks,” 

IEEE  ComSoc 

in  2019 

[120]  P. Blasco and D. Gunduz, “Learning-Based Optimization of Cache Content in a Small Cell 
Base Station,” Feb. 21, 2014, arXiv: arXiv:1402.3247. Accessed: Mar. 21, 2024. [Online]. 
Available: http://arxiv.org/abs/1402.3247 

[121]  R.  Wang,  J.  Rao,  C.  Zhou,  and  S.  Biswas,  “Connectionless  Edge-Cache  Servers  for 
Reducing  Cellular  Bandwidth  Usage  in  Vehicular  Networks,”  in  2021  International 
Conference  on  COMmunication  Systems  &  NETworkS  (COMSNETS),  Bangalore,  India: 
IEEE, Jan. 2021, pp. 516–524. doi: 10.1109/COMSNETS51098.2021.9352746. 

[122]  B.  Lee,  “Web  Caching  and  Zipf-like  Distributions :  Evidence  and  Implications,”  IEEE 
[Online].  Available: 

INFOCOM,  2009,  2009,  Accessed:  Mar.  21,  2024. 
https://cir.nii.ac.jp/crid/1570854176352908672 

[123]  H. Yao, C. Bai, M. Xiong, D. Zeng, and Z. Fu, “Heterogeneous cloudlet deployment and 
user-cloudlet  association  toward  cost  effective  fog  computing,”  Concurrency  and 

249 

 
Computation:  Practice  and  Experience,  vol.  29,  no.  16,  p.  e3975,  2017,  doi: 
10.1002/cpe.3975. 

[124]  S. Ali, N. Rajatheva, and W. Saad, “Fast Uplink Grant for Machine Type Communications: 
Challenges and Opportunities,” IEEE Commun. Mag., vol. 57, no. 3, pp. 97–103, Mar. 2019, 
doi: 10.1109/MCOM.2019.1800475. 

[125]  T.  F.  Smith  and  M.  S.  Waterman,  “Identification  of  common  molecular  subsequences,” 
Journal of Molecular Biology, vol. 147, no. 1, pp. 195–197, Mar. 1981, doi: 10.1016/0022-
2836(81)90087-5. 

[126]  A. K. Bhuyan, H. Dutta, and S. Biswas, “UAV Trajectory Planning For Improved Content 
Availability in Infrastructure-less Wireless Networks,” in 2023 International Conference on 
Information Networking (ICOIN, IEEE, Jan. 2023, pp. 376–381. 

[127]  A. Slivkins, “Introduction to Multi-Armed Bandits,” MAL, vol. 12, no. 1–2, pp. 1–286, Nov. 

2019, doi: 10.1561/2200000068. 

[128]  W. Cao, J. Li, Y. Tao, and Z. Li, “On Top-k Selection in Multi-Armed Bandits and Hidden 
Bipartite  Graphs,”  in  Advances  in  Neural  Information  Processing  Systems,  Curran 
Associates, 
[Online].  Available: 
https://proceedings.neurips.cc/paper_files/paper/2015/hash/ab233b682ec355648e7891e66
c54191b-Abstract.html 

2015.  Accessed:  Feb. 

2024. 

Inc., 

29, 

[129]  W.  W.  Cohen,  P.  Ravikumar,  and  S.  E.  Fienberg,  “A  Comparison  of  String  Metrics  for 

Matching Names and Records”. 

[130]  R.  S.  Sutton  and  A.  G.  Barto,  Reinforcement  learning:  an  introduction.  in  Adaptive 

computation and machine learning. Cambridge, Mass: MIT Press, 1998. 

[131]  H.  Zhu,  J.  Xu,  S.  Liu,  and  Y.  Jin,  “Federated  learning  on  non-IID  data:  A  survey,” 

Neurocomputing, vol. 465, pp. 371–390, Nov. 2021, doi: 10.1016/j.neucom.2021.07.098. 

[132]  T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated Learning: Challenges, Methods, 
and Future Directions,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 50–60, May 2020, 
doi: 10.1109/MSP.2020.2975749. 

[133]  H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing Federated Learning on Non-IID Data 
with Reinforcement Learning,” in IEEE INFOCOM 2020 - IEEE Conference on Computer 
Communications,  Toronto,  ON,  Canada:  IEEE,  Jul.  2020,  pp.  1698–1707.  doi: 
10.1109/INFOCOM41043.2020.9155494. 

[134]  A. P. Majtey, P. W. Lamberti, and D. P. Prato, “Jensen-Shannon divergence as a measure 
of distinguishability between mixed quantum states,” Phys. Rev. A, vol. 72, no. 5, p. 052310, 
Nov. 2005, doi: 10.1103/PhysRevA.72.052310. 

[135]  A.  K.  Bhuyan,  H.  Dutta,  and  S.  Biswas,  “Towards  a  UAV-centric  Content  Caching 
Architecture for Communication-challenged Environments,” in GLOBECOM 2022 - 2022 

250 

 
IEEE Global Communications Conference, Rio de Janeiro, Brazil: IEEE, Dec. 2022, pp. 
468–473. doi: 10.1109/GLOBECOM48099.2022.10001616. 

[136]  R.  S.  Sutton  and  A.  G.  Barto,  Reinforcement  Learning,  second  edition:  An  Introduction. 

MIT Press, 2018. 

[137]  W. Cao, J. Li, Y. Tao, and Z. Li, “On top-k selection in multi-armed bandits and hidden 
bipartite graphs,” Advances in Neural Information Processing Systems, vol. 28, 2015. 

[138]  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018. 

[139]  H.  Robbins  and  S.  Monro,  “A  Stochastic  Approximation  Method,”  The  Annals  of 

Mathematical Statistics, vol. 22, no. 3, pp. 400–407, 1951. 

[140]  T.  C.  T.  Kotiah,  “Chebyshev’s  inequality  and  the  law  of  large  numbers,”  International 
Journal of Mathematical Education in Science and Technology, vol. 25, no. 3, pp. 389–398, 
May 1994, doi: 10.1080/0020739940250310. 

[141]  T.  C.  T.  Kotiah,  “Chebyshev’s  inequality  and  the  law  of  large  numbers,”  International 
Journal of Mathematical Education in Science and Technology, vol. 25, no. 3, pp. 389–398, 
May 1994, doi: 10.1080/0020739940250310. 

[142]  A. K. Bhuyan, H. Dutta, and S. Biswas, “Handling Demand Heterogeneity in UAV-aided 
Content  Caching  in  Communication-challenged  Environments,”  in  2023  IEEE  24th 
International  Symposium  on  a  World  of  Wireless,  Mobile  and  Multimedia  Networks 
(WoWMoM),  Boston,  MA,  USA: 
Jun.  2023,  pp.  107–116.  doi: 
IEEE, 
10.1109/WoWMoM57956.2023.00025. 

[143]  M.  Cheatham  et  al.,  “On  the  efficient  execution  of  bounded  Jaro-Winkler  distances,” 

Semant. web, vol. 8, no. 2, pp. 185–196, Jan. 2017, doi: 10.3233/SW-150209. 

[144]  Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated Learning with Non-

IID Data,” 2018, doi: 10.48550/arXiv.1806.00582. 

[145]  A. K. Bhuyan, H. Dutta, and S. Biswas, “Unsupervised Speaker Diarization in Distributed 
IoT  Networks  Using  Federated  Learning,”  IEEE  Transactions  on  Emerging  Topics  in 
Computational 
Intelligence,  vol.  9,  no.  2,  pp.  1934–1946,  Apr.  2025,  doi: 
10.1109/TETCI.2024.3482855. 

[146]  M. Ye, X. Fang, B. Du, P. C. Yuen, and D. Tao, “Heterogeneous Federated Learning: State-
of-the-art and Research Challenges,” ACM Comput. Surv., vol. 56, no. 3, pp. 1–44, Mar. 
2024, doi: 10.1145/3625558. 

[147]  E. T. Martínez Beltrán et al., “Decentralized Federated Learning: Fundamentals, State of 
the Art, Frameworks, Trends, and Challenges,” IEEE Communications Surveys & Tutorials, 
vol. 25, no. 4, pp. 2983–3013, 2023, doi: 10.1109/COMST.2023.3315746. 

251 

 
[148]  A. K. Bhuyan and J. H. Nirmal, “Comparative study of voice conversion framework with 
line spectral frequency and Mel-Frequency Cepstral Coefficients as features using artficial 
neural networks,” in 2015 International Conference on Computers, Communications, and 
Systems (ICCCS), Nov. 2015, pp. 230–235. doi: 10.1109/CCOMS.2015.7562906. 

[149]  A.  K.  Bhuyan,  H.  Dutta,  and  S.  Biswas,  “Unsupervised  Quasi-Silence  based  Speech 
Segmentation  for  Speaker  Diarization,”  in  2022  IEEE  9th  International  Conference  on 
Sciences  of  Electronics,  Technologies  of  Information  and  Telecommunications  (SETIT), 
May 2022, pp. 170–175. doi: 10.1109/SETIT54465.2022.9875932. 

[150]  Y. Kim, “Sequence-to-Sequence Learning with Latent Neural Grammars,” in Advances in 
Neural Information Processing Systems, Curran Associates, Inc., 2021, pp. 26302–26317. 
Accessed: 
Available: 
https://proceedings.neurips.cc/paper_files/paper/2021/hash/dd17e652cd2a08fdb8bf7f68e2
ad3814-Abstract.html 

[Online]. 

2024. 

Apr. 

26, 

[151]  D. Cai and W. Lam, “Graph Transformer for Graph-to-Sequence Learning,” Proceedings 
of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, Art. no. 05, Apr. 2020, 
doi: 10.1609/aaai.v34i05.6243. 

[152]  L.  Yang,  T.  L.  J.  Ng,  B.  Smyth,  and  R.  Dong,  “HTML:  Hierarchical  Transformer-based 
Multi-task Learning for Volatility Prediction,” in Proceedings of The Web Conference 2020, 
Taipei Taiwan: ACM, Apr. 2020, pp. 441–451. doi: 10.1145/3366423.3380128. 

252